Remote-sensing applications mainly process and analyze remotely sensed images extracted by satellites to analyze useful information on the ground, including disaster monitoring, environmental detection, geology, and resource exploration [1]. As a key indicator for measuring satellite remote sensing performance, the spatial resolution of remote sensing images is very important in practical applications. High-resolution (HR) images are usually desired for remote sensing analysis and processing procedure. However, remote sensing images always distort due to the limitations of remote sensing image sensors and other factors like optical system aberration, atmospheric disturbance, movement, and noise of imaging system. The simplest way to improve the resolution is to increase the sensors’ density of remote sensing image acquisition equipment. However, this will generate shot noise, cause a big amount of hardware costs, increase the weight and volume of the sensor, and add the difficulty of satellite launch, which is not conducive to the application and popularization of high-resolution sensors [2,3,4]. In this respect, SISR is a better approach. It is an image post-processing technology, which is based on digital signal processing theory and can effectively and conveniently improve image resolution. SISR is mainly divided into two types: reconstruction-based SISR and learning-based SISR. In remote sensing applications, without increasing hardware investment, it can obtain high-resolution images of regions of interest, improve the recognition accuracy of targets of interest in images, and increase the value of image applications [5].

The reconstruction-based method mainly uses the imaging process of low-resolution (LR) images to build a model and proposes a series of constraints on the reconstructed image. The classic algorithms mainly include the iterative backward projection (IBP) [6], projection onto convex sets (POCS) [7], and Bayesian maximum a posteriori (MAP) [8], among which, the MAP method is the most widely used, usually with a regular term [9] to build a MAP solution framework. As for the total variation (TV) regular method [10], it is believed that the total variation of a noisy image is always greater than the total variation of a pure image, so the problem of suppressing noise in reconstruction is solved by constraining the total variation of the image; in general total variation (GTV) regularization [11], the distance relationship between the point of interest and the domain is further accurately described. Gradually, more reasonable and effective regularized [12, 13] image models are used for super-resolution restoration of images. Reconstruction-based SISR algorithms are insufficient in utilization of the prior information of the image itself. Most of these methods use some prior knowledge of the image’s edge and local smoothness to form constraints, and then use iterative algorithms to solve the optimization problem, but when the magnification is large, the reconstructed image is often too smooth, which lacks sharpness.

The learning-based method mainly learns the mapping relationship between the LR and HR images by training on the training set in advance and uses the learned mapping relationship to restore the high-resolution image. Learning-based SISR algorithm was first developed by Freeman et al. [14] and then applied by Baker et al. [15] to reconstruct the face image. Super-resolution reconstruction based on clustering [16, 17] has achieved good results, and the method of learning based on the sparse representation [18, 19] is the most widely used; reference [20] improves image feature extraction and dimensionality reduction during dictionary training so that the reconstructed image retains more high-frequency detail information; reference [21] proposes the sparse representation of the sample database composed of the low-resolution and high-resolution sample image blocks, and the over-complete dictionary corresponding to the training image pair is used. In recent years, super-resolution restoration using deep learning has begun to appear. Reference [22] proposes that the three-layer convolution corresponds to the extraction of image blocks, feature non-linear mapping, and final reconstruction. The interpolation-enlarged LR image is input to reconstruct the image. A method of feedback residual network based on deep edge guidance is proposed in reference [23], and images are trained according to different frequency bands and routes through recursive residual network. Reference [24] puts forward the idea of using residual learning to implement image reconstruction; Reference [25] conducts a convolution operation on a low-resolution image and finally performs an upsampling operation at the end of the network, that is, an operation to improve the resolution; in reference [26], the idea of generative confrontation is introduced into super-resolution, and a confrontation network and a discrimination network are used to simulate the confrontation. The discrimination network is used to judge the predicted high-resolution image generated by the generation network. However, these learning-based SISR techniques require sufficient HR training examples in order to perform properly and generalize well. In addition, they usually tend to overfit quickly due to the models’ complexity and the lack of training data.

To overcome the problems mentioned above, we propose a novel fusion SR method named IDRRN in this paper. A recursive residual network is introduced into the super-resolution restoration of remote sensing images. In this network model, global residual learning and local residual learning are introduced to reduce the difficulty of training deep networks, and a recursive block composed of residual units is used. To learn the residual image between high-resolution and low-resolution images, we can boost the accuracy by increasing the network depth without adding any weight parameters. Without loss of image restoration quality, the deep learning model is improved to make its network structure more concise and compact. By connecting multiple secondary filters in the deep network, the accuracy is significantly improved. This model uses local residual learning instead of global residual learning to train deep networks, which is more conducive to information transmission and gradient flow. The infusion of a recursive structure in the residual block reduces the parameters and makes the model more compact. Taking the uninterpolated LR image as input, and finally using the deconvolution layer at the end of the network to directly upsample to the SR output image, the calculation complexity is greatly reduced.

The algorithm has been adapted to be efficiently executed in parallel and presents some methodological improvements to make the model more efficient and effective. Experimental results show that the proposed method performs significantly against existing methods in evaluation indicators and visual effect.

### Related works

We briefly review the ideas and work progress related to this paper in this section. Firstly, we discuss the image degradation in remote sensing and get the mathematical model of LR images. Next, we describe the main idea of deep learning and its application in SISR algorithms. Finally, we illustrate the image restoration model of learning the residual by the convolutional neural network (CNN), in which the corruption is considered as “residual information.”

### Image degradation in remote sensing

The formation of remote sensing images has gone through several links. In these links, the problems of image degradation and quality degradation inevitably occur. In order to obtain high-quality spatial images, the acquired remote sensing images need to be denoised and deblurred [27]. As shown in Fig. 1, a degradation model is first established from the original image to the actual acquired image, where the original image is a high-resolution image and the actual acquired is a LR image.

When each image is taken by remote sensing, the blurry point spread function in different spatial domains *B*_{i} and motion deformation parameters *M*_{i} under different effects *D*_{i}, a LR image sequence can finally be obtained. After the image degradation model is established, the mathematical model of the low-resolution image can be expressed as follows:

$$ {g}_i={D}_i{B}_i{M}_i\boldsymbol{f}+{n}_i,i=1,2,\dots, q $$

(1)

Among them, *g*_{i} is the vectorized representation of the low-resolution image *i*, *q* is the number of LR image frames, *f* is a vectorized representation of a HR image, *m* and *n* represent the spatial dimensions of the real image, *M*_{i} is the motion matrix, *B*_{i} is a fuzzy matrix, *D*_{i} is the downsampling matrix, and *n*_{i} is the vectorized representation of the (*m* × *n*) × 1 dimensional noise.

Make

\( g=\left[\begin{array}{c}{\mathrm{g}}_1\\ {}{\mathrm{g}}_2\\ {}\dots \\ {}{\mathrm{g}}_{\mathrm{p}}\end{array}\right] \) _{,} \( H=\left[\begin{array}{c}{\mathrm{D}}_1{\mathrm{B}}_1{\mathrm{M}}_1\\ {}{\mathrm{D}}_2{\mathrm{B}}_2{\mathrm{M}}_2\\ {}\dots \\ {}{\mathrm{D}}_{\mathrm{p}}{\mathrm{B}}_{\mathrm{p}}{\mathrm{M}}_{\mathrm{p}}\end{array}\right] \) _{,} \( n=\left[\begin{array}{c}{\mathrm{n}}_1\\ {}{\mathrm{n}}_2\\ {}\dots \\ {}{\mathrm{n}}_{\mathrm{p}}\end{array}\right] \) _{,} *p* = 1, 2, …, *q* (2)

then the degradation model of *q* LR remote sensing images can be abbreviated as follows:

$$ \boldsymbol{g}=\boldsymbol{Hf}+\boldsymbol{n} $$

(3)

Among them, *g* is a vectorized representation of a LR image, *H*is the degradation matrix, and *n* is a vectorized representation of noise.

### Deep learning for SISR in remote sensing

High-resolution remote sensing images play an important role in agricultural and forestry monitoring, urban planning, and military reconnaissance. As the smallest size that can be distinguished by the spatial details of the target in the image, the spatial resolution of the remote sensing image is one of the key indicators for evaluating the image quality. However, due to the high-cost and time-consuming development of HR remote sensing satellites, how to obtain HR images economically and conveniently has always been a major challenge in the field of remote sensing. Super-resolution reconstruction technology is a favorite resort to such problems. The general objective in SR is to improve the image resolution beyond the sensor limits, that is, to increase the number of image pixels while providing finer spatial details than those captured by the original acquisition instrument.

The SISR of remote sensing images is an ill-conditioned inverse problem, so reasonable image feature expression is particularly important in the reconstruction process. Deep learning methods, especially CNN, can perform feature transformation and non-linear mapping on LR images to obtain complex feature expressions of LR images and then build LR images to HR images complex mapping relationship. The essence of deep learning is a self-learning method for data representation, replacing manually extracting features by using unsupervised or semi-supervised feature learning and hierarchical feature acquisition methods.

Super-resolution convolutional neural network (SRCNN )[22] has begun the era of deep convolutional neural networks dealing with super-resolution problems. The algorithm takes the result of LR image interpolation as the network input and obtains a HR image after three convolutional transformations. After three steps of feature extraction, nonlinear transformation, and feature restoration, a very good restoration effect is obtained. The first convolution layer is the extraction of image features. Image blocks are extracted from the LR image and each block is represented as a high-dimensional vector. Given a low-resolution image *x*, the process can be expressed as follows:

$$ {N}_1(x)=\max \left(0,{f}_1x+{d}_1\right) $$

(4)

Among them, *f*_{1} is the convolution kernel of the first convolution layer, which can be regarded as a filter. *d*_{1} represents the bias of the first layer.

The second convolution layer is a non-linear mapping between features, mapping each high-dimensional vector to another high-dimensional vector. Each mapping vector is a conceptual representation of HR blocks, which can be expressed as follows:

$$ {N}_2(x)=\max \left(0,{f}_2{N}_1(x)+{d}_2\right) $$

(5)

Here *f*_{2} and *d*_{2} represent the filter and bias of the second convolution layer.

The third convolution layer is a process of reconstructing an image to generate HR image. This operation stitches the above HR image blocks to generate a final HR image, which can be expressed as:

$$ {N}_3(x)={f}_3{N}_2(x)+{d}_3 $$

(6)

Here, *f*_{3} and *d*_{3} represent the filter and bias of the third convolution layer.

The entire convolutional neural network model continuously reduces the loss of the network through iteration. When the loss value is minimized and stabilized, the corresponding weight and bias of each layer of convolution are the optimal results of the network.

Accompanying the robust development of deep learning algorithms and great success of SRCNN, super-resolution recovery algorithm based on deep convolutional networks developed rapidly, and various improved variants and new network structures appeared accordingly, such as fast super-resolution convolutional neural network (FSRCNN) [28], very deep convolutional networks for image super-resolution (VDSR) [24], super-resolution generative adversarial network (SRGAN) [26], end-to-end deep and shallow networks (EEDS) [29], and enhanced deep super-resolution network (EDSR) [30]. This greatly improves the practical application of deep learning for SISR.

### Deep residual network

Residual network (ResNet) is proposed to solve the problem of network degradation when the deep neural network has too many hidden layers. Its main idea is to learn the residual function instead of the original function based on the input, which makes the training of the deeper network simpler, and can get better performance from the deeper network [31,32,33]. Its network structure is shown in Fig. 2.

Reference [34] pointed out that two weight layers and an activation function ReLu are regarded as a basic unit, and then, the input and output of the unit are added at the pixel level through a jump connection, that is, the corresponding pixels in the feature map are added, and the residual operation is performed as follows:

Among them, *x* represents the input of a basic unit, *H*(*x*) represents the result of the residual calculation, and *F*(*x*) represents the basic unit calculation result.

The residual block structure is as follows:

$$ {x}_o=U(x)=\sigma \left(F\left(x,W\right)+h(x)\right) $$

(8)

Among them, *x*_{o} represents the output of the residual block, *h*(*x*) is an identity mapping and *h*(*x*) = *x*, *W* is a set of weights, *F*(*x*, *W*) is the residual mapping to be learned, *σ* represents Relu activate function, and *U* represents a residual block function. The residual mapping is easier to optimize than the original mapping.

The proposed residual network breaks the argument that deepening the number of layers in the network cannot improve performance. Moreover, the structure of the deep residual network is simple, which solves the problem of performance degradation of deep convolutional neural networks under extremely deep conditions, and the classification performance is excellent.

### Proposed improved method

#### Recursive structure

Reference [35] proposed deeply-recursive convolutional network (DRCN) algorithm, which introduced recursive algorithm in residual network. The recursive structure consists of 16 chain structures. DRCN passes the recursive results through the reconstruction layer each time, generating intermediate results of HR images. DRCN’s recursive structure allows weight parameters to be shared in the convolutional layer, effectively controlling model parameters. However, in order to solve the problem that the training deep model is prone to vanish or explode gradients, each recursive learning needs to be supervised, which undoubtedly increases the burden on the network.

In response to the above issues, in this paper, the improved recursive structure is introduced into the residual block to reduce the network scale and make the model more compact. At the same time, the weights are shared among the residual blocks, reducing the number of model parameters. The residual block function is defined as:

$$ {H}^{\mu }=R\left({H}^{\mu -1}\right)=F\left({H}^{\mu -1},W\right)+{H}^0 $$

(9)

*H*^{μ} is the *μ*th output of the first residual block, *R* represents the residual block function, *F*(*H*^{μ − 1}, *W*) is the residual mapping to be learned, *W* is the set of weights, and *H*^{0} is the feature image output through the first convolution layer.

A convolution layer and a Relu layer are introduced at the beginning of the recursive block and then superimpose multiple residual blocks, which forms a recursive structure. Among them, *H*^{0} refers to the identity mapping of each residual block, and *B* represents the number of residual blocks contained in the recursive structure. The algorithmic recursive structure is shown in Fig. 3.

The result of the *μ*th residual block can be obtained by the residual block function *R* recursively.

#### Network structure optimization

The algorithm introduces local residual learning to reduce the difficulty of training the deep network. First, the high-frequency features of the LR input image are extracted through the convolution layer, and then after each two-layer convolution layer, the feature image extracted by the first convolution layer is added. That is, the inputs of all identity branches in the residual block remain the same. In this way, more image information can be transmitted to the deeper layer of the network, and its identity branch also helps the back propagation of gradients during training, avoiding the overfitting phenomenon [36]. The improved residual block structure consists of two convolutional layers and two Relu layers. The residual block structure is shown in Fig. 4.

Recursive structure is introduced in the residual block. The parameters are reduced, which is more helpful for information transmission and gradient flow. LR input image goes through a convolutional layer and a Relu layer, extracts features, and then inputs the extracted features into several residual blocks, and recursively learn the residual mapping function. Finally, at the end of the network, a deconvolution layer is used to directly upsample the learned residual image and restore SR output image. The optimized network structure is shown in Fig. 5.

It can be seen from the figure the number of convolutional layers in each residual network unit. In the improved network, there are more layers of residual network units at the front part of the network and fewer layers of the residual network units at the later part. This design can make the entire network contain deeper network branches while using the same number of parameters, thereby improving the quality of the generated images. The deep branches of the adjusted network increase, so that the optimized network can work more efficiently. At the same time, in order to avoid gradient dispersion and overfitting in deep networks, a pooling layer is added to the branches with deeper network layers, that is, residual network units near the output end.

The whole network is composed of three parts: feature extraction, nonlinear mapping of residual function, and SR image reconstruction. The LR input image passes through a convolutional layer and a Relu layer to extract features, and then, the extracted features are input into several residual blocks, and the residual mapping function is learned recursively. Finally, at the end of the network, a deconvolution layer is used to directly upsample the learned residual image to reconstruct the SR output image.

### Evaluation criteria

Objectively, the deviation error between the restored image and the original image is generally used to evaluate the quality of the image restoration. In this paper, peak signal-to-noise ratio (PSNR), structural similarity (SSIM), and Erreur Relative Globale Adimensionnelle de Synthèse (ERGAS) are used as reference evaluation indicators for image quality.

The larger the PSNR value, the smaller the difference between the reconstruction result and the original image, the better the reconstruction effect. The calculation formula is as follows:

$$ PSNR=10\ast {\log}_{10}\left\{\frac{{\left[\max \left({X}_i\right)\right]}^2\ast MN}{{\left\Vert {X}_i-{Y}_i\right\Vert}^2}\right\} $$

(10)

Among them, *X*_{i} is the high-resolution image of the original reference, *Y*_{i} is the reconstructed image, *M* and *N* are the height and width of the image, and generally, the maximum value of max(*X*_{i}) is 255, which can be directly substituted in the formula.

PSNR is mainly based on the comparison between pixels, and the evaluation of the local structure of the image is relatively weak. Sometimes the PSNR values of the two images are close, but the visual effects of the images are very different. Images generally have their own structures, and there is more or less correlation between adjacent pixels. SSIM is a structural parameter between the reconstruction result and the reference high-resolution image. The calculation formula is as follows:

$$ SSIM\left(X,Y\right)=\frac{\left(2{\mu}_X{\mu}_Y+{C}_1\right)\left(2{\sigma}_{XY}+{C}_2\right)}{\left({\mu}_X^2+{\mu}_Y^2+{C}_1\right)\left({\sigma}_X^2+{\sigma}_Y^2+{C}_2\right)} $$

(11)

Among them, *X* and *Y* are the reference HR image and the restored result image, respectively. *μ*_{X} and *μ*_{Y} represent the average pixel value of two image pairs, which are defined as follows:

$$ {\mu}_X=\frac{1}{N}\sum \limits_{i=1}^NX(i) $$

(12)

$$ {\mu}_Y=\frac{1}{N}\sum \limits_{i=1}^NY(i) $$

(13)

*N* is the number of dimensions to expand the image by column. *σ*_{X} and *σ*_{Y} are the corresponding variance, defined as follows:

$$ {\sigma}_X={\left(\frac{1}{N-1}\sum \limits_{i=1}^N{\left(X(i)-{\mu}_X\right)}^2\right)}^{\frac{1}{2}} $$

(14)

$$ {\sigma}_Y={\left(\frac{1}{N-1}\sum \limits_{i=1}^N{\left(Y(i)-{\mu}_Y\right)}^2\right)}^{\frac{1}{2}} $$

(15)

*σ*_{XY} is the covariance, which is defined as:

$$ {\sigma}_{XY}=\frac{1}{N-1}\sum \limits_{i=1}^N\left(X(i)-{\mu}_X\right)\left(Y(i)-{\mu}_Y\right) $$

(16)

*C*_{1} and *C*_{2} are normal number whose denominator is not zero. The value of SSIM ranges from 0 to 1. The closer the value is to 1, the more similar the two images are, and the better the reconstruction result is.

ERGAS is a quality evaluation method proposed for image fusion research, which reflects the degree of spectral distortion between the restored image and the reference image. It is also commonly used in the super-resolution restoration quality evaluation of images. The calculation formula is as follows:

$$ ERGAS=100\frac{h}{l}\sqrt{\frac{1}{K}\sum \limits_{k=1}^K{\left(\frac{RMSE(k)}{\mu (k)}\right)}^2} $$

(17)

*l* and *h* represent the resolution before and after image reconstruction, *K* represents the number of bands, *μ*(*k*) represents the average of *k* band, and *RMSE* represents the root mean square error of the image. The ideal value of ERGAS is 0.