Single-frame super-resolution for remote sensing images based on improved deep recursive residual network

Single-frame image super-resolution (SISR) technology in remote sensing is improving fast from a performance point of view. Deep learning methods have been widely used in SISR to improve the details of rebuilt images and speed up network training. However, these supervised techniques usually tend to overfit quickly due to the models’ complexity and the lack of training data. In this paper, an Improved Deep Recursive Residual Network (IDRRN) super-resolution model is proposed to decrease the difficulty of network training. The deep recursive structure is configured to control the model parameter number while increasing the network depth. At the same time, the short-path recursive connections are used to alleviate the gradient disappearance and enhance the feature propagation. Comprehensive experiments show that IDRRN has a better improvement in both quantitation and visual perception.


Introduction
Remote-sensing applications mainly process and analyze remotely sensed images extracted by satellites to analyze useful information on the ground, including disaster monitoring, environmental detection, geology, and resource exploration [1]. As a key indicator for measuring satellite remote sensing performance, the spatial resolution of remote sensing images is very important in practical applications. High-resolution (HR) images are usually desired for remote sensing analysis and processing procedure. However, remote sensing images always distort due to the limitations of remote sensing image sensors and other factors like optical system aberration, atmospheric disturbance, movement, and noise of imaging system. The simplest way to improve the resolution is to increase the sensors' density of remote sensing image acquisition equipment. However, this will generate shot noise, cause a big amount of hardware costs, increase the weight and volume of the sensor, and add the difficulty of satellite launch, which is not conducive to the application and popularization of highresolution sensors [2][3][4]. In this respect, SISR is a better approach. It is an image postprocessing technology, which is based on digital signal processing theory and can effectively and conveniently improve image resolution. SISR is mainly divided into two types: reconstruction-based SISR and learning-based SISR. In remote sensing applications, without increasing hardware investment, it can obtain high-resolution images of regions of interest, improve the recognition accuracy of targets of interest in images, and increase the value of image applications [5].
The reconstruction-based method mainly uses the imaging process of low-resolution (LR) images to build a model and proposes a series of constraints on the reconstructed image. The classic algorithms mainly include the iterative backward projection (IBP) [6], projection onto convex sets (POCS) [7], and Bayesian maximum a posteriori (MAP) [8], among which, the MAP method is the most widely used, usually with a regular term [9] to build a MAP solution framework. As for the total variation (TV) regular method [10], it is believed that the total variation of a noisy image is always greater than the total variation of a pure image, so the problem of suppressing noise in reconstruction is solved by constraining the total variation of the image; in general total variation (GTV) regularization [11], the distance relationship between the point of interest and the domain is further accurately described. Gradually, more reasonable and effective regularized [12,13] image models are used for super-resolution restoration of images. Reconstruction-based SISR algorithms are insufficient in utilization of the prior information of the image itself. Most of these methods use some prior knowledge of the image's edge and local smoothness to form constraints, and then use iterative algorithms to solve the optimization problem, but when the magnification is large, the reconstructed image is often too smooth, which lacks sharpness.
The learning-based method mainly learns the mapping relationship between the LR and HR images by training on the training set in advance and uses the learned mapping relationship to restore the high-resolution image. Learning-based SISR algorithm was first developed by Freeman et al. [14] and then applied by Baker et al. [15] to reconstruct the face image. Super-resolution reconstruction based on clustering [16,17] has achieved good results, and the method of learning based on the sparse representation [18,19] is the most widely used; reference [20] improves image feature extraction and dimensionality reduction during dictionary training so that the reconstructed image retains more high-frequency detail information; reference [21] proposes the sparse representation of the sample database composed of the low-resolution and high-resolution sample image blocks, and the over-complete dictionary corresponding to the training image pair is used. In recent years, super-resolution restoration using deep learning has begun to appear. Reference [22] proposes that the three-layer convolution corresponds to the extraction of image blocks, feature non-linear mapping, and final reconstruction. The interpolation-enlarged LR image is input to reconstruct the image. A method of feedback residual network based on deep edge guidance is proposed in reference [23], and images are trained according to different frequency bands and routes through recursive residual network. Reference [24] puts forward the idea of using residual learning to implement image reconstruction; Reference [25] conducts a convolution operation on a low-resolution image and finally performs an upsampling operation at the end of the network, that is, an operation to improve the resolution; in reference [26], the idea of generative confrontation is introduced into super-resolution, and a confrontation network and a discrimination network are used to simulate the confrontation. The discrimination network is used to judge the predicted high-resolution image  examples in order to perform properly and generalize well. In addition, they usually tend to overfit quickly due to the models' complexity and the lack of training data.
To overcome the problems mentioned above, we propose a novel fusion SR method named IDRRN in this paper. A recursive residual network is introduced into the superresolution restoration of remote sensing images. In this network model, global residual learning and local residual learning are introduced to reduce the difficulty of training deep networks, and a recursive block composed of residual units is used. To learn the residual image between high-resolution and low-resolution images, we can boost the accuracy by increasing the network depth without adding any weight parameters. Without loss of image restoration quality, the deep learning model is improved to make its network structure more concise and compact. By connecting multiple secondary filters in the deep network, the accuracy is significantly improved. This model uses local residual learning instead of global residual learning to train deep networks, which is more conducive to information transmission and gradient flow. The infusion of a recursive structure in the residual block reduces the parameters and makes the model more compact. Taking the uninterpolated LR image as input, and finally using the deconvolution layer at the end of the network to directly upsample to the SR output image, the calculation complexity is greatly reduced.
The algorithm has been adapted to be efficiently executed in parallel and presents some methodological improvements to make the model more efficient and effective. Experimental results show that the proposed method performs significantly against existing methods in evaluation indicators and visual effect.

Related works
We briefly review the ideas and work progress related to this paper in this section. Firstly, we discuss the image degradation in remote sensing and get the mathematical model of LR images. Next, we describe the main idea of deep learning and its application in SISR algorithms. Finally, we illustrate the image restoration model of learning the residual by the convolutional neural network (CNN), in which the corruption is considered as "residual information."

Image degradation in remote sensing
The formation of remote sensing images has gone through several links. In these links, the problems of image degradation and quality degradation inevitably occur. In order to obtain high-quality spatial images, the acquired remote sensing images need to be denoised and deblurred [27]. As shown in Fig. 1, a degradation model is first established from the original image to the actual acquired image, where the original image is a high-resolution image and the actual acquired is a LR image.
When each image is taken by remote sensing, the blurry point spread function in different spatial domains B i and motion deformation parameters M i under different effects D i , a LR image sequence can finally be obtained. After the image degradation model is established, the mathematical model of the low-resolution image can be expressed as follows: Among them, g i is the vectorized representation of the low-resolution image i, q is the number of LR image frames, f is a vectorized representation of a HR image, m and n represent the spatial dimensions of the real image, M i is the motion matrix, B i is a fuzzy matrix, D i is the downsampling matrix, and n i is the vectorized representation of the (m × n) × 1 dimensional noise. Make then the degradation model of q LR remote sensing images can be abbreviated as follows: Among them, g is a vectorized representation of a LR image, His the degradation matrix, and n is a vectorized representation of noise.

Deep learning for SISR in remote sensing
High-resolution remote sensing images play an important role in agricultural and forestry monitoring, urban planning, and military reconnaissance. As the smallest size that can be distinguished by the spatial details of the target in the image, the spatial resolution of the remote sensing image is one of the key indicators for evaluating the image quality. However, due to the high-cost and time-consuming development of HR remote sensing satellites, how to obtain HR images economically and conveniently has always been a major challenge in the field of remote sensing. Super-resolution reconstruction technology is a favorite resort to such problems. The general objective in SR is to improve the image resolution beyond the sensor limits, that is, to increase the number of image pixels while providing finer spatial details than those captured by the original acquisition instrument. The SISR of remote sensing images is an ill-conditioned inverse problem, so reasonable image feature expression is particularly important in the reconstruction process. Deep learning methods, especially CNN, can perform feature transformation and non-linear mapping on LR images to obtain complex feature expressions of LR images and then build LR images to HR images complex mapping relationship. The essence of deep learning is a self-learning method for data representation, replacing manually extracting features by using unsupervised or semi-supervised feature learning and hierarchical feature acquisition methods.
Super-resolution convolutional neural network (SRCNN ) [22] has begun the era of deep convolutional neural networks dealing with super-resolution problems. The algorithm takes the result of LR image interpolation as the network input and obtains a HR image after three convolutional transformations. After three steps of feature extraction, nonlinear transformation, and feature restoration, a very good restoration effect is obtained. The first convolution layer is the extraction of image features. Image blocks are extracted from the LR image and each block is represented as a high-dimensional vector. Given a low-resolution image x, the process can be expressed as follows: Among them, f 1 is the convolution kernel of the first convolution layer, which can be regarded as a filter. d 1 represents the bias of the first layer.
The second convolution layer is a non-linear mapping between features, mapping each high-dimensional vector to another high-dimensional vector. Each mapping vector is a conceptual representation of HR blocks, which can be expressed as follows: Here f 2 and d 2 represent the filter and bias of the second convolution layer. The third convolution layer is a process of reconstructing an image to generate HR image. This operation stitches the above HR image blocks to generate a final HR image, which can be expressed as: Here, f 3 and d 3 represent the filter and bias of the third convolution layer. The entire convolutional neural network model continuously reduces the loss of the network through iteration. When the loss value is minimized and stabilized, the corresponding weight and bias of each layer of convolution are the optimal results of the network.
Accompanying the robust development of deep learning algorithms and great success of SRCNN, super-resolution recovery algorithm based on deep convolutional networks developed rapidly, and various improved variants and new network structures appeared accordingly, such as fast super-resolution convolutional neural network (FSRCNN) [28], very deep convolutional networks for image super-resolution (VDSR) [24], superresolution generative adversarial network (SRGAN) [26], end-to-end deep and shallow networks (EEDS) [29], and enhanced deep super-resolution network (EDSR) [30]. This greatly improves the practical application of deep learning for SISR.

Deep residual network
Residual network (ResNet) is proposed to solve the problem of network degradation when the deep neural network has too many hidden layers. Its main idea is to learn the residual function instead of the original function based on the input, which makes the training of the deeper network simpler, and can get better performance from the deeper network [31][32][33]. Its network structure is shown in Fig. 2.
Reference [34] pointed out that two weight layers and an activation function ReLu are regarded as a basic unit, and then, the input and output of the unit are added at the pixel level through a jump connection, that is, the corresponding pixels in the feature map are added, and the residual operation is performed as follows: Among them, x represents the input of a basic unit, H(x) represents the result of the residual calculation, and F(x) represents the basic unit calculation result.
The residual block structure is as follows: Among them, x o represents the output of the residual block, h(x) is an identity mapping and h(x) = x, W is a set of weights, F(x, W) is the residual mapping to be learned, σ represents Relu activate function, and U represents a residual block function. The residual mapping is easier to optimize than the original mapping.
The proposed residual network breaks the argument that deepening the number of layers in the network cannot improve performance. Moreover, the structure of the deep residual network is simple, which solves the problem of performance degradation of deep convolutional neural networks under extremely deep conditions, and the classification performance is excellent.

Recursive structure
Reference [35] proposed deeply-recursive convolutional network (DRCN) algorithm, which introduced recursive algorithm in residual network. The recursive structure consists of 16 chain structures. DRCN passes the recursive results through the reconstruction layer each time, generating intermediate results of HR images. DRCN's recursive structure allows weight parameters to be shared in the convolutional layer, effectively controlling model parameters. However, in order to solve the problem that the training deep model is prone to vanish or explode gradients, each recursive learning needs to be supervised, which undoubtedly increases the burden on the network. In response to the above issues, in this paper, the improved recursive structure is introduced into the residual block to reduce the network scale and make the model more compact. At the same time, the weights are shared among the residual blocks, reducing the number of model parameters. The residual block function is defined as: H μ is the μth output of the first residual block, R represents the residual block function, F(H μ − 1 , W) is the residual mapping to be learned, W is the set of weights, and H 0 is the feature image output through the first convolution layer.
A convolution layer and a Relu layer are introduced at the beginning of the recursive block and then superimpose multiple residual blocks, which forms a recursive structure. Among them, H 0 refers to the identity mapping of each residual block, and B represents the number of residual blocks contained in the recursive structure. The algorithmic recursive structure is shown in Fig. 3.
The result of the μth residual block can be obtained by the residual block function R recursively.

Network structure optimization
The algorithm introduces local residual learning to reduce the difficulty of training the deep network. First, the high-frequency features of the LR input image are extracted through the convolution layer, and then after each two-layer convolution layer, the feature image extracted by the first convolution layer is added. That is, the inputs of all identity branches in the residual block remain the same. In this way, more image information can be transmitted to the deeper layer of the network, and its identity branch also helps the back propagation of gradients during training, avoiding the overfitting phenomenon [36]. The improved residual block structure consists of two convolutional layers and two Relu layers. The residual block structure is shown in Fig. 4.
Recursive structure is introduced in the residual block. The parameters are reduced, which is more helpful for information transmission and gradient flow. LR input image goes through a convolutional layer and a Relu layer, extracts features, and then inputs the extracted features into several residual blocks, and recursively learn the residual mapping function. Finally, at the end of the network, a deconvolution layer is used to directly upsample the learned residual image and restore SR output image. The optimized network structure is shown in Fig. 5.
It can be seen from the figure the number of convolutional layers in each residual network unit. In the improved network, there are more layers of residual network units at the front part of the network and fewer layers of the residual network units at the later part. This design can make the entire network contain deeper network branches while using the same number of parameters, thereby improving the quality of the generated images. The deep branches of the adjusted network increase, so that the optimized network can work more efficiently. At the same time, in order to avoid gradient dispersion and overfitting in deep networks, a pooling layer is added to the branches with deeper network layers, that is, residual network units near the output end. The whole network is composed of three parts: feature extraction, nonlinear mapping of residual function, and SR image reconstruction. The LR input image passes through a convolutional layer and a Relu layer to extract features, and then, the extracted features are input into several residual blocks, and the residual mapping function is learned recursively. Finally, at the end of the network, a deconvolution layer is used to directly upsample the learned residual image to reconstruct the SR output image.

Evaluation criteria
Objectively, the deviation error between the restored image and the original image is generally used to evaluate the quality of the image restoration. In this paper, peak signal-tonoise ratio (PSNR), structural similarity (SSIM), and Erreur Relative Globale Adimensionnelle de Synthèse (ERGAS) are used as reference evaluation indicators for image quality.
The larger the PSNR value, the smaller the difference between the reconstruction result and the original image, the better the reconstruction effect. The calculation formula is as follows: Among them, X i is the high-resolution image of the original reference, Y i is the reconstructed image, M and N are the height and width of the image, and generally, the maximum value of max(X i ) is 255, which can be directly substituted in the formula.
PSNR is mainly based on the comparison between pixels, and the evaluation of the local structure of the image is relatively weak. Sometimes the PSNR values of the two images are close, but the visual effects of the images are very different. Images generally have their own structures, and there is more or less correlation between adjacent pixels. SSIM is a structural parameter between the reconstruction result and the reference high-resolution image. The calculation formula is as follows: Among them, X and Y are the reference HR image and the restored result image, respectively. μ X and μ Y represent the average pixel value of two image pairs, which are defined as follows: N is the number of dimensions to expand the image by column. σ X and σ Y are the corresponding variance, defined as follows: σ XY is the covariance, which is defined as: C 1 and C 2 are normal number whose denominator is not zero. The value of SSIM ranges from 0 to 1. The closer the value is to 1, the more similar the two images are, and the better the reconstruction result is.
ERGAS is a quality evaluation method proposed for image fusion research, which reflects the degree of spectral distortion between the restored image and the reference image. It is also commonly used in the super-resolution restoration quality evaluation of images. The calculation formula is as follows: l and h represent the resolution before and after image reconstruction, K represents the number of bands, μ(k) represents the average of k band, and RMSE represents the root mean square error of the image. The ideal value of ERGAS is 0.

Experimental environment and settings
The experimental software environment uses Ubuntu 14.04, Python 2.7, TensorFlow 1.4; the hardware environment is Intel Core i7-6700K, RAM 16GB, and the GPU is NVIDIA GTX1080. We use remote sensing image scene classification data set NWPU-RESIS45 [37] created by Northwestern Polytechnical University. Data set includes 45 scenes, each scene has 700 images, and each image size is 256×256, ensuring the authenticity and diversity of experimental data.
From each type of remote sensing image, 100 images with obvious features are selected, with a total of 4500 images. These images constitute a training data set to train the algorithm model. In addition, a total of 450 images of each type are chosen as test data sets, and different SR algorithms (SRCNN, FSRCNN, DRCN, VDSR, EDSR, and IDRRN) are used to simulate the test results. There are some of the training images as shown in Fig. 6, comprising the following scenes: airplane, basketball_court, bridge, cir-cular_farmland, harbor, industrial_area, intersection, and parking_lot.
For input images, first use the magnification factor n to downsample the original training image, and it becomes an LR image. Then crop the LR image into a set of sub-images with stride s and sizef sub × f sub pixel and crop the corresponding size from the corresponding real image to (nf sub ) 2 pixel HR sub-images. These LR/HR sub-image pairs are training samples. To ensure that the image size does not change during the mapping process, the convolutional layers are filled with "0." When training IDRRN, the deconvolution filter will generate a size of (nf sub − n + 1) 2 output image. Therefore, we need to crop the n − 1 pixel boundaries of HR sub-image.

Quantitative results of SR methods
The network depth of the IDRRN algorithm proposed in this paper has 12 layers. The filter size should be odd so that it has a center, such as 3×3, 5×5, or 7×7. The use of smaller convolution kernels is one of the current trends to reduce parameters while ensuring network accuracy. The parameter setting of the convolution layer is the same as VDSR [24]. All convolutional layer filters are 3×3 in size and the number of filters is 64. The deconvolution uses the mean value of 0, the standard deviation is 0.001 random initialization of Gaussian distribution, and take Relu function as activation function. The size of the filter refers to the DRCN algorithm [35], which is 5×5. The step is equal to the amplification factor n. During training, the size of the image batch is 128, the momentum is 0.9, and the weight attenuation parameter is 0.0001. The initial learning rate is set to 0.1, then the learning rate is halved every 15 generations; the learning stops after 120 generations, and the loss function is the MSE (mean square error) function.
The performance of the proposed approach has been compared with the results obtained by six different SR methods available in the literature (Bicubic, SRCNN [22], FSRCNN [28], DRCN [35], VDSR [24], and EDSR [30]). Three different scaling factors, ×2, ×3, and ×4, have been tested over the considered image data set (airplane, bridge, harbor, intersection, and parking_lot). All the tested methods have been used considering the default settings suggested by the methods' authors for each particular scaling ratio. Table 1 provides a brief PSNR/SSIM description of the SR techniques.
As shown in Table 1, the average PSNR and SSIM values of the images generated by the method in this paper are higher than other current mainstream SISR algorithm.
The PSNR values are optimal in 5 types of scenarios. The maximum boost value is  Because of the particularity of remote sensing images, this paper uses ERGAS value in Formula (17) to compare the SR effect in order to further verify the effectiveness of the improved algorithm. From Table 2, we can get that among the 15 ERGAS data results, the IDRRN algorithm obtained 11 optimal values. By analyzing and comparing the SR results of Tables 1 and 2, we find that the recursive residual learning can transfer more effective image information to the depth of the network, learn more image features, and make the image restoration quality improve greatly.
Furthermore, the proposed IDRRN approach from inherent parameter sharing obtains higher parameter efficiency compared to other learning-based methods. In Fig. 7, we illustrate the parameters-to-PSNR relationship of our model and several state-ofthe-art methods, including SRCNN, FSRCNN, DRCN, VDSR, and EDSR. Our method represents a favorable trade-off between model size and SR performance and has modest processing time.
The addition of improved recursive structure does not need to increase the number of parameters. In addition, it improves the restoration quality of the image. The network structure is more compact and the objective performance is better.

Visual results and discussion
In order to demonstrate the effectiveness of our approach more fully, we also show some of the visual comparisons on three scales ×2, ×3, and ×4. Figures 8, 9, and 10 show the qualitative evaluation results of various algorithms. By enlarging the details of the image, the quality of image restoration of several SISR methods can be intuitively evaluated from the visual effect. It can be seen from the figures that our method has a significant improvement in both image sharpness and clarity. After image processing, it is easier to identify multiple image categories in the remote sensing image. IDRRN overcomes the shortcomings of the overall smooth reconstruction result of the traditional method, and the reconstruction result restores more high-frequency details.
In addition, from the comparison of the enlarged parts of the tail of the aircraft in Fig. 8, the ships in the port in Fig. 9, and the vehicles on the bridge in Fig. 10, it can be seen that the image after the SR reconstruction by the IDRRN generation network is sharper compared with other mainstream algorithms. It has a better performance in the restoration of remote sensing image details, and it is more effective in repairing complex textures in damaged images. After repairing, the details in the image are richer  and more consistent with the visual characteristics of the human eye. With the SR restoration of the remote sensing image, the texture and edges are clearer, and the objects in the output image are easier to recognize.

Conclusion
In this paper, we propose a new type of residual network that introduces an improved recursive structure in the residual block. The jump connection and recursive structure can effectively reduce the burden of carrying characteristic information on the network, achieving high-quality SR remote sensing image recovery. Experiments were performed using the NWPU-RESISC45 remote sensing image data set, and PSNR, SSIM, and ERGAS are the objective quality evaluation index of image SR. Experimental results show that compared with other super-resolution methods based on CNN, the method in this paper has more compact network structure and fewer model parameters, and the reconstruction details are more abundant. Moreover, the restoration results have better visual effects and are more conducive to further remote sensing image analysis.
In the next work, we will try to generalize the proposed IDRRN method to color images by designing a more compact network structure and improving the loss function of the model. In addition, we hope to further improve the details of super-resolution images and the repair effect of complex textures.