Pansharpening based on convolutional autoencoder and multi-scale guided filter

In this paper, we propose a pansharpening method based on a convolutional autoencoder. The convolutional autoencoder is a sort of convolutional neural network (CNN) and objective to scale down the input dimension and typify image features with high exactness. First, the autoencoder network is trained to reduce the difference between the degraded panchromatic image patches and reconstruction output original panchromatic image patches. The intensity component, which is developed by adaptive intensity-hue-saturation (AIHS), is then delivered into the trained convolutional autoencoder network to generate an enhanced intensity component of the multi-spectral image. The pansharpening is accomplished by improving the panchromatic image from the enhanced intensity component using a multi-scale guided filter; then, the semantic detail is injected into the upsampled multi-spectral image. Real and degraded datasets are utilized for the experiments, which exhibit that the proposed technique has the ability to preserve the high spatial details and high spectral characteristics simultaneously. Furthermore, experimental results demonstrated that the proposed study performs state-of-the-art results in terms of subjective and objective assessments on remote sensing data.


Introduction
There are many applications based on remote sensing satellites that require observation of the alterations of the earth, such as image fusion [1][2][3] and mapping land cover [4]. Given that, pansharpening is one of the essential interests of many scientists. It is difficult that the remote sensing satellites can obtain a panchromatic image (PAN) and a multi-spectral image (MS) with the qualities of both high spatial resolution and high spectral resolution at the same time due to data transmission impediment. However, the main objective of pansharpening is fusing the high spatial resolution PAN image with the corresponding high spectral resolution MS image to acquire high spatial and spectral resolutions for MS image [5].
As indicated by [6][7][8], a wide assortment of image fusion techniques can be classified into two classes based on the way of extracting a spatial detail from a PAN image: (1) component substitution (CS) and (2) multi-resolution analysis (MRA). And some methods do

Convolutional autoencoder
Autoencoder belongs to unsupervised learning that considers an input image and attempts to reconstruct it back. The convolutional autoencoder is a sort of convolutional neural network that reproduces the input image patches at the output. However, the design of a convolutional autoencoder comprises two fundamental phases, which are the encoding phase and the decoding phase. The encoding phase represents half of the network, and it incorporates convolution and max-pooling layers. In contrast, the decoding phase for recreating the input image patches from the degraded pieces comprises deconvolution and upscaling layers [29].

Encoding phase
A convolution among an input volume I = {I 1 , · · · , I D } with D dimension and every convolutional layer is composed of n convolutional filters F (1) n which is considered to produce m features.
O m represents the feature maps of the input I, b m represents the bias, and a denotes an activation function.

Decoding phase
The produced m feature maps considered to be used as input to the decoder, to reconstruct the input image, which is obtained by the consequence of the convolution between O = {O i=1 } n with convolutional filters F (2) = F (2) 1 , . . . , F (2) n that estimated as follows: Considering that both the output image patches and its input have the same dimension, therefore, it is conceivable to relate I andĨ using a loss function to update the weights during training, for example, mean square error (MSE).

Adaptive intensity-hue-saturation
The IHS technique belongs to CS-based methods that introduced [30], and it is just appropriate for MS images with three bands [11]. Even though the IHS strategy displays extraordinary spatial quality, it severely experiences spectral distortion. The general formula for generating an intensity component is as follows: where α i denotes the weight coefficients, and n represents the number of spectral bands. M i indicates the i th band of the upsampled MS band. Therefore, Rahmani et al. [31] AIHS was introduced, in which the optimal weights are obtained by solving the following optimization problem: where PAN denotes panchromatic image.

Guided filter
The guided filter GF was introduced by He et al. [32]. The uses of guided filter have been widely utilized in image processing fields such as detail enhancement and image fusion. The guided filter can maintain a strategic distance from ringing artifacts. The GF depends on a local linear model that is using the guided image gui to filter the input image inp. Therefore, the output image Out can conserve the essential data of the inp and obtain the variation trend of gui at the same time [19]. Mathematically, the guided filter is employed to find a pair of scalar values a i and b i that solves the following problem [33]: Here, n denotes to the number of pixels in a squared window w with size (2r+1)×(2r+ 1), and ζ is a small regularization constant that prevents large a i .
Here,īnp i andḡ ui i represent the input image mean and the guidance image mean, respectively. Thus, after computing a i ; b i for all windows in the image, the filtering output is computed as follows: The following equation represented the guided filter operation in this paper:

Methodology
In this paper, we propose a pansharpening technique based on a convolutional autoencoder and CS-based method. First, we highlight the steps for building our technology are: • Utilize the convolutional autoencoder to enhance to enhance the intensity component which is obtained by AIHS from MS and PAN images. And the spatial resolution enhancement of the degraded PAN image is used the to train the model. • Generate the intensity component of the MS image by utilizing AIHS-based method, which is then fed to trained convolutional autoencoder considering this as a testing step.
• Utilize the estimated intensity component to enhance the PAN image by using the guided filter.
• The fusion step represents the last phase of the proposed technique. However, it will be explained in detail later. Figure 1 illustrates the schematic of the proposed method.

Enhancing the spatial detail
To enhance the spatial detail of the intensity component, we utilize the convolutional autoencoder network in which the relationship between PAN image patches and its degraded form is learned. Note that the degraded PAN image is generated using bi-cubic interpolation. The convolutional autoencoder is used to minimize the difference between input image patches and reconstruction output original image patches. Figure 2 illustrates the applied structure of the convolutional autoencoder. According to [28], the same description of the training network would apply here: the PAN image and its spatially degraded image are partitioned into 8×8 patches with 5 overlapping pixels that include 500,000 patch pairs, 30 epochs for training, considering that the relationship between PAN image patches and its degraded image patches is learned by the training network. The following equation illustrates the output patches of the convolutional autoencoder network at each iteration: (12) where P i n i=1 , P L i n i=1 represent the output and input patches, respectively. Enc and Dec indicated the encoding and decoding processes, respectively. The encoding process involves several layers starting with (1) the input image patch 8×8; (2) the Conv2D layer that indicates a 2D convolutional layer with 16 filters 3×3 kernel size, activation "ReLU" and padding "same"; the "ReLU" activation is used due to its simplicity and computation efficiency compared to other activation functions [34]. (3) MAX-Pooling layer that indicates a 2D max-pooling 2×2 region with padding "same"; (4) Conv2D layer with 8 filters 3×3 kernel size, activation "ReLU" and padding "same"; (5) Max-Pooling 2×2 region with padding "same"; and (6) Conv2D layer with 8 filters 3×3 kernel size, activation "ReLU" and padding "same". The CAEs are fully convolutional networks; thus, the decoding process is including a convolution. The decoding process involves several layers starting with (1) the Conv2D layer that indicates a 2D convolutional layer with 8 filters 3×3 kernel size, activation "ReLU" and padding 'same'; (2) the UpSampling layer that indicates a 2D UpSampling 2×2 region; (3) the Conv2D layer with 8 filters 3×3 kernel size, activation "ReLU" and padding "same"; (4) UpSampling 2×2 region; (5) the Conv2D layer with 16 filters 3×3 kernel size, activation "ReLU" and padding "same"; and (6) the Conv2D layer with 1 filter 3×3 kernel size, activation "linear" and padding "same". Thus, Adadelta optimization is used throughout training, and the MSE between the reconstructed output patches and the target patches P H i n i=1 is used for updating the weights as follows: After updating the weights, the back-propagation algorithm is utilized for training the convolutional autoencoder network. In the stage of testing, because of similar characteristics between the PAN and the corresponding intensity component of the MS image, the trained network is relied upon to improve the intensity component of MS image; firstly, the intensity component I which is generated by Eq. (5) is partitioned {I i } n i=1 and is then fed to the trained network for generating an estimated intensity component is being tiled.

Fusion process
The estimated intensity component E I is employed to enhance the PAN image by using the two-scale guided filter. Firstly, the E I is being used as the guidance image and the PAN image as the input image.
The difference between the approximation image O 1 and the input image E I is represented by the spatial detail D 1 . Hence, D 1 will blend with low-frequency component and may cause serious spectral distortion [35]; therefore, D 1 is then utilized as the input image for the second scale of guided filter O 2 .
The difference between O 1 and O 2 is represented by the spatial detail D 2 .
The total semantic map D Total is injected into the upsampled MS image through injection gains g i which are adjusted by (19). The high-resolution multi-spectral (HRMS) fused image is conducted by the following equation:

Results and discussion
In this section, several experiments were performed on different datasets to evaluate the performance of the model based on some quality metrics. Here, 8×8 patches with 5 overlapping pixels of the degraded PAN and the original PAN images that include 500,000 patch pairs were utilized for training the network. In total, six datasets have been selected for implementation purposes. Three degraded datasets (full reference), which means the reference image is available, and three real datasets (no reference image), namely QuickBird and GeoEye. Therefore, we compared our technique with several conventional efficient pansharpening methods, such as IHS [11], PCA [12], BDSD [36], PRACS [37], and AIHS [31], and several state-of-the-art methods such as SFIM [15], MTF-GLP [16], Indusion [17], MSGF  [19], CAE [28], and PNN [38]. Moreover, seven image quality indexes are broadly utilized, to assess the quality of the fused image, which are: 1 Correlation coefficient (CC) [39] 2 Universal Image Quality Index (UIQI) [40] 3 Quaternion Theory-based Quality Index (Q4) [40] 4 Root mean square error (RMSE) [41] 5 Relative average spectral error (RASE) [42] 6 Spectral Angle Mapper (SAM) [43] 7 Erreur Relative Globale Adimensionnelle de Synthese (ERGAS) [44] To assess the quality of the fused images concerning real datasets, D s , D λ , and QNR [45] were employed. The ideal value of each quality index is shown in parentheses in the tables.

Parameter investigation
Here, we study the influence of parameter setting in the guided filter on the fusion simulation of degraded QuickBird-1 dataset, namely, window size r and the regularization parameter ζ . Figures 3, 4, and 5 illustrate the influence of these parameters, where the horizontal axis is the regularization parameter ζ concerning three cases of window size r and the vertical axis is quality index results. Therefore, as can be seen, the best performance results originated from setting the parameters r and ζ at 8 and 0.8 2 , respectively.

Fusion results of degraded datasets (full reference)
In this section, the simulations were carried out on degraded datasets that have the reference image to evaluate our proposed method according to Wald's protocol [46]. Regarding the degraded datasets (QuickBird, GeoEye), the sizes of the MS image and the PAN image

Experiments on degraded QuickBird datasets
In this section, two pairs of QuickBird satellite datasets were examined; Fig. 6 illustrates the fusion results of the degraded QuickBird-1 dataset. For better comparison, the red square area is enlarged and then displayed at the bottom left of the fusion image. As can be observed, Fig. 6d-j methods have more inferior pansharpening results than CAE and proposed methods. Figure 6i-j suffer from spatial distortion. Figure 6m suffers from spatial and spectral distortions. The fusion result of the PNN method is depicted in Fig. 6n, which produces some unnatural color compared with the reference image. Furthermore, Fig. 6l CAE and proposed method Fig. 6o look most similar to the reference image Fig. 6a, but the proposed method performs better in terms of spectral and spatial fidelity. Similar observations can be made regarding the experimental results from the QuickBird-2 dataset. Figure 7 displays the fusion results of the degraded QuickBird-2 dataset. For better visual comparison, the red rectangle area is enlarged and then displayed at the bottom of the selected area; thus, the proposed and CAE methods have performed better visual effects.  In terms of objective evaluation, the numerical indexes of fused images for Figs. 6 and 7 are computed and reported in Tables 2 and 3, respectively. From both tables, it is clear that our method can contribute to the best values in terms of quality indexes.   Here, it can be seen that the SFIM, Indusion, and MTF-GLP methods perform well, as shown in Fig. 8i-k. We can also observe from Fig. 8l that the result of the CAE method has a color problem at the vegetation area compared with the reference image. The colors of the fusion image for MSGF and PNN methods have remarkable distortion, as shown in Fig. 8m, n. Overall, the proposed method created the fused image, with appropriate spectral and spatial resolution, as shown in Fig. 8o compared with others. The numerical indexes of fused images for Fig. 8 are computed and reported in Table 4. From the table, it is clear that our method can contribute to the best values in the most quality indexes.

Fusion results of real datasets (no reference)
Regarding real datasets, two kinds of real datasets (QuickBird, GeoEye) were implemented, and the sizes of the MS image and the PAN image are 256×256 and 1024×1024, respectively.

Experiments on real QuickBird datasets
Two pairs of real QuickBird satellite datasets were examined; for better visual comparison, the red square area is enlarged and then displayed at the bottom left of the fusion image. Figure 9 displays the fusion results of real QuickBird-1 dataset.
The fusion results of all methods improved, but the CS-based method and CAE method suffer from spectral distortion, as shown in Fig. 9c, e, and k. The BDSD fusion method has remarkable distortions. For SFIM, Indusion, and MTF-GLP methods, they can achieve relatively better results regarding spectral resolution than others, as shown in Fig. 9h-j. The MSGF method suffers from spatial distortion, as shown in Fig. 9l, and the colors of the fusion image for the PNN method have remarkable distortions. However, the fusion result of the proposed method can perform better than others, as shown in Fig. 9o. Similarly, the observations can be done regarding the experimental results from the real QuickBird-2 dataset. Figure 10 displays the fusion results of the real QuickBird-2 dataset. The CS-based methods suffer from spectral distortion, as shown in Fig. 10c, e. The BDSD fusion method has remarkable distortions as shown in Fig. 10e. The CAE method can achieve well concerning the spatial aspect but still has a lighter color in the vegetation area compared with the upsampled MS image, as shown in Fig. 10k.
The fusion results of SFIM, Indusion, MTF-GLP, MSGF, PNN, and proposed methods improved in both aspects of spectral and spatial.
The numerical measurements of real data fused images for Figs. 9 and 10 are computed and listed in Tables 5 and 6, respectively. Table 5 illustrates the proposed method performed the best value in terms of D λ and D s . Thus, our method showed the best value in terms of D λ and QNR, as reported in Table 6. Figure 11 displays the fusion results of the real GeoEye-1 dataset. The selected red square area is enlarged and then displayed at the bottom right of the fusion image for better visual comparison. As shown in Fig. 11c-e, these methods can perform well regarding spatial aspect but suffer from spectral distortion, and Fig. 11f-i and l, suffer from notable spectral and spatial distortion. Here, it can be seen that the MTF-GLP, CAE, and proposed methods perform well, as shown in Fig. 11j, k, and o.

Experiment on real GeoEye dataset
Overall, the proposed method created the fused image, with appropriate spectral and spatial resolution. The numerical indexes of fused images for Fig. 11 are computed and reported in Table 7. From Table 7, the PNN method can perform the best value in terms of D λ , followed by our method. Overall, our method can still contribute to the best values concerning quality indexes.

Conclusion
In this paper, we have proposed a pansharpening technique based on a convolutional autoencoder with AIHS and a multi-scale guided filter. The proposed method first trained the convolutional autoencoder to learn the relationship between the panchromatic image and its degraded version. The trained network is used to enhance the intensity component. Furthermore, the multi-scale guided filter is used to enhance the original panchromatic image. Several experiments were conducted, and the article has put in place the results of the experiment. The outcomes of this research are, first, in terms of visual aspect, the proposed method includes more of the spectral detail of the MS image and spatial detail of the panchromatic image than existing fusion methods. Second, the quality indexes of our method show significant enhancements compared with comparative methods. Overall, the model developed in this research was able to preserve appropriate spatial and spectral aspects of fusion image compared with comparative methods in both aspects, subjective and objective evaluations.