According to recent researches of single image super-resolution (SR), the sparsity representation has shown the advantages in recovering discontinuous and inhomogeneous image regions [9, 16, 17, 27, 29]. Therefore, our proposed SR method still uses sparse coding to represent the image features. As we mentioned in Section 2, the global dictionary is beneficial to provide comprehensive image structure, and the local dictionary is more relevant to the image to be enhanced. In this case, we propose our SR method that utilizes the global and local dictionaries together to generate high resolution images with clear texture and suppressed blurring artifact.
Overview of our proposed SR method
The flowchart of our proposed single image super-resolution is presented in Fig. 1. First, a set of global dictionary pairs \(\left \{D^{H}_{i},D^{L}_{i}|i=1.1, 1.2, \dots, 4\right \}\) is trained from a large amount of natural images. Here, i represents the upscaling ratio from LR to HR images, and DH and DL represent the LR and HR image dictionary, respectively. These global dictionary pairs are generated based on the assumption that LR and HR images share the same sparse representation. By training multiple dictionary pairs, multi-scale mapping relation between HR and LR images can be established.
Given a low resolution image \(I_{\text {LR}} \in \mathbb {R}^{m\times n}\), and the scale factor s, a high resolution image \(I_{\text {HR}} \in \mathbb {R}^{sm\times sn}\) will be gradually generated by our proposed SR method. First, the magnification factor si is initialized as s1=s. According to the value of si, the corresponding global dictionary pair \(\phantom {\dot {i}\!}\left \{D^{H}_{s_{i}},D^{L}_{s_{i}}\right \}\) is used to magnify the low resolution image. In order to suppress the artifacts and the noises introduced by sparse representation, a local dictionary \(\phantom {\dot {i}\!}\left \{D^{0}_{s_{i}}\right \}\) is generated. Since \(\phantom {\dot {i}\!}\left \{D^{0}_{s_{i}}\right \}\) is constructed based on the self-information of the image, this dictionary would be more consistent with the image content. Based on \(\phantom {\dot {i}\!}\left \{D^{0}_{s_{i}}\right \}\), a sparse fidelity term and a non-local smoothing term are used as the constraints, so that the structure of the reconstructed HR image is similar to the original input image. Afterwards, the magnifying factor si is updated according to a blind image quality estimation function f(HRc), where HRc is the current estimated HR image. The HR image is iteratively updated until function f(HRc) converge.
The detailed descriptions of our proposed SR method will be introduced as follows.
Global dictionary training based on multi-scale image structures
According to Section 2.2, the initial value significantly affects the quality of the final HR image in local dictionary-based SR method such as NCSR. Compared with using the NCSR’s default initial value, which is generated by bicubic interpolation, the quality of the estimated HR image can be significantly improved if the initial values are better generated by other SR methods. Therefore, we propose to train global dictionaries from large dataset within multiple scaling factors to better generate the initial guess of the HR image.
In global dictionary-based sparse image representation, it is often assumed that the same image patch should have the same sparse code in different resolutions. Given an LR image Ilr and a dictionary trained Dlr from LR image dataset, the sparse codes of image patches in Ilr can be estimated. According to the assumption that the corresponding HR image Ihr shares the same sparse code with Ilr, we can reconstruct the high resolution image from the low-resolution one if an appropriate high resolution dictionary Dhr is also available. It is obvious that the most important step is to find out the dictionary pair Dhr and Dlr that can reliably represent the HR image and its LR version with the same sparse code.
Given a large high-resolution training dataset \(S_{{hr}}=\left \{I_{hr1}, I_{hr2}, \dots, I_{{hrn}}\right \}\) with clear natural images Ihri, the low-resolution training dataset \(S_{{lr}}=\left \{I_{lr1}, I_{lr2}, \dots, I_{{lrn}}\right \}\) is generated by applying Gaussian blurring, down-sampling, and bicubic scaling to the same size as the images Ihri in Shr. Afterwards, the images in Slr and Shr are decomposed into patch sets \(P_{{lr}}=\{p_{lr1}, p_{lr2}, \dots, p_{{lrm}}\}\) and \(P_{{hr}}=\{p_{hr1}, p_{hr2}, \dots, p_{{hrm}}\}\), where m is the number of patches extracted from the dataset.
In order to guarantee the dictionary a good representation of viewing-sensitive textures, we represent the image by its high-frequency component other than the original image. Similar with [11], the features in HR patch (phri) are extracted by subtracting the corresponding LR from the original HR, while the features in LR patch (plri) are extracted by using first- and second-order gradient filters. Afterwards, two training matrix can be generated: HR training matrix (\({X_{{hr}}}=[x_{hr1}, x_{hr2}, \dots, x_{{hrm}}]\)) and LR training matrix (\({X_{{lr}}}=[x_{lr1}, x_{lr2}, \dots, x_{{lrm}}]\)), where each x is a column vector reshaped from one training patch. Given Xhr and Xlr, the high- and low-resolution dictionaries can be estimated by Eqs. (4) and (5), respectively:
$$\begin{array}{@{}rcl@{}} D_{h}&=&\underset{D_{h},\alpha}{\mathrm{arg\ min}} \left\{\|X_{{hr}}-D_{h}\alpha \|^{2}_{2}+\lambda\|\alpha\|_{1}\right\}, \end{array} $$
(4)
$$\begin{array}{@{}rcl@{}} D_{l}&=&\underset{D_{l},\alpha}{\mathrm{arg\ min}} \left\{\|X_{{lr}}-D_{l}\alpha \|^{2}_{2}+\lambda\|\alpha\|_{1}\right\}. \end{array} $$
(5)
Because the sparse code α is shared between LR and HR patches, Eqs. (4) and (5) can be combined in Eq. (6):
$$ \left[D_{l}, D_{h}\right]=\underset{D_{l},D_{h},\alpha}{\mathrm{arg\ min}} \left\{ \|X_{{lr}}-D_{l}\alpha \|^{2}_{2}+ \|X_{{hr}}-D_{h}\alpha \|^{2}_{2}+\lambda\|\alpha\|_{1}\right\}. $$
(6)
The global dictionary pair is trained to reveal the underlying relation between the LR and HR images based on the knowledge from the training dataset. If the LR training images are generated by down-sampling the HR training images with a fixed scaling factor s, the LR images can only provide the structure knowledge in a fixed level. Therefore, one global dictionary pair trained from such datasets may not be suitable in different situations. For example, we can generate the LR training images by down-sampling the HR training images with a scale factor 4 and then estimate the global dictionary pair to reveal the LR-HR mapping relation. Although this dictionary pair may be effective if we scale an LR image to 4 times of its original size, it may performs unsatisfactorily if we scale the LR image to some other sizes. With this concern, we down-sample the HR training images by different scaling factors to generate multiple LR training set \(\left \{S_{lr1}, S_{lr2}, \dots, S_{{lrm}}\right \}\). The LR images in different training sets Slri represent the low-resolution structures in multi-scale. As shown in Fig. 2, we generate the LR-HR dictionary pair for each LR-HR training set pair {Slri,Shr}. Given the global dictionary pairs under different structure level, we can always choose the appropriate dictionary to enhance the LR image according to different situations.
In order to illustrate the necessity of multiple global dictionary pairs under different structure level, we generate 30 LR training datasets \(\{S_{lr1}, S_{lr2}, \dots, S_{lr30}\}\) from one HR training dataset Shr. Every LR set is generated by down-sampling the HR images at the ratio of si. In this paper, \(s_{i} \in \{1.1, 1.2, 1.3, \dots, 4.0\}\). In Fig. 3, we reconstruct the HR image from the LR image by using different global dictionary pairs. The horizontal axis represent the down-sampling ratio si at which the LR training set is generated. The vertical axis represent the PSNR value of the reconstructed HR image. In Fig. 3a, the LR image is scaled by 2.4 to generate the HR image, and the best reconstruction result is obtained when the LR training set is down-sampled by 2.5. Similarly, the LR image is scaled by 3.3 to generate the HR image in Fig. 3b, and the best reconstruction result is obtained when the LR training set is down-sampled by 3.5. It is clear that the choice of global dictionary pair affects the HR reconstruction result. In Fig. 4, the visual quality of the reconstructed HR images are presented along with the ground truth HR. It is visually noticeable that the quality of reconstructed HR would be better if appropriated global dictionary pair is used.
Local dictionary training using K-PCA
There is a need for great diversity in global dictionary, so that it can be used to recover general images. Despite the comprehensive information provided by global dictionary, it is proved to be unstable for sparse representation because of the highly diversity [15]. In order to represent the image by using a robust and compact dictionary, we use K-PCA and non-locally centralized sparse representation (NSCR) [17] to generate the local dictionary that is consistent with the input image.
The input LR image is scaled to a set of images \(\phantom {\dot {i}\!}S_{I}=\left \{I_{s_{k}}|k=1,2,3,\dots,N\right \}\) with different sizes by using bicubic interpolation. If the input \(LR \in \mathbb {R}^{m\times n}\), the desired output \(HR \in \mathbb {R}^{sm\times sn}\), the height and width of the scaled image \(\phantom {\dot {i}\!}I_{s_{i}}\) is \(\phantom {\dot {i}\!}0.8^{s_{k}}sm\) and \(\phantom {\dot {i}\!}0.8^{s_{k}}sn\). By extracting 7×7 image patches from SI, we generate the patch set P, which is further clustered into K groups \(\boldsymbol {P}=\{{\boldsymbol {P}_{\boldsymbol {i}}}|i=1,2,\dots,K\}\) using K-means clustering. We assume the patches within one group are similar, so these patches can be robustly represented by using a compact dictionary Di. Principal component analysis (PCA) is applied, and the PCA bases is regarded as Di for group Pi. After we combine all \(D_{i} (i=1,2,\dots,K)\) together, a complete local dictionary \(D^{0}=\left [D_{1}, D_{2}, \dots, D_{K}\right ]\) can be generated based on the input LR image itself.
Image super-resolution based on local and global training
In this section, we introduce the high resolution image reconstruction based on the global and local dictionaries. As shown in Eq. (3), a standard solution for image sparse representation can be formulated by the minimum optimization of an energy function with the fidelity term and the regularization term. The fidelity term ensures that the observed low-resolution image is a blurred and down-sampled version of the high-resolution image that is constructed by sparse representation. In this case, a reliable sparse representation is critical for high-resolution image reconstruction. In this paper, we adopt the global and the local dictionaries at the same time to ensure that the sparse representation can provide rich texture details and can be consistent with the observed low resolution image. With these concerns, we reformulate Eqs. (3) to (7).
$$ \boldsymbol{\alpha}_{\boldsymbol{y}}=\underset{\alpha_{l}}{\mathrm{arg\ min}} \left\{\|\boldsymbol{y}-\boldsymbol{H}\boldsymbol{D}^{\boldsymbol{0}}\boldsymbol{\alpha}_{\boldsymbol{l}} \|^{2}_{2} +\lambda\| \boldsymbol{\alpha}_{\boldsymbol{l}} - \boldsymbol{\beta}_{\boldsymbol{l}}\|_{1}\right\}, $$
(7)
s.t.
$$ \| U_{{IP}}(\boldsymbol{y})-\boldsymbol{D}^{\boldsymbol{L}}\boldsymbol{\alpha}_{\boldsymbol{g}} \|_{2}^{2}<\epsilon, $$
(8)
where ε is a small factor, y is the observed low-resolution image, H is a matrix for blurring and down-sampling, D0 is the local dictionary, αl is the sparse code of the high-resolution image according to local dictionary, DL is the global LR dictionaries, αg is the sparse code of the image according to global dictionary, U(·) is the upscaling operator, UIP(y) is the initial prediction of the upscaled y in the gradient decent based optimization for Eq. (7), and βl is the non-local mean of αl, which is formulated as follows:
$$ \boldsymbol{\beta}_{\boldsymbol{i}}=\sum\limits_{n \in N_{i}} \omega_{i,n}\boldsymbol{\alpha}_{\boldsymbol{n}}, $$
(9)
where αn denotes the sparse code of an image patch xn,Ni denotes the N most similar patches to patch xi,βi is the sparse code of patch xi after nonlocal smoothing, and ωi,n is the weight factor defined as follows:
$$\begin{array}{*{20}l} \omega_{i,n}=\frac{\text{exp}\left(-\|x_{i}-x_{n}\|^{2}_{2}\right)}{\sum\limits_{n \in N_{i}}\text{exp}\left(-\|x_{i}-x_{n}\|^{2}_{2}\right) }. \end{array} $$
(10)
Given an LR image, its HR version is estimated by iteratively solving Eq. (7). In the ith iteration, the global sparse code of the current LR image \(\alpha _{g}^{i}\) is estimated by using Eq. (11):
$$\begin{array}{*{20}l} \alpha_{{gi}}=\underset{\alpha}{\mathrm{arg\ min}} \left\{ \|X^{L}_{i}-D^{L}_{i}\alpha \|^{2}_{2}+\lambda\|\alpha\|_{1}\right\}, \end{array} $$
(11)
where \(X^{L}_{i}\) is the initial LR image in the ith iteration and also the final HR image estimation in the (i−1)th iteration, note that U(y) in Eq. (8) is the general representation for \(X^{L}_{i}\); \(D^{L}_{i}\) is the LR global dictionary used in the ith iteration; α is the sparse code of \(X^{L}_{i}\); and αgi is the optimal α that can minimize Eq. (11).
With the global sparse code αgi from Eq. (11), the HR estimation from global dictionary in the ith iteration is given in Eq. (12):
$$\begin{array}{*{20}l} X^{L}_{i+1/3}=D^{H}_{i}\alpha_{{gi}}, \end{array} $$
(12)
where \(D^{H}_{i}\) is the HR global dictionary used in the ith iteration, and \(X^{L}_{i+1/3}\) is an intermediate HR estimation, and it is also the initial HR guess feeding to the following local dictionary based HR estimation.
Given \(X^{L}_{i+1/3}\), the local dictionary D0 and the corresponding local sparse code αl in Eq. (7) can be generated by K-PCA as mentioned in Section 3.3. According to Eqs. (9) and (10), the sparse code βi of each patch’s non-local mean can be estimated. With βi, the regularization term \(X^{\beta _{i}}_{i}\) can be calculated as follows:
$$\begin{array}{*{20}l} X^{\beta_{i}}_{i}=D^{0}_{i}\beta_{i}. \end{array} $$
(13)
After we have H, D0, the initial sparse code αl, and the sparse code βl of non-local mean, the optimal αy in Eq. (7) could be iteratively approached. First, we fix the second regularization term, and the optimization problem becomes a least square problem which can be efficiently solved, the gradient decent-based updating processing is given in matrix form by Eq. (14):
$$\begin{array}{*{20}l} X^{L}_{i+2/3}=\boldsymbol{D}^{\boldsymbol{0}}\boldsymbol{\alpha}_{\boldsymbol{l}}+\theta H^{T}\left(y-H\boldsymbol{D}^{\boldsymbol{0}}\boldsymbol{\alpha}_{\boldsymbol{l}}\right), \end{array} $$
(14)
where θ is the learning step size, which is set to 2.4 in this paper. Afterwards, we fix the first fidelity term in Eq. (7), and now, this function can be solved by iterative shrinkage algorithm:
$$\begin{array}{*{20}l} X^{L}_{i+1}=S_{\tau}\left(X^{L}_{i+2/3}-X^{\beta_{i}}_{i}\right)+X^{\beta_{i}}_{i} \end{array} $$
(15)
where Sτ is a soft-thresholding operator; \(X^{\beta _{i}}_{i}\) is calculated in Eq. (13). According to the work presented in [30], the aforementioned algorithm is empirically converge. We also present the PSNR convergence of our proposed method in Fig. 5. There are three cases being compared: (1) the global SR DHαg is not used, and the SR image is generated based on local dictionary only; (2) the global SR is used only for once as the initial estimation for local SR; and (3) the global SR is used for two times to update the initial estimation during the gradient decent optimization of local SR. It can be observed that the use of global SR significantly improves the SR quality. It is worth notice that the global SR at the 201th iteration firstly reduces the PSNR but the final PSNR converges to a higher value compared with the cases without using global SR. It is very likely that the global SR updates the estimation during the gradient decent (GD) processing in local SR, and it avoids the GD processing being trapped in the local minima. According to our experiments, the proposed SR method can converge within 300 iterations for calculating Eq. (14).
The last problem is to find out the proper global dictionary in each iteration. Although we already generated global dictionary pairs in multi-scale structure, it is still difficult to select the most appropriate one in each iteration. According to our extensive experiments, we use the iteration number to stand for the degradation level and introduce a non-linear function in Eq. (16) to imply the selection of global dictionary pairs:
$$\begin{array}{*{20}l} f(i)=\frac{a}{i}+b \end{array} $$
(16)
where a and b are numeric parameters, i is the iteration number, and f(i) is the index number for global dictionary selection. In this paper, we set a and b to be s−1 and 1 for all test images, and here,s is the scaling factor of the HR image.