Compared to previous deep-learning-based single-image SR methods, our proposed method is also an end-to-end mapping that takes the low-resolution image as input and directly outputs the high-resolution one. The difference are mainly two aspects: we use a sparse prior constraint convolution layer to take the image sparse prior into account and use an anchored neighborhood convolution layer to avoid neurons compromise into different image contents. Therefore, we firstly introduce the sparse prior constraint convolution layer and the anchored neighborhood convolution layer that are associated with the two problems we focus on. Finally, we introduce our new network structure for single-image SR.
Sparse prior constraint layer
As shown in Eq. (4) and (5), the L2 norm sparse constraint objective function has a close solution x
i
=P
i
y
i
, where projection matrix is precomputed offline by a set of low- and high-image patch pairs. If each row of the projection matrix P
i
is considered as a filter, we can use a convolution layer to mimic this mapping process to predict the image detail. Here, we assume that y
i
is a vector of size n×1, x
i
is a vector of size m×1, and P
i
is a matrix of size m×n. Then, each convolution is of size 1×1×n, i.e., the spatial size of each convolution is 1×1 and it has n feature maps. Since the projection matrix P
i
has m rows, there are m convolutions of size 1×1×n. It should be noted that there is no bias in each filter so that all the filters can fully mimic the matrix multiplication process.
As shown in Eq. (5), \({x_{i}} = {D_{h}}{\left ({D_{l}^{T}{D_{l}} + \lambda I} \right)^{- 1}}D_{l}^{T}{y_{i}}\), where D
l
and D
h
are two well-trained low and high dictionaries. Since x
i
is the close solution with image sparse prior constraint, we transfer the matrix weights to one convolution layer that will make our network have an inherent attribute to take image sparse prior into account and the output x
i
will be a more accurate high-frequency prediction.
Anchored neighborhood layer
The ANR and A+ firstly find the neighborhoods and then calculate a separated projection matrix P
i
for each dictionary atom D
i
in the offline training process. As a result, given an input patch feature y
i
, it just needs to anchor it to its nearest neighbor atom D
i
and map it to HR space using the stored projection matrix P
i
. In this paper, we use a network to mimic this process, which has an inherent attribute to make our method get better performance.
The anchored neighborhood convolution layer is outlined in Fig. 2. To each dictionary atom D
i
, we calculate its projection matrix P
i
using the same method as A+, which has took the image sparse prior into account. After training all projection matrices, we transfer them to different convolution layers using the method mentioned above. That is, each sub-convolution layer with respect to an atom in the anchored neighborhood layer is a sparse prior constraint convolution layer. It should be noted that all these sub-convolution layers can be parallel implemented. To each input low-frequency feature vector, the anchored neighborhood layer will anchor it to one dictionary atom that will activate the corresponding sub-convolution layer. Then, the activated convolution layer maps the low-frequency feature vector to the high-resolution space, which executes the traditional matrix multiplication process.
Since we transfer the weights of the projection matrix P
i
to the sub-convolution layer, the anchored neighborhood convolution layer has fully took the sparse image prior into account. Both ANR and A+ demonstrate the projection matrix P
i
can be used to accurately predict the high-frequency details. Therefore, it is sure that our anchored neighborhood convolution layer can predict the accurate image in high frequency for the later layer to further refine. More importantly, through the anchoring process, the image patches will be divided into multiple categories, and each neuron will work on the similar feature vectors instead of the whole image that makes it avoid compromise to different image contents.
Proposed network structure
The proposed network structure is outlined in Fig. 3. It can be simply divided into four parts, i.e., feature extraction layer, anchored neighborhood convolution layer, combination layer, and deep integration subnetwork. We have used different colors to mark the corresponding part in Fig. 3.
Feature extraction. The ANR and A+ show that the features used to represent the image patches have strong influence on the performance. The most basic feature to use is the patch itself. This however does not give the feature good generalization properties. An often used similar feature is the first- and second-order derivative of the patch [3, 35]. In this paper, we use a convolution layer with n1 filters of size 3s×3s×1, where s is the magnification factor, to extract the image feature. As a result, the output feature is a n1×1 vector. At the same time, we use the “one-hot” convolution, which means one filter extracts only one pixel in the receptive field, to extract LR patches for the later image reconstruction. The filter size of the one-hot convolution is also 3s×3s×1.
Anchored neighborhood convolution. This layer has been introduced in detail in the Section 4.2. It is used to take image prior into account to fastly and accurately predict the image details and to make the neurons work on the local image patches to avoid compromise to different image contents. Note that the dictionary used in our experiment has 1024 atoms. Therefore, there are 1024 parallel sparse prior constraint layers in this anchored neighborhood layer.
Combination The anchored neighborhood convolution layer outputs the initial high-frequency details for each low-resolution patch. We firstly add these estimated high-frequency details to the corresponding LR patch, which is extracted by the one-hot convolution, to get the initial high-resolution feature vector. We reshape these feature vectors to get the image patches and concatenate them to output the initial high-resolution estimation. In other words, the combination layer contains a reshape and a concatenation process.
Deep integration. It has been demonstrated in the literature that the deeper the network, the better the performance. To further fuse the image local similarity details, we design a deep integration subnetwork that cascades m convolution layers, where the layers except the first and the last are of the same type: d filters of the size f×f×d, where a filter operates on f×f spatial region across d channels (feature maps). The first layer operates on the output of the combination layer, so that it has d filters of the size f×f×1. The last layer, which outputs the final image estimation, consists of a single filter of size f×f×d. It can be formulated as
$$ {F_{i}}\left(y \right) = \max \left({0,{w_{i}} * y + {b_{i}}} \right),i \in \left\{ {1,m - 1} \right\} $$
(6)
$$ {F_{m}}\left(y \right) = {w_{m}} * {F_{m - 1}}\left(y \right) + {b_{m}} $$
(7)
where max(·) represents the rectified linear unit (ReLU) operator and w
i
and b
i
represent the filters and biases of the ith layer respectively.
Training
We now describe the objective to minimize to find the optimal parameters of our model. Following most of deep-learning-based image restoration methods, the mean square error is adopted as the cost function of our network. Our goal is to train an end-to-end mapping f that predicts values \(\hat y = f\left (x \right)\), where x is an input low-resolution image and \(\hat y\) is the estimation of the corresponding high-resolution image. Given a set of high-resolution image examples y
i
,i=1…N, we generate the corresponding low-resolution images x
i
,i=1…N (in fact, we upscale them to the original size by bicubic interpolation). Then, the optimization objective is represented as
$$ \mathop {\min }\limits_{\theta} \frac{1}{{2N}}{\sum\nolimits}_{i = 1}^{N} {\left\| {f\left({{x_{i}};\theta} \right) - {y_{i}}} \right\|}_{F}^{2} $$
(8)
where θ is the network parameter needed to be trained, f(x
i
;θ) is the estimated high-resolution image with respect to low-resolution image x
i
. We use the adaptive moment estimation (Adam) [18] to optimize all network parameters.