 Research
 Open access
 Published:
Anchored neighborhood deep network for singleimage superresolution
EURASIP Journal on Image and Video Processing volume 2018, Article number: 34 (2018)
Abstract
Realtime image and video processing is a challenging problem in smart surveillance applications. It is necessary to trade off between high frame rate and high resolution to meet the limited bandwidth requirement in many specific applications. Thus, image superresolution become one commonly used techniques in surveillance platform. The existing image superresolution methods have demonstrated that making full use of image prior can improve the algorithm performance. However, the previous deeplearningbased image superresolution methods rarely take image prior into account. Therefore, how to make full use of image prior is one of the unsolved problems for deepnetworkbased single image superresolution methods. In this paper, we establish the relationship between the traditional sparserepresentationbased singleimage superresolution methods and the deeplearningbased ones and use transfer learning to make our proposed deep network take the image prior into account. Another unresolved problem of the deeplearningbased singleimage superresolution method is how to avoid neurons compromise to different image contents. In this paper, the image patches are anchored to the dictionary atoms to group into various categories. As a result, each neuron will work on the same types of image patches that have similar details, which makes the network more accurate to recover highfrequency details. By solving these two problems, we propose an anchored neighborhood deep network for singleimage superresolution. Experimental results show that our proposed method outperforms many stateoftheart singleimage superresolution methods.
1 Introduction
In recent years, big data [4, 7], cloud computing, and AI (artificial intelligence) are the most popular research topics. Deep learning in AI has moved from research labs to applications, especially computer vision, natural language processing, speech recognition, and many other fields [8, 21]. These applications require systems or platforms to interact with the real world by sensors [5, 6], such as cameras in computer vision. However, the bandwidth of these devices are very limited. For example, the bandwidth is about 480 Mbps for the USB 2.0 interface. If the resolution is 1920×1080 and the frame rate is 100 hz, then the bandwidth is about 5 Gbps. Moreover, the frame rate must be larger than 100 Hz in some highspeed applications. Hence, frame rate upconversion and superresolution are necessary in many realtime applications. Figure 1 gives an example of superresolution in surveillance application. In that case, the server side has high performance to process images acquired from the sensors with limited bandwidth interface. In addition, such high payload raises the scheduling problem in different communication environments [26–28].
Image superresolution (SR) technology takes the lowresolution (LR) images as input and maps them to the corresponding highresolution (HR) space. It has been studied for a long time but has become more prevalent with the new generation of ultrahighdefinition (UHD) TVs (3840×2048). Most video content is not available in UHD resolution. Therefore, SR algorithms are needed to generate UHD content from full high definition (FHD) (1920×1080) or lower resolutions [16]. Depending on the number of the input LR images, image SR generally can be divided into singleimage SR method and multipleimage SR method. In this paper, we focus on singleimage SR, which aims at recovering a highresolution image from a single lowresolution image. For convenience, we roughly subdivide the singleimage SR methods into two subclasses: the nondeeplearningbased methods and the deeplearningbased ones. Most nondeeplearningbased singleimage SR methods either try to find the new kinds of image prior or propose a new way to use these existing image prior, while the deeplearningbased methods always learn a simple endtoend mapping between the LR image and the HR one.
Traditional nondeeplearningbased SR methods have demonstrated that image prior, e.g., local smoothing, nonlocal selfsimilarity, and sparsity, plays an important role in image SR. Neighbor embedding (NE) approaches assume that small image patches from a lowresolution image and its highresolution counterpart form lowdimensional nonlinear manifolds with similar local geometry. Chang et al. [3] proposed a SR method based on this principle using the manifold learning method of locally linear embedding (LLE) [24]. In addition to the local linear prior, image sparsity is the most commonly used in the literature of singleimage SR. Yang et al. [35] proposed the first sparserepresentationbased singleimage SR method that assumes the lowfrequency image patches have the same sparse representation with the corresponding highfrequency image patches. On this basis, Zeyde et al. [37] proposed a more efficient dictionary learning method for both low and highresolution patches, which leads to significant training time savings. Other kinds of image prior, e.g., local smoothing and nonlocal selfsimilarity, are also well studied as the regularization term in the reconstructionconstraintbased singleimage SR methods. Apart from investigating the new image prior, some traditional methods try to find out a more compact representation of the wellknown image prior or a more efficient way to use these image prior for improving the image SR performance. In [30], Timofte et al. propose an anchored neighborhood regression (ANR) for singleimage SR. That is to anchor the neighborhood embedding of a lowresolution patch to the nearest atom in the dictionary and to precompute the corresponding embedding matrix. In later, they further propose an improved variant of ANR, which combines the best qualities of anchored neighborhood regression and simple functions (SF) [31]. In order to make better use of the image sparse prior, Zhang et al. [38] propose a dual dictionary to learn residual iteratively.
Recently, deep learning method has got much attention, and it is successfully applied in many low and highlevel computer vision problems. Some deeplearningbased image SR methods have also been explored. The pioneering work of deeplearningbased SR is SRCNN proposed by Dong et al. [10, 11]. They demonstrated that a convolutional neural network (CNN) can learn a mapping from lowresolution image to highresolution one in an endtoend manner. It does not require any engineered features that are typically necessary in traditional nondeeplearningbased methods. Soon after, they expanded this work for JPEG compressive image restoration [9]. Recently, they further proposed an improved version of SRCNN that takes the 1 × 1 convolution into account to reduce the network weights and result in a fast SRCNN (FSRCNN) [12]. Different from [9–12] that use the undegraded image as ground true for training, some works try to learn image residual. Kim et al. [17] proposed a very deep network to learn residual to accelerate the convergence speed. The abovementioned methods (except FSRCNN) need to upscale the lowresolution input image to the highresolution space using a single filter, commonly bicubic interpolation, before reconstruction. To avoid adding computational complexity, Shi et al. [29] proposed to operate on the lowresolution image space and introduce an efficient subpixel convolution layer which learns an array of upscaling filters to upscale the final LR feature maps into the HR output. Many other new deeplearningbased singleimage SR methods [15, 20, 34] and video SR methods [2, 14] have also been proposed.
Both the traditional nondeeplearningbased methods and the deeplearningbased ones have their advantages. Traditional methods always take the image prior into account and have the process of strict mathematical derivation. In comparison, deeplearningbased methods always are the endtoend mapping, which avoids the complex optimization solving process and results in much fast running speed. How to make full use of the advantages of both of them is a very interesting problem. In this paper, one of the problems we focus on is how to take the image prior into account for the deeplearningbased singleimage SR method. On the other hand, in all the previous deeplearningbased methods, each neuron works on the whole input feature map. It has to compromise to different image contents, although it has very small receptive field. For example, to the smooth region of the image, we expect the neuron to be lowpass filter, while a highpass filter is better to the complex texture region. However, the natural images are usually contentrich containing not only the smooth region but also the complex texture region. The neuron works on the whole image has to compromise to these totally different image contents that will affect their activity to the final output. In this paper, how to avoid the neuron compromises to different image contents is another problem we focus on.
In this paper, we propose an anchored neighborhood deep network for singleimage SR. The wellknown anchored neighborhood regression method show that the image highfrequency details can be computed by the input lowfrequency patches multiplied by a precomputed matrix, which is well trained by a large amount of low and high patch pairs with sparse prior constraint. We design a convolution layer to mimic the matrix multiplication process and transfer the weights of each row of the welltrained matrix to one convolution filter. Different from the previous transfer learning, we transfer the weights from a matrix, which is outside any network and trained with strict image prior constraint, to the network instead of one network to the other one. Since the weights of the matrix are trained with strict image prior constraint, the convolution layer whose parameters are transferred from the welltrained matrix has took the image prior into account. Inspired by the anchored neighborhood regression singleimage SR method, we first anchor the feature vectors to the nearest dictionary atom. Then, to different kinds of feature vectors, we use different convolution layers to predict their corresponding highfrequency details. Figure 2 gives an intuitional description to this process. That results in each neuron works on the same kinds of image patches to avoid compromise to different image contents. Through solving these two problems, we successfully design an anchored neighborhood deep network for singleimage SR. Experimental results show that our proposed method have comparable performance with many stateoftheart singleimage SR methods.
In the next section, we will give some background on sparserepresentationbased and deeplearningbased SR methods and review the anchored neighborhood regression methods. Section 3 introduces the motivation and summarize our contribution. In Section 4, we propose our anchored neighborhood deep network in detail. Section 5 describes our experiments, where we compare the performance of our approach to other stateoftheart methods. Section 6 concludes our work. Finally, we discuss some valuable future works in Section 7.
2 Related work
Since our proposed method is inspired by the anchored neighborhood regression and makes full use of the advantages of both sparse representation approaches and the deeplearningbased ones, we shortly review them.
2.1 Sparse representation approaches
Sparse representation try to use nonzero coefficients as few as possible to represent signal’s main information. For a patch x_{ i }, the process of finding its sparse representation vector α_{ i } with respect to a known overcomplete dictionary D is called sparse coding. As can be seen, owing to the overcompleteness, the null space of D introduces additional degrees of freedom in the choice of α_{ i }, which can be exploited to improve its compressibility. To obtain the sparse representation, sparse coding can be formulated as
Though this problem is NPhard in general, it can be approximated by a wide range of techniques [22]. In this paper, we adopt an orthogonal matching pursuit (OMP) [32] algorithm to solve this problem for its simplicity and efficiency.
The other one main problem of sparse representation is dictionary learning. Its general formulation is:
where {α_{ i }} are the sparse representation vectors for {x_{ i }}. There are many dictionary learning methods that have been proposed in recent year. One of the widely used dictionary learning methods is KSVD [18], which has shown more effectivity and higher efficiency than many other stateoftheart dictionary learning methods.
The sparserepresentationbased SR method assumes the same sparse representation for lowresolution patches as their corresponding highresolution patches. Therefore, the sparse dictionaries have to be jointly learned for low and highresolution image patches. Given a set of training image patch pairs X_{ h } and X_{ l }, the joint dictionary learning can be formulated as:
where X_{ h } and X_{ l }, N and M are the high and lowresolution patches and their dimensionality, respectively, and α is the coefficient vector representing the sparsity constraint.
To speed up the running time, Timofte et al. [30] proposed anchored neighborhood regression for fast singleimage SR. They relaxed the L0 norm constraint to L2 norm and used part of the dictionary atoms to represent each patch. Then, the objective function will become
With the L2 norm, this turns the problem into ridge regression and gives it a closed solution. An input lowfrequency patch y_{ i } can be projected to a highresolution space as
where P_{ i } is the stored projection matrix for dictionary atom \(D_{l}^{i}\). In summary, ANR computes offline the projection matrix P_{ i } for each dictionary atom in the training process and anchors each patch to its most similar dictionary atom and maps it to output the highfrequency detail patch with the corresponding projection matrix P_{ i }. In [31], Timofte et al. propose A+, an improved variant of ANR, which combines the best qualities of ANR and SF. We refer the reader to [30, 31] for more details about ANR and A+.
2.2 Deep learning approaches
The previous deeplearningbased image SR approaches always learn an endtoend mapping, which takes the lowresolution image as input and directly outputs the highresolution one. The pioneer work is SRCNN [10], which is a simple threelayer network. Specifically, the first layer performs patch extraction and representation, which extracts overlapping patches from the input image and represents each patch as a highdimensional vector. Then, the nonlinear mapping layer maps each highdimensional vector of the first layer to another highdimensional vector, which is conceptually the representation of a highresolution patch. At last, the reconstruction layer aggregates the patchwise representations to generate the final output. Inspired by other successful highlevel works, Kim et al. [17] proposed to increase the network depth to have a larger receptive field to predict the image details and use the residual learning method to accelerate convergence. In [34], Wang et al. designed a network to mimic the traditional sparserepresentationbased SR method. However, it needs multiple layers to get the accurate sparse representation, and all the image patches use the same network structure. Many other deeplearningbased image SR methods have also been proposed.
Transfer learning in deep neural networks becomes popular since the success of deep learning in image classification [19]. The features learned from the ImageNet show good generalization ability [36] and become a powerful tool for several highlevel vision problems. Many works have demonstrated that transfer of the network parameters learned from the ImageNet to their own network can improve performance. Inspired by the success of transfer learning applied in highlevel vision problems, Dong et al. [9] explored several transfer settings on compression artifact reduction and demonstrated the effectiveness of transfer learning in lowlevel vision problems. Different from these transfer learning approaches mentioned above, we propose to transfer the parameters from a precomputed projection matrix that makes our network can make full use of the image prior and improve SR performance.
3 Motivations and contributions
Traditional nondeeplearningbased singleimage SR methods try to either find new kinds of image prior or propose a new way to use these existing image prior. All these previous works demonstrated that makes full use of image priors can improve image SR performance. How to use the image prior in deeplearningbased methods is still rarely studied. So it inspires us to explore how to take image prior into account for deeplearningbased method. Fortunately, previous works proposed by Timofte et al. [30, 31] show that the objective function with sparse prior constraint has a closed solution. Furthermore, the matrix multiplication can be easily implemented by a convolution layer. Therefore, transferring these weights of the projection matrix trained offline to a convolution layer is a very natural selection.
The neurons of these previous deeplearningbased methods work on the whole input feature map. They have to compromise to different image contents. ANR and A+ proposed by Timofte et al. [30, 31] inspire us to anchor different image patches to different dictionary atoms, and then, all the patches are naturally divided into multiple categories and each neuron will work on the similar image patches.
In this paper, contrary to previous works, we propose to transfer the weights of the matrix, which are trained offline using a large amount of patches with image prior constraint, to the weights of a convolution layer. That results in that our network has the inherent property of taking the image prior into account. Similar to ANR and A+, we anchor each feature vector to one of the dictionary atoms and then use the corresponding convolution layer to map the lowfrequency input vector to predict its highfrequency detail. As a result, each neuron of our network works on the same kinds of image patches to avoid compromise to different image contents.
In short, the contributions of this work are mainly in three aspects:

We establish a relationship between our deeplearningbased singleimage SR method and the wellknown sparse representation one. The transfer learning technology has been used to join the traditional approach with good ability of using image prior knowledge and the deeplearningbased approach with strong endtoend optimization ability.

We propose anchored neighborhood deep network for singleimage SR. Compared to the previous deeplearningbased methods, the neurons in our proposed SR network pay more attention to acquire the local image information to avoid compromise to different image contents. Compared to the traditional anchored neighborhood regression methods, the traditional methods are local optimization, while our proposed network is an endtoend global optimization.

We give large amount of experiments to demonstrate the robustness of our proposed new singleimage SR method.
4 Proposed method
Compared to previous deeplearningbased singleimage SR methods, our proposed method is also an endtoend mapping that takes the lowresolution image as input and directly outputs the highresolution one. The difference are mainly two aspects: we use a sparse prior constraint convolution layer to take the image sparse prior into account and use an anchored neighborhood convolution layer to avoid neurons compromise into different image contents. Therefore, we firstly introduce the sparse prior constraint convolution layer and the anchored neighborhood convolution layer that are associated with the two problems we focus on. Finally, we introduce our new network structure for singleimage SR.
4.1 Sparse prior constraint layer
As shown in Eq. (4) and (5), the L2 norm sparse constraint objective function has a close solution x_{ i }=P_{ i }y_{ i }, where projection matrix is precomputed offline by a set of low and highimage patch pairs. If each row of the projection matrix P_{ i } is considered as a filter, we can use a convolution layer to mimic this mapping process to predict the image detail. Here, we assume that y_{ i } is a vector of size n×1, x_{ i } is a vector of size m×1, and P_{ i } is a matrix of size m×n. Then, each convolution is of size 1×1×n, i.e., the spatial size of each convolution is 1×1 and it has n feature maps. Since the projection matrix P_{ i } has m rows, there are m convolutions of size 1×1×n. It should be noted that there is no bias in each filter so that all the filters can fully mimic the matrix multiplication process.
As shown in Eq. (5), \({x_{i}} = {D_{h}}{\left ({D_{l}^{T}{D_{l}} + \lambda I} \right)^{ 1}}D_{l}^{T}{y_{i}}\), where D_{ l } and D_{ h } are two welltrained low and high dictionaries. Since x_{ i } is the close solution with image sparse prior constraint, we transfer the matrix weights to one convolution layer that will make our network have an inherent attribute to take image sparse prior into account and the output x_{ i } will be a more accurate highfrequency prediction.
4.2 Anchored neighborhood layer
The ANR and A+ firstly find the neighborhoods and then calculate a separated projection matrix P_{ i } for each dictionary atom D_{ i } in the offline training process. As a result, given an input patch feature y_{ i }, it just needs to anchor it to its nearest neighbor atom D_{ i } and map it to HR space using the stored projection matrix P_{ i }. In this paper, we use a network to mimic this process, which has an inherent attribute to make our method get better performance.
The anchored neighborhood convolution layer is outlined in Fig. 2. To each dictionary atom D_{ i }, we calculate its projection matrix P_{ i } using the same method as A+, which has took the image sparse prior into account. After training all projection matrices, we transfer them to different convolution layers using the method mentioned above. That is, each subconvolution layer with respect to an atom in the anchored neighborhood layer is a sparse prior constraint convolution layer. It should be noted that all these subconvolution layers can be parallel implemented. To each input lowfrequency feature vector, the anchored neighborhood layer will anchor it to one dictionary atom that will activate the corresponding subconvolution layer. Then, the activated convolution layer maps the lowfrequency feature vector to the highresolution space, which executes the traditional matrix multiplication process.
Since we transfer the weights of the projection matrix P_{ i } to the subconvolution layer, the anchored neighborhood convolution layer has fully took the sparse image prior into account. Both ANR and A+ demonstrate the projection matrix P_{ i } can be used to accurately predict the highfrequency details. Therefore, it is sure that our anchored neighborhood convolution layer can predict the accurate image in high frequency for the later layer to further refine. More importantly, through the anchoring process, the image patches will be divided into multiple categories, and each neuron will work on the similar feature vectors instead of the whole image that makes it avoid compromise to different image contents.
4.3 Proposed network structure
The proposed network structure is outlined in Fig. 3. It can be simply divided into four parts, i.e., feature extraction layer, anchored neighborhood convolution layer, combination layer, and deep integration subnetwork. We have used different colors to mark the corresponding part in Fig. 3.
Feature extraction. The ANR and A+ show that the features used to represent the image patches have strong influence on the performance. The most basic feature to use is the patch itself. This however does not give the feature good generalization properties. An often used similar feature is the first and secondorder derivative of the patch [3, 35]. In this paper, we use a convolution layer with n1 filters of size 3s×3s×1, where s is the magnification factor, to extract the image feature. As a result, the output feature is a n1×1 vector. At the same time, we use the “onehot” convolution, which means one filter extracts only one pixel in the receptive field, to extract LR patches for the later image reconstruction. The filter size of the onehot convolution is also 3s×3s×1.
Anchored neighborhood convolution. This layer has been introduced in detail in the Section 4.2. It is used to take image prior into account to fastly and accurately predict the image details and to make the neurons work on the local image patches to avoid compromise to different image contents. Note that the dictionary used in our experiment has 1024 atoms. Therefore, there are 1024 parallel sparse prior constraint layers in this anchored neighborhood layer.
Combination The anchored neighborhood convolution layer outputs the initial highfrequency details for each lowresolution patch. We firstly add these estimated highfrequency details to the corresponding LR patch, which is extracted by the onehot convolution, to get the initial highresolution feature vector. We reshape these feature vectors to get the image patches and concatenate them to output the initial highresolution estimation. In other words, the combination layer contains a reshape and a concatenation process.
Deep integration. It has been demonstrated in the literature that the deeper the network, the better the performance. To further fuse the image local similarity details, we design a deep integration subnetwork that cascades m convolution layers, where the layers except the first and the last are of the same type: d filters of the size f×f×d, where a filter operates on f×f spatial region across d channels (feature maps). The first layer operates on the output of the combination layer, so that it has d filters of the size f×f×1. The last layer, which outputs the final image estimation, consists of a single filter of size f×f×d. It can be formulated as
where max(·) represents the rectified linear unit (ReLU) operator and w_{ i } and b_{ i } represent the filters and biases of the ith layer respectively.
4.4 Training
We now describe the objective to minimize to find the optimal parameters of our model. Following most of deeplearningbased image restoration methods, the mean square error is adopted as the cost function of our network. Our goal is to train an endtoend mapping f that predicts values \(\hat y = f\left (x \right)\), where x is an input lowresolution image and \(\hat y\) is the estimation of the corresponding highresolution image. Given a set of highresolution image examples y_{ i },i=1…N, we generate the corresponding lowresolution images x_{ i },i=1…N (in fact, we upscale them to the original size by bicubic interpolation). Then, the optimization objective is represented as
where θ is the network parameter needed to be trained, f(x_{ i };θ) is the estimated highresolution image with respect to lowresolution image x_{ i }. We use the adaptive moment estimation (Adam) [18] to optimize all network parameters.
5 Experimental results and discussion
In this section, we evaluate the performance of our method on several datasets. We first describe datasets used for training and testing our method. Next, some training details are given. Finally, we show the quantitative and qualitative comparisons with five stateoftheart methods. We name our anchored neigborhood deep network as ANNet.
5.1 Implementation details
Datasets for training and testing. It is well known that training dataset is very important for the performance of learningbased image restoration methods. A lot of training dataset can be found in the literature. For example, SRCNN [10, 11] uses a 91image dataset and VDSR [17] uses a 291image dataset. In this paper, we mainly follow FSRCNN [12] to use the General100 dataset, which contains 100 bmp format images (with no compression). To further test the impact of different training datasets to the performance, we also establish our own training dataset, which contains 260 bmp format images. We set the patch size as 45 × 45 and use data augmentation (rotation or flip) to prepare training data. Following FSRCNN and SRCNN, we use three datasets, i.e., Set5 [1] (5 images), Set14 [37] (14 images), and BSD200 [23] (200 images) for testing, which are widely used for benchmark in other works. Note that the test images are strictly separate from the training datasets.
Training strategy. For weight initialization, we use the method described in He et al. [13]. This is a theoretically sound procedure for networks utilizing rectified linear units (ReLu). For the other hyperparameters of Adam, we set the exponential decay rates for the first and second moment estimate to 0.9 and 0.999, respectively. We train all our experiments only over 30 epochs and each epoch iterate 2000 times with a batch size of 64. The learning rate of the first 10 epochs is 0.0001, the 11 to 20 epochs is 0.00001, while that of the other 10 epochs is 0.000001. We implement our model using the MatConvNet package [33].
5.2 Investigation of different settings
To test the property of our anchored neighborhood deep network, we design a set of controlling experiments. We investigate the impact of the filter size, the network depth, and the training dataset. Since the parameters of the anchored neighborhood layer are fixed by the projection matrix trained offline, we mainly investigate different settings at the deep integration subnetwork.
Firstly, we investigate the impact of the filter size to the performance. In these experiments, the deep integration subnetwork just has two convolution layers. The average PSNR and SSIM values on the Set5 dataset of these experiments are shown in Table 1. The first column represents the filter size of the first convolution layer of the deep integration subnetwork, while the first row represents the filter size of the second convolution layer. Therefore, the values in the second row and the third column are the average PSNR and SSIM values of our network with the spacial size of the first and the second layers of the deep integration subnetwork being 3×3 and 5×5, respectively. Since we use the square filter, we simplify them to one number. Table 1 shows that the larger the filter size, the better the performance. That can be attributed to it having a larger receptive field to get more useful information to predict the image details.
Next, we investigate the impact of network depth to the performance. In these experiments, the filter size of the deep integration subnetwork of the twolayer network, the threelayer network, and the fourlayer network are 93, 953, and 9553, respectively, where the number is the filter spatial size. The PSNR and SSIM values on the Set5 dataset of these experiments are shown in Table 2. Obviously, the fourlayer network obtains the best performance not only merely on the average PSNR and SSIM but also on every single image. The threelayer network also outperforms the twolayer network. On this test dataset, the fourlayer network can improve roughly on average to 0.27 dB and 0.08 dB for the two and threelayer networks, respectively. It reveals that the deeper the network, the better the performance, which agrees with the other researcher’s finding. The good performance of the deeper network can also be attributed to the deep network having a larger receptive field to get more useful information to predict the image details.
Finally, we investigate the impact of training dataset to the performance. In our most experiments, we follow FSRCNN to use General100 as the training dataset. To further test the impact of training dataset to the performance, we establish our own training dataset, which contains 260 bmp format images. Table 3 shows the PSNR and SSIM values on Set5 of our proposed ANNet trained with different datasets. The small dataset represents the General100 dataset, which contains 100 images, while the big dataset represents our new established dataset that contains 260 images. On this test dataset, our network with the same setting trained with a larger dataset can improve roughly on average to 0.27 dB than that trained with a small dataset. That means a large training dataset is a good trick to improve the network performance.
5.3 Comparisons with stateoftheart methods
We compare our ANNet with four stateoftheart learningbased singleimage SR methods, namely, the A+ [31], SRF [25], SRCNN [10, 11], and SCN [34]. A+ and SRF are the two stateoftheart traditional nondeeplearningbased methods, while SRCNN and SCN are the two popular deeplearningbased singleimage SR methods. In Table 4, we provide a summary of quantitative evaluation on several datasets. The results of the other four methods are the same as reported at FSRCNN [12]. The setting of our ANNet to run the experiment for comparison is the deep integration subnetwork having two layers with filter sizes 5 and 3, respectively. It is trained on the public small General100 dataset instead on our own big dataset. Table 4 shows that our proposed ANNet outperforms A+, SRF, and SRCNN. On this setting and test dataset, our ANNet can improve roughly to 1.97, 0.38, 0.24, and 0.1 dB on average over all three test datasets, in comparison with Bicubic, SRCNN, A+, and SRF. Our ANNet are comparable with SCN because the difference of average PSNR value is just 0.04 dB. Furthermore, SCN needs multiple cascade operations to get the best performance. As discussed in the above section, we can increase our network depth or use a larger training dataset to get better performance. Some qualitative results are also given in Figs. 4, 5, and 6. Figure 4 shows the visual quality comparison of singleimage SR on image butterfly from Set5 with an upscaling factor of 3. Figures 5 and 6 use the image baby and woman from Set5 as two examples with an upscaling factor 4, respectively. Obviously, our ANNet can recover more image details. All these results demonstrate our proposed ANNet is a robust singleimage SR method.
6 Conclusions
In this paper, we focus on two rarely studied problems for deeplearningbased singleimage superresolution: one is how to take image prior into account for deeplearningbased approaches, the other one is how to avoid the neuron compromising to different image contents. To the first problem, we use the transfer learning technology to transfer the weights of a projection matrix trained with strict image prior constraint to one convolution layer. To the second problem, the proposed ANNet anchors each input feature vector to one of the dictionary atoms and maps it to the highresolution space with the corresponding convolution layer. By solving these two problems, we have proposed an anchored neighborhood deep network for singleimage superresolution. Experimental results show that our proposed method outperforms many stateoftheart singleimage superresolution methods. Our experiments demonstrate that the deeper our network, the better the performance. Furthermore, a large training dataset is a good trick to improve the network performance, which inspires us to use a larger dataset like ImageNet to train our network for practical application.
7 Future work
One of the problems we focus on is how to take image prior into account for deeplearningbased image superresolution. However, we have just used the image sparse prior in this paper. Apart from the sparse prior, many other kinds of image prior, e.g., local smoothing and nonlocal selfsimilarity, have been well studied in the traditional nondeeplearningbased image restoration methods. Therefore, the natural way to expand our work is to take more image prior into account and explore more effective way to use these image priors for deeplearningbased image superresolution. On the other hand, multiple frames input of the video superresolution will offer more abundant image information. In the future, we will also investigate extending our proposed anchored neighborhood deep network into a spatiotemporal network to superresolve one frame from multiple neighboring frames.
Abbreviations
 AI:

Artificial intelligence
 Adam:

Adaptive moment estimation
 ANR:

Anchored neighborhood regression
 CNN:

Convolutional neural network
 FHD:

Full high definition
 FSRCNN:

Fast SRCNN
 HR:

High resolution
 JPEG:

Joint picture expert group
 KSVD:

Ksingular vale decomposition
 LLE:

Locally linear embedding
 LR:

Low resolution
 NE:

Neighbor embedding
 ReLu:

Rectified linear units
 SF:

Simple functions
 SR:

Super resolution
 SRCNN:

Super resolution CNN
 USB:

Universal serial bus
 UHD:

Ultrahigh definition
 VDSR:

Very deep convolutional SR
References
M Bevilacqua, A Roumy, C Guillemot, ML AlberiMorel, Lowcomplexity singleimage superresolution based on nonnegative neighbor embedding (British Machine Vision Association, BMVA, 2012). https://www.engineeringvillage.com/search/doc/detailed.url?SEARCHID=caf6296bM0b6eM48c7M8336Mbde1f8cff1a7&usageZone=resultslist&usageOrigin=searchresults&pageType=quickSearch&searchtype=quickSearch&CID=quickSearchDetailedFormat&DOCINDEX=1&database=1&format=quickSearchDetailedFormat&tagscope=&displayPagination=yes.
J Caballero, C Ledig, A Aitken, A Acosta, J Totz, Z Wang, W Shi, in Proceedings of the 30th IEEE Conference on Computer Vision and Pattern Recognition, CVPR. Realtime video superresolution with spatiotemporal networks and motion compensation, (2017), pp. 2848–2857. https://www.engineeringvillage.com/search/doc/detailed.url?SEARCHID=ffefb3e7Mb3c2M44e8Ma22eM42bbf72807b0&usageZone=resultslist&usageOrigin=searchresults&pageType=quickSearch&searchtype=quickSearch&CID=quickSearchDetailedFormat&DOCINDEX=1&database=1&format=quickSearchDetailedFormat&tagscope=&displayPagination=yes.
H Chang, DY Yeung, Y Xiong, in Computer Vision and Pattern Recognition, 2004. CVPR 2004. Proceedings of the 2004 IEEE Computer Society Conference on, vol. 1. Superresolution through neighbor embedding (Institute of Electrical and Electronics Engineers Computer Society, 2004), pp. I–I. https://www.engineeringvillage.com/search/doc/detailed.url?SEARCHID=32a31d81M1dbcM4166M9621Mc35ebfa5d73f&usageZone=resultslist&usageOrigin=searchresults&pageType=quickSearch&searchtype=quickSearch&CID=quickSearchDetailedFormat&DOCINDEX=5&database=1&format=quickSearchDetailedFormat&tagscope=&displayPagination=yes.
BW Chen, X He, SY Kung, Support vector analysis of largescale data based on kernels with iteratively increasing order. J. Supercomput. 72(9), 3297–3311 (2015).
BW Chen, M Imran, M Guizani, Cognitive sensors based on ridge phasesmoothing localization and multiregional histograms of oriented gradients. IEEE Trans. Emerg. Top. Comput. 99(1), 1–1 (2016).
BW Chen, W Ji, Geoconquesting based on graph analysis for crowdsourced metatrails from mobile sensing. IEEE Commun. Mag. 55(1), 92–97 (2017).
BW Chen, L Yang, Y Gu, Privacypreserved big data analysis based on asymmetric imputation kernels. Futur. Gener. Comput. Syst. 78(2), 859–866 (2018).
YH Chen, GL Peng, CH Xie, W Zhang, CH Li, SH Liu, Acdin: Bridging the gap between artificial and real bearing damages for bearing fault diagnosis, (2018). https://www.engineeringvillage.com/search/doc/detailed.url?SEARCHID=974dc1acM3428M427cMaf68M88b679d61353&usageZone=resultslist&usageOrigin=searchresults&pageType=quickSearch&searchtype=quickSearch&CID=quickSearchDetailedFormat&DOCINDEX=1&database=1&format=quickSearchDetailedFormat&tagscope=&displayPagination=yes.
C Dong, Y Deng, C Change Loy, X Tang, in Proceedings of the IEEE International Conference on Computer Vision. Compression artifacts reduction by a deep convolutional network (Institute of Electrical and Electronics Engineers Inc., 2015), pp. 576–584. https://www.engineeringvillage.com/search/doc/detailed.url?SEARCHID=f7c5d2b6M9aacM4421M96fdM8899597110c7&usageZone=resultslist&usageOrigin=searchresults&pageType=quickSearch&searchtype=quickSearch&CID=quickSearchDetailedFormat&DOCINDEX=1&database=1&format=quickSearchDetailedFormat&tagscope=&displayPagination=yes.
C Dong, CC Loy, K He, X Tang, in European Conference on Computer Vision. Learning a deep convolutional network for image superresolution (Springer Verlag, 2014), pp. 184–199. https://www.engineeringvillage.com/search/doc/detailed.url?SEARCHID=12803a52M31fbM4d01M8dc2M7924976ea23f&usageZone=resultslist&usageOrigin=searchresults&pageType=quickSearch&searchtype=quickSearch&CID=quickSearchDetailedFormat&DOCINDEX=1&database=1&format=quickSearchDetailedFormat&tagscope=&displayPagination=yes.
C Dong, CC Loy, K He, X Tang, Image superresolution using deep convolutional networks. IEEE Trans. Pattern. Anal. Mach. Intell. 38(2), 295–307 (2016).
C Dong, CC Loy, X Tang, in European Conference on Computer Vision. Accelerating the superresolution convolutional neural network (Springer Verlag, 2016), pp. 391–407. https://www.engineeringvillage.com/search/doc/detailed.url?SEARCHID=5955183cM9d92M4761Mbcd5Ma1e5a47cc81a&usageZone=resultslist&usageOrigin=searchresults&pageType=quickSearch&searchtype=quickSearch&CID=quickSearchDetailedFormat&DOCINDEX=1&database=1&format=quickSearchDetailedFormat&tagscope=&displayPagination=yes.
K He, X Zhang, S Ren, J Sun, in Proceedings of the IEEE international conference on computer vision. Delving deep into rectifiers: surpassing humanlevel performance on imagenet classification (Institute of Electrical and Electronics Engineers Inc., 2015), pp. 1026–1034. https://www.engineeringvillage.com/search/doc/detailed.url?SEARCHID=e2442619Mb94fM4279Mbce2M3bc9d411de51&usageZone=resultslist&usageOrigin=searchresults&pageType=quickSearch&searchtype=quickSearch&CID=quickSearchDetailedFormat&DOCINDEX=1&database=1&format=quickSearchDetailedFormat&tagscope=&displayPagination=yes.
Y Huang, W Wang, L Wang, in Advances in Neural Information Processing Systems. Bidirectional recurrent convolutional networks for multiframe superresolution (Neural information processing systems foundation, 2015), pp. 235–243. https://www.engineeringvillage.com/search/doc/detailed.url?SEARCHID=72e0c168M759cM424fMa7c5M5ddad54500c7&usageZone=resultslist&usageOrigin=searchresults&pageType=quickSearch&searchtype=quickSearch&CID=quickSearchDetailedFormat&DOCINDEX=1&database=1&format=quickSearchDetailedFormat&tagscope=&displayPagination=yes.
J Johnson, A Alahi, L FeiFei, in European Conference on Computer Vision. Perceptual losses for realtime style transfer and superresolution (Springer Verlag, 2016), pp. 694–711. https://www.engineeringvillage.com/search/doc/detailed.url?SEARCHID=756f3295M76c3M48f7Mb5c9M144b7fdbfa41&usageZone=resultslist&usageOrigin=searchresults&pageType=quickSearch&searchtype=quickSearch&CID=quickSearchDetailedFormat&DOCINDEX=1&database=1&format=quickSearchDetailedFormat&tagscope=&displayPagination=yes.
A Kappeler, S Yoo, Q Dai, AK Katsaggelos, Video superresolution with convolutional neural networks. IEEE Trans. Comput. Imaging. 2(2), 109–122 (2016).
J Kim, J Kwon Lee, K Mu Lee, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Accurate image superresolution using very deep convolutional networks (IEEE Computer Society, 2016), pp. 1646–1654. https://www.engineeringvillage.com/search/doc/detailed.url?SEARCHID=5537f4a9M9a32M4c26M8acdM34670ec6efc4&usageZone=resultslist&usageOrigin=searchresults&pageType=quickSearch&searchtype=quickSearch&CID=quickSearchDetailedFormat&DOCINDEX=1&database=1&format=quickSearchDetailedFormat&tagscope=&displayPagination=yes.
D Kingma, J Ba, in International Conference on Learning Representations (ICLR2015). Adam: a method for stochastic optimization, (2015). arXiv:1412.6980. https://arxiv.org/abs/1412.6980.
A Krizhevsky, I Sutskever, GE Hinton, in Advances in neural information processing systems. Imagenet classification with deep convolutional neural networks (Neural information processing system foundationCanada, 2012), pp. 1097–1105. https://www.engineeringvillage.com/search/doc/detailed.url?SEARCHID=558ed086M3315M42beMa6bcM314dbe5adfdf&usageZone=resultslist&usageOrigin=searchresults&pageType=quickSearch&searchtype=quickSearch&CID=quickSearchDetailedFormat&DOCINDEX=2&database=1&format=quickSearchDetailedFormat&tagscope=&displayPagination=yes.
C Li, W Zhang, G Peng, S Liu, Bearing Fault Diagnosis Using FullyConnected WinnerTakeAll Autoencoder[J]. IEEE Access, 6103–6115 (2017). https://www.engineeringvillage.com/search/doc/detailed.url?SEARCHID=596b49bcM49c2M4664M9a6cMf7ae8f1a2a24&usageZone=resultslist&usageOrigin=searchresults&pageType=quickSearch&searchtype=quickSearch&CID=quickSearchDetailedFormat&DOCINDEX=1&database=1&format=quickSearchDetailedFormat&tagscope=&displayPagination=yes.
CH Li, W Zhang, GL Peng, SH Liu, Bearing fault diagnosis using fullyconnected winnertakeall autoencoder. IEEE Access. 6:, 6103–6115 (2017).
X Liu, X Wu, D Zhao, in Image Processing (ICIP), 2013 20th IEEE International, Conference on. Sparsitybased soft decoding of compressed images in transform domain (IEEE Computer Society, 2013), pp. 563–566. https://www.engineeringvillage.com/search/doc/detailed.url?SEARCHID=d9f77bf5Mb0f0M43e6M96d7M76c38c39bbfb&usageZone=resultslist&usageOrigin=searchresults&pageType=quickSearch&searchtype=quickSearch&CID=quickSearchDetailedFormat&DOCINDEX=1&database=1&format=quickSearchDetailedFormat&tagscope=&displayPagination=yes.
D Martin, C Fowlkes, D Tal, J Malik, in Computer Vision, 2001. ICCV 2001. Proceedings. Eighth IEEE International, Conference on, vol. 2. A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics (Institute of Electrical and Electronics Engineers Inc., 2001), pp. 416–423. https://www.engineeringvillage.com/search/doc/detailed.url?SEARCHID=15a4198dM46f8M47f7Mb681M9c75fe00d020&usageZone=resultslist&usageOrigin=searchresults&pageType=quickSearch&searchtype=quickSearch&CID=quickSearchDetailedFormat&DOCINDEX=1&database=1&format=quickSearchDetailedFormat&tagscope=&displayPagination=yes.
ST Roweis, LK Saul, Nonlinear dimensionality reduction by locally linear embedding. Science.290(5500), 2323–2326 (2000).
S Schulter, C Leistner, in Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Fast and accurate image upscaling with superresolution forests (IEEE Computer Society, 2015). https://www.engineeringvillage.com/search/doc/detailed.url?SEARCHID=060d721fMcba2M4e9fM95c2Mf20da6a9dd53&usageZone=resultslist&usageOrigin=searchresults&pageType=quickSearch&searchtype=quickSearch&CID=quickSearchDetailedFormat&DOCINDEX=1&database=1&format=quickSearchDetailedFormat&tagscope=&displayPagination=yes.
B Shen, N Chilamkurti, R Wang, X Zhou, S Wang, Deadlineaware rate allocation for iot services in data center network, (2017). https://www.engineeringvillage.com/search/doc/detailed.url?SEARCHID=d61b8bfbM31dfM4a19Ma0c5M2db50d3fffa3&usageZone=resultslist&usageOrigin=searchresults&pageType=quickSearch&searchtype=quickSearch&CID=quickSearchDetailedFormat&DOCINDEX=1&database=1&format=quickSearchDetailedFormat&tagscope=&displayPagination=yes.
B Shen, X Zhou, M Kim, Mixed scheduling with heterogeneous delay constraints in cyberphysical systems. Futur. Gener. Comput. Syst. 61(8), 108–117 (2016).
B Shen, X Zhou, R Wang, A delayaware schedule method for distributed information fusion with elastic and inelastic traffic. Inf. Fusion. 36(7), 68–79 (2017).
W Shi, J Caballero, F Huszár, J Totz, AP Aitken, R Bishop, D Rueckert, Z Wang, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Realtime single image and video superresolution using an efficient subpixel convolutional neural network (IEEE Computer Society, 2016), pp. 1874–1883. https://www.engineeringvillage.com/search/doc/detailed.url?SEARCHID=694d88c8Med6eM42b7Mb6d9Ma5a3a3c2b67b&usageZone=resultslist&usageOrigin=searchresults&pageType=quickSearch&searchtype=quickSearch&CID=quickSearchDetailedFormat&DOCINDEX=1&database=1&format=quickSearchDetailedFormat&tagscope=&displayPagination=yes.
R Timofte, V De Smet, L Van Gool, in Proceedings of the IEEE International Conference on Computer Vision. Anchored neighborhood regression for fast examplebased superresolution (Institute of Electrical and Electronics Engineers Inc., 2013), pp. 1920–1927. https://www.engineeringvillage.com/search/doc/detailed.url?SEARCHID=f011e4e4M1f2fM456cMac3aM36d34ba54572&usageZone=resultslist&usageOrigin=searchresults&pageType=quickSearch&searchtype=quickSearch&CID=quickSearchDetailedFormat&DOCINDEX=1&database=1&format=quickSearchDetailedFormat&tagscope=&displayPagination=yes.
R Timofte, V De Smet, L Van Gool, in Asian Conference on Computer Vision. A+: Adjusted anchored neighborhood regression for fast superresolution (Springer Verlag, 2014), pp. 111–126. https://www.engineeringvillage.com/search/doc/detailed.url?SEARCHID=1097e68cM0485M4e62Mb4ddM4ca34a2a3887&usageZone=resultslist&usageOrigin=searchresults&pageType=quickSearch&searchtype=quickSearch&CID=quickSearchDetailedFormat&DOCINDEX=1&database=1&format=quickSearchDetailedFormat&tagscope=&displayPagination=yes.
JA Tropp, AC Gilbert, Signal recovery from random measurements via orthogonal matching pursuit. IEEE Trans. Inf. Theory. 53(12), 4655–4666 (2007).
A Vedaldi, K Lenc, in Proceedings of the 23rd, ACM international conference on Multimedia. Matconvnet: Convolutional neural networks for matlab (Association for Computing Machinery, Inc., 2015), pp. 689–692. https://www.engineeringvillage.com/search/doc/detailed.url?SEARCHID=5218517dM4c7eM4743Mb58cM6bc27cdcf5a5&usageZone=resultslist&usageOrigin=searchresults&pageType=quickSearch&searchtype=quickSearch&CID=quickSearchDetailedFormat&DOCINDEX=1&database=1&format=quickSearchDetailedFormat&tagscope=&displayPagination=yes.
Z Wang, D Liu, J Yang, W Han, T Huang, in Proceedings of the IEEE International Conference on Computer Vision. Deep networks for image superresolution with sparse prior (Institute of Electrical and Electronics Engineers Inc., 2015), pp. 370–378. https://www.engineeringvillage.com/search/doc/detailed.url?SEARCHID=17d7184aMe136M4d35M96edMb2282ecacdf7&usageZone=resultslist&usageOrigin=searchresults&pageType=quickSearch&searchtype=quickSearch&CID=quickSearchDetailedFormat&DOCINDEX=1&database=1&format=quickSearchDetailedFormat&tagscope=&displayPagination=yes.
J Yang, J Wright, TS Huang, Y Ma, Image superresolution via sparse representation. IEEE Trans. Image Process. 19(11), 2861–2873 (2010).
MD Zeiler, R Fergus, in European conference on computer vision. Visualizing and understanding convolutional networks (Springer Verlag, 2014), pp. 818–833. https://www.engineeringvillage.com/search/doc/detailed.url?SEARCHID=ec17fa35Md67dM4840M83f7M023f167dfd70&usageZone=resultslist&usageOrigin=searchresults&pageType=quickSearch&searchtype=quickSearch&CID=quickSearchDetailedFormat&DOCINDEX=1&database=1&format=quickSearchDetailedFormat&tagscope=&displayPagination=yes.
R Zeyde, M Elad, M Protter, in International conference on curves and surfaces. On single image scaleup using sparserepresentations (Springer VerlagHeidelberg, 2010), pp. 711–730. https://www.engineeringvillage.com/search/doc/detailed.url?SEARCHID=87e27abfMc848M48c3MacbfM3ecd93a93759&usageZone=resultslist&usageOrigin=searchresults&pageType=quickSearch&searchtype=quickSearch&CID=quickSearchDetailedFormat&DOCINDEX=1&database=1&format=quickSearchDetailedFormat&tagscope=&displayPagination=yes.
J Zhang, C Zhao, R Xiong, S Ma, D Zhao, Image superresolution via dualdictionary learning and sparse representation (IEEE Computer Society, Washington, 2012). https://www.engineeringvillage.com/search/doc/detailed.url?SEARCHID=20abcd56M257fM49e1Mbb5fM23b6731e5923&usageZone=resultslist&usageOrigin=searchresults&pageType=quickSearch&searchtype=quickSearch&CID=quickSearchDetailedFormat&DOCINDEX=1&database=1&format=quickSearchDetailedFormat&tagscope=&displayPagination=yes.
Acknowledgements
We would like to acknowledge all our team members, especially Min Gao and Xinwei Gao, for their constructive suggestions on deeplearningbased image restoration and image compression. We would also like to acknowledge NVIDIA Corporation who kindly provided two sets of GPU.
Funding
This work is partially funded by the MOEMicrosoft Key Laboratory of Natural Language Processing and Speech, Harbin Institute of Technology, the Major State Basic Research Development Program of China (973 Program 2015CB351804), and the National Natural Science Foundation of China under Grant Nos. 61572155, 61672188, and 61272386.
Availability of data and materials
The training dataset of General100 dataset [12] and the testing datasets of Set5 [1] (5 images), Set14 [37] (14 images), and BSD200 [23] (200 images) are public. Please refer to the corresponding project website for downloading these datasets. The source code of the proposed method and our selfestablished dataset are available from the corresponding author on reasonable request, and it will also soon be available from Github.
Author information
Authors and Affiliations
Contributions
WS, SL, and FJ conceived and designed the study. WS performed the experiments. WS and SL wrote the paper. FJ, DZ, and ZT reviewed and edited the manuscript. All authors read and approved the manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare that they have no competing interests.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Additional information
Authors’ information
Wuzhen Shi is now a PhD candidate at the Harbin Institute of Technology (HIT), Harbin, China. He received master’s degree from Northwest A & F University, Yangling, Shaanxi, China, in 2014 and received bachelor’s degree from Shenyang Agricultural University, Shenyang, China, in 2012. His research interest is deeplearningbased image restoration.
Shaohui Liu received the BS, MS, and PhD degrees in computer science from the Harbin Institute of Technology (HIT), Harbin, China, in 2000, 2002, and 2007, respectively. He is now an Associated Professor in the Department of Computer Science, HIT, and his research interests include data compression, pattern recognition, and image and video processing.
Feng Jiang received the BS, MS, and PhD degrees in computer science from the Harbin Institute of Technology (HIT), Harbin, China, in 2001, 2003, and 2008, respectively. He is now an associated professor in the Department of Computer Science, HIT, and a visiting scholar in the School of Electrical Engineering, Princeton University. His research interests include computer vision, pattern recognition, and image and video processing.
Debin Zhao received the BS, MS, and PhD degrees in computer science from the Harbin Institute of Technology (HIT), Harbin, China, in 1985, 1988, and 1998, respectively. He is now a professor in the Department of Computer Science, HIT. He has published over 200 technical articles in refereed journals and conference proceedings in the areas of image and video coding, video processing, video streaming and transmission, and pattern recognition.
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
About this article
Cite this article
Shi, W., Liu, S., Jiang, F. et al. Anchored neighborhood deep network for singleimage superresolution. J Image Video Proc. 2018, 34 (2018). https://doi.org/10.1186/s1364001802697
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s1364001802697