Explicit-implicit dual stream network for image quality assessment

Communications industry has remarkably changed with the development of fifth-generation cellular networks. Image, as an indispensable component of communication, has attracted wide attention. Thus, finding a suitable approach to assess image quality is important. Therefore, we propose a deep learning model for image quality assessment (IQA) based on explicit-implicit dual stream network. We use frequency domain features of kurtosis based on wavelet transform to represent explicit features and spatial features extracted by convolutional neural network (CNN) to represent implicit features. Thus, we constructed an explicit-implicit (EI) parallel deep learning model, namely, EI-IQA model. The EI-IQA model is based on the VGGNet that extracts the spatial domain features. On this basis, the number of network layers of VGGNet is reduced by adding the parallel wavelet kurtosis value frequency domain features. Thus, the training parameters and the sample requirements decline. We verified, by cross-validation of different databases, that the wavelet kurtosis feature fusion method based on deep learning has a more complete feature extraction effect and a better generalisation ability. Thus, the method can simulate the human visual perception system better, and subjective feelings become closer to the human eye. The source code about the proposed EI-IQA model is available on github https://github.com/jacob6/EI-IQA.


Introduction
The emergence of 5G [1] period has brought great innovation to communications industry, and the demand for information transmission has increased under the bombardment of high-speed information streams. Evidently, different regions have different abilities to receive information due to their geographic location and other factors. Image is the main carrier of visual information [2] because it can intuitively reflect information, which is particularly important in the information transmission process. During image acquisition and transmission, different degrees of distortion [3] are caused by various factors, such as processing system and environmental noise. Then, distortions affect people's visual effects. Image quality directly affects the subjective perception of the human eye and the *Correspondence: jwz@whu.edu.cn 1 School of Electronic Information, Wuhan University, 430072 Wuhan, China 2 Collaborative Innovation Center of Geospatial Technology, Wuhan University, 430079 Wuhan, China acquisition of image information. Therefore, the research on IQA has arouse widespread concern.
In accordance with whether the human eye is required for classification, IQA methods can be roughly divided into two types, namely, (1) subjective assessment and (2) objective assessment [4]. The subjective assessment method uses human visual assessment [5] as a standard to assess images based on human intuitive visual experience. In the subjective assessment method, the distorted image [6] and the original image are assessed. Although the subjective assessment method works well, it entails heavy workload; thus, it does not satisfy the requirements of practical application [7,8]. By contrast, the objective assessment is simpler than subjective assessment; it has strong controllability. It uses the mathematical model to directly score the image without the need for an assessor, thereby saving time and resources. Thus, the subjective assessment has better application prospect and has become the main method in the IQA field [9].
to be tested, the block is partitioned in a non-interval manner, and then the feature vector is extracted by the same method of the BRISQUE algorithm for the block. Finally, the multivariate Gaussian (MVG) model was fitted to the extracted feature vector to obtain the final quality assessment score. Zhang et al. proposed NIQE's upgraded algorithm IL-NIQE [24] based on NIQE, which measures the image quality by calculating the distance between the distorted image and the undistorted image by the multivariate Gaussian distribution model, and performs principal component analysis(PCA) dimensionality reduction on the extracted feature vectors. Then, the MVG calculation was performed on the feature obtained from each image block to finally obtain the feature quality. Zhang et al. proposed the DESIQUE [25] algorithm, which extracts features from the spatial and frequency domains, then downsamples the image, and finally obtains a quality score through a shallow regression network. Liu et al. proposed the SSEQ [26] algorithm to block the input pictures and calculate the local entropy average, local entropy skewness, local spectral domain entropy mean, and local spectral domain entropy skewness of each region. Then, 12 image features were obtained by performing two downsampling. Finally, a quality score through a shallow regression network was obtained. Explicit NR-IQA algorithm performance comparison is shown in Table 1.
The early extraction of explicit features cannot establish an end-to-end model [27] due to the limited computational power. Therefore, it uses hierarchical extraction method to obtain feature set initially, and then performs corresponding operations on the feature set to achieve the image score. With the advancement of computing power and the emergence of deep networks, the implicit NR-IQA is introduced. Implicit NR-IQA [11] inputs the image into the algorithm, establishes an end-to-end model, and directly obtains the final image quality score. Represented by deep networks, the DeepIQA [28] model proposed by Gao Xinbo et al. performs qualitative assessment through machine learning and outputs numerical scores. The image is represented by the statistical characteristics of natural scenes [2]. The deep model is trained, the classification framework is established, the extracted features are graded to correspond to different subjective feelings, and then the qualitative labels are converted into image quality assessment scores by pooling. The CNNIQA [29] algorithm proposed by Kang et al. is based on the BIQA model of the CNN [30]. It takes the image patch as the input and uses back propagation and other methods for training. CNN works in the spatial domain, and the feature extraction and regression are integrated into the CNN, thereby deepening the network depth to improve learning ability. Kang et al. proposed a follow-up algorithm CNNIQA++ [31] to increase the number of convolutional layers on the basis of CNNIQA. It modified the full  [37] method to generate quality-recognisable images to solve the problem of insufficient training data. They used RankNet [38] to learn OF-BIQA models from dip. The automatic dip generation model was selected from MS-SSIM [39], VIF [40] and GMSD [41]. In addition, nonlinear logic function [42] was used to map the predictions of the three different models to the DMOS [42] of the laboratory for image and video engineering (LIVE) library. Ma et al. proposed an end-to-end optimised multitask deep neural network MEON [43], initially training a distortion type recognition subnetwork, and then training quality prediction subnetwork from the pretrained early layer and the output of the first subnetwork. Gao et al.
proposed blind image quality prediction BLINDER [44] through multilevel depth representation. They extracted multilevel representation from the DNN model VGGnet [45], and then calculated features on each layer. Subsequently, they estimated the quality score of the feature vector, and finally the average prediction score to estimate the overall quality. Kim et al. proposed a virtual reality IQA method based on deep learning (DeepVR-IQA) [46] and proposed a deep network consisting of a virtual reality quality score predictor and a human perception guide.
The proposed VR quality score predictor encodes patches through images. The position and visual characteristics of the image are used for learning. The proposed human perception guide refers to the subjective score of the human eye through adversarial learning, and their combination can predict the quality score. The performance of the implicit NR-IQA algorithm is shown in Table 2.
Apparently, implicit NR-IQA has good continuity and ensures that the information is extracted adequately. However, a great deal of information redundancy resulted from a lack of evident physical significance, thereby increasing model parameters and the difficulty of machine learning. The explicit NR-IQA has good physical meaning and interpretability. Its advantage lies in its interpretability, which is convenient for insight into the relationship between features, and selective combination can obtain better stackability. Hence, people use the feature set of explicit feature extraction to enhance other machine learning algorithms as the basis of training. However, the final extracted feature information is incomplete. Figure 1 shows a comparison of explicit and implicit network structures. For this reason, the proposed EI-IQA combines explicit features and implicit features and puts forward an experimental scheme. We combine the explicit and implicit features to describe the image characteristics, and then develop an effective deep learning model. After training and model parameters adjustment, we can finally obtain a reliable image quality score and distortion type. Compared with the traditional IQA algorithm, the proposed EI-IQA combines the advantages of explicit features and implicit features, extracts mixed features, and combines the CNN model to establish an end-to-end model. It uses deep neural networks to extract spatial features representing implicit features, and then uses wavelet transform to extract frequency domain features representing explicit features. Frequency domain features supplement the spatial features, reducing the number of deep neural network layers, improving generalisation capabilities, reducing the training difficulty, and avoiding the loss of information during feature extraction. The combination of frequency domain information and spatial domain information reduces information redundancy. Extracting mixed features reduces the algorithm's need for large samples and reduces the need for IQA deep networks for training samples. The experimental results show that our scheme effectively improves the algorithm's performance.
In summary, our contributions are summarised as follows: • The proposed EI-IQA comprehensively proposes combining two features to complement each other. The implicit feature makes up for the shortcoming of insufficient explicit feature; the explicit feature makes up for the shortcoming of the physical meaning of the implicit feature extracted by the deep network. At the same time, the number of network layers, network parameters and complexity are reduced. • We design a deep network that combines the advantages of explicit features and implicit features, effectively reducing the dependence of the deep network on large training samples. • Our results prove that only when the number of network layers reaches a certain depth, the implicit features extracted by the deep network are sufficient, that is, when the number of layers of the deep network is insufficient, the extracted features are redundant and insufficient.
The remaining chapters of this paper are arranged as follows. Section 2 details the proposed EI-IQA method and parameters training process. Section 3 provides the experimental results and analysis. Finally, Section 4 concludes with a summary of our work and describes the future outlook.

System solutions
With the complication of research in the field of images, feature extraction has become an important part of many algorithm design processes. In many cases, the shallow features extracted by traditional methods cannot satisfy the requirements of the algorithm. To ensure the sufficient extraction of information, most researchers use deep networks to automatically learn features from big data. However, a great deal of information redundancy resulted from a lack of evident physical significance, thereby increasing model parameters and the difficulty of machine learning. The deep neural network is used to extract the implicit features represented by the spatial features of the image, and then input into the regression network to obtain the quality score and distortion type. To solve the problem on feature redundancy resulting in a large number of model parameters, we propose a new scheme that combines the explicit features represented by manually extracted wavelet features and the implicit features represented by the spatial features extracted by the deep learning network. The proposed scheme is shown in Fig. 2.
For the input image, the spatial features of the image are extracted through the VGG13 deep network model, as well as the frequency domain features of the image in the wavelet domain using wavelet transform. Image spatial domain features are used to represent implicit features, and frequency domain features are used to represent explicit features. Explicit features are combined with implicit features, and mixed features are obtained through feature fusion. Then, the mixed feature is input to perform multitask learning in the established regression network to obtain the final image quality score and distortion type. That is, adding the extraction of frequency domain information on the basis of deep network, using the implicit features extracted by the deep network model VGGNet and the explicit features extracted by wavelet transform, we propose our deep image quality assessment [47] method based on EI-IQA. We use a combination of explicit and implicit features to describe image features and deep network models to achieve effective learning and to finally obtain the quality score and distortion type.
Compared with the previous solution, we add frequency domain information extraction, which has more accurate and comprehensive features. The combination of explicit and implicit features reduces information redundancy.
Explicit features are related to implicit features. The depth of the deep neural network is reduced to extract implicit features, thereby reducing parameter of training and the difficulty of machine learning. Thus, our algorithm has a better generalisation effect. In addition, because we are extracting a mixture of explicit and implicit features, which greatly reduces our demand for samples, our algorithm can obtain satisfactory results even for small sample libraries.

Feature extraction
The  in the model, reducing parameter training and adding frequency domain information.

Explicit feature extraction based on kurtosis value in wavelet domain
According to the existing research, compared with the original image, the distribution of the distorted image is flat, the peak value is low and the tail is long. Its kurtosis value has frequency scale invariance, which can be used as a metric and a feature to distinguish images with different degrees of distortion.
In the explicit feature extraction represented by the kurtosis value in the wavelet domain, discrete wavelet transform(DWT) [49] is performed on the image, and then 38 Daubechies filters from low frequency to high frequency scale are used. Thus, wavelet sub-band coefficients of t group (t = 38) are obtained in low frequency, horizontal, vertical and diagonal directions, which are L t , H t , V t and D t (t ∈ 1, 2, 3, · · · , 38), respectively. They are merged into 38 new matrices J t (t ∈ 1, 2, 3, · · · , 38).
The kurtosis value of the matrix J t is recorded as K (J t ) , as follows: where k i (J t ) represents the ith cumulative function of the matrix J t , and μ i (J t ) represents the ith central moment of the matrix J t .
The 38 kurtosis values of an image are combined into a 38-dimensional feature vector, denoted as E, as follows: The wavelet frequency domain feature extraction process is shown in Fig. 3.

Implicit feature extraction based on deep CNN
The deep neural network can avoid the loss of feature information, directly input the original image into the network model for training and combine feature learning with training. The proposed EI-IQA uses the classic deep neural network model VGG to segment the input image according to 32×32 to extract features. After the convolutional layer of n(n = 13) layers, the final implicit features represented by the spatial features are obtained. In the equation, n is the number of convolutional layers in the deep neural network model VGG.

Model training
In the model training process, the image in the database is initially normalised, and then divided into 32 × 32 size. Then, the multitask CNN is used to predict the image block. Finally, the prediction value combined with the original image is used to obtain the final results. Assuming that the original image is uniformly distorted, the resulting quality score is the average of the quality scores of all image blocks, and the type of distortion of the final image is determined by voting for most image blocks.

Image normalisation
The essence of neural network learning is the distribution of learning data. The difference between the distribution of training data and test data greatly reduces the generalisation ability of the model. Moreover, the neural network learning process is complicated. Once a slight change occurs in a certain middle layer of the network, it is gradually enlarged in the subsequent process. This network layer needs to be learned to adapt to the new data distribution in each iteration process, greatly reducing the speed of the neural network training. Therefore, the normalised data preprocessing is an indispensable part of the model training process. The image is cut into 32 × 32 noncoincident image blocks, and then the local normalisation operation is performed as follows: where I i, j represents a locally normalised brightness image matrix,i ∈ 1, 2, · · · , M, j ∈ 1, 2, · · · , N, and M and N represent the height and width of the image, respectively. C is a constant that prevents the denominator from reaching zero. The calculation formulas for μ i, j and σ i, j are as follows: where w k,l (k ∈ −K, · · · , K, l ∈ −L, · · · , L) represents a 2D circular symmetric Gaussian weighting [50] function, sampled to three standard deviations, and then readjusted to unit volume, that is, K = L = 3.

Loss function
Stochastic gradient descent and back propagation are used in the model to approximately minimise the loss during training. Gradient weighting is used, and network parameters are updated. w i is the ith network parameter, and p i represents the learning efficiency of the ith network. D i m represents the gradient of task m corresponding to w i ,and α m represents the relative weight of task m. The update rules during the iteration process are as follows: The loss function describes the average absolute error between the predicted value and the target value. The LOSS function can be described as follows: l(x, y) = L = {l 1 , · · · , l N } T (8) where N is the number of input samples, l N is the average absolute error between the predicted value and the target value when the input sample number is N, and the specific expression is as follows: where X (i) 1 and X (i) 2 are two input vectors, and N is the number of input samples.

Methods
Apparently, explicit NR-IQA represented by shallow machine network extracts features incompletely, but its physical significance is evident. On the contrary, implicit NR-IQA represented by the deep network extracts features adequately, but its unobvious significance causes information redundancy. Therefore, we combine both of them to extract features and propose the EI-IQA method. The proposed EI-IQA has solved the shortcomings of insufficient explicit features and unclear physical meaning of implicit features. The frequency domain features extracted by wavelet transform represent explicit features, and the spatial domain features extracted by deep network represent implicit features. Extract the explicit and implicit features are in parallel and input the combination of explicit feature and implicit feature to the regression network. Then, we directly obtain the final image quality score.

Results and discussion
To verify the performance of the algorithm, we conduct experiments on the LIVE [51], categorical subjective image quality (CSIQ [52]), and TID2013 [53] databases. The basic comparison of the three databases is shown in Table 3. These databases provide a data source for image quality assessment and play an important role. To ensure the consistency of training and testing, when conducting cross-database tests, we only select the same five distortion categories (JP2K, JPEG, Fast-fading, Gaussian noise and Gaussian blur) to carry out the test. Spearman's rank correlation coefficient (SROCC), Pearson correlation coefficient (PLCC), and distortion type classification Acc indicators are used to assess the performance of the algorithm given that the output of the model involves distortion types and quality scores. To enhance the model comparison effect, in the selection of feature extraction [54] models, SROCC, Kendall rank correlation coefficient (KROCC), root mean square error(RMSE), MSE, exit rate OR, PLCC, and distortion Acc index of type classification are used to assess the performance of the feature extraction model. During the experiment, the implicit feature extraction method was first verified. The VGGNet and ResNet models were used to extract the implicit features, and two models with different implicit features were established, namely, EI-VIQA and EI-RIQA. The comparative training results finally proved that VGGNet can more effectively extract the implicit features when the same explicit features are extracted. Then, to determine the optimal depth of VGGNet in the EI-VIQA model, we conduct EI-VIQA ablation experiment. Finally, EI-VIQA is used as the implicit feature extraction model of the EI-IQA model. At the same time, the ablation supplement experiment is performed on EI-RIQA, which further proved the accuracy of the experimental idea. The experimental results similar to the original deep network can be achieved by adding explicit features to supplement implicit features to reduce the network depth. Thus, the structure of the EI-IQA model has been established.
Then, the assessment index of this experimental algorithm is compared with the classic algorithm. The emphasis of different assessment indexes is slightly different. The comparison shows that a considerable part of the assessment indexes has higher scores of the proposed EI-IQA, proving the superiority of this algorithm. Finally, the established EI-IQA model is used for cross-database training, which is carried out for different distortion types. Among them, JP2K, JPEG, noise, blur, and other distortion types have been better assessed in multiple databases. The results prove that the model has good generalisation performance.

Model selection of the feature extraction
To extract better implicit features, during the experiment, VGGNet and ResNet models were selected, and the implicit features extracted by VGGNet and ResNet models under the LIVE library were combined with the same explicit features as feature inputs. The implicit feature extraction models EI-VIQA and EI-RIQA are established and compared to obtain a better implicit feature extraction model. The result showed that the implicit features extracted by EI-VIQA achieve better results. Table 4 shows the comparison of the implicit features extracted by the VGGNet and ResNet models under the LIVE library.
In Table 4, the comparison between the implicit feature input extracted by VGGNet and the implicit feature input extracted by the ResNet model evidently shows that the implicit feature extraction of VGGNet has the advantage of small depth and excellent index.

Quantitative test results
The same library training of the LIVE library is performed on the EI-IQA model determined by the experimental plan. EI-IQA and 18 classic algorithms, including shallow network and deep network, involved in Section 1 are trained together under the LIVE library. To prove the generalisation performance of the EI-IQA algorithm, the algorithm trained under the LIVE library was crosslibrary trained under the CSIQ and TID2013 databases. The method was also compared with the 18 classic algorithms involved in Section 1. Table 5 shows the IQA performance index comparison table tested on the LIVE database [55]. Table 6 shows the cross-database comparative results of the algorithm across TID2013. Table 7 shows the cross-database comparative results of the algorithm across CSIQ. Figure 4 shows the algorithm in CSIQ [52] and TID2013 [53] cross-library training results.
Tables 5, 6, and 7 show that the proposed EI-IQA still has considerable advantages over the classic algorithm under the CSIQ and TID2013 libraries. It is superior to some algorithm indicators, has good generalisation performance, and satisfies the expected requirements. In Fig. 4, the assessment index results of the EI-IQA algorithm in different databases are compared. The histogram clearly shows that the assessment indexes of the EI-IQA algorithm under different databases are not significantly different. The EI-IQA algorithm has been further proven to have effective generalisation performance and greatly enhanced reliability.

Model testing for specific distortion types
To test the generalisation ability of the assessment model for different samples, we use the entire LIVE database as the training set and select the same type of distortion as the training sample in the CSIQ, and TID2013 databases (JPEG2000, JPEG compression, fast-fading, white noise and Gaussian blur) are used as the test set to obtain the performance index of the algorithm. The SROCC assessment index is used to quantify the adaptability of the algorithm to different types of distortion; it is compared   Table 8 and that under the CSIQ library are shown in Table 9. The comparative results under the TID2013 library are shown in Table 10. These tables show that the proposed EI-IQA is more sensitive to distortion types, such as JP2K, JPEG, noise, and blur and slightly less sensitive to fast-fading distortion.

Ablation experiment
The depth of the deep network is reduced by adding explicit features, changing the number of network layers of VGGNet and combining the frequency domain features extracted by wavelet transform. The number of network layers of VGGNet is reduced by adding explicit features to supplement implicit features. The experimental results are similar to the original deep network. In addition, cross-library testing of the model, compared with other experiments, obtained good results. Table 11 shows the performance index of the algorithm tested on the LIVE library, and Fig. 5 shows a histogram of the algorithm  performance indicator tested on the LIVE library. The SROCC, PLCC, and Acc type of the classification of the proposed EI-IQA are the best among similar algorithms. In Table 11, the depth of the VGG network is reduced, and then combined with the explicit features of different degrees. No evident fluctuation in the assessment index results was found. This experiment aims to reduce the depth of the deep network by adding explicit features. In Fig. 5, part of the results of Table 11 are visualised, indicating that the depth of the VGG network is reduced. In addition, combined with the explicit features, the results of the assessment indicators do not change significantly.
To further prove the experimental idea, the implicit feature extraction model is replaced. The ResNet model is used to extract implicit features. The implicit features extracted by the ResNet model are supplemented by adding frequency domain features, which can achieve similar final indicators whilst reducing the depth of the ResNet model. This finding further proves the experimental idea, that is, adding frequency-domain features to supplement implicit features reduces the depth of the model and the difficulty of parameter training. Table 12 shows the performance index of the ResNet model under the LIVE library, and Fig. 6 shows the histogram of the performance index of the ResNet model under the LIVE library. In Table 12, the depth of the ResNet network is reduced, and then different levels of explicit features are combined. No evident fluctuation is found in the assessment index results. As a supplementary experiment, to further prove the idea of the proposed EI-IQA, explicit features are added to reduce the depth of the deep network. In Fig. 6, part of the results of Table 6 are visualised. The image shows that the depth of the ResNet network is reduced, and then combined with the explicit features, the results of the assessment indicators do not fluctuate significantly.
Since then, the structure of the EI-IQA algorithm has been established, that is, VGGNet is used to extract implicit features, and wavelet transform is used to extract wavelet domain and frequency domain features to represent explicit features. Explicit features supplement implicit features, thereby appropriately reducing the depth of VGGNet model. The experimental results achieved are close to the original network depth. The extracted explicit features and the implicit features are subjected to feature fusion to extract mixed features, and then the mixed features are input into the regression network [42] established for multitask learning to obtain the final quality score and distortion category.

Discussion
Most of the IQA methods have a large demand for samples. Compared with classical, the proposed EI-IQA combines explicit and implicit features. The explicit features complement the implicit features, reducing the network depth and solving the problem of large sample demand for traditional algorithms. The proposed EI-IQA constructs two different approaches to extract implicit features and chooses the VGGNet as the better one. In future work, we hope to further optimise the extraction of implicit features and combine with explicit features to obtain better mixed features. The generalisation ability of the model is improved by the model structure; it attempts to be closer to the subjective feeling of human eyes.

Conclusion
We build an EI-IQA method to extract features. From two different approaches, combining explicit and implicit features to describe image features makes up for the shortcomings of insufficient explicit features and unclear physical meaning of implicit features. The proposed EI-IQA model avoids the loss of feature information. The original image is directly fed into the model for training, and the generalisation ability of the model is effectively improved. Explicit features complement implicit features, considering mixed features as input. Thus, the dependence of deep networks on large training samples is effectively reduced. We also construct two different approaches to extract implicit features. VGGNet and ResNet are used to extract implicit features. Our results suggest that VGGNet extracts features better. Then, we test the proposed EI-IQA over three different databases. Compared with some classical, the proposed EI-IQA obtains better scores in some components. We believe that the proposed EI-IQA has more effective generalisation performance and the reliability is greatly enhanced.