Skip to main content

Explicit-implicit dual stream network for image quality assessment

Abstract

Communications industry has remarkably changed with the development of fifth-generation cellular networks. Image, as an indispensable component of communication, has attracted wide attention. Thus, finding a suitable approach to assess image quality is important. Therefore, we propose a deep learning model for image quality assessment (IQA) based on explicit-implicit dual stream network. We use frequency domain features of kurtosis based on wavelet transform to represent explicit features and spatial features extracted by convolutional neural network (CNN) to represent implicit features. Thus, we constructed an explicit-implicit (EI) parallel deep learning model, namely, EI-IQA model. The EI-IQA model is based on the VGGNet that extracts the spatial domain features. On this basis, the number of network layers of VGGNet is reduced by adding the parallel wavelet kurtosis value frequency domain features. Thus, the training parameters and the sample requirements decline. We verified, by cross-validation of different databases, that the wavelet kurtosis feature fusion method based on deep learning has a more complete feature extraction effect and a better generalisation ability. Thus, the method can simulate the human visual perception system better, and subjective feelings become closer to the human eye. The source code about the proposed EI-IQA model is available on github https://github.com/jacob6/EI-IQA.

1 Introduction

The emergence of 5G [1] period has brought great innovation to communications industry, and the demand for information transmission has increased under the bombardment of high-speed information streams. Evidently, different regions have different abilities to receive information due to their geographic location and other factors. Image is the main carrier of visual information [2] because it can intuitively reflect information, which is particularly important in the information transmission process. During image acquisition and transmission, different degrees of distortion [3] are caused by various factors, such as processing system and environmental noise. Then, distortions affect people’s visual effects. Image quality directly affects the subjective perception of the human eye and the acquisition of image information. Therefore, the research on IQA has arouse widespread concern.

In accordance with whether the human eye is required for classification, IQA methods can be roughly divided into two types, namely, (1) subjective assessment and (2) objective assessment [4]. The subjective assessment method uses human visual assessment [5] as a standard to assess images based on human intuitive visual experience. In the subjective assessment method, the distorted image [6] and the original image are assessed. Although the subjective assessment method works well, it entails heavy workload; thus, it does not satisfy the requirements of practical application [7, 8]. By contrast, the objective assessment is simpler than subjective assessment; it has strong controllability. It uses the mathematical model to directly score the image without the need for an assessor, thereby saving time and resources. Thus, the subjective assessment has better application prospect and has become the main method in the IQA field [9]. Objective assessment methods can be further divided into full-reference, semi-reference and non-reference IQA [10] (or blind IQA). Their main differences lie in the presence or absence of the reference image. Non-reference IQA (NR-IQA) method has a broader application prospect because the original image is often unavailable in practical applications.

NR-IQA can be divided into two categories in accordance with the different methods of extracting features. One type is an explicit NR-IQA method [11].The image feature which will be tested is initially extracted. Then, the extracted features will be input into the shallow regression network to obtain the final quality score. Represented by shallow machine learning, the BIQI [12] algorithm proposed by Moorthy et al. extracts statistical features in the wavelet domain of distorted images based on a two-level framework and uses support vector machines to classify image distortions. It calculates the probability of the existing distortion type and the quality corresponding to each distortion. The final quality is the weighted sum of the distortion probability and the corresponding quality. On the basis of the BIQI model, Moorthy et al. proposed an image authenticity and integrity assessment model, DIIVINE [13] algorithm based on distortion type identification. This algorithm uses a controllable pyramid [14] to perform wavelet decomposition in the direction and scale, extracts the statistical characteristics of the wavelet coefficients after separating normalisation [15], and then uses support vector machines to build a feature model. Saad et al. proposed the BLIINDS [16] algorithm and the BLIINDS-II [17] improved algorithm. The image was divided into blocks, and then for each block, the statistical characteristics of the discrete cosine transform (DCT) [18] coefficients were extracted in the DCT domain of the image to establish a support vector regression model. The model is based on the statistics of the local discrete cosine transform coefficients to achieve performance that satisfies the requirements of real-time systems. The improved BLIINDS-II algorithm uses a simple expression, a low-dimensional feature space and a simple Bayesian prediction framework, in the sparse DCT domain. Mittal et al. proposed the BRISQUE [19] algorithm, which established a regression model by extracting statistical features of the image’s spatial normalisation [20] coefficients. After calculating the mean subtracted contrast normalised (MSCN) [21], it was modelled by a symmetric generalised Gaussian distribution model (GGD) [12] and an asymmetric GGD (AGGD [22]) model to obtain statistical features. Then, it used the nearest neighbour algorithm for downsampling, extracted features on another scale, and finally obtained 36 features in the training image. On the basis of the BRISQUE method, Mittal et al. proposed a “completely blind” image quality analyser NIQE [23]. After calculating the MSCN of the image to be tested, the block is partitioned in a non-interval manner, and then the feature vector is extracted by the same method of the BRISQUE algorithm for the block. Finally, the multivariate Gaussian (MVG) model was fitted to the extracted feature vector to obtain the final quality assessment score. Zhang et al. proposed NIQE’s upgraded algorithm IL-NIQE [24] based on NIQE, which measures the image quality by calculating the distance between the distorted image and the undistorted image by the multivariate Gaussian distribution model, and performs principal component analysis(PCA) dimensionality reduction on the extracted feature vectors. Then, the MVG calculation was performed on the feature obtained from each image block to finally obtain the feature quality. Zhang et al. proposed the DESIQUE [25] algorithm, which extracts features from the spatial and frequency domains, then downsamples the image, and finally obtains a quality score through a shallow regression network. Liu et al. proposed the SSEQ [26] algorithm to block the input pictures and calculate the local entropy average, local entropy skewness, local spectral domain entropy mean, and local spectral domain entropy skewness of each region. Then, 12 image features were obtained by performing two downsampling. Finally, a quality score through a shallow regression network was obtained. Explicit NR-IQA algorithm performance comparison is shown in Table 1.

Table 1 Explicit NR-IQA algorithm performance comparison

The early extraction of explicit features cannot establish an end-to-end model [27] due to the limited computational power. Therefore, it uses hierarchical extraction method to obtain feature set initially, and then performs corresponding operations on the feature set to achieve the image score. With the advancement of computing power and the emergence of deep networks, the implicit NR-IQA is introduced. Implicit NR-IQA [11] inputs the image into the algorithm, establishes an end-to-end model, and directly obtains the final image quality score. Represented by deep networks, the DeepIQA [28] model proposed by Gao Xinbo et al. performs qualitative assessment through machine learning and outputs numerical scores. The image is represented by the statistical characteristics of natural scenes [2]. The deep model is trained, the classification framework is established, the extracted features are graded to correspond to different subjective feelings, and then the qualitative labels are converted into image quality assessment scores by pooling. The CNNIQA [29] algorithm proposed by Kang et al. is based on the BIQA model of the CNN [30]. It takes the image patch as the input and uses back propagation and other methods for training. CNN works in the spatial domain, and the feature extraction and regression are integrated into the CNN, thereby deepening the network depth to improve learning ability. Kang et al. proposed a follow-up algorithm CNNIQA++ [31] to increase the number of convolutional layers on the basis of CNNIQA. It modified the full connected layers, reduced the receiving domain of the filter, estimated image quality and identified distortion. Thus, compared with CNNIQA training parameters, the reduction was nearly 90% because the size of the training set is too small to limit its depth. DeepBIQ [32] proposed by Bianco et al. is a pretrained CNN based on classification tasks and transfer learning to implement the BIQA model. The overall image quality is estimated by accumulating and averaging the prediction scores of image subregions. RankIQA [33] proposed by Liu et al. attempts to rank the quality of the three networks from shallow to deep, and then migrates the trained network to the traditional CNN network to estimate the absolute image quality from a single image. The deepIQA [34] proposed by Boss et al., based on end-to-end training, contains 10 convolutional layers, five pooling layers, and two full connected layers. BIECON [35] proposed by Kim et al., a blind image evaluator based on convolutional networks, generates local quality and then aggregates regression to obtain a subjective score, where the image quality score of the local quality training is obtained by the full reference method. Kim et al. proposed DIQA [36] to assess images in deep networks. The training process includes two parts, as follows: regression to objective error maps and subjective scoring. Two manual features are used to capture specific distortion statistics, which are caused by normalisation and feature mapping. Ma et al. proposed the dipIQ [37] method to generate quality-recognisable images to solve the problem of insufficient training data. They used RankNet [38] to learn OF-BIQA models from dip. The automatic dip generation model was selected from MS-SSIM [39], VIF [40] and GMSD [41]. In addition, nonlinear logic function [42] was used to map the predictions of the three different models to the DMOS [42] of the laboratory for image and video engineering (LIVE) library. Ma et al. proposed an end-to-end optimised multitask deep neural network MEON [43], initially training a distortion type recognition subnetwork, and then training quality prediction subnetwork from the pretrained early layer and the output of the first subnetwork. Gao et al. proposed blind image quality prediction BLINDER [44] through multilevel depth representation. They extracted multilevel representation from the DNN model VGGnet [45], and then calculated features on each layer. Subsequently, they estimated the quality score of the feature vector, and finally the average prediction score to estimate the overall quality. Kim et al. proposed a virtual reality IQA method based on deep learning (DeepVR-IQA) [46] and proposed a deep network consisting of a virtual reality quality score predictor and a human perception guide. The proposed VR quality score predictor encodes patches through images. The position and visual characteristics of the image are used for learning. The proposed human perception guide refers to the subjective score of the human eye through adversarial learning, and their combination can predict the quality score. The performance of the implicit NR-IQA algorithm is shown in Table 2.

Table 2 Implicit NR-IQA algorithm performance comparison

Apparently, implicit NR-IQA has good continuity and ensures that the information is extracted adequately. However, a great deal of information redundancy resulted from a lack of evident physical significance, thereby increasing model parameters and the difficulty of machine learning. The explicit NR-IQA has good physical meaning and interpretability. Its advantage lies in its interpretability, which is convenient for insight into the relationship between features, and selective combination can obtain better stackability. Hence, people use the feature set of explicit feature extraction to enhance other machine learning algorithms as the basis of training. However, the final extracted feature information is incomplete. Figure 1 shows a comparison of explicit and implicit network structures.

Fig. 1
figure 1

Comparison of explicit and implicit network structures

For this reason, the proposed EI-IQA combines explicit features and implicit features and puts forward an experimental scheme. We combine the explicit and implicit features to describe the image characteristics, and then develop an effective deep learning model. After training and model parameters adjustment, we can finally obtain a reliable image quality score and distortion type. Compared with the traditional IQA algorithm, the proposed EI-IQA combines the advantages of explicit features and implicit features, extracts mixed features, and combines the CNN model to establish an end-to-end model. It uses deep neural networks to extract spatial features representing implicit features, and then uses wavelet transform to extract frequency domain features representing explicit features. Frequency domain features supplement the spatial features, reducing the number of deep neural network layers, improving generalisation capabilities, reducing the training difficulty, and avoiding the loss of information during feature extraction. The combination of frequency domain information and spatial domain information reduces information redundancy. Extracting mixed features reduces the algorithm’s need for large samples and reduces the need for IQA deep networks for training samples. The experimental results show that our scheme effectively improves the algorithm’s performance.

In summary, our contributions are summarised as follows:

  • The proposed EI-IQA comprehensively proposes combining two features to complement each other. The implicit feature makes up for the shortcoming of insufficient explicit feature; the explicit feature makes up for the shortcoming of the physical meaning of the implicit feature extracted by the deep network. At the same time, the number of network layers, network parameters and complexity are reduced.

  • We design a deep network that combines the advantages of explicit features and implicit features, effectively reducing the dependence of the deep network on large training samples.

  • Our results prove that only when the number of network layers reaches a certain depth, the implicit features extracted by the deep network are sufficient, that is, when the number of layers of the deep network is insufficient, the extracted features are redundant and insufficient.

The remaining chapters of this paper are arranged as follows. Section 2 details the proposed EI-IQA method and parameters training process. Section 3 provides the experimental results and analysis. Finally, Section 4 concludes with a summary of our work and describes the future outlook.

2 The proposed EI-IQA framework

2.1 System solutions

With the complication of research in the field of images, feature extraction has become an important part of many algorithm design processes. In many cases, the shallow features extracted by traditional methods cannot satisfy the requirements of the algorithm. To ensure the sufficient extraction of information, most researchers use deep networks to automatically learn features from big data. However, a great deal of information redundancy resulted from a lack of evident physical significance, thereby increasing model parameters and the difficulty of machine learning.

The deep neural network is used to extract the implicit features represented by the spatial features of the image, and then input into the regression network to obtain the quality score and distortion type. To solve the problem on feature redundancy resulting in a large number of model parameters, we propose a new scheme that combines the explicit features represented by manually extracted wavelet features and the implicit features represented by the spatial features extracted by the deep learning network. The proposed scheme is shown in Fig. 2.

Fig. 2
figure 2

Framework of the proposed EI-IQA method

For the input image, the spatial features of the image are extracted through the VGG13 deep network model, as well as the frequency domain features of the image in the wavelet domain using wavelet transform. Image spatial domain features are used to represent implicit features, and frequency domain features are used to represent explicit features. Explicit features are combined with implicit features, and mixed features are obtained through feature fusion. Then, the mixed feature is input to perform multitask learning in the established regression network to obtain the final image quality score and distortion type. That is, adding the extraction of frequency domain information on the basis of deep network, using the implicit features extracted by the deep network model VGGNet and the explicit features extracted by wavelet transform, we propose our deep image quality assessment [47] method based on EI-IQA. We use a combination of explicit and implicit features to describe image features and deep network models to achieve effective learning and to finally obtain the quality score and distortion type.

Compared with the previous solution, we add frequency domain information extraction, which has more accurate and comprehensive features. The combination of explicit and implicit features reduces information redundancy. Explicit features are related to implicit features. The depth of the deep neural network is reduced to extract implicit features, thereby reducing parameter of training and the difficulty of machine learning. Thus, our algorithm has a better generalisation effect. In addition, because we are extracting a mixture of explicit and implicit features, which greatly reduces our demand for samples, our algorithm can obtain satisfactory results even for small sample libraries.

2.2 Feature extraction

The frequency domain features extracted by wavelet transform represent explicit features, and the spatial domain features extracted by deep network represent implicit features. To solve the shortcomings of insufficient explicit features and unclear physical meaning of implicit features, we propose a parallel deep learning model EI-IQA based on EI-IQA. The implicit features extracted by VGGNet and the explicit features represented by the wavelet kurtosis [48] frequency domain features are used as the underlying feature vector X. The explicit and implicit features are extracted in parallel. The combination of explicit features and implicit features is used as input of the regression network.

Different from the traditional deep network, the proposed EI-IQA model combines the explicit features represented by frequency domain features and the implicit features represented by spatial domain features to form a mixed feature. It solves the shortcomings of insufficient explicit features and unclear physical meaning of implicit features. These features have complementary advantages and disadvantages. In addition, the EI-IQA model can achieve an experimental effect similar to the original deep model by reducing the number of deep neural networks in the model, reducing parameter training and adding frequency domain information.

2.2.1 Explicit feature extraction based on kurtosis value in wavelet domain

According to the existing research, compared with the original image, the distribution of the distorted image is flat, the peak value is low and the tail is long. Its kurtosis value has frequency scale invariance, which can be used as a metric and a feature to distinguish images with different degrees of distortion.

In the explicit feature extraction represented by the kurtosis value in the wavelet domain, discrete wavelet transform(DWT) [49] is performed on the image, and then 38 Daubechies filters from low frequency to high frequency scale are used. Thus, wavelet sub-band coefficients of t group (t=38) are obtained in low frequency, horizontal, vertical and diagonal directions, which are Lt,Ht,Vt and Dt(t1,2,3,,38), respectively. They are merged into 38 new matrices Jt(t1,2,3,,38).

$$ J_{t}=\left[L_{t},H_{t},V_{t},D_{t}\right] $$
(1)

The kurtosis value of the matrix Jt is recorded as \(K_{(J_{t})}\), as follows:

$$ K_{(J_{t})}=\frac{k_{4}(J_{t})}{k_{2}(J_{t})^{2}}=\frac{\mu_{4}(J_{t})}{\sigma(J_{t})^{4}} - 3 $$
(2)

where ki(Jt) represents the ith cumulative function of the matrix Jt, and μi(Jt) represents the ith central moment of the matrix Jt.

The 38 kurtosis values of an image are combined into a 38-dimensional feature vector, denoted as E, as follows:

$$ E=\left[K_{(J_{1})},K_{(J_{2})},\cdots,K_{(J_{38})}\right] $$
(3)

The wavelet frequency domain feature extraction process is shown in Fig. 3.

Fig. 3
figure 3

Schematic of explicit feature extraction structure

2.2.2 Implicit feature extraction based on deep CNN

The deep neural network can avoid the loss of feature information, directly input the original image into the network model for training and combine feature learning with training. The proposed EI-IQA uses the classic deep neural network model VGG to segment the input image according to 32×32 to extract features. After the convolutional layer of n(n=13) layers, the final implicit features represented by the spatial features are obtained. In the equation, n is the number of convolutional layers in the deep neural network model VGG.

2.3 Model training

In the model training process, the image in the database is initially normalised, and then divided into 32×32 size. Then, the multitask CNN is used to predict the image block. Finally, the prediction value combined with the original image is used to obtain the final results. Assuming that the original image is uniformly distorted, the resulting quality score is the average of the quality scores of all image blocks, and the type of distortion of the final image is determined by voting for most image blocks.

2.3.1 Image normalisation

The essence of neural network learning is the distribution of learning data. The difference between the distribution of training data and test data greatly reduces the generalisation ability of the model. Moreover, the neural network learning process is complicated. Once a slight change occurs in a certain middle layer of the network, it is gradually enlarged in the subsequent process. This network layer needs to be learned to adapt to the new data distribution in each iteration process, greatly reducing the speed of the neural network training. Therefore, the normalised data preprocessing is an indispensable part of the model training process.

The image is cut into 32×32 noncoincident image blocks, and then the local normalisation operation is performed as follows:

$$ \hat{I}\left(i,j\right)=\frac{I\left(i,j\right) - \mu\left(i,j\right)}{\sigma\left(i,j\right) + C} $$
(4)

where I(i,j) represents a locally normalised brightness image matrix, i1,2,,M,j1,2,,N, and M and N represent the height and width of the image, respectively. C is a constant that prevents the denominator from reaching zero. The calculation formulas for μ(i,j) and σ(i,j) are as follows:

$$ \mu\left(i,j\right)=\sum\limits_{k=-K}^{k=K}\sum\limits_{l=-L}^{l=L}w_{k,l}I_{k,l}\left(i,j\right) $$
(5)
$$ \sigma\left(i,j\right)=\sqrt{\sum\limits_{k=-K}^{k=K}\sum\limits_{l=-L}^{l=L}w_{k,l}\left(I_{k,l}\left(i,j\right) - \mu\left(i,j\right)\right)^{2}} $$
(6)

where wk,l(kK,,K,lL,,L) represents a 2D circular symmetric Gaussian weighting [50] function, sampled to three standard deviations, and then readjusted to unit volume, that is, K=L=3.

2.3.2 Loss function

Stochastic gradient descent and back propagation are used in the model to approximately minimise the loss during training. Gradient weighting is used, and network parameters are updated. wi is the ith network parameter, and pi represents the learning efficiency of the ith network. \(D_{m}^{i}\) represents the gradient of task m corresponding to wi,and αm represents the relative weight of task m. The update rules during the iteration process are as follows:

$$ w_{i} \leftarrow w_{i} - p_{i}\sum\limits_{m=1}^{m=2}\alpha_{m}D_{m}^{i} $$
(7)

The loss function describes the average absolute error between the predicted value and the target value. The LOSS function can be described as follows:

$$ l(x,y)=L=\left\{l_{1},\cdots,l_{N}\right\}^{T} $$
(8)

where N is the number of input samples, lN is the average absolute error between the predicted value and the target value when the input sample number is N, and the specific expression is as follows:

$$ L_{N}=\frac{1}{N}\sum\limits_{i}^{N}\parallel {\boldsymbol{X}}_{1}^{(i)} - {\boldsymbol{X}}_{2}^{(i)} \parallel $$
(9)

where \(\boldsymbol {X}_{1}^{(i)}\) and \(\boldsymbol {X}_{2}^{(i)}\) are two input vectors, and N is the number of input samples.

2.4 Methods

Apparently, explicit NR-IQA represented by shallow machine network extracts features incompletely, but its physical significance is evident. On the contrary, implicit NR-IQA represented by the deep network extracts features adequately, but its unobvious significance causes information redundancy. Therefore, we combine both of them to extract features and propose the EI-IQA method. The proposed EI-IQA has solved the shortcomings of insufficient explicit features and unclear physical meaning of implicit features. The frequency domain features extracted by wavelet transform represent explicit features, and the spatial domain features extracted by deep network represent implicit features. Extract the explicit and implicit features are in parallel and input the combination of explicit feature and implicit feature to the regression network. Then, we directly obtain the final image quality score.

3 Results and discussion

To verify the performance of the algorithm, we conduct experiments on the LIVE [51], categorical subjective image quality (CSIQ [52]), and TID2013 [53] databases. The basic comparison of the three databases is shown in Table 3. These databases provide a data source for image quality assessment and play an important role. To ensure the consistency of training and testing, when conducting cross-database tests, we only select the same five distortion categories (JP2K, JPEG, Fast-fading, Gaussian noise and Gaussian blur) to carry out the test. Spearman’s rank correlation coefficient (SROCC), Pearson correlation coefficient (PLCC), and distortion type classification Acc indicators are used to assess the performance of the algorithm given that the output of the model involves distortion types and quality scores. To enhance the model comparison effect, in the selection of feature extraction [54] models, SROCC, Kendall rank correlation coefficient (KROCC), root mean square error(RMSE), MSE, exit rate OR, PLCC, and distortion Acc index of type classification are used to assess the performance of the feature extraction model.

Table 3 Comparison of basic database conditions

During the experiment, the implicit feature extraction method was first verified. The VGGNet and ResNet models were used to extract the implicit features, and two models with different implicit features were established, namely, EI-VIQA and EI-RIQA. The comparative training results finally proved that VGGNet can more effectively extract the implicit features when the same explicit features are extracted. Then, to determine the optimal depth of VGGNet in the EI-VIQA model, we conduct EI-VIQA ablation experiment. Finally, EI-VIQA is used as the implicit feature extraction model of the EI-IQA model. At the same time, the ablation supplement experiment is performed on EI-RIQA, which further proved the accuracy of the experimental idea. The experimental results similar to the original deep network can be achieved by adding explicit features to supplement implicit features to reduce the network depth. Thus, the structure of the EI-IQA model has been established.

Then, the assessment index of this experimental algorithm is compared with the classic algorithm. The emphasis of different assessment indexes is slightly different. The comparison shows that a considerable part of the assessment indexes has higher scores of the proposed EI-IQA, proving the superiority of this algorithm. Finally, the established EI-IQA model is used for cross-database training, which is carried out for different distortion types. Among them, JP2K, JPEG, noise, blur, and other distortion types have been better assessed in multiple databases. The results prove that the model has good generalisation performance.

3.1 Model selection of the feature extraction

To extract better implicit features, during the experiment, VGGNet and ResNet models were selected, and the implicit features extracted by VGGNet and ResNet models under the LIVE library were combined with the same explicit features as feature inputs. The implicit feature extraction models EI-VIQA and EI-RIQA are established and compared to obtain a better implicit feature extraction model. The result showed that the implicit features extracted by EI-VIQA achieve better results. Table 4 shows the comparison of the implicit features extracted by the VGGNet and ResNet models under the LIVE library.

Table 4 Comparison between VGGNet and ResNet extracted features

In Table 4, the comparison between the implicit feature input extracted by VGGNet and the implicit feature input extracted by the ResNet model evidently shows that the implicit feature extraction of VGGNet has the advantage of small depth and excellent index.

3.2 Quantitative test results

The same library training of the LIVE library is performed on the EI-IQA model determined by the experimental plan. EI-IQA and 18 classic algorithms, including shallow network and deep network, involved in Section 1 are trained together under the LIVE library. To prove the generalisation performance of the EI-IQA algorithm, the algorithm trained under the LIVE library was cross-library trained under the CSIQ and TID2013 databases. The method was also compared with the 18 classic algorithms involved in Section 1. Table 5 shows the IQA performance index comparison table tested on the LIVE database [55]. Table 6 shows the cross-database comparative results of the algorithm across TID2013. Table 7 shows the cross-database comparative results of the algorithm across CSIQ. Figure 4 shows the algorithm in CSIQ [52] and TID2013 [53] cross-library training results.

Fig. 4
figure 4

Proposed EI-IQA cross-database result histogram

Table 5 Test results on LIVE database
Table 6 Test results on TID2013 database
Table 7 Test results on CSIQ database

Tables 5, 6, and 7 show that the proposed EI-IQA still has considerable advantages over the classic algorithm under the CSIQ and TID2013 libraries. It is superior to some algorithm indicators, has good generalisation performance, and satisfies the expected requirements. In Fig. 4, the assessment index results of the EI-IQA algorithm in different databases are compared. The histogram clearly shows that the assessment indexes of the EI-IQA algorithm under different databases are not significantly different. The EI-IQA algorithm has been further proven to have effective generalisation performance and greatly enhanced reliability.

3.3 Model testing for specific distortion types

To test the generalisation ability of the assessment model for different samples, we use the entire LIVE database as the training set and select the same type of distortion as the training sample in the CSIQ, and TID2013 databases (JPEG2000, JPEG compression, fast-fading, white noise and Gaussian blur) are used as the test set to obtain the performance index of the algorithm. The SROCC assessment index is used to quantify the adaptability of the algorithm to different types of distortion; it is compared with the classic IQA algorithm. The adaptability of the proposed EI-IQA to different types of distortion is different. The comparative results under the LIVE library are shown in Table 8 and that under the CSIQ library are shown in Table 9. The comparative results under the TID2013 library are shown in Table 10.

Table 8 Test results of specific distortion types under the LIVE database
Table 9 Test results of specific distortion types under the CSIQ database
Table 10 Test results of specific distortion types under the TID2013 database

These tables show that the proposed EI-IQA is more sensitive to distortion types, such as JP2K, JPEG, noise, and blur and slightly less sensitive to fast-fading distortion.

3.4 Ablation experiment

The depth of the deep network is reduced by adding explicit features, changing the number of network layers of VGGNet and combining the frequency domain features extracted by wavelet transform. The number of network layers of VGGNet is reduced by adding explicit features to supplement implicit features. The experimental results are similar to the original deep network. In addition, cross-library testing of the model, compared with other experiments, obtained good results. Table 11 shows the performance index of the algorithm tested on the LIVE library, and Fig. 5 shows a histogram of the algorithm performance indicator tested on the LIVE library. The SROCC, PLCC, and Acc type of the classification of the proposed EI-IQA are the best among similar algorithms.

Fig. 5
figure 5

Comparison histogram of VGGNet degrading results in LIVE

Table 11 Comparison of VGGNet degrading results

In Table 11, the depth of the VGG network is reduced, and then combined with the explicit features of different degrees. No evident fluctuation in the assessment index results was found. This experiment aims to reduce the depth of the deep network by adding explicit features. In Fig. 5, part of the results of Table 11 are visualised, indicating that the depth of the VGG network is reduced. In addition, combined with the explicit features, the results of the assessment indicators do not change significantly.

To further prove the experimental idea, the implicit feature extraction model is replaced. The ResNet model is used to extract implicit features. The implicit features extracted by the ResNet model are supplemented by adding frequency domain features, which can achieve similar final indicators whilst reducing the depth of the ResNet model. This finding further proves the experimental idea, that is, adding frequency-domain features to supplement implicit features reduces the depth of the model and the difficulty of parameter training. Table 12 shows the performance index of the ResNet model under the LIVE library, and Fig. 6 shows the histogram of the performance index of the ResNet model under the LIVE library.

Fig. 6
figure 6

Comparison histogram of ResNet degrading results in LIVE

Table 12 Comparison of ResNet degradation results

In Table 12, the depth of the ResNet network is reduced, and then different levels of explicit features are combined. No evident fluctuation is found in the assessment index results. As a supplementary experiment, to further prove the idea of the proposed EI-IQA, explicit features are added to reduce the depth of the deep network. In Fig. 6, part of the results of Table 6 are visualised. The image shows that the depth of the ResNet network is reduced, and then combined with the explicit features, the results of the assessment indicators do not fluctuate significantly.

Since then, the structure of the EI-IQA algorithm has been established, that is, VGGNet is used to extract implicit features, and wavelet transform is used to extract wavelet domain and frequency domain features to represent explicit features. Explicit features supplement implicit features, thereby appropriately reducing the depth of VGGNet model. The experimental results achieved are close to the original network depth. The extracted explicit features and the implicit features are subjected to feature fusion to extract mixed features, and then the mixed features are input into the regression network [42] established for multitask learning to obtain the final quality score and distortion category.

3.5 Discussion

Most of the IQA methods have a large demand for samples. Compared with classical, the proposed EI-IQA combines explicit and implicit features. The explicit features complement the implicit features, reducing the network depth and solving the problem of large sample demand for traditional algorithms. The proposed EI-IQA constructs two different approaches to extract implicit features and chooses the VGGNet as the better one. In future work, we hope to further optimise the extraction of implicit features and combine with explicit features to obtain better mixed features. The generalisation ability of the model is improved by the model structure; it attempts to be closer to the subjective feeling of human eyes.

4 Conclusion

We build an EI-IQA method to extract features. From two different approaches, combining explicit and implicit features to describe image features makes up for the shortcomings of insufficient explicit features and unclear physical meaning of implicit features. The proposed EI-IQA model avoids the loss of feature information. The original image is directly fed into the model for training, and the generalisation ability of the model is effectively improved. Explicit features complement implicit features, considering mixed features as input. Thus, the dependence of deep networks on large training samples is effectively reduced.

We also construct two different approaches to extract implicit features. VGGNet and ResNet are used to extract implicit features. Our results suggest that VGGNet extracts features better. Then, we test the proposed EI-IQA over three different databases. Compared with some classical, the proposed EI-IQA obtains better scores in some components. We believe that the proposed EI-IQA has more effective generalisation performance and the reliability is greatly enhanced.

Availability of data and materials

The Python source code of EI-IQA can be downloaded at https://github.com/jacob6/EI-IQAfor public use and evaluation. You can change this program as you like and use it anywhere, but please refer to its original source.

Abbreviations

AGGD:

Asymmetric generalised Gaussian distribution

BIQA:

Blind image quality assessment

CNN:

Convolutional neural network

CSIQ:

Categorical subjective image quality

DCT:

Discrete cosine transform

DeepVR-IQA:

Virtual reality IQA method based on deep learning

DMOS:

Differential mean opinion score

DWT:

Discrete wavelet transform

EI:

Explicit-implicit

EI-IQA:

Explicit-implicit dual stream network for image quality assessment

GGD:

Generalised Gaussian distribution

IQA:

Image quality assessment

KROCC:

Kendall rank correlation coefficient

LIVE:

Laboratory for image and video engineering

MOS:

Mean opinion score

MSCN:

Mean subtracted contrast normalised

MVG:

Multivariate Gaussian

NR-IQA:

Non-reference IQA

PCA:

Principal component analysis

PLCC:

Pearson correlation coefficient

RMSE:

Root mean square error

SROCC:

Spearman’s rank correlation coefficient

References

  1. C. Luo, J. Ji, Q. Wang, X. Chen, P. Li, Channel state information prediction for 5G wireless communications: a deep learning approach. IEEE Trans. Netw. Sci. Eng.7(1), 227–236 (2020).

    Article  MathSciNet  Google Scholar 

  2. C. Yan, B. Shao, H. Zhao, R. Ning, Y. Zhang, F. Xu, 3D room layout estimation from a single RGB image. IEEE Trans. Multimedia, 1–1 (2020).

  3. K. Gu, G. Zhai, X. Yang, W. Zhang, M. Liu, in 2013 IEEE International Conference on Image Processing, vol. 67. Subjective and objective quality assessment for images with contrast change, (2013), pp. 383–387.

  4. Z. Wang, A. C. Bovik, Modern image quality assessment (2006).

  5. Fan Zhang, Yuli Xu, in 2009 Chinese Control and Decision Conference, vol. 14. Image quality evaluation based on human visual perception, (2009), pp. 1487–1490.

  6. A. C. Bovik, Automatic predition of perceptual image and video quality. Proc. IEEE. 101(9), 2008–2024 (2013).

    MathSciNet  Google Scholar 

  7. Z. Wang, A. C. Bovik, Mean squared error: love it or leave it? A new look at signal fidelity measures. IEEE Signal Proc. Mag.26(1), 98–117 (2009).

    Article  Google Scholar 

  8. Z. Wang, Applications of objective image quality assessment methods [applications corner]. IEEE Signal Proc. Mag.28(6), 137–142 (2011).

    Article  Google Scholar 

  9. G. J. Katuwal, J. Kerekes, R. Ramchandran, C. Sisson, N. Rao, in 2013 IEEE Western New York Image Processing Workshop (WNYIPW), vol. 37. Automatic fundus image field detection and quality assessment, (2013), pp. 9–13.

  10. Zhou Wang, A. C. Bovik, H. R. Sheikh, E. P. Simoncelli, Image quality assessment: from error visibility to structural similarity. IEEE Trans. Image Process.13(4), 600–612 (2004).

    Article  Google Scholar 

  11. J. Kim, H. Zeng, D. Ghadiyaram, S. Lee, L. Zhang, A. C. Bovik, Deep convolutional neural models for picture-quality prediction: challenges and solutions to data-driven image quality assessment. IEEE Signal Proc. Mag.34(6), 130–141 (2017).

    Article  Google Scholar 

  12. A. K. Moorthy, A. C. Bovik, A two-step framework for constructing blind image quality indices. IEEE Signal Proc. Lett.17(5), 513–516 (2010).

    Article  Google Scholar 

  13. A. K. Moorthy, A. C. Bovik, Blind image quality assessment: from natural scene statistics to perceptual quality. IEEE Trans. Image Process.20(12), 3350–3364 (2011).

    Article  MathSciNet  MATH  Google Scholar 

  14. M. Unser, N. Chenouard, D. Van De Ville, Steerable pyramids and tight wavelet frames in l2(Rd). IEEE Trans. Image Process.20(10), 2705–2721 (2011).

    Article  MathSciNet  MATH  Google Scholar 

  15. M. Wainwright, O. Schwartz, E. Simoncelli, Natural image statistics and divisive normalization: modeling nonlinearity and adaptation in cortical neurons. Stat. Theor. Brain, 203–222 (2002).

  16. M. A. Saad, A. C. Bovik, C. Charrier, A DCT statistics-based blind image quality index. IEEE Signal Proc. Lett.17(6), 583–586 (2010).

    Article  Google Scholar 

  17. M. A. Saad, A. C. Bovik, C. Charrier, in 2011 18th IEEE International Conference on Image Processing, vol. 11. DCT statistics model-based blind image quality assessment, (2011), pp. 3093–3096.

  18. N. Ahmed, T. Natarajan, K. R. Rao, Discrete cosine transform. IEEE Trans. Comput.C-23(1), 90–93 (1974).

    Article  MathSciNet  MATH  Google Scholar 

  19. A. Mittal, A. K. Moorthy, A. C. Bovik, No-reference image quality assessment in the spatial domain. IEEE Trans. Image Process.21(12), 4695–4708 (2012).

    Article  MathSciNet  MATH  Google Scholar 

  20. D. L. Ruderman, Statistics of natural images. Netw. Comput. Neural Syst.5(4), 517–548 (1994).

    Article  MATH  Google Scholar 

  21. K. Gu, G. Zhai, X. Yang, W. Zhang, Using free energy principle for blind image quality assessment. IEEE Trans. Multimed.17(1), 50–63 (2015).

    Article  Google Scholar 

  22. A. Mittal, A. K. Moorthy, A. C. Bovik, No-reference image quality assessment in the spatial domain. IEEE Trans. Image Process.21(12), 4695–4708 (2012).

    Article  MathSciNet  MATH  Google Scholar 

  23. A. Mittal, Fellow, IEEE, R. Soundararajan, A. C. Bovik, Making a ’completely blind’ image quality analyzer. IEEE Signal Proc. Lett.20(3), 209–212 (2013).

    Article  Google Scholar 

  24. L. Zhang, L. Zhang, A. C. Bovik, A feature-enriched completely blind image quality evaluator. IEEE Trans. Image Process.24(8), 2579–2591 (2015).

    Article  MathSciNet  MATH  Google Scholar 

  25. Z. Yi, D. M. Chandler, No-reference image quality assessment based on log-derivative statistics of natural scenes. J. Electron. Imaging. 22(4), 043025–104302522 (2013).

    Article  Google Scholar 

  26. L. Liu, B. Liu, H. Huang, A. C. Bovik, No-reference image quality assessment based on spatial and spectral entropies. Sig. Process. Image Commun. (2014).

  27. K. Ma, W. Liu, K. Zhang, Z. Duanmu, Z. Wang, W. Zuo, End-to-end blind image quality assessment using deep neural networks. IEEE Trans. Image Process.27(3), 1202–1213 (2018).

    Article  MathSciNet  MATH  Google Scholar 

  28. W. Hou, X. Gao, D. Tao, X. Li, Blind image quality assessment via deep learning. IEEE Trans. Neural Netw. Learn. Syst.26(6), 1275–1286 (2015).

    Article  MathSciNet  Google Scholar 

  29. L. Kang, P. Ye, Y. Li, D. Doermann, in 2014 IEEE Conference on Computer Vision and Pattern Recognition, vol. 14. Convolutional neural networks for no-reference image quality assessment, (2014), pp. 1733–1740.

  30. R. Girshick, J. Donahue, T. Darrell, J. Malik, in 2014 IEEE Conference on Computer Vision and Pattern Recognition, vol. 34. Rich feature hierarchies for accurate object detection and semantic segmentation, (2014), pp. 580–587.

  31. L. Kang, P. Ye, Y. Li, D. Doermann, in 2015 IEEE International Conference on Image Processing (ICIP). Simultaneous estimation of image quality and distortion via multi-task convolutional neural networks, (2015), pp. 2791–2795.

  32. S. Bianco, L. Celona, P. Napoletano, in Signal, Image and Video Processing, vol. 12. On the use of deep learning for blind image quality assessment, (2018), pp. 355–362.

  33. X. Liu, J. Van De Weijer, A. D. Bagdanov, in 2017 IEEE International Conference on Computer Vision (ICCV), vol. 18. Rankiqa: learning from rankings for no-reference image quality assessment, (2017), pp. 1040–1049.

  34. S. Bosse, D. Maniry, K. Müller, T. Wiegand, W. Samek, Deep neural networks for no-reference and full-reference image quality assessment. IEEE Trans. Image Process.27(1), 206–219 (2018).

    Article  MathSciNet  MATH  Google Scholar 

  35. J. Kim, S. Lee, Fully deep blind image quality predictor. IEEE J. Sel. Top. Sig. Process.11(1), 206–220 (2017).

    Article  Google Scholar 

  36. J. Kim, A. Nguyen, S. Lee, Deep CNN-based blind image quality predictor. IEEE Trans. Neural Netw. Learn. Syst.30(1), 11–24 (2019).

    Article  Google Scholar 

  37. K. Ma, W. Liu, T. Liu, Z. Wang, D. Tao, dipiq: blind image quality assessment by learning-to-rank discriminable image pairs. IEEE Trans. Image Process.26(8), 3951–3964 (2017).

    Article  MathSciNet  MATH  Google Scholar 

  38. C. Burges, T. Shaked, E. Renshaw, Learning to rank using gradient descent. Proc. 22nd Int. Conf. Mach. Learn.11(10), 89–96 (2005).

    Google Scholar 

  39. Z. Wang, E. P. Simoncelli, A. C. Bovik, in The Thrity-Seventh Asilomar Conference on Signals, Systems Computers, 2003, vol. 2. Multiscale structural similarity for image quality assessment, (2003), pp. 1398–14022.

  40. H. R. Sheikh, A. C. Bovik, Image information and visual quality. IEEE Trans. Image Process.15(2), 430–444 (2006).

    Article  Google Scholar 

  41. W. Xue, L. Zhang, X. Mou, A. C. Bovik, Gradient magnitude similarity deviation: a highly efficient perceptual image quality index. IEEE Trans. Image Process.23(2), 684–695 (2014).

    Article  MathSciNet  MATH  Google Scholar 

  42. H. R. Sheikh, M. F. Sabir, A. C. Bovik, A statistical evaluation of recent full reference image quality assessment algorithms. IEEE Trans. Image Process.15(11), 3440–3451 (2006).

    Article  Google Scholar 

  43. K. Ma, W. Liu, K. Zhang, Z. Duanmu, Z. Wang, W. Zuo, End-to-end blind image quality assessment using deep neural networks. IEEE Trans. Image Process.27(3), 1202–1213 (2018).

    Article  MathSciNet  MATH  Google Scholar 

  44. G. Fei, J. Yu, S. Zhu, Q. Huang, T. Qi, Blind image quality prediction by exploiting multi-level deep representations. Pattern Recog.81:, 432–442 (2018).

    Article  Google Scholar 

  45. K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale image recognition. Proc. ImageNet Chall.1409(15), 1–10 (2014).

    Google Scholar 

  46. H. G. Kim, H. Lim, Y. M. Ro, Deep virtual reality image quality assessment with human perception guider for omnidirectional image. IEEE Trans. Circ. Syst. Video Technol.30(4), 917–928 (2020).

    Article  Google Scholar 

  47. C. Yan, Z. Li, Y. Zhang, Y. Liu, X. Ji, Y. Zhang, Depth image denoising using nuclear norm and learning graph model (2020).

  48. M. Boumahdi, J. Lacoume, in 1995 International Conference on Acoustics, Speech, and Signal Processing, vol. 3. Blind identification using the kurtosis: results of field data processing, (1995), pp. 1980–19833.

  49. M. J. Shensa, The discrete wavelet transform: wedding the a trous and Mallat algorithms. IEEE Trans. Sig. Process.40(10), 2464–2482 (1992).

    Article  MATH  Google Scholar 

  50. D. J. Field, Relations between the statistics of natural images and the response properties of cortical cells. J. Opt. Soc. Am. A-optics Image Sci. Vision. 4(12), 2379–2394 (1987).

    Article  Google Scholar 

  51. H. R. Sheikh, A. C. Bovik, L. Cormack, No-reference quality assessment using natural scene statistics: JPEG2000. IEEE Trans. Image Process.14(11), 1918–1927 (2005).

    Article  Google Scholar 

  52. E. C. Larson, D. M. Chandler, Most apparent distortion: full-reference image quality assessment and the role of strategy. J. Electron. Imaging. 19(1), 011006 (2010).

    Article  Google Scholar 

  53. N. Ponomarenko, L. Jin, O. Ieremeiev, Image database TID2013: Peculiarities, results and perspectives. Sig. Process. Image Commun.30:, 57–77 (2015).

    Article  Google Scholar 

  54. C. Yan, B. Gong, Y. Wei, Y. Gao, Deep multi-view enhancement hashing for image retrieval. arXiv. 29(10), 1–1 (2020).

    Google Scholar 

  55. H. R. Sheikh, Live image quality assessment database release 2.online (2005). http://live.ece.utexas.edu/research/quality.

Download references

Acknowledgements

The numerical calculations in this paper have been done on the supercomputing system in the Supercomputing Center of Wuhan University. The authors would like to thank Prof. Hongyan Zhang and Prof. Jiang Hao for the valuable opinions they have offered during our heated discussions.

Funding

This study is partially supported by National Natural Science Foundation of China (NSFC) (No. 61871298, 61671333) and National Key Research and Development Program of China (No. 2018Y F B0504501), (No. 2018Y F B1201602).

Author information

Authors and Affiliations

Authors

Contributions

XY conducted the experiments and drafted the manuscript. HT and CK implemented the core method and performed the statistical analysis. GY designed the methodology. WZ modified the manuscript. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Weizheng Jin.

Ethics declarations

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Yang, G., Ding, X., Huang, T. et al. Explicit-implicit dual stream network for image quality assessment. J Image Video Proc. 2020, 48 (2020). https://doi.org/10.1186/s13640-020-00538-y

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/s13640-020-00538-y

Keywords