 Research
 Open Access
 Published:
Scalable kernelbased minimum mean square error estimate for lightfield image compression
EURASIP Journal on Image and Video Processing volume 2018, Article number: 52 (2018)
Abstract
Lightfield imaging can capture both spatial and angular information of a 3D scene and is considered as a prospective acquisition and display solution to supply a more natural and fatiguefree 3D visualization. However, one problem that occupies an important position to deal with the lightfield data is the sheer size of data volume. In this context, efficient coding schemes for such particular type of image are needed. In this paper, we propose a scalable kernelbased minimum mean square error estimation (MMSE) method to further improve the coding efficiency of lightfield image and accelerate the prediction process. The whole prediction procedure is decomposed into three layers. By using different prediction method in different layers, the coding efficiency of lightfield image is further improved and the computation complexity is reduced both in encoder and decoder side. In addition, we design a layer management mechanism to determine which layers are to be employed to perform the prediction of the coding block by using the high correlation between the coding block and its adjacent known blocks. Experimental results demonstrate the advantage of the proposed compression method in terms of different quality metrics as well as the visual quality of views rendered from decompressed lightfield content, compared to the HEVC intraprediction method and several other prediction methods in this field.
Introduction
Lightfield imaging, also referred to as plenoptic imaging, holoscopic imaging, and integral imaging, can capture both spatial and angular information of a 3D scene and enable new possibilities for digital imaging [1]. Light fields (LFs) captured by the lightfield imaging represent the intensity and direction of the light rays emanated from a 3D scene. The full LFs can be represented by a sevendimensional plenoptic function L = L(x, y, z, θ, ϕ, λ, t) introduced by Adelson and Bergen [2], where (x, y, z) is the viewing position, (θ, ϕ) is the light ray directions, λ is the light ray wavelength, andt is the time. The sevendimensional plenoptic function is further simplified into four dimensions by considering only information taken in a region free of occlusions at a single time instance [3, 4]. The simplified 4D function L = L(u, v, x, y) can represent a set of light rays by being parametrized as an intersection of rays with two planes, where uvdescribing the ray position in aperture (object) plane and xy describing the rays position in image plane. Under this point of view, the LFs can be explored from the digital perspective with the advances in computational photography [5].
There are several techniques that can be utilized to capture a lightfield image, such as by using coded apertures, multicamera arrays, and microlens arrays. In the microlens arraybased technique, the most appropriate way to capture the LF image is by using plenoptic camera which is produced by Lytro. The commercially available plenoptic cameras can be divided into two categories: standard plenoptic camera and focused plenoptic camera. The focused plenoptic camera can provide a tradeoff between spatial and angular information by putting the focal plane of microlenses away from the image sensor plane. Since the wide range of possible applications and the rapid development of lightfield technology, many research groups have considered the standardization of lightfield application. The JPEG working group starts a new study, known as JPEG Pleno [6], aiming at richer image capturing, visualization, and manipulation. The MPEG group has also started the third phase of freeviewpoint television (FTV) since 2013, targeting super multiview, free navigation, and full parallax imaging applications [7].
Since the acquired LF image records both spatial and angular information of a 3D scene, it can naturally provide the benefits of rendering views at different viewpoints and views at different focused planes, which expand its applications. However, a vast amount of data is needed for such a LF image during the acquisition step for its enhanced features. Even though many image coding methods [8,9,10,11] have been proposed, they cannot be directly used for LF image. Therefore, efficient compression schemes for such particular type of image are needed for effective transmission and storage.
According to the available techniques to capture and visualize LF images, the compression schemes of LF images can be mainly categorized into two different categories, where the general diagram of workflow for LF image acquisition and visualization is depicted in Fig. 1. The first kind of compression methods, called spatial correlationbased compression method, is to compress the acquired lenslet image directly (see Fig. 1a) based on the fact that the elementary images (EIs) of lenslet image exhibit repetitive patterns and a large amount of redundancy exists between the neighboring EIs, which can be seen in Fig. 2a. By exploiting the inherent nonlocal spatial redundancy of LF image, a coding method combining the locally linear embedding (LLE) is proposed in [12]. This work is further improved by combining the LLE basedmethod and selfsimilarity (SS)based compensated prediction method in [13]. Paper [14] puts forward a disparity compensationbased lightfield image coding algorithm by exploring the high spatial correlation existing in the LF images, which is further improved by using kernelbased minimum meansquareerror estimation prediction [15] and Gaussian process regressionbased prediction method [16]. In [17], Conti et al. introduced the SS mode into HEVC to improve the coding efficiency of lightfield image, which is similar to the intrablock copy (IntraBC) incorporated into the HEVC range extension to code the screen contents. To further improve the coding performance, a bipredicted SS estimation and SS compensation are proposed in [18], where the candidate predictor can be also devised as a linear combination of two blocks within the same search window. A displacement intraprediction scheme for LF contents is proposed in [19], where more than one hypothesis is used to reduce prediction errors.
The other kind of compression method, called pseudosequencebased compression method, considers creation of a 4D LF representation of LF image prior to compression (see Fig. 1b). The pseudosequencebased coding methods try to decompose the LF image into multiple views, which can be seen in Fig. 2b. The derived multiple views are then organized into a sequence to make full use of the intercorrelations among various views. In [20], a subaperture images streaming scheme is proposed to compress the lenslet images, in which rotation scan mapping is adopted to further improve compression efficiency. A pseudosequencebased scheme for LF image compression is proposed in [21], in which the specific coding order of views, prediction structure, and rate allocation have been investigated for encoding the pseudosequence. A new LF multiview video coding prediction structure by extending the interview prediction into a twodirectional parallel structure is designed in [22] to analyze the relationship of the prediction structure with its coding performance. In [23], a lossless compression method of the rectified LF images is presented to exploit the high similarity existing among the subaperture images or view images. A novel pseudosequencebased 2D hierarchical reference structure for lightfield image compression is proposed in [24], where a 2D hierarchical reference structure is used with the distancebased reference frame selection and spatialcoordinatebased motion vector scaling to better characterize the intercorrelations among the various views decomposed from the lightfield image.
Although the pseudosequencebased compression method can compress the LF image effectively, the process to derive the 4D LF view images from the raw sensor data strongly depends on the exact acquisition device. In contrast, the spatial correlationbased compression method does not need to extract view images from the LF image, and this kind of method has the potential to achieve a better coding efficiency if we can take full use of the high spatial correlation between the adjacent EIs. Therefore, in this paper, we follow the spatial correlationbased compression method and propose a scalable kernelbased minimum mean square error estimation (MMSE) estimation method to effectively compress the LF image by exploring such high spatial correlation. The contributions of this paper are as follows:

1)
Hybrid kernelbased MMSE estimation and intrablock copy for LF image compression. The kernelbased MMSE estimation aims to predict the coding block by using an MMSE estimator where the required probabilities are obtained through kernel density estimation (KDE). Although the kernelbased MMSE estimation method can achieve a high coding efficiency, it does not always necessarily lead to a good prediction for the unknown block to be predicted, especially in nonhomogenous texture areas. Fortunately, the unknown blocks located in such areas can be better predicted by using a direct match with the block to be predicted. Therefore, we combine the kernelbased MMSE estimation method with the IntraBC mode to further improve the whole coding efficiency.

2)
Scalable kernelbased MMSE estimate to accelerate LF image compression. The kernelbased MMSE estimation method is timeconsuming both in encoder side and decoder side, and it is much worse for the hybrid prediction method in the encoder side. Therefore, we propose a scalable kernelbased MMSE estimation method to alleviate such shortcomings. In the scalable method, the reconstruction framework is decomposed into a set of reconstruction layers ranging from the basic layer that produces a rough
yet fast estimation to more complex layers yielding high quality results.

3)
Adaptive layer management mechanism. We use the prediction mode clue and the gradient information to decide which layer belongs for the current coding block. The adaptive layer management mechanism is designed to select which layer is belong to for a coding block based on how complex its surrounding area is.
Part of this pork has been published in [25]. In this paper, we give more details of theoretical analysis and propose a scalable kernelbased MMSE estimation method to accelerate LF image compression, and we also provide an adaptive layer management mechanism. Experimental results demonstrate the advantage of the proposed scalable compression method in terms of different quality metrics as well as the visual quality of views rendered from decompressed LF content.
The rest of this paper is organized as follows. An overview of the kernelbased MMSE estimation method is introduced in Section 2. The proposed scalable kernelbased MMSE estimate and its different reconstruction layers are described in Section 3. Section 4 gives the details of the layer selection mechanism. Experimental results are presented and analyzed in Section 5, and the concluding remarks are given in Section 6.
Kernelbased MMSE estimation method
The kernelbased MMSE estimation method aims to predict the current coding block given its neighbor known context under a kernelbased point of view by constructing a statistical model and calculating a kernelbased MMSE estimation. In order to construct the statistical model, the pixel values in the coding block and its neighbor known context are arranged into a multidimensional formalism. Kernel density estimation (KDE) is used to estimate the probability density function (PDF) of the statistical model with a set of observed vectors. The coding block is then predicted from an MMSE estimator given the PDF.
Let the pixel values in the current coding block be stacked in a column vector x_{0}, and the pixel values in its neighbor templates with template thickness being T are compacted in a column vector y_{0}, shown in Fig. 3a. The current coding block and its neighbor templates with template thickness being T are called the prototype region in this paper. Therefore, the main goal of the prediction method can be expressed to derive the MMSE estimate E[xy_{0}] of vector x_{0} given its context y_{0}. In order to do so, we arrange the vector x_{0} and y_{0} into a multidimensional vector z_{0} = (x_{0}, y_{0}), and a random vector variable z = (x, y) that has the same configuration as z_{0} is considered to obtain the signal statistical behavior. If the PDF of z is acquired, the MMSE estimator E[xy_{0}] can be derived from it. In the proposed scheme, we propose to utilize KDE to estimate the PDF of random variable z given a set of observed vectors {z_{k} k = 1, …, K}. Here, the observed vectors are composed of KNN patches, the top K closest templates that have the same configuration as the prototype region in terms of Euclidean distance, derived within the specified horizontal and vertical search windows, shown in Fig. 3b. Since the coding block in prototype region is unknown in the KNN patches searching procedure, its neighbor blocks are used for searching. We set the template thickness to T_{1} which equals to the size of the current coding block to increase the searching accuracy, shown in Fig. 3b.
Given the observed vectors, the estimator of the PDF of z using KDE with a Gaussian kernel \( K\left(\mathbf{u}\right)=\exp \left({\mathbf{u}}^{\mathrm{T}}\mathbf{u}/2\right)/\sqrt{2\pi } \) can be defined by [26],
where matrix H is called bandwidth, controlling the smoothness of the resulting PDF. \( {K}_Z^{(k)}\left(\mathbf{z}\right) \) can be considered as a multivariate Gaussian with mean z_{k} and covariance matrix H = HH^{T}. The covariance matrix H is also called as bandwidth for simplicity, which can be decomposed as
With the knowledge of p(z), we can calculate the MMSE estimator E[xy_{0}] of vector x_{0}. We find from Eq. (1) that p(z)has the form of Gaussian mixture model (GMM) with a priori probabilities 1/K and covariance matrixH. Therefore, it is reasonable to utilize the expressions of MMSE estimator under GMM model [26, 27]. The MMSE estimator of the coding block can be expressed as,
where \( {K}_Y^{(k)}\left(\mathbf{y}\right) \) is the marginal kernel for y, with mean y_{k} and covariance matrix H_{YY}. According to Eqs. (3)–(5), we can obtain the prediction of the coding block and the estimation method is referred as the kernelbased MMSE (KMMSE) estimation method. For simplicity, the KMMSE estimation can be rewritten as
where \( {\tilde{\mathbf{x}}}_0 \) and \( {\tilde{\mathbf{y}}}_0 \) express the linear predictions of x_{0} and y_{0} from the sets of vectors z_{k}(k = 1, …, K)by
There are two issues that should be tackled in KMMSE estimation method. One is to derive the weight vector ω_{k}(y_{0}). The other is to estimate the kernel bandwidth matrix H.
In order to derive the weight vector, we adopt a more direct way. That is to minimize the residual energy ε(ω) by solving a squared error function, where ε(ω) is given by
In order to estimate the kernel bandwidth matrix, we propose a new BE method which is based on the analysis of physical interpretation of the KMMSE estimation method. From Eq. (6), we find that the KMMSE estimator consists of two parts. The first part is a linear prediction of vector x_{0}, and the second part is a correction vector representing the unpredictable part of y_{0} is transformed into subspace x [26]. The physical interpretation of the second part can be understood as the vector x_{0} being close to vector y_{0} is likely to have a similar unpredictable part as vector y_{0}. The matrix \( {\mathbf{H}}_{XY}{\mathbf{H}}_{YY}^{1} \) can be regarded as a transfer matrix used to transfer the unpredictable part of y_{0} to subspace x. It is reasonable to infer that the bandwidth matrix H in KMMSE estimation is used to measure the similarity of the subspace x and subspace y. Therefore, we propose to utilize Eq. (9) to estimate the bandwidth matrix H approximately.
where η is a hyper parameter and is specified to 1.0 in the proposed system.
Scalable kernelbased MMSE (SKMMSE) estimation
The kernelbased MMSE estimation method is a powerful prediction method and can achieve a better prediction accuracy for LF contents in the homogenous texture areas. However, some shortcomings still exist. Firstly, since such method predicts the coding blocks given their neighbor known contexts, it does not always necessarily lead to a good prediction for the unknown block in some nonhomogenous texture areas. Secondly, the kernelbased MMSE estimation method is timeconsuming, especially in the decoder side. Thirdly, for some visually flat regions in homogenous texture areas, applying kernelbased MMSE estimation method would be an overkill since similar reconstruction quality could be achieved by simpler (and, therefore, faster) estimators when dealing with relatively simple structures. To this end, this paper proposes a scalable kernelbased MMSE estimation method, also called SKMMSE, that aims at further improving the coding efficiency and accelerating the prediction process by decomposing the prediction procedure into different layers. The proposed SKMMSE algorithm is comprised by three prediction layers. The layer division is based on the contents of the previously encoded blocks adjacent to the coding block. The higher the layer within the scalable hierarchy, the higher the computational complexity. The scalable layers are introduced in the next subsections, and the layer management mechanism will be illustrated in the next section.
Hybrid prediction layer (HPL)
We have mentioned above that the KMMSE estimation method does not always necessarily lead to a good prediction for the unknown blocks in the nonhomogenous texture areas. In order to improve the prediction accuracy for the blocks in such texture areas, we propose to use the hybrid kernelbased MMSE estimation and IntraBC method (hybrid prediction method) to predict the coding blocks and the coding blocks in such texture areas construct the hybrid prediction layer. The hybrid prediction method is based on the HEVC screen content coding (HEVCSCC) framework. In the hybrid prediction method, the KMMSE estimation method, IntraBC prediction, and the intradirectional prediction are all used as the competing prediction modes. The proposed hybrid prediction explores the idea of using the IntraBC scheme or intradirectional prediction to find the best prediction of the coding block \( {\widehat{\mathbf{x}}}_0^{HPL} \) when KMMSE estimation method fails based on the ratedistortion optimization (RDO) procedure. In other word, the hybrid prediction method uses the “try all then select best” intramode decision method to find the best prediction mode and optimal depth for each coding block.
It is worth to notice that the KMMSE estimation method is introduced into HEVC SCC by replacing one of the existing 35 intradirectional prediction modes in order to avoid the modification of bit stream structure, which means the prediction samples that generated by the KMMSE estimation method will replace the outputs produced by the substituted intradirectional prediction mode.
KMMSE prediction layer (KPL)
Other than the nonhomogenous texture areas, in many cases, we are dealing with the homogenous texture areas. In these areas, the coding block and its adjacent reconstructed blocks have the similar texture structure, which means that the coding block and its adjacent reconstructed blocks have a high correlation. For such areas, we propose to use the KMMSE estimation method to predict the coding blocks, as described in Section 2, and the coding blocks in such texture areas construct the KMMSE prediction layer. Since the high correlation is existed between the subspace x and subspace y, the current coding block is likely to have a similar unpredictable part as its adjacent reconstructed blocks. Therefore, we can achieve a higher prediction accuracy by using the KMMSE estimation method for the coding blocks in KPL. The estimator of the coding blocks in the homogenous texture areas can be expressed as
As mentioned earlier, the KMMSE estimation method is also implemented in the HEVCSCC framework. For coding blocks in the HPL, the “try all then select best” intramode decision method is used to find the best prediction mode and optimal depth. However, for the coding blocks in KPL, we will skip the IntraBC mode and only to derive the best prediction mode and optimal depth among the KMMSE mode (KMMSE estimation method) and the other 34 intradirectional prediction modes to reduce the computation complexity. Moreover, since the LF image is composed of numerous EIs and the texturehomogeneous areas hardly prevail in the LF image, the coding unit size 64 × 64 is seldom chosen as the optimal block size. Consequently, for KMMSE estimation method, we only choose four coding block sizes ranging from 32 × 32 to 4 × 4.
Linear prediction layer (LPL)
There is a special case in the KPL, that is the visually flat regions, such as the skies and walls. In such flat regions, the luminance information of the coding block and its adjacent reconstructed blocks is simple. In the KMMSE estimation method, the coding block is predicted by using two terms, shown in Eq. (6). The first term is a linear prediction of vector x_{0}, and the second term is a correction vector representing the unpredictable part of y_{0} is transformed into subspace x [26]. Since the luminance information in the flat regions is simple, the unpredictable part of y_{0} can be neglected with negligible effect to the prediction accuracy. This means that we can predict the coding blocks in such flat regions by directly using a linear prediction and do not need to compute the correction vector. Likewise, the coding blocks in the visually flat regions construct the LPL and the estimator of the coding blocks in such regions can be derived by
The weight vector ω_{k}(y_{0}) is achieved by using Eq. (8). The used linear prediction method is implemented in the HEVCSCC framework by using the same way as the KMMSE estimation method.
In this section, we have introduced three prediction layers according to the contents of the coding blocks and their adjacent reconstructed blocks, which comprise the proposed SKMMSE estimation method. The main idea is to further improve the coding efficiency and accelerate the prediction process. The three prediction layers are summarized as follows.

1)
HPL consists of the nonhomogenous texture areas and hybrid prediction method is used to predict the coding blocks in such layer.

2)
KPL consists of the homogenous texture areas where the texture information is abundant. For KPL, the KMMSE estimation method is utilized to predict the coding blocks.

3)
LPL consists of the visually flat regions in homogenous texture areas, and a linear prediction method is used to predict the coding blocks which is a simplified form of the KMMSE estimation method by discarding the correction vector.
Layer switching and management mechanism
The goal of the proposed SKMMSE is to further improve the coding efficiency and accelerate the prediction process. In order to do so, we propose a content adaptive layer selection scheme. Since the current coding block is unknown, we propose to utilize the contents of the neighbor known blocks adjacent to the coding block to decide which layer is belonged to for the current coding block. Since the contents of the coding block and its neighbor known blocks are closely linked, it is feasible to use the contents of the neighbor known blocks to determine which layers are to be employed to perform the prediction of the coding block. In order to achieve the goal, two assumptions are taken into consideration:

1)
Given that the HPL is used to improve the whole coding efficiency, to decide accurately which coding blocks belong to this layer is of great importance. The decision criterion should take into account the contents of the coding blocks.

2)
In order to accelerate the prediction process, the LPL is expected to perform well for the visually flat regions in homogenous texture areas. The visual flatness should be considered as the decision criterion, and the decision criterion should be simple and fast.
In order to consider the two assumptions, we propose to use the prediction mode correlation to decide which blocks belong to HPL and use the gradient information to measure the visual flatness. The following will introduce the two layer switching mechanism.
For homogenous texture areas, the coding blocks and their neighboring blocks are closed linked. In most cases, they have the similar texture information and structural characteristics. As a results, the prediction modes of the coding blocks should be similar to their neighboring known blocks in the homogenous texture areas. Consequently, we can apply the prediction mode information to decide whether the coding block belongs to the homogenous texture areas and further determine whether the coding block belongs to HPL or KPL. Since the prediction mode information of the current coding block is unknown, the prediction mode information of its neighboring known blocks are utilized in the decision criterion.
Suppose the optimal prediction modes of the current coding block and its left neighboring block, up neighboring block, and upleft neighboring block are denoted by PM_{C}, PM_{L}, PM_{U}, PM_{UL}. We define a flag flag_{CB} used to determine whether the coding block belongs to HPL or KPL, which is defined as
From Eq. (12), we see that if PM_{L} = PM_{U} = PM_{UL}, the flag_{CB} is set to 1 and the current coding block is determined to belong to KPL. Otherwise, flag_{CB} is set to 0 and the current coding block is determined to belong to HPL. The main reason is that if PM_{L} = PM_{U} = PM_{UL}, the current coding block is likely to have the same prediction mode as its neighboring blocks (left neighboring block, up neighboring block, and upleft neighboring block), which indicates that the current coding block and its neighboring blocks have a similar texture information and structural characteristics and they belong to the homogenous texture areas. Therefore, the current coding block is grouped into KPL. If the flag_{CB} is equal to 0, which means that the current coding block is quite different from its neighboring blocks, the current coding block is grouped into HPL.
In order to verify the decision accuracy, Table 1 shows the accuracy of PM_{C} = PM_{L} = PM_{U} = PM_{UL} if PM_{L} = PM_{U} = PM_{UL} in each depth level. The accuracy means that the probability of PM_{C} = PM_{L} = PM_{U} = PM_{UL} across all the test QPs when PM_{L} = PM_{U} = PM_{UL}. From Table 1, we see that the average accuracy of PM_{C} = PM_{L} = PM_{U} = PM_{UL} if PM_{L} = PM_{U} = PM_{UL} in each depth level across all the depth levels is from 77.8 to 93.8%, 85.0% on average. This means that if PM_{L} = PM_{U} = PM_{UL}, the prediction mode of the current coding block can be considered to be the same as its neighboring blocks. Therefore, we can use such prediction mode information to decide whether the current coding block is located at the homogenous texture areas. From Table 1, we also find that the average accuracy is lower for rectangular lens LF images (e.g., bike, fountain, Laura, and seagull) for depth 0 than other depth level. The reason is that the size of EI image in rectangular lens LF image is approximate to the coding block size for depth 0. Since the EIs exhibit repetitive patterns, the homogenous texture areas hardly prevail in such depth level. Fortunately, the average accuracy across all the depth level is approximate to 80% and it is feasible to use the prediction mode information of the neighboring known blocks to decide whether the current coding block is located at the homogenous texture areas.
We have mentioned above that there is a special case in the KPL, that is the visually flat regions. If we can find the coding blocks that belong to these regions and use a simpler and faster method to predict these coding blocks with negligible effect to the prediction accuracy, the computational complexity can be reduced, especially in decoder side. To this end, we utilize the gradient information of the top nearest patch of the prototype region in KNN patch set to judge whether the coding block belong to the LPL. Let z_{1} = (x_{1}, y_{1}) be the vector form of the top nearest patch of the prototype region in KNN patch set. Suppose G_{z1}, G_{x1}, and G_{y1} represent the gradient of vector z_{1}, x_{1}, and y_{1}, respectively, we also define a flag \( \mathrm{fla}{\mathrm{g}}_{CB}^{\prime } \). If ‖G_{x1} − G_{y1}‖ < G_{z1}, the coding block and its neighboring blocks are considered to be located at the visually flat regions and the \( \mathrm{fla}{\mathrm{g}}_{CB}^{\prime } \) is set to 1. Otherwise, \( \mathrm{fla}{\mathrm{g}}_{CB}^{\prime } \) is set to 0, and the coding block is considered to belong to the KPL. The flag \( \mathrm{fla}{\mathrm{g}}_{CB}^{\prime } \) is defined by
The proposed algorithm is summarized by the flow graph in Fig. 4. It is shown that the SKMMSE algorithm is a scalable LF image coding method, where the whole coding blocks are divided into three layers. The HPL is used to improve the prediction accuracy for coding blocks in the nonhomogenous texture areas while the KPL is applied to ensure the coding efficiency for the coding blocks in the homogenous texture areas where the texture information is abundant. Regarding to the LPL, a simplified prediction method is adopted to further reduce the whole computational complexity with negligible effect to the prediction accuracy, especially for the decoder side. Note that, the proposed framework can also be used to predict chrominance blocks.
Experimental results and discussion
In order to validate the efficiency of the proposed method, 12 LF test images including eight circular lens LF test images provided by the ICME 2016 grand challenge in lightfield image compression [28] and four rectangular lens LF test images provided by Dr. T. Georgiev [29] are used in the test set. The used LF test images are all captured by the focused plenoptic camera. The resolution of the circular lens LF test images is 7728 × 5368. The original resolution of the rectangular lens LF test images is 7240 × 5432, and we cut the four test images into size of 3840 × 2160 for simplicity. The size of each EI in rectangular lens LF image is 75 × 75. All the LF test images are transformed into YUV 4:2:0 format. The central rendered views from each LF test image are shown in Fig. 5.
The HEVC SCC reference software SCM3.0 [30] is modified for the proposed hybrid codec architecture. The coding configurations were set as the “All Intra,” which is defined in [31]. Four tested quantization parameters 22, 27, 32, and 37 are used. The proposed hybrid prediction method (referred to as SKMMSE) is compared with three prediction schemes: the original HEVC (referred to as HEVC), the screen content coding extension Ver. 3.0 to HEVC (referred to as HEVCSCC), and the kernelbased minimum meansquareerror estimation method [15] (referred to as KMMSE). The KMMSE method is realized in the HEVC SCC reference software SCM3.0. The YPSNR and YUVPSNR between the original LF image and decoded LF image shown in [28] are used as the objective quality metric.
The template thickness T in Fig. 3a is set to 4. The dimensions of the searching windows used in the SKMMSE and KMMSE method is given by V = 128, H = 128, shown in Fig. 3b. We have mentioned that the KMMSE method is integrated into the HEVC SCC standard by replacing one of the 35 intradirectional prediction modes in SKMMSE method. In our experiment, the intraprediction mode “4” is replaced and K is set to 6.
Table 2 gives the ratedistortion gains of the three prediction methods over HEVC intrastandard with YPSNR as the objective quality metric. From Table 2, we see that the proposed SKMMSE is clearly superior to the other methods. An average gain of up to 1.61 dB has been achieved by SKMMSE to HEVC intrastandard. Compared to HEVCSCC, around 30.9% BDrate can be saved in average by using SKMMSE. This is because integrating the KMMSE method to the HEVCSCC standard can effectively improve the prediction accuracy. When compared to KMMSE, the proposed SKMMSE can achieve about 0.17 dB average gains. The main reason is that the KMMSE do not work well for the blocks in nonhomogenous texture area. By combining the IBC mode, the proposed SKMMSE can achieve a better prediction of the coding block in such nonhomogenous texture area. From Table 2, we can also observe that the KMMSE method allows an average of 22.4% rate saving compared to the HEVCSCC, which means that KMMSE mode can obtain a better prediction and is selected as the best prediction mode in most cases. Figure 6 shows the ratedistortion curve of the test image set using different coding schemes with YPSNR as the objective quality metric, which further confirms that the proposed SKMMSE performs better than other prediction methods.
The ratedistortion gains of the three prediction methods over HEVC intrastandard with YUVPSNR as the objective quality metric are given in Table 3. From Table 3, we can achieve a consistent conclusion as Table 2. Compared to HEVC intrastandard, an average gain of up to 1.42 dB has been achieved by the proposed SKMMSE method. Likewise, around 10.5 and 32.8% BD rate can be saved in average by using SKMMSE when compared to KMMSE and HEVCSCC method, respectively. This also validates that the proposed SKMMSE architecture can effectively compress the LF data. Figure 7 gives the ratedistortion curve of the test image set using different coding schemes with YUVPSNR as the objective quality metric, which further proves the validity of the proposed SKMMSE.
Table 4 shows the execution time ratios of three coding methods to HEVC intrastandard both in encoder side and decoder side. From Table 4, we observe that the KMMSE method requires the most execution time both in encoder side and decoder side. The main reason is that the calculation of kernel bandwidth matrix H is timeconsuming for all the coding blocks. Since the decoder side has to do the same prediction procedure, the KMMSE method needs 49.1 times execution time to the HEVC intrastandard. In order to reduce the computation complexity, we propose the SKMMSE method. Table 4 also shows the effectiveness of the proposed SKMMSE in computation complexity. From Table 4, we see that proposed SKMMSE achieves 16.4 and 70.5% average coding time saving when compared to KMMSE method in encoder side and decoder side, respectively. The main reason mainly lies in two aspects. One is that the coding blocks is divided into three layers, and by using a simpler and faster prediction method to LPL with negligible effect to the prediction accuracy, the computation complexity is reduced in the encoder side. The other is by dividing the coding blocks into three layers, the IntraBC mode and linear prediction mode (linear prediction method) are selected as the optimal prediction mode by many coding blocks. These two modes cost much less than the KMMSE mode, especially in the decoder side. Although the computation complexity of the SKMMSE is less than the KMMSE, it still needs around two times execution time to HEVCSCC in the encoder side.
Since LF image captures both spatial and angular information of a scene, view images can be rendered from the LF image data. In order to further validate the validity of the proposed coding scheme, we give a visual quality investigation of rendered view image from the decoded LF image in Fig. 8. As shown in Fig. 8, the proposed SKMMSE can obtain a better visual quality, especially in some texture regions. The main reason mainly lies in two aspects. One is that the proposed scheme can achieve a better coding efficiency compared to other coding methods. The other is that the proposed SKMMSE prediction method can keep the detail information of EIs in the prediction process effectively.
Conclusions
In this paper, we propose a scalable kernelbased MMSE estimation method to effectively compress the LF image. The coding blocks are divided into three layers. In the HPL, a hybrid kernelbased MMSE estimation and IntraBC method are proposed to predict the coding blocks to improve the prediction accuracy of coding blocks in nonhomogenous texture area, which explores the idea of using the IntraBC scheme or intradirectional prediction to find the best prediction of the coding block when KMMSE estimation method fails based on the ratedistortion optimization (RDO) procedure. In the KPL, we propose to use the KMMSE estimation method to predict the coding blocks to ensure the coding efficiency for homogenous texture area. In the LPL, we propose to predict the coding blocks by directly using a linear prediction method and do not compute the correction vector. The linear prediction method can be seen as a simplified form of the KMMSE estimation method. In order to decide which layer is belonged to for the current coding block accurately, we propose to use the prediction mode correlation to decide which blocks belong to HPL and use the gradient information to measure the visual flatness.
The experimental results demonstrate that the proposed SKMMSE method can compress the lightfield image efficiency. It outperforms the HEVC intrastandard with 1.61 and 1.42 dB average quality improvements with YPSNR and YUVPSNR as the objective quality metric, respectively. With regard to the computation complexity, the proposed SKMMSE method can save around 16.4 and 70.5% average coding time than the KMMSE estimation method in encoder side and decoder side, respectively.
Future work will include complexity reduction and how to further improve the prediction accuracy for texture and edge regions.
Abbreviations
 FTV:

Freeviewpoint television
 GMM:

Gaussian mixture model
 HEVCSCC:

HEVC screen content coding
 HPL:

Hybrid prediction layer
 KDE:

Kernel density estimation
 KMMSE:

Kernelbased MMSE
 KPL:

KMMSE prediction layer
 LFs:

Light fields
 LLE:

Locally linear embedding
 LPL:

Linear prediction layer
 MMSE:

Minimum mean square error estimation
 PDF:

Probability density function
 RDO:

Ratedistortion optimization
 SKMMSE:

Scalable kernelbased MMSE
References
 1.
R Yang, X Huang, S Li, C Jaynes, Toward the light field display: auto stereoscopic rendering via a cluster of projectors. IEEE Trans. Vis. Comput. Graphics 14(1), 84–96 (2008)
 2.
EH Adelson, JR Bergen, in Computational Models of Visual Processing. The plenoptic function and the elements of early vision (MIT Press, Cambridge, 1991), pp. 3–20
 3.
M Levoy, P Hanrahan, in Proc. 23rd Annu. Conf. Comput. Graph. Interact. Techn. Light field rendering (1996), pp. 31–42
 4.
F Liu, G Hou, Z Sun, T Tan, High quality depth map estimation of object surface from lightfield images. Neurocomputing 252, 3–16 (2017)
 5.
M Levoy, Light fields and computational imaging. Computer 39, 46–55 (2006)
 6.
T. Ebrahimi, JPEG PLENO Abstract and Executive Summary, ISO/IEC JTC 1/SC 29/WG1 N6922, Sydney, Australia, 2015.
 7.
M. P. Tehrani, S. Shimizu, G. Lafruit, T. Senoh, T. Fujii, A. Vetro, et al., Use cases and requirements on freeviewpoint television (FTV), ISO/IEC JTC1/SC29/WG11 MPEG N14104, Geneva, Switzerland, 2013.
 8.
C Yan, H Xie, D Yang, J Yin, Y Zhang, Q Dai, Supervised hash coding with deep neural network for environment perception of intelligent vehicles. IEEE Trans. Intell. Transp. Syst. 19(1), 284–295 (2018)
 9.
C Yan, H Xie, S Liu, J Yan, Y Zhang, Q Dai, Effective Uyghur language text detection in complex background images for traffic prompt identification. IEEE Trans. Intell. Transp. Syst. 19(1), 220–229 (2018)
 10.
C Yan, Y Zhang, J Xu, F Dai, L Li, Q Dai, F Wu, A highly parallel framework for HEVC coding unit partitioning tree decision on manycore processors. IEEE Signal Processing Letters 21(5), 573–576 (2014)
 11.
C Yan, Y Zhang, J Xu, F Dai, J Zhang, Q Dai, F Wu, Efficient parallel framework for HEVC motion estimation on manycore processors. IEEE Transactions on Circuits and Systems for Video Technology 24(12), 2077–2089 (2014)
 12.
LFR Lucas, C Conti, P Nunes, LD Soares, NMM Rodrigues, CL Pagliari, EAB da Silva, SMM de Faria, in 2014 Proceedings of the 22nd European Signal Processing Conference (EUSIPCO). Locally linear embeddingbased prediction for 3D holoscopic image coding using HEVC (2014), pp. 11,15,1–11,15,5
 13.
R. Monteiro et al., Light field HEVCbased image coding using locally linear embedding and selfsimilarity compensated prediction 2016 IEEE International Conference on Multimedia & Expo Workshops (ICMEW), pp. 1–4, 2016.
 14.
D Liu, P An, R Ma, L Shen, in 2015 IEEE China Summit & Int. Conf. Signal and Information Processing (ChinaSIP). Disparity compensation based 3D holoscopic image coding using HEVC (2015), pp. 201–205
 15.
D Liu, P An, R Ma, C Yang, L Shen, K Li, Threedimensional holoscopic image coding scheme using highefficiency video coding with kernelbased minimum meansquareerror estimation. J. Electron. Imaging. 25(4), 043015–1–043015–9 (2016)
 16.
D Liu, P An, R Ma, C Yang, L Shen, 3D holoscopic image coding scheme using HEVC with Gaussian process regression. Signal Process. Image Commun. 47, 438–451 (2016)
 17.
C Conti, LD Soares, P Nunes, HEVCbased 3D holoscopic videocoding using selfsimilarity compensated prediction. Signal Process.Image Commun. 42, 59–78 (2016)
 18.
C Conti, P Nunes, LD Soares, HEVCbased light field image coding with bipredicted selfsimilarity compensation, 2016 IEEE International Conference on Multimedia & Expo Workshops (ICMEW) (2016), pp. 1–4
 19.
Y Li, M Sjostrom, R Olsson, U Jennehag, in IEEE Transactions on Circuits and Systems for Video Technology. Coding of focused plenoptic contents by displacement intra prediction, vol 26, no. 7 (2016), pp. 1308–1319
 20.
F Dai, J Zhang, Y Ma, Y Zhang, Lenselet image compression scheme based on subaperture images streaming, 2015 IEEE International Conference on Image Processing (ICIP), Quebec City, QC (2015), pp. 4733–4737
 21.
D Liu, L Wang, L Li, FW ZhiweiXiong, W Zeng, Pseudosequencebased light field image compression, 2016 IEEE International Conference on Multimedia & Expo Workshops (ICMEW) (2016), pp. 1–4
 22.
G Wang, W Xiang, M Pickering, CW Chen, in IEEE Transactions on Image Processing. Light field multiview video coding with twodirectional parallel interview prediction, vol 25, No. 11 (2016), pp. 5104–5117
 23.
P Helin, P Astola, B Rao, I Tabus, Sparse modelling and predictive coding of subaperture images for lossless plenoptic image compression, 2016 3DTVConference: The True Vision  Capture, Transmission and Display of 3D Video (3DTVCON) (2016), pp. 1–4
 24.
L Li, Z Li, B Li, D Liu, H Li, Pseudo sequence based 2D hierarchical coding structure for lightfield image compression, 2017 Data Compression Conference (DCC) (2017), pp. 131–140
 25.
D Liu, P An, R Ma, X Huang, L Shen, in 2017 PacificRim Conference on Multimedia (PCM). Hybrid kernelbased template prediction and intra block copy for light field image coding (2017)
 26.
J Koloda, AM Peinado, V Sanchez, Kernelbased MMSE multimedia signal reconstruction and its application to spatial error concealment. IEEE Trans. on Multimedia 16(6), 1729–1738 (2014)
 27.
D Persson, T Eriksson, P Hedelin, Packet video error concealment with Gaussian mixture models. IEEE Trans. Image Process. 17(2), 145–154 (2008)
 28.
M. Rerabek, T. Bruylants, T. Ebrahimi, F. Pereira, and P. Schelkens, Call for proposals and evaluation procedure. ICME 2016 grand challenge: lightfield image compression, Seattle, USA pp. 1–8, 2016.
 29.
T. Georgiev, 2013 (Online), Available: http://www.tgeorgiev.net, Website (Online). Accessed 1 July 2017.
 30.
HEVC SCC Reference Software Ver. 3.0 (SCM3.0). [Online]. Available: https://hevc.hhi.fraunhofer.de/svn/svn_HEVCSoftware/. Accessed 31 Aug 2017.
 31.
H. Yu, R. Cohen, K. Rapaka, and J. Xu, Common test conditions for screen content coding, document JCTVCX1015, 2016.
Acknowledgements
The authors would like to thank the editors and anonymous reviewers for their valuable comments.
Funding
This work was supported in part by the National Natural Science Foundation of China under grants 61571285 and 1301257 and Shanghai Science and Technology Commission under grant 17DZ2292400.
Availability of data and materials
The data will not be shared. The reason for not sharing the data and materials is that the work submitted for review is not completed. The research is still ongoing, and those data and materials are still required by the author and coauthors for further investigations.
Author information
Affiliations
Contributions
PA designed and conceived the research. ZY performed the simulated experiments and DL analyzed the experimental results. ZY wrote the manuscript. PA and DL edited the manuscript. All authors read and approved the final manuscript.
Corresponding authors
Correspondence to Zhixiang You or Ping An.
Ethics declarations
Authors’ information
Zhixiang You received his B.S. degree from Nanjing University in 2005, and M.S. degree from Tsinghua University in 2008. He is currently pursuing the Ph.D. degree in communication and information systems from Shanghai University, Shanghai, China. His research interests include algorithms and systems for 3D and VR imaging, processing, analysis, and quality assessment.
Ping An received her B.S. and M.S. degrees from Hefei University of Technology, Hefei, China, in 1990 and 1993, respectively, and the Ph.D. degree in communication and information systems from Shanghai University, Shanghai, China, in 2002. She is currently a professor in School of Communication and Information Engineering, Shanghai University. Her research interests include stereoscopic and threedimensional vision analysis and image and video processing, coding, and application.
Deyang Liu received his BS degree from Anqing Normal University, Anqing, China, in 2011, and his MS degree Ph.D. degree in Signal and Information Processing from Shanghai University, Shanghai, China, in 2014 and 2017. He is currently a lecturer in School of Computer and Information, Anqing Normal University. His research interests include 3D video processing, lightfield image coding, and scalable lightfield video coding.
Competing interests
The authors declare that they have no competing interests.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Additional information
Ping An is an IEEE member
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
About this article
Cite this article
You, Z., An, P. & Liu, D. Scalable kernelbased minimum mean square error estimate for lightfield image compression. J Image Video Proc. 2018, 52 (2018). https://doi.org/10.1186/s1364001802919
Received:
Accepted:
Published:
Keywords
 Lightfield image
 Image compression
 Scalable kernelbased MMSE estimate
 Layer management mechanism
 HEVC SCC