Light-field imaging, also referred to as plenoptic imaging, holoscopic imaging, and integral imaging, can capture both spatial and angular information of a 3D scene and enable new possibilities for digital imaging [1]. Light fields (LFs) captured by the light-field imaging represent the intensity and direction of the light rays emanated from a 3D scene. The full LFs can be represented by a seven-dimensional plenoptic function L = L(x, y, z, θ, ϕ, λ, t) introduced by Adelson and Bergen [2], where (x, y, z) is the viewing position, (θ, ϕ) is the light ray directions, λ is the light ray wavelength, andt is the time. The seven-dimensional plenoptic function is further simplified into four dimensions by considering only information taken in a region free of occlusions at a single time instance [3, 4]. The simplified 4D function L = L(u, v, x, y) can represent a set of light rays by being parametrized as an intersection of rays with two planes, where uvdescribing the ray position in aperture (object) plane and xy describing the rays position in image plane. Under this point of view, the LFs can be explored from the digital perspective with the advances in computational photography [5].
There are several techniques that can be utilized to capture a light-field image, such as by using coded apertures, multi-camera arrays, and micro-lens arrays. In the micro-lens array-based technique, the most appropriate way to capture the LF image is by using plenoptic camera which is produced by Lytro. The commercially available plenoptic cameras can be divided into two categories: standard plenoptic camera and focused plenoptic camera. The focused plenoptic camera can provide a trade-off between spatial and angular information by putting the focal plane of microlenses away from the image sensor plane. Since the wide range of possible applications and the rapid development of light-field technology, many research groups have considered the standardization of light-field application. The JPEG working group starts a new study, known as JPEG Pleno [6], aiming at richer image capturing, visualization, and manipulation. The MPEG group has also started the third phase of free-viewpoint television (FTV) since 2013, targeting super multi-view, free navigation, and full parallax imaging applications [7].
Since the acquired LF image records both spatial and angular information of a 3D scene, it can naturally provide the benefits of rendering views at different viewpoints and views at different focused planes, which expand its applications. However, a vast amount of data is needed for such a LF image during the acquisition step for its enhanced features. Even though many image coding methods [8,9,10,11] have been proposed, they cannot be directly used for LF image. Therefore, efficient compression schemes for such particular type of image are needed for effective transmission and storage.
According to the available techniques to capture and visualize LF images, the compression schemes of LF images can be mainly categorized into two different categories, where the general diagram of workflow for LF image acquisition and visualization is depicted in Fig. 1. The first kind of compression methods, called spatial correlation-based compression method, is to compress the acquired lenslet image directly (see Fig. 1a) based on the fact that the elementary images (EIs) of lenslet image exhibit repetitive patterns and a large amount of redundancy exists between the neighboring EIs, which can be seen in Fig. 2a. By exploiting the inherent non-local spatial redundancy of LF image, a coding method combining the locally linear embedding (LLE) is proposed in [12]. This work is further improved by combining the LLE based-method and self-similarity (SS)-based compensated prediction method in [13]. Paper [14] puts forward a disparity compensation-based light-field image coding algorithm by exploring the high spatial correlation existing in the LF images, which is further improved by using kernel-based minimum mean-square-error estimation prediction [15] and Gaussian process regression-based prediction method [16]. In [17], Conti et al. introduced the SS mode into HEVC to improve the coding efficiency of light-field image, which is similar to the intra-block copy (IntraBC) incorporated into the HEVC range extension to code the screen contents. To further improve the coding performance, a bi-predicted SS estimation and SS compensation are proposed in [18], where the candidate predictor can be also devised as a linear combination of two blocks within the same search window. A displacement intra-prediction scheme for LF contents is proposed in [19], where more than one hypothesis is used to reduce prediction errors.
The other kind of compression method, called pseudo-sequence-based compression method, considers creation of a 4D LF representation of LF image prior to compression (see Fig. 1b). The pseudo-sequence-based coding methods try to decompose the LF image into multiple views, which can be seen in Fig. 2b. The derived multiple views are then organized into a sequence to make full use of the inter-correlations among various views. In [20], a sub-aperture images streaming scheme is proposed to compress the lenslet images, in which rotation scan mapping is adopted to further improve compression efficiency. A pseudo-sequence-based scheme for LF image compression is proposed in [21], in which the specific coding order of views, prediction structure, and rate allocation have been investigated for encoding the pseudo-sequence. A new LF multi-view video coding prediction structure by extending the inter-view prediction into a two-directional parallel structure is designed in [22] to analyze the relationship of the prediction structure with its coding performance. In [23], a lossless compression method of the rectified LF images is presented to exploit the high similarity existing among the sub-aperture images or view images. A novel pseudo-sequence-based 2D hierarchical reference structure for light-field image compression is proposed in [24], where a 2D hierarchical reference structure is used with the distance-based reference frame selection and spatial-coordinate-based motion vector scaling to better characterize the inter-correlations among the various views decomposed from the light-field image.
Although the pseudo-sequence-based compression method can compress the LF image effectively, the process to derive the 4D LF view images from the raw sensor data strongly depends on the exact acquisition device. In contrast, the spatial correlation-based compression method does not need to extract view images from the LF image, and this kind of method has the potential to achieve a better coding efficiency if we can take full use of the high spatial correlation between the adjacent EIs. Therefore, in this paper, we follow the spatial correlation-based compression method and propose a scalable kernel-based minimum mean square error estimation (MMSE) estimation method to effectively compress the LF image by exploring such high spatial correlation. The contributions of this paper are as follows:
-
1)
Hybrid kernel-based MMSE estimation and intra-block copy for LF image compression. The kernel-based MMSE estimation aims to predict the coding block by using an MMSE estimator where the required probabilities are obtained through kernel density estimation (KDE). Although the kernel-based MMSE estimation method can achieve a high coding efficiency, it does not always necessarily lead to a good prediction for the unknown block to be predicted, especially in non-homogenous texture areas. Fortunately, the unknown blocks located in such areas can be better predicted by using a direct match with the block to be predicted. Therefore, we combine the kernel-based MMSE estimation method with the IntraBC mode to further improve the whole coding efficiency.
-
2)
Scalable kernel-based MMSE estimate to accelerate LF image compression. The kernel-based MMSE estimation method is time-consuming both in encoder side and decoder side, and it is much worse for the hybrid prediction method in the encoder side. Therefore, we propose a scalable kernel-based MMSE estimation method to alleviate such shortcomings. In the scalable method, the reconstruction framework is decomposed into a set of reconstruction layers ranging from the basic layer that produces a rough
yet fast estimation to more complex layers yielding high quality results.
-
3)
Adaptive layer management mechanism. We use the prediction mode clue and the gradient information to decide which layer belongs for the current coding block. The adaptive layer management mechanism is designed to select which layer is belong to for a coding block based on how complex its surrounding area is.
Part of this pork has been published in [25]. In this paper, we give more details of theoretical analysis and propose a scalable kernel-based MMSE estimation method to accelerate LF image compression, and we also provide an adaptive layer management mechanism. Experimental results demonstrate the advantage of the proposed scalable compression method in terms of different quality metrics as well as the visual quality of views rendered from decompressed LF content.
The rest of this paper is organized as follows. An overview of the kernel-based MMSE estimation method is introduced in Section 2. The proposed scalable kernel-based MMSE estimate and its different reconstruction layers are described in Section 3. Section 4 gives the details of the layer selection mechanism. Experimental results are presented and analyzed in Section 5, and the concluding remarks are given in Section 6.