Joint processing and fast encoding algorithm for multi-view depth video
© The Author(s). 2016
Received: 10 October 2015
Accepted: 16 August 2016
Published: 1 September 2016
The multi-view video plus depth format is the main representation of a three-dimensional (3D) scene. In the 3D extension of high-efficiency video coding (3D-HEVC), the main framework for depth video is similar to that of color video. However, because of the limitation of the mainstream capture technologies, depth video is inaccurate and inconsistent. In addition, the depth video coding method in the current 3D-HEVC software implementation is highly complex. In this paper, we introduce a joint processing and fast coding algorithm for depth video. The proposed algorithm utilizes the depth and color features to extract depth discontinuous regions, depth edge regions, and motion regions as masks for efficient processing and fast coding. The processing step includes spatial and temporal enhancement. The fast coding method mainly limits the traversal of the CU partition and mode decision. Experimental results demonstrate that the proposed algorithm reduces the coding time and the depth video coding bitrate; the proposed algorithm reduces overall coding time by 44.24 % and depth video coding time by 72.00 % on average. In addition, there is a 24.07 % depth video coding bitrate reduction with an average of 1.65 % Bjontegaard delta bitrate gains.
As computing, communication, and multimedia technologies rapidly advance, users are increasingly interested in three-dimensional (3D) video system applications, such as 3D television, free-viewpoint television, and photorealistic rendering of 3D scenes [1, 2]. Multi-view video plus depth (MVD) is a mainstream format that represents 3D scenes. MVD includes multiple viewpoint color and depth video. The color video represents the visual information of the scene while the corresponding depth video represents the geometric information of the 3D scene. In this scene representation, virtual views are synthesized using depth image based rendering (DIBR) . The MVD data format satisfies the 3D video system’s requirements and supports the wide viewing angle of 3D displays and auto-stereoscopic displays . However, MVD contains a large amount of data, which becomes a challenge for data storage and network transmission. Therefore, multi-view depth video as well as color video should be compressed efficiently .
Recently, depth video coding has become an active research area. High compression ratio, high virtual view quality, and low computational complexity are the targets of depth video coding. Cheung et al.  proposed depth map compression technique based on sparse representation. Lei et al.  improved the depth video coding performance by utilizing the depth-texture and motion similarities. Liu et al.  presented two depth compression techniques, the trilateral filter and sparse dyadic mode. Kang et al.  designed an adaptive geometry-based intra-prediction scheme for depth video coding. Oh et al.  proposed a depth boundary reconstruction filter to code the depth video. Zhang et al.  proposed regional bit allocation and rate distortion (RD) optimization algorithms for multi-view depth video coding using the imbalance bitrate allocation for different regions. Shao et al.  proposed a depth video coding algorithm based on distortion analysis. Standardization of MVD coding was also investigated by the MPEG and ITU-T/ISO/IEC Joint Collaborative Team on 3D Video Coding Extension Development (JCT-3V) . The standard for high-efficiency video coding (HEVC) has been completed by ISO and ITU. However, the 3D extension framework for HEVC (3D-HEVC) remains under development.
So far, depth video of natural scene can be obtained via Kinect sensors, depth camera systems, and depth estimation software [14–16]. The depth video is inaccurate and inconsistent because of the corresponding technique limitation. Depth images captured by Kinect sensors suffer from temporal flickering, noise, and holes because of the principle limitation of structured light technique. The depth video, obtained by depth camera system which is based on the principle of time-of-flight, may be inconsistent with the scene because of ambient light noise, motion artifacts, specular reflections, and so on. Depth video estimated by software usually contains discrete and rugged noises. Consequently, the ideal compression performance cannot be achieved, even if the state-of-the-art encoding methods are used. To improve encoding and virtual view rendering performance, many depth video processing algorithms [17–23] have been proposed. Hu et al.  proposed a depth video restoration algorithm which is effective for depth video corrupted by additive white Gaussian noise. Nguyen et al.  suppressed the coding artifacts over object boundaries by using a weighted mode filtering. Zhao et al.  proposed a depth no-synthesis-error (D-NOSE) model and presented a smoothing scheme for depth video coding. Lei et al.  proposed a depth sensation enhancement method for multiple virtual view rendering. Silva et al.  proposed a depth processing method based on a just noticeable depth difference (JNDD) model. In our previous works, the correlation of depth video is enhanced for high compression performance [22, 23]. The main contribution of these algorithms [17–23] is for better virtual view quality or high compression ratio under H.264/AVC framework. In 3D-HEVC, the size of CU and prediction modes are different from those of H.264/AVC; ideal compression ratio might not be achieved under 3D-HEVC.
The computational complexity of depth video coding is also a concern. The 3D-HEVC standard still uses the quadtree partition structure introduced in HEVC . Both color and depth videos are split into a sequence of coding tree units (CTUs), and each CTU is recursively divided into four leaf coding units (CUs); the largest CU size is 64 × 64. Each CU contains different prediction units (PUs), and different prediction modes, i.e., SKIP, merge, inter modes (Inter2N×2N, Inter2N×N, InterN×2N, and InterN×N), asymmetric motion partitioning (AMP) modes (Inter 2N×nU, Inter 2N×nD, Inter nL×2N, Inter nR×2N), and intra-prediction modes. All prediction modes are probed among all temporal and inter-view frames to determine the optimal mode for the current CU that achieves the best RD performance. It is clear that adopting the full search scheme to obtain the best CU quadtree structure as well as the motion or disparity vector for each CU consumes considerable search time [25, 26]. So far, many fast algorithms have been proposed to optimize mode selection, reference frame selection, motion estimation, and disparity estimation in depth video coding [27–33]. These algorithms can be categorized into two classes, those that exploit the correlations between color and depth video [27–29] and those that use depth image features [30–33]. Zhang et al.  proposed a low complexity MVD coding algorithm that includes motion vector sharing based on the texture image similarity correlation and SKIP mode decision in depth video coding. Shen et al.  proposed a fast depth video coding algorithm that uses the correlations of the prediction modes, reference frames, and motion vectors from color videos and depth maps. Lee et al.  proposed a fast and efficient multi-view depth image coding algorithm based on the temporal and inter-view correlations between the previously encoded texture images. Park  proposed a fast depth video coding algorithm based on edge classification and depth-modeling mode omission. Mora et al.  presented quadtree limitation coding tool and the associated predictive coding part. The runtime of depth video coding is reduced by exploiting the texture-depth correlation. Tsang et al.  proposed an intra-prediction algorithm using a single prediction direction instead of multiple prediction directions. Wang et al.  proposed a fast depth video encoding algorithm based on depth video partitioning. However, these algorithms were proposed based on inconsistent and inaccurate depth video and still have room for improvement for processed depth video.
In conclusion, the studies on depth video processing and fast depth video coding are conducted separately. Actually, the optimization chain of depth video processing and depth video coding should be coupled instead of mutually independent. Hence, a joint depth video processing and fast encoding algorithm is proposed in this paper. The algorithm includes two aspects. One is depth video processing for 3D-HEVC coding. The other is the fast depth video coding based on the processed depth video. The fast encoding method, based on acceleration of CU partition and mode decision, is suitable for the processed depth video. The proposed algorithm not only improves the compression ratio but also speeds up the encoding process.
This paper is a follow-up work which the depth video is processed based on our previous works [22, 23]. The contribution of this paper is the joint algorithm which couples depth video processing and fast algorithms together.
The rest of this paper is organized as follows. Motivation and analysis of this paper are given in Section 2, and the proposed algorithm is described in Section 3. Section 4 presents the experimental results, and the conclusions are drawn in Section 5.
2 Motivation and analysis
3 Joint depth video processing and fast encoding algorithm
3.1 Mask extraction
In the proposed algorithm, DRs, ERs, and MRs are extracted as masks for the succeeding processing and encoding process. Their extraction methods are detailed in this subsection.
3.1.1 Discontinuous and edge regions extraction
Object edges in depth video are important for virtual view rendering, and pixel value variations in these regions also lead to virtual view distortions during the rendering process. We use classical Canny operator to extract the edges for preservation. The edge extraction process includes depth smoothing using a Gaussian filter that is designed to reduce the impact of noise during the process. Calculation of the amplitude and angle of depth gradients, non-maximum suppression of gradient amplitudes, and edge extraction using double-threshold method are conducted consecutively.
3.1.2 Motion regions extraction
3.2 Depth processing
In the proposed algorithm, the depth video is processed in a manner that enhances the spatial and temporal correlations.
3.2.1 Depth video spatial enhancement
The ER is preserved for the sake of rendering performance. For non-ERs, a Gaussian filter and adaptive window smoothing filter are used, respectively.
The Gaussian filter is selected by considering the tradeoff between encoding performance and virtual view quality. First, it is easy to achieve a high compression ratio for smooth depth video. Second, in the DIBR process, rendering holes are decreased while edge regions are appropriately smoothed. Some small holes may disappear when using the smoothing filter.
Next, set each pixel in the vertical axis of the cross window as the centers, and n is the max search range. We check pixels in the left and right directions. If any pixel which does not belong to DR or belong to ER is detected, this process is stopped. After the left and right directions of each center are searched, the adaptive window is formed, as shown in Fig. 7b. The pixel of the adaptive window center is set to the mean of all pixels within this window, and the adaptive window smoothing filtering for the current pixel is completed.
3.2.2 Depth video temporal enhancement
Depth map is a gray image with DRs, ERs, and large areas of constant depth value. DRs or ERs in an accurate depth map should be exactly aligned with the objects’ boundaries in the color video. However, the depth videos are always inaccurate, and some of the depth edges do not exactly correspond to the color edge positions. After depth video processing, the inconsistency and discontinuity in the depth video decrease in the spatial domain, as inconsistent pixels in the time domain have been filtered out. Eventually, depth video consistency is enhanced. Many DRs in the processed depth video are reduced and become flat regions. As the proposed processing method protects sensitive regions for rendering, the quality degradation of the rendered view is restricted to a minimal degree. During the coding process, stationary and non-ERs in successive frames are very similar to the collocated and neighbor regions. Hence, a pre-determination of prediction modes will reduce the coding complexity without significant fluctuations in the rendered view quality.
3.3 Fast depth video coding
Depth video coding in 3D-HEVC inherits the most effective technologies in the color video coding process and adds some tools that are designed for depth video. In the coding process, the optimal prediction modes and the CU splitting depth are decided by an RD optimization process, and all the decision process is time consuming.
Compared with other prediction modes, the SKIP modes require less computational complexity. Each inter mode coded CU will determine the best motion parameters, including the motion vector, reference picture index, and reference picture list flag, while a CU coded in SKIP mode only contains one PU without a significant transform coefficient or motion vectors, and the reference picture list flag is inherited from the merge mode. The merge mode can be applied to the SKIP mode and any inter mode.
The proposed fast encoding method focuses on improving the prediction mode decision process and the CU splitting process. Since most of the CU splitting depth of depth video is less than that the corresponding CU splitting depth in color video , we use the corresponding CU splitting depth in color video as upper bound and narrow the CU depth range. We divide pixels in the depth video into two classes, R MR ∩R ER and others. The regions R MR ∩R ER contain motion of different objects or different classes of pixels with respect to edge properties. It is suitable to use a fine prediction mode test to determine the best one. Other regions in the depth map contain large areas of flat regions. They are highly likely to have the same pixel values as the neighboring inter coded unit, which means that the best prediction mode is likely to be SKIP or merge. The proposed method pre-decides the prediction modes, which can simplify the RD optimization process and reduce computational complexity.
Full resolution of color and depth video
Color QP values
25, 30, 35, 40
Depth QP values
34, 39, 42, 45
Texture SAO : ON
Texture SAO : OFF
View synthesis s/w
Test sequence information and view numbers used for encoding
1024 × 768
1024 × 768
1024 × 768
1920 × 1088
- Step 1
If the collocated color CU depth is larger than current CU depth, further splitting is taken. Otherwise, further splitting is stopped, and the optimal CU depth is determined.
- Step 2
If current CU belongs to R MR ∩R ER , all the prediction modes are searched to select the optimal one. Otherwise, only search SKIP and Merge modes.
- Step 3
If current CU depth is the optimal, the final optimal CU depth and prediction mode is determined. Otherwise, increase the current CU depth, go to step 1 until the current CU depth is optimal.
4 Experimental results
In this section, we evaluate the performance of the proposed algorithm under the common test conditions required by JCT-3V which are shown in Table 1 . We tested several sequences provided in the 3DV core experiments in two view configurations (right-left): “Poznan_Street,” provided by Poznan University of Technology; “Kendo” and “Balloons,” provided by Nagoya University, and “Newspaper1” provided by Gwangju Institute of Science and Technology. As the proposed algorithm is specially designed for depth videos that are estimated using stereo matching, the computer-generated sequences, “Undo_Dancer,” “GT_Fly,” and “Shark” were not tested. Table 2 lists the details of the test sequences used for coding.
4.1 Performance evaluation of depth video processing
Depth videos are geometric information of 3D scene and used for virtual view rendering on the client side. Hence, the performance of depth video processing method is assessed by the virtual view quality and bitrate of depth video. Virtual views are listed in Table 2, which were rendered from the decoded video using the 1D-fast VSRS method . We evaluate the depth video processing method of this paper by comparing experimental results with Hu’s , Silva’s , and Zhao’s  methods. It is noted that, for Hu’s method, the white Gaussian corrupted depth video is used for testing. We test Hu’s method with the condition of standard derivation σ at 10.
4.2 Performance evaluation of fast coding
Reduction of overall and depth video encoding times for Mora’s scheme and the proposed algorithm (%)
QP = 25
QP = 30
QP = 35
QP = 40
∆T MORA /∆t MORA
∆T proposed /∆t proposed
∆T MORA/∆t MORA
∆T proposed/∆t proposed
∆T MORA/∆t MORA
∆T proposed/∆t proposed
∆T MORA/∆t MORA
∆T proposed/∆t proposed
∆T MORA/∆t MORA
Bitrate variation of depth video for Mora’s scheme and the proposed algorithm (%)
QP = 25
QP = 30
QP = 35
QP = 40
In Mora’s scheme, CU depth limitation and prediction mode pre-decision are utilized. The depth video bitrate reduction in Mora et al.’s method mainly caused by the cost of transmitting split flags and partition sizes. The bitrate reduction of the proposed method is mainly contributed by the depth video processing. The depth video processing enhances the correlation and makes the depth video more smooth and consistent. In addition, the fast coding method of the proposed method mainly reduces the coding complexity in smoothing regions. Hence, both the encoding compression ratio and speed are improved.
PSNR and MS-SSIM results of the virtual view
BD-MS-SSIM and BDBR comparison of MORA’s scheme and the proposed algorithm
In this paper, we proposed a joint processing and fast depth video coding algorithm that takes into account the depth processing and depth video feature information which include DRs, ERs, and MRs. Because the depth videos captured by mainstream technologies are inaccurate and inconsistent, the proposed processing method improves the consistency of depth video in the spatial and temporal domains. The fast coding method is based on processed depth video and statistical analysis of prediction modes. Experimental results show that the proposed algorithm reduces the coding time and depth video coding bitrate while it maintains the quality of the rendered virtual view.
This work was supported by the National Natural Science Foundation of China under Grant 61620106012, Grant 61271270 and Grant U1301257, National High-tech R&D Program of China (863 Program, 2015AA015901), Natural Science Foundation of Zhejiang Province (LY16F010002, LY15F010005), and Natural Science Foundation of Ningbo (2015A610127, 2015A610124). It is also sponsored by K.C. Wong Magna Fund in Ningbo University.
ZP designed the proposed algorithm and drafted the manuscipt. HH tested the proposed algorithm. FC carried out the depth video processing studies. GJ participated in the algorithm design. MY performed the statistical analysis. All authors read and approved the final manuscript.
The authors declare that they have no competing interests.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
- M Tanimoto, FTV: free-viewpoint television. Signal Process. Image Commun. 27(6), 555–570 (2012)View ArticleGoogle Scholar
- A Aggoun, E Tsekleves, MR Swash, Immersive 3D holoscopic video system. IEEE Trans. Multimedia 20(1), 28–37 (2013)View ArticleGoogle Scholar
- C Fehn, Depth-image-based rendering (DIBR), compression and transmission for a new approach on 3D-TV (SPIE Conference on Stereoscopic Display and Virtual Reality System XI, San Jose, 2004), pp. 93–104Google Scholar
- K Müller, P Merkle, G Tech et al., 3D video formats and coding methods (IEEE International Conference on Image Processing, Hong Kong, 2010), pp. 2389–2392Google Scholar
- A De Abreu, P Frossard, F Pereira et al., Optimizing multiview video plus depth prediction structures for interactive multiview video streaming. IEEE J. Sel. Top. Sign. Proces. 9(3), 487–500 (2015)View ArticleGoogle Scholar
- G Cheung, A Kubota, A Ortega, Sparse representation of depth maps for efficient transform coding (Picture Coding Symposium, Nagoya, Japan, 2010), pp. 298–301Google Scholar
- J Lei, S Li, C Zhu et al., Depth coding based on depth-texture motion and structure similarities. IEEE Trans. Circuits Syst. Video Technol. 25(2), 275–286 (2015)View ArticleGoogle Scholar
- S Liu, P Lai, D Tian et al., New depth coding techniques with utilization of corresponding video. IEEE Trans. Broadcast. 57(2), 551–561 (2011)View ArticleGoogle Scholar
- M-K Kang, Y-S Ho, Depth video coding using adaptive geometry based intra prediction for 3D video systems. IEEE Trans. Multimedia 14(1), 121–128 (2011)View ArticleGoogle Scholar
- K-J Oh, A Vetro, Y-S Ho, Depth coding using a boundary reconstruction filter for 3-d video systems. IEEE Trans. Circuits Syst. Video Technol. 21(3), 350–359 (2011)View ArticleGoogle Scholar
- Y Zhang, S Kwong, L Xu et al., Regional bit allocation and rate distortion optimization for multiview depth video coding with view synthesis distortion model. IEEE Trans. Image Process. 22(9), 3497–3511 (2013)MathSciNetView ArticleGoogle Scholar
- F Shao, W Lin, G Jiang, M Yu et al., Depth map coding for view synthesis based on distortion analyses. IEEE J. Emerging Sel. Top. Circuits Syst. 4(1), 106–117 (2014)View ArticleGoogle Scholar
- K Muller, H Schwarz, D Marpe et al., 3D High-efficiency video coding for multi-view video and depth data. IEEE Trans. Image Process. 22(9), 3366–3378 (2013)MathSciNetView ArticleGoogle Scholar
- J Smisek, M Jancosek, T Pajdla, 3D with Kinect (IEEE International Conference on Computer Vision Workshops, Barcelona, 2011), pp. 1154–1160Google Scholar
- S Foix, G Alenyà, C Torras, Lock-in time-of-flight (ToF) cameras: a survey. IEEE Sensors J. 11(9), 1917–1926 (2011)View ArticleGoogle Scholar
- ISO/IEC JTC1/SC29/WG11, M16923, Depth estimation reference software (DERS) 5.0 (Xian, China, 2009)Google Scholar
- W Hu, X Li, G Cheung et al., Depth map denoising using graph-based transform and group sparsity, IEEE 15th International Workshop on Multimedia Signal Processing (MMSP), 2013, pp. 1–6Google Scholar
- V-A Nguyen, D Min, MN Do, Efficient techniques for depth video compression using weighted mode filtering. IEEE Trans. Circuits Syst. Video Technol. 23(2), 189–202 (2013)View ArticleGoogle Scholar
- Y Zhao, C Zhu, Z Chen et al., Depth no-synthesis-error model for view synthesis in 3D video. IEEE Trans. Image Process. 20(8), 2221–2228 (2011)MathSciNetView ArticleGoogle Scholar
- J Lei, C Zhang, Y Fang et al., Depth sensation enhancement for multiple virtual view rendering. IEEE Trans. Multimedia 17(4), 457–469 (2015)View ArticleGoogle Scholar
- DVSX De Silva, E Ekmekcioglu, WAC Fernando et al., Display dependent preprocessing of depth maps based on just noticeable depth difference modeling. IEEE J. Sel. Top. Sign. Proces. 5(2), 335–351 (2011)View ArticleGoogle Scholar
- Z Peng, G Jiang, M Yu et al., Temporal pixel classification and smoothing for higher depth video compression performance. IEEE Trans. Consum. Electron. 57(4), 1815–1822 (2011)View ArticleGoogle Scholar
- Z Peng, F Chen, G Jiang et al., Depth video spatial and temporal correlation enhancement algorithm based on just noticeable rendering distortion model. J. Vis. Commun. Image Represent. 33(11), 309–322 (2015)View ArticleGoogle Scholar
- GJ Sullivan, JR Ohm, WJ Han et al., Overview of the high efficiency video coding (HEVC) standard. IEEE Trans. Circuits Syst. Video Technol. 22(12), 1649–1668 (2012)View ArticleGoogle Scholar
- Lee, J.-H., Goswami, K., Kim, B.-G., et al. Fast encoding algorithm for high-efficiency video coding (HEVC) system based on spatio-temporal correlation. J. Real-Time Image Proc. 12(2), 407–418 (2016).Google Scholar
- Ahn, Y.-J., Sim, D. Square-type-first inter-CU tree search algorithm for acceleration of HEVC encoder. J. Real-Time Image Proc. 12(2), 419–432 (2016).Google Scholar
- Q Zhang, P An, Y Zhang et al., Low complexity multi-view video plus depth coding. IEEE Trans. Consum. Electron. 57(4), 1857–1865 (2011)MathSciNetView ArticleGoogle Scholar
- L Shen, Z Zhang, Z Liu, Inter mode selection for depth map coding in 3D video. IEEE Trans. Consum. Electron. 58(3), 926–931 (2012)MathSciNetGoogle Scholar
- JY Lee, HC Wey, DS Park, A fast and efficient multi-view depth image coding method based on temporal and inter-view correlations of texture images. IEEE Transactions on Circuits and Systems for Video Technology 21(12), 1859–1868 (2011)View ArticleGoogle Scholar
- C-S Park, Edge-based intra mode selection for depth-map coding in 3D-HEVC. IEEE Trans. Image Process. 24(1), 155–162 (2015)MathSciNetView ArticleGoogle Scholar
- EG MORA, J Jung, M Cagnazzo et al., Initialization, limitation and predictive coding of the depth and texture quardtree in 3D-HEVC. IEEE Trans. Circuits Syst. Video Technol. 24(9), 1554–1565 (2014)View ArticleGoogle Scholar
- SH Tsang, YL Chan, WC Siu, Efficient intra prediction algorithm for smooth regions in depth coding. Electron. Lett. 48(18), 1117–1119 (2012)View ArticleGoogle Scholar
- Y Wang, Z Peng, G Jiang et al., Fast mode decision for depth video coding based on depth segmentation. KSII Trans. Internet Inf. Syst. 6(4), 1128–1139 (2012)Google Scholar
- JCT-3V, H1003, Test model 8 of 3D-HEVC and MV-HEVC (Geneva, 2014)Google Scholar
- JCT-3V, F1100, Common test conditions of 3dv core experiments (Valencia, ES, 2013)Google Scholar
- P Hanhart, E Bosc, P Le Callet et al., Free-viewpoint video sequences: a new challenge for objective quality metrics, IEEE 16th International Workshop on Multimedia Signal Processing (MMSP), 2014, pp. 22–24Google Scholar
- G Bjontegaard, Calculation of average PSNR differences between RD-curves (Video Coding Experts Group, Austin, 2001)Google Scholar