Monocular vision-based depth map extraction method for 2D to 3D video conversion
© Tsai and Fan. 2016
Received: 14 October 2015
Accepted: 26 April 2016
Published: 3 June 2016
Due to the demand of 3D visualization and lack of 3D video content, a method converting the 2D to 3D video plays an important role. In this paper, a low-cost and high efficiency post processing method is presented to enjoy the vivid 3D video. We present two semi-automatic depth map extraction methods for stereo video conversion. For the static background video sequence, we proposed a method which combined the foreground segmentation with vanishing line technology. According to the separated foreground and background results from foreground segmentation algorithm, the viewer can use their acquired visual experience to initiate this operation with some depth information of background. Moreover, we proposed another conversion method for the dynamic background video sequence. Foreground segmentation was replaced by the relative velocity estimation based on the motion estimation and motion compensation. Combining the depth map from this work and the original 2D video, a vivid 3D video is produced.
3D video signal processing has become a hot development trend with large potential in visual processing areas. However, the problem of 3D content generation still lingers. Users can just watch computer graphic 3D animations or movie produced by a particular camera setting. Due to the lack of 3D media contents, a technique which converts existing 2D contents into 3D contents can play a key role for the growing 3D markets.
Many methods have been proposed to convert 2D video to 3D video during the past few years. These conversion methods rely on different visual cues ranging from motion information perspective structures. Some methods are based on horizontal parallax. One of the works is based on geometric and texture . The work by Jung et al.  proposed a novel line tracing method based on relative height depth cue. The modern methods take advantage of the depth map information to render the stereo and even multiple views for display  used depth image-based rendering technique to the display system. Additionally motion estimation technique is aided to cooperate with moving object detection [4, 5] proposed an H.264-based scheme for 2D to 3D video conversion. They used the motion information between successive frames and concerned the depth ambiguity at the boundaries on moving objects. A similar approach on  utilized the spatio-temporal analysis of MPEG videos to convert a stereoscopic video.
2D to 3D video conversion method can be divided into two categories according to the situation of human-computer interaction: fully automatic and semi-automatic method . Fully automated method is used to generate 3D video directly from 2D without any human-computer interaction. However, it is always a major issue on fully automatic method to create a robust and stable solution in any general content. This justifies the necessity of human interaction for accurate stereo view generation. By introducing human-computer interactions, semi-automatic methods can balance quality and cost more flexibility than fully automatic methods. Semi-automatic methods can apply with more flexibility than fully-automatic methods. It can keep the quality of stereo view by introducing human-computer interactions. Stereo quality and conversion cost are determined by the key frame intervals and the accuracy of depth maps on key frames. More accurate depth map will improve the stereo quality, but increase the conversion cost as well. Therefore, a tradeoff has to be made in order to obtain satisfactory quality at an acceptable cost.
In this paper, the video conversion method for 2D to 3D video is proposed. We apply the semi-automatic depth map extraction approach to provide the high-quality stereo video as a 3D entertainment. Two methods are constructed to deal with static and dynamic background scenes, respectively. The organization of this paper is described as follows. We provide the related works on Section 2. In Section 3, we present an overview of the proposed method. In Section 4, Method-1 is introduced with Gaussian mixture model for background and moving object detection. In Section 5, Method-2 is introduced with a relative velocity estimation method for moving object detection. In Section 6, we present the depth extraction and depth fusion process where both methods are adapted. Visual results and comparison data are shown in Section 7, and a conclusion is given in Section 8.
2 Related works
This section introduces the related works on fully automatic and semi-automatic, which are the two main methods for 2D to 3D video conversion. Referring to fully automated method, Knorr et al. in  proposed a geometric segmentation approach for dynamic scenes. It included a prioritized sequential algorithm for sparse 3D reconstruction and camera path estimation to efficiently reconstruct 3D scenes from broadcasting video. In , a structure was proposed from motion method to automatically recover 3D structure of the scene. However, this method has some limitations in camera movement and scene movement and thus reduces the wide availability of this method. In , Zhang et al. recovered consistent video depth maps and proposed a novel method based on bundle optimization. Zhang et al.  used a method which mainly integrates occlusion and visual attention organically to calculate depth map. Recently, Lei et al.  proposed example-based video stereolization with foreground segmentation and depth propagation according to the key and non-key frames in 2D videos.
With respect to semi-automatic method, it is widely proposed in recent years. Guttmann et al.  presented a semi-automatic system which propagates a sparse set of disparity values across a video. It employed classifiers combined with solving a linear system of equations and only requires sparse set of disparity values on the first and last frames of the video clip for reducing manual labor. Yan et al.  presented an effective method to semi-automatically generate high-quality depth maps for monocular images based on limited user inputs and depth propagation. They specify the depth values of the selected pixels and locate the approximate positions of T-junctions by user inputs and then generate depth maps by depth propagation combining user inputs, color, and edge information. For the purpose of stereoscopic 3D conversion, Phan et al.  proposed the module to alleviate much user input, as only the first frame needs to be marked. As in semi-automatic conversion approach, we had some previous result for static background scene . The main concept for this design is based on the vanishing point detection for depth map realization. As discussed in , a scene with vanishing line should be the most representative and easy to manipulate. This simple technique often leads to representative result especially for some man-made environments where they are always presented with many regular structures and parallel line.
3 Overall of proposed method
3.1 System overview
In this paper, we propose the techniques where scene geometry cues from a video and incorporate them into video segmentation and grouping. By inferring the depth information, some monocular depth cues have been proposed such as texture variations gradients, haze, and defocus . As a semi-automatic concept, we manipulate it by setting five initial points. Among them, four points can induce two vanishing lines. Following these two vanishing lines, the vanishing point is cooperated with the fifth point to decide the horizontal line and derive the depth map.
3.2 Overview on Method-1
For Method-1, we propose a novel semi-automated approach based on the input from the user to specify a set of initial conditions. First, the vanishing line extraction  is used in these static background video sequences. To generate depth maps of these scenes, three key issues are addressed. First is the depth layers acquisition of the static scenes. Second is the precise segmentation of moving objects. Third is the depth assignment to the segmented objects. By precise segmentation on the foreground and background, the separated scenes can be fused with the corresponding depth map.
3.3 Overview on Method-2
3.4 The advantages of the proposed method
Three main concepts are used in our proposed method. The first is that moving objects were always being the focus of video viewer. The second is that people can use their acquired monocular depth to directly provide key hints on computer depth generation. Then, the computation time of background classification and monocular depth cue estimation could be saved. For saving operating time, the third concept is that the user will not frequently interact with this conversion system. According to the above concepts, the segmented result of background and foreground and a user-guided vanishing line extraction method are combined with motion information between neighbor frames.
4 Static background segmentation
Background modeling and moving object detection are based on the adaptive background subtraction method. In this method, each pixel is modeled as a mixture of Gaussians with the on-line approximation to update the model . The Gaussian distributions are then evaluated to determine which are most likely to the result from a background process.
4.1 Background modeling
4.2 Moving object detection
5 Dynamic background subtraction
5.1 Motion estimation on relative velocity
Motion vectors are extracted from block motion estimation where a full search algorithm  is implemented. Search range and block size are defined for a tradeoff between saving computation time and precise prediction. By two adjacent frames, the block of current frame is compared with the block which has minimum sum of absolute difference (SAD) of the reference frame . Currently, high efficiency video coding (HEVC) is developed to provide better efficiency than previous video coding standards. The high coding efficiency is suitable for deal with sophisticated motion estimation, but with the high efficiency, the price is higher computational complexity [27, 28].
5.2 Moving object extraction
5.3 Object filling
6 Depth extraction and depth fusion process
We generate the depth map information to exploit the line features from a single frame. This is used as an input source of extracting geometric cue, i.e., vanishing point (VP). Most scenes are composed of parallel lines. Parallel lines in real scenes appear to converge with distance in perspective image, eventually reaching a VP at the horizon. Linear perspective is an important depth cue for these scenes. Taking into account the information collected in the pre-process analysis, a series of intermediate steps are used to recover the final depth map. These steps can be resumed in vanishing line extraction, gradient planes generation, and depth gradient assignment.
6.1 Vanishing line extraction and gradient plane generation
Up case: 0 ≤ X vp ≤ W ∩ Y vp ≥ H
Down case: 0 ≤ X vp ≤ W ∩ Y vp ≤ 0
Right case: X vp ≥ W ∩ 0 ≤ Y vp ≤ H
Left case: X vp ≤ 0 ∩ 0 ≤ Y vp ≤ H
Inside case: 0 < X vp < W ∩ 0 < Y vp < H
Default case: Otherwise above 5 cases
6.2 Background depth extraction
Higher depth level corresponds to lower grey values.
The VP is the most distant point from the observer.
L p is the most distant one from each two continuous farther points of p area. LL is a length computed by two middle points; one middle point is within FP p and FP p+1 , and the other is within CP p and CP p+1. R is a depth range proportioned to the distance of video frame. W ’ stands for user-defined weighting factor when LL is smaller than 255. W n stands for the weighting factor to assign depth level relative to the nth area. The depth layers of the static background scene are extracted through this way.
In each case, distance between VP and the most distant intersections of boundary of video frame and vanishing line will be considered as a proportion to the depth range. Regarding to the depth assignment of inside-case, it is combined with four cases except the default case. In terms of default case, the boundary of video frame with the most CP will be only considered as four selecting cases except the inside case. Human vision is more sensible to deep variations for close objects than for far ones. In other words, faster vanishing-line convergence induces higher deep variation. Thus, the deep levels have an increasing slope from the closest position to the farthest VP.
6.3 Moving object depth extraction
7 Experimental results
7.1 Simulation results
7.2 Evaluation and implementation results
In this paper, a monocular vision-based depth map extraction method for 2D to 3D video conversion is presented. We proposed a low complexity and high integration mechanism and also concern the characteristic in sequence. First of all, we initiate a scene detection to classify the characteristic of the background scene. Then, we provide two conversion methods for static and dynamic backgrounds, respectively. By the semi-automatic vanishing line extraction method, it can save much computation time and increase the precision of vanishing line detection. The estimated depth map can be used to generate the right and left view images for each frame to generate a 3D video result. It indicates that the proposed framework can process 2D to 3D conversion with high quality result.
This research was supported by the Ministry of Science and Technology, Taiwan, under Grant 104-2220-E-008-001.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
- K Han, K Hong, Geometric and texture cue based depth-map estimation for 2D to 3D imageconversion, IEEE International Conference on Consumer Electronics (ICCE), 651–652 (2011). doi:10.1109/ICCE.2011.5722790
- YJ Jung, A Baik, J Kim, D Park, A novel 2D-to-3D conversion technique based on relative height depth cue. Proc. SPIE 7234, 72371U-1–72371U-8 (2009)Google Scholar
- R Liu, W Tan, YJ Wu, YC Tan, B Li, H Xie, G Tai, X Xu, Deinterlacing of depth-image-based three-dimensional video for a depth-image-based rendering system. J. Electron. Imaging. 22(3), 033031 (2013)View ArticleGoogle Scholar
- XJ Huang, LH Wang, JJ Huang, DX Li, M Zhang, A depth extraction method based on motion and geometry for 2D to 3D conversion, in Third International Symposium on Intelligent Information Technology Application (IITA), vol. 3, 2009, pp. 294–298. 21–22 Nov. 2009Google Scholar
- MT Pourazad, P Nasiopoulos, RK Ward, An H.264-based scheme for 2D to 3D video conversion. IEEE Trans. Consum. Electron. 55(2), 742–748 (2009)View ArticleGoogle Scholar
- GS Lin, HY Huang, WC Chen, CY Yeh, KC Liu; WN Lie, A stereoscopic video conversion scheme based on spatio-temporal analysis of MPEG videos. Eur. J. Adv. Sign. Process. DOI:10.1186/1687-6180-2012-237
- Z Li, X Cao, X Dai, A novel method for 2D-to-3D video conversion using bi-directional motion estimation. Acoustics, Speech and Signal Processing (ICASSP), 2012 IEEE International Conference, 2012View ArticleGoogle Scholar
- S Knorr, E Imre, B Ozkalayci, A Alatan, T Sikora, A modular scheme for 2D/3D conversion of TVbroadcast Proc. 3DPTV. 703–710 (2006). doi:10.1109/3DPVT.2006.15
- T Huang, A Netravali, Motion and structure from feature correspondences: a review. Proc. IEEE 82(2), 252–268 (1994)View ArticleGoogle Scholar
- G Zhang, J Jia, T Wong, H Bao, Recovering consistent video depth maps via bundle optimization. Proc. CVPR. 1–8 (2008). doi:10.1109/CVPR.2008.4587496
- J Zhang, Y Yang, Q Dai, A novel 2D-to-3D scheme by visual attention and occlusion analysis. 3DTV Conf. 1–4 (2011). doi:10.1109/3DTV.2011.5877189
- L Wang, C Jung, Example-based video stereolization with foreground segmentation and depth propagation. Multimed. IEEE Trans. 16(7), 1905–1914 (2014)View ArticleGoogle Scholar
- M Guttmann, L Wolf, D Cohen-or, Semi-automatic stereo extraction from video footage. Proc. ICCV. 136–142 (2009). doi:10.1109/ICCV.2009.5459158
- X Yan, Y Yang, G Er, Q Dai, Depth map generation for 2D-to-3D conversion by limited user inputs and depth propagation. 3DTV Conf. 1–4 (2011). doi:10.1109/3DTV.2011.5877167
- R Phan, D Androutsos, Robust semi-automatic depth map generation in unconstrained images and video sequences for 2D to stereoscopic 3D conversion. Multimed. IEEE Trans. 16(1), 122–136 (2014)View ArticleGoogle Scholar
- TH Tsai, CS Fan, CC Huang, Semi-automatic depth map extraction method for stereo video conversion. 2012 Sixth International Conference on Genetic and Evolutionary Computing, 2012View ArticleGoogle Scholar
- A Almansa, A Desolneux, S Vamech, Vanishing point detection without any a priori information. IEEE Trans. on Pattern Anal. Mach. Intell. 25, 502–507 (2003). doi:10.1109/TPAMI.2003.1190575
- C Stauffer, WEL Grimson, Adaptive background mixture models for real-time tracking, IEEE Conference on Computer Vision & Pattern Recognition. Colorado, USA. pp. 246–252. June 1999Google Scholar
- A Saxena, J Schulte, AY Ng, Depth estimation using monocular and stereo cues. Int. Joint Conf. Artif. Intell. 2197–2203 (2007)Google Scholar
- V Cantoni, L Lombardi, M Porta, N Sicari, Vanishing point detection: representation analysis and new approaches, Dip. di Informatica e Sistemistica – Università di Pavia IEEE 2001Google Scholar
- J Zhong, S Sclaroff, Segmenting foreground objects from a dynamic textured background via a robust Kalman filter. Ninth IEEE Int. Conf. Comp. Vis. 2, 44–50 (2003)View ArticleGoogle Scholar
- A Monnet, A Mittal, N Paragios, V Ramesh, Background modeling and subtraction of dynamic scenes. Ninth IEEE Int. Conf. Comp. Vis. 2, 1305–1312 (2003)View ArticleGoogle Scholar
- DS Lee, Effective Gaussian mixture learning for video background subtraction. IEEE Trans. Pattern Anal. Mach. Intell. 27, 827–832 (2005)View ArticleGoogle Scholar
- TH Tsai, WT Sheu, CY Lin, Foreground object detection based on multi-model background maintenance, in IEEE International Symposium on Multimedia Workshops, 2007, pp. 151–159View ArticleGoogle Scholar
- YC Lin, SC Tai, Fast full-search block-matching algorithm for motion-compensated video compression. Proc. 13th Int. Conf. Pattern Recog. 3, 914–918 (1996)Google Scholar
- SL Kilthau, MS Drew, T Moller, Full search content independent block matching based on the fast Fourier transform. Int. Conf. Image Process. 1, 669–672 (2002)View ArticleGoogle Scholar
- C Yan, Y Zhang, J Xu, F Dai, J Zhang, Q Dai, F Wu, Efficient parallel framework for HEVC motion estimation on many-core processors. IEEE Trans. Circuit Syst. Video Technol. 24(12), 2077–2089 (2014)View ArticleGoogle Scholar
- C Yan, Y Zhang, J Xu, F Dai, L Li, Q Dai, F Wu, A Highly Parallel Framework for HEVC Coding Unit Partitioning Tree Decision on Many-core Processors. IEEE Sign. Process. Lett. 21, 573–576 (2014)View ArticleGoogle Scholar
- DDD. http://www.dddgroupplc.com/.