# A novel method for 2D-to-3D video conversion based on boundary information

- Tsung-Han Tsai
^{1}Email authorView ORCID ID profile, - Tai-Wei Huang
^{1}and - Rui-Zhi Wang
^{1}

**2018**:2

https://doi.org/10.1186/s13640-017-0239-5

© The Author(s). 2018

**Received: **11 May 2017

**Accepted: **3 December 2017

**Published: **8 January 2018

## Abstract

This paper proposes a novel method for 2D-to-3D video conversion, based on boundary information to automatically generate the depth map. First, we use the Gaussian model to detect foreground objects and then separate the foreground and background. Second, we employ the superpixel algorithm to find the edge information. According to the superpixels, we will assign corresponding hierarchical depth value to initial depth map. From the result of depth value assignment, we detect the edges by Sobel edge detection with two thresholds to strengthen edge information. To identify the boundary pixels, we use a thinning algorithm to modify edge detection. Following these results, we assign the depth value of foreground to refine it. We use four kinds of scanning path for the entire image to create a more accurate depth map. After that, we have the final depth map. Finally, we utilize depth image-based rendering (DIBR) to synthesize left and right view images. After combining the depth map and the original 2D video, a vivid 3D video is produced.

## Keywords

## 1 Introduction

In the field of visual processing, 3D image processing has become very popular in recent years. To produce a better display than the traditional 2D visual experience, 3D displays offer a number of new applications, including education, games, movies, cameras, etc., with 3D video generations still growing. The user only watches 3D animation or 3D movies made by a special camera on a computer. The lack of 3D videos makes 2D-to-3D image conversion quite practical.

Synthesis technology from a 2D image to a 3D image is performed in two steps: an estimation of the original 2D image depth map and then taking advantage of this depth map to synthesize a 3D stereoscopic image. Thus, the quality of the depth map largely affects the quality of the 3D image. According to whether human intervention, we can divide 2D to 3D image synthesis in two ways: automatic and semi-automatic [1]. In automatic methods, human-computer interactions are not involved. The process has different visual cues, ranging from motion information to perspective structures. In [2] they proposed a geometric and material-based algorithm, but the major issue of a fully automatic method is creating a robust and stable solution for any general content. This brought about semi-automatic methods that contain some human-computer interactions to balance quality. The stereo quality and conversion cost are determined by key frame intervals and the accuracy of depth maps on key frames. Guttman et al. set up a semi-synthetic depth map method [3] with a sparse labeling depth estimation method. It handles depth with several key images for the semi-automatic method and uses others with an automatic synthesis to improve accuracy. Obviously, how to determine the key image influences the accuracy of the entire depth map.

Regardless of the algorithm being fully automatic or semi-automatic processing, a static scene with moving parts is one of the most common issues. To generate a depth map in this scenario, a static background is typically a layer of depth values, deriving an accurate capture of mobile objects. In [1] they utilized motion estimation [4] to present a conversion which requires only a few user instructions on key frames and propagates the depth maps to non-key frames. Huang et al. used H.264 codec to encode the motion vectors and combined two kinds of depth cues: motion information and the geometric perspective [5]. Raza [6] presented a method for a dynamic scene. It needs several kinds of information to process, including object shape, movement amount, shielding geometry edges, and scene, to get depth information.

Since depth information is the most important issue for 2D-to-3D conversion, how to technically produce an accurate depth map is critical. Depth map generation can be classified into single-frame and multi-frame methods. Multi-frame methods are based on stereo/multi-view with the related depth information. Depth from motion is realized by the information of relative velocity [7–9]. In [7], they used the depth from motion and solved a multi-frame structure from the motion problem. The method uses the epipolar criterion to segment the features belonging to the independently moving objects. In [9], they applied a block-based method and cooperated with the bilateral filter to diminish the block effect and to generate a comfortable depth map.

The classification of the methods on depth map generation

Depth cue | Comments | Algorithm | |
---|---|---|---|

Multi-frame | Depth from motion | Use relative velocity to judge depth information | |

Single-frame | Depth from perspective geometry | Vanishing line detection | |

Depth from model | Color theory | ||

Depth from defocus | Use blur information to get depth value | ||

Depth from visual saliency | Estimation in region of interest |

In this work, we proposed a 2D-to-3D conversion method based on single-frame method and fully automatic conversion to generate stereo visual results. We use GMM (Gaussian mixture model) and SLIC (simple linear iterative clustering) to generate initial depth map and then utilize edge information and repeat four kinds of scanning path mode to refine the depth value. Afterwards, we have a precise final depth map. By DIBR, we produce the left and right view images to complete 2D-to-3D conversion. This paper is organized as follows. Section 2 provides an overview of the proposed method. Section 3 describes the Gaussian mixture model. Section 4 discusses the technique on superpixels. Section 5 amends depth map generation. Section 6 provides the experimental results and discussion. Finally, a conclusion is given in Section 7.

## 2 Overview of the proposed method

- 1.
Fully automatic conversion is contained;

- 2.
Foreground detection and edge information help unify the depth value on the object;

- 3.
Superpixel algorithm clusters pixels with close information;

- 4.
Six kinds of initial gradient hypothesis for initial depth map;

- 5.
Four kinds of scanning modes to fix the depth map;

- 6.
Through a Hough transform, we only extract one-line information to get the slope of the line.

## 3 Gaussian mixture model

### 3.1 Background modeling

*t*, it can be written as

*X =*{

*X*1,…,

*Xt*}. This pixel is combined by the amount of

*k*Gaussian distribution. The probability of observing the current pixel value

*P*(

*Xt*) is shown in (1).

*K*is the number of distributions,

*ω*

_{ i,t }is an estimate of the weight of the

*i*th Gaussian in the mixture at time

*t*, μ

_{ i,t }is the mean value of the

*i*th Gaussian in the mixture at time

*t*, ∑

_{ i,t }is the covariance matrix of the

*i*th Gaussian in the mixture at time

*t*, and

*p*is a Gaussian probability density function shown as:

According to each Gaussian’s parameter, we can evaluate which is the most accurate distribution of the background. Based on the variance and the persistence of each mixture of Gaussians, we determine which Gaussians may correspond to the background colors. Because there is a mixture model for every pixel in the image, we use [27] to execute our algorithm. Each new pixel value *Xt* is checked with the existing *K* Gaussian distributions until a match is found. A match is defined as a difference between a pixel value and mean within a threshold of covariance. If one of the *K* distributions matches the current pixel value, the parameters of the distribution are updated. When Gaussian distributions have greater weight and smaller variance, they are usually identified as the background model. Figure 2 shows the results of the background model.

*η*(

*t*) is a convergence factor. In traditional methods, this factor is set as a constant value or variable that decreases at constant time. This induces low converge speed or hard to converge, respectively. Our modification solves these issues.

### 3.2 Moving object detection

*ω/σ*, where

*σ*is the covariance of distribution. When Gaussian distributions have a maximum

*ω/σ*, they become background modeling. Moving objects can then be distinguished from the original 2D image through the

*i*th Gaussian with the maximum

*ω/σ*background model. We then binarize the moving object and background. For easy representation, the pixel of a moving object is assigned a white color, and its background is black. The determination is as follows:

*A*(

*x*,

*y*) is the binary result of moving objects’ detection,

*I*(

*x*,

*y*) is the current pixel value of the image, and

*T*denotes a threshold. Figure 2a, c shows the results of moving object detection.

## 4 SLIC superpixels

This section introduces SLIC superpixels. According to [28], using SLIC to generate superpixels is faster than other superpixel methods, i.e., normalized cuts algorithm. It exhibits state-of-art boundary adherence more efficiently and improves the performance of the segmentation algorithm. We further modify the SLIC method to speed up this process.

### 4.1 Concept of SLIC algorithm

Simple linear iterative clustering is an adaptation of *K*-means for superpixel generation. The only parameter of the algorithm is *k*, which represents the number of approximately equally sized superpixels. For color images, we transform color space from YUV to CIELAB. The clustering program begins with an initialization step where *k* initially clusters centers, called *C*_{
k
}. The clustering of grid size is *S* = √*N*/*K*, where *N* is all the pixels of an image, and *C*_{
i
} *=* [*l*_{
i
}, *a*_{
i
}, *b*_{
i
}, *x*_{
i
}, *y*_{
i
}]^{
T
} is the color space for each pixel. The pixel tag of an image is − 1. The distance value *D* is assigned as infinity.

*K*-means method searches the entire image by clustering the center. The complexity is

*O*(

*kNI*), where

*I*is the number of iterations. Within 2Sx2S clustering center, SLIC is applied to search this limited region. Thus, the complexity is reduced to

*O*(

*N*).

### 4.2 Distance measurement

*l a b*]

^{ T }is the color space, and [

*x y*]

^{ T }is the location information of the pixel. The distance value

*D*helps analyze the correlation among pixels and decides whether they can be classified as the same superpixels or not. We separate color information and location information to calculate them, shown in (6, 7).

*N*

_{ S }and color space coefficient

*N*

_{ C }to normalize the distance value. We can express

*D*’ as:

_{ S }is the largest distance of the grid, and

*N*

_{ C }is a variety of coefficients according to different images or initialized grid. We apply a parameter

*m*to control the tightness of the edge close to the image. Therefore, (8) is changed as follows:

After the calculation on distance value *D*, the label of each pixel is updated. If the pixel has the smallest distance value *D* with *k*th grid center, the label of the pixel will be updated by *k*. After each pixel has a corresponding label value, we average the color information and location with the same label value to get the grid’s new center. The process is repeated until convergence.

*E*in Fig. 4a means the residual error.

## 5 Depth extraction and depth fusion process

We employ edge information and superpixels to generate depth map. After finding foreground, similar depth value is assigned to whole object by using edge information of the object in current frame. The extraction and fusion on the depth are the key technique for synthesizing 3D visual quality. In our approach, we give the corresponding hierarchical depth map after the superpixel information. We then use Sobel edge detection and a thinning algorithm to capture the objects. Finally, we utilize four kinds of directions to scan the entire image and correct the depth value to obtain the full depth map.

### 5.1 Depth from prior hypothesis

- 1.
\( \mathrm{Depth}=\mathrm{White}-i\times \left(\frac{\mathrm{White}}{\mathrm{Height}}\right),\mathrm{where}\ 1<\mathrm{slope}\ \mathrm{or}-1>\mathrm{slope},\kern0.75em i=\left\{1,2,3\dots, \mathrm{height}\right\} \) (10).

- 2.
\( \mathrm{Depth}=\mathrm{White}-i\times \left(\frac{\mathrm{White}}{\mathrm{Width}}\right),\mathrm{where}\ \mathrm{slope}=0,\kern0.5em i=\left\{1,2,3\dots, \mathrm{width}\right\} \) (11).

- 3.
\( {\mathrm{Depth}}_t=\mathrm{White}-i\times \left(\frac{\mathrm{White}}{\sqrt{{\mathrm{Width}}^2+{\mathrm{Height}}^2}}\right) \)

### 5.2 Sobel edge detection

*G*

_{ x }and vertical gradient

*G*

_{ y }. After deriving the gradients, we use (13) to make the weighted gradient

*G*

_{ z }.

We take a threshold to compare with the gradient value of pixel *P* to decide the existence of an edge. To present a good result, we propose a two-threshold method to implement this. By these two thresholds, we have two different edge data. The large threshold maintains strong edge information and masks the noise on boundary pixels. The small threshold represents week edge information and thus maintains more noise information. Through this two-threshold mechanism, the result is more precise than a single threshold solution.

### 5.3 Depth refinement

- 1.
From the lower right to the upper left: if the pixel’s depth information is smaller than the right, it will be replaced by the right;

- 2.
From the lower left to the upper right: if the pixel’s depth information is smaller than the left, it will be replaced by the left;

- 3.
From the upper left to the lower right: if the pixel’s depth information is smaller than the upper, it will be replaced by the upper;

- 4.
From the upper right to the lower left: if the pixel’s depth information is smaller than the lower, it will be replaced by the lower.

## 6 Experimental results and discussion

### 6.1 Simulation results

*Hall*,

*Subway*,

*Station*,

*Hall monitor*,

*Laboratory*,

*Bridge close*, and

*Flower*. In brief, we only show the depth map result and the synthesized view as illustrated in Fig. 11.

Execution time for each sequence (on PC environment)

Sequence | Time (s) |
---|---|

Highway | 1.073 |

Subway | 1.050 |

Station | 1.059 |

Hall monitor | 1.054 |

Laboratory | 1.053 |

Bridge close | 1.077 |

Bridge far | 1.071 |

Flower | 1.018 |

Champagne | 0.993 |

Balloon | 0.999 |

Kendo | 0.992 |

Average | 1.040 |

### 6.2 Evaluation and comparison

Because most reference works did not provide their simulated video sequences, we evaluate them by characteristics. Kim et al. [32] used motion analysis which calculated three cues that were used to decide the scale factor of motion-to-disparity conversion. However, it is hard to detect blending between shots. Li et al. [33] used several simple monocular cues to estimate disparity maps and confidence maps of low spatial and temporal resolution in real-time, but it is less sensitive to the variety of scenes. In [34], it proposed a simplified algorithm that learns the scene depth from a large repository of image depth pairs. It can provide high performance but takes much computation time. In comparison with [26], they performed edge information to segment the objects. A similar work [30] also used edge detection and scan path to fill the depth values. If only the edge information is used, then it induces a same object with a different depth value. In our method, we not only use edge information, but also foreground detection to unify the depth value within the same object. Furthermore, two-threshold decision benefits the result on precise edge information and also avoids the disconnected result on an object and its depth value. Referring to [6], they learned and inferred depth values from motion, scene geometry, appearance, and occlusion boundaries. Although this can segment the images into spatio-temporal super-voxels and predict depth values with random forest regression, it is still hard to segment an object from the background well.

## 7 Conclusions

This paper has proposed a 2D-to-3D conversion algorithm. First, we separate the foreground and background. Second, we use the superpixel edge algorithm to get boundary information and gather the pixels with the same depth value. Through a six gradient hypothesis on the depth map, the initial depth value is assigned. Since the boundary information is needed for refinement, we perform Sobel edge detection with two different thresholds to get two kinds of results. We then apply a thinning algorithm to obtain the result with only one pixel on edge. Compared with the two-threshold decision, we are able to add foreground information to unify the final depth information. Four scanning paths are used to refine the depth values. Finally, depth image-based rendering is employed to synthesize a virtual image. In the future work, we will utilize more information such as visual saliency or use blur information to determine the initial depth map to deal with depth map in complex scenes more precisely.

## Declarations

### Acknowledgements

This research was supported and funded by the Ministry of Science and Technology, Taiwan, under Grant 104-2220-E-008-001.

### Authors’ contributions

THT carried out the algorithm studies and participated in its simulation and drafted the manuscript. TWH designed the proposed algorithm. RZW help to execute the experiments. All of the authors read and approved the final manuscript.

### Competing interests

The authors declare that they have no competing interests.

### Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Open Access**This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

## Authors’ Affiliations

## References

- Z Li, X Cao, X Dai, A novel method for 2D-to-3D video conversion using bi-directional motion estimation, Acoustics, speech and signal processing (ICASSP), 2012 IEEE international conference, 2012.Google Scholar
- KS Han, KY Hong, Geometric and texture cue based depth-map estimation for 2D to 3D image conversion, IEEE International Conference on Consumer Electronics (ICCE), 651–652 (2011)Google Scholar
- M Guttmann, L Wolf, D Cohen-Or, Semi-automatic stereo extraction from video footage, 2009 IEEE 12th International Conference on Computer Vision, 136–142 (2009)Google Scholar
- C Yan, et al., Efficient parallel framework for HEVC motion estimation on many-core processors,” IEEE Trans. Circuits Syst. Video Technol, 24(12), 2077-2089 (2014)Google Scholar
- XJ Huang, LH Wang, JJ Huang, DX Li, M Zhang, A depth extraction method based on motion and geometry for 2D to 3D conversion,” IITA 2009. Third International Symposium on Intelligent Information Technology Application, 3, 294–298 (2009)Google Scholar
- SH Raza, O Javed, A Das, H Cheng, H Singh, I Essa, Depth extraction from videos using geometric context and occlusion boundaries (BMCV 2014)Google Scholar
- E Imre, S Knorr, AA Alatan, T Sikora, Prioritized sequential 3D reconstruction in video sequences with multiple motions,” in IEEE Int. Conf. Image Process (ICIP, Atlanta, 2006)Google Scholar
- D Nister, A Davison, Real-time motion and structure estimation from moving cameras, Tutorial at CVPR, 2005.Google Scholar
- CC Chang, CT Li, PS Huang, TK Lin, YM Tsai, LG Chen, A block-based 2D–to-3D conversion system with bilateral filter, Proc. IEEE Int. Conf. Consumer Electronics, 1-2 (2009)Google Scholar
- S Battiato, S Curti, M La Cascia, M Tortora, E Scordato, Depth map generation by image classification, Proc. SPIE 5302, 95–104 (2004)Google Scholar
- X Huang, L Wang, J Huang, D Li, M Zhang, A depth extraction method based on motion and geometry for 2D to 3D conversion. 3rd Int Symp. Intell. Inf. Technol. 3, 294–298 (2009)Google Scholar
- TH Tsai, CS Fan, CC Huang, Semi-automatic Depth Map Extraction Method for Stereo Video Conversion, “The 6th International Conference on Genetic and Evolutionary Computing (ICGEC) (Kitakyushu, 2012)Google Scholar
- YK Lai, YF Lai, C Chen, An effective hybrid depth-generation algorithm for 2D-to-3D conversion in 3D displays”, IEEE/OSA J. Display Technol. 9(3), 146-161 (2013)Google Scholar
- K Yamada, Y Suzuki, Real-time 2D-to-3D conversion at full HD1080p resolution, the 13th IEEE International Symposium on Consumer Electronics, pp. 103–107, 2009.Google Scholar
- K Yamada, K Suehiro, H Nakamura, Pseudo 3D image generation with simple depth models, IEEE International Conference on Consumer Electronics 2005, pp. 4–22 (Las Vegas, 2005).Google Scholar
- J Ens, P Lawrence, An investigation of methods of determining depth from focus. IEEE Trans. Pattern Anal. Mach. Intell.
**15**(2), 523–531 (1993)View ArticleGoogle Scholar - SA Valencia, RM Rodriguez-Dagnino, Synthesizing stereo 3D views from focus cues in monoscopic 2D images, Proc. SPIE, 5006, 377–388 (2003)Google Scholar
- KR Ranipa, MV Joshi, A practical approach for depth estimation and image restoration using defocus cue, Machine Learning for Signal Processing (MLSP), 2011 IEEE International Workshop on, pp. 1–6, 2011.Google Scholar
- PPK Chan, BZ Jing, WWY Ng, DS Yeung, Depth estimation from a single image using defocus cues, Machine Learning and Cybernetics (ICMLC), International Conference on 4, 1732–1738 (2011)Google Scholar
- C Huang, Q Liu, S Yu, Regions of interest extraction from color image based on visual saliency (Springer Science Business Media, 2010)Google Scholar
- SJ Yao, LH Wang, DX Li, M Zhang, A real-time full HD 2D-to-3D video conversion system based on FPGA, in image and graphics (ICIG), 2013 Seventh International Conference on, pp. 774–778, 2013.Google Scholar
- YM Fang, JL Wang, M Narwaria, PL Callet, W Lin, Saliency detection for stereoscopic images, Image Proc. IEEE Trans on 23(6), 2625–2636 (2014)Google Scholar
- YJ Jung, A Baik, J Kim, D Park, A novel 2D-to-3D conversion technique based on relative height depth cue, Proc. of SPIE-IS&T Electronic Imaging, SPIE Vol. 7237, 2009Google Scholar
- C Fehn, Depth-image-based rendering (DIBR), compression and transmission for a new approach on 3D-TV,” in Proc. SPIE Conf. Stereoscopic Displays and Virtual Reality Systems XI, 5291, 93–104 (2004)Google Scholar
- Y Luo, Z Zhang, P An, Stereo video coding based on frame estimation and interpolation. IEEE Trans. Broadcast.
**49**(1), 14–21 (2003)View ArticleGoogle Scholar - CC Cheng, Student Member, IEEE, CT Li, LG Chen, Fellow, IEEE, A novel 2D-to-3D conversion system using edge information, IEEE Consumer Electronics Society, Consumer Electronics, IEEE Transactions on 56, 1739–1745 (2010)Google Scholar
- TH Tsai, CC Huang, CS Fan, A high performance foreground detection algorithm for night scenes,” Signal Processing Systems (SIPS), IEEE Workshop on, pp. 284–288, 2013.Google Scholar
- R Achanta, A Shaji, K Smith, A Lucchi, P Fua, S Süsstrunk, SLIC superpixels compared to state-of-the-art superpixel methods,” Pattern Analysis and Machine Intelligence, IEEE Transactions on, 34, 2274–2282 (2012)Google Scholar
- TY Zhang, CY Suen, A fast parallel algorithm for thinning digital patterns, Communication of the ACM 27(3), 236–239 (1984)Google Scholar
- BL Lin, LC Chang, SS Huang, DW Shen, YC Fan, Two dimensional to three dimensional image conversion system design of digital archives for classical antiques and document,” Information Security and Intelligence Control(ISIC), International Conference on, pp. 218–221, 2012.Google Scholar
- Nagoya University Multi-view Sequences Download List. http://www.fujii.nuee.nagoya-u.ac.jp/multiview-data/.
- D Kim, D Min, K Sohn, A stereoscopic video generation method using stereoscopic display characterization and motion analysis. IEEE Trans. On Broadcasting
**54**(2), 188–197 (2008)View ArticleGoogle Scholar - CT Li, YC Lai, C Wu, LG Chen, Perceptual multi-cues 2D-to-3D conversion system, 2011 Visual Communications and Image Processing (VCIP) pp. 1–1 (Tainan, 2011)Google Scholar
- J Konrad, M Wang, P Ishwar, 2D–to-3D image conversion by learning depth from examples, 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, Providence, 2012, pp. 16–22.Google Scholar