Depth-image-based rendering with spatial and temporal texture synthesis for 3DTV

Xi, Ming; Wang, Liang-Hao; Yang, Qing-Qing; Li, Dong-Xiao; Zhang, Ming

doi:10.1186/1687-5281-2013-9

Research
Open access
Published: 11 February 2013

Depth-image-based rendering with spatial and temporal texture synthesis for 3DTV

Ming Xi^1,2,
Liang-Hao Wang^1,2,
Qing-Qing Yang^1,2,
Dong-Xiao Li^1,2 &
…
Ming Zhang^1,2

EURASIP Journal on Image and Video Processing volume 2013, Article number: 9 (2013) Cite this article

5608 Accesses
19 Citations
1 Altmetric
Metrics details

Abstract

A depth-image-based rendering (DIBR) method with spatial and temporal texture synthesis is presented in this article. Theoretically, the DIBR algorithm can be used to generate arbitrary virtual views of the same scene in a three-dimensional television system. But the disoccluded area, which is occluded in the original views and becomes visible in the virtual views, makes it very difficult to obtain high image quality in the extrapolated views. The proposed view synthesis method combines the temporally stationary scene information extracted from the input video and spatial texture in the current frame to fill the disoccluded areas in the virtual views. Firstly, the current texture image and a stationary scene image, which is extracted from the input video, are warped to the same virtual perspective position by the DIBR method. Then, the two virtual images are merged together to reduce the hole regions and maintain the temporal consistency of these areas. Finally, an oriented exemplar-based inpainting method is utilized to eliminate the remaining holes. Experimental results are shown to demonstrate the performance and advantage of the proposed method compared with other view synthesis methods.

Introduction

Year 2010 is considered to be the year of breakthrough for 3D video and 3D industry [1]. Numerous 3D films are produced and released to the market. Stereo movies provide people stereo perceptions by showing two slightly different images of the same scene. Consumers can have immersive feelings by watching them in theaters with stereo eyeglasses. Disks and players of 3D Blu-ray standard have entered the home entertainment. The prosperity of 3D industry gives an important opportunity for three-dimensional television (3DTV) system, which is believed to be the next generation of television broadcasting after high-definition television. The concept of 3DTV system is defined by European project ATTEST [2] and developed by Morvan et al. [3] and Kubota et al. [4]. To improve the depth perception of users, autostereoscopic display technology without any need of additional glasses is preferred in the display part of 3DTV. Autostereoscopic displays can provide comfortable stereo parallax and smooth motion disparity by displaying multiview images of the same scene simultaneously. A simple approach is to capture, compress, and transmit multiple views directly. The current multiview video coding standard [5, 6] with high compression efficiency, which exploits the spatial correlations of the neighboring views, is used to encode and decode the multiple video streams, generally more than eight views. But the transmission bandwidth cost remains a challenging and unresolved problem. Meanwhile, it is commonly suggested that the future 3DTV systems should have completely decoupled capture and display operations [7]. A proper abstract intermediate representation of the captured data, video plus depth format, is proposed by Fehn [8] to achieve such a decoupled operation with an acceptable increment of bandwidth. The depth-image-based rendering (DIBR) [2] algorithm will be used to render multiple perspective views from the video plus depth data according to the requirement of autostereoscopic displays. Thus, the DIBR method has attracted much attention, and become a key technology of the 3DTV system [1].

The video plus depth data format consists of one texture color image and its corresponding perpixel dense depth map. Theoretically, being provided with the intrinsic and extrinsic parameters of the virtual views, the DIBR algorithm can be used to synthesize any virtual perspective views from the video plus depth data. But there exists three problems [2], which are visibility, resampling, and disocclusion. Multiple pixels of the reference view may fall into the same position in the virtual image plane, which will cause the visibility problem. A Z-buffer algorithm [9] can solve this problem by recording the Z values and choosing the nearest pixel to the virtual camera plane. The phenomenon of an integer pixel position in the reference view image being projected to a subpixel position in the virtual view is called resampling problem, which can be coped with upsampling procedure or backwards warping with interpolation. The remaining disocclusion problem is the fact that some parts of the captured scene, which are occluded in the original views, become visible in the virtual views. It is caused by the lack of scene information occluded by the foreground objects in the original view position. As the distance from virtual view to reference view increases, the disoccluded area becomes larger, as shown in Figure 1.

The disocclusion problem is considered to be the most significant and difficult one of the DIBR algorithm. It is well handled in the interpolation operation [10–13], but will become severe in the extrapolation situation, where the missing image information needs to be reconstructed by appropriate algorithms. Lots of algorithms have been developed to solve this problem, which can be divided into three categories.

The first is layered-depth-image (LDI) [14, 15], which can achieve excellent rendering results by providing sufficient information of the scene. LDI data are composed of a number of color layers and their corresponding depth layers, which contain not only the texture and depth information of visible scene from the front view, but also that of the occluded regions. It is very simple to obtain high-quality multiview images from LDI data. However, the procedure of creating LDI is computationally complex and quite time-consuming. The transmission bandwidth of LDI data also increases drastically with the number of layers. A simplified data format of LDI, which is called the “Declipse” format [16], is proposed by Philips Corporation. The “Declipse” format data consist of foreground layer and background layer. It presents the advantage to improve the rendering quality with a quite small overhead in terms of complexity and bitrate.

The second approach is called depth image preprocessing. To reduce the disoccluded areas in virtual views, low pass filter is applied to smooth the depth image. Fehn [2] uses a suitable Gaussian filter preprocessing the depth image to eliminate the disocclusions with the cost of slightly geometric distortions. An asymmetric smoothing method is proposed by Zhang and Tam [17]. By enlarging the standard deviation and window size of Gaussian filter in vertical direction, the vertical structure distortion is reduced. The filtering effect is to smooth the sharp discontinuities in the depth image, thus reducing the hole areas near object boundaries. A consequence of these algorithms is that the whole depth map has been modified, which will severely blur the distance between scene objects in different depth layers. To cope with depth loss, different kinds of oriented filters [18–21] are designed with the same principle, i.e., smoothing the sharp edge in the depth image locally and keeping the depth of the other regions unchanged. The oriented filters can improve the image quality of the virtual views, but still induce geometric distortion. Although the depth image preprocessing methods can be used to handle the disoccluded regions in the virtual views of small baseline, obvious geometric distortions will occur when the baseline is getting larger.

The third approach to filling the disoccluded areas is image completing techniques. This approach can be further classified into statistical-based methods, partial differential equations (PDE)-based methods, and exemplar-based methods. Statistical-based methods [22–25] have good performance in pure texture synthesis applications, but fail to complete natural images with complex structure. PDE-based methods [26–29], which are also called image inpainting methods, propagate linear structures into the disoccluded areas smoothly via diffusion. The diffusion process is simulated by the PDE of physical heat flow. Inpainting methods are suitable for removing small image artifacts, such as speckles, scratches, and overlaid texts. When the disocclusion is getting larger, the diffusion process will over-smooth the image and cause visible blurring artifacts. Exemplar-based methods [30, 31] fill the hole regions by copying patches with the similar texture from the known neighborhood of the image. Criminisi et al. [30] use the exemplar-based method to remove objects from images. Komodakis and Tziritas [31] propose an efficient belief propagation method to obtain global optimization. Exemplar-based methods have been used for the case of video completion in [32, 33]. Multiple frames are provided as the searching source of best match patch by Cheng et al. [34] to achieve temporal continuity. Exemplar-based methods have been the most powerful techniques for dealing with large disoccluded regions. Schmeing and Jiang [35] first obtain the background information with a computed background model. But their approach cannot handle the uncovered areas caused by static foreground objects. For each virtual view, Ndjiki-Nya et al. [36] use a background sprite to update the texture and depth information of disoccluded areas. There are two major drawbacks of this method. One is the valuable background information of disocclusions, which cannot be reused during the generation of other virtual views. The other is the memory cost increases with the number of virtual views.

In this article, a new virtual view generation method with spatial and temporal texture synthesis is proposed. The structure information of the captured scene in the temporal domain is taken into account by maintaining an accumulated sprite of stationary scene. An oriented exemplar-based inpainting algorithm is applied to restore the rest disoccluded areas with background texture.

The remainder of this article is organized as follows. In Section 2, a brief description of the algorithm framework is given. The details of each processing modules are demonstrated in Sections 3, 4, 5, and 6. Experimental results are compared with state-of-the-art methods in Section 7. The conclusions and future works can be found in Section 8.

System overview

The framework of proposed DIBR method with spatial and temporal texture synthesis is shown in Figure 2. The proposed method is divided into four main stages, i.e., stationary scene extraction, backward DIBR, merging operation, and oriented exemplar-based inpainting.

In the first stage, a sprite of stationary scene is maintained throughout the view synthesis process, which stores the temporally accumulated structure and depth information of stationary image part. The Structural SIMilarity index (SSIM) [37] is utilized to distinguish the stationary scene from the moving foreground objects by combining the input depth images. For stationary scene, the SSIM index between adjacent frames is large, so the image part, which is stationary in both adjacent frames, can be extracted by using the SSIM index values. But there still are some stationary scenes, which cannot be distinguished due to the occlusions of moving foreground objects. By considering the spatial relationship provided by the input depth maps, the texture information of these occluded stationary scenes can also be obtained. In the demonstration of our algorithm, the camera of input view is supposed to be still for simplicity. If the camera is moving, an additional camera tracking module needs to be inserted before stationary scene extraction stage to compensate the global motions, which is beyond the discussion in this article.

In the second stage, current frame and stationary scene sprite are warped to the same virtual perspective view by a backward DIBR method to tackle the visibility problem and resampling problem.

The proposed algorithm merges these two virtual images obtained from the second stage together with the third stage. The merging operation needs to be done very carefully, because the foreground objects in virtual views may still exist inner hole pixels. The merging operation can take use of most of the scene information provided by the sprite of stationary scene.

After the merging operation, there still exists a few blank regions without pixel values. In the final stage, oriented exemplar-based inpainting approach is applied to fill the remaining holes by searching best matching exemplar with background texture. Current virtual image is used as the searching source of best matching patch. The filling order of the inpainting method is steered from background structures to foreground objects.

Note that the proposed method only uses the sequence of color images and depth images from one captured view as the input data. If image data of another view are also provided, the switch in the framework can directly be shifted from the extrapolation mode to the interpolation mode without any changes of the framework.

Stationary scene extraction

The DIBR algorithm warps the original view to the virtual view position by projecting current pixels to points in real 3D space and re-projecting the 3D points to virtual image plane. Large disocclusions will appear in the discontinuous edges of depth map, which is the transition place between foreground and background in texture image. The background image part occluded by foreground objects should be visible in the virtual views. But the occluded background information is lost during the procedure of recording a 3D scene by a 2D image. To solve this problem, the proposed stationary scene extraction module tries to recover the lost background structure from video sequences. For a video captured by a fixed camera or a short cut of video, the image consists of moving foreground objects and stationary background. The occluded background information in current image frame may appear in frames at other moments. If the information can effectively be used, the filling effect of disoccluded areas will be more convincing.

Stationary scene extraction algorithm keeps a global sprite throughout the view generation process to accumulate structure and depth information of stationary scene in temporal direction. The global sprite of stationary scene is composed of two components: one is the texture image of stationary scene, denoted as C _SS, the other is the depth map of stationary scene, denoted as M _SS. C _SS and M _SS are, respectively, initialized with the first frame of the texture sequence and depth sequence of the original view. The initialization step is expressed as follows:

\{\begin{matrix} C_{SS} (p) = I_{t} (p) \\ M_{SS} (p) = D_{t} (p) \end{matrix}, t = 0

(1)

where p:(i j)corresponds to the pixel of column coordinate i and row coordinate j. I _t and D _t represent the color intensity frame and depth map frame of input original view at time t, respectively. D _t is represented as an 8-bits gray-scale image. The continuous depth range is quantized to 255 discrete depth values. The nearest object to the camera image sensor is assigned with 255 and the farthest object is assigned with 1. Pixels with depth value 0 are denoted as holes. The transform formula between discrete depth level and actual distance in real scene can be found in [12].

After the initialization, a temporary sprite of stationary scene, denoted as T C _SS and T M _SS, is obtained between each input image frame I _t and its previous frame I _t-1 to extract the useful information of occluded background in I _t. For stationary scene, the SSIM index [37] between adjacent frames is large, so the image part, which is stationary in both adjacent frames, can be extracted by using the SSIM index values. For each pixel p:(i j), a structure similarity index p _SSIM defined in [37] is calculated between the corresponding square areas $Φ_{t}^{I}$ and $Φ_{t - 1}^{I}$ of I _t and I _t-1, which take p as the center pixel and L × L as the window size. The SSIM p _SSIM is calculated as follows

p_{SSIM} = \frac{(2 μ_{Φ_{t}} μ_{Φ_{t - 1}} + K_{1}) (2 σ_{Φ_{t (t - 1)}} + K_{2})}{(μ_{Φ_{t}}^{2} + μ_{Φ_{t - 1}}^{2} + K_{1}) (σ_{Φ_{t}}^{2} + σ_{Φ_{t - 1}}^{2} + K_{2})}

(2)

where $μ_{Φ_{t}}$ , $μ_{Φ_{t - 1}}$ represent the luminance mean value of $Φ_{t}^{I}$ and $Φ_{t - 1}^{I}$ , respectively. $σ_{Φ_{t}}$ and $σ_{Φ_{t - 1}}$ represent the luminance standard deviation of $Φ_{t}^{I}$ and $Φ_{t - 1}^{I}$ . $σ_{Φ_{t (t - 1)}}$ denotes the luminance correlation coefficient between $Φ_{t}^{I}$ and $Φ_{t - 1}^{I}$ . K ₁ and K ₂ are constants. The value of K ₁and K ₂ can be determined according to the research work in [37]. The expressions of mean, standard deviation, and correlation coefficient can also be found in [37].

Then an arbiter with threshold A is used to divide the pixels of input image frame I _t into stationary part I _s and rest part I _r. The classifier can be expressed as follows:

\{\begin{matrix} p \in I_{s}, p_{SSIM} \geq A \\ p \in I_{r}, p_{SSIM} < A \end{matrix}, p : (i, j) \in I_{t} .

(3)

I _s contains the stationary pixels with high SSIM value, which can directly be used to update the same pixel positions in T C _SS. I _r are composed of three parts: the part with changed luminance P _lc, the relatively moving part P _rm, and the actually moving part P _am. P _lc represents the areas with similarly scene structure and different luminance which causes the decrease of SSIM value. P _rm is the region which is moving in I _t-1 and stationary in I _t. P _am denotes the image part which is moving in I _t and stationary in I _t-1. As shown in Figure 3c, I _s between Figure 3a,b is marked as black, the actually moving part P _am is marked as red, the region with changed luminance P _lc is marked as green, and the relatively moving area P _rm is marked as blue. The first two kinds P _lc and P _rm can be also used to update T C _SS directly, whereas the third kind P _am needs to be excluded from I _t and the pixels in the same regions of I _t-1 will be used to update T C _SS. As shown in Figure 3e–g, the poster occluded by the men’s hands in Figure 3e and the white board behind the man in Figure 3f are all preserved in Figure 3g. Provided with the corresponding depth map D _t and D _t-1, the three different image parts are defined as follows.

\{\begin{matrix} p \in P_{lc}, |μ_{t}^{D} - μ_{t - 1}^{D}| \leq T \\ p \in P_{rm}, μ_{t}^{D} - μ_{t - 1}^{D} < - T \\ p \in P_{am}, μ_{t}^{D} - μ_{t - 1}^{D} > T \end{matrix}, p : (i, j) \in I_{r}

(4)

where $μ_{t}^{D}$ and $μ_{t - 1}^{D}$ , respectively, represent the average depth value of square areas in D _t and D _t-1. The square neighborhoods have the same window size L × L with SSIM computation in Equation (2) and take the coordinates of pixel p as center position. T is a constant threshold, which defines the acceptable range of depth fluctuation. |·| is the absolute function.

Then the information of stationary scene between two adjacent frames can be extracted by the following equation:

\begin{matrix} T C_{SS} (p) & = \{\begin{matrix} I_{t} (p), p : (i, j) \in I_{s} \cup P_{lc} \cup P_{rm} \\ I_{t - 1} (p), p : (i, j) \in P_{am} \end{matrix} \\ T M_{SS} (p) & = \{\begin{matrix} D_{t} (p), p : (i, j) \in I_{s} \cup P_{lc} \cup P_{rm} \\ D_{t - 1} (p), p : (i, j) \in P_{am} \end{matrix} \end{matrix}

(5)

Finally, the temporary sprite of stationary scene (T C _SS and T M _SS) is used to update the global sprite (C _SS and M _SS). The update operation is described as follows.

\begin{matrix} C_{ss} (p) = \{\begin{matrix} T C_{SS} (p), μ_{TM}^{p} - μ_{M}^{p} \leq T \\ C_{SS} (p), otherwise \end{matrix} p : (i, j) \in C_{S S} \\ M_{ss} (p) = \{\begin{matrix} T M_{SS} (p), μ_{TM}^{p} - μ_{M}^{p} \leq T \\ M_{SS} (p), otherwise \end{matrix} p : (i, j) \in M_{SS} \end{matrix}

(6)

where $μ_{TM}^{p}$ and $μ_{M}^{p}$ , respectively, represent the average depth value of square areas in T M _SS and M _SS. The square neighborhoods have the same window size L × L with SSIM computation and take the coordinates of pixel p as center position. T is the same constant threshold defined in Equation (4). Figure 3d shows T C _SS of Figure 3b. Figure 3h,i are C _SS and M _SS of Figure 3b, respectively. Almost all the texture and depth information of stationary scene are restored in Figure 3h,i.

So far, the appeared background information in past frames is stored in C _SS and M _SS, which can be used to partly solve the disocclusion problem of virtual view synthesis algorithm.

Backward DIBR

The backward DIBR method, which shares the same idea with the inverse warping method in [13], can efficiently eliminate the small cracks in virtual view caused by resampling problem in traditional DIBR process [2]. In general, the backward DIBR method can be divided into two steps: warping the depth map of the reference view to the virtual view position and generating the texture image of the virtual view.

In the backward DIBR method, D _t, is warped to virtual perspective position. A two-pixel-wide region around background–foreground transitions is marked as unreliable pixels. During the rendering process of depth map, the unreliable pixels will be skipped, because their depth values are inaccurate. There are four registers in each pixel q:(u,v) of virtual view, which are used to store the depth and distance of four nearest pixels projected from the reference image. The four registers of pixel q only store rendered pixels from reference image whose distance to q is less than one pixel either in horizontal or vertical direction. VD _t, the depth map of virtual view, is calculated as follows

V D_{t} (q) = \{\begin{matrix} \sum_{k = 1}^{N (q)} λ_{k} D_{k}, N (q) > 0 and N (q) \leq 4 \\ 0, N (q) = 0 \end{matrix}

(7)

where N(q)denotes the numbers of pixels warped to q, which satisfy the condition mentioned above. If N(q) is larger than 4, we sort the warped pixels by its depth value in large to small order and store the first four pixels with larger depth. D _k is the depth value of stored pixel. N(q) = 0 means there is no pixel that is projected to pixel q. λ _k represents the normalized weight factor with the combination of distance and depth, which is defined as

λ_{k} = \frac{ρ_{k} ω_{k}}{\sum_{m = 1}^{N (q)} ρ_{m} ω_{m}}, \sum_{k = 1}^{N (q)} λ_{k} = 1

(8)

where the weight factor of distance ω _k is expressed as Equation (9). (U _k,V _k) is the projected position of warped pixel in virtual image plane.

ω_{k} = \frac{1}{\sqrt{{(U_{k} - u)}^{2} + {(V_{k} - v)}^{2}}}

(9)

The weight factor of depth ρ _k is expressed as

ρ_{k} = \{\begin{matrix} 1, D_{k} \geq μ_{ND} \\ 0, D_{k} < μ_{ND} \end{matrix}

(10)

where μ _ND is the average depth value of all the stored warped pixels in pixel q.

The non-hole pixel (u,v) in V D _t is reprojected to position (X _uv,Y _uv) in image plane of original view to get the texture image of virtual view by interpolation operation. The texture image of virtual view V I _t is calculated by

V I_{t} (q) = \{\begin{matrix} \frac{\sum_{n = 1}^{4} θ_{n} I_{n}}{\sum_{n = 1}^{4} θ_{n}}, V D_{t} (q) > 0 \\ hole, V D_{t} (q) = 0 \end{matrix}

(11)

where ‘hole’ flag means there is no warped pixel from the reference image. We set the hole pixels with a white color (R = 255, G = 255, B = 255). I _n represents the color value of pixel (x _n,y _n) whose distance to (X _uv,Y _uv) is less than one pixel either in horizontal or vertical direction. θ _n is the weight factor of distance, which is expressed as

θ_{n} = \frac{1}{\sqrt{{(X_{uv} - x_{n})}^{2} + {(Y_{uv} - y_{n})}^{2}}} .

(12)

The virtual depth map VM _t projected from M _SS and the virtual texture image VC _t projected from C _SS can be obtained by the same backward DIBR method. Two results of our backward DIBR algorithm are given in Figure 4e,f.

Merging operation

To efficiently use the structure information in C _SS, the two virtual texture images (VI _t and VC _t) need to be merged together. The merged virtual image and its depth map are denoted as MI _t and MD _t, respectively. The virtual view image VI _t is dominated in the merging process. Available background information in VC _t is used to fill the blank areas in VI _t. There may be holes in both foreground and background due to the inaccuracy of depth map, as shown in Figure 4e. We do the merging operation carefully to avoid filling holes in foreground with background structures.

First, an estimated depth value $D_{E}^{q}$ is obtained for each hole pixel q:(u,v) in VI _t. As mentioned in Section 3, the hole regions of virtual view are lacking of background information. When q locates between background and foreground, we choose the small depth value of background scene as estimation and the average depth otherwise. The estimation is defined as

D_{E}^{q} = \{\begin{matrix} \frac{μ_{D}^{q_{L}} + μ_{D}^{q_{R}}}{2}, |μ_{D}^{q_{L}} - μ_{D}^{q_{R}}| \leq T \\ μ_{D}^{q_{R}}, μ_{D}^{q_{L}} - μ_{D}^{q_{R}} > T \\ μ_{D}^{q_{L}}, μ_{D}^{q_{L}} - μ_{D}^{q_{R}} < - T \end{matrix}; q is hole

(13)

where q _L and q _R represent the first left and first right non-hole pixel in horizontal column, respectively. $μ_{D}^{q_{L}}$ and $μ_{D}^{q_{R}}$ represent the average depth of the K × K windows which take q _L and q _R as the center pixels in VD _t. T is the same constant defined in Equation (4).

Then the merging operation is executed as follows.

M I_{t} (q) = \{\begin{matrix} V I_{t} (q), V I_{t} (q) is non-hole \\ V C_{t} (q), V I_{t} (q) is hole and V C_{t} (q) is non- \\ hole and |V M_{t} (q) - D_{E}^{q}| \leq F \\ hole, otherwise \end{matrix}

(14)

where non-hole flag means there exists a meaningful value in this pixel position. The second condition in Equation (14) defines the situation, i.e., the pixel q is hole in VI _t, but meaningful pixel with available background texture in VC _t. This condition ensures that the holes in foreground objects will not be filled with the accumulated background information in VC _t. F represents the acceptable range of depth fluctuation in merging operation. In Figure 4g, the available texture of stationary background scene in Figure 4f is merged with the virtual image (Figure 4e) rendered from original view and the hole areas in foreground objects are reserved. The corresponding depth value of each non-hole pixel in merged virtual view MI _t is stored in MD _t, and the depth value of each hole pixel is set to zero.

Oriented exemplar-based inpainting

The merging operation can solve the disocclusion problem partly, because the useful background information in C _SS and M _SS is limited. There still exist hole areas in the merged virtual view MI _t, which are divided into two kinds: the foreground holes caused by inaccurate depth map and the blank areas caused by occlusion in original view. The image part with known pixels is defined by Λ, and the remaining hole area is denoted as Γ. The border of hole area Γ is defined as ∂Γ, as shown in Figure 5a.

To restore the missing information of the remaining hole areas, we propose an oriented exemplar-based inpainting algorithm based on the previous work of Criminisi et al. [30]. They determine the filling order of hole pixel h ∈ ∂Γ by assigning each hole pixel a priority P(h). The hole pixel with the highest priority is first filled with the best match patch in Λ. The priority is the product of the confidence term C(h)and the data term D(h). The confidence term enforces to fill hole with large support set of known pixels first, while the data term ensures the continuous propagation of linear structure into hole regions. Noticing the fact that most remaining holes are due to a lack of scene information of the stationary background, we improve their algorithm in two ways. One is filling the border pixel in ∂Γ which is adjacent to background area, first. The other is choosing the texture of known background area to restore the disoccluded regions. The improvements are implemented by considering depth cue in the calculation of the priority term and the energy function, both of which are used for the best exemplar searching procedure.

The modified priority term is defined as

P (h) = C (h) D (h) + de (h), h \in δΓ

(15)

where de(h) represents the depth term. The definition of C(h) and D(h) is the same as Criminisi’s approach, and their expressions can be found in [30]. The depth term is expressed as follows.

de (h) = \{\begin{matrix} Q, h near to BG \\ 0, h near to FG \end{matrix}, h \in δΓ

(16)

where BG and FG represent the background areas and foreground objects, respectively. Q is a constant, which should be no less than the maximum of the product of C(h) and D(h). We set Q = 256 in our framework. The new priority term will steer the filling order from background to foreground and keep the advantage of linear structure propagation.

Let r denote the pixel with maximum priority in ∂Γ. The J × J samples patch, which takes r as center, is defined as Ψ. A square area around r with W × W samples is defined to be the searching area Ω. Then the oriented exemplar-based inpainting algorithm needs to search for the best match patch S in Ω, which has the most similar texture with Ψ. The center of S is denoted as s. The corresponding depth areas of Ψ and S are represented by Θ and O, respectively.

The energy function combining the depth cue is expressed as follows.

\begin{matrix} E & = & \sum_{m \in Ψ_{k}} {∥Ψ (m) - S (m)∥}^{2} \\ + β \sum_{m \in Ψ_{k}} {∥Θ (m) - O (m)∥}^{2} + γ {|μ_{Θ}^{k} - μ_{O}^{u}|}^{2} \end{matrix}

(17)

where Ψ_k denotes the position set of known pixels in the filling target patch Ψ. The position set of hole pixels in Ψ is represented by Ψ_u : Ψ_u = Ψ - Ψ_k. Ψ(m) and S(m) denote the pixel value of pixel position m in Ψand S, respectively. Θ(m) and O(m) represent the depth value of pixel position m in Θ and O, respectively. β is a constant, which is the weighting factor for the depth values of corresponding pixels with Ψ_k in Θ. $μ_{Θ}^{k}$ represents the average depth value of the corresponding pixels with Ψ_k in Θ. $μ_{O}^{u}$ represents the average depth value of the corresponding pixels with Ψ_u in O. $μ_{Θ}^{k}$ and $μ_{O}^{u}$ are defined as

μ_{Θ}^{k} = \sum_{m \in Ψ_{k}} Θ (m) / |Ψ_{k}|, μ_{O}^{u} = \sum_{m \in Ψ_{u}} O (m) / |Ψ_{u}|

(18)

where |Ψ_u| denotes the area of Ψ_u. γ is the penalizing factor for the candidate patches with foreground texture. γ is an adaptive parameter related to the area of Ψ_k, denoted as |Ψ_k|. Then γ is calculated as

γ = \{\begin{matrix} 0, μ_{O}^{u} - μ_{Θ}^{k} \leq T \\ 10 |Ψ_{k}|, otherwise \end{matrix}

(19)

where T is a constant as defined in Equation (4).

The best match block in the searching area Ω is obtained by minimizing the energy cost function (17). The first term in energy function (17) represents the texture difference between the known pixels in target patch Ψ and the corresponding pixels in match patch S. In our approach, only the luminance component is considered. The second term in (17) indicates the depth similarity, which has lower importance than the first texture term. The third term is a penalization term. If there exist pixels of foreground objects in the corresponding area of Ψ_u in S, the penalization term will become larger. The likelihood of selecting patches with foreground pixels is greatly reduced by adding the penalization term. According to the definition of the energy function, the patches of the background scene, which contain similar texture and depth structure with the target block, will be selected to restore the missing information of the disoccluded image areas. We applied our oriented exemplar-based inpainting method to synthesize the missing texture information of disoccluded area in Figure 5a. The blank region is filled from background scene to foreground objects, and the linear structure is propagated into the hole in an appropriate way (see Figure 5c–g).

Methods

To evaluate the performance of the proposed method, we compare our approach with other methods, including the MPEG view synthesis reference software (VSRS, version 3.5) [38], the depth-based inpainting method in [29], and the Asymmetric Gaussian filtering method of Zhang and Tam [17].

Our experiments are carried out on three test sequences: “Book arrival”, “Breakdancers”, and “Ballet”. These sequences have 100 frames and a resolution of 1024 × 768samples. Multiple video plus depth data from different camera views are available. “Book arrival” sequence is captured by a parallel camera array and the others are obtained by a toed-in camera array. The baseline between two adjacent cameras is approximately 6.5 cm for “Book arrival” sequence and 20 cm for the other two sequences.

The parameter values used in our proposed algorithm is summarized in Table 1. The optimized parameters are used for MEPG method (VSRS 3.5). For Asymmetric Gaussian filtering method, we utilize strong smoothing parameters to eliminate the disoccluded areas caused by large camera baseline. We set the horizontal and vertical standard deviations of the Gaussian kernel to 20 and 60, respectively. The filter window sizes are set to 61 samples horizontally and 193 samples vertically. In the experiments, the Asymmetric Gaussian filtering method and the depth-based inpainting method employ the backward DIBR approach proposed in Section 4 to handle the visibility and resampling problems, just the same as our proposed method.

Table 1 Parameter values used in proposed method

Full size table

Subjective evaluation

The view synthesis results of these three test video sequences are shown in Figures 6, 7, and 8. All of the four presented approaches can handle the visibility and resampling problems and fill the disoccluded areas in virtual view. Our proposed algorithm has the best subjective effects compared to the others three methods.

The Asymmetric Gaussian filtering method causes noticeable geometric distortions. The vertical structure is curved in Figures 6c and 7c. The foreground objects become fat, as shown in Figures 6k and 7g,k. This method will slightly shift the object away from its correct position (see Figure 6g), which will reduce the disparity between reference image and virtual image and decrease the 3D feelings. For the purpose of autostereoscopic display, although the visual quality of Figure 6g is still pleasant, the depth perception of the scene is distorted due to these shifts. The distorted stereo display will make people fill uncomfortable and arouse visual fatigues. The depth-based inpainting method can restore the blank areas with color of background pixels, but induce severe blurring artifacts (see Figures 6l, 7h,l, and 8h) and some color bleeding defects (see Figures 7h,l and 8h). The filling results are very uncomfortable for visual experience. The VSRS method will lead to significant horizontal structure artifacts (as shown in Figures 6i,m, 7i,m, and 8i,m) and decrease the visual quality greatly.

The proposed approach utilizes the accumulated information of stationary scene to fill the disoccluded areas and achieves convincing effect, as shown in Figures 6n, 7j, and 8j. The missing structure of blank regions is restored with the true background structure. Even for the disoccluded areas caused by stationary foreground objects, our proposed method can obtain plausible filling results. As shown in Figures 6j and 7n, the hole areas are filled with the texture of background scene without losing the sharpness compared to Figure 6h,l. Figure 8l gives better visual effect than Figure 8n. Because the man’s leg is very close to the wall in Figure 8b, it is difficult to distinct the leg from the wall. In Figure 8n, our approach wrongly fill the hole with texture of the wall. Another important advantage of our approach is the temporary texture consistency of the filled disoccluded regions. For disoccluded areas caused by moving foreground objects, the missing texture is recovered from other frames. The true texture information in other frames is extracted and used to restore the hole areas. To demonstrate the consistency in temporal direction, a series of magnified virtual image subsection for “Ballet” sequence is shown in Figure 9. The disoccluded regions around the woman of adjacent frames are restored by the same true background structure, then the texture of filled image areas maintains consistent in time direction.

Objective comparison

We adopt peak-signal-to-noise ratio (PSNR) and SSIM [37] to compare the performance of proposed approach with the other three methods.

For every test case of each sequence, the PSNR and SSIM values are calculated for the whole image region of every virtual image frame. The mean values of PSNR and SSIM for each test case are stored in Table 2 and the best results are highlighted with boldface type. The “Camera” column indicates camera configuration of virtual view generation, i.e., “8→9” means synthesizing virtual view of the 9th camera’s perspective position from the 8th camera.

Table 2 PSNR and SSIM results

Full size table

From Table 2, we can observe that among these four methods the proposed framework has the best PSNR and SSIM performance for both the parallel and toed-in camera configuration. The Asymmetric Gaussian filtering method gets the lowest PSNR and SSIM values due to the geometric distortion. For the four test cases of “Book arrival” sequence, the baselines between the virtual view and reference view are small (6.5–13 cm). Because the holes around image boundary occupy great percentage of the whole disocclusions (see Figure 6b), the PSNR and SSIM gains of our proposed framework are small, i.e., 0.09–0.22 dB for PSNR and 0.0006–0.0027 for SSIM compared to depth-based inpainting method. For the test cases of “Breakdancers” and “Ballet” with large baseline (20–40 cm), our proposed approach obtains larger PSNR and SSIM gains compared to depth-based inpainting method, i.e., 0.16–1.82 dB for PSNR and 0.0018–0.0116 for SSIM. There are two important reasons for the improvements of PSNR and SSIM in our proposed framework. One is the available structure information from the stationary scene sprite; the other is the oriented exemplar-based inpainting process with reasonable filling orders. Figure 10 shows the PSNR and SSIM curves for two test cases. One is the virtual view of “Ballet” sequence, which is generated from the 3nd camera to the 4th camera. The other is the virtual view of “Breakdancers” sequence, which is generated from the 5th camera to the 4th camera.

Figure 11 gives the PSNR curves for a local area of “Book arrival” sequence. The concerned local area is the same subsection shown in Figure 6n. From the 1st frame to the 31st frame, the local area only covers background objects, so the performance is very close for these three algorithms. From the 32nd frame to the 99th frame, the local area contains not only background objects but also foreground objects. Then the disoccluded regions appear in the concerned local area due to the discontinuity of the depth. With the proposed stationary scene extraction algorithm, the true texture information of the background objects is utilized to recover the disoccluded regions. The temporal consistency of texture and structure is maintained for these frames using our algorithm. Compared to the VSRS and the depth-based inpainting algorithm, the fluctuation of the PSNR values is much smaller for the proposed method (as shown in Figure 11), which means that the temporal consistency of the rendered sequence is improved. It is obvious that the PSNR value drops at the 32nd frame and the 51st frame due to the sudden depth change in the input sequence. To obtain a more consistent rendered sequence, a temporal filtering procedure for the input depth sequence is beneficial.

Execution time

We implement these four algorithms in C language on a workstation of DELL Corporation and evaluate the runtime costs, as summarized in Table 3. The execution time of each step in proposed framework is given in Table 4. The workstation is equipped with an Intel 2.93-GHz Xeon quad-core CPU and 4-GB DDR2 RAM.

Table 3 Execution time comparison

Full size table

Table 4 Execution time of proposed framework

Full size table

The runtime costs of Asymmetric Gaussian filtering and MPEG method are within 10 seconds per frame. The depth-based inpainting algorithm spends more than 2 min due to the time-consuming iteration operation. The proposed approach takes about 20 s to generate virtual view for each frame. The oriented exemplar-based inpainting process takes most of the time cost for our approach, about 50–80%, as shown in Table 4. The execution time of the oriented exemplar-based inpainting algorithm is depended on the size of disoccluded areas, the image patch size, and the size of searching window. For “Ballet” sequence, because the area of hole regions is larger than the other two test sequences (cf. Figures 7b, 6b, and 8b), the runtime cost increases about 2 times. The additional time cost is acceptable for the improvement in the objective and subjective qualities of virtual view image.

Conclusion and future work

This article presents a novel DIBR method combined with spatial and temporal texture synthesis. By maintaining a sprite of stationary scene of the original sequence, the useful structure information can be adopted to restore the missing texture of disocclusions in virtual view images. The remaining disoccluded areas are restored by proposed oriented exemplar-based inpainting approach. The oriented exemplar-based inpainting method fills the rest hole areas from background to foreground and propagates the structure and texture into the blank regions in an appropriate way. Combining these two algorithms, the proposed DIBR method solved the disocclusion problem well and achieved the spatial and temporal consistency. These features make the proposed approach very suitable for extrapolation of virtual view synthesis. Meanwhile, the proposed framework has the flexibility of shifting to the interpolation operation. Theoretical analysis and experimental results show that the proposed method outperforms state-of-the-art view synthesis methods. The increase of runtime cost is moderate and acceptable. Our future work will focus on the research of camera tracking and motion compensation to extend our proposed method to the situation with moving cameras.

Abbreviations

3DTV:: three-dimensional television
DIBR:: depth-image-based rendering
LDI:: layered-depth-image
PDE:: partial differential equations
PSNR:: peak-signal-to-noise ratio
SSIM:: structural similarity index.

References

Smolic A, Kauff P, Knorr S, Hornung A, Kunter M, Muller M, Lang M: Three-dimensional video postproduction and processing. Proc. IEEE 2011, 99(4):607-625.
Article Google Scholar
Fehn C: Depth-image-based rendering (DIBR), compression, and transmission for a new approach on 3D-TV. In Proceedings of SPIE Stereoscopic Displays and Virtual Reality Systems XI. San Jose, CA, USA; 2004:93-104.
Chapter Google Scholar
Morvan Y, Farin D, de With PH: System architecture for free-viewpoint video and 3D-TV. IEEE Trans. Consum. Electron 2008, 54(2):925-932.
Article Google Scholar
Kubota A, Smolic A, Magnor M, Tanimoto M, Chen T, Zhang C: Multiview imaging and 3DTV. IEEE Signal Process. Mag 2007, 24(6):10-21.
Article Google Scholar
Smolic A, Mueller K, Stefanoski N, Ostermann J, Gotchev A, Akar G, Triantafyllidis G, Koz A: Coding algorithms for 3DTV—a survey. IEEE Trans. Circuits Syst. Video Technol 2007, 17(11):1606-1621.
Article Google Scholar
Merkle P, Smolic A, Muller K, Wiegand T: Efficient prediction structures for multiview video coding. IEEE Trans. Circuits Syst. Video Technol 2007, 17(11):1461-1473.
Article Google Scholar
Onural L, Sikora T: Introduction to the special section on 3DTV. IEEE Trans. Circuits Syst. Video Technol 2007, 17(11):1566-1567.
Article Google Scholar
Fehn C: A 3D-TV approach using depth-image-based rendering (DIBR). In Proceedings of the Visualization, Imaging, and Image Processing. ACTA Press, Benalmadena, Spain; 2003:482-487.
Google Scholar
Greene N, Kass M, Miller G: Hierarchical Z-buffer visibility. In Proceedings of the 20th annual conference on Computer graphics and interactive techniques. ACM Press, CA, USA; 1993:231-238.
Google Scholar
Zitnick C, Kang S, Uyttendaele M, Winder S, Szeliski R: High-quality video view interpolation using a layered representation. ACM Trans. Graph. (TOG) 2004, 23(3):600-608. 10.1145/1015706.1015766
Article Google Scholar
Smolic A, Muller K, Dix K, Merkle P, Kauff P, Wiegand T: Intermediate view interpolation based on multiview video plus depth for advanced 3D video systems. 15th IEEE International Conference on Image Processing (San Diego, CA, USA, 12–15 October 2008) pp. 2448–2451
Mori Y, Fukushima N, Yendo T, Fujii T, Tanimoto M: View generation with 3D warping using depth information for FTV. Signal Process.: Image Commun 2009, 24(1–2):65-72.
Google Scholar
Zinger S, Do L, et al.: Free-viewpoint depth image based rendering. J. Visu. Commun. Image Represent 2010, 21(5–6):533-541.
Article Google Scholar
Shade J, Gortler S, He L, Szeliski R: Layered depth images. In in Proceedings of the 25th Annual Conference on Computer Graphics and Interactive Techniques. ACM, Orlando, FL, USA; 1998:231-242.
Google Scholar
Yoon S, Ho Y: Multiple color and depth video coding using a hierarchical representation. IEEE Trans. Circuits Syst. Video Technol 2007, 17(11):1450-1460.
Article Google Scholar
Comparative study and recommendations http://www.3d4you.eu/
Zhang L, Tam W: Stereoscopic image generation based on depth images for 3D TV. IEEE Trans. Broadcast 2005, 51(2):191-199. 10.1109/TBC.2005.846190
Article Google Scholar
Chen W, Chang Y, Lin S, Ding L, Chen L: Efficient depth image based rendering with edge dependent depth filter and interpolation. IEEE International Conference on Multimedia and Expo (Amsterdam, Netherlands, 6 July 2005) pp. 1314–1317
Google Scholar
Daribo I, Tillier C, Pesquet-Popescu B: Distance dependent depth filtering in 3D warping for 3DTV. IEEE 9th Workshop on Multimedia Signal Processing (Chania, Crete, Greece, 1–3 October 2007) pp. 312–315
Google Scholar
Wang W, Huo L, Zeng W, Huang Q, Gao W: Depth image segmentation for improved virtual view image quality in 3-DTV. IEEE International Symposium on Intelligent Signal Processing and Communication Systems (Xiamen, China, 28 November–1 December 2007) pp. 300–303
Google Scholar
Wang L, Huang X, Xi M, Li D, Zhang M: An asymmetric edge adaptive filter for depth generation and hole filling in 3DTV. IEEE Trans. Broadcast 2010, 56(3):425-431.
Article Google Scholar
Heeger D, Bergen J: Pyramid-based texture analysis/synthesis. In Proceedings of the 22nd Annual Conference on Computer Graphics and Interactive Techniques. ACM, Los Angeles, CA, USA; 1995:229-238.
Google Scholar
De Bonet J: Multiresolution sampling procedure for analysis and synthesis of texture images. In Proceedings of the 24th Annual Conference on Computer Graphics and Interactive Techniques. ACM Press/Addison-Wesley Publishing Co., Los Angeles, CA, USA; 1997:361-368.
Google Scholar
Portilla J, Simoncelli E: A parametric texture model based on joint statistics of complex wavelet coefficients. Int. J. Comput. Vis 2000, 40: 49-70. 10.1023/A:1026553619983
Article Google Scholar
Doretto G, Chiuso A, Wu Y, Soatto S: Dynamic textures. Int. J. Comput. Vis 2003, 51(2):91-109. 10.1023/A:1021669406132
Article Google Scholar
Bertalmio M, Sapiro G, Caselles V, Ballester C: Image inpainting. In Proceedings of the 27th annual conference on Computer graphics and interactive techniques. ACM Press/Addison-Wesley Publishing Co., New Orleans, LA, USA; 2000:417-424.
Google Scholar
Bertalmio M, Vese L, Sapiro G, Osher S: Simultaneous structure and texture image inpainting. IEEE Trans. Image Process 2003, 12(8):882-889. 10.1109/TIP.2003.815261
Article Google Scholar
Chan T, Shen J: Nontexture inpainting by curvature-driven diffusions. J. Vis. Commun. Image Represent 2001, 12(4):436-449. 10.1006/jvci.2001.0487
Article Google Scholar
Oh K, Yea S, Ho Y: Hole filling method using depth based in-painting for view synthesis in free viewpoint television and 3-d video. IEEE Proceedings of Picture Coding Symposium (Chicago, IL, USA, 6–8 May 2009) pp. 1–4
Google Scholar
Criminisi A, Pérez P, Toyama K: Region filling and object removal by exemplar-based image inpainting. IEEE Trans. Image Process 2004, 13(9):1200-1212. 10.1109/TIP.2004.833105
Article Google Scholar
Komodakis N, Tziritas G: Image completion using efficient belief propagation via priority scheduling and dynamic pruning. IEEE Trans. Image Process 2007, 16(11):2649-2661.
Article MathSciNet Google Scholar
Patwardhan K, Sapiro G, Bertalmío M: Video inpainting under constrained camera motion. IEEE Trans. Image Process 2007, 16(2):545-553.
Article MathSciNet Google Scholar
Shih T, Tang N, Hwang J: Exemplar-based video inpainting without ghost shadow artifacts by maintaining temporal continuity. IEEE Trans. Circuits Syst. Video Technol 2009, 19(3):347-360.
Article Google Scholar
Cheng C, Lin S, Lai S: Spatio-temporally consistent novel view synthesis algorithm from video-plus-depth sequences for autostereoscopic displays. IEEE Trans. Broadcast 2011, 57(2):523.
Article Google Scholar
Schmeing M, Jiang X: Depth image based rendering: a faithful approach for the disocclusion problem. In IEEE 3DTV-Conference: The True Vision-Capture, Transmission and Display of 3D Video (3DTV-CON). Tampere, Finland; 7:1-4.
Google Scholar
Ndjiki-Nya P, Koppel M, Doshkov D, Lakshman H, Merkle P, Muller K, Wiegand T: Depth image-based rendering with advanced texture synthesis for 3-D video. IEEE Trans. Multimed 2011, 13(3):453-465.
Article Google Scholar
Wang Z, Bovik A, Sheikh H, Simoncelli E: Image quality assessment: from error visibility to structural similarity. IEEE Trans. Image Process 2004, 13(4):600-612. 10.1109/TIP.2003.819861
Article Google Scholar
Tanimoto M, Fujii T, Suzuki K: View synthesis algorithm in view synthesis reference software 2.0 (VSRS2. 0) ISO/IEC JTC1/SC29/WG11. 2008.
Google Scholar

Download references

Acknowledgements

Ming Xi would like to thank Yin Zhao for his discussion and suggestion about the backward DIBR algorithm. Ming Xi also would like to thank Menno Wildeboer and Masayuki Tanimoto for their kindly help with the implementations. The authors would like to thank the Interactive Visual Media Group at Microsoft Research and the Fraunhofer Institute for Telecommunications-Heinrich Hertz Institute for providing the “Breakdancers”, “Ballet”, and “Book arrival” sequences, respectively. This study was supported in part by the National Natural Science Foundation of China (Grant nos. 60802013, 61072081, 61271338), the National High Technology Research and Development Program (863) of China (Grant no. 2012AA011505), the National Science and Technology Major Project of the Ministry of Science and Technology of China (Grant no. 2009ZX01033-001-007), Key Science and Technology Innovation Team of Zhejiang Province, China (Grant no. 2009R50003) and China Postdoctoral Science Foundation (Grant no. 20110491804, 2012T50545).

Author information

Authors and Affiliations

Institute of Information and Communication Engineering, Zhejiang University, Hangzhou, 310027, P.R. China
Ming Xi, Liang-Hao Wang, Qing-Qing Yang, Dong-Xiao Li & Ming Zhang
Zhejiang Provincial Key Laboratory of Information Network Technology, Zhejiang University, Hangzhou, 310027, P.R. China
Ming Xi, Liang-Hao Wang, Qing-Qing Yang, Dong-Xiao Li & Ming Zhang

Authors

Ming Xi
View author publications
You can also search for this author in PubMed Google Scholar
Liang-Hao Wang
View author publications
You can also search for this author in PubMed Google Scholar
Qing-Qing Yang
View author publications
You can also search for this author in PubMed Google Scholar
Dong-Xiao Li
View author publications
You can also search for this author in PubMed Google Scholar
Ming Zhang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Liang-Hao Wang.

Additional information

Competing interests

The authors declare that they have no competing interests.

Authors’ original submitted files for images

Below are the links to the authors’ original submitted files for images.

Authors’ original file for figure 1

Authors’ original file for figure 2

Authors’ original file for figure 3

Authors’ original file for figure 4

Authors’ original file for figure 5

Authors’ original file for figure 6

Authors’ original file for figure 7

Authors’ original file for figure 8

Authors’ original file for figure 9

Authors’ original file for figure 10

Authors’ original file for figure 11

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 2.0 International License ( https://creativecommons.org/licenses/by/2.0 ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Xi, M., Wang, LH., Yang, QQ. et al. Depth-image-based rendering with spatial and temporal texture synthesis for 3DTV. J Image Video Proc 2013, 9 (2013). https://doi.org/10.1186/1687-5281-2013-9

Download citation

Received: 27 July 2012
Accepted: 23 November 2012
Published: 11 February 2013
DOI: https://doi.org/10.1186/1687-5281-2013-9

Depth-image-based rendering with spatial and temporal texture synthesis for 3DTV

Abstract

Introduction

System overview

Stationary scene extraction

Backward DIBR

Merging operation

Oriented exemplar-based inpainting

Methods

Subjective evaluation

Objective comparison

Execution time

Conclusion and future work

Abbreviations

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Competing interests

Authors’ original submitted files for images

Rights and permissions

About this article

Cite this article

Share this article

Keywords