A depth-image-based rendering (DIBR) method with spatial and temporal texture synthesis is presented in this article. Theoretically, the DIBR algorithm can be used to generate arbitrary virtual views of the same scene in a three-dimensional television system. But the disoccluded area, which is occluded in the original views and becomes visible in the virtual views, makes it very difficult to obtain high image quality in the extrapolated views. The proposed view synthesis method combines the temporally stationary scene information extracted from the input video and spatial texture in the current frame to fill the disoccluded areas in the virtual views. Firstly, the current texture image and a stationary scene image, which is extracted from the input video, are warped to the same virtual perspective position by the DIBR method. Then, the two virtual images are merged together to reduce the hole regions and maintain the temporal consistency of these areas. Finally, an oriented exemplar-based inpainting method is utilized to eliminate the remaining holes. Experimental results are shown to demonstrate the performance and advantage of the proposed method compared with other view synthesis methods.
Year 2010 is considered to be the year of breakthrough for 3D video and 3D industry
. Numerous 3D films are produced and released to the market. Stereo movies provide people stereo perceptions by showing two slightly different images of the same scene. Consumers can have immersive feelings by watching them in theaters with stereo eyeglasses. Disks and players of 3D Blu-ray standard have entered the home entertainment. The prosperity of 3D industry gives an important opportunity for three-dimensional television (3DTV) system, which is believed to be the next generation of television broadcasting after high-definition television. The concept of 3DTV system is defined by European project ATTEST
 and developed by Morvan et al.
 and Kubota et al.
. To improve the depth perception of users, autostereoscopic display technology without any need of additional glasses is preferred in the display part of 3DTV. Autostereoscopic displays can provide comfortable stereo parallax and smooth motion disparity by displaying multiview images of the same scene simultaneously. A simple approach is to capture, compress, and transmit multiple views directly. The current multiview video coding standard
[5, 6] with high compression efficiency, which exploits the spatial correlations of the neighboring views, is used to encode and decode the multiple video streams, generally more than eight views. But the transmission bandwidth cost remains a challenging and unresolved problem. Meanwhile, it is commonly suggested that the future 3DTV systems should have completely decoupled capture and display operations
. A proper abstract intermediate representation of the captured data, video plus depth format, is proposed by Fehn
 to achieve such a decoupled operation with an acceptable increment of bandwidth. The depth-image-based rendering (DIBR)
 algorithm will be used to render multiple perspective views from the video plus depth data according to the requirement of autostereoscopic displays. Thus, the DIBR method has attracted much attention, and become a key technology of the 3DTV system
The video plus depth data format consists of one texture color image and its corresponding perpixel dense depth map. Theoretically, being provided with the intrinsic and extrinsic parameters of the virtual views, the DIBR algorithm can be used to synthesize any virtual perspective views from the video plus depth data. But there exists three problems
, which are visibility, resampling, and disocclusion. Multiple pixels of the reference view may fall into the same position in the virtual image plane, which will cause the visibility problem. A Z-buffer algorithm
 can solve this problem by recording the Z values and choosing the nearest pixel to the virtual camera plane. The phenomenon of an integer pixel position in the reference view image being projected to a subpixel position in the virtual view is called resampling problem, which can be coped with upsampling procedure or backwards warping with interpolation. The remaining disocclusion problem is the fact that some parts of the captured scene, which are occluded in the original views, become visible in the virtual views. It is caused by the lack of scene information occluded by the foreground objects in the original view position. As the distance from virtual view to reference view increases, the disoccluded area becomes larger, as shown in Figure
The disocclusion problem is considered to be the most significant and difficult one of the DIBR algorithm. It is well handled in the interpolation operation
[10–13], but will become severe in the extrapolation situation, where the missing image information needs to be reconstructed by appropriate algorithms. Lots of algorithms have been developed to solve this problem, which can be divided into three categories.
The first is layered-depth-image (LDI)
[14, 15], which can achieve excellent rendering results by providing sufficient information of the scene. LDI data are composed of a number of color layers and their corresponding depth layers, which contain not only the texture and depth information of visible scene from the front view, but also that of the occluded regions. It is very simple to obtain high-quality multiview images from LDI data. However, the procedure of creating LDI is computationally complex and quite time-consuming. The transmission bandwidth of LDI data also increases drastically with the number of layers. A simplified data format of LDI, which is called the “Declipse” format
, is proposed by Philips Corporation. The “Declipse” format data consist of foreground layer and background layer. It presents the advantage to improve the rendering quality with a quite small overhead in terms of complexity and bitrate.
The second approach is called depth image preprocessing. To reduce the disoccluded areas in virtual views, low pass filter is applied to smooth the depth image. Fehn
 uses a suitable Gaussian filter preprocessing the depth image to eliminate the disocclusions with the cost of slightly geometric distortions. An asymmetric smoothing method is proposed by Zhang and Tam
. By enlarging the standard deviation and window size of Gaussian filter in vertical direction, the vertical structure distortion is reduced. The filtering effect is to smooth the sharp discontinuities in the depth image, thus reducing the hole areas near object boundaries. A consequence of these algorithms is that the whole depth map has been modified, which will severely blur the distance between scene objects in different depth layers. To cope with depth loss, different kinds of oriented filters
[18–21] are designed with the same principle, i.e., smoothing the sharp edge in the depth image locally and keeping the depth of the other regions unchanged. The oriented filters can improve the image quality of the virtual views, but still induce geometric distortion. Although the depth image preprocessing methods can be used to handle the disoccluded regions in the virtual views of small baseline, obvious geometric distortions will occur when the baseline is getting larger.
The third approach to filling the disoccluded areas is image completing techniques. This approach can be further classified into statistical-based methods, partial differential equations (PDE)-based methods, and exemplar-based methods. Statistical-based methods
[22–25] have good performance in pure texture synthesis applications, but fail to complete natural images with complex structure. PDE-based methods
[26–29], which are also called image inpainting methods, propagate linear structures into the disoccluded areas smoothly via diffusion. The diffusion process is simulated by the PDE of physical heat flow. Inpainting methods are suitable for removing small image artifacts, such as speckles, scratches, and overlaid texts. When the disocclusion is getting larger, the diffusion process will over-smooth the image and cause visible blurring artifacts. Exemplar-based methods
[30, 31] fill the hole regions by copying patches with the similar texture from the known neighborhood of the image. Criminisi et al.
 use the exemplar-based method to remove objects from images. Komodakis and Tziritas
 propose an efficient belief propagation method to obtain global optimization. Exemplar-based methods have been used for the case of video completion in
[32, 33]. Multiple frames are provided as the searching source of best match patch by Cheng et al.
 to achieve temporal continuity. Exemplar-based methods have been the most powerful techniques for dealing with large disoccluded regions. Schmeing and Jiang
 first obtain the background information with a computed background model. But their approach cannot handle the uncovered areas caused by static foreground objects. For each virtual view, Ndjiki-Nya et al.
 use a background sprite to update the texture and depth information of disoccluded areas. There are two major drawbacks of this method. One is the valuable background information of disocclusions, which cannot be reused during the generation of other virtual views. The other is the memory cost increases with the number of virtual views.
In this article, a new virtual view generation method with spatial and temporal texture synthesis is proposed. The structure information of the captured scene in the temporal domain is taken into account by maintaining an accumulated sprite of stationary scene. An oriented exemplar-based inpainting algorithm is applied to restore the rest disoccluded areas with background texture.
The remainder of this article is organized as follows. In Section 2, a brief description of the algorithm framework is given. The details of each processing modules are demonstrated in Sections 3, 4, 5, and 6. Experimental results are compared with state-of-the-art methods in Section 7. The conclusions and future works can be found in Section 8.
The framework of proposed DIBR method with spatial and temporal texture synthesis is shown in Figure
2. The proposed method is divided into four main stages, i.e., stationary scene extraction, backward DIBR, merging operation, and oriented exemplar-based inpainting.
In the first stage, a sprite of stationary scene is maintained throughout the view synthesis process, which stores the temporally accumulated structure and depth information of stationary image part. The Structural SIMilarity index (SSIM)
 is utilized to distinguish the stationary scene from the moving foreground objects by combining the input depth images. For stationary scene, the SSIM index between adjacent frames is large, so the image part, which is stationary in both adjacent frames, can be extracted by using the SSIM index values. But there still are some stationary scenes, which cannot be distinguished due to the occlusions of moving foreground objects. By considering the spatial relationship provided by the input depth maps, the texture information of these occluded stationary scenes can also be obtained. In the demonstration of our algorithm, the camera of input view is supposed to be still for simplicity. If the camera is moving, an additional camera tracking module needs to be inserted before stationary scene extraction stage to compensate the global motions, which is beyond the discussion in this article.
In the second stage, current frame and stationary scene sprite are warped to the same virtual perspective view by a backward DIBR method to tackle the visibility problem and resampling problem.
The proposed algorithm merges these two virtual images obtained from the second stage together with the third stage. The merging operation needs to be done very carefully, because the foreground objects in virtual views may still exist inner hole pixels. The merging operation can take use of most of the scene information provided by the sprite of stationary scene.
After the merging operation, there still exists a few blank regions without pixel values. In the final stage, oriented exemplar-based inpainting approach is applied to fill the remaining holes by searching best matching exemplar with background texture. Current virtual image is used as the searching source of best matching patch. The filling order of the inpainting method is steered from background structures to foreground objects.
Note that the proposed method only uses the sequence of color images and depth images from one captured view as the input data. If image data of another view are also provided, the switch in the framework can directly be shifted from the extrapolation mode to the interpolation mode without any changes of the framework.
Stationary scene extraction
The DIBR algorithm warps the original view to the virtual view position by projecting current pixels to points in real 3D space and re-projecting the 3D points to virtual image plane. Large disocclusions will appear in the discontinuous edges of depth map, which is the transition place between foreground and background in texture image. The background image part occluded by foreground objects should be visible in the virtual views. But the occluded background information is lost during the procedure of recording a 3D scene by a 2D image. To solve this problem, the proposed stationary scene extraction module tries to recover the lost background structure from video sequences. For a video captured by a fixed camera or a short cut of video, the image consists of moving foreground objects and stationary background. The occluded background information in current image frame may appear in frames at other moments. If the information can effectively be used, the filling effect of disoccluded areas will be more convincing.
Stationary scene extraction algorithm keeps a global sprite throughout the view generation process to accumulate structure and depth information of stationary scene in temporal direction. The global sprite of stationary scene is composed of two components: one is the texture image of stationary scene, denoted as CSS, the other is the depth map of stationary scene, denoted as MSS. CSS and MSS are, respectively, initialized with the first frame of the texture sequence and depth sequence of the original view. The initialization step is expressed as follows:
where p:(ij)corresponds to the pixel of column coordinate i and row coordinate j. It and Dt represent the color intensity frame and depth map frame of input original view at time t, respectively. Dt is represented as an 8-bits gray-scale image. The continuous depth range is quantized to 255 discrete depth values. The nearest object to the camera image sensor is assigned with 255 and the farthest object is assigned with 1. Pixels with depth value 0 are denoted as holes. The transform formula between discrete depth level and actual distance in real scene can be found in
After the initialization, a temporary sprite of stationary scene, denoted as TCSS and TMSS, is obtained between each input image frame It and its previous frame It-1 to extract the useful information of occluded background in It. For stationary scene, the SSIM index
 between adjacent frames is large, so the image part, which is stationary in both adjacent frames, can be extracted by using the SSIM index values. For each pixel p:(ij), a structure similarity index pSSIM defined in
 is calculated between the corresponding square areas
of It and It-1, which take p as the center pixel and L × L as the window size. The SSIM pSSIM is calculated as follows
represent the luminance mean value of
represent the luminance standard deviation of
denotes the luminance correlation coefficient between
. K1 and K2 are constants. The value of K1and K2 can be determined according to the research work in
. The expressions of mean, standard deviation, and correlation coefficient can also be found in
Then an arbiter with threshold A is used to divide the pixels of input image frame It into stationary part Is and rest part Ir. The classifier can be expressed as follows:
Is contains the stationary pixels with high SSIM value, which can directly be used to update the same pixel positions in TCSS. Ir are composed of three parts: the part with changed luminance Plc, the relatively moving part Prm, and the actually moving part Pam. Plc represents the areas with similarly scene structure and different luminance which causes the decrease of SSIM value. Prm is the region which is moving in It-1 and stationary in It. Pam denotes the image part which is moving in It and stationary in It-1. As shown in Figure
3c, Is between Figure
3a,b is marked as black, the actually moving part Pam is marked as red, the region with changed luminance Plc is marked as green, and the relatively moving area Prm is marked as blue. The first two kinds Plc and Prm can be also used to update TCSS directly, whereas the third kind Pam needs to be excluded from It and the pixels in the same regions of It-1 will be used to update TCSS. As shown in Figure
3e–g, the poster occluded by the men’s hands in Figure
3e and the white board behind the man in Figure
3f are all preserved in Figure
3g. Provided with the corresponding depth map Dt and Dt-1, the three different image parts are defined as follows.
, respectively, represent the average depth value of square areas in Dt and Dt-1. The square neighborhoods have the same window size L × L with SSIM computation in Equation (2) and take the coordinates of pixel p as center position. T is a constant threshold, which defines the acceptable range of depth fluctuation. |·| is the absolute function.
Then the information of stationary scene between two adjacent frames can be extracted by the following equation:
Finally, the temporary sprite of stationary scene (TCSS and TMSS) is used to update the global sprite (CSS and MSS). The update operation is described as follows.
, respectively, represent the average depth value of square areas in TMSS and MSS. The square neighborhoods have the same window size L × L with SSIM computation and take the coordinates of pixel p as center position. T is the same constant threshold defined in Equation (4). Figure
3d shows TCSS of Figure
3h,i are CSS and MSS of Figure
3b, respectively. Almost all the texture and depth information of stationary scene are restored in Figure
So far, the appeared background information in past frames is stored in CSS and MSS, which can be used to partly solve the disocclusion problem of virtual view synthesis algorithm.
The backward DIBR method, which shares the same idea with the inverse warping method in
, can efficiently eliminate the small cracks in virtual view caused by resampling problem in traditional DIBR process
. In general, the backward DIBR method can be divided into two steps: warping the depth map of the reference view to the virtual view position and generating the texture image of the virtual view.
In the backward DIBR method, Dt, is warped to virtual perspective position. A two-pixel-wide region around background–foreground transitions is marked as unreliable pixels. During the rendering process of depth map, the unreliable pixels will be skipped, because their depth values are inaccurate. There are four registers in each pixel q:(u,v) of virtual view, which are used to store the depth and distance of four nearest pixels projected from the reference image. The four registers of pixel q only store rendered pixels from reference image whose distance to q is less than one pixel either in horizontal or vertical direction. VDt, the depth map of virtual view, is calculated as follows
where N(q)denotes the numbers of pixels warped to q, which satisfy the condition mentioned above. If N(q) is larger than 4, we sort the warped pixels by its depth value in large to small order and store the first four pixels with larger depth. Dk is the depth value of stored pixel. N(q) = 0 means there is no pixel that is projected to pixel q. λk represents the normalized weight factor with the combination of distance and depth, which is defined as
where the weight factor of distance ωk is expressed as Equation (9). (Uk,Vk) is the projected position of warped pixel in virtual image plane.
The weight factor of depth ρk is expressed as
where μND is the average depth value of all the stored warped pixels in pixel q.
The non-hole pixel (u,v) in VDt is reprojected to position (Xuv,Yuv) in image plane of original view to get the texture image of virtual view by interpolation operation. The texture image of virtual view VIt is calculated by
where ‘hole’ flag means there is no warped pixel from the reference image. We set the hole pixels with a white color (R = 255, G = 255, B = 255). In represents the color value of pixel (xn,yn) whose distance to (Xuv,Yuv) is less than one pixel either in horizontal or vertical direction. θn is the weight factor of distance, which is expressed as
The virtual depth map VMt projected from MSS and the virtual texture image VCt projected from CSS can be obtained by the same backward DIBR method. Two results of our backward DIBR algorithm are given in Figure
To efficiently use the structure information in CSS, the two virtual texture images (VIt and VCt) need to be merged together. The merged virtual image and its depth map are denoted as MIt and MDt, respectively. The virtual view image VIt is dominated in the merging process. Available background information in VCt is used to fill the blank areas in VIt. There may be holes in both foreground and background due to the inaccuracy of depth map, as shown in Figure
4e. We do the merging operation carefully to avoid filling holes in foreground with background structures.
First, an estimated depth value
is obtained for each hole pixel q:(u,v) in VIt. As mentioned in Section 3, the hole regions of virtual view are lacking of background information. When q locates between background and foreground, we choose the small depth value of background scene as estimation and the average depth otherwise. The estimation is defined as
where qL and qR represent the first left and first right non-hole pixel in horizontal column, respectively.
represent the average depth of the K × K windows which take qL and qR as the center pixels in VDt. T is the same constant defined in Equation (4).
Then the merging operation is executed as follows.
where non-hole flag means there exists a meaningful value in this pixel position. The second condition in Equation (14) defines the situation, i.e., the pixel q is hole in VIt, but meaningful pixel with available background texture in VCt. This condition ensures that the holes in foreground objects will not be filled with the accumulated background information in VCt. F represents the acceptable range of depth fluctuation in merging operation. In Figure
4g, the available texture of stationary background scene in Figure
4f is merged with the virtual image (Figure
4e) rendered from original view and the hole areas in foreground objects are reserved. The corresponding depth value of each non-hole pixel in merged virtual view MIt is stored in MDt, and the depth value of each hole pixel is set to zero.
Oriented exemplar-based inpainting
The merging operation can solve the disocclusion problem partly, because the useful background information in CSS and MSS is limited. There still exist hole areas in the merged virtual view MIt, which are divided into two kinds: the foreground holes caused by inaccurate depth map and the blank areas caused by occlusion in original view. The image part with known pixels is defined by Λ, and the remaining hole area is denoted as Γ. The border of hole area Γ is defined as ∂Γ, as shown in Figure
To restore the missing information of the remaining hole areas, we propose an oriented exemplar-based inpainting algorithm based on the previous work of Criminisi et al.
. They determine the filling order of hole pixel h∈∂Γ by assigning each hole pixel a priority P(h). The hole pixel with the highest priority is first filled with the best match patch in Λ. The priority is the product of the confidence term C(h)and the data term D(h). The confidence term enforces to fill hole with large support set of known pixels first, while the data term ensures the continuous propagation of linear structure into hole regions. Noticing the fact that most remaining holes are due to a lack of scene information of the stationary background, we improve their algorithm in two ways. One is filling the border pixel in ∂Γ which is adjacent to background area, first. The other is choosing the texture of known background area to restore the disoccluded regions. The improvements are implemented by considering depth cue in the calculation of the priority term and the energy function, both of which are used for the best exemplar searching procedure.
The modified priority term is defined as
where de(h) represents the depth term. The definition of C(h) and D(h) is the same as Criminisi’s approach, and their expressions can be found in
. The depth term is expressed as follows.
where BG and FG represent the background areas and foreground objects, respectively. Q is a constant, which should be no less than the maximum of the product of C(h) and D(h). We set Q = 256 in our framework. The new priority term will steer the filling order from background to foreground and keep the advantage of linear structure propagation.
Let r denote the pixel with maximum priority in ∂Γ. The J × J samples patch, which takes r as center, is defined as Ψ. A square area around r with W × W samples is defined to be the searching area Ω. Then the oriented exemplar-based inpainting algorithm needs to search for the best match patch S in Ω, which has the most similar texture with Ψ. The center of S is denoted as s. The corresponding depth areas of Ψ and S are represented by Θ and O, respectively.
The energy function combining the depth cue is expressed as follows.
where Ψk denotes the position set of known pixels in the filling target patch Ψ. The position set of hole pixels in Ψ is represented by Ψu : Ψu = Ψ - Ψk. Ψ(m) and S(m) denote the pixel value of pixel position m in Ψand S, respectively. Θ(m) and O(m) represent the depth value of pixel position m in Θ and O, respectively. β is a constant, which is the weighting factor for the depth values of corresponding pixels with Ψk in Θ.
represents the average depth value of the corresponding pixels with Ψk in Θ.
represents the average depth value of the corresponding pixels with Ψu in O.
are defined as
where |Ψu| denotes the area of Ψu. γ is the penalizing factor for the candidate patches with foreground texture. γ is an adaptive parameter related to the area of Ψk, denoted as |Ψk|. Then γ is calculated as
The best match block in the searching area Ω is obtained by minimizing the energy cost function (17). The first term in energy function (17) represents the texture difference between the known pixels in target patch Ψ and the corresponding pixels in match patch S. In our approach, only the luminance component is considered. The second term in (17) indicates the depth similarity, which has lower importance than the first texture term. The third term is a penalization term. If there exist pixels of foreground objects in the corresponding area of Ψu in S, the penalization term will become larger. The likelihood of selecting patches with foreground pixels is greatly reduced by adding the penalization term. According to the definition of the energy function, the patches of the background scene, which contain similar texture and depth structure with the target block, will be selected to restore the missing information of the disoccluded image areas. We applied our oriented exemplar-based inpainting method to synthesize the missing texture information of disoccluded area in Figure
5a. The blank region is filled from background scene to foreground objects, and the linear structure is propagated into the hole in an appropriate way (see Figure
To evaluate the performance of the proposed method, we compare our approach with other methods, including the MPEG view synthesis reference software (VSRS, version 3.5)
, the depth-based inpainting method in
, and the Asymmetric Gaussian filtering method of Zhang and Tam
Our experiments are carried out on three test sequences: “Book arrival”, “Breakdancers”, and “Ballet”. These sequences have 100 frames and a resolution of 1024 × 768samples. Multiple video plus depth data from different camera views are available. “Book arrival” sequence is captured by a parallel camera array and the others are obtained by a toed-in camera array. The baseline between two adjacent cameras is approximately 6.5 cm for “Book arrival” sequence and 20 cm for the other two sequences.
The parameter values used in our proposed algorithm is summarized in Table
1. The optimized parameters are used for MEPG method (VSRS 3.5). For Asymmetric Gaussian filtering method, we utilize strong smoothing parameters to eliminate the disoccluded areas caused by large camera baseline. We set the horizontal and vertical standard deviations of the Gaussian kernel to 20 and 60, respectively. The filter window sizes are set to 61 samples horizontally and 193 samples vertically. In the experiments, the Asymmetric Gaussian filtering method and the depth-based inpainting method employ the backward DIBR approach proposed in Section 4 to handle the visibility and resampling problems, just the same as our proposed method.
The view synthesis results of these three test video sequences are shown in Figures
8. All of the four presented approaches can handle the visibility and resampling problems and fill the disoccluded areas in virtual view. Our proposed algorithm has the best subjective effects compared to the others three methods.
The Asymmetric Gaussian filtering method causes noticeable geometric distortions. The vertical structure is curved in Figures
7c. The foreground objects become fat, as shown in Figures
7g,k. This method will slightly shift the object away from its correct position (see Figure
6g), which will reduce the disparity between reference image and virtual image and decrease the 3D feelings. For the purpose of autostereoscopic display, although the visual quality of Figure
6g is still pleasant, the depth perception of the scene is distorted due to these shifts. The distorted stereo display will make people fill uncomfortable and arouse visual fatigues. The depth-based inpainting method can restore the blank areas with color of background pixels, but induce severe blurring artifacts (see Figures
8h) and some color bleeding defects (see Figures
8h). The filling results are very uncomfortable for visual experience. The VSRS method will lead to significant horizontal structure artifacts (as shown in Figures
8i,m) and decrease the visual quality greatly.
The proposed approach utilizes the accumulated information of stationary scene to fill the disoccluded areas and achieves convincing effect, as shown in Figures
8j. The missing structure of blank regions is restored with the true background structure. Even for the disoccluded areas caused by stationary foreground objects, our proposed method can obtain plausible filling results. As shown in Figures
7n, the hole areas are filled with the texture of background scene without losing the sharpness compared to Figure
8l gives better visual effect than Figure
8n. Because the man’s leg is very close to the wall in Figure
8b, it is difficult to distinct the leg from the wall. In Figure
8n, our approach wrongly fill the hole with texture of the wall. Another important advantage of our approach is the temporary texture consistency of the filled disoccluded regions. For disoccluded areas caused by moving foreground objects, the missing texture is recovered from other frames. The true texture information in other frames is extracted and used to restore the hole areas. To demonstrate the consistency in temporal direction, a series of magnified virtual image subsection for “Ballet” sequence is shown in Figure
9. The disoccluded regions around the woman of adjacent frames are restored by the same true background structure, then the texture of filled image areas maintains consistent in time direction.
We adopt peak-signal-to-noise ratio (PSNR) and SSIM
 to compare the performance of proposed approach with the other three methods.
For every test case of each sequence, the PSNR and SSIM values are calculated for the whole image region of every virtual image frame. The mean values of PSNR and SSIM for each test case are stored in Table
2 and the best results are highlighted with boldface type. The “Camera” column indicates camera configuration of virtual view generation, i.e., “8→9” means synthesizing virtual view of the 9th camera’s perspective position from the 8th camera.
2, we can observe that among these four methods the proposed framework has the best PSNR and SSIM performance for both the parallel and toed-in camera configuration. The Asymmetric Gaussian filtering method gets the lowest PSNR and SSIM values due to the geometric distortion. For the four test cases of “Book arrival” sequence, the baselines between the virtual view and reference view are small (6.5–13 cm). Because the holes around image boundary occupy great percentage of the whole disocclusions (see Figure
6b), the PSNR and SSIM gains of our proposed framework are small, i.e., 0.09–0.22 dB for PSNR and 0.0006–0.0027 for SSIM compared to depth-based inpainting method. For the test cases of “Breakdancers” and “Ballet” with large baseline (20–40 cm), our proposed approach obtains larger PSNR and SSIM gains compared to depth-based inpainting method, i.e., 0.16–1.82 dB for PSNR and 0.0018–0.0116 for SSIM. There are two important reasons for the improvements of PSNR and SSIM in our proposed framework. One is the available structure information from the stationary scene sprite; the other is the oriented exemplar-based inpainting process with reasonable filling orders. Figure
10 shows the PSNR and SSIM curves for two test cases. One is the virtual view of “Ballet” sequence, which is generated from the 3nd camera to the 4th camera. The other is the virtual view of “Breakdancers” sequence, which is generated from the 5th camera to the 4th camera.
11 gives the PSNR curves for a local area of “Book arrival” sequence. The concerned local area is the same subsection shown in Figure
6n. From the 1st frame to the 31st frame, the local area only covers background objects, so the performance is very close for these three algorithms. From the 32nd frame to the 99th frame, the local area contains not only background objects but also foreground objects. Then the disoccluded regions appear in the concerned local area due to the discontinuity of the depth. With the proposed stationary scene extraction algorithm, the true texture information of the background objects is utilized to recover the disoccluded regions. The temporal consistency of texture and structure is maintained for these frames using our algorithm. Compared to the VSRS and the depth-based inpainting algorithm, the fluctuation of the PSNR values is much smaller for the proposed method (as shown in Figure
11), which means that the temporal consistency of the rendered sequence is improved. It is obvious that the PSNR value drops at the 32nd frame and the 51st frame due to the sudden depth change in the input sequence. To obtain a more consistent rendered sequence, a temporal filtering procedure for the input depth sequence is beneficial.
We implement these four algorithms in C language on a workstation of DELL Corporation and evaluate the runtime costs, as summarized in Table
3. The execution time of each step in proposed framework is given in Table
4. The workstation is equipped with an Intel 2.93-GHz Xeon quad-core CPU and 4-GB DDR2 RAM.
The runtime costs of Asymmetric Gaussian filtering and MPEG method are within 10 seconds per frame. The depth-based inpainting algorithm spends more than 2 min due to the time-consuming iteration operation. The proposed approach takes about 20 s to generate virtual view for each frame. The oriented exemplar-based inpainting process takes most of the time cost for our approach, about 50–80%, as shown in Table 4. The execution time of the oriented exemplar-based inpainting algorithm is depended on the size of disoccluded areas, the image patch size, and the size of searching window. For “Ballet” sequence, because the area of hole regions is larger than the other two test sequences (cf. Figures
8b), the runtime cost increases about 2 times. The additional time cost is acceptable for the improvement in the objective and subjective qualities of virtual view image.
Conclusion and future work
This article presents a novel DIBR method combined with spatial and temporal texture synthesis. By maintaining a sprite of stationary scene of the original sequence, the useful structure information can be adopted to restore the missing texture of disocclusions in virtual view images. The remaining disoccluded areas are restored by proposed oriented exemplar-based inpainting approach. The oriented exemplar-based inpainting method fills the rest hole areas from background to foreground and propagates the structure and texture into the blank regions in an appropriate way. Combining these two algorithms, the proposed DIBR method solved the disocclusion problem well and achieved the spatial and temporal consistency. These features make the proposed approach very suitable for extrapolation of virtual view synthesis. Meanwhile, the proposed framework has the flexibility of shifting to the interpolation operation. Theoretical analysis and experimental results show that the proposed method outperforms state-of-the-art view synthesis methods. The increase of runtime cost is moderate and acceptable. Our future work will focus on the research of camera tracking and motion compensation to extend our proposed method to the situation with moving cameras.
partial differential equations
structural similarity index.
Smolic A, Kauff P, Knorr S, Hornung A, Kunter M, Muller M, Lang M: Three-dimensional video postproduction and processing. Proc. IEEE 2011, 99(4):607-625.
Fehn C: Depth-image-based rendering (DIBR), compression, and transmission for a new approach on 3D-TV. In Proceedings of SPIE Stereoscopic Displays and Virtual Reality Systems XI. San Jose, CA, USA; 2004:93-104.
Smolic A, Muller K, Dix K, Merkle P, Kauff P, Wiegand T: Intermediate view interpolation based on multiview video plus depth for advanced 3D video systems. 15th IEEE International Conference on Image Processing (San Diego, CA, USA, 12–15 October 2008) pp. 2448–2451
Mori Y, Fukushima N, Yendo T, Fujii T, Tanimoto M: View generation with 3D warping using depth information for FTV. Signal Process.: Image Commun 2009, 24(1–2):65-72.
Chen W, Chang Y, Lin S, Ding L, Chen L: Efficient depth image based rendering with edge dependent depth filter and interpolation. IEEE International Conference on Multimedia and Expo (Amsterdam, Netherlands, 6 July 2005) pp. 1314–1317
Daribo I, Tillier C, Pesquet-Popescu B: Distance dependent depth filtering in 3D warping for 3DTV. IEEE 9th Workshop on Multimedia Signal Processing (Chania, Crete, Greece, 1–3 October 2007) pp. 312–315
Wang W, Huo L, Zeng W, Huang Q, Gao W: Depth image segmentation for improved virtual view image quality in 3-DTV. IEEE International Symposium on Intelligent Signal Processing and Communication Systems (Xiamen, China, 28 November–1 December 2007) pp. 300–303
De Bonet J: Multiresolution sampling procedure for analysis and synthesis of texture images. In Proceedings of the 24th Annual Conference on Computer Graphics and Interactive Techniques. ACM Press/Addison-Wesley Publishing Co., Los Angeles, CA, USA; 1997:361-368.
Bertalmio M, Sapiro G, Caselles V, Ballester C: Image inpainting. In Proceedings of the 27th annual conference on Computer graphics and interactive techniques. ACM Press/Addison-Wesley Publishing Co., New Orleans, LA, USA; 2000:417-424.
Oh K, Yea S, Ho Y: Hole filling method using depth based in-painting for view synthesis in free viewpoint television and 3-d video. IEEE Proceedings of Picture Coding Symposium (Chicago, IL, USA, 6–8 May 2009) pp. 1–4
Schmeing M, Jiang X: Depth image based rendering: a faithful approach for the disocclusion problem. In IEEE 3DTV-Conference: The True Vision-Capture, Transmission and Display of 3D Video (3DTV-CON). Tampere, Finland; 7:1-4.
Ming Xi would like to thank Yin Zhao for his discussion and suggestion about the backward DIBR algorithm. Ming Xi also would like to thank Menno Wildeboer and Masayuki Tanimoto for their kindly help with the implementations. The authors would like to thank the Interactive Visual Media Group at Microsoft Research and the Fraunhofer Institute for Telecommunications-Heinrich Hertz Institute for providing the “Breakdancers”, “Ballet”, and “Book arrival” sequences, respectively. This study was supported in part by the National Natural Science Foundation of China (Grant nos. 60802013, 61072081, 61271338), the National High Technology Research and Development Program (863) of China (Grant no. 2012AA011505), the National Science and Technology Major Project of the Ministry of Science and Technology of China (Grant no. 2009ZX01033-001-007), Key Science and Technology Innovation Team of Zhejiang Province, China (Grant no. 2009R50003) and China Postdoctoral Science Foundation (Grant no. 20110491804, 2012T50545).
Authors and Affiliations
Institute of Information and Communication Engineering, Zhejiang University, Hangzhou, 310027, P.R. China
Ming Xi, Liang-Hao Wang, Qing-Qing Yang, Dong-Xiao Li & Ming Zhang
Zhejiang Provincial Key Laboratory of Information Network Technology, Zhejiang University, Hangzhou, 310027, P.R. China
Ming Xi, Liang-Hao Wang, Qing-Qing Yang, Dong-Xiao Li & Ming Zhang
This article is distributed under the terms of the Creative Commons Attribution 2.0 International License (
), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.