# Real-time virtual view synthesis using light field

- Li Yao
^{1}Email authorView ORCID ID profile, - Yunjian Liu
^{1}and - Weixin Xu
^{1}

**2016**:25

https://doi.org/10.1186/s13640-016-0129-2

© The Author(s). 2016

**Received: **30 June 2016

**Accepted: **6 September 2016

**Published: **15 September 2016

## Abstract

Virtual view synthesis technique renders a virtual view image from several pre-collected viewpoint images. The hotspot on virtual view synthesis area is depth image-based rendering (DIBR), which has low one-time imaging quality. To achieve high imaging quality artifacts, the holes must be inpainted after image warping which means high computational complexity. This paper proposed a real-time virtual view synthesis method from light field. Then the light field is transformed into frequency domain. The light field is parameterized and reconstructed from image array. The virtual view is rendered by resampling the light field in frequency domain. After resampling the image by performing Fourier slice, the virtual view image is obtained by inverse Fourier transform. Experiments show that our method can get high one-time imaging quality in real time.

## Keywords

## 1 Introduction

For many modern applications including special effects on films, TVs, and surveillance as well as virtual reality and other applications, it is often desirable to generate a high-quality virtual view image of a 3D scene from a viewpoint for which no direct information is available. The virtual view is synthesized from information from a number of known reference views (RV). Depth image-based rendering (DIBR) technique is the hotspot in virtual view synthesis field. Each pixel in RV has corresponding depth information. The core of DIBR is 3D image warping theory proposed by McMillan [1]. The virtual view is synthesized by re-projection of every pixel of referenced images in space.

The main drawback of DIBR technology is a low one-time image quality. The holes [2, 3], artifacts [4, 5], deformation [6], and abnormal edge [7] will appear after image warping; repair works for these issues, such as inpainting, have high computational complexity. Although there are a lot of recent work on improving imaging quality in real time [8], Ho [9] and Xiao [10] reduced the computational complexity and improved operational efficiency in their work but still cannot achieve real-time performance. Mori [11] and Zinger [12] enhanced imaging quality in their work but further increased computational complexity.

Levoy [13], the proposer of light field rendering theory, called light field the “next generation display technology” and introduced light field into computer graphics. Studies on light field have gotten good results in many research areas, such as microscopic imaging [14], light field camera [15], complex scene 3D reconstruction [16], and accurate depth estimation [17]. Light field has shown great potential in these research works.

For the first time, this paper applies the light field theory to virtual view synthesis. Our method can synthesize high-quality virtual view from the light field in real time.

## 2 Parameterization and collection of light field

Traditional cameras capture a 2D digital image, in which each pixel is the energy integration of all rays that reach the point in scene; the direction of these rays are not distinguished. The image is a projection of 3D scene, which makes the direction and position of rays lost. Light field retains the directions and positions of rays in scene.

### 2.1 Parameterization

*L*= plenoptic (

*V*

_{ x },

*V*

_{ y },

*V*

_{ z },

*θ*,

*∅*,

*λ*,

*t*). Plenoptic function is a seven dimensional function. (

*V*

_{ x },

*V*

_{ y },

*V*

_{ z }) is the spatial position of the observation point, (

*θ*,

*∅*) is the light pitch angle and azimuth angle, λ is the wavelength of light, and

*t*is the observation light recording time. Taking the propagation of light in 3D space into account, the wavelength

*λ*(color) generally does not change. Then the plenoptic function can be reduced to five-dimensional, which is

*L*= plenoptic (

*V*

_{ x },

*V*

_{ y },

*V*

_{ z },

*θ*,

*∅*). In most real-world scenarios, the light that imaging system collected are propagating in a limited light path. Thus, light field can be represented using two reciprocal parallel planar, called light field [13] or lumigraph [19], which is expressed as

*L*(

*u*,

*v*,

*s*,

*t*), wherein (

*u*,

*v*) are coordinates of the camera focal plane and (

*s*,

*t*) are coordinates of the image plane, as shown in Fig. 1. Light field can be understood as a certain intensity of light passing through the point (

*u*,

*v*) and (

*s*,

*t*) on the two planes. In this representation, a ray is represented by two points in (

*u*,

*v*) and (

*s*,

*t*), so the direction and position can be determined, meanwhile the characterization of light down to four-dimensional, moreover light field can be limited into the effective area of the two planes. In practical, color information

*λ*represented by RGB channel and different frames can express time information

*t*. Therefore, light field could only focus on direction and position.

### 2.2 Light field collection based on camera array

*u*,

*v*) plane is the main lens plane and (

*s*,

*t*) is the image plane.

*L*

_{ F }(

*u*,

*v*,

*s*,

*t*) represents the radiation of light, while

*F*is the distance between (

*u*,

*v*) and (

*s*,

*t*). The amount of received radiation on the image plane can be expressed as

*E*

_{ F }(

*s*,

*t*).

*θ*is the angle between light

*L*

_{ F }(

*u*,

*v*,

*s*,

*t*) and the (

*u*,

*v*) plane.

*A*(

*u*,

*v*) is the pupil function, which describes the impact of light coming in through the imaging system. Taking amplitude and phase of the light will change in the imaging system for instance.

## 3 Method

### 3.1 Light field reconstruction

*u*plane is the plane of the main lens of the camera and

*s*plane is the image plane of the camera; the thin lines represent the light converged on a pixel in the imaging plane through the main lens. Figure 4b describes a pixel which is formed in the

*s*plane by overlaying a series of light pass through the main lens in the

*u*plane. The box in Fig. 4b represents that all the light pass through the main lens and image on the

*s*plane. The linear inside box intersects with axis

*s*at point (

*a*, 0) means a series of light pass through the

*u*plane forms pixel (

*a*, 0), which is the pixel in Fig. 4a. Here, for convenience of explanation, this example uses only one dimension each of plane (

*s*,

*t*) and plane (

*u*,

*v*). In practical, the imaging process also includes the vertical dimension

*v*and

*t*of light.

In the conventional imaging process, light is projected on the image plane which makes the position information of light on the *s* plane recorded but does not record the location information on the *u* plane. However, in a camera array, the imaging process, the resultant is a multiple image array of the same scene. This is equivalent to placing lots of main lens on the *u* plane in Fig. 4, which means the position information of light on the *u* plane is recorded. Therefore, using camera spatial relationship and imaging Eq. (3), a light field is captured.

*s'*and

*s*, in Eq. (3)

*F*is different; the result is equivalent to cutting the propagation of light in a track, different facets obtained. When camera is focused on the new plane s,

*F'*changes into

*F*, then expressed

*L*

_{ F }(

*u*,

*s*) with

*L*

_{ F }(

*u*,

*s'*); thus, a new image is synthesized.

*L*

_{ F }described by (

*u*,

*s'*) also can be described by (

*u*,

*s*). Thus, let

*α = F '*/

*F*, by the similar triangles, and then extending two-dimensional to four-dimensional case, the following equation can be obtained:

The formula expressed that image can be taken as a slice of 4D light field project on a 2D plane. Therefore, each image in image array can be taken as a slice of the light field of a certain scene project on a 2D plane.

### 3.2 Resampling in frequency domain

Although the spatial relationship between the image and the light field can be visually described in space domain, but the description itself is an integral projection process, algorithm of this process usually have high computational complexity O(*n*
^{
4
}). In contrast, the relationship between images and light field in frequency domain can be simply express as an image is a 2D slice of 4D light field in Fourier domain. This conclusion stems from the fundamental theory Fourier slice theorem by Ron Bracewell [23] proposed in 1956. The classical theory has a great contribution later in the field of medical imaging, computed tomography scanning, and positron imaging technology.

*u*will converge to a point in plane

*s*. Figure 7c shows the Fourier slice in frequency domain, which coincides with

*k*

_{ s }axis.

*u*,

*v*) and plane (

*s*,

*t*). As shown in Fig. 8, if the tangent is out of blue range, such as the yellow tangent, the resolution of virtual view image is deleted; if the tangent is in the blue region, such as the red tangent, the resolution of virtual view image is not deleted. Therefore, when resampling in frequency domain, in order to get a full-resolution image, the focal length should be shorter than the distance between plane (

*u*,

*v*) and (

*s*,

*t*).

The process of virtual view synthesis can be transformed from space domain to frequency equivalently.

Equations (6) to (8) are the resampling imaging process in frequency domain, in which *f*
_{
u
}, *f*
_{
v
}, *f*
_{
s
}, and *f*
_{
t
} are the projection of the Fourier domain.

*O*(

*n*

^{2}log

*n*) +

*O*(

*n*

^{ 2 }) =

*O*(

*n*

^{2}log

*n*), which is reduced a lot compared to

*O*(

*n*

^{4}) when operating in space domain.

## 4 Results and discussion

### 4.1 Experiment environment

The software and hardware environment

CPU | Intel(R) Core(TM) i7-4790 K @ 4.00 GHz |

GPU | NVIDIA(R) GeForce GTX TITAN X (16 GB) |

RAM | 16GB DDR3 1600 MHz |

OS | Windows 10 professional x64 |

Software | OpenGL, C++, OpenCV 2.4.11, CUDA 7.0 |

### 4.2 Experiment 1: open dataset

Image array information

Image array | Amethyst | Chess | Ball | Knight | Gantry |
---|---|---|---|---|---|

Image amount | 17 × 17 | 17 × 17 | 17 × 17 | 17 × 17 | 17 × 17 |

Image resolution | 768 × 1024 | 1400 × 800 | 1024 × 1024 | 1024 × 1024 | 640 × 1024 |

Single image size(KB) | 939 ~ 968 | 1265 ~ 1274 | 1657 ~ 1687 | 1263 ~ 1280 | 873 ~ 883 |

Total size (MB) | 269 | 358 | 473 | 356 | 248 |

#### 4.2.1 Real-time performance experiment

In this section, the real-time performance is evaluated by frames per second (fps). The more frames rendered per second, the better real-time performance is.

- 1)The viewpoint is moving along the red path in Fig. 10.
- 2)
This process lasts 18 s. From start to end, fps is calculated each second. Thus, average fps and one-time imaging time can be obtained.

Figure 10 is a visualization of plane (*u*, *v*) in a parameterized light field. The dark square is the light field we constructed. Each dark point in dark square represents the center of a camera in camera array. The white point upper left is where the viewpoint starts. The black point lower left is where the viewpoint ends. The red broken line is the path which viewpoint moves along. The blue point upper left means the viewpoint is closer to that camera. The viewpoint is moving along the red path. The viewpoint movement trajectory covers all camera points, from start to end.

#### 4.2.2 Vision effect experiment

As shown in Fig. 12, virtual view images of amethyst and chess are similar with real images. Synthesized image in ball blurs obviously.

When evaluating the quality of synthesized images, we usually compare the real view image and the synthesized image. The more similar they are the better vision effect there is. PSNR is a common objective evaluation standard of image quality, the unit is dB, and the greater the value, the better synthesis quality is.

- 1)
This data set is a close-up image array, compared with amethyst and chess with the same space between cameras; the rays captured in ball have larger angle of incidence, which results in the attenuation effect of the incident light [25].

- 2)
The scene contains a relatively large transparent object [26], which has negative influence on resampling. This is a current problem in this area, our algorithm provides no special treatment with this case, which lowers the imaging effect.

### 4.3 Experiment 2: simulated dataset

In this experiment, we compare the real-time performance and imaging quality of our algorithm with a DIBR-based virtual view synthesis algorithm. Limited by current studies, the data used in two virtual view synthesis methods do not support mutual-use. Therefore, simulation scenario is used in this experiment.

Most of studies based on DIBR in recent years is some improvements based on the framework proposed Do [24], for example, depth image preprocessing [27], improved hole-filling algorithm [28], allocation of resources [9], and parallelization acceleration. So in this experiment, we selected a DIBR synthesis algorithm proposed by Do [24].

#### 4.3.1 Real-time performance experiment

Comparison between DIBR and light field

Experiment | Our algorithm | Do 1 (CPU) | Do 2 (CUDA) |
---|---|---|---|

ms/Pic | 33 ms | 3506 ms | 890 ms |

PSNR | 25.7 | 27.2 | |

MSE | 175.0 | 123.9 | |

SSIM | 0.89 | 0.9 |

#### 4.3.2 Vision effect comparison

As shown in Fig. 14, column (a) lists real image and the details in virtual view position, column (b) lists images synthesized by Do’s algorithm, and column (c) lists images synthesized by our algorithm. In the cups scene, we can see that our result is much more similar to the ground truth at the right edge of the cup (the first and second rows in Fig. 14). In the bowling scene, there are ghost edges and blurring on the bowling pins because of the specular surface, and our result is good.

Objective evaluation results are as shown in Table 3 from 3 to 5 rows. The PSNR, MSE, and SSIM differences between our method and Do’s method were small, but our algorithm has better visual result in real-time performance.

### 4.4 Experiment 3: light field compression

Light field compression result

Image array | Amethyst | Chess | Ball | Knights | Gantry |
---|---|---|---|---|---|

Original light Field (GB) | 1.3 | 2.2 | 1.9 | 1.7 | 1.4 |

Compressed light field (MB) | 87.5 | 65.3 | 100.3 | 73.2 | 84.7 |

Compress rate (%) | 6.5 | 2.9 | 5.2 | 4.2 | 5.9 |

## 5 Conclusions

Traditional DIBR based virtual view synthesis method has low one-time imaging quality, and it is time-consuming to deal with artifacts, holes, and other issues. Our light field based virtual view synthesis method provides a better one-time imaging quality, and no repair work is needed after first-time imaging, which means lower computational complexity; thus, high-quality and real-time virtual view synthesis is performed in this paper. Experiments show that the virtual view synthetic methods proposed in this paper provides good performance in real time and objective image quality using light field. But with transparent objects in the scene, the imaging quality is not high enough. Tao [25] in their study has made some attempt for this issue, which is also the direction of our next study. Moreover, our method is currently only performed on image array. The amount of data will be larger when it comes to videos. Our next work is to compress light field.

## Declarations

### Acknowledgements

This work is supported by the National High-tech R&D Program of China (863 Program) (Grant No. 2015AA015904), China Postdoctoral Science Foundation funded project (2015 M571640), Special grade of China Postdoctoral Science Foundation funded project (2016 T90408), CCF-Tencent Open Fund (RAGR20150120), and Special Program for Applied Research on Super Computation of the NSFC-Guangdong Joint Fund (the second phase).

### Authors’ contributions

LY implemented the core algorithm and drafted the manuscript. WX participated in light field reconstruction and helped to draft the manuscript. YL participated in the light field compression. All authors read and approved the final manuscript.

### Competing interests

The authors declare that they have no competing interest.

**Open Access**This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

## Authors’ Affiliations

## References

- L Mcmillan,
*An Image-Based Approach to Three-Dimensional Computer Graphics*(University of North Carolina, Chapel Hill, 1997)Google Scholar - Oh, K. J., Yea, S., & Ho, Y. S. (2009). Hole filling method using depth based in-painting for view synthesis in free viewpoint television and 3-D video. Picture Coding Symposium (pp.1-4).Google Scholar
- S Horst, Q Matthias, B Jörg, G Karsten, M Peter, Inter-view consistent hole filling in view extrapolation for multi-view image generation. IEEE Int Conf Image Process
**22**, 426 (2015)Google Scholar - Muller, K., Smolic, A., Dix, K., & Kauff, P. (2008). Reliability-based generation and view synthesis in layered depth video. Multimedia Signal Processing, 2008 IEEE, Workshop on (Vol.24, pp.409-427). IEEE.Google Scholar
- Smolic A, Mueller K, Merkle P, et al. Multi-view video plus depth (MVD) format for advanced 3D video systems. ISO/IEC JTC1/SC29/WG11 and ITU-T SG16 Q, 2007, 6: 2127.Google Scholar
- Gaurav, C., Olga, S., & George, D. (2011). Silhouette-aware warping for image-based rendering. Eurographics Conference on Rendering (Vol.30, pp.1223–1232). Eurographics AssociationGoogle Scholar
- P Merkle, Y Morvan, A Smolic, D Farin, K Müller, PHND With et al., The effects of multiview depth video compression on multiview rendering. Signal Process Image Commun
**24**(1–2), 73–88 (2009)View ArticleGoogle Scholar - M Sharma, S Chaudhury, B Lall, MS Venkatesh, A flexible architecture for multi-view 3dtv based on uncalibrated cameras. J. Vis. Commun. Image Represent.
**25**(4), 599–621 (2014)View ArticleGoogle Scholar - TY Ho, DN Yang, W Liao, Efficient resource allocation of mobile multi-view 3d videos with depth image-based rendering. IEEE Trans Mob Comput
**14**(2), 344–357 (2015)View ArticleGoogle Scholar - J Xiao, MM Hannuksela, T Tillo, M Gabbouj, Scalable bit allocation between texture and depth views for 3-d video streaming over heterogeneous networks. IEEE Trans. Circuits Syst. Video Technol.
**25**(1), 139–152 (2015)View ArticleGoogle Scholar - Y Mori, N Fukushima, T Yendo, T Fujii, M Tanimoto, View generation with 3d warping using depth information for ftv. Signal Process Image Commun
**24**(1–2), 65–72 (2009)View ArticleGoogle Scholar - Zinger, S., Do, L., & With, P. H. N. D. (2010). Free-viewpoint depth image based rendering. Journal of Visual Communication & Image Representation, 21(s 5–6), 533–541.Google Scholar
- Levoy, M. (1996). Light field rendering. Conference on Computer Graphics and Interactive Techniques (Vol.2, pp.II-64 - II-71). ACM..Google Scholar
- M Levoy, R Ng, A Adams, M Footer, M Horowitz, Light field microscopy. ACM Trans Graph
**25**(3), 924–934 (2006)View ArticleGoogle Scholar - Ren, N., Levoy, M., Bredif, M., Duval, G., Horowitz, M., & Hanrahan, P. (2005). Light field photography with a hand-held plenoptic cameraGoogle Scholar
- C Kim, H Zimmer, Y Pritch, A Sorkine-Hornung, M Gross, Scene reconstruction from high spatio-angular resolution light fields. ACM Trans Graph
**32**(4), 96–96 (2013)MATHGoogle Scholar - Tao, M. W., Srinivasan, P. P., Malik, J., Rusinkiewicz, S., & Ramamoorthi, R. (2015). Depth from shading, defocus, and correspondence using light-field angular coherence. IEEE Conference on Computer Vision and Pattern Recognition (pp.1940-1948). IEEE Computer Society.Google Scholar
- Adelson, Edward H., and James R. Bergen. The plenoptic function and the elements of early vision. Vision and Modeling Group, Media Laboratory, Massachusetts Institute of Technology, 1991Google Scholar
- R Szeliski, S Gortler, R Grzeszczuk, MF Cohen,
*The lumigraph*. Proceedings of Siggraph, 2001, pp. 43–54Google Scholar - B Wilburn, N Joshi, V Vaish, EV Talvala, E Antunez, A Barth et al., High performance imaging using large camera arrays. ACM Trans. Graph.
**24**(3), 765–776 (2010)View ArticleGoogle Scholar - Stroebel, & Leslie, D..Basic photographic materials and processes /. Focal Press..Google Scholar
- Ng, R. (2006). Digital light field photography. (Vol.115, pp.38-39). Stanford University.Google Scholar
- Bracewell, R. N.. The Fourier Transform & Its Applications. The Fourier transform and its applications /. WCB/McGraw Hill.Google Scholar
- Do, L., Zinger, S., & With, P. H. N. D. (2009). Quality improving techniques for free-viewpoint dibr. Proceedings of SPIE - The International Society for Optical Engineering, 7524, 1–4.Google Scholar
- Tao, M. W., Wang, T. C., Malik, J., & Ramamoorthi, R. (2014). Depth Estimation for Glossy Surfaces with Light-Field Cameras. Computer Vision - ECCV 2014 Workshops. Springer International PublishingGoogle Scholar
- Magnor, M., & Girod, B. (2000). 3D TV: Data compression for light-field rendering. (Vol.10, pp.338-343).Google Scholar
- Y Cho, K Seo, KS Park, Enhancing depth accuracy on the region of interest in a scene for depth image based rendering. Ksii Trans Internet Inf Systs
**8**(7), 2434–2448 (2014)Google Scholar - B Kim, M Hong, An adaptive interpolation algorithm for hole-filling in free viewpoint video. J Meas Sci Instrume
**8681**(4), 343–345 (2013)MathSciNetGoogle Scholar