The DIBR algorithm warps the original view to the virtual view position by projecting current pixels to points in real 3D space and re-projecting the 3D points to virtual image plane. Large disocclusions will appear in the discontinuous edges of depth map, which is the transition place between foreground and background in texture image. The background image part occluded by foreground objects should be visible in the virtual views. But the occluded background information is lost during the procedure of recording a 3D scene by a 2D image. To solve this problem, the proposed stationary scene extraction module tries to recover the lost background structure from video sequences. For a video captured by a fixed camera or a short cut of video, the image consists of moving foreground objects and stationary background. The occluded background information in current image frame may appear in frames at other moments. If the information can effectively be used, the filling effect of disoccluded areas will be more convincing.

Stationary scene extraction algorithm keeps a global sprite throughout the view generation process to accumulate structure and depth information of stationary scene in temporal direction. The global sprite of stationary scene is composed of two components: one is the texture image of stationary scene, denoted as *C*
_{SS}, the other is the depth map of stationary scene, denoted as *M*
_{SS}. *C*
_{SS} and *M*
_{SS} are, respectively, initialized with the first frame of the texture sequence and depth sequence of the original view. The initialization step is expressed as follows:

\left\{\begin{array}{c}{C}_{\text{SS}}\left(p\right)={I}_{t}\left(p\right)\\ {M}_{\text{SS}}\left(p\right)={D}_{t}\left(p\right)\end{array},\phantom{\rule{1em}{0ex}}t=0\right.

(1)

where *p*:(*i* *j*)corresponds to the pixel of column coordinate *i* and row coordinate *j*. *I*
_{
t
} and *D*
_{
t
} represent the color intensity frame and depth map frame of input original view at time *t*, respectively. *D*
_{
t
} is represented as an 8-bits gray-scale image. The continuous depth range is quantized to 255 discrete depth values. The nearest object to the camera image sensor is assigned with 255 and the farthest object is assigned with 1. Pixels with depth value 0 are denoted as holes. The transform formula between discrete depth level and actual distance in real scene can be found in
[12].

After the initialization, a temporary sprite of stationary scene, denoted as *T* *C*
_{SS} and *T* *M*
_{SS}, is obtained between each input image frame *I*
_{
t
} and its previous frame *I*
_{
t-1} to extract the useful information of occluded background in *I*
_{
t
}. For stationary scene, the SSIM index
[37] between adjacent frames is large, so the image part, which is stationary in both adjacent frames, can be extracted by using the SSIM index values. For each pixel *p*:(*i* *j*), a structure similarity index *p*
_{SSIM} defined in
[37] is calculated between the corresponding square areas
{\Phi}_{t}^{I} and
{\Phi}_{t-1}^{I} of *I*
_{
t
} and *I*
_{
t-1}, which take *p* as the center pixel and *L* × *L* as the window size. The SSIM *p*
_{SSIM} is calculated as follows

{p}_{\text{SSIM}}=\frac{(2{\mu}_{{\Phi}_{t}}{\mu}_{{\Phi}_{t-1}}+{K}_{1})(2{\sigma}_{{\Phi}_{t(t-1)}}+{K}_{2})}{({\mu}_{{\Phi}_{t}}^{2}+{\mu}_{{\Phi}_{t-1}}^{2}+{K}_{1})({\sigma}_{{\Phi}_{t}}^{2}+{\sigma}_{{\Phi}_{t-1}}^{2}+{K}_{2})}

(2)

where
{\mu}_{{\Phi}_{t}},
{\mu}_{{\Phi}_{t-1}} represent the luminance mean value of
{\Phi}_{t}^{I} and
{\Phi}_{t-1}^{I}, respectively.
{\sigma}_{{\Phi}_{t}} and
{\sigma}_{{\Phi}_{t-1}} represent the luminance standard deviation of
{\Phi}_{t}^{I} and
{\Phi}_{t-1}^{I}.
{\sigma}_{{\Phi}_{t(t-1)}} denotes the luminance correlation coefficient between
{\Phi}_{t}^{I} and
{\Phi}_{t-1}^{I}. *K*
_{1} and *K*
_{2} are constants. The value of *K*
_{1}and *K*
_{2} can be determined according to the research work in
[37]. The expressions of mean, standard deviation, and correlation coefficient can also be found in
[37].

Then an arbiter with threshold *A* is used to divide the pixels of input image frame *I*
_{
t
} into stationary part *I*
_{
s
} and rest part *I*
_{
r
}. The classifier can be expressed as follows:

\left\{\begin{array}{c}p\in {I}_{s},\phantom{\rule{1em}{0ex}}{p}_{\text{SSIM}}\ge A\\ p\in {I}_{r},\phantom{\rule{1em}{0ex}}{p}_{\text{SSIM}}<A\end{array}\right.,\phantom{\rule{1em}{0ex}}p:(i,j)\in {I}_{t}.

(3)

*I*
_{
s
} contains the stationary pixels with high SSIM value, which can directly be used to update the same pixel positions in *T* *C*
_{SS}. *I*
_{
r
} are composed of three parts: the part with changed luminance *P*
_{lc}, the relatively moving part *P*
_{rm}, and the actually moving part *P*
_{am}. *P*
_{lc} represents the areas with similarly scene structure and different luminance which causes the decrease of SSIM value. *P*
_{rm} is the region which is moving in *I*
_{
t-1} and stationary in *I*
_{
t
}. *P*
_{am} denotes the image part which is moving in *I*
_{
t
} and stationary in *I*
_{
t-1}. As shown in Figure
3c, *I*
_{
s
} between Figure
3a,b is marked as black, the actually moving part *P*
_{am} is marked as red, the region with changed luminance *P*
_{lc} is marked as green, and the relatively moving area *P*
_{rm} is marked as blue. The first two kinds *P*
_{lc} and *P*
_{rm} can be also used to update *T* *C*
_{SS} directly, whereas the third kind *P*
_{am} needs to be excluded from *I*
_{
t
} and the pixels in the same regions of *I*
_{
t-1} will be used to update *T* *C*
_{SS}. As shown in Figure
3e–g, the poster occluded by the men’s hands in Figure
3e and the white board behind the man in Figure
3f are all preserved in Figure
3g. Provided with the corresponding depth map *D*
_{
t
} and *D*
_{
t-1}, the three different image parts are defined as follows.

\left\{\begin{array}{c}p\in {P}_{\text{lc}},\phantom{\rule{1em}{0ex}}\left|{\mu}_{t}^{D}-{\mu}_{t-1}^{D}\right|\le T\\ p\in {P}_{\text{rm}},\phantom{\rule{1em}{0ex}}{\mu}_{t}^{D}-{\mu}_{t-1}^{D}<-T\\ p\in {P}_{\text{am}},\phantom{\rule{1em}{0ex}}{\mu}_{t}^{D}-{\mu}_{t-1}^{D}>T\end{array}\right.,\phantom{\rule{1em}{0ex}}p:(i,j)\in {I}_{r}

(4)

where
{\mu}_{t}^{D} and
{\mu}_{t-1}^{D}, respectively, represent the average depth value of square areas in *D*
_{
t
} and *D*
_{
t-1}. The square neighborhoods have the same window size *L* × *L* with SSIM computation in Equation (2) and take the coordinates of pixel *p* as center position. *T* is a constant threshold, which defines the acceptable range of depth fluctuation. |·| is the absolute function.

Then the information of stationary scene between two adjacent frames can be extracted by the following equation:

\begin{array}{cc}T{C}_{\text{SS}}\left(p\right)& =\left\{\begin{array}{c}{I}_{t}\left(p\right),\phantom{\rule{1em}{0ex}}p:(i,j)\in {I}_{s}\cup {P}_{\text{lc}}\cup {P}_{\text{rm}}\\ {I}_{t-1}\left(p\right),\phantom{\rule{1em}{0ex}}p:(i,j)\in {P}_{\text{am}}\hfill \end{array}\right.\\ T{M}_{\text{SS}}\left(p\right)& =\left\{\begin{array}{c}{D}_{t}\left(p\right),\phantom{\rule{1em}{0ex}}p:(i,j)\in {I}_{s}\cup {P}_{\text{lc}}\cup {P}_{\text{rm}}\\ {D}_{t-1}\left(p\right),\phantom{\rule{1em}{0ex}}p:(i,j)\in {P}_{\text{am}}\hfill \end{array}\right.\end{array}

(5)

Finally, the temporary sprite of stationary scene (*T* *C*
_{SS} and *T* *M*
_{SS}) is used to update the global sprite (*C*
_{SS} and *M*
_{SS}). The update operation is described as follows.

\begin{array}{c}{C}_{\text{ss}}\left(p\right)=\left\{\begin{array}{c}T{C}_{\text{SS}}\left(p\right),{\mu}_{\text{TM}}^{p}-{\mu}_{M}^{p}\le T\\ {C}_{\text{SS}}\left(p\right),\phantom{\rule{1em}{0ex}}\text{otherwise}\hfill \end{array}\right.\phantom{\rule{1em}{0ex}}p:(i,j)\in {C}_{\mathrm{S}S}\\ {M}_{\text{ss}}\left(p\right)=\left\{\begin{array}{c}T{M}_{\text{SS}}\left(p\right),{\mu}_{\text{TM}}^{p}-{\mu}_{M}^{p}\le T\\ {M}_{\text{SS}}\left(p\right),\phantom{\rule{1em}{0ex}}\text{otherwise}\hfill \end{array}\right.p:(i,j)\in {M}_{\text{SS}}\end{array}

(6)

where
{\mu}_{\text{TM}}^{p} and
{\mu}_{M}^{p}, respectively, represent the average depth value of square areas in *T* *M*
_{SS} and *M*
_{SS}. The square neighborhoods have the same window size *L* × *L* with SSIM computation and take the coordinates of pixel *p* as center position. *T* is the same constant threshold defined in Equation (4). Figure
3d shows *T* *C*
_{SS} of Figure
3b. Figure
3h,i are *C*
_{SS} and *M*
_{SS} of Figure
3b, respectively. Almost all the texture and depth information of stationary scene are restored in Figure
3h,i.

So far, the appeared background information in past frames is stored in *C*
_{SS} and *M*
_{SS}, which can be used to partly solve the disocclusion problem of virtual view synthesis algorithm.