Figure 1 shows the framework of the proposed video background initialization and foreground segmentation approach for bootstrapping video sequences, which contains four major processing steps, namely, block representation, background updating, initial segmented foreground, and noise removal and shadow suppression with two morphological operations. In Figure 1, the input includes the current (gray-level) video frame *I*^{t} and the previous (gray-level) video frame *I*^{t–1} of a bootstrapping video sequence, and the output includes the modeled background frame *B*^{t} and the segmented foreground frame *F*^{t}, where *i* denotes the frame number (index). Here, *I*
_{(x,y)}
^{t}, *I*
_{(x,y)}
^{t−1}, *B*
_{(x,y)}
^{t}, and *F*
_{(x,y)}
^{t} denote pixels (*x,y*) in *I*^{t}, *I*^{t–1}, *B*^{t}, and *F*^{t}, respectively. Each video frame is *W* × *H* (pixels) in size, and each video frame is partitioned into non-overlapping and equal-sized blocks of size *N* × *N* (pixels). Let (*i,j*) be the block index, where *i* = 0,1,2,…,(*W/N*) – 1 and *j* = 0,1,2,…,(*H/N*) – 1. Here, **b**
_{(i,j)}
^{t} = {*I*
_{(iN+a,jN+b)}
^{t}: *a*, *b* = 0, 1, 2,…,*N* − 1}, **b**
_{(i,j)}
^{t−1} = {*I*
_{(iN+a,jN+b)}
^{t−1}: *a*, *b* = 0, 1, 2,…,*N* − 1}, and {\tilde{b}}_{\left(i,j\right)}^{t}=\left\{{B}_{\left(\mathit{iN}+a,\mathit{jN}+b\right)}^{t}:a,b=0,1,2,\dots ,N-1\right\}, denote blocks (*i,j*) in *I*^{t}, *I*^{t–1}, and *B*^{t}, respectively. In addition, let {\widehat{B}}^{t} denote the initial modeled background frame and {\widehat{b}}_{\left(i,j\right)}^{t}=\left\{{\widehat{B}}_{\left(\mathit{iN}+a,\mathit{jN}+b\right)}^{t}:a,b=0,1,2,\dots ,N-1\right\}, denote block (*i,j*) in {\widehat{B}}^{t}.

### 2.1. Initial modeled background processing

As the illustrated example shown in Figure 2, a sequence of initial modeled background frames {\widehat{B}}^{t} (*t* = 1,2,…) will be obtained in the initial modeled background processing procedure. At the beginning (*t* = 1), each block {\widehat{b}}_{\left(i,j\right)}^{1} of size *N* × *N* (pixels) in {\widehat{B}}^{1} is set to “undefined” (labeled in black), as shown in Figure 2l. Then, the initial modeled background frame {\widehat{B}}^{t} (*t* = 2,3,…,19) is obtained based on the “updated” modeled background frame *B*^{t–1} (see Section 2.3) and the block motion representation frame {\widehat{R}}^{t}. Each block of size *N* × *N* (pixels) in {\widehat{R}}^{t} is determined as either a “static” block (labeled in blue) or a “moving” block (labeled in red) by motion estimation (see Section 2.2) between two consecutive (gray-level) video frames *I*^{t–1} and *I*^{t} of the bootstrapping video sequence, as shown in Figure 2g–k. For one “undefined” block {\widehat{b}}_{\left(i,j\right)}^{t-1} in {\widehat{B}}^{t-1}, if its corresponding block in {\widehat{R}}^{t} is determined as a “static” block, i.e., its motion vector is (0,0), the “static” block {\widehat{b}}_{\left(i,j\right)}^{t} in {\widehat{B}}^{t} is duplicated from the corresponding block **b**
_{(i,j)}
^{t} in *I*^{t}. Then, each “static” block {\widehat{b}}_{\left(i,j\right)}^{t} in {\widehat{B}}^{t} will perform the background updating procedure (see Section 2.3) to obtain {\tilde{b}}_{\left(i,j\right)}^{t} in *B*^{t} Otherwise, the “undefined” block {\widehat{b}}_{\left(i,j\right)}^{t-1} in {\widehat{B}}^{t-1} will remain as the “undefined” block {\widehat{b}}_{\left(i,j\right)}^{t} in {\widehat{B}}^{t}. That is, each “undefined” block {\widehat{b}}_{\left(i,j\right)}^{t} will not participate the background updating procedure until {\widehat{b}}_{\left(i,j\right)}^{t} is determined as a “static” block. As shown in Figure 2, each block in {\widehat{R}}^{t} is determined by motion estimation between two consecutive (gray-level) video frames, *I*^{t–1} and *I*^{t}, of the bootstrapping video sequence. Each block in the initial modeled background frame {\widehat{B}}^{2} is based on *B*^{1} and {\widehat{R}}^{2}, in which some blocks in {\widehat{B}}^{2} are still “undefined” (labeled in black). The initial modeled background frame {\widehat{B}}^{3} is obtained based on *B*^{2} and {\widehat{R}}^{3}. {\widehat{R}}^{4}, {\widehat{R}}^{5},…, and {\widehat{R}}^{19} in the illustrated example can similarly be obtained. Note that, in the illustrated example shown in Figure 2, each initial modeled background frame {\widehat{B}}^{t} (*t* = 1,2,…,18) contains at least one “undefined” block.

Finally, as shown in Figure 2q, the initial modeled background frame {\widehat{B}}^{19} contains no “undefined” block. Here, for the illustrated example shown in Figure 2, the performance index *T*
_{1}(=19) is defined as the frame index for initial modeled background processing. Afterwards, the initial modeled background frame {\widehat{B}}^{t} (*t* = 20,21,…) is duplicated from the “updated” modeled background frame *B*^{t–1}, i.e., {\widehat{B}}^{t}={B}^{t-1} (*t* = 20,21,…) [35].

### 2.2. Block representation

As the illustrated example shown in Figure 3, in the proposed block representation approach, each block of the current video frame *I*^{t} is classified into one of the four categories, namely, “background,” “still object,” “illumination change,” and “moving object.” In Figure 3b, each block of the block representation frame *R*^{t} for *I*^{t} is labeled in four different gray levels. The block representation frame *R*^{t} is obtained based on the two consecutive video frames, *I*^{t} and *I*^{t–1}, and the initial modeled background frame {\widehat{B}}^{t} by the proposed block representation approach (as shown in Figure 4), in which motion estimation and correlation coefficient computation are used to perform block representation (classification).

Motion estimation is performed between the two consecutive video frames, *I*^{t} and *I*^{t– 1} using a block matching algorithm so that each block in *I*^{t} is determined as either “static” or “moving.” In this study, the sum of absolute differences (SAD) is used as the cost function for block matching between block **b**
_{(i,j)}
^{t} in *I*^{t} and the corresponding block in *I*^{t–1} and the search range for motion estimation is set to *±N*/2 [35, 36]. For a block in *I*^{t}, if the minimum SAD, *D*
_{
mv(u,v)}, for motion vector (*u,v*), is smaller than 90% of the SAD for the null-vector (0,0), *D*
_{
mv(0,0)}, the block is determined as a “moving” block; otherwise, it is determined as a “static” block [19, 35].

On the other hand, the correlation coefficient *C*
_{
B
}(*i*, *j*) between block **b**
_{(i,j)}
^{t} in *I*^{t} and block {\widehat{b}}_{\left(i,j\right)}^{t} in the initial modeled background frame {\widehat{B}}^{t} is computed as

{C}_{B}\left(i,j\right)=\frac{{\displaystyle \sum}\left|{\mathbf{b}}_{\left(i,j\right)}^{t}-{\mu}_{{\mathbf{b}}_{\left(i,j\right)}^{t}}|\times |{\widehat{b}}_{\left(i,j\right)}^{t}-{\mu}_{{\widehat{b}}_{\left(i,j\right)}^{\mathit{t}}}\right|}{\sqrt{{\displaystyle \sum}|{\mathbf{b}}_{\left(i,j\right)}^{t}-{\mu}_{{\mathbf{b}}_{\left(i,j\right)}^{t}}{|}^{2}}\times \sqrt{{\displaystyle \sum}|{\widehat{b}}_{\left(i,j\right)}^{t}-{\mu}_{{\widehat{b}}_{\left(i,j\right)}^{\mathit{t}}}{|}^{2}}}

(1)

where *μ*
_{
b
} is the mean of the pixel values in block **b**. As shown in Figure 4, based on *C*
_{
B
}(*i,j*) and the threshold TH_{CB} a “static” block can be further classified into either a “background” block (if *C*
_{
B
}(*i,j*) *≥* TH_{CB}) or a “still object” block (otherwise), whereas a “moving” block can be further classified into either an “illumination change” block (if *C*
_{
B
}(*i,j*) ≥ TH_{CB}) or a “moving object” block (otherwise). Afterwards, four different block representations are obtained.

### 2.3. Background updating

By background updating, each block {\widehat{b}}_{\left(i,j\right)}^{t} in the initial modeled background frame {\widehat{B}}^{t} can be updated to obtain the corresponding block {\tilde{b}}_{\left(i,j\right)}^{t} in the modeled background frame *B*^{t} as follows. Both the “background” and “illumination change” blocks are updated by temporal smoothing, i.e., block {\tilde{b}}_{\left(i,j\right)}^{t} in *B*^{t} is updated as the linearly weighted sum of block {\widehat{b}}_{\left(i,j\right)}^{t} in {\widehat{B}}^{t} and block **b**
_{(i,j)}
^{t} in *I*^{t}. On the other hand, both the “still object” and “moving object” blocks are updated by block replacement.

(a) *Background*: the modeled background block {\tilde{b}}_{\left(i,j\right)}^{t} in *B* ^{t} is updated by

{\tilde{b}}_{\left(i,j\right)}^{t}=\alpha \xb7{\widehat{b}}_{\left(i,j\right)}^{t}+\left(1-\alpha \right)\xb7{\mathbf{b}}_{\left(i,j\right)}^{t}

(2)

where α, the updating weight, is empirically set to 0.9 in this study.

(b) *Still object*: the modeled background block {\tilde{b}}_{\left(i,j\right)}^{t} in *B* ^{t} is updated by

\{\begin{array}{c}\hfill {\tilde{\mathbf{b}}}_{\left(i,j\right)}^{t}={\mathbf{b}}_{\left(i,j\right)}^{t}{,\phantom{\rule{1.75em}{0ex}}\mathrm{if}\phantom{\rule{0.25em}{0ex}}\mathrm{Count}}_{\left(i,j\right)}\ge {\mathrm{TH}}_{\mathrm{still}},\phantom{\rule{0.75em}{0ex}}\hfill \\ \hfill {\tilde{\mathbf{b}}}_{\left(i,j\right)}^{t}={\widehat{\mathbf{b}}}_{\left(i,j\right)}^{t},\phantom{\rule{1.75em}{0ex}}\mathrm{otherwise},\phantom{\rule{5.25em}{0ex}}\hfill \end{array}

(3)

where Count_{(i,j)} is the number of times that **b**
_{(i,j)}
^{t} in *I*^{t} is successively determined as a “still object” block previously, and TH_{still} is a threshold for the time duration (in terms of the number of frames) that a “still object” block will learn to be a “background” block. That is, if an object (or a block **b**
_{(i,j)}
^{t} in *I*^{t}) does not “move” for a sufficient time duration, it will become some part of the background. As the illustrated example shown in Figure 5, the marked block **b**
_{(11,13)}
^{33} in *I*^{33} is detected as a “still object” block (in *R*^{33}) for a sufficient time duration (TH_{still} = 20). Then, its corresponding block {\tilde{b}}_{\left(11,13\right)}^{33} in *B*^{33} will be updated (replaced) by **b**
_{(11,13)}
^{33} in *I*^{33}.

(c) *Illumination change*: the modeled background block {\tilde{b}}_{\left(i,j\right)}^{t} in *B* ^{t} is similarly updated by Equation (2).

(d) *Moving object*: the modeled background block {\tilde{b}}_{\left(i,j\right)}^{t} in *B* ^{t} is updated by

\{\begin{array}{c}\hfill {\tilde{\mathbf{b}}}_{\left(i,j\right)}^{t}={\mathbf{b}}_{\left(i,j\right)}^{t},\phantom{\rule{1.5em}{0ex}}\mathrm{if}\phantom{\rule{0.25em}{0ex}}\mathrm{SM}\left({\mathbf{b}}_{\left(i,j\right)}^{t}\right)<\mathrm{SM}\left({\widehat{\mathbf{b}}}_{\left(i,j\right)}^{t}\right)\phantom{\rule{0.75em}{0ex}}\hfill \\ \hfill {\tilde{\mathbf{b}}}_{\left(i,j\right)}^{t}={\widehat{\mathbf{b}}}_{\left(i,j\right)}^{t},\phantom{\rule{1.5em}{0ex}}\mathrm{otherwise},\phantom{\rule{7em}{0ex}}\hfill \end{array}

(4)

where SM(**b**
_{(i,j)}
^{t}) and \mathrm{SM}\left({\widehat{b}}_{\left(i,j\right)}^{t}\right) denote the side-match measures for block **b**
_{(i,j)}
^{t} from *I*^{t} embedded in {\widehat{B}}^{t} and that for block {\widehat{b}}_{\left(i,j\right)}^{t} “embedded” in {\widehat{B}}^{t}, respectively, as shown in Figure 6. The side-match measure (or the boundary match measure) [37, 38] is widely used in various image/video error concealment algorithms due to its good trade-off in complexity and visual quality. SM(**b**
_{(i,j)}
^{t}) is defined as the sum of squared differences between the boundary of the embedded block **b**
_{(i,j)}
^{t} from *I*^{t} and the boundaries of the four neighboring blocks {\widehat{b}}_{\left(i-1,j\right)}^{t},
{\widehat{b}}_{\left(i+1,j\right)}^{t},
{\widehat{b}}_{\left(i,j-1\right)}^{t}, and {\widehat{b}}_{\left(i,j+1\right)}^{t}, in {\widehat{B}}^{t} (Figure 6a), i.e.,

\begin{array}{l}\mathrm{SM}\left({\mathbf{b}}_{\left(i,j\right)}^{t}\right)={\displaystyle \sum _{b=0}^{N-1}{\left({\widehat{B}}_{\left(\mathit{iN}-1,\mathit{jN}+b\right)}^{t}-{I}_{\left(\mathit{iN},\mathit{jN}+b\right)}^{t}\right)}^{2}}\\ \phantom{\rule{6em}{0ex}}+{\displaystyle \sum _{b=0}^{N-1}{\left({\widehat{B}}_{\left(\mathit{iN}+N,\mathit{jN}+b\right)}^{t}-{I}_{\left(\mathit{iN}+N-1,\mathit{jN}+b\right)}^{t}\right)}^{2}}\\ \phantom{\rule{6em}{0ex}}\times {\displaystyle \sum _{a=0}^{N-1}{\left({\widehat{B}}_{\left(\mathit{iN}+a,\mathit{jN}-1\right)}^{t}-{I}_{\left(\mathit{iN}+a,\mathit{jN}\right)}^{t}\right)}^{2}}\\ \phantom{\rule{6em}{0ex}}+{\displaystyle \sum _{a=0}^{N-1}{\left({\widehat{B}}_{\left(\mathit{iN}+a,\mathit{jN}+N\right)}^{t}-{I}_{\left(\mathit{iN}+a,\mathit{jN}+N-1\right)}^{t}\right)}^{2}}\end{array}

(5)

Similarly, \mathrm{SM}\left({\widehat{b}}_{\left(i,j\right)}^{t}\right) is defined as the sum of squared differences between the boundary of block {\widehat{b}}_{\left(i,j\right)}^{t} and the boundaries of its four neighboring blocks {\widehat{b}}_{\left(i-1,j\right)}^{t},
{\widehat{b}}_{\left(i+1,j\right)}^{t},
{\widehat{b}}_{\left(i,j-1\right)}^{t}, and {\widehat{b}}_{\left(i,j+1\right)}^{t}, in {\widehat{B}}^{t} (Figure 6b), i.e.,

\begin{array}{l}\mathrm{SM}\left({\widehat{\mathbf{b}}}_{\left(i,j\right)}^{t}\right)={\displaystyle \sum _{b=0}^{N-1}{\left({\widehat{B}}_{\left(\mathit{iN}-1,\mathit{jN}+b\right)}^{t}-{\widehat{B}}_{\left(\mathit{iN},\mathit{jN}+b\right)}^{t}\right)}^{2}}\\ \phantom{\rule{6em}{0ex}}+{\displaystyle \sum _{b=0}^{N-1}{\left({\widehat{B}}_{\left(\mathit{iN}+N,\mathit{jN}+b\right)}^{t}-{\widehat{B}}_{\left(\mathit{iN}+N-1,\mathit{jN}+b\right)}^{t}\right)}^{2}}\\ \phantom{\rule{6em}{0ex}}\times {\displaystyle \sum _{a=0}^{N-1}{\left({\widehat{B}}_{\left(\mathit{iN}+a,\mathit{jN}-1\right)}^{t}-{\widehat{B}}_{\left(\mathit{iN}+a,\mathit{jN}\right)}^{t}\right)}^{2}}\\ \phantom{\rule{6em}{0ex}}+{\displaystyle \sum _{a=0}^{N-1}{\left({\widehat{B}}_{\left(\mathit{iN}+a,\mathit{jN}+N\right)}^{t}-{\widehat{B}}_{\left(\mathit{iN}+a,\mathit{jN}+N-1\right)}^{t}\right)}^{2}}.\end{array}

(6)

Note that if a block in *R*^{t} is determined as a “moving object” block two times consecutively, the corresponding modeled background block {\tilde{b}}_{\left(i,j\right)}^{t} in *B*^{t} is updated by Equation (4). The side-match measure uses the camouflage of each “moving object” block to search the more suitable modeled background block so that we can speed up the background updating procedure. As the illustrated example shown in Figure 7, two marked blocks **b**
_{(12,9)} and **b**
_{(11,10)} in both *I*^{12} and *I*^{13} are detected as two “moving object” blocks in both *R*^{12} and *R*^{13} consecutively. Thus, their corresponding blocks {\tilde{b}}_{\left(12,9\right)}^{13} and {\tilde{b}}_{\left(11,10\right)}^{13} in *B*^{13} will be updated (replaced) by blocks **b**
_{(12,9)}
^{13} and **b**
_{(11,10)}
^{13} in *I*^{13}, respectively.

### 2.4. Initial segmented foreground

Based on the modeled background frame *B*^{t} performing background updating, as an illustrated example shown in Figure 8, the initial (binary) segmented foreground frame {\widehat{F}}^{t} can be obtained as

{\widehat{F}}^{t}=\{\begin{array}{c}\hfill 1,\phantom{\rule{1.25em}{0ex}}\mathrm{if}\phantom{\rule{0.25em}{0ex}}{I}^{t}-{B}^{t}\ge T{H}_{\mathrm{isf}},\hfill \\ \hfill 0,\phantom{\rule{1.25em}{0ex}}\text{otherwise},\phantom{\rule{3em}{0ex}}\hfill \end{array}

(7)

where TH_{isf} is a threshold, which is empirically set to 15 in this study.

### 2.5. Noise removal and shadow suppression with two morphological operations

As shown in Figure 8, {\widehat{F}}^{t} usually contains some fragmented (noisy) parts and shadows. To obtain the precise segmented foreground frame *F*^{t}, a noise removal and shadow suppression procedure is adopted, which combines the shadow suppression approach in [39] and the edge information extracted from *I*^{t} with {\widehat{F}}^{t} being the (binary) operation mask.

Let {\widehat{F}}_{S}^{t} be the *S* (saturation) component of the original video frame (frame *t*) represented in the HSV color space and {\widehat{F}}_{E}^{t} be the gradient image of *I*^{t} using the Sobel operator [40] with {\widehat{F}}^{t} being the (binary) operation mask. The segmented foreground frame {\overline{F}}^{t} is defined as

{\overline{F}}^{t}=\{\begin{array}{c}\hfill 1,\phantom{\rule{1em}{0ex}}\mathrm{if}\phantom{\rule{0.5em}{0ex}}\left({\widehat{F}}^{t}\cap \left({\widehat{F}}_{S}^{t}\ge {\sigma}_{{\widehat{F}}_{S}^{t}}\right)\right)\cup \left({\widehat{F}}_{E}^{t}\ge {\mathrm{TH}}_{E}\right),\hfill \\ \hfill 0,\phantom{\rule{1em}{0ex}}\mathrm{otherwise},\phantom{\rule{10em}{0ex}}\hfill \end{array}

(8)

where ∩ and ∪ denote the logical AND and OR operators, respectively, {\sigma}_{{\widehat{F}}_{S}^{t}} is the standard deviation of {\widehat{F}}_{S}^{t}, and TH_{
E
} is a threshold. Here, TH_{
E
} is empirically set to 120 in this study. Figure 9 shows an illustrated example performing the noise removal and shadow suppression procedure. By applying the shadow suppression approach in [39], the “second” (binary) segmented foreground frame (shown in Figure 9b) is obtained based on {\widehat{F}}^{t} (shown in Figure 8c) and {\widehat{F}}_{S}^{t}\ge {\sigma}_{{\widehat{F}}_{S}^{t}} (shown in Figure 9a). Based on the “second” (binary) segmented foreground frame (shown in Figure 9b), combining the gradient image {\widehat{F}}_{E}^{t} of *I*^{t} (shown in Figure 9c) preserving the edge information in the initial (binary) segmented foreground frame {\widehat{F}}^{t}, the segmented foreground frame {\overline{F}}^{t} (shown in Figure 9d) is obtained by Equation (8). Finally, the final segmented foreground frame (shown in Figure 9e) is obtained as *F*^{t} with two morphological (erosion and dilation) operations [40].