Figure 1 shows the framework of the proposed video background initialization and foreground segmentation approach for bootstrapping video sequences, which contains four major processing steps, namely, block representation, background updating, initial segmented foreground, and noise removal and shadow suppression with two morphological operations. In Figure 1, the input includes the current (gray-level) video frame It and the previous (gray-level) video frame It–1 of a bootstrapping video sequence, and the output includes the modeled background frame Bt and the segmented foreground frame Ft, where i denotes the frame number (index). Here, I
(x,y)
t, I
(x,y)
t−1, B
(x,y)
t, and F
(x,y)
t denote pixels (x,y) in It, It–1, Bt, and Ft, respectively. Each video frame is W × H (pixels) in size, and each video frame is partitioned into non-overlapping and equal-sized blocks of size N × N (pixels). Let (i,j) be the block index, where i = 0,1,2,…,(W/N) – 1 and j = 0,1,2,…,(H/N) – 1. Here, b
(i,j)
t = {I
(iN+a,jN+b)
t: a, b = 0, 1, 2,…,N − 1}, b
(i,j)
t−1 = {I
(iN+a,jN+b)
t−1: a, b = 0, 1, 2,…,N − 1}, and , denote blocks (i,j) in It, It–1, and Bt, respectively. In addition, let denote the initial modeled background frame and denote block (i,j) in .
2.1. Initial modeled background processing
As the illustrated example shown in Figure 2, a sequence of initial modeled background frames (t = 1,2,…) will be obtained in the initial modeled background processing procedure. At the beginning (t = 1), each block of size N × N (pixels) in is set to “undefined” (labeled in black), as shown in Figure 2l. Then, the initial modeled background frame (t = 2,3,…,19) is obtained based on the “updated” modeled background frame Bt–1 (see Section 2.3) and the block motion representation frame . Each block of size N × N (pixels) in is determined as either a “static” block (labeled in blue) or a “moving” block (labeled in red) by motion estimation (see Section 2.2) between two consecutive (gray-level) video frames It–1 and It of the bootstrapping video sequence, as shown in Figure 2g–k. For one “undefined” block in , if its corresponding block in is determined as a “static” block, i.e., its motion vector is (0,0), the “static” block in is duplicated from the corresponding block b
(i,j)
t in It. Then, each “static” block in will perform the background updating procedure (see Section 2.3) to obtain in Bt Otherwise, the “undefined” block in will remain as the “undefined” block in . That is, each “undefined” block will not participate the background updating procedure until is determined as a “static” block. As shown in Figure 2, each block in is determined by motion estimation between two consecutive (gray-level) video frames, It–1 and It, of the bootstrapping video sequence. Each block in the initial modeled background frame is based on B1 and , in which some blocks in are still “undefined” (labeled in black). The initial modeled background frame is obtained based on B2 and . , ,…, and in the illustrated example can similarly be obtained. Note that, in the illustrated example shown in Figure 2, each initial modeled background frame (t = 1,2,…,18) contains at least one “undefined” block.
Finally, as shown in Figure 2q, the initial modeled background frame contains no “undefined” block. Here, for the illustrated example shown in Figure 2, the performance index T
1(=19) is defined as the frame index for initial modeled background processing. Afterwards, the initial modeled background frame (t = 20,21,…) is duplicated from the “updated” modeled background frame Bt–1, i.e., (t = 20,21,…) [35].
2.2. Block representation
As the illustrated example shown in Figure 3, in the proposed block representation approach, each block of the current video frame It is classified into one of the four categories, namely, “background,” “still object,” “illumination change,” and “moving object.” In Figure 3b, each block of the block representation frame Rt for It is labeled in four different gray levels. The block representation frame Rt is obtained based on the two consecutive video frames, It and It–1, and the initial modeled background frame by the proposed block representation approach (as shown in Figure 4), in which motion estimation and correlation coefficient computation are used to perform block representation (classification).
Motion estimation is performed between the two consecutive video frames, It and It– 1 using a block matching algorithm so that each block in It is determined as either “static” or “moving.” In this study, the sum of absolute differences (SAD) is used as the cost function for block matching between block b
(i,j)
t in It and the corresponding block in It–1 and the search range for motion estimation is set to ±N/2 [35, 36]. For a block in It, if the minimum SAD, D
mv(u,v), for motion vector (u,v), is smaller than 90% of the SAD for the null-vector (0,0), D
mv(0,0), the block is determined as a “moving” block; otherwise, it is determined as a “static” block [19, 35].
On the other hand, the correlation coefficient C
B
(i, j) between block b
(i,j)
t in It and block in the initial modeled background frame is computed as
(1)
where μ
b
is the mean of the pixel values in block b. As shown in Figure 4, based on C
B
(i,j) and the threshold THCB a “static” block can be further classified into either a “background” block (if C
B
(i,j) ≥ THCB) or a “still object” block (otherwise), whereas a “moving” block can be further classified into either an “illumination change” block (if C
B
(i,j) ≥ THCB) or a “moving object” block (otherwise). Afterwards, four different block representations are obtained.
2.3. Background updating
By background updating, each block in the initial modeled background frame can be updated to obtain the corresponding block in the modeled background frame Bt as follows. Both the “background” and “illumination change” blocks are updated by temporal smoothing, i.e., block in Bt is updated as the linearly weighted sum of block in and block b
(i,j)
t in It. On the other hand, both the “still object” and “moving object” blocks are updated by block replacement.
(a) Background: the modeled background block in B t is updated by
(2)
where α, the updating weight, is empirically set to 0.9 in this study.
(b) Still object: the modeled background block in B t is updated by
(3)
where Count(i,j) is the number of times that b
(i,j)
t in It is successively determined as a “still object” block previously, and THstill is a threshold for the time duration (in terms of the number of frames) that a “still object” block will learn to be a “background” block. That is, if an object (or a block b
(i,j)
t in It) does not “move” for a sufficient time duration, it will become some part of the background. As the illustrated example shown in Figure 5, the marked block b
(11,13)
33 in I33 is detected as a “still object” block (in R33) for a sufficient time duration (THstill = 20). Then, its corresponding block in B33 will be updated (replaced) by b
(11,13)
33 in I33.
(c) Illumination change: the modeled background block in B t is similarly updated by Equation (2).
(d) Moving object: the modeled background block in B t is updated by
(4)
where SM(b
(i,j)
t) and denote the side-match measures for block b
(i,j)
t from It embedded in and that for block “embedded” in , respectively, as shown in Figure 6. The side-match measure (or the boundary match measure) [37, 38] is widely used in various image/video error concealment algorithms due to its good trade-off in complexity and visual quality. SM(b
(i,j)
t) is defined as the sum of squared differences between the boundary of the embedded block b
(i,j)
t from It and the boundaries of the four neighboring blocks
and in (Figure 6a), i.e.,
(5)
Similarly, is defined as the sum of squared differences between the boundary of block and the boundaries of its four neighboring blocks
and in (Figure 6b), i.e.,
(6)
Note that if a block in Rt is determined as a “moving object” block two times consecutively, the corresponding modeled background block in Bt is updated by Equation (4). The side-match measure uses the camouflage of each “moving object” block to search the more suitable modeled background block so that we can speed up the background updating procedure. As the illustrated example shown in Figure 7, two marked blocks b
(12,9) and b
(11,10) in both I12 and I13 are detected as two “moving object” blocks in both R12 and R13 consecutively. Thus, their corresponding blocks and in B13 will be updated (replaced) by blocks b
(12,9)
13 and b
(11,10)
13 in I13, respectively.
2.4. Initial segmented foreground
Based on the modeled background frame Bt performing background updating, as an illustrated example shown in Figure 8, the initial (binary) segmented foreground frame can be obtained as
(7)
where THisf is a threshold, which is empirically set to 15 in this study.
2.5. Noise removal and shadow suppression with two morphological operations
As shown in Figure 8, usually contains some fragmented (noisy) parts and shadows. To obtain the precise segmented foreground frame Ft, a noise removal and shadow suppression procedure is adopted, which combines the shadow suppression approach in [39] and the edge information extracted from It with being the (binary) operation mask.
Let be the S (saturation) component of the original video frame (frame t) represented in the HSV color space and be the gradient image of It using the Sobel operator [40] with being the (binary) operation mask. The segmented foreground frame is defined as
(8)
where ∩ and ∪ denote the logical AND and OR operators, respectively, is the standard deviation of , and TH
E
is a threshold. Here, TH
E
is empirically set to 120 in this study. Figure 9 shows an illustrated example performing the noise removal and shadow suppression procedure. By applying the shadow suppression approach in [39], the “second” (binary) segmented foreground frame (shown in Figure 9b) is obtained based on (shown in Figure 8c) and (shown in Figure 9a). Based on the “second” (binary) segmented foreground frame (shown in Figure 9b), combining the gradient image of It (shown in Figure 9c) preserving the edge information in the initial (binary) segmented foreground frame , the segmented foreground frame (shown in Figure 9d) is obtained by Equation (8). Finally, the final segmented foreground frame (shown in Figure 9e) is obtained as Ft with two morphological (erosion and dilation) operations [40].