Low-complexity background subtraction based on spatial similarity

Lee, Sangwook; Lee, Chulhee

doi:10.1186/1687-5281-2014-30

Research
Open access
Published: 19 June 2014

Low-complexity background subtraction based on spatial similarity

Sangwook Lee¹ &
Chulhee Lee¹

EURASIP Journal on Image and Video Processing volume 2014, Article number: 30 (2014) Cite this article

3992 Accesses
15 Citations
Metrics details

Abstract

Robust detection of moving objects from video sequences is an important task in machine vision systems and applications. To detect moving objects, accurate background subtraction is essential. In real environments, due to complex and various background types, background subtraction is a challenging task. In this paper, we propose a pixel-based background subtraction method based on spatial similarity. The main difficulties of background subtraction include various background changes, shadows, and objects similar in color to background areas. In order to address these problems, we first computed the spatial similarity using the structural similarity method (SSIM). Spatial similarity is an effective way of eliminating shadows and detecting objects similar to the background areas. With spatial similarity, we roughly eliminated most background pixels such as shadows and moving background areas, while preserving objects that are similar to the background regions. Finally, the remaining pixels were classified as background pixels and foreground pixels using density estimation. Previous methods based on density estimation required high computational complexity. However, by selecting the minimum number of features and deleting most background pixels, we were able to significantly reduce the level of computational complexity. We compared our method with some existing background modeling methods. The experimental results show that the proposed method produced more accurate and stable results.

Introduction

As security monitoring emerges as an important issue, there has been an increasing demand for intelligent surveillance systems. Key operations in intelligent surveillance include object tracking, abnormal behavior detection, and behavior understanding. Accurate background subtraction plays an important role. The goal of background subtraction is to eliminate background components and detect meaningful moving objects. In real environments, due to various and complex background types such as moving escalators, waving tree branches, water fountains, and flickering monitors, background subtraction is a difficult task. Researchers have overcome these problems by using background modeling. Simple background models assume static background images. Background components can generally be eliminated by computing the difference between an input image and the background image that was modeled using average, low-pass filtering, and median filtering[1–4]. For instance, in[1], the median background image was used to subtract the background components. Since temporal median filtering is time-consuming, a fast algorithm utilizing the characteristics of adjacent frames was proposed[2]. Cheng et al. applied a recursive mean procedure to compute background images[3]. In[4], low-pass filtering was utilized to estimate a static background image. However, these approaches cannot handle dynamic backgrounds and are sensitive to threshold values.

In order to handle various background types, statistical approaches were introduced. Among these approaches, Gaussian modeling methods have been widely used. Initially, uni-modal distribution was used to model pixel values[5]. In[6], a background subtraction method using the HSV color space was presented based on single Gaussian modeling. A fast and stable linear discriminant approach based on uni-modal distribution and Markov random field was proposed[7]. Rambabu and Woo proposed a background subtraction method which is robust against noisy and changing illumination based on single Gaussian modeling[8]. Although these models have low complexity levels and produce satisfactory performances in controlled backgrounds, it is difficult to use them for dynamic scenes. The Gaussian mixture model (GMM) is usually used to model various background types. Stuffer and Grimson used the GMM for background subtraction in[9], and it is still a popular method for background subtraction[10–20]. A spatio-temporal GMM (STGMM) was proposed to handle complex background[10]. Using a GMM, a statistical framework was investigated to localize a foreground object[11] and a dynamic background was modeled for highly dynamic conditions such as active cameras and high motion activities in background regions[12]. Also, the subtraction of two Gaussian kernels (difference of Gaussians) was used to eliminate background regions in embedded platforms[13]. A general framework of regularized online classification EM for GMM was proposed[14]. Wang et al. proposed an adaptive local-patch GMM to detect moving objects in dynamic background regions[15]. In[16], a new update algorithm was proposed for learning adaptive mixture models, and Bin et al. proposed a self-adaptive moving object detection algorithm. The method improved the original GMM in order to adapt to sudden or gradual illumination changes[17]. In[18], in order to improve GMM performance, a new rate control method based on high-level feedback was developed. An improved adaptive-K GMM method was presented for updating background regions[19], and GMM was used for modeling background regions in a Bayer-pattern domain[20]. A disadvantage of these multimodal Gaussian modeling methods is that they require pre-defined parameters such as the number of the Gaussian distributions and the standard deviations of those distributions. Also, dynamic backgrounds cannot be accurately modeled by a few Gaussian distributions. In order to overcome parameter background modeling methods, nonparametric background modeling techniques have been developed for estimating background probabilities. Nonparametric background modeling methods have been used to estimate background distribution based on pixel values observed in the past. In[21], the Gaussian kernel was used for pixel-based background modeling. This nonparametric method is usually used to handle multiple modes of dynamic backgrounds without pre-defined parameters. However, these nonparametric methods use kernel density estimation (KDE), which requires heavy computational complexity and a large amount of memory. Various efforts have been made to address these problems. Using Parzen density estimation and foreground object detection, a fast estimation method was presented[22] and an automatic background modeling based on multivariate non-parametric KDE was proposed[23]. In[24], a non-parametric method was proposed for foreground and background modeling, which did not require any initialization. Han et al. proposed an efficient algorithm for recursive density approximation based on density mode propagation[25]. Also, depth information, on-line auto-regressive modeling, and Gaussian family distribution were used to eliminate background regions[26–28]. In[29], new object segmentation was proposed based on a recursive KDE. It used the mean-shift method to approximate the local maximum value of the density function. The background was modeled using real-time KDE based on online histogram learning[30].

Also, alternative approaches were proposed based on neural network techniques or the support vector machine (SVM) method[31–35]. A method was proposed based on self-organization through artificial neural networks[31]. Furthermore, a self-organization method was combined with fuzzy approach to update background[32]. In[33–35], an automatic algorithm was proposed to perform background modeling using SVM.

To develop a robust model with low complexity, we used a pixel-based background subtraction method based on spatial similarity computed using the structural similarity method (SSIM)[36]. Using spatial similarity, we measured the pixel similarity and eliminated background pixels. The remaining pixels were classified as either background or foreground pixels using KDE. Since we eliminated most background pixels and used only two features for KDE, the complexity of the proposed method was significantly reduced. The proposed method was evaluated using two datasets (Wallflower's and Li's datasets) and showed favorable performance over some existing methods.

The overall algorithm for efficient background subtraction

Preparation

The structure similarity for eliminating background components

To eliminate background components while preserving potential foreground components, we first computed the spatial similarity using the SSIM method that was developed for image quality assessment[36]. The SSIM was computed as follows:

\begin{array}{l} Lum inance : l (x, y) = \frac{2 μ_{x} μ_{y} + C_{1}}{μ_{x}^{2} + μ_{y}^{2} + C_{1}} \\ Contrast : c (x, y) = \frac{2 σ_{x} σ_{y} + C_{2}}{σ_{x}^{2} + σ_{y}^{2} + C_{2}} \\ Structure : s (x, y) = \frac{σ_{xy} + C_{3}}{σ_{x} σ_{y} + C_{3}} \\ \begin{array}{l} SSIM (x, y) = {[l (x, y)]}^{α} {[c (x, y)]}^{β} {[s (x, y)]}^{γ} \\ = \frac{(2 μ_{x} μ_{y} + C_{1}) (2 σ_{xy} + C_{2})}{(μ_{x}^{2} + μ_{y}^{2} + C_{1}) (σ_{x}^{2} + σ_{y}^{2} + C_{2})} \end{array} \end{array}

(1)

where α, β, and γ are parameters which determine the relative importance of l(x, y), c(x, y), and s(x, y) and we set α, β, and γ to 1. μ_x and μ_y are the local means, σ_x and σ_y are the local standard deviations, σ_xy is the local covariance coefficient between regions x and y, and we set C₃ to C₂/2 and C₁ and C₂ are constants that were set to 6.5025 and 58.5225 as proposed in[36]. In Equation 1, l, c, and s represent the luminance, contrast, and structure of two images. In this paper, we computed the SSIM for local regions (e.g., a 3 × 3 block) to eliminate background components. Figure 1a and b show the input and reference background images, respectively. Figure 1c and g show the intensity and hue difference images between Figure 1a and b, respectively. The SSIM difference image between Figure 1a and b is shown in Figure 1k. Thresholding (if a pixel value of the difference image was larger than the given threshold value, the pixel was eliminated) was applied to the difference images with various threshold values (low, medium, large), and the resulting images are shown in Figure 1d,e,f,h,i,j,l,m,n. For the intensity component (Figure 1c,d,e,f), the differences between the shadow regions and the corresponding background regions were high. The thresholding operation still left shadows when using a low threshold value (e.g., 80). When we used a larger threshold value (e.g., 120) to eliminate the shadows, potential foreground objects were also eliminated (Figure 1f).

For the hue component (Figure 1g,h,i,j), shadows were not retained, but many of the background regions contained high difference values. To eliminate these background regions, we tried using a larger threshold value (e.g., 120). However, the top portion of the person with the blue jacket and the red portion of the person on the right were also eliminated. Furthermore, the small object in the lower-left corner was almost deleted when the intensity component or the hue component was used. However, the method based on the SSIM correctly retained the object (Figure 1k,l,m,n). In the SSIM, global intensity and contrast changes were not determined as forms of distortion[36]. Therefore, the proposed method proved to be robust against shadows with lower intensity values while retaining internal structures. Furthermore, since the proposed method used the variances and covariance of two local regions, it could detect objects with similar colors. In Figure 2a, a person's head color was similar to the background regions. The proposed method showed improved performance compared to the other method[31] (http://www.na.icar.cnr.it/~maddalena.l/MODLab/SoftwareSOBS.html). Similarly, in Figure 2b, the woman's jacket color was similar to the background regions. The proposed method correctly classified the woman as a foreground object while the other method missed the jacket.

To apply the SSIM to local regions, we used a sliding window approach. For each pixel, we computed the SSIM of a 3 × 3 window centered at the pixel. Let A(i, j) = ⌊A^R(i, j), A^G(i, j), A^B(i, j)⌋ be a pixel in the RGB color space. Then, the similarity image (SI) between intensity images A^I(i, j) and B^I(i, j) was calculated as follows:

{SI}_{A^{I}, B^{I}} (i, j) = SSIM (A^{I} (i, j), B^{I} (i, j))

(2)

where

\begin{array}{l} A^{I} (i, j) & = \frac{1}{3} (A^{R} (i, j) + A^{G} (i, j) + A^{B} (i, j)), μ_{A^{I} (i, j)} \\ = \frac{1}{9} (\sum_{v = - 1}^{1} \sum_{u = - 1}^{1} A^{I} (i + u, j + v)) \end{array}

(3)

\begin{array}{l} σ_{A^{I} (i, j)}^{2} = \frac{1}{9} (\sum_{v = - 1}^{1} \sum_{u = - 1}^{1} (A^{I} {(i + u, j + v)}^{2} - μ_{A^{I} (i, j)})), \\ σ_{A^{I} (i, j) B^{I} (i, j)} = \frac{1}{9} (\sum_{v = - 1}^{1} \sum_{u = - 1}^{1} (A^{I} (i + u, j + v) - μ_{A^{I} (i, j)}) \\ \times (B^{I} (i + u, j + v) - μ_{B^{I} (i, j)})) \end{array}

A^I(i, j) represents an intensity value, $μ_{A^{I} (i, j)}$ and $μ_{B^{I} (i, j)}$ are intensity means, $σ_{A^{I} (i, j)}$ and $σ_{B^{I} (i, j)}$ are intensity standard deviations, and $σ_{A^{I} B^{I} (i, j)}$ is the intensity covariance. $S I_{A^{I}, B^{I}} (i, j)$ is close to 1 when two window regions were similar. C₁ and C₁ were set to 6.5025 and 58.5225, respectively[36]. By assuming that one image was a reference background image, we obtained a binary background image (BBI) by applying a thresholding operation:

{BBI}_{A^{I}, B^{I}} (i, j) = \{\begin{cases} 0 (background) if ({SI}_{A^{I}, B^{I}} (i, j) > T_{1}) \\ 1 (foreground candidate) otherwise \end{cases}

(4)

T₁ is a threshold value which was empirically determined and set to 0.55. Figure 3 shows the effect of the threshold value. When we used a small value for T₁, most pixels were classified as background regions (Figure 3c). When we used a large value for T₁, most pixels were classified as foreground regions (Figure 3o). Based on this observation, we set T₁ to 0.55, though any value between 0.1 and 0.9 provided good performance.

Since we calculated the means and the variances, the computational complexity was low. However, some background pixels were still retained. In order to eliminate the background pixels, we used nonparametric kernel density estimation.

Determining foreground and background areas using KDE

Generally, KDE can model multi-modal probability distributions without requiring any prior information. It is effective for modeling the arbitrary densities of real environments. KDE was applied to each pixel of the training images. In other words, we extracted training samples at each pixel location of the training images. Let s₁, s₂, …, s_N be training samples and we used the Gaussian kernel function. Then, the probability of x_t was calculated as follows[21]:

p (x_{t}) = \frac{1}{N} \sum_{i = 1}^{N} \frac{1}{\sqrt{2 πσ}} e^{- \frac{1}{2 σ} {(s_{i} - x_{t})}^{2}}

(5)

where σ represents the kernel function bandwidth and N is the number of training samples. A pixel was classified as a background pixel if the estimated probability was larger than the given threshold. It was observed that a large value of N produced more robust results. Consequently, a typical KDE method requires a large number of operations. On the other hand, we first eliminated most background pixels using the spatial similarity (SS) method and used only two features (one of the RGB components and one of the normalized RGB components). Also, we used a small number of samples (one hundred samples). Therefore, we were able to significantly reduce the computational complexity of the KDE without sacrificing performance. Figure 4 shows an example of the proposed method. We eliminated most background pixels using the SS method (Figure 4c). However, some background pixels were still retained and we eliminated these pixels using KDE. In this case, the candidate pixels made up 5% to 6% of the entire image. The processing time was also reduced accordingly.Based on this observation, we propose a computationally efficient background subtraction method by eliminating background regions using spatial similarity in the spatial domain and the KDE method in the temporal domain. By combining spatial and temporal features, the proposed method produced better performance than the conventional KDE method. Figure 5 shows the comparison results. These sequences contain dynamic background regions. Tree branches were swaying and the curtain was moving in the wind. In dynamic background regions, it is difficult to accurately model the background in the conventional KDE method. Therefore, many background components are often classified as foreground components. However, since most of the background components in the proposed method were eliminated with spatial similarity, most of the background components misclassified as foreground components were correctly classified as background components.

The proposed method

Determine the background type

The reference background image (RBI) was computed as the average of the training intensity images:

{RBI}^{I} (i, j) = \frac{1}{N} \sum_{t = 0}^{N - 1} A_{t}^{I} (i, j)

(6)

where $A_{t}^{I} (i, j)$ represents a pixel of the t-th intensity image of a video sequence and N is the number of training images, which was set to 100. In other words, the first 100 images of a given video sequence generally were used as training images. We also computed the averages of the RGB channels of the training images:

{RBI}^{Ω} (i, j) = \frac{1}{N} \sum_{t = 0}^{N - 1} A_{t}^{Ω} (i, j) where Ω \in \{R, G, B\}

(7)

Then, a similarity image between the reference background and training intensity images was computed using Equation 2 and the reference binary background image (RBBI) was computed:

\begin{array}{l} For each pixel (i, j) \\ r (i, j) = \frac{1}{N} \sum_{t = 0}^{N - 1} {SI}_{{RBI}^{I}, A_{t}^{I}} (i, j) \\ RBBI (i, j) = \{\begin{cases} 0 (static background) if (r (i, j) > 0.8) \\ 1 (moving background) otherwise \end{cases} \end{array}

(8)

The RBBI successfully detected moving background components such as moving escalators, waving tree branches, and water fountains.

Determine the foreground candidate pixels

When a new image was entered, a BBI was computed between the RBI and the input intensity image using Equations 3 to 4. If BBI(i, j) = 1, the pixel could have been either a foreground pixel or a moving background pixel. If RBBI(i, j) = 1 (moving background), we computed the difference between the intensity input image and the RBI^I. If the difference between the input intensity image and the RBI^I was small, the pixel could have been a background pixel. Also, the pixel was classified as a foreground candidate when the difference was larger than the given threshold, and the pixel was classified as a foreground candidate if BBI(i, j) = 1 and RBBI(i, j) = 0. The following procedure was used to classify a pixel:

\begin{array}{l} For each pixel (i, j) \\ If ({BBI}_{{RBI}^{I}, I_{k}^{I}} (i, j) = 1), then \\ If (RBBI (i, j) = 1), then \\ {FCI}_{k} (i, j) = \{\begin{cases} if (| {RBI}^{I} (i, j) - {I_{k}}^{I} (i, j) | > T_{2}) \\ 1 (foreground candidate) \\ otherwise \\ 0 (background) \end{cases} \\ Otherwise \\ FC I_{k} (i, j) = 1 (foreground candidate) \end{array}

(9)

where FCI_k(i, j) represents a candidate image, $I_{k}^{I} (i, j)$ represents the k-th input intensity image (see Equation 3), and T₂ was empirically set to 30. If T₂ was too large, most pixels were classified as background pixels. In other words, many foreground pixels were misclassified as background pixels when T₂ was too large. Figure 6 shows the results for various values of T₂. Figure 6a,b shows an input image and the BBI. Figure 6c,d,e,f shows the FCI for various values of T₂. Most foreground pixels were eliminated when T₂ was set to 80 (Figure 6f), while most moving background pixels were retained when T₂ was set to 10 (Figure 6c). In order to choose an optimal threshold value, we tested the proposed method with various values of T₂ using some video sequences with dynamic background regions and chose the threshold value (T₂ = 30). At this point, most background regions were removed.

Subtract the background pixels using KDE

We classified only the foreground candidate pixels (i.e., FCI_k(i, j) = 1) using KDE. Since there were high correlations among the R, G, and B components, and using all three channels produced only slight improvements, we used only the color with the largest difference. To improve performance, we also used one of the normalized RGB components that were robust against illumination changes and that represented the chrominance information well. We selected one of the RGB component channels as follows:

\begin{array}{l} For each pixel (i, j) of the k ‒ th frame \\ If ({FCI}_{k} (i, j) = 1) \\ d_{max} = max ({Diff}_{R}, {Diff}_{G}, {Diff}_{B}) \end{array}

(10)

where

\begin{array}{l} {Diff}_{R} = | {RBI}^{R} (i, j) - I_{k}^{R} (i, j) | \\ {Diff}_{G} = | {RBI}^{G} (i, j) - I_{k}^{G} (i, j) | \\ {Diff}_{B} = | {RBI}^{B} (i, j) - I_{k}^{B} (i, j) | \end{array}

where d_max represents the maximum difference. Let Ω_max be the channel with the maximum difference.

A foreground candidate pixel was classified as a background pixel when the estimated probability density function of the pixel value was larger than the given threshold as follows:

\begin{array}{l} if \frac{1}{N} \sum_{m = 0}^{N - 1} \frac{1}{\sqrt{2 πσ}} e^{- \frac{1}{2 σ} {(I_{k}^{Ω_{max}} (i, j) - A_{m}^{Ω_{max}} (i, j))}^{2}} > T_{3}, \\ decide the pixel as background \\ otherwise, \\ decide the pixel as foreground \end{array}

(11)

where σ represents the kernel width. Since the probability density function of the background pixel was unknown, we assumed that the probability densities for all intensity values were identical. Therefore, we set T₃ to 1/256. We used the standard deviation of the training images as the kernel width. This procedure was repeated using the normalized RGB color components, which were computed as follows:

I_{normalized}^{Ω} (i, j) = \frac{255 \cdot I^{Ω} (i, j)}{I^{R} (i, j) + I^{G} (i, j) + I^{B} (i, j)} with Ω \in \{R, G, B\}

(12)

where I(i, j) represents the input image. If either the estimated probability density function of the pixel using the original RGB channels or the estimated probability density function of the pixel value of the normalized RGB channels was classified as a foreground component, the pixel was determined to be a foreground component. After this procedure, there were several small holes inside the foreground regions and some noise elements in the background regions. Most pixel-based methods suffer from this kind of problem. In order to address this, we applied a morphological operation to remove the small holes and noise elements. In particular, we used erosion followed by dilation and then a region filling technique was applied to the results[37].

Updating

After the decision procedure, the RBI and the pixels of the training images had to be updated to adapt to the changing background areas. We used a simple IIR filter to update the RBI as follows[38]:

\begin{array}{l} If pixel (i, j) is classified as background, \\ {RBI}^{Ω} (i, j) = (1 - α) {RBI}^{Ω} (i, j) + α I_{k}^{Ω} (i, j) \\ where Ω \in \{R, G, B\} \end{array}

(13)

where α represents the learning rate and was set to 0.01. The training images were updated by replacing the oldest pixel with the new background pixel. There is a trade-off in the choice of α. If a value for α was large, the RBI quickly reflected background changes. Figures 7 and8 show the RBI changes for various learning rate values. As can be seen in Figure 7, the RBI was affected by shadows when we used a large value for α. Figure 7a,b shows the 372nd input and the initial RBI images. Figure 7c shows the RBI image when α was 0.6. Because of a large value for α, the RBI was quickly affected by the shadows. If we used a small value for α, the RBI did not quickly reflect background changes.

In some test sequences, the background gradually became brighter over a period (Figure 8). The RBI did not reflect this gradual background change with a small value of α (Figure 8c). Thus, we set α = 0.01, and the learning rate was able to handle background changes adequately (Figure 8d).

If sudden background changes occurred, the results may have been erroneous. In order to handle such sudden background changes, we calculated the image intensity difference between the input image and the RBI and determined that sudden background changes occurred if the difference was larger than the given threshold:

\begin{array}{l} if (\frac{1}{N_{x} \cdot N_{y}} \sum_{i = 0, j = 0}^{i = N_{x} - 1, j = N_{y} - 1} |I_{k}^{I} (i, j) - {RBI}^{I} (i, j)| > 30), \\ a sudden background change occurs \\ at the k ‒ th sequence . \end{array}

(14)

When a sudden background change was detected at the k-th image, we calculated the image differences between the previous 100 images (from the (k-99)-th image to the k-th image) and the RBI. We selected the previous images that had larger frame differences than the threshold. The selected images were temporarily used as the training images. If the number of selected images was smaller than 15, all the pixels of the k-th image were classified as background components. However, the RBI was not updated when sudden changes were detected.Figure 9 shows an example of the proposed background subtraction procedure. Figure 9a is an input image, and Figure 9b shows the reference background image. Figure 9c is the reference binary background image where the white areas represent moving backgrounds (the waving trees). Figure 9d shows the binary background image between Figure 9a and b. Figure 9e shows the foreground candidate image. Figure 9f shows the result obtained using the original RGB components, and Figure 9g shows the final result using the normalized RGB components and the morphological operation.

Experimental results

Experiments were performed using two datasets (Li's dataset and the Wallflower's dataset). Li's dataset contained several dynamic background video sequences (water surface (WS), campus (CAM), fountain (FT), and meeting room (MR)) and static background video sequences (shopping center (SC), subway station (SS), airport (AP), lobby (LB), bootstrap (B)). The Wallflower's dataset contained various background types (bootstrap (B), camouflage (C), foreground aperture (FA), light switch (LS), moved object (MO), time of day (TD), and waving tree (WT)).

First, we measured the processing time of the proposed method. The proposed method took about 0.015 s per 10,000 pixels, while the processing time of a conventional method[38] was about 1.475 s per 10,000 pixels (using a 2.8-GHz Pentium IV with 1 GB of RAM) when the number of sample images was 100. For instance, the proposed method processed 66.7 frames of video per second when working with 160 × 128 video sequences. The complexity of KDE is O_KDE(MN) evaluations (the kernel function, multiplications and additions), assuming N image pixels and M sample points (N pixels per image and M training images). In the proposed method, we applied ‘spatial similarity’ to eliminate potential background pixels using a window processing operation (size of window = w). The computational complexity for calculating spatial similarity is O_similarity(w²N) operations (multiplications and additions). Then, the remaining pixels (the number of remaining pixels: K = τN) are further processed using KDE (O_KDE(KM)). Therefore, the computational complexity of the proposed method is calculated as follows:

\begin{array}{l} Number of operation = O_{similarity} (w^{2} N) + O_{KDE} (KM) \\ = O_{similarity} (w^{2} N) + O_{KDE} (τ NM) \end{array}

(15)

In the proposed method, the window size is 3 (w = 3), and the average remaining pixels were about 5% ~ 6% of the entire image pixels (τ ≅ 0.05). In other words, the KDE operation was reduced by approximately 95%. Although we needed to compute additional spatial similarity, it had a minor effect on the overall complexity. With 100 training images, the computational complexity for KDE and the proposed method was O(100 N) and O((9 + 0.05 × 100)N) = O(14 N), respectively. In this case, the complexity of the proposed method was about 14% of KDE.

Next, the proposed method was compared with some existing algorithms[31, 38–40]. The Jaccard similarity was used as a performance measure[41]:

JS = \frac{TP}{TP + FP + FN}

(16)

where TP represents the number of true positive pixels, FP represents the number of false positive pixels, and FN represents the number of false negative pixels. Generally, a higher Jaccard similarity index indicates better performance.

Results using Li's dataset

Table 1 shows a performance comparison with Li's dataset based on Jaccard similarity.Figure 10 shows the background subtraction results of the proposed method and Li's method using Li's dataset. The first column shows a test image, the second column shows the ground truth data of the test image, the third column shows the results of Li's method, and the fourth column shows the results of the proposed method. Using spatial similarity, the proposed method was robust against shadows. Noticeable improvements were observed in the SC, LB, B, and AP sequences which contained significant shadows. For these sequences, the proposed method showed about 8.4% ~ 14.9%, 7.8% ~ 20.6%, and 2.91% ~ 12.7% improvement compared to SOBS, Li's method, and Park's method, respectively, in terms of the Jaccard similarity. Since the proposed method used covariance, the variances of two local regions, and the normalized RGB color components, it was able to detect some objects that were similar to the background intensity. Therefore, in the WS and the FT sequences that contained objects whose intensity values were similar to the background regions, the proposed method showed improved performance compared to the other methods. For instance, a main difficulty of the WS sequence was detecting a person's leg when the intensity value of the leg was similar to the background intensity value. The other methods missed parts of the leg while the proposed method accurately detected the leg. For this WS sequence, the proposed method showed about 10.4%, 7.8%, and 2.91% improvements compared to SOBS, Li's method, and Park's method. A main difficulty of the FT sequence was that a person's pants color was similar to the background region when the person stood against the fountain. For the FT sequence, the Jaccard similarity of the proposed method was 0.820, and the proposed method showed about 16.5%, 14.6%, and 10.3% improvements compared to SOBS, Li's method, and Park's method, respectively. However, some sequences (e.g., CAM, SS, and MR) contained complex dynamic background sequences. For instance, in the CAM sequence, the background included tree branches that were constantly swayed by a strong wind. The SS sequence contained moving escalators and the MR sequence contained moving curtains). In these kinds of dynamic background sequences, Park’s method (in CAM, SS, and MR) and Li's method (in MR) performed slightly better than the proposed method.

Table 1 Performance comparison with Jaccard similarity (Li's dataset)

Full size table

Results using Wallflower's dataset

Table 2 shows a performance comparison with Wallflower's dataset based on FP + FN. Figure 11 shows the results of the proposed method and Wallflower method using the Wallflower's dataset. The first column shows a test image, the second column shows the ground truth data of the test image, the third column shows the results of Wallflower method, and the fourth column shows the results of the proposed method. The proposed method showed noticeable improvements for the C and B sequences. In the B sequence, the proposed method successfully detected objects that were similar to the background areas. On the other hand, since some moving trees of the WT sequence were classified as foreground components, the proposed method was not as good as Park's method. The LS sequence contained a sudden background change and the proposed method showed better performance. In the MO sequence, the proposed method classified the relocated objects (the chair and the phone) as foreground components. To handle this kind of problem, higher level processing such as that used in the Wallflower method might be required. The proposed method missed an object whose color was similar to that of the background area in the TD sequence.

Table 2 Performance comparison with the number of false positive and false negative pixels (Wallflower's dataset)

Full size table

The effects of thresholds

Next, we investigated the effects of thresholds (T₁ and T₂ in Equations 8 to 9). Figure 12 shows the Jaccard similarity of the proposed method as the T₁ and T₂ values increased with Li's dataset and wallflower's dataset. In order to analyze the effect of T₁, we computed the false positive ratio (FPR) and false negative ratio (FNR) metrics as follows:

\begin{array}{l} FPR = \frac{FP}{TP + FP + FN} \\ FNR = \frac{FN}{TP + FP + FN} \end{array}

(17)

When we used a large value for T₁, most foreground pixels were correctly classified as foreground pixels. However, many background pixels also were classified as foreground pixels. Therefore, FPR increased and FNR decreased. When we used a small value for T₁, most background pixels were classified as background pixels. However, many foreground pixels were classified as background pixels. Therefore, FPR decreased and FNR increased when we used a small value for T₁. Figure 13 shows the Jaccard similarity, and the FPR and FNR metrics with various values for T₁ (T₂ was fixed and set at 30).

We selected the optimal value for T₁ and T₂. When we set T₁ and T₂ to 0.55 and 30 respectively, the foreground candidate pixels were about 5% of the entire number of pixels, and the Jaccard similarity of the proposed method was about 0.78 with Li's dataset and the FP and FN numbers were about 6,888 with Wallflower's dataset. Experiments with various values of T₁ and T₂ show that the proposed method produced stable performance when the value of T₁ was from 0.5 to 0.65 and the value of T₂ was from 25 to 35.

Conclusions

In this paper, we proposed a background subtraction method that utilized structural similarity, which was robust against various background areas. The proposed method also significantly reduced the level of computational complexity since most pixels were eliminated using the similarity image. We tested the proposed method with two datasets and then compared the proposed method with some existing methods. The experimental results demonstrated that the proposed method was effective for various background scenes and compared favorably with some existing algorithms.

Authors’ information

Sangwook Lee received the BS and MS degrees in electrical and electronic engineering from Yonsei University, Seoul, Repiblic of Korea in 2004 and 2006, respectively. He is currently working toward the PhD degree from Yonsei University and a senior engineer at Samsung Electronics Co. Ltd., Republic of Korea. His research interests include machine vision, image/signal processing, and video quality measurement.

Chulhee Lee received the BS and MS degrees in electronic engineering from Seoul National University in 1984 and 1986, respectively, and a PhD degree in electrical engineering from Purdue University, West Lafayette, Indiana, in 1992. In 1996, he joined the faculty of the Department of Electrical and Computer Engineering, Yonsei University, Seoul, Republic of Korea. His research interests include image/signal processing, pattern cognition, and neural networks.

References

McFarlane NJB, Schofield CP: Segmentation and tracking of piglets in images. Mach. Vision App. 1995, 8(1):187-193.
Article Google Scholar
Hung MH, Hsieh CH: Speed up temporal median filter for background subtraction. In Proceedings of the PCSPA, vol. 1. Harbin; 2004:297-300.
Google Scholar
Cheng F, Huang S, Ruan S: Advanced motion detection for intelligent video surveillance systems. In Proceedings of the ACM SAC, vol. 1. 984, Sierra; 2010:983-984.
Google Scholar
Cohen S: Background estimation as a labeling problem. In Proceedings of ICCV, vol. 2. Beijing; 2005:1034-1041.
Google Scholar
Wren C, Azarbayejani A, Darrell T, Pentland A: Pfinder: Real-time tracking of the human body. IEEE Trans. Pattern Anal. Mach. 1997, 19(7):780-785. 10.1109/34.598236
Article Google Scholar
Zhao M, Bu J, Chen C: Robust background subtraction in HSV color space. In Proceedings of SPIE MSAV, vol. 1. Boston; 2002:325-332.
Google Scholar
Pan X, Wu Y: GSM-MRF based classification approach for real-time moving object detection. J. Zhejiang Univ. Sci. A 2008, 9(2):250-255. 10.1631/jzus.A071267
Article MATH Google Scholar
Rambabu C, Woo W: Robust and accurate segmentation of moving objects in real-time video. In Proceedings of International Symposium on Ubiquitous VR, vol. 191. Yanji City; 2006:65-69.
Google Scholar
Stauffer C, Grimson E: Adaptive background mixture models for real-time tracking. In Proceedings of IEEE Conf. Computer Vision Patt. Recog, vol. 2. Fort Collins; 1999:246-252.
Google Scholar
Zhang W, Fang X, Yang X, Wu Q: Spatiotemporal Gaussian mixture model to detect moving objects in dynamic scenes. J. Electron. Imaging 2007, 16(2):023013-1–023013-6.
Article Google Scholar
Su T, Hu J: Background removal in vision servo system using Gaussian mixture model framework. In Proceedings of ICNSC, vol. 1. Singapore; 2004:70-75.
Google Scholar
Doulamis A: Dynamic background modeling for a safe road design. In Proceedings of PETRA, vol. 1. Samos; 2010:1-9.
Google Scholar
Khan MH, Kypraios I, Khan U: A robust background subtraction algorithm for motion based video scene segmentation in embedded platforms. In Proceedings of FIT, vol. 1. Abbottabad; 2009:1-8.
Google Scholar
Wang H, Miller P: Regularized online mixture of Gaussians for background with shadow removal. In Proceedings of AVSS, vol. 1. Klagenfurt; 2011:249-254.
Google Scholar
Wang SC, Su TF, Lai SH: Detection of moving objects from dynamic background with shadow remova. In Proceedings of ICASSP, vol. 1. Prague; 2011:925.
Google Scholar
Zhao L, He X: daptive Gaussian mixture learning for moving object detection. In Proceedings of IC-BNMT, vol. 1. Beijing; 2010:1176-1180.
Google Scholar
Bin Z, Liu Y: Robust moving object detection and shadow removing based on improved Gaussian model and gradient information. In Proceedings of ICMT2010, vol. 1. Ningbo; 2010:1-5.
Google Scholar
Lim HH, Chuang JH, Liu TL: Regularized background adaptation: a novel learning rate control scheme for Gaussian mixture modeling. IEEE Trans. Image Process. 2011, 20(3):822-836.
Article MathSciNet Google Scholar
Zhou H, Zhang X, Gao Y, Yu P: Video background subtraction using improved adaptive-K Gaussian mixture model. In Proceedings of ICACTE, vol. 5. Chengdu; 2010:363-366.
Google Scholar
Suhr J, Jung H, Li G, Kim J: Mixture of Gaussians-based background subtraction for Bayer-pattern image sequences. IEEE Trans. Circuits Syst. Video Technol. 2011, 21(3):365-370.
Article Google Scholar
Elgammal A, Harwood D, Davis L: Non-parametric model for background subtraction. In Proceedings of ECCV, vol. 1. Dublin; 2000:751-767.
Google Scholar
Tanaka T, Shimada A, Arita D, Taniguchi R: A fast algorithm for adaptive background model construction using Parzen density estimation. In Proceedings of IEEE Conf. AVSS, vol. 1. London; 2007:528-553.
Google Scholar
Tavakkoli A, Nicolescu M, Bebis G: Automatic robust background modeling using multivariate non-parametric kernel density estimation for visual surveillance. In Proceedings of the International Symposium of Advances in Visual Computing LNCS, vol. 1. Nevada; 2005:363-370.
Google Scholar
Martel-Brisson N, Zaccarin A: Unsupervised approach for building non-parametric background and foreground models of scenes with significant foreground activity. In Proceedings of VNBA, vol. 1. Vancouver; 2008:93-100.
Google Scholar
Han B, Zhu DCY, Davis L: Sequential kernel density approximation through mode propagation: applications to background modeling. In Proceedings of ACCV, vol. 1. Jeju; 2004:1-6.
Google Scholar
Gordon G, Darrell T, Harville M, Woodfill J: Background estimation and removal based on range and color. In Proceedings of CVPR, vol. 1. Fort Collins; 1999:2459-2464.
Google Scholar
Monnet A, Mittal A, Paragios N, Ramesh V: Background modeling and subtraction of dynamic scenes. In Proceedings of ICCV, vol. 2. Beijing; 2003:1-8.
Google Scholar
Kim H, Sakamoto R, Kitahara I, Toriyama T, Kogure K: Robust foreground extraction technique using Gaussian family model and multiple thresholds. In Proceedings of ACCV, vol. 1. Tokyo; 2007:758-768.
Google Scholar
Zhu Q, Liu G, Wang Z, Chen H, Xie Y: A novel video object segmentation based on recursive kernel density estimation. In Proceedings of ICINFA, vol. 1. Shenzhen; 2011:843-846.
Google Scholar
Kolawole A, Tavakkoli A: Robust foreground detection in videos using adaptive color histogram thresholding and shadow removal. In Proceedings of ISVC, vol. 2. Las Vegas; 2011:496-505.
Google Scholar
Maddalena L, Petrosino A: A self-organizing approach to background subtraction for visual surveillance applications. IEEE Trans. Image Process. 2008, 13(4):1168-1177.
Article MathSciNet Google Scholar
Maddalena L, Petrosino A: Self organizing and fuzzy modelling for parked vehicles detection. In Proceeding of ACVIS, vol. 1. Bordeaux; 2009:422-433.
Google Scholar
Lin H, Liu T, Chuang J: A probabilistic SVM approach for background scene initialization. In Proceedings of ICIP, vol. 3. Rochester; 2002:893-896.
Google Scholar
Cheng L, Gong M, Schuurmans D, Caelli T: Real-time discriminative background subtraction. IEEE Trans. Image Process. 2011, 20(5):1401-1414.
Article MathSciNet Google Scholar
Junejo I, Bhutta A, Foroosh H: Dynamic scene modeling for object detection using single-class SVM. In Proceeding of International Conference on Image Processing, vol. 1. Hong Kong; 2010:1541-1544.
Google Scholar
Wang Z, Bovik AC, Sheikh HR, Simoncelli EP: Image quality assessment: from error visibility to structural similarity. IEEE Trans. Image Process. 2004, 13(4):1-14.
Article Google Scholar
Gonzalez R, Woods R: Digital Image Processing. 2nd edition. Prentice Hall, Englewood Cliffs; 2002.
Google Scholar
Park JG, Lee C: Bayesian rule-based complex background modeling and foreground detection. Opt. Eng. 2010, 49(2):027006-1–027006-11.
Article Google Scholar
Li L, Huang W, Gu IYH, Tian Q: Statistical modeling of complex backgrounds for foreground object detection. IEEE Trans. Image Process. 2004, 13(1):1459-1472.
Article Google Scholar
Toyama K, Krumm L, Brumitt B, Meyers B: Wallflower: principles and practice of background maintenance. In Proceedings of IEEE ICCV, vol. 1. Kerkyra; 1999:255-261.
Google Scholar
Jaccard P: The distribution of flora in the alpine zone. New Phytol. 1912, 11(2):37-50. 10.1111/j.1469-8137.1912.tb05611.x
Article Google Scholar

Download references

Acknowledgements

This work was supported by grant no. R01-2006-000-11223-0 from the Basic Research Program of the Korea Science & Engineering Foundation.

Author information

Authors and Affiliations

Yonsei University, 134 Sinchon-dong, Seodaemun-gu, Seoul, 120-749, Korea
Sangwook Lee & Chulhee Lee

Authors

Sangwook Lee
View author publications
You can also search for this author in PubMed Google Scholar
Chulhee Lee
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Chulhee Lee.

Additional information

Competing interests

The authors declare that they have no competing interests.

Authors’ original submitted files for images

Below are the links to the authors’ original submitted files for images.

Authors’ original file for figure 1

Authors’ original file for figure 2

Authors’ original file for figure 3

Authors’ original file for figure 4

Authors’ original file for figure 5

Authors’ original file for figure 6

Authors’ original file for figure 7

Authors’ original file for figure 8

Authors’ original file for figure 9

Authors’ original file for figure 10

Authors’ original file for figure 11

Authors’ original file for figure 12

Authors’ original file for figure 13

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 2.0 International License (https://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Lee, S., Lee, C. Low-complexity background subtraction based on spatial similarity. J Image Video Proc 2014, 30 (2014). https://doi.org/10.1186/1687-5281-2014-30

Download citation

Received: 03 July 2013
Accepted: 02 June 2014
Published: 19 June 2014
DOI: https://doi.org/10.1186/1687-5281-2014-30

Low-complexity background subtraction based on spatial similarity

Abstract

Introduction

The overall algorithm for efficient background subtraction

Preparation

The structure similarity for eliminating background components

Determining foreground and background areas using KDE

The proposed method

Determine the background type

Determine the foreground candidate pixels

Subtract the background pixels using KDE

Updating

Experimental results

Results using Li's dataset

Results using Wallflower's dataset

The effects of thresholds

Conclusions

Authors’ information

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Competing interests

Authors’ original submitted files for images

Rights and permissions

About this article

Cite this article

Share this article

Keywords