AUTO GMM-SAMT: An Automatic Object Tracking System for Video Surveillance in Traffic Scenarios
© K. Quast and A. Kaup. 2011
Received: 1 April 2010
Accepted: 26 October 2010
Published: 27 October 2010
A complete video surveillance system for automatically tracking shape and position of objects in traffic scenarios is presented. The system, called Auto GMM-SAMT, consists of a detection and a tracking unit. The detection unit is composed of a Gaussian mixture model- (GMM-) based moving foreground detection method followed by a method for determining reliable objects among the detected foreground regions using a projective transformation. Unlike the standard GMM detection the proposed detection method considers spatial and temporal dependencies as well as a limitation of the standard deviation leading to a faster update of the mixture model and to smoother binary masks. The binary masks are transformed in such a way that the object size can be used for a simple but fast classification. The core of the tracking unit, named GMM-SAMT, is a shape adaptive mean shift- (SAMT-) based tracking technique, which uses Gaussian mixture models to adapt the kernel to the object shape. GMM-SAMT returns not only the precise object position but also the current shape of the object. Thus, Auto GMM-SAMT achieves good tracking results even if the object is performing out-of-plane rotations.
Moving object detection and object tracking are important and challenging tasks not only in video surveillance applications but also in all kinds of multimedia technologies. A lot of research has been performed on these topics giving rise to numerous detection and tracking methods. A good survey of detection as well as tracking methods can be found in . Typically, an automatic object tracking system consists of a moving object detection and the actual tracking algorithm [2, 3].
The aim of the detection unit is to detect moving foreground regions and store the detection result in a binary mask. A very common solution for moving foreground detection is background subtraction. In background subtraction a reference background image is subtracted from each frame of the sequence and binary masks with the moving foreground objects are obtained by thresholding the resulting difference images. The key problem in background subtraction is to find a good background model. Commonly a mixture of Gaussian distributions is used for modeling the color values of a particular pixel over time [4–6]. Hence, the background can be modeled by a Gaussian mixture model (GMM). Once the pixelwise GMM likelihood is obtained, the final binary mask is either generated by thresholding [4, 6, 7] or according to more sophisticated decision rules [8–10]. Although the Gaussian mixture model technique is quite successful, the obtained binary masks are often noisy and irregular. The main reason for this is that spatial and temporal dependencies are neglected in most approaches. Thus, the method of our detection unit improves the standard GMM method by regarding spatial and temporal dependencies and integrating a limitation of the standard deviation into the traditional method. While the spatial dependency and the limitation of the standard deviation lead to clear and noiseless object boundaries, false positive detections caused by shadows and uncovered background regions so called ghosts can be reduced due to the consideration of the temporal dependency. By combining this improved detection method with a fast shadow removal technique, which is inspired by the technique of , the quality of the detection result is further enhanced and good binary masks are obtained without adding any complex and computational expensive extensions to the method.
Once an object is detected and classified as reliable, the actual tracking algorithm can be initialized. In  tracking methods are divided into three main categories: point tracking, kernel tacking, and silhouette tracking. Due to its ease of implementation, computational speed, and robust tracking performance, we decided to use a mean shift-based tracking algorithm , which belongs to the kernel tracking category. In spite of its advantages traditional mean shift has two main drawbacks. The first problem is the fixed scale of the kernel or the constant kernel bandwidth. In order to achieve a reliable tracking result of an object with changing size, an adaptive kernel scale is necessary. The second drawback is the use of a radial symmetric kernel. Since most objects are of anisotropic shapes, a symmetric kernel with its isotropic shape is not a good representation of the object shape. In fact if not specially treated, the symmetric kernel shape may lead to an inclusion of background information into the target model, which can even cause tracking failures. An intuitive approach of solving the first problem is to run the algorithm with three different kernel bandwidths, former bandwidth and former bandwidth ±10%, and to choose the kernel bandwidth which maximizes the appearance similarity (±10% method) . A more sophisticated method using difference of Gaussian mean shift kernel in scale space has been proposed in . The method provides good tracking results but is computationally very expensive. And both methods are not able to adapt to the orientation or the shape of the object.
Mean shift-based methods which are not only adapting the kernel scale but also the orientation of the kernel are presented in [14–17]. The method of  focuses on face tracking and uses ellipses as basic face models; thus it cannot easily be generalized for tracking other objects since adequate models are required. Like in  scale and orientation of a kernel can be obtained by estimating the second-order moments of the object silhouette, but that is of high computational costs. In  mean shift is combined with adaptive filtering to obtain kernel scale and orientation. The estimations of kernel scale and orientation are good, but since a symmetric kernel is used, no adaptation to the actual object shape can be performed. Therefore, in  asymmetric kernels are generated using implicit level set functions. Since the search space is extended by a scale, and an orientation dimension, the method simultaneously estimates the new object position, scale, and orientation. However the method can only estimate the objects orientation for in-plane rotations. In case of 3D or out-of-plane rotations none of the mentioned algorithms is able to adapt to the shape of the object.
Therefore, for the tracking unit of Auto GMM-SAMT we developed GMM-SAMT, a mean shift-based tracking method which is able to adapt to the object contour no matter what kind of 3D rotation the object is performing. During initialization the tracking unit generates an asymmetric and shape-adapted kernel from the object mask delivered by the previous units of Auto GMM-SAMT. During the tracking the kernel scale is first adapted to the current object size by running the mean shift iterations in an extended search space. The scale-adapted kernel is then fully adapted to the current contour of the object by a segmentation process based on a maximum a posteriori estimation considering the GMMs of the object and the background histogram. Thus, a good fit of the object shape is retrieved even if the object is performing out-of-plane rotations.
The paper is organzied as follows. In Section 2 the detection of moving foreground regions is explained while Section 3 describes the determination of reliable objects among the detected foreground regions. GMM-SAMT, the core of Auto GMM-SAMT, is presented in Section 4. The whole system (Figure 1) is evaluated in Section 5 and finally conclusions are drawn in Section 6.
2. Moving Foreground Detection
2.1. GMM-Based Background Subtraction
The other parameters remain the same. The Gaussians are now ordered by the value of the reliability measure in such a way that with increasing subscript the reliability decreases. If a pixel matches more than one Gaussian distribution, the one with the highest reliability is chosen. If the constraint in (2) is not fulfilled and a color value cannot be assigned to any of the distributions, the least probable distribution is replaced by a distribution with the current value as its mean value, a low prior weight, and an initially high standard deviation and is rescaled.
2.2. Temporal Dependency
The traditional method takes into account only the mean temporal frequency of the color values of the sequence. The more often a pixel has a certain color value, the greater is the probability of occurrence of the corresponding Gaussian distribution. But the direct temporal dependency is not taken into account.
2.3. Spatial Dependency
In the standard GMM method, each pixel is treated separately and spatial dependency between adjacent pixels is not considered. Therefore, false positives caused by noise-based exceedance of in (2) or slight lighting changes are obtained. Since the false positives of the first type are small and isolated image regions, the ones of the second type cover larger adjacent regions as they mostly appear at the border of shadows, the so-called penumbra. Through spatial dependency both kinds of false positives can be eliminated.
2.4. Background Quality Enhancement
If a pixel in a new frame is not described very well by the current model, the standard deviation of a Gaussian distribution modelling the foreground might increase enourmously. This happens most notably when the pixel's color value deviates tremendously from the mean of the distribution and large values of are obtained during the model update. The larger gets, the more color values can be matched to the Gaussian distribution. Again this increases the probability of large values of .
2.5. Single Step Shadow Removal
where is the maximum allowed darkness, is the maximum allowed brightness, and and are the maximum allowed angle separation for penumbra and umbra. Compared to the shadow removal scheme described in , the proposed technique supresses penumbra and umbra simultaneously while the method of  has to be run twice. More details can be found in .
3. Determination of Reliable Objects
After the GMM-based background subtraction it has to be decided which of the detected foreground pixels in the binary mask represent true and reliable object regions. In spite of its good performance the background subtraction unit still needs a few frames to adjust when an object, which has not been moving for a long time, suddenly starts to move. During this period uncovered background regions, also referred to as ghosts, can be detected as foreground. To avoid a tracking of these wrong detection results we have to distinguish between reliable (true objects) and nonreliable objects (uncovered background). Since it does not make sense to track objects which only appear in the scene for a few frames, these objects are also considered as nonreliabel objects.
where is an 8-dimensional vector consisting of the first 8 elements of . Concatenating the equations from more than four point correspondences a linear set of equations of the form of is obtained which can be solved by a least squares technique.
In case of airport apron surveillance or other surveillance scenarios where the scene is captured from a (slanted) top view position, moving objects on the ground can be considered as flat compared to the reference plane. Thus, in the transformed binary masks the size of the detected foreground regions almost does not change over the sequence, compare masks in Figure 6. Hence, we can now use the size for detecting reliable objects. Since airplanes and vehicles are the most interesting objects on the airport apron, we only keep detected regions which are bigger than a certain size in the transformed binary image. In most cases can also be used to distinguish between airplanes and other vehicles. After removing all foreground regions which are smaller than , the binary mask is transformed back into the original view. All remaining foreground regions in two subsequent frames are then matched by estimating the shortest distance between the centroids. We define a foreground region as a reliable object, if the region is detected and matched in subsequent frames.
The detection result of a reliable object already being tracked is compared to the tracking result of GMM-SAMT to check if the detection result is still valid; see Figure 1. The comparison is also used as a final refinement step for the GMM-SAMT results. In case of very similar object and background color the tracking result might miss small object segments at the border of the object, which might be identified as object regions during the detection step and can be added to the object shape. Also small object segments at the border of the object, which are actually background regions, can be identified and corrected by comparing the tracking result with the detection result. For objects, which are considered as realiable for the first time, the mask of the object is used to build the shape adaptive kernel and to estimate the color histogram of the object for generating the target model as described in Sections 4.1 and 4.2. After the adaptive kernel and target model are estimated, GMM-SAMT can be initialized.
4. Object Tracking Using GMM-SAMT
4.1. Mean Shift Tracking Overview
The aim is to minimize the distance between the two color distributions as a function of in the neighborhood of a given position . This can be achieved using the mean shift algorithm. By running this algorithm the kernel is recursively moved from to according to the mean shift vector.
4.2. Asymmetric Kernel Selection
4.3. Mean Shift Tracking in Spatial-Scale-Space
Given the object mask for the initial frame the object centroid and the target model are computed. To make the target model more robust the histogram of a specified neighborhood of the object is also estimated and bins of the neighborhood histogram are set to zero in the target histogram to eliminate the influence of colors which are contained in the object as well as in the background. In case of an object mask with a slightly different shape than the object shape too many object colors might be supressed in the target model, if the direct neighbored pixels are considered. Therefore, the directly neighbored pixels are not included in the considered neighborhood. The mean shift iterations are then performed as described in [17, 23] and the new position of the object as well as a scaled object shape will be determined, where the latter can be considered as a first shape estimate.
4.4. Shape Adaptation Using GMMs
where indicates that a pixel, or more precise its color value , belongs to the object ( ) or the background class ( ), and is the corresponding a priori probability. To set to an appropriate value object and background area of the initial mask are considered.
Based on the number of its object and background pixels, a segment is assigned as an object or background segment. If more than 50% of the pixels of a segment belong to the object class, the segment is assigned as an object segment; otherwise the segment is considered to belong to the background. The tracking result is then compared to the according detection result of the GMM-based background subtraction method. Segments of the GMM-SAMT result, which match the detected moving foreground region, are considered as true moving object segments. But segments which are not at least partly included in the moving foreground region of the background subtraction result are discarded, since they are most likely wrongly assigned as object segments due to errors in the MAP estimation caused by very similar foreground and background colors. Hence, the final object shape consists only of segments complying with the constraints of the background subtraction as well as the constraints of the GMM-SAMT procedure. Thus, we obtain quite a trustworthy representation of the final object shape from which the next object-based kernel is generated. Finally, the next mean shift iterations of GMM-SAMT can be initialiezed.
5. Experimental Results
The performance of Auto GMM-SAMT was tested on several sequences showing typical traffic scenarios recorded outside. To show that the detection method itself is also applicable for other surveillance scenarios, it was also tested on indoor surveillance sequences. In particular, the detection method was tested on two indoor sequences provided by  and three outdoor sequences, while the tracking and overall performance of Auto GMM-SAMT was tested on five outdoor sequences. For each sequence at least 15 ground truth frames were either manually labeled or taken from . Overall the performance of Auto GMM-SAMT was evaluated on a total of 200 sample frames.
After parameter testing the GMM methods achieved good detection results for all sequences with Gaussians, , , and , whereas the parameters for temporal dependency and and for spatial dependency were set to and . Due to the very different illumination conditions in the indoor and outdoor scenarios, the learning rate and the shadow removal parameters were chosen separately for indoor sequences and outdoor sequences; see Table 2.
Ground truth frames
Detection unit of Auto GMM-SAMT
To determine reliable objects among the detected foreground regions the obtained binary masks are transformed using the corresponding homography matrix. The homography matrix is estimated only once at the beginning of a sequence and can then be used for the whole sequence. A recalculation of the homography matrix is not necessary. Thus, the homography estimation can be considered as a calibration step of the surveillance system, which does not influence the computational performance of Auto GMM-SAMT at all. In the transformed mask only foreground regions of interesting size (e.g., ) are kept and considered as possible object regions. For our purpose was set to 2000 pixels for detecting cars and to 75000 pixels for airplanes.
After possible object regions are estimated in the transformed binary mask, the mask is transformed back into the original view. All possible object regions, which could be matched in subsequent frames, are considered as reliabel objects. For each reliable detected object the masked-based kernel is generated. Each object kernel is then used for computing the weighted histogram in the RGB space with bins. For the scale dimension the Epanechnikov kernel with a bandwidth of is used. For mean shift segmentation a multivariate kernel defined according to (35) in  as the product of two Epanechnikov kernels, one for the spatial domain (pixel coordinates) and one for the range domain (color), is used. The bandwidth of the Epanechnikov kernel in range domain was set to , and the bandwidth of the one in spatial domain to . The minimal segment size was set to 5 pixels. Since the colors of an object and the surrounding background do not change to drastically in the considered scenarios, while the object is being tracked, the object and background GMMs for the MAP decision are only estimated at the beginning of the tracking by running the EM algorithm until convergence or for a maximum number of 30 iterations. Since Auto GMM-SAMT is developed for video surveillance of traffic scenarios, which are recorded diagonally from above such that the homography leads to reasonable results, the tracking performance was tested on five outdoor sequences containing mainly three-dimensional rigid objects.
Ground truth frames
Standard mean shift
The performance of the detection unit (implemented in C++) is about 29 fps for image resolution on a 2.83 GHz Intel Core 2 Q9550. By using multithreading the performance is further enhanced up to 60.16 fps using 4 threads. Since the tracking unit is implemented in Matlab, it does not perform in real-time yet. But our modifications do not add any computational expensive routines to the mean shift method and the EM-algorithm is only run at the beginning of the tracking. Thus, a good computational performance should also be possible for a C/C++ implementation of the tracking unit.
The presented Auto GMM-SAMT video surveillance system shows that the GMM-SAMT algorithm could succesfully be combined with our improved GMM-based background subtraction method. Thus, an automatic object tracking for video surveillance is achieved.
On the one hand Auto GMM-SAMT takes adavantage of GMM-SAMT, which extends the standard mean shift algorithm to track the contour of objects of changing shape without the help of any predefined shape model. Since the tracking unit works with object mask-based kernels, the influence of background colors on the target model is avoided. Thus, the Auto GMM-SAMT tracking unit is much more robust than standard mean shift tracking. Because of adapting the kernel to the current object shape in each frame, Auto GMM-SAMT is able to track the shape of an object even if the object is performing out-of-plane rotations.
On the other hand Auto GMM-SAMT automates the initialization of the tracking algorithm using our improved GMM-based detection algorithm. Because of the limitation of the standard deviation and the consideration of temporal and spatial dependencies in the detection unit, the Auto GMM-SAMT system obtains good binary masks. Even uncovered background regions are relatively fast classified as background due to the spatiotemporal adaptive detection method. Despite this fast adaptation to uncovered background areas, for a few frames false positives caused by uncovered background regions might be contained in the masks. But it is shown that the GMM-SAMT tracking method can also achieve good tracking result when initialized with binary masks of moderate quality as long as the color of object and (uncovered) background is not too similar. Otherwise Auto GMM-SAMT will deliver the first correct object contours after the uncovered background is correctly identified as such by the detection unit. Nevertheless, Auto GMM-SAMT can keep up with the stand alone implementation of GMM-SAMT. In some cases Auto GMM-SAMT performs even better than GMM-SAMT due to the final shape refinement when comparing the tracking results with the background subtraction results. However, in the case of very similar foreground and background colors detection and tracking problems can occur.
In addition, the projective transformation of Auto GMM-SAMT can be considered only as a fast but very simple object classification. Since the classification is not reliable enough for a robust surveillance system, we will focus on other object features as well as on alternative classification techniques in our future work. The consideration of other object features could also help to improve the detection and tracking performance in case of very similar object and background colors. Besides we also plan to investigate the automation of the homography estimation to remove the manual calibration step.
This work has been supported by Gesellschaft für Informatik, Automatisierung und Datenverarbeitung (iAd) and the Bundesministerium für Wirtschaft und Technologie (BMWi), ID 20V0801I.
- Yilmaz A, Javed O, Shah M: Object tracking: a survey. ACM Computing Surveys 2006, 38(4):1-45.View ArticleGoogle Scholar
- Borg M, Thirde D, Ferryman J, et al.: Video surveillance for aircraft activity monitoring. Proceedings of the IEEE International Conference on Advanced Video and Signal-Based Surveillance (AVSS '05), 2005, Como, Italy 16-21.Google Scholar
- Porikli F, Tuzel O: Human body tracking by adaptive background models and mean-shift analysis. Proceedings of the IEEE International Workshop on Performance Evaluation of Tracking and Surveillance (PETS-ICVS '03), 2003, Graz, AustriaGoogle Scholar
- Stauffer C, Grimson WEL: Adaptive background mixture models for real-time tracking. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR '99), 1999, Miami, Fla, USA 2: 252-258.Google Scholar
- Stauffer C, Grimson WEL: Learning patterns of activity using real-time tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence 2000, 22(8):747-757. 10.1109/34.868677View ArticleGoogle Scholar
- Power PW, Schoonees JA: Understanding background mixture models for foreground segmentation. Proceedings of the Image and Vision Computing (IVCNZ '02), November 2002, Auckland, New Zealand 267-271.Google Scholar
- KaewTraKulPong P, Bowden R: An improved adaptive background mixture model for real-time tracking with shadow detection. Proceedings of the 2nd European Workshop Advanced Video Based Surveillance Systems (AVBS '01), 2001, Kingston upon Thames, UK 1:Google Scholar
- Carminati L, Benois-Pineau J: Gaussian mixture classification for moving object detection in video surveillance environment. Proceedings of the IEEE International Conference on Image Processing (ICIP '05), 2005, Genoa, Italy 3: 113-116.Google Scholar
- Li L, Huang W, Gu IY-H, Tian Q: Statistical modeling of complex backgrounds for foreground object detection. IEEE Transactions on Image Processing 2004, 13(11):1459-1472. 10.1109/TIP.2004.836169View ArticleGoogle Scholar
- Yang SY, Hsu CT: Background modeling from gmm likelihood combined with spatial and color coherency. Proceedings of the IEEE International Conference on Image Processing (ICIP '06), 2006, Atlanta, Ga, USA 2801-2804.Google Scholar
- Comaniciu D, Ramesh V, Meer P: Real-time tracking of non-rigid objects using mean shift. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2000, Hilton Head, SC, USA 142-149.Google Scholar
- Comaniciu D, Ramesh V, Meer P: Kernel-based object tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence 2003, 25(5):564-577. 10.1109/TPAMI.2003.1195991View ArticleGoogle Scholar
- Collins RT: Mean-shift blob tracking through scale space. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR '03), 2003, Madison, Wis, USA 2: 234-240.Google Scholar
- Vilaplana V, Marques F: Region-based mean shift tracking: application to face tracking. Proceedings of the 15th IEEE International Conference on Image Processing (ICIP '08), 2008, San Diego, Calif, USA 2712-2715.Google Scholar
- Bradski GR: Computer vision face tracking for use in a perceptual user interface. Intel Technology Journal 1998, 2: 12-21.Google Scholar
- Qiao Q, Zhang D, Peng Y: An adaptive selection of the scale and orientation in kernel based tracking. Proceedings of the 3rd International IEEE Conference on Signal-Image Technologies and Internet-Based System (SITIS '07), 2007, Shanghai, China 1: 659-664.View ArticleGoogle Scholar
- Yilmaz A: Object tracking by asymmetric kernel mean shift with automatic scale and orientation selection. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR '07), 2007, Minneapolis, Minn, USA 1-6.Google Scholar
- Aach T, Kaup A: Bayesian algorithms for adaptive change detection in image sequences using Markov random fields. Signal Processing 1995, 7(2):147-160.Google Scholar
- Quast K, Kaup A: Real-time moving object detection in video sequences using spatio-temporal adaptive gaussian mixture models. Proceedings of the International Conference on Computer Vision Theory and Applications (VISAPP '10), 2010, Angers, FranceGoogle Scholar
- Hartley RI, Zisserman A: Multiple View Geometry in Computer Vision. Cambridge University Press, New York, NY, USA; 2000.MATHGoogle Scholar
- Quast K, Kaup A: Scale and shape adaptive mean shift object tracking in video sequences. Proceedings of the 17th European Signal Processing Conference (EUSIPCO '09), 2009, Glasgow, Scotland 1513-1517.Google Scholar
- Comaniciu D, Meer P: Mean shift: a robust approach toward feature space analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence 2002, 24(5):603-619. 10.1109/34.1000236View ArticleGoogle Scholar
- Quast K, Kaup A: Shape adaptive mean shift object tracking using gaussian mixture models. Proceedings of the IEEE 11th International Workshop on Image Analysis for Multimedia Interactive Services (WIAMIS '10), 2010, Desenzano, ItalyGoogle Scholar
- Dempster AP, Laird NM, Rubin DB, et al.: Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society. Series B 1977, 39(1):1-38.MathSciNetMATHGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.