A multicue spatiotemporal framework for automatic frontal face clustering in video sequences
 Simeon Schwab^{1, 2}Email author,
 Thierry Chateau^{1},
 Christophe Blanc^{1, 2} and
 Laurent Trassoudaine^{1}
DOI: 10.1186/16875281201310
© Schwab et al.; licensee Springer. 2013
Received: 2 April 2012
Accepted: 23 November 2012
Published: 11 February 2013
Abstract
Clustering of specific object detections is a challenging problem for video summarization. In this article, we present a method to form tracks by grouping face detections of a video sequence. Our clustering method is based on a probabilistic maximum a posteriori data association framework, and we apply it to face detection in a visual surveillance context. Optimal solution is found with a procedure using networkflow algorithms described in previous pedestrian trackingbydetection works. To address difficult cases of small detections in scenes with multiple moving people, given that face detections are located in a video sequence, we use dissimilarities involving appearance and spatiotemporal information. The main contribution is the use of an optical flow or local front–back tracking to handle complex situations appearing in real sequences. The resulting algorithm is then able to deal with situations where people are crossing one another and face detections are scattered due to head rotation. The clustering step of our framework is compared to generic clustering methods (hierarchical clustering and affinity propagation) on several real challenging sequences, as evaluations indicate that this is more adapted to videobased detection clustering. We propose to use a new evaluation criteria, derived from purity and inverse purity of a clustering estimation, to assess performances of such methods. Results also show that optical flow and a skin color prior added to face detections improve the clustering quality.
Keywords
Clustering Face detection Multiple visual tracking Optical flow Maximum a posterioriIntroduction
Face detection on still images is becoming more and more common and efficient, yet use in real surveillance video sequences remains a big issue. Due to the large number of detections extracted from video, an automatic clustering of face detections is interesting for visual surveillance applications. For archive browsing or for face tagging on videos, it is easier to investigate with an album of faces than with a set of all the detected faces.
We propose a method to clusterspecific object detections of a video sequence, which we applied to face detections. Our efforts focus on real visual surveillance constraints: cluttered scenes, uncontrolled, and containing multiple small faces.
In uncontrolled visual surveillance scenes, the use of a face recognition system remains complicated due to the poor quality of face images. It is for this reason that our method focuses on grouping face detections extracted from a video sequence and we do not address directly the face recognition problem. Our goal is to form tracks of face detections occurring in a whole video sequence.
Related work
Actual face tagging systems present interesting results on TV shows/series or news videos. The main works [1–9] combine: a face detector in still image, face tracking techniques, and face recognition system with different probabilistic frameworks.
Some of them are semisupervised: the study by Ramanan et al. [1] proposes a preclustering step to drastically reduce the handlabeling time, Berg et al.’s study [2] automatically corrects inaccurately and ambiguously labeled face from news videos. In most of cases studied, a recognition system can be used, thanks to the detection quality and external information is sometimes added. Sivic et al. [4] work with near frontal faces and use other attributes such as hair or clothes to describe a person, and an extension of this method [3] uses subtitles to improve labeling and naming of persons.
All of these works use news, TV shows/series videos, as materials to face tagging in videos. However, in surveillance video scenes, people are not looking at the camera, closeup face views are rare, and a lot of mutual occlusions occurs. These are some of the reasons why face clustering remains extremely challenging in unconstrained visual surveillance situations.
Recent works on multiobject tracking can be seen as partial solutions to face clustering in video sequences. Such methods are generally divided into two main parts: (1) object detection and (2) data association. In these multiobject trackers, the data association problem is crucial having to handle complex situations such as partial or total occlusion. In fact, the problem of multiobject tracking and clustering of detected moving objects has many similarities to our problem. Many works in multiobject tracking have proposed some partial solution [10–13] using sequential or global strategies to estimate object trajectories and overcome identification errors.
To track different objects we often have to extrapolate the trajectories to predict the tracking during occlusions. To improve the quality of multiobject tracking one way is to add space–time information, and one currently used solution is shortterm tracking. The studies [14–16] use a single object tracker with various probabilistic models and sampling.
For global trackingbydetection, the main problem is to find a solution in a nonprohibitive computation time, because of the complexity of the possible detection combinations. To find a solution of a data association problem, as occurring in multitarget tracking, a lot of specific clustering algorithms are also employed: sampling with Monte Carlo methods [14, 15, 17], optimization methods like linear programming [18], Hungarian algorithm [19, 20], or successive mincost flow [21] on a graph. We chose to use this last method, and find a solution to the maximum a posteriori (MAP) by successive searches of mincost flows on a graph [21]. Even though it restricts the problem model, this method is advantageous in that it finds an optimal solution in a reasonable computation time.
In this article, we employ the problem definition and solving of [21]. To add space–time cue, we extended face detections with forward and backward tracking. However, most of the time, shortterm object trackers cannot actually handle occlusions as they are just used to estimate the motion of the object to fill time gaps. That is why, our idea is to use optical flow to add a local space–time information. We also tested the use of a generic clustering method for the third stage of our method. As the dissimilarities employed in visual detection clustering are not necessarily metrics, we focus on nonmetric clustering methods. We selected two clustering methods from all the algorithms available. The first is a classical and famous method: hierarchical ascendant clustering. This method is an agglomerative clustering building a hierarchy according to dissimilarities between objects, this hierarchy could then be thresholded to define clusters. The other selected algorithm is the affinity propagation [22–24]. This algorithm performs unsupervised classification by identifying a subset of representative exemplars, and is also employed in a visual understanding context.
Probabilistic dataassociation model
This section presents the background we employed to define the probabilistic model used to clusterdetected faces. It is inspired by the model proposed by Nevatia and colleagues [21], we modified its formulation to clarify the likelihood and a priori probability terms and adapted it to the videobased face clustering case. To obtain an optimal clustering according to our probabilistic MAP framework, we used their networkflowbased algorithm.
MAP dataassociation model
The model is based on a probabilistic framework where a state (set of associations) has to be estimated from observations given by the set of faces extracted from a video sequence. State and observations are random variables, and the objective is to obtain MAP estimate of face associations.
Each observation (face detection) contains position on the frame, time in the sequence (frame index), appearance of the detection area, and motion information (detailed in the next section). Let $\mathcal{Z}=\left\{{\mathit{z}}_{i}\right\}$ be the set of all the detections, where z _{ i }=(x _{ i },s _{ i },a _{ i },t _{ i }) is a detection; x _{ i } represents the xyposition in pixels, s _{ i } its width in pixels, a _{ i } the appearance descriptor, and t _{ i } the time (frame index) in the video. We denote D the number of detection in $\mathcal{Z}$.
The state to be estimated is a set of trajectories T={T _{1},…,T _{ K },T _{FP}}, where a trajectory is a set of detections: T _{ k }={z _{ k } _{1},…,z _{ k } _{ n } _{ k }} (k _{ i }=j means that the i th element of the k trajectory is the detection z _{ j }). The cluster denoted by T _{FP} represents the set of detections considered as false positives. As one detection can only belong to one trajectory and each detection is assigned to a trajectory, T is in fact a clustering of the D detections. $\mathcal{T}$ denotes the set of all the possible clustering of D detections.
In [21], the detection likelihood (P(z _{ i }T)) is represented with a Bernoulli distribution with a constant parameter. This parameter corresponds to the false positive rate of the face detector. We chose to delegate this term to the prior because it does not involve any observation, and we introduced a detection likelihood based on observations. This likelihood is described by the probability that an observation z _{ i } is a real face and not a false positive of the detector. A more detailed description of this probability is given in Section 5.
where P _{ e } represents the probability of starting a trajectory with a given detection (estimated as the number of people over the number of detections), K is the number of trajectories in T and T _{ F P } the number of detections considered as false positives. The β parameter is the false positive rate of the detector, and D denotes the total detection count.
Section 6 describes how we used dissimilarities (involving time, appearance, and movement) to define the P _{link} probabilities.
As the face detector fails when the camera is not in front of the face, missed detections are not necessarily attributed to occlusion events and an explicit occlusion model (as in [21]) is not very suitable.
It is therefore complicated to set up the P _{ e } parameter, however, experiments at Section 7.3 show that, in practice, results are quite stable with the P _{ e } variation, then we empirically set it to 1%.
Resolving the MAP
Computing all the solutions of the MAP model is in general a challenging task, mainly because of the computational complexity.
By using the firstorder Markov chain hypothesis for the trajectory likelihood, the solution is tractable and an optimal solution to the MAP can be computed by successively solving mincost flow problems on a specific graph [21]. Nodes of this graph represent detections and a path on the oriented graph represents a cluster. The cost of an arc between two nodes is assigned to computed transition dissimilarity (opposite of loglikelihood) between the corresponding two detections. So, with a given value, the mincost flow determines the associations to be used for clustering. An optimal MAP solution is found by varying the flow value and iteratively solving mincost flow problems.
Detection extension with movement information
In this section, we present the manner in which we introduce space–time information to face detections. Two ways of representing space–time cues are proposed: (1) a basic shortterm tracker and (2) an optical flow estimation. The two approaches are compared in Section 7.3.2.
Shortterm tracking
To overcome hard situations due to long periods of undetected faces, we first propose to extend detections by using shortterm backward and forward tracking. This algorithm provides additional space–time information to help in situations involving two detections distant in time. The tracking system is based on the estimation of the optimal position and size of the reference face (which is the patch of the detection in the present frame) at a further frame. The optimization procedure uses a cost function based on the appearance dissimilarity with the reference patch. Search domain is bounded by priors: a maximum velocity and scale factor. The optimization core is achieved by a Nelder–Mead method based on simplexes with the previous estimation as starting point. Tracking is achieved for each detection, and pursuing in the past and future according to video time.
where ${\mathit{z}}_{i}^{k}=({\mathit{x}}_{i}^{k},{s}_{i}^{k},{t}_{i}^{k})$ are positions and sizes estimated with the shortterm tracking and where N _{ i } is the number of elements in the tracklet of the detection i. The ${\mathit{z}}_{i}^{k}$ are sorted by frame time and one of these is the previously defined z _{ i }.
Optical flow
Another way to take into account space–time information is to use optical flow. There are many methods to compute an optical flow between two frames, one of the main issues is to represent a large scale of displacements [25–27]. In our case, we used a pyramidal version of the Lucas–Kanade algorithm to represent both small and large displacements as explained in [26].
In each frame having a detection, we compute the optical flow from the previous to the current frame and from the current to the next frame, then these two optical flows are averaged. The resulting speed vector of a detection is obtained by taking the most representative flow vector in the detection area. This vector is added to the observation z _{ i }.
Detection likelihood
where the probability to be a face (P _{ f }) with an appearance a _{ i } is estimated by the proportion of skin color pixels over the detected face patch. The skin color segmentation is simply done by fixed colorimetric boundaries [28]. Due to ethnic skin differences and colorimetric noise, skin color detection is far from being the best representation of skin proportion, but it still adds information in the main cases. To limit the exclusion of true detection without skin color pixels, we threshold P _{ f }(a _{ i }) to be 0.01 at minimum instead of 0.
Another way to improve the detector is to add the prior that a detection is on the foreground by using background extraction techniques. The drawback of this method is that it cannot handle moving objects such as pedestrian clothes or vehicles, and background extraction is complicated with nonfixed cameras.
Detection dissimilarities
with $\stackrel{~}{{d}_{\mathrm{a}}}$ the appearance dissimilarity, $\stackrel{~}{{d}_{\mathrm{m}}}$ the motion dissimilarity, and $\stackrel{~}{{d}_{\mathrm{t}}}$ the temporal dissimilarity.
where x is a, m, s, or t and σ _{ x } is the standard deviation estimated with all the d _{ x }(f,g) with $(f,g)\in \mathcal{Z}\times \mathcal{Z}$. The next sections describe the different d _{ x } dissimilarities.
Appearance dissimilarity
Detection appearance is represented by an HSV histogram [29]. This histogram is the concatenation of a 2D HS histogram and a 1D V histogram of image pixels (where H, S, and V represent hue, saturation, and value of a color, respectively). If the S and V values are large enough for a pixel, the pixel is counted in the HS histogram, or else it is counted in the V histogram. To measure the dissimilarity between two HSV histograms, we used the Bhattacharyya coefficient.
By considering only the face detection area, color information is not sufficient to distinguish two different faces. We therefore extended the face detection to an area under the head, in order to retrieve color information from the pedestrian clothes, this is done by doubling down the detection area.
Space–time dissimilarities
For dissimilarities involving position in frame and frame time, we define four dissimilarities: two for the motion (based on tracklets or opticalflows), one for the speed (in pixel per frame time), and one for the time. If detections are extended with tracklets ($\stackrel{~}{{\mathit{z}}_{i}}$ instead of z _{ i }) the tracklet dissimilarity is employed, if not, the optical flow is estimated and the optical flow dissimilarity is used.
Tracklet dissimilarity
The motion dissimilarity between two trajectories is obtained by averaging the two acquired distances, as shown in Figure 2. This average is weighted by the number of positions used to estimate the speed. In practice, in order to account possible high accelerations, the number of finite differences used to estimate the speed is limited.
If the two tracklets overlap (i.e., at least one date in common), the motion dissimilarity is computed from the spatial average of position distances on common frames.
Given ${T}_{\text{inter}}^{\mathit{\text{ij}}}=\{{t}_{i}^{1},\dots ,{t}_{i}^{{N}_{i}}\}\cap \{{t}_{j}^{1},\dots ,{t}_{j}^{{N}_{j}}\}$ the frame intersection and assuming that ${\mathit{z}}_{i}^{1}$ is before ${\mathit{z}}_{j}^{1}$, the movement dissimilarity is defined by

if no overlapping (i.e., ${T}_{\text{inter}}^{\mathit{\text{ij}}}=\varnothing $):${d}_{m}(\stackrel{~}{{\mathit{z}}_{i}},\stackrel{~}{{\mathit{z}}_{j}})=\frac{{K}_{i}{d}_{\text{pos}}(\widehat{{\mathit{z}}_{i}},{\mathit{z}}_{j}^{1})+{K}_{j}{d}_{\text{pos}}(\widehat{{\mathit{z}}_{j}},{\mathit{z}}_{i}^{{N}_{i}})}{{K}_{i}+{K}_{j}}$(10)

where $\widehat{{\mathit{z}}_{i}}$ is the forward extrapolation of $\stackrel{~}{{\mathit{z}}_{i}}$, $\widehat{{\mathit{z}}_{j}}$ the backward extrapolation of $\stackrel{~}{{\mathit{z}}_{i}}$, K _{ i } (resp. K _{ j }) is the number of elements used to estimate speed from $\stackrel{~}{{\mathit{z}}_{i}}$ (resp. $\stackrel{~}{{\mathit{z}}_{j}}$). In practice, we take K _{ i }= min(10,N _{ i }) for our experiments.

if overlapping:${d}_{m}(\stackrel{~}{{\mathit{z}}_{i}},\stackrel{~}{{\mathit{z}}_{j}})=\frac{\sum _{t\in {T}_{\text{inter}}^{\mathit{\text{ij}}}}{d}_{\text{pos}}({\mathit{z}}_{i}^{{k}^{i}\left(t\right)},{\mathit{z}}_{j}^{{k}^{j}\left(t\right)})}{\left{T}_{\text{inter}}^{\mathit{\text{ij}}}\right}$(11)
this normalization is done to be closer to a spatial overlap measure than the simple Euclidean distance is.
Optical flow dissimilarity
where v→_{ i } is the estimated optical flow vector of the detection i.
Speed dissimilarity
Time dissimilarity
where Δ t is empirically set (50 frames for our experiments).
Evaluation
This section presents the evaluation of the proposed method on several challenging videos.
Evaluation criteria
There are several ways to measure clustering quality: intrinsic methods (by measuring the proximity of elements inside a cluster and the proximity between clusters) and extrinsic methods that use manual ground truth classification.
As shown by Amig ó et al. [30], there are many ways to extrinsically evaluate clustering involving different quality measures, such as good and bad pair counting, purity, entropy measures, etc.
We propose an evaluation measurement based on purity and inversepurity. We define estimated clustering as clustering obtained by a clustering algorithm, as opposed to groundtruth clustering manually achieved by a human expert. Moreover, the purity is called estimation purity (EP) and the inversepurity groundtruth purity (GTP).

EP:$\text{EP}=\frac{1}{D}\sum _{k}\underset{j}{max}{E}_{k}\cap {\text{GT}}_{j}$(17)
where D is the number of detections, GT={GT_{ j }} the groundtruth clustering, and E={E _{ k }} the estimated clustering. The higher the EP, the less there are covering errors. We refer to a covering error when a cluster includes the faces of different people.

GTP:$\text{GTP}=\frac{1}{D}\sum _{k}\underset{j}{max}{\text{GT}}_{k}\cap {E}_{j}$(18)
it shows the proportion of wellrepresented detections (i.e., it measures the fact that there are few people represented by multiple estimated clusters).
This measure is used in our experiments to compare the estimated clustering with the groundtruth clustering.
Experiments
Dataset
Dataset for experiments
Video  Passages  Frames  nb detect.  FP (%)  Size 

1  24  1934  1725  2.78  35 
2  6  307  200  11.5  35 
3  7  384  920  2.61  36 
4  6  485  463  3.46  58 
5  29  1966  1794  1.56  63 
6  14  5042  1299  22.17  32 
Additional file 1: Video 1. (AVI 8 MB)
Additional file 2: Video 2. (AVI 2 MB)
Additional file 3: Video 3. (AVI 2 MB)
Additional file 4: Video 4. (AVI 7 MB)
Experimental procedures
In the following experiments, we compare the clustering stage in the MAP framework with two generic clustering methods using the same dissimilarity matrices. The first one is the hierarchical clustering with singlelink, and the second one is a relatively new method based on affinity propagation between elements [22–24].
Then, we compare different movement dissimilarities: one based on optical flow, another with forward and backward tracking, and the last one with just the pixel distance between detections. We also present some results showing the impact of the skin color term P _{ f } in detection appearance likelihood.
Results
Performance of the MAP clustering method
Best performances reached by three clustering algorithms with the same dissimilarity matrix
Videos  MAP  HAC  AP 

1  90.9  88.9  76.5 
2  81.5  73.4  67.5 
3  76.5  67.3  57.6 
4  98.1  88.8  82.3 
5  98.1  92.3  81.9 
6  77.7  79.3  54.0 
Results show that MAP clustering leads to better performances than the two other clustering methods. We can also see that hierarchical clustering outperforms affinity propagation. This is probably due to the fact that hierarchical clustering (with singlelink) often suffers from the chain effect. In our case, the chain effect is not so problematic, mostly because one detection has high affinity with two other detections: one in previous frames the other in the next frames. This naturally gives clusters a chain shape. Affinity propagation selects exemplars in each cluster, so the clusters are grouped around exemplars. This cluster structure is not as suited to our application as the chain one.
The performance of the MAP clustering seems to come from the fact that it is mostly suited to the videobased detection situation. The two other methods just use dissimilarities and no prior on clusters shape, while the presented MAP model uses a firstorder Markov chain to model clusters and handles false positives in a specific way.
Performance of our dissimilarity measure
F measure (in %) of different methods and videos
Video  Basic  Tracklet  OF  No prior 

1  80.4(6.9)  76.6(8.5)  85.8(2.8)  79.4(6.7) 
2  70.8(6.5)  72.1(2.2)  72.5(5.6)  66.4(4.6) 
3  69.2(2.9)  63.7(2.5)  68.7(2.5)  67.1(5.6) 
4  87.8(6.7)  77.2(10.5)  88.5(7.2)  87.6(6.6) 
5  94.8(2.3)  94.4(4.2)  95.4(2.2)  94.6(2.7) 
6  75.8(5.9)  74.4(4.3)  77.5(6.0)  72.8(3.5) 
The mean and standard deviation of Fmeasures are estimated over results obtained by varying the P _{ e } parameter, which is the probability to start (or stop) a cluster at a given detection. This parameter is used to force the number of clusters. For these experiments we take 100 values of P _{ e } between 0.04 and 25%. Results show that the quality does not appear to be very sensitive to the large P _{ e } variation.
Although dataset videos have various false positive rates (from 1.5 to 11%), we arbitrarily set the false positive parameter (β) to 1% for all the videos. This is done to avoid manual adjustment of a priori parameter to each video.
Table 3 shows that the use of optical flow dissimilarities gives better results compared to the use of dissimilarities based on tracklets. We can also see that the prior information based on skin color pixels improves results due to a better estimation of false positive cluster.
Concerning the running time, for Additional file 1: Video 1 (1934 frames and 1725 detections), the face detector takes 8 min 30 s to treat all the frames, the feature extraction stage (opticalflow version) takes 1 min 8 s, the computation of the dissimilarity matrix takes 8 s, and the optimization used to find the optimal clustering takes 20 s. We used a C++ implementation and run the test on a 2.4GHz processor without parallelization. These figures give an overview of the different processing times, this indicates that most of the computation time is used in detection and image processing tasks.
Synthesis
In the main presented experiments, the method based on the MAP clustering is more suited to our problem than hierarchical clustering or affinity propagations are. The movement dissimilarity based on the optical flow improves the results, more than the tracklet dissimilarity does in the main cases. Using a skin color term in the face likelihood enhances clustering quality, by improving false positive cluster quality.
Conclusions
This article proposes a method to cluster face detections on challenging video sequences. Our method relies on a dataassociation framework by resolving a MAP problem. In the case of a frontal face detector, where detections are particularly sparse due to head rotations, experiments show that hat adding movement information to detection dissimilarities improves the results. Two different approaches are tested: the first based on shortterm tracking and the second using optical flow extraction. We also present a new criteria to evaluate the performances of the resulting clustering.
Although our method has not reached the required quality for visual surveillance applications, we present a starting point for a video face summarization system based on trackingbydetection, in scenes where automatic face recognition remains a challenging issue.
Consent
Consent was obtained from the persons appearing in the videos 1 to 5 used for this publication.
Endnote
^{a}AVSS AB Hard from http://www.eecs.qmul.ac.uk/~andrea/avss2007_d.html.
Declarations
Acknowledgements
This study was supported by Vesalis and ANRT (France), we thank Instut Pascal and BioRafale consortium. We also used the video AVSS AB Hard from the iLids dataset for AVSS 2007.
Authors’ Affiliations
References
 Ramanan D, Baker S, Kakade S: Leveraging archival video for building face datasets. In 11th IEEE International Conference on Computer Vision. (Rio de Janeiro; 14–21 Oct 2007. IEEE [http://www.ieee.org] (Rio de Janeiro, 14–21 Oct 2007). IEEE Google Scholar
 Berg TL, Berg AC, Edwards J, Maire M, White R, Teh YW, LearnedMiller E, Forsyth DA: Names and faces in the news. IEEE Conference on Computer Vision and Pattern Recognition (Berkeley, 27 June2 July 2004), pp. 848–854Google Scholar
 Everingham M, Sivic J, Zisserman A: Takingthe bite out of automated naming of characters in tv video. In Elsevier Image and Vision Computing. Elsevier; 2009:545559. [http://www.elsevier.com]Google Scholar
 Sivic J, Everingham M, Zisserman A: Person spotting: video shot retrieval for face sets. In International Conference on Image and Video Retrieval. (Singapore; 2005:226236.View ArticleGoogle Scholar
 Arandjelovic O, Zisserman A: Automatic face recognition for film character retrieval in featurelength films. In IEEE Conference on Computer Vision and Pattern Recognition. (Oxford; 2005:860867.Google Scholar
 Nechyba MC, Brandy L, Schneiderman H: Pittpatt face detection and tracking for the clear 2007 evaluation. Multimodal Technologies for Perception of Humans: International Evaluation Workshops CLEAR 2007 and RT 2007 2008, 126137.View ArticleGoogle Scholar
 Nechyba MC, Schneiderman H: Pittpatt face detection and tracking for the clear 2006 evaluation. In Multimodal Technologies for Perception of Humans: First International Evaluation Workshop on Classification of Events, Activities and Relationships. (CLEAR; 2007:161170. [http://www.clearevaluation.org/]View ArticleGoogle Scholar
 Zhao M, Yagnik J, Adam H, Bau D: Large scale learning and recognition of faces in web videos. IEEE International Conference on Automatic Face and Gesture Recognition (Amsterdam, 1719 Sept 2008)Google Scholar
 Kim M, Kumar S, Pavlovic V, Rowley H: Face tracking and recognition with visual constraints in realworld videos. IEEE Conference on Computer Vision and Pattern Recognition (Anchorage AK, 2328 June 2008)Google Scholar
 Bardet F, Chateau T, Ramadasan D: Illumination aware mcmc particle filter for longterm outdoor multiobject simultaneous tracking and classification. In 12th IEEE International Conference on Computer Vision. (Kyoto; 2009:16231630.Google Scholar
 Dubuisson S, Fabrizio J: Optimal recursive clustering of likelihood functions for multiple object tracking. In Elsevier Pattern Recognition Letters. Elsevier; 2009:606614. [http://www.elsevier.com]Google Scholar
 Ess A, Leibe B, Schindler K, Van Gool L: A mobile vision system for robust multiperson tracking. IEEE Conference on Computer Vision and Pattern Recognition (Anchorage AK, 2328 June 2008)Google Scholar
 Leibe B, Schindler K, Van Gool L: Coupled detection and trajectory estimation for multiobject tracking. IEEE Proceedings of 11th International Conference on Computer Vision (Zurich, 1421 Oct 2007)Google Scholar
 Benfold B, Reid I: Stable multitarget tracking in realtime surveillance video. IEEE Conference on Computer Vision and Pattern Recognition (Oxford, 2025 June 2011),pp. 3457–3464Google Scholar
 Ge W, Collins R: Multitarget data association by tracklets with unsupervised parameter estimation. In British Machine Vision Conference. (Leeds; 2008.Google Scholar
 Song B, Jeng TY, Staudt E, RoyChowdhury AK: A stochastic graph evolution framework for robust multitarget tracking. In Springer 11th European Conference on Computer Vision. (Heraklion; 2010:605619.Google Scholar
 Yu Q, Medioni G: Multipletarget tracking by spatiotemporal monte carlo markov chain data association. In IEEE Transactions on Pattern Analysis and Machine Intelligence. IEEE; 2009:21962210. [http://www.ieee.org]Google Scholar
 Berclaz J, Fleuret F, Fua P: Multiple object tracking using flow linear programming. In Twelfth IEEE International Workshop on Performance Evaluation of Tracking and Surveillance. (Lausanne; June 2009.Google Scholar
 Huang C, Wu B, Nevatia R: Robust object tracking by hierarchical association of detection responses. In Proceedings of the 10th European Conference on Computer Vision: Part II. (Marseille; 2008:788801.Google Scholar
 Stergiou A, Karame G, Pnevmatikakis A, Polymenakos L: The ait 2d face detection and tracking system for clear 2007. In Multimodal Technologies for Perception of Humans: International Evaluation Workshops CLEAR 2007 and RT 2007. (Baltimore; 2008:113125.View ArticleGoogle Scholar
 Zhang L, Li Y, Nevatia R: Global data association for multiobject tracking using network flows. IEEE Conference on Computer Vision and Pattern Recognition (Los Angeles, 2328 June 2008)Google Scholar
 Dueck D, Frey BJ: Nonmetric affinity propagation for unsupervised image categorization. IEEE International Conference on Computer Vision (Toronto, 1421 Oct 2007)Google Scholar
 Frey BJ, Dueck D: Clustering by passing messages between data points. In American Association for the Advancement of Science. Science; 2007:972976. [http://www.sciencemag.org/]Google Scholar
 Lu Z, CarreiraPerpinán MA: Constrained spectral clustering through affinity propagation. IEEE International Conference on Computer Vision (Portland, 2328 June 2008)Google Scholar
 Brox T, Bregler C, Malik J: Large displacement optical flow. IEEE Conference on Computer Vision and Pattern Recognition (Berkeley, 2025 June 2009) pp. 41–48Google Scholar
 Marzat J, Dumortier Y, Ducrot A: Realtime dense and accurate parallel optical flow using cuda. In Proceedings of The 17th International Conference in Central Europe on Computer Graphics, Visualization and Computer Vision (WSCG). WSCG; 2009. [http://www.wscg.eu/]Google Scholar
 Sun D, Sudderth EB, Black MJ: Layered image motion with explicit occlusions, temporal consistency, and depth ordering. Advances in Neural Information Processing Systems 2010, 22262234.Google Scholar
 Rahman N, Wei K, See J: Rgbhcbcr skin colour model for human face detection. In Proceedings of The MMU International Symposium on Information and Communications Technologies. Springer; 2006. [http://www.springer.com/]Google Scholar
 Perez P, Hue C, Vermaak J, Gangnet M: Colorbased probabilistic tracking. In 7th European Conference on Computer Vision. (Copenhagen; 2002:661675.Google Scholar
 Amigó E, Gonzalo v, Artiles J, Verdejo F: A comparison of extrinsic clustering evaluation metrics based on formal constraints. In Springer Information Retrieval. Springer; 2009:461486. [http://www.springer.com/]Google Scholar
 Viola P, Jones M: Rapid object detection using a boosted cascade of simple features. In IEEE Conference on Computer Vision and Pattern Recognition. (Cambridge; 2001:511518.Google Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.