Most tracking algorithms impose constraints on the motion and appearance of objects such as the prior knowledge of motion model, the number and size or the shape of objects. Various approaches have been proposed so far including the mean-shift, the Kalman filtering and particle filter. The mean-shift based tracker iteratively shifts a data point to the average of data points in its neighborhood, which minimizes the distance between a model histogram representing the target and candidate histograms computed on the current frame. However, it ignores the motion information and is difficult to recover from temporary tracking failures. The Kalman filter is the minimum-variance state estimator for linear dynamic systems with Gaussian noise [21]. For the visual object which moves rapidly, it is hard, in general, to implement the optimal state estimator in closed form [22]. Various modifications of the Kalman filter can be used to estimate the state. These modifications include the extended Kalman filter [23], and the unscented Kalman filter [24]. A multi-step tracking framework was also introduced in [25] to track facial landmarks points under head rotations and facial expressions. The Kalman filter was used to predict the locations of landmarks and a better performance was achieved. However, there are some shortcomings for Kalman filter to track the landmarks of facial expressions, such as the nonlinearity of the head motions, the unimodality of the Kalman, the inherent tracking delay, etc.
Over the last few years, there has been immense attention on particle filters for image tracking because of their simplicity, flexibility, and systematic treatment of nonlinearity and non-Gaussianity. Particle filters provide a convenient Bayesian filtering framework of integrating the detector into the tracker. Based on point mass representations of probability densities, particle filters operate by propagating the particle estimation and can be applied to any state-space model [26–29]. However the sampling results from the proposal density are assigned with low weights and a large number of the particles are wasted in areas with small likelihood. To track the state of a temporal event with a set of noisy observations, the main idea is to maintain a set of solutions that are an efficient representation of the conditional probability. However a large amount of particles that result from sampling from the proposal density might be wasted because they are propagated into areas with small likelihood. Some of the existing works ignore the fact that, while a particle might have low likelihood, parts of it might be close to the correct solution. The estimation of the particle weights does not take into account the interdependences between the different parts of the state of a temporal event.
Particle filter can use multi-modal likelihood functions and propagate multi-modal posterior distributions [30, 31]. There are two basic schemes: sending the output of the detector into the measurement likelihood [32, 33], or applying a mixture proposal distribution by combining the dynamic model with the output of the detector [34]. However, directly applying particle filter on multiple objects tracking is not feasible because the standard particle filter does not define a way to identify individual modes or hypotheses. Some researchers used sequential state estimation techniques to track multiple objects [35]. Patras and Pantic applied auxiliary particle filtering with factorized likelihoods for tracking of facial points [27]. Zhao et al. [36] introduced a method for tracking of facial points with multi cue particle filter. They have incorporated information from both color and edge of facial features and proposed the point distribution model for constraint tracking results and avoid tracking fails during occlusion. The standard particle filter has a common problem that it turns out to be inadequate when the dynamic system has a very low process noise, or if the observation noise has very small variance [34]. The reason is due to its defective sampling strategy with large dimensionality of the state space. After a few iterations, the particle set will collapse to one single point [31]. Therefore, the resampling method is applied to eliminate particles that have small weights and to concentrate on particles with large weights. It has been realized that improving the resampling or global optimization strategy is more decisive to the success of the tracking [30].
In this paper, we use multiple DE-MC particle filters to track the facial landmarks through the video sequence depending on the locations of the current appearance of the spatially sampled features.
3.1 DE-MC particle filter
The particle filter provides a robust Bayesian framework for the visual tracking problem. It maintains a particle based representation of the a posteriori probability p(X
k
|Y
1 : k
) of the state X
k
given all the observations Y
1 : k
= {Y
1, Y
2, …, Y
k
} up to and including the current time, k, instance, according to:
(6)
In (6), the state X
k
is a 2 M-component vector that represents the location of landmarks, the observation Y
1:k
is the set of image frames up to the current time instant. The normalization constant λ
k
is independent of X
k
. The motion model p(X
k
|X
k - 1) is conditioned directly on the immediate preceding state and independent of the earlier history if the motion dynamics are assumed to form a temporal Markov chain. The distribution is represented by discrete samples N through particle filtering. The N samples (particles) are drawn from a proposed distribution , i = 1,2,…,N and assigned with weights .
Suppose that at a previous time instance k – 1, we have a particle based representation of the density, that is, we have a collection of N particles and their corresponding weights . At time step k, select a new set of samples from with the probability proportional to . The samples with a larger weight should be selected with a higher probability. Then, applying a constant velocity dynamical model to the samples yields:
(7)
where is a new set of samples selected at time k, and V
k-1 is the velocity vector computed in time step k-1.
The particle set acts as the initial N population for a T-iteration DE-MC processing. For any one landmark in the T-iteration processing, two different integers, r
1
r
2 that r
1 ≠ r
2 ≠ k, are randomly chosen from the population of previous iteration. A new member that is created, where λ is a scalar whose value is found to be optimal when , g is drawn from a symmetric distribution with small variance compared to that of . A target function is given based on the ratio between the populations of current and previous step until a convergence or a preset end point is reached. Then the weights of particles are subject to update by the DE-MC. At the end of this step, we take the output population as the particle set of current time step .
We estimate the state at time step k as:
(8)
and update the velocity vector of current time step V
k
= X
k
- X
k - 1. The step size of random jumping for current DE-MC iteration is reduced if the survival rate of the last DE-MC iteration is high or inflated otherwise [37]. The update scheme for the maximum likelihood decision on the weights w can be summarized as follows:
Starting from the set of particles which are the filtering result of time step k – 1: .
-
1.
Selection: select a set of samples from with the probability proportional to .
-
2.
Prediction and Measurement: Apply a constant velocity dynamical model to the samples using Eq. (7). At the end of this step, we take the output population as the particle set of current .time step that .
-
3.
Representation and Velocity Updating: Estimate the state at time step k by Eq. (8) and update the velocity vector of current time step.
While the tracker updates and tracks the X
k
vector that represents the coordinates of the 26 landmark points, the samples are already drawn. The DE-MC particle filter is able to make a more reasonable sampling and keeps them from running off into implausible shapes even if they are placed in the positions far away from the solution point or are trapped in the local cost basin of the state space. The observation model can help the sample points for positions close to the solution in regard to their starting points. The measurement module provides necessary feedback to the sampling module, according to which, the hypothesis moves to the regions where it is more likely for the global maximum of the measurement function to be found.
3.2 Kernel correlation-based observation likelihood
The kernel correlation based on Hue Saturation Value (HSV) color histograms is used to estimate the observation likelihood and measure the correctness of particles, since HSV decouples the intensity (value) from color (hue and saturation) and corresponds more naturally to human perception [38]. We set each feature point at the centre of a window as the observation model. The kernel density estimate (KDE) K(X
k
) for the color distribution of the object X
k
at time step k is given as:
(9)
where the c(.) function is a three dimensional vector of HSV and can be generated from the candidate region within a search region R centered at X
k
at time step k. It should be sufficiently large to reach the maximum facial point movement without overlapping with any neighboring windows. c(r) can be generated from the target region, which is r position translation in the search region R. The normalizing constant ζ ensures K(X
k
;r) to be a probability distribution, . The kernel width is used to scale the KDE K(X
k
;r), and the optimal solution for kernel width that minimizes the Mean Integrated Square Error (MISE) [39] is given by:
(10)
where i
x
is the number of particles in the set at time k and d
opt
denotes the optimal solution for . If we denote K*(X
k
;r) as the reference region model and K(X
k
;r) as a candidate region model, we can measure the data likelihood to track the facial point movements by considering the maximum value of the correlation coefficient between the color histograms in this region and in a target region. The correlation coefficient ρ(X
k
) is calculated as:
(11)
where E(K(X
k
;r)) is the means of the vectors K(X
k
;r) and K*(X
k
;r), and E(K*(X
k
;r)) is the average intensities of the color model. Finally, we define the observation likelihood of the color measurement distribution using the correlation coefficient ρ(X
k
) that:
(12)
where τ
i
is a scaling parameter, which helps the result evaluated by (12) be more reasonably distributed in the range of (0,1).
3.3 Landmark point tracking
In this section, we present using multiple DE-MC filters for facial landmarks tracking over time. Once the observation model is defined we need to model the transition density and to specify the scheme for reweighting the particles. The single particle filters weight particles based on a likelihood score and then propagate these weighted particles according to a motion model. Simply running particle filters for multiple landmarks tracking needs a complex motion model for the identity between targets. Such an approach suffers from exponential complexity in the number of tracked targets [40]. In contrast to traditional methods, our approach addresses the multi-target tracking problem using the M-component non-parametric mixture model, where each component (every landmark point) is modeled with an individual particle filter that forms part of the mixture. The landmark states have multi-modal distribution functions and the filters in the mixture interact only through the computation of the importance weights. In particular, we combined color based kernel correlation technique for the observation likelihood with DE-MC particle filtering distribution. A set of weighted particles are used to approximate a density function corresponding to the probability of the location of the target given observations.
To avoid sampling from a complicated distribution, the M-component model is adopted for the posterior distribution over the state X
k
of all targets M according to:
(13)
where M = 26, p
j
(X
k
|Y
1 : k
) is the posteriori probability of the facial landmarks with the M-component non-parametric mixture model, and Pi is the mixture weights satisfy . We utilize training data to learn the interdependencies between the positions of the facial landmarks for the reweighting scheme. It is clear that the performance can be improved if we consider the motion models of the landmark points. The motion model p(X
k
|X
k - 1) predicts the state X
k
given the previous state X
k- 1. Using the filtering distribution computed from (13), the predictive distribution becomes:
(14)
where p
m
(X
k
|Y
1 : k - 1) = ∫ p(X
k
|X
k - 1)p
m
(X
k - 1|Y
1 : k - 1)dX
k - 1. The likelihood p(Y
k
|X
k
) is the measurement model and expresses the probability of observation Y
k
. We approximate the posterior from an appropriate proposal distribution to maintain a particle based representation for the a posteriori probability of the state. It provides a consistent way to resolve the ambiguities that arise in associating multiple objects with measurements of the similarity criterion between the target points and the candidate points. The updated posterior mixture takes the form that:
(15)
The new weights can be approximated with a prior on the relative positions of the facial features as:
(16)
The particles are sampled from the training data to obtain the appropriate distribution in the M-mixture model. The prediction step and the measurement step are integrated together instead of functioning separately. The use of the priors provides sufficient constrains for reliable tracking at the presence of appearance changes due to facial expressions. The measurement function evaluates the resemblance between image features generated by hypothesis and those generated by ground truth positions, as the criterion for judging the correctness of hypothesis.
When tracking the multiple modalities, multiple trackers start with mode-seeking procedure, the posterior modes are subsequently detected through the HSV color histograms based kernel correlation analysis. Using a trained color-based observation model allows us to track different landmark points. Here, we have M different likelihood distributions. At time k we sample candidate particles from an appropriate proposal distribution from and weight these particles according to the probability proportional:
(17)
In our work, scaling is normalized by person-related scaling factors that are estimated from the positions of the facial features at the first frame, such as the dimensions of the mouth. This scheme simply processes with the prior knowledge by sampling from the transition priors and updating the particles using importance weights derived from (17).