# Automatic landmark point detection and tracking for human facial expressions

- Yun Tie
^{1}Email author and - Ling Guan
^{1}

**2013**:8

https://doi.org/10.1186/1687-5281-2013-8

© Tie and Guan; licensee Springer. 2013

**Received: **3 November 2011

**Accepted: **8 January 2013

**Published: **4 February 2013

## Abstract

Facial landmarks are a set of salient points, usually located on the corners, tips or mid points of the facial components. Reliable facial landmarks and their associated detection and tracking algorithms can be widely used for representing the important visual features for face registration and expression recognition. In this paper we propose an efficient and robust method for facial landmark detection and tracking from video sequences. We select 26 landmark points on the facial region to facilitate the analysis of human facial expressions. They are detected in the first input frame by the scale invariant feature based detectors. Multiple Differential Evolution-Markov Chain (DE-MC) particle filters are applied for tracking these points through the video sequences. A kernel correlation analysis approach is proposed to find the detection likelihood by maximizing a similarity criterion between the target points and the candidate points. The detection likelihood is then integrated into the tracker’s observation likelihood. Sampling efficiency is improved and minimal amount of computation is achieved by using the intermediate results obtained in particle allocations. Three public databases are used for experiments and the results demonstrate the effectiveness of our method.

## Keywords

## 1. Introduction

As computers have become an integral part of our life, the need has arisen for a more natural communication interface between humans and machines. To make human-computer interaction (HCI) more natural and friendly, it would be beneficial to give computers the ability to recognize states of mind of humans the same way a human does. Analyzing facial expression in real time without human intervention will help to understand people’s behavior, and thus plays an important role in efficient HCI systems. Automatic facial component localization, such as the eyes, a mouth or nose, is a critical step for expression understanding and emotion recognition [1]. To capture the full range of emotional facial expressions from video sequences, accurate and reliable feature detection and tracking methods are required.

The rest of this paper is organized as follows. Section 2 presents automatic facial landmark detection. In Section 3, we describe multiple points tracking method with DE-MC particle filters and the kernel correlation technique. The experimental setup and results are presented in Section 4. Finally, Section 5 discuss the results and draw conclusions.

## 2. Facial landmark detection

Automatic landmark detection in still image is useful in many computer vision tasks where object recognition or pose determination is needed with high reliability. It aims to facilitate locating point correspondence between images or between images and a known model where natural features, such as the texture shape or location information, are not present in sufficient quantity and uniqueness. Some previous works used shape information for facial feature localization such as template matching [3], graph matching [4], and snakes [5]. These works can detect facial feature well in neutral faces but fail to show good performance in handling large variations such as non-uniform illuminations, change of pose, facial expressions, etc.

Due to the inherent difficulty of detecting the landmark points using a single image, temporal information captured from subsequent frames of a video sequence has been utilized. Detecting and tracking landmark points in video sequences enables computers to recognize affective states of humans, as well as the abilities to interpret and respond appropriately to users’ affective feedback [6, 7]. We can categorize the landmark detection algorithms in the literature into two groups based on the type of features and anthropometrical information they used, the geometric feature-based methods [8–10] and appearance-based methods [11–13]. The geometric feature-based methods utilize prior knowledge about the face position, and constrain the landmark search by heuristic rules that involve angles, distances, and areas. A number of the existing methods did have success in detecting facial features. For example, [6] used a multi-feature based fusion scheme for facial fiducial point detection and an average of 75% detection rate was achieved, and [8] used Gabor feature based boosted classifier for 20 facial feature point detection, which achieved average recognition rate of 86%. In general, they perform quite well when localizing a small number of facial feature points such as the corners of the eyes and the mouth, however, none of them detects and tracks all the 26 facial landmarks.

The appearance-based methods, on the other hand, using image filters such as Gabor wavelets, generate the facial features for either the whole face or specific regions in a face image. The Active Shape Models (ASM) [14] and Active Appearance Models (AAM) [15] are two popular appearance-based methods with statistical face models to prevent locating inappropriate feature points. Cristinacce and Cootes [16] expanded AAM with constrained local models with a set of local feature templates. Milborrow and Nicolls [17] introduced modifications to the ASM with more sophisticated methods. However, these methods were mainly applied to a full face shape model. When the object is small in appearance, cluttered background and occlusion lead to severe ambiguity.

In this section we introduce the scale invariant feature based method for the landmark detection, which includes three steps: *preprocessing, candidate selection and feature vectors extraction*.

### 2.1 Preprocessing

Since the faces are non-rigid and have a high degree of variability in location, color and pose, it is difficult to detect face automatically in a complex environment. Occlusion and illumination artifacts can also change the overall appearance of a face. We, therefore, propose detecting facial regions in the input video sequence using a face detector with local illumination compensation for normalization and optimal adaptive correlation [18]. Specifically, each frame of the input video sequence is extracted and regularized using an illumination compensation process, including gamma intensity correction (GIC), difference of Gaussian (DoG), local histogram matching (LHM) and local normal distribution (LND). Face candidate regions are then located by the OAC technique with kernel canonical correlation analysis (KCCA). Compare to Viola and Johns’ algorithm [19], the local normalization based method is adaptive to the normalized input image and designed to complete the segmentation in a single iteration. With the local normalization based method, the proposed method tends be more robust under different illumination conditions.

Before the raw data sequences can be used for automatic landmark point detection and tracking, it is necessary to normalize the size of the sequence such that they were in the format required by the system. Since the displacement of landmark point in each frame depends on each individual, we use the Inter-ocular Distance (IOD) for size normalization. The distance between left and right eye pupils is determined in the first input frame. We also manually marked the landmarks for the selected sequences to create the ground truth data.

After the facial region being detected, we propose to use the scale space extrema method to find the locations of candidate points in Section 2.2. The scale invariant feature for each candidate point is extracted and the 26 landmark detectors are constructed as described in Section 2.3.

### 2.2 Candidates selection

*L (x, y)*of input image in different scale space is expressed as:

*L*(

*x,y,σ*) is the spatial scale image,

*s(x, y)*indicates input image of facial region, and

*G*(

*x,y,σ*) is the Gaussian convolution kernel function defined as:

*σ*being the scale factor. The image smoothness varies with

*σ*, and a series of scale images is obtained with different

*σ*values. The scale space extrema are computed using the difference of Gaussian (DoG) function of the input image, which calculates the difference of two nearby scales separated by a constant multiplicative factor

*k*, that:

where *D*(*x,y,σ*) is the DoG function of the input image. In this work, we set the interval number *n* to 3 to form *n* + 2 DoG images, and *k* to 2^{1/3}. Each pixel in a DoG image is compared to its eight neighbors on the same scale and each of its nine neighbors one scale up and down. If this value is the minimum or maximum among the pixels compared, it is an extremum. These pixels are chosen as interest candidate points, including the adjacent scale, the position and scale of the local extreme point. Since the success of landmark detection depends on the quantity of the selected candidates, we used a larger number of scale samples. (Those points are generally the feature points of the image, located on contours, corners and edges.) DoG extrema are repeatedly assigned in the scale space. They are stable features across all possible scales and are invariant to scale and rotation. These points are highly distinctive and are located on contours, corners and edges in a facial region. Since there are 5 DoG images in our work, all the interest candidates are examined to determine location and scale. The landmarks are detected based on the measurements from these local decisions.

### 2.3 Feature vectors extraction

*σ*= 1.6 for the scale, a reasonable compromise between stable extrema detection and computational cost. This value is used throughout this work. A gradient orientation histogram is calculated for the direction of each interest point in its neighborhood. The gradient magnitude

*m*(

*x, y*) and orientation

*θ*(

*x, y*) are computed using pixel differences, that:

where *L* is the image at scale *σ*. We choose a neighborhood *F* centered at the interest point. By calculating the directions of points in *F*, we obtain the histogram of gradient directions. The orientation has a range of 360 degrees calculated by Eqs. (4) and (5). However, it is complex and computationally expensive to use the original orientation histogram with 360 bins. To reduce the computing cost, we equally divide the histogram into 36 phases each covering a range of 10 degrees of the orientations. As a result, the orientation histogram has 36 bins. The direction of the interest candidate point is the maximal component of the 36 phases in the histogram.

To detect the landmarks from the interest candidate points, a set of landmark detectors with the feature description from the gradient orientation histogram of the input images are constructed. The descriptor is constructed from a vector containing the values of all the orientation histogram entries. At the center of each landmark, a neighborhood window is selected and divided into 16 subregions of 4 × 4. Using (4) and (5), the directions and amplitudes of all pixels in the subregions are obtained, and then accumulated into orientation histograms summarizing the contents over the 4 × 4 subregions. Using the orientation histogram, we can calculate the eight direction distributions in the ranges of (0,π/4,π/2,3π/4,π,5π/4,3π/2,7π/4) with the length corresponding to the sum of the gradient magnitudes near that direction within the region. The amplitude and Gaussian function are also applied on the eight direction distributions to create the direction histogram of subregions. The feature description of each landmark point is obtained by connecting the direction descriptions of all subregions. The total number of the direction descriptions is 16 since we have 4 × 4 subregions of the landmark descriptor. So the length of a landmark point detector is 128 = 16 × 8, and should be normalized in order to ensure illumination invariance.

## 3. Multiple points tracker

Most tracking algorithms impose constraints on the motion and appearance of objects such as the prior knowledge of motion model, the number and size or the shape of objects. Various approaches have been proposed so far including the mean-shift, the Kalman filtering and particle filter. The mean-shift based tracker iteratively shifts a data point to the average of data points in its neighborhood, which minimizes the distance between a model histogram representing the target and candidate histograms computed on the current frame. However, it ignores the motion information and is difficult to recover from temporary tracking failures. The Kalman filter is the minimum-variance state estimator for linear dynamic systems with Gaussian noise [21]. For the visual object which moves rapidly, it is hard, in general, to implement the optimal state estimator in closed form [22]. Various modifications of the Kalman filter can be used to estimate the state. These modifications include the extended Kalman filter [23], and the unscented Kalman filter [24]. A multi-step tracking framework was also introduced in [25] to track facial landmarks points under head rotations and facial expressions. The Kalman filter was used to predict the locations of landmarks and a better performance was achieved. However, there are some shortcomings for Kalman filter to track the landmarks of facial expressions, such as the nonlinearity of the head motions, the unimodality of the Kalman, the inherent tracking delay, etc.

Over the last few years, there has been immense attention on particle filters for image tracking because of their simplicity, flexibility, and systematic treatment of nonlinearity and non-Gaussianity. Particle filters provide a convenient Bayesian filtering framework of integrating the detector into the tracker. Based on point mass representations of probability densities, particle filters operate by propagating the particle estimation and can be applied to any state-space model [26–29]. However the sampling results from the proposal density are assigned with low weights and a large number of the particles are wasted in areas with small likelihood. To track the state of a temporal event with a set of noisy observations, the main idea is to maintain a set of solutions that are an efficient representation of the conditional probability. However a large amount of particles that result from sampling from the proposal density might be wasted because they are propagated into areas with small likelihood. Some of the existing works ignore the fact that, while a particle might have low likelihood, parts of it might be close to the correct solution. The estimation of the particle weights does not take into account the interdependences between the different parts of the state of a temporal event.

Particle filter can use multi-modal likelihood functions and propagate multi-modal posterior distributions [30, 31]. There are two basic schemes: sending the output of the detector into the measurement likelihood [32, 33], or applying a mixture proposal distribution by combining the dynamic model with the output of the detector [34]. However, directly applying particle filter on multiple objects tracking is not feasible because the standard particle filter does not define a way to identify individual modes or hypotheses. Some researchers used sequential state estimation techniques to track multiple objects [35]. Patras and Pantic applied auxiliary particle filtering with factorized likelihoods for tracking of facial points [27]. Zhao et al. [36] introduced a method for tracking of facial points with multi cue particle filter. They have incorporated information from both color and edge of facial features and proposed the point distribution model for constraint tracking results and avoid tracking fails during occlusion. The standard particle filter has a common problem that it turns out to be inadequate when the dynamic system has a very low process noise, or if the observation noise has very small variance [34]. The reason is due to its defective sampling strategy with large dimensionality of the state space. After a few iterations, the particle set will collapse to one single point [31]. Therefore, the resampling method is applied to eliminate particles that have small weights and to concentrate on particles with large weights. It has been realized that improving the resampling or global optimization strategy is more decisive to the success of the tracking [30].

In this paper, we use multiple DE-MC particle filters to track the facial landmarks through the video sequence depending on the locations of the current appearance of the spatially sampled features.

### 3.1 DE-MC particle filter

*p*(

*X*

_{ k }|

*Y*

_{1 : k }) of the state

*X*

_{ k }given all the observations

*Y*

_{1 : k }= {

*Y*

_{1},

*Y*

_{2}, …,

*Y*

_{ k }} up to and including the current time,

*k*, instance, according to:

In (6), the state *X*
_{
k
} is a 2 M-component vector that represents the location of landmarks, the observation *Y*
_{
1:k
} is the set of image frames up to the current time instant. The normalization constant *λ*
_{
k
} is independent of *X*
_{
k
}. The motion model *p*(*X*
_{
k
}|*X*
_{
k - 1}) is conditioned directly on the immediate preceding state and independent of the earlier history if the motion dynamics are assumed to form a temporal Markov chain. The distribution is represented by discrete samples *N* through particle filtering. The *N* samples (particles) are drawn from a proposed distribution $p\left(\left.{X}_{k}^{\left(i\right)}\right|{X}_{k}^{\left(i-1\right)},{Y}_{k}\right)$, *i* = 1,2,…,*N* and assigned with weights $w\left({X}_{k}^{\left(i\right)}\right)$.

*k –*1, we have a particle based representation of the density, that is, we have a collection of

*N*particles and their corresponding weights ${\left\{{X}_{k-1}^{\left(i\right)},w\left({X}_{k-1}^{\left(i\right)}\right)\right\}}_{i=1}^{N}$. At time step

*k*, select a new set of samples ${\left\{{\stackrel{^^\u2322}{X}}_{k}^{\left(i\right)}\right\}}_{i=1}^{N}$ from ${\left\{{X}_{k-1}^{\left(i\right)}\right\}}_{i=1}^{N}$ with the probability proportional to $w\left({X}_{k-1}^{\left(i\right)}\right)$. The samples with a larger weight should be selected with a higher probability. Then, applying a constant velocity dynamical model to the samples yields:

where ${\widehat{X}}_{k}^{\left(i\right)}$ is a new set of samples selected at time *k*, and *V*
_{k-1} is the velocity vector computed in time step *k*-1.

The particle set ${\left\{{X}_{k}^{\left(i\right)-}\right\}}_{i=1}^{N}$ acts as the initial *N* population for a *T*-iteration DE-MC processing. For any one landmark in the *T*-iteration processing, two different integers, *r*
_{1}
*r*
_{2} that *r*
_{1} ≠ *r*
_{2} ≠ *k*, are randomly chosen from the population of previous iteration. A new member $\left\{{X}_{k}^{*\left(i\right)}\right\}$ that $\left\{{X}_{k}^{*\left(i\right)}\right\}=\left\{{X}_{k-1}^{\left(i\right)}\right\}+\lambda \left(\left\{{X}_{k-1}^{\left({r}_{1}\right)}\right\}-\left\{{X}_{k-1}^{\left({r}_{2}\right)}\right\}\right)+g$ is created, where *λ* is a scalar whose value is found to be optimal when $\lambda =2.38/\sqrt{2N}$, *g* is drawn from a symmetric distribution with small variance compared to that of ${\left\{{X}_{k}^{\left(i\right)}\right\}}_{i=1}^{N}$. A target function is given based on the ratio between the populations of current and previous step until a convergence or a preset end point is reached. Then the weights of particles are subject to update by the DE-MC. At the end of this step, we take the output population as the particle set of current time step ${\left\{{X}_{k}^{\left(i\right)},w\left({X}_{k}^{\left(i\right)}\right)\right\}}_{i=1}^{N}$.

*k*as:

and update the velocity vector of current time step *V*
_{
k
} = *X*
_{
k
} - *X*
_{
k - 1}. The step size of random jumping for current DE-MC iteration is reduced if the survival rate of the last DE-MC iteration is high or inflated otherwise [37]. The update scheme for the maximum likelihood decision on the weights *w* can be summarized as follows:

*k –*1: ${\left\{{X}_{k-1}^{\left(i\right)},w\left({X}_{k-1}^{\left(i\right)}\right)\right\}}_{i=1}^{N}$.

- 1.
Selection: select a set of samples ${\left\{{\stackrel{\u2322}{X}}_{k}^{\left(i\right)}\right\}}_{i=1}^{N}$ from ${\left\{{X}_{k-1}^{\left(i\right)}\right\}}_{i=1}^{N}$ with the probability proportional to $w\left({X}_{k-1}^{\left(i\right)}\right)$.

- 2.
Prediction and Measurement: Apply a constant velocity dynamical model to the samples using Eq. (7). At the end of this step, we take the output population as the particle set of current .time step that ${\left\{{X}_{k}^{\left(i\right)},w\left({X}_{k}^{\left(i\right)}\right)\right\}}_{i=1}^{N}$.

- 3.
Representation and Velocity Updating: Estimate the state at time step

*k*by Eq. (8) and update the velocity vector of current time step.

While the tracker updates and tracks the *X*
_{
k
} vector that represents the coordinates of the 26 landmark points, the samples are already drawn. The DE-MC particle filter is able to make a more reasonable sampling and keeps them from running off into implausible shapes even if they are placed in the positions far away from the solution point or are trapped in the local cost basin of the state space. The observation model can help the sample points for positions close to the solution in regard to their starting points. The measurement module provides necessary feedback to the sampling module, according to which, the hypothesis moves to the regions where it is more likely for the global maximum of the measurement function to be found.

### 3.2 Kernel correlation-based observation likelihood

*K*(

*X*

_{ k }) for the color distribution of the object

*X*

_{ k }at time step

*k*is given as:

*c*(.) function is a three dimensional vector of HSV and $c\left({X}_{k}^{\left(i\right)}\right)$ can be generated from the candidate region within a search region

*R*centered at

*X*

_{ k }at time step

*k*. It should be sufficiently large to reach the maximum facial point movement without overlapping with any neighboring windows.

*c*(

*r*) can be generated from the target region, which is

*r*position translation in the search region

*R*. The normalizing constant

*ζ*ensures K(

*X*

_{ k };

*r*) to be a probability distribution, ${\sum}_{k=1}^{N}K\left({X}_{k};r\right)}=1$. The kernel width ${d}^{{i}_{x}}$ is used to scale the KDE

*K*(

*X*

_{ k };

*r*), and the optimal solution for kernel width ${d}^{{i}_{x}}$ that minimizes the Mean Integrated Square Error (MISE) [39] is given by:

*i*

_{ x }is the number of particles in the set at time

*k*and

*d*

_{ opt }denotes the optimal solution for ${d}^{{i}_{x}}$. If we denote

*K*

^{*}(

*X*

_{ k };

*r*) as the reference region model and

*K*(

*X*

_{ k };

*r*) as a candidate region model

*,*we can measure the data likelihood to track the facial point movements by considering the maximum value of the correlation coefficient between the color histograms in this region and in a target region. The correlation coefficient

*ρ*(

*X*

_{ k }) is calculated as:

*E*(

*K*(

*X*

_{ k };

*r*)) is the means of the vectors

*K*(

*X*

_{ k };

*r*) and

*K*

^{*}(

*X*

_{ k };

*r*), and

*E*(

*K*

^{ * }(

*X*

_{ k };

*r*)) is the average intensities of the color model. Finally, we define the observation likelihood of the color measurement distribution using the correlation coefficient

*ρ*(

*X*

_{ k }) that:

where *τ*
_{
i
} is a scaling parameter, which helps the result evaluated by (12) be more reasonably distributed in the range of (0,1).

### 3.3 Landmark point tracking

In this section, we present using multiple DE-MC filters for facial landmarks tracking over time. Once the observation model is defined we need to model the transition density and to specify the scheme for reweighting the particles. The single particle filters weight particles based on a likelihood score and then propagate these weighted particles according to a motion model. Simply running particle filters for multiple landmarks tracking needs a complex motion model for the identity between targets. Such an approach suffers from exponential complexity in the number of tracked targets [40]. In contrast to traditional methods, our approach addresses the multi-target tracking problem using the M-component non-parametric mixture model, where each component (every landmark point) is modeled with an individual particle filter that forms part of the mixture. The landmark states have multi-modal distribution functions and the filters in the mixture interact only through the computation of the importance weights. In particular, we combined color based kernel correlation technique for the observation likelihood with DE-MC particle filtering distribution. A set of weighted particles are used to approximate a density function corresponding to the probability of the location of the target given observations.

*X*

_{ k }of all targets M according to:

*M*= 26,

*p*

_{ j }(

*X*

_{ k }|

*Y*

_{1 : k }) is the posteriori probability of the facial landmarks with the M-component non-parametric mixture model, and

*Pi*is the mixture weights satisfy ${\sum}_{m=1}^{M}P{i}_{m,k}}=1$. We utilize training data to learn the interdependencies between the positions of the facial landmarks for the reweighting scheme. It is clear that the performance can be improved if we consider the motion models of the landmark points. The motion model

*p*(

*X*

_{ k }|

*X*

_{ k - 1}) predicts the state

*X*

_{ k }given the previous state

*X*

_{ k- 1}. Using the filtering distribution computed from (13), the predictive distribution becomes:

*p*

_{ m }(

*X*

_{ k }|

*Y*

_{1 : k - 1}) = ∫

*p*(

*X*

_{ k }|

*X*

_{ k - 1})

*p*

_{ m }(

*X*

_{ k - 1}|

*Y*

_{1 : k - 1})

*dX*

_{ k - 1}. The likelihood

*p*(

*Y*

_{ k }|

*X*

_{ k }) is the measurement model and expresses the probability of observation

*Y*

_{ k }. We approximate the posterior from an appropriate proposal distribution to maintain a particle based representation for the a posteriori probability of the state. It provides a consistent way to resolve the ambiguities that arise in associating multiple objects with measurements of the similarity criterion between the target points and the candidate points. The updated posterior mixture takes the form that:

The particles are sampled from the training data to obtain the appropriate distribution in the M-mixture model. The prediction step and the measurement step are integrated together instead of functioning separately. The use of the priors provides sufficient constrains for reliable tracking at the presence of appearance changes due to facial expressions. The measurement function evaluates the resemblance between image features generated by hypothesis and those generated by ground truth positions, as the criterion for judging the correctness of hypothesis.

*k*we sample candidate particles from an appropriate proposal distribution ${\left\{{\stackrel{^^\u2322}{X}}_{k-1}^{\left(i\right)}\right\}}_{i=1}^{N}$ from ${\left\{{X}_{k-1}^{\left(i\right)}\right\}}_{i=1}^{N}$ and weight these particles according to the probability proportional:

In our work, scaling is normalized by person-related scaling factors that are estimated from the positions of the facial features at the first frame, such as the dimensions of the mouth. This scheme simply processes with the prior knowledge by sampling from the transition priors and updating the particles using importance weights derived from (17).

## 4. Experiments and results

To evaluate the system performance of the proposed detection and tracking method for facial expression, we construct an experimental dataset from three publicly available databases: RML Emotion database [9], Cohn-Kanade (CK) database [41] and Mind Reading (MR) database [42]. The RML Emotion database was originally recorded for language and context independent emotional recognition with the six fundamental emotional states: happiness, sadness, anger, disgust, fear and surprise. It includes eight subjects in nearly frontal view (2 Italian, 2 Chinese, 2 Pakistani, 1 Persian, and 1 Canadian) and 520 video sequences in total. Each video pictures a single emotional expression and ends at the apex of that expression while the first frame of every video sequence shows a neutral face. Video sequences from neutral to target display are digitized into 320 × 340 pixel arrays with 24-bit color values. The CK database consists of approximately 2000 image sequences in nearly frontal view from over 200 subjects. Each video pictures a single facial expression and ends at the apex of that expression while the first frame of every video sequence shows a neutral face. The MR database is an interactive computer-based resource for face emotional expressions, developed by Cohen and his psychologist team. It consists of 2472 faces, 2472 voices and 2472 stories. Each video pictures the frontal face with a single facial expression of one actor (30 actors in total) of varying age ranges and ethnic origins.

We select 320 videos of eight subjects from the RML Emotion database, 360 image sequences of 90 subjects from CK database and 360 videos of 30 subjects from MR database for the experiments. As a result, the experimental dataset includes 1040 image sequences of 128 subjects in total. The experiments are implemented on a Quad CPU 2.4 GHz PC with 3.25 GB memory, under the Windows XP operating system.

We compare the automatically located facial landmarks with the ground truth points to evaluate the performance of the detection and tracking method. In general, the detecting and tracking methods are usually regarded as a SUCCESS if the bias of the automatic labeling result to the manual labeling result is less than 30% of the true inter-ocular distance [43]. However, this is unacceptable in the case of facial expression analysis. To follow the subtle changes in the facial feature appearance, we define a SUCCESS case if the bias of a detected point to the true facial point is less than 10% of inter-ocular distance in the test image. The one-against-all (OAA) and leave-one-subject-out (LOSO) cross validation strategies are utilized to perform the experiments. The OAA strategy works as follows: for each time, one sample is held out as the testing data, while the rest of the data in the entire dataset is used as the training data. This procedure continues until all the individual samples in the entire dataset have been held out once. In the LOSO strategy, the samples belonging to one subject are used as the testing data and the remainders as the training data. This is also repeated for all of the possible trials until all the subjects are used as the testing data. There is no overlap between the training and testing subjects. The experimental results are averaged as the final accuracy.

### 4.1 Facial landmark detection

In this section, we present the experimental results using the proposed facial landmarks detection method. Adaboost algorithm is applied for training the 26 facial landmark detectors. We use ten frames from each training sequence with the manually labeled ground truth points. The surrounding eight positions of the true point are also selected as the positive examples in a training image. Another five arbitrary points in the same frame are chosen as the set of negative examples. The prototypical 128-dimensional feature vector is used for each sample point. In the testing images, candidate points are first extracted from facial region using scale invariant feature. For a certain facial landmark, Adaboost classifier outputs a response depicting the similarity between the representations of the candidate points compared to the learned training model. After checking the entire facial region, the position with the highest response reveals the landmark point.

### 4.2 Tracking results

In this section, we present the experimental results using the proposed multiple DE-MC filters. The positions of the facial landmarks in the first frame of an input sequence are automatically found using the detection method. The positions in all subsequent frames are then determined by the multiple particle filters with the color based observation likelihood. The observation model is built from the training data of manually labeled sequences using a finite set of particles within the feature point centered window. We approximate the posterior *p*(*X*
_{
k
}|*Y*
_{1 : k
}) from an appropriate proposal distribution to maintain a particle based representation for the a posteriori probability of the state. Since the calculation of the weights of the particles is a critical step of multiple points tracking, in the proposed M-mixture model, we sample the particles from the training data to obtain the appropriate proposal distribution. The proposed method simply proceeds by sampling from the transition priors and updating the particles using importance weights derived from Eq. (17). In the DE-MC iterations, the measurement module provides necessary feedbacks to the sampling module. According to them the sampling moves to regions in the state space where it is more possible to find the global maximum of the measurement function. Since we are interested in the global optimal state, we place denser sampling grids in the region of interest. This approach yields a result reasonably close to that obtained by sampling strictly according to the ground truth posterior distribution.

*n*frames when the missing points first occurred. This step length

*n*can be changed by the user and should not be crucial to the system. If the trackers respond correctly after a few frames, the trackers are able to recover due to the accumulation of probabilities. However, when the step length

*n*continues to grow, due to incorrect responses of the detector, the color correlation of the observation likelihood drops and the trackers will begin to lose points. After that, “point lost” will be declared. We then stop estimating its motion

*V*

_{ k }and discard the motion likelihood term. The trackers will be reinitialized by the point detectors in the following frames. All the 26 points can be detected with a new set of parameters if the facial region appears again in the scene. The improved result is shown in Figure 10 that reinitialization executes and all facial landmarks are found again after frame 183.

### 4.3 Performance evaluate

where *N*
_{SUCCESS} stands for the number of SUCCESS point from the detection and tracking, *N*
_{
miss
} stands for the number of missed points, and *N*
_{
false
} stands for the number of false alarms. The sum *N*
_{SUCCESS} + *N*
_{
miss
} is the total number of manually labeled facial landmarks in the entire video sequence.

### 4.4 Comparison with state-of-the-art

To distinguish person-independent affective states, subtle changes of facial expressions should be extracted for feature construction. Automatic facial landmark detection and tracking are crucial for analyzing the current facial appearance since it will facilitate the examination of the fine structural changes inherent in the spontaneous expressions. A key motivation for developing landmark point techniques is that they lay the foundation for developing 3D models and associated dynamic feature extraction and recognition techniques which are highly likely superior to 2D-based and static 3D-based techniques. We therefore first compare with the result reported in [9] which also used the RML Emotion database, but with static visual features extracted by 2D Gobar filters. The comparison shows that working on the same database, facial landmark based 3D dynamic features [49] (90% recognition rate) substantially outperforms the recognition rate by the 2D Gabor features (approximately 50%), and also the bimodal features (approximately 82%).

**Comparisons based on different public databases**

Fiducial points | Proposed | Expanded AAM [14] | Factorized PF [29] | SIR PF [51] | Gabor PF [52] | |||||
---|---|---|---|---|---|---|---|---|---|---|

BIOID | BUHM AP | BIOID | BUHM AP | BIOID | BUHM AP | BIOID | BUHM AP | BIOID | BUHM AP | |

P1 | 92.89 | 91.97 | 85.45 | 86.13 | 83.66 | 83.27 | 81.35 | 79.19 | 87.42 | 89.73 |

P 2 | 94.68 | 93.06 | 87.15 | 88.56 | 84.81 | 82.99 | 79.64 | 80.61 | 86.25 | 86.44 |

P 3 | 93.33 | 89.56 | 84.21 | 83.68 | 79.40 | 76.72 | 74.39 | 75.65 | 82.91 | 80.19 |

P 4 | 90.94 | 91.76 | 84.48 | 81.95 | 78.38 | 78.49 | 76.34 | 73.71 | 82.03 | 84.97 |

P 5 | 95.31 | 94.28 | 90.33 | 89.95 | 82.67 | 82.01 | 79.08 | 80.14 | 89.50 | 90.32 |

P 6 | 88.86 | 89.59 | 80.94 | 81.38 | 77.78 | 74.91 | 75.12 | 72.92 | 79.74 | 80.63 |

P 7 | 94.99 | 93.47 | 88.34 | 87.45 | 82.40 | 82.74 | 82.94 | 81.68 | 80.42 | 81.06 |

P 8 | 89.33 | 88.45 | 83.47 | 81.97 | 79.61 | 74.96 | 76.27 | 75.54 | 79.24 | 78.62 |

P 9 | 96.01 | 94.73 | 91.04 | 90.15 | 81.92 | 82.59 | 79.35 | 76.54 | 81.48 | 79.20 |

P 10 | 86.31 | 87.14 | 79.69 | 80.41 | 74.02 | 79.14 | 76.05 | 79.13 | 79369 | 79.06 |

P 11 | 89.03 | 90.02 | 86.63 | 87.37 | 82.33 | 81.86 | 85.02 | 82.46 | 85.24 | 83.15 |

P 12 | 85.12 | 86.24 | 80.31 | 81.06 | 75.21 | 75.83 | 73.66 | 76.98 | 79.45 | 79.13 |

P 13 | 91.92 | 93.10 | 86.69 | 87.81 | 82.70 | 83.51 | 81.12 | 83.06 | 81.28 | 83.67 |

P 14 | 84.97 | 83.15 | 79.62 | 78.38 | 78.26 | 77.44 | 78.45 | 77.96 | 79.66 | 81.71 |

P 15 | 91.24 | 92.45 | 84.71 | 84.55 | 82.83 | 81.34 | 76.52 | 75.43 | 86.01 | 86.93 |

P 16 | 89.56 | 88.74 | 85.35 | 86.16 | 78.25 | 79.59 | 78.05 | 79.29 | 86.36 | 82.98 |

P 17 | 82.49 | 86.35 | 79.22 | 81.74 | 76.17 | 78.79 | 74.62 | 73.64 | 81.31 | 84.52 |

P 18 | 89.45 | 90.12 | 86.08 | 85.15 | 81.93 | 83.58 | 80.23 | 80.71 | 85.51 | 85.78 |

P 19 | 88.94 | 90.84 | 87.11 | 87.84 | 80.81 | 82.14 | 76.03 | 75.34 | 84.18 | 86.19 |

P 20 | 91.03 | 93.21 | 82.21 | 81.16 | 79.85 | 79.62 | 76.82 | 78.02 | 84.45 | 85.56 |

P 21 | 89.62 | 88.82 | 81.74 | 81.06 | 81.48 | 84.44 | 82.82 | 79.12 | 79.82 | 80.43 |

P 22 | 93.87 | 92.53 | 83.94 | 80.09 | 85.30 | 86.52 | 79.99 | 79.21 | 85.28 | 86.58 |

P 23 | 96.40 | 94.77 | 86.37 | 85.52 | 86.23 | 85.17 | 84.80 | 82.95 | 86.34 | 84.42 |

P 24 | 90.85 | 91.06 | 85.81 | 81.03 | 79.41 | 78.30 | 79.27 | 78.13 | 84.73 | 83.96 |

P 25 | 94.97 | 95.43 | 89.24 | 86.11 | 83.22 | 83.80 | 80.59 | 80.62 | 88.27 | 86.48 |

P 26 | 91.67 | 90.28 | 83.67 | 84.84 | 76.53 | 79.43 | 76.27 | 78.74 | 77.75 | 78.58 |

Ave. | 90.86% | 84.51% | 80.66% | 78.58% | 83.35% |

As is evident from these results, our method achieves the best overall performance of 90.8% average rate. In contrast with other approaches, the most evident improvement of the proposed method is that the prediction step and the measurement step are integrated together instead of functioning separately. The use of the priors provides sufficient constrains for reliable tracking at the presence of appearance changes due to facial expressions. The measurement function evaluates the resemblance between image features generated by hypothesis and those generated by ground truth positions, as the criterion for judging the correctness of hypothesis.

The proposed method has demonstrated its ability to handle pose variations problems and can be used for both image and video based facial expression recognition. Computationally, the proposed method has the advantages of automatic initialization by using the scale invariant features extraction over the other methods that examine pixels one by one. Note that the method proposed in [27] achieved a better overall detection rate. However, this method is only tested on perfect manually aligned image sequences and no experiments in fully automatic conditions were reported. In addition, only 13 sequences were experimented on in [27]. Therefore, the result is far from conclusive.

## 5. Discussions and conclusions

Automatic facial landmark detecting and tracking is a challenging task in facial expression analysis. In this paper, we proposed an automatic approach to detect and track facial landmarks for varying facial expressions. We first construct a set of facial landmark detectors with scale invariant feature. Locating feature points automatically on a single frame makes it possible to eliminate the manual initiation step for the tracking algorithm.

We also adopt the multiple DE-MC filters for facial landmarks tracking. Compared with the existing multi-target tracking methods, such as the joint probabilistic data association filter (JPDAF) [53], moving horizon estimation [54], various modifications of the Kalman filter [55], or the interior point approaches [56], the DE-MC particle filter leads to a more reasonable approximation to the proposal distribution. It incorporates the advantage of the Differential Evolution algorithm in global optimization and the ability of the Monte Carlo Markov Chain in reasonably sampling a high-dimensional state space. It evidently boosts the performance of the traditional tracking method in terms of more accurate motion vector prediction. Based on the fact that the posterior depends on both the previous state and the current observation in a visual tracking application, the DE-MC particle filter can also considerably improve the accuracy for tracking by building a path connecting a sampling with measurement. Taking the advantage of the DE-MC algorithm’s ability, we can obtain reasonably distributed samples that are concentrated on important regions of the state space. A novel Kernel correlation with robust color histograms is proposed for the observation likelihood to deal with changes in the facial appearance of different expressions.

Furthermore, the facial landmarks are tracked by utilizing prior knowledge on the facial feature configurations. It provides a consistent way to resolve the ambiguities that arise in associating multiple objects with measurements of the similarity criterion between the target points and the candidate points. Instead of simply applying the single DE-MC filter for multiple point tracking, we utilize the M-component non-parametric mixture model for the multiple DE-MC filters' posterior distribution over the states of all target points. This approach yields a result reasonably close to that obtained by sampling strictly according to the ground truth posterior distribution.

For future work, we plan to improve the detection and tracking performance and extend our real-time algorithm to cope with both self and other forms of occlusions.

## Declarations

## Authors’ Affiliations

## References

- Salah AA, Cinar H, Akarun L, Sankur B: Robust facial landmarking for registration.
*Annals of Telecommunications*2007, 62(1):1608-1633.Google Scholar - Zeng Z, Pantic M, Roisman GI, Huang TS: A Survey of Affect Recognition Methods: Audio, Visual, and Spontaneous Expressions.
*IEEE Transactions on Pattern Analysis and Machine Intelligent*2009, 31(1):39-58.View ArticleGoogle Scholar - Brunelli R, Poggio T: Face Recognition: Features Versus Templates”.
*IEEE Trans. Pattern Anal. Mach. Intell.*1993, 15(10):1042-1062. 10.1109/34.254061View ArticleGoogle Scholar - Herpers R, Sommer G:
*An Attentive Processing Strategy for the Analysis of Facial Features*. Berlin Heidelberg, New York: Springer-Verlag; 1998.View ArticleGoogle Scholar - Pardas M, Losada M: Facial Parameter Extraction System Based on Active Contours.
*International Conference on Image Processing, Thessaloniki*2001, 1: 1058-1061.Google Scholar - Akakin HC, Sankur B: Multi-attribute robust facial feature localization.
*Automatic Face & Gesture Recognition*2008, 1-6.Google Scholar - Cohen I, Sebe N, Garg A, Lew MS, Huang TS: Facial expression recognition from video sequences.
*Proceeding of international conference on Multimedia and Expo*2002, 2: 121-124.View ArticleGoogle Scholar - Pantic M, Rothkrantz LJM: Facial action recognition for facial expression analysis from static face image.
*IEEE Transactions on Systems, Man, and Cybernetics-Part B*2004, 34: 1449-1461. 10.1109/TSMCB.2004.825931View ArticleGoogle Scholar - Silva LCD, Hui SC: Real-time facial feature extraction and emotion recognition.
*Proceedings of the 4th International Conference on Information Communications and Signal Processing*2003, 3: 1310-1314.Google Scholar - Anderson K, McOwan PW: A real-time automated system for the recognition of human facial expressions.
*IEEE Transactions on Systems, Man, and Cybernetics Part B*2006, 36(1): 96-105.View ArticleGoogle Scholar - Lyons MJ, Budynek J, Plante A, Akamatsu S: Classifying facial attributes using a 2-D Gabor wavelet representation and discriminant analysis.
*Proceedings of the 4th International Conference on Automatic Face and Gesture Recognition*2000, 202-207.Google Scholar - Cohen I, Sebe N, Sun Y, Lew MS, Huang TS: Evaluation of expression recognition techniques.
*Proceedings of International Conference on Image and Video Retrieval*2003, 184-195.View ArticleGoogle Scholar - Wang Y, Guan L: Recognizing Human Emotional State from Audiovisual Signals.
*IEEE Transactions on Multimedia*2008, 10(5):659-668.View ArticleGoogle Scholar - Cootes TF, Taylor C, Cooper D, Graham J: Active shape models: their training and their applications.
*Computer Vision and Image Understanding*1995, 61(1):38-59. 10.1006/cviu.1995.1004View ArticleGoogle Scholar - Cootes TF, Edwards GJ, Taylor CJ: Active appearance models.
*European Conference on Computer Vision*1998, 2: 484-498.Google Scholar - Cristinacce D, Cootes T: Automatic feature localization with constrained local models.
*Pattern Recognition*2008, 41: 3054-3067. 10.1016/j.patcog.2008.01.024View ArticleGoogle Scholar - Milborrow S, Nicolls F: Locating facial features with an extended active shape model.
*European Conference on Computer Vision*2008, 5305: 504-513.Google Scholar - Yun T, Guan L: Automatic face detection in video sequences using local normalization and optimal adaptive correlation techniques.
*Pattern Recognition*2009, 42(9):1859-1868. 10.1016/j.patcog.2008.11.026View ArticleGoogle Scholar - Viola P, Jones M, Viola P, Jones M: Robust Real Time Object Detection.
*Proceedings of the 2nd International Workshop on Statistical and Computational Theories of Vision*2001.Google Scholar - Lowe DG: Distinctive Image Features from Scale-Invariant Keypoints.
*International Journal of Computer Vision*2004, 60(2):91-110.View ArticleGoogle Scholar - Rhodes I: A tutorial introduction to estimation and Filtering.
*IEEE Autom. Control*1971, 16(6):688-706. 10.1109/TAC.1971.1099833MathSciNetView ArticleGoogle Scholar - Simon D:
*Optimal state estimation*. New Jersey: John Wiley & Sons; 2006.View ArticleGoogle Scholar - Dan S:
*”Optimal State Estimation: Kalman, H Infinity, and Nonlinear Approaches*. 1st edition. Toulouse, France: Wiley-Interscience; 2006.Google Scholar - Julier S, Uhlmann J: Unscented filtering and nonlinear estimation.
*Proceedings of the IEEE*2004, 92(3):401-422. 10.1109/JPROC.2003.823141View ArticleGoogle Scholar - Akakin HC, Sankur B: Robust Classification of Face and Head Gestures in Video.
*Image and Video Computing*2011, 29: 470-483. 10.1016/j.imavis.2011.03.001View ArticleGoogle Scholar - Pitt MK, Shephard N: Filtering via simulation: auxiliary particle filtering.
*J. American Statistical Association*1999, 94: 590-599. 10.1080/01621459.1999.10474153MathSciNetView ArticleGoogle Scholar - Patras I, Pantic M: Particle filtering with factorized likelihoods for tracking facial features. In
*Sixth IEEE International Conference on Automatic Face and Gesture Recognition*. Seoul, Korea; 2004:97-102.Google Scholar - Rui Y, Chen Y: Better Proposal Distributions: Object Tracking using Unscented Particle Filter.
*Proceedings of the Interntional Conference on Computer Vision and Pattern Recognition*2001, 2: 786-793.Google Scholar - Deutscher J, Blake A, Reid I: Automatic partitioning of high dimensional search spaces associated with articulated body motion capture.
*Proceedings of the International Conference on Computer Vision and Pattern Recognition*2001, 2: 669-676.Google Scholar - Du M, Guan L: Monocular Human Motion Tracking with the DE-MC Particle Filter.
*IEEE International Conference on Acoustics, Speech and Signal Processing*2006, 2: 14-19.Google Scholar - Maghami M, Zoroofi RA, Araabi BN, Shiva M, Vahedi E:
*Kalman Filter Tracking for Facial Expression Recognition using Noticeable Feature Selection*. International Conference on Intelligent and Advanced Systems; 2007.View ArticleGoogle Scholar - Hue C, Cadre L, P´erez P: Tracking Multiple Objects with Particle Filtering.
*IEEE Transactions on Aerospace and Electronic Systems*2002, 38: 791-812. 10.1109/TAES.2002.1039400View ArticleGoogle Scholar - Isard M, MacCormick J: BraMBLe: A Bayesian multiple-blob tracker.
*Proceedings of the IEEE International Conference on Computer Vision*2001, 2: 34-41.Google Scholar - Vermaak J, Doucet A, P´erez P: Maintaining Multi-Modality through Mixture Tracking.
*Ninth IEEE International Conference on Computer Vision*2003, 2: 1110-1116.View ArticleGoogle Scholar - Yu T, Wu Y:
*Collaborative tracking of multiple targets*. Washington, D.C.: IEEE CVPR; 2004.Google Scholar - Liyue Z, Jianhua T:
*Fast Facial Feature Tracking with Multi-Cue Particle Filter*. Hamilton, New Zealand: Image and Vision Computing New Zealand, IVCNZ2007; 2000.Google Scholar - MacComick J, Blake A: Partitioned Sampling, Articulated Objects and Interface-Quality Hand Tracking.
*Proceedings of the European Conference on Computer Vision*2000, 2: 3-19.Google Scholar - Perez P, Hue C, Vermaak J, Gangnet M: Color-Based Probabilistic Tracking.
*European Conference on Computer Vision*2002.Google Scholar - Silverman B:
*Density Estimation for Statistics and Data Analysis*. Las Vegas: Chapman and Hall; 1986:254-259.View ArticleGoogle Scholar - Khan Z, Balch T, Dellaert F: Efficient particle filter-based tracking of multiple interacting targets using an MRF-based motion model.
*IEEE Intl. Conf. on Intelligent Robots and Systems*2003.Google Scholar - Kanade T, Cohn J, Tian Y: Comprehensive database for facial expression analysis.
*IEEE Int. Conf. Automatic Face and Gesture Recognition*2000, 46-53.Google Scholar - Cohen SB, Golan O, Wheelwright S, Hill J:
*Mind Reading: The Interactive Guide to Emotions*. London: Jessica Kingsley; 2004.Google Scholar - Vukadinovic D, Pantic M:
*Fully Automatic Facial Feature Point Detection Using Gabor Feature Based Boosted Classifiers*. Waikoloa: IEEE International Conference on Systems, Man and Cybernetics; 2005.View ArticleGoogle Scholar - Wu B, Ai H, Huang C, Lao S: Fast rotation invariant multi-view face detection based on RealAdaboost.
*Proceedings of the IEEE International Conference on Automatic Face and Gesture Recognition*2004.Google Scholar - Friedman J, Hastie T, Tibshirani R: Additive logistic regression: A statistical view of boosting.
*The Annals of Statistics*2000, 28: 337-374.MathSciNetView ArticleGoogle Scholar - Vezhnevets A, Vezhnevets V:
*Modest AdaBoost-teaching AdaBoost to generalize better*. Novosibirsk Akademgorodok, Russia: Graphicon-2005; 2005.Google Scholar -
*GML AdaBoost Matlab Toolbox*. ; http://research.graphicon.ru/machine-learning/gml-adaboost-matlab-toolbox.html - Matthews I, Baker S, Ishikawa T: The Template Update Problem.
*IEEE T-PAMI*2004, 26(6):1115-1118.View ArticleGoogle Scholar - Tie Y, Guan L:
*Human Emotion Recognition Using a Deformable 3D Facial Expression Model*. Seoul, Korea: IEEE International Symposium on Circuits and Systems, ISCAS; 2012.View ArticleGoogle Scholar - Jesorsky O, Kirchberg K, Frischholz R: Robust Face Detection Using the Hausdorff Distance. In
*International Conference on Audio- and Video-Based Biometric Person Authentication*. Springer; 2001:90-95.View ArticleGoogle Scholar - Fazli S, Afrouzian R, Seyedarabi H: Fiducial facial points tracking using particle filter and geometric features.
*Ultra Modern Telecommunications and Control Systems and Workshops*2010, 396-400.View ArticleGoogle Scholar - Valstar M, Pantic M:
*Fully Automatic Facial Action Unit Detection and Temporal Analysis*. New Jersey: Computer Vision and Pattern Recognition Workshop; 2006.View ArticleGoogle Scholar - Rasmussen C, Hager G: Probabilistic data association methods for tracking complex visual objects.
*IEEE Transactions on Pattern Analysis and Machine Intelligent*2001, 560-576.Google Scholar - Goodwin G, Seron M, Dona JD:
*Constrained control and estimation*. Verlag: Springer; 2005.View ArticleGoogle Scholar - Yang C, Blasch E: Kalman filtering with nonlinear state Constraints.
*IEEE Trans. Aeros. Electron. Syst.*2008, 45(1):70-84.View ArticleGoogle Scholar - Bell B, Burke J, Pillonetto G: An inequality constrained nonlinear Kalman–Bucy smoother by interior point likelihood maximization.
*Automatica*2009, 45(1):25-33. 10.1016/j.automatica.2008.05.029MathSciNetView ArticleGoogle Scholar

## Copyright

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.