Open Access

Gauss–Laguerre wavelet textural feature fusion with geometrical information for facial expression identification

  • Ahmad Poursaberi1Email author,
  • Hossein Ahmadi Noubari2,
  • Marina Gavrilova1 and
  • Svetlana N Yanushkevich1
EURASIP Journal on Image and Video Processing20122012:17

DOI: 10.1186/1687-5281-2012-17

Received: 1 November 2011

Accepted: 3 September 2012

Published: 25 September 2012

Abstract

Facial expressions are a valuable source of information that accompanies facial biometrics. Early detection of physiological and psycho-emotional data from facial expressions is linked to the situational awareness module of any advanced biometric system for personal state re/identification. In this article, a new method that utilizes both texture and geometric information of facial fiducial points is presented. We investigate Gauss–Laguerre wavelets, which have rich frequency extraction capabilities, to extract texture information of various facial expressions. Rotation invariance and the multiscale approach of these wavelets make the feature extraction robust. Moreover, geometric positions of fiducial points provide valuable information for upper/lower face action units. The combination of these two types of features is used for facial expression classification. The performance of this system has been validated on three public databases: the JAFFE, the Cohn-Kanade, and the MMI image.

Keywords

Facial expression Gauss–Laguerre wavelet Feature fusion Texture analysis

Introduction

Automatic facial expression recognition (AFER) is of interest to researchers because of its importance for facial biometric-based intelligent support systems. It provides a behavioral measure to assess emotions, cognitive processes, and social interaction[1]. Examples of applications of AFER include robotics, human–computer interface, behavioral science, animations and computer games, educational software, emotion processing, and fatigue detection. Due to multiple limitations and difficulties such as occlusion, lighting conditions, and variation of expressions across the population, or even for an individual, having an automatic system helps in creating intelligent visual media for understanding different expressions. Moreover, this understanding helps in building meaningful and responsive HCI interfaces.

Each AFER implements three main functions: face detection and tracking, feature extraction, and expression classification. The first attempt towards the AFER was taken in 1978 by Suwa et al.[2] who presented a system for facial expression analysis from video, and tracking 20 points as features. Before that, only two ways existed for FER[3]: (a) human observer-based coding system which is subjective, time-consuming, and hard to standardize, and (b) electromyography-based systems which is invasive (needs sensors on the face). The muscle actions result in various facial behaviors and motions, and later on can be used to represent the corresponding facial expressions. These assumptions became the basis for developing the following systems for coding multiple facial expressions and emotions:
  1. 1

    The Facial Action Coding System (FACS)—Ekman and Friesen [4].

     
  2. 2

    The Facial Animation parameters (FAPs)—MPEG-4 standard, SNHC [5].

     
In the study of Ekman and Friesen[6], it was shown that those six emotions—anger, disgust, fear, happiness, sadness, and surprise—are “discriminable within any one literate culture”. Sometimes, a neutral expression is considered as a seventh expression. The FACS describes facial expressions in terms of action units (AUs). It explains how to identify different facial expressions based on the application of varying facial muscles individually or in groups. It contains 46 AUs, which are basic facial movements, corresponding to different muscle activities. It is not an easy task to recognize AUs automatically, given an image or a video. There are two main approaches to AU recognition:
  1. 1.

    Processing 2D static images.

     
  2. 2.

    Processing image sequences.

     

The first one, which is more difficult than image sequence since less information is available, often uses feature-based methods. Using only one image for expression recognition needs robust and highly distinctive features to cope with variations in human subjects or imaging conditions[3]. There are several methods to process still images. One of them is PCA-based holistic representations and feed forward neural networks (NN) for classification proposed by Cottrell and Metcalfe[7]. Chen and Huang[8] used a clustering-based feature extraction to recognize only three facial expressions. Eigenface feature extraction accompanied by principal component analysis (PCA) is proposed by Turk and Pentland[9]. Holistic representations and NNs are applied to pyramid-structured images by Rahardja et al.[10]. Feng et al.[11] applied local binary pattern for feature extraction, and used a linear programming technique as the classifier. Deformable models were utilized by Lanitis et al.[12] to capture variations in shape and grey-level appearance. In the second approach, an image sequence displays one expression. The neutral face is used as a baseline face, and FER is in based on the difference between the baseline and the following input face image. Preliminary work on facial expressions, by tracking the motion of 20 identified spots, has been done by Suma et al.[13]. Motion tracking of facial features in image sequences is performed by optical flow, and expressions are classified into six basic classes[14]. The Fourier transform was utilized for feature extraction, and a fuzzy C-means clustering was applied to build a spatiotemporal model for each expression in[15].

Facial coding is normally performed in two different ways: holistic and analytic. In the holistic approach, the face is treated as a whole. Different methods are presented in this approach including[16, 17]: optical flow, Fisher linear discriminates, NN, active appearance models (AAMs), and Gabor filters. In the analytic approach, local features are used instead of the whole face, namely, fiducial points describe the position of important points on the face (e.g., eyes, eyebrows, mouth, nose, etc.), together with the geometry or texture features around these points[18].

Gabor filters are widely used in texture analysis. These filters model simple cells in the primary visual cortex. Zafeiriou and Pitas[19] showed the best performance of Gabor filters in both analytic and holistic approaches. Gabor filters have been used for expression classification in[4, 20]. Although the Gabor filters show high performance in FER, the main problems using this filter is how to select the optimum one, in terms of scale and orientation. For example in[20], 40 filters (5 scales and 8 orientations) are used. Because of the large number of convolution operations, it needs large amounts of memory and computational cost. Moreover, with the small training samples, the dimensionality is really high[21]. Normally, two types of facial features are used: permanent and transient. Permanent features include eyes, lips, brows and cheeks, and transient features include facial lines, wrinkles, and furrows. The eyebrows and mouth play the main role in facial expressions. Pardas and Bonafonte[22] showed that expressions such as surprise, joy, and disgust have much higher recognition rate, since clear motion of the mouth and the eyebrows are involved.

In this article, both the combined texture and the geometric information of face fiducial points are used to code different expressions. Gauss–Laguerre (GL) wavelets are used for texture analysis and the positions of 18 fiducial points represent the deformation of the eyes, eyebrows, and mouth. The combination of these features is used for expression classification. The K-nearest neighbor (KNN) is used for classifying expressions based on closest training examples in the feature space. The rest of the article is organized as follows: in “G–L wavelets” section, a mathematical description of GL circular harmonic wavelets (CHW) is presented; feature extraction approach in addition to the classification method are mentioned in “The proposed approach” section; experimental results, using the JAFFE, the Cohn-Kanade, and the MMI face databases are reported in “Experiment results” section; finally, a conclusion is drawn in “Conclusion” section.

GL wavelets

The CHWs are polar-separable wavelets, with harmonic angular shape. They are steerable in any desired direction by simple multiplication with a complex steering factor, thus they are referred to as self-steerable wavelets. The CHWs were first introduced in[23] and utilize the concepts from circular harmonic functions (CHFs) employed in optical correlations for rotation-invariant pattern recognition. A CHF is represented in polar coordinates[21] as
f k n ( r,θ)=V k n ( r)e inθ
(1)

where n is the order, k is the degree of the CHF, and V k n . is the radial profile.

The same functions also appear in harmonic tomographic decomposition, and have been considered for the analysis of local image symmetry. CHFs have been employed for defining of rotation-invariant pattern signatures. A family of orthogonal CHWs, forming a multi-resolution pyramid referred to as the circular harmonic pyramid (CHP), is utilized for coefficient generation and coding. Each CHW, pertaining to the pyramid, represents the image by translated, dilated, and rotated versions of a CHF. At the same time, for a fixed resolution, the CHP orthogonal system provides a local representation of the given image around a point in terms of CHFs. The self-steerability of each component of the CHP can be exploited for pattern analysis in the presence of rotation (other than translation and dilation), in particular, for pattern recognition, irrespective of orientation.

CHFs, these are complex, polar separable filters, characterized by harmonic angular shape, which allows building rotationally invariant descriptors. A scale parameter is also introduced to perform a multi-resolution analysis. The GL filters from the family of orthogonal functions, satisfying the wavelet admissibility condition required for multi-resolution wavelet pyramid analysis, are used. Similar to Gabor wavelets, any image may be represented by translated, dilated, and rotated replicas of the GL wavelet. For a fixed resolution, the GL CHFs provide a local representation of the image in the polar coordinate system centered at a given point, named the pivot point. This representation is called the GL transform[24]. They are characterized by a CHF, which is a complex polar separable filter with a harmonic angular shape, represented in polar coordinates.

For a given image I x,y L 2 R 2 ,dx 2 , the expression I p r,θ = I x ˜ +rsinθ, y ˜ rcosθ is the representation in the polar coordinate space centered at the pivot x x ˜ , y ˜ . I p . can be decomposed in terms of CHF, based on its periodic characteristic with respect to θ:
I p r , θ = n k V k n r e inθ
(2)
where radial profile V k n ρ is given by the Fourier integral:
V k n r = 1 2 π 0 2 π I p r , θ e jnθ
(3)
The expansion of radial profiles is represented by the series of weighted orthogonal functions which are GL CHFs:
k n r , θ = 1 K 2 n + 1 2 π n 2 K ! n + K ! 1 2 r n L K n 2 πr 2 e πr 2 e jnθ
(4)
where L K n r is the generalized Laguerre polynomial defined by
L K n r = h = 0 K 1 K n + K K h r h h !
(5)
As any CHF, GL functions are self-steering, i.e., they are rotated by angle φ when multiplied by factor e jn . In particular, the real and imaginary parts of each GL function form a geometrical pair in phase quadrature. Moreover, GL functions are isomorphic to their Fourier transform. It is shown in[24] that each GL function defines an admissible dyadic wavelet. Thus, the redundant set of wavelets; corresponding to different GL functions, represent a self-steering pyramid, utilized for local and multiscale image analysis. The real part of the GL function is depicted in Figure1a. An important feature, applicable to facial expression recognition, is that GL function with various degrees of freedom can be tuned to significant visual features. For example, for n = 1, GLs are tuned to edges, for n = 2 to ridges, for n = 3 to equiangular forks, for n = 4 to orthogonal crosses, irrespective of their actual orientation[25]. Given an image I(x,y), for every site of the plane, it is possible to perform the GL analysis by convolving it with each properly scaled GL function as follows:
g jnK x , y = 1 a 2 j L K n r c o s θ a 2 j , r s i n θ a 2 j
(6)
where a 2 j are the dyadic scale factors. Figure1b shows a plot of GL-CHFs for a fixed dyadic scale factor and variables n and k.
Figure 1

(a) Real part of GL function; n = 4, K = 1, j = 2. (b) Real part of GL CHFs. Variation of filter in spatial domain with fixed scale, k = 0,1, …,4, and n = 1,2, …,5.

The proposed approach

In this section, the algorithmic steps of proposed approach are explained. For each input image, the face area is localized first. Then, the features are extracted based on GL filters, and, finally, the KNN classification is used for expression recognition.

Preprocessing

Preprocessing is normally performed before feature extraction for the FER, in order to increase system performance. The aim of this step which includes scaling, intensity normalization, and size equalization, is to have images, which only contain a face, expressing a certain emotion. Sometimes, histogram equalization is also used to adjust image brightness and contrast. To normalize the face, the image with the neutral expression is scaled, so that it has a fixed distance between the eyes. No intensity normalization has been considered, since the GL filters can extract an abundance of features without any preprocessing.

For face localization, we used the well-known algorithm by Viola-Jones. This method is based on the Haar-like features and the AdaBoost learning algorithm. The localized face image is cropped automatically and resized to 128 × 96. The next step is to normalize the geometry, so that the recognition method is robust to individuals’ differences. For the normalization, we need to extract the eye locations. Figure2 shows an example of the normalization procedure. The output of this step is directly used for textural feature extraction. In this experiment, if the face was not cropped well, it was done manually. The purpose of this article is to propose a new feature extraction method for facial expression recognition.
Figure 2

Normalization procedure (left to right): (a) input image (from the JAFFE database), (b) the extracted AAM fiducial points, (c) normalized image to have fixed distance between eyes, (d) the localized and resized face.

To extract facial features, the AAM is utilized. It is widely used in face recognition and expression classification, due to its remarkable performance in extracting face shape and texture information. AAM[12] contains both a statistical model and texture information of the face, and performs matching via finding the model parameters. These minimize the difference between the image and the synthesized model. We used 18 fiducial points to model the face and distinguish facial expressions. The features to distinguish the latter are explained in section “AAM”. In our experiment, the AAM model has been created using different images from three databases with different expressions. All images were roughly resized and cropped to 256 × 256. After creating the AAM, the eye positions in each image is automatically extracted, and the line, which connects the inner corner of the eyes, is used for normalization.

AAM

AAM[12] is an algorithm for matching a statistical shape model to an image with both shape and appearance variations. For example in facial expression recognition, these deformations are both facial expression changes and pose variations along with the texture variations caused by illuminations. These variations are represented by a linear model like PCA. So, the main purpose of the AAM is first to define a model and then finding the best matched parameters between the given new image and built model using a fitting algorithm.

Normally, the fitting algorithm is repeated until the parameters of shape and appearance satisfy particular values. The shape model is created by combining the vectors constructed from the points of the labeled images.
s = s 0 + i = 1 m p i s i
(7)
where p i is the parameters of shape. The mean shape s0 and m shape basis vectors s i are obtained by using PCA for training data. They are the m eigenvectors which correspond to the m largest eigenvalues. Before applying PCA, the landmark points are normalized. The appearance variation is represented by a linear combination of a mean appearance A0(x) and n appearance basis vectors A i (x) as
A x = A 0 x + i = 1 n α i A i x
(8)

where α is the appearance parameter. After finding the shape and appearance parameters, a piecewise affine warp is used to construct the AAM by locating each pixel of appearance onto the inner side of the current shape. The goal is to minimize the difference between the warped image and the appearance image.

Feature extraction

The feature vector consists of two types: textural features, which are extracted globally by applying the GL filter, and the geometric information of local fiducial points.

Textural feature extraction

To extract facial features we used GL functions, which provide a self-steering pyramidal analysis structure. By the proper choice of the GL function parameters (scale, order, and degree of CHF, which is explained further), it is possible to generate a set of redundant wavelets, and, thus, an accurate extraction of complex texture features of facial expression. Redundancy of the wavelet transform and a higher degree of freedom in the selected parameters of the GL function makes it highly suitable for facial texture extraction, compared to those of the Gabor wavelets. The self-steering pyramid structure of GL-based analysis is superior to Gabor wavelets, due to its ability to choose parameters of GL functions. To take advantage of the degrees of freedom, provided by the GL function, parameters of filters have to be tuned to significant visual features and texture patterns, so that it can extract desirable frequency information of facial texture patterns. In our experiment, it was found that the best results are obtained for n = 2, via running several simulations, in which the filters are convolved directly with the face image. Other parameters (scale and degree) are also adjusted in the same manner. The best results have been obtained for a = 2, k = 1. The output of the filtered image has the same size as the input image 128 × 96 and is complex-valued. Figure3 shows the examples of GL filtering with various parameters a and k. Unlike the Gabor filter, there is no need to construct multiple filters, and a single tuned GL filter is sufficient for feature selection. The size of the textural feature vector (128 × 96 = 12288) is quite large for fusion with geometric information points. If the dimension of the input vector is large, and the data are highly correlated, there are several methods to remove redundancy, such as PCA. During PCA, the components of input vectors are orthogonalized, which means they are no more correlated with each other. The components are put in order, so that ones with largest variation come first and those with low variation are eliminated. The data are usually normalized before performing PCA to have zero mean and unity variance. In our case, the size of the feature vector, after down-sampling, is 3072 (by factor 4), which is further reduced to 384 samples per image using the PCA.
Figure 3

The GL pyramid response for three different filters.

Geometric feature extraction

As mentioned in section “Preprocessing”, 18 fiducial points are put together to construct the model. These points are extracted automatically, based on the AAM model. The coordinates of these fiducial points are used to calculate 15 Euclidean distances. Different expressions result in different deformations of the corresponding facial components, especially near the eyes and mouth. The selected geometric feature extractions are performed as follows.
  1. 1.

    The AAM is applied to extract the 18 points. The distances are labeled by d’s as shown in Figure 4.

     
Figure 4

Geometric feature selection based on fiducial points.

  1. 2.

    For the upper portion of the face, ten distances are calculated, according to Table 1.

     
Table 1

Upper (distances 1–10) and lower (distances 11–15) face geometric distance

Meaning

Distance

Meaning

Distance

left inner brow-left inner eye corner

P 1 = d 3 E d 3 N d 3 N

right eye height

P 6 = d 6 E + d 8 E d 6 N + d 8 N d 6 N + d 8 N

right inner brow-right inner eye corner

P 2 = d 4 E d 4 N d 4 N

left top eye point-line connecting left eye corners

P 7 = d 5 E d 5 N d 5 N

left top brow-line connecting left eye corners

P 3 = d 1 E d 1 N d 1 N

right top eye point-line connecting right eye corners

P 8 = d 6 E d 6 N d 6 N

right top brow-line connecting right eye corners

P 4 = d 2 E d 2 N d 2 N

left bottom eye point-line connecting left eye corners

P 9 = d 7 E d 7 N d 7 N

left eye height

P 5 = d 5 E + d 7 E d 5 N + d 7 N d 5 N + d 7 N

right bottom eye point-line connecting right eye corners

P 10 = d 8 E d 8 N d 8 N

Mouth height

P 11 = d 12 E + d 13 E d 12 N + d 13 N d 12 N + d 13 N

Mouth width

P 12 = d 11 E d 11 N d 11 N

left lip corner-line connecting left eye corners

P 13 = d 9 E d 9 N d 9 N

right lip corner-line connecting right eye corners

P 14 = d 10 E d 10 N d 10 N

top lip-line connecting lip corners

P 15 = d 13 E d 13 N d 13 N

 
  1. 3.

    For the lower portion of the face, five distances are calculated, according to Table 1.

     
The final feature vector is a combination of two features. Since the dimension of the texture feature vector is 384, and the dimension of the geometric feature vector is 15, the total size is 399. During the simulations, it was observed that geometric features are more important than texture ones. To find the appropriate weight coefficients for both types of features, the average recognition rate versus different weight coefficient for geometric features have been monitored. Let us assume that the textural feature is FT and the geometrical feature is FG. The final feature vector is F = αF T + βF G . The average recognition rate has been obtained using the “Leave-One-Out” approach in three trials (see section “Experiment results”). In each trial, and for each database, one random image is used for testing and the rest utilized for training. In this case, the three trials, generally speaking, have different sets for train/test schemes. The coefficient for geometric features varied from 0.5 by step-size 0.01. Figure5 shows the average recognition rate versus geometric coefficient weight. The best average recognition rate, within coefficients, was β = 0.69 and α = 0.31.
Figure 5

Variation of coefficient for geometric features versus average recognition rate.

Classification

KNN is a well-known instance-based classification algorithm[26], which does not make any assumptions on the underlying data distribution. The similarity between the test sample and the other samples, used in training, is calculated, and k most similar set samples are determined. The class of the test sample is then found, based on the classes of its KNNs.

This classification suits the multi-class classification, in which the decision is based on a small neighborhood of similar objects. In the classification procedure, the training data are first plotted in n-dimensional space, where n is the number of features. Each of these consists of a set of vectors labeled with their associated class (arbitrary number of classes). The number k defines how many neighbors influence the classification. Based on the suggestion made in[26], the better classification is obtained when k = 3. This suggestion was based on different experiments and observing the classification rate on JAFFE database. The same classifier is used for the Cohn-Kanade and the MMI database as well.

Experiment results

To evaluate the performance of the proposed method, the JAFFE image database[27], the Cohn-Kanade, and the MMI databases have been used. Eighteen fiducial points have been obtained via the AAM model, and two types of information have been extracted: geometric and textural. MATLAB was used for implementation.

The JAFFE database contains 213 images with a resolution of 256 × 56 pixels. Six basic expressions, in addition to the neutral face (seven in total), are considered in this database. These expressions are happiness, sadness, surprise, anger, disgust, and fear. The images were taken from 10 Japanese female models, and each emotion was subjectively tested on 60 Japanese volunteers. There are three samples, corresponding to each facial expression of each person. Figure6 shows examples of each expression. In the JAFFE database, each subject posed 3 or 4 times for each expression.
Figure 6

Various expressions from the JAFFE database (left to right): anger, disgust, fear, happiness, sadness, surprise, and neutral.

The Cohn-Kanade database[28], which is widely used in literature for facial expression analysis, consists of approximately 500 image sequences from posers. Each sequence goes from neutral to target display, with the last frame being AU coded. The subjects range in age from 18 to 30 years where 65% are females, 15% are African-American, and 3% are Asian or Latino. The images contain six different facial expressions: anger, disgust, fear, happiness, sadness, and anger. Figure7 shows examples of some expressions.
Figure 7

Examples of various expressions from the Cohn-Kanade database.

The MMI Facial Expression Database[29] was created by the Man–machine Interaction Group, Delft University of Technology, Netherlands. This database is initially established for research on machine analysis of facial expressions. The database consists of over 2,900 videos and high-resolution still images of 75 subjects of both genders, who range in age from 19 to 62 years and have either a European, Asian, or South American ethnic background. These samples show both non-occluded and partially occluded faces, with or without facial hair and glasses. In our experiments, 96 image sequences were selected from the MMI database. The only selection criterion is that a sequence can be labeled as one of the six basic emotions[30]. The sequences come from 20 subjects, with 1–6 emotions per subject. The neutral face and three peak frames of each sequence (hence, 384 images in total) were used for 6-class expression recognition. Some sample images from the MMI database are shown in Figure8.
Figure 8

The sample face expression images from the MMI database.

Three different methods were selected to verify the accuracy of this system:
  1. 1.

    “Leave-One-Out” cross-validation: For each expression from each subject, one image is left out, and the rest are used for training [26].

     
  2. 2.

    Cross-validation: the database is randomly partitioned to ten distinct segments, and nine partitions are used for training, with the remaining partition used to test performance. The procedure is repeated so that every equal-sized set is used once as the test set. Finally, an average of ten experiments is been reported [31].

     
  3. 3.

    Expresser-based segmentation: the database is divided into several segments; each of them corresponds to a subject. For the JAFFE database, 213 expression images, posed by 10 subjects, are partitioned into 10 segments, each corresponding to one subject [32]. For the Cohn-Kanade database, 375 video sequences are been used, that is, over 4,000 images. Nine out of ten segments are used for training and the tenth for testing. It is repeated, so each of the ten segments is used in testing. The average results for those ten experiments are been reported.

     

JAFFE database

Table2 shows the average success rate for different approaches. The confusion matrix for the “Leave-One-Out” method is presented in Table3. For the average recognition rate, nine out of ten expression image classes are used for training with the last one being the testing set each time. This procedure is repeated for each subject.
Table 2

Recognition accuracy (%) on the JAFFE database for different approaches

Expression

Leave-One-Out

Cross validation

Expresser-based segmentation

Anger

94.3

92.8

89.3

Disgust

95.7

94.6

90.7

Fear

96.0

95.0

91.1

Happiness

98.2

96.9

92.6

Sad

96.7

94.2

90.2

Surprise

98.2

96.3

92.3

Neutral

97.9

95.5

91.7

Average

96.71

95.04

91.12

Table 3

Confusion matrix for the Leave-One-Out method (the JAFFE database)

 

AN

FE

SU

DI

HA

SA

NE

Total

AN

94.3%

0

0.7

3.2%

0

1.8%

0

30

FE

1%

96%

2%

1%

0

0

0

32

SU

0

0.5%

98.2%

0

1.3%

0

 

30

DI

1.1

0

0.5%

95.7%

0

2.7%

0

29

HA

0

0

0

0

98.2%

0

1.8%

31

SA

2%

0

0.3%

1%

0

96.7%

0

31

NE

0

0

0

0

0

2.1%

97.9%

30

The confusion matrix is a 7 × 7 matrix containing information of actual class label in both rows and columns. The diagonal entries are the rounded average successful recognition rates in ten trials, while the off-diagonal entries correspond to misclassifications. The total recognition rate is 96.71%; the best rate is for surprise and happiness expressions, and the lowest one is for anger. The performance of the proposed method has been compared against some published methods in Table4.
Table 4

Comparison with other methods on the JAFFE database

Method

Recognition rate (average%)

Lyons et al.[31]

92.00

Zhi and Ruan[33]

95.91

Zhang et al.[34]

90.34

Liejun et al.[35]

95.7

Shih et al.[36]

95.71

Zhao et al.[37]

93.72

Guo and Dyer[38]

91.00

Proposed

96.71

Cohn-Kanade database

Table5 shows the average success rate for different approaches. The confusion matrix for the “Leave-One-Out” method is also presented in Table6. Different number of images has been selected for experiments in literature, and images are selected based on the different criteria (see Table7). In this experiment, 375 image sequences have been selected from 97 subjects so that the criterion was to be that of a sequence labeled as one of the six basic emotions, with the video clip being longer than ten frames. The total recognition rate is 92.2%; the best rate is for the happiness expression, and the lowest one for sadness. The performance of the proposed method has been compared against some published methods in Table7. Although several frames from each video sequence are used, we consider them as “static” images without using any temporal information.
Table 5

Recognition accuracy (%) on the Cohn-Kanade database for different approaches

Expression

Leave-One-Out

Cross validation

Expresser-based segmentation

Anger

88.25

87.03

82.35

Disgust

94.9

91.58

88.65

Fear

92.28

90.98

87.12

Happiness

97.82

96.92

91.05

Sad

87.87

84.58

81.11

Surprise

92.08

91.23

86.32

Average

92.2

90.37

86.1

Table 6

Confusion matrix for the Leave-One-Out method (the Cohn-Kanade database)

 

AN (%)

FE (%)

SU (%)

DI (%)

HA (%)

SA (%)

AN

88.25

1.21

0.89

1.08

0.00

8.57

FE

0.87

92.28

1.98

2.17

2.21

0.49

SU

1.87

1.43

92.08

1.35

1.31

1.96

DI

1.65

0.67

0.00

94.9

0.00

2.78

HA

0.00

0.00

2.18

0.00

97.82

0.00

SA

6.54

4.63

0.00

0.96%

0.00

87.87

Table 7

Comparison of facial expression recognition for the Cohn-Kanade database

Method

Number of selected video sequences

Recognition rate (average%)

Zhan et al.[39]

300

90.4

Shan et al.[40]

320

92.1

Bartlett et al.[41]

313

86.9

Littlewort et al.[42]

313

93.8

Yang et al.[43]

352

92.23

Tian[44]

375

93.8

Aleksic and Katsaggelos[45]

284

93.66

Zafeiriou and Pitas[19]

374

97.1

Kotsia and Pitas[46]

99.7%

Proposed

374

92.20

MMI database

Table8 shows the average success rate for different approaches. The total recognition rate is 87.6%; the best rate is for the happiness expression, and the lowest one being for sadness.
Table 8

Recognition accuracy (%) on the MMI database for different approaches

Expression

Leave-One-Out

Cross validation

Expresser-based segmentation

Anger

86.14

85.08

80.11

Disgust

85.22

86.4

78.22

Fear

89.91

84.42

81.32

Happiness

91.1

88.81

83.21

Sad

83.44

81.79

77.11

Surprise

90.2

89.34

81.02

Average

87.66

85.97

80.16

The experimental results show that the proposed method meets the criteria of accuracy and efficiency for facial expression classification. It outperforms, in terms of accuracy, some other existing approaches that used the same database. The average recognition rate of the proposed approach is 96.71%, when using “Leave-One-Out” method, and 95.04% when using cross-validation for estimating its accuracy on the JAFFE database. For the Cohn-Kanade database, the average recognition rate of the proposed approach is 92.20%, when using “Leave-One-Out” method, and 90.37% when using the cross-validation for estimating its accuracy. For the MMI database, the average recognition rate of the proposed approach is 87.66%, when using the “Leave-One-Out” method, and 85.97% when using cross-validation for estimating its accuracy. Few articles reported the accuracy on emotion recognition on the MMI. Most of them reported the recognition rate on the AU. Sánchez et al.[47] achieved 92.42% but it is not clear how many video sequences were used. Cerezo et al.[48] reported 92.9% average recognition rate on 1,500 still images of mixed MMI and CK databases. Shan et al.[30] used 384 images from the MMI, and the average recognition rate of 86.9% was reported.

For the “Leave-One-Out” procedure in Table5, all image sequences are divided into six classes, each corresponding to one of the six expressions. Four sets, each containing 20% of the data for each class, chosen randomly, were created to be used as training sets, while the other 20% were used as the test set.

The procedure of classification is repeated five times. In each cycle, the samples in the testing set are included into the current training set. The new set of samples (20% of the samples for each class) is again formed to have a new test set, and the remaining ones are the new training set. Finally, the average classification rate is the mean of the success rate in classification.

Conclusion

This article proposes a combined texture/geometric feature selection for facial expression recognition. The GL circular harmonic filter is applied, for the first time, to facial expression identification. The advantage of this filter is its rich frequency extraction capability for texture analysis, as well as being a rotation-invariant and a multiscale approach. The geometric information of fiducial points is added to the texture information to construct the feature vector. Given a still expression image, normalization is performed first. The extracted features are passed through a KNN classifier. Experiments showed that the selected features represent the facial expression effectively, demonstrating an average success rate of 96.71, 92.2, and 87.66% when following the “Leave-One-Out” strategy for accuracy estimation, as well as 95.04, 90.37, and 85.97% when following the cross-validation method. These are comparable with the results, reported for other approaches on both databases, namely, the presented results demonstrate better success rate for the JAFFE database, and have the same success range as the approaches for the Cohn-Kanade database. Further development of the proposed approach includes perfecting the local and global feature selections, as well as testing using other classification techniques.

Declarations

Authors’ Affiliations

(1)
Department of Electrical and Computer Engineering, University of Calgary
(2)
Department of Electrical and Computer Engineering, University of British Columbia

References

  1. Yuki M, Maddux WW, Masuda T: Are the windows to the soul the same in the East and West? Cultural differences in using the eyes and mouth as cues to recognize emotions in Japan and the United States. J. Exp. Soc. Psychol. 2007, 43(2):303-311. 10.1016/j.jesp.2006.02.004View ArticleGoogle Scholar
  2. Suwa M, Sugie N, Fujimora K: A preliminary note on pattern recognition of human emotional expression, in Proceedings of the Fourth International Joint Conference on Pattern Recognition. Kyoto, Japan; 1978:408-410.Google Scholar
  3. Bashyal S, Venayagamoorthy GK: Recognition of facial expressions using Gabor wavelets and learning vector quantization. Eng. Appl. Artif. Intell. 2008, 21: 1056-1064. 10.1016/j.engappai.2007.11.010View ArticleGoogle Scholar
  4. Ekman P: WV Friesen, Manual for the Facial Action Coding System. Consulting Psychologists Press, Palo Alto, CA; 1977.Google Scholar
  5. Audio (MPEG Mtg, Atlantic City, 1998): MPEG Video and SNHC, Text of ISO/IEC FDIS 14 496–3 . Doc. ISO/MPEG N2503
  6. Ekman P, Friesen WV: Constants across cultures in the face and emotions. J. Personal Soc. Psychol. 1971, 17(2):124-129.View ArticleGoogle Scholar
  7. Cottrell G, Metcalfe J: Face, Gender and Emotion Recognition Using Holons, in. In Advances in Neural Information Processing Systems 3rd edition. Edited by: Morgan K, San M. 1991, 564-571. ed. byGoogle Scholar
  8. Chen X, Huang T: Facial expression recognition: a clustering based approach. Pattern Recognit. Lett. 2003, 24: 1295-1302. 10.1016/S0167-8655(02)00371-9MATHView ArticleGoogle Scholar
  9. Turk M, Pentland A: Eigenfaces for recognition. J. Cogn. Neurosci. 1991, 3: 71-86. 10.1162/jocn.1991.3.1.71View ArticleGoogle Scholar
  10. Rahardja A, Sowmya A, Wilson W: A neural network approach to component versus holistic recognition of facial expressions in images. Intell. Robots Comput. Vis. X: Algorithms and Techniques 1991, 1607: 62-70.Google Scholar
  11. Feng X, Pietikäinen M, Hadid A: Facial expression recognition based on local binary patterns. Pattern Recognit. Image Anal. 2007, 17(4):592-598. 10.1134/S1054661807040190View ArticleGoogle Scholar
  12. Lanitis A, Taylor C, Cootes T: Automatic interpretation and coding of face images using flexible models. IEEE Trans. Pattern Anal. Mach. Intell. 1997, 19(7):743-756. 10.1109/34.598231View ArticleGoogle Scholar
  13. Suma M, Sugie N, Fujimora K: A preliminary note on pattern recognition of human emotional expression, in Proceedings of the 4th International Joint Conference on Pattern Recognition. Kyoto, Japan; 1978:408-410.Google Scholar
  14. Yacoob Y, Davis L: Recognizing faces showing expressions, in International Workshop Automatic Face and Gesture Recognition. Zurich, Switzerland; 1995:278-283.Google Scholar
  15. Xiang T, Leung MKH, Cho SY: Expression recognition using fuzzy spatio-temporal modeling. Pattern Recognit. 2008, 41(1):204-216. 10.1016/j.patcog.2007.04.021MATHView ArticleGoogle Scholar
  16. Essa I, Pentland A: Coding, analysis, interpretation and recognition of facial expressions. IEEE Trans. Pattern Anal. Mach. Intell. 1997, 19(7):757-763. 10.1109/34.598232View ArticleGoogle Scholar
  17. Fasel IR, Bartlett MS, Movellan JRA: A comparison of Gabor filter methods for automatic detection of facial landmarks. IEEE 5th International Conference on Automatic Face and Gesture Recognition, Washington, DC; 2002:242-248.Google Scholar
  18. Fasela B, Luettinb J: Automatic facial expression analysis: a survey. Pattern Recognit. 2003, 36(1):259-275. 10.1016/S0031-3203(02)00052-3View ArticleGoogle Scholar
  19. Zafeiriou S, Pitas I: Discriminant graph structures for facial expression recognition. IEEE Trans. Multimed. 2008, 10(8):1528-1540.View ArticleGoogle Scholar
  20. Lee CC, Shih CY: Gabor feature selection for facial expression recognition. International Conference on Signals and Electronic Systems, Gliwice, Poland; 2010:139-142.Google Scholar
  21. Deng H, Zhu J, Lyu MR, King I: Two-stage multi-class AdaBoost for facial expression recognition. Proceedings of IJCNN07, Orlando, USA; 2007:3005-3010.Google Scholar
  22. Pardas M, Bonafonte A: Facial animation parameters extraction and expression recognition using Hidden Markov Models. Signal Process: Image Commun 2002, 17: 675-688. 10.1016/S0923-5965(02)00078-4Google Scholar
  23. Jacovitti G, Neri A: Multiscale image features analysis with circular harmonic wavelets. Proc. SPIE: Wavelets Appl. Signal Image Process. 1995, 2569: 363-372.View ArticleGoogle Scholar
  24. Capdiferro L, Casieri V, Laurenti A, Jacovitti G: Multiple feature based multiscale image enhancement. Greece, in Digital Signal Processing Conference 2002, 2: 931-934.Google Scholar
  25. Ahmadi H, Pousaberi A, Azizzadeh A, Kamarei M: An efficient iris coding based on Gauss-Laguerre wavelets, in 2nd IAPR/IEEE International Conference on Biometrics , Seoul. South Korea 2007, 4642: 917-926.Google Scholar
  26. Sohail A, Bhattacharya P: Classification of facial expressions using k-nearest neighbor classifier. 4418 edition. Computer Vision Computer Graphics Collaboration Techniques; 2007:555-566.Google Scholar
  27. Lyons M, Akamatsu S, Kamachi M, Gyoba J: Coding facial expressions with Gabor wavelets. Third IEEE International Conference on Automatic Face and Gesture Recognition, Nara, Japan; 1998:200-206.Google Scholar
  28. Kanade T, Cohn JF, Tian Y: Comprehensive database for facial expression analysis. Proceedings of the 4th IEEE International Conference on Automatic Face and Gesture Recognition, Grenoble, France; 2000:46-53.Google Scholar
  29. Pantic M, Valstar MF, Rademaker R, Maat L: Web-based database for facial expression analysis. Proceedings of the IEEE International Conference on Multimedia and Expo (ICME’05), Amsterdam, Netherlands; 2005:317-321.Google Scholar
  30. Shan C, Shaogang G, McOwan PW: Facial expression recognition based on local binary patterns: a comprehensive study. Image Vis. Comput. 2009, 27: 803-816. 10.1016/j.imavis.2008.08.005View ArticleGoogle Scholar
  31. Lyons M, Budynek J, Akamatsu S: Automatic classification of single facial images. IEEE Trans. Pattern Anal. Mach. Intell. 1999, 21: 1357-1362. 10.1109/34.817413View ArticleGoogle Scholar
  32. Feng X, Lv B, Li Z, Zhang J: A novel feature extraction method for facial expression recognition. Proceeding of JCIS, Taiwan; 2006:371-375.Google Scholar
  33. Zhi R, Ruan Q: Facial expression recognition based on two dimensional discriminant locality preserving projections. Neuro Comput. 2008, 71: 1730-1734.Google Scholar
  34. Zhang Z, Lyons M, Schuster M, Akamatsu S: Comparison between geometry based and Gabor wavelet based facial expression recognition using multi layer perceptron. Proceeding 3rd International Conference on Automatic Face and Gesture Recognition, Nara, Japan; 1998:454-459.Google Scholar
  35. Liejun W, Xizhong Q, Taiyi Z: Facial expression recognition using improved support vector machine by modifying kernels. Inf. Technol. J. 2009, 8(4):595-599. 10.3923/itj.2009.595.599View ArticleGoogle Scholar
  36. Shih FY, Chuang C, Wang PSP: Performance comparisons of facial expression recognition in Jaffe database. IJPRAI 2008, 445-459.Google Scholar
  37. Zhao L, Zhuang G, Xu X: Facial expression recognition based on PCA and NMF. Proceeding of the 7th World Congress on Intelligent Control and Automation, Chongqing, China; 2008:6822-6825.Google Scholar
  38. Guo G, Dyer CR: Learning from examples in the small sample case: face expression recognition. IEEE Trans. Syst. Man Cybern. B 2005, 35(3):477-488. 10.1109/TSMCB.2005.846658View ArticleGoogle Scholar
  39. Zhan Y, Ye J, Niu D, Cao P: Facial expression recognition based on Gabor wavelet transformation and elastic templates matching. Int. J. Image Graph. 2006, 6(1):125-138. 10.1142/S0219467806002112View ArticleGoogle Scholar
  40. Shan C, Gong S, McOwan PW: Robust facial expression recognition using local binary patterns, in Proceeding of ICIP05 , Genoa. Italy 2005, 2: 370-373.Google Scholar
  41. Bartlett MS, Littlewort G, Fasel I, Movellan JR: Real time face detection and facial expression recognition: development and applications to human computer interaction, in IEEE Conference on Computer Vision and Pattern Recognition , Madison. Wisconsin 2003, 5: 53-53.Google Scholar
  42. Littlewort G, Bartlett M, Fasel I, Susskind J, Movellan J: Dynamics of facial expression extracted automatically from video. 5th edition. Proceeding of IEEE Conf. Computer Vision and Pattern Recognition, Workshop on Face Processing in Video, New York, USA; 2004:80-88.Google Scholar
  43. Yang P, Liu Q, Metaxas DN: Exploring facial expressions with compositional features. Proceeding of CVPR, San Francisco, USA; 2010:2638-2644.Google Scholar
  44. Tian Y: Evaluation of face resolution for expression analysis. Proceeding of IEEE Workshop Face Processing in Video, Washington, DC, USA; 2004:82-82.Google Scholar
  45. Aleksic SP, Katsaggelos KA: Automatic facial expression recognition using facial animation parameters and multi-stream HMMS. IEEE Trans. Inf. Forensics Secur. 2006, 1(1):3-11. 10.1109/TIFS.2005.863510View ArticleGoogle Scholar
  46. Kotsia I, Pitas I: Facial expression recognition in image sequences using geometric deformation features and support vector machines. IEEE Trans. Image Process 2007, 16(1):172.MathSciNetView ArticleGoogle Scholar
  47. Sánchez A, Ruiz JV, Ana M: Belén, AS Montemayor, H Javier, P Juan José. Differential optical flow applied to automatic facial expression recognition. Neurocomputing 2011, 74(8):1272-1282.Google Scholar
  48. Cerezo E, Hupont I, Baldassarri S, Ballano S: Emotional facial sensing and multimodal fusion in a continuous 2D affective space. Ambient Intell. Hum. Comput. 2012, 3: 31-46. 10.1007/s12652-011-0087-6View ArticleGoogle Scholar

Copyright

© Poursaberi et al.; licensee Springer. 2012

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.