Non-frontal facial expression recognition based on salient facial patches

. Methods using salient facial patches (SFP) play a significant role in research on facial expression recognition. However, most SFP methods use only frontal face images or videos for recognition, and do not consider variations of head position. In our view, SFP can also be a good choice to recognize facial expression under different head rotations, and thus we propose an algorithm for this purpose, called Profile Salient Facial Patches (PSFP). First, in order to detect the facial landmarks from profile face images, the tree-structured part model is used for pose-free landmark localization; this approach excels at detecting facial landmarks and estimating head poses. Second, to obtain the salient facial patches from profile face images, the facial patches are selected using the detected facial landmarks, while avoiding overlap with each other or going beyond the range of the actual face. For the purpose of analyzing the recognition performance of PSFP, three classical approaches for local feature extraction-histogram of oriented Gradients (HOG), local binary pattern (LBP), and Gabor were applied to extract profile facial expression features. Experimental results on radboud faces database show that PSFP with HOG features can achieve higher accuracies under the most head rotations.


Introduction
The problem of determining how to use face information in human computer interaction has been the subject of analysis for a number of years. An increasing number of applications using face  9 considered relationships between head poses and proposed a pose-based hierarchical Bayesianthemed model. Jampour et al. 10 found that linear or non-linear local mapping methods provide more reasonable results for multi-pose facial expression recognition than global mapping methods.
However, none of the above algorithms is sufficient to correctly recognize expressions on faces in non-frontal images. Although the researchers have sought to achieve higher recognition rates, constructing models or functions for mapping the relationship between frontal and non-frontal face images, the feature point movements and texture variations are considerably more complex under head pose variations and identity bias. An effective feature extraction method is necessary for the recognition of non-frontal facial expressions.
Recently, a method based on salient facial patches, which seeks salient facial patches from the human face and extracts facial expression features from these patches, has played a significant role in emotion recognition [11][12][13][14][15][16][17][18][19] . In this method, a few prominent facial patches (e.g., eyebrows, eyes, cheeks, and mouth) are relied on as the key points in face images, and the discriminative features are extracted from salient regions. The extracted features are important for distinguishing one expression from another, and the salient facial patches create favorable conditions for non-frontal facial expression recognition. Thus, we propose an algorithm based on salient facial patches designed to recognize facial expressions from non-frontal face images. This method, called Profile Salient Facial Patches (PSFP), detects salient facial patches from non-frontal face images, and recognizes facial expressions from salient facial patches. The remainder of this paper is organized as follows. Related work is described in Sec. 2, and the details of PSFP are presented in Sec. 3.
We provide the design and analysis of experiments for facial expression recognition in Sec. 4.
Finally, we conclude the paper in Sec. 5.

Related Work
Sabu and Mathai 11 were the first to investigate the importance of algorithms based on salient facial patches for facial expression recognition. They found that the most accurate and efficient system of the methods proposed to date was that by Happy and Routray, 12 who provided the system for facial expression recognition using salient facial patches. These salient regions can vary in different facial expressions and can be responsible for deformation of the face. This system is easy to reproduce and is efficient for recognizing frontal-view facial expressions. Chitta and Sajjan 13 4 found that the most effective salient facial patches are located mainly in the lower half of the face.
Thus, they reduced the salient region and extracted the emotion features from the lower face.
However, their algorithm did not achieve high recognition rates in their experiment. Zhang et al. 14 used a sparse group lasso scheme to explore the most salient patches for each facial expression and combined these patches into the final features for emotion recognition. They achieved an average recognition rate of 95.33% on the CK+ database. Wen et al. 15 used a convolutional neural network 20 to train the salient facial patches on face images and then employed a secondary voting mechanism to help the trained convolutional neural network determine the final category of test images. Sun et al. 16 presented a convolutional neural network that uses a visual attention mechanism and can be used for facial expression recognition. This mechanism pays attention to local areas of face images and determines the importance of each region. In particular, whole face images with different poses were used for training the convolutional neural network. Yi et al. 17 expanded the salient facial patches from static images to video sequences, and used 24 feature points to show the deformation in facial geometry throughout the entire face. Yao et al. 18 presented an deep neural network classifier, which can capture pose-variant expression features from the depth patches and recognize non-frontal expressions. Barman and Dutta. 19 used an active appearance model (AAM) 21 to detect the salient facial landmarks, whose connections form triangles that can be regarded as salient facial regions. The geometric features are extracted for the recognition of emotion in the face.
Based on this overview of salient facial patches algorithms, we find the following three commonalities in facial expression recognition: 1. Most of the proposed methods are used on frontal face images.
2. There are three main components of salient facial regions: eyes, nose, and lips.

Face Detection
Yu et al. 23 presented a united framework to detect the human face and track facial feature points simultaneously. There are two main steps in their framework: (1) Initialization. The mixtures of the tree-structured part model are used to formulate the problem as follows: * = arg max ∈ , ∈(1, ) where the first term uses local patch appearance evaluation function , which indicates whether a facial landmark may be at the aligned position; the second term uses shape deformation cost , which will maintain the balance of the relative locations of neighboring facial landmarks; and the tree has been defined as = ( , ), ∈ (1, ), in which represents the shared pool of parts and represents an edge between two parts. For each viewpoint , this scoring function is applied to measure the facial landmark configuration . Eq. 1 assigns a larger score to more likely positions of facial landmarks.
In order to solve the Eq. 1, a group sparse learning algorithm 24 can be used to select the most salient weights, and form a new tree. The structure of a tree is closely related to the weights: If the tree structure changed, the weights must be retrained. Training has been transformed into solving the problem on max-margin. 23 (2) Localization. The initial facial landmarks having been detected, the authors use the Procrustes analysis method to project their 3D reference shape model onto a 2D face image. Because their 3D shape model can handle continuous view changes, it is appropriate for solving the pose-free facial landmark initialization problem. The authors first transform this problem into parametric form, and then build the probabilistic model. Based on this probabilistic model, the authors proposed a two-step cascaded deformable shape model to refine the locations of the facial landmarks. * = arg max ( |{ = 1} 1 , ) 2 aims to maximize the likelihood of alignment. Then, we can use the Bayesian rule to derive Eq.
3. In Eq. 4, we know that the parameter can determine 3D shape model s, ( ) = ( ). The authors suppose that ( ) obeys the Gaussian distribution. In addition, the logistic regressor is , where φ is the feature descriptor of facial landmark patch i, and the other parameters ϑ and b represent two regressor weights.
To solve the optimization problem from Eqs. 2-4, two main steps are given as follows: Step One: The authors want to find the neighborhood of facial landmarks, and achieve optimal alignment likelihood. The expectation maximization method is the key to solving Eq. 4.
Step Two: The facial landmarks of each component are aligned to the global minimum, and the landmarks locations are refined.
After these two steps are performed, the algorithm can determine locations of landmarks to form the salient facial patches; in addition, the algorithm also has the estimated head pose for each facial image.
Finally, the landmarks can be tracked and presented as = ( , ), = 1,2, … ,66. The locations of the landmarks for an image such as Fig. 1 (a) can be shown as in Fig. 1

Extraction of Pose-free Salient Facial Patches
The special salient facial patches are obtained from the face images according to the head pose.
From the analysis of related work, we find that eyes, nose, and lips are important facial components of the salient facial patches. The locations of these facial components for an image such as Fig. 1 (a) can be shown as in Fig. 1 (c). The salient facial patches can be extracted around the facial parts and the areas among eyebrow, eye, nose, and lip areas: where point = ( , ) is the center of , and M×N is the size of .
If L salient facial patches have been selected from image R, the facial expression features will be extracted from the L salient facial patches: where k is the number of images. The locations of 19 salient facial patches on a frontal face image are shown in Fig. 1 (d). 9 P1 and P4 are at the lip corners; P9 and P11 are just below them. P10 is at the midpoint of P9 and P11.
P16 is at the center of the two eyes, and P17 is above P16. P15 and P14 are below the left and right eyes, respectively. P3 and P6 are located from the middle of the nose and the eyes. P5, P13, and P12 are stacked together, extracted from the left side of the nose, and P2, P7, and P8 are at the right side of the nose. P18 and P19 are located on the respective outer eyebrows.
The method of selection of facial patches in PSFP is similar to that in Ref. 12, with two exceptions.
The first difference is that the salient facial patches (SFP) method used in Ref. 12 Fig. 2 (a), and ours as in Fig. 2 (b). Given that there are already two patches at the inner eyebrows, if the patches are larger, they would likely overlap with those at the inner eyebrows. When the image is a non-frontal facial view, the face will be partially occluded. This will reduce the surface area of the inner eyebrows. If we were to use the position of the inner eyebrows as in the method of Ref. 12, much information would be lost. Fig. 3 shows the positions of the P19 patch as determined by the two selection methods. Obviously, the patch as selected by Ref. 12 is beyond the face area.

Feature Extraction and Classification
After the salient facial patches have been obtained from the face images, the features of the facial patches need to be extracted for the classification. After the features have been obtained, a representative classifier is applied for facial expression classification. (1) HOG: First, we break up the whole-face image into parts; second, we obtain a histogram from each cell; finally, we normalize the computed results and return a descriptor.
(2) LBP: The N×N LBP operator is used to obtain the facial expression features. The operator weights of operator are multiplied by the corresponding pixels of the face image, and the summation of the N×N -1 pixels are used for the LBP features of the neighborhood.
There are much variations of the LBP algorithm; in Ref. 12 the highest recognition rate was attained using the uniform LBP. The N×N uniform LBP operator computes LBP features from a circular neighborhood; it has two important parameters: P, which is the number of corresponding pixels, and R, which is the radius of the circular neighborhood.
(3) Gabor: Gabor filters can be formulated as where u represents the orientation, and represents the scale. If an image is convolved with a Gabor filter, the Gabor features will be extracted by the particular u and values.
The above examples show feature extraction performed from only a single patch; thus, feature fusion is necessary for feature extraction of the salient facial patches.

Classification
After the facial expression features have been extracted, the final task is feature classification.
Non-frontal face images are hampered by a lack of emotion information, so if the classifier is weak, the recognition rate may be very low. For this problem, the adaptive boosting (AdaBoost) 27 algorithm is applied for the classification because it is good at combining many learning algorithms to improve recognition performance and is thus suitable method for the task of classification.

Experimental Setting
This simulation environment used the MATLAB R2015b platform running on a Dell personal computer. We evaluated the PSFP algorithm on the radboud (RaFD) 25  consisting of ten people, eight expressions, three gaze directions, and five head poses.
The framework for the PSFP algorithm was implemented as shown in Fig. 5.

Purposes
In this study, experiments were used to validate the recognition performance of PSFP from four different perspectives.

Testing PSFP performance under different training-testing strategies
There are two commonly used experimental ways of performing non-frontal facial expression

Testing PSFP performance under different parameter values
Generally, the selection of parameters depends on empirical values, and it is difficult to support them with a rigorous proof. Therefore, it is necessary to use different parameter values for PSFP and observe the recognition performance on the test set. As described in Sec. 4.1, the size of the facial patches was typically set to 16×16 and the feature dimensionality was typically set to 10.
Both of these key parameters can affect the expression recognition performance. Secs. 4.5.1 and 4.5.2 describe the experiments carried out for this performance comparison.

Comparing PSFP with SFP for frontal facial expression recognition
In Sec. 3.2, we discussed the two differences between the SFP method of Ref. 12

Comparing PSFP with non-SFP using whole-face images
A salient facial patch is in fact only part of the face image. According to common understanding, if the whole-face image is used for the recognition, the performance may be better. However, if the selection of salient facial patches is sufficiently good, PSFP could perform better than this non-SFP method. Therefore, we used the same feature extraction and classification method for the two methods and compared them, as described in Sec. 4.5.4.

Pose-Invariant Non-frontal Facial Expression Recognition
There are two training-testing strategies for facial expression recognition: person-dependent and person-independent.
In the experiments on person-dependent facial expression recognition, the subjects appearing in the training set also appear in the test set. Because every model has three different head poses, a three-fold cross-validation strategy was used for the person-dependent facial expression recognition.
The dataset can be divided into three segments according to head pose. Each time, two segments were used for training and the remaining segment for testing. Thus, the number of images in the training set was 160, and the number in the test set was 80, for each head rotation angle. The same training-testing procedure can be repeatedly carried out three times and the average result of the three procedures is considered the final recognition performance of the PSFP algorithm. The HOG, LBP, and Gabor methods were used for feature extraction, and the AdaBoost algorithm with the NN classifier was applied for classification.
The recognition rates of these methods are shown in Table 2.
Each row shows recognition performance with five head rotation angles (90°, 45°, 0°, −45°, and −90°). The best recognition rates are highlighted in bold. For most angles, HOG has the best recognition performance, and at −45°, LBP has the best recognition performance. We also find that the best head rotation angle for recognition of non-frontal facial expressions is −45°. In the experiments on person-independent facial expression recognition, the subjects appearing in the training set do not appear in the test set. For this reason, the leave-one-person-out strategy was used for the experiments: All photographs of one person are selected as the test set, and the remaining photographs in the dataset are used for training. Thus, the number of images in the training set was 216, and the number in the test set was 24, for each head rotation angle. This procedure was repeated 10 times, and the averaged result is taken as the final recognition rate.
The experiment results are shown in Table 3.
For most angles, HOG achieved the best recognition rate, and at 0° and −45°, Gabor achieved the best recognition rate. Again, we find that the best head rotation angle for recognition of non-frontal facial expressions is −45°. In summary, analyses of the pose-invariant non-frontal facial expression recognition experiments show the following: (1) When the head rotation angle is larger, the recognition rate may be lower.
Because many facial patches are occluded by head rotation, the number of emotion features is not sufficient to achieve a high recognition rate. PSFP with the HOG algorithm, however, still obtains good recognition rates. (2) Although identity bias and face occlusion interfere with the facial expression recognition, the PSFP algorithm can achieve better recognition performance on nonfrontal facial expression recognition.

Pose-Variant Non-frontal Facial Expression Recognition
Again, there are two training-testing strategies for facial expression recognition: person-dependent and person-independent.
In the experiments on person-dependent facial expression recognition, a three-fold cross-validation strategy was used for training and testing. The number of images in the training set was 800, and the number in the test set was 400. The same procedure was performed three times.
In the experiments on person-independent facial expression recognition, the leave-one-person-out strategy was used. The number of images in the training set was 1080, and the number in the test set was 120. This procedure was performed 10 times for each dataset, and the average values are taken as the final recognition rate.
The experiment results are listed in Table 4.
As shown in the table, having different head pose rotations increases the difficulty of non-frontal facial expression recognition. However, the proposed method performed well. PSFP with the HOG algorithm again achieved the best recognition rates.

Comparison by size of facial patches
In the above experiments, the size of the facial patches was 16×16. We increased the size to 32×32, and the experiment results are shown in Figs. 6 and 7. As can be seen, the 32×32 facial patches achieved higher recognition performance than the 16×16 facial patches. This is because the feature extraction methods can obtain much more information, which helps improve the recognition performance of non-frontal facial expression recognition.

Comparison by feature dimensionality
In the above experiments, the feature dimensionality was set to 10. We re-ran the experiments for pose-variant non-frontal facial expression recognition and allowed the feature dimensionality to increase from 10 to 100. AdaBoost with NN was used as the classifier, and the feature extraction methods were HOG, LBP, and Gabor, shown separately.
The experiment results are shown in Figs. 8 and 9. As shown in the figures, the recognition rates grow from the initial allocation and eventually settle around a range of values. In the experiment on pose-variant non-frontal facial expression recognition, the magnitude of the range is from 4% to 6%. Although the recognition rate may increase with the increase in feature dimensionality, the computation cost of the algorithm is necessarily higher. We suggest that the feature dimensionality should be set to a value as small as possible while maintaining good performance.  PSFP was higher than that of the SFP method of Ref. 12. This demonstrates that the PSFP method can also outperform SFP for frontal facial expression recognition.

Comparison with non-SFP method using whole-face images
In this experiment, the LBP algorithm was used to extract the whole-face images, and the AdaBoost algorithm was applied for classification. The non-SFP method was compared with the for person-dependent and -independent strategies are shown in Fig. 12 and Fig. 13.  Even though the PSFP method does not use the whole-face image for the recognition, its accuracy is not lower than that of the non-SFP method using whole-face images. The selection of salient facial patches helps the PSFP method to achieve the higher accuracy. Moreover, the size of the

Summary
From the above experiments, we find that the PSFP method has the following characteristics: (1) HOG features have better recognition performance than LBP features or Gabor features. We believe the reason is that whereas LBP features are based on local image regions of the facial patch and Gabor features are extracted from the whole-face patch, HOG features are obtained from the small squared cells of the facial patch. Therefore, the HOG method can more effectively extract the emotion features under complex changes of light, scale, pose and identity environments; (2) The PSFP method can also be applied for frontal facial expression recognition. (It is an extension of the SFP method.); (3) PSFP can achieve high recognition rates while consuming fewer data.

Conclusion
This paper has presented an algorithm based on salient facial patches, called PSFP. It employs the relevance of facial patches in non-frontal facial expression recognition, and uses the facial landmark detection method to track key points from the pose-free human face. In addition, an algorithm for extracting the salient facial patches was proposed; this algorithm determines the facial patches under different head rotations. The facial expression features could be extracted from the facial patches and finally used for feature classification. The experiment results show that PSFP can achieve high recognition rates while consuming fewer data.