 Research
 Open Access
 Published:
Multithreading cascade of SURF for facial expression recognition
EURASIP Journal on Image and Video Processing volume 2016, Article number: 37 (2016)
Abstract
We propose a novel and general framework called the multithreading cascade of Speeded Up Robust Features (McSURF), which is capable of processing multiple classifications simultaneously and accurately. The proposed framework adopts SURF features, but the framework is a multiclass and simultaneous cascade, i.e., a multithreading cascade. McSURF is implemented by configuring an area under the receiver operating characteristic (ROC) curve (AUC) of the weak SURF classifier for each data category into a realvalue lookup list. These noninterfering lists are built into thread channels to train the boosting cascade for each data category. This boosting cascadebased approach can be trained to fit complex distributions and can simultaneously and robustly process multiclass events. The proposed method takes facial expression recognition as a test case and validates its use on three popular and representative public databases: the Extended CohnKanade, MMI Facial Expression Database, and Annotated Facial Landmarks in the Wild database. Overall results show that this framework outperforms other stateoftheart methods.
Introduction
Robustly and simultaneously learning highly discriminative multiclass classifiers with local image features is one of the most significant challenges to computer vision researchers, because they are critical infrastructures for recognition engines; consequently, these researches appear of great importance. Our study focuses on feature descriptors and learning classifiers to develop a novel learning framework for multiclass recognition applications.
In this study, we propose a framework called the multithreading cascade of Speeded Up Robust Features (McSURF), which adopts SURF for training a multithreading boosting cascade. The proposed learning model is applied to facial expression recognition (FER), and while it is derived from AdaBoost [1], it is a novel, multiclass, simultaneous cascade, i.e., a multithreading cascade. In contrast to the conventional boosting cascade models (e.g., BinBoost [2], joint cascade [3], and LUTAdaBoost [4–6]), we propose a novel and robust cascade algorithm (McSURF) that can simultaneously learn multitask cascades using the local feature detector and descriptor SURF [7]. The proposed boosting cascadebased approaches can be trained to fit complex distributions and can simultaneously process multiclass events much more robustly.
We experimentally evaluated the proposed method in three public expression databases, i.e., the Extended CohnKanade (CK+) [8], MMI Facial Expression Database [9, 10], and Annotated Facial Landmarks in the Wild (AFEW) database [11], that together represent labcontrolled and realworld scenarios. Some examples of expression recognition results are shown in Fig. 1. The experimental results show that the proposed method can construct a robust FER system whose results outperform wellknown stateoftheart FER methods.
The main contribution of our study is the development of a novel framework (McSURF) that can simultaneously build a cascade learning model while robustly processing a multiclass recognition application. By so doing, we are making the following original contributions: (1) Typically, a boosting classifier is trained as a binary classification model. Our proposed multithreading cascade learning model allows multiple categories to be simultaneously trained on a cascade learning model. (2) The McSURF is an excellent FER application method. Its performance experimentally outperforms many stateoftheart methods. (3) We experimentally evaluated the impact of face registration at both learning and recognition stages and determined how face registration works on a boosting classifier during these stages. This represents an important breakthrough that is relevant to related industries and those with related research interests.
The remainder of this paper is organized as follows: We review the related works in Section 2. We describe the proposed framework in Section 3. In Section 4, we describe our experiments, and we draw our conclusions in Section 5.
Related work
Recently, mainstream FER approaches are based on effective local descriptors or facial action units. Local descriptors such as local binary pattern on three orthogonal planes (LBPTOP) [12], HOE [13], and histograms of oriented gradients (HOG) 3D [14] are extracted from the local facial cuboid to obtain a representation of a certain length independent of time resolution. In other words, these approaches try to describe the spatiotemporal property of facial expressions using descriptors. These feature descriptor approaches present effective and robust FER representations, because they can avoid intraclass variation and face deformation. However, rigid cuboids can only capture lowlevel feature information and these lowlevel features often fail to describe highlevel facial concepts, i.e., there is a “semantic gap” between lowlevel features and highlevel concepts. Therefore, the effective use of local descriptors to represent complex expressions has been an ongoing problem.
Another approach is adopted for processing facial action areas. Although these approaches are not more popular than those based on local descriptors, this method category is also important to consider. Methods based on facial action areas use a series of facial landmarks, as discussed in [8] and [15] and use the active appearance model (AAM) [16] and the constrained local model (CLM) [15, 17] to encode shape and texture information, respectively. These approaches do not have the semantic gap problem, because they focus solely on the detection of midlevel facial action areas, which contain sufficient semantic cues. However, it is difficult to accurately detect landmarks (or defined action units) when facial expression varies, because these defined landmarks cannot completely address the many varied and complex expressions.
This study aims to present a more ideal solution for FER. It have been proved that local features trained by classifiers can effectively cancel out the problems caused by semantic gap, which leads to an overall significant improvement of the classification performance. Therefore, we propose a novel and general learning framework that contains robust classifiers as well as highquality local feature descriptors, and the technical details are discussed in the following section.
The proposed method
Our proposed framework has these components: SURF features for local patch description; logistic regressionbased weak classifiers, which are combined with the area under the receiver operating characteristic (ROC) curve (AUC) [18] as a single criterion for cascade convergence testing; and a multithreading cascade for boosting training that can process multiple categories.
Figure 2 shows a schematic of the implementation process of the proposed framework. First, the facial region is detected based on the VJ framework. Then, the detected facial region is parallel processed by multiple classifiers to estimate the expression. The parallel classification, i.e., the multithreading aspect, is implemented by configuring the AUC of the weak classifier for each data category into a realvalue lookup list. As shown in Fig. 2, these noninterfering lists are built into thread channels in which the algorithm can appropriately organize the ensemble of weak classifiers into related classes. In the proposed framework, SURF represents the expressional features of the detected facial regions for weak classifiers. We describe SURF in Section 3.1 and explain how to use SURF features to construct logistic regressionbased weak classifiers in Section 3.2. To start the parallel aspect as shown in Fig. 2, we design the multithreading cascade channel in Section 3.3. We describe how to learn weak classifiers in each channel in Section 3.4. Finally, in Section 3.5, we describe the boosting cascade training. These approaches are formulated in the following section.
Feature description
SURF is a scale and rotationinvariant interest point detector and descriptor. It is faster than scaleinvariant feature transform (SIFT) [7, 19], and AdaBoostbased algorithms that have adopted SURF have been shown to obtain the best accuracy and speed [20]. In this study, we adopt an 8bin T2 SURF descriptor to describe the local features, inspired by the approach of Li et al. [20]. However, in contrast to Li et al.’s [20] approach, we allow different aspect ratios for each patch (the ratio of width and height) because this can improve the speed of image traversal. We also imported diagonal and antidiagonal filters to improve the description capability of the SURF descriptors.
Given a recognition window, we define rectangular local patches within it, each patch having four spatial cells and with the patch size ranging from 12×12 to 40×40 pixels. Each patch is represented by a 32dimensional SURF descriptor, which can be computed quickly based on the sums of twodimensional Haar wavelet responses, and we can make efficient use of the integral images [1]. d _{ x } is defined as the horizontal gradient image, which can be obtained using the filter kernel [−1,0,1], and d _{ y } is the vertical gradient image, which can be obtained using the filter kernel [−1,0,1]^{T}; define d _{D} as the diagonal image and d _{AD} as the antidiagonal image, both of which can be computed using twodimensional filter kernels diag (−1,0,1) and antidiag (−1,0,1). Therefore, 8bin T2 is able to be defined as \(v= (\sum (\left d_{x}\right +d_{x}),\sum (\left d_{x}\right d_{x}),\sum (\left d_{y}\right +d_{y}),\sum (\left d_{y}\right d_{y}),\sum (\left d_{\mathrm {D}}\right +d_{\mathrm {D}}),\sum (\left d_{\mathrm {D}}\right d_{\mathrm {D}}),\sum (\left d_{\text {AD}}\right +d_{\text {AD}}),\sum (\left d_{\text {AD}}\right d_{\text {AD}}))\). Here, d _{ x }, d _{ y }, d _{D}, and d _{AD} can be computed individually, using integral images, by the filters shown in Fig. 3 a(1), a(2), b(1), and b(2), respectively. For details on how to compute the twodimensional Haar responses with integral images, please refer to [1].
The recognition template for SURF is 40×40 pixels with four spatial cells, again with the patch size ranging from 12×12 to 40×40 pixels. We slide the patch over the recognition template with four pixels forward to ensure a sufficient featurelevel difference. In addition, we allow a different aspect ratio for each patch. The local candidate region of the features is also divided into four cells, and the descriptor is extracted from each cell. Hence, concatenating the features in all four cells yields a 32dimensional feature vector. In practical feature normalization, an L _{2} normalization followed by clipping and renormalization (L _{2} Hys) [21] has been shown to work best.
Weak classifier construction
In this study, we build a weak classifier over each local patch described by the SURF descriptor and select the optimum patches in each boosting iteration from the patch pool. Meanwhile, we construct the weak classifier for each local patch by logistic regression to fit our classifying framework, due to it being a probabilistic linear classifier.
On one hand, we build a weak classifier over each local patch, as described by the SURF descriptor, and select optimum patches in each boosting iteration from the patch pool. On the other hand, we construct a weak classifier for each local patch by logistic regression to fit our classification framework, since it is a probabilistic linear classifier. Given a SURF feature \(\mathbb {F}\) over a local patch, logistic regression defines the probability model
where q=1 means that the trained sample is a positive sample of the current class, q=−1 indicates negative samples, w is a weight vector for the model, and b is a bias term. We train classifiers on local patches from a largescale dataset. Assuming, in each boosting iteration stage, that there are K possible local patches, which are represented by SURF feature \(\mathbb {F}\), each stage is a boosting training procedure with logistic regression as weak classifiers. In this way, the parameters can be identified by minimizing the objective
where λ denotes a tunable parameter for the regularization term and ∥w∥_{ p } is the L _{ p } norm of the weight vector. Note that it is also applied to L _{2}loss and L _{1}loss linear support vector machines (SVMs) by the wellknown open source code LIBLINEAR [22]. Therefore, this question can be solved using algorithms in [22].
Multithreading cascade channel construction
In this subsection, we introduce how to implement the parallel aspect. Assuming there are M expression categories in the training sample set, given the weak classifiers \(h_{i}^{(n)}\) for category i data, the strong classifier is defined as \(H_{i}^{(N)}(\mathbb {F})=\frac {1}{N}\sum _{n=1}^{N}h_{i}^{(n)}(\mathbb {F})\).
Assuming there are a total of N boosting iteration rounds, in the round n, we will build K weak classifiers \([h_{i}^{(n)}(\mathbb {F}_{k})]_{k=1}^{K}\) for each local patch in parallel from the boosting sample subset. Meanwhile, we also test each model \( h_{i}^{(n)}(\mathbb {F}_{k})\) in combination with previous n−1 boosting rounds. In other words, we test \(H_{i}^{(n1)}(\mathbb {F}) + h_{i}^{(n)}(\mathbb {F}_{k})\) for \(H_{i}^{(n)}(\mathbb {F})\) on the all training samples, and each test model will produce a highest AUC score [18, 23] \(J(H_{i}^{(n1)}(\mathbb {F}) + h_{i}^{(n)}(\mathbb {F}_{k}))\), i.e.,
This procedure is repeated until the AUC scores converge or the designated number of iterations N is reached. Then, the selected S _{ i } is set as a threshold to generate an AUC score pool, which contains the values of \( J(H_{i}^{(n1)}(\mathbb {F}) + h_{i}^{(n)}(\mathbb {F}_{k})) \geq 0.8 \times S_{i}\). In this way, it builds an AUC score pool for each class of object.
To learn multiclass classifiers simultaneously, we adopt these AUC data to construct independent channels for boosting learning. The details of the procedure are summarized as follows:

Assuming the AUC score pools have been normalized to [0,1], we divide the range into M subrange bins. Each bin corresponds to a channel ID. In this way, we can obtain a channel ID set \( \textbf {C}=\{\text {bin}_{l}=[\frac {(l1)}{M}, \frac {l}{M}]~l=1,\cdots,M\}\). In each channel, we build an independent boosting model for training classifiers that can recognize a corresponding category task.

We set \(u=S_{i}(\mathbb {F},x)\) and define the weak classifier h _{ i }(x) as follows:
$$ \begin{aligned} & \textbf{if}~u \in \ \textbf{C}~\text{and} ~x \in \{\text{the ~samples ~of~} \text{expression}~i\}, \\ & \textbf{then}~ h_{i}(x) = 2P(q\mathbb{F}, \mathbf{w})1. \end{aligned} $$(4)These guarantee that the precision of h is greater than 0.5.

Given the characteristic function
$$\begin{array}{@{}rcl@{}} B^{(l, i)}(u,\textbf{Y})=\left\{ \begin{array}{ll} 1&u\wedge \textbf{Y}=i \\ 0 &\text{otherwise} \end{array},\right. \end{array} $$(5)where i∈Y and Y is defined as the label set of those expression categories that can be recognized by the classifier h. This function is used to check and ensure that the expression categories of the channel, classifier, and sample are consistent.

Lastly, to cover the characteristic function, we formally express the weak classifier as
$$ h(\mathbb{F})=\sum_{l=1}^{M}\sum_{i=1}^{M}\left(2P\left(q\mathbb{F}, \mathbf{w}\right)1 \right)B^{(l, i)}(u, \textbf{Y}). $$(6)As shown in Fig. 2, by using the above approaches, we can construct the parallel aspect for training. Meanwhile, the classifier category is able to be judged and autoselected into the related channel. In this way, we can learn the classifiers with Algorithm 1 and train multithreading boosting cascades simultaneously in their training channels via Algorithm 2.
Learning weak classifiers
In this subsection, we describe how to learn weak classifiers in each threading channel as shown in Fig. 2. Like most existing multiclass classification algorithms, our approach is crucially dependent on the labeled data of sample space to learn the classifiers. In this study, we combine this approach with the above constructed cascade channels to implement multiclass classification. In our case, we denote the sample space as X and the label set as Y. A sample of a multiclass and multilabel problem is a pair (x,Y), where Y(i) is defined as
where x∈X,l∈Y, and Y⊆Y.
The whole procedure involves a forward selection and inclusion of a weak classifier over possible local patch temples that can be adjusted using different temple configurations, according to the processing images. To enhance both the speed of the learning convergence and robustness, our algorithm further introduces a backward removal approach. For more details on including backward removal or even a floating searching capability into the boosting framework, please refer to [24]. In this study, we implement backward removal on Algorithm 1 step 4 to extend the procedure with the capability to backward remove redundant weak classifiers. In so doing, it is not only able to reduce the number of weak classifiers in each stage but also improve the generalization capability of the strong classifiers.
Boosting cascade training
Inspired by [18] and [20], here, we introduce AUC as a single criterion for cascade convergence testing, which realizes an adaptive False Positive Rate (FPR) among the different stages (for a more detailed description of AUC, refer to [18]). Hence, combined with logistic regressionbased weak classifiers to adopt SURF features, this approach can yield a fast convergence speed and a cascade model with much shorter stages.
Within one stage, no threshold for intermediate weak classifiers is required. We need only determine each decision threshold θ _{ i } for ith emotional category in its threading channel. In our case, using the ROC curve, the FPR of each emotional category is easily determined when given the minimal hit rate \(d_{i}^{\mathrm {(min)}}\). We decrease \(d_{i}^{(j)}\) from 1 on the ROC curve, until reaching the transit point \({d_{i}^{j}}= d_{i}^{\mathrm {(min)}}\). The corresponding threshold at that point is the desired θ _{ i }, i.e., the FPR is adaptive to different stage, and it is usually much smaller than 0.5. Therefore, its convergence speed is much quicker than the conventional methods.
To avoid overfitting, we restricted the number of samples used during training, as in [25]. In practice, we sampled an active subset from the whole training set according to the boosting weight. It is generally good practice to use about 30×p samples of each class, where p is a multiple coefficient (Algorithm 1 step 3.a).
After one stage of classifiers learning is converged via Algorithm 2, we continue to train another one with falsepositive samples coming from the scanning of nontarget images with the partially trained cascade. We repeat this procedure until the overall FPR reaches the stated goal. As with many current methods [3, 20, 26, 27], this measure was also inspired by the VJ framework [1], and although we indicated in Section 3.3 that we had adopted this approach, our approach is able to process binary cascades, as well as multiclass cascades. In every independent threading channel, respective cascade recognition subframeworks can be trained simultaneously for each data category. Furthermore, we propose an algorithm (Algorithm 2) to implement the boosting ensemble of classifiers for multiclass cascades, which is an original contribution to boost learning research. Equally important is that in our approach, the cascade training process is based on AUC analysis, and the FPR is usually much smaller than 0.5. In addition, it is adaptive for different stages. Therefore, this approach can result in a model size that is much smaller and has a recognition speed that is dramatically increased.
Experiments
In this section, we provide details of the dataset and evaluation results for our proposed method, as applied to FER. We implemented all training and recognition programs in C++ on Red Hat Enterprise Linux (RHEL) 6.5 OS, processed with a PC with a Core i72600 3.40 GHz CPU and 8 GB RAM.
Databases and protocols
We evaluated the proposed method on three public databases, i.e., CK+, MMI, and AFEW, which include two labcontrolled databases (CK+ and MMI) and one with realworld scenarios (AFEW).
CK+ DB
The CK+ database (DB) is a set of facial expression samples posed by 123 people. There are 327 sequences, taken from 593 sequences that meet the criteria for one of seven discrete emotions of the Facial Action Coding System (FACS) [8] (anger (An), contempt (Co), disgust (Di), fear (Fe), happiness (Ha), sadness (Sa), and surprise (Su)). In our experiments, we divided these samples into several groups for each expression by the personindependent rule, and each group included ten posers. A personindependent tenfold crossvalidation had been conducted for this DB to compare the results of a number of the outstanding current methods. For the recognition experiments, we put these images into 10minlength videos, 640 × 480 in size, and with a frame rate of 60 frames per second (FPS), based on the personindependent rule.
MMI DB
The MMI DB is a public database that includes more than 30 subjects, in which the femaletomale ratio is roughly 11:15. The subjects’ ages range from 19 to 62, and they are of European, Asian, or South American descent. This database is considered to be more challenging than CK+ because some posers have worn accessories such as glasses. In the experiments, we used all 205 effective image sequences of the six expressions in the MMI dataset. As with the CK+ DB, a personindependent tenfold crossvalidation had been completed to compare results from the stateoftheart methods. For the recognition stage, these images were also made into 10minlength videos, 640 × 480 in size, and with a frame rate of 60 FPS based on the personindependent rule.
AFEW DB
For the AFEW DB, which is a much more challenging database, evaluation experiments also have been done [11]. All of the AFEW sets were collected from movies to depict the socalled wild scenarios. For this study, we adopted the 2013 AFEW version [28], because the evaluation results of many stateoftheart methods have been based on this version. We trained the training set, and the results are reported for its validation set, in the same way as for the latest FER work [29].
Face registration
Like most facial research, our recognition performance is assessed based on faces normalized by the position of the eyes [30–32], i.e., eye centers are used to register faces, whereas we utilized elastic bunch graph matching (EBGM) training images for the the rest [31, 32]. Some examples of randomly selected faces on eye perturbation are shown in Fig. 4.
At first, to determine the impact of face registration on the boosting convergence speed, we considered eye perturbation in the training sets only. Our results showed that the proposed method used only 261 min to converge at the 11th stage. In contrast, our proposed method used 422 min to converge at the 16th cascade stage when not using any face registration approach, and there is no corresponding increase in performance.
We expect that classifier testing on a dataset with similar registration may improve the recognition results. Therefore, we next considered eye perturbation in both the recognition and training stages. As shown in Fig. 5 b, the performance improved by 3–6 %, compared to the results in Fig. 5 a.
The experiments show that in the FER case, the registration of the face is very important, because if it was necessary to craft features for every permutation, this would require more data. However, this problem is solved by using a good face registration approach, which requires less data and a reduced number of boosting convergence stages. Face registration in both the testing and training sets can improve the robustness of these algorithms in FER applications. Because the classifiers are trained on face images with similar eye perturbations, they can therefore better cope with face images containing registration errors. This also gives us some insight into why VJ face detection [1] is followed by the use of eye detectors.
Computational cost evaluation
We used all the training samples in the AFEW training set and collected training samples from the CK+ and MMI DBs, according to the personindependent tenfold crossvalidation rule. To reduce the training process time, we trained the samples from the three datasets together. All of the training samples were normalized to 100×100 pixel facial patches and processed by histogram equalization, and no color information was used. To enhance the generalization performance of boosting learning, we used some transformations in the training samples (mirror reflection, rotate the images, etc.), and finally increased the original number of samples by a factor of 64. Normalization was not performed on any of the testing sample sequences. In the training stages, we adopted the training data of the current processing expression as positive sample data, and data from other expressions as negative data.
For every expressional category, we set the maximum number of weaker classifiers in each stage as 128. The proposed method took 281 min to converge at the 11t h iteration stage. The cascade detector contained 2963 classifiers for all expressions and needed to evaluate only 3.5 SURF per window. Details of the FER cascade, as illustrated in Fig. 6 a, b, include the number of weak classifiers in each stage, and the average accumulated rejection rate for all the cascade stages. The results indicate that the first seven stages rejected 98 % of the noncurrent class samples. After training, we observed that the top three picked local patches for FER laid in the regions of two eyes and mouth. This situation is similar to Haarbased classifiers [6], see the examples in Fig. 7.
In order to evaluate the convergence speed of the AUC model, we determined the FPR at each boosting stage. The results show that, in the AUC model, the FPR f _{ j } at each cascade stage is adaptive among the different stages, ranging from 0.04486 to 0.26837, and is much smaller than the conventional model FPR of 0.5. In almost all existing cascade frameworks, FPR \(\prod _{j=1}^{T}f_{j}\) (T denotes the total cascade stages) reaches the goal (it is usually set as 10^{−6}). This means that conventional models require more iterations and that the AUC model cascade can converge much faster. These relate directly to training efficiency and recognition speed. Therefore, these experimental results confirm that the AUC cascade model is much more efficient than the conventional cascade models. However, since the proposed framework makes the classifiers parallel recognize the multiclass expressions, the peak of memory cost is nearly six times more than the conventional one.
In addition, the average recognition speed of the proposed method is 54.6 FPS for the three datasets. We tried almost all existing local features, such as HOG [21] and SIFT. At first, we thought that SIFT and HOG features would be more discriminating than SURF, but the results show that HOG descriptors lack robustness with regard to head rotation, which has also been pointed out by Klaeser et al. [14]. The accuracy of the SIFTbased version is similar to the results of the 8bin T2 SURF descriptor, but its memory requirement is four times than that of the 8bin T2 SURF descriptors. Moreover, the speed was only 15.4 FPS, which cannot process realtime scenes smoothly. In addition, we adopted Haar’s version, which contains more than 26 boosting stages and 27,396 classifiers of all categories. It also requires more than 37 Haarlike features per window and has the slowest convergence speed of all. Consequently, we concluded that SURF is more ideal for the proposed framework.
Recognition result evaluations
In this study, we used the same labels for the expression categories as those in the original databases. All of the recognition experiments are based on videos, and we evaluated their accuracies frame by frame. Here, we present the recognition results for the three representative public databases (CK+, MMI, and AFEW), because we needed to evaluate both the labcontrolled (CK+ and MMI) and realworld (AFEW) scenarios.
We also selected a number of methods for comparison to represent the stateoftheart of this field, including the methods that have been proposed for improving spatiotemporal descriptors: LBPTOP [12], HOE [13], PLBP [33], and HOG 3D [14]. CLM [15] is a typical approach that is used to process facial action units. These methods are very popular for FER, while 3DCNNDAP [34] and STMExpLet [29] are the latest methods. We also compared methods that focus on enhancing the robustness of classification approaches for their classifying frameworks, such as ITBN [35], 3D LUT [6], and LSHCORF [36]. For a fair comparison, we used the same databases, which were evaluated via standardized items. Tables 1, 2, and 3 compare our method (McSURF) with these stateoftheart methods, most of which were conducted using their released codes and with their parameters tuned to better adapt to our experiments. However, for some methods, because we could not obtain their source codes (e.g., STMExpLet [29] and 3DCNNDAP [34]), it was necessary to simply cite the results reported from related studies. In addition, McSURF (3T) is the item of recognition results (personindependent tenfold) by using the data from the three databases together, yet McSURF (OD) denotes the performance (tenfold) with the original data from each database.
In Table 1, the experimental results, for the CK+ database, compare our approach (McSURF) with eight stateoftheart methods (CLM [15], HOE [13], LBPTOP [12], ITBN [35], HOG 3D [14], LSHCORF [36], 3D LUT [6], and 3DCNNDAP [34]). The mean average precision (mAP) of our method is highly competitive with stateoftheart methods.
Table 2 lists the evaluation experiment results for the MMI DB. The proposed method outperformed those stateoftheart methods. In addition, unlike many existing methods that only evaluate some selected samples, in our experiments, we used all 205 effective image sequences of the six expressions (anger (An), disgust (Di), fear (Fe), happiness (Ha), sadness (Sa), and surprise (Su)).
Table 3 shows the evaluation results for the AFEW database (Ne means neutral), which is designed as a realworld scenario dataset and where the faces have sharp rotations. Since CK+ and MMI are labcontrolled datasets, they have some shortcomings with respect to being evaluated in realworld scenarios. Therefore, we once again compared our proposed method with the stateoftheart methods. Our results show that our method can achieve 32.6 %, a performance better than the following stateoftheart methods: HOE [13] 19.5 %, LBPTOP [12] 25.1 %, HOG 3D [14] 26.9 %, LSHCORF [36] 21.8 %, 3D LUT [6] 25.2 %, and STMExpLet [29] 31.7 %.
To date, we have performed all the necessary experiments and covered all items relating to the latest works in FER. The proposed method, 3DCNNDAP, and STM outperform the other methods. This means general learning frameworks lead to greater robustness with respect to intraclass variation and face deformation. Because the local descriptorbased methods, such as LBPTOP, HOE, and HOG 3D, lack semantic meanings, they can hardly represent complex variations over midlevel facial action areas, so accuracy is difficult to achieve in methods based on facial action areas. However, to obtain the spatiotemporal property of expressions, 3DCNNDAP and STM treat the time of the video as the third dimension, which limits the possible number of subjectindependent applications. They can obtain good performance only in dynamic images. Hence, although the mean average precision of 3DCNNDAP is almost the same as the average accuracy of our proposed method, its results sharply decline in the MMI database. In contrast, the proposed framework treats feature learning separately by dopting the subjectindependent classifier to the final objective of classification. Since local features trained by classifiers can effectively cancel out the problems caused by semantic gap, which leads to an overall significant improvement of the classification performance [37–39]. Thus, the learned feature and classifier have specificity and discriminative capability. Therefore, the performance of the proposed framework is distinctive.
Conclusions
In this study, we proposed a novel cascade framework called the multithreading cascade of SURF (McSURF) for robust FER. The main contribution is our proposed multithreading cascade learning model, which allows multiple categories of data to be simultaneously trained. The concurrency of this multithreading learning model can extend the application range of cascades and represents a significant advance in related imaging industries.
We used three popular and representative public databases in the FER research field to experimentally confirm the validity of the proposed method. Based on our experimental results, we analyzed the impact of face registration on both the learning and recognition stages, obtaining detailed answers on how face registration works on AdaBoostbased algorithms and why it can improve the robustness of these algorithms in FER applications. These issues are important to those with related research interests.
In future work, we will first attempt to improve the discriminative power of the multiple classification framework and investigate how feature representation errors impact recognition frameworks.
References
 1
P Viola, M Jones, Robust realtime face detection. Int. J. Comput. Vis. (IJCV). 57(2), 137–154 (2004).
 2
T Trzcinski, M Christoudias, V Lepetit, Learning image descriptors with boosting. IEEE Trans. Patt. Analy. Mach. Intell. (TPAMI). 37(3), 597–610 (2015).
 3
D Chen, S Ren, Y Wei, X Cao, J Sun, in Proc. Eur. Conf. Comput. Vis. (ECCV). Joint cascade face detection and alignment, (2014), pp. 109–122.
 4
B Wu, H Ai, C Huang, in Proc. Audioand VideoBased Biometric Person Authentication. LUTbased AdaBoost for gender classification, (2003), pp. 104–110.
 5
Y Wang, H Ai, B Wu, C Huang, in Proc. Int. Conf. Pattern Recognit. (ICPR), 3. Real time facial expression recognition with AdaBoost, (2004), pp. 926–929.
 6
J Chen, Y Ariki, T Takiguchi, in Proc. ACM Multimedia Conf. (MM). Robust facial expressions recognition using 3D average face and ameliorated AdaBoost, (2013), pp. 661–664.
 7
H Bay, A Ess, T Tuytelaars, LV Gool, Speededup robust features (SURF). Comput. Vis. Image Underst. (CVIU). 110(3), 346–359 (2008).
 8
P Lucey, JF Cohn, T Kanade, J Saragih, Z Ambadar, I Matthews, in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR) Workshops. The Extended CohnKanade Dataset (CK+): a complete dataset for action unit and emotionspecified expression, (2010), pp. 94–101.
 9
MF Valstar, M Pantic, in Proc. of Int. Conf. Language Resources and Evaluation, Workshop on EMOTION. Induced disgust, happiness and surprise: an addition to the MMI Facial Expression Database, (2010), pp. 65–70.
 10
M Pantic, MF Valstar, R Rademaker, L Maat, in Proc. IEEE Int. Conf. on Multimedia and Expo (ICME). Webbased database for facial expression analysis, (2005), pp. 317–321.
 11
A Dhall, R Goecke, S Lucey, T Gedeon, Collecting large, richly annotated facialexpression databases from movies. MultiMedia, IEEE. 19(3), 34–41 (2012).
 12
G Zhao, M Pietikainen, Dynamic texture recognition using local binary patterns with an application to facial expressions. IEEE Trans. Patt. Analy. Mach. Intell. (TPAMI). 29(6), 915–928 (2007).
 13
L Wang, Y Qiao, X Tang, in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR). Motionlets: Midlevel 3D parts for human motion recognition, (2013), pp. 2674–2681.
 14
A Klaeser, M Marszalek, C Schmid, in Proc. British Machine Vis. Conf. (BMVC). A spatiotemporal descriptor based on 3Dgradients, (2008), pp. 99–19910.
 15
SW Chew, P Lucey, S Lucey, J Saragih, JF Cohn, S Sridharan, in FG. Personindependent facial expression detection using constrained local models, (2011), pp. 915–920.
 16
TF Cootes, GJ Edwards, CJ Taylor, Active appearance models. IEEE Trans. Patt. Analy. Mach. Intell. (TPAMI). 23(6), 681–685 (2001).
 17
C David, C Tim, in Proc. British Machine Vis. Conf. (BMVC). Feature detection and tracking with constrained local models, (2006), pp. 95–19510.
 18
C Ferri, PA Flach, J HernándezOrallo, in Proc. Int. Conf. Machine Learn. (ICML). Learning decision trees using the area under the ROC curve, (2002), pp. 139–146.
 19
DG Lowe, in Proc. IEEE Int. Conf. Comput. Vis. (ICCV), 2. Object recognition from local scaleinvariant features, (1999), pp. 1150–1157.
 20
J Li, T Wang, Y Zhang, in Proc. IEEE Int. Conf. Comput. Vis. (ICCV) Workshops. Face detection using SURF cascade, (2011), pp. 2183–2190.
 21
N Dalal, B Triggs, in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR). Histograms of oriented gradients for human detection, (2005), pp. 886–8931.
 22
RE Fan, KW Chang, CJ Hsieh, XR Wang, CJ Lin, LIBLINEAR: A Library for Large Linear classification. J. Mach. Learn. Res.9:, 1871–1874 (2008).
 23
P Long, R Servedio, in Proc. Adv. Neural Inf. Proc. Syst. (NIPS). Boosting the area under the ROC curve, (2007), pp. 945–952.
 24
SZ Li, Z Zhang, HY Shum, H Zhang, in Proc. Adv. Neural Inf. Proc. Syst. (NIPS). FloatBoost learning for classification, (2002), pp. 993–1000.
 25
R Xiao, H Zhu, H Sun, X Tang, in Proc. IEEE Int. Conf. Comput. Vis. (ICCV). Dynamic cascades for face detection, (2007), pp. 1–8.
 26
L Bourdev, J Brandt, in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), 2. Robust object detection via soft cascade, (2005), pp. 236–243.
 27
Q Zhu, MC Yeh, KT Cheng, S Avidan, in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), 2. Fast human detection using a cascade of histograms of oriented gradients, (2006), pp. 1491–1498.
 28
A Dhall, R Goecke, J Joshi, K Sikka, T Gedeon, in Proc. ACM The Int. Conf. on Multimodal Interaction (ICMI). Emotion recognition in the wild challenge 2014, (2014), pp. 461–466.
 29
M Liu, S Shan, R Wang, X Chen, in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR). Learning Expressionlets on spatiotemporal manifold for dynamic facial expression recognition, (2014), pp. 1749–1756.
 30
E Rentzeperis, A Stergiou, A Pnevmatikakis, L Polymenakos, in Proc. Artificial Intelligence Applications and Innovations (AIAI). Int. Federation for Inf. Proc. (IFIP), 204. Impact of face registration errors on recognition, (2006), pp. 187–194.
 31
L Wiskott, JM Fellous, N Kuiger, C von der Malsburg, Face recognition by elastic bunch graph matching. IEEE Trans. Patt. Analy. Mach. Intell. (TPAMI). 19(7), 775–779 (1997).
 32
A Albiol, D Monzo, A Martin, J Sastre, A Albiol, Face recognition using HOGEBGM. Pattern Recognition Letters. 29(10), 1537–1543 (2008).
 33
RA Khan, A Meyer, H Konik, S Bouakaz, Framework for reliable, realtime facial expression recognition for low resolution images. Pattern Recogn. Lett.34"(10), 1159–1168 (2013).
 34
M Liu, S Li, S Shan, R Wang, X Chen, in Proc. Asia Conf. Comput. Vis. (ACCV), 9006. Deeply learning deformable facial action parts model for dynamic expression analysis, (2014), pp. 143–157.
 35
P Scovanner, S Ali, M Shah, in Proc. ACM Multimedia Conf. (MM). A 3dimensional SIFT descriptor and its application to action recognition, (2007), pp. 357–360.
 36
O Rudovic, V Pavlovic, M Pantic, in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR). Multioutput Laplacian dynamic ordinal regression for facial expression recognition and intensity estimation, (2012), pp. 2634–2641.
 37
L Xie, Q Tian, M Wang, B Zhang, Spatial pooling of heterogeneous features for image classification. IEEE Trans. Image Proc.(TIP). 23:, 1994–2008 (2014).
 38
J Chen, T Takiguchi, Y Ariki, A robust SVM classification framework using PSM for multiclass recognition. EURASIP J. Image Video Process.2015(1), 1–12 (2015).
 39
J Sánchez, F Perronnin, T Mensink, J Verbeek, Image classification with the Fisher vector: theory and practice. Int. J. Comput. Vis. (IJCV). 105(3), 222–245 (2013).
Authors’ contributions
JC designed the core methodology of the study and carried out the implementation and experiments, and he drafted the manuscript. ZL assisted in doing the experiments. TT and YA participated in the study and helped to draft the manuscript. All authors read and approved the final manuscript.
Competing interests
The authors declare that they have no competing interests.
Author information
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
About this article
Cite this article
Chen, J., Luo, Z., Takiguchi, T. et al. Multithreading cascade of SURF for facial expression recognition. J Image Video Proc. 2016, 37 (2016). https://doi.org/10.1186/s1364001601407
Received:
Accepted:
Published:
Keywords
 AdaBoost
 Multithreading cascade
 SURF
 AUC
 Facial expression recognition