Handling missing weak classifiers in boosted cascade: application to multiview and occluded face detection
 Pierre Bouges^{1}Email author,
 Thierry Chateau^{1}Email author,
 Christophe Blanc^{1} and
 Gaëlle Loosli^{2}
https://doi.org/10.1186/16875281201355
© Bouges et al.; licensee Springer. 2013
Received: 2 July 2012
Accepted: 27 June 2013
Published: 25 October 2013
Abstract
We propose a generic framework to handle missing weak classifiers at testing stage in a boosted cascade. The main contribution is a probabilistic formulation of the cascade structure that considers the uncertainty introduced by missing weak classifiers. This new formulation involves two problems: (1) the approximation of posterior probabilities on each level and (2) the computation of thresholds on these probabilities to make a decision. Both problems are studied, and several solutions are proposed and evaluated. The method is then applied to two popular computer vision applications: detecting occluded faces and detecting faces in a pose different than the one learned. Experimental results are provided using conventional databases to evaluate the proposed strategies related to basic ones.
Keywords
Pattern recognition Supervised learning Object detection Missing data Adaptation Face1 Introduction
Missing data in classification can be divided into two subproblems: (1) missing data at training stage and (2) missing data at testing stage. In this paper, we assume that missing data only occur at testing stage and that training is done with complete data. A recent study on missing data at testing stage can be found in [2] where SaarTsechansky and Provost evaluate different methods to handle missing data at testing stage. They compare two kinds of approach: reduced models and predictive value imputation. Their study does not focus on boosted cascades; the solution we propose in this paper is, to our knowledge, the first algorithm that handles missing data in a boosted cascade without modifying the initial training. Most existing solutions are based on learning algorithms that are designed to be robust to missing data. For example, Smeraldi et al. [3] used a modified version of adaptive boosting (AdaBoost) where weak classifiers can abstain when a feature is missing. Another algorithm was proposed by Globerson and Roweis [4] which is built to be robust to feature deletion. In the same way, Dekel and Shamir [5] improved this idea with an algorithm robust to feature deletion and feature corruption. Chen et al. proposed [6] a solution to detect occluded faces using only one upright face classifier, but they lost the cascade structure resulting in a high detection time.
Here we propose a generic solution to the problem of occluded object detection where occluded weak classifiers are considered as unavailable. Unavailable weak classifiers are seen as missing data, and this fact is incorporated in the cascade structure. We evaluate the proposed method for two different applications: (1) detecting occluded faces and (2) detecting faces in a pose different than the one learned. For each application, we explain how weak classifiers can be considered as available or not. Our method differs from former studies [1, 7] in two aspects: the proposed solution does not need the training of multiple classifiers, and, as opposed to existing methods where classifiers are designed to detect objects in a specific pose or with specific occlusions, the proposed solution relies on only one classifier that can adapt to specific poses or occlusions.
Section 2 presents the principle of boosted cascade. A new algorithm that handles missing weak classifiers in a boosted cascade is then detailed in Section 3. Application to occluded faces is presented in Section 4, followed by application to multiview face detection in Section 5. The proposed method is then evaluated in Section 6.
2 Boosted cascade overview
This section presents the principle of boosted cascade. The boosting algorithm was introduced by Schapire [8], and many extensions have been proposed. The main idea is to combine the performance of many weak classifiers to produce a powerful strong classifier. The goal is then to perform binary classification. In this paper, we focus on real boosting algorithms (e.g., Real AdaBoost, LogitBoost, or Gentle AdaBoost) which means that weak classifiers are realvalued functions.
Let $\mathcal{\mathcal{L}}={\left\{\right({\mathbf{x}}_{i},{y}_{i}\left)\right\}}_{i=1}^{N}$ be a training set where x _{ i } are training examples and y _{ i }∈{1,1} are their corresponding labels (1 is for the object class, also called positive class). Given this set, a real boosting algorithm iteratively finds T weak classifiers h _{ t } to form a strong classifier $\text{sign}\left(H\right(\mathbf{x}\left)\right)=\text{sign}\left(\sum _{t=1}^{T}{h}_{t}\right(\mathbf{x}\left)\right)$ where x is a sample to be classified. Moreover, sign(h _{ t }(x)) gives the label of x predicted by h _{ t }, and the value h _{ t }(x) represents the confidence of the prediction. Each training example x _{ i } is an image R _{ i } of the object or nonobject, and each weak classifier h _{ t } is learned on a set of subwindows ${\left\{{r}_{\mathit{\text{ti}}}\right\}}_{i=1}^{N}$ which correspond to discriminative areas in all images ${\left\{{R}_{i}\right\}}_{i=1}^{N}$ (see Figure 1a for an example of such subwindows).
To speed up classification, Viola and Jones [9] proposed a cascade structure where several strong classifiers are associated into successive levels. The idea is that the first strong classifiers reject most of the negative examples, while the last strong classifiers try to discriminate positive examples from hard negative examples. In such cascades, strong classifiers are slightly changed into $\text{sign}\left({H}_{j}\right(\mathbf{x}){\alpha}_{j})=\text{sign}\left(\sum _{t=1}^{{T}_{j}}{h}_{\mathit{\text{jt}}}\right(\mathbf{x}){\alpha}_{j})$ where α _{ j } are thresholds that are fixed during training (without cascade, α _{ j }=0). The training of a boosted cascade requires five elements: (1) the value f _{max}, the maximum acceptable falsepositive rate per level; (2) the value d _{min}, the minimum acceptable detection rate per level; (3) the value F, the overall falsepositive rate to be achieved, (4) a set ${\mathcal{S}}^{p}$ of positive images; and (5) a set of background images that will be used to generate interesting negative examples during training. The training of the level j consists of two steps: (1) applying the current cascaded detector (level 1 to j1) on to generate falsepositives and create a set of negative examples ${\mathcal{S}}^{n}$ and (2) using ${\mathcal{S}}^{p}$ and ${\mathcal{S}}^{n}$ to train the strong classifier sign(H _{ j }α _{ j }). This one is designed so that a detection rate of at least d _{min} and a falsepositive rate of at most f _{max} are achieved. Both parameters d _{min} and f _{max} are fixed by the user. These two steps are repeated until the constraint defined by F is satisfied. In this paper, we consider that the training stage is already done: the cascade of strong classifiers $\left\{\text{sign}\right({H}_{1}{\alpha}_{1}),\dots ,\text{sign}({H}_{K}{\alpha}_{K}\left)\right\}$ is available. The following section presents a generic framework to use this cascade when some weak classifiers h _{ jt } are missing at testing stage.
3 Handling missing weak classifiers
3.1 Naive approach
Suppose that we want to classify a sample x with a strong classifier sign(Hα) where H is made up of a set of weak classifiers $\{{h}_{1},\dots ,{h}_{T}\}$. Suppose also that only p<T weak classifiers are available, given by $\{{h}_{{\mathrm{a}}_{1}},\dots ,{h}_{{\mathrm{a}}_{p}}\}$. The set of unavailable weak classifiers is defined as $\{{h}_{{\mathrm{u}}_{1}},\dots ,{h}_{{\mathrm{u}}_{q}}\}$ where q=Tp. In such a situation, the easiest strategy to classify x consists in setting all unavailable weak classifiers to zero, i.e., ${h}_{{u}_{1}}\left(\mathbf{x}\right)=\cdots ={h}_{{u}_{q}}\left(\mathbf{x}\right)=0$. If we note ${H}_{\mathrm{a}}\left(\mathbf{x}\right)=\sum _{t=1}^{p}{h}_{{\mathrm{a}}_{t}}\left(\mathbf{x}\right)$, the strong classifier becomes sign(H _{a}α). By applying this principle to all cascade levels, the set of strong classifiers becomes $\left\{\text{sign}\right({H}_{1\mathrm{a}}{\alpha}_{1}),\dots ,\text{sign}({H}_{K\mathrm{a}}{\alpha}_{K}\left)\right\}$. To sum up, the naive approach consists in setting all unavailable weak classifiers to zero and keeping all cascade thresholds unchanged. This approach will be used as our baseline in the experiments section and will be referred to as 'naive approach’.
3.2 Probabilistic formulation of a boosted cascade
This probabilistic formulation is very close to the one of Lefakis and Fleuret in [11]. Our motivation remains different because they proposed a new learning algorithm based on a probabilistic cascade formulation. In our case, we use a probabilistic formulation to handle the fact that some weak classifiers are missing at testing stage.
3.3 Posterior probability estimation
When weak classifiers are missing, the probability P(y=1x) can no longer be computed and an approximation must be used. We propose three different approximation strategies to do this:

The simplest strategy to estimate P(y=1x) is to compute a probability based on available weak classifiers. Thus, we define P _{boost}(y=1x) as:${P}_{\text{boost}}(\phantom{\rule{1.0pt}{0ex}}y=1\mathbf{x})\doteq {\mathrm{e}}^{{H}_{\mathrm{a}}\left(\mathbf{x}\right)}/({\mathrm{e}}^{{H}_{\mathrm{a}}\left(\mathbf{x}\right)}+{\mathrm{e}}^{{H}_{\mathrm{a}}\left(\mathbf{x}\right)}).$(7)

A second strategy, noted P _{knn}(y=1x), tries to benefit from the initial training. Indeed, each training example x _{ i } provides a set of weak classifier values ${h}_{{\mathbf{x}}_{i}}=\left({h}_{1}\right({\mathbf{x}}_{i}),\dots ,{h}_{T}({\mathbf{x}}_{i}\left)\right)$ and an associated label y _{ i }. All these weak classifier values form a set $\mathcal{\mathscr{H}}={\left\{\right({h}_{{\mathbf{x}}_{i}},{y}_{i}\left)\right\}}_{i=1}^{N}$, and the subset of available weak classifiers form ${\mathcal{\mathscr{H}}}_{\mathrm{a}}={\left\{\right({h}_{{\mathrm{a}}_{{\mathbf{x}}_{i}}},{y}_{i}\left)\right\}}_{i=1}^{N}$ where ${h}_{{\mathrm{a}}_{{\mathbf{x}}_{i}}}=\left({h}_{{\mathrm{a}}_{1}}\right({\mathbf{x}}_{i}),\dots ,{h}_{{\mathrm{a}}_{p}}({\mathbf{x}}_{i}\left)\right)$. The resulting set ${\mathcal{\mathscr{H}}}_{\mathrm{a}}$ is used as a training set to approximate P(y=1x) with the help of the knearest neighbor (knn) algorithm. Given a sample x, its associated available weak classifier scores ${h}_{{\mathrm{a}}_{\mathbf{x}}}=\left({h}_{{\mathrm{a}}_{1}}\right(\mathbf{x}),\dots ,{h}_{{\mathrm{a}}_{p}}(\mathbf{x}\left)\right)$ are first computed. Then, the knn algorithm searches the k nearest neighbors of the point ${h}_{{\mathrm{a}}_{\mathbf{x}}}$ in the space ${\mathcal{\mathscr{H}}}_{\mathrm{a}}$. Considering the labels $\{{y}_{1}^{\ast},\dots ,{y}_{k}^{\ast}\}$ of the k nearest neighbors, the probability P _{knn}(y=1x) is computed as:${P}_{\text{knn}}(\phantom{\rule{1.0pt}{0ex}}y=1\mathbf{x})\doteq \sum _{i=1}^{k}\frac{{1\phantom{\rule{0.3em}{0ex}}l}_{\{{y}_{i}^{\ast}=1\}}}{k},$(8)

where 1 l_{pred}=1 if the predicate (pred) is true and 1 l_{pred}=0 otherwise. Figure 4 illustrates the computation of P _{knn}(y=1x) when two weak classifiers are available.

An additional strategy, noted P _{comb}(y=1x), consists in combining the two previous methods as the simplest way:${P}_{\text{comb}}(\phantom{\rule{1.0pt}{0ex}}y=1\mathbf{x})\doteq \frac{{P}_{\text{boost}}(\phantom{\rule{1.0pt}{0ex}}y=1\mathbf{x})+{P}_{\text{knn}}(\phantom{\rule{1.0pt}{0ex}}y=1\left\mathbf{x}\right)}{2}.$(9)
3.4 Boosted McCascade threshold estimation
Before a McCascade can be used to classify a sample x, the threshold ${\beta}_{1},\dots ,{\beta}_{K}$ must be estimated. The threshold ${\beta}_{1},\dots ,{\beta}_{K}$ estimation can be seen as the training stage of a McCascade. This is achieved through an iterative procedure which uses sets ${\mathcal{S}}^{p}$ and from the initial training stage. This procedure is described in Algorithm 1. At iteration j, the threshold β _{ j } of the level j is computed using the following scheme: all probabilities ${p}_{\mathit{\text{ji}}}\doteq P(\phantom{\rule{1.0pt}{0ex}}{y}_{1:j}=1{\mathbf{x}}_{i})$ are first computed. Then, the set of probabilities ${\left\{{p}_{\mathit{\text{ji}}}\right\}}_{i=1}^{N}$ is sorted and β _{ j } is chosen among the set of finite values ${\stackrel{~}{p}}_{\mathit{\text{ji}}}\doteq 0.5({p}_{\mathit{\text{ji}}}+{p}_{j(i+1)}),\phantom{\rule{0.3em}{0ex}}i\in \{1,\dots ,N1\}$. The function find_optimal_threshold (see line Algorithm 2) finds the threshold that minimizes a cost function defined on falsepositive and truepositive rates. Contrary to the initial cascade where each level ensures reaching a truepositive rate of at least d _{min} with a falsepositive rate less than f _{max}, the McCascade cannot guarantee the same performance. The cost function’s goal is to ensure that each threshold found provides a performance close to the initial cascade performance. Three cost functions are proposed:

FP_cost is defined on the falsepositive rate f _{ β } associated to a threshold β:$\text{FP\_cost}\left({f}_{\beta}\right)\doteq max(0,{f}_{\beta}{f}_{max}).$(10)

The falsepositive rate f _{ β } is computed on the training examples. Using this function means that the threshold found provides a falsepositive rate which is as close as possible to f _{max} (it remains greater or equal to f _{max}).

TP_cost is defined on the truepositive rate d _{ β } associated to a threshold β:$\text{TP\_cost}\left({d}_{\beta}\right)\doteq max(0,{d}_{min}{d}_{\beta}).$(11)

The truepositive rate d _{ β } is computed on the training examples. The threshold computed with this function will ensure a truepositive rate close to d _{min} (it remains lower or equal to d _{min}).

FP_TP_cost is defined on both falsepositive and truepositive rates:$\phantom{\rule{25.0pt}{0ex}}\text{FP\_TP\_cost(}{f}_{\beta},{d}_{\beta}\text{)}\doteq \text{FP\_cost(}{f}_{\beta}\text{)}+\text{TP\_cost(}{d}_{\beta}\text{)}.$(12)

This last cost function is a compromise between a falsepositive rate of f _{max} and a truepositive rate of d _{min}.
A detailed version of find_optimal_threshold with the cost function FP_TP_cost is given in Algorithm 2. Once all the thresholds ${\beta}_{1},\dots ,{\beta}_{K}$ are estimated, the McCascade can be used to classify any unknown sample x.
3.5 Cascade and McCascade training time
When a McCascade is created, the threshold ${\beta}_{1},\dots ,{\beta}_{K}$ must be computed. This step can be seen as the training stage of a McCascade. Compared to the training stage of a cascade, a McCascade needs fewer time to be trained. The training time of a cascade depends on a lot of parameters: number of training samples, number of levels, implementation (C++/MATLAB), …Rather than giving precise training times to compare a cascade and a McCascade, rough estimates are given here to emphasize the fact that a McCascade is faster to train than a cascade.
 1.
Gather training data. Training data are made up of the positive images and of the background images. This step can last a few seconds if a public database exists. It can also last a few days if images must be manually gathered.
 2.
Generate falsepositives. At the beginning of each level, the negative samples are generated by applying the current classifier to the set of the background images. This step can last a few seconds to a few minutes.
 3.
Train a cascade level. At each boosting iteration, several weak classifiers are learned (one for each subwindow), and the best one is kept. The number of iteration depends on the classification performance that must be reached. This step can last a few minutes to a few hours.
 1.
Generate falsepositives. At the beginning of each level, the negative samples are generated by applying the current classifier to the set of the background images. This step can last a few seconds to a few minutes.
 2.
Fix the level threshold. A probability is computed for each training example, and the threshold is computed according to these probabilities. This step can last a few milliseconds to a few seconds.
4 Application to occluded face detection
Occlusions can greatly change the appearance of a face, and an upright face detector will easily fail to detect such faces. A cascaded detector that can deal with occlusions has already been proposed by Lin et al. [7]. Their solution relies on the training of nine cascaded detectors (one main cascade + eight occlusion cascades) that are then combined. This solution exhibits good performance at the cost of a prohibitive training time. On the other hand, Chan et al. [6] also proposed a detector to handle occlusion with only one training. They first train a boosted cascade and then combine all the weak classifiers learned to obtain a detector robust to occlusions. The problem is that the cascade structure is lost, resulting in an extensive execution time. Our solution relies on the use of an upright face detector and the definition of several occlusion configurations where each occlusion configuration is associated with a McCascade. Each occlusion configuration is associated with a set of occluded weak classifiers from all the weak classifiers of the upright face detector. Based on this set, a McCascade that uses nonoccluded weak classifiers can be built. Each McCascade created is called an occlusion cascade. Hence, we build several occlusion cascades which are then combined with the principle of cascading with evidence explained later.
4.1 Occlusion cascade creation
Based on these two sets, two McCascades ${\mathcal{C}}^{\mathcal{A}}$ and ${\mathcal{C}}^{\mathcal{\mathcal{B}}}$ can be created. ${\mathcal{C}}^{\mathcal{A}}$ only uses weak classifiers defined in ${\mathcal{\mathscr{H}}}^{\mathcal{A}}$. In the same way, ${\mathcal{C}}^{\mathcal{\mathcal{B}}}$ only uses weak classifiers defined in ${\mathcal{\mathscr{H}}}^{\mathcal{\mathcal{B}}}$. Finally, thresholds β _{ j } of both McCascades are fixed with the help of Algorithm 1.
4.2 Cascading with evidence
The vector ε _{ j }(x) is called the evidence of x at level j.
Equation 16 means that ${H}_{j}^{\mathcal{I}}$ only involves weak classifiers over subwindows that do not intersect with ${\mathcal{O}}_{\mathcal{I}}$. With the evidence vector presented in Equation 15, weak classifiers can now be defined as available or not depending on the occlusion encountered. Indeed, let x be an occluded face example of type and suppose that the main cascade rejects it at level j because H _{ j }(x)<α _{ j }. Before rejecting it, we check the evidence vector of x. In particular, the majority of ${H}_{1}^{\mathcal{A}}\left(\mathbf{x}\right),\dots ,{H}_{j}^{\mathcal{A}}\left(\mathbf{x}\right)$ should be positive, indicating that x is an occluded face of type . Based on this fact, weak classifiers that can handle occlusion (i.e., h _{ jt } verifying ${\mathcal{S}}_{\mathit{\text{jt}}}\cap {\mathcal{O}}_{\mathcal{A}}=\varnothing $) are defined as available, and x continues the classification process with the McCascade ${\mathcal{C}}^{\mathcal{A}}$ defined on available weak classifiers. Generally speaking, if a sample is occluded of type and if this sample is rejected by the main cascade, this sample will be passed to the McCascade ${\mathcal{C}}^{\mathcal{I}}$. Note that with this principle of cascading with evidence, there is no explicit occlusion detection.
5 Application to multiview face detection
5.1 Detecting faces with ROP angle
 1.We associate a point ${P}_{1}^{i}={\left({u}_{1}\phantom{\rule{0.3em}{0ex}}{v}_{1}\phantom{\rule{0.3em}{0ex}}{w}_{1}\right)}^{T}$ to the point ${p}_{1}^{i}$. ${p}_{1}^{i}$ is the 3D point with the same xcoordinate and ycoordinate as ${p}_{1}^{i}$ that belongs to the ellipsoid. We just have to compute the zcoordinate w _{1} with the help of the ellipsoid equation expressed in ${\mathit{\text{CS}}}_{i}$ (see Figure 11a):$\frac{{(u{u}_{0})}^{2}}{{a}^{2}}+\frac{{(v{v}_{0})}^{2}}{{b}^{2}}+\frac{{(w{w}_{0})}^{2}}{{c}^{2}}=1,$(17)
 2.We express ${p}_{1}^{i}$ in the coordinate system ${\mathit{\text{CS}}}_{e}$ whose origin is the ellipsoid center. This gives us the ${P}_{1}^{e}$ point:$\left[\begin{array}{c}{\stackrel{~}{x}}_{1}\\ {\stackrel{~}{y}}_{1}\\ {\stackrel{~}{z}}_{1}\\ {\stackrel{~}{d}}_{1}\end{array}\right]=\left[\begin{array}{cccc}1& 0& 0& w/2\\ 0& 1& 0& w/2\\ 0& 0& 1& 0\\ 0& 0& 0& 1\end{array}\right]\phantom{\rule{0.5em}{0ex}}\left[\begin{array}{c}{u}_{1}\\ {v}_{1}\\ {w}_{1}\\ 1\end{array}\right],$(18)
 3.Finally, we express ${P}_{2}^{e}$ in ${\mathit{\text{CS}}}_{i}$ to get the ${P}_{2}^{i}$ point (see Figure 11c):$\left[\begin{array}{c}{\u0169}_{2}\\ {\stackrel{~}{y}}_{2}\\ {\stackrel{~}{z}}_{2}\\ {\stackrel{~}{d}}_{2}\end{array}\right]={\left[\begin{array}{cccc}1& 0& 0& w/2\\ 0& 1& 0& w/2\\ 0& 0& 1& 0\\ 0& 0& 0& 1\end{array}\right]}^{1}\left[\begin{array}{c}{x}_{2}\\ {y}_{2}\\ {z}_{2}\\ 1\end{array}\right].$(20)
 1.
Modifying the position of all subwindows using an ellipsoid model,
 2.
Defining the set of available weak classifiers by checking that their associated subwindows do not disappear after rotation, and
 3.
Creating the McCascade using available weak classifiers.
5.2 A multiview system
Note that the system used to combine the three detectors can be extended to get a face detector robust to pose and to occlusion. Indeed, using this system, several occlusion cascades (presented in Section 4.1) and several posespecific detectors (presented in Section 5.1) can be combined.
6 Experiments
This section presents the experiments achieved in order to (1) evaluate the performances of McCascade compared to the naive approach and (2) evaluate the McCascade algorithm for two concrete applications: occluded face detection and multiview face detection. In these experiments, upright face detectors are similar to the system of Tuzel et al. [15]: covariance matrices are used as features [16], and the learning algorithm is a cascade of LogitBoost [10]. Weak classifiers are linear functions that are learned from a set of feature vectors. A feature vector is derived from a covariance matrix by taking its upper triangular part. The only difference with the system [15] is that we assume that a feature vector lies on a vector space (in [15], a feature vector lies on a Riemannian manifold).
The first part of the experiments related to McCascade performance (Sections 6.2 and 6.3) are done with an upright face cascaded detector of three levels with 5, 10, and 25 weak classifiers, respectively. Positive examples come from the labeled upright faces in the wild database [17], and negative samples were generated from 1,310 images containing no face. A total of 4,000 positive examples and 8,000 negative examples are used to train each cascade level. The second part of the experiments related to applications (Sections 6.4, 6.5, and 6.6) are done with an upright face detector of nine levels. This detector is noted . Each level was trained with 5,000 positive examples and 5,000 negative examples. Each level was designed so that a detection rate of at least d _{min}=0.998 and a falsepositive rate of at most f _{max}=0.5 were achieved on training examples. The positive examples again come from the labeled upright faces in the wild database, and negative samples were generated from 2,500 images containing no face. The FLANN library [18] is used to perform nearest neighbor searches (used in P _{knn} and P _{comb}). The test database is the CMU frontal face test A which consists of 42 images showing 169 upright faces with varied background [19].
In the first part of the experiments, receiver operator characteristic (ROC) curves are used to evaluate and compare performances, and all performances exhibited are raw, i.e., the postprocessing step of merging multiple detections is not taken into account here. This means that the falsepositive rate can be reduced with this postprocessing step without modifying the truepositive rate. When multiple detections occur for the same person, only the one with the highest classification score is kept. The others are simply ignored. In the second part of the experiments, free ROC(FROC) curves are used, and multiple detections are merged. Contrary to the ROC curve which plots detection rate versus false acceptance rate, the FROC curve plots the detection rate versus the number of falsepositives and is more suited to evaluate performances of an object detector in specific applications. Different experiments were conducted to evaluate the different aspects of our method. In Section 6.2, we test the three proposed cost functions TP_cost, FP_cost, and FP_TP_cost used in the computation of McCascade’s thresholds. Then, Section 6.3 deals with the evaluation of the different strategies used to estimate posterior probability: P _{boost}, P _{knn}, and P _{comb}. After these two series of experiments, we apply our method to two specific applications: detecting faces occluded by a scarf or sunglasses (see Section 6.4) and detecting faces in a pose different than the one learned (see Section 6.5).
6.1 Good detection criterion
ρ stands for the precision area and π for the recall area. GT is the ground truth area, and D is the detection area. The operator R is the number of pixels in the area R. A detection matches with ground truth if F _{overlap}>0.5.
6.2 Evaluation of threshold estimation strategies
Evaluation of cost function used to compute thresholds β _{ j } when 50% of weak classifiers are missing
k  Cost function  FP  TP  $\overline{{\mathbf{n}}_{\mathbf{\text{level}}}}$  

×10 ^{3}  
TP_cost  3.21  0.88  1.61  
P _{boost}    FP_cost  0.066  0.1  1.48 
FP_TP_cost  0.08  0.15  1.48  
TP_cost  5.56  0.95  1.29  
P _{knn}  3  FP_cost  0.14  0.52  1.26 
FP_TP_cost  0.17  0.44  1.29  
TP_cost  5.43  0.95  1.62  
P _{comb}  3  FP_cost  0.006  0.12  1.48 
FP_TP_cost  0.03  0.24  1.48 
Evaluation of cost function used to compute thresholds β _{ j } when 60% of weak classifiers are missing
k  Cost function  FP  TP  $\overline{{\mathbf{n}}_{\mathbf{\text{level}}}}$  

×10 ^{3}  
TP_cost  8.4  0.95  1.64  
P_{boost}    FP_cost  0.058  0.06  1.48 
FP_TP_cost  0.17  0.29  1.49  
TP_cost  8.25  0.96  1.32  
P_{knn}  3  FP_cost  0.15  0.56  1.26 
FP_TP_cost  0.28  0.58  1.32  
TP_cost  11.9  0.97  1.67  
P_{comb}  3  FP_cost  0.005  0.11  1.48 
FP_TP_cost  0.16  0.49  1.5 
6.3 Performance of the posterior probability estimation
6.4 Occluded face detection
In this section, we evaluate the performance of McCascade coupled with the principle of cascading with evidence in a specific application: detecting faces with top occlusions (like sunglasses) or bottom occlusions (like a scarf). We only consider these two types of occlusions for two reasons. The first is that we are working in a video surveillance context in which these two types of occlusions are often encountered. The second reason is that a public database with these two types of occlusion is available: the AR database.
6.4.1 Evaluation on the AR database
The AR database [21] is used first. In particular, we use the 765 images of faces occluded by a scarf and the 765 images of faces occluded by sunglasses. The classifier used here is the upright face detector of nine levels. Using this cascade , we build a McCascade ${\mathcal{C}}^{\mathcal{A}}$ that can handle bottom occlusion and a McCascade ${\mathcal{C}}^{\mathcal{\mathcal{B}}}$ that can handle top occlusion. Also, a detector that associates , ${\mathcal{C}}^{\mathcal{A}}$, and ${\mathcal{C}}^{\mathcal{\mathcal{B}}}$ with the principle of cascading with evidence is created. This detector will be noted 'McCascades + evidence’ in the results. The McCascade ${\mathcal{C}}^{\mathcal{A}}$ has, on average, 42% unavailable weak classifiers per level. The McCascade ${\mathcal{C}}^{\mathcal{\mathcal{B}}}$ has, on average, 46% unavailable weak classifiers per level.
Two scenarios are tested:

Scenario 1. We consider images of faces occluded by a scarf, and we then compare (1) the cascade , (2) the McCascade ${\mathcal{C}}^{\mathcal{A}}$, and (3) the detector McCascades + evidence.

Scenario 2. We consider images of faces occluded by sunglasses, and we then compare (1) the cascade , 2) the McCascade ${\mathcal{C}}^{\mathcal{\mathcal{B}}}$, and (3) the detector McCascades + evidence.
For all scenarios, FROC curves are computed. To create the FROC curve of a cascaded detector, several threshold values are tested for the last level which results in corresponding points of detection rate and number of falsepositives. To get more points (points with a higher detection rate and a higher number of falsepositives), the last level must be removed, and then different thresholds for the new last level are tested. This procedure continues until enough points are collected. When several cascades are associated (e.g., in the system ' +${\mathcal{C}}^{\mathcal{A}}$ +${\mathcal{C}}^{\mathcal{\mathcal{B}}}$ + evidence’), creating a FROC curve is not straightforward because each cascade has its own thresholds. To alleviate this problem, we use the idea proposed by Viola and Jones in [22]. To create FROC curves from multiple cascades, thresholds are simultaneously modified in all cascades. In the same way, layers are simultaneously removed in all cascades.
Finally, we normalize all the values between 0 and 1. Based on this map, we understand that our method fails on faces occluded by sunglasses because, in this scenario, we only use weak classifiers located on the lower part of the face which are too weak to ensure good performance.
6.4.2 Evaluation in reallife scenario
Three detectors are applied to this sequence:

Upright face detector . It is noted 'FD_{cov}’ in the results.

Detector that associates ,${\mathcal{C}}^{\mathcal{A}}$, and ${\mathcal{C}}^{\mathcal{\mathcal{B}}}$ with the principle of cascading with evidence. It is noted 'FD_{cov} + occlusion’ in the results.

Upright face detector of the OpenCV library (the file

haarcascade_frontalface_alt_tree.xml is used). This detector is the implementation of the solution of Lienhart et al. [23]. This classifier is a cascade of boosted classifiers. Haar features are used. It is noted 'FD_{haar}’ in the results.
6.5 Multiview face detection
In this part of the experiments, the boosted McCascade algorithm has been applied to another specific application: detecting faces in different poses using an upright face detector. The FERET database [24] was used to evaluate the system. We test our method on faces turned 22.5°, 45° and 67.5°. For each angle, all the subwindow positions are first adjusted using the procedure described in Section 5.
6.5.1 Ellipsoid parameters
 1.
Based on the upright face classifier, we create two classifiers ${\mathcal{C}}^{22.5}$ and ${\mathcal{C}}^{45}$ by adjusting all the subwindow positions using ellipsoid parameters (a _{ i },b _{ i },c _{ i }). Subwindows that disappear are handled by the naive approach presented in Section 3.1, i.e., associated weak classifiers are simply ignored.
 2.
${\mathcal{C}}^{22.5}$ is applied to the validation set of images of faces turned 22.5°, and the ROC curve is computed. Then, the area under ROC curve is computed which gives ${\text{auc}}_{i}^{22.5}$ (auc is a criterion to compare ROC curves: the higher it is, the better the ROC curve). Using ${\mathcal{C}}^{45}$, we also get ${\text{auc}}_{i}^{45}$.
 3.
Finally, the overall value ${\text{auc}}_{i}={\text{auc}}_{i}^{22.5}+{\text{auc}}_{i}^{45}$ is computed.
Parameters with the best value auc_{ i } were kept. We found that a=2.0∗w/2, b=w. and c=w/2 give the best results.
6.5.2 Modification of subwindow positions
Here, the use of an ellipsoid to modify subwindow positions is evaluated. Three detectors are built:

${\mathcal{C}}^{22.5}$ is a detector of faces that turned 22.5°,

${\mathcal{C}}^{45}$ is a detector of faces that turned 45°, and

${\mathcal{C}}^{67.5}$ is a detector of faces that turned 67.5°.
 1.
The subwindow position modification should compensate the modified appearance of a turned face of an angle θ _{ y }. When the angle θ _{ y } increases, it becomes much more difficult to compensate the modified appearance as the modification becomes stronger and stronger.
 2.
In Section 5, we explain that some subwindows can disappear due to rotation. In fact, the number of subwindows that disappear increases with the angle θ _{ y }. This loss impacts the initial performance.
6.5.3 Association with a McCascade
The three detectors of the previous section ${\mathcal{C}}^{22.5}$, ${\mathcal{C}}^{45}$, and ${\mathcal{C}}^{67.5}$ have some unavailable weak classifiers:

${\mathcal{C}}^{22,5}$ has, on average, 18% unavailable weak classifiers per level.

${\mathcal{C}}^{45}$ has, on average, 27% unavailable weak classifiers per level.

${\mathcal{C}}^{67,5}$ has, on average, 44% unavailable weak classifiers per level.
Unlike using the naive approach to handle these unavailable weak classifiers, it could be interesting to modify the cascade structure into a McCascade. In this section, the structure of the three detectors is changed into a McCascade. The strategy P _{knn} is used with k=3 neighbors, and thresholds β _{ j } are fixed using the cost function TP_cost. In Figures 23 and 24, these detectors are noted 'MaMcCascade’. On faces turned 22.5° and 45°, the improvement compared to the naive approach is slight (increase of the detection rate from 2% to 5%). The impact of using a McCascade is greater on faces turned 67.5°. Indeed, contrary to the naive approach, the McCascade allows for the detection rate to be improved with only a few falsepositives. However, performances remain limited. For example, 55% of faces are detected with 12 falsepositives, while this rate is 90% when faces are turned 22.5° and 45°.
6.5.4 The multiview system
In the previous sections, the pose of faces was known. Here, a multiview system is evaluated. This system can detect faces with different ROP angles. The three detectors ${\mathcal{C}}^{22,5}$, ${\mathcal{C}}^{45}$, and ${\mathcal{C}}^{67,5}$ are combined to get the multiview system following the principle of Section 5.2. Unavailable weak classifiers are handled with a McCascade. In Figures 23 and 24, this detector is noted 'MaMcCascade multiview’. It gets performances that are close to performances of specific detectors (noted MaMcCascade on each curve).
6.6 Computation time
Mean detection time on faces turned 45°
Classifier  Mean time  Minimum time  Maximum time 

(ms)  (ms)  (ms)  
Cascade  234 ± 46  196  593 
MaMcCascade  296 ± 96  201  663 
Mean detection time on faces occluded by a scarf
Classifier  Mean time  Minimum time  Maximum time 

(ms)  (ms)  (ms)  
Cascade  375 ± 43  272  610 
McCascade + evidence  468 ± 63  335  758 
7 Conclusions
We have presented a solution for handling missing weak classifiers in a boosted cascade. Our method relies on a probabilistic formulation of the cascade structure and on the computation of posterior probability on each level. To make a decision on each level, thresholds have been introduced and are fixed through an iterative procedure that minimizes a cost function. All aspects of the proposed solution have been tested. Moreover, the method has been successfully applied to two specific applications which involve occluded faces. During experiments on occluded faces and on turned faces, we also discuss limitations of the proposed solution which are due to performance differences between weak classifiers. On the other hand, the main advantage of the proposed method is that it only uses an existing face classifier; additional training is not needed to detect occluded faces or faces in another pose. Future work will focus on the method’s limitation on occluded faces. During experiments on occluded faces, we notice that the proposed solution can fail on some occlusion types because learned weak classifiers do not cover the face with the same performance. To alleviate this problem, we plan to modify the initial training by adding constraints to the weak classifier locations.
8 Consent
Declarations
Acknowledgements
We want to thank OSEO for supporting our work which is part of the Biorafale project aimed at detecting and recognizing dangerous fans in football stadiums.
Authors’ Affiliations
References
 Huang C, Ai H, Li Y, Lao S: Highperformance rotation invariant multiview face detection. Trans. Patt. Anal. Mach. Intell 2007, 29(4):671686.View ArticleGoogle Scholar
 Maytal SaarTsechansky: Handling missing values when applying classification models. J Mach Learn Res. 2007, 8: 16231657.Google Scholar
 Smeraldi F, DefoinPlatel M, Saqi M: Handling missing features with boosting algorithms for proteinprotein interaction prediction. Data Integration in the Life Science, ed by. P Lambrix, G Kemp. Proceedings of the 7th International Conference, DILS 2010, Gothenburg, Sweden, August 2527, 2010. Lecture Notes in Computer Science, vol 6254 (Springer, Berlin, 2010), pp. 132–147View ArticleGoogle Scholar
 Globerson A, Roweis S: Nightmare at test time: robust learning by feature deletion. In Proceedings of the 23rd International Conference on Machine Learning (ICML ’06), June 2006. New York: ACM; 2006:353360.Google Scholar
 Dekel O, Shamir O, Xiao L: Learning to classify with missing and corrupted features. Mach. Learn. 2008, 81: 149178.MathSciNetView ArticleGoogle Scholar
 Chen J, Shan S, Yang S, Chen X, Gao W: Modification of the adaboostbased detector for partially occluded faces. 18th Int. Conf. Pattern Recognit. 2006, 2: 516519.Google Scholar
 Lin YY, Liu TL, Fuh CS: Fast object detection with occlusions. Eur. Conf. Comput. Vis. 2004, 3021: 402413.Google Scholar
 Schapire RE: The strength of weak learnability. Mach. Learn. 1990, 5(2):197227.Google Scholar
 Viola P, Jones M: Rapid object detection using a boosted cascade of simple features. Conf. Comput. Vis. Pattern Recognit. 2001, 1: 511518.Google Scholar
 Friedman J, Hastie T, Tibshirani R: Additive logistic regression : a statistical view of boosting. Ann. Statist. 2000, 28: 337407.MathSciNetView ArticleGoogle Scholar
 Lefakis L, Fleuret F: Joint cascade optimization using a product of boosted classifiers. Adv. Neural Inf. Process. Syst. 2010, 23: 13151323.Google Scholar
 Lin YY, Liu TL: Robust face detection with multiclass boosting. Conf. Comput. Vis. Pattern Recognit. 2005, 1: 680687.Google Scholar
 Schneiderman H, Kanade T: A statistical method for 3d object detection applied to faces and cars. Conf. Comput. Vis. Pattern Recognit. 2000, 1: 746751.Google Scholar
 Huang C, Ai H, Wu B, Lao S: Boosting nested cascade detector for multiview face detection. Int. Conf. Pattern Recognit. 2004, 2: 415418.Google Scholar
 Tuzel O, Porikli F, Meer P: Human detection via classification on Riemannian manifolds. In IEEE Conference on Computer Vision and Pattern Recognition, 17–22 June 2007. Piscataway: IEEE; 2007:18.View ArticleGoogle Scholar
 Tuzel O, Porikli F, Meer P: Region covariance : a fast descriptor for detection and classification. In Proceedings of the 19th European Conference on Computer Vision, May 2006. Berlin: SpringerVerlag; 2006:589600.View ArticleGoogle Scholar
 Huang GB, Ramesh M, Berg T, LearnedMiller E: Labeled faces in the wild: a database for studying face recognition in unconstrained environments. Technical Report 0749, University of Massachusetts (2007)Google Scholar
 Muja M, Lowe DG: Fast approximate nearest neighbors with automatic algorithm configuration. In VISAPP International Conference on Computer Vision Theory and Applications (VISAPP’09), Lisbon,5–8February 2009. Setubal: INSTICC Press; 2009:331340.Google Scholar
 Rowley HA, Baluja S, Kanade T: Neural networkbased face detection. Trans. Patt. Anal. Mach. Intell. 1998, 20: 2338. 10.1109/34.655647View ArticleGoogle Scholar
 Yao J, Odobez JM: Fast human detection from joint appearance and foreground feature subset covariances. Comput. Vis. Image Understanding 2011, 115: 14141426. 10.1016/j.cviu.2011.06.002View ArticleGoogle Scholar
 Martinez AM, Benavente R: The AR face database. Technical Report 24,. The Ohio State University, (1998)Google Scholar
 Jones M, Viola P: Fast multiview face detection. Technical Report 96,. Mitsubishi Electric Research Laboratories, (2003)Google Scholar
 Lienhart R, Kuranov A, Pisarevsky V: Empirical analysis of detection cascades of boosted classifiers for rapid object detection. Pattern Recognit 2002, 2781: 297304.View ArticleGoogle Scholar
 Phillips PJ, Moon H, Rizvi SA, Rauss PJ: The FERET evaluation methodology for face recognition algorithms. Trans. Patt. Anal. Mach. Intell. 2000, 22(10):10901104. 10.1109/34.879790View ArticleGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.