Robust head pose estimation based on key frames for human-machine interaction

Humans can interact with several kinds of machine (motor vehicle, robots, among others) in different ways. One way is through his/her head pose. In this work, we propose a head pose estimation framework that combines 2D and 3D cues using the concept of key frames (KFs). KFs are a set of frames learned automatically offline that consist the following: 2D features, encoded through Speeded Up Robust Feature (SURF) descriptors; 3D information, captured by Fast Point Feature Histogram (FPFH) descriptors; and target’s head orientation (pose) in real-world coordinates, which is represented through a 3D facial model. Then, the KF information is re-enforced through a global optimization process that minimizes error in a way similar to bundle adjustment. The KF allows to formulate, in an online process, a hypothesis of the head pose in new images that is then refined through an optimization process, performed by the iterative closest point (ICP) algorithm. This KF-based framework can handle partial occlusions and extreme rotations even with noisy depth data, improving the accuracy of pose estimation and detection rate. We evaluate the proposal using two public benchmarks in the state of the art: (1) BIWI Kinect Head Pose Database and (2) ICT 3D HeadPose Database. In addition, we evaluate this framework with a small but challenging dataset of our own authorship where the targets perform more complex behaviors than those in the aforementioned public datasets. We show how our approach outperforms relevant state-of-the-art proposals on all these datasets.


Introduction
The head pose provides rich information about the emotional state, behavior, and intentionality of a person. This knowledge is useful in several areas such as humanmachine interaction [1], augmented reality [2,3], expression recognition [4], and driver assistance [5], among others.
The task of correctly estimating the head pose with non-invasive systems might seem easy, and many current devices (smartphones or webcams) can detect human faces from videos or images in real time. Those are good for recreation, but they cannot handle all the difficulties in head pose estimation (HPE) such as (self ) occlusion, extreme head poses, facial expressions, and fast movements.
Driver assistance scenario is a particular case where *Correspondence: lerasle@laas.fr 1 CNRS, LAAS, 7 avenue du Colonel Roche, F-31400, Toulouse, France 2 Univ. de Toulouse, UPS, LAAS, F-31400, Toulouse, France the user may exhibit complex behaviors such as zooming in/out of the steering wheel, wide range of head rotation, and fast movements. Here, the pose can verify if the user pays attention to the road allowing an autonomous system to assist the driver when necessary. Therefore, HPE algorithms should provide fast and robust information because missed detections or spurious estimates can lead to accidents. Usually, HPE proposals [6][7][8] rely in RGB images to find specific 2D facial features, such as eyes, eyebrows, mouth, or nose. These heterogeneous features provide accurate estimations, but those are not available all the time, i.e., working with blurry images or light changes. Depth-based approaches, e.g., Fanelli et al. [4], can overcome some of the limitations of the 2D estimation allowing a better 3D HPE. Both methodologies perform well where the target's face is nearly frontal, but as mentioned above, this assumption cannot be guaranteed. Some applications use 3D models [9,10] to retrieve the pose because they also (2020) 2020: 13 Page 2 of 19 provide semantic information, i.e., gaze estimation and facial expression. We propose a framework that takes the best features of the aforementioned methodologies, combining 2D and 3D cues with a rigid 3D face model. It can handle challenging situations, such as large head poses, with a high detection rate and good accuracy for a wide range of orientations. Our approach follows an efficient key frame (KF) methodology with an offline learning phase and an online pose estimation step. Our fast and non-invasive offline step learns target's appearance and pose using a RGB-D sensor, in such a way that it creates a set of key frames (KFs) for that specific person, see Fig. 1. The KFs could be spurious or inaccurate; therefore, we propose a global optimization process based on bundle adjustment that improves the set of KFs and updates the 3D face model to better fix the target. This information is later used to estimate an accurate pose in the online step.
This process could be seen as a disadvantage because it needs to learn KFs for each new user, but our proposal incorporates an automatic learning system that only requires the user to perform simple movements in a short time before launching the online step. In several contexts, we can afford to perform this initialization stage. This is the case for driving assistance where learning could be done when the vehicle is stopped. Moreover, we might even suppose that the offline process conditions the start of the vehicle, allowing to verify in advance whether the user is in good conditions to drive.
We show how this key frame-based proposal provides competitive results to those in the state of the art. We evaluate our approach using the following: (i) the standard benchmark BIWI Kinect Head Pose Database [4], (ii) ICT 3D HeadPose Database, and (iii) our own dataset recorded with a Microsoft Kinect v1.
BIWI and ICT-3DHP datasets are, in the literature, standard benchmarks for evaluating head pose detectors with more than 240 and 200 cites, respectively [4,[10][11][12], where each target is recorded with neutral expression, rotating the head at a slow-medium speed. However, these datasets do not represent complex and challenging movements that a human could do. Therefore, we develop our own dataset where the targets perform more natural movements as those expected in real scenarios. It consists of four sequences where targets show complex behaviors, such as rapid head movements, self-occlusion, and facial expression, among others. Although we evaluate several datasets, all the examples shown in this paper use images from our "ICU" dataset to describe the different steps of our proposal. Thanks to quantitative evaluations of these challenging sequences, we demonstrate that our monocular RGB-D-based approach offers competitive results to current approaches in the state of the art. The main contributions of this paper are as follows:

1.
A key frame-based framework, with state-of-the-art accuracy, that consists of an original offline process with an automatic learning step with global consistency, a KF optimization step based on error propagation, and a 3D face model updating methodology. All the above learned information is considered during an online head pose estimation with a formulation that takes into account the descriptors, normal surface, and self-occlusion. 2. A new dataset exhibiting more complex behaviors to those present in the aforementioned datasets.
This paper has the following structure: We present the related work in Section 2. The formulation of our methodology for pose detection is given in Section 3. Section 4 presents the quantitative and qualitative results including a discussion where we compare our framework with respect to other two approaches in the state of the art. Last, Section 5 describes conclusions and future work.

Related works
In the fields of mobile robotics and computer vision, there are works focused on monocular systems for HPE, i.e., [13,14], that can be categorized according to cue used. Hereafter, we mention a few of the most relevant ones.

RGB-based approaches
Some approaches tackle the HPE problem by using 2D deformable models that can approximate the human face shape [15,16]. In [6], Kazemi and Sullivan propose a fast face alignment framework based on a random forest where each regression tree is learned by a gradient boosting-based loss function. This methodology allows to detect multiple faces with high accuracy at a speed of 1 ms per image even with complex expression (strong facial deformations) or small head rotations.
Other proposals seek specific facial features, i.e., eyes and nose, among others. Valenti et al. [8] learn the location of the eyes from a set of training images, and assuming that the head follows a geometrical shape, those are projected in a cylinder. This person-specific model is used then for detecting and tracking the target. Barros et al. [7] follow a similar strategy, but including motion information from optical flow to reinforce the estimation. Drouard et al. [17] propose a learning method based on histogram of oriented gradients (HoG). HoG features are mapped (through a Gaussian locally linear model) onto the head pose space, which is then used to predict a new head orientation. Chen et al. [18] achieve good results with RGB images of low resolution using a Support Vector Regression (SVR) classifier trained with a gradient-based feature. All these methods combine the information in a single model and achieve state-of-the-art (SoA) results when sufficient training data is provided.
Learning the appearance of a person with a single shot is not always possible due to problems such as changes in lighting or occlusions. Therefore, several works rely on the information coming from a set of relevant frames, called key frames (KFs) [19]. In [20], Vacchetti et al. propose a KF-based method that detects and estimates the 3D pose of static rigid objects using only RGB images. Each KF consists of a set of key points and a 3D model, projected to image plane using camera calibration. The proposal provides SoA results by considering both 2D-3D key frame matching and 2D-2D temporal matching. The work of Kim et al. [21] exploits the idea of KF for pose estimation and tracking of multiple 3D objects from 2D information. The methodology can obtain results in real time, i.e., 40 objects within 6 to 25 ms per frame. In the last two proposals, the camera is moving while the target remains static. Nevertheless, these methods can work in the opposite way, i.e., static camera with moving targets. Morency et al. [22] propose a generalized adaptive view-based appearance model (extension of the AVAM algorithm of [23]) that estimates the head pose for a specific image region. The final pose is inferred by merging the results of (1) a referential frame, (2) tracking between current and previous frame, and (3) matching against a KF.
A more recent method [24] uses deep learning to train a convolutional neural network (CNN) using RGB images. The results are provided in real time and can handle challenging issues such as different light conditions. The 2D-based proposals perform well with nearly frontal views, but they have difficulty estimating an accurate head pose due to problems such as large poses, (self ) occlusions, and changes in lighting. In this sense, depth cue is more efficient in such situations.

Depth-based approaches
Many of nowadays SoA methods are based on the depth cue because 3D information provides the shape of the head in a more distinctive way [4,12,25] .
In [25], the authors use the depth image to tackle some of the problem of pose estimation such as partial occlusion and head orientation variations. The proposal rotates a generic 3D human face model, and each rotation is transformed in a depth image, which is later used in the alignment process. This offline-learned set is compared to the input depth frame, and the best match provides the pose hypothesis. It achieves real-time results thanks to a framework based on graphics processing units (GPUs).
Fanelli et al. [4] train a random regression forest that allows to detect poses in real time through nose tip detection. The training data is generated in a similar way as [25] using a 3D face model set with several orientations. Each leaf of the regression tree votes for a possible nose Papazov et al. [12] propose a new 3D invariant descriptor that encodes facial landmarks. The descriptors are learned in an offline training phase using a group of highresolution meshes with triangular paths. A CNN is used in [26] to estimate head pose from pure depth data with the use a Siamese network (a couple of CNN) achieving high accurate results in real time.

RGB-D-based approaches
The combination of color and depth cues has shown high performance in challenging situations. In the work of [11], the pose is inferred by fitting a morphable 3D model on the target represented by a 3D point cloud. The model is learned for a specific person in an offline training step. Saeed and Al-Hamadi [27] use HoG features, extracted from both RGB and D cues, to train a classifier based on support vector machine (SVM). In [28], the authors present a similar method that combines 2D and 3D HoG features but to train a multi-layer perceptron classificator. In [29], the authors present an improvement to the constrained local model by including 3D information. Then, they train some SVM classifiers and logistic regressors using probabilistic features.
Some works enhance classical methods by including depth information. This is the case with [30], in which the authors use the depth cue in a visual odometry technique.
Smolyanskiy et al. [31] add a depth-based constraint to an active appearance model fitting. However, this approach suffers from drift problems, where the final model is not well aligned with target's 3D position. Some other proposals propose to combine depth and color cues using random forest [32]. Here, tensor-based regressors allow to model large variations of head orientation.
In [10], Li proposes a method based on an energy minimization function that optimizes the distance between a 3D point cloud (current frame) and a rigid template model of the human face. The optimization is carried out using ICP algorithm, and the color cue is used in two ways: (1) to detect 2D facial landmarks, using the method of Viola and Jones [33], and (2) to remove outliers, using a k-means clustering algorithm. The detected landmarks, i.e., eyes, are projected to 3D world through the depth image and included in the energy function as a weight factor, which increased the accuracy and convergent speed of ICP. On the other hand, k-means allows to separate relevant 3D points (i.e., those belonging to the face) from the spurious ones (i.e., clutter). The face model is updated online in a parallel process using only the depth cues allowing to adapt to different kinds of faces. The proposal relies in the work of Fanelli et al. [4] to reinitialize the approach because ICP requires more time to infer a face pose from an initial position than from previous frame. Meanwhile, Fanelli et al. 's approach finds a face faster but with less precision. Yu et al. [9] propose a similar method that instead learns a 360 • 3D morphable model, including a motion cue, based on optical flow, in the ICP optimization process.

Descriptors
Descriptors encode important information about the visual characteristics of the objects present in images [34], such as appearance [35,36], motion [37], or geometry [38]. Therefore, they have been used in multiple contexts. Yan et al. [39] propose a FAST-like descriptor which considers the orientation of image intensity. Alioua et al. [35] propose a 2D head pose estimation framework using a combination of classic descriptors, e.g., HoG, SURF, and Haar. Yan et al. [36] uses two CNN features to model global and local appearance of the target and a 3D CNN which codify the motion. The computational cost of some descriptors could be expensive, e.g., especially those based on deep learning [36], even using parallelization methods [37]. Therefore, we rely on robust features with fair computational cost.

Synthesis
The aforementioned proposals have some qualities that adapt well in specific scenarios. To mention some outstanding methods, we have the following: Kazemi and Sullivan [6] who use a RGB-based method with fast estimation and high accuracy in frontal view, Fanelli et al. 's [4] proposal which relies in depth information and provides good detection rate, and Li et al. [10] who can achieve accurate results for head poses with large rotation. A combination of these (or more) methods could face the challenges of estimating head pose, but a direct combination could not generate results in real time.
Finally, there are some datasets to evaluate the performance of HPE algorithms, such as BIWI dataset [4] and ICT-3DHP dataset [29], that are the standard benchmark used in several relevant papers [4,[9][10][11][12]. They consist of multiple sequences, each with a different person, where the target has a neutral expression, with slow-medium speed head rotation and (mostly) remaining in the same position.
From above, we can summarize our contributions as follows: 1. A robust HPE algorithm based on KF that combines 3D geometry information (point cloud), appearance, and shape (encoded through SURF and FPFH descriptors), exploiting all RGB-D channels. 2. A double mechanism consisted of (1) an offline learning phase that exploits the complementarity of aforementioned techniques to create a person-specific set of KFs and (2) an online framework based on KF and ICP that estimates robustly and in real time the head pose. 3. A bundle adjustment process that improves the accuracy, in terms of performance and CPU cost, of the learned KFs in order that they are consistent between them. 4. An online update of both the KFs and 3D face model. 5. A new dataset with more challenging behaviors and situations that those in the literature consisting of four sequences with a ground truth generated from a tion (MoCap) system. It includes rapid head movements, facial deformation, self-occlusions, and position displacement, among others. 6. A rigorous and large-scale evaluation and comparison with relevant existing approaches in the state of the art.

Method
Our key frame-based approach is inspired by some works like [20,22], and [25] but for the applicative context of HPE for human-machine interaction, i.e., human HPE instead of static objects considering both appearance and depth cues with a partial 3D face model. Each KF consists of a set of 3D appearance features (SURF descriptors projected to 3D world through the depth image), 3D-based features, and an approximate head pose, represented with a 3D template model. First, we describe the contents of each key frame to then show how they are learned consistently and subsequently used in a pose estimation system.

3D face model
A 3D morphable face model (3DMFM) is a shape representation of a human face that can be used to provide accurate estimations for most of the head poses. Then, a face model M is a set of 3D vertex/points created as a linear combination of a mean shape μ with a weighted deformation basis DB as follows: Here, γ i and DB i are the eigenvalue and eigenvector, respectively, learned from a set of 3D scans. In our approach, we use the Basel Face Model (BFM) [40], which has learned the DB values from the 3D face scans of 200 subjects, each with different age, gender, height, and width. Traditionally, 3DMFM fitting is an offline optimization step that finds theω i values through the minimization of the distance between one (or more) 3D frame(s) and the model. This allows to create a model with a facial shape similar to a specific person, i.e., [9,10].
Our offline key frame learning step uses a generic human face model M with average characteristics, i.e., age, weight, and gender. This model fits well in most of the cases, but it must be updated in order to fit some facial structures. Section 3.2.3 describes an efficient optimization scheme that does not rely in calculatingω of Eq. 1 like other methods, but in an error propagation-based approach inspired by as bundle adjustment.
Even with a well-fitted model, some HPE algorithms have problems handling face deformation such as mouth movements or facial expressions. This is a common situation when a person is speaking with other one or reacting to external situations, i.e., music and other people's movements, to mention a few. We keep this in consideration and create a partial model with only the part between nasal base and forehead. This region does not deform much and provides results as accurate as more complete models.
In any case, we use Eq. 1 to build a partial face model an example of the model is shown in Fig. 1 represented as the output of the red block.

Face descriptors
Our proposal relies and is based on natural facial landmarks encoded through SURF descriptors, which allow to estimate features invariant to rotation and scale, and Fast Point Feature Histogram (FPFH) descriptors, which include 3D information invariant to illumination changes. These descriptors enhance the robustness of the HPE and increase both accuracy and detection orientation range.
SURF descriptors SURF is a robust and reliable descriptor that has shown good performance in several topics such as SLAM, camera pose estimation, and image registration. In the context of HPE, SURF describes a specific person's face in a general way, avoiding the need to search specific features (e.g., eyes, nose). Therefore, any relevant characteristic is taken into account, regardless of its origin, i.e., beard, mustache, glasses, or other. In addition, these descriptors are invariant to scale and rotation allowing to detect no-static targets, i.e., drivers moving around in the cockpit and people interacting with robots, among others. We use SURF in a similar way as in image registration: we calculate a set of η α interest point in the foreground of image plane using the good features to track algorithm. Since each RGB pixel has associated a depth value, we define the background as any point farther than a threshold th a . Thereby, we have a set of f α features with their respective 3D position p α j = {x, y, z} as follows: (2020) 2020:13 Page 6 of 19 From Eq. 2, we have a descriptor that encodes the appearance of a specific person in 3D world, and by grouping them, we get the set: In practice, the parameters used in SURF get a η α ≈ 100 − 200 descriptors. SURF descriptors are robust in cases with little luminosity changes and flat objects, and in our problem, they have proven to be useful for the pose estimation. Although certain changes of a 3D object, due to lighting or rotation, cannot be captured properly by these descriptors, we use a shape descriptor that reinforces the estimation.

FPFH descriptors
Curvature estimates and surface normals are a basic representation of the geometry of an object, easy to compute and compare. Although the level of detail captured is not much, many points contain the same (or similar) feature information. Alternatives are the 3D descriptors, and they summarize the object's geometry taking into account the aforementioned features in an efficient manner.
Fast Point Feature Histogram (FPFH) descriptor, proposed by Rusu et al. [38], captures the normal surface variations around a point, resulting in a high hyperspace signature that is invariant to the 6D pose (rotation and position) and robust against the neighborhood noise. It is formulated as follows: where SPFH (Simplified Point Feature Histogram) computes the set of angular features of the PFH descriptor, κ i is the distance between p β j and p i , and N j is the set of neighboring points of p β j . We build the set point to evaluate by considering (1) the 3D projection of the points computed by good feature to track methods, in the same way as in SURF, and (2) a downsampling of the target point cloud. The 3D frame descriptors are formulated in a similar way as in the previous section: where Finally, each KF contains these three elements: appearance and shape signatures and a 3D face model, together with the depth image. In practice, the number of descriptors η β ≈ 200.

Offline key frame learning
In this section, we describe how the KFs are learned from a RGB-D stream, see workflow in Fig. 1. First, the target pose is roughly estimated using a robust but computational expensive system based on three state-of-the-art   Fig. 1). Only the most relevant frames, according to the quality of the estimated pose and the descriptors, are selected as key frames, yellow block. Finally, an optimization process (green block) improves the KF-estimated poses and suppresses spurious frames, i.e., which are not consistent with any other.

Rough pose estimation
Some methods require the use of other algorithms for initialization or learning [9,10]. Our proposal requires a rough estimation of the pose, or rough pose estimation, that is computed by combining three HPE systems that have a good accuracy/CPU-cost ratio: Kazemi and Sullivan [6] 2D face detector, Fanelli et al. [4] depth based, and Li et al. [10] RGB-D based method.
These proposals complement each other and provide a first good estimate on which we rely to create a more robust method. Kazemi and Sullivan's [6] proposal is a fast-facial feature detector and is part of a public library, DLib from [41]. Fanelli  The work of [10] consists of two independent parts (computed in parallel): (1) a head pose tracking framework based on ICP and (2) a 3D model update system. This method is based on facial features that cannot handle well large head rotations, and therefore, the accuracy decreases when the 2D face landmark detector fails. Therefore, we propose a simple but reliable 3D feature, see in Fig. 2, that provides additional information for feature-based systems, i.e., [10].
Let's assume q t−1 = {x, y, z} as the 3D position of nose tip estimated from previous frame and θ t−1 as the head orientation, red sphere and blue line in Fig. 2b, respectively. Assuming a slow movement of the target, the next nose point q t should be close to previous estimation; we can find this new nose by analyzing the neighboring q t−1 in the current target' point cloud ψ t−1 : where r = 0.2m is the searching radio. In other words, N t are the neighboring points of q t−1 and one of those is a good candidate to be the next nose tip (q t ∈ U t ), see yellow area of Fig. 2d. From previous pose estimation, we define nose as the furthest point in the orientation θ t−1 : where υ(·) computes the distance between point p and a line segment defined through q t−1 and θ t−1 .q t is shown as the blue sphere in Fig. 2e.
In [10], the authors include the 3D eye positions, detected with Viola and Jones [33] algorithm and projected through the depth image, as a weighted factor in the ICP algorithm. We do the same with this nose featureq t ; the correspondences betweenq t and a 3D template model have a weight of 40, as indicated in [10]; and the rests are set to 1. This process guides the template to zones with high probability of been the target's face; Fig. 2f shows the final estimation.
This feature enhances the accuracy of the original proposal; thus, we use this nose-based framework in the KF learning. Like other person-specific methods [11], we must learn the appearance of each new target, but the process is worth it because, as detailed below, it improves the accuracy of the estimations.

Automatic frame selection
In some application context, e.g., driver assistance, we can take some time to perform the KF learning before starting the vehicle without any danger. Here, robust estimates of head pose are essential because inaccurate or missed detections can cause accidents. This could be difficult to achieve because the target behavior is sometimes complex with random or abrupt movements. We develop our proposal considering that the KFs can handle well these scenarios providing high-quality results. Therefore, we consider that it is justifiable to take a little time in order to learn a robust person-specific set of KFs.
First, we estimate the rough pose as described in Sec. 3.2.1 where the methods (Kazemi, Fanelli, and Li) propose each one a HPE P * = {q * , θ * } where q = {x, y, z} is the nose location and θ is the head orientation. Thus, we have at frame t three pose estimation candidates C t = {P Kazemi , P Li , P Fanelli }. In the best-case scenario, all the methods converge to a similar point, i.e., mean of the three posesP t = {q t ,θ T } has a small variance Var(C t ). If this is the case, we addP t to the set of key frame pose S KF . Otherwise, we select a pose according to the qualities of the methods. Kazemi is highly accurate with frontal view targets, Fanelli can detect poses even with rapid motion, and Li works better with heads that exhibit large orientation (looking to right/left, full profile). Therefore, we privilege these techniques according to each situation: where th d = 5 cm and th θ = 45 o are the pose and orientation thresholds, θ o is the existing angle between camera origin and target pose, and th v = 0.5 is the variance threshold. We defines = ||P t − P KF t−1 || as the angular speed between two consecutive pose estimations with th s = 1rad/s as the speed threshold.
Descriptor computation So far, the descriptions D α and D β are calculated in the foreground and, therefore, may include irrelevant non-face features. To remove spurious information, we simply rely in the rough estimate P KF t that defines the position of the 3D face model. We use this knowledge to filter out the points far enough from the template. Let us assume q M as the nose position of the model zone and L 2 (d, p) as the Euclidean distance (norm L 2 ) between 3D points. Then, we filter the points according to a threshold th e as follows: Frame selection The accuracy of the estimation is related to the number of key frames. More KFs improve the results, but computational cost is also increased. We keep the number low by discretizing the orientation space through spherical coordinates discretized at 20 • . An example is shown in Fig. 3 where a yellow polygon depicts the discretized orientation.
Once an estimate is close to the center of the discretized area, we keep the pose P KF t and compute the descriptorŝ D KF t = D α ,D β around it. We change the color of the visited areas to green in such a way that the user can observe the missing orientations (Fig. 3). Sometimes an area is visited more than once; in this case, we keep the best KF based on a fitting score (given by pose estimation algorithms) and number of descriptors. Finally, the KF set is defined as follows: In this learning process, target should move its head at normal speed performing only head rotations, as recorded in BIWI and ICT-3DHP datasets. We consider around of 30 − 40 KF, covering most of the orientation space, and 100 SURF/FPFH descriptors. The set S KF can be used as it is; however, we can enhance the pose estimation of each KF by applying an optimization step.

Key frame pose optimization
The KFs provide rich information of the pose and appearance of the target. An automatic learning method provides a good initial estimation, but small errors in the set of KF limit the quality of new estimates. Moreover, it could include spurious frames (inconsistent estimate), red circle in Fig. 4. Therefore, we can overcome those issues by applying an optimization process that provides a global and simultaneous consistency between all KFs and the 3D face model. To achieve this, we need to minimize the error between the 3D face model and all KFs. Let us assume M as the template model in a reference position (origin of 3D world with not rotation) and K k as the point cloud of the k-th KF. We need to process only the points corresponding to the face. This position is known from the estimated poses P KF k , and therefore, we filter the points p of K k keeping only those around 20 cm of the pose estimation, i.e., H k = {p ∈ K k : L 2 p, p KF k < 0.2m}. Hence, the goal is to find the transformation parameters τ k = {R k , t k } that minimize two aspects: (1) the local error between the paired points of the human face model M and the KF point cloud H k , and (2) the global error between the rest of the KF facial point cloud H * , This can be achieved by minimizing the following cost function: where T(·) apply the geometric transformation of a point h with respect to τ * , | · | is the cardinality, and τ = {τ 1 . . . τ K } is the set of all transformations. The variable λ i weights the contribution of the ith KF (H i ) to evaluate and is derived from the percentage of paired points between the face model M and the ith KF point cloud: We can observe that λ i is close to zero when the number of paired points(P i ) is small, meaning this KF is not a good match to work with because it is desalinated or is a spurious frame. At each iteration, we remove the KFs with a low weight λ i < 0.25 because we cannot guaranty that those are a real part of the face or point cloud coming from bad estimates.
We optimize Eq. 10 following an iterative scheme such as ICP. First, we select a KF k and perform the optimization, and we repeat this process with the rest until convergence. Figure 4 shows the KFs (projected to a reference frame) before and after optimization, from which the target's face can be seen more clearly. Finally, we recalculate the poses and filter 3D points of the model.
This point cloudĤ is seen as a scattered data, and through an interpolation algorithm based on Delaunay triangulation [42], we create a mesh F = Delaunay(Ĥ) that describes the facial surface of the target. Then, the new modelM is estimated from the paired vertex (M i , F i ) by minimizing the cost function: where N i are the neighboring vertex ofM i and γ weights the similarity of the original model. This equation updates the points of M with respect to F allowing the generic template to evolve in a modelM more similar to the target.

Online head pose estimation
In this section, we present our original framework that exploits the characteristics of the KFs, in comparison with other existing approaches. We have a set S KF with appearance and shape descriptors (associated to 3D points) and a robust pose estimation. As mentioned above, descriptors are computed only on the area around the 3D model, so we haveD k = {d 1 , . . . , d η k } for each k KF. We apply a similar process for the current frame.

Pose estimation
Initialization For a new frame t, we first compute the descriptors following the steps as mentioned in Section 3.1.2, sampling over the whole foreground images Although in some cases it may not be necessary to use both types of descriptors, the use of both allows to compensate any problem that the other has, for example, drastic changes in the lighting affect SURF. arg min where dist computes the distance between two features and ρ k is the number of correspondences. After optimization, we set the k-th KF as the best candidate for the current t frame, i.e., S KF b = S KF k . Finding the best KF is a time-consuming process, but our proposal achieves realtime results by considering the previous estimation. We evaluate first those KFs close to the last estimated pose, and we accept it as the best frame if the number of correspondences is enough (i.e., > 20). This selection reduces considerably the computational cost.

Key frame selection
Nevertheless, the correspondences between D t andD b could be inconsistent due to the symmetry of the face (i.e., eyes) or matching between different parts with similar appearance (i.e., mustache and eyebrow). Coherent matches must share similar geometrical characteristics such as distance and orientation in 3D coordinates.
Descriptor filtering Let us assume M * b,t as the correct match set between D b and D t , andp as a vector containing the 3D position of both appearance (D α ) and shape (D β ) descriptors. We compute the mean and variance between the KF pointsp * b and current framep * t in terms of distance and orientation, and then, we remove atypical points as follows: where Mah(·) calculates the Mahalanobis distance, th m < 1. is its associated threshold, and μ * and σ * are the mean and variance, respectively, of (1) Euclidean distance Initial pose We use the points p of the correspondences M b,t to compute a rigid transformation from D b to D t in order to get an initial head pose P b . The relative transformationτ t = {R t ,t t } is estimated by minimizing the cost function: arg min where ω j is the confidence weight of the matched pair, calculated based on the distance between their corresponding features as follows: Thus, reliable features contribute more in the estimate of the transformationτ t . This pose is enhanced by considering additional information such as occlusion of current frame. Now, let us assume M t as the model M after applying this rigid transformation. We improve the pose by aligning now the points p m of the model M t with the corresponding p t points of the current frame, which is done by minimizing the next point-to-plane cost function: arg min where n (j) m is the normal surface of point p (j) m . The weight ω j encodes the affinity between correspondences based on their normals, distance, and orientation with respect to the camera. We formulate it as follows: ang(a 1 , a 2 ) = acos a 1 · a 2 ||a 1 || ||a 2 || .
Equation 16 measures the angle between the normals of points p Sometimes it is not possible to find a suitable KF for a given frame, i.e., the number of matches is not enough. In this case, we use the last KF-based estimation as a temporal KF, and thus, we continue the pose estimation without interruptions.
We optimize Eqs. 10, 14, and 15 through an ICP scheme with classic termination criteria, i.e., maximum number of iteration (10) and mean square error in terms of translation and rotation. Thus, we obtain the final pose P t = {p t , θ t }, which corresponds to the nose tip and orientation, respectively, of the model after the transformation τ t = {R t , t t }.

Key frame updating
Our system does not require to learn all the 50 discretized orientation in order to be launched, but it benefits the more KFs there are. Therefore, the online system begins when it has 20 frames, and then, new KFs could be added from the current estimates of our proposal. This is done by checking the current estimated pose P t ; if the orientation θ t does not have a KF associated in discretized space, we include it in the set following the considerations of Section 3.2.2. Otherwise, we compare the fitness score of current frame with the closest KF. The score checks the average distance between the model and point cloud, the number of descriptors, and the feature distance, and we keep the one with more descriptors and smaller distance. The optimization described in Section 3.2.3 is carried out when enough KFs have been added or modified, i.e., 5 frames. Since this operation is performed in parallel and only when necessary, no additional time is added to the online estimate.

Experimental evaluations
We evaluate our KF-based proposal, Fanelli's method, and Li' approach with the variant of the 3D nose feature, see Section 3.2.1, on two public benchmarks: ICT-3DHP dataset [29] and BIWI Kinect Head Pose Database [4]. Also, we create a more realistic dataset with complex behaviors that challenge these pose estimation frameworks.

Datasets
The BIWI Kinect Head Pose Database [4] is a baseline for evaluating HPE algorithms. It consists of 24 sequences with 20 persons of different gender, age, and facial characteristics. It has over 15K RGB-D images aiming to frameby-frame detection and not tracking because there are many sequences with some missed frames. Each sequence has a single target rotating his/her head, with a range of ± 75 and ± 60 • for yaw and pitch, respectively, slowly with a neutral expression. Head pose annotations are estimated using a tracking system.
The ICT-3DHP Dataset is proposed in [29]. It is divided into 10 sequences containing about 14K RGB-D frames with both color and depth images. The targets perform a similar head motion as in BIWI dataset, but some targets present facial expressions, self-occlusion (e.g., hair), and small change of position. The ground truth is generated through a Polhemus FASTRAK flock of birds tracker, which is a commercial system that estimates head pose from sensors located over a white sport cap.
Our ICU-Head Pose Dataset consists of 4 sequences each with a unique person, see Fig. 5. The targets have different facial morphology and features, i.e., glasses, mustaches, or beards. The sequences are created to test the performance of HPE algorithms under challenging scenarios. Therefore, targets perform complex behaviors including change of head position, large range head orientation, self-occlusions, fast motion, and facial deformation. We collect the sequences with a Microsoft Kinect v1 under controlled conditions with a resolution of 640×480. The ground truth is automatically annotated through a commercial motion caption (MoCap) system with a total of 6 marks (reflective spheres) fixed over a bicycle helmet using metallic bars of 10cm, see Fig. 6. The MoCap detects these markers as a rigid object and estimates the location and orientation of the helmet and therefore the target's head, with high precision.
Each target performs a different set of behaviors with unique characteristics such as speed. A summary of the sequences is presented in Table 1. The details of each sequence are as follows: In Seq1, the target performs simple actions at slow speed. It presents small range over the head orientation with a complexity similar to the public BIWI and ICT-3DHP datasets. We rate Seq2 as medium difficulty because it presents a large orientation range and fast motions. Also, the target changes its head position several times, approaching and moving away to the camera. Seq3 and Seq4 are the most challenging of the whole set. In Seq3, the target performs extreme head orientation and multiple self-occlusion. Finally, Seq4 depicts fast head movements in orientation and position. Throughout the article, we show several examples using our dataset.

Evaluation criteria
We evaluate the performance of the HPE algorithms through standard metrics such as missed detection, Euler Results with the best performances are presented in bold *Estimation that we calculated angle error (roll, pitch, and yaw), and mean angular error.
A head pose is labeled as missed detection whether the estimation algorithm does not converge to a solution, according to the termination criteria, or the proposed pose has an error of more than 45 • . We learn the KFs for each sequence using the system described in Section 3.2.2, and those frames are not considered in the evaluation step. We evaluate and report the results of three proposals: (1) Fanelli's method [4], using the open source code; (2) an implementation of Li's proposal [10]; and (3) Li Nose that includes our nose-based feature in the approach of Li. We analyze different parts of our proposal separately creating three variants; an overview is shown in Table 2. Recall that KFv1 is our a basic version, published in [43], which only uses the SURF descriptors.
We only report the results with respect to the orientation because an incorrect position estimate is reflected in the orientation error as well.

Results
First, we analyze the BIWI dataset, Fig. 7 reports the mean error in all sequences per proposal, and Fig. 8 shows the missed detection percentage. The mean error in the Libased approaches (red and green columns) is almost the same, but the number of missed detection has decreased substantially when we incorporate the nose feature (green column). The last three columns (purple and cyan) depict the results of our proposal. The performance, in both accuracy and detection rate, is improved after we apply the optimization process over the KFs. Also, Table 3 reports the results and compares them against other methods in the state of the art. Our proposal has the best accuracy in terms of pitch and yaw; meanwhile, Venturelli et al. 's approach [26] has a similar performance for roll. Nevertheless, the variance of our KFv3 proposal is smaller in all cases, making this approach more stable.
Similarly, we evaluate the proposals with the ICT-3DHP dataset and we show the results in Figs. 9 and 10. The mean error is almost the same for both Li's approach and KFv1 proposal, but we can observe that the optimized approach KFv3 is more accurate with a missed detection rate of less than 0.5%. We compare our results with other approaches in Table 4. KFv3 method gives the best results, with a smaller variance of all the techniques, meaning it is more stable. Figures 11 and 12 and Table 5 show the results using our dataset. Fanelli's approach has the biggest error, Libased proposals have a similar mean error around 8 • , and KF-based approaches have the smallest error. By observing Fig. 11, we point out a great improvement with respect to missed detection because our KF-based approach handles better fast motions and occlusions. In Fig. 13, we see a qualitative example for Seq3 where, for a given frame, we estimate the pose depicted with a blue line and the 3D template model in green. We observe that Fanelli and Li have limitations detecting the pose; meanwhile, our approach can detect a sufficient part of the face to infer a correct pose.
From all the results, we observe that Fanelli's approach has the bigger error in most of the cases. This is because it is difficult to find the point of the nose when the face is in full profile, which makes the nose barely distinguishable. A better training could improve this aspect, but that requires more pre-processing.
In general, Li's basic approach has a better performance than Fanelli's, but in our dataset, Li's proposal has problem detecting the pose. Figure 14 shows more detailed results of each sequence. We can observe how sequences 1 and 2 have a performance similar to those of the previous public datasets; nonetheless, in sequences 3 and 4, the missed detection rate of Li is higher than the rest. These sequences present fast motion with both targets wearing glasses; therefore, the images are blurred, and in some occasions, the light is reflected in the glasses. This makes it difficult for the 2D face landmark detector to find the eyes, forcing Li's proposal to use ICP without any additional information. If we compare the red and green column, we observe an improvement, meaning that the addition of the 3D nose feature overcomes the aforementioned problems.
In Figs. 15 and 16, we analyze the results in terms of missed detection. These figures are 2D histograms of the discretized orientation for pitch and yaw. When a frame is labeled as missed detection, we use the ground truth and increase a counter of the corresponding pose. The histograms are normalized considering the number of frames, so each cell (for a specific orientation) depicts the percentage of missed detections. The histogram center, highlighted with the green and blue arrows, represents a target in frontal view (looking to the camera). Following over x-axis means the head is moving from left to right (blue arrow) or from up to down with the y-axis. In Fig. 15, we report the results of sequence 14 of BIWI dataset where we can observe how Fanelli's proposal (left image) cannot detect well a pose at full profile. In other words, it has problems to handle a target looking up on  Figure 16 shows other cases but with BIWI dataset using sequence 24. Both approaches based on Li (first two images at the left) do not detect well the head when it is looking a little to the upper right corner. The third figure shows the results with our KF-based method without optimization (KFv1). Most of the undetected frames happen when the target is looking upward. On the contrary, this does not happen with the KFv3 because it improves the detection rate in that orientation. The previous results show how our approach improves the HPE performance under challenging scenarios. In some cases, other proposals provide a little more accurate result, but in all cases, the KF-based approach is more stable, and it does not require a specific architecture (i.e., GPUs) with a reasonable computation time. This makes the approach more reliable and robust.

Discussion
Our learning step uses the output of two state-ofthe-art HPE methods, e.g., Fanelli and Li, but several proposals in Tables 3 and 4 outperform them. The intuitive question is why we privilege those instead of more accurate proposals. This can be answered by observing Table 6 that summarizes some features of the most relevant approaches. The proposals of Ahn, Saeed, and Venturelli [24,26,27] are more accurate and faster, but they require the use a GPU card. This makes them more expensive and complex to use in embedded systems. The other proposals, e.g., [9,12,28], require more computational time with high variance in their estimate, i.e., [12] has a variance of 16 and 9.6 • for yaw and pitch, respectively. Our proposal has a computational cost of ≈ 10 fps, which is reasonable for most applications. One characteristic is that most of our proposal is highly parallelized, so we can improve calculation times if necessary.
When comparing the results of each dataset, we observe that in the simplest sequences, our proposal obtains results with equivalent precision. Also, the results with the most complex sequences (i.e., ICU dataset) show that our proposal has a better performance both in accuracy and missed detection percentage.
If we compare the three versions of KF, we observe how the versions with global optimization (KFv2 and KFv3), described in Section 3.2.3, improve the stability of the performance in comparison with the KFv1. The accuracy is further improved in KFv3 by including (1) the descriptor distance as weighting factors in the optimization process and (2) an adaptive model to the target's face.
We give a qualitative evaluation of the tested methods in Table 7, based on our personal experience. Here, we grade them according to our impression in each aspect as follows: (+) low, (++) good, and (+++) excellent.
As shown in the first row, Li does not handle well fast motions. In this case, blurry images affect directly two appearance-based aspects of the proposal: the 2D (eye) landmark detector and the color-based k-means, which remove no-face correspondences of the ICP algorithm. This makes it unstable in fast situation, and therefore, it gives a low detection rate. On the other hand, it can detect poses in a wide range of orientations with a good precision.
Li's proposal improves when more features are available. The inclusion of the nose feature enhances the accuracy of the estimations and reduces the missed detection rate. This is because the 3D feature is based on depth information, which is not much affected by blurry images.
In general, the orientation range and accuracy are better than the classic implementation but still need more improvement.
Fanelli's approach deals better with fast motions because depth information is not distorted by movement. In contrast, it has a more restricted detection range because the nose tip, the key element of Fanelli's method, is undistinguished at images of full profile. In other words, there is not enough evidence to distinguish the nose tip from the edge of the face. The rest of the time, it has no problem detecting a pose in short time and this is why Li used this method to initialize its proposal. Nevertheless, the accuracy of the results is low.
In most of the fast motions, our proposal could find enough features to estimate the pose. Also, it estimates the pose even with targets at full profile (i.e., looking to the left of right) with an excellent orientation range. From these two aspects, it has less problems detecting the target most of the time with competitive results to those in the state of the art.
From the results, we observe how the use of KFbased approach improves the estimation, and those are  Results with the best performances are presented in bold enhanced by applying the global optimization process. The inclusion of the descriptor weights (KFv3) helps to estimate more robustly the pose because it reduces the importance of weak correspondences, which may not be a good match (great distance between descriptors), and prioritizes strong matches.

Conclusion and future work
This paper has presented a framework for HPE based on key frames, which includes information of appearance, shape head pose hypothesis. This includes an original offline learning proposal consisting of two stages: (1) an automatic KF learning step and (2) an original postprocessing step that minimize globally the error between KFs and the 3D face model, enhancing the accuracy and consistency of the KF set. We evaluated this personspecific approach in two public benchmarks, and we have shown that the use of the KF provides robust estimates for a wide range of orientations in reasonable time. Also, we presented a more challenging dataset with complex behaviors that includes self-occlusions, fast motion, change of the head position, and extreme head orientation. The results in this dataset showed that our approach can estimate a pose even in complex situations, contrarily to other approaches. At the same time, we have shown that our proposal is more stable than others and with a gain in precision as the complexity of the datasets increases.
We have compared against several works and considered classic benchmark datasets. Regarding the benchmarked datasets, the results have shown how the KFbased approach, learned from weaker estimation algorithms, provides good performance and how those are enhanced after optimization. Furthermore, our approach maintains a competitive CPU cost with respect to other applications.
A natural investigation track is to relax the offline stage (to leave a mostly online system) by learning only a couple of KFs of the target, with neutral pose and looking into the camera direction. Then, we perform our pose estimation algorithm where we learn more KFs as soon as new estimates are available. The set of KF is updated as described in Section 3.3.2.