- Research
- Open Access
- Published:

# Rotation update on manifold in probabilistic NRSFM for robust 3D face modeling

*EURASIP Journal on Image and Video Processing*
**volume 2015**, Article number: 45 (2015)

## Abstract

This paper focuses on recovering the 3D structure and motion of human faces from a sequence of 2D images. Based on a probabilistic model, we extensively studied the rotation constraints of the problem. Instead of imposing numerical optimizations, the inherent geometric properties of the rotation matrices are taken into account. The conventional Newton’s method for optimization problems was generalized on the rotation manifold, which ultimately resolves the constraints into unconstrained optimization on the manifold. Furthermore, we also extended the algorithm to model within-individual and between-individual shape variances separately. Evaluation results give evidence to the improvement over the state-of-the-art algorithms on the Mocap-Face dataset with additive noise, as well as on the Binghamton University A 3D Facial Expression (BU-3DFE) dataset. Robustness in handling noisy data and modeling multiple subjects shows the capability of our system to deal with real-world image tracks.

## Introduction

Recovering scene geometry and camera motion from sequences of 2D monocular images has seen significant success for the 3D geometry of static objects. The widely used rigid factorization method was first introduced by Tomasi and Kanade [1]. Orthonormality constraints are adopted on the rotation matrices in order to recover structure and motion in a single step. Unfortunately, most biological objects and natural scenes are deformable. 3D rigid motions, i.e., camera rotation and translation, along with non-rigid deformations, e.g., stretching and bending, are mixed altogether in their image measurements. Hence, extending the existing rigid algorithms to the non-rigid scenario turns out to be a far more challenging task than it appears to be.

It is known that the problem of non-rigid structure from motion (NRSFM) is generally underconstrained and thus intractable, if each point of the object moves arbitrarily. In practice, however, many objects, e.g., faces, deform under certain rules. A possible approach is to learn an application-specific 3D model of non-rigid structure from the training data to constrain deformation [2]. Another possibility is to hard-code and learn a model incrementally [3]. Some approaches [4–7] were proposed from another perspective to remove the need of such a prior model, which is not available in most real-world situations. The shape model, i.e., shape bases, is treated as unknowns to be solved, with only the orthonormality constraints on camera rotations being utilized. Xiao et al. [8] proved that only enforcing the orthonormality constraints is not enough for the factorization-based method; therefore, they introduced the basis constraints to reduce ambiguity.

In this work, two major contributions have been made. We first investigated the geometric properties of the orthonormality constraints and generalized the Newton’s optimization method to the underlying manifold of the camera rotation matrices. That means, non-linear optimization can be carried out on the manifold without any imprecise approximations. We used a probabilistic principal component analysis (PPCA)-based framework [9] to model NRSFM as it is more robust to noise than the closed-form factorization techniques. Our second contribution is about dealing with multiple subjects. The current NRSFM algorithms mostly focus on the reconstruction of a single subject. While dealing with data containing multiple subjects, no difference is taken into account, when modeling between-individual variation (e.g., face model of different identities) and within-individual variation (e.g., facial expression of the same identity). For that reason, we extended the PPCA-based framework to the probabilistic linear discriminant analysis (PLDA) [10] model to improve reconstruction performance on data with multiple subjects.

The remainder of this paper is organized as follows. Previous research on NRSFM is reviewed in Section 2. Section 3 presents the probabilistic NRSFM model [9] and our novel manifold optimization technique on the orthonormality constraints. Section 4 discusses the experimental results of our algorithm. Finally, we conclude our work in Section 5.

## Related work

Modern structure from motion (SFM) algorithms employ the factorization method for orthographic camera projection proposed by Tomasi and Kanade [1]. The rank theorem ensures that the input matrix can be factorized into two matrices, one corresponds to the camera motion, and the other represents the shape. Although the resulting matrices from singular value decomposition (SVD) are not unique, they only differ by a linear transformation. By imposing metric constraints, a decent solution of the SFM problem for rigid objects can be achieved.

In the seminal work of Bregler et al. [11] and Torresani et al. [6] for solving NRSFM, they assumed that the 3D shape of an object can be explained as a linear combination of deformation shapes applied to a dominant rigid component. In this way, the non-rigid motion recovery is formulated as a factorization problem and the low rank of the image measurements is analyzed. In general, this model assumes that the number of basis shapes should be known, an inaccurate choice that can lead to performance drop. Theoretically, if the number is underestimated, it is not sufficient to represent all variations of the object; otherwise, the extra degree of freedom is unconstrained and is unlikely to generalize well, which starts fitting to noise.

Using the linear representation, Xiao et al. [8] proposed a closed-form scheme for solving the NRSFM problem. They proved in the previous work that by imposing orthonormality constraints alone on camera rotations, the increased degree of freedom will cause ambiguity. The additional basis constraints will determine the shape bases uniquely. In [12], Xiao and Kanade pointed out that even enforcing both sets of linear metric constraints above could still lead to ambiguity, if there exist degenerate bases, which are not of full rank three. However, by exploiting the rank three constraints inherently, Akhter et al. [13] analytically proved that orthonormality constraints alone are sufficient to recover the exact structure. Ambiguity solely lies in the transformation of linear basis vectors, which does not affect the 3D structure reconstruction. Dai et al. [14] proved this claim by solving the NRSFM problem without any prior using matrix trace norm minimization.

Torresani et al. [9] proposed a probabilistic deformation model based on PPCA and suggested that it reveals better reconstruction result than the conventional linear model. In their work, 3D shapes are drawn from non-uniform probability distribution functions (PDFs) with a Gaussian prior on each shape in the subspace instead of the common linear subspace model, which is a specific usage of PPCA. The parameters of the PDF are unknown in advance, which will be optimized using the expectation-maximization (EM) algorithm together with the 3D shapes and rigid motions. An advantage of PPCA over the simple deterministic subspace model is that degeneracy of closed-form solutions does not occur so that the ambiguity problem figured out by Xiao et al. in [12] does not happen here. However, the rotation matrices are approximated by using a single Gauss–Newton step with a fixed updating step length, which can lead to a considerable performance drop in the rotation reconstruction if no proper metric on the manifold is defined.

Over the last years, more research on NRSFM has also been done using various forms of non-linear optimization techniques to minimize the 3D reprojection error. In order to overcome the degeneracy problem, some additional heuristic constraints were introduced. Shaji and Chandran [15] introduced a canonical Riemannian metric on the product span subspace of the rotation matrices and articulated shape weights. The Newton’s algorithm is generalized to the product manifold to recover those parameters, while the Wiberg algorithm is employed to solve the shape update. It differs from our approach in that our framework uses a probabilistic model with a posterior objective function over the latent variables, which is more robust to noise introduced by tracking error or manual labeling. Section 4 shows the robustness of our model with extreme conditions of noise.

Other than recovering the whole 3D shapes and motion parameters like in almost all the existing applications, Rabaud and Belongie [16] presented a manifold learning approach that only focuses on an embedding of frames within the input image sequence. The intuition is as follows: given enough image frames, a non-rigid deformed 3D shape can be observed several times in different view angles. If some of the frames share a low 3D reconstruction error, they are highly likely to represent a similar 3D shape, otherwise it means a poorly matched set of frames. Following this principle, triplets of frames are compared to exploit all repetitions in possible shape deformations. Then the generalized non-metric multi-dimensional scaling framework is used to estimate the weight of each deformation shape. Bundle adjustment is employed as a further optimization step, which minimizes the reprojection error. This closed-form approach can reconstruct accurate 3D shape on a clean synthetic dataset; however, with the amount of noise added, their performance drops very fast and approaches that of PPCA. Tao and Matuszewski [17] also employed manifold learning-based diffusion maps to handle highly deformable objects.

By exploiting the temporal smoothness of the shape trajectories across the images, Akhter et al. [18] addressed the NRSFM problem in trajectory space, which is the dual problem to the conventional spatial shape bases. By describing the 3D point trajectory linearly using object independent discrete cosine transform (DCT) vectors, unknowns in estimation are reduced, and stable reconstruction is achieved as a result. Gotardo and Martinez extended the temporal dependence to iteratively obtain higher-frequency DCT in [19] and explicitly modeled the complementary spaces of rank three in [20]. Valmadre and Lucey [21] formulated the regularization of the trajectory basis with a temporal filter. Recently, Park et al. [22] simplified the global motion estimation of the trajectory basis with the aid of a few stationary points in the scene. Despite the robust performance on various motion capture datasets, limitation of its application is also obvious, i.e., the object deformation should be temporally continuous and smooth. Otherwise, higher-frequency DCT vectors are needed, which significantly increases the rank of the trajectory matrix factorization and will eventually lead to degeneration and unstable performance. In contrast, the primary problem in shape space does not suffer from this.

In this paper, we demonstrate a probabilistic, iterative alternating approach to solve the NRSFM problem. In contrast to Torresani et al. [9], the conventional Newton’s method is generalized on the rotation manifold to solve the optimal rotation matrix for each optimization iteration. The orthonormality constraints are naturally guaranteed by the metric update step without the need of being projected back after constrained optimizations on the Euclidean space. Additionally, a generic PLDA model that takes into account the commonness across all subjects, as well as the specific characteristics between the subjects, can be learned. On datasets with more than one subject, better individual reconstruction is achieved even if insufficient number of frames are available for each subject.

## NRSFM model

Most of the state-of-the-art NRSFM algorithms make use of a linear subspace model to represent the shape model. A linear combination of deformation shapes is thereby applied to a dominant rigid component. Let the 3*P*×1 matrix \(\bar {\mathbf {s}}\) be the mean shape and the 3*P*×*K* matrix **V** and the *K*-dimensional vector **z**
_{
t
} be the remaining basis shapes and their weights, respectively, where *P* is the number of landmarks in each image frame and *K* the number of articulation shapes apart from the mean shape. The 3D shape of the *t*th frame is represented as

Note that shapes are stacked in matrix **V** so that each column represents a basis shape. Camera rotation in frame *t* is denoted by the 2×3 matrix **R**
_{
t
}. Due to the inevitable presence of internal and external noise in image tracks or labeling, a zero-mean Gaussian noise **n**
_{
t
} with variance *σ*
^{2} is also added. If we align the images to the center and drop the translations, the 2D observation matrix under the orthographic camera model can be factorized into

This probabilistic formulation of the conventional principal component analysis (PCA) was addressed by Tipping and Bishop in [23]. It has a simple linear probabilistic assumption that all marginal and conditional distributions are Gaussian. PPCA is closely related to factor analysis [24], in which a statistical model is used to describe the relation between the observed vector **p**
_{
t
} and the corresponding latent variables **z**
_{
t
}.

In Eq. (2), the weight coefficients **z**
_{
t
} are formulated as an independent and identically distributed (i.i.d.) Gaussian prior

These unobserved or latent variables are marginalized out instead of being explicitly calculated. Since there only exists linear transformations in Eq. (2), the measurement matrix **p**
_{
i
} is also Gaussian distributed [9] with the form

### Shape update

The PPCA model can be estimated iteratively by the EM algorithm [9]. In the expectation step (E-step), the posterior distribution over **z**
_{
t
} is defined as

Over this distribution, the first two moments of **z**
_{
t
} are given

In the following maximization step (M-step), the expected negative log-likelihood function

is minimized. The shape bases \(\{\bar {\mathbf {s}}, \mathbf {V}\}\) and the noise parameter *σ*
^{2} can be updated individually in closed form by setting their partial derivative to zero [9] with the help of the expectations in Eqs. (6) and (7).

However, the camera rotation parameter **R**
_{
t
} is subject to orthonormality constraints; hence, closed-form update like the other parameters is not possible. Torresani et al. [9] approximated the solution with a single Gauss–Newton step on the Euclidean space, which is inaccurate and has a theoretically low convergence rate. In the upcoming section, we propose our optimization technique on the manifold.

### Motion update on manifold

In [9], a twist vector ** ξ** is employed to hold the result of the single Gauss–Newton step. The exponential map of the skew-symmetric matrix \(\hat {\boldsymbol {\xi }}\) is then set as the updating vector

**. Note that without defining an appropriate metric on the manifold, a manually selected and fixed updating step length is implemented, which declines the performance obviously, when faced complex setups.**

*Δ*#### Newton’s method on *SO*(3)

As we consider the orthographic camera model, the camera motion matrix **R**
_{
t
} in Eq. (2) is obtained by projecting a 3D rotation matrix to 2D with an orthographic projection matrix

so that the mapping

from 3D to 2D is satisfied. The rotation matrix **Q** is an orthogonal matrix with a determinant one, which lies exactly on the manifold of the special orthogonal group

Hence, instead of putting an approximate algebraic or numeric constraint on the Euclidean space \(\mathbb {R}^{N}\) and projecting them back onto the *SO*(3) manifold, an unconstrained optimization on the manifold is a natural generalization and is expected to perform better.

To start with, we consider to be on the normal Euclidean space. The Newton’s method iteratively finds the stationary points of differentiable functions. Provided that *f*(*x*) is a twice-differentiable function, the update sequence *x*
_{
n
} can be approximated by the Taylor series expansion up to the second order and rewritten as

Given a quadratic function *f*(*x*), the optimal point can be found even in a single step. So on the Euclidean space, the first and second order derivatives of the objective function are needed. Edelman et al. [25] proved that for Stiefel manifolds (set of all orthonormal *k*-frames in \(\mathbb {R}^{N}\), \(V_{k}(\mathbb {R}^{n}) = \left \{\mathbf {A} \in \mathbb {R}^{n \times k}: \mathbf {A}^{\top } \mathbf {A} = \mathbf {I}\right \}\)), e.g., *SO*(3), their canonical Riemannian structure makes possible to generalize a Riemannian Newton’s method on them. Besides the gradient and Hessian, the definition of the update along the geodesic of the manifold must be known to ensure that the update is valid, because unlike on the Euclidean space, the update path is no longer a straight line but rather a geodesic curve, which stays on the surface of the manifold all the time and defines the shortest path between two points on the surface. The update step is illustrated in Fig. 1.

We define the objective function *F* with respect to the rotation matrix **Q** for the manifold optimization as follows

Since **Q**∈*SO*(3), following [26], its tangent vector ** Δ**∈

*T*(

*SO*(3)) is given by

where \(\hat {\mathbf {u}}\) is the skew-symmetric matrix of vector **u** in the form of

For the Riemannian manifold, the metric can simply be induced from the Euclidean metric as

The explicit formula for geodesics on *SO*(3) at **Q** in direction ** Δ** is then

where \(t \in \mathbb {R}\), \(\omega = \mathbf {Q}^{\top } \boldsymbol {\Delta } \in \mathfrak {so}(3)\) (\(\mathfrak {so}(3)\) is the Lie algebra of *SO*(3)). The last equation is called the Rodrigues’ rotation formula [27].

#### Gradient and Hessian

To obtain the gradient and Hessian, we first derive the first and second order derivative for the geodesic **Q**(*t*) with respect to *t*:

Note that the last step in Eq. (19) is derived from the property of tangent space on the Stiefel manifold that **Q**
^{⊤}
** Δ** is a skew-symmetric matrix with

Given the geodesic definition, we derive the gradient and Hessian in direction ** Δ**∈

*T*(

*SO*(3)):

For any arbitrary pair of vectors **X**,**Y**∈*T*(*SO*(3)), polarization [26] helps compute Hess *F*(**X**,**Y**) with

#### Algorithm summary

With the requirements for generalizing Newton’s method being ready, the optimal updating vector on the manifold can be found by modifying the original Newton Eq. (12) to

assuming that the Hessian is non-degenerate. It is the same as finding a vector ** Δ** that satisfies for all vector fields

**Y**

where **G**=∇*F* stands for the gradient. The Hessian can be uniquely determined by using an orthonormal basis {**E**
^{k}},*k*=1,2,3 into Eq. (25) as

For simplicity, the standard basis **e**
_{
k
} for \(\mathbb {R}^{3}\) is chosen so that \(\mathbf {E}^{k} = \mathbf {Q} \hat {\mathbf {e}}_{k} \in T(SO(3))\). Thus, the 3×3 Hessian matrix **H** and the three-dimensional gradient vector **g** can be obtained:

Then,we solve for the vector \(\mathbf {u} = [u_{1}, u_{2}, u_{3}]^{\top } \in \mathbb {R}^{3}\) using

Finally, the desired updating vector \(\boldsymbol {\Delta } = \mathbf {Q} \hat {\mathbf {u}}\) is obtained. The last step is to update the current rotation along the geodesic in the direction of this vector. The algorithm is summarized in Algorithm 1.

### NRSFM with PLDA

PLDA was presented by Prince and Elder in [10] as a probabilistic estimation for deterministic linear discriminant analysis (LDA) [28]. This model separately models the between-individual and within-individual variations among different subjects. Unlike PCA, which only takes into account the whole data distribution, LDA seeks the maximum separability of classes along the direction that has the highest ratio of the variance between the classes to the variance within the classes [29]. Thus, for our PLDA model, the single shape subspace **V** in Eq. (1) is replaced by the between-individual subspace **F** and the within-individual subspace **K** as follows

where *i* denotes the *i*th individual of the total *I* subjects and *j* denotes the *j*th image of *J* images belonging to this person. Compared to the original definition in Eq. (1), the latent variables **z**
_{
t
} now consist of two parts. The first part, **h**
_{
i
}, indicates the parameter for the between-individual subspace **F**, which remains constant for all *J* images of individual *i*, while the second part **w**
_{
ij
} describes how each image varies in the within-individual subspace **K**. Given this advanced shape model, the latent identity variables **h**
_{
i
} guarantee that a great part of the commonness in the same subject is preserved and taken into account at runtime.

In order to estimate the PLDA parameters, an EM algorithm that is similar to PPCA is presented by Prince and Elder [10] with modifications in the E-step. The main point is to ensure that all *J* images share the same latent identity variable **h**
_{
i
} despite the image-specific latent variables **w**
_{
ij
}. Therefore, the calculation of these *J* images is done in the single step and the corresponding equations in Eq. (1) are stacked up into a composite matrix system

or equivalently

Accordingly, the expectations of the new latent variables **y**
_{
i
} for the E-step changes from Eqs. (6) and (7) to

As for the M-step, most of the existing PPCA updates remain unchanged, if we replace the original shape matrix **V** with [**F**
**K**] and each of the latent variables **z**
_{
ij
} with \(\left [\begin {array}{ll}\mathbf {h}_{i} \\ \mathbf {w}_{\textit {ij}}\end {array}\right ]\). Accordingly, the objective log-likelihood function to be minimized in Eq. (8) can be modified to

We apply the Gauss-Newton step [9] as well as our manifold extension of Newton’s method to optimize the objective function.

## Experiments

In this section, extensive experiments are conducted to validate the proposed approaches. Rotation recovery using the Newton’s method on the manifold is first assessed on different datasets. Subsequently, performance of PLDA on generated data with multiple subjects is presented.

### Setup

For our experiments, the evaluation criteria is the same as in [9], i.e., the sum of squared differences between estimated 3D shapes to ground truth depth: \(\Vert \hat {\mathbf {s}}_{1:T} - \mathbf {s}_{1:T}{\Vert _{F}^{2}}\), with the camera rotation **R** also being applied to the 3D shape. As the ground truth for camera rotation is not given, we are not able to measure the absolute performance gain from our algorithm explicitly. However, the decreased reconstruction error implicitly assesses the effectiveness of rotation estimation in our algorithm.

Moreover, additive zero-mean Gaussian noise is imposed to analyze the robustness of reconstruction. The noise level is plotted as the ratio of the noise variance to the norm of the 2D measurements: *JTσ*
^{2}/∥**p**
_{1:T
}∥_{
F
}. The noise levels range from 0 to 30 % with 2 % step, and the trials for each noise level are averaged over 10 runs. Our test is carried out on two face datasets, i.e., the Vicon motion capture data Mocap-Face [9] and the Binghamton University 3D Facial Expression (BU-3DFE) dataset [30].

The Mocap-Face dataset [9] contains a single video, which captures a single male subject with 40 markers attached to his face. The video contains 316 frames in total. Sample frames from this dataset can be seen in Fig. 2. Throughout the video sequence, the subject made limited changes of facial expression and head pose. Note that the tracking is very accurate using the markers.

The BU-3DFE dataset [30] is originally created for 3D facial expression analysis. The complete dataset consists of 100 subjects, covering different ethnic groups. Seven facial expressions are performed at four intensity levels by each subject. We randomly select 300 frames from 100 subjects for our test, in contrast to the Mocap-Face dataset [9], where there exists only one subject. This is a more practical setup, since in many computer vision datasets, e.g., for face alignment [31, 32], only static images of multiple subjects are provided, where only a few samples of the same subject is available. Separate application of NRSFM for each single subject is then impossible. Random poses are generated by projecting the 3D landmarks to 2D. Note that temporal smoothness of the shapes is not valid in this dataset. The dataset provides manual annotation of 83 marker points as shown in Fig. 3. Due to labeling noise and inconsistency, this dataset contains noise in the original measurements.

### Evaluation of manifold PPCA

We first give quantitative results for the recovery of the 3D camera rotation between our algorithm (manifold PPCA) and the baseline PPCA [9], as well as the other state-of-the-art approaches, point trajectory approach (PTA) [18] and column space fitting (CSF2) [20].

#### Reconstruction results on Mocap-Face

In the first experiment without noise on Mocap-Face [9], our approach achieves slightly better performance than PPCA, while both having an effective reconstruction result under 3 % error, as is plotted in Fig. 4
a. Qualitative results are also shown in Fig. 5
a, which yields similar outcome. In comparison, performance of methods in trajectory space is more sensitive to the number of DCT bases. Starting from *K*=8, both PTA and CSF2 degrade abruptly.

In real life, there are no markers and the automatic point detectors are usually not stable. To assess the performance of the system in a real-world case, it is necessary to test it on the noisy data. We evaluate our system with additive Gaussian noise at different noise levels. As can be observed from Fig. 4 c, at the beginning, all algorithms have almost the same error rate up to 6 % noise. With more noise added, PPCA starts to undergo a significantly steeper curve than our approach. Starting from 20 % noise level, our result gets 50 % lower error rate than PPCA. That is most likely because with more noise, it is more difficult for the rotation approximation in PPCA to find the right updating direction. Despite achieving the lowest error in the above noise-free experiment, the state-of-the-art CSF2 surprisingly fails to hold up well against noise, which approaches PPCA as the second worst. The same trajectory-based PTA is more stable, thanks to smoother DCT bases. Shaji and Chandran [15] also evaluated on the Mocap-Face dataset [9] with additive noise. From their plot, the performance degrades very quickly with noise level over 20 %. However, our probabilistic approach does not suffer from this problem. Additionally, the variance of the results of each noise level is also shown in the figure, in which we observe that the manifold PPCA also reduces error variances. That means our approach performs much more stably under noisy circumstances.

#### Reconstruction results on BU-3DFE

Since the BU–3DFE dataset [30] is a more difficult setup, the performance is lower compared to the test on the Mocap-Face dataset [9]. The purpose of this test is to see how well a generic face model can be generated using different NRSFM approaches. As can be seen in Fig. 4
b, the recovered models cannot fit all instances as well as on the Mocap-Face dataset [9]. But again, the error level of our attempt is in overall ca. 8 % lower than that of PPCA regardless of the choice of *K*, which demonstrates a relative performance gain of 30 to 40 %. As can be observed qualitatively in Fig. 5
b, PPCA’s rotation approximation limits its result to getting better rotation estimate in frame 50. It also has difficulties to recover the contour of the faces correctly in frame 250, whereas our system clearly does better. It is also interesting to test the approaches in trajectory bases, where the smooth shape deformation assumption is no longer valid on the BU-3DFE dataset. As expected, CSF2 performs worse than the manifold PPCA, which does not take into account the temporal prior. Selection of *K* also has no influence to the error rate, unlike in Fig. 4
a. If we add Gaussian noise to the data (see Fig. 4
d), manifold PPCA and PTA degrades slightly slower than PPCA and CSF2, similar to that on the Mocap-Face dataset [9]. These results reveal that, when modeling more complicated shapes, an optimal rotation estimation using manifold optimization techniques is superior. Another advantage of our manifold-based approach is that it is more robust in noisy environments in general.

### Evaluation of PLDA

In the second part of our experiments, we generate datasets with multiple subjects from the original BU-3DFE dataset [30] to evaluate the PLDA variant of the NRSFM algorithm in comparison with PPCA. We consider two setups with different number of subjects involved. For the first setup, we select six subjects and each subject has 50 images. For the second setup, there are 12 subjects with 25 images, respectively. Thus, the total number of frames is still 300. Similar to the PPCA experiments with additive noise, we randomly generate five input datasets for each setup to obtain statistically significant results.

For all tests in this section, we directly compare the results of PPCA and PLDA as well as the influence of imposing Newton’s method on the manifold for recovery of the rotation matrix to both probabilistic frameworks. Since in Section 4.2.2, trajectory-based methods are proven not to generalize well when temporal smoothness does not hold, the results of PTA and CSF2 are not included. The curves of both PPCA approaches are plotted in dashed lines while the PLDA results are plotted in solid lines.

In the test case without additive noise, we fix the number of between-subject shape bases *F*=3 and vary the within-subject bases *G* in PLDA from 1 to 7, compared to the only shape bases *K* in PPCA that equals the sum of *F* and *G*. We first notice that with the help of PLDA, the performance for both rotation recovery techniques has got further improvement, independently from the input data with 6 subjects (Fig. 6
a) or 12 subjects (Fig. 6
b). Especially for the baseline PPCA algorithm with Gauss-Newton rotation approximation, there is nearly 10 % less error in the reconstruction in both setups. Moreover, the error variance also drops hugely to an acceptable level. For our Newton’s rotation recovery method on the manifold, only little improvement in reconstruction error is observed; however, the performance without PLDA already delivers a satisfactory result. As a result, all of our proposed methods, i.e., manifold PPCA, PLDA and manifold PLDA, manage to make significant performance enhancement. The similar outcome of the manifold PPCA and PLDA, which employ different extensions for the optimization problem, indicates that the achieved result may have approximated to the performance limit of the dataset using the probabilistic framework.

When zero-mean Gaussian noise is added (Fig. 6 c, d), although the gaps between the manifold PLDA and PLDA remain close, the lower average error rate and the stability with notably less error deviation at some noise levels again demonstrate the effectiveness and necessity of our better rotation recovery. We also observe that both PLDA-based methods degrade faster than those PPCA-based methods with the amount of imposed Gaussian noise starting from ca. 15 to 20 %. We conclude that its reason is probably because PLDA-based approaches need to estimate more parameters in the E-step than PPCA in each iteration, which makes the additional noise and uncertainty a decisive deficit factor for the approach. But overall, introducing PLDA does help to further decrease the error reconstruction rate with or without additive noise.

Qualitative experiments are also conducted in order to review the effect of applying PLDA on datasets with more than one subject, as can be seen in Fig. 7. We know that from Eq. (30), the between-individual linear shape model of PLDA consists of the global mean shape \(\bar {\mathbf {s}}\) plus the subject-specific shape term **F**
**h**
_{
i
}. The frame-specific shape term **K**
**w**
_{
ij
} serves solely as within-individual variance and is therefore omitted in this experiment. Thus, in Fig. 7, the reconstruction of every single subject for each dataset given by the first two terms in Eq. (30) is shown. The 3D ground truth is obtained by averaging all 50 shape vectors for the corresponding subject. As expected, the reconstruction result is fairly satisfactory, thanks to the characteristics of PLDA. Even for some unique faces as in Fig. 7
e, g, their contours and facial features are still well modeled, which again proves the capability and tolerance of our approach.

In Fig. 8, we illustrate the reconstructed bases **F** and **K**. The effect of varying the between-individual (**F**) and within-individual (**K**) shape bases learned by PLDA between ±3 standard deviations from the mean value is analyzed. With the first three bases of **F**, different eye, face contour, and mouth types are modeled, respectively, in Fig. 8
b, c, and d. It is interesting to see that the variations in the figures are more related to identification of the subjects than to the expressions. With the bases of **K** instead, different facial expressions present in BU-3DFE [30] are well recognizable. For example, Fig. 8
e shows opening and closing mouths. Evolution from angry to fear is illustrated in Fig. 8
f and from surprise to happiness in Fig. 8
g, respectively. Those results fully meet our expectations and conform to the characteristics of PLDA, which provide optimally and more meaningfully reconstructed shape bases than those given by PPCA.

## Conclusions

In this work, we have presented a novel solution to unleash the orthonormality constraints of the camera rotation matrix in the NRSFM problem. Without requiring conducting complex approximations, performing rotation update on the *SO*(3) manifold implicitly ensures the validity of the constraints. In the experiments on the Mocap-Face dataset [9] with additional noise, which contains only one subject, our approach performs significantly better by reducing up to 50 % reconstruction error. Furthermore, the proposed PLDA approach successfully extends the existing probabilistic framework to separately model between-subject and within-subject shape variations during the alternating optimization for datasets with multiple identities. On the BU-3DFE dataset [30] with multiple subjects and manually annotated landmarks, we clearly outperform the baseline approaches in all tests. To conclude, we have shown that the proposed approaches are robust against noise, which indicates that they are more capable of dealing with real-world data. In addition to its robustness, our approaches generalize better on datasets with multiple subjects.

## References

C Tomasi, T Kanade, Shape and motion from image streams under orthography: a factorization method. Int. J. Comput. Vis.

**9**(2), 137–154 (1992).V Blanz, T Vetter, in

*Proceedings of the Annual International Conference on Computer Graphics and Interactive Techniques*. A morphable model for the synthesis of 3D faces, (1999), pp. 187–194.S Ullman, Maximizing rigidity: The incremental recovery of 3-D structure structure from rigid and nonrigid motion. Perception.

**13:**, 255–274 (1983).M Brand, in

*Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition*, 2. Morphable 3D models from video, (2001), pp. 456–463.M Brand, in

*Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition*, 2. A direct method for 3D factorization of nonrigid motion observed in 2D, (2005), pp. 122–128.L Torresani, DB Yang, EJ Alexander, C Bregler, in

*Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition*, 1. Tracking and modeling non-rigid objects with rank constraints, (2001), pp. 493–500.L Torresani, A Hertzmann, in

*Proceedings of the European Conference on Computer Vision*. Automatic non-rigid 3D modeling from video, (2004), pp. 299–312.J Xiao, J Chai, T Kanade, A closed-form solution to non-rigid shape and motion recovery. Int. J. Comput. Vis.

**67**(2), 233–246 (2006).L Torresani, A Hertzmann, C Bregler, Nonrigid structure-from-motion: estimating shape and motion with hierarchical priors. IEEE Trans. Pattern Anal. Mach. Intell.

**30**(5), 878–892 (2008).SJD Prince, JH Elder, in

*Proceedings of the IEEE International Conference on Computer Vision*. Probabilistic linear discriminant analysis for inferences about identity, (2007), pp. 1–8.C Bregler, A Hertzmann, H Biermann, in

*Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition*, 2. Recovering non-rigid 3D shape from image streams, (2000), pp. 690–696.J Xiao, T Kanade, in

*Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition*, 1. Non-rigid shape and motion recovery: degenerate deformations, (2004), pp. 668–675.I Akhter, Y Sheikh, S Khan, in

*Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition*. In defense of orthonormality constraints for nonrigid structure from motion, (2009), pp. 1534–1541.Y Dai, H Li, M He, A simple prior-free method for non-rigid structure-from-motion factorization. Int. J. Comput. Vis.

**107**(2), 101–122 (2014).A Shaji, S Chandran, in

*Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops*. Riemannian manifold optimisation for non-rigid structure from motion, (2008), pp. 1–6.V Rabaud, S Belongie, in

*Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition*. Linear embeddings in non-rigid structure from motion, (2009), pp. 2427–2434.L Tao, BJ Matuszewski, in

*Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition*. Non-rigid structure from motion with diffusion maps prior, (2013), pp. 1530–1537.I Akhter, Y Sheikh, S Khan, T Kanade, in

*Advances in Neural Information Processing Systems*. Nonrigid structure from motion in trajectory space, (2008), pp. 41–48.PFU Gotardo, AM Martinez, Computing smooth time trajectories for camera and deformable shape in structure from motion with occlusion. IEEE Trans. Pattern Anal. Mach. Intell.

**33**(10), 2051–2065 (2011).PFU Gotardo, AM Martinez, in

*Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition*. Non-rigid structure from motion with complementary rank-3 spaces, (2011), pp. 3065–3072.J Valmadre, S Lucey, in

*Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition*. General trajectory prior for non-rigid reconstruction, (2012), pp. 1394–1401.HS Park, T Shiratori, I Matthews, Y Sheikh, 3D trajectory reconstruction under perspective projection. Int. J. Comp. Vis.

**115**(2), 115–135 (2015).ME Tipping, CM Bishop, Probabilistic principal component analysis. J. R. Stat. Soc.

**61:**, 611–622 (1999).DJ Bartholomew,

*Latent Variable Models and Factor Analysis*(Charles Griffin & Co. Ltd., London, 1987).A Edelman, TA Arias, ST Smith, The geometry of algorithms with orthogonality constraints. SIAM J. Matrix Anal. Appl.

**20**(2), 303–353 (1999).Y Ma, Košecka, J́, S Sastry, Optimization criteria and geometric algorithms for motion and structure estimation. Int. J. Comput. Vis.

**44**(3), 219–249 (1999).RM Murray, Z Li, SS Sastry,

*A Mathematical Introduction to Robotic Manipulation*(CRC Press, Boca Raton, FL, 1994).PN Belhumeur, J Hespanha, DJ Kriegman, Eigenfaces

*vs*, Fisherfaces: Recognition using class specific linear projection. IEEE Trans. Pattern Anal. Mach. Intell.**19**(7), 711–720 (1997).AM Martínez, AC Kak, PCA versus LDA. IEEE Trans. Pattern Anal. Mach. Intell.

**23**(2), 228–233 (2001).L Yin, X Wei, Y Sun, J Wang, MJ Rosato, in

*Proceedings of the International Conference on Automatic Face and Gesture Recognition*. A 3D facial expression database for facial behavior research, (2006), pp. 211–216.R Gross, I Matthews, J Cohn, T Kanade, S Baker, in

*Proceedings of the International Conference on Automatic Face and Gesture Recognition*. Multi-PIE, (2008), pp. 1–8.K Messer, J Matas, J Kittler, J Luettin, G Maître, in

*Proceedings of the International Conference on Audio and Video-based Biometric Personal Verification*. XM2VTSDB: The extended M2VTS database, (1999), pp. 72–77.

## Acknowledgements

This work was done when H. Gao was at Computer Vision for Human-Computer Interaction Lab (CV:HCI), Karlsruhe Institute of Technology (KIT). H. K. Ekenel was partially supported by TUBITAK, project no. 113E121 and a Marie Curie FP7 Integration Grant within the 7^{th} EU Framework Programme. We acknowledge support by Deutsche Forschungsgemeinschaft (DFG) and Open Access Publishing Fund of KIT.

## Author information

### Authors and Affiliations

### Corresponding author

## Additional information

### Competing interests

The authors declare that they have no competing interests.

### Authors’ contributions

In this work, two major contributions have been made. We first investigated the geometric properties of the orthonormality constraints and generalized the Newton’s optimization method to the underlying manifold of the camera rotation matrices. That means, non-linear optimization can be carried out on the manifold without any imprecise approximations. We used a PPCA-based framework [9] to model NRSFM as it is more robust to noise than the closed-form factorization techniques. Our second contribution deals with multiple subjects. Current NRSFM algorithms mostly focus on the reconstruction of a single subject. While dealing with data containing multiple subjects, no difference is taken into account, when modeling between-individual variation (e.g., face model of different identities) and within-individual variation (e.g., facial expression of the same identity). For that reason, we extend the PPCA-based framework to the PLDA [10] model to improve reconstruction performance on data with multiple subjects.

## Rights and permissions

**Open Access** This article is distributed under the terms of the Creative Commons Attribution 4.0 International License(http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

## About this article

### Cite this article

Qu, C., Gao, H. & Ekenel, H.K. Rotation update on manifold in probabilistic NRSFM for robust 3D face modeling.
*J Image Video Proc.* **2015, **45 (2015). https://doi.org/10.1186/s13640-015-0101-6

Received:

Accepted:

Published:

DOI: https://doi.org/10.1186/s13640-015-0101-6

### Keywords

- Non-rigid structure from motion
- Manifold optimization
- Newton’s method
- PLDA
- Face model