Evaluating effects of focal length and viewing angle in a comparison of recent face landmark and alignment methods

Recent attention to facial alignment and landmark detection methods, particularly with application of deep convolutional neural networks, have yielded notable improvements. Neither these neural-network nor more traditional methods, though, have been tested directly regarding performance differences due to camera-lens focal length nor camera viewing angle of subjects systematically across the viewing hemisphere. This work uses photo-realistic, synthesized facial images with varying parameters and corresponding ground-truth landmarks to enable comparison of alignment and landmark detection techniques relative to general performance, performance across focal length, and performance across viewing angle. Recently published high-performing methods along with traditional techniques are compared in regards to these aspects.


Introduction
Face detection, tracking, and recognition continue to be employed in a variety of ever more common-place biometric applications, particularly with recent integrations in mobile-device security and communication. Most of these applications, such as identity verification, pose tracking, expression analysis, and age or gender estimation, make use of landmark points around facial components. Correctly locating these key points is crucial as they often are used to abstract main features such as the jaw, eye-brows, eyes, nose shape, nostrils, and mouth [1]. Due to the complexity of head gestures , automatic localizing of canonical landmarks usually first involves face alignment to account for rotation, translation, and scale due to pose or view-direction differences [2][3][4][5]. Furthermore, 2D images photographically captured by cameras are affected by perspective and lens distortion, an important aspect considered in this work.
This review aims to compare performance of five notable facial landmark and alignment methods under the effects of different camera focal lengths and positions, particularly under conditions that have been ignored or difficult to test. Previously, Çeliktutan et al. completed a thorough survey of facial landmark detection algorithms and comparative performance in 2013, which at the time primarily focused on 2D techniques such as Active Shape Model (ASM) and Active Appearance Model (AAM) variations [6]. In 2018, Johnston and Chazal published work that built on the earlier survey, noting the shift of interest to deep-learning methods due to potential performance increases as well as techniques that also perform 3D alignment [7]. Several strong-performing neural-network methods have been published since; however, and in general, no performance comparisons have included lens-perspective effects nor systematic evaluation across the range of viewing angles. This study is not an exhaustive survey of recent methods but rather an investigation in the effects of focal length and viewing angle on both traditional and more recent neural methods (published after the 2018 article). Focal-length-based perspective and viewing angle are both important considerations if designing a biometric or other system in order to account for the lens chosen, viewing angle, and proximity necessary for the system. The effects a lens imparts on acquisition have often been ignored in face-related research. A fundamental technique in computer vision is estimating a camera projection matrix and has been regarded in many studies; however, the datasets used to train and test landmark detection do not usually include camera meta-data (particularly large datasets gleaned from the Internet for deep-learning approaches), or datasets have been captured in very controlled situations with a single lens. The most widely used databases in training recent deep networks are 300W [8], COFW [3], WFLW [9], and AFLW [10]. Those cover large variation over age, ethnicity, skin color, expression, and pose and have been used by top-performing deep neural networks [11][12][13][14][15]. None of them explicitly note focallength as a parameter. In short, there is no dataset published online that has considered focal length/field of view versus proximity for training alignment or landmark detection methods. We assume perspective distortions caused by focal length will likely affect the final annotation results. If so, training sets including camera and lens parameters could increase accuracy of a system or at least aid in designing systems.
A few researchers have considered aspects of image distortions relative to face images for particular applications, but not what we present here. Damer et al. investigated stateof-the-art deep neural networks for facial landmark detection, but their main focus was perspective distortion due to distances between cameras and captured faces and did not consider the effects due to lenses and associated field of view [16]. Valente et al. investigated basic lens effects; however, they only analyzed these relative to simple mathematical algorithms for facial recognition (EIGENDETECT and SRC) and not those for facial alignment nor their effects on facial landmark detection [17]. Flores et al. also focused on perspective distortion caused by distance [18]. They estimated camera pose from facial images using Efficient Perspective n-Point(EPnP) rather than evaluating landmark location. In this work, we consider the effects of lens focal length and viewing angle in regards to some of the highest performing recent facial-landmarking techniques. Although the method of evaluation uses synthetic images, the question of performance relative to lens and viewing angle is also relative, and the goal is to demonstrate that all methods are affected to varying degree. Studying such effects without large datasets that include camera and lens meta-data would not currently be possible without either collecting such a dataset or creating test images synthetically as we have done. Future work could include design of a dataset, although it could be prohibitive to collect data on the (2021) 2021:9 Page 3 of 18 size order of Internet-driven datasets used for deep-learning training. Furture work also could consider improving synthetically rendered images for higher fidelity, style-transfer, or in-environment placement, etc. The contributions of this work include evaluation of five different facial landmark detection methods in regards to varying lens choice and viewing angle. Three of them are from recently published deep-learning 3D facial annotating methods, and the remaining two are AAM implementations. We evaluate the performance of these methods across view angles and focal lengths by using face images synthesized from detailed 3D scans of individuals.We demonstrate that all are subject to particular performance degradation with lens-perspective distortion and viewing angle. This information may be used to guide design choices in biometric or other imaging systems as well as develop on methods that are more robust to lens choice and angle.

Landmark schemes
There have been a variety of landmark schemes used in related projects, but a few have been most used in recent work and make a logical choice for comparative evaluation. Following the categories in [6], there are two major groups of facial landmarks schemes: primary landmarks and secondary landmarks. Primary landmarks usually define the eye corners, the mouth corners, and the nose tip. Those landmarks are located at "T" sections between boundaries or at high curvatures on a face which may be detected by image processing algorithms, e.g., multi-resolution shape models [19], Harris Corner Detection model [20], or Image Gradient Orientation (IGO) model [21]. Secondary landmarks outline the contour of main features that are guided by primary landmarks, such as the jaw line, eyebrows, and nostrils. Wu et al. [1] provide a thorough survey on facial landmark databases and their corresponding landmark schemes. A common 68-point landmark is supported by many face databases, e.g., AFLW [10], BU-4DFE [22], Helen [8,23], etc. For easiest consistency, the 68-point scheme from Multi-PIE [24], and further popularized by iBUG's 300W [8], was chosen for this study.
Sagonas et al. and Johnston et al. [7,8] state that primary landmarks are more easily detected than secondary landmarks while annotating the ground-truth reference. The "m7 landmarks" including the 4 eye corners, 1 nose tip, and 2 mouth corners are also included here in some comparisons with the idea that they provide higher importance information. Figure 1 shows the two landmarks schemes used in this paper.
In order to generate face images at controlled focal lengths and precise angle selections, we synthesized photo-realistic images using detailed 3D meshes captured from a structured-light 3dMD system. Our facial capture participants were asked to make different expressions following the Facial Action Coding System (FACS). FACS was created by the anatomist Carl-Herman Hjortsjö [25] and further developed by Ekman etc.
[26] It provides a coding system which describes how to categorize facial expressions into Action Units (AUs) with muscle movements. We manually annotated the ground-truth landmarks in 3D for 84 faces from our participants, 64 from a set of FACS-capture expressions of two individuals and 20 of unique individuals with a range of ethnicity, age, and gender where the pose was neutral or a slight smile. Figure 2 shows an example of FACS and neutral faces in our dataset. Landmark variation often occurs between in datasets, particularly for areas such as the jawline or eyebrows. For consistency, we keep jawline points evenly distributed along the chin. In some projects, eyebrow points are placed at the center, bottom, or top of brow arcs. Good choices for landmarks points include those near high curvature or boundaries on objects. Here, eyebrows are marked anatomically at the supraorbital ridge or eyebrow ridge.

Evaluation metrics
We use ground-truth based localization error to evaluate performance in each case via root mean squared error (RMSE). Accurate landmarks are generated for each synthetic image by projecting manual 3D landmarks to match the rendered angle and field of view. We use the method proposed by Johnston et al. [7] for calculating the RMSE: where x k , y k denote each of the K predicted landmark k in an image, andx k ,ỹ k indicate the corresponding ground-truth landmark. Normalizing for face size in pixels is useful due to the variance across images. Previously, RMSE is normalized by the ground-truth outer corners of the left eye and right eye landmarks(Eq. 3) [8]. The error per landmark in image i is given as: where (x le ,ỹ le ) and (x re ,ỹ re ) are the ground-truth outer corners of the left eye and right eye in the image i. In our case, however, our synthetic images vary with camera positions. The distances of outer-eye corners may have small impacts at side angles due to perspective projection. Hence, we calculate Normalized Root Mean Squared Error (NRMSE) by normalizing per width of the head bounding box. We calculate the percentage of accepted points among all points to show the performance for each algorithm: is a mask function that if the normalized distance is less than Th, it is acceptable, and i is set to 1. Otherwise, the result is not acceptable, and i's value is set to 0. So, the overall performance over K landmarks in each image for I image set is:

Camera position and focal length
Our coordinate system follows the typical computer-graphics right-handed coordinate system convention, where the X-axis points to horizontal right, Y -axis points to vertical up, and Z-axis perpendicular to both X and Y points outward from the screen. In order to track the camera around each face, we use spherical coordinates to represent camera positions. Our interests are analyzing multiple viewing angles at a wide range of specific viewing angles. We define camera positions in spherical coordinates at (r, φ, θ), where φ is the polar angle (also known as zenith angle) from the positive Y -axis with 45 • ≤ φ ≤ 135 • , at 15 • each. We define θ to be the azimuthal angle in the xy-plane from the positive X-axis with 180 • ≤ θ ≤ 0 • at intervals 30 • . Lastly, r varies for simulated focal length. Overall, we have 49 camera positions so that various front views of the face and some extreme camera positions could be tested. Figure 3 shows the position of spherical coordinates and samples of face images with different viewing angles. Focal length, relative to the dimensions of the film or digital sensor, determines the field of view on a physical camera, and there are also radial distortion issues relative to physical lenses and typical of certain optical designs such as pincushion and barrel distortion (these are not specifically included here but could warrant a follow-up study). In photography, a common standard of comparison of focal length to express field of view is relative to the standard of the 35-mm-film frame size used for much of the twentieth century and carried forward into digital sensors. This "35-mm" frame size of 36 mm across by 24 mm down came to be a standard for still photography when Oskar Barnack doubled the individual frame from motion-picture film (standardized by Thomas Edison) to use in still cameras. The relation between angle of view and focal length is given by: where α is the angle of view, d denotes the size of film, and f is the focal length t. As can be seen by the relationship, shorter focal lengths widen the field of view and vice-versa. To maintain a face of a relative size in images captured with different focal lengths, the distance to the camera needs to be changed. Perspective effects are modified as this occurs, as can be noted in Fig. 4. Short focal lengths (wide-angle lenses) introduce a fair amount of facial distortion whereas longer lengths begin to approximate an orthographic projection that maintains relative distances among landmarks better. Although not tested here, these effects can be more pronounced near the edges of a capture frame. As mobile phone photography increases, some of the most common focal lengths relative to the standard of comparison noted would equate to the 28-mm to 35-mm range of focal lengths, or a relatively wide field of view. Interchangeable lens cameras or cameras with zoom lenses can vary the focal length. As can be seen from formula 6, a larger focal length lens has a narrower angle of view at the same camera-to-object distance which offers magnified, detailed photos. Focal lengths greater than 50 mm are often used in longer range photography, long range biometric acquisition, and especially in head-and-shoulder portrait photography. For this study, common focal lengths of prime lenses used in still photography were chosen as the range, from 24 mm (wide-angle on a 35-mm system) to 135-mm (slight telephoto on a 35-mm system), with the range covering typical focal lengths used in photography and not including extreme wide-angle lenses nor extreme telephoto lenses. We choose six different types of common lens focal lengths (24 mm, 28 mm, 35 mm, 50 mm, 85 mm, and 135 mm) as our test domains for comparison.

Face landmark and alignment methods
Wu et al. [1] mention classifying technology as holistic methods, constrained local model methods, and regression-based methods. Holistic methods treat a whole face image as the entire appearance and shape to train models. Constrained local models locate landmarks based on the global face but emphasizing local features around landmarks. Regressionbased methods mostly are adopted for deep-learning, using regression analysis to map landmarks to images directly. Johnston et al. [7] believe that facial landmark detection methods can be divided into generative methods, discriminate methods, and statistical methods. Generative methods minimize the error between models and facial reconstructions. Discriminate methods use a dataset to train the regression models. Statistical models are a combination of generative methods and discriminate methods. Çeliktutan et al. classify facial landmark detection into model-based (using the entire face region) and texture-based (matching landmarks to local features) [6]. Here we consider landmarking algorithms based on either statistical methods or deep-learning methods. Statistical methods calculate the positions of landmarks using mathematical algorithms. Most of the traditional methods(e.g., AAM and ASM) can fall into this group. Deep learning methods feed facial images to train deep neural networks to locate landmarks. ASM and AAM models have performed among some of the best landmark-detection algorithms for nearly two decades. ASM, first introduced by Cootes et al., attempts to detect and measure the expected shape of a target in an image. ASM requires a set of landmarked images for training the model. The first step is using Procrustes Analysis to align all object images. A mean shape is calculated by Principle Component Analysis (PCA) which applied to find eigen vectors and eigen values [27]. All the objects' shapes can be approximated as: x =x + Pb (7) wherex is the mean shape calculated over all overall training data. P is a set of eigen vectors derived from the covariance matrix calculated via PCA, and b is a set of shape parameters given by: As an improvement of ASM, an active appearance model matches both shape and texture simultaneously and gives an optimal parameterized model. PCA is also applied for texture and once again for finding combined appearance parameters and vectors. Menpo provides five different AAM versions with two main groups: Holistic AAM (HAAM) and Patch AAM (PAAM) [28]. HAAM warps appearance information using a nonlinear function, such as Thin Plate Spline (TPS), and takes the whole texture into account when fitting, while the PAAM uses rectangular patches around each landmark as texture appearance. We test both HAAM and PAAM as separate techniques for comparison here. For building the AAM, we chose the widely used Helen Dataset which provides a high-resolution set of annotated facial images containing different ethnicities, ages, genders, head poses, facial expressions, and skin colors, similarly used by Johnston et al. [7]. In order to reduce error caused by facial detection, we extract faces from image using bounding boxes calculated from ground-truth landmarks and dilated by 5%.
In the past few years, deep-learning based neural-network methods have leveraged very large datasets for training and recently outperformed statistical shape and appearance models in many areas. We gathered three recent high-performing methods where implementations were available to compare in our various cases. The first method is called the Position Map Regression Network [29]. The main idea of PRNet is creating a 2D UV Position Map which contains the shape of an entire face to predict 3D positions. PRNet employs a convolutional neural network (CNN) trained 2D images along with ground truth 3D dense position clouds created via 3D morphable model (3DMM). 3D positions are projected to the UV texture-map format and used in training the CNN. The UV texture map preserves 3D information, even posed with occlusions.
The second method is the 3D Face Alignment Network (3D-FAN). Bulat and Tzimiropoulos use a 2D-to-3D Face Alignment Network combined with a stacked heat-map sub-network to predict Z coordinates along with 2D landmarks [30].
The third method from Bahagavatula et al. uses a 3D Spatial Transformer Network (3DSTN) to estimate a camera projection matrix in order to reconstruct 3D facial geometry. The method forms occluded faces with 2D landmark regression and predicts 3D landmark locations [31].
These methods were trained on 300W-LP except for 3D-FAN which was trained on the 230,000 + 300W-LP. It would be prohibitive to attempt to include all recent deeplearning methods in this comparison, but these were chosen based on strong performance in recent publications, and we believe other recent methods would very likely perform similarly based on similar overall performance on the same datasets.  To calculate the RMSE (1) and NRMSE (2), (3) on landmarks, all measurements require ground-truth as references. All facial meshes with texture were manually marked using landmarker.io to create these ground-truth landmarks. Figure 6 shows an example of 3D facial annotation in landmarker.io as performed on our dataset [28].
Using our own Python-, Qt-, and OpenGL-based lab application, Countenance Tool, we render 3D facial positions given varying angles and focal lengths. Since we compare how view angles and focal lengths affect landmark methods, we move the virtual camera to 49 different locations shown in Fig. 3. At each location, we rasterize faces with 6 different synthesized focal lengths (24 mm, 28 mm, 35 mm, 50 mm, 85 mm, and 135 mm) by changing the focal length parameter shown in equation 6 before rendering. Overall, there are images at 49 angles and 6 focal lengths for each face. At the same time, we use the same camera matrices (varying with view of angles and focal length parameters) to project the 3D ground-truth landmarks to yield the ground-truth 2D landmarks at image coordinates. Figure 7 shows a set of images with ground-truth landmarks of different focal lengths and viewing angles.
To summarize the workflow demonstrated in Fig. 5, we first performed facial geometry capture with a 3dMD system. The 3dMD system provided 3D meshes along with texture information. We then imported those into landmarker.io to annotate each face manually to generate 3D ground-truth landmarks. After getting the ground-truth, we rasterized each face at 49 angles and 6 focal lengths and calculated the ground-truth 2D landmark locations. Finally, we analyzed performance of each method by calculating NRMSE error between a method's predicted landmarks and the 2D ground-truth locations.

Results and discussion
In this section, we compare the RMSE performance of the five methods with the full 68-points scheme and the reduced m7 scheme against 6 threshold levels. Figure 8 plots the percentage correctly accepted for each facial landmark and alignment method with both schemes. Generally speaking, as expected, the overall acceptance performance for each algorithm increases as the threshold widens. The m7 landmarking scheme tends to show better performance as a smaller set located at distinct "corners. " In general, the CNN methods perform better, but all are still subject to performance effects due to focal lengths and viewing angles. It would be remiss to declare one method particularly better than another here, particularly since 3D-FAN was trained on an augmented dataset versus the others; we used the publicly available pre-trained networks. Compared to the neuralnetwork techniques, the performance of traditional statistical methods is typically lower. As Cootes explains [32], the performance of ASM and AAM is dependent on the starting One of the main contributions of this paper is demonstrating the effect of focal length on landmarking accuracy. Figure 9 demonstrates lower performance with a wider fieldof-view, associated with strong perspective effects, and better performance as focal length increases. There is expected leveling in performance with focal length increase.
In order to visualize effects on specific landmarks at different focal lengths, we drew the 68-point landmarks located by each method and the average of the frontal view for the extremes (135 mm lens in blue circles based on RMSE and 24mm in red). This shows which landmarks are most affected by the focal-length perspective warping. Figure 10 also reflects the data depicted in the Fig. 9. The radius of the RMSE presents how far each predicted landmark is from the ground-truth. The result shows that all of the landmarks that are close to the center of faces have more accurate predictions, while landmarks along facial edges have lower accuracy predictions due to projective distortions; particularly, corners of eyes and lips seem affected.
The last consideration for this paper is systematic adjustment of the camera's viewing angle across the viewing hemisphere. We place the camera at 49 different positions with extreme poses included. When the camera views from the center ( θ ∼ = 90 • and φ ∼ = 90 • ), the performance results are better than when the camera view from the sides. The landmark predictions at φ around 45 • and 135 • have the lowest performances due to extreme viewing angles. As expected, performance drops as the view moves to the more extreme angles, and the rate of effect for each method are shown in Figs. 11, 12, 13, 14, and 15. Most of the facial landmark and alignment algorithms perform well at frontal views, and the detection precision relies on the training set variability. Attempting to delineate prediction differences between extreme-view cases and center view cases, we chose the most centered view image (φ = 90 • and θ = 90 • ), as well as 8 images surrounding by it, to be the frontal group (Fig. 16). The rest of the images are the outer group (Fig. 17). Front view detection can approach almost 100% accuracy especially at center view for deep neural networking methods. The precision rate drops more than 50% approaching extreme angles (θ = 0 • and θ = 180 • ).
Part of the set of the images used were also based on 3D captures of action units from FACS which taxonomizes individual physical expression of emotions. The results shown in Fig. 18 illustrate that in general the landmark-prediction methods work better on neutral faces due to FACS faces having more facial expressions which increase prediction difficulties. Performance decreases across wide field of view and view angle are consistent.

Conclusion
In conclusion 3DSTN, PRNet, and 3D-FAN methods generally work better than traditional statistical methods. Deep-learning methods have become the prevalent research direction for the time being, but they are still subject to viewing angles and also, particularly, lens effects that have rarely been considered during any performance evaluations. Increasing focal length tends to improve the landmark and alignment performances due to less projection distortion. This could inform design decisions for camera system and lens chosen for a biometric system, or it could be used to inform future algorithm design.
Given experimental results, all methods, as expected, work best from frontal-viewing angles. It is also interesting to note that the slope of fall-off for the performance decrease introduced by shorter focal lengths (wider field of view) is less for the AAM based methods and the 3DSTN approach. This is likely due to the AAM methods being based on image features, and the PAAM more specifically emphasizing local image features. 3DSTN likely does well as part of the method specifically estimates a camera projection matrix, which in some sense should help counteract some of the focal length introduced perspective issues. PRNET and 3D-FAN methods using more general 3D data are likely more affected, and the larger training set for 3D-FAN likely assists its performance here. One limitation of statistical algorithms is the landmark detection performance is tied to the head pose variation in the training set. When applying PCA, the first N eigen vectors are chosen as the main components. Typically, these are chosen based on representing ± 3 standard deviations from the mean value. Based on this limitation, the landmarking performance for extreme view angles, as often shown, drops. However, the CNN methods that all incorporate some system of 3D reference tend to do better as viewing angles move from the center; however, they still suffer performance drops and are still affected by shorter focal lengths.
Since focal length variance does affect final face landmark and alignment performance, future work could include use of this to augment training data. This could be done through data collection or use of synthetic data.
Meta-data from capture lenses stored in digital photographs is often removed by the time images reach large datasets, but it would be interesting to note such effects from in-the-wild photographs. In the meantime, training with synthetic data that includes controlled variance of viewing angle ranges as well as varying focal length, added to photographic datasets, should likely improve results.
In the future, image acquisition should not only cover pose, illumination, expression, ethnicity, skin color, etc., but also include consideration of full camera and lens parameters when possible.