Deep-learned faces: a survey

*Correspondence: s.wickramaarachchilage@qmul.ac.uk Multimedia and Vision Group, School of Electronic Engineering and Computer Science, Queen Mary University of London, Mile End Rd, E1 4NS London, UK Abstract Deep learning technology has enabled successful modeling of complex facial features when high-quality images are available. Nonetheless, accurate modeling and recognition of human faces in real-world scenarios “on the wild” or under adverse conditions remains an open problem. Consequently, a plethora of novel deep network architectures addressing issues related to low-quality images, varying pose, illumination changes, emotional expressions, etc., have been proposed and studied over the last few years. This survey presents a comprehensive analysis of the latest developments in the field. A conventional deep face recognition system entails several main components: deep network, optimization loss function, classification algorithm, and train data collection. Aiming at providing a complete and comprehensive study of such complex frameworks, this paper first discusses the evolution of related network architectures. Next, a comparative analysis of loss functions, classification algorithms, and face datasets is given. Then, a comparative study of state-of-the-art face recognition systems is presented. Here, the performance of the systems is discussed using three benchmarking datasets with increasing degrees of complexity. Furthermore, an experimental study was conducted to compare several openly accessible face recognition frameworks in terms of recognition accuracy and speed.


Introduction
Face conveys a plethora of discriminative features rich enough to determine one's identity [1]. These features can be extracted in unconstrained scenarios and non-intrusive manners. Hence, automated face recognition can be exploited in a large number of practical applications [2]. Among others, it has shown excellent capabilities in security applications like intelligent surveillance [3,4], user authentication applications like traveler verification at border crossing points [5,6], and diverse other mobile and social media applications [7][8][9][10]. Indeed, person identity prediction based on facial features for practical purposes is a valuable tool in modern information technology [11]. Straightforwardly, as it may seem, the underlying modeling and mapping of faces is complex and it becomes daunting due to the diversity of facial features. Such complexity is further exacerbated by other variations like emotions, illumination, make up, and low-quality sensing [12,13]. To tackle this important, yet challenging problem of face recognition, intensive research efforts have been reported by numerous research groups and scholars. The discipline can be traced (2020) 2020: 25 Page 2 of 33 back to the sixties [14,15], when both feature based approaches and holistic approaches were reported. Feature-based approaches exploit the geometric relationships among distinctive facial features such as eyes, mouth, and other face landmarks [16][17][18][19][20][21][22][23]. In contrast, holistic approaches aim at capturing features of the entire facial area in an image [24][25][26][27][28][29].
Holistic approaches assign equal importance to all the pixels rather than special attention to a set of points of interest. Hence, these approaches encompass higher distinctive power at the cost of increased computational complexity [6,30]. Deep convolutional neural networks (DCNNs) are a holistic approach that recently enabled a quantum leap in the field. In 2014, Facebook reported a face recognition system named DeepFace [27] which achieved near-human performance on LFW benchmark [31]. This accuracy was quickly surpassed by systems like DeepId3 [28] and FaceNet [29]. Such substantial progress of face recognition technology is a reflection of cutting-edge research developments in deep network architectures. Starting from LeNet in 1989 [32], DCNNs have evolved into sophisticated networks particularly fueled by classification challenges like The ImageNet Large Scale Visual Recognition Challenge (ILSVRC) [33]. AlexNet [34], VGGNet [35], and GoogleNet [36] are arguably the three most influential ILSVRC networks.
The role of a deep network in a classic image classification system is to map the complex high-dimensional image information into a low-dimensional proprietary template, i.e., feature vector. The generated feature vectors can be interpreted as points in a fixeddimensional space. Clearly, face images are a subspace in the much larger image space. This fact implies that network architectures that succeeded in the problem of image classification are adaptable to face classification. Some successful face recognition applications that emerged from image classification networks are as follows: DeepId3 [28] which was influenced by VGGNet [35] and GoogLeNet [36], Google's Facenet [29] which used GoogleNet [36] architecture, and VGGFace [37] that exploited concepts from VGGNet [35].
A deep network is generally underpinned by an optimization loss function. When the deep net outputs feature vectors from input images, the loss function adds discriminative power to the generated features. Over the years, loss functions have evolved complementing the network architectures. These loss functions can be categorized as classification based approaches, i.e., softmax loss and it's variants, and metric learning approaches, i.e., contrastive loss and triplet loss. Successful exploitation of suitable loss functions in face recognition includes softmax loss in DeepFace [27], a variation of softmax loss as used in Arcface [38] and a tripletloss used in FaceNet [29]. Figure 1 shows the data flow of a typical face recognition system. During training, the network model learns from large training datasets. The trained model is then used to generate feature vectors for test faces. A classic face recognition task generally includes a gallery of labelled faces and probe/query images. Labelled gallery images are usually processed in advance in a step called 'enrolment process' . Here, the feature vectors/templates of the gallery subjects are generated. These features are then either stored with their corresponding labels or used to generate subject specific models. During the face recognition phase, the template of the query face is compared to the enrolled templates. This comparison can either use a nearest neighbor search or a model based classification. The former approach is referred throughout this paper as template learning and the latter is referred as subject-specific modelling. An important aspect of face recognition is benchmarking. As mentioned before, network architectures together with optimization loss functions and sufficient and diversified train datasets have enabled successful modeling of complex facial features when provided with high-quality images. These face recognition systems reported near-perfect performance on classic benchmarks like LFW [31]. However, the performance saturation on these benchmarks resulted in more challenging benchmarks [39][40][41] entailing more realistic pictures captured under adverse conditions. The evaluations on such real-world data shows that the performance of face recognition systems is affected by many factors including emotions, illuminations variations, make up and pose variations [39,40,42,43].

Surveys on deep face recognition
Due to the importance of the topic and the vast number of face recognition papers reported in the past, there is indeed no shortage of related surveys either. Some noteworthy face recognition surveys include Zhang et al. [11], Jafri et al. [30], Bowyer et al. [44], and Scheenstra et al. [45]. These comprehensively survey face recognition systems prior to DeepFace. Hence, these surveys do not discuss the new sophisticated deep learning approaches that emerged during the last decade. Surveys that discuss deep face recognition have singled out face recognition as an individual discipline rather than a collection of components adopted from different studies. These surveys generally discuss the face recognition pipeline: face pre-processing, network, loss function, and face classification [42,46,50] or discuss a single aspect of face recognition such as 3-D face recognition [47], illumination face recognition [52] or pose invariant face recognition [51]. Although these surveys are important and provide an excellent basis for the analysis of the state-of-theart in the field, they do not provide conclusive comparisons or analysis of the underlying network architectures. To better illustrate the difference of the key contributions in the past and this survey, Table 1 summarises the main deep face recognition surveys. The analysis presented by Wang et al. [46] is arguably the most comprehensive survey yet in the field. It provides a holistic overview of the broad topics of deep face recognition including the face recognition pipeline, face datasets, benchmarks, and industry scenes, briefly surveying all elements of face recognition. In contrast, this paper focuses on deep learning based components in the recognition pipeline and delivers a much detailed analysis of the 18 most critical deep face recognition systems. The paper describes a face recognition system as a unique combination of a deep net, loss function, classification approach, train dataset, and other system specific novelties if any. To properly understand how each system was derived, the paper also discusses the evolution of the aforementioned components.

Paper contribution
The key contributions of this survey include: Comparatively analyses Local Binary Convolutional Neural Networks (LBCNNs) against other state-of-the-art networks in terms of sensibility and processing time.

2017
A survey on facial feature extraction techniques for automatic face annotation [49] Discusses six facial feature extraction approaches: speeded up robust features, eigenfaces, scale invariant feature transform, convolutional neural network, gabor filter, and local binary pattern.

2016
A survey of deep face recognition in the wild [50] Discusses the network models of seven face recognition systems, comparing their reported performance on LFW [31] benchmark.

2016
A comprehensive survey on poseinvariant face recognition [51] Discusses pose-robust facial feature extraction systems under two categories: engineered features and learning based features. Deep frameworks are discussed under learning based features.

2015
Addressing the illumination challenge in two-dimensional face recognition: a survey [52] Provides summarized review of 72 state-of-the-art illumination-invariant facial feature extraction methods prior to 2014.

2015
A survey of unconstrained face recognition algorithm and its applications [53] Discusses face recognition techniques in terms of their behavior at pose variations, non-uniform motion blur and illumination for the period 2011-2014. The discussion includes several neural network based systems.
Deep learning based face recognition is either the main focus or is included as a subsection in each survey (2020) 2020:25 Page 5 of 33 Ensemble of 2 models 6.8 VGGFace [37] DeepId3 [28] 2014 GoogleNet [36] Inception architecture Ensemble of 7 models 6.67 FaceNet [29] DeepId3 [28] OpenFace [65] 2014 InceptionV2 [73] Adding batch normalization to Inception architecture • The background knowledge required to understand and analyze the underlying frameworks used in face recognition, including, -The origin and evolution of DCNN frameworks that were effective in face recognition (Table 2). -The loss functions used in face recognition, categorized and compared under two classes: classification based approaches and metric learning approaches. -A comparative discussion on two main classification approaches used in face recognition, i.e, template learning and subject specific modelling. -A brief discussion on key face datasets and evaluation benchmarks.
• The face recognition systems are analyzed based on the network architecture, loss function, classification approach, and train data and other unique system design details.
• The performance of face recognition is discussed based on three scenarios: • An experimental study that compares three face recognition systems (DLIB [64], OpenFace [65], and FaceNet_Re [66]) with respect to face recognition accuracy and speed.
• Discussion on open issues and challenges in face recognition highlighting possible future research.
The remainder of the survey is organized as follows. Section 2 presents a cognitive study of the evolution of DCNN architectures. Then, the paper presents a comparative analysis of loss functions in Section 3, a study of classification algorithms in Section 4, and face datasets and evaluation benchmarks in Section 5. Section 6 presents a study on state-of-the-art face recognition systems. This study is three fold and includes an individual systems analysis, a comparative performance analysis on three benchmarks and an experimental performance analysis. Finally, the paper presents the open issues of face recognition followed by the conclusion.

The evolution of deep face architectures
Andrew Ng, the Chief Scientist at Baidu Research, described the notion of deep learning as "Using brain simulations, hope to make learning algorithms much better and easier to use and make revolutionary advances in machine learning and AI". While deep neural networks (DNNs) have conquered different disciplines, convolutional neural networks (ConvNets or CNNs) have been particularly effective in visual science [68]. Given the appropriate network architecture, CNNs are able to process, analyze, and classify highdimensional patterns, resulting in an extremely valuable tool in computer vision.
A typical DCNN adheres to a conventional structure which consists of a set of stacked convolutional layers followed by contrast normalization and max-pooling and finally one or more fully connected layers [36]. Different variants of this structure have been explored for performance enhancements. Please refer to Fig. 5 for the general structure of a DCNN.
The evolution of deep network architectures initiated with increased size with respect to depth, the number of levels, and width, the number of units at each level [34,35,69]. Nonetheless, the increased complexity associated with larger nets was not favored in practical applications. Hence, systems like GoogleNet pioneered architecturally enhanced networks with lesser parameters [36]. This was followed by Microsoft's efforts to simplify the training process by using networks with lesser complexity [70]. In the immediate history, researchers have combined these two design techniques for further simplified networks [71].
Classification challenges such as ILSVRC [33], MNIST, and CIFAR have led to several milestone in image recognition. AlexNet [34], the winner of ILSVRC 2012, achieved a top-5 test error rate of 15.3%, which is the pioneer of DNN-based image recognition. To this day, the publication is considered to be one of the most influential breakthroughs. The second milestone was recorded when VGGNet [35], the second place winner ILSVRC 2014, achieved significant improvements (top 5 test error rate of 6.8%) with increased depth in DNNs.
Despite the fact that going deeper with convolutions seemed to be the straightforward solution for accuracy enhancements [34,35,69], this approach had two main drawbacks: (1) the large number of parameters that these deeper networks encompassed made the network prone to over-fitting and (2) the deeper networks meant increased computer resource consumption. These factors turned the attention of research community towards sparsely connected systems. Nonetheless, sparse systems were not a simple solution and possessed complications and limitations. The calculations associated with these non-uniform sparse systems, even if the number of arithmetic operations were fewer, suffered from the overhead of look-ups and cache misses. In comparison, dense nets, even with higher number of arithmetic operations, had the advantage of fast dense matrix multiplication operations provided by improved numerical libraries [34,72].
GoogleNet [36], which was the winner of ILSVRC 2014, introduced an architecture code-named Inception, which was capable of outperforming AlexNet with 12 times fewer parameters as that of AlexNet. The main concept behind this architecture is finding optimal local sparse structure covered by readily available dense components.
The inception architecture was learned layer by layer. In a single layer, units with high correlation were clustered together. These clusters which are connected to the layer units were considered as the next layer. When these inception modules were stacked, the higher layers required more and more 3×3 and 5×5 convolutions. This is because the highly abstract features are captured by the higher layers, and their spatial concentration reduces as a result. To avoid such complexities, dimension reduction was introduced to the architecture. In doing so, 1×1 convolutions were introduced before the 3×3 and 5×5 convolutions so that these 1×1 convolutions can compute reductions prior to feeding them to more expensive convolutions. The Inception architecture was later modified in the subsequent versions by adding batch normalization in Inception V2 [73] and additional factorizations in Inception V3 [74].
The uniqueness of Inception architecture is that the design principles focus more on computational simplicity, enabling the inference to be run even on a single machine. Due to this nature of GoogleNet, it was later used by many face recognition systems including Google's FaceNet [29] and DeepId3 [28].
In a contemporary research, the Microsoft Research employed the concept of deep residual learning [70] for image recognition. The authors show that the residual learning framework enables very deep networks, deeper than the traditional DNNs, to be implemented with lesser complexity. The study presented a DNN with 152 layers, which is eight times as deep as VGGNet [35].
Residual learning can be explained as follows. Consider a set of layers stacked, this could be the entire network or a part of it. Let the input to the stack of layers be x and the underlying mapping be H(x). Instead of training the layers to learn the traditional complicated function, the stack of layers are trained to learn the corresponding residual function, i.e., H(x) − x thus deriving Eq. 1.
The authors presented several networks of different sizes. ResNet-152, which is 152 layers deep, outperformed VGGNet and GoogleNet in ImageNet validation with a top (2020) 2020: 25 Page 8 of 33 When residual connections on top of a traditional DCNN architectures achieved closer performance to that of Inception V3, it raised the question whether residual connections on top of Inception would further enhance the performance. This hypothesis was explored in Inception V4 (Fig. 3) [71]. The authors showed that, while it is feasible to achieve competitive results through very deep networks without the use of residual connections, inclusion of residual connections in fact improves training speed in a greater scale.  [70], modified residual connections used in [36], the schema for 17×17 grid of the pure InceptionV4 network, and the schema for 17×17 grid (Inception-ResNet-B) module of Inception-ResNet-v1 network In addition to discussed networks, bilinear CNNs are a model designed for image recognition and later adopted in face recognition. The network consists of two feature extractors whose outputs are multiplied using outer product at each location of the image and pooled to obtain a bilinear vector [75]. This model was proven to be effective in fine-grained recognition tasks.
The major architectural innovations in DCNN history are associated with three concepts: increased network size, inception architecture, and residual connections. These innovations vary in performance indices like model complexity, computational complexity, memory usage, and inference time. These indices are vital in selecting an appropriate architecture compatible with the resource constraints in practical deployment. Canziani et al. [76] and Bianco et al. [77] presents an experimental comparison between different DNNs. From the results of Canziani et al., Fig. 2 presents the model complexity and computational complexity of DCNNs that have major impact on face recognition (with the exception of E-Net, BN-NIN, and BN-AlexNet).

Comparative analysis of loss functions
The loss function is the supervisory signal used to train a deep network. The study of loss functions has been carried along two main lines of research: Fig. 4 (1) classification-based approaches (conventional softmax classifier [27,37,78] and modified versions of softmax loss [38,60,61,[79][80][81][82][83][84][85][86]) and (2) metric learning approaches (contrastive loss [28,55,56,87,88] and triplet loss [29]). The softmax loss learns by classifying each train image into one of the pre-defined classes. Variants of softmax loss have made efforts to increase the intra-class compactness in the process. In contrast, metric learning approaches learn by increasing the similarity between faces of same identity while decreasing the similarity between the faces of different identities. Regardless of the approach, all deep face supervisory signals are driven towards a single goal, inter-class discrepancy with intra-class compactness.

Classification-based approaches
The softmax loss is a multi-class classification problem where the input data contains one or more images of a set of individuals and the classifier learns the features of each individual. Despite being referred as softmax loss for convenience, technically, a k-way softmax function is employed to obtain a probability distribution over labels of k classes [27]. And the minimization is carried out for the cross-entropy loss for each training sample. The softmax loss is denoted in Eq. 2, where x i R d denotes the d-dimensional deep feature of the ith sample, belonging to the y th i class. W j R d denotes the jth column of the weight W R d * n and b j R n is the bias term. The batch size and the class number are N and n, respectively. Softmax loss, despite achieving inter-class dispersion, provides no particular inclination towards intra-class compactness. Hence, the features learned through softmax loss may not be discriminative enough for rather challenging open-set classification problem [38]. Studies that followed have reported several efforts to enhance the discriminative power of softmax loss [60,61,[79][80][81][83][84][85][86]. An extension of softmax loss named center-loss [79] attempted to achieve the missing intra-class compactness by taking into account the euclidean distance between the feature vector and the center of the class. The authors show that a combination of center-loss and the softmax loss could be an optimum solution. However, in the matter of huge training datasets with a large number of classes, the class-wise learning approach becomes complicated and difficult. In an effort to solidify class-wise learning in large datasets, a new approach named SphereFace [60] incorporated a multiplicative angular margin penalty. Even though a new loss function was introduced in this publication, the presented optimum solution was a hybrid with softmax loss. Later, Wang et al. [61] proposed a system named CosFace which used a cosine margin penalty. As opposed to Sphereface, CosFace was an additive margin. This approach outperformed Sphereface.
Most recently in 2019, a research team from Imperial college introduced an additive angular margin loss named ArcFace [38]. The derivation of ArcFace can be outlined as follows.
Consider the traditional softmax loss denoted in Eq. 2. The bias is removed and the logit W T j x i is transformed to its dot product as W T j x i = W j x i cos θ j . When l 2 normalization is applied on individual weight and embedding feature , W j = 1 and x i is re-scaled to s yielding the following equation.
log e scosθ y i e scosθ y i + n j=1,j!=y i e scosθ j Now, the predictions only depend on the angle between the feature and the weight. The inter-class discrepancy and intra-class compactness is achieved by the additive angular margin penalty m; hence, the final equation is as follows.
log e scosθ y i +m e scosθ y i +m + n j=1,j!=y i e scosθ j ArcFace system reported considerable improvement reporting 99.83% LFW accuracy.

Metric learning approaches
Metric learning approaches are a different optimization approach than softmax loss or its variants. In metric learning, the network is provided with sample images and is penalized based on whether the samples are of the same class or not. Contrastive loss and triplet loss are two metric learning approaches popular in face recognition. Contrastive loss is generally used in Siamese style networks. A Siamese network is an architecture with two parallel neural networks with shared weights. Each network takes a different input, and the two outputs are combined to provide some prediction [89]. Contrastive loss was proposed by Hadsell et al. [90] and was used in face recognition  [55], DeepId2+ [56], DeepId3 [28], and others [87,88]. Figure 7 shows the Siamese network used in DeepId2+. The researchers of Google presented a system that learns a direct mapping from face images to discrete points in the compact euclidean space [29]. The optimization loss function is triplet loss. Given a triplet (an anchor, positive sample and negative sample), this loss aims at minimizing the distance between the anchor and its positive while maximizing the distance between the anchor and the negative. Contemporary research carried out by Baidu Research also reports the use of triplet loss [57].
Despite being conceptually straightforward, the effectiveness of metric learning mainly depends on the input samples. For example, FaceNet uses a hard sample mining algorithm for optimum triplets. Moreover, the number of possible triplets grows exponentially with dataset size and hence effective triplet mining becomes complicated. Studies that followed [37,38] reported that while triplet loss is an effective approach, learning by classification and metric learning approaches makes the training more convenient.

A study on face classification algorithms
Generally, the train data in face recognition are large scale datasets diversified with variations in gender, ethnicity, profession, etc. In contrast, gallery set is much smaller and application specific (e.g., mugshot images of persons of interest). Often times, gallery images are disjoint from the train data. Even if the gallery set was included in the much larger train set, each update to gallery will require complete retraining of the network. In these situations, the effort to use the trained model without alteration, for online face recognition, is inconvenient and naive. To this end, deep face recognition exploits the strategy of transfer learning. In this approach, as shown in Fig. 6, the network learns from large volumes of train data and the trained model is used to generate features for test faces. A shallow classifier is then used on the generated features for face identification. In doing so, the enrolment of gallery samples, i.e., training the shallow classifier, is carried out as an intermediary step between offline model training and online face recognition. The enrolment could be a model based approach or a template learning approach. This section aims to discuss the algorithms used in this shallow classification process. Generally, transfer learning includes a source domain which is trained offline and a target domain for online processing [59,91,92]. In this context, the source domain is the large datasets used for offline training of the network model and the target domain is the online face recognition data. Prior to DeepFace, transfer learning meant fine-tuning the network model with the gallery samples. DeepFace presented a varied approach of transfer learning for face recognition. The DeepFace net was initially trained as a traditional multi-class face classification problem. The authors considered the output of the last fully connected layer as a raw feature representation of the input face. With this notion, Deep-Face used two identical DNNs with shared weights to simultaneously generate feature vectors for two faces for face verification. A contemporary research that exploited a similar concept is the DeepId series [28,[54][55][56], which used the generated feature vectors for tasks like face verification and recognition. This feature vector based classification was exploited in face recognition in two main approaches: (1) subject specific modelling and (2) template based learning.
When several gallery images are available for a single subject, it results in multiple feature vectors per subject. These feature vectors can be modeled into a single representation for the subject. This is generally carried out with the use of algorithms like support vector machines (SVM). The model-based approaches yield optimum performance multiple imagery per subject is available. In other circumstances, template-based learning is a straightforward approach. In template learning, the unknown feature vector, i.e., template, is compared to known templates to calculate the nearest neighbor.
In contrast to image-based face recognition, video-based face recognition generally has more than one face image for a probe subject spread across a set of consecutive frames. Hence, multiple feature vectors are available for comparison [58,62,63,93,94] during the classification. The studies on video face recognition has carried out classification in three main approaches: (1) perform classification on each frame, (2) result pooling over set of frames [62,63,94], and (3) integration of information across frames for one-time face recognition [58,95]. The second and third approaches maintain the information across all frames and have reported progress on IJB benchmark.

Face datasets and evaluation benchmarks
The data serves two purposes in a typical face recognition system; it serves as training data and as benchmarks for system validation. It known that the quality of train data has a huge impact on the performance of a DNN. Similarly, the quality of the validation data has a huge impact on the reliability of the benchmark results. The term "quality" refers to the size and the level of inter and intra-class variations. The intra-class variation is a measure of the depth of the dataset, i.e., the number of images per each individual and the inter-class variations is achieved by increasing the breadth, the number of individuals in the dataset (Table 3). Initially, face datasets consisted of high-quality images mostly featuring celebrities [31,88,96]. Datasets that followed were more practicality driven and hence consisted of data captured at unconstrained environments (e.g., surveillance footages) [39,40]. Moreover, several datasets aimed at including challenging variations like age gaps [97][98][99][100], pose [101] , disguise [102], and ethnic variations [37,67].
Over the years, face recognition systems have been employing train datasets of increasing scale. Facebook once used a dataset of 500 million images of over 10 million subjects for training face recognition models [103] and Google used a dataset of 4 million facial images from 4000 subjects [27]. The success of these systems, backed up by large-scale private datasets, attracted research attention towards large and openly accessible face datasets like VGGFace2 [78].
The evaluation benchmarks are generally disjoint from the train datasets. They provide an estimate on the reliability of the trained model under different protocols like face verification, closed-set face identification, and open-set face identification [104]. For an unbiased comparison, the results are denoted in notations specified by the benchmark. Face verification is the task of determining if two faces belong to the same identity or not. Verification accuracy is generally represented in the receiver operating characteristic (ROC) [31]. The curve plots variance of the true acceptance rate against the false acceptance rate. Closed-set face identification is the task of identifying a probe against the pre-defined gallery with the assumption that the probe has a mate in the gallery. The accuracy of closed-set recognition is commonly denoted using cumulative match characteristic (CMC) [39,40]. The CMC curve measures the percentage of true identifications within a given rank, i.e., rank 5 identification accuracy denotes the true identifications within the top 5 predictions. Open-set face identification is the task of identifying a probe against the pre-defined gallery while being open to the possibility that the probe may not have a mate in the gallery. The open-set face recognition accuracy can be denoted using decision error trade-off (DET) [39]. The DET curve to plots the false-negative identification rate (FNIR) as a function of false-positive identification rate (FPIR). This sections aims to provide an overview of face datasets that have been effective in face recognition discussing their important features, advantages, and disadvantages.

LFW [31]
LFW is by far the most effective benchmark for unconstrained face recognition. The dataset comprises 13,233 images of 5749 people under varying conditions of pose, lighting, focus, resolution, etc. The cropped faces are detections of Haar cascade-based face detector by Viola and Jones [105].
The benchmark targets the pair matching problem/face verification. Two evaluation protocols are provided: (1) restricted, the pairs are provided, and (2) unrestricted, the pairs can be generated as per user's preference. The ROC curve is used for recording the results.

YTF [96]
Following LFW, a similar dataset and a benchmark was released with the purpose of evaluation of face recognition in videos under unconstrained category. The dataset comprises 3425 videos of 1595 individuals. These individuals are a subset of those of the LFW dataset.
Since the dataset was designed so as to align with LFW, the benchmark tests were designed the same way. The benchmark includes pair matching tests under two protocols restricted and unrestricted.

VGGFace [37]
VGGFace [37] consists of 2.6 million images of 2 622 individuals. Despite being recognized as one of the largest publicly available datasets for training, the refined dataset where label noise is removed by human annotators, consisting of 800, 000 images.

VGGFace2 [78]
VGGFace2 consists of 3.31 million images of 9131 s classes giving an average of 362.6 images per class. The dataset was created with the aim of achieving a higher depth and breadth. The additional design goals of the dataset include achieving wide range of age, pose, and ethnic variations.

CASIA-Webface [88]
The CASIA-Webface dataset which consists of total of 453,453 images over 10,575 identities. The data is collected from IMDb website. The dataset is designed to be compatible with LFW benchmark, meaning that there are no any overlappings between the two datasets. Hence, a system trained on CASIA-Webface can be independently evaluated on LFW.

CelebFaces [106]
CelebFaces contains 87,628 face images of 5436 celebrities from the Internet, with approximately 16 images per person on average.

Ms-celeb-1m [107]
Ms-celeb-1m dataset consists of a benchmark test which includes evaluation data and evaluation protocol and a separate dataset for training. The evaluation dataset comprises data from one million celebrities and the training dataset comprises approximately 10 million images of 100,000 celebrities.

MegaFace [67]
MegaFace challenge evaluates the performance of face recognition and face verification with up to 1 million distractors. Moreover, it includes protocols for age invariant face recognition. The probe data collection of MegaFace is composed of two datasets: (1) Face-Scrub dataset [108] which consists of 100,000 photos of 530 celebrities and (2) FG-Net dataset [109,110] which consists of 975 photos of 82 people. The latter encompasses variations of age with photos spanning many ages of each subject. The MegaFace distraction data, i.e., gallery collection, includes 1 million photos of more than 690,000 unique subjects collected from Yahoo's Flickr dataset [111]. The evaluation protocol for face recognition is as follows. Let the probe set have M faces of a subject, out of which one is placed in the gallery of 1 million distractors. The face recognition system is provided with the remaining M-1 images. The system is expected to learn from these M-1 images and rank the distractor set in the order of similarity. Ideally, the one image from the probe set should be ranked in the first place. The results are provided via CMC curves. For evaluations on face verification, all pairs between the probe set and distractor set are provided within the dataset. This contains 4 billion negative pairs. The verification results are provided via ROC curves.

IJB [39-41]
In contrast to LFW benchmark which used a commodity face detector, IJB dataset provides a set of face images that are manually aligned (Fig. 8). The manual alignment process aims at preserving challenging variations such as pose, occlusion, and illumination, that are generally filtered out with automated detection. The dataset is a collection of media in the wild which contains both images and videos. The dataset contains media from 500 individuals gathered so as to produce a near-uniform geographic distribution. The complete dataset comprises 5712 images and 2085 videos.
This dataset is benchmarked for face verification and closed-set and open-set face recognition. The performance evaluation on IJB is a process of 10-fold cross validation. The dataset is split 10 random train and test splits with 333 subjects allocated for training at each level and the remaining 167 subjects for testing. The train set can be used to either fine-tune the network or experimentally derive the optimum threshold distance between two facial feature vectors, which, when exceeded, it can be concluded that the faces are of different identities. The test set is then split into two parts, gallery set and probe set. Each subject has media in both the sets. The media in the probe set are used as the search term and the gallery set is the database that the probe image is tested against.
To facilitate open-set classification problem, 55 randomly picked subjects are removed from the gallery. In the protocol specified for face verification, the actual and imposter pairs are provided similar following the LFW convention, but to increase the difficulty, the imposter pairs are selected with restrictions so as to pick pairs of more similarity. The performance is reported using ROC, CMC, and DET curves.

DeepFace [27]
DeepFace uses a nine-layer deep neural network with more than 120 million parameters for face recognition. Softmax loss was employed to train the network, and the train dataset was a private dataset of four million facial images of more than 4000 identities. The system also implements an effective pre-processing mechanism where a 3D model is used to align faces into a canonical pose. In summary, the success of DeepFace is due to three main factors: (1) sound pre-processing step, (2) network architecture, and (3) large scale train data. In addition to the proposed system, DeepFace also presents an end-to-end face verification system using a Siamese network. Following the training, the network excluding the classification layer is replicated twice to generate features simultaneously for two images. The generated feature vectors are compared in deciding if the two images are of the same person.

DeepId series [28, 54-56]
DeepId introduced the concept that when a CNN is trained for face classification with approximately 10,000 identities and the network is designed such that the number of neurons is reduced as we go higher in the feature extraction hierarchy, it results in the top layers producing compact identity related features with only a few neurons. These identity features, referred to as DeepIds, can then be generalized to other tasks like face verification. This approach of learning facial feature representations through a classification tasks has conceptual similarities to the Siamese network proposed by DeepFace.
The network used in DeepId consists of four convolutional layers, each followed by a max pooling layer. On top of this lies the fully connected layer which is referred to as DeepId layer. The layer was named so because the DeepIds are extracted from this layer. DeepId layer is then followed by the top layer which is a softmax layer. The DeepIds extracted from this network is fed to joint Bayesian technique via which the verification is carried out. The system was trained on an extended version of CelebFaces [106], codenamed CelebFaces+, which contains 202,599 face images of 10,177 celebrities. The system yielded 97.45% verification accuracy on unconstrained face verification in LFW. Following the success of DeepId, DeepId2 suggested that including both face identification signals and face verification signals (contrastive loss) for supervision can further increase the accuracy of face recognition/verification systems. This hypothesis was based on the premise that the face identification signals contribute in increasing inter-personal variations whereas face verification signals contribute in reducing intra-personal variations. DeepId2 achieved 99.15% LFW accuracy. This performance was further improved by DeepId2+ which introduced two system improvements: (1) increasing the dimension of hidden representations and (2) introducing supervisory signals to early convolution layers. Please refer to Fig. 7 for DeepId2+ network.
Adding to the continuous improvements, DeepId3 used both identification and verification signals as supervision but on deeper architectures than those of previous DeepId versions. The DeepId3 nets were influenced by VGGNet (stacking of convolutions to achieved increased depth) and GoogLeNet (Inception) architectures. By this implementation, an ensemble of two DeepId3 nets achieved 99.53% LFW accuracy (Fig. 8).

VGGFace [37]
Inspired by VGGNet which showed that deeper convolutions can be more effective in large-scale image recognition, VGGFace applies the same concept for face recognition. The authors employed a modified version of the architecture presented in VGGNet and trained on VGGFace dataset. The authors evaluated two loss functions, softmax loss and triplet loss, and concluded that the triplet loss certainly does provide a better overall performance. Nonetheless, the authors report that training the network as a classifier with softmax loss makes the training significantly easier and faster.

Template adaptation [59]
The VGGFace system was later used for transfer learning with template adaptation. In this implementation, the deep CNN features from pre-trained VGGNet is combined with linear SVMs trained at test time [59]. The one-vs-rest linear SVMs are reported to increase the discriminative power of the feature space.

FaceNet [29]
Instead of training a face recognition system in the form of a conventional classifier, FaceNet implements a system which directly maps the input face thumbnails to the compact Euclidean space. The Euclidean space is generated such that the l2 distance between all faces of the same identity is small, whereas the l2 distance between a pair of face images from different identities is large. This is enabled by triplet loss which, by definition, aims at minimizing the distance between pairs of same identity while maximizing the distance between pairs of different identities. The authors used two DNNs, (1) Zeiler Fergus [125] and the (2) GoogleNet [36] architecture. The nets were trained on an in-house dataset of 100-200 million face images of about 8 million different identities. Out of the two nets used, Zeiler Fergus, achieved an impressive LFW accuracy of 99.63% and a 95.12% YTF accuracy.

Baidu [57]
The authors present a network comprising 9 convolutions trained with triplet loss. The system reports a near-perfect LFW verification accuracy. The authors conclude that triplet loss, compared to multi-class classification, is more suitable for face verification and retrieval problems.

DLIB [64]
Dlib [64] is library written in C++ which provides software components targeting specialities like data mining, machine learning, image processing, and linear algebra. The library includes a face recognition component that uses a modified version of ResNet-34 [70] to obtain a unique embedding for each face thumbnail. The output feature vectors are of 128 numerical dimensions and the network is trained using triplet loss. The network has been trained on a dataset of 3 million images. The face recognition component of the Dlib library employs transfer learning to offer flexibility to the user to provide an annotated dataset against which the probe face image/video is compared to. During the enrolment process, the pre-trained model generates vectors for the annotated face images and are stored. During the recognition process, the Euclidean distance between the probe feature vector and each of the stored gallery feature vectors is calculated. During the classification, if the calculated distance lies below a pre-defined threshold, the two faces are considered to be of the same identity. This implementation identifies one or more subjects as possible identity of the unknown face.

OpenFace [65]
OpenFace [65] is a face recognition system open sourced under the Apache 2.0 license. The system was developed with the purpose of bridging the gap between the publicly available face recognition systems and the state-of-the-art high performing private systems. The system is based on concepts introduced in GoogleNet [36] and FaceNet [29].
OpenFace uses a modified version of nn4 network from GoogleNet which was also used in FaceNet. The DNN is trained using triplet loss. The output feature vectors obtained from this trained model are of 128 numerical dimensions.
The face classification is carried out by subject specific modelling approach using a linear SVM. Given the labeled face images of train data, the system generates feature vectors for each face. Then, the feature vectors are fed to the SVM which creates a model based on face feature vectors. When provided with a facial feature vector of an unknown face image, the SVM model classifies the unknown face.

FaceNet : re-implementation (FaceNet_Re) [66]
This openly accessible face recognition system is a modified re-implementation of FaceNet [29]. The system provides three pre-trained models of Inception ResNet V1 architecture, trained with varying loss functions and train datasets. As seen in Table 4, the model trained with softmax loss and VGGFace2 reported the highest LFW accuracy out of the three. Similar to DeepId series, once trained, the inference network which is the network omitting the top layer is used as the pre-trained model generate feature vectors of 512 numerical dimensions. Similar to OpenFace implementation, an SVM classifier is used for classification task.

SphereFace [60] and CosFace [61]
SphereFace and CosFace are two face recognition systems which were used to introduce SphereFace loss and CosFace loss respectively. Both systems use the ResNet-64 architecture and is trained on CASIA-WebFace. Additionally, CosFace trains the system with another private dataset and reports a higher performance.

ArcFace [38]
ArcFace, which is a quite recent publication, implements a series of DNNs (ResNet-100, ResNet-50 and ResNet-34) along with the ArcFace loss. This system outputs a 512dimensional feature vector for face images. The DNNs were trained on a modified version of Ms Celeb dataset (ms1m). In a series of experimental results, the authors show that this implementation outperforms majority of the reported state-of-the-art results.

Neural aggregation network (NAN) [58]
NAN is a system designed for video face recognition. It comprised a deep network and an aggregation module. The deep network generates feature vectors for faces in video frames. The aggregation module aggregates the feature vectors to form a single feature inside the convex hull spanned by them. This aggregation is invariant to the image order and hence does not utilize the temporal information across video frames. The network used in the paper is of GoogLeNet [36] architecture with the batch normalization [73]. Face verification is carried out with a Siamese NAN structure with two NANs trained with contrastive loss. Face identification is carried out by adding a fully connected layer on top of the NAN for softmax loss. The train dataset uses about 3M face images of 50K identities from the Internet.

Bilinear CNNs (B-CNN) [62]
The system uses a symmetric bilinear-CNN model, comprising two Imagenet-pretrained "M-net" models from VGG's MatConvNet [126]. The models are fine-tuned with Face-Scrub dataset. One-versus-rest linear SVM classifiers are trained on the gallery set during experiments.

DCNNmanual+metric [63]
The paper presents an end-to-end system for face verification. The authors train a DCNN with 10 convolutional layers, 5 pooling layers, and 1 fully connected layer with CASIA-WebFace dataset [88]. The system uses joint Bayesian metric learning [127,128] for face verification. Out of presented deep nets, the network named DCNNmanual+metric yields the best performance. DCNNmanual+metric uses the model trained on CASIA-WebFace (2020) 2020: 25 Page 21 of 33 dataset further fine-tuned using the IJB-A [39] and its extended version Janus Challenging set 2 dataset. The system uses cosine distance as a measure of similarity between faces. Manual stands for using training data with manual annotation and metric stands for applying metric learning to compute similarity.

LFW (2007)
LFW has been the commodity benchmark for face verification, over the last decade. Table 4 presents the summary of recent milestones in face recognition alongside the reported LFW accuracy. The reported high accuracies on LFW indicates that the benchmark has reached saturation, creating requirement for advanced benchmarks. This near-perfect performance at LFW has been explained by Klare et al. [39], in terms of the nature of the face detector used. This commodity face detector, despite having attractive features like being scalable and real-time efficiency, is not resilient to variations in visual data. Once the faces are mined using this detector, variations like pose, occlusion, and illumination are filtered. The clear and good quality images of frontal pose makes it more convenient to the face recognition systems, thus overlooking the probable challenges in advanced applications like intelligent surveillance. In comparison to the face recognition results reported on larger benchmarks like MegaFace (Table 5), dataset size can be identified as a second factor that enables higher accuracy on LFW.

MegaFace
MegaFace challenge advocates evaluation of deep face recognition at the hand of million distractors. The aim of the benchmark is to scale with the real-world applications that usually involve recognizing a face at a planetary scale.
"Algorithms that achieve above 95% performance on LFW (equivalent of 10 distractors in our plots), achieve 35-75% identification rates with 1 million distractors, " reports MegaFace. Accounting for the reported results on this benchmark, Google's FaceNet which achieved near perfect LFW accuracy has recorded an accuracy level of 70%s on MegaFace. The other noteworthy results were of a commercial system named NTech-Lab. While the reported situation in 2014-2015 was not perfect nor impressive, the years that followed reported progress in recognition results [60,61]. The recent results reported by ArcFace [38] indicate an impressive near perfect accuracy on MegaFace benchmark. Please refer to Table 5 for a summary of identification and verification results on MegaFace. Table 6 presents a summary of reported face recognition results on IJB benchmark. While the reported results on this benchmark are comparatively higher than those on MegaFace, the results are not perfect, nor near-perfect. Hence, these results are an indication that the face recognition is challenged by complications in unconstrained data. A noteworthy fact regarding this benchmark is that, since the dataset includes multiple imagery for a single recognition, ideally, the system should include a mechanism to fully exploit the excess information. While the authors of the dataset suggest subject specific modelling, systems like B-CNN have employed other approaches like result pooling.

Study 3: Experimental analysis
Bianco et al. [77] presents OpenFace and DLIB uses HoG face detector [116] while FaceNet_Re uses MTCNN face detector [115]. To avoid dependencies from the detectors, only the faces detected by both algorithms were considered in the experiment. Taking into account the dependencies from different classification approaches, the two systems that used subject-specific modeling with linear SVMs (OpenFace and FaceNet_Re) were modified to perform template comparison in a similar manner to that of DLIB (nearest neighbor based on euclidean distance). In addition, the results from the original SVM implementation was also reported for comparison.
Depending on the use case, the recognition could be from Still images to Still images (S2S), from Video to Video (V2V), and from Still images to Video (S2V). Several benchmarks have addressed the first two approaches; LFW and MegaFace address S2S, and IJB addresses the combination of S2V, S2S, and V2V tests. While some benchmarks have made efforts to address S2V, these are problem specific datasets with some form of bias [130]. Hence, the experiment measures S2V recognition with LFW dataset as the set of gallery images and selected videos of YTF dataset as the probe videos. The experiment measures the rank 1 recognition accuracy with increasing gallery sizes. In addition, the average time taken by the system to run the forward pass on a single face thumbnail is compared. Figure 9 plots the recognition accuracy against the gallery size. The graph depicts two observations: (1) comparing the performance of the three systems with template learning as the classifier, FaceNet_Re TL and OpenFace TL show performance decrease with the gallery size; however, DLIB system shows considerable stability against the growth of gallery; and (2) comparing the performance of the same system with SVM and template (2020) 2020: 25 Page 23 of 33 learning classifiers, in both the instances (OpenFace and FaceNet_Re), the SVM is effective with limited number of subjects, but the performance drops drastically as the number of subjects increases. And one-to-one template learning is comparatively more stable against larger gallery sets. Since many studies have encouraged the use of subject-specific modeling to better utilize all the available information from multiple visual data [27,[39][40][41]59], it is important to properly analyze the strengths and weaknesses of different modeling approaches. The popularity of SVM in image classification can be explained by its ability to scale well with high dimensional data [131][132][133]. Although this works well when provided with small number of classes, increased number of classes with limited train data per class could complicate the process of finding the separation hyperplane. Table 7 reports the average time taken by each system to run the DCNN model on a single face thumbnail, as recorded on an Intel Core i7-7740X CPU @ 4.30GHz. The times reflect the underlying computational complexity involved in feature extraction from raw pixels. OpenFace and FaceNet_Re that includes Inception modules in the framework have reported lesser forward pass time in comparison to the DLIB model. Among the limited records in literature on computational efficiency, DeepFace reports an 0.18s feature extraction time on a single core Intel 2.2GHz CPU and FaceNet reports a 30 ms/image on a mobile phone with a small NN which is reported to have lesser, yet sufficient-for-face-clustering accuracy.

Open issues
Starting from face verification with high-quality data, face recognition has advanced over the recent years to address complicated scenarios like face recognition in unconstrained images and video face recognition. Simultaneously, face recognition benchmarks like IJB and MegaFace have aimed to replicate real-world applications. While the reports indicates a continuous progress, there are some un-addressed issues in terms of face recognition systems and benchmarks.

A comparative analysis for face recognition accuracy
Several studies have carried out experimental evaluations comparing state-of-the-art DCNN frameworks for image classification [68,76,134]. These experiments provide unbiased comparisons of the systems. This is particularly important in situations where all the systems are not evaluated on the same benchmark. The condition applies to face recognition as well. While almost all the face recognition systems provide the LFW accuracy, systems that were implemented prior to benchmarks like IJB and MegaFace do not provide evaluation results on them. Hence, there exists the necessity for these systems to be evaluated under a common benchmark.

A comparative analysis for computational complexity
While the studies report the recorded accuracy, only limited publications report the associated computational complexity. Despite offline processing being generally flexible on computational complexity, it is one of the most critical requirements in real-time applications. Hence, there exists the need for a comparative analysis of the deep face recognition systems with respect to performance indices like computational complexity, memory consumption, and inference time.

End-to-end systems
Majority of the studies and benchmarks tend to isolate face recognition as an individual discipline and hence do not provide sufficient insights on critical issues arising from inevitable integration with modules like face detection (e.g., false recognitions resulting from false detections). Despite limited studies [63,135,136], end-to-end face recognition is still an open research.

Multi-model face recognition
Most of the deep face recognition systems generate a single feature embedding for each face. This approach consider holistic features and does not contemplate component level features. Several studies have aimed to implement multi-model face recognition systems to gain optimum use of diverse information in a face image [137][138][139]. Several studies have made efforts to perform fusion of multiple descriptors across face [140][141][142]. These systems show that despite the possibility of increased computational complexity, the multi-model systems can yield positive results. Hence, this study an be improved targeting applications that require offline processing.

Multi-face recognition and tracking
The benchmarks and systems for video face recognition portray the problem as face recognition on a set of face images per subject [39][40][41]. These benchmarks does not evaluate face tracking. Nonetheless, face tracking is of vital importance in multi-face recognition in videos. In this scenario, the pixel level information in a video frame and the temporal information across video frames can be fused for an improved result. While face tracking and face clustering have been studies as a separate discipline [143][144][145][146], in practical applications, they are applied along with face recognition. Hence, evaluating the state-of-the-art face recognition systems along with face tracking can be a possible research with practical use.

Ensemble of deep learning and traditional face descriptors
While deep face descriptors are becoming the main feature representations for face recognition, traditional visual appearance descriptors can be used as an additional informational guidance. Recent developments have demonstrated effective usage of traditional visual descriptors in image processing tasks such as image semantic learning [147] and text mining in complex background images [148]. Exploring their effectiveness as an ensemble of deep learning could be possible future research.

Video face recognition
Frame-wise face recognition, feature aggregation across frames, and score pooling across frames are popular approaches of video face recognition. The first approach provides a crisp classification output that the probe face belongs to identity x. Unconstrained videos where faces are subjected to motion blur and other factors like partial faces due to pose might require aggregation of several partial truths into a higher truth. Despite score pooling and feature aggregation being straightforward aggregations across video frames, there exists room for sophisticated algorithms like inference based on fuzzy logic. They can be adapted from research work of similar disciplines like image annotation [149]. Through this mechanism, a degree of certainty can be calculated to the classification output, against factors like the quality of the image and the fraction of face visible. While temporal attention has proven to be effective in video face recognition, research disciplines like video captioning has shown improved performance by including spatial affinities in the attention [150]. Hence, spatial-temporal attention emerges as a possible research for video face recognition.

Application-specific designs
The expected functionality of face recognition varies with application. A face recognition application designed for intelligent surveillance where the cost of false alarms (registered individuals recognized as possible intruders) is high and the cost of missed alarms (possible intruders recognized as registered) is even higher should strive for minimum FPIR with reasonable flexibility on false-negative rejections. In contrast, applications that detects persons of interest is expected to have minimum FNIR (person of interest recognized as unknown) with reasonable flexibility of FPIR (false alarm where a regular individual is identified as a person of interest). Hence, the need for scenario specific designs and benchmarks cannot be overlooked.

Basic challenges for face recognition
Despite many architectural enhancements and diversified datasets, face recognition still has scope for improvement in terms of elementary complications arising from visual variations like pose, expression, occlusion, and illumination. In addition to direct studies like expression invariant face recognition [151], face recognition under occlusion [152], or illumination face recognition [52], tools like video segmentation [153,154] and region of interest extraction [155] can provide potential indirect assistance for face recognition on noisy imagery. Regardless of the varying modeling approaches and application specific fine-tuning, face recognition has mainly been influenced by DCNN frameworks, loss functions, classification algorithms, and train data. The continuous advancement of deep network architectures in image classification generates networks adaptable for face recognition. The study of Dong et al. [156], Bruna and Mallat [157], and Hankins et al. [158] are some image classification networks with prospect for face classification. Hence, face recognition will remain an active research striving for sophisticated frameworks.

Conclusion
This survey has presented the origin and evolution and a comparative analysis of 18 face recognition systems. Through this, the survey aims provide an informational guidance to simulate future research. In doing so, the paper has analyzed the performance of the systems in terms of benchmark results reported on three benchmarks which addresses different aspects of face recognition and an experimental study. Additionally, the survey has discussed the open issues in face recognition along with a note on possible future research.