- Research Article
- Open Access
Face Recognition from Still Images to Video Sequences: A Local-Feature-Based Framework
© Shaokang Chen et al. 2011
- Received: 30 April 2010
- Accepted: 9 December 2010
- Published: 15 December 2010
Although automatic faces recognition has shown success for high-quality images under controlled conditions, for video-based recognition it is hard to attain similar levels of performance. We describe in this paper recent advances in a project being undertaken to trial and develop advanced surveillance systems for public safety. In this paper, we propose a local facial feature based framework for both still image and video-based face recognition. The evaluation is performed on a still image dataset LFW and a video sequence dataset MOBIO to compare 4 methods for operation on feature: feature averaging (Avg-Feature), Mutual Subspace Method (MSM), Manifold to Manifold Distance (MMS), and Affine Hull Method (AHM), and 4 methods for operation on distance on 3 different features. The experimental results show that Multi-region Histogram (MRH) feature is more discriminative for face recognition compared to Local Binary Patterns (LBP) and raw pixel intensity. Under the limitation on a small number of images available per person, feature averaging is more reliable than MSM, MMD, and AHM and is much faster. Thus, our proposed framework—averaging MRH feature is more suitable for CCTV surveillance systems with constraints on the number of images and the speed of processing.
- Face Recognition
- Video Sequence
- Face Image
- Local Binary Pattern
- Scale Invariant Feature Transform
After the bombing attack in 2005, special attentions have been paid to the use of CCTV for surveillance to prevent such attacks in the future. Based on the number of CCTV cameras on Putney High Street, it is "guesstimated"  that there are around 500,000 CCTV cameras in the London area and 4,000,000 cameras in the UK. This implies that there is approximately one camera for every 14 people in the UK. Given the huge number of cameras, it is impossible to hire enough security guards to constantly monitor all camera feeds. Hence, generally the CCTV feeds are recorded without monitoring, and the videos are mainly used for a forensic or reactive response to crime and terrorism after it has happened. However, the immense cost of successful terrorist attacks in public spaces shows that forensic analysis of videos after the event is simply not an adequate response. In the case of suicide attacks, there is no possibility of prosecution after the event, so only recording surveillance video provides no terrorism deterrent. There is an emerging need to detect events and persons of interest from CCTV videos before any serious attack happens. This means that cameras must be monitored at all times.
However, two main constraints restrict human monitoring of the CCTV videos. One important issue is the limitation of the number of videos that a person can monitor simultaneously. For large amount of cameras, it requires a lot of people resulting in high ongoing costs. Another issue is that such a personnel intensive system may not be reliable due to the attention span of humans decreasing rapidly when performing such tedious tasks for long time. One possible solution is advanced surveillance systems that employ computers to monitor all video feeds and deliver the alerts to human operators for response. Because of this, there has been an urgent need in both the industry and the research community to develop advanced surveillance systems, sometimes dubbed as Intelligent CCTV (ICCTV). In particular, developing total solutions for protecting critical infrastructure has been on the forefront of R&D activities in this field [2–4].
The outline of this paper is as follows: we review the state-of-the-art techniques for still image and video-based face recognition in Section 2, followed by discussions of still images and video sequences for surveillance in Section 3; we then proposed the Multiregion Histogram for still image face recognition in Section 4; the extension of MRH for video-based face recognition is presented in Section 5; Section 6 comes to the conclusion and future work.
2.1. Still Image Face Recognition
Research on still image face recognition has been done for nearly half a century. Two main approaches have been proposed for illumination invariant recognition. One is to represent images with features that are less sensitive to illumination changes [5, 6] such as the edge maps of the image. This approach suffers from the fact that features generated from shadows are related to illumination changes and may have an impact on recognition. Experiments done by Adinj et al. in  show that even with the best image representations, the misclassification rate is more than 20%. Another approach is to construct a low-dimensional linear subspace for images of faces taken under different lighting conditions [8, 9]. This approach is based on an assumption that images of a convex Lambertian object under variable illuminations form a convex cone in the space of all possible images . Around 3 to 9 images are required to construct the convex cone. Nevertheless, the surface of human faces is not completely Lambertian reflected and convex. Therefore, it is hard for these methods to deal with cast shadows. Furthermore, these systems need several images of the same face taken under different controlled lighting source directions to construct a model of a given face.
As for expression invariant recognition, it is still unsolved for machine recognition and is even a difficult task for humans. In [11, 12], images are morphed to be the same shape as the one used for training. But it is not guaranteed that all images can be morphed correctly; for example, an image with closed eyes cannot be morphed to a neutral image because of the lack of texture inside the eyes. Liu et al.  propose to use optical flow for face recognition with facial expression variations. However, it is hard to learn the local motions within the feature space to determine the expression changes of each face, since the way one person expresses a certain emotion is normally somewhat different from others. Martinez proposed a weighing method to deal with facial expressions in . An image is divided into several local areas, and those that are less sensitive to expressional changes are chosen and weighed independently. But features that are insensitive to expression changes may be sensitive to illumination variations .
Pose variability is usually considered to be the most challenging problem. There are three main approaches developed for 2D-based pose invariant face recognition. Wiskott et al. proposed Elastic Bunch Graph Matching, which applied Gabor filter to extract pose invariant features . In [16–18] multiple-view templates are used to represent faces with different poses. Multiple-view approaches require several gallery images per person under controlled view conditions to identify a face, which restricts its application when only one image is available per person. Face synthesis methods have emerged in an attempt to overcome this issue. In , Gao et al. constructed a Face-Specific Subspace by synthesising novel views from a single image. In  a method for direct synthesis of face model parameters is proposed. In , an Active Appearance Model- (AAM-) based face synthesis method is applied for face recognition subject to relatively small pose variations. A recurring problem with AAM-based synthesis and multiview methods is the need to reliably locate facial features to determine the pose angle for pose compensation—this turns out to a be difficult task in its own right.
The above methods can handle certain kinds of face image variation successfully, but drawbacks still restrict their application. It may be risky to rely heavily on choosing invariant features [5, 6, 14, 15], such as using edge maps of the image or choosing expression insensitive regions. This is because features insensitive to one variation may be highly sensitive to other variations, and it is very difficult to abstract features that are completely immune to all kinds of variation . Some approaches attempt to construct face-specific models to describe possible variations under changes in lighting or pose [8, 9, 19, 22]. Such methods require multiple images per person taken under controlled conditions to construct a specific subspace for each person for the face representation. This leads to expensive image capture processes, poor scalability of the face model, and does not permit applications, where only one gallery image is available per person. Other approaches divide the range of variation into several subranges (e.g., low, medium, and high pose angles) and construct multiple face spaces to describe face variations lying in the corresponding subrange [16–18]. These approaches require us to register several images representing different variations per person into the corresponding variation models so that matching can be done in each interval individually. Once again, acquiring multiple images per person under specific conditions is often very difficult, if not impossible, in practice.
2.2. Video-Based Face Recognition
In recent years, increasing attention has been paid to the video-based face recognition. Many approaches were proposed to use temporal information to enhance face recognition for videos. One direct approach is temporal voting. A still image-matching mechanism is proposed by Satoh for matching two video sequences . The distance between two videos is the minimum distance between two frames across two videos. Zhou and Chellappa presented a sequential importance sampling (SIS) method to incorporate temporal information in a video sequence for face recognition . A state space model with tracking state vector and recognizing identity variable was used to characterize the identity by integrating motion and identity information over time. However, this approach only considers identity consistency in temporal domain, and thus it may not work well when the face is partially occluded. Zhang and Martinez applied a weighted probabilistic approach on appearance face models to solve the occlusion problem . Their experiment shows that this approach can improve the performance for PCA, LDA, and ICA. The approach proposed in  uses the condensation algorithm to model the temporal structures.
Some approaches utilize spatial information by considering frames from videos as still image sets without considering their temporal information. Person-specific models are trained from video sequences to form many individual eigenspaces in . Angles between subspaces are considered as the similarity between videos. In , each person is represented by a low-dimensional appearance manifold learned from training exemplars sampled from videos. The probabilistic likelihood of the linear models is propagating through the transition matrix between different pose manifolds. An exemplar-based probabilistic approach is proposed in , in which representative face images are selected as exemplars from training videos by radial basis functions. This approach can model small 2D motion effectively, but it cannot handle large pose variation or occlusion. Topkaya and Bayazit applied dimensional analysis on the representative frames selected based on facial features and the corresponding positions .
Most of the recent approaches utilize spatiotemporal information for face recognition in video. A sparse representation of face is learned from video for online face recognition under unconstrained conditions . Principal component null space analysis (PCNSA) is proposed in , which is helpful for nonwhite noise covariance matrices. The Autoregressive and Moving Average (ARMA) model method is proposed in  to model a moving face as a linear dynamical object. Liu and Chen proposed an adaptive Hidden Markov Model (HMM) on dynamic textures for video-based face recognition. Kim et al. applied HMM to solve the visual constraints problem for face tracking and recognition .
The above approaches for face recognition in video have several main drawbacks. Firstly, personal specific facial dynamics are useful to discriminate different persons, but the intrapersonal temporal information that related to facial expression and emotions is also encoded and used; secondly, normally consistent weights are assigned to spatiotemporal features from the observation that some features are more helpful for recognition, but the weights are not adaptively assigned which may be harmful when face appearance changes dramatically, especially in the case of occlusion, where some features may disappear; thirdly, most of the methods require well-aligned faces, which limits their usage in practice; last but not least, most of the above approaches utilize holistic facial features, but the local facial features are not well investigated, which is shown to be useful for image analysis and face recognition on still images.
For face recognition in surveillance scenarios, identifying a person captured on image or video is one of the key tasks. This implies matching faces on both still images and video sequences. It can be further classified into three categories: still image to still image matching, video sequence to video sequence matching, and still image to video sequence matching.
However, there are some major advantages of video sequences. First, we can employ spatial and temporal information of faces in the video sequence to improve still images recognition performance. Second, psychophysical and neural studies have shown that dynamic information is very crucial in the human face recognition process . Third, with redundant information, we can reconstruct more complex representations of faces such as a 3D face model  or super-resolution images  and apply them to improve recognition performance. Fourth, some online learning techniques can be applied for video-based face recognition to update the model over time .
Since we need to do both still image and video-based face recognition under surveillance conditions, the above approaches are not suitable. Most still image face recognition techniques are not appropriate for surveillance images due to the following concurrent and uncontrolled factors. The pose, illumination, and expression variations are shown to have great impact on face recognition . Image resolution change due to variable distances to cameras is another factor that influences the recognition performance . The face localization error induced by automatic face detector will definitely affect the recognition results as there are no guarantees that the localization is perfect (e.g., misalignment or wrong scale) . In addition to image properties, a surveillance system may have further constraints: limitation in number of images, for example, only one gallery image per person, as well as real-time operation requirements in order to handle large volumes of people. As many still image face recognition techniques are restricted to medium to high resolution face images and require expensive computation or multiple gallery images, which are not applicable for surveillance. Most of the video-based face recognition approaches are designed for video to video match, which cannot be used for still image recognition. Moreover, the above approaches rely heavily on the good face detection and feature localization, which is impractical under surveillance conditions, where images are of low resolution and processing should be in real-time. We thus develop a framework for both still image and video based face recognition under surveillance scenarios using local facial features. This approach can handle low resolution face image recognition with pose, illumination, and expression variations to a certain degree and is not sensitive to localization errors. Moreover, the computation for this approach is fast enough for real-time processing.
In this section, we describe a Multiregion Histogram- (MRH-)  based approach with the aim of concurrently addressing the above-mentioned problems.
4.1. Multiregion Histograms of Visual Words
where the th element in is the posterior probability of according to the th component of a visual dictionary model. The visual dictionary model is built from a convex mixture of Gaussians , parameterised by , where is the number of Gaussians, while , , and are the weight, mean vector, and covariance matrix for Gaussian , respectively. The mean of each Gaussian can be regarded as a particular "visual word."
The DCT decomposition acts like a low-pass filter, which retained features robust to small alterations due to in-plane rotations, expression changes, or smoothing due to upsampling from low-resolution images. The overlapping of blocks during feature extraction, as well as the loss of spatial relations within each region (due to averaging), results in robustness to translations of the face which are caused by imperfect face localization. We note that in the region configuration (used in ) the overall topology of the face is effectively lost, while in configurations such as it is largely retained (while still allowing for deformations in each region).
The visual dictionary is obtained by pooling a large number of feature vectors from training faces, followed by employing the Expectation Maximisation algorithm  to optimise the dictionary's parameters (i.e., ).
4.2. Normalised Distance
where is the th cohort face and is the number of cohorts. In the above equation cohort faces are assumed to be reference faces that are known not to be of persons depicted in or . As such, the terms and estimate how far away, on average, faces and are from the face of an impostor. This typically results in (4) being approximately 1 when and represent faces from two different people and less than 1 when and represent two instances of the same person. If the conditions of given images cause their raw distance to increase, the average raw distances to the cohorts will also increase. As such, the division in (4) attempts to cancel out the effect of varying image conditions.
4.3. Empirical Evaluation
This approach is evaluated on LFW dataset  which contains 13,233 face images with variations in pose, illumination, expression, in-plane rotation, resolution, and localization (resulting in scale or translation error). The images of LFW were obtained from the Internet, and faces were centered, scaled, and cropped based on bounding boxes provided by an automatic face locator. We normalize the extracted faces to pixels, with an average distance between eyes of 32 pixels.
The test protocol of LFW is verification based, which is to classify whether a pair of previously unseen faces is of the same person (matched pair) or two different persons (mismatched pair). The protocol specifies two views of the dataset: view 1, aimed at algorithm development and model selection, and view 2, aimed at final performance reporting. There are 1100 matched and 1100 mismatched pairs in training set and 500 unseen matched and 500 unseen mismatched pairs in the test set in view 1. We use training set to construct the visual dictionary as well as optimizing the threshold. In view 2 the images are split into 10 sets, with each set 300 matched and 300 mismatched pairs. A 10-fold cross-validation is done by using 9 for training and 1 for testing for each of the subset, respectively. Performance is evaluated by the mean and standard error of the average accuracies for all 10 subsets. The standard error is useful for assessing the significance of performance differences across algorithms .
Results on view 2 of LFW. MRH approaches used a 1024-component visual dictionary.
MRH (probabilistic, normalised distance)
MRH (probabilistic, raw distance)
MRH (probabilistic, normalised distance)
PCA (normalised distance)
PCA (raw distance)
Randomised Binary Trees (RBT)
Single LE + holistic
For intelligent surveillance systems, automatic face recognition should be performed for both still images and video sequences. Thus, normal video-based face recognition techniques are not suitable for this task since they are designed only for video-to-video matching. In an attempt to retain the ability for still image face recognition and to be capable for still-to-video and video-to-video matching, we propose the following approaches to enhance MRH for face recognition on videos. In this section, we explore four methods that operate on features to build up a more representative model for classification as well as four methods that operate on distance between vectors to improve the performance. By investigating these approaches, we attempt to choose a best suitable method that takes advantage of multiframe information in a computationally inexpensive manner for image-set and video-set matching. As part of the investigation into this problem, a subset of LFW database is used for image set matching, test and a large-scale audiovisual database called "Mobio Biometry" (MOBIO)  is used for video-set matching, respectively.
5.1. Operation on Feature
In this approach, several methods are inspected, which utilize multiple feature vectors of the sample images in a set to build up a more representative model of faces. In other words, they attempt to extract more meaningful new features from the existing features. In the following sections, we will discuss them in more detail.
5.1.1. Feature Averaging
To extend still image face recognition for video sequences, a direct approach is applying still image recognition for each frame in the video set. But this approach is computationally expensive and does not fully utilize spatial and temporal information of the video. Given an example, to identify a face from a probe video with frames in a video database with video sequences, the thorough search needs to perform the still image matching by times, where is the average frames per sequence. Generally, for only a 10-second video, it would contain about 300 frames (with a normal frame rate at 30 fps). This means that the calculation for video is about 90000 times of that for still image.
where is the number of selected frames for video . By the above averaging, we statistically average both spatial and temporal information of faces. The average over frames straightly integrates temporal information, and the region averaging of MRH accomplishes spatial merge.
The similarity measure between two videos is the normalized distance between the average histograms as defined in (4). As can be easily seen, with this averaging approach, the recognition is done by only times distance calculation, comparable to the still image recognition.
5.1.2. Manifold Distance
Manifold distance is an emerging area in image-set matching . Mutual Subspace Method (MSM) was one of the earliest (and still competitive) approaches within this school of thought . In MSM the principal angle between two subspaces is considered as the similarity measure. The basic idea of principal angles has been extended into kernel principal angles  or discriminative canonical correlation  with promising results.
where and . is the principal angle between the two subspaces and . One straight and numerical robust way to compute the canonical correlations is based on Singular Value Decomposition (SVD). Considering that and are orthogonal bases for subspaces and , respectively, the canonical correlations are the singular values of . In MSM the largest eigenvalue is used as the distance between two manifolds.
where and are the s of image set and , respectively.
5.1.3. Affine Hull Method
5.2. Operation on Distance
where is the distance between and .
5.3. Empirical Evaluation
The above approaches for video-based face recognition are evaluated on LFW and MOBIO datasets. For fair comparison, the four methods with operation on features are actually applied on three different facial features: MRH, Local Binary Patterns (LBP), and raw pixel intensity. The four methods for operation on distance are actually applied on the defined distance between these features. In the following section, we will describe the experiments on the above two datasets individually.
5.3.1. LFW Multiple Images Set Match
The image-set matching is evaluated in subsets of LFW. We follow the similar image pair verification protocol to LFW. We first evaluate the image-set matching with 3 images per set. 620 image-set pairs are generated from the LFW dataset, with 310 pairs for training and 310 pairs for testing. Each pair contains two image sets with 3 images in each set. Images in the testing are never included in the training. In order to remove bias, the pairs generated in our experiments are balanced so that the number of matched pairs and mismatched pairs is the same. Similarly, 432 pairs (216 training pairs and 216 testing pairs) are generated for image-set matching with 4 images in a set. We test the following four methods that operates on features: feature averaging (Avg-Feature), the manifold distance by applying MSM and MMD on facial features, and the affine hull method (AHM). To comprehensively investigate the influence of operation on features, we test the following three features: Multiregion Histogram (MRH), Local Binary Patterns (LBP) [51, 64], and raw pixel intensity. For comparison, we also apply four methods that operates on the distance between vectors (features) for thorough image-set matching.
Verification results for image-set matching of LFW.
Number of images
Operation on feature
Operation on distance
MMD is slightly worse than MSM because of the limitation on the number of images (only 3 or 4 images) per set. With only a few images per set, MSM would construct a more representative subspace than the subspaces modelled by MMD, because MSM uses all available features to construct a linear subspace, whilst MMD only uses a subset of the feature to construct several linear subspaces. Thus, the performance of MSM is slightly better than MMD in LFW. In some extreme cases, if only two images available per set, MMD cannot be applied to generate linear subspaces. MMD performs better than MSM only when there are much more images in an image set. Results reported in  shows improvement of MMD over MSM with 300 to 500 frames per video. Under the constraint that only few images (less than 10) are acquired for each set, MMD generally can not perform as good as MSM.
5.3.2. MOBIO Videos Set Match
The MOBIO dataset was collected as part of a European collaborative project to facilitate research and development on robust-to-illumination face and speaker video verification systems in uncontrolled environments. The quality of mobile images is generally poor with blurred images from motion and smudged lenses and changes in illumination between scenes, which is similar to those experienced in CCTV videos with out of focus, motion blur, and cameras with dirty lenses. The experiments used in this paper focused on only the development subset of the MOBIO database. In this subset, 1,500 probe videos are each compared to every person in the gallery (20 females and 27 males) whom each have 5 videos.
Because MOBIO database does not provide face locations, the OpenCV's Haar Feature-based Cascade Classifier  is used to detect faces in each frame. The faces are then tracked over multiple frames using Continuously Adaptive Mean-SHIFT Tracker (CAMSHIFT)  with colour histograms. Once the faces are detected, eyes are further located within the face using a Haar-based classifier. If no eyes are located, they are approximated from the size of the face detected. The faces are then resized and cropped such that the eyes are centered with a 32-pixel intereye distance. For these experiments, a closely cropped face of pixels was used which excludes outer features surrounding the face such as hair and beard. In the surveillance context, such peripheral features can be easily used as disguises. Due to the low quality of the videos in MOBIO database and the robustness of the face detector, there are 7% of all the videos with less than or equal to 2 face images extracted.
Based on the observation in LFW test that methods of operation on features perform better than that of operation on distance and operations on MRH features outperform other features, in this video set match, we only evaluate operation on feature approaches for MRH. Due to the limitation of the face detector, less than 2 images with faces can be extracted from some videos, and MMD methods are not applicable to those videos. Thus, we only test the following three methods: MSM, AHM, and Avg-Feature.
Average time for processing one video for different number of selected frames.
Number of frames
Half total error rate results on MOBIO dataset obtained from .
In this paper, we reviewed state-of-the-art face recognition techniques for still images and video sequences. Most of these existing approaches need well-aligned face images and only perform either still image face recognition or video-to-video match. They are not suitable for face recognition under surveillance scenarios because of the following reasons: limitation in the number (around ten) of face images extracted from each video due to the large variation in pose and lighting change; no guarantee of the face image alignment resulted from the poor video quality, constraints in the resource for calculation influenced by the real time processing. We then proposed a local facial feature-based framework for still image and video-based face recognition under surveillance conditions. This framework is generic to be capable of still-to-still, still-to-video and video-to-video matching in real-time. Evaluation of this approach is done for still image and video based face recognition on LFW image dataset and MOBIO video dataset. Experimental results show that MRH feature is more discriminative for face recognition with illumination, pose, and expression variations and is less sensitive to alignment errors. Empirical evaluation on video-based recognition with 8 methods for operation on feature and operation on distance shows that operation on features generally performs better. The best performance achieved is by Avg-Feature compared to other recent advanced methods such as MSM, MMD, and AHM, when the number of images per set is small (less than 10). MSM, MMD and AHM attempt to overfit to small number of samples, though they might outperform Avg-Feature with hundreds of images available per set. But the speed of the former is much slower than the latter. Thus, for face recognition under surveillance scenario, Avg-Feature is more suitable, subjected to the constraints in the number of images and real-time processing. Though experiments show that MRH feature is more reliable then other local features, such as LBP, GJD, and SIFT, recent research discovers some more robust features, for example, Learning-based Descriptors (LE) . It is worth investigating the averaging effect on these features.
Besides technical challenges, data collection is one of the main issues for research on surveillance systems. Privacy laws or policies may prevent surveillance footage being used for research even if the video is already being used for security monitoring. Careful consultation and negotiation should be carried out before any real-life trials of intelligent surveillance systems.
This project is supported by a grant from the Australian Government Department of the Prime Minister and Cabinet. NICTA is funded by the Australian Government's Backing Australia's Ability initiative, in part through the Australian Research Council.
- McCahill M, Norris C: Urbaneye: CCTV in London. Centre for Criminology and Criminal Justice, University of Hull, UK; 2002.Google Scholar
- Francisco G, Roberts S, Hanna K, Heubusch J: Critical infrastructure security confidence through automated thermal imaging. Infrared Technology and Applications XXXII, April 2006, Kissimmee, Fla, USA, Proceedings of SPIE 6206:View ArticleGoogle Scholar
- Fuentes LM, Velastin SA: From tracking to advanced surveillance. Proceedings of the International Conference on Image Processing (ICIP '03), September 2003 121-124.Google Scholar
- Ziliani F, Velastin S, Porikli F, Marcenaro L, Kelliher T, Cavallaro A, Bruneaut P: Performance evaluation of event detection solutions: the CREDS experience. Proceedings of IEEE Conference on Advanced Video and Signal Based Surveillance (AVSS '05), September 2005 201-206.Google Scholar
- Gao Y, Leung MKH: Face recognition using line edge map. IEEE Transactions on Pattern Analysis and Machine Intelligence 2002, 24(6):764-779. 10.1109/TPAMI.2002.1008383View ArticleGoogle Scholar
- Yilmaz A, Gökmen M: Eigenhill vs. eigenface and eigenedge. Proceedings of the International Conference on Pattern Recognition, 2000 827-830.Google Scholar
- Adinj Y, Moses Y, Ullman S: Face recognition: the problem of compensation for changes in illumination direction. IEEE Transactions on Pattern Analysis and Machine Intelligence 1997, 19(7):721-732. 10.1109/34.598229View ArticleGoogle Scholar
- Basri R, Jacobs DW: Lambertian reflectance and linear subspaces. IEEE Transactions on Pattern Analysis and Machine Intelligence 2003, 25(2):218-233. 10.1109/TPAMI.2003.1177153View ArticleGoogle Scholar
- Georghiades AS, Belhumeur PN, Kriegman DJ: From few to many: illumination cone models for face recognition under variable lighting and pose. IEEE Transactions on Pattern Analysis and Machine Intelligence 2001, 23(6):643-660. 10.1109/34.927464View ArticleGoogle Scholar
- Belhumeur PN, Kriegman DJ: What is the set of images of an object under all possible illumination conditions? International Journal of Computer Vision 1998, 28(3):245-260. 10.1023/A:1008005721484View ArticleGoogle Scholar
- Beymer D, Poggio T: Face recognition from one example view. Proceedings of the 5th International Conference on Computer Vision, June 1995 500-507.View ArticleGoogle Scholar
- Black MJ, Fleet DJ, Yacoob Y: Robustly estimating changes in image appearance. Computer Vision and Image Understanding 2000, 78(1):8-31. 10.1006/cviu.1999.0825View ArticleGoogle Scholar
- Liu X, Chen T, Kumar BVKV: Face authentication for multiple subjects using eigenflow. Pattern Recognition 2003, 36(2):313-328. 10.1016/S0031-3203(02)00033-XView ArticleGoogle Scholar
- Martínez AM: Recognizing imprecisely localized, partially occluded, and expression variant faces from a single sample per class. IEEE Transactions on Pattern Analysis and Machine Intelligence 2002, 24(6):748-763. 10.1109/TPAMI.2002.1008382View ArticleGoogle Scholar
- Wiskott L, Fellous JM, Krüger N, Von Malsburg CD: Face recognition by elastic bunch graph matching. IEEE Transactions on Pattern Analysis and Machine Intelligence 1997, 19(7):775-779. 10.1109/34.598235View ArticleGoogle Scholar
- Beymer D: Feature correspondence by interleaving shape and texture computations. Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition, June 1996 921-928.Google Scholar
- Pentland A, Moghaddam B, Starner T: View-based and modular eigenspaces for face recognition. Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition, June 1994 84-91.Google Scholar
- Sankaran P, Asari V: A multi-view approach on modular PCA for illumination and pose invariant face recognition. Proceedings of the 33rd Applied Imagery Pattern Recognition Workshop, October 2004 165-170.View ArticleGoogle Scholar
- Gao W, Shan S, Chai X, Fu X: Virtual face image generation for illumination and pose insensitive face recognition. Proceedings of IEEE International Conference on Accoustics, Speech, and Signal Processing, April 2003 776-779.Google Scholar
- Sanderson C, Bengio S, Gao Y: On transforming statistical models for non-frontal face verification. Pattern Recognition 2006, 39(2):288-302. 10.1016/j.patcog.2005.07.001View ArticleGoogle Scholar
- Shan T, Lovell BC, Chen S: Face recognition robust to head pose from one sample image. Proceedings of the 18th International Conference on Pattern Recognition (ICPR '06), August 2006 515-518.View ArticleGoogle Scholar
- Harandi MT, Nili Ahmadabadi M, Araabi BN: Optimal local basis: a reinforcement learning approach for face recognition. International Journal of Computer Vision 2009, 81(2):191-204. 10.1007/s11263-008-0161-5View ArticleGoogle Scholar
- Satoh S: Comparative evaluation of face sequence matching for content-based video access. Proceedings of IEEE Conference on Automatic Face and Gesture Recognition, 2000 163-168.Google Scholar
- Zhou S, Chellappa R: Probabilistic human recognition from video. Proceedings of the European Conference on Computer Vision, 2002, Copenhagen, Denmark 681-697.MATHGoogle Scholar
- Zhang Y, Martinez AM: A weighted probabilistic approach to face recognition form multiple images and video sequences. Asian Security Review 2006, 24: 626-638.Google Scholar
- Zhou S, Krueger V, Chellappa R: Face recognition from video: a condensation approach. Proceedings of IEEE Conference on Automatic Face and Gesture Recognition, 2002, Washington, DC, USA 221-228.Google Scholar
- Shakhnarovich G, Moghaddam B: Face recognition in subspaces. In Handbook of Face Recognition. Springer, New York, NY, USA; 2004.Google Scholar
- Lee KC, Ho J, Yang MH, Kriegman D: Video-based face recognition using probabilistic appearance manifolds. Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition, June 2003 313-320.Google Scholar
- Kruger V, Zhou S: Exemplar-based face recognition from video. Proceedings of the European Conference on Computer Vision, 2002, Copenhagen, Denmark 361-365.MATHGoogle Scholar
- Topkaya IS, Bayazit NG: Improving face recognition from videos with preprocessed representative faces. Proceedings of the 23rd International Symposium on Computer and Information Sciences (ISCIS '08), October 2008Google Scholar
- Tangelder J, Schouten B: Learning a sparse representation from multiple still images for online face recognition in an unconstrained environment. Proceedings of the International Conference on Pattern Recognition, 2006 3: 1087-1090.Google Scholar
- Vaswani N, Chellappa R: Principal components null space analysis for image and video classification. IEEE Transactions on Image Processing 2006, 15(7):1816-1830.View ArticleGoogle Scholar
- Soatto S, Doretto G, Wu Y: Dynamic textures. Proceedings of the International Conference on Computer Vision, 2001, Vancouver, Canada 2: 439-446.MATHGoogle Scholar
- Kim M, Kumar S, Pavlovic V, Rowley H: Face tracking and recognition with visual constraints in real-world videos. Proceedings of the 26th IEEE Conference on Computer Vision and Pattern Recognition (CVPR '08), June 2008Google Scholar
- Viola P, Jones MJ: Robust real-time face detection. International Journal of Computer Vision 2004, 57(2):137-154.View ArticleGoogle Scholar
- O'Toole AJ, Roark DA, Abdi H: Recognizing moving faces: a psychological and neural synthesis. Trends in Cognitive Sciences 2002, 6(6):261-266. 10.1016/S1364-6613(02)01908-3View ArticleGoogle Scholar
- Chowdhury AR, Chellappa R, Krishnamurthy R, Vo T: 3d face reconstruction from video using a generic model. Proceedings of the International Conference on Multimedia and Expo, 2002, Lausanne, SwitzerlandView ArticleGoogle Scholar
- Baker S, Kanade T: Limits on super-resolution and how to break them. IEEE Transactions on Pattern Analysis and Machine Intelligence 2002, 24(9):1167-1183. 10.1109/TPAMI.2002.1033210View ArticleGoogle Scholar
- Liu X, Chen T, Thornton SM: Eigenspace updating for non-stationary process and its application to face recognition. Pattern Recognition 2003, 36(9):1945-1959. 10.1016/S0031-3203(03)00057-8View ArticleMATHGoogle Scholar
- Zhao W, Chellappa R, Phillips PJ, Rosenfeld A: Face recognition: a literature survey. ACM Computing Surveys 2003, 35(4):399-458. 10.1145/954339.954342View ArticleGoogle Scholar
- Wang J, Zhang C, Shum HY: Face image resolution versus face recognition performance based on two global methods. Proceedings of the Asian Conference on Computer Vision, 2004, Jeju Island, KoreaGoogle Scholar
- Rodriguez Y, Cardinaux F, Bengio S, Mariéthoz J: Measuring the performance of face localization systems. Image and Vision Computing 2006, 24(8):882-893. 10.1016/j.imavis.2006.02.012View ArticleGoogle Scholar
- Sanderson C, Lovell BC: Multi-region probabilistic histograms for robust and scalable identity inference. Proceedings of the 3rd International Conference on Advances in Biometrics (ICB '09), June 2009 199-208.Google Scholar
- Gonzales R, Woods R: Digital Image Processing. 3rd edition. Prentice Hall, Englewood Cliffs, NJ, USA; 2007.Google Scholar
- Bishop C: Pattern Recognition and Machine Learning. Springer, Berlin, Germany; 2006.MATHGoogle Scholar
- Sanderson C, Shang T, Lovell BC: Towards pose-invariant 2D face classification for surveillance. Proceedings of the 3rd International Workshop on Analysis and Modeling of Faces and Gestures (AMFG '07), October 2007 276-289.View ArticleGoogle Scholar
- Kadir T, Brady M: Saliency, scale and image description. International Journal of Computer Vision 2001, 45(2):83-105. 10.1023/A:1012460413855View ArticleMATHGoogle Scholar
- Huang GB, Ramesh M, Berg T, Learned-Miller E: Labeled Faces in the Wild: a database for studying face recognition in unconstrained environments. 2007., (07-49):Google Scholar
- Belhumeur PN, Hespanha JP, Kriegman DJ: Eigenfaces vs. fisherfaces: recognition using class specific linear projection. IEEE Transactions on Pattern Analysis and Machine Intelligence 1997, 19(7):711-720. 10.1109/34.598228View ArticleGoogle Scholar
- Nowak E, Jurie F: Learning visual similarity measures for comparing never seen objects. Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR '07), June 2007Google Scholar
- Ojala T, Pietikäinen M, Mäenpää T: Multiresolution gray-scale and rotation invariant texture classification with local binary patterns. IEEE Transactions on Pattern Analysis and Machine Intelligence 2002, 24(7):971-987. 10.1109/TPAMI.2002.1017623View ArticleMATHGoogle Scholar
- Lades M, Vorbrueggen JC, Buhmann J, Lange J, v.d Malsburg Christoph C, Wuertz RP, Konen W: Distortion invariant object recognition in the dynamic link architecture. IEEE Transactions on Computers 1993, 42(3):300-311. 10.1109/12.210173View ArticleGoogle Scholar
- Lowe DG: Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision 2004, 60(2):91-110.View ArticleGoogle Scholar
- Cao Z, Yin Q, Tang X, Sun J: Face recognition with learning-based descriptor. Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2010, San Francisco, Calif, USAView ArticleGoogle Scholar
- Ruiz-Del-Solar J, Verschae R, Correa M: Recognition of faces in unconstrained environments: a comparative study. EURASIP Journal on Advances in Signal Processing 2009, 2009:-19.Google Scholar
- Marcel S, McCool C, Ahonen PM, et al.: Mobile biometry (mobio) face and speaker verification evaluation. Proceedings of the 20th International Conference on Pattern Recognition, 2010Google Scholar
- Jenkins R, Burton AM: 100% Accuracy in automatic face recognition. Science 2008, 319(5862):435. 10.1126/science.1149656View ArticleGoogle Scholar
- Zhao S, Zhang X, Gao Y: A comparative evaluation of average face on holistic and local face recognition approaches. Proceedings of the 19th International Conference on Pattern Recognition (ICPR '08), December 2008Google Scholar
- Wang R, Shan S, Chen X, Gao W: Manifold-manifold distance with application to face recognition based on image set. Proceedings of the 26th IEEE Conference on Computer Vision and Pattern Recognition (CVPR '08), June 2008Google Scholar
- Yamaguchi O, Fukui K, Maeda K: Face recognition using temporal image sequence. Proceedings of the 3rd IEEE International Conference on Automatic Face and Gesture Recognition, 1998, Nara, JapanView ArticleGoogle Scholar
- Wolf L, Shashua A: Learning over sets using kernel principal angles. Journal of Machine Learning Research 2004, 4(6):913-931.MathSciNetMATHGoogle Scholar
- Kim TK, Kittler J, Cipolla R: Discriminative learning and recognition of image set classes using canonical correlations. IEEE Transactions on Pattern Analysis and Machine Intelligence 2007, 29(6):1005-1018.View ArticleGoogle Scholar
- Cevikalp H, Triggs B: Face recognition based on image sets. IEEE Conference on Computer Vision and Pattern Recognition, 2010, San Francisco, Calif, USAView ArticleMATHGoogle Scholar
- Ahonen T, Hadid A, Pietikainen M: Face recognition with local binary patterns. Proceedings of the 8th European Conference on Computer Vision (ECCV '04), 2004, Prague, Czech RepublicView ArticleMATHGoogle Scholar
- Bradski GR: Computer video face tracking for use in a perceptual user interface. Intel Technology Journal Q2 1998.Google Scholar
- Marcel S, McCool C, Matejka P, et al.: On the results of the first mobile biometry (mobio) face and speaker verification evaluation. Proceedings of IEEE Conference on Pattern Recognition (ICPR '10), 2010, Istanbul, TurkeyView ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.