Classification of extreme facial events in sign language videos
© Antonakos et al.; licensee Springer. 2014
Received: 29 May 2013
Accepted: 26 February 2014
Published: 13 March 2014
We propose a new approach for Extreme States Classification (ESC) on feature spaces of facial cues in sign language (SL) videos. The method is built upon Active Appearance Model (AAM) face tracking and feature extraction of global and local AAMs. ESC is applied on various facial cues - as, for instance, pose rotations, head movements and eye blinking - leading to the detection of extreme states such as left/right, up/down and open/closed. Given the importance of such facial events in SL analysis, we apply ESC to detect visual events on SL videos, including both American (ASL) and Greek (GSL) corpora, yielding promising qualitative and quantitative results. Further, we show the potential of ESC for assistive annotation tools and demonstrate a link of the detections with indicative higher-level linguistic events. Given the lack of facial annotated data and the fact that manual annotations are highly time-consuming, ESC results indicate that the framework can have significant impact on SL processing and analysis.
Facial events are inevitably linked with human communication and are more than essential for gesture and sign language (SL) comprehension. Nevertheless, both from the automatic visual processing and the recognition viewpoint, facial events are difficult to detect, describe and model. In the context of SL, this gets more complex given the diverse range of potential facial events, such as head movements, head pose, mouthings and local actions of the eyes and brows, which could carry valuable information in parallel with the manual cues. Moreover, the above visual phenomena can occur in multiple ways and at different timescales, either at the sign or the sentence level, and are related to the meaning of a sign, the syntax or the prosody [1–4]. Thus, we focus on the detection of such low-level visual events in video sequences which can be proved important both for SL analysis and for automatic SL recognition (ASLR) [5, 6].
SL video corpora are widely employed by linguists, annotators and computer scientists for the study of SL and the training of ASLR systems. All the above require manual annotation of facial events, either for linguistic analysis or for ground truth transcriptions. However, manual annotation is conducted by experts and is a highly time-consuming task (in  is described as ‘enormous’, resulting on annotations of ‘only a small proportion of data’), justifying their general lack. Simultaneously, more SL data, many of which lack facial annotations, are built or accumulated on the web [8–11]. All the above led on efforts towards the development of automatic or semi-automatic annotation tools [12–14] for the processing of corpora.
In this article, we focus on the low-level visual detection of facial events in videos within what we call extreme states framework. Our contributions start with the low-level detection of events - as the head pose over yaw, pitch and roll angles, the opening/closing of the eyes and local cues of the eyebrows and the mouth; see also the example in Figure 1. This list only demonstrates indicative cases. For instance, in the case of head turn over the yaw angle, we detect the extreme states of the rotation (left/right), and for the eyes, the closed/open states. The proposed approach, referred to as the Extreme States Classification (ESC), is formulated to detect and classify in a simple but effective and unified way the extreme states of various facial events. We build on the exploitation of global and local Active Appearance Models (AAMs), trained in facial regions of interest, to track and model the face and its components. After appropriate feature selection, ESC over-partitions the feature space of a referred cue and applies maximum-distance hierarchical clustering, resulting on extreme clusters - and the corresponding statistically trained models - for each cue. The framework is applied successfully on videos from different SLs, American sign language (ASL) and GSL corpora, showing promising results. In the case of existing facial-level annotations, we quantitatively evaluate the method. ESC is also applied on a multi-person still images database, showing its person-independent generalization. Finally, the potential impact of the low-level facial events detection is further explored: We highlight the link of the detections with higher-level linguistic events. Based on low-level visual detections, we detect linguistic phenomena related to the sign and sentence boundaries or the sentence structure. We also show the incorporation of ESC in assistive annotation, e.g. within environments as ELAN . The presented evidence renders ESC a candidate expected to have practical impact in the analysis and processing of SL data.
2 Relative literature
Given the importance of facial cues, the incorporation and the engineering or linguistic interest of facial features and head gestures have recently received attention. This interest is manifested through different aspects with respect to (w.r.t.) visual processing, detection and recognition. There are methods related to non-manual linguistic markers direct recognition [19–21] as applied to negations, conditional clauses, syntactic boundaries, topic/focus and wh-, yes/no questions. Moreover, there are methods for the detection of important facial events such as head gestures [20, 22, 23], eyebrows movement and eyes blinking/squint , along with facial expressions recognition [22, 24, 25] within the context of SL. The authors in  employ a two-layer Conditional Random Field for recognizing continuously signed grammatical markers related to facial features and head movements. Metaxas et al.  employ geometric and Local Binary pattern (LBP) features on a combined 2D and 3D face tracking framework to automatically recognize linguistically significant non-manual expressions in continuous ASL videos. The challenging task of fusion of manuals and non-manuals for ASLR has also received attention [5, 6, 26, 27]. Due to the cost - timewise - and the lack of annotations, recently, there is a more explicit trend by works towards preliminary tools for semi-automatic annotation via a recognition and a translation component  at the sign level concerning manuals, by categorizing manual/non-manual components , providing information on lexical signs and assisting sign searching. Early enough, Vogler and Goldenstein [24, 25] have contributed in this direction. Such works clearly mention the need for further work on facial cues.
More generally, unsupervised and semi-supervised approaches for facial feature extraction, event detection and classification have dragged interest . Aligned Cluster Analysis (ACA)  and Hierarchical ACA (HACA)  apply temporal clustering of naturally occurring facial behaviour that solves for correspondences between dynamic events. Specifically, ACA is a temporal segmentation method that combines kernel k-means with a dynamic time warping kernel, and HACA is its extension that employs an hierarchical bottom-up framework. Consequently, these methods are dynamic which differs from the static nature of our proposed method that detects potential facial events on each frame of a video. Authors in  use Locally Linear Embedding to detect head pose. Hoey  aims at the unsupervised classification of expression sequences by employing a hierarchical dynamic Bayesian network. However, the above are applied on domains such as facial expression recognition and action recognition .
Facial features are employed in tasks such as facial expression analysis  and head pose estimation . A variety of approaches are proposed for tracking and feature extraction. Many are based on deformable models, like Active Appearance Models (AAMs), due to their ability to capture shape and texture variability, providing a compact representation of facial features [5, 6, 13, 29, 36]. There is a variety of tracking methods including Active Shape Models with Point Distribution Model [22, 37, 38], Constrained Local Models , deformable part-based models , subclass divisions  and appearance-based facial features detection . The tracking can also be based on 3D models [24, 25] or a combination of models . There are also numerous features employed as SIFT , canonical appearance [21, 39], LBPs  and geometric distances on a face shape graph [21, 29]. Authors in  recognize grammatical markers by tracking facial features based on probabilistic Principal Components Analysis (PCA)-learned shape constraints. Various pattern recognition techniques are applied for the detection and classification/recognition of facial features in SL tasks, such as Support Vector Machines [22, 39], Hidden Markov Models [5, 6] and combinations .
This work differs from other SL-related works. First, multiple facial events are handled in a unified way through a single framework. As shown next, this unified handling of ESC along with the extreme-states formulation are suitable for SL analysis in multiple ways. Second, the method detects the facial events in question at each new unseen frame, rather than performing a segmentation procedure given a whole video sequence; thus, it is static. Then, the method is inspired and designed for SL video corpora, and the whole framework is designed having in mind assistive automatic facial annotation tools, extending . The detection of even simple events can have drastic impact, given the lack of annotations. This is strengthened given their relation with linguistic phenomena. From the facial-linguistic aspect, the methods as  are aiming on the recognition of linguistic phenomena themselves in a supervised model-based manner, thus requiring linguistic annotations. Herein, we rather focus on visual phenomena for the detection of visual events and provide an interactive assistive annotation tool for their discovery in corpora, while exploring their potential link with higher-level phenomena. Given the difficulty of ASLR in continuous, spontaneous tasks , recognition-based annotation tools have still low performance - or rather preliminary for the case of the face [13, 25]. Focusing on independent frames, without interest on the dynamics, ESC can be effective on multiple information cues, useful for SL annotation and recognition. Thus, our results could feed with salient detections, higher-level methods such as [20, 21] or ASLR systems . Of course, the area would benefit by further incorporation of more unsupervised approaches [29, 31, 33]. Overall, different methods focus partially on some of the above aspects, and to the best of our knowledge, none of them shares all described issues. In , we introduced ESC. Herein, the approach is extensively presented with mature and updated material, including the application to multiple events and more experiments. In addition, there is an updated formulation and rigorous handling of parameters - e.g. event symmetry, SPThresh (see Section 4.2.2) - which allows the user to employ the framework in an unsupervised way. Finally, we further highlight linguistic and assistive annotation perspectives.
3 Visual processing: global and local AAMs for tracking and features
3.1 Active Appearance Model background
Active Appearance Models (AAMs) [44, 45] are generative statistical models of an object’s shape and texture that recover a parametric description through optimization. Until recently, mainly due to the project-out inverse compositional algorithm , AAMs have been widely criticized of being inefficient and unable to generalize well in illumination and facial expression variations. However, recent research has proved that this is far from being true. The employment of more efficient optimization techniques [46–48] as well as robust feature-based appearance representations [47, 49] has proved that AAMs are one of the most efficient and robust methodologies for face modelling. In this paper, we take advantage of adaptive, constrained inverse compositional methods  for improved performance, applied on pixel intensities. Even though other proposed AAM variations may be more successful, the main focus of this paper is the facial events detection in SL videos and the proposed method is independent of the AAM optimization technique in use.
In brief, following the notation in , we express a shape instance as s=[x1,y1,…,x N ,y N ], a 2N×1 vector consisting of N landmark points’ coordinates (x i ,y i ), ∀ i=1, …, N and a texture instance as an M × 1 vector A consisting of the greyscale values of the M column-wise pixels inside the shape graph. The shape model is trained employing Principal Components Analysis (PCA) on the aligned training shapes to find the eigenshapes of maximum variance and the mean shape s0. The texture model is trained similarly in order to find the corresponding eigentextures and mean texture A0. Additionally, we employ the similarity transformation that controls the face’s global rotation, translation and scaling and the global affine texture transform T u =(u1+1)I+u2, used for lighting invariance. t=[t1,…,t4] and u=[u1,u2] are the corresponding parameter vectors.
Synthesis is achieved via linear combination of eigenvectors weighted with the according parameters, as (shape) and (texture). We denote by the concatenated shape parameters vector consisting of the similarity t and shape parameters . Similarly, we denote by the concatenated texture parameters consisting of the affine texture transform u and the texture parameters . The piecewise affine warp function maps pixels inside the source shape s into the mean shape s0 using the barycentric coordinates of Delaunay triangulation. Next, we employ both global and local AAMs denoted with a ‘G’ or ‘L’ exponent. For more details, see the relative literature as in [45, 46]. Finally, the complexity of the employed AAM fitting algorithm is per iteration which results in a close to real-time performance of 15 fps.
3.2 Initialization using face and skin detection
3.2.1 Global AAM fitting results
3.3 Projection from global to local AAMs
3.4 Features and dimensionality
4 Extreme States Classification of facial events
4.1 Event symmetry and feature spaces
Next, we take advantage of the characteristics of 1D features and the continuity of the model’s deformation (Section 3.4). Since the feature values’ variation causes the respective facial event to smoothly alternate between two extreme instances, ESC aims to automatically detect these extreme states, the upper and the lower one. The instances located between the extremes are labelled as undefined or neutral depending on the event.
Given a training feature space, once the designer selects the feature best describing the event, ESC performs an unsupervised training of probabilistic models. The 1D features provide simplicity in the automatic cluster selection and real-time complexity. See Figure 7 for an example employing the GAAM’s first eigenshape parameter symmetric feature.
4.2.1 Hierarchical breakdown
ESC automatically selects representative clusters that will be used to train Gaussian distributions. These clusters must be positioned on the two edges and the centre of the 1D feature space, as shown in Figure 8. We apply agglomerative hierarchical clustering resulting in a large number of clusters, approximately half the number of the training observations. This hierarchical over-clustering eliminates the possible bias of the training feature space. It neutralizes its density differences, creating small groups that decrease the number of considered observations. In case we have a significant density imbalance between the feature’s edges of our training set, the over-clustering equalizes the observations at each edge.
Direct application of a clustering method for the automatic selection of representative clusters would take into account the inter-distances of data points resulting in biased large surface clusters that spread towards the centre of the feature space. If the two edges of the feature space were not equalized w.r.t. the data points’ density, then we would risk one of the two clusters to capture intermediate states. Consequently, the models corresponding to the extreme states would also include some undefined/neutral cases from the centre of the feature space, increasing the percentage of false-positive detections.
4.2.2 Cluster selection
Another reason not applying a direct clustering method for the automatic selection of three representative clusters - two on the edges, one in the centre - is that the trained distributions for each cluster would intersect and share data points, leading to false-positive detections. We tackle this issue with a cluster selection procedure based on maximum-distance criterion. We take advantage of the 1D feature space continuous geometry, according to which the extreme states are the ones with maximum distance. Thus, we automatically select appropriate clusters on the edges of the feature space and a central cluster at half the distance between them.
Facial event default configuration
4.2.3 Final clusters interpretation and training feature space
Algorithm 1 ESC training
The above training procedure is applied on a given 1D training feature space . The designer can choose between two possible types of training feature space after the appropriate feature selection. The first option is to use the feature values from a video’s frames. This has the advantage that the training set is adjusted to the facial event’s variance within the specific video. The second option is to synthesize the feature space from the AAM, by forming a linear 1D space of the selected feature’s values ranging from a minimum to a maximum value. The minimum and maximum values are selected so as to avoid distortion: a safe value range for a parameter is in , where m i is the respective eigenvalue. This scenario forms an unbiased training feature space containing all possible instances with balanced density between the representative clusters.
4.3 Classification and ESC character
Each observation of a testing set is classified in a specific class out of the three, based on maximum likelihood criterion. Following the example of Figure 7, the final extreme pose over the yaw angle detection is summarized in the subcaptions. ESC builds on the AAM fitting results; thus, it is a supervised method w.r.t. the landmark points annotation for AAM training. It also requires the designer’s intervention for the selection of the appropriate 1D feature space best describing the facial event, which, by using Table 1, becomes a semi-supervised task. However, given the AAM trained models and the training feature space that corresponds to the facial event, the ESC method requires no further manual intervention. In other words, given a 1D feature space, the ESC method detects extreme states of facial events in an unsupervised manner. As explained in Section 4.2.2, the symmetry type of the event leads to a default SPThres configuration (Table 1). However, the designer has the option to alter the SPThres defaults to refine the ESC performance w.r.t. the difference of precision vs. recall percentages - as further explained in Section 6.2 - which is useful in certain applications such as assistive annotation. Finally, it is highlighted that ESC does not require any facial event annotations, as opposed to other facial event detection methods.
GAAM training attributes
Number of frames
Number of train images
Number of landmark points N
Number of eigenshapes N s
Number of eigentextures N t
Mean shape resolution M
6 Experimental results
Herein, we present qualitative results on GSL (Section 6.1) which lacks annotations, a quantitative comparison between ESC, supervised classification and k-means clustering on ASL (Section 6.2), a quantitative testing of the effect of AAM fitting accuracy on ESC performance (Section 6.3) and a subject-independent application on IMM (Section 6.4). Section 7 provides links with linguistic phenomena (Section 7.1) and demonstrates annotation perspectives (Section 7.2).
6.1 Qualitative results for continuous GSL
6.2 Quantitative evaluation of ESC vs. supervised classification vs. k-means on ASL
We conduct experiments to compare ESC with supervised classification and k-means clustering on the BU database, taking advantage of existing annotations. The task includes some indicative facial events from the ones presented in Section 6.1 that have the appropriate annotation labels: yaw and roll pose, left eye opening/closing and left eyebrow up/down movement. Note that we only use the non-occluded annotated frames of the ASL video and aim to compare the individual detections on each frame. We group similar annotations to end up with three labels. For example, we consider the yaw pose annotated labels right and left to be extreme and the labels slightly right and slightly left to be neutral.
Indicative ESC confusion matrix
126.96.36.199 Supervised classification
For the supervised classification, we partition the feature space in three clusters following the annotations. Subsequently, we apply uniform random sampling on these annotated sets in order to select N/3 points for each and N in total, as chosen by the ESC cluster selection. Again, the rest consist the testing set. These points are then employed to train one Gaussian distribution per cluster.
We employ the k-means algorithm in order to compare with a state-of-the-art unsupervised clustering method. The algorithm is directly applied on the same testing set as in ESC and supervised classification, requiring three clusters.
frames labelled as extreme are true positives. This is useful in applications we want the classification decisions to be correct and not to correctly classify all the extreme states. By applying the default SPThres values (Table 1), ESC ensures high F-score performance, with precision/recall balance. However, Figures 14 and 15 also show that we can achieve higher precision or recall percentages with a slight SPThres configuration. Additionally, the above highlight the strength of ESC as a method to classify facial events in the cases at which manual annotations are unavailable.
6.3 ESC dependency on AAM fitting
6.4 ESC subject independency
7 Further applications
Next, we examine two practical cases that show the application of the presented approach within tasks related to SL. The first concerns linguistic phenomena as related to facial events, while the second highlights ESC application for assistive annotation of facial events.
7.1 ESC detections and linguistic phenomena
Facial cues are essential in SL articulation and comprehension. Nevertheless, the computational incorporation of information related to facial cues is more complex when compared, for instance, with handshape manual information. This is due to the multiple types of parallel facial cues, the multiple ways each cue is involved each time and the possibly different linguistic levels. Based on existing evidence [1–3, 17, 56] and observations, we account each time for a facial cue and link it with linguistic phenomena. We construct links of (1) facial cues via the corresponding ESC detections (Section 4), with (2) a few selected indicative linguistic phenomena. The phenomena include (1) sign and sentence boundaries and (2) sentence-level linguistic markers which determine the phrase structure: alternative constructions, which refer to conjunctive sentences and enumerations, when the signer enumerates objects (see Section 1 and Figure 1).
7.1.1 Eye blinking
7.1.2 Pose variation and head movements
7.1.3 Quantitative evaluation
The GSL corpus annotations provide ground-truths for sign/sentence boundaries. The alternative construction and enumeration ground-truths are based on our annotations via ELAN  based on descriptions in [56, p. 12]. The cues to test are pose over the yaw, pitch and roll angles, head’s vertical translation and eye open/closing. We build on ESC detections of a facial cue for the detection of the desired transitions. For each event, the dynamic transition’s labels are assigned at each frame with a different state (detection change) than the previous frame.
Facial cue suitability for linguistic phenomena detection
7.2 Assistive application to annotation tools
As discussed (Section 1), the need for semi-supervised methods is evident given the general lack of manual annotations. These require many times the real-time duration of a SL video, let alone the required training of an annotator who is subjective and possibly error-prone after many annotation hours. ESC can be potentially employed for the benefit of annotators via assistive or semi-automatic annotation.
7.2.1 Annotator-defined extreme states
7.2.2 Annotation labels consistency and errors
7.2.3 Incorporation of results into annotation software
8 Discussion and conclusions
We present an efficient approach for the detection of facial events that are of interest within the context of processing and analysis of continuous SL videos. We formulate our framework by introducing the notion of ‘extreme states’, which is intuitive: take, for instance, left/right extremes of yaw head pose angle, up/down extremes of pitch head angle, open/close extremes of the eyes, and so on. Although simple, such events are potentially related with various SL linguistic phenomena, such as sign/sentence boundaries, role playing and dialogues, enumerations and alternative constructions, to name but a few, which are still under research. By applying the proposed approach on SL videos, we are able to detect and classify salient low-level visual events. As explained in Section 4.3, the method builds upon face tracking results and performs an unsupervised classification. Evaluations are conducted on multiple datasets. The detection accuracy is comparable with that of the supervised classification, and F-scores range between 77% and 91%, depending on the facial event. These detection results would be of great assistance for annotators, since the analysis and annotation of such events in large SL video corpora consumes many times the real-time duration of the initial videos. Moreover, via the relation with higher-level linguistic events, a few of which have been only indicatively presented, the ESC detections could further assist analysis or assistive consistency tests of existing labels.
Axes of further work concern the automatic adaptation of unknown signers, the incorporation of facial expression events and the incorporation of more linguistic phenomena of multiple levels. Although ongoing research in SL recognition is still far from the development of a complete ASLR system, the integration of facial and linguistic events in such a system is an important future step. The qualitative/quantitative evaluations of the approach on multiple databases and different SLs (GSL, ASL), the evaluation on the multi-subject IMM database, which have all shown promising results, as well as the practical examples and intuitive applications indicate that ESC is in a field that opens perspectives with impact on the analysis, processing and automatic annotation of SL videos.
a In the parentheses, we have added comments that assist the understanding. The gloss transcriptions are ‘WALK $INDEX LINE WAIT $MANUAL’ (frames 650 to 723) and ‘FINE PASSPORT TICKET PASSPORT ID-CARD SOME WHERE GO EUROPE OR ABROAD’ (736 to 927). Gloss: the closest English word transcription corresponding to a sign. $INDEX: variable convention that refers to previous gloss WALK concerning spatial location. $MANUAL: manual classifier for spatial queue description. $EOC: end-of-clause.
This research work was supported by the project ‘COGNIMUSE’ which is implemented under the ARISTEIA Action of the Operational Program Education and Lifelong Learning and is co-funded by the European Social Fund (ESF) and Greek National Resources. It was also partially supported by the EU research projects Dicta-Sign (FP7-231135) and DIRHA (FP7-288121). The major part of this work was done when the first author was at the National Technical University of Athens, Greece. The authors want to thank I. Rodomagoulakis for his contribution and S. Theodorakis for insightful discussions.
- Sandler W: The medium and the message: prosodic interpretation of linguistic content in Israeli Sign Language. Sign Language & Linguistics John Benjamins Publishing Company 1999, 2(2):187-215.MathSciNetView ArticleGoogle Scholar
- Brentari D, Crossley L: Prosody on the hands and face. Gallaudet University Press, Sign Language & Linguistics, John Benjamins Publishing Company 2002, 5(2):105-130.Google Scholar
- Wilbur R: Eyeblinks & ASL phrase structure. Sign Language Studies. Gallaudet University Press 1994, 84(1):221-240.Google Scholar
- Wilbur R, Patschke C: Syntactic correlates of brow raise in ASL. Sign Language & Linguistics. John Benjamins Publishing Company 1999, 2(1):3-41.Google Scholar
- Von Agris U, Zieren J, Canzler U, Bauer B, Kraiss K: Recent developments in visual sign language recognition. Universal Access in the Information Society, Springer 2008, 6(4):323-362. 10.1007/s10209-007-0104-xView ArticleGoogle Scholar
- von Agris U, Knorr M, Kraiss K: The significance of facial features for automatic sign language recognition. In 8th IEEE Int. Conf. on Automatic Face & Gesture Recognition (FG). Amsterdam, The Netherlands; 17–19 Sept 2008.Google Scholar
- Johnston T, Schembri A: Issues in the creation of a digital archive of a signed language. In Sustainable Data from Digital Fieldwork: Proc. of the Conf., Sydney University Press. Sydney, Australia; 4–6 Dec 2006.Google Scholar
- Matthes S, Hanke T, Regen A, Storz J, Worseck S, Efthimiou E, Dimou AL, Braffort A, Glauert J, Safar E: Dicta-Sign – building a multilingual sign language corpus. In Proc. of the 5th Workshop on the Representation and Processing of Sign Languages: Interactions Between Corpus and Lexicon (LREC), European Language Resources Association. Istanbul, Turkey; 23–27 May 2012.Google Scholar
- Neidle C, Vogler C: A new web interface to facilitate access to corpora: development of the ASLLRP data access interface. In Proc. of the Int. Conf. on Language Resources and Evaluation (LREC), European Language Resources Association. Istanbul, Turkey; 23–27 May 2012.Google Scholar
- Dreuw P, Neidle C, Athitsos V, Sclaroff S, Ney H: Benchmark databases for video-based automatic sign language recognition. In Proc. of the Int. Conf. on Language Resources and Evaluation (LREC), European Language Resources Association. Marrakech, Morocco; 28–30 May 2008.Google Scholar
- Crasborn O, van der Kooij E, Mesch J: European cultural heritage online (ECHO): publishing sign language data on the internet. In 8th Conf. on Theoretical Issues in Sign Language Research, John Benjamins Publishing Company. Barcelona, Spain; 30 Sept–2 Oct 2004.Google Scholar
- Dreuw P, Ney H: Towards automatic sign language annotation for the elan tool. In Proc. of Int. Conf. LREC Workshop: Representation and Processing of Sign Languages, European Language Resources Association. Marrakech, Morocco; 28–30 May 2008.Google Scholar
- Hrúz M, Krn̆oul Z, Campr P, Müller L: Towards automatic annotation of sign language dictionary corpora. In Proc. of Text, speech and dialogue, Springer. Pilsen, Czech Republic; 1–5 Sept 2011.Google Scholar
- Yang R, Sarkar S, Loeding B, Karshmer A: Efficient generation of large amounts of training data for sign language recognition: a semi-automatic tool. Comput. Helping People with Special Needs 2006, 635-642.View ArticleGoogle Scholar
- Dicta-Sign Language Resources: Greek Sign Language Corpus. 31 January 2012.http://www.sign-lang.uni-hamburg.de/dicta-sign/portalGoogle Scholar
- Sze F: Blinks and intonational phrasing in Hong Kong Sign Language. In 8th Conf. on Theoretical Issues in Sign Language Research, John Benjamins Publishing Company. Barcelona, Spain; 30 Sept–2 Oct 2004.Google Scholar
- Pfau R: Visible prosody: spreading and stacking of non-manual markers in sign languages. In 25th West Coast Conf. on Formal Linguistics, Cascadilla Proceedings Project. Seattle, USA; 28–30 Apr 2006.Google Scholar
- Wittenburg P, Brugman H, Russel A, Klassmann A, Sloetjes H: ELAN: a professional framework for multimodality research. In Proc. of the Int. Conf. on Language Resources and Evaluation (LREC), European Language Resources Association. Genoa, Italy; 24–26 May 2006.Google Scholar
- Nguyen T, Ranganath S: Facial expressions in american sign language: tracking and recognition. Pattern Recognition Elsevier 2012, 45(5):1877-1891. 10.1016/j.patcog.2011.10.026MATHView ArticleGoogle Scholar
- Nguyen T, Ranganath S: Recognizing continuous grammatical marker facial gestures in sign language video. In 10th Asian Conf. on Computer Vision, Springer. Queenstown, New Zealand; 8–12 Nov 2010.Google Scholar
- Metaxas D, Liu B, Yang F, Yang P, Michael N, Neidle C: Recognition of nonmanual markers in ASL using non-parametric adaptive 2D-3D face tracking. In Proc. of the Int. Conf. on Language Resources and Evaluation (LREC), European Language Resources Association. Istanbul, Turkey; 23–27 May 2012.Google Scholar
- Neidle C, Michael N, Nash J, Metaxas D, Bahan IE, Cook L, Duffy Q, Lee R: A method for recognition of grammatically significant head movements and facial expressions, developed through use of a linguistically annotated video corpus. In Proc. of 21st ESSLLI Workshop on Formal Approaches to Sign Languages. Bordeaux, France; 27–31 July 2009.Google Scholar
- Erdem U, Sclaroff S: Automatic detection of relevant head gestures in American Sign Language communication. In IEEE Proc. of 16th Int. Conf. on Pattern Recognition. Quebec, Canada; 11–15 Aug 2002.Google Scholar
- Vogler C, Goldenstein S: Analysis of facial expressions in american sign language. In Proc, of the 3rd Int. Conf. on Universal Access in Human-Computer Interaction, Springer. Las Vegas, Nevada, USA; 22–27 July 2005.Google Scholar
- Vogler C, Goldenstein S: Facial movement analysis in ASL. Universal Access in the Information Society Springer 2008, 6(4):363-374. 10.1007/s10209-007-0096-6View ArticleGoogle Scholar
- Sarkar S, Loeding B, Parashar A: Fusion of manual and non-manual information in american sign language recognition. Handbook of Pattern Recognition and Computer Vision . CRC, FL; 2010:1-20.Google Scholar
- Aran O, Burger T, Caplier A, Akarun L: Sequential belief-based fusion of manual and non-manual information for recognizing isolated signs. Gesture-Based Human-Computer Interaction and Simulation . Springer; 2009:134-144.View ArticleGoogle Scholar
- Bartlett MS: Face image analysis by unsupervised learning and redundancy reduction. PhD thesis. (University of California, San Diego, 1998)Google Scholar
- Zhou F, De la Torre F, Cohn JF: Unsupervised discovery of facial events. In IEEE Conf. on Computer Vision and Pattern Recognition (CVPR). San Francisco, CA, USA; 13–18 June 2010.Google Scholar
- Zhou F, De la Torre Frade F, Hodgins JK: Hierarchical aligned cluster analysis for temporal clustering of human motion. IEEE Trans. on Pattern Analysis and Machine Intelligence 2013, 35(3):582-596.View ArticleGoogle Scholar
- Hadid A, Kouropteva O, Pietikainen M: Unsupervised learning using locally linear embedding: experiments with face pose analysis. Quebec, Canada; 11–15 Aug 2002.Google Scholar
- Hoey J: Hierarchical unsupervised learning of facial expression categories. In Proc. of IEEE Workshop on Detection and Recognition of Events in Video. Vancouver, BC, Canada; 8 July 2001.Google Scholar
- Niebles J, Wang H, Fei-Fei L: Unsupervised learning of human action categories using spatial-temporal words. International Journal of Computer Vision, Springer 2008, 79(3):299-318. 10.1007/s11263-007-0122-4View ArticleGoogle Scholar
- Pantic M, Rothkrantz LJ: Automatic analysis of facial expressions: the state of the art. IEEE Trans. on Pattern Analysis and Machine Intelligence 2000, 22(12):1424-1445. 10.1109/34.895976View ArticleGoogle Scholar
- Murphy-Chutorian E, Trivedi M: Head pose estimation in computer vision: a survey. IEEE Trans. on Pattern Analysis and Machine Intelligence 2009, 31(4):607-626.View ArticleGoogle Scholar
- Lin D: Facial expression classification using PCA and hierarchical radial basis function network. Journal of Information Science and Engineering, Citeseer 2006, 22(5):1033-1046.Google Scholar
- Canzler U, Dziurzyk T: Extraction of non manual features for video based sign language recognition. In IAPR Workshop on Machine Vision Applications, ACM. Nara, Japan; 11–13 Dec 2002.Google Scholar
- Michael N, Neidle C, Metaxas D: Computer-based recognition of facial expressions in ASL: from face tracking to linguistic interpretation. In Proc. of the Int. Conf. on Language Resources and Evaluation (LREC), European Language Resources Association. Malta; 17–23 May 2010.Google Scholar
- Ryan A, Cohn J, Lucey S, Saragih J, Lucey P, De la Torre F, Rossi A: Automated facial expression recognition system. In IEEE 43rd Int. Carnahan Conference on Security Technology. Zürich, Switzerland; 5–8 Oct 2009.Google Scholar
- Zhu X, Ramanan D: Face detection, pose estimation, and landmark localization in the wild. In IEEE Conf. on Computer Vision and Pattern Recognition (CVPR). Providence, RI, USA; 16–21 June 2012.Google Scholar
- Ding L, Martinez A: Precise detailed detection of faces and facial features. In IEEE Conf. on Computer Vision and Pattern Recognition (CVPR). Anchorage, Alaska, USA; 24–26 June 2008.Google Scholar
- Ding L, Martinez A: Features versus context: an approach for precise and detailed detection and delineation of faces and facial features. IEEE Trans. on Pattern Analysis and Machine Intelligence 2010, 32(11):2022-2038.View ArticleGoogle Scholar
- Antonakos E, Pitsikalis V, Rodomagoulakis I, Maragos P: Unsupervised classification of extreme facial events using active appearance models tracking for sign language videos. In IEEE Proc. of Int. Conf. on Image Processing (ICIP). Orlando, Florida, USA; 30 Sept–3 Oct 2012.Google Scholar
- Cootes T, Edwards G, Taylor C: Active appearance models. IEEE Trans. on Pattern Analysis and Machine Intelligence 2001, 23(6):681-685. 10.1109/34.927467View ArticleGoogle Scholar
- Matthews I, Baker S: Active appearance models revisited. International Journal of Computer Vision, Springer 2004, 60(2):135-164.View ArticleGoogle Scholar
- Papandreou G, Maragos P: Adaptive and constrained algorithms for inverse compositional active appearance model fitting. In IEEE Conf. on Computer Vision and Pattern Recognition (CVPR). Anchorage, Alaska, USA; 24–26 June 2008.Google Scholar
- Tzimiropoulos G, Alabort-i Medina J, Zafeiriou S, Pantic M: Generic active appearance models revisited. In Asian Conf. on Computer Vision, Springer. Daejeon, Korea; 5–9 Nov 2012.Google Scholar
- Batur A, Hayes M: Adaptive active appearance models. IEEE Trans. on Image Processing 2005, 14(11):1707-1721.View ArticleGoogle Scholar
- Navarathna R, Sridharan S, Lucey S: Fourier active appearance models. In IEEE Int. Conf. on Computer Vision (ICCV). Barcelona, Spain; 6–13 Nov 2011.Google Scholar
- Vukadinovic D, Pantic M: Fully automatic facial feature point detection using Gabor feature based boosted classifiers. In IEEE Int. Conf. on Systems, Man and Cybernetics. Waikoloa, Hawaii, USA; 10–12 Oct 2005.Google Scholar
- Valstar M, Martinez B, Binefa X, Pantic M: Facial point detection using boosted regression and graph models. In IEEE Conf. on Computer Vision and Pattern Recognition (CVPR). San Francisco, CA, USA; 13–18 June 2010.Google Scholar
- Vezhnevets V, Sazonov V, Andreeva A: A survey on pixel-based skin color detection techniques. In Proc. Graphicon. Moscow, Russia; 2003.Google Scholar
- Tzoumas S: Face detection and pose estimation with applications in automatic sign language recognition. Master’s thesis, National Technical University of Athens, 2011Google Scholar
- Roussos A, Theodorakis S, Pitsikalis V, Maragos P: Hand tracking and affine shape-appearance handshape sub-units in continuous sign language recognition. In 11th European Conference on Computer Vision, Workshop on Sign, Gesture and Activity (ECCV), Springer. Crete, Greece; 5–11 Sept 2010.Google Scholar
- Nordstrøm M, Larsen M, Sierakowski J, Stegmann M: The IMM face database-an annotated dataset of 240 face images. Inform. Math. Model 2004, 22(10):1319-1331.Google Scholar
- CNRS-LIMSI: Dicta-Sign Deliverable D4.5: report on the linguistic structures modelled for the Sign Wiki. Techical Report D4.5, CNRS-LIMSI (2012)Google Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.