Skip to main content

2D and 3D analysis of animal locomotion from biplanar X-ray videos using augmented active appearance models


For many fundamental problems and applications in biomechanics, biology, and robotics, an in-depth understanding of animal locomotion is essential. To analyze the locomotion of animals, high-speed X-ray videos are recorded, in which anatomical landmarks of the locomotor system are of main interest and must be located. To date, several thousand sequences have been recorded, which makes a manual annotation of all landmarks practically impossible. Therefore, an automatization of X-ray landmark tracking in locomotion scenarios is worthwhile. However, tracking all landmarks of interest is a very challenging task, as severe self-occlusions of the animal and low contrast are present in the images due to the X-ray modality. For this reason, existing approaches are currently only applicable for very specific subsets of anatomical landmarks. In contrast, our goal is to present a holistic approach which models all anatomical landmarks in one consistent, probabilistic framework. While active appearance models (AAMs) provide a reasonable global modeling framework, they yield poor fitting results when applied on the full set of landmarks. In this paper, we propose to augment the AAM fitting process by imposing constraints from various sources. We derive a general probabilistic fitting approach and show how results of subset AAMs, local tracking, anatomical knowledge, and epipolar constraints can be included. The evaluation of our approach is based on 32 real-world datasets of five bird species which contain 175,942 ground-truth landmark positions provided by human experts. We show that our method clearly outperforms standard AAM fitting and provides reasonable tracking results for all landmark types. In addition, we show that the tracking accuracy of our approach is even sufficient to provide reliable three-dimensional landmark estimates for calibrated datasets.

1 Introduction

For many fundamental problems of ongoing research in biomechanics, zoology, evolutionary biology, and robotics, the key element is a thorough knowledge on animal locomotion [18]. Ideally, this knowledge is obtained by analyzing skeletal movements of locomoting animals. While many methods have been developed over time, the state-of-the-art approach for obtaining noninvasive in vivo measurements of the locomotor system is biplanar X-ray videography. In contrast to reflective marker-based methods, it allows for unobstructed observations at an unrivaled accuracy [5]. In general, the animal to be analyzed is placed on a treadmill and filmed from a side camera view (lateral camera) and a top camera view (dorsoventral camera) at a very high frequency, usually 1,000 frames per second. A typical experimental setup is shown in Figure 1.

Figure 1
figure 1

X-ray acquisition system. Biplanar X-ray acquisition system (I, treadmill; II, lateral X-ray detector; III, dorsoventral X-ray detector) used for locomotion analysis (a). Example data of a quail with typical landmarks (1 to 8, torso; 9 and 16, hip joints; 10, 11, 17, and 18, knee joints; 12, 13, 19, and 20, intertarsal joints; 14, 15, 21, and 22, feet) are shown in (b, c).

For an evaluation of acquired data, anatomical landmarks - usually skeletal joints of the locomotor system such as hip joints, knee joints, intertarsal joints, and phalangeal joints [6, 7] - have to be located in the images. Most evaluations to date solely rely on human experts (e.g., [5, 6]), which is an extremely time-consuming process and complicates the realization of large-scale studies. An automation of this process would therefore greatly benefit research in the aforementioned areas [9]. However, as almost all parts of an animal’s skeletal system undergo severe self-occlusions during locomotion (cf. Figure 1), developing fully automatic tracking methods for this application is a challenging task.

In this paper, we address the issue of landmark tracking in X-ray sequences of grounded locomotion of birds. We present a novel method which, unlike previous approaches, is able to track all landmarks used in locomotion analysis and can overcome many other practically relevant drawbacks of existing methods (see Subsection 1.2) using a unified, consistent, and probabilistic framework that combines the complementing paradigms of model-driven and data-driven tracking.

1.1 Related work

For very simple scenarios of locomotion analysis, straightforward tracking approaches such as template matching can be applied [9]. Due to severe occlusions, however, template matching and a variety of other standard methods such as optical flow/KLT and its extensions [1012], region tracking [13, 14], and SIFT-based tracking [15] were proven to be unsuited for X-ray analyses in the challenging scenario at hand [16, 17]. A more advanced approach for skeletal tracking is based on image registration between recorded X-ray images and backprojected CT scans [3, 18, 19]. However, in most cases this method is only feasible for medical applications, as a full CT scan is necessary for each subject to be analyzed.

An alternative, completely data-driven approach for robust template tracking in X-ray sequences was recently proposed in [16]. As standard template tracking fails due to the severe occlusions, the idea is to divide the template to be tracked into certain sub-templates. For each frame, all sub-templates are matched to the target image individually, and the results of these sub-templates are then merged to obtain one consistent parameter transformation for the whole template. The important difference between [16] and existing sub-template-based approaches such as [2022] lies in the fusion of sub-template results. While previous approaches employ a hard decision between occluded and non-occluded sub-templates, the authors in [16] use a soft decision which exploits special properties of X-ray images. It has proven to be well suited for X-ray bone tracking under moderate occlusions (e.g., for the lower leg landmarks in the side view) [16]. However, due to its data-driven nature, landmarks undergoing severe occlusions (landmarks occluded by the torso, e.g., knee landmarks of the side view or feet landmarks of the top view, cf. Figure 1) cannot be handled.

To overcome such problems of data-driven approaches, model-driven methods generally are able to estimate landmark positions - even for total occlusions - by using global context. One prominent example of global models are active appearance models (AAMs) [2325]. Besides many applications for human face modeling (e.g., [24, 26, 27]) and medical image analysis (e.g., [28, 29]), AAMs have also been successfully applied to landmark tracking in X-ray locomotion scenarios [17, 30]. One major problem in our scenario, however, is that the movement of the animals often is very complex. As a result, especially for the lower legs, landmark configurations during locomotion substantially differ from the mean landmark configuration, i.e., the motions are non-stationary [31, 32]. As discussed in [31] and [33], this situation drastically complicates the fitting of AAM-like models. Besides the non-stationary motion, another major problem is the non-discriminative texture information of the lower leg landmarks (cf. landmarks 12 to 15 and 19 to 22 in Figure 1b), which additionally complicates the fitting process of AAMs. Thus, the aforementioned standard AAM-based approaches only work when neglecting the set of non-stationary landmarks, as in [17, 30, 34].

To combine the benefits of data-driven and model-driven methods, several hybrid models were developed over time. One straightforward example are combined local models [35], where the shape is modeled globally, as for AAMs, but the texture is modeled locally around each landmark. A recently proposed probabilistic example of this approach are discriminative Bayesian active shape models [36], where many local detectors are used to estimate a global landmark configuration. Both approaches, however, model landmark motions similarly to AAMs and are thus very likely to suffer from the same problems as well.

1.2 Motivation

As mentioned in the last subsection, data-driven [16] as well as model-driven approaches [17, 30] exist for landmark tracking in X-ray locomotion analysis. However, all previously published works in this field suffer from at least one of the following shortcomings, which is a major drawback for the usage of these methods for actual zoological and biomechanical studies:

  • Only very specific anatomical landmark subsets can be tracked, e.g., the torso landmarks [17, 30] or the lower leg landmarks [16]. In addition, certain landmarks exist which are covered by neither of the current approaches, e.g., the lower leg landmarks of the top camera view (cf. landmarks 12 to 15 and 19 to 22 in Figure 1b).

  • Although model-driven and data-driven approaches can generally complement each other (e.g., [36]), only one paradigm is used at a time [16, 17, 30, 34].

  • For data-driven approaches, merely landmarks of the side camera view are considered due to severe self-occlusions in the top camera view [16].

  • Modeling, tracking, and evaluation are performed solely using two-dimensional (2D) approaches, although an accurate camera calibration can be obtained for newly recorded datasets [16, 17, 30, 34].

  • The validation of the tracking methods is performed only on very few real-world datasets, as obtaining ground-truth landmarks is a tedious and time-consuming task [17, 30, 34].

As a consequence, while model-driven as well as data-driven approaches exist for very specific landmark subsets, neither of them alone is applicable for the full tracking problem. The trivial option of simply merging their results is not an option, because on the one hand the landmark subsets would be tracked independently of another and hence would not be consistent. On the other hand, not all landmarks would be covered by these methods, as for instance the lower leg landmarks in the top view (cf. Figure 1b). Our goal in this work is to overcome all drawbacks mentioned above and to present an approach which is holistic in the sense that all landmarks of the animal are modeled in one consistent framework. We base the approach on the fact that existing methods [16, 30] are complementary, i.e., the first method works well on a landmark subset the second method is unsuited for and vice versa. Our main idea is to unify these ‘subset approaches’ within a probabilistic framework to obtain consistent estimates for all landmarks. While AAMs applied on the full set of locomotion landmarks yield poor fitting results, they are still well suited for modeling interrelationships between landmarks. Therefore, we use AAMs as base model for our approach. However, in contrast to standard AAMs, we augment the fitting process by imposing constraints obtained from sources such as subset methods [16, 30]. We first derive a probabilistic framework that allows AAM fitting under arbitrary types of constraints. While similar approaches such as [36] and [37] only utilize positional priors, we aim to include additional constraints, e.g., the anatomical context or the epipolar geometry of the camera setup. As opposed to existing works in this field, this framework allows to consistently incorporate all landmarks of both camera views while combining the advantages of data-driven and model-driven approaches. In addition, we evaluate our approach based on 32 real-world datasets from three zoological studies [6, 7, 34], including 175,942 manually labeled ground-truth landmarks and birds of different morphology and locomotion characteristics, which by far exceeds the amount of data used in recent studies. An outline of our approach is shown in Figure 2.

Figure 2
figure 2

Overview of augmented AAMs. Our holistic approach for landmark tracking in X-ray locomotion sequences. The fitting process of standard AAMs is augmented by imposing constraints from methods that only perform well on a subset of landmarks (‘subset AAM’ and ‘local tracking’). Additionally, further knowledge such as anatomical constraints and epipolar priors are included.

The remainder of this paper is structured as follows. First, an overview of standard AAMs is given in Section 2, as AAMs form the baseline of our method. In Section 3, we present augmented AAMs as our approach for landmark tracking in X-ray locomotion sequences. After deriving a general fitting framework, we describe the constraints used in our specific case. The validation of our approach is presented and discussed in Section 4.

2 Active appearance models

This section gives an overview of standard AAMs [2325], which form the baseline of our augmented approach presented in Section 3. AAMs are parametric statistical models which describe the visual appearance of arbitrary object classes. The variation in object appearance is modeled by a shape component (represented by image landmarks) and a shape-free texture component. AAMs are trained from sample images with annotated landmark positions. Once learned, a trained model can be fit to unseen images automatically. In the following subsections, the basic training and fitting procedure of standard AAMs will be described.

2.1 AAM training

AAM training is based on annotated sample images, i.e., N images I 1 ,, I N with M corresponding landmarks l n =( x n , 1 , y n , 1 ,, x n , M , y n , M ), 1 ≤ n ≤ N. As first step, the shape model is built by aligning the given shape samples l 1 ,, l N with respect to translation, rotation, and scale via Procrustes analysis [38, 39], resulting in shapes s 1 ,, s N . The shape variations are then parameterized by applying principal component analysis (PCA) to the matrix S=( s 1 s ¯ ,, s N s ¯ ), where s ¯ = 1 N n = 1 N s n is the mean shape. The result is a linear model which describes an arbitrary shape s based on its shape parameters b s, the shape eigenvectors P s, and the mean shape s ¯ of all samples via

s= s ¯ + P s b s .

An example of an AAM shape model is shown in Figure 3 for an animal locomotion dataset used in this paper. It demonstrates that the movements of the lower legs are very complex in both camera views and thus cannot be handled well in the fitting process of standard AAMs.

Figure 3
figure 3

Shape components of an AAM trained on a bird sequence. In this example, variations of the first three shape parameters for the top view (left) and the side view (right) are shown. The movements of the lower legs are very complex in both camera views and cannot be handled well in the fitting process of standard AAMs.

The second step of AAM training consists of building a texture model. Firstly, each image I 1 ,, I N is warped into a common reference frame - usually the mean shape s ¯ . The shape-normalized images are then vectorized, resulting in the texture vectors g 1 ,, g N . Afterwards, the very same PCA-based procedure as for the shape model is employed, which results in the linear texture model

g= g ¯ + P g b g ,

where g is an arbitrary shape-normalized texture with texture parameters b g, P g are the texture eigenvectors, and g ¯ = 1 N n = 1 N g n is the mean texture of the samples.

To obtain a combined representation of both shape and texture, the third - albeit optional - step of AAM training is to merge shape and texture parameters into one parameter set. This is achieved by concatenating the variance-normalized shape and texture parameter vectors for each training sample and again applying PCA. Therefore, each object instance can then be represented by its combined parameters b c. The final parameter count, i.e., the dimension of b c, is then reduced by discarding parameters which explain only a small fraction of the total variance.

2.2 AAM fitting

The goal of AAM fitting is to find the model parameter vector b ̂ c that best fits an object instance shown in a given input image. Technically, the optimization criterion is to minimize the squared difference δ g = (g image − g model) between the given image and the synthesized appearance of the AAM instance, i.e.,

b ̂ c = argmin b c δ g δg.

In its original formulation [2325], this problem was solved in an iterative manner by assuming a linear relationship δ b c = A δ g between the necessary model parameter changes δ b c and the current image difference δ g, where A can be learned in advance. In general, however, such a simple constant relationship between δ b c and δ g does not exist, which can lead to suboptimal fitting results [40]. An alternative optimization approach for Equation 3 is the inverse compositional/project-out algorithm [40]. By decoupling shape and texture parameters, it allows for a very efficient alignment that eliminates many drawbacks of the original AAM fitting method.

Note, however, that our augmented AAM approach presented in Section 3 is independent of the actual optimization scheme - it is possible to base it on both the additive as well as the inverse compositional methods (cf. Subsection 3.1).

2.3 Multi-view extension

While standard AAMs can only be used for a single camera view, possible extensions are available for scenarios which contain more than one camera, e.g., [41] or [42]. In our case, a biplanar image acquisition is usual, albeit also monocular sequences exist. In addition, for many previously recorded datasets from biological studies such as [7], a calibration of the camera setup is not available. Therefore, in our locomotion scenario, it is generally not possible to apply any of the methods mentioned above, as they rely on certain assumptions about the scene. However, it is still possible to exploit relationships between multiple camera views using multi-view AAMs [43, 44], as shown in [30].

The construction of multi-view AAMs is closely related to standard AAMs. Let K denote the number of camera views. As first step, the aligned landmark vectors s n ( 1 ) ,, s n ( K ) of all camera views are concatenated into one vector s n′. Afterwards, PCA is applied to obtain the multi-view shape model in the same manner as for standard AAMs. As for the multi-view landmarks, for each training sample the texture vectors of all views are concatenated to form the vector g n′ and PCA is applied. Note that this multi-view extension is used in exactly the same manner for augmented AAMs, which are presented in the following section.

3 Augmented AAM approach

In the following augmented AAMs, our extension of standard AAMs are presented. As stated in the motivation (cf. Subsection 1.2), the goal is to overcome poor fitting results in cases of non-stationary shape activities [31, 32] and non-discriminative texture information, which is particularly true for the locomotion analysis scenario presented in this paper. We achieve this goal by augmenting the fitting process of standard AAMs by including various types of constraints. A general overview of augmented AAMs is shown in Figure 2. It depicts the different components which contribute to the final system, whereas most parts are directly based on the given training data. An AAM trained on all landmarks of the training data forms the baseline of our augmented AAM (‘full AAM training’ in Figure 2). The fitting step of this AAM is then augmented using constraints derived from (1) a standard AAM trained only on the subset of stationary (i.e., torso and upper leg) landmarks, (2) local tracking methods for lower leg landmarks, (3) anatomical knowledge, and (4) the epipolar geometry of the scene.

In Subsection 3.1, we first derive a general framework for the inclusion of AAM fitting constraints. The remainder of this section gives a detailed description of the particular constraints used for the application on locomotion sequences. In Subsection 3.6, the necessary conditions of our approach and the generalization ability to other scenarios is discussed.

3.1 AAM fitting with constraints

For standard AAMs, it is not possible to include further knowledge - i.e., constraints - into the fitting process. We therefore reformulate AAM fitting within a maximum a posteriori (MAP) framework, which includes the approach of [37] as a special case. By definition, the MAP estimate b ̂ c , MAP of the combined AAM parameter vector maximizes the posterior probability given the observations, in our case the input image I and the fitting constraints π, i.e.,

b ̂ c , MAP = argmax b c p( b c |I,π).

By assuming conditional independence of the image data I and the provided constraints π given the parameter vector b c, we can rewrite Equation 4 as

b ̂ c , MAP = argmax b c p(I| b c )·p(π| b c )·p( b c ).

For the first likelihood term p(I|b c), not the whole input image I is relevant, but only its sampled version g image which is based on the AAM shape configuration specified by b c, i.e., p(I|b c) = p(g image|b c). As for standard AAMs, we assume the fitting process to be initialized at a parameter combination close to the optimal value. The likelihood can then be modeled as a Gaussian distribution g image | b c N( g model , Σ g image g model ) or equivalently p(g image|b c) = p(δ g) with δgN( 0 , Σ δ g ). The covariance matrix Σ δ g of the texture errors can be estimated in the training step of the AAM and is usually assumed to be diagonal due to its large dimensionality (cf. Subsection 2.2).

The likelihood term p(π|b c) of Equation 5 integrates constraints into the fitting process. Here, π is a vector which contains the differences between given target values (constraints) and the actual values based on the current AAM parameters b c. We again assume a Gaussian distribution, i.e., π| b c N( 0 , Σ π ). Concrete configurations of π for different types of priors will be presented in the following subsections. Note that if multiple prior types are used, as is the case in our scenario, Equation 5 contains one likelihood term for each prior type.

The prior term p(b c) of Equation 5 can be modeled in various ways, e.g., using a uniform distribution (resulting in a maximum likelihood estimation) or a zero-mean Gaussian distribution [37]. To favor model configurations with a low complexity, in this work we prefer the latter method.

As a result of the above considerations, maximizing Equation 5 is equivalent to minimizing its negative log likelihood, thus

b ̂ c , MAP = argmin b c δ g Σ δ g 1 δg+ π Σ π 1 π+ b c Σ b c 1 b c .

As mentioned above, Equation 6 can be optimized using arbitrary methods. One possible approach is based on the standard additive AAM parameter update scheme [25], which is derived in [37] and is used in this work. However, it is also possible to reformulate Equation 6 - i.e., AAM fitting with constraints - for the inverse composition/project-out approach [40], which in detail is described in [45].

3.2 Anchor AAM

The first type of constraints we use for fitting the full-body AAM are the results of an ‘anchor AAM’ or ‘subset AAM,’ which is an AAM applied on the subset of stationary landmarks, i.e., the torso and upper leg landmarks (cf. Figure 1). We include the results using the tracked landmark locations as positional constraints. Therefore, π anchor is the difference vector between target and current landmark positions. To estimate the reliability of the constrained positions and thus Σ π anchor , robust confidence measures derived from the AAM fitting process (e.g., [46] or [47]) can be applied.

3.3 Robust local tracking constraints

While standard AAMs have problems with landmarks located at distal limb segments such as the lower legs, the data-driven approach in [16] was specifically designed for tracking in X-ray sequences containing occlusions. In former studies, the method was proven to be well-suited for tracking the subset of lower leg landmarks of the side camera view, but it is inapplicable for landmarks with more severe occlusions such as the knee landmarks of the side view or feet landmarks of the top view. We include the tracking results for the lower leg landmarks as additional constraints π local into the augmented AAM. As for π anchor, the vector π local is the difference between target and current landmark positions. For the estimation of the corresponding covariance matrix Σ π local , the same options as for the local detector used in [36] apply. In our case, due to the high accuracy of the local method [16], it is sufficient to use an isotropic covariance.

3.4 Anatomical constraints

For the challenging tracking scenario at hand, the inclusion of anatomical context knowledge is an important point to consider. As demonstrated in [16, 48], one possibility is to perform a segmentation of the images into relevant anatomical parts - in our case, the torso, left leg, and right leg. For the side view of the bird locomotion scenario at hand, this segmentation can be obtained in three simple steps:

  1. 1.

    Global thresholding and contour finding →whole-body segment

  2. 2.

    Iterative ellipse fitting on the whole body →torso segment

  3. 3.

    Removing the torso segment from the whole-body segment →leg segments

Here, the main problem is to find the correct correspondence between the two leg segments in the images and their anatomical counterparts. We propose to use the anchor AAM’s training data to train a regression model which can predict the correct correspondence for the entire sequence based on the AAM’s model parameters.

To include the results of the anatomical image segmentation into the fitting process, we define π anatomical to be the vector which for each landmark p m  = (x m , y m ) contains the minimum Euclidean distance to its corresponding segment S(m), i.e.,

π anatomical , m = min q S ( m ) d( p m ,q).

To quickly obtain values for Equation 7 during the fitting process, we precompute distance transformed images for each segment using the algorithm presented in [49, 50]. However, also, faster approximations for the distance transform such as [51] can be used, as small errors in the computed distances do not affect the overall result.

Because anatomical region constraints can only provide a coarse estimate for individual landmark positions, for the covariance matrix Σ π anatomical , we assume a scaled identity matrix σ2 I, where σ2 is chosen to be substantially smaller than the covariances of other priors. This has the effect that the fitting process at first is completely driven by the anatomical constraints. When, as a result, each landmark l n is aligned to its corresponding anatomical segment S(m), i.e., p m S(m), the vector π anatomical becomes zero and the fitting procedure is governed by other constraints.

3.5 Epipolar priors

Although a camera calibration is not available for all datasets, it is still possible to include knowledge about the camera geometry into the fitting process. We can estimate the fundamental matrix F by exploiting the fact that point correspondences for the two camera views are available from the anchor AAM’s training data. For each pair (v n , u n ) of homogenous landmarks from the top and side view, we then add the additional constraint π epipolar with

π epipolar , n = v n F u n .

Equation 8 becomes zero if v n is located on the epipolar line F u n and vice versa. The covariance matrix Σ π epipolar can be estimated by applying the points used for the estimation of F on Equation 8.

3.6 Generalization to other scenarios

The presented method was specifically designed for the skeletal locomotion tracking scenario at hand. For this particular case, X-ray acquisition is a necessity, as all skeletal landmarks of interest must be observable. In addition, all parts of interest of an animal must remain in the field of view during the whole sequence, which generally implies the use of a treadmill. As the appearance of the animal is modeled using multi-view AAMs [43, 44] (cf. Subsection 2.3), the camera setup must remain static during a recording. Similarly, if a trained model is to be reused for another sequence, the recordings must share an identical camera setup. However, as for standard multi-view AAMs, the number of cameras used for a sequence is flexible - in fact, the validation of our approach presented in Section 4 includes datasets with one camera view as well as datasets with two camera views.

More generally, the main characteristics of the data which led to our approach are non-stationary landmark movements and non-discriminative local texture information of certain landmarks (cf. Subsection 1.2). Therefore, the idea of augmented active appearance models should be applicable for all scenarios (1) in which landmarks and texture can be modeled by active appearance models, (2) which suffer from the data characteristics mentioned above, and (3) for which sufficient fitting constraints can be obtained. One possible example might be a medical scenario, in which certain anatomical structures are to be tracked in an image sequence.

4 Experiments and results

The evaluation of our holistic approach for anatomical landmark tracking is performed on 32 real-world X-ray bird locomotion sequences. The datasets were recorded in the course of three large-scale zoological studies - namely [7], [6], and [34] - and comprise five species (quails, jackdaws, tinamous, bantams, and lapwings) which differ in morphology and locomotion characteristics. The acquisition of all sequences was carried out using a state-of-the-art biplanar high-speed X-ray system, based on the Neurostar X-ray device (Siemens AG, Munich, Germany). All images have a resolution of 1,536 × 1,024 pixels and were recorded at 1,000 frames per second. A total of 42,909 frames (approximately 125 GB of raw image data) was used in the course of this evaluation. Except for lapwings, all datasets have a biplanar camera setup and use the multi-view version of AAMs and augmented AAMs. Camera calibration allowing three-dimensional (3D) triangulation and evaluation of the tracking results is available for exactly one dataset. For each dataset, landmark positions manually located by human experts (biologists) are available, usually for every tenth frame of a sequence. Typical landmarks used for these datasets are depicted in Figure 1. A total of 175,942 ground-truth landmark positions were used for the comparisons presented in this paper. The actual number of ground-truth landmarks defined for each image varies per dataset and ranges from 14 to 24, with typical values being 20 landmarks per image. An overview of the employed datasets is shown in Figure 4.

Figure 4
figure 4

Dataset overview. Overview of the 32 real-world bird locomotion datasets used for the evaluation of our approach. The datasets originate from three zoological studies [6, 7, 34], including birds of different morphology and locomotion characteristics.

We evaluate our approach based on the point-to-point error [52], i.e., the Euclidean error (in pixels for the 2D case and in millimeters for the 3D case) between manually located and automatically tracked landmark positions. For each sequence, an AAM was trained based on exactly one stride, using the provided landmark data. In any case, at most ten frames of a sequence were used for AAM training. Afterwards, all frames of the sequence were tracked using our presented augmented AAM approach.

4.1 Comparison to standard AAMs

As a proof of concept, we first compare our augmented AAMs to the results obtained by standard AAMs. For both methods, identical experimental setups were used - they only differ in the fitting method. The quantitative and qualitative comparisons for the real-world bird locomotion datasets are shown in Figure 5, grouped by camera view and bird species. For a better overview, landmarks are grouped into anatomical subsets: the torso (e.g., pelvis, furcula, and neck), upper legs (hip joints and knee joints), and lower legs (intertarsal joints and feet).

Figure 5
figure 5

Comparison of our augmented AAM approach to standard AAMs. The evaluation is based on 32 real-world bird locomotion datasets comprising five species. In (a), the Euclidean point-to-point errors between tracked and ground-truth landmarks are shown as Tukey boxplots grouped by tracking method, bird species, camera view, and landmark group. In (b), qualitative tracking results are shown for the example of a typical jackdaw sequence. Our presented approach clearly outperforms standard AAMs in terms of fitting accuracy. Especially the lower leg landmarks benefit drastically from the augmented method, as the median error is constantly reduced to below 25 pixels for image sizes of 1,536 × 1,024 pixels.

From the quantitative results presented in Figure 5a, it can be seen that augmented AAMs substantially outperform standard AAMs in terms of fitting accuracy in any case. This is particularly apparent for lower leg landmarks, where median errors of up to 150 pixels are constantly reduced to below 25 pixels for image sizes of 1,536 × 1,024 pixels. As a typical example, for all 15 quail sequences, the median point-to-point error of lower leg landmarks of the side camera view is about 110 pixels for standard AAMs and only about 20 pixels for augmented AAMs. The reason for this result is that especially lower leg landmarks are prone to non-stationary shape movements and non-discriminative texture information, which drastically complicates standard AAM fitting but can be handled well by augmented AAMs. For other landmark groups, augmented AAMs are also clearly superior to their standard AAM counterparts: for the example of the 15 quail datasets, the median point-to-point error of torso landmarks of the top camera view is about 25 pixels for standard AAMs and about 15 pixels for augmented AAMs. The general performance disparity between the five bird species can be explained by different locomotion characteristics. For birds such as tinamous, the movement of the lower leg landmarks is less dominant compared to species such as jackdaws (cf. images in Figure 4).

In Figure 5b, qualitative tracking results for standard AAMs and augmented AAMs are presented for the lower leg landmarks of a jackdaw. It can be stated that the landmarks located by standard AAMs are clearly inaccurate in most cases, while augmented AAMs give reliable results. An example video showing tracking results of standard AAMs and augmented AAMs is provided in Additional file 1.

Additional file 1: Video 1: Qualitative tracking examples. This video shows some qualitative landmark tracking results for the scenario of grounded bird locomotion. (AVI 19 MB)

The above comparison clearly shows that our augmented AAM approach, as opposed to standard AAMs, is well suited for tracking the entire set of anatomical landmarks in this challenging scenario. Based on a large-scale study which analyzes the accuracy of manually located landmarks in X-ray locomotion scenarios [34], it can be stated that the accuracy of our approach is comparable to the performance of human experts.

4.2 Comparison to non-holistic approaches

While our augmented AAM approach is holistic in the sense that all landmarks are modeled in one consistent framework, it uses constraints obtained from methods which only perform well on very specific landmark subsets (cf. Section 3).

The question that we therefore would like to address is how an augmented AAM performs in direct comparison to each of the non-holistic approaches which provide its constraints. Quantitative results of this comparison are shown in Figure 6 for the two non-holistic tracking methods:

  1. 1.

    Subset AAM: standard multi-view AAM for the subset of torso and upper leg landmarks only [30]

  2. 2.

    Local tracking: robust local template tracking for lower leg landmarks of the side view only [16]

Figure 6
figure 6

Comparison of augmented AAMs to non-holistic methods. Subset AAM corresponds to non-holistic method of standard AAMs applied on its subset of well-trackable landmarks (torso and upper leg, cf. Subsection 3.2), as presented in [17, 30, 34]. Local tracking denotes the non-holistic data-driven tracking approach of [16] applied on its subset of well-trackable landmarks (lower leg, cf. Subsection 3.3). On their specific landmark subsets, non-holistic methods are superior to augmented AAMs. However, augmented AAMs incorporate all landmarks in a consistent framework and also provide reliable positions for landmarks which are not covered by any of the non-holistic approaches.

It is important to note that for both cases, the evaluation is performed only on the specific landmark subset of the respective non-holistic method.

As can be seen in the top row of Figure 6, the median error of the subset AAM is between 2 pixels (tinamous, top view) and 5 pixels (quails, side view) smaller than for corresponding landmarks of the augmented AAM. For the example of quails, the median error of the side view landmarks is about 10 pixels for subset AAMs, and about 15 pixels for augmented AAMs. This effect can be explained by the fact that the subset AAM is optimized for these specific landmarks, while the augmented AAM mediates between various fitting constraints for all landmarks - even those not covered in this comparison. In addition, the shape and texture models of the augmented AAM are more complex due to the increased scope and thus are harder to optimize. The results of the second non-holistic method, robust template tracking, are presented in the bottom row of Figure 6 and show the same tendency. While local tracking is even more accurate than the subset AAM, the performance of the augmented AAM is similar for both comparisons. Here, the very same explanations as before apply.

As a result, we can state that both non-holistic methods are more accurate on their specific landmark subsets than our holistic approach. However, the holistic approach has the essential advantage that it also can reliably and consistently track landmarks which are covered by neither of the non-holistic approaches, as for instance the lower leg landmarks of the top camera view (cf. Figure 5).

4.3 Influence of constraints

As our approach combines several fitting constraints, an important aspect is the practical relevance of individual constraint types. It is to be expected that positional constraints such as local tracking priors will have a larger benefit on fitting accuracy than, e.g., anatomical constraints. However, the question is whether a combination of several constraints can improve the fitting results. We therefore compare the performance of augmented AAMs using different combinations of constraints described in Section 3.

In Figure 7, quantitative results of this analysis are depicted. Due to the large amount of comparisons, results are exemplarily shown for jackdaws and tinamous, which according to Figure 5 have the worst and best tracking performance, respectively. It can be seen that torso and upper leg landmarks behave similarly for either case. Whenever constraints of the anchor AAM are provided for these landmarks, the holistic model seems to reach its maximum accuracy and no other constraints are beneficial.

Figure 7
figure 7

Influence of constraints for augmented AAM fitting. The Euclidean point-to-point errors between tracked and ground-truth landmarks are shown as Tukey boxplots grouped by the type of used constraints for AAM fitting, bird species, camera view, and landmark group. The types of used constraints range from ‘none’ (standard AAMs) to ‘all’ (augmented AAMs). Results are exemplarily shown for a difficult bird species (jackdaws) and an easy scenario (tinamous). In challenging scenarios, all constraints contribute to the final fitting performance, as can be seen for the lower leg landmarks of jackdaws.

Similarly, for lower leg landmarks, it is sufficient to use local template tracking constraints in easy scenarios (tinamous). However, in more challenging scenarios (jackdaws), all constraints contribute to the final fitting performance. In both scenarios, epipolar constraints primarily improve the results of the top view. This is mainly due to the fact that the lower leg landmarks have no positional constraints for the top view and thus have a larger inaccuracy. While anatomical constraints do not increase accuracy when used together with local constraints, they improve results of standard AAMs in complicated scenarios (jackdaws). This fits their intended purpose of providing a rough initial landmark estimate for the other constraints (cf. Section 3).

An example in Figure 7 which demonstrates all aspects of the above argumentation is the case of lower leg landmarks in the top view for jackdaws. In case that no constraints are used (‘none,’ standard AAMs), the median point-to-point error is larger than 125 pixels. If all but local tracking priors are employed (‘without local’), a median error of about 55 pixels is obtained. When using all priors (‘all,’ augmented AAMs), the median error is smallest with about 25 pixels.

4.4 3D evaluation of tracking results

To allow an analysis of uncalibrated animal locomotion datasets recorded in previous biological studies such as [7], augmented AAMs do not rely on a calibrated camera setup, albeit both X-ray camera views are modeled in a consistent manner. However, for datasets having calibration information available, using 3D landmark positions instead of projected 2D positions is desired for biological evaluations (e.g., [8]). In the case of a known camera calibration, this can be achieved by triangulating the 2D tracking results of both X-ray camera views [53, 54]. Similarly, we obtain 3D ground-truth landmarks by triangulating the given 2D ground-truth landmark locations. In the following, we evaluate the 3D accuracy of the landmarks tracked with our approach in order to

  1. 1.

    State whether our approach is accurate enough to produce reliable 3D results

  2. 2.

    Obtain an upper error bound for pure 3D tracking methods

Currently, camera calibration is only available for exactly one of the 32 datasets presented above - namely a quail dataset having 1,841 frames which cover 24 steps. For the calibration of this dataset, a custom-built metal plate with a size of 140 mm×60 mm×0.5 mm was employed. It contains 18 uniquely identifiable holes which are easily detectable in both X-ray as well as visible light cameras. For the actual calibration, we use the method of Zhang [55]. The mean backprojection error of the intrinsic camera calibration is 0.27 pixels at an image size of 1,536 × 1,024 pixels.

In Figure 8, both qualitative as well as quantitative results of the 3D evaluation are presented. From the quantitative results in Figure 8a, it can be seen that the median point-to-point error of all landmark types is below 5 mm. Compared to the animal’s body length of 200 mm, this error is negligible for many practical biological evaluations. Additionally, this error serves as a rough upper bound for methods which perform pure 3D tracking. The largest median error (5 mm) is obtained for lower leg landmarks, which is in accordance with the results of the 2D evaluation (cf. Figure 5). The rather surprising result that upper leg landmarks have a slightly lower median error (2.7 mm) than torso landmarks (3.5 mm) is caused by 2D tracking inaccuracies in the top view of this particular dataset. To allow a visual assessment of the 3D accuracy, Figure 8b shows the reprojected landmark positions for one step of the animal which was additionally filmed with a visible light camera. A video showing these reprojected 3D landmarks is provided in Additional file 2.

Figure 8
figure 8

3D evaluation of landmark tracking using augmented AAMs. The evaluation was based on one quail dataset with known camera calibration. 3D landmark positions were obtained by triangulating 2D tracking results. In (a), the 3D Euclidean point-to-point errors between triangulated tracked and triangulated ground-truth landmarks are shown as Tukey boxplots grouped by landmark group. The quail has a body length of approximately 200 mm. Based on these results, a median error of 5 mm can be seen as a rough upper bound for future methods which perform pure 3D tracking. In (b), reprojected landmarks are shown for a visible light camera to visualize the accuracy.

4.5 Implementation details

Both the augmented AAM approach presented in this work as well as the standard AAMs were entirely implemented in the programming language R ( The robust template tracking approach of [16] which is used to provide local AAM fitting constraints was implemented in C++ using the OpenCV library [56]. All experiments were performed on a typical desktop computer with an Intel CoreteTM i5-760 CPU at 2.80 GHz. On average, the creation of all fitting constraints for the augmented AAM was performed at 13.2 frames per second (fps) for the anchor AAM, 11.4 fps for local tracking constraints, and 2.8 fps for torso and leg distance constraints (cf. Section 3). Given these constraints, our implementation of augmented AAMs runs at 0.5 fps. Note that this value could be drastically increased using a pure C/C++ implementation, employing the inverse compositional/project-out [40, 45] optimization instead of the additive method [37] and by exploiting the vast parallelization capability of the approach. In the animal locomotion scenario at hand, however, a real-time processing of datasets is not of primary importance.

5 Conclusions and further work

In this paper, we presented augmented active appearance models, a general approach for AAM fitting in cases of non-stationary shape motions and non-discriminative local texture information. Our method is based on a holistic, probabilistic framework which allows the inclusion of arbitrary fitting priors. We applied our approach to the challenging scenario of landmark tracking in X-ray animal locomotion sequences, for which until now only methods for specific landmark subsets existed. For this particular scenario, we presented various types of suitable fitting constraints that were included into our probabilistic framework. Extensive experiments based on 32 real-world datasets including 175,942 ground-truth landmark positions showed that our approach clearly outperforms standard AAM fitting and allows to reliably track all landmarks of interest. In addition, we could show that the accuracy of our approach is sufficient to provide reliable 3D landmark estimates for calibrated datasets.

For further work, an interesting and relevant point to consider is the scenario of non-cyclic locomotion, for instance birds running over obstacles. Another important problem we want to solve is how to transfer already trained models to different tracking scenarios, such as adapting a quail model to be able to track tinamous. Both points mentioned require an adaption of a given model to novel cases, and we plan to utilize methods from incremental learning [47] and domain adaptation for this task. Inspired by the promising results of 3D landmark estimation for calibrated datasets, another idea for further work is the inclusion of additional imaging modalities such as visible light cameras into the tracking process.


  1. Gatesy SM, Guineafowl hind limb function: I: cineradiographic analysis and speed effects. J. Morphol 1999,240(2):1097-4687.

    Google Scholar 

  2. Fischer MS, Schilling N, Schmidt M, Haarhaus D, Witte H: Basic limb kinematics of small therian mammals. J Exp Biol 2002,205(Pt 9):1315-1338.

    Google Scholar 

  3. Tashman S, Anderst W: In vivo measurement of dynamic joint motion using high speed biplane radiography and CT: application to canine ACL deficiency. J. Biomech. Eng 2003, 125: 238-245. 10.1115/1.1559896

    Article  Google Scholar 

  4. Brainerd EL, Baier DB, Gatesy SM, Hedrick TL, Metzger KA, Gilbert SL, Crisco JJ: X-ray reconstruction of moving morphology (XROMM): precision, accuracy and applications in comparative biomechanics research. J. Exp. Zool. A 2010,313A(5):262-279.

    Google Scholar 

  5. Gatesy SM, Baier DB, Jenkins FA: KP Dial: Scientific rotoscoping: a morphology-based method of 3-D motion analysis and visualization. J. Exp. Zool A 2010,313A(5):244-261.

    Google Scholar 

  6. Nyakatura JA, Andrada E, Grimm N, Weise H, Fischer MS: Kinematics and center of mass mechanics during terrestrial locomotion of northern lapwings (Vanellus vanellus, Charadriiformes). J. Exp. Zool. A 2012,317(9):580-594. 10.1002/jez.1750

    Article  Google Scholar 

  7. Stößel A, Fischer MS: Comparative intralimb coordination in avian bipedal locomotion. J. Exp. Biol 2012, 215: 4055-4069. 10.1242/jeb.070458

    Article  Google Scholar 

  8. Fischer MS, Lilje K: Dogs in Motion. Dortmund: VDH; 2011.

    Google Scholar 

  9. Hedrick TL: Software techniques for two- and three-dimensional kinematic measurements of biological and biomimetic systems. Bioinspir Biomim 2008,3(3):034001. 10.1088/1748-3182/3/3/034001

    Article  Google Scholar 

  10. Lucas BD, Kanade T: An iterative image registration technique with an application to stereo vision. In Proceedings of the 7th International Joint Conference on Artificial Intelligence (IJCAI ’81). Vancouver: William Kaufmann; August 1981:674-679.

    Google Scholar 

  11. Shi J, Tomasi C: Good features to track. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Seattle; 21–23 June 1994:593-600.

    Google Scholar 

  12. Baker S, Matthews I: Lucas-Kanade 20 years on: a unifying framework. Int. J. Comput. Vis 2004,56(3):221-255.

    Article  Google Scholar 

  13. Hager GD, Belhumeur PN: Efficient region tracking with parametric models of geometry and illumination. IEEE T. Pattern Anal 1998,20(10):1025-1039. 10.1109/34.722606

    Article  Google Scholar 

  14. Jurie F, Dhome M: Hyperplane approximation for template matching. IEEE T. Pattern Anal 2002,24(7):996-1000. 10.1109/TPAMI.2002.1017625

    Article  Google Scholar 

  15. Lowe DG, Distinctive image features from scale-invariant keypoints: Int. J. Comput. Vis. 2004,60(2):91-110.

    Google Scholar 

  16. Amthor M, Haase D, Denzler J: Fast and robust landmark tracking in X-ray locomotion sequences containing severe occlusions. In Proceedings of the Vision, Modeling and Visualization (VMV) Workshop. Magdeburg; 12–14 November 2012:15-22.

    Google Scholar 

  17. Haase D, Denzler J: Anatomical landmark tracking for the analysis of animal locomotion in X-ray videos using active appearance models. In Image Analysis ed. by A Heyden, F Kahl. Proceedings of the 17th Scandinavian Conference on Image Analysis (SCIA 2011), no. 6688 in LNCS, Ystad, May 2011. Springer, Heidelberg, 2011); 604-615.

    Google Scholar 

  18. Rohlfing T, Denzler J, Gräßl C, Russakoff DB, Maurer Jr CR: Markerless real-time 3-D target region tracking by motion backprojection from projection images. IEEE T. Med. Imaging 2005,24(11):1455-1468.

    Article  Google Scholar 

  19. Miranda DL, Schwartz JB, Loomis AC, Brainerd EL, Fleming BC, Crisco JJ: Static and dynamic error of a biplanar videoradiography system using marker-based and markerless tracking techniques. J. Biomech. Eng 2011,133(12):121002. 10.1115/1.4005471

    Article  Google Scholar 

  20. Ishikawa T, Matthews I, Baker S: Efficient image alignment with outlier rejection. Technical Report CMU-RI-TR-02-27. Carnegie Mellon University Robotics Institute, 2002

  21. Jurie F, Dhome M: Real time robust template matching. Proceedings of the British Machine Vision Conference 2002, (BMVC) (British Machine Vision Association, Cardiff, 2–5 September 2002)

  22. Pan J, Hu B, Zhang JQ: Robust and accurate object tracking under various types of occlusions. IEEE Trans. Circuits Syst. Video Techn 2008,18(2):223-236.

    Article  Google Scholar 

  23. Cootes TF, Edwards GJ, Taylor CJ: Active appearance models. In Computer Vision–ECCV’98 Edited by: Burkhardt H, Neumann B. Proceedings of the 5th European Conference on Computer Vision, Freiburg, 2–6 June 1998. Lecture Notes in Computer Science, vol. 1407 (Springer, Berlin, 1998), pp. 484–498

  24. Cootes TF, Taylor CJ, Edwards G J: Face recognition using active appearance models. In Computer Vision–ECCV’98 Edited by: Burkhardt H, Neumann B. Proceedings of the 5th European Conference on Computer Vision, Freiburg, 2–6 June 1998. Lecture Notes in Computer Science, vol. 1407 (Springer, Berlin, 1998), pp. 581–595

  25. Cootes TF, Edwards GJ, Taylor CJ: Active appearance models. IEEE T. Pattern Anal 2001,23(6):681-685. 10.1109/34.927467

    Article  Google Scholar 

  26. Ashraf AB, Lucey S, Cohn JF, Chen T, Ambadar Z, Prkachin KM, Solomon PE: The painful face - pain expression recognition using active appearance models. Im. Vis. Comp 2009,27(12):1788-1796. 10.1016/j.imavis.2009.05.007

    Article  Google Scholar 

  27. van der Maaten L, Hendriks E: Action unit classification using active appearance models and conditional random fields. Cogn. Process 2012,13(Suppl 2):S507-S518.

    Article  Google Scholar 

  28. Cootes TF, Taylor CJ: Statistical models of appearance for medical image analysis and computer vision. Medical Imaging: Image Processing, vol. 4322 ed. by M Sonka, KM Hanson. Proceedings of SPIE, Bellingham August 2001, (SPIE, Bellingham, 2001), pp. 236–248

  29. Mitchell SC, Lelieveldt BPF, van der Geest RJ, Bosch JG, Reiber JHC, Sonka M: Multistage hybrid active appearance model matching: segmentation of left and right ventricles in cardiac MR images. IEEE Trans. Med. Imaging 2001,20(5):415-423. 10.1109/42.925294

    Article  Google Scholar 

  30. Haase D, Nyakatura JA, Denzler J: Multi-view active appearance models for the X-ray based analysis of avian bipedal locomotion. Pattern Recognition ed. by R Mester, M Felsberg. Proceedings of the 33rd DAGM Symposium (DAGM), no. 6835 in LNCS, Frankfurt, 31 August to 2 September 2011, (Springer, Berlin, 2011), pp. 11–20

  31. Das S, Vaswani N: Nonstationary shape activities: dynamic models for landmark shape change and applications. IEEE T. Pattern Anal 2010,32(4):579-592.

    Article  Google Scholar 

  32. Vaswani N, Chowdhury AKR, Chellappa R: “Shape activity”: a continuous-state HMM for moving/deforming shapes with application to abnormal activity detection. IEEE Trans. Image Process 2005,14(10):1603-1616.

    Article  Google Scholar 

  33. Cootes TF, Taylor CJ, Cooper DH, Graham J, Active shape models—their training and application: Comput Vis. Image Underst. 1995, 61: 38-59. 10.1006/cviu.1995.1004

    Article  Google Scholar 

  34. Haase D, Denzler J: Comparative evaluation of human and active appearance model based tracking performance of anatomical landmarks in locomotion analysis. Proceedings of the 8th Open German-Russian Workshop Pattern Recognition and Image Understanding (OGRW-8-2011) Nizhny Novgorod, November 2011, 96-99.

    Google Scholar 

  35. Cristinacce D, Cootes TF: Automatic feature localisation with constrained local models. Pattern Recognit 2008,41(10):3054-3067. 10.1016/j.patcog.2008.01.024

    Article  Google Scholar 

  36. Martins P, Caseiro R, Henriques JF: Discriminative Bayesian active shape models. In Proceedings of the 12th European Conference on Computer Vision. Florence; 7–13 October 2012:57-70.

    Google Scholar 

  37. Cootes TF, Taylor CJ: Constrained active appearance models. In IEEE International Conference on Computer Vision (ICCV). BC: Vancouver; 7–14 July 2001:748-754.

    Google Scholar 

  38. Bookstein FL: Landmark methods for forms without landmarks: morphometrics of group differences in outline shape. Med. Image Anal 1997,1(3):225-243. 10.1016/S1361-8415(97)85012-8

    Article  Google Scholar 

  39. Dryden IL, Mardia KV: Statistical Shape Analysis. Chichester: Wiley; 1998.

    Google Scholar 

  40. Matthews I, Baker S: Active appearance models revisited. Int. J. Comput. Vis 2004,60(2):135-164.

    Article  Google Scholar 

  41. Xiao J, Baker S, Matthews I, Kanade T: Real-time combined 2D+3D active appearance models. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Washington D.C.; June 2004:535-542.

    Google Scholar 

  42. Sung J, Kim D: Estimating 3D facial shape and motion from stereo image using active appearance models with stereo constraints. In Third International Conference on Image Analysis and Recognition, Póvoa de Varzim, 18–20 September 2006, Lecture Notes in Computer Science vol. 4142. Berlin: Springer; 2006:457-467.

    Google Scholar 

  43. Lelieveldt B, Üzümcü M, van der Geest R, Reiber J, Sonka M: Multi-view active appearance models for consistent segmentation of multiple standard views. Int. Congr. Ser 2003, 1256: 1141-1146.

    Article  Google Scholar 

  44. Oost E, Koning G, Sonka M, Oemrawsingh PV, Reiber JHC, Lelieveldt BPF: Automated contour detection in X-ray left ventricular angiograms using multiview active appearance models and dynamic programming. IEEE T. Med. Imaging 2006,25(9):1158-1171.

    Article  Google Scholar 

  45. Papandreou G, Maragos P: Adaptive and constrained algorithms for inverse compositional active appearance model fitting. In Conference on Computer Vision and Pattern Recognition (CVPR 2008). Anchorage; 23–28 June 2008.

    Google Scholar 

  46. Edwards GJ, Cootes TF, Taylor CJ: Advances in active appearance models. In IEEE International Conference on Computer Vision (ICCV) vol. 1. Kerkyra; 20–27 September 1999:137-142.

    Google Scholar 

  47. Sung J, Kim D: Adaptive active appearance model with incremental learning. Pattern Recogn. Lett 2009,30(4):359-367. 10.1016/j.patrec.2008.11.006

    Article  Google Scholar 

  48. Zhou M, Liang L, Sun J, Wang Y: AAM based face tracking with temporal matching and face segmentation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). San Francisco; 13–18 June 2010:701-708.

    Google Scholar 

  49. Felzenszwalb PF, Huttenlocher DP: Distance transforms of sampled functions. Theory Comput 2012,8(19):415-428.

    Article  MathSciNet  Google Scholar 

  50. van den Boomgaard R: Mathematical morphology: extensions towards computer vision. PhD thesis. University of Amsterdam, 1992

  51. Borgefors G: Distance transformations in digital images. Comput. Vis., Graph., Image Process 1986,34(3):344-371. 10.1016/S0734-189X(86)80047-0

    Article  Google Scholar 

  52. Stegmann MB: Active appearance models: theory, extensions and cases. Master’s thesis. Technical University of Denmark, DTU, 2000

  53. Hartley R, Zisserman A: Multiple View Geometry in Computer Vision. Cambridge: Cambridge University Press; 2003.

    Google Scholar 

  54. Mittrapiyanuruk P, DeSouza GN, Kak AC: Calculating the 3D-pose of rigid-objects using active appearance models. In International Conference on Robotics and Automation (ICRA 2004). New Orleans; 26 April to 1 May 2004:5147-5152.

    Google Scholar 

  55. Zhang Z: A flexible new technique for camera calibration. TPAMI 2000,22(11):1330-1334. 10.1109/34.888718

    Article  Google Scholar 

  56. Bradski G, Kaehler A: Learning OpenCV: Computer Vision with the OpenCV Library. Cambridge: O’Reilly; 2008.

    Google Scholar 

Download references


The authors would like to thank Alexander Stößel from the Department of Human Evolution at the Max Planck Institute for Evolutionary Anthropology in Leipzig, Germany for providing the quail, jackdaw, and tinamou datasets. Furthermore, we would like to thank John Nyakatura from the Institute of Systematic Zoology and Evolutionary Biology with Phyletic Museum at the Friedrich Schiller University of Jena, Germany for providing the bantam and lapwing datasets as well as one additional quail dataset. This research was supported by grant DE 735/8-1 of the German Research Foundation (DFG).

Author information

Authors and Affiliations


Corresponding author

Correspondence to Daniel Haase.

Additional information

Competing interests

Both authors declare that they have no competing interests.

Electronic supplementary material


Additional file 2: Video 2: 3D landmark reconstruction examples. This video shows an example of 3D landmark estimation using augmented AAMs for the scenario of grounded bird locomotion. (AVI 15 MB)

Authors’ original submitted files for images

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 2.0 International License (, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Haase, D., Denzler, J. 2D and 3D analysis of animal locomotion from biplanar X-ray videos using augmented active appearance models. J Image Video Proc 2013, 45 (2013).

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: