Skip to main content

Comparison of two 3D tracking paradigms for freely flying insects


In this paper, we discuss and compare state-of-the-art 3D tracking paradigms for flying insects such as Drosophila melanogaster. If two cameras are employed to estimate the trajectories of these identical appearing objects, calculating stereo and temporal correspondences leads to an NP-hard assignment problem. Currently, there are two different types of approaches discussed in the literature: probabilistic approaches and global correspondence selection approaches. Both have advantages and limitations in terms of accuracy and complexity. Here, we present algorithms for both paradigms. The probabilistic approach utilizes the Kalman filter for temporal tracking. The correspondence selection approach calculates the trajectories based on an overall cost function. Limitations of both approaches are addressed by integrating a third camera to verify consistency of the stereo pairings and to reduce the complexity of the global selection. Furthermore, a novel greedy optimization scheme is introduced for the correspondence selection approach. We compare both paradigms based on synthetic data with ground truth availability. Results show that the global selection is more accurate, while the previously proposed tracking-by-matching (probabilistic) approach is causal and feasible for longer tracking periods and very high target densities. We further demonstrate that our extended global selection scheme outperforms current correspondence selection approaches in tracking accuracy and tracking time.

1 Introduction

The investigation of complex movement patterns of various organisms has become an integral subject of biological research. From a biological point of view, motion is the visual response to any kind of perceivable stimulation. The nervous system is responsible for the perception, the integration of the information, and the execution of the final response. One of the most popular model organisms to study how the nervous system controls locomotion is Drosophila melanogaster (i.e., fruit fly). Sophisticated genetic tools as well as advanced imaging techniques allow the functional dissection of neural circuits [14].

Drosophila is a holomethabolous insect. In the larval stage, locomotion is confined to two dimensions, whereas the adult fly moves in two and three dimensions. Approaches dealing with crawling larvae are common praxis; thus, two-dimensional (2D) tracking is well established in behavioral experiments [3, 58]. In addition, flies confined to 2D motion are often used in behavioral experiments [2, 911]. Basically, there are two ways to prevent the flies from takeoff: cutting the wings [12] or using an arena with a flat ceiling [13]. Both manipulations could lead to unnatural behavior [14]. Thus, three-dimensional (3D) tracking approaches are needed to address all kinds of behavioral phenotypes.

1.1 Related work

Work on freely flying fruit flies is still in its infancy, because it requires dynamic 3D correspondence analysis [15]. This analysis involves two challenging tasks: stereo matching (i.e., correspondence between camera views) and temporal tracking (i.e., correspondence over time). Together they form the so-called general multi-index assignment problem [16]. This problem is non-deterministically polynomial-time hard ( NP-hard) [16]. If all correspondences are known, triangulation is used to determine the 3D positions.

To avoid expensive multi-camera multi-target 3D tracking, existing approaches typically either track in two dimensions (no stereo matching) [2, 911] or track only a single target (no ambiguities over time) [14, 17]. If multi-camera multi-target 3D tracking is required, stereo matching and temporal tracking can be solved separately by accepting a decrease of tracking accuracy [1820].

Among others, there are two fundamentally different paradigms used to capture 3D trajectories of multiple adult Drosophila. The first paradigm uses the extended Kalman filter and avoids complexity by separating stereo and temporal correspondence associations [21, 22]. Due to this separation, optimal results cannot be guaranteed, and fragmented tracks prevent the preservation of the fly identities over time. The second paradigm performs a global selection by combining both tasks to calculate the overall best assignment [23]. As a result, identity preservation can be achieved for many flies and frames. However, the amount of possible combinations increases exponentially with the number of animals and time steps; thus, current solutions are only able to track for a short period.

Another probabilistic approach addresses the trade-off between identity preservation and long-term experiments [24]. The authors use the Hungarian algorithm and Kalman filtering for stereo matching and temporal correspondence association. Focusing on applicability for biologists, up to seven flies were evaluated in several experiments.

All the above-mentioned approaches focus on either tracking a few hundreds of targets for a short period of time or tracking less targets for more frames. High-density tracking is used in different research areas like particle tracking velocimetry [18, 25] and tracking bats [26, 27], bees [28] or fruit flies [20, 23]. A quantitative comparison of several three-dimensional Lagrangian particle tracking approaches for high-density situations is given in [29].

Examples of long-term tracking approaches for fruit flies are given in [21, 22, 24, 30, 31]. In a recent publication, problems like noise and low frame rates are addressed to calculate trajectories of wild mosquitoes [32]. The authors used a probabilistic multi-target tracking for swarms of 6 to 25 mosquitoes. If hundreds of flies are tracked for a comparatively long period, trajectories are fragmented and the identity is not preserved. Furthermore, tracking several hundreds of flies simultaneously is not practical for most biological applications [24, 30]. Only if swarming behavior needs to be analyzed, ambiguous animals are neglected leading to a strongly varying number of targets over time [33].

In a recent publication, multi-path branching was used to handle occlusions by employing global optimization when calculating the trajectories [34]. The algorithm was exhaustively tested for both high-density and long-term situations. Again, the tracking accuracy decreases if the number of targets and the number of frames increases simultaneously.

1.2 Proposed algorithms and comparison scheme

In this paper, we compare identity preserving 3D tracking approaches for long-term experiments considering biological usability. First, we present algorithms for both the above-mentioned paradigms (see Figure 1):

  • The previously proposed tracking-by-matching (TbM) solution [35] integrates a third camera to conduct projection consistency check into the probabilistic approach.

  • In addition, we introduce a global correspondence selection (GCS) algorithm (extension of [23]), calculating the global search space and minimizing a cost function afterwards.

Figure 1
figure 1

General comparison scheme of this paper. The new approach is highlighted in yellow.

Limitations of the TbM and the GCS approach are addressed by utilizing a third camera to verify the consistency of stereo pairings. The third camera is integrated by the so-called projection consistency [35]. As a result, the amount of ambiguous temporal associations is reduced in the TbM approach. The GCS approach benefits from the projection consistency by means of a reduced overall complexity. Besides utilizing Gibbs sampling for optimization, as suggested by [23], we introduce an alternative greedy selection scheme (see Figure 1). It should be pointed out that we use GCS in terms of optimizing a global search space, not determining the global optimum for our optimization task.

We compare both paradigms, the TbM and the GCS approach, based on synthetic data; thus, the ground truth is available. Global correspondence selection was done via Gibbs sampling [23] and greedy optimization utilizing projection consistency. This leads to the comparison scheme illustrated in Figure 1.

This paper is organized as follows: In Section 2, we provide notes about notations and central equations. In particular, the projection consistency is described in detail. Algorithms are presented in Section 2.2. Section 2.2.3 describes the extensions of the GCS approach. The synthetic data and measures used for comparison are described in Section 3. All results are listed in Section 4: We compare GCS approaches with the TbM approach in Sections 4.2, 4.3, and 4.4. In addition, we compare both GCS approaches in more detail in Section 4.1. A concluding discussion of both paradigms is given in Section 5.

2 Methods

Both algorithms expect time-synchronized image streams from up to three cameras. Let I t i represent these images of cameras i = 1, 2, 3, and time t = 1, …, T. All cameras need to be calibrated; thus, the camera matrices Ki, rotation matrices (from camera i to camera j) Rij, and translation vectors tij are given. Then, the fundamental matrices can be calculated by

F ij = K j t ij × K j R ij ( K i ) - 1

(for more details, see [36]). Consider a swarm of flying targets of similar appearance and small size. The centers of detected targets (i.e., blobs) in a single image I t i are denoted by M t i ={ m n i , t i }={(x,y)} for n i =1,, N t i targets at time t, where (x, y) is the image coordinate of the objects’ centroid. The value N t i may differ due to occlusions or noise.

To calculate the 3D positions of the flies, stereo correspondences between detected blobs need to be established. Since we use three cameras, triplets of image points ( m n 1 , t 1 , m n 2 , t 2 , m n 3 , t 3 ) correspond to one target. In general, two 2D image coordinates are sufficient to calculate a single 3D position; thus, we define all possible pairs given by H t ij = M t i × M t j between camera i and j. A pairing s k , t ij H t ij could represent either a true or a false correspondence for target k.

2.1 Stereo matching and projection consistency

Both paradigms perform stereo matching based on epipolar geometry and verify matches using the so-called projection consistency constraint [35].

2.1.1 Stereo matching

Stereo matching is used to identify possible pairings between two respective views and thus result in possible correspondences. For matching a point m n i , t i in I t i with a point m n j , t j in I t j , both need to be located on the same epipolar line. Given m n i , t i , the corresponding epipolar line l n i j in I t j can be calculated by

l n i j = F ij m n i , t i .

Detected points from M t j lying on this epipolar line are matched to m n i , t i and indicate possible pairings { s k , t ij } H t ij .

2.1.2 Projection consistency

Since we use three calibrated cameras, triplets of 2D points ( m n 1 , t 1 , m n 2 , t 2 , m n 3 , t 3 ) located in images I t 1 , I t 2 , and I t 3 correspond to the same target in the 3D space. Projection consistency is applied to those triplets to verify the overall match. This constraint is satisfied if the respective projections from two 2D points ( m n i , t i , m n j , t j ) into the third view I t h are sufficiently close to m n h , t h . To project ( m n i , t i , m n j , t j ) into I t h , we calculate the 3D coordinate p t h by triangulating ( m n i , t i , m n j , t j ) and use the camera matrices from I h to obtain the hypothetical 2D position m ~ t h in I t h . Afterwards, we search for the closest point m , t h in M t h by

m , t h = min n h { 1 , , N t h } dist ( m ~ t h , m n h , t h ) + ( l n i h ) T m n h , t h ,

where the first summand is the Euclidean distance between the hypothetic position m ~ t h and the measured positions in M t h and the second summand is the distance between a measured point and the epipolar line l n i h in view I h corresponding to view I i . If

dist( m ~ t h , m , t h )<τ,

then blob m ~ t h (and thus the underlying pairing ( m n i , t i , m n j , t j )) describes a correct stereo correspondence. τ is the threshold for the projection consistency and depends on calibration accuracy. The triplet ( m n i , t i , m n j , t j , m n h , t h ) satisfies the projection consistency if m n h , t h = m , t h , for all possible combinations i, j, h {1, 2, 3}, (i ≠ j ≠ h).

2.2 Presented algorithms

To compare current state-of-the-art tracking paradigms, we introduce the TbM algorithm and a GCS algorithm. We try to overcome the limitations of probabilistic tracking, namely the separation of stereo matching and temporal tracking, by integrating projection consistency into the temporal tracking routine. Exponential complexity, arising in global correspondence selection algorithms, is avoided by reducing the global search space based on the projection consistency. Correspondence selection can be done by Gibbs sampling [37] or in a greedy manner. Since we introduce a novel selection scheme for the GCS approach (yellow box in Figure 1), we describe it in more detail. The TbM is explicitly described in [35].

2.2.1 Tracking-by-matching algorithm

As in all probabilistic tracking approaches, our tracking algorithm models the position and motion information of the targets independently from the stereo correspondences between the views. We use the unscented Kalman filter (UKF) as a Bayesian framework for 2D tracking [8, 38]. Using the notation introduced above, every target m n i , t i in every view (i {1, 2, 3}) is represented by its own tracker T k i (i.e., a single UKF). Temporal tracking is achieved by referring one of the measured ni-th targets to a specific tracker k over time t (yellow box 'temporal tracking’ in Figure 2). Thus, for each new frame triplet, every UKF predicts the next possible 2D position for its target. Then, detections close to the predictions are verified with projection consistency (green box 'Verify new triplet via proj. consist.’; for details, refer to [35]). In this way, the projection consistency constraint is used to integrate stereo matching into temporal tracking. After updating all trackers T k i , this procedure is repeated as long as there are further frames available (see Figure 2).

Figure 2
figure 2

Flow charts of the two 3D tracking paradigms. Temporal tracking is marked in yellow, stereo matching is marked in green, and projection consistency is highlighted by dots.

2.2.2 Global correspondence selection

Before going into formal details, the following section introduces the GCS algorithm in a top-down manner.

General workflow of the GCS approach. In distinction from the TbM approach, temporal tracking and stereo matching is not done within the main loop (compare to Figure 2). In fact, a global search space named is constructed over all the accessible frames before the actual tracking (compare to box 'Construction of ’). Afterwards, the best possible assignments are calculated by minimizing an overall cost function operating on .

To reduce the size of this search space, epipolar and temporal assumptions are made before considering actual correspondences. As illustrated in Figure 2, possible stereo correspondences between cameras 1 and 2 are calculated for every time step. Only blob pairings close to their respective epipolar lines are considered as possible stereo matches (box 'Calculate stereo correspondences’). If projection consistency is used (dashed box in Figure 2), invalid matches are removed or replaced via Equation 4. Note that the third camera is only used to replace incorrect matches from cameras 1 and 2. In other words, projection consistency is used to further reduce the set of possible pairings arising from I 1 and I 2 . The resulting set of matches can be interpreted as a set of possible 3D positions. Given two sets of 3D positions for consecutive time steps, possible temporal assignments can be calculated (compared to box 'Calculate temporal correspondences’). If a target has no successor within a 3D neighborhood (given by the maximal flight speed), it is removed from .

This reduction is done for all available frames and time steps T. After constructing the search space, several assignments are unique. Ambiguous pairings and ambiguous temporal correspondences form natural clusters in . Thus, only samples inside these clusters need to be optimized (see box 'Get ambiguous clusters from ’ in Figure 2). The subsequent optimization is done by a cost function introduced below which incorporates stereo and temporal matching (Equation 6). We implemented two optimization strategies to find possible samples in the respective clusters (see box 'Greedy / Gibbs cluster optimization’).

Formal definition of the search space. Let P( H t ij ) be the power set of all pairings s k , t ij H t ij . To avoid additional complexity, arising from pairwise pairings between three views, we use cameras 1 and 2 for stereo matching. Thus, the initial search space is constructed for H t 12 . Blobs from I t 3 are only considered if projection consistency is used (see Section 2.2.3).

The subset containing N t pairings from the power set P( H t 12 ) is given by the set of N t permutations P N t ( H t 12 )P( H t 12 ) (i.e., P N t ( H t 12 ) contains all possible combinations of pairings with N t elements). A single set S t 12 =( s 1 , t 12 , s 2 , t 12 ,, s N t , t 12 ) P N t ( H t 12 ) is called a configuration and contains N t stereo correspondences for time step t. If camera indices are not necessary, we use ( s 1 , t , s 2 , t ,, s N t , t )= S t = S t ij and H t = H t ij for i, j {1, 2, 3}, i ≠ j.

Let S=(P( H 1 ),P( H 2 ),,P( H T )) be the set of all configurations over all time steps, or S P =( P N 1 ( H 1 ), P N 2 ( H 2 ),, P N T ( H T )) if the number of targets per time step is known. A sequence of configurations between two time steps t - 1 and t is denoted by S t-1:t and contains temporal correspondences between consecutive frames. Thus, an overall solution, containing all tracks for all flies and T time steps, is given by a sequence S 1 : T S. The entire 3D trajectory of target k is then given by s k,1:T  = (s k,1, s k,2, … ,s k,T ).

Cost function. Stereo matching and temporal tracking is incorporated into a single optimization task, solving the optimization problem

S 1 : T =arg min S 1 : T S f( S 1 : T ).

The cost function f(•) incorporates an epipolar constraint f E (•) for stereo matching, kinetic coherence f K (•) for temporal tracking, and a so-called conservation-observation match f C (•) to punish multiple assignments. Thus, f(•) can be written as a sum of all the abovementioned constraints

f ( S 1 : T ) = α t = 1 T f E ( S t ) + β t = 1 T f C ( S t , H t ) + γ t = 1 T f K ( S t - 1 , S t )

with weights α, β, and γ (compare to [23]).

Cost function summands. Epipolar costs are defined as

f E ( S t ) = k = 1 N t ρ e ( s k , t ) ,

where ρ e (s k,t ) sums the distances between the blobs m k , t i , m k , t j from s k,t to its epipolar lines (compare Section 2.1.1). To avoid improbable stereo matchings, values f E (•) larger than a threshold ε E are set to :

f E ( S t )= k = 1 N t ρ e ( s k , t ) k : ρ e ( s k , t ) < ε E otherwise .

The kinetic coherence

f K ( S t - 1 , S t ) = k = 1 N t ρ k ( s k , t - 1 , s k , t ) = k = 1 N t dist ( p k , t - 1 , p k , t )

calculates the Euclidean distances dist(•) between 3D positions p k,t-1 and p k,t (defined by s k,t-1 and s k,t ). ρ k (•) expects two consecutive pairings for 3D coordinate calculation. Improbable temporal connections are set to :

f K ( S t - 1 , S t )= k = 1 N t ρ k ( s k , t - 1 , s k , t ) k : ρ k ( s k , t - 1 , s k , t ) < ε K otherwise.

Finally, the conservation observation match is defined as

f C ( S t , H t ) = 1 N t 1 k = 1 N t 1 | n c ( m k , t 1 , S t ) - 1 | + 1 N t 2 k = 1 N t 2 | n c ( m k , t 2 , S t ) - 1 | ,

where n c ( m k , t i , S t ) adds up the contributions of a blob m k , t i in configuration S t . If the number of correspondences exceeds a threshold ε C , configuration costs are set to :

f C ( S t , H t )= ρ c ( S t , M t 1 ) + ρ c ( S t , M t 2 ) k : n c ( m k , t , S t ) < ε C otherwise .

where ρ c ( S t , M t i )= 1 N t i k = 1 N t i | n c ( m k , t i , S t )-1|.

Recursive decomposition. Equation 6 can be rewritten in a recursive manner as follows:

f( S 1 : T )=f( S 1 : t - 1 )+Δf( S t )


Δf( S t )=α f E ( S t )+β f C ( S t , H t )+γ f K ( S t - 1 , S t ).

Thus, the whole optimization can be done by dynamic programming (for more details, see [23]).

Reduction of S. In [23], Gibbs sampling [37] is suggested to find the best possible sequence of configurations S 1 : T S. Since is a set of T power sets P( H t ), several steps are suggested to reduce the search space. First of all, sampling for solutions with N t targets for time t leads to a reduced set P N t ( H t ). Thus, we redefine the overall search space for S 1 : T by S P =( P N 1 ( H 1 ), P N 2 ( H 2 ),, P N T ( H T )).

The set H t is reduced by rejecting pairings which do not satisfy Equation 7. In the remaining subset H t H t , only blob pairings s k,t close to the respective epipolar lines are considered. Due to the recursive decomposition given in Equation 10, the successor to S t-1 can be selected from the N t permutation P N t ( H t ). Since kinetic costs are limited (see Equation 8), improbable temporal correspondences can be rejected from P N t ( H t ). Figure 3 illustrates the reduction of the cardinality for an N 1 permutation.

Figure 3
figure 3

Example for the cardinality reduction of . For a given time t, only one target needs to be found; thus, P( H t ) reduces to P N 1 ( H t ). The resultant search space is illustrated in column P N 1 ( H t ), containing only combinations of two blobs. The red blob detections do not satisfy the epipolar constraint (illustrated via red epipolar lines). In addition, the kinetic costs exceed the limit (indicated by the search sphere around the red blob). Thus, the cardinality of P N 1 ( H t ) further decreases to P N 1 ( H t ), containing a single unique pairing.

After rejecting both impossible stereo matchings and temporal correspondences, some sequences of configurations S t - 1 : t ( P N t - 1 ( H t - 1 ), P N t ( H t )) are unique. The remaining ambiguities form natural clusters C ( t - δ : t ) , ν S P for δ + 1 frames and ν flies. Zou et al. [23] extend ambiguous clusters by adjacent pairings. However, these pairings can again be involved in an ambiguous cluster. Since we tried to keep the identity over time, we merged the clusters in these situations as long as there are no ambiguous situations before and after each cluster anymore. In this way, the resultant clusters include overall ambiguous situations, and the domain of Equation 5 is global.

Since the cluster size increases exponentially with the number of targets N and time steps T, Gibbs sampling also requires thousands of sampling steps to guarantee good results. Indeed, the authors of [23] were only able to track for less than 1 s of recording.

2.2.3 Introduced improvements for the GCS approach

Here, we introduce two extensions to improve the performance of the GCS approach:

  • Utilizing projection consistency to reject ambiguous pairings s k,t and thus reducing the sizes of the clusters

  • Performing optimization in a greedy manner by selecting the best successor directly based on Equation 11

GCS with projection consistency. Similar to the above introduced probabilistic tracking approach, ambiguities and wrong stereo matches increase the size of the search space H t 12 . Thus, all pairings s k , t 12 H t 12 are projected into the view of the third camera I 3 . Only pairings satisfying Equation 4 remain in H t 12 ; inconsistent pairings are rejected.

An ambiguity is found if two pairings s i , t 12 , s j , t 12 in the remaining H t 12 share a single blob m k,t . The blob m , t 3 satisfying Equation 4 is then used to generate a new pairing s i , t , containing m , t 3 from I t 3 and the unambiguous blob from I t 1 or I t 2 . Afterwards, ambiguous pairings in H t 12 are replaced by unique pairings s i , t .

Let H t be the reduced subset, containing the new unique pairing s i , t (compare ' H with PC’ in Figure 4). The overall search space for Equation 5 is then given by S P =( P N 1 ( H 1 ), P N 2 ( H 2 ),, P N T ( H T )).

Figure 4
figure 4

Example for pairings with and without projection consistency (PC). One target is occluded in I 2 , and all possible combinations are generated between I 1 and I 2 . Pairings that do not satisfy the PC constraint are removed, and ambiguous pairings are corrected using projection consistency.

The optimization of clusters based on Equation 5 via Gibbs sampling is described in [23]. The greedy optimization strategy is described below.

Greedy optimization. Given a cluster with ambiguities C ( t - δ : t ) , ν S P , a sequence of configurations S t - δ : t C ( t - δ : t ) , ν must be determined. Let the sequence of configurations be S t-δ:t  = (S 0, S 1, …, S δ ), which must be optimized for ν flies. For a given S i  = (s 1, i , s 2, i , …, s ν,i ), a successor to every pairing s k,i is selected based on Equation 11, by choosing the pairing s ,i+1 with minimal costs (starting with i = 0 and k = 1). If, for example, s 1,0 is already assigned to s 1,1, a successor to s 2,0 is selected by s 2 , 1 =arg min s 2 , 1 S 1 α i = 1 N 1 f E ( S 1 )+β i = 1 N 1 f C ( S 1 , H 1 )+γ i = 1 N 1 f K ( S 0 , S 1 ). This is successively done for all pairings and all configurations until every pairing in every configuration has a successor.

2.2.4 Complexity of the algorithms

The complexity of the GCS search space and thus the memory storage is O( k NT ) in theory (N is the number of targets, T is the number of time steps, and k ≤ N denotes ambiguities after cardinality reduction), since there are kN possible configurations between two views and each of these configurations at t can be combined with all configurations at (t + 1). Optimization is only necessary for ambiguous clusters C ( t - δ : t ) , ν , therefore N = ν specifies the number of flies in this cluster and T = δ + 1 specifies the length of the cluster. Thus, the global optimum must be calculated based on kNT possible cluster configurations. Given the recursive decomposition of Equation 10, the complexity of the cost calculations decreases to O(T k N ). The theoretical computational complexity is O(NTL) for Gibbs sampling (where L is the number of sampling steps) and O(NT) for greedy optimization.

For TbM algorithm, no extra space for tracking is needed. Thus, the memory storage complexity is O(NT). The time complexity is O( ( NT ) 2 ). For the first three frames, exhausted search for triplets requires matching all detections in two views.

3 Experiments

3.1 Synthetic data

Both tracking paradigms are evaluated using synthetic data, generated by the swarm simulator introduced in [35]. The simulator generates all necessary data for tracking (i.e., rendered images and camera matrices) and evaluation (i.e., ground truth of the 2D and 3D trajectories). For our tests, three synchronized and calibrated cameras are placed around a 20 × 20 × 20 cm3 chamber. All movies are recorded with 800×800 pixel resolution and 150 fps. Since the beam width of the field of view is 45°, all cameras are placed 80 cm away from the cube’s center. Rotations around the y-axes, for cameras 1, 2, and 3, are 0°, -120°, and 120°, respectively.

According to [39], the maximum flight speed is set to 0.8 m/s. The crawling speed is reduced by the factor 0.1, and we use a Gaussian random walk for flight movement calculation [35]. To achieve more realistic conditions and to increase the probability of occlusions and nearby targets, we integrated negative geotaxis within our random walk model. Negative geotaxis describes the tendency of Drosophila to orient themselves against the earth’s gravity [40]. We integrated negative geotaxis by manipulating the randomly generated velocity in the y direction v t  = Θv t-1+n t (with Gaussian noise n t N(0, σ 2 ) and smoothness Θ  [0, 1]). With a probability of 0.002% the y entry of n t is forced to be zero or positive over time.

In this way, we generated several test movies with an increasing number of targets. For most real-world locomotion experiments, 50 flies per run are sufficient; thus, we generated movies with 10 to 50 targets and 1,000 frames (approximately 6 s; Sections 4.2 and 4.1). In addition, we made a long-term movie with 50 flies over 3,000 frames (Section 4.3) and high-density movies with a few hundreds of flies and time steps (Section 4.4).

To guarantee identical raw data for both algorithms, the 2D positions of all views are established by a separate blob detection routine. Resultant measurements contain time steps with several occluded flies in all views (leading to changes in N t i ; compare to Table 1). We also added Gaussian noise (σ2 = 0.001 in the intensity domain [0,1]) to the ground truth videos to simulate blob detections under realistic conditions. Figure 5 shows an example triplet of noisy images of 200 flies.

Table 1 Overview of the results
Figure 5
figure 5

Example of a noisy image triplet generated by the swarm simulator.

3.2 Evaluation and comparison measure

Both paradigms are compared in terms of tracking accuracy using the correspondence and association errors ( E ca) [35]. The E ca is defined as follows:

E ca = N c + N a T ,

where N c is the number of incorrect stereo matches, N a is the number of false temporal associations, and T is the number of frames. To calculate N a and N c , all computed 3D trajectories are assigned to their respective ground truth paths. This assignment is used to calculate Euclidean distances between calculated positions and ground truth positions. If the distance is not within a tolerance, N c is incremented for each frame and time step. The temporal association value is incremented if the ID of the calculated 3D paths changes between consecutive frames.

4 Results

We tested all combinations illustrated in Figure 1 as follows:

  • Tracking by matching method (named TbM)

  • GCS optimized via Gibbs sampling analogous to [23] (named Gibbs)

  • GCS with projection consistency (PC) optimized via Greedy (named Greedy PC)

General tracking results for 50 flies and over 1,000 time steps are given in Figure 6.

Figure 6
figure 6

Tracking results of 50 flies over 1,000 time steps and ground truth. The GCS image was made with projection consistency and greedy selection.

Table 1 summarizes results for all approaches. The resultant E ca value is additionally plotted in Figure 7a.

Figure 7
figure 7

Resultant E ca and N c and N a values. (a) Resultant E ca value plotted for all cases from Table 1. (b) Resultant N c and N a values plotted for the TbM and the Greedy PC tests.

4.1 Gibbs sampling vs. greedy optimization

The first observation is related to the number of occlusions and maximal cluster sizes. In general, the complexity of the global search space increases with the number of targets and frames [16]. If there are only a few ambiguities (e.g., occlusions, nearby 3D paths), most of the correspondences are unique and latter optimization is only necessary for a few small clusters (compare to max|C| in Table 1).

This can be observed in all movies besides the movie with 20 flies: the maximum cluster contains 19,024 pairings and 2,140 pairings without and with projection consistency, respectively (Table 1). Thus, the tracking time increases for both GCS approaches. However, the greedy selection is still able to calculate sufficient tracks, whereas Gibbs sampling results in less reliable results. The reason is that Gibbs needs to sample one sequences of configurations in a cluster containing almost 20,000 pairings covering 18 targets for 843 frames. Given one wrong correspondence selection prevents Gibbs from converging in the global optimum. Since we sampled for 10,000 iterations, this was not possible in reasonable time.

This coherence is also observable in the long-term and high-density experiments. In contrast to Zou et al. [23], we merge overlapping clusters for both joint time steps and joint targets (see Section 2.2.3) to guarantee a global search space. Thus, given very dense situations with hundreds of flies, the natural segmentation of the clusters is no longer available. In all measurements, the cluster size of the Gibbs approach was equivalent to the overall search space so that | C ( t - δ : t ) , ν |=| C T , N |=| S P |. The latter optimization must therefore sample one sequence of configurations out of kν(δ+1) (k ≤ ν and ν → N, δ → T) possible sequences (compare to Section 2.2.4) statistically. This is why Gibbs sampling requires millions of sampling steps to calculate stable results [37] which was neither shown in [23] nor possible in our data for thousands of frames in reasonable time. Since algorithms requiring more than 4 h for only a few seconds movie length are not suitable for biological applications, we neglect these tracking results in Table 1 (indicated by n/a). Thus, Gibbs sampling for high-density or long-term situations is more interesting from a theoretical point of view [24].

4.2 TbM vs. GCS

Obviously, the Greedy PC approach has the best overall performance int the general experiments. The TbM approach is between the Greedy and Gibbs solution. Optimization of GCS without PC and via Gibbs leads to the worst results with irregular E ca values.

If the number of flies increases, the E ca increases for both TbM and Greedy tracking (compare to Figure 7a). Since both measurements increase proportional to the number of wrong correspondences between views N c and wrong associations over time N a , these values are examined in Figure 7b.

As apparent, an increasing N c value leads to high error measurements for both the TbM and the Greedy PC approach. The main reason for wrong or missing stereo correspondences is caused by occlusions. The more flies are located in the chamber, the more occlusions arise during blob detection (compare to Table 1). Especially in latter frames, occlusions arise very frequently because of the negative geotaxis (compare to Section 3.1). During the movie with 50 flies, up to 4 flies are occluded for several frames in camera 1, for example. Thus, even in situations with up to 50 flies, the target density is comparatively high. Since TbM and Greedy PC try to overcome this events using the projection consistency, both have much lower N c than the two camera tracking solutions. Gibbs has up to 1,677 wrong stereo correspondences (data not shown). Therefore, it is not able to calculate the global optimum even after 10,000 sampling steps.

In other words, the overall search space S P , containing more than 1,000 occlusions, cannot be sampled sufficiently because of the growth of the clusters (Table 1). However, Greedy PC benefits from the previously calculated overall search space S P : since all possible pairings and sequences of configurations are used for coast calculations, ambiguities caused by occlusions can be corrected more frequently.

4.3 Long-term tracking

In the long-term experiment, 50 flies were tracked for 3,000 frames. Gibbs failed in this experiments because the size of the clusters increases drastically for 3,000 frames. Thus, only TbM and Greedy PC were able to track during this experiment.

In long-term movies, the TbM approach can achieve better results than the Greedy PC algorithm (Table 1). The reason for this inversion compared to the 1,000 frame results is that the size of the clusters |C| increases to much for 3,000 frames. Thus, the probability of getting a local optimum via greedy selection increases accordingly. However, tracking accuracy is still convenient in the Greedy PC approach. On the other hand, TbM, as a causal method, is not affected by the length of tracking sequences.

In contrast to the TbM approach, GCS can miss targets during optimization (see Table 1). However, projection consistency reduces the amount of missed targets. Furthermore, GCS optimization leads to less fragmented trajectories than the TbM approach. Whereas TbM results in 137 trajectories (N a  = 99) for 50 flies (over 3,000 frames), the Greedy PC approach calculates 49 complete tracks of 50 tracks in total (N a  = 7). If complete trajectories are required (i.e., identity of the flies must remain over time), Greedy PC is recommended but with the possibility of loosing flies.

4.4 High-density tracking

To evaluate the behavior of both tracking approaches, we tracked up to 200 flies. Similar to the long-term experiment in Section 4.3, we limit our comparisons to Greedy PC and TbM.

Table 1 highlights the measurements for 100, 150, and 200 targets. We decreased the number of frames for the movies with 150 and 200 flies to limit the size of S P . For up to 100 targets and 200 frames, Greedy correspondence selection is more accurate than TbM. However, given more than 100 targets, resulting in very high fly densities, the TbM outperforms the GCS approach. Most importantly, there are no missing targets in the probabilistic approach, whereas Greedy was not able to find trajectories for all flies. Furthermore, TbM can achieve better overall accuracy in high-density situations in less tracking time. The only drawback of the probabilistic tracking is again the fragmentation of the trajectories: TbM calculates more tracks than Greedy PC resulting in many identity changes.

5 Conclusion

In this paper, we discussed two tracking paradigms for identical appearing objects such as Drosophila melanogaster in 3D. One paradigm is based on a probabilistic approach conducting tracking and matching alternatively [35]. The other paradigm constructs a global search space over all targets and time steps, which is optimized in a second step [23].

Due to the high complexity of the second GCS paradigm, we introduced two improvements, namely projection consistency and greedy optimization. Especially, the projection consistency is able to reduce the overall complexity and thus improve the tracking results without yielding into local optima. Since Gibbs sampling, used for GCS optimization in [23], needs thousands of iterations to guarantee good results, our greedy selection scheme outperforms Gibbs sampling. However, a global result cannot be guaranteed via greedy optimization.

We demonstrated several advantages and disadvantages of both the TbM and GCS approach. Thus, the decision which approach to use must be done carefully. If the identity of the flies is not important, TbM can be used to track for several thousands of frames (compare to Section 4.3). All flies were detected, but the trajectories were fragmented due to occlusions.

The GCS approach was not able to track all flies in all experiments: only 49 of 50 flies were detected. On the other hand, the trajectories of the detected flies were less fragmented (compare to Section 4.2). In addition, the GCS was able to solve collisions and occlusions more frequently because of the global search space. This leads to the higher tracking accuracy illustrated in Figure 7a. If dozens of flies must be tracked for a comparatively short period, GCS outperforms TbM tracking. For very long sequences, it is the other way around.

If high fly densities are needed for a comparatively long period, the size of the global search space prevents Gibbs optimization, because it requires too many sampling iterations. In addition, greedy tracking quality decreases drastically compared to TbM (see Section 4.4). Thus, without further reductions of the global search space, probabilistic tracking is the preferable paradigm in high-density experiments.

TbM could be optimized in terms of tracking accuracy, whereas GCS could be optimized for longer tracking durations and higher target densities. Possible improvements for the TbM approach are discussed in [35]. Here, we want to focus on improvements of the GCS approach.

Currently, we use the third camera only to correct mismatches between cameras 1 and 2. The optimization scheme is still executed on pairings. Since pairwise comparison in a triplet would further reduce the search space, all optimization steps could be done on three image points.

The kinetic model given by Equation 8 is also a current drawback of the GCS approach. Only motion form (t - 1) is considered for time step t. Thus, a more appropriate motion model would further improve the accuracy of GCS tracking.

Currently, we are developing a three-camera real-world setup to capture movies of adult Drosophila flies. Thus, we are going to test both algorithms on real video sequences, comparable to the synthetic data introduced above.


  1. Gohl DM, Silies MA, Gao XJ, Bhalerao S, Luongo FJ, Lin CC, Potter CJ, Clandinin TR: A versatile in vivo system for directed dissection of gene expression patterns. Nat. Methods 2011, 8(3):231-237.

    Article  Google Scholar 

  2. Katsov A: Motion processing streams in Drosophila are behaviorally specialized. Neuron 2008, 59(2):322-335.

    Article  Google Scholar 

  3. Risse B, Thomas S, Otto N, Löpmeier T, Valkov D, Jiang X, Klämbt C: FIM a novel FTIR-based imaging method for high throughput locomotion analysis. PloS one 2013, 8(1):e53963.

    Article  Google Scholar 

  4. Sokolowski MB: Drosophila: genetics meets behaviour. Nat. Rev. Genet 2001, 2(11):879-890.

    Article  Google Scholar 

  5. Gomez-Marin A, Partoune N, Stephens GJ, Louis M: Automated tracking of animal posture and movement during exploration and sensory orientation behaviors. PloS one 2012, 7(8):e41642.

    Article  Google Scholar 

  6. Grover D, Yang J, Ford D, Tavaré S, Tower J: Simultaneous tracking of movement and gene expression in multiple Drosophila melanogaster flies using GFP and DsRED fluorescent reporter transgenes. BMC Res. Notes 2009, 2(1):58.

    Article  Google Scholar 

  7. Lahiri S, Shen K, Klein M, Tang A, Kane E, Gershow M, Garrity P, Samuel ADT: Two alternating motor programs drive navigation in Drosophila larva. PloS one 2011, e23(8):180.

    Google Scholar 

  8. Tao J, Klette R: Tracking of 2D or 3D irregular movement by a family of unscented Kalman filters. J. Inf. Convergence Commun. Eng 2012, 10(3):307-314.

    Article  Google Scholar 

  9. Dankert H, Wang L, Hoopfer ED, Anderson DJ, Perona P: Automated monitoring and analysis of social behavior in Drosophila. Nat. Methods 2009, 6(4):297-303.

    Article  Google Scholar 

  10. Martin J: A portrait of locomotor behaviour in Drosophila determined by a video-tracking paradigm. Behav. Process 2004, 67(2):207-219.

    Article  Google Scholar 

  11. Ofstad TA, Zuker CS, Reiser MB: Visual place learning in Drosophila melanogaster. Nature 2011, 474(7350):204-207.

    Article  Google Scholar 

  12. Colomb J, Reiter L, Blaszkiewicz J, Wessnitzer J: Open source tracking and analysis of adult Drosophila locomotion in Buridan’s paradigm with and without visual targets. PloS one 2012, 7(11):e42247.

    Article  Google Scholar 

  13. Valente D, Golani I, Mitra PP: Analysis of the trajectory of Drosophila melanogaster in a circular open field arena. PloS one 2007, 2(10):e1083.

    Article  Google Scholar 

  14. Fry SN, Bichsel M, Müller P, Robert D: Tracking of flying insects using pan-tilt cameras. J. Neurosci. Methods 2000, 101(1):59-67.

    Article  Google Scholar 

  15. Poore A: Multidimensional assignment formulation of data association problems arising from multitarget and multisensor tracking. Comput. Optimization Appl 1994, 3: 27-57.

    Article  MathSciNet  MATH  Google Scholar 

  16. Burkard RE, Dell’Amico M, Martello S: Assignment Problems. Philadelphia: Society for Industrial Mathematics; 2009.

    Book  MATH  Google Scholar 

  17. Tammero LF, Dickinson MH: The influence of visual landscape on the free flight behavior of the fruit fly Drosophila melanogaster. J. Exp. Biol 2002, 205(Pt 3):327-343.

    Google Scholar 

  18. Du H, Zou D, Chen YQ: Relative epipolar motion of tracked features for correspondence in binocular stereo. In ICCV Conference Proceedings, Rio de Janeiro, October 2007. Piscataway: IEEE; 2007:1-8.

    Google Scholar 

  19. Engelmann D, Garbe C, Stohr M, Geißler P, Hering F, Jahne B: Stereo particle tracking. 8th International Symposium on Flow Visualisation, Sorrento, 1–4 September 1998, 1-10.

    Google Scholar 

  20. Wu HS, Zhao Q, Zou D, Chen YQ: Acquiring 3D motion trajectories of large numbers of swarming animals. In IEEE 12th International Conference on Computer Vision Workshops, Kyoto, September to October 2009. Piscataway: IEEE; 2009:593-600.

    Google Scholar 

  21. Grover D, Tower J, Tavaré S: O fly, where art thou. J. R. Soc. Interface 2008, 5(27):1181-1191.

    Article  Google Scholar 

  22. Straw AD, Branson K, Neumann TR, Dickinson MH: Multi-camera real-time three-dimensional tracking of multiple flying animals. J. R. Soc. Interface 2011, 8(56):395-409.

    Article  Google Scholar 

  23. Zou D, Zhao Q, Wu HS, Chen YQ: Reconstructing 3D motion trajectories of particle swarms by global correspondence selection. In IEEE 12th International Conference on Computer Vision, Kyoto, September to October 2009. Piscataway: IEEE; 2009:1578-1585.

    Chapter  Google Scholar 

  24. Ardekani R, Biyani A, Dalton JE, Saltz JB, Arbeitman MN, Tower J, Nuzhdin S, Tavaré S: Three-dimensional tracking and behaviour monitoring of multiple fruit flies. J. R. Soc. Interface 2013, 10(78):20120-547.

    Google Scholar 

  25. Pereira F, Stüer H, Graff EC, Gharib M: Two-frame 3D particle tracking. Meas. Sci. Technol 2006, 17(7):1680-1692.

    Article  Google Scholar 

  26. Theriault D, Wu Z, Hristov N, Swartz S, Breuer K, Kunz T, Betke M: Reconstruction and analysis of 3D trajectories of Brazilian free-tailed bats in flight. 2010.

    Google Scholar 

  27. Wu Z, Hristov NI, Hedrick TL, Kunz TH, Betke M: Tracking a large number of objects from multiple views. In 2009 IEEE 12th International Conference on Computer Vision, Kyoto, September to October 2009. Piscataway: IEEE; 2009:1546-1553.

    Google Scholar 

  28. Veeraraghavan A, Srinivasan M, Chellappa R, Baird E, Lamont R: Motion based correspondence for 3D tracking of multiple Dim objects. IEEE Int. Conf. Acoustics Speech Signal Process (ICASSP) 2006, 2: pp. II.

    Google Scholar 

  29. Ouellette NT, Xu H, Bodenschatz E: A quantitative study of three-dimensional Lagrangian particle tracking algorithms. Exp. Fluids 2005, 40(2):301-313.

    Article  Google Scholar 

  30. Kohlhoff KJ, Jahn TR, Lomas DA, Dobson CM, Crowther DC, Vendruscolo M: The iFly tracking system for an automated locomotor and behavioural analysis of Drosophila melanogaster. Integr. Biol. (Camb) 2011, 3(7):755-760.

    Article  Google Scholar 

  31. Zou S, Liedo P, Altamirano-Robles L, Cruz-Enriquez J, Morice A, Ingram DK, Kaub K, Papadopoulos N, Carey JR: Recording lifetime behavior and movement in an invertebrate model. PloS one 2011, 6(4):e18151.

    Article  Google Scholar 

  32. Butail S, Manoukis N, Diallo M, Ribeiro JM, Lehmann T, Paley DA: Reconstructing the flight kinematics of swarming and mating in wild mosquitoes. J. R. Soc. Interface 2012, 9(75):2624-2638.

    Article  Google Scholar 

  33. Kelley DH, Ouellette NT: Emergent dynamics of laboratory insect swarms. Sci. Rep 2013., 3:

    Google Scholar 

  34. Attanasi A, Cavagna A, Del Castello L, Giardina I, Jelic A, Melillo S, Parisi L, Shen E, Silvestri E, Viale M: Tracking in three dimensions via multi-path branching. CoRR abs/1305.1495. 2013.

    Google Scholar 

  35. Tao J, Risse B, Jiang X, Klette R: 3D Trajectory estimation of simulated fruit flies. In Proc 27th IVCNZ. Dunedin; 26–28 November 2012.

    Google Scholar 

  36. Hartley R, Zisserman A: Multiple View Geometry in Computer Vision. Cambridge: Cambridge University Press; 2004.

    Book  MATH  Google Scholar 

  37. Geman S, Geman D: Stochastic relaxation, Gibbs distributions and the Bayesian restoration of images. IEEE Trans. Pattern Anal. Mach. Intell 1984, 6(6):721-741.

    Article  MATH  Google Scholar 

  38. Wan EA, van der Menve R: The unscented Kalman filter for nonlinear estimation. In Adaptive Systems for Signal Processing, Communications, and Control Symposium 2000, Lake Louise, Alta, October 2000. Piscataway: IEEE; 2000:153-158.

    Google Scholar 

  39. Marden JH, Wolf MR, Weber KE: Aerial performance of Drosophila melanogaster from populations selected for upwind flight ability. J. Exp. Biol 1997, 200(Pt 21):2747-2755.

    Google Scholar 

  40. Gargano JW, Martin I, Bhandari P, Grotewiel MS: Rapid iterative negative geotaxis (RING): a new method for assessing age-related locomotor decline in Drosophila. Exp. Gerontol 2005, 40(5):386-395.

    Article  Google Scholar 

Download references


The authors thank S. Strothoff for the discussion and help throughout the project. Furthermore, we would like to thank the anonymous reviewers for the helpful annotations and suggestions for improvements. We acknowledge the support by the Deutsche Forschungsgemeinschaft and Open Access Publication Fund of University of Münster.

Author information

Authors and Affiliations


Corresponding author

Correspondence to Benjamin Risse.

Additional information

Competing interests

The authors declare that they have no competing interests.

Authors’ original submitted files for images

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 2.0 International License (, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Risse, B., Berh, D., Tao, J. et al. Comparison of two 3D tracking paradigms for freely flying insects. J Image Video Proc 2013, 57 (2013).

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: