# A two-fly tracker that solves occlusions by dynamic programming: computational analysis of *Drosophila* courtship behaviour

- Christian Schusterreiter
^{1, 2}Email author and - Wilfried Grossmann
^{3}

**2013**:64

https://doi.org/10.1186/1687-5281-2013-64

© Schusterreiter and Grossmann; licensee Springer. 2013

**Received: **19 February 2013

**Accepted: **31 October 2013

**Published: **17 December 2013

## Abstract

This paper introduces a two-fly tracker which focuses on an approach to model and to solve occlusions as an optimization problem. Automated tracking of genetic model organisms is gaining importance since geneticists and neuroscientists have biological tools to systematically study the connection between genes, neurons and behaviour by performing large-scale behavioural experiments. This paper is about a fly tracker that provides automated quantification for such functional behaviour studies on *Drosophila* courtship behaviour. It enables measurement and visualization of behavioural differences in genetically modified fly pairs. The developed system provides solutions for all major challenges that were identified: arena detection, segmentation, quality control, resolving occlusions, resolving heading and detection of behaviour events. Among all challenges especially resolving occlusions turned out to be of particular importance and huge effort was invested to resolve that particular problem. Our tests show that our system is capable to identify flies through an entire video with an accuracy of 99.97%. This result is achieved by combining different types of local methods and modeling the global identity assignment as an optimization problem.

## Keywords

*Drosophila*; Courtship behaviour; Quantification; Pattern recognition

## 1 Introduction

### 1.1 Motivation and goals

A fundamental question in neuroscience is to understand the relation between genes, brains and behaviour: Genes encode hard-wired neuronal circuits in the nervous system. For innate behaviours - like reproductive behaviour of insects - such neuronal circuits produce observable stereotypic motor outputs.

The fruit fly *Drosophila melanogaster* has a set of innate behaviours that are hard-wired in the nervous system. Several innate behaviours of *D.* *melanogaster* are sex-specific. In combination with the availability of genetic and molecular tools, the fruit fly is a common model organism to study how the nervous system generates behaviours.

*Drosophila* courtship is a robust and sex-specific behaviour that has been characterized through multiple genetic screens. Many genes that regulate male and female courtship behaviour have already been identified. It was a big surprise that a complex behaviour like courtship is regulated by a few sets of genes [1, 2], and it is strongly believed that these genes interact with cascades of downstream genes that regulate individual parts of the behaviour.

Currently geneticists and neuroscientists perform large-scale experiments in order to systematically identify genes and neurons that are involved in specific steps of courtship behaviour. Quantification of these experiments and classification of different behaviours turned out to be a very time consuming and tedious task; thus, it was the major bottle neck of large-scale behaviour screens for a long time.

Automated tools aim to support such large-scale experiments. Saving time is one important factor, but in addition, automation limits human error and extends possibilities for robust, objective and reproducible analysis.

This paper describes the development of a system, which translates courtship behaviour videos into formal descriptors using computer vision and statistical methods. The descriptors allow ethogram-like descriptions of complex courtship behaviour patterns for each fly. A special feature of our system is the identification of individual flies through the entire video by solving the occlusion problem with very high accuracy.

### 1.2 Related work

When the project was initiated, only a few trackers [3–6] existed for a different model organism called *Caenorhabditis elegans*. These trackers mainly analyzed the worm’s movements and quantified turn direction versus straight movement. They excluded frames where worms occluded each other. The only published fly tracker [7] analyzed the fly locomotion behaviour.

In 2008 Perona published an automated fly tracker for courtship and aggressive behaviour [8, 9] and initiated a transition from manual to automated scoring. Simultaneously, Schusterreiter developed a tracker [10] that measures the courtship index and captures courtship sub-behaviours.

In particular, the work of Dankert et al. [8] has similar aspects to this paper as it also introduces a two-fly tracker and tackles similar challenges. Our tracker mainly differs in three aspects: It was initially designed to process unseen videos that are not specifically recorded for automated tracking and therefore comes with an arsenal of quality boosting and quality control methods. Second, we spend a huge effort to tackle the occlusion problem, which was probably less critical for the application scenarios of [8]. Finally, our system offers top-down and bottom-up classifiers for courtship behaviour, while the other system offers top-down classifiers only but for both courtship and aggressive behaviours.

Branson et al. [9] quantifies behaviour of multiple flies, while our system is specifically optimized for two flies per chamber. Our system deals with flies turning their head up in *z*-direction and flies occluding each other in *z*-direction by improved software analysis, while the tracker [9] attacks these problems by improvements of the recording setup [11] that significantly decrease difficult occlusion cases.

Hoyer et al. [12] used a tracker that quantified aggression behaviour by a user-defined lunge counter and required one of the two male flies to be painted with a white dot on the back. Similarly, the identity tracking method introduced in [13] enabled biologists to genetically mark flies by a camera-detectable fluorescence marker. In contrary, our system is in principle capable to incorporate detectable color differences but does not *require* to mark flies.

The work introduced within this paper was developed independently from related work; however, some user-defined classifiers of the postprocessor were defined after the classifiers in [8] have been studied. Further similarities, like choosing the Hungarian algorithm for identity assignment in unoccluded sequences or circular arena detection by the Hough transform, are coincidental.

This paper introduces a two-fly tracker and focuses mainly on resolving occlusions as an optimization problem. It is organized as follows: Section 2 introduces the main components of the entire system. Section 3 starts with the basic definitions for the occlusion problem (Section 3.1) and local methods for occlusion assignments (Section 3.2). The solution of the occlusion problem as an optimization problem is stated in Section 3.3, and a dynamic programming algorithm that solves this optimization problem is presented in Section 3.4. Results of the approach will be discussed in Section 3.5. Section 4 shows that the same algorithm is capable to solve the heading problem. Finally, Section 5 provides a summary, discusses properties and results of the optimization algorithm and outlines future work.

## 2 A two-fly tracker

In general, automated tracking is a data *densification* process that takes high amounts of *video data* having low information content and turns them into low amounts of *relevant features* having high information content. Our system comes with two major steps: an *image processing step* where raw video data is transformed into a time series representation and a *pattern recognition step* where biologically relevant events are detected within that time series. The image processing part further subsumes several data transformation and data cleaning steps while the data is still in its image representation. It thus boosts quality and plausibility of image data before the time series is extracted and ensures that minimum quality standards are met. In case videos are detected to be inappropriate for downstream computation steps, they are rejected as early as possible in order to save computation time.

*modules*named

*preprocessor*,

*tracker*,

*postprocessor*and

*annotationTool*. The preprocessor and the tracker cover the image processing part, the postprocessor derives advanced attributes and covers the pattern recognition part. The workflow between modules is straightforward: information flows from the preprocessor through the tracker to the postprocessor module. The only two-way interacting component is the annotationTool; it interoperates with postprocessed data (

*cf*. Figures 1 and 2).

The following paragraphs contain brief descriptions for each module; more details for main functionality may be found in [10].

*The preprocessor* identifies individual arenas (*cf*. Figure 1a) and boosts video quality for each frame. Quality improvement encompasses illumination correction based on an illumination correction curve (*cf*. Figure 1b) and elimination of arenas where camera movement, intruding objects or not exactly two flies were detected (*cf*. Figure 1c,h,i). In this process, approximately 2% to 5% of chamber videos are rejected. For arena detection and all further image processing two gray level pictures are essential: the so-called rigorously smoothed background (Figure 1e) and the cautiously smoothed background (Figure 1f). Arenas are detected by a circular Hough transform which identifies number, position and diameter of the arenas (Figure 1d). Arena boundaries (Figure 1g) are watched by an intrusion detector. Finally, videos are split into individual arenas (Figure 1i), each handed separately to the tracker.

*The tracker* shown in Figure 1j,k,l,m,n,o,p,q,r,s takes single arena videos and corresponding smoothened backgrounds as input (*cf*. first column of the tracker figure, Figure 1j,u,t). The second column shows the arena after subtracting the smoothened backgrounds (first and second rows, Figure 1p,k) and the results of a gradient procedure in the third row (Figure 1n). Further processing is based on the gray level histogram shown in the middle of the third column (Figure 1l). Two different thresholds are used for body and wing detection. Using the thresholds for body region together with the gradient pictures for the boundaries, we obtain a picture for the body region (Figure 1m). Using this result in connection with the threshold for wing detection, we obtain from (Figure 1p) the first image of the binarized body-wing region (Figure 1q). This image is further improved by filling holes (Figure 1r). Resulting body regions (Figure 1m) and wing regions (Figure 1r) are marked in the original image in Figure 1s. Extraction of *primary* attributes directly from images (Figure 1m) and (Figure 1r) concludes the image processing step. These attributes build the interface to the postprocessor module. We derive for both body and wing region: the number of pixels *Area*, the region *Perimeter* and the center of gravity *Centroid*, *Orientation*, *MinorAxisLength* and *MajorAxisLength* of a covering ellipse.

The tracking process is accompanied by number of quality control steps like checking for intrusions into the watched boundary around a chamber (*cf*. Figure 1i) and evaluation of tracked primary attributes’ plausibility.

*The postprocessor* covers the pattern recognition part of the system and searches for biologically relevant events. As a first step tracking data is *normalized* such that all attributes are comparable across different videos. From normalized primary attributes we compute secondary attributes that allow definition of behavioural patterns like following, wing extension or copulation. Figure 2 shows some of these attributes. They capture specific fly constellations (Figure 2a,e,f,g,h) or shapes (Figure 2b,c,d). Transformation of these attributes into behaviour patterns requires identification of individual flies in each video frame and detection of each fly’s head and tail. Fly identification is rather simple as long as the tracker distinguishes separate regions for each fly in so-called unoccluded frames. For sequences of unoccluded frames, fly identities are carried through successive frames by solving an assignment problem for position characteristics with the Hungarian algorithm [14].

In case of occluded frames, the frames where fly bodies overlap (occlude) each other, primary and secondary attributes are computed after solving the occlusion problem. A solution assigns a matching for fly identities before and after each occlusion (see Section 3). According to these matchings, occluded primary attributes are approximated by interpolations. Having primary attributes for every single frame, the postprocessor then determines the head and tail for each fly body (see Section 4) and computes all other secondary attributes.

The system may then apply machine-learned or user-defined classifiers to detect relevant behaviour events. Detected events are protocolled in color-coded *ethograms* (*cf*. Section 5.2) and excel sheets.

*The annotationTool* interacts with the postprocessed data. It supports attribute inspections and overruling of machine decisions. Manually tweaked postprocessing data is re-postprocessed to ensure consistent data views and to avoid time-consuming recalculations during online annotations.

The screenshot in Figure 2i contains data panels on top that visualize attributes for both flies, video panels that depict video frames with tracked perimeter, automatically annotated heading and interpolated ellipses and an occlusion panel that visualizes fly identification across an occlusion. Control panels on the right are for video navigation, attribute selection and manual annotation.

We further implemented a *webinterface* to bulk-submit processing tasks to a computer cluster and to manage videos and tracking results.

## 3 Resolving occlusions

### 3.1 Problem definition

When examining social interactions, the aim is to capture behaviour especially when the individuals are close to each other. Therefore, it is necessary to identify individual flies throughout the entire video even if they overlap or occlude each other. If the two flies move close together and their body regions overlap such that the segmentation method detects only a single body region for both flies, then assigning fly identities becomes a difficult task for a computer. Even for humans, it is sometimes difficult or even impossible to allocate individuals correctly after two flies have overlapped completely.

Since it is essential to be able to allocate the individual flies for the behavioural studies, the occlusion problem was a key challenge in system development and huge effort was invested to tackle the occlusion problem.

For resolving occlusions, the sequence of all video frames *V* is partitioned into alternating *σ* and *τ* sequences. While *σ* sequences contain unoccluded frames where both fly bodies are detected separately, *τ* sequences contain occluded frames where the two fly bodies are merged into one larger region or where no flies at all were detected.

Formally, *σ* and *τ* sequences are defined as follows:

**Definition** **1**. Let *f* be a frame. Function *o*(*f*) is defined as *o*(*f*)=1 in case *f* is occluded and *o*(*f*)=0 otherwise. Let *f*
_{0} refer to an empty frame. *o*(*f*
_{0})=1.

**Definition** **2**. Let *V* be a sequence of frames *f*
_{
i
}, *f*
_{
i
}∈*V*. A sequence *σ*⊆*V* contains a set of successive frames *f*
_{
i
} with ∀*f*
_{
i
}∈*σ*:*o*(*f*
_{
i
})=0; a sequence *τ*⊆*V* contains successive frames with ∀*f*
_{
i
}∈*τ*:*o*(*f*
_{
i
})=1.

**Corollary** **1**. *The border for a partitioning Φ of V that consists of alternating σ and τ sequences is marked by Δ[ o(f*
_{
i
}
*)]. The partitioning Φ*
_{
0
}
*of V*
_{
0
}
*=f*
_{
0
}
*∪V∪f*
_{
0
}
*is guaranteed to start and end with a τ sequence.*

**Definition** **3**. The occlusion problem is finding the best overall assignment of fly identities in all unoccluded sequences using observable fly attributes.

The problem is solved in two steps. First, we calculate *local scores* for the possible assignments of the fly identities in a subsequence (*σ*
_{
i
},*τ*
_{
i
},*σ*
_{
i+1}), using only information in the occluded sequence *τ*
_{
i
} and its two enclosing unoccluded sequences *σ*
_{
i
} and *σ*
_{
i+1}. These scores can be interpreted as probabilities for a matching and are determined by local methods called *t-methods*. Different local methods are introduced in Section 3.2. The occlusion sequences in Figure 3a,b depicts rather trivial occlusion cases where most local methods are successful. The sequences in Figure 3c,d shows cases that may deliver diverged results from different local methods.

Resolution of such ambiguities is done in the second step by (a) formalizing the occlusion problem as a global optimization problem in subsection 3.3 and (b) solving it with a dynamic programming approach (see subsection 3.4).

### 3.2 Local methods

Local methods or t-methods are associated with a *τ* sequence and aim to provide an assignment for the identifiers of its enclosing *σ* sequences. Each t-method computes a *t* *value*, respectively, a *t* *score* for each assignment that resembles its certainty.

In general, t-methods may be differentiated into *attribute-based methods* following a *merge-and-split* approach, *point-based methods* following a *straight-through* approach [15] and *combination methods*.

#### 3.2.1 Attribute-based methods

Follow a classical merge-and-split approach. The idea of attribute-based methods is pretty simple: compare the values of known fly attributes before and after the occlusion and assign pre-occlusion flies to best matching post-occlusion flies. The particular set of characteristic attributes that are used to re-identify flies may vary.

Although any attribute can be taken into account, we will first focus on size-based attributes. An initial motivation for size-based methods was the known size difference between male and female flies.

In general, attribute-based methods match flies according to the mean, maximum, minimum and any other aggregation of sizes before occlusion and after occlusion.

Method *siz* 1 aggregates an eccentricity-corrected size attribute $\text{AreaEC}=\text{Area}\sqrt{1-{\text{EccentricityC}}^{2}},$
$\text{EccentricityC}=\frac{\text{Eccentricity}}{1+{e}^{5\xb7\text{Eccentricity}}}$ from whole *σ* sequences and compares probabilities that indicate which fly is bigger. Figure 2i (blue shape, second and third data panel) visualizes the attributes Area and AreaEC next to each other (particularly note frames 460 to 480 when the larger fly turns up in *z*-direction). When using attribute AreaEC method siz1 solves all occlusion cases in Figure 3, while straight incorporation of Area would get sequence Figure 3c wrong.

Method posm compares Centroids from the last frame before the occlusion to the first frame after the occlusion and computes a score *v*∈(−1,+1). The score aggregates Centroid distances that indicate a matching of fly 1 before the occlusion being fly 1 after the occlusion, ${o}_{b}^{1}\rightharpoonup {o}_{a}^{1}$ (and correspondingly ${o}_{b}^{2}\rightharpoondown {o}_{a}^{2}$) versus the opposite matching. Normalized by all involved distances, the score $v=-\frac{({o}_{b}^{1}\rightharpoonup {o}_{a}^{1})+({o}_{b}^{2}\rightharpoondown {o}_{a}^{2})-({o}_{b}^{1}\rightharpoonup {o}_{a}^{2})-({o}_{b}^{2}\rightharpoondown {o}_{a}^{1}).}{({o}_{b}^{1}\rightharpoonup {o}_{a}^{1})+({o}_{b}^{2}\rightharpoondown {o}_{a}^{2})+({o}_{b}^{1}\rightharpoonup {o}_{a}^{2})+({o}_{b}^{2}\rightharpoondown {o}_{a}^{1}).}$ is guaranteed to be between −1 and +1 and is negated to prefer short distances. Result scores *v* indicate an identity assignment by sign (*v*) and the method’s certainty about their assignment by |*v*|.

#### 3.2.2 Point-based methods

follow the straight-through approach where a point set $\mathfrak{C}$ that is traced ‘straight through’ the occlusion states. The object’s perimeter turned out to be a good choice for $\mathfrak{C}$, it outperformed all other tested point set candidates by solution quality or computation time.

The aim of point-based methods is to assign identifiers from the state before the occlusion *b*, the last frame where both flies have been identified, to the state after the occlusion *a*, the first frame where both flies are identified again. For this reason, point sets $\mathfrak{C}$ are extracted before and after the occlusion and each point is associated with an identifier of the two separately detected flies. Then, for each frame during the occlusion, the point set is extracted and associated identifiers are carried over from its predecessor frame by a nearest neighbour assignment using Voronoi diagrams [16]. At the end, the identifier set carried through the occlusion ${\u0108}_{a}$ is compared with the freshly partitioned point set ${\u0108}_{{a}^{\prime}}$ and a score *v* is derived that resembles how associated identifiers of the characteristics in ${\u0108}_{a}$ and ${\u0108}_{{a}^{\prime}}$ match. The score particularly aggregates the sum of identifier votes from ${\mathfrak{C}}_{a}$ that indicate mapping identifiers ${o}_{b}^{1}$ to ${o}_{a}^{1}$ and ${o}_{b}^{2}$ to ${o}_{a}^{2}$ minus the votes for mapping ${o}_{b}^{1}$ to ${o}_{a}^{2}$ and ${o}_{b}^{2}$ to ${o}_{a}^{1}$, normalized by the sum of all votes, $v=\frac{({o}_{b}^{1}\rightharpoonup {o}_{a}^{1})+({o}_{b}^{2}\rightharpoondown {o}_{a}^{2})-({o}_{b}^{1}\rightharpoonup {o}_{a}^{2})-({o}_{b}^{2}\rightharpoondown {o}_{a}^{1}).}{({o}_{b}^{1}\rightharpoonup {o}_{a}^{1})+({o}_{b}^{2}\rightharpoondown {o}_{a}^{2})+({o}_{b}^{1}\rightharpoonup {o}_{a}^{2})+({o}_{b}^{2}\rightharpoondown {o}_{a}^{1}).},v\in (-1,+1)$. Resulting scores *v* again indicate a suggested identity assignment and the certainty about this assignment result in sign (*v*) respectively |*v*|.

The major weakness of all point-based methods comes with the nearest neighbour assignment. Due to the fact that each pixel takes over the identifier of its nearest pixel in the previous frame, crossing flies are likely to be mis-scored. In fact, all mis-scores and ‘don’t know’ cases that scored with a value of 0 result from this known issue. The latter case especially comes up when the occluded region moves over longer distances. A method variant bocT therefore aims to compensate such movements by applying rigid transformation between successive frames and reduces the effect of that particular weakness.

The boc method and its variants turned out to be particularly reliable for occlusion cases like the ones in Figure 3a,b and are likely to get cases like the one in Figure 3c wrong. Although point-based methods have known difficulties when dealing with crossing flies - they still correctly solve between 90% and 95% of our test case set (see Section 3.5) and typically give low certainty values when they are wrong.

#### 3.2.3 Combining meta-methods

Aim to boost scores from individual local methods by machine learning techniques. For this reason, we implemented a large number of attribute-based, point-based and other methods; we extracted observable attributes from occluded blobs, in particular, the duration of the occlusion and its minimum number of pixels (providing information about a ‘maximal degree of occlusion’) turned out to provide good occlusion characterizations. After computation of all decision and score results from all implemented methods, a *meta-method* was trained by standard machine learning approaches. The Classification and Regression Trees (CART) turned out to be a useful approach; although alternative meta-method approaches performed equally well, the tree approach was chosen because of its intuitive and easy understandable rule-based decisions.

The experiment results in Section 3.5 contain results from a cross-validated CART method where local methods, each having an accuracy of 90% to 95%, are bundled into a combined t-method with about 99% accuracy.

Alternatively, the probability-converted score of independent methods may be combined by the *Dempster combination*
[17, 18], which allows to mathematically combine evidences from different sources into a combined degree of belief. The Dempster combination is defined as follows:

**Definition** **4**. Let *e*
_{1} and *e*
_{2} be two independent evidences. The Dempster combination of these two evidences is defined as ${e}_{1}\otimes {e}_{2}=\frac{{e}_{1}\xb7{e}_{2}}{1-K},K=(1-{e}_{1})\xb7{e}_{2}+{e}_{1}\xb7(1-{e}_{2})$.

The definition above allows to cumulatively combine evidences from multiple t-methods into new t-probabilities. The Dempster combination may also be used to combine independent s-methods (s-methods are introduced in Section 3.3 below).

### 3.3 Occlusions as global optimization problem

Local methods process cases linearly and assess occlusion sequences independently one after another. Therefore a wrong identity assignment is passed on through the entire video as identities are swapped from that wrong assignment on and therefore mis-assigned up to the end of the video (*cf*. in Section 3.3, first example).

In order to overcome this error propagation problem we complement t-methods with so-called *s-methods*. While t-methods are associated with *τ* sequences, s-methods are associated with unoccluded *σ* sequences where both flies are detected. These s-methods aim to discriminate and re-identify the two detected flies much like in the merge-and-split approaches introduced in Section 3.2. However, while merge-and-split t-methods assess and compare characteristics of the *σ* sequences directly before and after an occluded *τ* sequence, the characteristics for s-methods require to be *comparable during the whole video*. Similar to the t-methods, a s-method provides a *s*-*score* for each assignment that resembles its certainty.

The comparability of s-methods is a key property to overcome the error propagation problem that comes when using t-methods only and is essential for the optimization approach described in this section. The following two examples underline the difference between s- and t-methods:

The size-based method siz1 (see Section 3.2) aims to match flies before an occlusion (in sequence *σ*
_{
b
}) to flies after an occlusion (in sequence *σ*
_{
a
}) according to an observed size difference. The bigger fly is assigned to the bigger fly and the smaller fly is assigned to the smaller fly. Such a size-based method may easily be generalized to become a s-method, since the discriminating characteristic - the fly size - is comparable during the whole video. In other words, flies in an arbitrary unoccluded sequence *σ*
_{
k
} may be matched to flies of every other unoccluded sequence *σ*
_{
i
}, such that the bigger flies are assigned to each other.

On the contrary, the position-based method posm (see again Section 3.2), which aims to match flies according to their position, is *not* suitable for a s-method generalization. Obviously, longer time spans between two sequences *σ*
_{
k
} and *σ*
_{
i
} will lead to improper results.

In general anatomical features, e.g. size or eye color, suggest suitable s-methods implementations. In principle any measurable anatomical or otherwise constant feature (like a painted mark) that discriminates the flies is applicable.

An intuitive combination for s- and t-methods would be to select scores where a s-method is absolutely sure and to then treat corresponding identity assignments as ‘fix points.’ Then t-methods may be used for low-score cases between these fix points only. Such an approach would limit the intrinsic problem coming with t-methods as mis-assignments would only be propagated up to the next fix point. The introduced optimization approach is a generalization of this idea and enables that s-methods and t-methods *correct each other*.

In order to ensure comparability of method results, their scores are converted to *probabilities* as probability values are *comparable* and *combinable with each other*.

In theory the conversion is done by empirically determining the distribution of score values per method and then deriving a value *p* from a score and the method’s specifically given score distribution. In practice, using a linear approximation turned out to lead to sufficiently accurate results for all incorporated methods (see Section 3.5).

Finally, *s* values and *t* values are defined for each methods as *logodds* which are derived from these (approximated) probability values, $v=ln\left(\frac{p}{1-p}\right)$. Logodds inherit all comparability and combinability properties and further provide two desirable mathematical properties: (1) logodds of counter probabilities correspond to an inversion in sign, ${v}^{\prime}=ln\left(\frac{1-p}{p}\right)=-v$ and (2) logodds are combinable by addition.

*σ*and

*τ*sequences by an example: After the video is split into alternating

*σ*and

*τ*sequences (occluded

*τ*sequences are marked by gray boxes in Figure 4), the

*s*and

*t*scores are computed. The system incorporates size-based method siz1m as s-method and the point-based method boc as t-method (see Section 3.2).

For each unoccluded sequence, *σ* the two detected flies are arbitrarily named fly *A* and fly *B* and the s-method siz1m - a variant of the sign test that provides good approximations for short sample sizes - computes the probability *p* that A is the bigger fly. The logodds *s* are derived from *p* and assigned to each unoccluded sequence (see cyan values in Figure 4a). Positive values indicate that fly A is the bigger fly, negative values that it is assumed that B is the bigger fly.

For each *occluded* *τ* sequence, the probability that fly A *before* the occlusion remains fly A *after* the occlusion is derived from a t-method. Method boc carries identifier information through the occlusion (*cf*. point-based methods in Section 3.2 for a more detailed description). Again, the logodds are computed, and the resulting *t* values are assigned to the occluded sequences (see brown values in Figure 4a). While *s* values correspond to the probability that fly A is the smaller fly (in Figure 4 written as ‘A is male’), *t* values correspond to the probability that fly A in the *σ* sequence *before* the occlusion corresponds to fly A in the *σ* sequence *after* the occlusion. The (potentially artificial) *τ* sequences at the beginning and the end of *V*
_{0} are assigned with *t* values of 0.

Having all *s* values and *t* values in place, the occlusion resolvement problem may now be treated as an optimization problem. The proposed optimization algorithm (Section 3.4) uses a *dynamic programming* approach to compute the most plausible identity assignment by maximizing $\sum s+\sum t$ under a *flip operation*.

A *flip operation* affects two occluded sequences *τ*
_{
i
} and *τ*
_{
j
} and all unoccluded sequences between them. But most importantly, it does *not* affect the identities in sequences *outside* these two occlusions. All sequences before *τ*
_{
i
} and after *τ*
_{
j
} remain unchanged.

Figure 4b depicts the flow of identifiers in Figure 4a after a flip operation between the two occlusions drawn as gray boxes. In the first occlusion, identities of the flies are swapped, which results in swapped identifiers in the sequence in the middle as well, and in the second occlusion, identities are swapped back, making ‘flip’ a *local* operation only.

Swapping identities mathematically corresponds to inverting the sign of *s* and *t* values.

In Figure 4 the flipped identities in Figure 4b have a total value $\sum s+\sum t=3.5$ and are therefore more plausible than identities in Figure 4a with $\sum s+\sum t=0.5$.

Formally, this flip operation is defined as follows:

**Definition** **5**. Let
and
be sequences of *s* values and *t* values associated with sequences *σ* and *τ*, such that *s*
_{
i
}∈*S* denotes the *s* value for *σ*
_{
i
} and *t*
_{
i
}∈*T* denotes the *t* value for *τ*
_{
i
}. The operation flip (*i*,*j*) on
and $\mathcal{T}$, defined as function $(\mathcal{S},{\mathcal{T})}^{\prime}=\text{flip}(i,j,\mathcal{S},\mathcal{T})$, reverts the signs of *t*
_{
i
} and *t*
_{
j
} and of all *s*
_{
k
},*i*≤*k*<*j* in between them.

This flip operation comes with a number of desirable mathematical properties. It is obviously commutative and associative.

**Definition** **6**. Let flip (*i*,*j*) and flip (*k*,*l*) be flip operations on *V*. The combined operation of both flips is denoted as flip (*i*,*j*)∪flip(*k*,*l*).

Since *flip* is commutative, the order of resolving the underlying individual operations does not matter. Flip is obviously semi-idempotent flip$(i,j)\cup \text{flip}(i,j)=\varnothing $ and therefore concatenable flip (*i*,*k*)∪*f* *l* *i* *p*(*k*,*j*)=flip(*i*,*j*) since flip$(k,k)\cup \text{flip}(k,k)=\varnothing $.

**Lemma** **1**. *Let* *i*,*j*,*k*,*l* be indices for
with *i*≤*j*≤*k*≤*l*. Then *f* *l* *i* *p*(*i*,*k*)∪*f* *l* *i* *p*(*j*,*l*)=*f* *l* *i* *p*(*i*,*j*)∪*f* *l* *i* *p*(*k*,*l* *).*

*Proof*. flip (*i*,*k*)∪flip(*j*,*l*)= (concatenable)

(flip (*i*,*j*)∪flip(*j*,*k*))∪(flip(*j*,*k*)∪flip(*k*,*l*)) = (associative)

flip(*i*,*j*)∪(flip(*j*,*k*)∪flip(*j*,*k*))∪flip(*k*,*l*)= (semi-idempotent)

flip(*i*,*j*)∪flip(*k*,*l*)

These properties of flip encourage the definition of a normal form ${\mathcal{F}}_{\perp}$ for a set flip-operations.

**Definition** **7**. Let *f* be a flip operation *f* = flip(*i*,*j*), $f\in \mathcal{F}$, |*f*| denote the number of sequences affected by flip operation flip (*i*,*j*) and $\left|\mathcal{F}\right|$ therefore be $\left|\mathcal{F}\right|=\sum _{f\in \mathcal{F}}\left|f\right|$. Further, let ${\mathcal{V}}^{\prime}=\mathcal{F}\left(\mathcal{V}\right)$ denote the result of the application all flip operations in
, and $\mathfrak{F}$ be the infinite set of all flip operation sets that are equivalent to
, ${\mathfrak{F}=\{\mathcal{F}}_{i}\left|{\mathcal{F}}_{i}\right(\mathcal{V})=\mathcal{F}(\mathcal{V}\left)\right\}$. The normal form ${\mathcal{F}}_{\perp}$ of
is defined as the set of flip operations flip (*i*,*j*) with *i*<*j* that affects the smallest amount of sequences but still delivers the same result, $\forall {\mathcal{F}}_{i},{\mathcal{F}}_{\perp}\in \mathfrak{F}:\left|{\mathcal{F}}_{\perp}\right|\le \left|{\mathcal{F}}_{i}\right|$.

**Corollary** **2**. *A normal form*
${\mathcal{F}}_{\perp}$ of
does neither contain double-flip operations $\mathit{\text{flip}}(i,j)\in {\mathcal{F}}_{\perp},\mathit{\text{flip}}(k,l)\in {\mathcal{F}}_{\perp}\to (i,j)\ne (k,l)$ nor flip overlaps that would contain double-flip operations, $\mathit{\text{flip}}(i,j)\in {\mathcal{F}}_{\perp},\mathit{\text{flip}}(k,l)\in {\mathcal{F}}_{\perp},i<l\to j<\mathrm{k.}$ The properties *i*<*j*, *k*<*l* and transitively *i*<*k* and *j*<*l* follow from the convention that *i*<*j* for all flip operations $\mathit{\text{flip}}(i,j)\in {\mathcal{F}}_{\perp}$
*.*

**Corollary** **3**. *A normal form*
${\mathcal{F}}_{\perp}$ is sufficiently characterized by an ordered enumeration of all flip operations indices. A normal form ${\mathcal{F}}_{\perp}=\left\{\mathit{\text{flip}}\right(i,j),\mathit{\text{flip}}(k,l\left)\right\}$ may therefore be denoted as ${\mathcal{F}}_{\perp}=\{i,j,k,l\}$
*.*

Every set of flip operations is transformable into its normal form ${\mathcal{F}}_{\perp}$ by elimination of double flips and flip overlaps and sorting of flip indices.

### 3.4 Solving occlusions by dynamic programming

This section introduces an algorithm that solves the optimization problem modelled in the previous section using a dynamic programming approach that results in a generalization of the Viterbi algorithm.

The proposed optimization algorithm computes the most plausible identity assignment throughout the entire video by maximizing $\sum s+\sum t$ under the flip operation. Intuitively, this enables s-methods and t-methods to complement and correct each other, especially in cases where an s-method indicates certainty but a t-method does not or vice versa.

The algorithm exploits mathematical properties of the flip operations. When searching for optimal solutions it is sufficient to traverse normal forms of flip operations only. This reduces an infinite search space to an exponential search space. By sorting (commutative, non-overlapping) flip operations in ascending order intermediate results for all flip operations up to a sequence *τ*
_{
k
} may be reused. The dynamic programming approach therefore traverses the exponential search space within *linear time* and still guarantees to derive the shortest set of flip operations that is required to transform an arbitrary identifier initialization into the assignment with the highest global plausibility. This enables assignment of local identifiers for flies A _{
i
} and B _{
i
} to global identifiers 1 and 2 and to sort fly attributes according to global fly identifiers.

Algorithms 1, 2, 3 and 4 below provide a formal definition of the optimization approach.

#### Algorithm 1 **initialize**

#### Algorithm 2 **backtrack**

#### Algorithm 3 **bulkflip**

#### Algorithm 4 **optimizeAssignment**

- 1.
A dynamic programming initialization step (see Algorithm 1), where

*s*values and*t*values are traversed once to compute the cumulative scores*S*and*T*, such that*S*_{ i,c }and*T*_{ i,c }contain the best possible scores up to sequence*σ*_{ i }resp.*τ*_{ i−1}and the condition*c*=−1 that the current sequence is flipped and identifiers are swapped, respectively,*c*=1 that they remain unchanged. This step exploits the mathematical properties of the flip operation in order to model the optimization problem as a dynamic programming problem instance. The cumulative score up to the first occluded sequence is initialized with*T*_{1,−1}=−*∞*and*T*_{1,+1}=0. This enforces fly 1 of the global assignment to fulfill the property of positive*s*values. The total cost of the global identity assignment is given in*T*_{ n+1,1}. - 2.
A backtracking step (see Algorithm 2), where the chosen path that lead to the assignment with best score in

*T*_{ n+1,1}is reconstructed. This path determines the flip positions that sufficiently characterize ${\mathcal{F}}_{\perp}$, the desired smallest set of flip operations that transforms an arbitrary initialization into the optimal solution. - 3.
A bulk-flip step (see Algorithm 3), where the result flip operations in ${\mathcal{F}}_{\perp}$ are applied to the initially given

*s*and*t*values in order to derive the flipped scores*s*^{′}and*t*^{′}of the optimal solution, $\sum {s}^{\prime}+\sum {t}^{\prime}={T}_{n+1,1}$.

The algorithm result is applied by swapping fly objects within the time series data. For each sequence *σ*
_{
i
}, a value swap${}_{i}=\frac{{s}_{i}^{\prime}}{{s}_{i}}$ with swap _{
i
}∈{−1,+1} may be computed, in case swap _{
i
}=−1 the identifiers for sequence *σ*
_{
i
} have to be swapped.

Finally, three minor improvements are suggested: (1) All *s* and *t* values *v* that are 0 are replaced by *v*=*ε* where *ε* is the smallest representable floating point number that can carry a sign. This replacement does not affect the algorithm result but instead keeps track of all signs for *s* and *t* values and guarantees that all divisions are defined. (2) The maximum impact of a single *s* or *t* value should be limited, the current implementation guarantees for machine-generated *s* or *t* values *v* that *ε*≤|*v*|≤*Ω* with *Ω*=20. (3) The bulk-flip step may optionally be simplified to compute and return only *k* instead of *s*^{′} and *t*^{′}, since *k*
_{
i
} is equivalent to swap _{
i
}.

The algorithm runs in linear time $\mathcal{O}\left(m\right)$ with regards to the total number of sequences *m*=|*Φ*
_{0}|=2|{*τ*
_{
i
}}|−1 and is fast enough for being computed in real time. Manually overruled *τ*
_{
i
} sequences are assigned with a *t* value of *T* _MAX=*Ω*·*m*+1 such that machine decisions cannot vote them down and the most plausible global assignment is adapted accordingly.

For occluded scenes, a revised certainty value *c*
_{
i
} that resembles the global confidence of the algorithm may optionally be computed. This revised value consists of the known local certainty *t*
_{
i
} and a global certainty value *Δ* *T* that is computed as a difference between global assignment costs. The algorithm computes the cost to derive an assignment ${T}_{n+1,1}^{\prime}$, where ${t}_{i}^{\prime}$ is guaranteed to be set in opposite direction ${t}_{i}^{\prime}=-T\text{\_MAX}\xb7\text{sign}\left({t}_{i}\right)$ and computes $\mathrm{\Delta T}=({T}_{n+1,1}-{t}_{i})-({T}_{n+1,1}^{\prime}-{t}_{i}^{\prime})$. The total confidence *c*
_{
i
} of the combined certainty values *t*
_{
i
} and *Δ* *T* can be expressed as a probability measure, the optional computation of all confidence values runs in quadratic time $\mathcal{O}\left({m}^{2}\right)$.

### 3.5 Experimental results

During our project, we processed more than ten thousand multi-chamber videos containing more than a billion single-chamber frames. Our occlusion methods were tested on 8 randomly selected *Drosophila* courtship videos, each containing 11 chambers with male-female pairs of the same genotype. Each chamber had a diameter of 1 cm and was covered by an anti-reflecting glass plate on top. Videos were recorded from the top at 25 frames per second.

The chamber videos were preprocessed, tracked, postprocessed and manually annotated to establish a ground truth. From our 88 original chambers, 5 were rejected by the preprocessor (wrong number of flies) or due to lack of manual annotation. The remaining 83 chambers contained 8,421 occlusions and 610,919 frames of two-fly behaviour before copulation.

The identity assignment during *σ* sequences using the Hungarian algorithm turned out to be extremely reliable. We identified potential problems when flies jump (rapidly move to a random new destination, within one frame) and therefore specifically detect such jump events and treat them like occlusions. In particular, identities in sequences before and after the jump event are independently assigned using the Hungarian algorithm and global identities are then assigned using our global occlusion resolvement methods. However, in case two flies jumped exactly to each others place within a single frame this would trick the jump detector and result in an assignment error within the *σ* sequence. We recorded videos with 25 frames per second and noticed only two such errors during the entire project, which involved tracking about one billion frames. We did not further quantify this error rate due to its rarity and want to denote that recording at higher frame rates would further decrease the error potential.

*σ*and

*τ*sequences. The first column contains a sequence identifier; the second column, the length of each sequence in frames. The following three columns contain

*s*values, respectively,

*t*values of s- and t-methods siz1m (deciding based on size differences), posm (deciding based on fly positions) and bocT (deciding based on identifier-containing point sets that are ‘carried through’ an occlusion).

The remaining six columns contain identifier assignment results produced by different methods. The first three columns contain decisions of local methods only: They are combined with nothing but ‘zeros’ and therefore analyzed individually for their assignment decisions. For the last three columns, methods were combined with each other. Each assignment entry contains a <*v* *a* *l* *u* *e*> and a <*d* *e* *c* *i* *s* *i* *o* *n*> (separated by a semicolon), the <*v* *a* *l* *u* *e*> resembles the s or t value associated with the given <*d* *e* *c* *i* *s* *i* *o* *n*> identifier assignment. Entries with correct <*d* *e* *c* *i* *s* *i* *o* *n*>s are colored in green, incorrect assignments are bold and in red. This implicitly encodes the ground truth.

The first example depicts the typical error pattern of local t-methods. The occlusion in *τ*
_{41.56} is wrongly resolved by methods posm and bocT. Method posm therefore mis-assigns identities for its following *σ* sequence *σ*
_{41.57}
*and all* *σ* *sequences thereafter*, up to the end of the video or another mis-assignment. Apparently, the bocT method already had a mis-assignment before *τ*
_{41.56} since identifiers were swapped in *σ*
_{41.55} and *σ*
_{41.56} before occlusion *τ*
_{41.56}. Due to the second mis-assignment, the identifiers are swapped back and result in correct *σ* sequences after the second mis-assignment.

Having the last three columns in green shows that all three of our combining methods are capable of rescuing this case. The main reasons for the combined method’s success are their fundamentally different error patterns. Since combined methods involve both s- and t-methods, a t-method failure may still result in swapped identifiers, but they are typically swapped back immediately since it is not plausible to swap too many *σ* sequences despite continuous negative evidence coming from the s-method. Examples three and four depict such ‘double errors’ that are typical for combined methods.

We further want to discuss the robustness of combined methods by examining column *s* *i* *z* 1*m*:*b* *o* *c* *T* in the first example. Although method siz1m assigns the score of *ε* (don’t know) in *σ*
_{41.57} right after the sequence that bocT would get wrong and although bocT assigns *ε* in *τ*
_{41.55}, the occlusion right before the troubled occlusion, the combined method still gets the whole assignment right. How is that possible?

In order to mis-assign *τ*
_{41.56} according to the evidence coming from bocT, the combined method *s* *i* *z* 1*m*:*b* *o* *c* *T* would have to do a *double*-error. The two most obvious options for that would be to either perform a flip (45.55,45.56) or a flip (45.56,45.57) operation. However, the costs for flip (45.55,45.56) are less attractive than for the no-flip case, (−(*ε*)−9.70+0.77)<(*ε*+9.70−0.77). Obviously, the high confidence of siz1m in *σ*
_{41.56} makes this option unattractive, and similarly for flip (45.56,45.57), (0.77 − *ε* − 2.63) < (−0.77 + *ε* + 2.63). In this case the higher score for *τ*
_{41.57} coming from method bocT itself makes the difference. Flipping even longer sequences, e.g. flip (45.56,45.58) would be even less attractive for the algorithm. The most plausible identifier assignment is determined correctly - despite wrong evidence coming from bocT in *τ*
_{41.56} and two proximate *ε* values in *τ*
_{41.55} and *σ*
_{41.57}.

The second example depicts a similar case, this time method siz1m mis-assigns *σ*
_{45.149}, but methods posm and bocT both get this case right. Again, all combined methods come up with the correct assignment as the combined evidence coming from posm or bocT is stronger than the misleading evidence from siz1m.

In the third example, in sequence *σ*
_{45.49} method siz1m is wrong and rather confident about it. In this case both combined methods *s* *i* *z* 1*m*:*p* *o* *s* *m* and *s* *i* *z* 1*m*:*b* *o* *c* *T* would fail too, however, method *s* *i* *z* 1*m*:*p* *o* *s* *m*,*b* *o* *c* *T* which uses the stronger Dempster-combined evidences from posm *and* bocT is still capable of coming up with the correct assignment.

The last example shows a case where *s* *i* *z* 1*m*:*p* *o* *s*,*b* *o* *c* *T* is wrong. Although methods siz1m, *s* *i* *z* 1*m*:*p* *o* *s* *m* and *s* *i* *z* 1*m*:*b* *o* *c* *T* would solve the case correctly, the combined wrong evidences of posm and bocT outweigh the value coming from siz1m. The video frames of this error instance are depicted in Figure 3d.

*s*−

*m*

*e*

*t*

*h*

*o*

*d*>:<

*t*−

*m*

*e*

*t*

*h*

*o*

*d*>. Again, a combination with zeros is used to quantify local methods only. All evaluated methods are compared according to two quality measures: (a) the percentage of correct assignments of identifiers before an occlusion to identifiers after that occlusion and (b) percentage of correctly assigned frames in unoccluded sequences.

**Performance summary**

S-method : t-method | Correct τ(%) | Correct σ-frames (%) | |
---|---|---|---|

1 | zeros : posm | 94.28 | 51.73 |

2 | zeros : bocT | 91.73 | 53.90 |

3 | zeros : CART | 99.75 | 87.32 |

4 | zeros : CART | 98.96 | 73.94 |

5 | siz1m : zeros | 93.37 | 99.74 |

6 | siz1m : posm | 98.86 | 99.88 |

7 | siz1m : bocT | 99.17 | 99.95 |

8 | siz1m : posm, bocT | 99.55 | 99.97 |

9 | siz1m : posm∼ | 99.47 | 99.95 |

10 | siz1m : posm∼,bocT∼ | 99.62 | 99.97 |

11 | siz1m : CART | 99.9169 | 99.99 |

12 | siz1m : CART | 99.1331 | 99.92 |

Methods in Table 1 rows 1 to 4 involve t-methods only. Aside from the local methods posm and bocT, we further evaluated meta-method CART, a machine learning method that uses a classification tree to come up with an assignment based on multiple *t* values and occlusion properties like an occlusions length or its maximum overlap. Note that CART is still a t-method as it combines multiple t-methods. We provide results for overfitted CART _{
O
} and cross-validated CART _{
C
}, where 10-fold cross-validation was applied.

Although the accuracy of local methods for correct occlusions are 94.28% and 91.73%, the methods get only 51.73% and 53.90%, respectively, of unoccluded frames correct. This is due to the error propagation problem that is outlined in the first example of Figure 5. As expected, the CART approach alone cannot overcome this problem. Although combined *t* scores lead to a highly improved occlusion accuracy, the t-intrinsic error pattern still leads to low frame accuracy.

In row 5 the size-based s-method is evaluated. Although it comes with similar occlusion accuracy as the t-methods, its frame accuracy is highly improved. This is because incorporation of s-methods leads to double-error patterns where wrong occlusion assignments are immediately swapped back. Therefore, such methods typically get only single *σ* sequences wrong.

All further rows 6 to 12 contain performance values for combined methods. In 6 to 8 methods *s* *i* *z* 1*m*:*p* *o* *s* *m*,*s* *i* *z* 1*m*:*b* *o* *c* *T*, and *s* *i* *z* 1*m*:*p* *o* *s* *m*,*b* *o* *c* *T* show the impact of the dynamic programming approach to the combined methods performance. As shown in the examples in Figure 5 above, s-methods and t-methods complement and correct each other. Despite their double-error patterns that minimize the number of mis-assigned *σ* sequences, combined methods further minimize the length for mis-assigned *σ* sequences. The *s* value coming from siz1m is designed to be dependent on the lengths of observable *σ* sequences, such that long sequences (on which the method performs well) are given high scores and utterly short sequences (where the method sometimes is wrong) are given low scores. Typical error pattern for combined methods are therefore double errors that contain single *short* sequences, typically consisting of one or two frames, which explains the high frame accuracy of these combined methods.

We further evaluated the combination of methods of same type using the Dempster combination [17, 18] and it turned out that the use of Dempster-combined t-method *p* *o* *s* *m*,*b* *o* *c* *T* in row 8 slightly outperformed simpler dynamic programming combinations in rows 6 and 7.

Finally, the methods evaluated in 9 to 12 turned out to result in little or no improvements. In Section 3.3 we mention that probability values are derived from method scores using a linear approximation. In 9 and 10, we evaluate combinations with methods posm ∼ and bocT ∼ where nonlinear approximations are used to derive more precise probability values. However, it turned out that these performance improvements between *s* *i* *z* 1*m*:*p* *o* *s* *m*,*b* *o* *c* *T* and *s* *i* *z* 1*m*:*p* *o* *s* *m*∼,*b* *o* *c* *T*∼ corrected only nine more frames.

When combining s-methods with CART-methods, it turns out that the overfitted method *s* *i* *z* 1*m*:CART_{
O
} outperforms all other methods; however, the cross-validated method *s* *i* *z* 1*m*:CART_{
C
} shows a decrease in performance. This is mostly because the CART-method typically returns very confident scores that are difficult to be corrected by other methods. The method CART _{
C
} is a good example for a t-method that outperforms other t-methods in occlusion accuracy (*cf*. Table 1: rows 1, 2 and 4), but still is outperformed in terms of frame accuracy due to a lack of combinability (*cf*. Table 1: 8 and 12).

## 4 Other application: resolving heading

A fly body or an ellipse covering a fly body consists of two *ends* A and B where the flies’ axis crosses the flies’ perimeter. Resolving the heading problem means to find out whether end A or end B is the flies head.

Fortunately, there are several evidences from the flies’ anatomy and behaviour. First, flies typically walk in a forward direction. The movement direction of a fly may be used to predict at which side to find the head. Secondly, the flies’ wings typically point in backwards direction. Therefore, vector from Centroid to wCentroid may be used as a second independent predictor. Finally, the head does typically not flip by 180° within a single frame.

Interestingly, the heading problem may be reduced to the occlusion problem described in Section 3 and the proposed optimization algorithm of Section 3.3 may be applied to solve the heading problem as well.

The idea is to model every single frame as a *σ* sequence and ‘artificial gaps’ between frames as *τ* sequences. The evidences from movement and wings are incorporated as s-methods (again s-methods have to operate on attributes that are comparable through the entire video) and a known persistence constraint is incorporated as a t-method. The computed *s* values correspond to probabilities for point A being the head of the fly, and *t* values correspond to probabilities that point A in the frame before *τ* is again point A in the frame after *τ*.

In order to resolve the heading it is sufficient to define the s- and t-methods that incorporate movement, wing anatomy and persistence evidences, and then re-use the very same algorithm and framework as for occlusions.

For this reason, the coordinates of the two endpoints *A* and *B* (after heading assignment called *Head* and *Tail*) and the Centroids of the body and the wing regions *C* and *W* are determined for every frame.

**Definition** **8**. Let *X*
_{
i
} denote the value of point *X* in frame *i* and $\overline{\mathit{\text{XY}}}$ denote the euclidean distance between points *X* and *Y*. The *s* score *score*
_{
move
} is defined as score${}_{\text{move}}=\frac{\overline{{A}_{i}{C}_{i-1}}-\overline{{B}_{i}{C}_{i-1}}}{\overline{{A}_{i}{C}_{i-1}}+\overline{{B}_{i}{C}_{i-1}}}$.

**Definition** **9**. The *s* score *score*
_{
wing
} is defined as score${}_{\text{wing}}=\frac{\overline{\mathit{\text{AW}}}-\overline{\mathit{\text{BW}}}}{\overline{\mathit{\text{AW}}}+\overline{\mathit{\text{BW}}}}$.

**Definition** **10**. The *combined s score* is defined as score _{move⊗wing}= max(score_{move},score_{wing}).

The score score _{move} will be positive whenever the fly moved rather in A- than in B-direction, the score score _{wing} will be positive in case A is closer to the centroid of the wing region. Note that −1≤score_{move}≤+1 and −1≤score_{wing}≤+1. Both scores are combined by a simple maximum aggregation. From the combined score score _{move⊗wing} probability approximations and finally logodd values *s* may be derived as in the occlusion case.

**Definition** **11**. The *t* score *s* *c* *o* *r* *e*
_{
persist
} is defined as score ${}_{\text{persist}}=\frac{\overline{\mathit{\text{AB}}}+\overline{\mathit{\text{BA}}}-\overline{\mathit{\text{AA}}}-\overline{\mathit{\text{BB}}}}{\overline{\mathit{\text{AB}}}+\overline{\mathit{\text{BA}}}+\overline{\mathit{\text{AA}}}+\overline{\mathit{\text{BB}}}}\text{Eccentricity}.$

The persistence score score _{persist} is −Eccentricity≤score_{move}≤+Eccentricity, with 0≤Eccentricity≤1. Rescaling the score by the Eccentricity attribute ensures low persistence scores when flies ‘turn upwards’ and are thus round. From such a position, flies may abruptly change their heading via *z*-direction.

The optimization algorithm (see Algorithm 4) will compute the most plausible *heading* assignment for all video frames by maximizing $\sum s+\sum t$ under the flip operation introduced in the Section 3.3.

For the heading case, the linear time property of algorithm is essential since heading is typically computed for 15,000+ frames and other, e.g. quadratic algorithms would already become unhandy for these problem instances.

A performance evaluation on 42,870 manually annotated heading situations resulted in 99.2% of correct heading assignments. This number fits to the ‘occlusion accuracy’ quality measures that we observed for occlusion problem instances in Table 1 (middle column). The other quality measure in that table is not applicable for heading problem instances.

Typical heading error instances are sequences where flies actually *do* walk backwards for a longer time, e.g. due to a series of evasive maneuvers.

## 5 Conclusions

### 5.1 Summary and discussion

*Drosophila*courtship behaviour and its sub-behaviours (

*cf*. Figure 7). The software includes tools for result inspection and bulk submission of videos to a computer cluster.

The identification of the two flies through the entire video was essential for the detection and assignment of biologically relevant events and the resolvement occlusions with highest possible accuracy turned out to be of particular importance. Since manual correction of occlusion assignments requires lots of user interaction (and user attention), some efforts were invested to come up with an automated solution for that problem (see Section 3).

Section 3.2 introduced different approaches for solving single occlusions. These methods are called t-methods or *local* methods as they focus on individual occlusions without further consideration of their context. Each local method suggests an identifier assignment for each occlusion case and gives a certainty score for its decision. Decisions and scores of multiple local methods may be combined in a meta-method; machine learning based meta-methods may further include observable features further characterizing occlusions.

An intrinsic problem of t-methods or combinations of them comes with their limited local perspective that causes mis-assignments to be propagated until the end of the video. Section 3.3 introduces an optimization approach that incorporates context information. It complements t-methods by s-methods that base on characteristics of unoccluded sequences that are comparable during the whole video. The certainty scores of s- and t-methods are turned into logodd values that are comparable to each other such that the most plausible identity assignment for the entire video is achieved by maximizing the sum of these logodd values under a flip operation (*cf*. Section 3.3). The optimization algorithm introduced in Section 3.4 solves that optimization problem in linear time.

Within a test set of 11 male-female courtship videos with manually annotated ground truth, the introduced local methods scored 90% to 95% of the cases correctly, the combining meta-methods correctly assigned about 99% and the optimization approach assigned up to 99.62% of occlusions correctly. When it comes to correctly assigned identities in unoccluded frames, naive meta-methods suffer from the error propagation problem while the optimization approach improves its accuracy to 99.97%. In other words, from about 6 h and 45 min of the unoccluded video frames, the frames with wrong identity assignment are together about 7 s.

These results are achieved as the algorithm implicitly minimizes the number of mis-assigned unoccluded frames. This property is inherited from the approximation of the sign test that is used as s-method.

Further, the algorithm ‘self-corrects’ its mistakes. Potential errors typically occur *pair-wise*, one real error immediately followed by a second one that compensates the first error, as it is *not* plausible that identities are wrong from a given point to the end of the video. Wrongly assigned identities are therefore a *local* problem and do not affect the rest of the video.

These two error patterns, pair-wise errors ensuring *local* mis-assignments only and *short* non-occluded sequences as potential error domains, explain the low number of unoccluded frames that are mis-identified and result in desirable properties of the algorithm.

In many occasions, users accidentally *downgraded* the quality of ground truth that was previously automatically pre-annotated. This suggests that the automated occlusion resolvement method may, in many cases, be more reliable than a human annotator. However, the automatically derived occlusion assignments may still be manually inspected and overruled, and an annotating user may sort occlusions by machine-given confidence values.

Another desirable property of the algorithm is that it runs very efficiently (in linear time $\mathcal{O}\left(\right|{\Phi}_{0}\left|\right)$). This enables its applications on large problem instances. Section 4 describes how heading assignments can be modelled as an instance of the very same optimization problem. The very same optimization algorithm then derives the most plausible heading assignment for every frame.

### 5.2 Future work

We have a system that identifies flies and extracts various attributes; we implemented more than 1,000 automatically observable shape and constellation descriptors. The system currently comes with classifiers for courtship behaviour and its sub-behaviours that allow to visualize observed behaviours as automatically generated ethograms.

Figure 7 depicts an ethogram that visualizes courtship events for wild-type males and females. Our current courtship classifiers are sex-specific (compare left vs. right) and deliver expected results for known mutants (*cf*. [10]).

We aim to support definition and training for classifiers that capture further meaningful behaviours.

The overall system is currently being transcoded from Matlab to C++ and is optimized for performance such that it runs on a standard laptop within a reasonable computation time.

## Declarations

## Authors’ Affiliations

## References

- Baker BS, Taylor BJ, Hall JC: Are complex behaviors specified by dedicated regulatory genes? Reasoning from
*Drosophila*.*Cell*2001, 105: 13-24. 10.1016/S0092-8674(01)00293-8View ArticleGoogle Scholar - Dickson BJ: Wired for sex: the neurobiology of
*Drosophila*, mating decisions.*Science*2008, 322: 904-909. 10.1126/science.1159276View ArticleGoogle Scholar - Pierce-Shimomura JT, Dores M, Lockery SR: Analysis of the effects of turning bias on chemotaxis in
*C. elegans*.*J Exp Biol*2005, 208(Pt 24):4727-4733.View ArticleGoogle Scholar - Cronin CJ, Mendel JE, Mukhtar S, Kim YM, Stirbl RC, Bruck J, Sternberg PW: An automated system for measuring parameters of nematode sinusoidal movement.
*BMC Genet*2005, 6(1):5. 10.1186/1471-2156-6-5View ArticleGoogle Scholar - Feng Z, Cronin CJ, Wittig Jr JH, Sternberg PW, Schafer WR: An imaging system for standardized quantitative analysis of
*C. elegans*behavior.*BMC Bioinformatics*2004, 5: 115. 10.1186/1471-2105-5-115View ArticleGoogle Scholar - Baek JH, Cosman P, Feng Z, Silver J, Schafer WR: Using machine vision to analyze and classify
*Caenorhabditis elegans*behavioral phenotypes quantitatively.*J Neurosci. Methods*2002, 118(1):9-21. 10.1016/S0165-0270(02)00117-6View ArticleGoogle Scholar - Martin JR: A portrait of locomotor behaviour in
*Drosophila*determined by a video-tracking paradigm.*Behav Processes*2004, 67(2):207-219. 10.1016/j.beproc.2004.04.003View ArticleGoogle Scholar - Dankert H, Wang L, Hoopfer ED, Anderson DJ, Perona P: Automated monitoring and analysis of social behavior in
*Drosophila*.*Nature Methods*2009, 6: 297-303. 10.1038/nmeth.1310View ArticleGoogle Scholar - Branson K, Robie AA, Bender J, Perona P, Dickinson MH: High-throughput ethomics in large groups of
*Drosophila*.*Nature Methods*2009, 6: 451-457. 10.1038/nmeth.1328View ArticleGoogle Scholar - Schusterreiter C: Computational analysis of
*Drosophila*courtship behaviour.*Thesis, University of Vienna*2011.Google Scholar - Simon JC, Dickinson MH:
*A new chamber for studying the behavior of Drosophila*.*PLoS One*2010, 5(1):e8793. 10.1371/journal.pone.0008793View ArticleGoogle Scholar - Hoyer SC, Eckart A, Herrel A, Zars T, Fischer SA, Hardie SL, Heisenberg M: Octopamine in male aggression of
*Drosophila*.*Curr. Biol*2008, 18(3):159-167. 10.1016/j.cub.2007.12.052View ArticleGoogle Scholar - Ramdya PP, Schaffter T, Floreano D, Benton R: Fluorescence Behavioral Imaging (FBI) tracks identity in heterogeneous groups of Drosophila . Plos One 2012., 7(11):Google Scholar
- Kuhn HW: The Hungarian method for the assignment problem.
*Naval Res. Logistics Q*1955, 2(1–2):83-97.View ArticleGoogle Scholar - Gabriel P, Verly J, Piater J, Genon A: The state of the art in multiple object tracking under occlusion in video sequences.
*Advanced Concepts for Intelligent Vision Systems (2003)*pp 166-173.Google Scholar - Aurenhammer F: Voronoi diagrams—a survey of a fundamental geometric data structure.
*ACM Comput. Surv. (CSUR)*1991, 23(3):345-405. 10.1145/116873.116880View ArticleGoogle Scholar - Dempster AP: Upper and lower probabilities induced by a multivalued mapping.
*Ann. Math. Stat*1967, 38(2):325-339. 10.1214/aoms/1177698950MathSciNetView ArticleGoogle Scholar - Shafer G:
*A Mathematical Theory of Evidence*. Princeton University Press; 1976.Google Scholar

## Copyright

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.