 Research
 Open Access
 Published:
A twofly tracker that solves occlusions by dynamic programming: computational analysis of Drosophila courtship behaviour
EURASIP Journal on Image and Video Processing volume 2013, Article number: 64 (2013)
Abstract
This paper introduces a twofly tracker which focuses on an approach to model and to solve occlusions as an optimization problem. Automated tracking of genetic model organisms is gaining importance since geneticists and neuroscientists have biological tools to systematically study the connection between genes, neurons and behaviour by performing largescale behavioural experiments. This paper is about a fly tracker that provides automated quantification for such functional behaviour studies on Drosophila courtship behaviour. It enables measurement and visualization of behavioural differences in genetically modified fly pairs. The developed system provides solutions for all major challenges that were identified: arena detection, segmentation, quality control, resolving occlusions, resolving heading and detection of behaviour events. Among all challenges especially resolving occlusions turned out to be of particular importance and huge effort was invested to resolve that particular problem. Our tests show that our system is capable to identify flies through an entire video with an accuracy of 99.97%. This result is achieved by combining different types of local methods and modeling the global identity assignment as an optimization problem.
1 Introduction
1.1 Motivation and goals
A fundamental question in neuroscience is to understand the relation between genes, brains and behaviour: Genes encode hardwired neuronal circuits in the nervous system. For innate behaviours  like reproductive behaviour of insects  such neuronal circuits produce observable stereotypic motor outputs.
The fruit fly Drosophila melanogaster has a set of innate behaviours that are hardwired in the nervous system. Several innate behaviours of D. melanogaster are sexspecific. In combination with the availability of genetic and molecular tools, the fruit fly is a common model organism to study how the nervous system generates behaviours.
Drosophila courtship is a robust and sexspecific behaviour that has been characterized through multiple genetic screens. Many genes that regulate male and female courtship behaviour have already been identified. It was a big surprise that a complex behaviour like courtship is regulated by a few sets of genes [1, 2], and it is strongly believed that these genes interact with cascades of downstream genes that regulate individual parts of the behaviour.
Currently geneticists and neuroscientists perform largescale experiments in order to systematically identify genes and neurons that are involved in specific steps of courtship behaviour. Quantification of these experiments and classification of different behaviours turned out to be a very time consuming and tedious task; thus, it was the major bottle neck of largescale behaviour screens for a long time.
Automated tools aim to support such largescale experiments. Saving time is one important factor, but in addition, automation limits human error and extends possibilities for robust, objective and reproducible analysis.
This paper describes the development of a system, which translates courtship behaviour videos into formal descriptors using computer vision and statistical methods. The descriptors allow ethogramlike descriptions of complex courtship behaviour patterns for each fly. A special feature of our system is the identification of individual flies through the entire video by solving the occlusion problem with very high accuracy.
1.2 Related work
When the project was initiated, only a few trackers [3–6] existed for a different model organism called Caenorhabditis elegans. These trackers mainly analyzed the worm’s movements and quantified turn direction versus straight movement. They excluded frames where worms occluded each other. The only published fly tracker [7] analyzed the fly locomotion behaviour.
In 2008 Perona published an automated fly tracker for courtship and aggressive behaviour [8, 9] and initiated a transition from manual to automated scoring. Simultaneously, Schusterreiter developed a tracker [10] that measures the courtship index and captures courtship subbehaviours.
In particular, the work of Dankert et al. [8] has similar aspects to this paper as it also introduces a twofly tracker and tackles similar challenges. Our tracker mainly differs in three aspects: It was initially designed to process unseen videos that are not specifically recorded for automated tracking and therefore comes with an arsenal of quality boosting and quality control methods. Second, we spend a huge effort to tackle the occlusion problem, which was probably less critical for the application scenarios of [8]. Finally, our system offers topdown and bottomup classifiers for courtship behaviour, while the other system offers topdown classifiers only but for both courtship and aggressive behaviours.
Branson et al. [9] quantifies behaviour of multiple flies, while our system is specifically optimized for two flies per chamber. Our system deals with flies turning their head up in zdirection and flies occluding each other in zdirection by improved software analysis, while the tracker [9] attacks these problems by improvements of the recording setup [11] that significantly decrease difficult occlusion cases.
Hoyer et al. [12] used a tracker that quantified aggression behaviour by a userdefined lunge counter and required one of the two male flies to be painted with a white dot on the back. Similarly, the identity tracking method introduced in [13] enabled biologists to genetically mark flies by a cameradetectable fluorescence marker. In contrary, our system is in principle capable to incorporate detectable color differences but does not require to mark flies.
The work introduced within this paper was developed independently from related work; however, some userdefined classifiers of the postprocessor were defined after the classifiers in [8] have been studied. Further similarities, like choosing the Hungarian algorithm for identity assignment in unoccluded sequences or circular arena detection by the Hough transform, are coincidental.
This paper introduces a twofly tracker and focuses mainly on resolving occlusions as an optimization problem. It is organized as follows: Section 2 introduces the main components of the entire system. Section 3 starts with the basic definitions for the occlusion problem (Section 3.1) and local methods for occlusion assignments (Section 3.2). The solution of the occlusion problem as an optimization problem is stated in Section 3.3, and a dynamic programming algorithm that solves this optimization problem is presented in Section 3.4. Results of the approach will be discussed in Section 3.5. Section 4 shows that the same algorithm is capable to solve the heading problem. Finally, Section 5 provides a summary, discusses properties and results of the optimization algorithm and outlines future work.
2 A twofly tracker
In general, automated tracking is a data densification process that takes high amounts of video data having low information content and turns them into low amounts of relevant features having high information content. Our system comes with two major steps: an image processing step where raw video data is transformed into a time series representation and a pattern recognition step where biologically relevant events are detected within that time series. The image processing part further subsumes several data transformation and data cleaning steps while the data is still in its image representation. It thus boosts quality and plausibility of image data before the time series is extracted and ensures that minimum quality standards are met. In case videos are detected to be inappropriate for downstream computation steps, they are rejected as early as possible in order to save computation time.
The system architecture consists of several modules named preprocessor, tracker, postprocessor and annotationTool. The preprocessor and the tracker cover the image processing part, the postprocessor derives advanced attributes and covers the pattern recognition part. The workflow between modules is straightforward: information flows from the preprocessor through the tracker to the postprocessor module. The only twoway interacting component is the annotationTool; it interoperates with postprocessed data (cf. Figures 1 and 2).
The following paragraphs contain brief descriptions for each module; more details for main functionality may be found in [10].
The preprocessor identifies individual arenas (cf. Figure 1a) and boosts video quality for each frame. Quality improvement encompasses illumination correction based on an illumination correction curve (cf. Figure 1b) and elimination of arenas where camera movement, intruding objects or not exactly two flies were detected (cf. Figure 1c,h,i). In this process, approximately 2% to 5% of chamber videos are rejected. For arena detection and all further image processing two gray level pictures are essential: the socalled rigorously smoothed background (Figure 1e) and the cautiously smoothed background (Figure 1f). Arenas are detected by a circular Hough transform which identifies number, position and diameter of the arenas (Figure 1d). Arena boundaries (Figure 1g) are watched by an intrusion detector. Finally, videos are split into individual arenas (Figure 1i), each handed separately to the tracker.
The tracker shown in Figure 1j,k,l,m,n,o,p,q,r,s takes single arena videos and corresponding smoothened backgrounds as input (cf. first column of the tracker figure, Figure 1j,u,t). The second column shows the arena after subtracting the smoothened backgrounds (first and second rows, Figure 1p,k) and the results of a gradient procedure in the third row (Figure 1n). Further processing is based on the gray level histogram shown in the middle of the third column (Figure 1l). Two different thresholds are used for body and wing detection. Using the thresholds for body region together with the gradient pictures for the boundaries, we obtain a picture for the body region (Figure 1m). Using this result in connection with the threshold for wing detection, we obtain from (Figure 1p) the first image of the binarized bodywing region (Figure 1q). This image is further improved by filling holes (Figure 1r). Resulting body regions (Figure 1m) and wing regions (Figure 1r) are marked in the original image in Figure 1s. Extraction of primary attributes directly from images (Figure 1m) and (Figure 1r) concludes the image processing step. These attributes build the interface to the postprocessor module. We derive for both body and wing region: the number of pixels Area, the region Perimeter and the center of gravity Centroid, Orientation, MinorAxisLength and MajorAxisLength of a covering ellipse.
The tracking process is accompanied by number of quality control steps like checking for intrusions into the watched boundary around a chamber (cf. Figure 1i) and evaluation of tracked primary attributes’ plausibility.
The postprocessor covers the pattern recognition part of the system and searches for biologically relevant events. As a first step tracking data is normalized such that all attributes are comparable across different videos. From normalized primary attributes we compute secondary attributes that allow definition of behavioural patterns like following, wing extension or copulation. Figure 2 shows some of these attributes. They capture specific fly constellations (Figure 2a,e,f,g,h) or shapes (Figure 2b,c,d). Transformation of these attributes into behaviour patterns requires identification of individual flies in each video frame and detection of each fly’s head and tail. Fly identification is rather simple as long as the tracker distinguishes separate regions for each fly in socalled unoccluded frames. For sequences of unoccluded frames, fly identities are carried through successive frames by solving an assignment problem for position characteristics with the Hungarian algorithm [14].
In case of occluded frames, the frames where fly bodies overlap (occlude) each other, primary and secondary attributes are computed after solving the occlusion problem. A solution assigns a matching for fly identities before and after each occlusion (see Section 3). According to these matchings, occluded primary attributes are approximated by interpolations. Having primary attributes for every single frame, the postprocessor then determines the head and tail for each fly body (see Section 4) and computes all other secondary attributes.
The system may then apply machinelearned or userdefined classifiers to detect relevant behaviour events. Detected events are protocolled in colorcoded ethograms (cf. Section 5.2) and excel sheets.
The annotationTool interacts with the postprocessed data. It supports attribute inspections and overruling of machine decisions. Manually tweaked postprocessing data is repostprocessed to ensure consistent data views and to avoid timeconsuming recalculations during online annotations.
The screenshot in Figure 2i contains data panels on top that visualize attributes for both flies, video panels that depict video frames with tracked perimeter, automatically annotated heading and interpolated ellipses and an occlusion panel that visualizes fly identification across an occlusion. Control panels on the right are for video navigation, attribute selection and manual annotation.
We further implemented a webinterface to bulksubmit processing tasks to a computer cluster and to manage videos and tracking results.
3 Resolving occlusions
3.1 Problem definition
When examining social interactions, the aim is to capture behaviour especially when the individuals are close to each other. Therefore, it is necessary to identify individual flies throughout the entire video even if they overlap or occlude each other. If the two flies move close together and their body regions overlap such that the segmentation method detects only a single body region for both flies, then assigning fly identities becomes a difficult task for a computer. Even for humans, it is sometimes difficult or even impossible to allocate individuals correctly after two flies have overlapped completely.
Since it is essential to be able to allocate the individual flies for the behavioural studies, the occlusion problem was a key challenge in system development and huge effort was invested to tackle the occlusion problem.
Figure 3 shows four cases of occlusion to further illustrate the problem. The majority of occlusion cases are very similar to the ones depicted in Figure 3a,b; the one in Figure 3a depicts a social interaction that typically happens in occluded scenes. Our system reliably resolves such occlusions (and also detects the wing extension during occlusion in Figure 3a). A rare case where our system is wrong (Figure 3d) is further discussed in Section 3.5.
For resolving occlusions, the sequence of all video frames V is partitioned into alternating σ and τ sequences. While σ sequences contain unoccluded frames where both fly bodies are detected separately, τ sequences contain occluded frames where the two fly bodies are merged into one larger region or where no flies at all were detected.
Formally, σ and τ sequences are defined as follows:
Definition 1. Let f be a frame. Function o(f) is defined as o(f)=1 in case f is occluded and o(f)=0 otherwise. Let f _{0} refer to an empty frame. o(f _{0})=1.
Definition 2. Let V be a sequence of frames f _{ i }, f _{ i }∈V. A sequence σ⊆V contains a set of successive frames f _{ i } with ∀f _{ i }∈σ:o(f _{ i })=0; a sequence τ⊆V contains successive frames with ∀f _{ i }∈τ:o(f _{ i })=1.
Corollary 1. The border for a partitioning Φ of V that consists of alternating σ and τ sequences is marked by Δ[ o(f _{ i } )]. The partitioning Φ _{ 0 } of V _{ 0 } =f _{ 0 } ∪V∪f _{ 0 } is guaranteed to start and end with a τ sequence.
Definition 3. The occlusion problem is finding the best overall assignment of fly identities in all unoccluded sequences using observable fly attributes.
The problem is solved in two steps. First, we calculate local scores for the possible assignments of the fly identities in a subsequence (σ _{ i },τ _{ i },σ _{ i+1}), using only information in the occluded sequence τ _{ i } and its two enclosing unoccluded sequences σ _{ i } and σ _{ i+1}. These scores can be interpreted as probabilities for a matching and are determined by local methods called tmethods. Different local methods are introduced in Section 3.2. The occlusion sequences in Figure 3a,b depicts rather trivial occlusion cases where most local methods are successful. The sequences in Figure 3c,d shows cases that may deliver diverged results from different local methods.
Resolution of such ambiguities is done in the second step by (a) formalizing the occlusion problem as a global optimization problem in subsection 3.3 and (b) solving it with a dynamic programming approach (see subsection 3.4).
3.2 Local methods
Local methods or tmethods are associated with a τ sequence and aim to provide an assignment for the identifiers of its enclosing σ sequences. Each tmethod computes a t value, respectively, a t score for each assignment that resembles its certainty.
In general, tmethods may be differentiated into attributebased methods following a mergeandsplit approach, pointbased methods following a straightthrough approach [15] and combination methods.
3.2.1 Attributebased methods
Follow a classical mergeandsplit approach. The idea of attributebased methods is pretty simple: compare the values of known fly attributes before and after the occlusion and assign preocclusion flies to best matching postocclusion flies. The particular set of characteristic attributes that are used to reidentify flies may vary.
Although any attribute can be taken into account, we will first focus on sizebased attributes. An initial motivation for sizebased methods was the known size difference between male and female flies.
In general, attributebased methods match flies according to the mean, maximum, minimum and any other aggregation of sizes before occlusion and after occlusion.
Method siz 1 aggregates an eccentricitycorrected size attribute $\text{AreaEC}=\text{Area}\sqrt{1{\text{EccentricityC}}^{2}},$ $\text{EccentricityC}=\frac{\text{Eccentricity}}{1+{e}^{5\xb7\text{Eccentricity}}}$ from whole σ sequences and compares probabilities that indicate which fly is bigger. Figure 2i (blue shape, second and third data panel) visualizes the attributes Area and AreaEC next to each other (particularly note frames 460 to 480 when the larger fly turns up in zdirection). When using attribute AreaEC method siz1 solves all occlusion cases in Figure 3, while straight incorporation of Area would get sequence Figure 3c wrong.
Method posm compares Centroids from the last frame before the occlusion to the first frame after the occlusion and computes a score v∈(−1,+1). The score aggregates Centroid distances that indicate a matching of fly 1 before the occlusion being fly 1 after the occlusion, ${o}_{b}^{1}\rightharpoonup {o}_{a}^{1}$ (and correspondingly ${o}_{b}^{2}\rightharpoondown {o}_{a}^{2}$) versus the opposite matching. Normalized by all involved distances, the score $v=\frac{({o}_{b}^{1}\rightharpoonup {o}_{a}^{1})+({o}_{b}^{2}\rightharpoondown {o}_{a}^{2})({o}_{b}^{1}\rightharpoonup {o}_{a}^{2})({o}_{b}^{2}\rightharpoondown {o}_{a}^{1}).}{({o}_{b}^{1}\rightharpoonup {o}_{a}^{1})+({o}_{b}^{2}\rightharpoondown {o}_{a}^{2})+({o}_{b}^{1}\rightharpoonup {o}_{a}^{2})+({o}_{b}^{2}\rightharpoondown {o}_{a}^{1}).}$ is guaranteed to be between −1 and +1 and is negated to prefer short distances. Result scores v indicate an identity assignment by sign (v) and the method’s certainty about their assignment by v.
3.2.2 Pointbased methods
follow the straightthrough approach where a point set $\mathfrak{C}$ that is traced ‘straight through’ the occlusion states. The object’s perimeter turned out to be a good choice for $\mathfrak{C}$, it outperformed all other tested point set candidates by solution quality or computation time.
The aim of pointbased methods is to assign identifiers from the state before the occlusion b, the last frame where both flies have been identified, to the state after the occlusion a, the first frame where both flies are identified again. For this reason, point sets $\mathfrak{C}$ are extracted before and after the occlusion and each point is associated with an identifier of the two separately detected flies. Then, for each frame during the occlusion, the point set is extracted and associated identifiers are carried over from its predecessor frame by a nearest neighbour assignment using Voronoi diagrams [16]. At the end, the identifier set carried through the occlusion ${\u0108}_{a}$ is compared with the freshly partitioned point set ${\u0108}_{{a}^{\prime}}$ and a score v is derived that resembles how associated identifiers of the characteristics in ${\u0108}_{a}$ and ${\u0108}_{{a}^{\prime}}$ match. The score particularly aggregates the sum of identifier votes from ${\mathfrak{C}}_{a}$ that indicate mapping identifiers ${o}_{b}^{1}$ to ${o}_{a}^{1}$ and ${o}_{b}^{2}$ to ${o}_{a}^{2}$ minus the votes for mapping ${o}_{b}^{1}$ to ${o}_{a}^{2}$ and ${o}_{b}^{2}$ to ${o}_{a}^{1}$, normalized by the sum of all votes, $v=\frac{({o}_{b}^{1}\rightharpoonup {o}_{a}^{1})+({o}_{b}^{2}\rightharpoondown {o}_{a}^{2})({o}_{b}^{1}\rightharpoonup {o}_{a}^{2})({o}_{b}^{2}\rightharpoondown {o}_{a}^{1}).}{({o}_{b}^{1}\rightharpoonup {o}_{a}^{1})+({o}_{b}^{2}\rightharpoondown {o}_{a}^{2})+({o}_{b}^{1}\rightharpoonup {o}_{a}^{2})+({o}_{b}^{2}\rightharpoondown {o}_{a}^{1}).},v\in (1,+1)$. Resulting scores v again indicate a suggested identity assignment and the certainty about this assignment result in sign (v) respectively v.
The major weakness of all pointbased methods comes with the nearest neighbour assignment. Due to the fact that each pixel takes over the identifier of its nearest pixel in the previous frame, crossing flies are likely to be misscored. In fact, all misscores and ‘don’t know’ cases that scored with a value of 0 result from this known issue. The latter case especially comes up when the occluded region moves over longer distances. A method variant bocT therefore aims to compensate such movements by applying rigid transformation between successive frames and reduces the effect of that particular weakness.
The boc method and its variants turned out to be particularly reliable for occlusion cases like the ones in Figure 3a,b and are likely to get cases like the one in Figure 3c wrong. Although pointbased methods have known difficulties when dealing with crossing flies  they still correctly solve between 90% and 95% of our test case set (see Section 3.5) and typically give low certainty values when they are wrong.
3.2.3 Combining metamethods
Aim to boost scores from individual local methods by machine learning techniques. For this reason, we implemented a large number of attributebased, pointbased and other methods; we extracted observable attributes from occluded blobs, in particular, the duration of the occlusion and its minimum number of pixels (providing information about a ‘maximal degree of occlusion’) turned out to provide good occlusion characterizations. After computation of all decision and score results from all implemented methods, a metamethod was trained by standard machine learning approaches. The Classification and Regression Trees (CART) turned out to be a useful approach; although alternative metamethod approaches performed equally well, the tree approach was chosen because of its intuitive and easy understandable rulebased decisions.
The experiment results in Section 3.5 contain results from a crossvalidated CART method where local methods, each having an accuracy of 90% to 95%, are bundled into a combined tmethod with about 99% accuracy.
Alternatively, the probabilityconverted score of independent methods may be combined by the Dempster combination [17, 18], which allows to mathematically combine evidences from different sources into a combined degree of belief. The Dempster combination is defined as follows:
Definition 4. Let e _{1} and e _{2} be two independent evidences. The Dempster combination of these two evidences is defined as ${e}_{1}\otimes {e}_{2}=\frac{{e}_{1}\xb7{e}_{2}}{1K},K=(1{e}_{1})\xb7{e}_{2}+{e}_{1}\xb7(1{e}_{2})$.
The definition above allows to cumulatively combine evidences from multiple tmethods into new tprobabilities. The Dempster combination may also be used to combine independent smethods (smethods are introduced in Section 3.3 below).
3.3 Occlusions as global optimization problem
Local methods process cases linearly and assess occlusion sequences independently one after another. Therefore a wrong identity assignment is passed on through the entire video as identities are swapped from that wrong assignment on and therefore misassigned up to the end of the video (cf. in Section 3.3, first example).
In order to overcome this error propagation problem we complement tmethods with socalled smethods. While tmethods are associated with τ sequences, smethods are associated with unoccluded σ sequences where both flies are detected. These smethods aim to discriminate and reidentify the two detected flies much like in the mergeandsplit approaches introduced in Section 3.2. However, while mergeandsplit tmethods assess and compare characteristics of the σ sequences directly before and after an occluded τ sequence, the characteristics for smethods require to be comparable during the whole video. Similar to the tmethods, a smethod provides a sscore for each assignment that resembles its certainty.
The comparability of smethods is a key property to overcome the error propagation problem that comes when using tmethods only and is essential for the optimization approach described in this section. The following two examples underline the difference between s and tmethods:
The sizebased method siz1 (see Section 3.2) aims to match flies before an occlusion (in sequence σ _{ b }) to flies after an occlusion (in sequence σ _{ a }) according to an observed size difference. The bigger fly is assigned to the bigger fly and the smaller fly is assigned to the smaller fly. Such a sizebased method may easily be generalized to become a smethod, since the discriminating characteristic  the fly size  is comparable during the whole video. In other words, flies in an arbitrary unoccluded sequence σ _{ k } may be matched to flies of every other unoccluded sequence σ _{ i }, such that the bigger flies are assigned to each other.
On the contrary, the positionbased method posm (see again Section 3.2), which aims to match flies according to their position, is not suitable for a smethod generalization. Obviously, longer time spans between two sequences σ _{ k } and σ _{ i } will lead to improper results.
In general anatomical features, e.g. size or eye color, suggest suitable smethods implementations. In principle any measurable anatomical or otherwise constant feature (like a painted mark) that discriminates the flies is applicable.
An intuitive combination for s and tmethods would be to select scores where a smethod is absolutely sure and to then treat corresponding identity assignments as ‘fix points.’ Then tmethods may be used for lowscore cases between these fix points only. Such an approach would limit the intrinsic problem coming with tmethods as misassignments would only be propagated up to the next fix point. The introduced optimization approach is a generalization of this idea and enables that smethods and tmethods correct each other.
In order to ensure comparability of method results, their scores are converted to probabilities as probability values are comparable and combinable with each other.
In theory the conversion is done by empirically determining the distribution of score values per method and then deriving a value p from a score and the method’s specifically given score distribution. In practice, using a linear approximation turned out to lead to sufficiently accurate results for all incorporated methods (see Section 3.5).
Finally, s values and t values are defined for each methods as logodds which are derived from these (approximated) probability values, $v=ln\left(\frac{p}{1p}\right)$. Logodds inherit all comparability and combinability properties and further provide two desirable mathematical properties: (1) logodds of counter probabilities correspond to an inversion in sign, ${v}^{\prime}=ln\left(\frac{1p}{p}\right)=v$ and (2) logodds are combinable by addition.
Figure 4a explains the values assignment to σ and τ sequences by an example: After the video is split into alternating σ and τ sequences (occluded τ sequences are marked by gray boxes in Figure 4), the s and t scores are computed. The system incorporates sizebased method siz1m as smethod and the pointbased method boc as tmethod (see Section 3.2).
For each unoccluded sequence, σ the two detected flies are arbitrarily named fly A and fly B and the smethod siz1m  a variant of the sign test that provides good approximations for short sample sizes  computes the probability p that A is the bigger fly. The logodds s are derived from p and assigned to each unoccluded sequence (see cyan values in Figure 4a). Positive values indicate that fly A is the bigger fly, negative values that it is assumed that B is the bigger fly.
For each occluded τ sequence, the probability that fly A before the occlusion remains fly A after the occlusion is derived from a tmethod. Method boc carries identifier information through the occlusion (cf. pointbased methods in Section 3.2 for a more detailed description). Again, the logodds are computed, and the resulting t values are assigned to the occluded sequences (see brown values in Figure 4a). While s values correspond to the probability that fly A is the smaller fly (in Figure 4 written as ‘A is male’), t values correspond to the probability that fly A in the σ sequence before the occlusion corresponds to fly A in the σ sequence after the occlusion. The (potentially artificial) τ sequences at the beginning and the end of V _{0} are assigned with t values of 0.
Having all s values and t values in place, the occlusion resolvement problem may now be treated as an optimization problem. The proposed optimization algorithm (Section 3.4) uses a dynamic programming approach to compute the most plausible identity assignment by maximizing $\sum s+\sum t$ under a flip operation.
A flip operation affects two occluded sequences τ _{ i } and τ _{ j } and all unoccluded sequences between them. But most importantly, it does not affect the identities in sequences outside these two occlusions. All sequences before τ _{ i } and after τ _{ j } remain unchanged.
Figure 4b depicts the flow of identifiers in Figure 4a after a flip operation between the two occlusions drawn as gray boxes. In the first occlusion, identities of the flies are swapped, which results in swapped identifiers in the sequence in the middle as well, and in the second occlusion, identities are swapped back, making ‘flip’ a local operation only.
Swapping identities mathematically corresponds to inverting the sign of s and t values.
In Figure 4 the flipped identities in Figure 4b have a total value $\sum s+\sum t=3.5$ and are therefore more plausible than identities in Figure 4a with $\sum s+\sum t=0.5$.
Formally, this flip operation is defined as follows:
Definition 5. Let and be sequences of s values and t values associated with sequences σ and τ, such that s _{ i }∈S denotes the s value for σ _{ i } and t _{ i }∈T denotes the t value for τ _{ i }. The operation flip (i,j) on and $\mathcal{T}$, defined as function $(\mathcal{S},{\mathcal{T})}^{\prime}=\text{flip}(i,j,\mathcal{S},\mathcal{T})$, reverts the signs of t _{ i } and t _{ j } and of all s _{ k },i≤k<j in between them.
This flip operation comes with a number of desirable mathematical properties. It is obviously commutative and associative.
Definition 6. Let flip (i,j) and flip (k,l) be flip operations on V. The combined operation of both flips is denoted as flip (i,j)∪flip(k,l).
Since flip is commutative, the order of resolving the underlying individual operations does not matter. Flip is obviously semiidempotent flip$(i,j)\cup \text{flip}(i,j)=\varnothing $ and therefore concatenable flip (i,k)∪f l i p(k,j)=flip(i,j) since flip$(k,k)\cup \text{flip}(k,k)=\varnothing $.
Lemma 1. Let i,j,k,l be indices for with i≤j≤k≤l. Then f l i p(i,k)∪f l i p(j,l)=f l i p(i,j)∪f l i p(k,l ).
Proof. flip (i,k)∪flip(j,l)= (concatenable)
(flip (i,j)∪flip(j,k))∪(flip(j,k)∪flip(k,l)) = (associative)
flip(i,j)∪(flip(j,k)∪flip(j,k))∪flip(k,l)= (semiidempotent)
flip(i,j)∪flip(k,l)
These properties of flip encourage the definition of a normal form ${\mathcal{F}}_{\perp}$ for a set flipoperations.
Definition 7. Let f be a flip operation f = flip(i,j), $f\in \mathcal{F}$, f denote the number of sequences affected by flip operation flip (i,j) and $\left\mathcal{F}\right$ therefore be $\left\mathcal{F}\right=\sum _{f\in \mathcal{F}}\leftf\right$. Further, let ${\mathcal{V}}^{\prime}=\mathcal{F}\left(\mathcal{V}\right)$ denote the result of the application all flip operations in , and $\mathfrak{F}$ be the infinite set of all flip operation sets that are equivalent to , ${\mathfrak{F}=\{\mathcal{F}}_{i}\left{\mathcal{F}}_{i}\right(\mathcal{V})=\mathcal{F}(\mathcal{V}\left)\right\}$. The normal form ${\mathcal{F}}_{\perp}$ of is defined as the set of flip operations flip (i,j) with i<j that affects the smallest amount of sequences but still delivers the same result, $\forall {\mathcal{F}}_{i},{\mathcal{F}}_{\perp}\in \mathfrak{F}:\left{\mathcal{F}}_{\perp}\right\le \left{\mathcal{F}}_{i}\right$.
Corollary 2. A normal form ${\mathcal{F}}_{\perp}$ of does neither contain doubleflip operations $\mathit{\text{flip}}(i,j)\in {\mathcal{F}}_{\perp},\mathit{\text{flip}}(k,l)\in {\mathcal{F}}_{\perp}\to (i,j)\ne (k,l)$ nor flip overlaps that would contain doubleflip operations, $\mathit{\text{flip}}(i,j)\in {\mathcal{F}}_{\perp},\mathit{\text{flip}}(k,l)\in {\mathcal{F}}_{\perp},i<l\to j<\mathrm{k.}$ The properties i<j, k<l and transitively i<k and j<l follow from the convention that i<j for all flip operations $\mathit{\text{flip}}(i,j)\in {\mathcal{F}}_{\perp}$ .
Corollary 3. A normal form ${\mathcal{F}}_{\perp}$ is sufficiently characterized by an ordered enumeration of all flip operations indices. A normal form ${\mathcal{F}}_{\perp}=\left\{\mathit{\text{flip}}\right(i,j),\mathit{\text{flip}}(k,l\left)\right\}$ may therefore be denoted as ${\mathcal{F}}_{\perp}=\{i,j,k,l\}$ .
Every set of flip operations is transformable into its normal form ${\mathcal{F}}_{\perp}$ by elimination of double flips and flip overlaps and sorting of flip indices.
3.4 Solving occlusions by dynamic programming
This section introduces an algorithm that solves the optimization problem modelled in the previous section using a dynamic programming approach that results in a generalization of the Viterbi algorithm.
The proposed optimization algorithm computes the most plausible identity assignment throughout the entire video by maximizing $\sum s+\sum t$ under the flip operation. Intuitively, this enables smethods and tmethods to complement and correct each other, especially in cases where an smethod indicates certainty but a tmethod does not or vice versa.
The algorithm exploits mathematical properties of the flip operations. When searching for optimal solutions it is sufficient to traverse normal forms of flip operations only. This reduces an infinite search space to an exponential search space. By sorting (commutative, nonoverlapping) flip operations in ascending order intermediate results for all flip operations up to a sequence τ _{ k } may be reused. The dynamic programming approach therefore traverses the exponential search space within linear time and still guarantees to derive the shortest set of flip operations that is required to transform an arbitrary identifier initialization into the assignment with the highest global plausibility. This enables assignment of local identifiers for flies A _{ i } and B _{ i } to global identifiers 1 and 2 and to sort fly attributes according to global fly identifiers.
Algorithms 1, 2, 3 and 4 below provide a formal definition of the optimization approach.
Algorithm 1 initialize
Algorithm 2 backtrack
Algorithm 3 bulkflip
Algorithm 4 optimizeAssignment
The final algorithm listed as Algorithm 4 consists of the following steps:

1.
A dynamic programming initialization step (see Algorithm 1), where s values and t values are traversed once to compute the cumulative scores S and T, such that S _{ i,c } and T _{ i,c } contain the best possible scores up to sequence σ _{ i } resp. τ _{ i−1} and the condition c=−1 that the current sequence is flipped and identifiers are swapped, respectively, c=1 that they remain unchanged. This step exploits the mathematical properties of the flip operation in order to model the optimization problem as a dynamic programming problem instance. The cumulative score up to the first occluded sequence is initialized with T _{1,−1}=−∞ and T _{1,+1}=0. This enforces fly 1 of the global assignment to fulfill the property of positive s values. The total cost of the global identity assignment is given in T _{ n+1,1}.

2.
A backtracking step (see Algorithm 2), where the chosen path that lead to the assignment with best score in T _{ n+1,1} is reconstructed. This path determines the flip positions that sufficiently characterize ${\mathcal{F}}_{\perp}$, the desired smallest set of flip operations that transforms an arbitrary initialization into the optimal solution.

3.
A bulkflip step (see Algorithm 3), where the result flip operations in ${\mathcal{F}}_{\perp}$ are applied to the initially given s and t values in order to derive the flipped scores s ^{′} and t ^{′} of the optimal solution, $\sum {s}^{\prime}+\sum {t}^{\prime}={T}_{n+1,1}$.
The algorithm result is applied by swapping fly objects within the time series data. For each sequence σ _{ i }, a value swap${}_{i}=\frac{{s}_{i}^{\prime}}{{s}_{i}}$ with swap _{ i }∈{−1,+1} may be computed, in case swap _{ i }=−1 the identifiers for sequence σ _{ i } have to be swapped.
Finally, three minor improvements are suggested: (1) All s and t values v that are 0 are replaced by v=ε where ε is the smallest representable floating point number that can carry a sign. This replacement does not affect the algorithm result but instead keeps track of all signs for s and t values and guarantees that all divisions are defined. (2) The maximum impact of a single s or t value should be limited, the current implementation guarantees for machinegenerated s or t values v that ε≤v≤Ω with Ω=20. (3) The bulkflip step may optionally be simplified to compute and return only k instead of s^{′} and t^{′}, since k _{ i } is equivalent to swap _{ i }.
The algorithm runs in linear time $\mathcal{O}\left(m\right)$ with regards to the total number of sequences m=Φ _{0}=2{τ _{ i }}−1 and is fast enough for being computed in real time. Manually overruled τ _{ i } sequences are assigned with a t value of T _MAX=Ω·m+1 such that machine decisions cannot vote them down and the most plausible global assignment is adapted accordingly.
For occluded scenes, a revised certainty value c _{ i } that resembles the global confidence of the algorithm may optionally be computed. This revised value consists of the known local certainty t _{ i } and a global certainty value Δ T that is computed as a difference between global assignment costs. The algorithm computes the cost to derive an assignment ${T}_{n+1,1}^{\prime}$, where ${t}_{i}^{\prime}$ is guaranteed to be set in opposite direction ${t}_{i}^{\prime}=T\text{\_MAX}\xb7\text{sign}\left({t}_{i}\right)$ and computes $\mathrm{\Delta T}=({T}_{n+1,1}{t}_{i})({T}_{n+1,1}^{\prime}{t}_{i}^{\prime})$. The total confidence c _{ i } of the combined certainty values t _{ i } and Δ T can be expressed as a probability measure, the optional computation of all confidence values runs in quadratic time $\mathcal{O}\left({m}^{2}\right)$.
3.5 Experimental results
During our project, we processed more than ten thousand multichamber videos containing more than a billion singlechamber frames. Our occlusion methods were tested on 8 randomly selected Drosophila courtship videos, each containing 11 chambers with malefemale pairs of the same genotype. Each chamber had a diameter of 1 cm and was covered by an antireflecting glass plate on top. Videos were recorded from the top at 25 frames per second.
The chamber videos were preprocessed, tracked, postprocessed and manually annotated to establish a ground truth. From our 88 original chambers, 5 were rejected by the preprocessor (wrong number of flies) or due to lack of manual annotation. The remaining 83 chambers contained 8,421 occlusions and 610,919 frames of twofly behaviour before copulation.
The identity assignment during σ sequences using the Hungarian algorithm turned out to be extremely reliable. We identified potential problems when flies jump (rapidly move to a random new destination, within one frame) and therefore specifically detect such jump events and treat them like occlusions. In particular, identities in sequences before and after the jump event are independently assigned using the Hungarian algorithm and global identities are then assigned using our global occlusion resolvement methods. However, in case two flies jumped exactly to each others place within a single frame this would trick the jump detector and result in an assignment error within the σ sequence. We recorded videos with 25 frames per second and noticed only two such errors during the entire project, which involved tracking about one billion frames. We did not further quantify this error rate due to its rarity and want to denote that recording at higher frame rates would further decrease the error potential.
Figure 5 contains four examples that demonstrate error patterns and how different methods complement each other. The table rows contain alternating σ and τ sequences. The first column contains a sequence identifier; the second column, the length of each sequence in frames. The following three columns contain s values, respectively, t values of s and tmethods siz1m (deciding based on size differences), posm (deciding based on fly positions) and bocT (deciding based on identifiercontaining point sets that are ‘carried through’ an occlusion).
The remaining six columns contain identifier assignment results produced by different methods. The first three columns contain decisions of local methods only: They are combined with nothing but ‘zeros’ and therefore analyzed individually for their assignment decisions. For the last three columns, methods were combined with each other. Each assignment entry contains a <v a l u e> and a <d e c i s i o n> (separated by a semicolon), the <v a l u e> resembles the s or t value associated with the given <d e c i s i o n> identifier assignment. Entries with correct <d e c i s i o n>s are colored in green, incorrect assignments are bold and in red. This implicitly encodes the ground truth.
The first example depicts the typical error pattern of local tmethods. The occlusion in τ _{41.56} is wrongly resolved by methods posm and bocT. Method posm therefore misassigns identities for its following σ sequence σ _{41.57} and all σ sequences thereafter, up to the end of the video or another misassignment. Apparently, the bocT method already had a misassignment before τ _{41.56} since identifiers were swapped in σ _{41.55} and σ _{41.56} before occlusion τ _{41.56}. Due to the second misassignment, the identifiers are swapped back and result in correct σ sequences after the second misassignment.
Having the last three columns in green shows that all three of our combining methods are capable of rescuing this case. The main reasons for the combined method’s success are their fundamentally different error patterns. Since combined methods involve both s and tmethods, a tmethod failure may still result in swapped identifiers, but they are typically swapped back immediately since it is not plausible to swap too many σ sequences despite continuous negative evidence coming from the smethod. Examples three and four depict such ‘double errors’ that are typical for combined methods.
We further want to discuss the robustness of combined methods by examining column s i z 1m:b o c T in the first example. Although method siz1m assigns the score of ε (don’t know) in σ _{41.57} right after the sequence that bocT would get wrong and although bocT assigns ε in τ _{41.55}, the occlusion right before the troubled occlusion, the combined method still gets the whole assignment right. How is that possible?
In order to misassign τ _{41.56} according to the evidence coming from bocT, the combined method s i z 1m:b o c T would have to do a doubleerror. The two most obvious options for that would be to either perform a flip (45.55,45.56) or a flip (45.56,45.57) operation. However, the costs for flip (45.55,45.56) are less attractive than for the noflip case, (−(ε)−9.70+0.77)<(ε+9.70−0.77). Obviously, the high confidence of siz1m in σ _{41.56} makes this option unattractive, and similarly for flip (45.56,45.57), (0.77 − ε − 2.63) < (−0.77 + ε + 2.63). In this case the higher score for τ _{41.57} coming from method bocT itself makes the difference. Flipping even longer sequences, e.g. flip (45.56,45.58) would be even less attractive for the algorithm. The most plausible identifier assignment is determined correctly  despite wrong evidence coming from bocT in τ _{41.56} and two proximate ε values in τ _{41.55} and σ _{41.57}.
The second example depicts a similar case, this time method siz1m misassigns σ _{45.149}, but methods posm and bocT both get this case right. Again, all combined methods come up with the correct assignment as the combined evidence coming from posm or bocT is stronger than the misleading evidence from siz1m.
In the third example, in sequence σ _{45.49} method siz1m is wrong and rather confident about it. In this case both combined methods s i z 1m:p o s m and s i z 1m:b o c T would fail too, however, method s i z 1m:p o s m,b o c T which uses the stronger Dempstercombined evidences from posm and bocT is still capable of coming up with the correct assignment.
The last example shows a case where s i z 1m:p o s,b o c T is wrong. Although methods siz1m, s i z 1m:p o s m and s i z 1m:b o c T would solve the case correctly, the combined wrong evidences of posm and bocT outweigh the value coming from siz1m. The video frames of this error instance are depicted in Figure 3d.
Table 1 summarizes our performance evaluation where all identifier assignment methods introduced in Section 3 were applied to our annotated test set. The table compares local methods and different combined methods, named according to the scheme <s−m e t h o d>:<t−m e t h o d>. Again, a combination with zeros is used to quantify local methods only. All evaluated methods are compared according to two quality measures: (a) the percentage of correct assignments of identifiers before an occlusion to identifiers after that occlusion and (b) percentage of correctly assigned frames in unoccluded sequences.
Methods in Table 1 rows 1 to 4 involve tmethods only. Aside from the local methods posm and bocT, we further evaluated metamethod CART, a machine learning method that uses a classification tree to come up with an assignment based on multiple t values and occlusion properties like an occlusions length or its maximum overlap. Note that CART is still a tmethod as it combines multiple tmethods. We provide results for overfitted CART _{ O } and crossvalidated CART _{ C }, where 10fold crossvalidation was applied.
Although the accuracy of local methods for correct occlusions are 94.28% and 91.73%, the methods get only 51.73% and 53.90%, respectively, of unoccluded frames correct. This is due to the error propagation problem that is outlined in the first example of Figure 5. As expected, the CART approach alone cannot overcome this problem. Although combined t scores lead to a highly improved occlusion accuracy, the tintrinsic error pattern still leads to low frame accuracy.
In row 5 the sizebased smethod is evaluated. Although it comes with similar occlusion accuracy as the tmethods, its frame accuracy is highly improved. This is because incorporation of smethods leads to doubleerror patterns where wrong occlusion assignments are immediately swapped back. Therefore, such methods typically get only single σ sequences wrong.
All further rows 6 to 12 contain performance values for combined methods. In 6 to 8 methods s i z 1m:p o s m,s i z 1m:b o c T, and s i z 1m:p o s m,b o c T show the impact of the dynamic programming approach to the combined methods performance. As shown in the examples in Figure 5 above, smethods and tmethods complement and correct each other. Despite their doubleerror patterns that minimize the number of misassigned σ sequences, combined methods further minimize the length for misassigned σ sequences. The s value coming from siz1m is designed to be dependent on the lengths of observable σ sequences, such that long sequences (on which the method performs well) are given high scores and utterly short sequences (where the method sometimes is wrong) are given low scores. Typical error pattern for combined methods are therefore double errors that contain single short sequences, typically consisting of one or two frames, which explains the high frame accuracy of these combined methods.
We further evaluated the combination of methods of same type using the Dempster combination [17, 18] and it turned out that the use of Dempstercombined tmethod p o s m,b o c T in row 8 slightly outperformed simpler dynamic programming combinations in rows 6 and 7.
Finally, the methods evaluated in 9 to 12 turned out to result in little or no improvements. In Section 3.3 we mention that probability values are derived from method scores using a linear approximation. In 9 and 10, we evaluate combinations with methods posm ∼ and bocT ∼ where nonlinear approximations are used to derive more precise probability values. However, it turned out that these performance improvements between s i z 1m:p o s m,b o c T and s i z 1m:p o s m∼,b o c T∼ corrected only nine more frames.
When combining smethods with CARTmethods, it turns out that the overfitted method s i z 1m:CART_{ O } outperforms all other methods; however, the crossvalidated method s i z 1m:CART_{ C } shows a decrease in performance. This is mostly because the CARTmethod typically returns very confident scores that are difficult to be corrected by other methods. The method CART _{ C } is a good example for a tmethod that outperforms other tmethods in occlusion accuracy (cf. Table 1: rows 1, 2 and 4), but still is outperformed in terms of frame accuracy due to a lack of combinability (cf. Table 1: 8 and 12).
4 Other application: resolving heading
A fly body or an ellipse covering a fly body consists of two ends A and B where the flies’ axis crosses the flies’ perimeter. Resolving the heading problem means to find out whether end A or end B is the flies head.
Fortunately, there are several evidences from the flies’ anatomy and behaviour. First, flies typically walk in a forward direction. The movement direction of a fly may be used to predict at which side to find the head. Secondly, the flies’ wings typically point in backwards direction. Therefore, vector from Centroid to wCentroid may be used as a second independent predictor. Finally, the head does typically not flip by 180° within a single frame.
Interestingly, the heading problem may be reduced to the occlusion problem described in Section 3 and the proposed optimization algorithm of Section 3.3 may be applied to solve the heading problem as well.
The idea is to model every single frame as a σ sequence and ‘artificial gaps’ between frames as τ sequences. The evidences from movement and wings are incorporated as smethods (again smethods have to operate on attributes that are comparable through the entire video) and a known persistence constraint is incorporated as a tmethod. The computed s values correspond to probabilities for point A being the head of the fly, and t values correspond to probabilities that point A in the frame before τ is again point A in the frame after τ.
Figure 6 depicts how to model the heading problem as a problem instance of the same optimization problem that previously solved the occlusion problem. Note the strong similarities between Figures 4 and 6; the text written in red in Figure 6 marks the few differences.
In order to resolve the heading it is sufficient to define the s and tmethods that incorporate movement, wing anatomy and persistence evidences, and then reuse the very same algorithm and framework as for occlusions.
For this reason, the coordinates of the two endpoints A and B (after heading assignment called Head and Tail) and the Centroids of the body and the wing regions C and W are determined for every frame.
Definition 8. Let X _{ i } denote the value of point X in frame i and $\overline{\mathit{\text{XY}}}$ denote the euclidean distance between points X and Y. The s score score _{ move } is defined as score${}_{\text{move}}=\frac{\overline{{A}_{i}{C}_{i1}}\overline{{B}_{i}{C}_{i1}}}{\overline{{A}_{i}{C}_{i1}}+\overline{{B}_{i}{C}_{i1}}}$.
Definition 9. The s score score _{ wing } is defined as score${}_{\text{wing}}=\frac{\overline{\mathit{\text{AW}}}\overline{\mathit{\text{BW}}}}{\overline{\mathit{\text{AW}}}+\overline{\mathit{\text{BW}}}}$.
Definition 10. The combined s score is defined as score _{move⊗wing}= max(score_{move},score_{wing}).
The score score _{move} will be positive whenever the fly moved rather in A than in Bdirection, the score score _{wing} will be positive in case A is closer to the centroid of the wing region. Note that −1≤score_{move}≤+1 and −1≤score_{wing}≤+1. Both scores are combined by a simple maximum aggregation. From the combined score score _{move⊗wing} probability approximations and finally logodd values s may be derived as in the occlusion case.
Definition 11. The t score s c o r e _{ persist } is defined as score ${}_{\text{persist}}=\frac{\overline{\mathit{\text{AB}}}+\overline{\mathit{\text{BA}}}\overline{\mathit{\text{AA}}}\overline{\mathit{\text{BB}}}}{\overline{\mathit{\text{AB}}}+\overline{\mathit{\text{BA}}}+\overline{\mathit{\text{AA}}}+\overline{\mathit{\text{BB}}}}\text{Eccentricity}.$
The persistence score score _{persist} is −Eccentricity≤score_{move}≤+Eccentricity, with 0≤Eccentricity≤1. Rescaling the score by the Eccentricity attribute ensures low persistence scores when flies ‘turn upwards’ and are thus round. From such a position, flies may abruptly change their heading via zdirection.
The optimization algorithm (see Algorithm 4) will compute the most plausible heading assignment for all video frames by maximizing $\sum s+\sum t$ under the flip operation introduced in the Section 3.3.
For the heading case, the linear time property of algorithm is essential since heading is typically computed for 15,000+ frames and other, e.g. quadratic algorithms would already become unhandy for these problem instances.
A performance evaluation on 42,870 manually annotated heading situations resulted in 99.2% of correct heading assignments. This number fits to the ‘occlusion accuracy’ quality measures that we observed for occlusion problem instances in Table 1 (middle column). The other quality measure in that table is not applicable for heading problem instances.
Typical heading error instances are sequences where flies actually do walk backwards for a longer time, e.g. due to a series of evasive maneuvers.
5 Conclusions
5.1 Summary and discussion
This paper introduces a twofly tracker that provides solutions for all major challenges in insect tracking: arena detection, segmentation, resolving occlusions, resolving heading and event detections. It was designed to handle legacy videos that were originally not recorded for automated quantification and therefore provides several methods for quality boosting and quality control (see Section 2). The system automates all attentioncritical parts. It checks video quality standards, detects arenas and automatically computes video background, and computes thresholds for body and wing segmentation and a global identity assignment across occlusions. It is therefore designed to minimize required user interactions which makes it suitable to support analysis of largescale experiments. Compared to manual behaviour scoring, it provides highthroughput, robust and objective quantification for courtship behaviour videos; overcomes lacks of behaviour quantification; and enables systematic identification of genes and neurons that are critical for this behaviour. This may lead to better understanding of how nervous systems regulate stereotypic behaviours. Compared to existing tracking softwares, our system is capable to deal with quality issues that result from video contamination or human error and automatically boosts video quality beyond possible recording conditions. We put huge effort into the occlusion problem and automatically identifying flies without any requirements of marking them. Our system is not limited to specific behaviours and computes a large number of attributes for both identified flies that encourage definition or training of behaviour classifiers. It comes with machinelearned and userdefined classifiers for Drosophila courtship behaviour and its subbehaviours (cf. Figure 7). The software includes tools for result inspection and bulk submission of videos to a computer cluster.
The identification of the two flies through the entire video was essential for the detection and assignment of biologically relevant events and the resolvement occlusions with highest possible accuracy turned out to be of particular importance. Since manual correction of occlusion assignments requires lots of user interaction (and user attention), some efforts were invested to come up with an automated solution for that problem (see Section 3).
Section 3.2 introduced different approaches for solving single occlusions. These methods are called tmethods or local methods as they focus on individual occlusions without further consideration of their context. Each local method suggests an identifier assignment for each occlusion case and gives a certainty score for its decision. Decisions and scores of multiple local methods may be combined in a metamethod; machine learning based metamethods may further include observable features further characterizing occlusions.
An intrinsic problem of tmethods or combinations of them comes with their limited local perspective that causes misassignments to be propagated until the end of the video. Section 3.3 introduces an optimization approach that incorporates context information. It complements tmethods by smethods that base on characteristics of unoccluded sequences that are comparable during the whole video. The certainty scores of s and tmethods are turned into logodd values that are comparable to each other such that the most plausible identity assignment for the entire video is achieved by maximizing the sum of these logodd values under a flip operation (cf. Section 3.3). The optimization algorithm introduced in Section 3.4 solves that optimization problem in linear time.
Within a test set of 11 malefemale courtship videos with manually annotated ground truth, the introduced local methods scored 90% to 95% of the cases correctly, the combining metamethods correctly assigned about 99% and the optimization approach assigned up to 99.62% of occlusions correctly. When it comes to correctly assigned identities in unoccluded frames, naive metamethods suffer from the error propagation problem while the optimization approach improves its accuracy to 99.97%. In other words, from about 6 h and 45 min of the unoccluded video frames, the frames with wrong identity assignment are together about 7 s.
These results are achieved as the algorithm implicitly minimizes the number of misassigned unoccluded frames. This property is inherited from the approximation of the sign test that is used as smethod.
Further, the algorithm ‘selfcorrects’ its mistakes. Potential errors typically occur pairwise, one real error immediately followed by a second one that compensates the first error, as it is not plausible that identities are wrong from a given point to the end of the video. Wrongly assigned identities are therefore a local problem and do not affect the rest of the video.
These two error patterns, pairwise errors ensuring local misassignments only and short nonoccluded sequences as potential error domains, explain the low number of unoccluded frames that are misidentified and result in desirable properties of the algorithm.
In many occasions, users accidentally downgraded the quality of ground truth that was previously automatically preannotated. This suggests that the automated occlusion resolvement method may, in many cases, be more reliable than a human annotator. However, the automatically derived occlusion assignments may still be manually inspected and overruled, and an annotating user may sort occlusions by machinegiven confidence values.
Another desirable property of the algorithm is that it runs very efficiently (in linear time $\mathcal{O}\left(\right{\Phi}_{0}\left\right)$). This enables its applications on large problem instances. Section 4 describes how heading assignments can be modelled as an instance of the very same optimization problem. The very same optimization algorithm then derives the most plausible heading assignment for every frame.
5.2 Future work
We have a system that identifies flies and extracts various attributes; we implemented more than 1,000 automatically observable shape and constellation descriptors. The system currently comes with classifiers for courtship behaviour and its subbehaviours that allow to visualize observed behaviours as automatically generated ethograms.
Figure 7 depicts an ethogram that visualizes courtship events for wildtype males and females. Our current courtship classifiers are sexspecific (compare left vs. right) and deliver expected results for known mutants (cf. [10]).
We aim to support definition and training for classifiers that capture further meaningful behaviours.
The overall system is currently being transcoded from Matlab to C++ and is optimized for performance such that it runs on a standard laptop within a reasonable computation time.
References
 1.
Baker BS, Taylor BJ, Hall JC: Are complex behaviors specified by dedicated regulatory genes? Reasoning from Drosophila . Cell 2001, 105: 1324. 10.1016/S00928674(01)002938
 2.
Dickson BJ: Wired for sex: the neurobiology of Drosophila , mating decisions. Science 2008, 322: 904909. 10.1126/science.1159276
 3.
PierceShimomura JT, Dores M, Lockery SR: Analysis of the effects of turning bias on chemotaxis in C. elegans . J Exp Biol 2005, 208(Pt 24):47274733.
 4.
Cronin CJ, Mendel JE, Mukhtar S, Kim YM, Stirbl RC, Bruck J, Sternberg PW: An automated system for measuring parameters of nematode sinusoidal movement. BMC Genet 2005, 6(1):5. 10.1186/1471215665
 5.
Feng Z, Cronin CJ, Wittig Jr JH, Sternberg PW, Schafer WR: An imaging system for standardized quantitative analysis of C. elegans behavior. BMC Bioinformatics 2004, 5: 115. 10.1186/147121055115
 6.
Baek JH, Cosman P, Feng Z, Silver J, Schafer WR: Using machine vision to analyze and classify Caenorhabditis elegans behavioral phenotypes quantitatively. J Neurosci. Methods 2002, 118(1):921. 10.1016/S01650270(02)001176
 7.
Martin JR: A portrait of locomotor behaviour in Drosophila determined by a videotracking paradigm. Behav Processes 2004, 67(2):207219. 10.1016/j.beproc.2004.04.003
 8.
Dankert H, Wang L, Hoopfer ED, Anderson DJ, Perona P: Automated monitoring and analysis of social behavior in Drosophila . Nature Methods 2009, 6: 297303. 10.1038/nmeth.1310
 9.
Branson K, Robie AA, Bender J, Perona P, Dickinson MH: Highthroughput ethomics in large groups of Drosophila . Nature Methods 2009, 6: 451457. 10.1038/nmeth.1328
 10.
Schusterreiter C: Computational analysis of Drosophila courtship behaviour. Thesis, University of Vienna 2011.
 11.
Simon JC, Dickinson MH: A new chamber for studying the behavior of Drosophila . PLoS One 2010, 5(1):e8793. 10.1371/journal.pone.0008793
 12.
Hoyer SC, Eckart A, Herrel A, Zars T, Fischer SA, Hardie SL, Heisenberg M: Octopamine in male aggression of Drosophila . Curr. Biol 2008, 18(3):159167. 10.1016/j.cub.2007.12.052
 13.
Ramdya PP, Schaffter T, Floreano D, Benton R: Fluorescence Behavioral Imaging (FBI) tracks identity in heterogeneous groups of Drosophila . Plos One 2012., 7(11):
 14.
Kuhn HW: The Hungarian method for the assignment problem. Naval Res. Logistics Q 1955, 2(1–2):8397.
 15.
Gabriel P, Verly J, Piater J, Genon A: The state of the art in multiple object tracking under occlusion in video sequences. Advanced Concepts for Intelligent Vision Systems (2003) pp 166173.
 16.
Aurenhammer F: Voronoi diagrams—a survey of a fundamental geometric data structure. ACM Comput. Surv. (CSUR) 1991, 23(3):345405. 10.1145/116873.116880
 17.
Dempster AP: Upper and lower probabilities induced by a multivalued mapping. Ann. Math. Stat 1967, 38(2):325339. 10.1214/aoms/1177698950
 18.
Shafer G: A Mathematical Theory of Evidence. Princeton University Press; 1976.
Author information
Additional information
Competing interests
The authors declare that they have no competing interests.
Authors’ original submitted files for images
Below are the links to the authors’ original submitted files for images.
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 2.0 International License (https://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
About this article
Cite this article
Schusterreiter, C., Grossmann, W. A twofly tracker that solves occlusions by dynamic programming: computational analysis of Drosophila courtship behaviour. J Image Video Proc 2013, 64 (2013). https://doi.org/10.1186/16875281201364
Received:
Accepted:
Published:
Keywords
 Fly tracker
 Ethogram
 Occlusion
 Dynamic programming
 Drosophila; Courtship behaviour; Quantification; Pattern recognition