# Geometric bounding box interpolation: an alternative for efficient video annotation

- Pedro Gil-Jiménez
^{1}Email authorView ORCID ID profile, - Hilario Gómez-Moreno
^{1}, - Roberto López-Sastre
^{1}and - Saturnino Maldonado-Bascón
^{1}

**2016**:8

https://doi.org/10.1186/s13640-016-0108-7

© Gil-Jiménez et al. 2016

**Received: **26 May 2015

**Accepted: **7 February 2016

**Published: **22 February 2016

## Abstract

In video annotation, instead of annotating every frame of a trajectory, usually only a sparse set of annotations is provided by the user: typically its endpoints plus some key intermediate frames, interpolating the remaining annotations between these key frames in order to reduce the cost of the video labeling. While a number of video annotation tools have been proposed, some of which are freely available, and bounding box interpolation is mainly based on image processing techniques whose performance is highly dependent on image quality, occlusions, etc. We propose an alternative method to interpolate bounding box annotations, based on cubic splines and the geometric properties of the elements involved, rather than image processing techniques.

The algorithm proposed is compared with other bounding box interpolation methods described in the literature, using a set of selected videos modeling different types of object and camera motion. Experiments show that the accuracy of the interpolated bounding boxes is higher than the accuracy of the other evaluated methods, especially when considering rigid objects. The main goal of this paper is related with the bounding box interpolation step, and we believe that our design can be integrated seamlessly with any annotation tool already developed.

## Keywords

## 1 Introduction

The growth in image and video processing demands larger quantities of annotated training data. A still image can be annotated at different levels. At the image level, annotations consist of simply indicating the objects present in the image, without any spatial information. At the pixel level, each pixel must be assigned to a label. Intermediate solutions may consist of drawing polygons surrounding the object or simple geometric shapes, such as rectangles or ellipses. The complexity, and thus, the cost or time required to annotate an image, depends on the level at which we need to annotate it.

Large-scale image databases are useful for supervised learning of visual object appearance. There are many publicly available databases. For instance, the PASCAL VOC Challenge [1] provides a standard dataset of annotated images at the bounding box level, for object categorization and detection. The images are collected from the flickr photo-sharing web site, and annotations are done at a single annotation “party”, to ensure consistency. The ImageNet database [2] offers a collection of tens of millions of annotated images hierarchically organized, in this case, at the image level.

Many applications has been proposed in the literature for the task of image annotation, both for personal use and for massive annotations. In [3], Russell et al. proposed *LabelMe*, a web-based open tool that allows polygon-based image annotations. This tool can be accessed freely by any researcher, either to annotate their own images or to download the already annotated image collections. In [4], von Ahn and Dabbish took advantage of the *Games With A Purpose* (GWAP) philosophy to develop an online game which allows image annotation while the users are enjoying the game. In this case, images can only be annotated at the image level. The *Human Intelligence Task* (HIT) service provided by *Amazon Mechanical Turk* was explored by Sorokin and Forsyth [5] to outsource human annotation services at a low cost. This approach enables the annotation of large image databases at the bounding box level in a short time.

Video annotation can also be done at different levels. For instance, ANVIL [6] is a free tool that allows video annotations at the frame level, that is, without any spatial information. However, when object location is required, the temporal evolution of the annotated objects must also be provided. Manually annotating the object location for all the frames in the video sequence is a tedious task. This is typically solved by allowing the user to provide only a limited set of annotations for a given object, then propagating this to the rest of the frames. The frames where the annotations are supplied are called *key frames*, and this set must usually include the first and last frames where the object is visible, in order for the system to know the endpoints of the object’s trajectory.

The number of key frames needed depends on many factors, such as the object type (rigid or non-rigid), trajectory length (number of frames), type of object motion (straight, curved, or chaotic), and the accuracy required for the propagated annotations. The number and position of these key frames can be fixed or defined by the user. Vondrick et al. [7] discovered that ‘a fixed key frame schedule is significantly faster than a user-defined key frame schedule.’ However, the key frame frequency needs to be adjusted depending on the type of motion to annotate. When an object moves in a chaotic trajectory, more annotations are needed in order to accurately handle the chaotic motion. In this paper, however, we are not concerned with this issue and will use both schedules indiscriminately.

Propagation across frames has typically been performed by means of image and video processing techniques. In [8], Badrinarayanan et al. proposed a propagation scheme, given the pixel-level annotation of the first and last frames of a video sequence. The algorithm is based on a hidden Markov model and labels every pixel in all the frames between the endpoints. In [9], Liu et al. proposed a contour tracker to track the contour provided by the user in the first frame. The user can specify a time-varying depth value for each annotated object, intended to handle occlusions and 3D linear interpolation. The user may also be requested for an inspection to correct tracker mistakes. Contour tracking is also used in [10] to track annotated curves in a video sequence for *rotoscoping*, i.e., the process of tracking shapes throughout a video sequence. In [11], Kalal et al. proposed a long-term tracker of a bounding box provided by the user in the first frame. The system is able to automatically re-detect the object after a period when it is not in the field of view or after tracking failure, by learning the object appearance from the previous frames.

Propagation can also be done by means of the geometric properties of the elements involved. In this case, the user provides a set of simple (rectangle, ellipse, etc.) or complex shapes, and the system computes the coordinates of the polygon for the remaining frames. For this case, the main techniques described in the literature will be analyzed in Section 2, where different publicly available annotation tools will be introduced. Section 3 describes the alternative procedure proposed in this paper. Section 4 describes the experiments performed to compare the existing software applications. Section 5 concludes the paper.

## 2 Related works

In this paper, we propose an alternative method to propagate bounding box annotations between key frames. For that reason, we will analyze here the most popular software packages for video annotation at the bounding box level, focusing on the propagation algorithms proposed in each.

### 2.1 Available software

*VIdeo Performance Evaluation Resource*(ViPER) is one of the earliest video annotation tools provided by the research community [12]. The initial goal was to create a flexible ground truth format and a tool for ground truth generation and sharing. It is a JAVA application that allows the user to define object categories (people, cars, animals, etc.) and different attributes, including spatial ones (bounding box, bounding ellipse, points, etc.). The software includes a default linear interpolation utility. For this interpolator, if we define

**b**

_{ i }to be the coordinates of a bounding box at

*t*=

*t*

_{ i }, then

*x*

_{ i },

*y*

_{ i }) are the coordinates of the center of the bounding box and (

*w*

_{ i },

*h*

_{ i }) are its width and height. Given two consecutive annotations

**b**

_{ i }and

**b**

_{ i+1}, the coordinates of the bounding box for a given time

*t*=

*t*

_{ j }are

Linear interpolation can be defined as the simplest approach [7] for label propagation. This method has two main drawbacks. On the one hand, linear interpolation can only handle straight trajectories, since interpolation between adjacent segments are computed independently. On the other hand, it can not accurately capture scale transformations due to perspective effects. A recent update can be found in [13], which allows semiautomatic video annotation. Nevertheless, the system is based on a foreground/background motion detection, which limits its use to videos recorded with static cameras.

Vondrick et al. proposed in [7] a user interface to be deployed on Amazon Mechanical Turk (MT) to outsource bounding box annotations using its HIT services. In their work, they analyzed several aspects related to this HIT service, such as interface design, the evaluation of a worker’s talents, and the optimal key frame frequency. Focusing on the bounding box propagation technique, they proposed two different algorithms, the simplest one being the linear interpolation discussed above. The other method is based on dynamic programming, using a visual model of the annotated object. They extract the model from the *N* bounding boxes provided, compute a feature descriptor composed of HOG and color features, and train a linear SVM classifier from positive and negative (background) samples. The main drawback is that the algorithm is highly dependent on the image quality, and the system can even fail in the case of object occlusions. The system includes a constant to allow the visual interpolator to gracefully degrade to linear interpolation when the former becomes inaccurate.

Instead of relying on visual properties, the interpolation can be done using the geometric properties of the elements involved. In this case, when interpolating coordinates, we can use image or spatial coordinates.

**Q**moving in front of a camera, as in Fig. 1. At

*t*=

*t*

_{ i }, the point will be located at

**Q**

_{ i }in spatial coordinates, and its image will be at

**q**

_{ i }, and similarly for

*t*=

*t*

_{ i+1}. If we suppose constant motion for the point, at

*t*

_{ j }=(

*t*

_{ i }+

*t*

_{ i+1})/2, the point will be located at

**Q**

_{ j }, at the intermediate position between the points in space and its image at

**q**

_{ j }. However, if we interpolate using image coordinates, the point \(\hat {\mathbf {q}}_{j}\) will be located at the intermediate position between the image points, and as can be seen in Fig. 1, the coordinates

**q**

_{ j }and \(\hat {\mathbf {q}}_{j}\) do not coincide, due to the effects of perspective.

However, parallel motion is not the general situation. For object 2 in Fig. 2, the variation in depth is large. In this situation, the last assumption does not hold, and the accuracy in the bounding box interpolation decreases, as we will see later, in Section 4.

To take into account this effect, Yuen et al. [14] pointed out that ‘a constant velocity in space does not project to a constant velocity in the image plane, due to perspective effects.’ In their work, they proposed an alternative method to project the coordinates of the annotated object to the image plane for any time between two key frames.

*t*is computed using

*x*

_{0},

*y*

_{0}) are the coordinates of the point in the frame at

*t*=0, and

*p*

_{v}=(

*x*

_{v},

*y*

_{v}) are the coordinates of the vanishing point computed as described above.

*λ*(

*t*) is the displacement of the point along the direction of the motion in space, and since constant motion is assumed, we have

*λ*(

*t*)=

*v*

*t*, where

*v*is the velocity of the point in space and can be computed, given a second key frame at

*t*=

*t*

_{1}, using

*t*=

*t*

_{ i }and

*t*=

*t*

_{ i+1}, as in Fig. 3, and we trace the lines joining the four pair of corners. As stated before, the cross point would correspond to the vanishing point of the trajectory. However, in the general situation, annotation errors and changes in object aspect ratio will make the lines fail to meet at the same point. For some points in the last figure, some of the cross points are even located outside the image plane. Furthermore, for motion parallel to the image plane, the vanishing point will lie at infinity, so that in (3), there is an indetermination that must be resolved some other way. The second drawback is that this schema can only be employed between a pair of key frames, while assuming straight paths in between, which is not a correct assumption for object motion in general.

The second drawback is solved by Lee et al. in [15, 16] by interpolation using cubic splines [17]. In their work, they proposed a tool for collaborative video annotation, mainly intended for interactive services (i.e., smart TV) in the multimedia industry. They also proposed a dynamic sampling-based cubic spline interpolation (DSCSI) algorithm to reduce the size of the trajectory data on a “clickable video”. Focusing on the interpolation part, the object position is computed using cubic spline interpolation (CSI) from a sparse set of annotations. Nevertheless, the interpolation is performed using image coordinates instead of spatial coordinates, so that, as explained above, it is not able to accurately model object trajectories with large depth variations.

To address these problems, in this paper, a new interpolation algorithm that combines a 3D reconstruction schema is proposed, similar to the one proposed in [14], with the cubic spline interpolation used in [16]. However, the main difference with respect to [14] is that the algorithm proposed in our paper performs a continuous reconstruction in the spatial coordinates, as will be described in Section 3.3. This improvement allows us to use cubic splines to interpolate the spatial bounding box coordinates, instead of using linear interpolation, to model object trajectories. The main difference with respect to [16] is that the proposed 3D reconstruction allows us to model more accurately the trajectory of objects when they exhibit large depth variations.

## 3 Geometric interpolation

**b**

_{ i }and

**b**

_{ i+1}of an object, the key idea is finding the coordinates

**B**

_{ i }and

**B**

_{ i+1}of the bounding box in space. Having that, computing the coordinates

**b**

_{ j }of the bounding box in the image plane for any time

*t*=

*t*

_{ j }is straightforward, assuming we know the motion model of the object in space.

We will demonstrate that using only bounding box coordinates, we can perform a 3D reconstruction of the bounding box positions. If the camera’s internal parameters are not known, i.e., we are working with an uncalibrated camera, the 3D positions can be computed only up to a scale transformation. Nevertheless, we will see that actual 3D coordinates are irrelevant for bounding box interpolation, and so, no camera calibration is needed.

Prior to defining the interpolation schema proposed in this paper, we introduce briefly the basic concepts related with central projection and the camera model needed. The notation has been taken from [18] where possible. We will first describe the interpolation technique for a simpler two-dimensional case and then extend the concepts to three dimensions.

### 3.1 Two-dimensional camera model

*line camera*shown in Fig. 5. This camera projects points on the plane

*XZ*onto an

*image line*. Let the center of projection be the point

**C**=(

*X*

_{C},

*Z*

_{C},1)

^{T}, expressed in homogeneous coordinates, and let

**l**=(

*a*,

*b*,

*c*)

^{T}be the homogeneous vector representing the image line. The distance from the line

**l**to the point

**C**is

*f*, which is called the

*focal length*of the camera. The line from the center of projection, normal to the image line, is called the

*principal axis*of the camera, and the cross point between the image line and the principal axis is called the

*principal point*.

**Q**=(

*X*

_{Q},

*Z*

_{Q},1)

^{T}in the plane is mapped to the point

**q**=(

*x*

_{q},

*w*

_{q})

^{T}in homogeneous coordinates in the image line. This point will be located where a line joining the point

**Q**with the center of projection

**C**meets the image line. The homogeneous coordinates of

**q**can be computed from the

*projection matrix*P:

*internal parameter*matrix, R is the 2×2 camera

*rotation*matrix, I is the 2×2 identity matrix, and \(\widetilde {\mathbf {C}}=\{X_{\mathrm {C}}, Z_{\mathrm {C}}\}\) is the center of projection expressed in inhomogeneous coordinates. In order to simplify the problem, suppose that the camera is placed at a canonical position, that is, suppose that the center of projection is the origin of the coordinate system,

**C**=(0,0,1)

^{T}, and that the principal axis is the

*Z*axis (

*α*=0), as in Fig. 6. Furthermore, we can have the principal point be the origin of the image coordinate system. In this situation, P reduces to

### 3.2 Reconstruction for a single segment

Now, consider an object moving in front of a line camera from *t*=*t*
_{
i
} to *t*=*t*
_{
i+1}, as shown in Fig. 6. In this section, we will consider a trajectory defined by only one segment, that is, only two key frames, at the start and endpoints of the trajectory, without intermediate key frames. In the following sections, we will extend the method to any number of segments.

*Z*for both points is the same. Under a camera matrix

*P*, the point a is imaged onto the image line at

*t*

_{ i }to

*t*

_{ i+1}is

*Δ*

_{ i }={

*Δ*

*X*

_{ i },

*Δ*

*Z*

_{ i }}, then the coordinates for point a at

*t*=

*t*

_{ i+1}will be

*f*. As can be seen in Fig. 9, for an image point

**q**

_{ i }at

*t*=

*t*

_{ i }, if we assume the point depth to be

*Z*=

*Z*

_{ i }and use a camera with focal length

*f*, the point is back projected to \(\widetilde {\textbf {Q}}_{i}=\{X_{i}, Z_{i}\}\). Now, changing the camera focal length to

*f*

^{′}=

*k*

_{E}·

*f*while keeping the same depth

*Z*

_{ i }, the point coordinates will change to \(\widetilde {\textbf {Q}}'_{i}=\{X'_{i}, Z_{i}\}\). By similar triangles, we have

*X*

_{ i }=

*k*

_{E}·

*X*

*i*′, and the same for

*t*=

*t*

_{ i+1}. Since our final interest is not the reconstruction itself, but the computation of the images of intermediate positions, it can easily be seen that the interpolation is independent of the particular focal length chosen. For instance, if we interpolate the image coordinates for the point at the middle position between

*t*

_{ i }and

*t*

_{ i+1}, assuming constant motion, the coordinates for the point when focal length is

*f*will be

*f*

^{′}, the interpolated position will be

that is, both points are imaged onto the same image coordinate. Figure 9 shows a graphical example.

*f*=1. Reorganizing the initial equations,

The last result forms a system of linear equations in the unknowns \(\mathbf {B}_{i}=[X_{i}^{\mathrm {a}}, X_{i}^{\mathrm {b}}, Z_{i}, \Delta X_{i}, \Delta Z_{i}]\), which can be computed from its null space. Since (18) is a homogeneous system of equations, **B**
_{
i
} can be found up to a constant factor *k*
_{s}. The meaning from the geometric point of view is that the space is reconstructed up a scale transformation. However, under central projection, two points related by a scale transformation are imaged onto the same image coordinates. This means that the particular constant factor *k*
_{s} chosen for **B**
_{
i
} is not important, since only the relation between all variables is important. We will come back to this concept when analyzing trajectories composed of more than one segment.

### 3.3 Reconstruction from multiple segments

*N*key frames, we could apply the same process independently for the

*N*−1 segments. However, since the computations are independent, the results would give disjoint reconstructed segments. To see this, suppose we have three different annotations for an object,

**b**

_{ i },

**b**

_{ i+1}, and

**b**

_{ i+2}, as in Fig. 10.

The reconstruction for the segment defined by **b**
_{
i
} and **b**
_{
i+1} gives **B**
_{
i
} and **B**
_{
i+1}. Likewise, for points **b**
_{
i+1} and **b**
_{
i+2} we have **B**
*i*+1′ and **B**
*i*+2′. The reconstructed segments will be, in general, disjoint. Although both reconstructions are correct, they are only useful if further interpolations are performed independently for each segment, i.e. assuming straight paths between key frames. However, typical object motion is better modeled with interpolation schemes which need more samples. As we will see later, we will use cubic spline interpolation, a method which uses more than two samples to compute any interpolation. To this end, we need the reconstructed trajectory to be continuous along the whole path.

*k*

_{s}, introduced before, for the second segment to have \(\hat {\mathbf {B}}'_{i+1}=\mathbf {B}_{i+1}\):

### 3.4 Three-dimensional case

Extrapolating the problem to three dimensions is not straightforward. A first approach could be to perform the reconstruction independently for the horizontal and vertical coordinates, but this would yield different depth values (*Z*
_{
i
} and *Δ*
*Z*
_{
i
}) for the horizontal and vertical reconstructions. Although we can adjust the scale factor *k*
_{s} of one of the coordinates to make the depth values *Z*
_{
i
} equal for both coordinates, the other value, *Δ*
*Z*
_{
i
}, will, in general, be different and so will be the object depth for *t*=*t*
_{
i+1}, but, obviously, the depth position should be the same for both coordinates along the whole path, since it is the same point.

*t*=

*t*

_{ i }and

*t*=

*t*

_{ i+1}as in Fig. 12. As the object moves away from the camera, its size will decrease and inversely. If the aspect ratio of the annotated rectangle was kept constant between the key frames, this would be the only variation on both height and width.

*Z*and

*Δ*

*Z*. Applying the equation for the computation of the depth information, substituting \(s \rightarrow w\) for the horizontal coordinate, and \(s \rightarrow h\) for the vertical coordinate, we have \({Z_{i}^{x}}=w_{i+1}\) and \(\Delta {Z_{i}^{x}}=w_{i}-w_{i+1}\) for the

*x*coordinate and \({Z_{i}^{y}}=h_{i+1}\) and \(\Delta {Z_{i}^{y}}=h_{i}-h_{i+1}\) for the

*y*coordinate. We can use the scale correction factor

*k*

_{s}introduced before to have \(\hat {Z'}_{i}^{y}={k_{i}^{s}} \cdot {Z_{i}^{y}}={Z_{i}^{x}}\), that is, the depth position at the beginning of the segment will be the same for the horizontal and vertical reconstructions. In this situation, the depth shift for the vertical reconstruction will be

that is, the computed depths for the *x* and *y* coordinates are the same, as desired. However, if the aspect ratio is not maintained, the last equation does not hold, and so, the depth shifts for the horizontal and vertical reconstructions do not coincide.

The aspect ratio of an imaged object is only maintained when considering planar rigid objects that move without rotating. But even in this hypothetical case, errors in object annotations will generally result in the bounding box aspect ratio not being maintained. When annotating non-rigid objects, the problem gets worse. For instance, consider a person standing up in one key frame, and the same person crouching in the next key frame. In that situation, the bounding box height will halve, whereas the width will be the same, or even increase.

*t*

_{ i }to

*t*

_{ i+1}due to the change in depth and compute it from the geometrical mean of its width and height,

That is, the factors are inversely proportional. Thus, to complete the system, we only have to take into account the last variable \(k_{i}^{\mathrm {h}}\) along the trajectory, modifying the width and height of the bounding box of the annotated rectangle correspondingly. Since its value can change along the different segments of the trajectory, interpolation of this variable between key frames needs to be considered to eliminate abrupt size changes between segments.

### 3.5 Bounding box interpolation

The process described so far allows us to perform, up to a scale transformation, a three-dimensional reconstruction of the annotated trajectory. The last step is the design of the interpolation schema to compute the bounding box coordinates for any frame. First of all, thanks to the reconstruction process described, the interpolation is not performed over the image plane coordinates but over the reconstructed spatial coordinates. Once we have the interpolated 3D coordinates, we project them over the image plane to get the new bounding box coordinates. In Section 3.1, we chose the camera to have its center of projection at the origin of the system of coordinates, with its principal axis the *Z* axis and with focal length equal to 1. Since this is the camera employed in the reconstruction, it will be the one to project the interpolated coordinates to the image plane.

*X*,

*Y*, and

*Z*need to be interpolated. Furthermore, to eliminate abrupt changes in object size, the aspect ratio correction \({k_{i}^{h}}\) also needs to be interpolated. So, a four cubic spline interpolators need to be implemented. Algorithm ?? summarizes the whole process needed to interpolate bounding box coordinates from a sparse set of annotations.

*trajectory breaks*at these points, so that the interpolation is computed independently at either side of the break. Figure 14 shows the improvement when breaks are included in the trajectory annotation.

## 4 Results and discussion

- 1.
*Geometric cubic spline interpolation*(GC): This is the algorithm described in this paper. - 2.
*Geometric interpolation*(GI): A simplified version of the algorithm described, without using cubic spline interpolation. - 3.
*Linear interpolation*(LI): A linear interpolator, as defined in Eq. (2). This is the interpolation used in ViPER [12]. - 4.
*LabelMe*(LM): Although the original work was designed for general polygonal annotations [14], the algorithm can be easily adapted to work with rectangles. For the computation of the vanishing point*p*_{v}, we used the cross point between the lines defined by the top-left and the bottom-right corners of consecutive rectangles. Velocity and point coordinates are computed according to (4) and (3). - 5.
*Cubic spline interpolation*(CS): Following the original work by Lee et al. [15], the center coordinates of the bounding box and the rectangle size in the image plane as well are interpolated between key frames using the cubic spline functions implemented in the SciPy library.

*motion type*in this table), to check the algorithms against a number of situations. These include the following:

- (a)Rigid object moving in straight path, static cameraTable 1
Video data set summary

Motion

Resolution (

*W*×*H*)Trajectory length

Bounding box size (

*h*·*w*)type

(pixels)

(frames)

Min (pixels)

Max (pixels)

a

352×240

41

560

12,212

a

352×240

28

616

4888

a

352×240

53

660

1856

b

352×240

55

1440

5508

b

352×240

76

1392

4800

b

352×240

52

7488

16,020

b

352×240

188

2912

17,472

c

640×360

34

6030

7056

c

960×540

29

12,920

15,675

c

640×360

17

3942

5270

d

720×576

89

240

4640

d

640×480

164

1176

6360

d

640×360

47

432

2480

d

640×480

97

440

2808

e

640×360

67

680

3936

e

640×480

33

13,720

57,200

f

640×360

69

2072

3192

f

640×360

69

2184

4292

f

640×360

198

9396

33,200

- (b)
Rigid object moving in curved path, static camera

- (c)
Static object, with camera panning

- (d)
Static object, with camera translation

- (e)
Static object, with camera zoom

- (f)
Non-rigid objects

**b**

_{ i }, we calculate the

*overlap error*as

where the subscript *n* indicates the interval between a pair of key frames. In this case, since we removed one half of the samples of the trajectory, *n*=1 sample. Note that since we have to remove one half of the samples, we can create one test trajectory by removing the odd samples and a second one removing the even samples. So, the computation of *e*
_{1} is finally obtained as the mean value of \({e_{1}^{0}}\) and \({e_{1}^{1}}\), where the superscript 0 means the error when the test trajectory is obtained by keeping the even samples (that is, we start keeping the sample in 0, remove the sample in 1, keep the sample in 2, and so on), and similarly for the superscript 1. Likewise, *e*
_{2} is computed by constructing a new test trajectory removing *n*=2 out of three samples. In this case, we can have \({e_{2}^{0}}\), \({e_{2}^{1}}\), and \({e_{2}^{2}}\), and so the error *e*
_{2} is computed as the mean of these three values. We performed this computation until the interval *n*=20.

*n*. In this case, the error has been computed as the average of the error

*e*for all the videos used in this experiment. Note that, for

*n*=1, an error is present for all the interpolators. This error is mainly due to the process employed to annotate the videos manually, which adds a random shift to the object trajectory. Obviously, as the interval

*n*increases, the average error also increases. However, the best performance is achieved for the interpolators based on depth recovery to compute the interpolated bounding boxes, since many of the objects annotated exhibit great variations of depth. In this sense, the algorithm proposed in this paper (GC) yields the best performance. The algorithm proposed by Yuen et al. (LM) and the simplified version of the algorithm proposed in this paper without cubic spline interpolation (GI) obtain worse results. In fact, for most of the cases, both algorithms are quite similar, since they are based on depth recovery and straight interpolation between key frames. Lastly, those algorithms not based on depth recovery but on interpolation from image coordinates, (LI) and (CS), yield the worst results. Table 2 shows the numerical results for the overall performance of the interpolators for different values of the interval

*n*.

Overall interpolation error *e*
_{
n
} (pixels)

Interval | GC | GI | LI | LM | CS |
---|---|---|---|---|---|

5 | 332.9 | 356.0 | 362.5 | 360.3 | 331.3 |

10 | 610.2 | 612.5 | 649.7 | 623.8 | 588.2 |

15 | 815.5 | 832.0 | 937.9 | 840.8 | 889.4 |

20 | 890.4 | 910.9 | 1083.2 | 946.4 | 1035.6 |

## 5 Conclusions

In this paper, we have presented an alternative method that interpolates the bounding box annotations between key frames for video labeling. The method is based on a 3D reconstruction of the bounding box using the provided annotations, based on the geometric properties of the elements involved. Once the 3D coordinates are computed, the interpolation of the bounding box for the remaining frames is performed in spatial coordinates, using cubic spline interpolation, and finally projected onto the image plane. The algorithm has been evaluated using a selected set of video clips that includes different types of object and camera motions and compared with other interpolation algorithms proposed in the literature. The results show a good performance, especially when considering rigid objects moving in trajectories with great variation in depth, where the accuracy of the interpolated bounding boxes is higher than the other evaluated algorithms. Since bounding box interpolation is a specific part of any annotation tool, we believe that the proposed algorithm can be a good alternative for all these tools when accurate object annotations are required.

## Declarations

### Acknowledgements

This research was supported by projects CCG2014/EXP-055 and TEC2013-45183-R.

**Open Access** This article is distributed under the terms of the Creative Commons Attribution 4.0 International License(http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

## Authors’ Affiliations

## References

- M Everingham, LV Gool, C Williams, J Winn, A Zisserman, The PASCAL visual object classes (VOC) challenge. Int. J. Comput. Vis.
**88:**, 303–338 (2010).View ArticleGoogle Scholar - J Deng, W Dong, R Socher, LJ Li, K Li, L Fei-Fei, in
*IEEE Conf. Computer Vision and Pattern Recognition*. ImageNet: a large-scale hierarchical image database (IEEENew York, 2009).Google Scholar - B Russell, A Torralba, K Murphy, W Freeman, LabelMe: a datadata and web-based tool for image annotation. Int. J. Comput. Vis.
**77**(1), 157–173 (2008).View ArticleGoogle Scholar - L von Ahn, L Dabbish, in
*Proceedings of the Conference on Human Factors in Computing Systems*. Labeling IImage with a computer game (ACMNew York, 2004).Google Scholar - A Sorokin, D Forsyth, in IEEE Conference on Computer Vision and Pattern Recognition Workshops. Utility data annotation with Amazon Mechanical Turk (EEE, 2008), pp. 1–8.Google Scholar
- M Kipp, in
*Proceedings of the 7th European Conference on Speech Communication and Technology*. Anvil - A generic annotation tool for multimodal dialogue (INTERSPEECHAalborg, 2001), pp. 1367–1370.Google Scholar - C Vondrick, DCR Patterson, Efficient scaling up crowdsourced video annotation. Int. J. Comput. Vis.
**101**(1), 184–204 (2012).View ArticleGoogle Scholar - V Badrinarayanan, F Galasso, R Cipolla, in
*IEEE Conf. on Computer Vision and Pattern Recognition*. Label propagation in video sequences (IEEENew York, 2010).Google Scholar - C Liu, WT Freeman, EH Adelson, Y Weiss, in
*IEEE Conference on Computer Vision and Pattern Recognition*. Human-assisted motion annotation (IEEENew York, 2008).Google Scholar - A Agarwala, A Hertzmann, DH Salesin, SM Seitz, in
*ACM Transactions on Graphics*. Keyframe-based tracking of rotoscoping and animation (ACMNew York, 2004).Google Scholar - Z Kalal, K Mikolajczyk, J Matas, Tracking-learning-detection. Pattern Anal. Mach. Intell.
**34**(7), 1409–1422 (2012).View ArticleGoogle Scholar - D Doermann, D Mihalcik, in
*Int. Conf. on Pattern Recognition. vol. 4*. Tools and techniques for video performance evaluation (IEEENew York, 2000).Google Scholar - MA Serrano, J García, MA Patricio, JM Molina, in
*Distributed Computing and Artificial Intelligence. vol. 79*. Interactive video annotation tool (Springer-VerlagBerlin - Heidelberg, 2010), pp. 325–332.View ArticleGoogle Scholar - J Yuen, B Russell, C Liu, A Torralba, in
*IEEE Int. Conf. Computer Vision*. LabelMe video: building a video database with human annotations (IEEENew York, 2009).Google Scholar - JH Lee, KS Lee, GS Jo, in
*Proceedings of the International Conference on Information Science and Applications (ICISA)*. Representation method of the moving object trajectories by interpolation with dynamic sampling (IEEENew York, 2013), pp. 1–4.Google Scholar - KS Lee, AN Rosli, IA Supandi, GS Jo, Dynamic sampling-based interpolation algorithm for representation of clickable moving object in collaborative video annotation. Neurocomputing.
**146:**, 291–300 (2014).View ArticleGoogle Scholar - M Unser, Splines: a perfect fit for signal and image processing. Signal Process. Mag.
**16**(6), 22–38 (1999).View ArticleGoogle Scholar - R Hartley, A Zisermann,
*Multiple View Geometry in Computer Vision*, 2nd ed. (Cambridge University Press, Cambridge, 2003).Google Scholar - E Jones, T Oliphant, P Peterson, et al., SciPy: Open source scientific tools for Python (2001). http://www.scipy.org/. Accessed 15 Feb 2016.
- P Gil-Jiménez, TrATVid annotation tool. http://agamenon.tsc.uah.es/Investigacion/gram/papers/Annotation. Accessed 16 Nov 2015.