Skip to main content

Advertisement

Mutual kernelized correlation filters with elastic net constraint for visual tracking

Abstract

In this paper, we propose a robust visual tracking method based on mutual kernelized correlation filters with elastic net constraint. First, two correlation filters are trained in a general framework jointly in a closed form, which are interrelated and interacted on each other. Second, elastic net constraint is imposed on each discriminative filter, which is able to filter some interfering features. Third, scale estimation and target re-detection scheme are adopted in our framework, which can deal with scale variation and tracking failure effectively. Extensive experiments on some challenging tracking benchmarks demonstrate that our proposed method is able to obtain a competitive tracking performance against other state-of-the-art algorithms.

Introduction

Visual tracking is a fundamental task in computer vision with numerous applications, such as unmanned control systems, surveillance, assistant driving, and so on. Given the position of the tracked object in the first frame, the goal of visual tracking is to estimate the position of the tracked target in the subsequent frame precisely. Although great progress has been made in recent years [1, 2], designing a robust tracking algorithm is still a challenging problem due to negative factors such as background clutters, severe occlusion, motion blur, and illumination variation (see Fig. 1).

Fig. 1
figure1

Tracking results in challenging environments including background clutters (motorRolling), severe occlusion (Jogging-1), fast motion (skiing), illumination change (Singer2). The tracking results of HDT, Staple, KCF, CNN-SVM, DSST, MEEM, and our tracker are shown by red, green, blue, black, magenta, cyan, and gray rectangles, respectively

Generally speaking, visual tracking methods can be divided into two categories: generative methods [3,4,5,6,7] and discriminative methods [8,9,10,11,12,13]. Generative methods attempt to build a model to represent tracked target and find the region with the minimum reconstruction error from a great deal of candidates. For example, under the particle filter framework, Mei et al. [14] developed a tracker method based on sparse representation, called the ℓ1 method, which reconstructs each candidate with dictionary template and trivial template. The sparse representation coefficients of each candidate can be computed by solving ℓ1 minimization. Despite ℓ1 method demonstrated impressive tracking performance, the tracking speed is very slow because of its huge computation load. In order to solve this problem, Bao et al. [15] proposed a fast ℓ1 tracking method by using accelerated proximal gradient approach. Xiao et al. [16] presented a fast object tracking method by solving ℓ2 regularized least square problem. Wang et al. [17] developed a novel and fast visual tracking method via probability continuous outlier model. Different from the general method, discriminative algorithms regard visual tracking as a binary classification problem which distinguishes the correct tracked object from the background. For example, Babenko et al. [18] trained an online discriminative classifier to separate the tracked object from the background by online multiple instance learning. Zhang et al. [19] formulated visual tracking as a binary classification via a naive Bayes classifier with an online update scheme in the compressed domain.

In recent years, visual tracking methods based on correlation filter [20,21,22,23,24,25] have attracted great attention due to its real-time tracking speed and robust tracking performance. Under the framework of correlation filter, a discriminative classifier is trained with a great deal of dense sampling examples. These dense sampling examples are with circulant structure which allows the use of the fast Fourier transform (FFT). Bolme et al. [26] first developed a minimum output sum of squared error filter for real-time visual tracking. After that, a great deal of tracking methods based on correlation filter has been proposed to improve tracking performance. Henriques et al. [27] developed a high-speed tracker with kernelized correlation filters which can deal with multi-channel features. Danelljan et al. [28] presented a discriminative scale space tracker with a correlation filter based on a scale pyramid representation. In order to mitigate the unwanted boundary effect which appeared in traditional correlation-based trackers, Danelljan et al. [29] figured out spatially regularized discriminative correlation filters (SRDCF) for visual tracking. Recent researches have shown that features from convolutional neural networks (CNN) can improve tracking performance greatly [30,31,32,33]. Zhang et al. [34] builded a simple two-layer convolutional network to learn robust representation for visual tracking without offline training. Ma et al. [35] utilized three convolutional layers to learn robust target appearance for visual tracking. Wang et al. [36] exploited robust target appearance representation from the top layer to lower layer for object tracking. Heng et al. [37] incorporated recurrent neural network (RNN) into CNN to improve tracking performance. He et al. [38] integrated weighted convolution responses from 10 layers and achieved a very promising performance.

Although correlation filters based trackers have obtained superior tracking performance, many trackers utilized a single correlation filter and could not achieve promising tracking results. Figure 2 gives the precision plots and success plots of OPE by methods with a different number of correlation filters on OTB-2013. It is obvious that just simply merging two correlation filters is able to greatly improve tracking performance in both precision and success rate. However, there is still much room for improvement for methods using two correlation filters which are independent of each other.

Fig. 2
figure2

Precision plots and success plots of OPE by methods with different correlation filters on OTB-2013

Inspired by the above discussions, we develop a robust visual tracking method via mutual kernelized correlation filters using features from convolutional neural networks (MKCN_CNN), where each tracker works on its own and tries to correct the other one. At the same time, an elastic net constraint is imposed on each filter, which can eliminate some distractive features. Finally, the proposed tracking framework can be solved in a closed-form fashion. Extensive experiments demonstrate that our method can achieve promising tracking performance competing with some other state-of-the-art trackers.

The rest of this paper is organized as follows. Section 2 briefly summarizes the principle of visual tracking based on kernelized correlation filter. Section 3 introduces the proposed tracking algorithm in details. The experimental results and corresponding discussions are described in Section 4, followed by the conclusion in Section 5.

Visual tracking based on kernelized correlation filters

Henriques et al. [27] proposed a fast discriminative visual tracking method based on kernelized correlation filters (KCF). Given a n × 1 vector x = [x1, x2, …, xn]T denoting a base image, a shifted version of x can be defined by {Pux|u = 1} = [xn, x1, …, xn − 1]T. Here, P is a permutation matrix. So, the full shifted signals of x are given by {Pux|u = 1, 2, …, n − 1}. Then, the data matrix X is defined by all the cyclic shifted version of x which can be made diagonal by discrete Fourier transform (DFT).

$$ \mathbf{X}={F}^H\operatorname{diag}\left(\hat{\mathbf{x}}\right)F $$
(1)

Here, F means the DFT matrix, H stands for transpose and complex-conjugate, \( \hat{\mathbf{x}}=\mathcal{F}\left(\mathbf{x}\right) \), which computes the DFT of vector x. The goal of KCF is to find a discriminative correlation classifier f(x) over the data matrix X for separating the target object from the surrounding environment. Given the training dataset and their corresponding labels (x1, y1), …, (xm, ym), the discriminative correlation classifier f(x) can be obtained by the following equation,

$$ \underset{\mathbf{w}}{\min}\sum \limits_i{\left(f\left({\mathbf{x}}_i\right)-{y}_i\right)}^2+\lambda {\left\Vert \mathbf{w}\right\Vert}^2 $$
(2)

where λ means the regularization parameter. xi stands for the ith row element of the data matrix X. A Gaussian function is adopted to model the label yi. When xi is the centered target, yi is set to 1. For the other cyclic shifted version of xi around the center target, their labels smoothly decay to 0. The solution w can be easily obtained by w = (XHX + λI)−1XHy. In order to get a powerful model, kernel trick is introduced into Eq. (2). The new model is rewritten as

$$ \underset{\alpha }{\min }{\left\Vert \mathbf{K}\alpha -\mathbf{y}\right\Vert}_2^2+\alpha \mathbf{K}\alpha $$
(3)

where K is a n × n kernel matrix and one of its elements is Kij = k(xi, xj). Matrix K has a circulant structure and can be diagonalized as

$$ \mathbf{K}={F}^H\operatorname{diag}\left(\hat{\mathbf{k}}\right)F $$
(4)

Here, k is the first row of matrix K. The solution α in the dual space can be given by

$$ \alpha ={\left(\mathbf{K}+\lambda \mathbf{I}\right)}^{-1}\mathbf{y} $$
(5)

where I is an identity matrix. Just as the data matrix X, kernel matrix K is also circulant. So, the solution of Eq. (3) can be efficiently computed in the frequency domain.

$$ \hat{\alpha}=\frac{\hat{\mathbf{y}}}{{\hat{\mathbf{K}}}^{\mathbf{xx}}+\lambda } $$
(6)

In the next frame, a great deal of candidates, denoted as x', are extracted at the same position as the current frame. Actually, all these candidates’ x' are obtained from the cyclic shift of the base image x. The response of these candidates can be computed from

$$ f\left({\mathbf{x}}^{\hbox{'}}\right)={\mathcal{F}}^{-1}\left({\hat{\mathbf{k}}}^{\hbox{'}}\circ {\hat{\alpha}}_t\right) $$
(7)

Here, −1 stands for the inverse discrete Fourier transform (IDFT). \( {\hat{\mathbf{k}}}^{\hbox{'}} \) means the kernel correlation of candidates x' and base image x in the frequency domain. denotes element by element multiplication. The candidate with the largest response is chosen as the final target object in the next frame.

Methods

Though the KCF method has obtained promising tracking performance, only one discriminative classifier is used in this model, which makes the KCF method not able to deal with complex sciences. In order to overcome these problems, inspired by ensemble tracking methods, we proposed mutual kernelized correlation filters with elastic net constraint for visual tracking. Extensive experiments show that our method can perform better than the state-of-the-art methods. The flowchart of our proposed tracking framework is demonstrated in Fig. 3.

Fig. 3
figure3

The flowchart of our proposed tracking framework

Problem statement

In order to find the best target object from a great deal of candidates, we introduce a linear regressor model in the proposed method.

$$ \underset{\mathbf{w}}{\min }{\left\Vert \mathbf{y}-\mathbf{Xw}\right\Vert}_2^2 $$
(8)

Here, X has the same definition as KCF. y means regression label value of X. w represents the corresponding coefficient. In order to promote the performance of Eq. (8), just as least absolute shrinkage and selection operator (LASSO) model, ℓ1 norm is adopted to regularize the coefficients w.

$$ \underset{\mathbf{w}}{\min }{\left\Vert \mathbf{y}-\mathbf{Xw}\right\Vert}_2^2+\tau {\left\Vert \mathbf{w}\right\Vert}_1 $$
(9)

where τ is a constant weight parameter. In Eq. (9), some values of w are set to zero which can make some occluded pixels excluded in this new model. So, the occluded pixels have less effect on the final decision of regression values. However, we find that the occluded pixels often assemble in one position together. Eq. (9) cannot group these pixels with the same features. So, in order to overcome the limitations of the LASSO model, an elastic net regularization [39] is introduced in Eq. (9).

$$ \underset{\mathbf{w}}{\min }{\left\Vert \mathbf{y}-\mathbf{Xw}\right\Vert}_2^2+\lambda {\left\Vert \mathbf{w}\right\Vert}_2+\tau {\left\Vert \mathbf{w}\right\Vert}_1 $$
(10)

Here, λ is a constant weight parameter. w2 is used to group pixels with the similar property. In order to promote the tracking performance of our method, kernel trick is exploited in Eq. (10). The candidates are mapped to a high-dimensional feature space φ(x). Then, in the dual space, the solution w is given by a linear combination of mapped candidates.

$$ \mathbf{w}=\sum \limits_i{\alpha}_i\varphi \left({\mathbf{x}}_i\right) $$
(11)

Equation (10) in the dual space can be described as

$$ \underset{\alpha }{\min }{\left\Vert \mathbf{y}-\mathbf{K}\alpha \right\Vert}_2^2+{\lambda \alpha}^T\mathbf{K}\alpha +\tau {\left\Vert \alpha \right\Vert}_1 $$
(12)

where K represents kernel matrix. The solution of α involves square norm and ℓ1 norm simultaneously. In order to compute α efficiently, another variable β is introduced in Eq. (12).

$$ \underset{\alpha }{\min }{\left\Vert \mathbf{y}-\mathbf{K}\alpha \right\Vert}_2^2+{\lambda \alpha}^T\mathbf{K}\alpha +\tau {\left\Vert \beta \right\Vert}_1+\mu {\left\Vert \alpha -\beta \right\Vert}_2^2 $$
(13)

Here, μ is a constant weight parameter.

Mutual kernelized correlation filters

In this part, we introduce mutual kernelized correlation filters based on Eq. (13). Then, the proposed mutual kernelized correlation filters will solve this following problem

$$ {\displaystyle \begin{array}{l}T\left({\alpha}_1,{\alpha}_2\right)=\underset{\alpha_1,{\alpha}_2}{\min }{\left\Vert \mathbf{y}-{\mathbf{K}}_1{\alpha}_1\right\Vert}_2^2+{\left\Vert \mathbf{y}-{\mathbf{K}}_2{\alpha}_2\right\Vert}_2^2+{\lambda \alpha}_1^T{\mathbf{K}}_1{\alpha}_1+{\lambda \alpha}_2^T{\mathbf{K}}_2{\alpha}_2+\tau {\left\Vert {\beta}_1\right\Vert}_1\\ {}+\tau {\left\Vert {\beta}_2\right\Vert}_1+\mu {\left\Vert {\alpha}_1-{\beta}_1\right\Vert}_2^2+\mu {\left\Vert {\alpha}_2-{\beta}_2\right\Vert}_2^2+2\rho {\left\Vert {\mathbf{K}}_1{\alpha}_1-{\mathbf{K}}_2{\alpha}_2\right\Vert}_2^2\end{array}} $$
(14)

The first two parts of Eq. (14) force each kernelized correlation filter model to have the minimum squared error with respect to the desired output regression label y. \( {\lambda \alpha}_1^T{\mathbf{K}}_1{\alpha}_1+{\lambda \alpha}_2^T{\mathbf{K}}_2{\alpha}_2 \) denote the elastic net regularization on two models respectively. \( \tau {\left\Vert \beta \right\Vert}_1+\tau {\left\Vert \beta \right\Vert}_2+\mu {\left\Vert {\alpha}_1-{\beta}_1\right\Vert}_2^2+\mu {\left\Vert {\alpha}_2-{\beta}_2\right\Vert}_2^2 \) are introduced to exclude the occluded pixels in the target object. \( 2\rho {\left\Vert {\mathbf{K}}_1{\alpha}_1-{\mathbf{K}}_2{\alpha}_2\right\Vert}_2^2 \) is used to weight the influence of the two kernelized correlation filter models.

It is obvious that Eq. (14) is convex with respect to α1, α2 if β1, β2 are fixed, and vice versa. So, we propose an iterative algorithm to compute the solution α1, α2. Thus, four subproblems with respect to α1, α2, β1, β2 are given as follows

$$ {T}_1\left({\alpha}_1\right)=\underset{\alpha_1}{\min }{\left\Vert \mathbf{y}-{\mathbf{K}}_1{\alpha}_1\right\Vert}_2^2+{\lambda \alpha}_1^T{\mathbf{K}}_1{\alpha}_1+\tau {\left\Vert {\beta}_1\right\Vert}_1+\mu {\left\Vert {\alpha}_1-{\beta}_1\right\Vert}_2^2+2\rho {\left\Vert {\mathbf{K}}_1{\alpha}_1-{\mathbf{K}}_2{\alpha}_2\right\Vert}_2^2 $$
(15)
$$ {T}_2\left({\beta}_1\right)=\underset{\beta_1}{\min}\tau {\left\Vert \beta \right\Vert}_1+\mu {\left\Vert {\alpha}_1-{\beta}_1\right\Vert}_2^2 $$
(16)
$$ {T}_3\left({\alpha}_2\right)=\underset{\alpha_2}{\min }{\left\Vert \mathbf{y}-{\mathbf{K}}_2{\alpha}_2\right\Vert}_2^2+{\lambda \alpha}_2^T{\mathbf{K}}_2{\alpha}_2+\tau {\left\Vert {\beta}_2\right\Vert}_1+\mu {\left\Vert {\alpha}_2-{\beta}_2\right\Vert}_2^2+2\rho {\left\Vert {\mathbf{K}}_1{\alpha}_1-{\mathbf{K}}_2{\alpha}_2\right\Vert}_2^2 $$
(17)
$$ {T}_4\left({\beta}_2\right)=\underset{\beta_2}{\min}\tau {\left\Vert {\beta}_2\right\Vert}_1+\mu {\left\Vert {\alpha}_2-{\beta}_2\right\Vert}_2^2 $$
(18)

Set the derivation of T1 with respect to α1 to be zero; Eq. (15) can be rewritten as follows:

$$ {\displaystyle \begin{array}{l}\frac{\partial {T}_1}{\partial {\alpha}_1}=-2{\mathbf{K}}_1\left(\mathbf{y}-{\mathbf{K}}_1{\alpha}_1\right)+2\lambda {\mathbf{K}}_1{\alpha}_1+4\rho {\mathbf{K}}_1\left({\mathbf{K}}_1{\alpha}_1-{\mathbf{K}}_2{\alpha}_2\right)+2\mu \left({\alpha}_1-{\beta}_1\right)\\ {}=-2{\mathbf{K}}_1\mathbf{y}+2{\mathbf{K}}_1{\mathbf{K}}_1{\alpha}_1+2\lambda {\mathbf{K}}_1{\alpha}_1+4\rho {\mathbf{K}}_1{\mathbf{K}}_1{\alpha}_1-4\rho {\mathbf{K}}_1{\mathbf{K}}_2{\alpha}_2+2{\mu \alpha}_1-2{\mu \beta}_1\\ {}=0\end{array}} $$
(19)

Change the order of formula (19), we obtain

$$ {\displaystyle \begin{array}{l}{\mathbf{K}}_1{\mathbf{K}}_1{\alpha}_1+\lambda {\mathbf{K}}_1{\alpha}_1+2\rho {\mathbf{K}}_1{\mathbf{K}}_1{\alpha}_1+{\mu \alpha}_1={\mathbf{K}}_1\mathbf{y}+2\rho {\mathbf{K}}_1{\mathbf{K}}_2{\alpha}_2+{\mu \beta}_1\\ {}\Rightarrow \left({\mathbf{K}}_1{\mathbf{K}}_1+\lambda {\mathbf{K}}_1+2\rho {\mathbf{K}}_1{\mathbf{K}}_1+\mu \mathbf{I}\right){\alpha}_1={\mathbf{K}}_1\mathbf{y}+2\rho {\mathbf{K}}_1{\mathbf{K}}_2{\alpha}_2+{\mu \beta}_1\end{array}} $$
(20)

Then, we obtain the solution α1

$$ {\alpha}_1={\left({\mathbf{K}}_1{\mathbf{K}}_1+\lambda {\mathbf{K}}_1+2\rho {\mathbf{K}}_1{\mathbf{K}}_1+\mu \mathbf{I}\right)}^{-1}\left({\mathbf{K}}_1\mathbf{y}+2\rho {\mathbf{K}}_1{\mathbf{K}}_2{\alpha}_2+{\mu \beta}_1\right) $$
(21)

Set the derivation of T3 with respect to α2 to be zero; a similar solution α2 is given as follows:

$$ {\alpha}_2={\left({\mathbf{K}}_2{\mathbf{K}}_2+\lambda {\mathbf{K}}_2+2\rho {\mathbf{K}}_2{\mathbf{K}}_2+\mu \mathbf{I}\right)}^{-1}\left({\mathbf{K}}_2\mathbf{y}+2\rho {\mathbf{K}}_2{\mathbf{K}}_1{\alpha}_1+{\mu \beta}_2\right) $$
(22)

It is straightforward that Eqs. (16) and (18) are least squared by ℓ1 norm regularization. Thus, the solution β1 and β2 have closed form which can be easily achieved by a soft shrinkage function

$$ {\beta}_1=\operatorname{sign}\left({\alpha}_1\right)\max \left(0,\left|{\alpha}_1\right|-\frac{\tau }{2\mu}\right) $$
(23)
$$ {\beta}_2=\operatorname{sign}\left({\alpha}_2\right)\max \left(0,\left|{\alpha}_2\right|-\frac{\tau }{2\mu}\right) $$
(24)

By introducing Eqs. (4), (21) can be reformulated as follows:

$$ {\displaystyle \begin{array}{l}{\alpha}_1={\left({\mathbf{K}}_1{\mathbf{K}}_1+\lambda {\mathbf{K}}_1+2\rho {\mathbf{K}}_1{\mathbf{K}}_1+\mu \mathbf{I}\right)}^{-1}\left({\mathbf{K}}_1\mathbf{y}+2\rho {\mathbf{K}}_1{\mathbf{K}}_2{\alpha}_2+{\mu \beta}_1\right)\\ {}={\left(\left(1+2\rho \right){F}^H\operatorname{diag}\left({\hat{\mathbf{k}}}_1\circ {\hat{\mathbf{k}}}_1\right)F+\lambda {F}^H\operatorname{diag}\left({\hat{\mathbf{k}}}_1\right)F+\mu \mathbf{I}\right)}^{-1}\\ {}\times \left({F}^H\operatorname{diag}\left({\hat{\mathbf{k}}}_1\right)F\mathbf{y}+2\rho {F}^H\operatorname{diag}\left({\hat{\mathbf{k}}}_1\circ {\hat{\mathbf{k}}}_2\right)F{\alpha}_2+{\mu \beta}_1\right)\\ {}={F}^H\operatorname{diag}\left(\frac{1}{\left(1+2\rho \right){\hat{\mathbf{k}}}_1\circ {\hat{\mathbf{k}}}_2+\lambda {\hat{\mathbf{k}}}_1+\mu}\right)\operatorname{diag}\left({\hat{\mathbf{k}}}_1\right)F\mathbf{y}\\ {}+2\rho {F}^H\operatorname{diag}\left(\frac{1}{\left(1+2\rho \right){\hat{\mathbf{k}}}_1\circ {\hat{\mathbf{k}}}_2+\lambda {\hat{\mathbf{k}}}_1+\mu}\right)\operatorname{diag}\left({\hat{\mathbf{k}}}_1\circ {\hat{\mathbf{k}}}_2\right)F{\alpha}_2\\ {}+\mu {F}^H\operatorname{diag}\left(\frac{1}{\left(1+2\rho \right){\hat{\mathbf{k}}}_1\circ {\hat{\mathbf{k}}}_2+\lambda {\hat{\mathbf{k}}}_1+\mu}\right)F{\beta}_1\end{array}} $$
(25)

Then, the DFT of α1 is found by

$$ {\displaystyle \begin{array}{l}{\hat{\alpha}}_1=\operatorname{diag}\left(\frac{{\hat{\mathbf{k}}}_1}{\left(1+2\rho \right){\hat{\mathbf{k}}}_1\circ {\hat{\mathbf{k}}}_1+\lambda {\hat{\mathbf{k}}}_1+\mu}\right)\hat{\mathbf{y}}+2\rho \operatorname{diag}\left(\frac{{\hat{\mathbf{k}}}_1\circ {\hat{\mathbf{k}}}_2}{\left(1+2\rho \right){\hat{\mathbf{k}}}_1\circ {\hat{\mathbf{k}}}_1+\lambda {\hat{\mathbf{k}}}_1+\mu}\right){\alpha}_2\\ {}+\mu \operatorname{diag}\left(\frac{1}{\left(1+2\rho \right){\hat{\mathbf{k}}}_1\circ {\hat{\mathbf{k}}}_1+\lambda {\hat{\mathbf{k}}}_1+\mu}\right){\beta}_1\\ {}=\frac{{\hat{\mathbf{k}}}_1\circ \hat{\mathbf{y}}+2\rho {\hat{\mathbf{k}}}_1\circ {\hat{\mathbf{k}}}_2\circ {\alpha}_2+{\mu \beta}_1}{\left(1+2\rho \right){\hat{\mathbf{k}}}_1\circ {\hat{\mathbf{k}}}_1+\lambda {\hat{\mathbf{k}}}_1+\mu}\end{array}} $$
(26)

In the same way, the DFT of α2 is obtained from

$$ {\hat{\alpha}}_2=\frac{{\hat{\mathbf{k}}}_2\circ \hat{\mathbf{y}}+2\rho {\hat{\mathbf{k}}}_2\circ {\hat{\mathbf{k}}}_1\circ {\alpha}_1+{\mu \beta}_2}{\left(1+2\rho \right){\hat{\mathbf{k}}}_2\circ {\hat{\mathbf{k}}}_2+\lambda {\hat{\mathbf{k}}}_2+\mu } $$
(27)

Here, k2 is the first row of matrix K2.

Model update

To update the proposed MKCF_CNN method for robust visual tracking, an incremental scheme is adopted to update the proposed model,

$$ {\displaystyle \begin{array}{l}{\alpha}_{1,t}=\left(1-\eta \right){\alpha}_{1,t-1}+{\eta \alpha}_{1,t}\\ {}{\alpha}_{2,t}=\left(1-\eta \right){\alpha}_{2,t-1}+{\eta \alpha}_{2,t}\end{array}} $$
(28)
$$ {\displaystyle \begin{array}{l}{\mathbf{x}}_{1,t}=\left(1-\eta \right){\mathbf{x}}_{1,t-1}+\eta {\mathbf{x}}_{1,t}\\ {}{\mathbf{x}}_{2,t}=\left(1-\eta \right){\mathbf{x}}_{2,t-1}+\eta {\mathbf{x}}_{2,t}\end{array}} $$
(29)

where η is a constant parameter which controls the learning rate. The subscript t denotes the tth frame. The incremental update strategy can deal with the abrupt change in successive frame.

Target detection

For kernel correlation filter K1, in the tth frame sequence, a great deal of circulant candidates, denoted as \( {\mathbf{x}}_{1,t}^{\hbox{'}} \), are extracted around the base image x1, t − 1. The base image x1, t − 1 locates at the position of the target at the (t − 1)th frame. The candidates \( {\mathbf{x}}_{1,t}^{\hbox{'}} \) have a circulant structure. Thus, the responses of these candidates are given by

$$ \mathrm{response}1={\mathcal{F}}^{-1}\left({\hat{\mathbf{k}}}_1^{\hbox{'}}\circ {\alpha}_{1,t}\right) $$
(30)

In the same way, the responses of these candidates \( {\mathbf{x}}_{2,t}^{\hbox{'}} \) with respect to kernel correlation filter K2 are obtained by

$$ \mathrm{response}2={\mathcal{F}}^{-1}\left({\hat{\mathbf{k}}}_2^{\hbox{'}}\circ {\alpha}_{2,t}\right) $$
(31)

The maximum values of response1 and response2 are easily achieved by max(response1(:)) and max (response2(:)), respectively. if max(response1(:)) > max(response2(:)), the final response is equal to max(response1(:)). Otherwise, the final response is equal to max(response2(:)). The best position of the target is obtained according to the final response.

Convolutional neural network (CNN) features extracted from MatConvNet

Traditional features, such as histogram of oriented gradient (HOG), SIFT, and CN, have achieved promising tracking performance in the past decade. However, these handcrafted features are out-of-date along with the rise of CNN features. In [40], the properties of CNN-based representation have gained impressive results on image recognition and object detection. In [35], three convolutional layers, conv3 − 4, conv4 − 4, conv5 − 4, utilizing VGG-19 model are introduced to the field of visual tracking and demonstrate powerful representation ability. Inspired by [41], we used the conv5 − 4 convolution layer and conv4 − 4 convolution layer of VGG-19 to model the appearance of the target. Features from conv5 − 4 convolution layer with more semantic information can discriminate the target from the dramatically changing background. Features from conv4 − 4 convolution layer with more spatial details can locate the position of target precisely.

Target recovery

We adopt the EdgeBox method [42] to re-detect the target from the failures of tracking. A great deal of object bounding box detection proposals Pd are generated by the EdgeBox method, and these proposals are evaluated under the framework of correlation filter to decide the final tracking position. Given the position (xt − 1, yt − 1) of the target in the (t − 1)th frame, a set of bounding box proposals are extracted around the position of the target in the current frame. The position of each bounding box proposal pi is set to \( \left({x}_t^i,{y}_t^i\right) \) in the tth frame. The maximum response score of each bounding box proposal pi is given by r(pi), which is computed by Eq. (7) using the HOG feature. If the score of tracking results in the tth frame is smaller than the threshold T0, it can be believed that the tracker loses the target and the scheme of re-detection should be triggered. The optimal bounding box proposal in the tth frame is obtained by minimizing the following expression:

$$ {\displaystyle \begin{array}{l}\arg \underset{i}{\min }r\left({p}_t^i\right)+\alpha L\left({p}_t^i,{p}_{t-1}\right)\\ {}\kern2.75em s.t.\kern0.5em r\left({p}_t^i\right)>1.5{T}_0\end{array}} $$
(32)

where \( L\left({p}_t^i,{p}_{t-1}\right)=\exp \left(-\frac{1}{2{\sigma}^2}{\left\Vert \left({x}_t^i,{y}_t^i\right)-\left({x}_{t-1},{y}_{t-1}\right)\right\Vert}^2\right) \). The formula \( L\left({p}_t^i,{p}_{t-1}\right) \) is motion constraint between two successive frames. α is a constant parameter which controls the balance between the response score and the motion constraint. σ means the diagonal length of the initial target size.

Scale estimation

Scale estimation is very important for robust tracking. Motivated by [42], we use the EdgeBox method to deal with scale variation appeared in sequences. Given the size (wt − 1, ht − 1) of the target in the (t − 1)th frame, we use the EdgeBox method to conduct on the multi-scale bounding box proposals Ps with the size of swt − 1 × sht − 1 in the current frame and reject the proposals whose intersection over union (IoU) is lower than 0.6 or higher than 0.9. For each accepted scale proposal, we compute the response score under the framework of correlation filter. If the maximum response score {r(pi)|piPs} is smaller than response obtained in Section 3.4, we keep the size of the target in the (t − 1)th frame. Otherwise, we update the size of the target by the following equation:

$$ \left({w}_t,{h}_t\right)=\gamma \left({w}_t^{\ast },{h}_t^{\ast}\right)+\left(1-\gamma \right)\left({w}_{t-1},{h}_{t-1}\right) $$
(33)

where \( \left({w}_t^{\ast },{h}_t^{\ast}\right) \) is the size of the proposal with the maximum response score. γ is a constant parameter which controls the update rate.

Results and discussion

In this section, we evaluate our proposed method on three public datasets: OTB-2013 [43], TColor-128 [44], and DTB70 [45]. Matlab pseudo-codes and tracking pipeline of our MKCF_CNN method are given in Tables 1 and 2, separately. Extensive experiments demonstrate that our method is able to achieve a very appealing performance in terms of effectiveness and robustness.

Table 1 Matlab pseudo-codes of MKCF_CNN
Table 2 Tracking pipeline of MKCF_CNN method

Experimental setup

The proposed MKCF_CNN method is implemented in MATLAB on a PC equipped with an Intel Xeon CPU E5-2640 v4 with 128G RAM and a single NVIDIA GeForce GTX 1080Ti. We adopt the pretrained VGGNet-19 as our feature extractor and utilize matcovnet for feature generation. We train two correlation filters utilizing outputs from the conv4 − 4 and conv5 − 4 layers. The linear kernel is adopted in this paper. The parameters λ, τ, μ, ρ in (14) are empirically set to 10−4, 10−5, 10−4, and 10−3 separately. We set the update rate η in (28) and (29) to 0.01 and the weight parameter γ in (33) to 0.6. The tracking failure threshold T0 is set to 0.2.

Evaluation metrics

We use two measurements, precision plots and success plots [46], to quantitatively assess the tracking results of our method. Precision plots illustrate the percentage of frames in which the center location error is within a given threshold. The threshold is set to 20 pixels. The center location error means the Euclidean distance between the tracked location and the ground truth. The success plots are the percentage of frames where the overlap rate S is larger than a fixed threshold T1. The overlap rate S is defined as \( S=\frac{\mathrm{Aera}\left({B}_E\cap {B}_G\right)}{\mathrm{Aera}\left({B}_E{UB}_G\right)} \). ∩ and are intersection and union operators, respectively. BE denotes the estimated bounding box and BG is the ground-truth bounding box. T1 is set to 0.5 in this paper.

To evaluate the tracking performance of our method comprehensively, the challenging videos from OTB-2013 and TColor-128 are categorized with 11 attributes including background clutter (BC), deformation (DEF), fast motion (FM), in-plane rotation (IPR), illumination variation (IV), low resolution (LR), motion blur (MB), occlusion (OCC), out-of-plane rotation (OPR), out of view (OV), and scale variation (SV).

Comparison of tracking performance on OTB-2013

OTB-2013 benchmark dataset contains 51 sequences with 11 challenging attributes. We compare our method with 9 state-of-the-art algorithms which contain deep learning tracking methods (HCFT [35], HDT [47], CNN-SVM [48], DeepSRDCF [49]) and correlation filter tracking methods (MEEM [50], Staple [51], SAMF [52], DSST [28]). Figure 4 gives the precision plots and success plots of OPE of our proposed method against other state-of-the-state methods on OTB-2013. According to Fig. 4, our MKCF_CNN tracker outperforms most of the other trackers, demonstrating the effectiveness of MKCF_CNN. The proposed MKCF_CNN method achieves 2.3% performance gains in precision against HCFT, which is the most related tracking method with us. Meanwhile, MKCF_CNN and DeepSRDCF rank first on the success score.

Fig. 4
figure4

Precision plots and success plots of OPE of our proposed method against other state-of-the-art methods on OTB-2013

In order to comprehensively assess the tracking performance of our proposed MKCF_CNN tracker, we present tracking results under OPE regarding 11 attributes in Figs. 5 and 6. We can observe that on the 51 videos with all the 11 challenging attributes, our method ranks first among the 10 evaluated trackers on precision plots. On the videos with attributes such as background clutter, deformation, in-plane rotation, illumination variation, low resolution, and out of view, MKCF_CNN ranks first among all the evaluated trackers on success plots. In the HCFT method, the outputs of the conv3 − 4, conv4 − 4, and conv5 − 4 layers are used as the deep features. In the HDT method, the outputs of six convolutional layers (10th–12th, 14th–16th) from VGGNet-19 are adopted as feature maps. However, only two layers (conv4 − 4, conv5 − 4) from VGGNet-19 are used in our proposed method, and two mutual kernelized correlation filters are trained to interact each other through all the tracking process without definite parameters as HCFT and definite initial parameters as HDT. From Figs. 5 and 6, it is clear that our method performs better than those most relevant methods.

Fig. 5
figure5

Precision plots of OPE with different attributes on OTB-2013

Fig. 6
figure6

Success plots of OPE with different attributes on OTB-2013

The tracking speed is very important for visual tracking. Correlation filter-based trackers obtained beyond real-time speed using handcrafted features. Except for DFT and inverse DFT, the computational complexity of trackers with a single correlation filter is O(n log n). n is the dimensionality of the features. Thus, the whole computational load of single correlation filter-based trackers is O(Mn log n). M is the number of base trackers. M = 2 in our method and M = 3 in HCFT. For trackers under the correlation filter framework with deep features, the computational burden mainly comes from the features extraction process. Thus, the tracking speed of our proposed method is 1.3 fps, which is a little faster than HCFT with a speed of 1.1 fps.

Comparison of tracking performance on TColor-128

The TColor-128 dataset consists of 128 challenging color videos and is designed to assess the tracking performance on color sequences. Similarly, we evaluated our proposed MKCF_CNN method with 9 state-of-the-art trackers, including HCFT [35], COCF [41], KCF_GaussianHog [27], SRDCF [29], MUSTER [53], SAMF [52], DSST [28], Struck [54], and ASLA [55]. Figure 7 shows precision plots and success plots of OPE of our proposed method against other state-of-the-art methods on TColor-128. Figures 8 and 9 present precision plots and success plots of OPE with different attributes on TColor-128, respectively. It is obvious that our method is the best one among the ten trackers on dataset TColor-128, following HCFT method. Our method obtains a precision rate of 73.5% and a success rate of 63.1%. HCFT and COCF rank second and third, respectively. Although HCFT utilizes deep features from three layers, its performance is not better than our method. COCF uses the same outputs as our method from two layers of VGGNet-19, and it performs worse than our MKCF_CNN tracker. This is because the scale estimation and re-detection scheme are able to locate the target precisely in our method. Figures 8 and 9 demonstrate the effectiveness of our method on TColor-128 with 11 challenging attributes. It can be seen that our method performs best against 9 other methods. Table 3 gives the data comparison of success rates of 8 trackers. The experimental results show that our method achieves the best performance under all challenging attributes except for scale variation.

Fig. 7
figure7

Precision plots and success plots of OPE of our proposed method against other state-of-the-art methods on TColor-128

Fig. 8
figure8

Precision plots of OPE with different attributes on TColor-128

Fig. 9
figure9

Success plots of OPE with different attributes on TColor-128

Table 3 The success rates of 8 trackers with 11 challenging attributes on TColor-128 dataset. The best, second best, and third best tracking results are represented in red, blue, and green, respectively

Figure 10 shows some tracking results of two sequences with severe occlusion. In the Lemming video, the toy Lemming is severely occluded by a triangular rule when it is moving (e.g., #320, #340). It is obvious that the proposed method, SAMF, Struck, and OAB are robust to severe occlusion and can track the Lemming target steadily. In the skating2 sequence, the target woman dancer has obvious appearance variation and is totally occluded by the man dancer occasionally when they are skating (e.g., #150, #250). We can observe that the proposed method, HCFT and COCF with deep features, are able to deal with the severe occlusion and appearance variation effectively.

Fig. 10
figure10

Tracking results of ten trackers on sequences Lemming and skating2, in which the targets undergo occlusion. The tracking results of ASLA, IVT, CSK, SAMF, OAB, Struck, HCFT, COCF, and our tracker are shown by red, green, blue, black, magenta, cyan, gray, dark red, and orange rectangles, respectively

Figure 11 demonstrates some screenshots of two videos with fast motion. In the Soccer sequence, the player target keeps jumping and undergoes fast motion, background clutter, and occlusion when celebrating the victory (e.g., #36, #76, #170). IVT, Struck, CSK, ASLA, and OAB lose the target completely because of the challenging interference factors. The target in the Biker sequence undergoes fast motion and scale variation because of fast riding (e.g., #10, #100, #200). It can be easily seen that our method performs well in the entire sequence and is able to deal with motion blur and scale variation effectively.

Fig. 11
figure11

Tracking results of nine trackers on sequences Soccer and Biker, in which the targets undergo fast motion. The tracking results of ASLA, IVT, CSK, SAMF, OAB, Struck, HCFT, COCF, and our tracker are shown by red, green, blue, black, magenta, cyan, gray, dark red, and orange rectangles, respectively

Figure 12 illustrates some sampled tracking results of two sequences with appearance variation. The appearance of the target in the Surfing sequence changes severely when the player is going surfing (e.g., #100, #125). From the tracking results, we can see that most of the trackers are able to locate the target coarsely. However, only our method has the ability to track the target more precisely. In the Bikeshow sequence, the biker cycles in the square with severe appearance variation and scale change (e.g., #20, #120, #361). The proposed method, HCFT and COCF utilizing deep features, handle appearance change better than the other methods with handcrafted features.

Fig. 12
figure12

Tracking results of nine trackers on sequences Surfing and Bikeshow, in which the targets undergo appearance variation. The tracking results of ASLA, IVT, CSK, SAMF, OAB, Struck, HCFT, COCF, and our tracker are shown by red, green, blue, black, magenta, cyan, gray, dark red, and orange rectangles, respectively

Figure 13 demonstrates some tracking results of two sequences with background clutter. The target in the Board sequence moves in the complex scenes with severe background clutter (e.g., #160, #300, #400). It can be seen that our method can track the target successfully through the sequence. In the Torus sequence, the target moves in a cluttered room with slight appearance variation (e.g., #100, #200, #220). We can observe that trackers with handcrafted features can not deal with this situation and drift away to other objects.

Fig. 13
figure13

Tracking results of nine trackers on sequences Board and Torus, in which the targets undergo background clutter. The tracking results of ASLA, IVT, CSK, SAMF, OAB, Struck, HCFT, COCF and our tracker are shown by red, green, blue, black, magenta, cyan, gray, dark red and orange rectangles, respectively

Figure 14 shows some screenshots of tracking results in two sequences with illumination variation. In the Shaking video, a guitarist is playing on the stage with dim lights (e.g., #100, #200, #300). Although the target undergoes severe illumination variation, our method locates the target more precisely than other trackers. In the Singer2 sequence, the singer in dark clothes performing on the stage undergoes drastic illumination variation (e.g., #110, #210, #320). We can observe that HCFT and COCF with deep features move away from the target resulting in drastic illumination variation. Only our method is able to persistently track the target in the whole sequence.

Fig. 14
figure14

Tracking results of nine trackers on sequences Shaking and Singer2, in which the targets undergo illumination change. The tracking results of ASLA, IVT, CSK, SAMF, OAB, Struck, HCFT, COCF, and our tracker are shown by red, green, blue, black, magenta, cyan, gray, dark red, and orange rectangles, respectively

Comparison of tracking performance on DTB

DTB dataset consists of 70 challenging videos captured by a camera mounted on an unmanned aerial vehicle (UAV). All of the 70 challenging sequences in the DTB dataset were manually annotated with 11 challenging attributes, including motion blur (MB), scale variation (SV), similar objects around (SOA), aspect ratio variation (ARV), background cluttered (BC), occlusion (OCC), out-of-view (OV), deformation (DEF), out-of-plane rotation (OPR), fast camera motion (FCM), and in-plane rotation (IPR). We compare our method with 9 representative trackers including HCFT [35], HDT [47], COCF [41], MEEM [50], SO-DLT [56], SRDCF [29], KCF [27], DAT [57], and DSST [28]. Figure 15 shows the overall tracking performance of OPE based on precision score and success score on DTB dataset. We can see that the proposed tracker can achieve the best tracking performance against 9 other trackers.

Fig. 15
figure15

Precision plots and success plots of OPE of our proposed method against other state-of-the-art methods on DTB

Ablation study

Effect of mutual kernelized correlation filters

In order to demonstrate the effectiveness of mutual correlation filters, we investigate the tracking performance of our proposed method with mutual correlation filters and without mutual correlation filters on OTB-2013. Figure 16 gives the precision plots and success plots of OPE by different settings. Our method with mutual correlation filters achieves a score of 0.914 in terms of precision and the precision performance is improved by 0.9% compared with the method without mutual correlation filters. In success plots, owing to the interaction of mutual correlation filters, the tracking performance is improved by 2.0%. Figures 17 and 18 show the tracking results on OTB-2013 with 11 challenging attributes. It is obvious that our method with mutual correlation filters achieves better tracking performance in all the 11 attributes in both the average precision score and average success rate.

Fig. 16
figure16

Precision plots and success plots of OPE by our proposed method with mutual kernelized correlation filters and our method without mutual kernelized correlation filters on OTB-2013

Fig. 17
figure17

Average precision score of our proposed method with mutual kernelized correlation filters and our method without mutual kernelized correlation filters in terms of 11 challenging attributes on OTB-2013

Fig. 18
figure18

Average success rate of our proposed method with mutual kernelized correlation filters and our method without mutual kernelized correlation filters in terms of 11 challenging attributes on OTB-2013

Effect of elastic net constraint

Figure 19 gives the tracking results on OTB-2013 by our method with elastic net constraint and our method without elastic net constraint in terms of precision and success rate. We can observe that the proposed method with elastic net constraint achieves slightly better than method without elastic net constraint. Table 4 demonstrates the tracking results on OTB-2013 with 11 challenging attributes. It is clear that our proposed method with elastic net constraint obtains better performance than method without elastic net constraint in terms of IPR, OC, SV, OPR, and IV.

Fig. 19
figure19

Precision plots and success plots of OPE by our proposed method with elastic net constraint and our method without elastic net constraint on OTB-2013

Table 4 The success rates of our method with elastic net constraint and our method without elastic net constraint on OTB-2013. The best tracking results are represented in red. ENC denotes elastic net constraint

Effect of scale estimation

In this section, we investigate the tracking performance with scale estimation scheme and without scale estimation scheme. Experimental results conducted on OTB-2013 are demonstrated in Figs. 20 and 21. The first picture in Fig. 20 shows the comparison of success plots of OPE on OTB-2013 and the second picture in Fig. 20 gives the success plots of OPE in terms of scale variation. Figure 21 shows the average success rate of our proposed method with scale estimation scheme and our method without scale estimation scheme in terms of 11 challenging attributes on OTB-2013. It can be seen that the scale estimation mechanism is able to improve the tracking performance greatly.

Fig. 20
figure20

The comparison of our method with scale estimation scheme and our method without scale estimation scheme on OTB-2013. The first figure demonstrates the success plots of OPE and the second one gives the success plots of OPE in terms of scale variation

Fig. 21
figure21

Average success rate of our proposed method with scale estimation scheme and our method without scale estimation scheme in terms of 11 challenging attributes on OTB-2013

Effect of re-detection module

In this section, we compare the tracking performance with re-detection module and without re-detection module on OTB-2013. The first picture in Fig. 22 shows the comparison of success plots of OPE on OTB-2013 and the second picture in Fig. 22 gives the success plots of OPE in terms of occlusion. It is obvious that the re-detection module is able to recover target in case of tracking failures. Table 5 gives the tracking results on OTB-2013 in terms of 11 challenging attributes. The best tracking results are shown in red. It is clear that our method with re-detection module achieves better tracking results in almost all the 11 attributes except for the LR and DE.

Fig. 22
figure22

Comparison of our method with re-detection module and our method without re-detection module on OTB-2013. The first picture demonstrates the success plots of OPE and the second one gives the success plots of OPE in terms of occlusion

Table 5 The success rates of our method with re-detection module and our method without re-detection module on OTB-2013. The best tracking results are represented in red. RD denotes re-detection

Summary and conclusion

In this paper, we propose a novel visual tracking method based on mutual kernelized correlation filters with elastic net constraint. The proposed algorithm is able to train two interactive discriminative classifiers to cope with the challenging environment and severe appearance variation. The elastic net constraint is imposed on the mutual kernelized correlation filters to group the similar features and to alleviate the impact of outliers. Scale adaption and re-detection scheme are applied in our method to promote tracking performance. Extensive experimental results demonstrate that our proposed method is able to obtain appealing tracking performance by using the interacted kernelized correlation filters with elastic net constraint. Quantitative and qualitative results show the superiority of our method in terms of effectiveness and robustness, compared with other tracking algorithms.

Availability of data and materials

Data sharing not applicable to this article as no datasets were generated or analyzed during the current study.

Abbreviations

BC:

Background clutter

CN:

Color name

CNN:

Convolutional neural networks

DEF:

Deformation

DFT:

Discrete Fourier transform

ENC:

Elastic net constraint

FFT:

Fast Fourier transform

FM:

Fast motion

HOG:

Histogram of oriented gradient

IPR:

In-plane rotation

IV:

Illumination variation

KCF:

Kernelized correlation filters

LASSO:

Least absolute shrinkage and selection operator

MB:

Motion blur

OCC:

Occlusion

OPR:

Out-of-plane rotation

OV:

Low resolution

OV:

Out of view

RD:

Re-detection

RNN:

Recurrent neural network

SIFT:

Scale-invariant feature transform

SRDCF:

Spatially regularized discriminative correlation filters

SV:

Scale variation

References

  1. 1.

    A. Li, M. Lin, Y. Wu, M. Yang, S. Yan, NUS-PRO: a new visual tracking challenge. IEEE Trans. Pattern Anal. Mach. Intell. 38(2), 335–349 (2016)

  2. 2.

    P. Li, D. Wang, L. Wang, H. Lu, Deep visual tracking: review and experimental comparison. Pattern Recogn. 76, 323–338 (2018)

  3. 3.

    S. Zhang, X. Lan, Y. Qi, C. Yuen, Robust visual tracking via basis matching, IEEE Trans. Circuits Syst. Video Technol. 27(3), 421–430 (2017)

  4. 4.

    S. Zhang, H. Zhou, F. Jiang, X. Li, Robust visual tracking using structurally random projection and weighted least squares. IEEE Trans. Circuits Syst. Video Technol. 25(11), 1749–1760 (2015)

  5. 5.

    D. Wang, H. Lu, M. Yang, Robust visual tracking via least soft-threshold square. IEEE Trans. Circuits Syst. Video Technol. 26(9), 1709–1721 (2016)

  6. 6.

    L. Zhang, W. Wu, T. Chen, N. Strobel, D. Comaniciu, Robust object tracking using semi-supervised appearance dictionary learning. Pattern Recogn. Lett. 62, 17–23 (2015)

  7. 7.

    W. Zhong, H. Lu, M. Yang, Robust object tracking via sparse collaborative appearance model. IEEE Trans. Image Process. 23(5), 2356–2368 (2014)

  8. 8.

    Y. Song, C. Ma, L. Gong, J. Zhang, R. Lau, M. Yang, in Proceedings of the IEEE International Conference on Computer Vision. CREST: convolutional residual learning for visual tracking (2017), pp. 2555–2564

  9. 9.

    T. Zhang, C. Xu, M. Yang, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Multi-task correlation particle filter for robust object tracking (2017), pp. 4819–4827

  10. 10.

    W. Chen, K. Zhang, Q. Liu, Robust visual tracking via patch based kernel correlation filters with adaptive multiple feature ensemble. Neurocomput. 214, 607–617 (2016)

  11. 11.

    K. Zhang, X. Li, H. Song, Q. Liu, Visual tracking using spatio-temporally nonlocally regularized correlation filter. Pattern Recogn. 83, 185–195 (2018)

  12. 12.

    K. Zhang, Q. Liu, J. Yang, M.-H. Yang, Visual tracking via boolean map representations. Pattern Recogn. 81, 47–160 (2018)

  13. 13.

    S. Yao, Z. Zhang, G. Wang, Y. Tang, L. Zhang, in Proceedings of the European Conference on Computer Vision. Real-time visual tracking: promoting the robustness of correlation filter learning (2016), pp. 662–678

  14. 14.

    M. Xue, H. Ling, in Proceedings of the IEEE International Conference on Computer Vision. Robust visual tracking using ℓ1 minimization (2009), pp. 1436–1443

  15. 15.

    C. Bao, Y. Wu, H. Ling, H. Ji, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Real time robust ℓ1 tracker using accelerated proximal gradient approach (2012), pp. 1830–1837

  16. 16.

    Z. Xiao, H. Lu, D. Wang, L2-RLS based object tracking. IEEE Trans. Circuits Syst. Video Technol. 24(8), 1301–1308 (2014)

  17. 17.

    D. Wang, H. Lu, Fast and robust object tracking via probability continuous outlier model. IEEE Trans. Image Process. 24(12), 5166–5176 (2015)

  18. 18.

    B. Babenko, M. Yang, S. Belongie, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Visual tracking with online multiple instance learning (2009), pp. 983–990

  19. 19.

    K. Zhang, L. Zhang, M. Yang, Fast compressive tracking. IEEE Trans. on Pattern Anal. Mach. Intell. 36(10), 2002–2015 (2014)

  20. 20.

    K. Zhang, L. Zhang, Q. Liu, D. Zhang, M. Yang, in Proceedings of the European Conference on Computer Vision. Fast visual tracking via dense spatio-temporal context learning (2014), pp. 127–141

  21. 21.

    M. Wang, Y. Liu, Z. Huang, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Large margin object tracking with circulant feature maps (2017), pp. 4021–4029

  22. 22.

    H. Fan, H. Ling, in Proceedings of the IEEE International Conference on Computer Vision. Parallel tracking and verifying: a framework for real-time and high accuracy visual tracking (2017), pp. 5486–5494

  23. 23.

    F. Li, C. Tian, W. Zuo, L. Zhang, M. Yang, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Learning spatial-temporal regularized correlation filters for visual tracking (2018), pp. 4904–4913

  24. 24.

    W. Zuo, X. Wu, L. Lin, L. Zhang, M. Yang, Learning support correlation filters for visual tracking. IEEE Trans. on Pattern Anal. Mach. Intell. DOI: https://doi.org/10.1109/TPAMI.2018.2829180

  25. 25.

    M. Danelljan, G. Hager, F. Khan, M. Felsberg, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Adaptive decontamination of the training set: a unified formulation for discriminative visual tracking (2016), pp. 1430–1438

  26. 26.

    D. Bolme, J. Beveridge, B. Draper, Y. Lui, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Visual object tracking using adaptive correlation filters (2010), pp. 2544–2550

  27. 27.

    J. Henriques, R. Caseiro, P. Martins, J. Batista, High-speed tracking with kernelized correlation filters. IEEE Trans. on Pattern Anal. Mach. Intell. 37(3), 583–596 (2015)

  28. 28.

    M. Danelljan, G. Hager, F. Khan, M. Felsberg, Discriminative scale space tracking. IEEE Trans. on Pattern Anal. Mach. Intell. 39(8), 1561–1575 (2017)

  29. 29.

    M. Danelljan, G. Hager, F. Khan, M. Felsberg, in Proceedings of the IEEE International Conference on Computer Vision. Learning spatially regularized correlation filters for visual tracking (2015), pp. 4310–4318

  30. 30.

    L. Bertinetto, J. Valmadre, F. Henriques, A. Vedaldi, H. Philip, in Proceedings of the European Conference on Computer Vision Workshops. Fully-convolutional siamese networks for object tracking (2016), pp. 850–865

  31. 31.

    N. Hyeonseob, B. Han, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Learning multi-domain convolutional neural networks for visual tracking (2016), pp. 4293–4302

  32. 32.

    Z. Chi, H. Li, H. Lu, M. Yang, Dual deep network for visual tracking. IEEE Trans. Image Process. 26(4), 2005–2015 (2017)

  33. 33.

    S. Zhang, Y. Qi, F. Jiang, X. Lan, P. Yuen, H. Zhou, Point-to-set distance metric learning on deep representations for visual tracking. IEEE Trans. Intell. Transp. Sys. 19(1), 187–198 (2018)

  34. 34.

    K. Zhang, Q. Liu, Y. Wu, M. Yang, Robust visual tracking via convolutional networks without training. IEEE Trans. Image Process. 25(4), 1779–1792 (2016)

  35. 35.

    C. Ma, J. Huang, X. Yang, M. Yang, in Proceedings of the IEEE International Conference on Computer Vision. Hierarchical convolutional features for visual tracking (2015), pp. 3074–3082

  36. 36.

    L. Wang, W. Ouyang, X. Wang, H. Lu, in Proceedings of the IEEE International Conference on Computer Vision. Visual tracking with fully convolutional networks (2015), pp. 3119–3127

  37. 37.

    F. Heng, H. Ling, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. SANet: structure-aware network for visual tracking (2017), pp. 42–49

  38. 38.

    Z. He, Y. Fan, J. Zhuang, Y. Dong, H. Bai, in Proceedings of the IEEE International Conference on Computer Vision. Correlation filters with weighted convolution responses (2017), pp. 1992–2000

  39. 39.

    S. Yao, G. Wang, L. Zhang, Correlation filter learning toward peak strength for visual tracking. IEEE Trans. Cybern. 48(4), 1290–1303 (2018)

  40. 40.

    K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale image recognition, arXiv:1409.1556(2015)

  41. 41.

    L. Zhang, P. Suganthan, Robust visual tracking via co-trained Kernelized correlation filters. Pattern Recogn. 69, 82–93 (2017)

  42. 42.

    D. Huang, L. Luo, M. Wen, Z. Chen, C. Zhang, in Proceedings of British Machine Vision Conference. Enable scale and aspect ratio adaptability in visual tracking with detection proposals (2015), pp. 185.1–185.12

  43. 43.

    Y. Wu, J. Lim, M. Yang, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Online object tracking: a benchmark (2013), pp. 2411–2418

  44. 44.

    P. Liang, E. Blasch, H. Ling, Encoding color information for visual tracking: algorithms and benchmark. IEEE Trans. Image Process. 24(12), 5630–5644 (2015)

  45. 45.

    S. Li, D. Yeung, in AAAI Conference on Artificial Intelligence. Visual object tracking for unmanned aerial vehicles: a benchmark and new motion models (2017), pp. 4140–4146

  46. 46.

    Y. Wu, J. Lim, M. Yang, Object tracking benchmark. IEEE Trans. on Pattern Anal. Mach. Intell. 37(9), 1834–1848 (2015)

  47. 47.

    Y. Qi, S. Zhang, L. Qin, H. Yao, Q. Huang, J. Lim, M. Yang, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Hedged deep tracking (2016), pp. 4303–4311

  48. 48.

    S. Hong, T. You, S. Kwak, B. Han, in Proceedings of the 32nd International Conference on International Conference on Machine Learning. Online tracking by learning discriminative saliency map with convolutional neural network (2015), pp. 597–606

  49. 49.

    M. Danelljan, G. Hager, F. Khan, M. Felsberg, in Proceedings of the IEEE International Conference on Computer Vision Workshop. Convolutional features for correlation filter based visual tracking (2015), pp. 621–629

  50. 50.

    J. Zhang, S. Ma, S. Sclaroff, in Proceedings of the European Conference on Computer Vision. MEEM: robust tracking via multiple experts using entropy minimization (2014), pp. 188–203

  51. 51.

    B. Luca, V. Jack, G. Stuart, M. Ondrej, P. Torr, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016. Staple: complementary learners for real-time tracking (2016), pp. 1401–1409

  52. 52.

    Y. Li, J. Zhu, in Proceedings of the European Conference on Computer Vision. A scale adaptive kernel correlation filter tracker with feature integration (2014), pp. 254–265

  53. 53.

    Z. Hong, Z. Chen, C. Wang, M. Xue, D. Prokhorov, D. Tao, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Multi-store tracker (MUSTer): a cognitive psychology inspired approach to object tracking (2015), pp. 749–758

  54. 54.

    S. Hare, A. Saffari, H.S. Philip, in Proceedings of the IEEE International Conference on Computer Vision. Struck: structured output tracking with kernels (2011), pp. 263–270

  55. 55.

    X. Jia, H. Lu, M. Yang, Visual tracking via coarse and fine structural local sparse appearance models. IEEE Trans. Image Process. 25(10), 4555–4564 (2016)

  56. 56.

    N. Wang, S. Li, A. Gupta, D. Y. Yeung, Transferring rich feature hierarchies for robust visual tracking, arXiv:1501.04587(2015)

  57. 57.

    H. Possegger, T. Mauthner, H. Bischof, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. In defense of color-based model-free tracking (2015), pp. 2113–2120

Download references

Acknowledgements

The authors thank the editor and anonymous reviewers for their helpful comments and valuable suggestions.

Funding

This work was supported by A Project of Shandong Province Higher Educational Science and Technology Program under Grant No. J17KA088 and No. J16LN02, the Natural Science Foundation of Shandong Province under Grant No. ZR2015FL009 and No. ZR2019PF021, the Key Research and Development Program of Shandong Province under Grant No. 2016GGX101023, Scientific Research Fund of Binzhou University under Grant No. 2019ZD03 and Dual Service Projects of Binzhou University under Grant No. BZXYSFW201805.

Author information

HW proposed the study, conducted the experiments, and wrote the manuscript. SZ analyzed the data and revised the manuscript. Both authors read and approved the final manuscript.

Correspondence to Haijun Wang.

Ethics declarations

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Keywords

  • Visual tracking
  • Mutual kernelized correlation filters
  • Elastic net constraint
  • Convolutional neural networks