 Research
 Open Access
 Published:
Mutual kernelized correlation filters with elastic net constraint for visual tracking
EURASIP Journal on Image and Video Processing volume 2019, Article number: 73 (2019)
Abstract
In this paper, we propose a robust visual tracking method based on mutual kernelized correlation filters with elastic net constraint. First, two correlation filters are trained in a general framework jointly in a closed form, which are interrelated and interacted on each other. Second, elastic net constraint is imposed on each discriminative filter, which is able to filter some interfering features. Third, scale estimation and target redetection scheme are adopted in our framework, which can deal with scale variation and tracking failure effectively. Extensive experiments on some challenging tracking benchmarks demonstrate that our proposed method is able to obtain a competitive tracking performance against other stateoftheart algorithms.
Introduction
Visual tracking is a fundamental task in computer vision with numerous applications, such as unmanned control systems, surveillance, assistant driving, and so on. Given the position of the tracked object in the first frame, the goal of visual tracking is to estimate the position of the tracked target in the subsequent frame precisely. Although great progress has been made in recent years [1, 2], designing a robust tracking algorithm is still a challenging problem due to negative factors such as background clutters, severe occlusion, motion blur, and illumination variation (see Fig. 1).
Generally speaking, visual tracking methods can be divided into two categories: generative methods [3,4,5,6,7] and discriminative methods [8,9,10,11,12,13]. Generative methods attempt to build a model to represent tracked target and find the region with the minimum reconstruction error from a great deal of candidates. For example, under the particle filter framework, Mei et al. [14] developed a tracker method based on sparse representation, called the ℓ_{1} method, which reconstructs each candidate with dictionary template and trivial template. The sparse representation coefficients of each candidate can be computed by solving ℓ_{1} minimization. Despite ℓ_{1} method demonstrated impressive tracking performance, the tracking speed is very slow because of its huge computation load. In order to solve this problem, Bao et al. [15] proposed a fast ℓ_{1} tracking method by using accelerated proximal gradient approach. Xiao et al. [16] presented a fast object tracking method by solving ℓ_{2} regularized least square problem. Wang et al. [17] developed a novel and fast visual tracking method via probability continuous outlier model. Different from the general method, discriminative algorithms regard visual tracking as a binary classification problem which distinguishes the correct tracked object from the background. For example, Babenko et al. [18] trained an online discriminative classifier to separate the tracked object from the background by online multiple instance learning. Zhang et al. [19] formulated visual tracking as a binary classification via a naive Bayes classifier with an online update scheme in the compressed domain.
In recent years, visual tracking methods based on correlation filter [20,21,22,23,24,25] have attracted great attention due to its realtime tracking speed and robust tracking performance. Under the framework of correlation filter, a discriminative classifier is trained with a great deal of dense sampling examples. These dense sampling examples are with circulant structure which allows the use of the fast Fourier transform (FFT). Bolme et al. [26] first developed a minimum output sum of squared error filter for realtime visual tracking. After that, a great deal of tracking methods based on correlation filter has been proposed to improve tracking performance. Henriques et al. [27] developed a highspeed tracker with kernelized correlation filters which can deal with multichannel features. Danelljan et al. [28] presented a discriminative scale space tracker with a correlation filter based on a scale pyramid representation. In order to mitigate the unwanted boundary effect which appeared in traditional correlationbased trackers, Danelljan et al. [29] figured out spatially regularized discriminative correlation filters (SRDCF) for visual tracking. Recent researches have shown that features from convolutional neural networks (CNN) can improve tracking performance greatly [30,31,32,33]. Zhang et al. [34] builded a simple twolayer convolutional network to learn robust representation for visual tracking without offline training. Ma et al. [35] utilized three convolutional layers to learn robust target appearance for visual tracking. Wang et al. [36] exploited robust target appearance representation from the top layer to lower layer for object tracking. Heng et al. [37] incorporated recurrent neural network (RNN) into CNN to improve tracking performance. He et al. [38] integrated weighted convolution responses from 10 layers and achieved a very promising performance.
Although correlation filters based trackers have obtained superior tracking performance, many trackers utilized a single correlation filter and could not achieve promising tracking results. Figure 2 gives the precision plots and success plots of OPE by methods with a different number of correlation filters on OTB2013. It is obvious that just simply merging two correlation filters is able to greatly improve tracking performance in both precision and success rate. However, there is still much room for improvement for methods using two correlation filters which are independent of each other.
Inspired by the above discussions, we develop a robust visual tracking method via mutual kernelized correlation filters using features from convolutional neural networks (MKCN_CNN), where each tracker works on its own and tries to correct the other one. At the same time, an elastic net constraint is imposed on each filter, which can eliminate some distractive features. Finally, the proposed tracking framework can be solved in a closedform fashion. Extensive experiments demonstrate that our method can achieve promising tracking performance competing with some other stateoftheart trackers.
The rest of this paper is organized as follows. Section 2 briefly summarizes the principle of visual tracking based on kernelized correlation filter. Section 3 introduces the proposed tracking algorithm in details. The experimental results and corresponding discussions are described in Section 4, followed by the conclusion in Section 5.
Visual tracking based on kernelized correlation filters
Henriques et al. [27] proposed a fast discriminative visual tracking method based on kernelized correlation filters (KCF). Given a n × 1 vector x = [x_{1}, x_{2}, …, x_{n}]^{T} denoting a base image, a shifted version of x can be defined by {P^{u}xu = 1} = [x_{n}, x_{1}, …, x_{n − 1}]^{T}. Here, P is a permutation matrix. So, the full shifted signals of x are given by {P^{u}xu = 1, 2, …, n − 1}. Then, the data matrix X is defined by all the cyclic shifted version of x which can be made diagonal by discrete Fourier transform (DFT).
Here, F means the DFT matrix, H stands for transpose and complexconjugate, \( \hat{\mathbf{x}}=\mathcal{F}\left(\mathbf{x}\right) \), which computes the DFT of vector x. The goal of KCF is to find a discriminative correlation classifier f(x) over the data matrix X for separating the target object from the surrounding environment. Given the training dataset and their corresponding labels (x_{1}, y_{1}), …, (x_{m}, y_{m}), the discriminative correlation classifier f(x) can be obtained by the following equation,
where λ means the regularization parameter. x_{i} stands for the ith row element of the data matrix X. A Gaussian function is adopted to model the label y_{i}. When x_{i} is the centered target, y_{i} is set to 1. For the other cyclic shifted version of x_{i} around the center target, their labels smoothly decay to 0. The solution w can be easily obtained by w = (X^{H}X + λI)^{−1}X^{H}y. In order to get a powerful model, kernel trick is introduced into Eq. (2). The new model is rewritten as
where K is a n × n kernel matrix and one of its elements is K_{ij} = k(x_{i}, x_{j}). Matrix K has a circulant structure and can be diagonalized as
Here, k is the first row of matrix K. The solution α in the dual space can be given by
where I is an identity matrix. Just as the data matrix X, kernel matrix K is also circulant. So, the solution of Eq. (3) can be efficiently computed in the frequency domain.
In the next frame, a great deal of candidates, denoted as x^{'}, are extracted at the same position as the current frame. Actually, all these candidates’ x^{'} are obtained from the cyclic shift of the base image x. The response of these candidates can be computed from
Here, ℱ^{−1} stands for the inverse discrete Fourier transform (IDFT). \( {\hat{\mathbf{k}}}^{\hbox{'}} \) means the kernel correlation of candidates x^{'} and base image x in the frequency domain. ∘ denotes element by element multiplication. The candidate with the largest response is chosen as the final target object in the next frame.
Methods
Though the KCF method has obtained promising tracking performance, only one discriminative classifier is used in this model, which makes the KCF method not able to deal with complex sciences. In order to overcome these problems, inspired by ensemble tracking methods, we proposed mutual kernelized correlation filters with elastic net constraint for visual tracking. Extensive experiments show that our method can perform better than the stateoftheart methods. The flowchart of our proposed tracking framework is demonstrated in Fig. 3.
Problem statement
In order to find the best target object from a great deal of candidates, we introduce a linear regressor model in the proposed method.
Here, X has the same definition as KCF. y means regression label value of X. w represents the corresponding coefficient. In order to promote the performance of Eq. (8), just as least absolute shrinkage and selection operator (LASSO) model, ℓ_{1} norm is adopted to regularize the coefficients w.
where τ is a constant weight parameter. In Eq. (9), some values of w are set to zero which can make some occluded pixels excluded in this new model. So, the occluded pixels have less effect on the final decision of regression values. However, we find that the occluded pixels often assemble in one position together. Eq. (9) cannot group these pixels with the same features. So, in order to overcome the limitations of the LASSO model, an elastic net regularization [39] is introduced in Eq. (9).
Here, λ is a constant weight parameter. ‖w‖_{2} is used to group pixels with the similar property. In order to promote the tracking performance of our method, kernel trick is exploited in Eq. (10). The candidates are mapped to a highdimensional feature space φ(x). Then, in the dual space, the solution w is given by a linear combination of mapped candidates.
Equation (10) in the dual space can be described as
where K represents kernel matrix. The solution of α involves square norm and ℓ_{1} norm simultaneously. In order to compute α efficiently, another variable β is introduced in Eq. (12).
Here, μ is a constant weight parameter.
Mutual kernelized correlation filters
In this part, we introduce mutual kernelized correlation filters based on Eq. (13). Then, the proposed mutual kernelized correlation filters will solve this following problem
The first two parts of Eq. (14) force each kernelized correlation filter model to have the minimum squared error with respect to the desired output regression label y. \( {\lambda \alpha}_1^T{\mathbf{K}}_1{\alpha}_1+{\lambda \alpha}_2^T{\mathbf{K}}_2{\alpha}_2 \) denote the elastic net regularization on two models respectively. \( \tau {\left\Vert \beta \right\Vert}_1+\tau {\left\Vert \beta \right\Vert}_2+\mu {\left\Vert {\alpha}_1{\beta}_1\right\Vert}_2^2+\mu {\left\Vert {\alpha}_2{\beta}_2\right\Vert}_2^2 \) are introduced to exclude the occluded pixels in the target object. \( 2\rho {\left\Vert {\mathbf{K}}_1{\alpha}_1{\mathbf{K}}_2{\alpha}_2\right\Vert}_2^2 \) is used to weight the influence of the two kernelized correlation filter models.
It is obvious that Eq. (14) is convex with respect to α_{1}, α_{2} if β_{1}, β_{2} are fixed, and vice versa. So, we propose an iterative algorithm to compute the solution α_{1}, α_{2}. Thus, four subproblems with respect to α_{1}, α_{2}, β_{1}, β_{2} are given as follows
Set the derivation of T_{1} with respect to α_{1} to be zero; Eq. (15) can be rewritten as follows:
Change the order of formula (19), we obtain
Then, we obtain the solution α_{1}
Set the derivation of T_{3} with respect to α_{2} to be zero; a similar solution α_{2} is given as follows:
It is straightforward that Eqs. (16) and (18) are least squared by ℓ_{1} norm regularization. Thus, the solution β_{1} and β_{2} have closed form which can be easily achieved by a soft shrinkage function
By introducing Eqs. (4), (21) can be reformulated as follows:
Then, the DFT of α_{1} is found by
In the same way, the DFT of α_{2} is obtained from
Here, k_{2} is the first row of matrix K_{2}.
Model update
To update the proposed MKCF_CNN method for robust visual tracking, an incremental scheme is adopted to update the proposed model,
where η is a constant parameter which controls the learning rate. The subscript t denotes the tth frame. The incremental update strategy can deal with the abrupt change in successive frame.
Target detection
For kernel correlation filter K_{1}, in the tth frame sequence, a great deal of circulant candidates, denoted as \( {\mathbf{x}}_{1,t}^{\hbox{'}} \), are extracted around the base image x_{1, t − 1}. The base image x_{1, t − 1} locates at the position of the target at the (t − 1)th frame. The candidates \( {\mathbf{x}}_{1,t}^{\hbox{'}} \) have a circulant structure. Thus, the responses of these candidates are given by
In the same way, the responses of these candidates \( {\mathbf{x}}_{2,t}^{\hbox{'}} \) with respect to kernel correlation filter K_{2} are obtained by
The maximum values of response1 and response2 are easily achieved by max(response1(:)) and max (response2(:)), respectively. if max(response1(:)) > max(response2(:)), the final response is equal to max(response1(:)). Otherwise, the final response is equal to max(response2(:)). The best position of the target is obtained according to the final response.
Convolutional neural network (CNN) features extracted from MatConvNet
Traditional features, such as histogram of oriented gradient (HOG), SIFT, and CN, have achieved promising tracking performance in the past decade. However, these handcrafted features are outofdate along with the rise of CNN features. In [40], the properties of CNNbased representation have gained impressive results on image recognition and object detection. In [35], three convolutional layers, conv3 − 4, conv4 − 4, conv5 − 4, utilizing VGG19 model are introduced to the field of visual tracking and demonstrate powerful representation ability. Inspired by [41], we used the conv5 − 4 convolution layer and conv4 − 4 convolution layer of VGG19 to model the appearance of the target. Features from conv5 − 4 convolution layer with more semantic information can discriminate the target from the dramatically changing background. Features from conv4 − 4 convolution layer with more spatial details can locate the position of target precisely.
Target recovery
We adopt the EdgeBox method [42] to redetect the target from the failures of tracking. A great deal of object bounding box detection proposals P_{d} are generated by the EdgeBox method, and these proposals are evaluated under the framework of correlation filter to decide the final tracking position. Given the position (x_{t − 1}, y_{t − 1}) of the target in the (t − 1)th frame, a set of bounding box proposals are extracted around the position of the target in the current frame. The position of each bounding box proposal p_{i} is set to \( \left({x}_t^i,{y}_t^i\right) \) in the tth frame. The maximum response score of each bounding box proposal p_{i} is given by r(p_{i}), which is computed by Eq. (7) using the HOG feature. If the score of tracking results in the tth frame is smaller than the threshold T_{0}, it can be believed that the tracker loses the target and the scheme of redetection should be triggered. The optimal bounding box proposal in the tth frame is obtained by minimizing the following expression:
where \( L\left({p}_t^i,{p}_{t1}\right)=\exp \left(\frac{1}{2{\sigma}^2}{\left\Vert \left({x}_t^i,{y}_t^i\right)\left({x}_{t1},{y}_{t1}\right)\right\Vert}^2\right) \). The formula \( L\left({p}_t^i,{p}_{t1}\right) \) is motion constraint between two successive frames. α is a constant parameter which controls the balance between the response score and the motion constraint. σ means the diagonal length of the initial target size.
Scale estimation
Scale estimation is very important for robust tracking. Motivated by [42], we use the EdgeBox method to deal with scale variation appeared in sequences. Given the size (w_{t − 1}, h_{t − 1}) of the target in the (t − 1)th frame, we use the EdgeBox method to conduct on the multiscale bounding box proposals P_{s} with the size of sw_{t − 1} × sh_{t − 1} in the current frame and reject the proposals whose intersection over union (IoU) is lower than 0.6 or higher than 0.9. For each accepted scale proposal, we compute the response score under the framework of correlation filter. If the maximum response score {r(p_{i})p_{i} ∈ P_{s}} is smaller than response obtained in Section 3.4, we keep the size of the target in the (t − 1)th frame. Otherwise, we update the size of the target by the following equation:
where \( \left({w}_t^{\ast },{h}_t^{\ast}\right) \) is the size of the proposal with the maximum response score. γ is a constant parameter which controls the update rate.
Results and discussion
In this section, we evaluate our proposed method on three public datasets: OTB2013 [43], TColor128 [44], and DTB70 [45]. Matlab pseudocodes and tracking pipeline of our MKCF_CNN method are given in Tables 1 and 2, separately. Extensive experiments demonstrate that our method is able to achieve a very appealing performance in terms of effectiveness and robustness.
Experimental setup
The proposed MKCF_CNN method is implemented in MATLAB on a PC equipped with an Intel Xeon CPU E52640 v4 with 128G RAM and a single NVIDIA GeForce GTX 1080Ti. We adopt the pretrained VGGNet19 as our feature extractor and utilize matcovnet for feature generation. We train two correlation filters utilizing outputs from the conv4 − 4 and conv5 − 4 layers. The linear kernel is adopted in this paper. The parameters λ, τ, μ, ρ in (14) are empirically set to 10^{−4}, 10^{−5}, 10^{−4}, and 10^{−3} separately. We set the update rate η in (28) and (29) to 0.01 and the weight parameter γ in (33) to 0.6. The tracking failure threshold T_{0} is set to 0.2.
Evaluation metrics
We use two measurements, precision plots and success plots [46], to quantitatively assess the tracking results of our method. Precision plots illustrate the percentage of frames in which the center location error is within a given threshold. The threshold is set to 20 pixels. The center location error means the Euclidean distance between the tracked location and the ground truth. The success plots are the percentage of frames where the overlap rate S is larger than a fixed threshold T_{1}. The overlap rate S is defined as \( S=\frac{\mathrm{Aera}\left({B}_E\cap {B}_G\right)}{\mathrm{Aera}\left({B}_E{UB}_G\right)} \). ∩ and ∪ are intersection and union operators, respectively. B_{E} denotes the estimated bounding box and B_{G} is the groundtruth bounding box. T_{1} is set to 0.5 in this paper.
To evaluate the tracking performance of our method comprehensively, the challenging videos from OTB2013 and TColor128 are categorized with 11 attributes including background clutter (BC), deformation (DEF), fast motion (FM), inplane rotation (IPR), illumination variation (IV), low resolution (LR), motion blur (MB), occlusion (OCC), outofplane rotation (OPR), out of view (OV), and scale variation (SV).
Comparison of tracking performance on OTB2013
OTB2013 benchmark dataset contains 51 sequences with 11 challenging attributes. We compare our method with 9 stateoftheart algorithms which contain deep learning tracking methods (HCFT [35], HDT [47], CNNSVM [48], DeepSRDCF [49]) and correlation filter tracking methods (MEEM [50], Staple [51], SAMF [52], DSST [28]). Figure 4 gives the precision plots and success plots of OPE of our proposed method against other stateofthestate methods on OTB2013. According to Fig. 4, our MKCF_CNN tracker outperforms most of the other trackers, demonstrating the effectiveness of MKCF_CNN. The proposed MKCF_CNN method achieves 2.3% performance gains in precision against HCFT, which is the most related tracking method with us. Meanwhile, MKCF_CNN and DeepSRDCF rank first on the success score.
In order to comprehensively assess the tracking performance of our proposed MKCF_CNN tracker, we present tracking results under OPE regarding 11 attributes in Figs. 5 and 6. We can observe that on the 51 videos with all the 11 challenging attributes, our method ranks first among the 10 evaluated trackers on precision plots. On the videos with attributes such as background clutter, deformation, inplane rotation, illumination variation, low resolution, and out of view, MKCF_CNN ranks first among all the evaluated trackers on success plots. In the HCFT method, the outputs of the conv3 − 4, conv4 − 4, and conv5 − 4 layers are used as the deep features. In the HDT method, the outputs of six convolutional layers (10th–12th, 14th–16th) from VGGNet19 are adopted as feature maps. However, only two layers (conv4 − 4, conv5 − 4) from VGGNet19 are used in our proposed method, and two mutual kernelized correlation filters are trained to interact each other through all the tracking process without definite parameters as HCFT and definite initial parameters as HDT. From Figs. 5 and 6, it is clear that our method performs better than those most relevant methods.
The tracking speed is very important for visual tracking. Correlation filterbased trackers obtained beyond realtime speed using handcrafted features. Except for DFT and inverse DFT, the computational complexity of trackers with a single correlation filter is O(n log n). n is the dimensionality of the features. Thus, the whole computational load of single correlation filterbased trackers is O(Mn log n). M is the number of base trackers. M = 2 in our method and M = 3 in HCFT. For trackers under the correlation filter framework with deep features, the computational burden mainly comes from the features extraction process. Thus, the tracking speed of our proposed method is 1.3 fps, which is a little faster than HCFT with a speed of 1.1 fps.
Comparison of tracking performance on TColor128
The TColor128 dataset consists of 128 challenging color videos and is designed to assess the tracking performance on color sequences. Similarly, we evaluated our proposed MKCF_CNN method with 9 stateoftheart trackers, including HCFT [35], COCF [41], KCF_GaussianHog [27], SRDCF [29], MUSTER [53], SAMF [52], DSST [28], Struck [54], and ASLA [55]. Figure 7 shows precision plots and success plots of OPE of our proposed method against other stateoftheart methods on TColor128. Figures 8 and 9 present precision plots and success plots of OPE with different attributes on TColor128, respectively. It is obvious that our method is the best one among the ten trackers on dataset TColor128, following HCFT method. Our method obtains a precision rate of 73.5% and a success rate of 63.1%. HCFT and COCF rank second and third, respectively. Although HCFT utilizes deep features from three layers, its performance is not better than our method. COCF uses the same outputs as our method from two layers of VGGNet19, and it performs worse than our MKCF_CNN tracker. This is because the scale estimation and redetection scheme are able to locate the target precisely in our method. Figures 8 and 9 demonstrate the effectiveness of our method on TColor128 with 11 challenging attributes. It can be seen that our method performs best against 9 other methods. Table 3 gives the data comparison of success rates of 8 trackers. The experimental results show that our method achieves the best performance under all challenging attributes except for scale variation.
Figure 10 shows some tracking results of two sequences with severe occlusion. In the Lemming video, the toy Lemming is severely occluded by a triangular rule when it is moving (e.g., #320, #340). It is obvious that the proposed method, SAMF, Struck, and OAB are robust to severe occlusion and can track the Lemming target steadily. In the skating2 sequence, the target woman dancer has obvious appearance variation and is totally occluded by the man dancer occasionally when they are skating (e.g., #150, #250). We can observe that the proposed method, HCFT and COCF with deep features, are able to deal with the severe occlusion and appearance variation effectively.
Figure 11 demonstrates some screenshots of two videos with fast motion. In the Soccer sequence, the player target keeps jumping and undergoes fast motion, background clutter, and occlusion when celebrating the victory (e.g., #36, #76, #170). IVT, Struck, CSK, ASLA, and OAB lose the target completely because of the challenging interference factors. The target in the Biker sequence undergoes fast motion and scale variation because of fast riding (e.g., #10, #100, #200). It can be easily seen that our method performs well in the entire sequence and is able to deal with motion blur and scale variation effectively.
Figure 12 illustrates some sampled tracking results of two sequences with appearance variation. The appearance of the target in the Surfing sequence changes severely when the player is going surfing (e.g., #100, #125). From the tracking results, we can see that most of the trackers are able to locate the target coarsely. However, only our method has the ability to track the target more precisely. In the Bikeshow sequence, the biker cycles in the square with severe appearance variation and scale change (e.g., #20, #120, #361). The proposed method, HCFT and COCF utilizing deep features, handle appearance change better than the other methods with handcrafted features.
Figure 13 demonstrates some tracking results of two sequences with background clutter. The target in the Board sequence moves in the complex scenes with severe background clutter (e.g., #160, #300, #400). It can be seen that our method can track the target successfully through the sequence. In the Torus sequence, the target moves in a cluttered room with slight appearance variation (e.g., #100, #200, #220). We can observe that trackers with handcrafted features can not deal with this situation and drift away to other objects.
Figure 14 shows some screenshots of tracking results in two sequences with illumination variation. In the Shaking video, a guitarist is playing on the stage with dim lights (e.g., #100, #200, #300). Although the target undergoes severe illumination variation, our method locates the target more precisely than other trackers. In the Singer2 sequence, the singer in dark clothes performing on the stage undergoes drastic illumination variation (e.g., #110, #210, #320). We can observe that HCFT and COCF with deep features move away from the target resulting in drastic illumination variation. Only our method is able to persistently track the target in the whole sequence.
Comparison of tracking performance on DTB
DTB dataset consists of 70 challenging videos captured by a camera mounted on an unmanned aerial vehicle (UAV). All of the 70 challenging sequences in the DTB dataset were manually annotated with 11 challenging attributes, including motion blur (MB), scale variation (SV), similar objects around (SOA), aspect ratio variation (ARV), background cluttered (BC), occlusion (OCC), outofview (OV), deformation (DEF), outofplane rotation (OPR), fast camera motion (FCM), and inplane rotation (IPR). We compare our method with 9 representative trackers including HCFT [35], HDT [47], COCF [41], MEEM [50], SODLT [56], SRDCF [29], KCF [27], DAT [57], and DSST [28]. Figure 15 shows the overall tracking performance of OPE based on precision score and success score on DTB dataset. We can see that the proposed tracker can achieve the best tracking performance against 9 other trackers.
Ablation study
Effect of mutual kernelized correlation filters
In order to demonstrate the effectiveness of mutual correlation filters, we investigate the tracking performance of our proposed method with mutual correlation filters and without mutual correlation filters on OTB2013. Figure 16 gives the precision plots and success plots of OPE by different settings. Our method with mutual correlation filters achieves a score of 0.914 in terms of precision and the precision performance is improved by 0.9% compared with the method without mutual correlation filters. In success plots, owing to the interaction of mutual correlation filters, the tracking performance is improved by 2.0%. Figures 17 and 18 show the tracking results on OTB2013 with 11 challenging attributes. It is obvious that our method with mutual correlation filters achieves better tracking performance in all the 11 attributes in both the average precision score and average success rate.
Effect of elastic net constraint
Figure 19 gives the tracking results on OTB2013 by our method with elastic net constraint and our method without elastic net constraint in terms of precision and success rate. We can observe that the proposed method with elastic net constraint achieves slightly better than method without elastic net constraint. Table 4 demonstrates the tracking results on OTB2013 with 11 challenging attributes. It is clear that our proposed method with elastic net constraint obtains better performance than method without elastic net constraint in terms of IPR, OC, SV, OPR, and IV.
Effect of scale estimation
In this section, we investigate the tracking performance with scale estimation scheme and without scale estimation scheme. Experimental results conducted on OTB2013 are demonstrated in Figs. 20 and 21. The first picture in Fig. 20 shows the comparison of success plots of OPE on OTB2013 and the second picture in Fig. 20 gives the success plots of OPE in terms of scale variation. Figure 21 shows the average success rate of our proposed method with scale estimation scheme and our method without scale estimation scheme in terms of 11 challenging attributes on OTB2013. It can be seen that the scale estimation mechanism is able to improve the tracking performance greatly.
Effect of redetection module
In this section, we compare the tracking performance with redetection module and without redetection module on OTB2013. The first picture in Fig. 22 shows the comparison of success plots of OPE on OTB2013 and the second picture in Fig. 22 gives the success plots of OPE in terms of occlusion. It is obvious that the redetection module is able to recover target in case of tracking failures. Table 5 gives the tracking results on OTB2013 in terms of 11 challenging attributes. The best tracking results are shown in red. It is clear that our method with redetection module achieves better tracking results in almost all the 11 attributes except for the LR and DE.
Summary and conclusion
In this paper, we propose a novel visual tracking method based on mutual kernelized correlation filters with elastic net constraint. The proposed algorithm is able to train two interactive discriminative classifiers to cope with the challenging environment and severe appearance variation. The elastic net constraint is imposed on the mutual kernelized correlation filters to group the similar features and to alleviate the impact of outliers. Scale adaption and redetection scheme are applied in our method to promote tracking performance. Extensive experimental results demonstrate that our proposed method is able to obtain appealing tracking performance by using the interacted kernelized correlation filters with elastic net constraint. Quantitative and qualitative results show the superiority of our method in terms of effectiveness and robustness, compared with other tracking algorithms.
Availability of data and materials
Data sharing not applicable to this article as no datasets were generated or analyzed during the current study.
Abbreviations
 BC:

Background clutter
 CN:

Color name
 CNN:

Convolutional neural networks
 DEF:

Deformation
 DFT:

Discrete Fourier transform
 ENC:

Elastic net constraint
 FFT:

Fast Fourier transform
 FM:

Fast motion
 HOG:

Histogram of oriented gradient
 IPR:

Inplane rotation
 IV:

Illumination variation
 KCF:

Kernelized correlation filters
 LASSO:

Least absolute shrinkage and selection operator
 MB:

Motion blur
 OCC:

Occlusion
 OPR:

Outofplane rotation
 OV:

Low resolution
 OV:

Out of view
 RD:

Redetection
 RNN:

Recurrent neural network
 SIFT:

Scaleinvariant feature transform
 SRDCF:

Spatially regularized discriminative correlation filters
 SV:

Scale variation
References
A. Li, M. Lin, Y. Wu, M. Yang, S. Yan, NUSPRO: a new visual tracking challenge. IEEE Trans. Pattern Anal. Mach. Intell. 38(2), 335–349 (2016)
P. Li, D. Wang, L. Wang, H. Lu, Deep visual tracking: review and experimental comparison. Pattern Recogn. 76, 323–338 (2018)
S. Zhang, X. Lan, Y. Qi, C. Yuen, Robust visual tracking via basis matching, IEEE Trans. Circuits Syst. Video Technol. 27(3), 421–430 (2017)
S. Zhang, H. Zhou, F. Jiang, X. Li, Robust visual tracking using structurally random projection and weighted least squares. IEEE Trans. Circuits Syst. Video Technol. 25(11), 1749–1760 (2015)
D. Wang, H. Lu, M. Yang, Robust visual tracking via least softthreshold square. IEEE Trans. Circuits Syst. Video Technol. 26(9), 1709–1721 (2016)
L. Zhang, W. Wu, T. Chen, N. Strobel, D. Comaniciu, Robust object tracking using semisupervised appearance dictionary learning. Pattern Recogn. Lett. 62, 17–23 (2015)
W. Zhong, H. Lu, M. Yang, Robust object tracking via sparse collaborative appearance model. IEEE Trans. Image Process. 23(5), 2356–2368 (2014)
Y. Song, C. Ma, L. Gong, J. Zhang, R. Lau, M. Yang, in Proceedings of the IEEE International Conference on Computer Vision. CREST: convolutional residual learning for visual tracking (2017), pp. 2555–2564
T. Zhang, C. Xu, M. Yang, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Multitask correlation particle filter for robust object tracking (2017), pp. 4819–4827
W. Chen, K. Zhang, Q. Liu, Robust visual tracking via patch based kernel correlation filters with adaptive multiple feature ensemble. Neurocomput. 214, 607–617 (2016)
K. Zhang, X. Li, H. Song, Q. Liu, Visual tracking using spatiotemporally nonlocally regularized correlation filter. Pattern Recogn. 83, 185–195 (2018)
K. Zhang, Q. Liu, J. Yang, M.H. Yang, Visual tracking via boolean map representations. Pattern Recogn. 81, 47–160 (2018)
S. Yao, Z. Zhang, G. Wang, Y. Tang, L. Zhang, in Proceedings of the European Conference on Computer Vision. Realtime visual tracking: promoting the robustness of correlation filter learning (2016), pp. 662–678
M. Xue, H. Ling, in Proceedings of the IEEE International Conference on Computer Vision. Robust visual tracking using ℓ_{1} minimization (2009), pp. 1436–1443
C. Bao, Y. Wu, H. Ling, H. Ji, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Real time robust ℓ_{1} tracker using accelerated proximal gradient approach (2012), pp. 1830–1837
Z. Xiao, H. Lu, D. Wang, L2RLS based object tracking. IEEE Trans. Circuits Syst. Video Technol. 24(8), 1301–1308 (2014)
D. Wang, H. Lu, Fast and robust object tracking via probability continuous outlier model. IEEE Trans. Image Process. 24(12), 5166–5176 (2015)
B. Babenko, M. Yang, S. Belongie, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Visual tracking with online multiple instance learning (2009), pp. 983–990
K. Zhang, L. Zhang, M. Yang, Fast compressive tracking. IEEE Trans. on Pattern Anal. Mach. Intell. 36(10), 2002–2015 (2014)
K. Zhang, L. Zhang, Q. Liu, D. Zhang, M. Yang, in Proceedings of the European Conference on Computer Vision. Fast visual tracking via dense spatiotemporal context learning (2014), pp. 127–141
M. Wang, Y. Liu, Z. Huang, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Large margin object tracking with circulant feature maps (2017), pp. 4021–4029
H. Fan, H. Ling, in Proceedings of the IEEE International Conference on Computer Vision. Parallel tracking and verifying: a framework for realtime and high accuracy visual tracking (2017), pp. 5486–5494
F. Li, C. Tian, W. Zuo, L. Zhang, M. Yang, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Learning spatialtemporal regularized correlation filters for visual tracking (2018), pp. 4904–4913
W. Zuo, X. Wu, L. Lin, L. Zhang, M. Yang, Learning support correlation filters for visual tracking. IEEE Trans. on Pattern Anal. Mach. Intell. DOI: https://doi.org/10.1109/TPAMI.2018.2829180
M. Danelljan, G. Hager, F. Khan, M. Felsberg, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Adaptive decontamination of the training set: a unified formulation for discriminative visual tracking (2016), pp. 1430–1438
D. Bolme, J. Beveridge, B. Draper, Y. Lui, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Visual object tracking using adaptive correlation filters (2010), pp. 2544–2550
J. Henriques, R. Caseiro, P. Martins, J. Batista, Highspeed tracking with kernelized correlation filters. IEEE Trans. on Pattern Anal. Mach. Intell. 37(3), 583–596 (2015)
M. Danelljan, G. Hager, F. Khan, M. Felsberg, Discriminative scale space tracking. IEEE Trans. on Pattern Anal. Mach. Intell. 39(8), 1561–1575 (2017)
M. Danelljan, G. Hager, F. Khan, M. Felsberg, in Proceedings of the IEEE International Conference on Computer Vision. Learning spatially regularized correlation filters for visual tracking (2015), pp. 4310–4318
L. Bertinetto, J. Valmadre, F. Henriques, A. Vedaldi, H. Philip, in Proceedings of the European Conference on Computer Vision Workshops. Fullyconvolutional siamese networks for object tracking (2016), pp. 850–865
N. Hyeonseob, B. Han, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Learning multidomain convolutional neural networks for visual tracking (2016), pp. 4293–4302
Z. Chi, H. Li, H. Lu, M. Yang, Dual deep network for visual tracking. IEEE Trans. Image Process. 26(4), 2005–2015 (2017)
S. Zhang, Y. Qi, F. Jiang, X. Lan, P. Yuen, H. Zhou, Pointtoset distance metric learning on deep representations for visual tracking. IEEE Trans. Intell. Transp. Sys. 19(1), 187–198 (2018)
K. Zhang, Q. Liu, Y. Wu, M. Yang, Robust visual tracking via convolutional networks without training. IEEE Trans. Image Process. 25(4), 1779–1792 (2016)
C. Ma, J. Huang, X. Yang, M. Yang, in Proceedings of the IEEE International Conference on Computer Vision. Hierarchical convolutional features for visual tracking (2015), pp. 3074–3082
L. Wang, W. Ouyang, X. Wang, H. Lu, in Proceedings of the IEEE International Conference on Computer Vision. Visual tracking with fully convolutional networks (2015), pp. 3119–3127
F. Heng, H. Ling, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. SANet: structureaware network for visual tracking (2017), pp. 42–49
Z. He, Y. Fan, J. Zhuang, Y. Dong, H. Bai, in Proceedings of the IEEE International Conference on Computer Vision. Correlation filters with weighted convolution responses (2017), pp. 1992–2000
S. Yao, G. Wang, L. Zhang, Correlation filter learning toward peak strength for visual tracking. IEEE Trans. Cybern. 48(4), 1290–1303 (2018)
K. Simonyan, A. Zisserman, Very deep convolutional networks for largescale image recognition, arXiv:1409.1556(2015)
L. Zhang, P. Suganthan, Robust visual tracking via cotrained Kernelized correlation filters. Pattern Recogn. 69, 82–93 (2017)
D. Huang, L. Luo, M. Wen, Z. Chen, C. Zhang, in Proceedings of British Machine Vision Conference. Enable scale and aspect ratio adaptability in visual tracking with detection proposals (2015), pp. 185.1–185.12
Y. Wu, J. Lim, M. Yang, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Online object tracking: a benchmark (2013), pp. 2411–2418
P. Liang, E. Blasch, H. Ling, Encoding color information for visual tracking: algorithms and benchmark. IEEE Trans. Image Process. 24(12), 5630–5644 (2015)
S. Li, D. Yeung, in AAAI Conference on Artificial Intelligence. Visual object tracking for unmanned aerial vehicles: a benchmark and new motion models (2017), pp. 4140–4146
Y. Wu, J. Lim, M. Yang, Object tracking benchmark. IEEE Trans. on Pattern Anal. Mach. Intell. 37(9), 1834–1848 (2015)
Y. Qi, S. Zhang, L. Qin, H. Yao, Q. Huang, J. Lim, M. Yang, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Hedged deep tracking (2016), pp. 4303–4311
S. Hong, T. You, S. Kwak, B. Han, in Proceedings of the 32nd International Conference on International Conference on Machine Learning. Online tracking by learning discriminative saliency map with convolutional neural network (2015), pp. 597–606
M. Danelljan, G. Hager, F. Khan, M. Felsberg, in Proceedings of the IEEE International Conference on Computer Vision Workshop. Convolutional features for correlation filter based visual tracking (2015), pp. 621–629
J. Zhang, S. Ma, S. Sclaroff, in Proceedings of the European Conference on Computer Vision. MEEM: robust tracking via multiple experts using entropy minimization (2014), pp. 188–203
B. Luca, V. Jack, G. Stuart, M. Ondrej, P. Torr, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016. Staple: complementary learners for realtime tracking (2016), pp. 1401–1409
Y. Li, J. Zhu, in Proceedings of the European Conference on Computer Vision. A scale adaptive kernel correlation filter tracker with feature integration (2014), pp. 254–265
Z. Hong, Z. Chen, C. Wang, M. Xue, D. Prokhorov, D. Tao, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Multistore tracker (MUSTer): a cognitive psychology inspired approach to object tracking (2015), pp. 749–758
S. Hare, A. Saffari, H.S. Philip, in Proceedings of the IEEE International Conference on Computer Vision. Struck: structured output tracking with kernels (2011), pp. 263–270
X. Jia, H. Lu, M. Yang, Visual tracking via coarse and fine structural local sparse appearance models. IEEE Trans. Image Process. 25(10), 4555–4564 (2016)
N. Wang, S. Li, A. Gupta, D. Y. Yeung, Transferring rich feature hierarchies for robust visual tracking, arXiv:1501.04587(2015)
H. Possegger, T. Mauthner, H. Bischof, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. In defense of colorbased modelfree tracking (2015), pp. 2113–2120
Acknowledgements
The authors thank the editor and anonymous reviewers for their helpful comments and valuable suggestions.
Funding
This work was supported by A Project of Shandong Province Higher Educational Science and Technology Program under Grant No. J17KA088 and No. J16LN02, the Natural Science Foundation of Shandong Province under Grant No. ZR2015FL009 and No. ZR2019PF021, the Key Research and Development Program of Shandong Province under Grant No. 2016GGX101023, Scientific Research Fund of Binzhou University under Grant No. 2019ZD03 and Dual Service Projects of Binzhou University under Grant No. BZXYSFW201805.
Author information
Affiliations
Contributions
HW proposed the study, conducted the experiments, and wrote the manuscript. SZ analyzed the data and revised the manuscript. Both authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
About this article
Cite this article
Wang, H., Zhang, S. Mutual kernelized correlation filters with elastic net constraint for visual tracking. J Image Video Proc. 2019, 73 (2019). https://doi.org/10.1186/s136400190474z
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s136400190474z
Keywords
 Visual tracking
 Mutual kernelized correlation filters
 Elastic net constraint
 Convolutional neural networks