 Research
 Open Access
 Published:
Adaptive visual target tracking algorithm based on classifiedpatch kernel particle filter
EURASIP Journal on Image and Video Processing volume 2019, Article number: 20 (2019)
Abstract
We propose a highperformance visual target tracking (VTT) algorithm based on classifiedpatch kernel particle filter (CKPF). Novel features of this VTT algorithm include sparse representations of the target template using the labelconsistent Ksingular value decomposition (LCKSVD) algorithm; Gaussian kernel density particle filter to facilitate candidate template generation and likelihood matching score evaluation; and an occlusion detection method using sparse coefficient histogram (ASCH). Experimental results validate superior performance of the proposed tracking algorithm over stateoftheart visual target tracking algorithms in scenarios that include occlusion, background clutter, illumination change, target rotation, and scale changes.
Introduction
Visual target tracking (VTT) [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15] is a key enabling technology for numerous emerging computer vision applications including video surveillance, navigation, humancomputer interactions, augmented reality, higher level scene understanding, and action recognition among many others. It is a challenging task because the visual observations often suffer from interference due to occlusion, scale and shape variation, illumination variation, background clutter, and related factors.
VTT differs from conventional tracking task in that the observation at each time instant is a video frame and the motion trajectory is confined to the spatial coordinates in each frame. On the other hand, like the conventional tracking, a VTT algorithm is divided into the prediction phase and an update phase. In the prediction phase, a motion model is incorporated to predict the target location based on current estimate. In the updated phase, a maximum likelihood (ML) estimate of the target location is sought based on observations made in the current frame. Then, an updated target position is decided based on predicted location and the ML estimated position. These location predictions and estimations are traditionally realized using sequential Bayesian estimation algorithms such as Kalman filters or particle filters.
Based on how the ML estimation of target location is realized, current VTT algorithms may be categorized into two families: discriminative algorithms versus generative algorithms [5]. Discriminative methods detect the presence of a tracked object using a pattern classification approach with the objective to distinguish the foreground target from the background. For example, the multiple instance learning (MIL) [6] method puts all ambiguous positive and negative samples into bags to learn a discriminative model for visual target tracking. Generative methods detect the tracked object by searching for the region most resembling to the target model, based on templates or subspace models. In [7], a robust fragmentsbased tracking method is proposed to handle partial occlusions or pose changes. Every patch votes on the possible positions and scales of the target in the current frame by comparing the intensity histogram against the corresponding histogram of each image patch. However, a static appearance model of the target cannot adapt to rapid appearance changes of the target. Incremental learning visual tracking (IVT) algorithm [8] handles the problem of changing target appearance. In the template update process, a forgetting factor is introduced to ensure that less modeling power is wasted fitting older observations. Visual tracking decomposition (VTD) algorithm [9] is proposed to handle the appearance and motion changes of the target occur at the same time. In the tracking process, the observation model is decomposed into multiple basic observation models that can cover different specific target appearances. The motion model is also represented by combining multiple basic models that cover different motion types. Then two types of basic models are used to construct the multiple basic trackers to handle a certain change of a target.
Tracking algorithms based on the sparse model have attracted great interests lately. Mei et al. [10, 11] formulated visual target tracking as a sparse approximation problem in the particle filtering (PF) framework [12, 13]. Using a dictionary of image patches, the target template can be represented as a weighted linear combination of very few (hence sparse representation) image templates in the dictionary. The sparse representation can be estimated by solving an l_{1}norm regularized least squares (LS) problem. In [14], a realtime robust l_{1} tracker is proposed by adding an l_{2}norm regularization to the coefficients associated with the trivial templates, and an accelerated proximal gradient (APG) method is employed to speed up the problem solving. Multitask tracking (MTT) is proposed [15] as a multitask sparse learning problem in a PF framework. The particles are modeled as linear combinations of dictionary image templates, and the interdependencies between particles are exploited to improve the tracking performance. In [5], an adaptive structural local sparse appearance model is proposed to locate the target more accurately by considering the spatial information of the target based on an alignmentpooling method. Moreover, the incremental subspace learning and sparse representation are combined to update the template, which can adapt to the appearance change of the target with less possibility of drifting. When the target exhibits dramatic appearance changes, a collaborative model is proposed [16] that combines a sparsitybased discriminative classifier and a sparsitybased generative model. With this appearance model, both holistic updates and local representations are considered. Moreover, the latest observations and the original template are used to update the model and adapt to the appearance change while mitigating the drift problem.
Most of the dictionaries based on the sparse representation theory are constructed directly by the samples of the template base or obtained by the clustering method with some constraints. The image templates in the dictionary often lack the ability of discrimination. Moreover, the templates updated by the same update scheme cannot adapt to the changes of the foreground and the background of the target. To address these concerns, in this work, we propose an adaptive visual target tracking algorithm based on classifiedpatch kernel particle filter (CKPF), which has the following advantages:

(a)
Classified patches and lowdimensional dictionary are considered in the CKPF. Note that lowdimensional dictionary and classification parameters (CP) are learned by the labelconsistent KSVD (LCKSVD) [17, 18] technique. To the best of our knowledge, this is the first work to extend the LCKSVD approach to exploit the intrinsic structure among the patches of the visual target. The image patches in the dictionary trained using LCKSVD will be more discriminative to classify foreground from the background, and the obtained low dictionary can reduce the computational burdens.

(b)
The antiocclusion sparse coefficient histograms (ASCHs) [16] are merged in CKPF to enhance the ability of antiocclusion. If the reconstructed error of one patch is bigger than the threshold, the patch will be marked as occluded, and the corresponding sparse coefficients were displaced with zero to reduce the negative influence.

(c)
Gaussian kernel density (GKD) of the learned patches is considered to make the proposed algorithm more stable. The reason is that the importance of each patch is considered in the structure of candidate template according to the distance close to the center of the template.

(d)
An adaptive template update scheme is developed to adapt to the target appearance changes improving the robustness of the tracker. It is because the appearance of the target often changes significantly due to the disturbance of illumination changes, occlusion, rotation, and scale variation. When the target is occluded, the arrived template usually cannot describe the real target effectively. Therefore, the weight of the arrived template should decrease at this time. Otherwise, the weight should increase due to the accurately estimate of the arrived template without other disturbance factors.
Our proposed visual target tracker differs from existing approaches [10,11,12,13,14,15,16] in several aspects, such as the dictionary learning of the local image patches by LCKSVD, likelihood model construction of the candidate particles, as well as the design of the adaptive parameter for the template update. The main contributions of this paper are threefold. (a) Classification parameters and lowdimensional patches are learned by LCKSVD to construct the CKPF. (b) Isotropic Gaussian kernel density of the patches is proposed to produce the mixture likelihood of the each candidate particle. (c) An adaptive template update scheme is proposed to adapt to the target appearance changes.
The remainders of this paper are organized as follows. In Section 2, we summarize the details of the proposed adaptive visual target tracking algorithm based on CKPF. An overview of the LCKSVD is presented. Meanwhile, adaptive template update scheme is developed and discussed. In Section 3, extensive simulation results comparing our proposed algorithm against existing visual target trackers are reported and the implications of these results are discussed. Conclusions and future works are presented in Section 4.
Methods
Overview of the algorithm
As shown in Fig. 1, the target is represented as a rectangular template in each frame, and the target template will be scaled to 32 × 32 pixels (big red boxes on the right). Candidate target region will also be scaled at the same ratio before further processing. Pixels within the template are assumed to be positive samples of the target. A 4pixel wide strip surrounding the template is defined as the background whose edges outside and inside of the target template are denoted by B1 and B2, i.e., the gray annular area (B2B1) with width 8 pixels are the background area. A patch is defined as a 6 × 6 square. N_{p} = 196 image patches (positive samples) will be extracted from the template (foreground, target), and N_{n} = 196 patches will be extracted from the background as negative sample. These extracted image patches are regularly distributed over the foreground or the background regions respectively with overlaps as needed. Together, the positivelabeled patches represent the target and the negativelabeled patches represent the background.
Each patch is rasterscanned and converted into a 36 × 1 vector. Hence, there are 196 vectors labeled with + 1 (positive samples) and 192 vectors labeled with 0 (negative samples). We denote the total number of patches N = 392. A labelconsistent, kernel singular value decomposition (LCKSVD) algorithm will be applied to both the 196 positive vectors and the 196 negative vectors and select a subset of 50 vectors from each of them to form a labeled dictionary. This dictionary consists of 50 vectors with positive (+ 1) labels and 50 vectors with negative (0) labels. Let K = 100, the dictionary may be represented by a 36 × K matrix D. The dictionary will be estimated from the initial frame where the target to be tracked is specified for the tracking algorithm. It will remain unchanged until template update operation is performed.
The LCKSVD algorithm also yields a sparse representation of each patch (36 × 1 vector) as a weighted combination of the 100 vectors selected in the dictionary. Two constraints are imposed on the potential sparse representations: (a) (discriminative constraint) Sparse vectors corresponding to foreground (or background) patches should have similar representation. This is represented by a discriminative parameter matrix A_{K × K}. (b) (classification constraint) Class labels (+ 1, 0) can be reproduced from weighted linear combination of the sparse representation. This is represented by a 2 × K classification parameter matrix W. In addition to the sparse representation of each foreground and background patches, represented by a K × N matrix X, the LCKSVD algorithm can estimate the dictionary D, the discriminative parameter matrix A and the classification parameter matrix W simultaneously.
Given the dictionary D and sparse representation of the template X, tracking begins by moving into the next frame. A kernel particle filter is applied to generate 100 potential target positions at (k + 1)th frame according to the particle representation of the state transition probability p(x_{k + 1}x_{k}) such that E(x_{k + 1}x_{k}) = x_{k} where x_{k} = {x_{k}, y_{k}, θ_{k}, s_{k}, α_{k}, β_{k}} is the state vector of the target at the kth frame. The assumption is that the target motion may be described by an affine transformation, for example, (x_{k}, y_{k}) is the target position, θ_{k}, s_{k}, α_{k}, and β_{k} are the rotation angle, the scaling factor, the aspect ratio, and the angle of inclination, respectively. We also assume p(x_{k + 1}x_{k}) has a Gaussian distribution where the covariance matrix is selected based on prior knowledge of the tracking task.
Each particle corresponds to a candidate target template. Then, 196 image patches are extracted, and corresponding sparse representation X’ are evaluated using LCKSVD and the library D. A kernel density weighted sparse coefficient similarity score (SCSS) then will be applied to produce an estimate of the likelihood probability between the sparse representation of the template X and the current template candidate X’. The kernel density weightings place more weight on image patches that are closer to the center of the template and less weight on image patches on peripherals of the template. The location of the bestmatched template will be designated as new target position.
Before moving into the next frame, the tracking algorithm may also adaptively update the template when occlusion of the target is detected. This is accomplished by using a sparse coefficient histogram matrix (SCHM) [16] to estimate the level of occlusion of the target. If so, the algorithm uses the newly estimated template, or a weighted linear combination of the estimated template and an initial template depending on the percentage of patches that are deemed occluded. With the newly updated template, the algorithm moves to the following frame.
A block diagram summarizing above overview of the proposed algorithm is depicted in Fig. 2. It has an initialization phase where a lowdimensional labelconsistent dictionary D of image patches will be estimated, and the sparse representation X as well as classification parameters W of individual patches are also computed. Next, the kernel densitybased particle filter (KPF) algorithm generates candidate templates in the following frame. For each candidate template, the likelihood score will be evaluated, and the maximum likelihood estimate of the target position will be computed. This is followed by an adaptive template update phase where occlusion of the target is detected.
Theoretical backgrounds
LCKSVD
The LCKSVD dictionary learning algorithm [17, 18] in Fig. 2 can simultaneously train an overcomplete lowdimensional dictionary and a linear classifier, i.e., the obtained dictionaries have both reconstructive and discriminative abilities. The objective function is expressed as
where \( Y\kern0.5em =\kern0.5em {\left\{{y}_i\right\}}_{i=1}^N\in {\mathrm{R}}^{n\times N} \) denotes the input sample set, X = [x_{1}, x_{2}, ⋯, x_{N}] ∈ R^{K × N} denotes the coefficient matrix, D = [d_{1}, d_{2}, ⋯, d_{K}] ∈ R^{n × K} denotes the lowdimensional dictionary matrix containing K ≪ N prototype sampleatoms for columns \( {\left\{{d}_j\right\}}_{j=1}^K \), and T_{0} denotes the degree of sparsity. Q ∈ R^{K × N} denotes the sparse codes with discriminative power of Y for classification. A is a linear transformation matrix, which can transform the original sparse codes to be most discriminative in sparse feature space. \( {\left\Vert Q AX\right\Vert}_2^2 \) denotes the discriminative sparse code error, which forces the samples with same class label to have the similar sparse representations. \( {\left\Vert H WX\right\Vert}_2^2 \) denotes the classification error, W is the classification parameter matrix, and H is the class label of input samples. α and β are the scalars controlling the relative contribution of the corresponding terms [18].
The KSVD method [19] can be used to obtain the optimal solutions for all the parameters simultaneously. Specifically, Eq. (1) can be rewritten as
Let \( {Y}_{\mathrm{new}}={\left({Y}^{\mathrm{T}},\sqrt{\alpha }{Q}^{\mathrm{T}},\sqrt{\beta }{H}^{\mathrm{T}}\right)}^{\mathrm{T}} \), \( {D}_{\mathrm{new}}={\left({D}^{\mathrm{T}},\sqrt{\alpha }{A}^{\mathrm{T}},\sqrt{\beta }{W}^{\mathrm{T}}\right)}^{\mathrm{T}} \), then Eq. (2) can be expressed as
Then D_{new} can be obtained by using the KSVD method, i.e., D, A, and W are learned simultaneously. More descriptions about LCKSVD can refer to [17, 18].
In Eq. (1), the learned dictionary can be better used to represent the target due to the constraint terms. The discriminative sparse code error can force the samples with same class to have the similar sparse representations, which can enlarge the difference between classes of training data. Moreover, the classification error can effectively train a classifier to identify the foreground and background of the target.
Sparse coefficient histogram and occlusion detection
The patches of the target can be represented by using the obtained low dimensional dictionary D and the sparse coefficient of each patch can be used to construct the histogram matrix. However, some patches in the candidate target may be occluded, and the coefficient histogram cannot express the feature of candidate target accurately. As a result, the target cannot be estimated accurately. Taking this problem into account, the occlusion detection strategy [16] is employed according to the reconstruction error of each patch. And then the sparse coefficient histogram can be updated according to the occlusion detection results.
Assume that ξ_{i} denotes the sparse coefficient vector of the ith patch, we have
The sparse coefficient histogram matrix can be established by concatenating the sparse coefficient vector ξ_{i}, i.e.,\( \rho =\left[{\xi}_1,{\xi}_2,\dots, {\xi}_{N_p}\right] \). If the target is partially occluded, then some of the patches of the target are occluded, and their corresponding sparse coefficients will be meaningless, which makes the sparse coefficient matrix ρ unable to express the candidate target well, causing big reconstruction error. Therefore, we introduce an occluded target detective mechanism to identify the occluded patches and their corresponding sparse coefficients.
It is defined that if the reconstructed error of each patch is bigger than the threshold, the patch will be marked as occluded, and then the corresponding sparse coefficient vector is reset to zero. The candidate histogram matrix after occlusion detection is defined as φ = ρ ⊙ o, where ⊙ denotes the elementwise multiplication. \( o\in {R}^{\left({K}_p+{K}_n\right)\times {N}_p} \)denotes the matrix of occluded detection, and o_{i} is the element of the matrix o, and can be defined as:
where \( {\varepsilon}_i={\left\Vert {y}_i{D}_t{\xi}_{i\_t}\right\Vert}_2^2 \) denotes the reconstructed error of the ith patch. Note that only the positive patches are used to compute the reconstructed error, therefore D_{t} denotes the dictionary which only consists of the set of positive patches from the learned dictionary D, ξ_{i _ t} denotes the corresponding sparse coefficient vector of D_{t}, and ε_{0} denotes the threshold of reconstructed error of each patch. If ε_{i} ≥ ε_{0}, then the ith patch be considered as occluded and the corresponding coefficient vector is set as zero.
Classifiedpatch kernel particle filter
Given the observation set of target y_{1 : k} = {y_{1}, y_{2}, … , y_{k}} up to the kth frame, the target state x_{k} can be extracted via the maximum posterior estimation, i.e., \( {\widehat{x}}_k=\arg \underset{x_k^i}{\max }p\left({x}_k^i{y}_{1:k}\right) \), where \( {x}_k^i \) denotes the state of the ith sampled particle of the kth frame. The posterior probability \( p\left({x}_k^i{y}_{1:k}\right) \) can be inferred by the Bayesian recursion, i.e.,
where p(y_{k} x_{k}) denotes the observation model. p(x_{k} x_{k − 1}) denotes the dynamic model which describes the temporal correlation of the target states between consecutive frames. The affine transformation with six parameters is utilized to model the target motion between two consecutive frames. The state transition is formulated as p(x_{k} x_{k − 1}) = N(x_{k}; x_{k − 1}, Σ), where Σ is a diagonal covariance matrix whose elements are the variances of the affine parameters.
The observation model p(y_{k} x_{k}) denotes the likelihood of the observation y_{k} at state x_{k}. It plays an important role in robust tracking. In this paper, we aim to construct a robust likelihood model having the antiocclusion ability and foreground target identification ability by merging the similarity of sparse coefficient histograms [16] and the classification information. Moreover, we consider the spatial information of each patch by using the isotropic Gaussian kernel density, which can keep the stability of the proposed algorithm for visual target tracking.
The likelihood of the lth particle is expressed as
where \( {M}_{k,i}^l \) and \( {L}_{k,i}^l \) denote the likelihood of classification and the similarity function of the target histograms between the candidate and the template. \( k\left({\left\Vert \frac{y_k^l{c}_i}{h}\right\Vert}^2\right) \) denotes the isotropic Gaussian kernel density, where c_{i} denotes the center of the ith patch, and \( {y}_k^l \) denotes the center of the lth particle in the kth frame. It means the distance between the patch and the candidate particle is considered, i.e., the patches far away from the center of the target will be assigned smaller weights, which can weaken the disturbance of the patches on the edge of the target.
According to the histogram intersection function [16, 20], the similarity function of the ith patch of the lth particle is defined as
where \( {\varphi}_{k,i}^l \) and ψ^{i} denotes the sparse coefficient histograms of the candidate target and the target template, respectively. Template histogram is computed only once for each image sequence. Moreover, the comparison between the candidate and the template should be carried out under the same occlusion condition. Therefore, the template and the ith candidate share the same matrix o of occluded detection.
The likelihood of classification of the ith patch of the lth particle is defined as
where \( {\varphi}_{k,i}^l \) is the sparse coefficient vector of the candidate patch. Γ denotes the base vector of target classification, i.e., Γ = [1, 0]^{T}, \( \cos \angle \left(\alpha, \beta \right)=\frac{\alpha \cdot \beta }{\left\alpha \right\left\beta \right} \) denotes the bearing of two vectors.
The bigger the number of patches belonging to the candidate particle is, the better the target appearance can be described. Because the selected patches may be from target templates or background templates. Therefore, if the patch belongs to the target, we should give it a bigger weight than that belong to the background.
Adaptive template update
In the tracking process, the appearance of the target often changes significantly due to the disturbance of illumination changes, occlusion, rotation, scale variation, and so on. Therefore, we need to update the template appropriately. However, if the template is updated too frequently by using new observations, the tracking results are easy to drift away from the target due to the accumulation of errors. Especially, when the target is occluded, the latest tracking result cannot describe the real target well, which will cause the later estimated targets to be lost. On the contrary, if tracking with fixed templates, it is prone to fail in dynamic scenes as it does not consider inevitable appearance change.
In this paper, we propose an improved template histogram update scheme by combining the histogram of the first frame and the latest estimated histogram with the variable μ, i.e.,
where \( \mu ={\mathrm{e}}^{\hbox{} \left(1\hbox{} \frac{O_n}{O_0}\right)} \) denotes the weighting parameter, which can adaptively adjust the update template to adapt to the change of the target appearance. \( {\widehat{\psi}}_n \) denotes the update template histogram, ψ and \( {\widehat{\varphi}}_n \) denote the template histogram of the first frame and the latest estimate, respectively. \( {O}_n=\frac{\#{\mathrm{Patch}}_{occ}}{\#\mathrm{Patch}} \) denotes the occlusion degree of the current tracking results. #Patch_{occ} and #Patch denote the number of the occluded patches and the total patches. O_{0} is a threshold of the degree of occlusion. Moreover, to avoid frequent template update, we detect the occluded state every five frames, i.e., we update the template every five frames.
During the update process, the first frame template and the newly arrived template are considered simultaneously. However, when the target is occluded, the arrived template usually cannot describe the real target effectively. Therefore, the weight μ of the arrived template should decrease at this time. Otherwise, the weight μ should increase due to the accurately estimate of the arrived template without other disturbance factors. In this paper, we set the parameter μ change with the reconstruction error. If O_{n} increases, which denotes the target may be disturbed by some factors, such as illumination and occlusion, the arrived template may be inaccurate, hence the weight of the template should decrease, while the weight of the first frame template should increase.
Experiment results
To verify the effectiveness of the proposed algorithm, some challenging sequences from the public dataset of video target tracking [1] (http://cvlab.hanyang.ac.kr/tracker_benchmark/datasets.html) are used to evaluate the performance of the proposed algorithm. The main challenging features of the data are described in Table 1, including the interference of occlusion, background clutter, illumination change, target rotation, scale change, motion blur, etc. The proposed algorithm is compared with eight stateoftheart benchmark tracking algorithms, including multiple instance learning (MIL) [6], compressive tracking (CT) [21], robust fragmentsbased tracking (FRAG) [7], incremental visual tracking (IVT) [8], visual tracking decomposition (VTD) [9], L1 tracker using accelerated proximal gradient (L1APG) [14], multitask sparse learning tracking (MTT) [15], and local sparse appearance model and Kselection (LSK) [22]. The experiments are implemented on computer with Intel Core 2.4 GHz, i74700HQ processor with 8 GB RAM. The software tool is MATLAB 2014a and the l_{1} minimization problem is solved with the SPAMS package [23]. For each sequence, the location of the target is manually labeled in the first frame.
The learned lowdimensional dictionary consists of 50 positive templates and 50 background templates which are from the sampled templates by LCKSVD dictionary learning. In the framework of PF, 100 candidate particles are sampled according to the same partition patch method, and the most similarity candidate particle is extracted as the estimated target. Set the threshold of the occlusion degree as O_{0} = 0.8 in Eq. (10).
Qualitative evaluation
Figure 3 shows the tracking results of different algorithms when the target undergoes heavy occlusion, illumination variation, background clutter, rotation, scale change, fast motion, and motion blur.
Occlusion and illumination variation
In order to demonstrate the antiocclusion and antiilluminationvariation performances of the proposed algorithm, some challenging video sequences are used in this experiment. Especially in (a) FaceOcc2 and (b) Woman sequences, the targets are heavily occluded or longtime partial occluded. However, the proposed algorithm can extract the targets accurately. The reason is that the local detection strategy for occlusion and illumination changes as well as the adaptive template update scheme are employed, which can easily describe and detect the variations of the local details of the targets and help to decrease the influence of the disturbances including occlusion, illumination change, rotation, etc. Moreover, the Gaussian kernel density of the patches is considered in the CKPF, which considers the global information of the local patches, improving the tracking performance. Taking the 181th, 273th, and 659th frames in FaceOcc2 sequences as examples, the target is occluded heavily by the book and the hat; the proposed algorithm has the highest tracking accuracy. In the 127th, 172th, and 495th frames in the Woman sequences, the target is partial occluded by the car and disturbed by the background clutters; some of the benchmark algorithms cannot estimate the target accurately with heavily position drift, while the proposed algorithm can successfully track the target throughout the entire sequences.
In (c) Shaking and (d) Singer1 sequences, there exists large illumination variation, and partial scale change, the benchmark algorithms FRAG, IVT, MTT, and CT cannot extract the target correctly following with heavily drift. LSK and MIL have good estimated results, but the proposed algorithm and the VTD approach have better tracking results. For the VTD algorithm, the observation model is decomposed into multiple basic observation models that can cover different specific target appearances, which can adapt to the illumination changes; however, it is hard to deal with the scale variation problem of the target while the proposed algorithm can do it adaptively. Therefore, in the Singer1 sequences, its tracking results are worse than those of the proposed algorithm due to the scale variation of the targets.
Background clutter
In the video sequences of (f) Board, (e) Deer, and (c) Shaking, the targets are disturbed by some background clutters, especially in Board sequences; the background is complex and there exists partial target rotation and fast motion. L1APG, MTT, and IVT cannot extract the target correctly due to the use of the fixed global model, while the proposed algorithm employs the local patch features to describe the details of the target, and the LCKSVD method is introduced to learn dictionaries and train the classification parameters simultaneously, which can decrease the influence of the background disturbance. In the 42th frame of the Deer sequence, there is another deer in the background. Most of the algorithms have the results with largely drift due to the clutter disturbance. However, the proposed algorithm obtains an accurate result; the reason is that the set of background models is considered simultaneously and effectively updated in the tracking process.
Rotation and scale change
In (i) Girl and (f) Board sequences, there exists heavily target rotation. In the 94th and 119th frames of the Girl sequences, the girl turns around. It is clear that heavily drift exists in the results obtained by FRAG and LSK, while the proposed algorithm can adapt to the case of target rotation due to the use of the effectively update strategy, which considers the initial target model and the last estimate target model simultaneously. In the 434th frame of the Girl sequences, the face of the girl is occluded by the man and the scale makes a little change during the process of target rotation; the proposed algorithm also obtains a good tracking result. From the Board sequences, we can draw the same conclusions, in which the proposed algorithm has a good performance of target tracking under the scenario with target rotation and scale variation.
Moreover, in the Singer1 sequences, it is clear that the scale of the target changes heavily; the proposed algorithm can obtain accurate results, because the scale parameter s_{k} is estimated simultaneously in the implement process of CKPF.
Fast motion and motion blur
In (j) Jumping and (e) Deer sequences, there exists fast motion of the target and motion blur. For the Jumping sequences, L1APG, LSK, and MTT cannot extract the target correctly due to the motion blur, while the proposed algorithm has a good tracking result. In the 109th and 262th frames of the Jumping sequence, fast motion and motion blur make some of the benchmark algorithms have heavily drift results, while the proposed algorithm has good results. The reason is that the background templates are considered to restrain the influence of the background, and the updated positive template can adapt to the case with motion blur. From the Deer sequences, we can conclude the same conclusions.
Quantitative evaluation
Two evaluation criteria are employed to quantitatively assess the performance of the proposed algorithm. One is average center location error (ACLE), and the other is tracking success rate (SR). Figure 4 shows the relative position error (in pixels) between the center and the tracking results. ACE is defined as the average relative position error. Assume the tracking result is R_{r}, and the ground truth is R_{g}, then SR is defined as ϒ = (R_{r} ∪ R_{g})/(R_{r} ∪ R_{g}). Tables 2 and 3 give values of ACLE and SR for different tracking algorithms.
As can be seen from Fig. 4, the proposed algorithm has a better performance than those of the benchmark algorithms. The tracking result of each frame is accurate and the curve of the error is stable without high changing. While part of the benchmark algorithms are instable, and have big errors between some frames due to different disturbances.
From Tables 2 and 3, it is clear that the proposed algorithm can adapt to most of the video sequences with the best and second best results except the (i) Girl sequences. The performance of the proposed algorithm can be attributed to the detailed description of the local patches by the LCKSVD dictionary learning and adaptive template update scheme. Moreover, the Gaussian kernel density of the patches as the global information is considered in CKPF. The algorithm of VTD can also adapt to the scenarios with illumination change and lightly occlusion (e.g., Shaking and Singer1); the reason is that the appearance change is considered in the target template, but its performance decreases when the rotation and the motion blur happen on the targets (e.g., Deer, Board, and Jumping). L1APG has a good performance on the Girl sequence; the reason is that the last tracking result is used directly as the updated template, which can effectively adapt to the Girl sequence with the turn of the girl. However, it cannot extract the target correctly due to the motion blur and illumination variation, such as in (f) Board, (j) Jumping, (c) Shaking, and (l) Car4 sequences. For the Girl sequences, the tracking result of the proposed algorithm is not the best, but it is only slightly below the L1APG and MTT algorithms.
Discussion of adaptive parameter μ
To verify the effectiveness of the adaptive template update scheme, two special challenging sequences, the first 200 frames of FaceOcce2 and the first 170 frames of Woman with big variance of appearance, are chosen in this experiment. The tracking results with different constant values (e.g., 0.1, 0.4, 0.7, and 0.9) of the weighting parameter μ of Eq. (10) are compared to those with adaptive parameter value, and these are demonstrated in Table 4.
As can be seen that there are different values of ACLEs and SRs by choosing different constant values of μ, and smaller value of μ (e.g., 0.1) gets higher accuracy for the first 200 frames FaceOcce2 sequences, while bigger value of μ (e.g., 0.9) gets higher accuracy for the first 170 frames Woman sequences. The reason is that the variations of the target appearance are small during the 1st frame to 140th frame of FaceOcce2 sequences, and the updated templates mainly rely on the latest templates. But the target appearances are severely occluded between 141st and 190th fames; the updated templates more rely on the template of the first frame. Therefore, it is noted that the differences of the tracking accuracy are small with different values of μ for this sequences. But for the Woman sequences, the target appearances are slightly disturbed by the background clutters between 36th and 170th frames, and there only exists partial occlusion between 106th and 165th frames. Therefore, most of the updated templates mainly rely on the latest frame templates, and the bigger value of μ gets better results. While for the proposed algorithm with adaptive weight parameter, it is clear that it can obtain an ideal tracking result without manually setting the parameter values.
Conclusion
In this paper, we present an adaptive visual tracking algorithm based on CKPF. The template sets constructed by the local patch features from both foreground and background of the target are used to learn the dictionaries simultaneously. The lowdimensional dictionary and target classification parameters are trained by using the LCKSVD dictionary learning. To robustly decide the final tracking states, an adaptive template update scheme is designed, and the classification information, the target candidate histogram, and the Gaussian kernel density are merged to form CKPF. The effectiveness of the proposed algorithm is experimentally demonstrated by comparing with 8 stateoftheart trackers on 12 challenging video sequences, and experimental results show that the proposed algorithm has a better tracking performance than some benchmark methods in the scenarios with the interference of occlusion, background clutter, illumination change, target rotation, and scale change. However, the computation cost is high; in the future, we would like to improve the computational efficiency by considering the reverselowrank representation scheme [24], and some optimal particle pruning schemes.
Abbreviations
 APG:

Accelerated proximal gradient
 ASCH:

Antiocclusion sparse coefficient histograms
 CKPF:

Classifiedpatch kernel particle filter
 CP:

Classification parameters
 CT:

Compressive tracking
 GKD:

Gaussian kernel density
 IVT:

Incremental learning visual tracking
 LCKSVD:

Labelconsistent Ksingular value decomposition
 LS:

Least squares
 LSK:

Local sparse appearance model and Kselection
 MIL:

Multiple instance learning
 ML:

Maximum likelihood
 MTT:

Multitask tracking
 PF:

Particle filtering
 VTD:

Visual tracking decomposition
 VTT:

Visual target tracking
References
 1.
Y. Wu, J. Lim, M.H. Yang, Object tracking benchmark. IEEE Trans. Pattern Anal. Mach. Intell. 37(9), 1834–1848 (2015).
 2.
M. Kristan, J. Matas, A. Leonardis, et al., in Proceedings of the IEEE international conference on computer vision workshops. The visual object tracking vot2015 challenge results (2015), pp. 1–23.
 3.
H. Fan, J. Xiang, Robust visual tracking with multitask joint dictionary learning. IEEE Trans. Circuits Syst. Video Technol. 27(5), 1018–1030 (2017).
 4.
H. Li, Y. Li, F. Porikli, Deep track: learning discriminative feature representations online for robust visual tracking. IEEE Trans. Image Process. 25(4), 1834–1848 (2016).
 5.
X. Jia, H.C. Lu, M.H. Yang, in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. Visual tracking via adaptive structural local sparse appearance model (IEEE Computer Society Press, Los Alamitos, 2012), pp. 1822–1829.
 6.
B. Babenko, M.H. Yang, S. Belongie, in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. Visual tracking with online multiple instance learning (IEEE Computer Society Press, Los Alamitos, 2009), pp. 983–990.
 7.
A. Adam, E. Rivlin, I. Shimshoni, in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. Robust fragmentsbased tracking using the integral histogram (IEEE Computer Society Press, Los Alamitos, 2006), pp. 798–805.
 8.
D.A. Ross, J. Lim, R.S. Lin, et al., Incremental learning for robust visual tracking. Int. J. Comput. Vis. 77(1–3), 125–141 (2008).
 9.
J. Kwon, K.M. Lee, in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. Visual tracking decomposition (IEEE Computer Society Press, Los Alamitos, 2010), pp. 1269–1276.
 10.
X. Mei, H.B. Ling, in Proceedings of IEEE 12th International Conference on Computer Vision. Robust visual tracking using L1 minimization (IEEE Computer Society Press, Los Alamitos, 2009), pp. 1436–1443.
 11.
X. Mei, H.B. Ling, Y. Wu, et al., in Proceedings of IEEE conference on computer vision and pattern recognition. Minimum error bounded efficient L1 tracker with occlusion detection (IEEE Computer Society Press, Los Alamitos, 2011), pp. 1257–1264.
 12.
M.S. Arulampalam, S. Maskell, N. Gordon, et al., A tutorial on particle filters for online nonlinear/nonGaussian Bayesian tracking. IEEE Trans. Signal Process. 50(2), 174–188 (2002).
 13.
S.P. Zhang, H.X. Yao, X. Sun, et al., Sparse coding based visual tracking: Review and experimental comparison. Pattern Recogn. 46(7), 1772–1788 (2013).
 14.
C.L. Bao, Y. Wu, H.B. Ling, et al., in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. Real time robust L1 tracker using accelerated proximal gradient approach (IEEE Computer Society Press, Los Alamitos, 2012), pp. 1830–1837.
 15.
T.Z. Zhang, B. Ghanem, S. Liu, et al., in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. Robust visual tracking via multitask sparse learning (IEEE Computer Society Press, Los Alamitos, 2012), pp. 2042–2049.
 16.
W. Zhong, H.C. Lu, M.H. Yang, in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. Robust object tracking via sparsitybased collaborative model (IEEE Computer Society Press, Los Alamitos, 2012), pp. 1838–1845.
 17.
Z.L. Jiang, Z. Lin, L.S. Davis, in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. Learning a discriminative dictionary for sparse coding via label consistent ksvd (IEEE Computer Society Press, Los Alamitos, 2011), pp. 1697–1704.
 18.
Z.L. Jiang, Z. Lin, L.S. Davis, Label consistent KSVD: learning a discriminative dictionary for recognition. IEEE Trans. Pattern Anal. Mach. Intell. 35(11), 2651–2664 (2013).
 19.
M. Aharon, M. Elad, A. Bruckstein, KSVD: An algorithm for designing Overcomplete dictionaries for sparse representation. IEEE Trans. Signal Process. 54(11), 4311–4322 (2006).
 20.
J.X. Wu, J.M. Rehg, in Proceedings of IEEE 12th International Conference on Computer Vision. Beyond the Euclidean distance: creating effective visual codebooks using the histogram intersection kernel (IEEE Computer Society Press, Los Alamitos, 2009), pp. 630–637.
 21.
K.H. Zhang, L. Zhang, M.H. Yang, in Proceedings of the 11th European Conference on Computer Vision. Realtime compressive tracking (IEEE Computer Society Press, Los Alamitos, 2012), pp. 864–877.
 22.
B.Y. Liu, J.Z. Huang, L. Yang, et al., in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. Robust tracking using local sparse appearance model and Kselection (IEEE Computer Society Press, Los Alamitos, 2011), pp. 1313–1320.
 23.
J. Mairal, F. Bach, J. Ponce, et al., Online learning for matrix factorization and sparse coding. J. Mach. Learn. Res. 11(1), 19–60 (2010).
 24.
Y. Yang, W. Hu, Y. Xie, et al., Temporal restricted visual tracking via reverselowrank sparse learning. IEEE Trans. Cybern. 47(2), 485–498 (2017).
Acknowledgments
The authors would like to thank the Editor and anonymous reviewers for their constructive suggestion.
Funding
Natural Science Foundation of Jiangsu Province (Nos. BK20181340, BK20130154), National Natural Science Foundation of China (Nos. 61305017, 61772237), and The CyberPhysical Systems program of the U.S. National Science Foundation (CNS 1329481).
Availability of data and materials
All data and material are available.
Author information
Affiliations
Contributions
JY initiated the project. GZ, JY, and JL designed the algorithms, performed the experiments, and drafted the manuscript. WW and YH participated in the proposed method and analyzed the experiment results. All authors read and approved the final manuscript.
Corresponding author
Correspondence to Jinlong Yang.
Ethics declarations
Competing interests
The authors declare that they have no competing interests.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
About this article
Received
Accepted
Published
DOI
Keywords
 Visual target tracking
 Ksingular value decomposition
 Sparse coding
 Dictionary learning
 Particle filter