### Overview of the algorithm

As shown in Fig. 1, the target is represented as a rectangular template in each frame, and the target template will be scaled to 32 × 32 pixels (big red boxes on the right). Candidate target region will also be scaled at the same ratio before further processing. Pixels within the template are assumed to be positive samples of the target. A 4-pixel wide strip surrounding the template is defined as the background whose edges outside and inside of the target template are denoted by B1 and B2, i.e., the gray annular area (B2-B1) with width 8 pixels are the background area. A patch is defined as a 6 × 6 square. *N*_{p} = 196 image patches (positive samples) will be extracted from the template (foreground, target), and *N*_{n} = 196 patches will be extracted from the background as negative sample. These extracted image patches are regularly distributed over the foreground or the background regions respectively with overlaps as needed. Together, the positive-labeled patches represent the target and the negative-labeled patches represent the background.

Each patch is raster-scanned and converted into a 36 × 1 vector. Hence, there are 196 vectors labeled with + 1 (positive samples) and 192 vectors labeled with 0 (negative samples). We denote the total number of patches *N* = 392. A label-consistent, kernel singular value decomposition (LC-KSVD) algorithm will be applied to both the 196 positive vectors and the 196 negative vectors and select a subset of 50 vectors from each of them to form a labeled dictionary. This dictionary consists of 50 vectors with positive (+ 1) labels and 50 vectors with negative (0) labels. Let *K* = 100, the dictionary may be represented by a 36 × *K* matrix **D**. The dictionary will be estimated from the initial frame where the target to be tracked is specified for the tracking algorithm. It will remain unchanged until template update operation is performed.

The LC-KSVD algorithm also yields a sparse representation of each patch (36 × 1 vector) as a weighted combination of the 100 vectors selected in the dictionary. Two constraints are imposed on the potential sparse representations: (a) (discriminative constraint) Sparse vectors corresponding to foreground (or background) patches should have similar representation. This is represented by a discriminative parameter matrix **A**_{K × K}. (b) (classification constraint) Class labels (+ 1, 0) can be reproduced from weighted linear combination of the sparse representation. This is represented by a 2 × *K* classification parameter matrix **W**. In addition to the sparse representation of each foreground and background patches, represented by a *K* × *N* matrix **X**, the LC-KSVD algorithm can estimate the dictionary **D**, the discriminative parameter matrix **A** and the classification parameter matrix **W** simultaneously.

Given the dictionary **D** and sparse representation of the template **X**, tracking begins by moving into the next frame. A kernel particle filter is applied to generate 100 potential target positions at (*k* + 1)th frame according to the particle representation of the state transition probability *p*(**x**_{k + 1}|**x**_{k}) such that *E*(**x**_{k + 1}|**x**_{k}) = **x**_{k} where **x**_{k} = {*x*_{k}, *y*_{k}, *θ*_{k}, *s*_{k}, *α*_{k}, *β*_{k}} is the state vector of the target at the *k*th frame. The assumption is that the target motion may be described by an affine transformation, for example, (*x*_{k}, *y*_{k}) is the target position, *θ*_{k}, *s*_{k}, *α*_{k}, and *β*_{k} are the rotation angle, the scaling factor, the aspect ratio, and the angle of inclination, respectively. We also assume *p*(**x**_{k + 1}|**x**_{k}) has a Gaussian distribution where the covariance matrix is selected based on prior knowledge of the tracking task.

Each particle corresponds to a candidate target template. Then, 196 image patches are extracted, and corresponding sparse representation **X**’ are evaluated using LC-KSVD and the library **D**. A kernel density weighted sparse coefficient similarity score (SCSS) then will be applied to produce an estimate of the likelihood probability between the sparse representation of the template **X** and the current template candidate **X’**. The kernel density weightings place more weight on image patches that are closer to the center of the template and less weight on image patches on peripherals of the template. The location of the best-matched template will be designated as new target position.

Before moving into the next frame, the tracking algorithm may also adaptively update the template when occlusion of the target is detected. This is accomplished by using a sparse coefficient histogram matrix (SCHM) [16] to estimate the level of occlusion of the target. If so, the algorithm uses the newly estimated template, or a weighted linear combination of the estimated template and an initial template depending on the percentage of patches that are deemed occluded. With the newly updated template, the algorithm moves to the following frame.

A block diagram summarizing above overview of the proposed algorithm is depicted in Fig. 2. It has an initialization phase where a low-dimensional label-consistent dictionary **D** of image patches will be estimated, and the sparse representation **X** as well as classification parameters **W** of individual patches are also computed. Next, the kernel density-based particle filter (KPF) algorithm generates candidate templates in the following frame. For each candidate template, the likelihood score will be evaluated, and the maximum likelihood estimate of the target position will be computed. This is followed by an adaptive template update phase where occlusion of the target is detected.

### Theoretical backgrounds

#### LC-KSVD

The LC-KSVD dictionary learning algorithm [17, 18] in Fig. 2 can simultaneously train an over-complete low-dimensional dictionary and a linear classifier, i.e., the obtained dictionaries have both reconstructive and discriminative abilities. The objective function is expressed as

$$ {\displaystyle \begin{array}{l}\left\langle D,W,A,X\right\rangle =\underset{D,W,A,X}{\arg \min }{\left\Vert Y- DX\right\Vert}_2^2+\alpha {\left\Vert Q- AX\right\Vert}_2^2\\ {}\kern5.75em +\beta {\left\Vert H- WX\right\Vert}_2^2,\kern0.75em \mathrm{s}.\mathrm{t}.\kern0.5em \forall i,{\left\Vert {x}_i\right\Vert}_0\le {T}_0\end{array}} $$

(1)

where \( Y\kern0.5em =\kern0.5em {\left\{{y}_i\right\}}_{i=1}^N\in {\mathrm{R}}^{n\times N} \) denotes the input sample set, *X* = [*x*_{1}, *x*_{2}, ⋯, *x*_{N}] ∈ *R*^{K × N} denotes the coefficient matrix, *D* = [*d*_{1}, *d*_{2}, ⋯, *d*_{K}] ∈ R^{n × K} denotes the low-dimensional dictionary matrix containing *K* ≪ *N* prototype sample-atoms for columns \( {\left\{{d}_j\right\}}_{j=1}^K \), and *T*_{0} denotes the degree of sparsity. *Q* ∈ *R*^{K × N} denotes the sparse codes with discriminative power of *Y* for classification. *A* is a linear transformation matrix, which can transform the original sparse codes to be most discriminative in sparse feature space. \( {\left\Vert Q- AX\right\Vert}_2^2 \) denotes the discriminative sparse code error, which forces the samples with same class label to have the similar sparse representations. \( {\left\Vert H- WX\right\Vert}_2^2 \) denotes the classification error, *W* is the classification parameter matrix, and *H* is the class label of input samples. *α* and *β* are the scalars controlling the relative contribution of the corresponding terms [18].

The K-SVD method [19] can be used to obtain the optimal solutions for all the parameters simultaneously. Specifically, Eq. (1) can be rewritten as

$$ \left\langle D,W,A,X\right\rangle =\underset{D,W,A,X}{\arg \min }{\left\Vert \left(\begin{array}{c}Y\\ {}\sqrt{\alpha }Q\\ {}\sqrt{\beta }H\end{array}\right)-\left(\begin{array}{c}D\\ {}\sqrt{\alpha }A\\ {}\sqrt{\beta }W\end{array}\right)X\right\Vert}_2^2,\kern0.5em \mathrm{s}.\mathrm{t}.\kern0.5em \forall i,{\left\Vert {x}_i\right\Vert}_0\le T{}_0 $$

(2)

Let \( {Y}_{\mathrm{new}}={\left({Y}^{\mathrm{T}},\sqrt{\alpha }{Q}^{\mathrm{T}},\sqrt{\beta }{H}^{\mathrm{T}}\right)}^{\mathrm{T}} \), \( {D}_{\mathrm{new}}={\left({D}^{\mathrm{T}},\sqrt{\alpha }{A}^{\mathrm{T}},\sqrt{\beta }{W}^{\mathrm{T}}\right)}^{\mathrm{T}} \), then Eq. (2) can be expressed as

$$ \left\langle {D}_{\mathrm{new}},X\right\rangle =\underset{D_{\mathrm{new}},X}{\arg \min}\left\{{\left\Vert {Y}_{\mathrm{new}}-{D}_{\mathrm{new}}X\right\Vert}_2^2\right\},\kern0.5em \mathrm{s}.\mathrm{t}.\kern0.5em \forall i,{\left\Vert {x}_i\right\Vert}_0\le T{}_0 $$

(3)

Then *D*_{new} can be obtained by using the K-SVD method, i.e., *D*, *A*, and *W* are learned simultaneously. More descriptions about LC-KSVD can refer to [17, 18].

In Eq. (1), the learned dictionary can be better used to represent the target due to the constraint terms. The discriminative sparse code error can force the samples with same class to have the similar sparse representations, which can enlarge the difference between classes of training data. Moreover, the classification error can effectively train a classifier to identify the foreground and background of the target.

#### Sparse coefficient histogram and occlusion detection

The patches of the target can be represented by using the obtained low dimensional dictionary *D* and the sparse coefficient of each patch can be used to construct the histogram matrix. However, some patches in the candidate target may be occluded, and the coefficient histogram cannot express the feature of candidate target accurately. As a result, the target cannot be estimated accurately. Taking this problem into account, the occlusion detection strategy [16] is employed according to the reconstruction error of each patch. And then the sparse coefficient histogram can be updated according to the occlusion detection results.

Assume that *ξ*_{i} denotes the sparse coefficient vector of the *i*th patch, we have

$$ \underset{\xi_i}{\min }{\left\Vert {y}_i-D{\xi}_i\right\Vert}_2^2+\lambda {\left\Vert {\xi}_i\right\Vert}_1 $$

(4)

The sparse coefficient histogram matrix can be established by concatenating the sparse coefficient vector *ξ*_{i}, i.e.,\( \rho =\left[{\xi}_1,{\xi}_2,\dots, {\xi}_{N_p}\right] \). If the target is partially occluded, then some of the patches of the target are occluded, and their corresponding sparse coefficients will be meaningless, which makes the sparse coefficient matrix *ρ* unable to express the candidate target well, causing big reconstruction error. Therefore, we introduce an occluded target detective mechanism to identify the occluded patches and their corresponding sparse coefficients.

It is defined that if the reconstructed error of each patch is bigger than the threshold, the patch will be marked as occluded, and then the corresponding sparse coefficient vector is reset to zero. The candidate histogram matrix after occlusion detection is defined as *φ* = *ρ* ⊙ *o*, where ⊙ denotes the element-wise multiplication. \( o\in {R}^{\left({K}_p+{K}_n\right)\times {N}_p} \)denotes the matrix of occluded detection, and *o*_{i} is the element of the matrix *o*, and can be defined as:

$$ {o}_i=\left\{\begin{array}{c}1,\kern1em {\varepsilon}_i<{\varepsilon}_0\\ {}0,\mathrm{otherwise}\end{array}\right. $$

(5)

where \( {\varepsilon}_i={\left\Vert {y}_i-{D}_t{\xi}_{i\_t}\right\Vert}_2^2 \) denotes the reconstructed error of the *i*th patch. Note that only the positive patches are used to compute the reconstructed error, therefore *D*_{t} denotes the dictionary which only consists of the set of positive patches from the learned dictionary *D*, *ξ*_{i _ t} denotes the corresponding sparse coefficient vector of *D*_{t}, and *ε*_{0} denotes the threshold of reconstructed error of each patch. If *ε*_{i} ≥ *ε*_{0}, then the *i*th patch be considered as occluded and the corresponding coefficient vector is set as zero.

### Classified-patch kernel particle filter

Given the observation set of target *y*_{1 : k} = {*y*_{1}, *y*_{2}, … , *y*_{k}} up to the *k*th frame, the target state *x*_{k} can be extracted via the maximum posterior estimation, i.e., \( {\widehat{x}}_k=\arg \underset{x_k^i}{\max }p\left({x}_k^i|{y}_{1:k}\right) \), where \( {x}_k^i \) denotes the state of the *i*th sampled particle of the *k*th frame. The posterior probability \( p\left({x}_k^i|{y}_{1:k}\right) \) can be inferred by the Bayesian recursion, i.e.,

$$ p\left({x}_k^i|{y}_{1:k}\right)\propto p\left({y}_k|{x}_k\right)\int p\left({x}_k|{x}_{k-1}\right)p\left({x}_{k-1}|{y}_{1:k-1}\right){dx}_{k-1} $$

(6)

where *p*(*y*_{k}| *x*_{k}) denotes the observation model. *p*(*x*_{k}| *x*_{k − 1}) denotes the dynamic model which describes the temporal correlation of the target states between consecutive frames. The affine transformation with six parameters is utilized to model the target motion between two consecutive frames. The state transition is formulated as *p*(*x*_{k}| *x*_{k − 1}) = *N*(*x*_{k}; *x*_{k − 1}, Σ), where Σ is a diagonal covariance matrix whose elements are the variances of the affine parameters.

The observation model *p*(*y*_{k}| *x*_{k}) denotes the likelihood of the observation *y*_{k} at state *x*_{k}. It plays an important role in robust tracking. In this paper, we aim to construct a robust likelihood model having the anti-occlusion ability and foreground target identification ability by merging the similarity of sparse coefficient histograms [16] and the classification information. Moreover, we consider the spatial information of each patch by using the isotropic Gaussian kernel density, which can keep the stability of the proposed algorithm for visual target tracking.

The likelihood of the *l*th particle is expressed as

$$ {p}_l=\sum \limits_{i=1}^{N_p}k\left({\left\Vert \frac{y_k^l-{c}_i}{h}\right\Vert}^2\right){M}_{k,i}^l{L}_{k,i}^l $$

(7)

where \( {M}_{k,i}^l \) and \( {L}_{k,i}^l \) denote the likelihood of classification and the similarity function of the target histograms between the candidate and the template. \( k\left({\left\Vert \frac{y_k^l-{c}_i}{h}\right\Vert}^2\right) \) denotes the isotropic Gaussian kernel density, where *c*_{i} denotes the center of the *i*th patch, and \( {y}_k^l \) denotes the center of the *l*th particle in the *k*th frame. It means the distance between the patch and the candidate particle is considered, i.e., the patches far away from the center of the target will be assigned smaller weights, which can weaken the disturbance of the patches on the edge of the target.

According to the histogram intersection function [16, 20], the similarity function of the *i*th patch of the *l*th particle is defined as

$$ {L}_{k,i}^l=\sum \min \left({\varphi}_{k,i}^l,{\psi}^i\right) $$

(8)

where \( {\varphi}_{k,i}^l \) and *ψ*^{i} denotes the sparse coefficient histograms of the candidate target and the target template, respectively. Template histogram is computed only once for each image sequence. Moreover, the comparison between the candidate and the template should be carried out under the same occlusion condition. Therefore, the template and the *i*th candidate share the same matrix *o* of occluded detection.

The likelihood of classification of the *i*th patch of the *l*th particle is defined as

$$ {M}_{k,i}^l=\cos \angle \left(W{\varphi}_{k,i}^l,\Gamma \right) $$

(9)

where \( {\varphi}_{k,i}^l \) is the sparse coefficient vector of the candidate patch. Γ denotes the base vector of target classification, i.e., Γ = [1, 0]^{T}, \( \cos \angle \left(\alpha, \beta \right)=\frac{\alpha \cdot \beta }{\left|\alpha \right|\left|\beta \right|} \) denotes the bearing of two vectors.

The bigger the number of patches belonging to the candidate particle is, the better the target appearance can be described. Because the selected patches may be from target templates or background templates. Therefore, if the patch belongs to the target, we should give it a bigger weight than that belong to the background.

### Adaptive template update

In the tracking process, the appearance of the target often changes significantly due to the disturbance of illumination changes, occlusion, rotation, scale variation, and so on. Therefore, we need to update the template appropriately. However, if the template is updated too frequently by using new observations, the tracking results are easy to drift away from the target due to the accumulation of errors. Especially, when the target is occluded, the latest tracking result cannot describe the real target well, which will cause the later estimated targets to be lost. On the contrary, if tracking with fixed templates, it is prone to fail in dynamic scenes as it does not consider inevitable appearance change.

In this paper, we propose an improved template histogram update scheme by combining the histogram of the first frame and the latest estimated histogram with the variable *μ*, i.e.,

$$ {\widehat{\psi}}_n=\left\{\begin{array}{l}\mu \psi +\left(1-\mu \right){\widehat{\varphi}}_n,\kern0.5em {O}_n<{O}_0\\ {}{\widehat{\psi}}_{n\hbox{-} 1},\kern0.5em \mathrm{otherwise}\end{array}\right. $$

(10)

where \( \mu ={\mathrm{e}}^{\hbox{-} \left(1\hbox{-} \frac{O_n}{O_0}\right)} \) denotes the weighting parameter, which can adaptively adjust the update template to adapt to the change of the target appearance. \( {\widehat{\psi}}_n \) denotes the update template histogram, *ψ* and \( {\widehat{\varphi}}_n \) denote the template histogram of the first frame and the latest estimate, respectively. \( {O}_n=\frac{\#{\mathrm{Patch}}_{occ}}{\#\mathrm{Patch}} \) denotes the occlusion degree of the current tracking results. #Patch_{occ} and #Patch denote the number of the occluded patches and the total patches. *O*_{0} is a threshold of the degree of occlusion. Moreover, to avoid frequent template update, we detect the occluded state every five frames, i.e., we update the template every five frames.

During the update process, the first frame template and the newly arrived template are considered simultaneously. However, when the target is occluded, the arrived template usually cannot describe the real target effectively. Therefore, the weight *μ* of the arrived template should decrease at this time. Otherwise, the weight *μ* should increase due to the accurately estimate of the arrived template without other disturbance factors. In this paper, we set the parameter *μ* change with the reconstruction error. If *O*_{n} increases, which denotes the target may be disturbed by some factors, such as illumination and occlusion, the arrived template may be inaccurate, hence the weight of the template should decrease, while the weight of the first frame template should increase.