Procedure
Figure 1 illustrates the procedure of our method. Our method initializes the filters according to the given target position. In the subsequent frames, we first crop the search area centered at the target location in the previous frame, and then, extract the CNN features from different layers of pre-trained ResNet. Secondly, the learned linear filters convolved with the extracted features to generate the response maps of different layers. Then, multiple response maps are weighted and fused to one response map. The target position is located according to the position of the maximum value in the fused response map. After that, in the estimated target location, the histogram of oriented gradient (HOG) features in the regions with different scales are used to find the optimal target scale by scale filters. Finally, the NELM and the PSR of the fused response map are performed to decide whether to update the filter or not.
Convolutional features
The convolutional feature maps from ResNet are used to encode target appearance. With the increment of CNN layer number, the spatial resolution of feature map is gradually reduced. For object tracking, low resolution is not sufficient to accurately locate target. Thus, we ignore the features from the last convolutional layer (conv5) and full connection layers. The features from different layers have different spatial resolutions that are relatively low compared with the input image. Therefore, bilinear interpolation is used to enlarge the resolutions of the features to the same size by:
$$ x_{i} = \sum_{k}^{}\alpha_{ik}h_{ik} $$
(1)
where h represents the features, x represents the features enlarged by interpolation operation, and the interpolation weight depends on the position of i and k-neighbor feature value. The visualization of the features from ResNet is shown in Fig. 2.
Correlation filter
Denote xl as the feature from the conv-l layer with the size of M×N×D after bilinear interpolation operation, where M,N, and D indicate the width, height, and the number of channels, respectively. The shifted sample x(m,n),(m,n)∈{0,1,…,M−1}×{0,1,…,N−1} has a Gaussian function label \(y(m,n)=e^{-\frac {(m-M/2)^{2}+(n-N/2)^{2}}{2\delta ^{2}}}\), where δ indicates the kernel width. Correlation filters wl are obtained by minimizing the objective function:
$$ \min_{w}\|w_{l}\star x_{l}(m,n)-y(m,n)\|^{2}+\lambda\|w_{l}\|^{2} $$
(2)
where ⋆ means circular correlation and λ indicates the regularization parameter. The optimization problem can be solved in Fourier domain and the solutions are:
$$ W_{l} = \frac{\bar{Y}\odot{X}}{\sum_{d=1}^{D}{X_{l}^{d}}\odot{\bar{X}_{l}^{d}}+\lambda} $$
(3)
Here, X and Y are the fast Fourier transformation (FFT) F(x) and F(y), respectively. The over bar represents the complex conjugate. The symbol ⊙ denotes the element-wise product. At the detection process, the features of the search patch are extracted and transformed to the Fourier domain, the complex conjugate is \(\bar {Z}\). The response map at conv-l layer can be computed by:
$$ f_{l} = F^{-1}(\sum_{d=1}^{D}\bar{W}_{l}^{d}\odot{Z_{l}^{d}}) $$
(4)
where F−1 is the inverse FFT.
Response map fusion based on AdaBoost
In order to select the appropriate weights to fuse the response maps, AdaBoost algorithm is used for adaptive weight adjustment. The error rate e is computed between the normalized response maps at different layers fl, and the desired response map g peaked at the estimated target position in t−1 frame is:
$$ e_{l}^{t-1} = \text{Mean}(\frac{\text{abc}(f_{l}^{t-1}-g^{t-1})}{f_{l}^{t-1}+g^{t-1}}) $$
(5)
where abs represents absolute value, Mean denotes the operation of average, the weight of conv-l layer βl is:
$$ \beta_{l} = \log\frac{1-e_{l}^{t-1}}{e_{l}^{t-1}} $$
(6)
Then, at t frame, the fused response map is:
$$ f^{t}=\sum_{l=3,4}\beta_{l}f_{l}^{t} $$
(7)
The target position \((\hat {m},\hat {n})\) is estimated as:
$$ (\hat{m},\hat{n}) = \mathop{\arg\max}_{(m,n)}f^{t}(m,n) $$
(8)
After the filters are initialized, the filters of different layers can correctly track the target in the initial frame, as the computation is performed in the initial frame. In other words, these filters have the same error rate; thus, the initial weights are both set to 0.5.
For scale estimation, we construct a feature pyramid center in the estimated target position. Let P×R denote the target size in the current frame, S be the size of the scale dimension, and a represent the scale factor. For each \(n\in \left \{\lfloor -\frac {S-1}{2}\rfloor,\dots,\lfloor \frac {S-1}{2}\rfloor \right \}\), we crop the image patch of the size anP×anR and extract the HOG features; then, the scale response map Rn is computed by:
$$ R_{t+1}(n)=F^{-1}\{\sum_{k=1}^{K}\bar{H}_{k}^{t+1}(n){\odot}I_{t+1}^{k}(n)\} $$
(9)
where
$$ H_{t}(n)=\frac{\bar{G}(n){\odot}I_{t}(n)}{\sum_{k=1}^{K}I_{t}^{k}(n){\odot}\bar{I}_{t}^{k}(n)+\lambda_{s}} $$
(10)
where I is the FFT of HOG features, and \(\bar {G}\) is the complex conjugate of Gaussian label. We can find the \(\hat {n}\) corresponded maximum value as:
$$ \hat{n} = \mathop{\arg\max}_{n}R_{t+1}(n) $$
(11)
Then, the best scale of target is \(a^{\hat {n}}P{\times }a^{\hat {n}}R\).
Optimized update strategy with occlusion detection
The filters need to be updated to maintain discriminative ability as the target often undergoes appearance variance. However, when the target is occluded, the filters should avoid using background information to update, or it may cause model drift.
In minimum output sum of squared error (MOSSE) filter [28], PSR was used to describe the state of the response map to detect tracking failure. The peak means the maximum, and the side lobe is defined as the rest of the pixels, excluding an 11×11 window around the peak. The PSR is defined as \(PSR=\frac {g^{\text {max}}-\mu }{\sigma }\), where gmax is the peak value, μ is the mean and σ is the standard deviation of the side lobe. The PSR is between 20.0 and 60.0 when the tracking is normal, while PSR drops to lower than 7.0 when the target is occluded or the tracking failed, as shown in Fig. 3. However, when the target moves rapidly or is of low resolution, the PSR stays in a low value, as shown in c and d of Fig. 3. Therefore, PSR cannot accurately reflect whether the target is occluded or not.
In this work, NELM is employed to detect occlusion. Observing the response maps, we found that the response maps have more local maxima when the target is occluded than when the target is not occluded. As shown in Fig. 4, the red dotted lines show the locations of the local maxima in the 3D response map.
Let f denote the fused response map in current frame and fmax be the peak of f. For each local maximum \(f_{\text {loc}}^{i}(i\in \{1,2,3,\dots,L\})\), L is the number of local maximum except fmax, the ratio between \(f_{\text {loc}}^{i}\) and fmax is \(T_{i} = \frac {f_{\text {loc}}^{i}}{f_{\text {max}}}\). In the response map, some local maxima are possibly generated because of the background interference which needs to be avoided. The motion of the target between the initial frame and the second frame should be smooth. Therefore, in the response map obtained from the second frame of the video sequence, the local maximum except the peak (which is the target position) is taken as the threshold γ:
$$ \gamma = \max(T_{i}) $$
(12)
In the response map of subsequent frame, Ti is greater than the threshold γ; then, fi is recorded as the effective local maximum, and the number of effective local maximum is expressed as:
$$ \text{NELM} = \text{Crad}\{T_{i}|T_{i}>\gamma\} $$
(13)
where Crad represents the number of elements in a collection. If the effective local maxima exist, i.e., NELM > 1, and the PSR is less than the given threshold, the algorithm does not update the filters. PSR is only used to evaluate the response map, similar to MOSSE, the PSR threshold is set to 7.000. If no effective local maximum exists or the PSR is greater than the given threshold, the algorithm allows updating the filters. In Fig. 3b, the PSR value is lower than the empirical value and the NELM is equal to zero, target occlusion is not detected, then the filters can be updated at this time. At t frame, the filter in (3) is represented by Wt, At is the molecule of Wt, and Bt is the denominator. The updating formulae are:
$$ A_{t} = (1-\eta_{p})A_{t-1} + \eta_{p}*\bar{Y}{\odot}X_{t} $$
(14)
$$ B_{t} = (1-\eta_{p})B_{t-1} + \eta_{p}*\sum_{d=1}^{D}X_{t}^{d}{\odot}\bar{X}_{t}^{d} $$
(15)
$$ W_{t} = \frac{A_{t}}{B_{t}+\lambda} $$
(16)
C and D represent the molecules and denominators of the filters Ht in (10), respectively. The updating formulae are:
$$ C_{t} = (1-\eta_{s})C_{t-1} + \eta_{s}*\bar{G}{\odot}I_{t} $$
(17)
$$ D_{t} = (1-\eta_{s})D_{t-1} + \eta_{s}*\sum_{k=1}^{K}I_{t}^{k}{\odot}\bar{I}_{t}^{k} $$
(18)
$$ H_{t} = \frac{C_{t}}{D_{t}+\lambda} $$
(19)
where ηp and ηs are the learning rates for Wt and Ht, respectively.