The procedure of the proposed method is shown in Fig. 1. The green arrows indicate the process of anti-occlusion mechanism and the light blue arrows indicate the steps of multi-feature response map adaptive fusion strategy. h is the learning rate of the filter in Eq. (12).
In this section, we describe the preliminary DCF tracking first, then introduce the multi-feature response map adaptive fusion strategy and the anti-occlusion mechanism. Finally, we describe the high-confidence model update strategy.
Overview of DCF
The correlation filter detects the location of the target by training a correlation filter w. A typical DCF model is centered on the target, and it is trained by the image area with the size of \(M \times N\) is the training sample obtained by the cyclic shift of the image patch. The objective function is as follows:
$$\mathop {\min }\limits_{w} \left\| {Xw - Y} \right\|_{2}^{2} + \lambda \left\| w \right\|_{2}^{2} ,$$
(1)
where λ is the regularization parameter to reduce the overfitting of the model, X is the matrix obtained by cyclic shift of our samples. \(Y\) is the Gaussian label that represents the ideal output of filtering learning. The closed solution of the objective function can be obtained by Fourier transform. The formula is shown below:
$$\hat{w}_{d}^{*t} = \frac{{\hat{A}_{d}^{t} }}{{\hat{B}_{d}^{t} + \lambda }},$$
(2)
$$\hat{A}_{d}^{t} = \hat{Y} \odot \hat{X}_{d}^{*t} , \, \hat{B}_{d}^{t} = \sum\limits_{i = 1}^{D} {\hat{X}_{i}^{*t} \odot \hat{X}_{i}^{t} } ,$$
(3)
where \(\odot\) is the element-wise product, the hat symbol denotes the discrete Fourier transform (DFT) of a vector and \(\hat{X}^{*}\) is the complex-conjugate of \(\hat{X}\). \(t\) and \(d\left( {d \in \left\{ {\left. {1, \ldots ,D} \right\}} \right.} \right)\) denote the index and channel of the current frame, respectively. The response map can be calculated as follows:
$$R = {\mathcal{F}}^{ - 1} \left( {\sum\limits_{d = 1}^{D} {\hat{W}_{d} \odot } \hat{X}^{*}_{d} } \right).$$
(4)
The final position of the target can be obtained by finding the maximum value of the response map. Since the training samples by circular shift contain a lot of rich information of the target, it can train a filter with excellent performance, but in the process of generating negative samples from the base samples, the negative samples may have discontinuous edges, which will bring interference information to the filter. By adding the cosine window, the discontinuous regions outside the bottle region can be filtered out, and the tracking regions in the image block can be better highlighted, thus obtain a better training sample.
Multi-feature response map adaptive fusion strategy
Since HOG, CN, and deep (features extracted with CNN) features have achieved impressive performance in the field of target tracking, the selection and fusion of target features has become an important development direction of target tracking. Therefore, we propose a strategy for dynamic fusion of these three features in the framework of correlation filtering, and the main idea of the proposed tracker is to dynamically fuse multiple features to build the appearance model of the object. Figure 2 illustrates the process of feature fusion and update of fusion weights, the yellow box, black box, and blue box are the prediction results of CN feature, HOG feature and CNN feature response maps, respectively, and the red box is the prediction result obtained by fusing the three feature response maps. The weight of single feature response map is update by Eq. (8). The response maps are linearly added using the feature weights to obtain the final response maps. The individual feature fusion weight is updated by the consistency of the results obtained from the individual feature response map with the final response map.
Specifically, the method in this section uses HOG, CN and deep features to train three independent correlation filters \({\text{w}}\). In order to build the appearance model of the object in different view. \({\text{w}}_{{{\text{hog}}}} ,{\text{ w}}_{{{\text{cn}}}}\) and \({\text{w}}_{{{\text{cnn}}}}\) represent the filters obtained by training the HOG, CN and deep features extracted from the target image block, respectively. To fuse multiple features, we let \(\beta_{{{\text{hog}}}}\), \(\beta_{{{\text{cn}}}}\) and \(\beta_{{{\text{cnn}}}}\) be the weights of \({\text{w}}_{{{\text{hog}}}} ,{\text{ w}}_{{{\text{cn}}}}\) and \({\text{w}}_{{{\text{cnn}}}}\), respectively. The filters \({\text{w}}_{{{\text{hog}}}} ,{\text{ w}}_{{{\text{cn}}}}\) and \({\text{w}}_{{{\text{cnn}}}}\) are operated on the HOG, CN and deep feature map, respectively. The response maps \(F_{{{\text{hog}}}}\), \(F_{{{\text{cn}}}}\) and \(F_{{{\text{cnn}}}}\) are obtained by Eq. (4). Therefore, there are three different kinds of object representations. In order to solve the above three different and unattached regression subproblems, we linearly added these response maps according to the following equation:
$$F_{{{\text{final}}}} = \beta_{{{\text{hog}}}} F_{{{\text{hog}}}} + \beta_{{{\text{cn}}}} F_{{{\text{cn}}}} + \beta_{{{\text{cnn}}}} F_{{{\text{cnn}}}} .$$
(5)
The final result can be obtained by finding the maximum response value position in \(F_{{{\text{final}}}}\). We adjust the weights of the three features based on the tracking results of the previous frame to take into account the different importance of the three features at different moments or scenes for building the target appearance model. To efficiently adjust the weights and better capture changes in the target appearance, the weight corresponding to a feature is determined by the agreement between the tracking result of that feature alone and the final tracking result \(F_{{{\text{final}}}}\).
If the tracking result obtained with a feature alone is very closely matched with the final tracking result \(F_{{{\text{final}}}}\), then the feature can simulate the appearance of the target well and the feature will be given a higher weight. On the other hand, if the tracking result of a feature is very different from the final tracking result \(F_{{{\text{final}}}}\), then the feature is not suitable for modeling the target appearance in the current tracking scenario, and therefore the feature is assigned by a small weight. We obtain the prediction results \(P_{{{\text{hog}}}} , \, P_{{{\text{cn}}}} , \, P_{{{\text{cnn}}}}\) and \(P_{{{\text{final}}}}\) by \(F_{{{\text{hog}}}} , \, F_{{{\text{cn}}}} , \, F_{{{\text{cnn}}}}\) and \(F_{{{\text{final}}}}\) response maps, respectively. The overlap between the individual feature result and the final result indicates the importance of the individual features in the current frame. The agreement between the multiple features and the final result can be calculated as follows:
$$O_{j}^{T} = \frac{{{\text{Area}}\left( {P_{j}^{T} \cap P_{{{\text{final}}}}^{T} } \right)}}{{{\text{Area}}\left( {P_{j}^{T} \cup P_{{{\text{final}}}}^{T} } \right)}},$$
(6)
where \(P_{j}^{T}\) denotes the results of single feature, \(P_{{{\text{final}}}}^{T}\) is the final tracking result.
In order to improve the robustness and tracking accuracy of the system, we need to take into account the n most recent consecutive reliable frames and calculate the temporal consistency of individual features, respectively. The specific definition of reliable frames will be given when the model update strategy is introduced. Therefore, considering the previous n reliable frames, we calculate the temporal consistency of single feature as follows (\(N_{r}\) denotes the set of the previous n reliable frames):
$$O_{j} = \left( {1 - \alpha } \right)\left( {\sum\limits_{{t \in N_{r} }} {O_{j}^{t} } } \right) + \alpha O_{j}^{T} ,$$
(7)
where \(\alpha\) is learning rate for agreement between the multiple features and the final result. \(\alpha\) is a constant that controls the effect of the current frame feature consistency on the overall feature response map fusion weight. Finally, the weight of the single feature can be calculated as follows:
$$\beta_{j} = \frac{{O_{j} }}{{\sum\nolimits_{{k \in \left\{ {\text{hog,cn,cnn}} \right\}}} {O_{k} } }}, \, j \in \left\{ {\left. {\text{hog,cn,cnn}} \right\}} \right.,$$
(8)
The larger the \(\beta\) is, the more suitable the current feature is to build the target appearance model in the current scene. In the first frame, we consider that all three features have the same importance, so we set \(\beta_{{{\text{hog}}}}\), \({\upbeta }_{{{\text{cn}}}}\), \({\upbeta }_{{{\text{cnn}}}}\) all to 0.33.
Anti-occlusion mechanism
The target is often occluded by other objects during tracking, resulting in tracking failure, which indicates that how to deal with the occlusion problem becomes a crucial concern for target tracking. After extensive experiments, it is found that there are multiple local peaks in the response map of the tracker when it is partially occluded. However, most current tracking algorithms select only the global peak as the final position of the target and ignore other local peaks. In fact, the features extracted from the target location are also polluted when the target is occluded, which causes the response value of the real location of the target to be low. In addition, the tracker model is updated in each frame and objects that obscure the target are added to the tracking model as tracking targets. However, since the tracking target model is not completely corrupted, the confidence level obtained for the region containing the real target is also higher than that of the general background map, so it means that the local peaks may also be the real position of the target. Therefore, when there are multiple peaks in the response map, we need to evaluate whether the local peaks are the real position of the target.
According to Fig. 3, it can be seen that when the tracking target is partially occluded and the localization is wrong, the response map shows multiple local peaks instead of one independent peak (local peaks do not include global peak). The isolated local peaks may be due to the effects of noise, while the coordinates of the local peaks of the arches may be the real location of the target. According to this characteristic, we design an accurate localization method based on the smoothness of the peaks, the smoothing function of the peaks is defined as follows:
$$SC = \sum\limits_{m = - 3}^{3} {\sum\limits_{n = - 3}^{3} {\left( {R_{{\left( {x + n,y + m} \right)}} - R_{{\left( {x,y} \right)}} } \right)} }$$
(9)
where \(R_{{\left( {{\text{x,}}y} \right)}}\) represents the local peak of the response map, \(R_{{\left( {{\text{x}} + n,y + m} \right)}}\) represents the nearby points of the local peak. After extensive experiments, it has been shown that the performance of the algorithm is most reliable when the m, n in the range of [− 3,3], and SC represents the smoothing coefficient. It means that the local peaks are likely to be the real tracking target position when SC of the local peaks is greater than SC of the global peaks. We generate a new response map which is obtained by moving the center of the region of interest to the nonmaximal local peak position of the response map and re-extracting features. We then select the response map with the largest response value as the final response map. Due to the balance between performance and speed, we can select at most the three points of the local peak with the highest smoothness.
Model update strategy
The features extracted from the target may pollute the tracking model when the target is occluded, so the proposed tracking algorithm has to decide whether to update the tracking model or not. If the tracking results are reliable for the current frame, we update the model to better represent the appearance changes. If the result of the current frame is unreliable, we do not update the model to avoid contaminating it. The online update of the numerator \(\hat{A}_{d}^{t}\) and denominator \(\hat{B}_{d}^{t}\) of the filter w is as follows:
$$\hat{A}_{d}^{t} = \left( {1 - h} \right)\hat{A}_{d}^{t - 1} + h\hat{Y} \odot \hat{X}_{d}^{*t} ,$$
(10)
$$\hat{B}_{d}^{t} = \left( {1 - h} \right)\hat{B}_{d}^{t - 1} + h\sum\limits_{i}^{D} {\hat{X}_{i}^{*t} \odot \hat{X}_{i}^{t} } ,$$
(11)
where h is the learning rate of the filter w and t is the index of the current frame.
If the local peak smoothing coefficient \({\text{SC}}_{{{\text{local}}}}\) of the response map is greater than the global peak smoothing coefficient \({\text{SC}}_{{{\text{global}}}}\), then the target position of the current frame contains background information and the tracking result of the current frame is unreliable. We reduce the learning rate of the model to avoid the subsequent frame tracking failure caused by model contamination. Otherwise, the tracking result of the current frame is reliable. The model of the proposed algorithm is updated with the normal learning rate. The learning rate h is set by Eq. (12):
$$h = \left\{ \begin{gathered} 0,\quad if \, \left( {{\text{SC}}_{{{\text{local}}}} > {\text{SC}}_{{{\text{global}}}} } \right)\& \& \left( {{\text{NLP}} \ge 1} \right) \hfill \\ \tau ,\quad otherwise \hfill \\ \end{gathered} \right.,$$
(12)
where \(\tau\) is the learning rate of the standard DCF, \({\text{SC}}_{{{\text{local}}}}\) and \({\text{SC}}_{{{\text{global}}}}\) are the smoothing coefficients of the local peak and global peak in the response map.