CF object tracking based on deep features
An object tracker generally consists of three parts, namely, an appearance model, a motion model, and an update model. The general flow of an object tracking algorithm is described as follows: (1) each tracked object is represented by modeling, and an appearance model is established based on the initial information. (2) The appearance model is used to determine the location of the object in the current frame. (3) Based on the tracking results with respect to the current frame, an update strategy is used to update the appearance model to allow it to adapt to changes in the object and the environment. Based on whether the appearance model is established using the background information, object tracking algorithms can be categorized into two types: generative and discriminative models [24]. Because they use both the background and object information, discriminative models generally exhibit higher tracking performance than generative models. In recent years, discriminative models based on the CF tracking framework have garnered extensive attention due to their advantageous performance and efficiency [25]. The CF algorithm generates positive and negative samples by cyclically shifting feature maps (e.g., grayscale, color, and histogram of oriented gradients (HOG) feature maps) to learn the filter h and then perform a convolution operation on the imagefi:
$$ {g}_i={f}_i\ast h $$
(1)
Thus, a correlation information map is obtained. In the correlation map, the location with the maximum value is the location of the object.
As deep learning advances, deep features will replace conventional features, which will further improve the tracking performance of CF algorithms. The deep spatially regularized discriminative CF algorithm [26] learns a CF in the first layer of a single-resolution deep feature map. The hierarchical convolutional feature (HCF) algorithm [27] improves the tracking performance by training a CF based on the features of multiple convolutional layers.
Deep learning is advantageous because this technique, which is driven by data and tasks, is capable of automatically learning how to extract model features and avoids the incompleteness caused by hand-designed features.
The DCFNet algorithm proposed by Wang et al. [28] treats a CF as a layer of a deep network, thereby achieving end-to-end training for tracking tasks. While they have achieved relatively satisfactory tracking performance, these algorithms generally have relatively low processing speeds and are inadequate for practical applications. The time cost of CF tracking algorithms results from the feature extraction process as well as the online-learning, detection, and update processes of the filter.
Network structure design
Figure 1 shows the tracker network structure designed in this study. This network structure consists of three parts; namely, a feature extraction network, a CF layer, and a response loss layer. The feature extraction network is a vertically symmetrical twin structure. The upper branch of the feature extraction network is referred to as the historical branch, i.e., the branch where the location of the object is known. The lower branch of the feature extraction network is a branch where the location of the object is unknown, and its objective is to allow the network to learn how to search for the object in the subsequent frame when its location in the current frame is known.
Two key issues need to be addressed when using deep learning to extract features, namely, (1) how to design a suitable feature extraction network structure based on a specific task and (2) how to design a model training loss function to optimize the network parameters.
The feature extraction process for deep convolutional networks shows that shallow networks tend to obtain features of an object, such as physical outline, edges, color, and texture. The extracted features become increasingly abstract as the number of network layers increases. As the network deepens, the object positioning precision decreases. In a traffic scene, a significant change in the size of a vehicle occurs as the vehicle moves from far to near or from near to far. Therefore, an excessively large number of network layers will be unfavorable to small-scale detection and tracking. An increase in the number of network layers will cause an increase in the computational load and affect real-time application.
Inspired by this, a lightweight shallow feature extraction network was designed in this study, because shallow networks can more easily learn the features (e.g., physical outline, edges, color, and texture) of objects. This network consists of a convolutional layer, an inception module [29], two channel attention modules, and a local response normalization (LRN) layer.
- (1)
Convolutional layer: this layer contains 96 3 × 3 convolution kernels with a step size of 1.
- (2)
Channel attention mechanism module 1: this module recalibrates the feature maps generated by the convolutional layer, suppresses the invalid features, and improves the valuable features.
- (3)
Inception module: as demonstrated in Fig. 2, the inception module combines the features of the receptive fields of multiple scales (1 × 1, 3 × 3, and 5 × 5) and allows the network to determine the filter type in the convolutional layer on its own. This enriches the features learned by the network. Additionally, the 1 × 1 convolution kernels before the 3 × 3 and 5 × 5 convolution kernels reduce the feature channel thickness and the computational load of the network structure. The dimensions of the feature channels outputted by the receptive fields of the 1 × 1, 3 × 3, and 5 × 5 scales are 4, 8, and 4, respectively. This is because the receptive field of the 3 × 3 scale outperforms the receptive fields of the 5 × 5 and 1 × 1 scales in terms of local perceptibility.
- (4)
Channel attention mechanism module 2: this module recalibrates the feature maps generated by the inception module.
- (5)
LRN layer: this layer performs interchannel normalization on the outputted feature maps, limits the magnitude of the outputted eigenvalues, and renders the training process more stable.
The rectified linear unit (ReLU) activation function is used after each convolutional layer for response. This is mainly because ReLU is a piecewise linear function with a relatively high forward propagation and backward feedback speeds. Additionally, ReLU has a gradient of 1 in the > 0 region and is not associated with the vanishing or exploding gradient problem.
The conventional CF approach is adopted for the CF layer. The CF layer learns a filter based on all the cyclic shifts of the feature maps outputted by the historical branch and correlates the filter with the feature maps outputted by the current branch to generate a response map.
The response loss layer uses a two-dimensional (2D) Gaussian function with a peak at the center as the label and the L2 norm to measure the loss between the response map and the label. The following section provides a detailed introduction to the key components of the network structure.
Channel attention mechanism
In the video-based vehicle tracking process, indiscriminately searching for a vehicle in all regions within the field of view is clearly time-consuming. The attention mechanism of biological vision can help an organism to quickly focus on an object of interest. Introducing the attention mechanism into computer vision to increase the search speed in regions where the vehicle is suspected to be located in the full field of view will undoubtedly be favorable to improving the vehicle detection and tracking performance [30]. In this study, a visual attention mechanism is introduced into the feature extraction network. This approach enables the feature extraction network to highlight the vehicle features in the scene, suppress the background features, and improve the effectiveness of the network in representing vehicle features, which improves vehicle detection and tracking precision and speeds. The channel attention module shown in Fig. 3 is introduced to the output of each layer [31].
This module consists of three components, namely, a squeeze operation, an excitation operation, and a scale operation.
For an input x with c1 feature channels, a feature map with a feature dimension of c2 is outputted after a convolution operation. By global pooling, the squeeze operation turns each 2D feature channel into a real number with a global receptive field and generates a one-dimensional vector consistent with the dimension of the feature map. This vector characterizes the global distribution of responses in the feature channel and allows the layers close to the input layer to have a global receptive field.
The main goal of the excitation operation is to explicitly model the correlation between channels. This step is achieved through two fully connected layers. The first fully connected layer reduces the feature dimension to 1/16th of the feature dimension of the input. After obtaining a response from the ReLU activation function, the dimension is increased to the original dimension through another fully connected layer. Ultimately, the output and input have the same feature dimension. Finally, a sigmoid activation function is used to normalize the output value to the range of 0–1. This approach has the following advantages:
- (1)
It allows the structure to be nonlinear to better fit the complex correlation between channels.
- (2)
It reduces the number of parameters and computational load and ensures a lightweight network.
The weights outputted by the previous step represent the importance of the feature channels selected by the attention module.
CF and response loss layers
In the CF layer, let M × N be the dimensions of the input image block x in the historical branch. A feature map φ(x) ∈ RM × N × D is obtained for the image block using the feature extraction network. Positive and negative samples are then generated by cyclically shifting the feature map and are then used to train a filter w, which can be obtained using a minimization equation (Eq. (2)):
$$ \underset{w}{\min }{\left\Vert Xw-y\right\Vert}_2^2+\lambda {\left\Vert w\right\Vert}_2^2 $$
(2)
where λ (λ ≥ 0) is a regularization coefficient and X = [x1, x2, …, xn]T is a data matrix consisting of all the positive and negative samples generated by cyclically shifting the feature map. A closed-form solution to Eq. (3) can be calculated using the least-squares method [32]:
$$ w={\left({X}^TX+\lambda I\right)}^{-1}{X}^Ty $$
(3)
The above equation can be rewritten as the following equation in the complex number field:
$$ w={\left({X}^HX+\lambda I\right)}^{-1}{X}^Hy $$
(4)
X is a circulant matrix. Therefore, the filter in the lth (1 ∈ {1, …, D}) feature channel can be written in the form of Eq. (5):
$$ {\hat{w}}_l=\frac{\hat{y}\odot {\hat{x}}_l^{\ast }}{\sum_{i=1}^D{\hat{x}}_i^{\ast}\odot {\hat{x}}_i+\lambda } $$
(5)
where e signifies the Hadamard product, •∗ signifies a complex conjugate operation, and \( \hat{\bullet} \) signifies the discrete Fourier transform of the vector.
In the current branch, the to-be-searched-for image block z and the image block x have the same dimensions. Through the feature extraction network, a feature map φ(z) is obtained for the image block z. The response R of the feature map φ(z) and filter can be calculated using Eq. (6).
$$ R={F}^{-1}\left({\sum}_{l=1}^D{\hat{w}}_l\odot \hat{\varphi}{(z)}_l^{\ast}\right) $$
(6)
The loss function of the network is defined as the L2 norm between the response R and the label \( \tilde{R} \) of the 2D Gaussian function with the peak at the center. Equation (7) shows the loss function.
$$ {\displaystyle \begin{array}{l}\kern2.5em L\left(\theta \right)={\left\Vert R-\tilde{R}\right\Vert}^2+\gamma {\left\Vert \theta \right\Vert}^2\\ {}s.t.\kern1.25em R={F}^{-1}\left(\sum \limits_{l=1}^D{\hat{w}}_l^{\ast}\odot {\hat{\varphi}}_l\left(z,\theta \right)\right)\\ {}\kern2.25em {\hat{w}}_i=\frac{{\hat{y}}^{\ast}\odot {\hat{\varphi}}_l\left(x,\theta \right)}{\sum_{i=1}^D{\hat{\varphi}}_i\left(x,\theta \right)\odot {\left({\hat{\varphi}}_i\left(x,\theta \right)\right)}^{\ast }+\lambda}\end{array}} $$
(7)
The forward derivation process for the CF layer was previously provided. To achieve end-to-end training, it is also necessary to derive backpropagation forms. The backpropagation forms of the historical and current branches can be derived using the chain method, as shown in Eq. (8) (see elsewhere [26] for details).
$$ {\displaystyle \begin{array}{l}\frac{\partial L}{\partial {\varphi}_l(x)}={F}^{-1}\left(\frac{\partial L}{\partial {\left({\hat{\varphi}}_l(x)\right)}^{\ast }}+{\left(\frac{\partial L}{\partial \left({\hat{\varphi}}_l(x)\right)}\right)}^{\ast}\right),\\ {}\frac{\partial L}{\partial {\varphi}_l(z)}={F}^{-1}\left(\frac{\partial L}{\partial {\left({\hat{\varphi}}_l(z)\right)}^{\ast }}\right).\end{array}} $$
(8)
Online update and scale adaptation
In the object tracking process, changes occur in the scale and angle of the object in the image sequence. Additionally, the background where the object is located also changes with time. To achieve accurate and stable object tracking, an online update strategy must adapt to the changes in the object and background.
Figure 4 shows the online tracking process of a deep network–CF-combined object tracker. First, the deep features of the object region in the first frame of the tracking video sequence is extracted and used to train an initial filter. Then, in the subsequent frame, a search region is set with the object’s location in the previous frame as the center. Deep features within this search region are located and inputted into the filter for response. The maximum location in the response map is the location of the object in the current frame. Finally, the filter template is updated based on this new location.
In the tracking flow in Fig. 4, based on the response map, only the position of the object can be predicted, whereas changes in the scale of the object cannot be accurately perceived. If the object shrinks, the filter will learn a large amount of background information. Conversely, if the object expands, the filter will drift with the local texture of the object. To allow the tracker to adapt to scale variations, a multiscale search strategy is often adopted. The general flow of a multiscale CF tracking algorithm is described as follows:
- (1)
The deep features of the tracked object region in the first frame of the video sequence are extracted, and a CF is obtained by initialization.
- (2)
For each image frame in the subsequent input video sequences, an image pyramid is established based on the tracked object region predicted from the previous image frame. Equation (9) shows the pyramid scale factor.
$$ \left\{{a}^s|s=\left\lfloor -\frac{S-1}{2}\right\rfloor, \left\lfloor -\frac{S-3}{2}\right\rfloor, \dots, \left\lfloor \frac{S-1}{2}\right\rfloor \right\} $$
(9)
Because the filter template has a fixed size, it is necessary to uniformly normalize the multiscale images to the size of the filter template. Then, a multiscale feature map can be obtained by using the feature extraction network.
- (3)
The feature maps of each scale are first processed with a window function and then allowed to respond to the CF template. In each response map, the location of the maximum value is the predicted location of the object, and the corresponding scale is the scaling ratio for the object in the current frame.
- (4)
The image features of the new location of the object are extracted to update the CF template. Equation (10) shows the update strategy for the filter template. This update strategy is then implemented to sufficiently make use of the historical information provided by the video sequence and improve the robustness of the filter to allow the filter to cope with interference from external factors (e.g., illumination and blocking objects).
$$ W=\eta W+\left(1-\eta \right){A}_i $$
(10)
where η is the learning efficiency and Ai is the calculation results for the current frame.
Tracker–detector-integrated object tracking
The detector of the target is designed to provide the initial positioning for the target to be tracked and then provide objects for the tracker to achieve tracking. Based on the detector, automatic detection of the target can be achieved without manual determination of the initial tracking target and can address the dynamic change in the appearance of the new target and the disappearance of the old target in the monitoring process, which is the necessary link of the automatic tracking of the target.
The target detector can be used to detect the target in the video image and extract the target information of each frame of the image. In the case of single target tracking, target tracking can be achieved by simply using the detector. In the case of multitarget application in the scene, however, multitarget tracking cannot be realized because the target detector cannot establish the corresponding relationship between two multitarget frames.
In this study, a YOLO detector–CF tracker-integrated object tracking algorithm was proposed for tracking moving objects in complex traffic scenes. First, a tracker is used to predict the locations of the objects in the subsequent frame. Observed values close to the predicted values are searched for close to the predicted values. Additionally, the CF template is used to select the matched observed values to correct the predicted values. Then, each object that fails to be tracked due to blocking is retracked based on the correlations between the spatial location, moving direction, and historical features. The experimental results have demonstrated that the proposed algorithm is relatively highly robust and able to retrack objects that were previously blocked for a short period of time. Figure 5 shows the flowchart of the object tracking algorithm designed in this study, which mainly consists of two parts: (1) matching between the observed and predicted values and correction of the predicted values; and (2) processing the blocked and new objects.
Peak-to-sidelobe ratio (PSR)-based tracking quality evaluation
In ideal conditions, a CF can predict the location of an object at the current moment if its location at the previous moment is known. When tracking objects in an actual traffic scene, a tracking drift or a loss of tracked objects can often occur due to mutual blocking between objects or other factors. Therefore, determining whether a tracking drift or a loss of tracked objects has occurred and, if so, addressing the problem are the key issues in achieving robust tracking. In this study, the PSR, which is extensively used when applying CFs, was used to evaluate the tracking quality. The PSR can be calculated using Eq. (11) [33].
$$ PSR=\frac{g_{\mathrm{max}}-{\mu}_{s1}}{\sigma_{s1}} $$
(11)
where gmax is the peak value of the CF response map, and μs1 and σs1 are the average value and variance, respectively, within an N × N window, with the peak value of the response map as its center, respectively.
In this study, N was set to 12. Through testing and statistical analysis, the PSR was found to range from 5 to 10 in the normal tracking state. When PSR < 5, it can be determined that a tracking drift or loss of the tracked object has occurred. If a tracker exhibits relatively poor tracking quality or fails to match the observed values in multiple consecutive frames, it may be because the object has been blocked or has left the field of view. In this study, an effective detection region was established in the object tracking process. The distance d of an object from the boundary of the detection region along its moving direction can be calculated. When d<D (D is the distance th between the object and the boundary), the object has left the field of view. Trackers whose relatively poor tracking quality has been determined to not be due to the departure of the object from the field or view, or that fail to match with the observed values in multiple consecutive frames, are added to a temporary linked list. Additionally, a survival period is also set. Trackers past the survival period will be removed.
Matching between the observed and predicted values and the correction of the predicted values
Continuous, steady object tracking can be achieved by correcting the tracker based on the observed values for the object. Therefore, how to establish the matching relationship between observed and predicted values is a key issue in achieving steady tracking. Generally, the predicted and observed values for an object are relatively close in terms of spatial distance. Matching between the predicted and observed values by spatial constraints is sufficient for scenes with relatively low vehicle densities. However, for scenes with relatively high vehicle densities and objects that relatively heavily block one another, as shown in Fig. 6, spatial constraints may easily lead to mismatching. In view of this problem, observed and predicted values were matched in this study through a combination of a spatial constraint and filter template. The spatial constraint is as follows:
$$ IOU\left({r}_d,{r}_p\right)=\frac{r_d\cap {r}_p}{r_d\cup {r}_p}> th $$
(12)
Observed values are selected as candidate matches if the IOU between them and the predicted values is greater than th. Extracting objects relatively close to the predicted values significantly reduces the search region and improves the processing efficiency. In this study, th was set to 0.2.
An image block is extracted with each candidate matched object selected in the previous step as the center. A deep network is then used to extract the image features. A response value is subsequently obtained by performing a correlation operation on the image features and the object filter template corresponding to the predicted value. The candidate match with the highest response value is the ultimate match. To correct a predicted value, the higher predicted value and the highest response value corresponding to the observed value are selected because the peak response value represents the predicted object confidence, i.e.,
$$ r=\left\{\begin{array}{l}{r}_d\ if\ {g}_{r_d}\ge {g}_{r_p}\\ {}{r}_p\ else\end{array}\right. $$
(13)
where r is the final result, rd is the predicted value, and rp is the observed value.
Tracking of blocked and new objects
An unmatched observed value may be a new object that has entered the field of view or an object that has been blocked again. To retrack blocked objects, objects that meet the following conditions are first searched for on the temporary linked list:
- (1)
An object on the temporary linked list that is located within a circular region with the observed value as the center and R as the radius.
- (2)
In a traffic scene, an object will not suddenly change its direction within a short period of time. The location where an observed value reappears should be in front of the moving direction of an object on the temporary linked list.
For each object that meets the above conditions, an image block is extracted with its observed value as the center, and the image features are extracted using a deep network. Then, the PSRs of the feature map and the response map outputted by the object tracker filter template are calculated. If the PSR of the response map is greater than th, then the object is the same object that previously disappeared, and the tracker is moved to the tracking linked list; otherwise, the object is a new object. A new object is located on the boundary of the monitoring region. As a result, its observed value only contains part of its information. Using this rectangular box to initialize the tracker will result in unsteady tracking results. To address the boundary issue, let p(xi, yi) be the location of the center of an object at the current moment. The distance di (i = 0, 1, 2, and 3) of the object from the boundary of the image can be easily calculated. When min({di, i = 0, 1, 2, 3}) > dist, the object has completely entered the monitoring region. Under this condition, the tracker for this object is initialized. In this study, th for the rematching blocked objects was set to 6.0.