In this section, we will introduce our algorithm for real-time object detection and tracking in embedded systems. In order to achieve an adequate real-time performance, the algorithm obtains the object box information by the SSD model only on the key frame. The object box information includes the location and size of the object. KCF target tracking algorithm separates target and background through discriminant framework, so as to achieve the goal of target tracking. The KCF model is trained through samples, which are obtained by cyclic shifting from the region inside the object box. In order to avoid the contamination of the KCF model caused by tracking failure, this paper introduces the validity detection mechanism of tracking results to evaluate whether tracking has failed or not and then choose to update the model or retrain it through SSD object detection results. A strategy is introduced to reduce the missed rate of the SSD model in motion-blurred and illumination-variation scenes.
The overall flow of the algorithm is shown in Fig. 1. The first step is to run either the SSD object detection algorithm or the KCF object tracking algorithm on the frame i (image Ii):
$$ S(I_{i})= \left\{\begin{array}{ll} SSD(I_{i}),&\ \text{{i} mod \(N = 0\) or \(fr = 1\)}\\ KCF(I_{i}),&\ \text{otherwise} \end{array}\right. $$
(1)
where S(Ii) indicates the detection or tracking method for Ii,SSD(Ii) is the SSD object detection method, KCF(Ii) is the KCF tracking method. N is a constant of value 20 in this paper. fr is a flag of value 1 when the validity detection mechanism of tracking results fails.
For SSD object detection or KCF object tracking, the tracking or detection can be expressed as:
$$ \left\{\begin{array}{ll} L_{s}{\left(l_{i},c_{i},r_{i},n_{i}\right)}=S(I_{i}),&\ \text{\(S(I_{i}) = SSD(I_{i})\) or \(F\left(r_{i},r_{i-1}\right) = 0\)}\\ L_{K}(r_{i})=S(I_{i}),&\ \text{otherwise} \end{array}\right. $$
(2)
where Ls(li,ci,ri,ni) is the result of SSD object detection, li represents the object category, ci is the confidence of the category, ri is the object box of the detection result, and ni is the object number. F(ri,ri−1) is the result of validity detection mechanism of tracking results. The calculation method is given in Section 2.1. LK(ri) is the result of the KCF tracking.
If ni is 0—that is, no object is detected—the strategy of reducing the missed detection is used to reduce the missed rate. The calculation method is given in Section 2.2. Otherwise, based on the image blocks contained in ri, the samples needed for KCF training can be obtained by cyclic shifting, so as to train the object initial position model for subsequent object tracking.
Validity detection mechanism of tracking results
The KCF tracking algorithm in this paper is updated by linear interpolation as shown in Equation 3 [7]:
$$ \left\{\begin{array}{l} \alpha_{t} = (1 - \eta) \times {\alpha_{t - 1}} + \eta \times {{\alpha_{t}}'}\\ {x_{t}} = (1 - \eta) \times {x_{t - 1}} + \eta \times {{x_{t}}'} \end{array}\right. $$
(3)
where η is an interpolation coefficient, which characterizes the learning ability of the model for new image frames, αt is the classifier model, and xt is the object appearance template. It can be seen that the KCF algorithm does not consider whether the prediction result of the current frame is suitable for updating the model. When the tracking results deviate from the real object due to occlusion, motion blur, illumination variation, and other problems, Equation 3 will incorporate the wrong object information into the model, which will gradually contaminate the tracking model and eventually lead to the failure of subsequent object tracking tasks. In order to avoid inaccurate tracking caused by model contamination, it is necessary to judge whether tracking failure has occurred in time. In the process of tracking, the differences between the object information of adjacent frames can be expressed by the correlation of the object area. When the tracking is successful, the difference between the object regions of adjacent frames is very small and the correlation is very high. When the tracking fails, the object regions of adjacent frames will change greatly, and the correlation will also change significantly. Therefore, this paper adopts the correlation of frame objects to judge whether or not tracking fails.
Considering that the application object of this algorithm is an embedded system, in order to improve the real-time performance of the algorithm, we only use low-frequency information to calculate the correlation. The information in the image includes high-frequency and low-frequency components: high-frequency components describe specific details, while low-frequency components describe a wide range of information. Figure 2a is a frame image randomly selected from the BlurBody video sequence in the OTB-100(Object Tracking Benchmark) [5] dataset, and Fig. 2b is a matrix diagram of the discrete cosine transform coefficients of the image. From Fig. 2b, it can be seen that the image energy in natural scenes is concentrated in the low-frequency region. In addition, some conditions such as camera shake and fast motion of the object may also cause motion blur, which may result in insufficient high-frequency information. Therefore, the high-frequency information is not reliable in judging the correlation of the object area.
In this paper, a perceptual hash algorithm [21] is proposed to quickly calculate the hash distance between the object area of the current frame and the previous frame. This process uses only low-frequency information. The hash distance is the basis for judging whether the tracking fails, as shown in Equation 4.
$$ F\left(r_{i},r_{i-1}\right)= \left\{\begin{array}{ll} 1&\ {pd_{i,i-1} \leq H_{{th}}}\\ 0&\ {pd_{i,i-1} > H_{{th}}} \end{array}\right. $$
(4)
where F(ri,ri−1) indicates whether the frame i fails to track, which is determined by the object area of the frame i−1 in the video sequence, the values of 1 and 0 representing tracking success and tracking failure, respectively; pdi,i−1 is the hash distance between the object area of frame i and frame i−1; and Hth is the hash distance threshold.
Taking the BlurBody video sequence in the OTB-100 dataset as the test object, the hash distance pdi,i−1 between the real object area of each frame and the previous frame is calculated, as shown in Fig. 3.
It can be seen from Fig. 3 that the hash distance of the object area is usually less than 15; for video frames with pdi,i−1 greater than 15, there are often obvious blurring and camera shakes. At this time, there are significant deviations in the tracking results of the KCF algorithm.
Figure 4 shows the BlurBody video sequence tested by the KCF algorithm. The tracking results of frames 43, 107, and 160 are compared with the real position of the object, and the tracking result hash distances pd43,42,pd107,106, and pd160,159 are respectively 9, 22, and 15. The hash distance pdi,i−1 of frame 43 is lower, and the tracking result is more accurate; the hash distance pdi,i−1 of frame 107 is higher, and the tracking result has obviously deviated from the true position of the object. It can be seen that the hash distance pdi,i−1 can well reflect the validity of the tracking result.
Strategy to reduce missed rate
There is less information on the appearance of images in the motion blur and dark scenes [22]. In addition, the SSD model detects the current frame separately—it does not consider the correlation of adjacent frames, so the missed rate is high in the above scenes. In this paper, image enhancement is proposed to obtain more detailed image information, and then the improved KCF algorithm is used to track the object in order to reduce the missed rate.
We are faced with the situation that the SSD model cannot detect the object when the image is blurred or dark, as shown in Fig. 5. The essence of image blurring or darkening is that the image is subjected to average or integral operation, so the image can be inversely calculated to highlight the details of the image. In this paper, the Laplacian differential operator is proposed to sharpen the image to obtain more detailed information.
The enhanced image is tracked by an improved KCF algorithm with color features. In the KCF tracking algorithm, the object feature information is described by the histograms of oriented gradients [23]. However, in images involving blurring or illumination variation, the edge information of the object is often not obvious. This paper proposes to extract object information by combining color features using the Lab color feature. The strong expressive ability of the Lab color space allows for a better description of the appearance of the object.
In the KCF tracking algorithm, for the case of extracting multi-channel features of an image as input, it is assumed that the describing vector of each channel feature of an image is x=[x1,x2,⋯xC], and the output formula of Gauss kernel in reference [7]:
$$ {\mathbf{k}^{xx^{\prime}}} = \exp \left\{ { - \frac{1}{{{\sigma^{2}}}}\left\{ {{{\left\| \mathbf{x} \right\|}^{2}} + {{\left\| {{\mathbf{x^{\prime}}}} \right\|}^{2}} - 2{{\mathcal{F}}^{- 1}}\left[ {F\left(\mathbf{x} \right) \odot {F^{\ast}}\left({{\mathbf{x^{\prime}}}} \right)} \right]} \right\}} \right\} $$
(5)
could be rewritten as:
$$ {\mathbf{k}^{xx^{\prime}}} = \exp \left\{ { - \frac{1}{{{\sigma^{2}}}}\left[ {{{\left\| \mathbf{x} \right\|}^{2}} + {{\left\| {{\mathbf{x^{\prime}}}} \right\|}^{2}} - 2{{\mathcal{F}}^{- 1}}\left({\sum\limits_{c = 1}^{C} {{F_{C}}}} \right)} \right]} \right\} $$
(6)
where FC=F(xC)⊙F∗(x′C).
Based on Equation 6, the object is described by the histogram of oriented gradients feature of the 31-channel feature. In the strategy to reduce missed rate, Laplacian sharpening is first applied to the previous two frames. Then, KCF tracking model with the Lab color feature is trained by the object in the sharpened image. Next, the object position of the current frame is predicted by the trained model. Finally, the tracking result is checked by the method described in Section 2.1. If the tracking is successful, the predicted object will be given as the result. Otherwise, the object in the next frame will continue to be detected by the SSD model. The algorithm flow is shown in Fig. 6.
To verify the feasibility of the algorithm in this section, two motion-blurred video sequences, BlurBody and BlurOwl, and two illumination-varying video sequences, Human 9 and Singer 2, were selected from the OTB-100 dataset for the following comparison experiments:
Experiment 1 The object in the four video sequences is tracked by an unimproved KCF algorithm.
Experiment 2 Clear frame sequences, or frame sequences having no significant illumination variations, were tracked by the unimproved KCF algorithm. Only the partial frame sequences of motion blur or illumination variations were tracked by the algorithm described in this section.
The tracking results of Experiment 1 and Experiment 2 were evaluated by two indexes: precision rate (PR) and success rate (SR). The PR and SR of experiments 1 and 2 are shown in Fig. 7. It can be seen from the figure that for the video sequences of motion blur and illumination variation, the improved KCF tracking algorithm exhibits a significantly higher PR and SR than the unimproved algorithm.