Real-time embedded object detection and tracking system in Zynq SoC

With the increasing application of computer vision technology in autonomous driving, robot, and other mobile devices, more and more attention has been paid to the implementation of target detection and tracking algorithms on embedded platforms. The real-time performance and robustness of algorithms are two hot research topics and challenges in this field. In order to solve the problems of poor real-time tracking performance of embedded systems using convolutional neural networks and low robustness of tracking algorithms for complex scenes, this paper proposes a fast and accurate real-time video detection and tracking algorithm suitable for embedded systems. The algorithm combines the object detection model of single-shot multibox detection in deep convolution networks and the kernel correlation filters tracking algorithm, what is more, it accelerates the single-shot multibox detection model using field-programmable gate arrays, which satisfies the real-time performance of the algorithm on the embedded platform. To solve the problem of model contamination after the kernel correlation filters algorithm fails to track in complex scenes, an improvement in the validity detection mechanism of tracking results is proposed that solves the problem of the traditional kernel correlation filters algorithm not being able to robustly track for a long time. In order to solve the problem that the missed rate of the single-shot multibox detection model is high under the conditions of motion blur or illumination variation, a strategy to reduce missed rate is proposed that effectively reduces the missed detection. The experimental results on the embedded platform show that the algorithm can achieve real-time tracking of the object in the video and can automatically reposition the object to continue tracking after the object tracking fails.

(2021) 2021: 21 Page 3 of 16 for the performance of embedded hardware, algorithm selection, and algorithm-based improvement. At present, convolutional neural network is usually deployed in embedded system to quantify the weight or activation value, that is, to convert the data from 32bit floating-point type to low-integer type, such as binary neural network(BNN), ternary weight network (TWN), and XNORNet. However, there are still some shortcomings in the current quantization methods in the trade-off between accuracy and calculation efficiency. Many quantization methods compress the network in different degrees and save storage resources, but they can not effectively improve the calculation efficiency in the hardware platform. In 2017, Xilinx proposed to quantify the weight of convolutional neural network from 32-bit to fixed-point 8-bit, adopted the method of software and hardware co-design, and realized the hardware acceleration of the model by FPGA, which met the real-time requirements of convolutional neural network in embedded system. The main research content of this paper is to implement target detection and tracking in embedded hardware and software system. The task of target detection and tracking requires the algorithm to be real-time and robust. In order to solve the problems of poor real-time tracking performance of convolutional neural network and low robustness of tracking algorithm in complex scenes, a fast and accurate video real-time detection and tracking algorithm for embedded system is proposed. It is a tracking algorithm with object detection, which is based on a deep convolutional neural network single-shot multibox detector (SSD) [19] model and kernelized correlation filters(KCF) object tracking algorithm. In order to improve the real-time performance of the algorithm in the embedded system, the SSD model is quantized and compressed in Xilinx DNNDK (Deep Neural Network Development Kit) development environment, and the compressed model is deployed to the embedded system by the method of software and hardware co-design. The main work of this paper is as follows: (1) To achieve higher robustness in complex scenes, a deep learning method is applied to object detection. Due to the high detection accuracy of the SSD model, that model is used to locate the object. It is important to mention that we are not proposing a new or improved version of SSD, but rather a method for the hardware-software co-design of embedded systems based on System on Chip (SoC).
(2) To achieve higher speeds, a KCF is applied to object tracking. In complex scenes, such as fast movement, camera shake, and occlusion, the KCF algorithm tends to track failures. After the failure, the KCF model is updated with the wrong object samples, which will contaminate the model, causing it to be unable to continue tracking the object [20]. This paper proposes a validity detection mechanism of tracking results to judge whether the tracking fails or not, so as to decide whether to update the model.
(3) Since the missed rate of the SSD model in the scenes of blurred motion and illumination variation is high, this paper introduces a strategy to reduce the missed rate. When missed detection occurs in the process of object detection, the object position in the current frame is predicted according to the position of the object in the previous two frames, so as to reduce the missed rate.
(4) In order to improve the real-time performance of the algorithm in the embedded hardware and software system, the SSD model is quantized as an 8-bit fixed-point model, the algorithm was partitioned through the way of hardware and software co-design, and part of the tasks was completed by ARM and FPGA, respectively. In this way, the advantages of ARM and FPGA are fully exploited, and the real-time performance is achieved without loss of accuracy. The rest of this paper is organized as follows. Section 2 introduces the real-time detection-tracking algorithm for embedded systems, including a validity detection mechanism of tracking results, and the strategy to reduce the missed rate. In Section 3, the method for hardware and software co-design of embedded systems based on SoC is described. Section 4 compares the results with other representative methods. Finally, the conclusion is given in Section 5.

Proposed method
In this section, we will introduce our algorithm for real-time object detection and tracking in embedded systems. In order to achieve an adequate real-time performance, the algorithm obtains the object box information by the SSD model only on the key frame. The object box information includes the location and size of the object. KCF target tracking algorithm separates target and background through discriminant framework, so as to achieve the goal of target tracking. The KCF model is trained through samples, which are obtained by cyclic shifting from the region inside the object box. In order to avoid the contamination of the KCF model caused by tracking failure, this paper introduces the validity detection mechanism of tracking results to evaluate whether tracking has failed or not and then choose to update the model or retrain it through SSD object detection results. A strategy is introduced to reduce the missed rate of the SSD model in motion-blurred and illumination-variation scenes.
The overall flow of the algorithm is shown in Fig. 1. The first step is to run either the SSD object detection algorithm or the KCF object tracking algorithm on the frame i (image I i ): where S(I i ) indicates the detection or tracking method for I i , SSD(I i ) is the SSD object detection method, KCF(I i ) is the KCF tracking method. N is a constant of value 20 in this paper. fr is a flag of value 1 when the validity detection mechanism of tracking results fails. For SSD object detection or KCF object tracking, the tracking or detection can be expressed as: where L s (l i , c i , r i , n i ) is the result of SSD object detection, l i represents the object category, c i is the confidence of the category, r i is the object box of the detection result, and n i is the object number. F (r i , r i−1 ) is the result of validity detection mechanism of tracking results. The calculation method is given in Section 2.1. L K (r i ) is the result of the KCF tracking.
If n i is 0-that is, no object is detected-the strategy of reducing the missed detection is used to reduce the missed rate. The calculation method is given in Section 2.2. Otherwise, based on the image blocks contained in r i , the samples needed for KCF training can be obtained by cyclic shifting, so as to train the object initial position model for subsequent object tracking.

Validity detection mechanism of tracking results
The KCF tracking algorithm in this paper is updated by linear interpolation as shown in Equation 3 [7]: where η is an interpolation coefficient, which characterizes the learning ability of the model for new image frames, α t is the classifier model, and x t is the object appearance template. It can be seen that the KCF algorithm does not consider whether the prediction result of the current frame is suitable for updating the model. When the tracking results deviate from the real object due to occlusion, motion blur, illumination variation, and other problems, Equation 3 will incorporate the wrong object information into the model, which will gradually contaminate the tracking model and eventually lead to the failure of subsequent object tracking tasks. In order to avoid inaccurate tracking caused by model contamination, it is necessary to judge whether tracking failure has occurred in time. In the process of tracking, the differences between the object information of adjacent frames can be expressed by the correlation of the object area. When the tracking is successful, the difference between the object regions of adjacent frames is very small and the correlation is very high. When the tracking fails, the object regions of adjacent frames will change greatly, and the correlation will also change significantly. Therefore, this paper adopts the correlation of frame objects to judge whether or not tracking fails.
Considering that the application object of this algorithm is an embedded system, in order to improve the real-time performance of the algorithm, we only use low-frequency information to calculate the correlation. The information in the image includes highfrequency and low-frequency components: high-frequency components describe specific details, while low-frequency components describe a wide range of information. Figure 2a is a frame image randomly selected from the BlurBody video sequence in the OTB-100(Object Tracking Benchmark) [5] dataset, and Fig. 2b is a matrix diagram of the discrete cosine transform coefficients of the image. From Fig. 2b, it can be seen that the image energy in natural scenes is concentrated in the low-frequency region. In addition, some conditions such as camera shake and fast motion of the object may also cause motion blur, which may result in insufficient high-frequency information. Therefore, the high-frequency information is not reliable in judging the correlation of the object area.
In this paper, a perceptual hash algorithm [21] is proposed to quickly calculate the hash distance between the object area of the current frame and the previous frame. This process uses only low-frequency information. The hash distance is the basis for judging whether the tracking fails, as shown in Equation 4.
where F(r i , r i−1 ) indicates whether the frame i fails to track, which is determined by the object area of the frame i − 1 in the video sequence, the values of 1 and 0 representing tracking success and tracking failure, respectively; pd i,i−1 is the hash distance between the object area of frame i and frame i − 1; and H th is the hash distance threshold.
Taking the BlurBody video sequence in the OTB-100 dataset as the test object, the hash distance pd i,i−1 between the real object area of each frame and the previous frame is calculated, as shown in Fig. 3.
It can be seen from Fig. 3 that the hash distance of the object area is usually less than 15; for video frames with pd i,i−1 greater than 15, there are often obvious blurring and camera shakes. At this time, there are significant deviations in the tracking results of the KCF algorithm.  Figure 4 shows the BlurBody video sequence tested by the KCF algorithm. The tracking results of frames 43, 107, and 160 are compared with the real position of the object, and the tracking result hash distances pd 43,42 , pd 107,106 , and pd 160,159 are respectively 9, 22, and 15. The hash distance pd i,i−1 of frame 43 is lower, and the tracking result is more accurate; the hash distance pd i,i−1 of frame 107 is higher, and the tracking result has obviously deviated from the true position of the object. It can be seen that the hash distance pd i,i−1 can well reflect the validity of the tracking result.

Strategy to reduce missed rate
There is less information on the appearance of images in the motion blur and dark scenes [22]. In addition, the SSD model detects the current frame separately-it does not consider the correlation of adjacent frames, so the missed rate is high in the above scenes. In this paper, image enhancement is proposed to obtain more detailed image information, and then the improved KCF algorithm is used to track the object in order to reduce the missed rate. We are faced with the situation that the SSD model cannot detect the object when the image is blurred or dark, as shown in Fig. 5. The essence of image blurring or darkening is that the image is subjected to average or integral operation, so the image can be inversely calculated to highlight the details of the image. In this paper, the Laplacian differential operator is proposed to sharpen the image to obtain more detailed information.
The enhanced image is tracked by an improved KCF algorithm with color features. In the KCF tracking algorithm, the object feature information is described by the histograms of oriented gradients [23]. However, in images involving blurring or illumination variation, the edge information of the object is often not obvious. This paper proposes to extract object information by combining color features using the Lab color feature. The strong expressive ability of the Lab color space allows for a better description of the appearance of the object.
In the KCF tracking algorithm, for the case of extracting multi-channel features of an image as input, it is assumed that the describing vector of each channel feature of an image is x = [x 1 , x 2 , · · · x C ], and the output formula of Gauss kernel in reference [7]: could be rewritten as: where Based on Equation 6, the object is described by the histogram of oriented gradients feature of the 31-channel feature. In the strategy to reduce missed rate, Laplacian sharpening is first applied to the previous two frames. Then, KCF tracking model with the Lab color feature is trained by the object in the sharpened image. Next, the object position of the current frame is predicted by the trained model. Finally, the tracking result is checked by the method described in Section 2.1. If the tracking is successful, the predicted object will be given as the result. Otherwise, the object in the next frame will continue to be detected by the SSD model. The algorithm flow is shown in Fig. 6.
To verify the feasibility of the algorithm in this section, two motion-blurred video sequences, BlurBody and BlurOwl, and two illumination-varying video sequences, Human 9 and Singer 2, were selected from the OTB-100 dataset for the following comparison experiments: Experiment 2 Clear frame sequences, or frame sequences having no significant illumination variations, were tracked by the unimproved KCF algorithm. Only the partial frame sequences of motion blur or illumination variations were tracked by the algorithm described in this section.
The tracking results of Experiment 1 and Experiment 2 were evaluated by two indexes: precision rate (PR) and success rate (SR). The PR and SR of experiments 1 and 2 are shown in Fig. 7. It can be seen from the figure that for the video sequences of motion blur and illumination variation, the improved KCF tracking algorithm exhibits a significantly higher PR and SR than the unimproved algorithm.

Hardware and software co-design
The algorithm in this paper is implemented by Zynq UltraScale+ MPSoC (multiprocessor system-on-chip) [24,25]. The real-time object detection and tracking algorithm based on the SSD model and KCF tracking is implemented by embedded software and hardware cooperation: the algorithm module, with simple operation, a mass of judgment statements, and pointer operation, is processed by the processing system (PS); the part with great influence on the speed performance of the algorithm and high degree of parallelism is implemented by programmable logic (PL), which is composed of FPGA. The hardware and software partition of the system is shown in Fig. 8. The convolution and pooling layers in SSD model are implemented by the hardware in PL, while the softmax layer is implemented by PS because it involves floating-point operation. In addition, other functions are implemented by PS, such as non-maximum supression, mapping the operation results to the image, KCF tracking algorithm, validity detection mechanism of tracking results, and strategy to reduce missed rate. The computing capability and memory bandwidth of embedded systems are limited; in addition, the weight parameters of deep neural network model often have a lot of redundancy. When the system on chip is implemented, there will be bottlenecks in computing and storage resources, but also a lot of power consumption. In this paper, we use DNNDK (Deep Neural Network Development Kit) [26] to compress the 32-bit floatingpoint weight into 8-bit fixed-point weight,and then compile it into ARM-executable instruction stream. In the embedded system, PS obtains the instruction from the offchip memory and transmits it to the on-chip memory of the PL through the bus. The instruction scheduler of the PL obtains the instruction to control the operation of the computing engine [27,28], as shown in Fig. 9. On-chip memory is used to cache input and output data as well as temporary data in the operation process, so as to achieve high throughput. On-chip memory is used to cache input and output data as well as temporary data in the operation process, so as to achieve high throughput. The deep pipeline design is adopted in the computing engine, which makes full use of the parallel processing function of FPGA, so the computing speed is improved. Among them, the processing elements of convolutional computing engine make full use of the fine-grained blocks in PL, such as multiplier, adder and so on, which makes it possible to efficiently complete the computation in convolutional neural network.
In this paper, the tasks of reading video frames, running detection and tracking algorithms, and displaying video frames are implemented by ARM of the PS. If the single thread mode is adopted, the image is read first, then the target is detected and tracked, and finally the image is displayed, which will make the utilization of CPU and FPGA relatively low, and the real-time performance will be affected. Since the reading of video frame is mainly file I/O operation, the detection and tracking algorithm is mainly CPU and FPGA calculation, and the video image display is mainly Ethernet transmission (displayed after being transmitted to the computer through X11 protocol). It can be seen that these three steps occupy different system resources, so using multithreading to execute the above three steps simultaneously can make full use of CPU resources, thus significantly improving the operation efficiency, as shown in Fig. 10.

Overall performance comparison of algorithms
In this section, we compare the object tracking performance of the proposed algorithm with four other algorithms that have better real-time and robustness: KCF, Struck, CSK, and Staple. In order to evaluate the performance of the proposed algorithm, the tracking algorithm, the SSD object detection model, and the algorithm itself are tested on the OTB-100 dataset. To ensure the objectivity and fairness of the experimental results, the SSD model in this algorithm is trained by the open datasets VOC2007 [29] , VOC2012 [30], and Wider Face [31]. Since these three training datasets differ greatly from the object categories of the OTB-100 test dataset, 14 video sequences are tested, which are BlurBody, CarScale, Couple, Girl, Gym, Human2, Human6, Human7, Human8, Human9, Liquor, Man, Trellis, and Woman. We chose the one pass evaluation method [5] to test the KCF, Struck, CSK, and Staple tracking algorithms for artificially setting the first frame object position. The SSD object detection model and the algorithm in this paper do not provide the initial object position, and the object position is automatically detected by the model. Table 1 shows the results of the comparison of these algorithms, measured by PR, SR , and tracking speed (in frames per second). The precision plot and success plot are drawn to prove the feasibility and effectiveness of the proposed algorithm, as shown in Fig. 11.
It can be seen from Table 1 and Fig. 11 that the SR of the proposed algorithm is slightly higher than the SSD object detection model, and the PR is nearly 10% higher. This is attributed to the fact that the algorithm remedies the shortcomings of the SSD model by using the strategy of reducing missed detection in the scenes involving motion blur and illumination variation, which makes it possible to track frames that the SSD model cannot detect. Compared with Struck, CSK, and KCF, the PR and SR of the proposed algorithm are much higher. This is due to the validity detection mechanism of tracking results, which can detect tracking failure in time and immediately start the SSD model to re-detect the object.
When testing processing speed, KCF, Struck, CSK, and Staple are tested directly on a PC. The experimental environment is the Windows operating system with an Intel R Core i7-7700K 4.2-GHz processor with 16 GB of memory. The experimental software is MATLAB R2018b. The SSD detection model and the proposed algorithm are tested on a ZCU104 board [32]. Taking advantage of the fast running speed of KCF, it can achieve 36.2 frames per second on the ZCU104 board, which meets the real-time requirements.

Experiment after adding training dataset
As can be seen from Table 1 and Fig. 11, the PR and SR of the proposed algorithm are not higher than those of Staple. This is because the SSD model is needed to detect the object position when the current frame is the initial frame or in the tracking failure scenes. However, the training datasets VOC2007 and VOC2012 of the SSD model are quite different from the objects in the test video sequence: the "person" category in the test dataset is mostly the image under the road monitor perspective, while the "person" image in the training dataset is all the image from the horizontal perspective. In addition, the video sequence of the "bottle" category in the test dataset is very long, whereas the images in the training dataset are too few in number.  The significant difference between the training dataset and the test dataset leads to the low accuracy of the SSD model (PR and SR are 74.51% and 75.12%, respectively), which reduces the accuracy of the proposed algorithm. To verify this point, another experiment was conducted: about 200 bottle images and 200 pedestrian images from the perspective of the road monitor were added to the training dataset. It is important to note that all the images added to the training dataset were not included in the test dataset. After retraining the SSD model, the test results of the contrast experiment are shown in Table 2. It can be seen that with the increase in the number of images in the training dataset, the accuracy of the SSD model has been improved slightly, while the accuracy of the algorithm in this paper has been greatly improved. At this time, the PR of the algorithm is approximately the same as that of Staple, but its SR is about 10% higher than that of Staple.

Experiments with specific attributes
Object tracking tests were carried out on the video sequences with the attributes of occlusion, motion blur, and illumination variation. The accuracy and robustness of the algorithm in these challenging conditions are compared using the SR. The results are shown in Table 3.
It can be seen from Table 3 that for occlusion, the SR of the algorithm and the SSD model are higher than those of other tracking algorithms. In other tracking algorithms, when the object is occluded, the algorithm not only easily loses track of the object, but also often fails to relocate the object in order to continue. However, the algorithm in this paper can relocate the object after tracking failure and thus continue to track it through the SSD model.
For motion blurring, the SR of the proposed algorithm and the Staple algorithm is higher than the other algorithms. This is attributed to the introduction of color features into the strategy of reducing missed detection in this paper.  For illumination variation, the SSD model exhibits high accuracy, because the brightness and contrast of the training dataset have been changed randomly in order to improve generalization in training the SSD model. In illumination-variation scenes, when the light is dark, the SR has been greatly improved compared with the SSD model due to introducing color features into the strategy of reducing missed detection.

Conclusions
This paper has presented a real-time object detection-tracking algorithm for embedded platforms. This algorithm combines the object detection model of SSD in deep convolution networks and the KCF tracking algorithm. The SSD model was accelerated by using field-programmable gate arrays, which satisfies the real-time performance of the algorithm in embedded platforms. To solve the problem of the traditional KCF algorithm not being able to robustly track for a long time, and specifically the problem of contamination after it fails to track in complex scenes, the validity detection mechanism of the tracking results was proposed. To solve the problem of the SSD model's high missed rate under the conditions of motion blur and illumination variation, a strategy to reduce the missed detection was proposed by introducing color features. The results of the experiments show that the overall PR of the proposed algorithm reaches 91.19%, the SR reaches 84.79%, and the frame rate reaches 36.2 frames per second. In the specific cases of occlusion, motion blurring, and illumination-variation attributes, the proposed algorithm has higher accuracy and robustness than other tracking algorithms. The focus of future research should be on improving tracking performance by tracking with deep features and improving hardware implementation.