On the basis of the original CT algorithm, the occlusion discrimination mechanism and a target re-location strategy are introduced to improve the accuracy and adaptability of the algorithm in this paper.
Occlusion discrimination
In the process of target tracking, other targets and backgrounds often block tracking target. To solve this problem, it is necessary to determine whether the target is obscured or not. So the target state is divided into normal state and occlusion state in this paper. According to the state of the target, different processing mechanisms are used respectively.
In the CT algorithm, the confidence H(v) of all candidate target regions is obtained by Eq. (4), and we select the region with the maximum confidence max(H(v)) as the target region. From Eq. (4), we can know that the value of H(v) is determined by (p(vi| y = 1)/p(vi| y = 0)).The more features of the target in the candidate region are close to the target, the more likely the candidate region is the target.
H(v) can only indicate which candidate region in the same frame is closer to the target region, but can not judge the deformation or occlusion of the target. We calculated the maximum confidence max(H(v)) of some frames in the car video, and the line diagram of max (H (V)) is shown as shown in Fig. 1.
From Fig. 1, we can see that max(H(v)) value at 127th frame, 136th frame, and 220th frame has been greatly reduced. Figure 2 shows the tracking target at 127th frame, 136th frame, and 220th frames.
It can be seen from Fig. 2 that all of three frames have different degrees of occlusion. When the tracking target is not occluded, max(H(v)) appears a trend of gradual growth and tends to be stable. Therefore, it can be judged by the change of max(H(v)) that the target has larger deformation or occlusion. Let C(k) be the discrimination degree for the occlusion or larger deformation of the kth frame, which can be defined as:
$$ C(k)=\frac{H^{k-1}-{H}^k}{H^{k-1}} $$
(7)
where Hk − 1 represents the maximum confidence max(H(v)) at the frame k−1 and Hk represents the maximum confidence max(H(v)) at the frame k.
Given a threshold ξ, if C(k) > ξ, the target is considered as completely occluded or disappeared; if C(k) < ξ, the target is in the normal tracking state or partial occlusion state.
Occlusion tracking
In general, occlusion starts from the edge of the target. In the process of the occlusion, the feature of the region that is occluded will be lost, but the feature of the region that is not occluded still maintains the original feature information. As long as sub-regions without occlusion can be tracked, the target can be accurately located through sub-regions without occlusion. Therefore, a compressive tracking algorithm based on sub-regions is proposed in this paper. The main idea of the algorithm is to divide the target region into several sub-regions, the candidate regions of each sub-region are created, the candidate region compression features of each sub-region are extracted, H(v) is calculated by the Bayesian classifier, and the region with the maximum confidence in terms of the compression feature of candidate sub-regions is selected to calculate the discrimination degree C(k). If C(k) < ξ, the target region is located according to position of sub-regions without occlusion. If C(k) > ξ, the target is completely occluded or disappearing.
The division of sub-regions has a great impact on the tracking effect. Each sub-region may be overlapped or non-overlapping; sub-region size can be fixed or adaptive. If the sub-region is too large, it will be too sensitive to occlusion, while if the sub-region is too small, it may lose target information and there will be too many sub-regions; thus, the computational cost of the algorithm will be very high. In this paper, the target is divided into four sub-regions.
As you can see from Fig. 3, if the sub-region 1 is occluded, the upper left corner in the target area is occluded. If the sub-regions 1 and 2 are occluded, the upper half part in the target area is occluded, and so on. We can not only distinguish effectively whether the target is occluded but also determine the location of the target according to the sub-regions that have not been occluded through method for the sub-region.
Suppose Dik − 1i ∈ {1, 2, 3, 4} denotes the i sub-region in the frame k − 1, the i candidate sub-regions of the current frame can be defined as:
$$ {T}_i(m)=\left\{z|\left\Vert I(z)-{D_i}^{t-1}\right\Vert <r\right\} $$
(8)
where γ denotes the neighborhood radius. Supposing Ti(m) denotes the mth candidate region of the sub-region i, all candidate region compression features V(k) are computed according to Eq. (3), H(v) is computed according to Eq. (4), and select the candidate region with the maximum confidence to calculate the C(k). If C(k) > ξ, the target is completely blocked or disappearing, and the target is located according to the feature matching method; otherwise, the target is located according to position of the sub-region with maximum C(k).
After the target is successfully located, the classifier parameters need to be updated according to the CT algorithm. In order to reduce the impact by occlusion, only the sub-regions without occlusion update the probability distribution of the features. Therefore, after the target position is determined, the main steps of updating classifier parameter are as follows:
Step 1: Calculating the confidence Hi(v) of all the sub-regions of target region;
Step 2: Calculating the discrimination degree Ci(k) of all sub-regions;
Step 3: Judging by the Ci(k) value whether the sub-region is occluded or not. If C(k) > ξ, the classifier is not updated. If not, the classifier will be updated. The updating method is as follows:
-
1.
The positive and negative samples of sub-regions with Ci(k) < ζ are created. The specific method can be defined as:
$$ {\displaystyle \begin{array}{l}{T}_i=\left\{z|\left\Vert I(z)-{D_i}^t\right\Vert <r\right\}\\ {} Bi=\left\{z|\alpha <\left\Vert I(z){D_i}^t\right\Vert <\beta \right\}\end{array}}\gamma <\alpha <\beta $$
(9)
-
2.
According to the Eq. (3), the feature Vi can be obtained. The calculation of the parameters \( {u}_i^1 \), \( {\delta}_i^1 \), \( {u}_i^0 \), and \( {\delta}_i^0 \) is as follows:
$$ {\displaystyle \begin{array}{l}{u}_i^1=\frac{1}{n}\sum \limits_{m=0}^{n-1}{V}_m\\ {}{\delta}_i^1=\sqrt{\frac{1}{n}\sum \limits_{m=0}^{n-1}{\left({V}_i(m)-{u}_i^1(m)\right)}^2}\\ {}{u}_i^0=\frac{1}{n}\sum \limits_{m=0}^{n-1}{BV}_m\\ {}{\delta}_i^0=\sqrt{\frac{1}{n}\sum \limits_{m=0}^{n-1}{\left({BV}_i(m)-{u}_i^0(m)\right)}^2}\end{array}} $$
(10)
This algorithm model for sub-regions partition and classifier updating is shown in Fig. 4. As can be seen from Fig. 4, this method can not only judge the target state effectively, but also has good robustness to partial occlusion, local gray change, and deformations.
Target re-locate
This method in this paper can deal well with the tracking problem of partial occlusion. But when the target is completely occluded or disappeared, the algorithm cannot track and locate the target accurately. The target detection mechanism in the TLD algorithm is introduced in this paper. A method of target re-location based on improved ORB feature matching is proposed. Firstly, fast corner points are gained, and then, false corner points are removed; finally, the BRIEF descriptor is used to describe the corner points.
-
(a)
Fast corner
In [17], in the neighborhood of a pixel, there are many pixels that are larger than or smaller than the gray level of that point; the pixel will be the corner point, which is defined as:
$$ N=\sum \limits_{\forall x\in \mathrm{circle}(p)}\left|I(x)-I(p)\right|>{\varepsilon}_d $$
(11)
where I(p) means the gray value of the candidate pixel, I(x) represents any pixel of the circular boundary with p as the center, and εd is the threshold. Different threshold εd can be used to control the number of corner points, and the relationship between the threshold value and the number of corners is shown in Fig. 5. In order to quickly remove the false corner, εd value is 12 in this paper.
-
(b)
Removing edge points
Fast corner points include many edge points and local non-maximum points. The curvature of edge point is larger in the direction perpendicular to the direction of the edge, and smaller along the direction of the edge, while the principal curvatures of the real corner points is larger in any direction [18]. Therefore, the edge points can be removed from fast corner points by principal curvature. In this paper, the principal curvature is calculated by the 2 × 2 Hessian matrix H, which is defined as:
$$ H\left(x,y\right)=\left[\begin{array}{cc}{D}_{xx}& {D}_{xy}\\ {}{D}_{xy}& {D}_{yy}\end{array}\right] $$
(12)
The four elements of H can be obtained by the adjacent difference. According to the property of Hessian matrix, the principal curvature of H is proportional to the eigenvalue of Hessian matrix. Since the principal curvature of the real corner point is larger in any direction [18], if the difference between the two eigenvalues is larger, it shows that the candidate corner point is on the edge; otherwise, the candidate corner is the real corner point. Here, we do not directly calculate two eigenvalues but calculate the ratio of two eigenvalues. Let α be the larger eigenvalue of H matrix, β is its smaller eigenvalue.
$$ {\displaystyle \begin{array}{l}\mathrm{Tr}(H)={D}_{xx}+{D}_{yy}=\alpha +\beta \\ {}\mathrm{Det}(H)={D}_{xx}{D}_{yy}-{\left({D}_{xy}\right)}^2=\alpha \beta \\ {}\mathrm{ratio}=\frac{\mathrm{Tr}{(H)}^2}{\mathrm{Det}(H)}=\frac{{\left(\alpha +\beta \right)}^2}{\alpha \beta}\end{array}} $$
(13)
In Lowe’s paper [19], α = γβ, ratio = (γ + 1)2/γ, and (γ = 10). If ratio is less than (10 + 1)2/10, the feature points are preserved, otherwise discarded.
-
(c)
Removing the pseudo corner points
The edge points can be removed via the steps (b). But there are still some local non-maximum points. It can be further judged by calculating the Laplace value of the pixels in the small neighborhood around the candidate corner point; if the candidate corner point is the Laplace extreme point, then the corner point is preserved and vice versa [20]. The calculation of Laplace extremum is as follows:
$$ L(x)=\sum \limits_{\forall \left(p,q\right)}\left(I(p)+I(q)-I(x)\right) $$
(14)
-
(d)
The direction of the fast corner point
Fast corner point does not have direction. In [21], the direction of fast feature points is obtained by gray centroid method. The specific methods are as follows:
Firstly, the moment of the neighborhood of the feature point is computed. The i + j moment is defined as:
$$ {M}_{ij}=\sum \limits_x\sum \limits_y{x}^i{y}^jI\left(x,y\right) $$
(15)
And then the centroid is obtained with these moments.
$$ C=\left({C}_x,{C}_y\right)=\left(\frac{M_{10}}{M_{00}},\frac{M_{01}}{M_{00}}\right) $$
(16)
The orientation of the centroid then simply is
$$ \theta =\arctan \left(\frac{C_y}{C_x}\right) $$
(17)
where\( {M}_{00}=\sum \limits_x\sum \limits_yI\left(x,y\right) \), \( {M}_{10}=\sum \limits_x\sum \limits_y xI\left(x,y\right) \), and \( {M}_{01}=\sum \limits_x\sum \limits_y yI\left(x,y\right) \).
-
(e)
BRIEF descriptor
The BRIEF descriptor is a bit string description of an image patch from a set of binary intensity tests; a binary test τ is defined as
$$ \tau \left(p;x,y\right)=\left\{\begin{array}{c}1\kern1em \mathrm{if}\kern0.5em p(x)<p(y)\\ {}0\kern1em \mathrm{other}\kern5.5em \end{array}\right. $$
(18)
wherep(⋅)denotes the function of binary comparisons and (x, y) is a sample pair. Each test sample is a randomly 5 × 5 window of a 31 × 31 pixel patch. The feature is defined as a vector of n binary tests:
$$ {f}_{n_d}(p)=\sum \limits_{1\le i\le n}{2}^{i-1}\tau \left(p;x,y\right) $$
(19)
As a result, the length of descriptor is n. n = 128, 256, 512…, in this paper, n = 256. The function p(x) is computed as a gray values sum in the5 × 5 window around pixel x. In order to improve the computational speed, the method of integrating graphs is used to compute the sum of the gray value of the image patch.
The BRIEF descriptor is robust to illumination changes, but it is sensitive to noise and rotation. In order to solve the noise sensitive problem, the image is preprocessed by Gauss filter in the ORB algorithm. In order to solve the problem of rotation invariance, for feature set of n, binary tests at location(x, y)define a 2 × n matrix.
$$ S=\left(\begin{array}{c}{x}_1\dots \dots {x}_n\\ {}{y}_1\dots \dots {y}_n\end{array}\right) $$
(20)
The rotation matrix Rθ is generated using FAST principal direction angleθ, which is defined as:
$$ {R}_{\theta }=\left[\begin{array}{cc}\cos \theta & \sin \theta \\ {}-\sin \theta & con\theta \end{array}\right] $$
(21)
Therefore, feature set with orientation at location (x, y) is defined as:
$$ {S}_{\theta }={R}_{\theta }S $$
(22)
Now, the new feature descriptor becomes:
$$ {g}_n\left(p,\theta \right)={f}_{n_d}(p)\mid \left({x}_i,{y}_i\right)\in {S}_{\theta } $$
(23)
The BRIEF with orientation has larger variance and a mean near 0.5, which makes its description performance more irrelevant and distinguishable [22].
In the end, 256 high variance and uncorrelated binary strings are selected as the final ORB descriptors by greedy algorithm, and the specific steps are as follows:
-
1.
The first element in the set T, which is composed of all binary strings, is put into the result set R.
-
2.
The elements in the set T are in turn compared with the elements in the set R. If the correlation between them is greater than a given threshold value, the binary string is abandoned; otherwise, it will be added to the container R.
-
3.
Repeat step 2 until there are 256 elements in the result set R. If the number of elements in the result set R is less than 256, then the correlation threshold is increased and the greedy algorithm is performed again, until there are 256 binary strings in the result set R.
Figure 6 shows the matching results based on ORB features. As can be seen from Fig. 6, most of the matching points are focused on the right targets, and only a few of them are wrong.
In order to improve the robustness of the algorithm, the reference point of the target location is decided by the median values of the matched feature points in the horizontal and vertical, which is defined as:
$$ {\displaystyle \begin{array}{l}{x}^{\hbox{'}}=\mathrm{mid}\left({x}_i\right)\\ {}{y}^{\hbox{'}}=\mathrm{mid}\left({y}_i\right)\end{array}} $$
(24)
where mid(xi)represents the median value of the horizontal coordinates and mid(yi)represents the median value of the vertical coordinates.
The reference point is selected as the center point of the target tracking, and the coordinates of the upper left corner of the tracking box are calculated according to Eq. (25), so that the position of the tracking target could be determined.
$$ {\displaystyle \begin{array}{l}\mathrm{rect}.\mathrm{x}={x}^{\hbox{'}}-\frac{1}{2}\mathrm{width}\\ {}\mathrm{rect}.\mathrm{y}={y}^{\hbox{'}}-\frac{1}{2}\mathrm{height}\end{array}} $$
(25)
where rect. x and rect. y, respectively, mean the coordinate of upper left corner of tracking box and width and height are the tracking box’s width and height, respectively. The target detection based on ORB features is shown in Fig. 7:
The partial matching results based on the ORB on car video are shown in Fig. 8.
It can be seen from Fig. 8 that the method based on ORB feature matching can track the target accurately when there are many matching points. Combined with CT algorithm, the method in this paper can effectively improve the accuracy of tracking.
When the target is re-located by ORB feature, the accuracy of position depends on the number of successful matching feature points. When the target appears again, if there is a huge change in shape, the number of successful matching feature points may be less, which results in the difficulty of accurate positioning. Aiming at this problem, a matching template library is constructed in this paper. After the target is lost, all the templates in the template library are used to search the target. If a template in a template library matches a frame image, the number of matching points is greater than the given threshold value, the template is considered as the matching template. The matching template with the largest number of matching points is called the best matching template. Then, the location of the target is determined according to the matching points corresponding to the best matching template. In order to improve the speed of matching, there are not too many templates in the template library; therefore, in this paper, a new method for updating similar templates is proposed. Firstly, the tracking state is determined by tracking module. If the target is in the normal tracking state, the ORB feature of the target area is extracted. Then, the extracted ORB feature is matched with the template in the template library in turn. If the number of matching points is less than the given threshold t, this area is added to the template library, as shown in Fig. 9.
By updating the template library, some templates are stored in the template library, which improves the robustness of the matching. In order to improve the positioning accuracy, the best template matching strategy is proposed in this paper. The specific process is shown as follows:
-
1.
After the target is lost, all the templates in the template library are used to search the target, respectively.
-
2.
If a template exists in the template library, the number of matching points with the current image is greater than the given threshold, and then, the template is the matching template;
-
3.
If there may be multiple matching templates, the template with the largest number of matching points is considered as the best matching template;
-
4.
The location of the target is determined according to the best matching template, and the relocation is completed.
The template matching process is shown in Fig. 10:
Algorithm flow
The flow chart of the improved CT algorithm based on target division and feature point matching is shown in Fig. 11.