Learning attention for object tracking with adversarial learning network

Artificial intelligence has been widely studied on solving intelligent surveillance analysis and security problems in recent years. Although many multimedia security approaches have been proposed by using deep learning network model, there are still some challenges on their performances which deserve in-depth research. On the one hand, high computational complexity of current deep learning methods makes it hard to be applied to real-time scenario. On the other hand, it is difficult to obtain the specific features of a video by fine-tuning the network online with the object state of the first frame, which fails to capture rich appearance variations of the object. To solve above two issues, in this paper, an effective object tracking method with learning attention is proposed to achieve the object localization and reduce the training time in adversarial learning framework. First, a prediction network is designed to track the object in video sequences. The object positions of the first ten frames are employed to fine-tune prediction network, which can fully mine a specific features of an object. Second, the prediction network is integrated into the generative adversarial network framework, which randomly generates masks to capture object appearance variations via adaptively dropout input features. Third, we present a spatial attention mechanism to improve the tracking performance. The proposed network can identify the mask that maintains the most robust features of the objects over a long temporal span. Extensive experiments on two large-scale benchmarks demonstrate that the proposed algorithm performs favorably against state-of-the-art methods.


Introduction
Nowadays, multimedia content (in particular image and video data) is being widely shared over the Internet due to the rapid development of network technologies and advent of high-end devices.Emerging technologies such as Cloud, Fog, Edge, SDN, Big Data, Internet of Things (IoT), and Deep Learning provide scalability, flexibility, agility, and ubiquity in terms of data acquisition, data storage, data management, and communications.Although a large number of multimedia forensic and security techniques have been proposed to protect multimedia data and devices and to support investigations of multimedia-related criminal cases and security incidents, a number of multimedia security issues have also emerged correspondingly, such as intelligent analysis for surveillance, copy-move forgery in digital images and videos, and biometric spoofing.
In recent year, artificial intelligence has been widely studied on solving a variety of difficult problems using deep learning network model, such as convolution neural networks for steganalysis and forensics, and generative adversarial networks for coverless steganagraphy.Surveillance technology for intelligent multimedia hiding and forensics has been a hot topic in multimedia security community.It is the basis of advanced video processing tasks such as follow-up steganography [1], data hiding [2], JPEG compressed [3], and object recognition [4] and is a necessary prerequisite for implementing high-level intelligent behavior analysis.Object tracking is one of the fundamental tasks in intelligent surveillance technology.The aim is to localize the object in a video sequence with a bounding box of the object at the first frame of the video.Although object tracking has made great progress in recent years and some effective algorithms have been proposed to solve challenging problems in specific scenarios, there are still exist many issues such as occlusion and illumination changes.So the topic deserves in-depth investigation, which is important in both academia and industry.Figure 1 shows intelligent surveillance applications.
Currently, most of the trackers based on deep learning network model use a largescale benchmark datasets [6,7] to train the network offline, and the sample of the first frame is used to fine-tune the parameters of network online.However, training deep network model online is challenging due to the limited training samples which cannot capture the diversity of the object appearance variations, and offline pre-training is very time consuming.In addition, the quality of the image is also an important factor for the training process.The existing tracking methods utilize the sampling scheme around the object state to obtain the training samples, such as KCF [8], DLT [9], and MCPF [10].However, the positive samples extracted from each frame are highly overlapped, and they fail to capture rich appearance variations.In this work, we use the first ten frames of a video sequence to fine-tune the parameters of deep network model.But manual annotating the object positions from the first ten frames is always impractical and time consuming.To solve the abovementioned problems, we exploit the advantages of the pre-trained network on a large-scale benchmark datasets to predict the object position.The parameters of deep network have been pre-trained on large-scale datasets.Then we use the prediction network to track the object in the video sequences and automatically obtain the object positions of the first ten frames.The generative adversarial network has great advantages in augmenting training samples.Furthermore, the positive and negative samples are extracted from the first ten frames.Therefore, the positive and negative samples are obtained to fine-tune the generative adversarial network online.The proposed tracking algorithm can capture the changes of the object appearance in the video sequences.During the tracking, the generative model is used to occlude the image features through a randomly generated mask to enhance the diversity of positive samples.The discriminative model employs its discriminative performance to identify the object.These features are robust enough to address the challenges of object tracking.Adversarial learning network can identify masks that retain the most robust features of the object appearance.In terms of tracking accuracy, our approach obtains a relative gain of 5.9% compared to other deep learning-based tracking approaches.In this paper, the proposed tracking method can be applied to solve the multimedia forensics and security problems.In other words, it is possible to explore object tracking techniques for various real-time multimedia security applications, such as real-time information hiding and digital forensics.We summarize the main contributions of this work as follows: (1) We propose an end-to-end the prediction network model for object tracking to improve the tracking accuracy and the computational complexity of training process, which can jointly train the prediction network and spatial attention model in generative adversarial network framework.(2) An effective spatial attention mechanism is developed, which can adaptively generate the response maps.The feature representations are employed to online tracking process to alleviate over-fitting.In addition, the positive and negative samples are augmented in the feature space to capture a variety of appearance changes over a temporal span by using generative adversarial network.(3) We conduct the extensive experiments on two popular benchmarks, which demonstrate the proposed object tracking method with learning attention significantly outperforms state-of-the-art methods.
The rest of the paper is organized as follows.In Section 2, we review related work of existing object tracking algorithms.Section 3 introduces the motivation.In Section 4, we introduce our object tracking approach for intelligent surveillance analysis.The experimental settings are presented in Section 5.In Section 6, we present experimental results and discussion in two tracking benchmarks.Finally, Section 7 concludes this paper.

Related work
Object tracking is one of the fundamental tasks in computer vision and has been extensively studied over the last decade.There are extensive surveys of object tracking in the Fig. 1 Object tracking technique for intelligent surveillance analysis, tracking results of the proposed tracker (red), and the VITAL tracker [5] (black) on the OTB2015 benchmark literature [11][12][13].Before the emergence of the object tracking algorithms based on deep learning, most tracking algorithms used particle filter framework for object tracking, such as Kalman filter [14] and particle filter [15].However, the disadvantage of these methods is that the number of particles limits the tracking speed.In this section, we review the related advances in three research streams for visual object tracking.

Correlation filter-based tracking
In recent years, correlation filters (CF) have been widely used in numerous applications such as object detection and recognition.It transfers operations into the Fourier domain as element-wise multiplication.Correlation filters [8,[16][17][18][19] have attracted considerable attention due to its computational efficiency and competitive performance.Correlation filter trackers regress all the circular shifted versions of the input features to a Gaussian function.We arrange correlation filter tracking algorithms in a hierarchy and classify them into three categories: Basic correlation filter trackers, regularized correlation filter trackers, and combination of deep learning and correlation filter trackers.Some basic correlation filter (CF) trackers have been developed to boost performance in tracking by using scale estimation.Bolme et al. [16] propose a minimum output sum of squared error (MOSSE) tracker for object tracking on grayscale images, which encodes object appearance through an adaptive correlation filter by optimizing the output sum of squared error.MOSSE can achieve several hundreds of frames per second.In 2012, Henriques et al. [20] propose the CSK algorithm based on the improvement of MOSSE, which solves the problem of a small number of training samples in the object tracking process through the cyclic matrix and further improves the tracking accuracy of the algorithm by using the kernel technique.However, the above two algorithms adopt simple grayscale features, which are easily disturbed by the external environment, resulting in inaccurate tracking results.They are further improved by the kernelized correlation filters (KCF) [8] with HOG features in a Fourier domain.KCF performs well in OTB50 [4] benchmark in terms of tracking speed and accuracy.Scale change is also a common problem in object tracking.In [21], the DSST tracker learns adaptive multiscale correlation filters using HOG features to handle the scale and translation changes of the objects.
Regularized correlation filter trackers can improve the detection range by using different filter size and patch size.The SRDCF method [22] reduces the boundary effect problem by weighting the weight space of CF.However, its optimization is complicated and the tracking speed is slow.To improve its weakness, the CSR-DCF method [23] adds feature channel and space stability constraints based on SRDCF and uses the augmented Lagrangian scheme to facilitate fast FFT solution, which greatly improves the tracking accuracy and speed.The C-COT method [24] proposes a strategy for training continuous convolution filters, which facilitates the integration of multi-scale CNN features and achieves sub-pixel level tracking accuracy.However, the tracking model framework still adopts the SRDCF method and the computational complexity is high.
ECO [25] method further proposes a factorization convolution scheme to reduce the computational complexity of the tracking model.The UPDT [26] proposes a novel adaptive fusion approach that leverages the complementary properties of deep and shallow features to improve both robustness and accuracy.In [27], the STRCF introduces temporal regularization to SRDCF with single sample.The formulation of this method can not only serve as a reasonable approximation to SRDCF with multiple training samples, but also provide a more robust appearance model than SRDCF in the case of large appearance variations.Although these approaches achieve a satisfied performance in some constrained scenarios, they have an inherent limitation that they resort to low level hand-crafted features, which are vulnerable in dynamic situations including illumination changes, occlusion, deformations, and background clutter.
Inspired by the success of deep learning model in object detection, recognition, and classification tasks, researchers have started to focus on combining of deep learning and correlation filter.In HCF [28] and HDT [29], deep feature is used to extract the object features instead of handcrafted features.It is worth noting that CFNet [30] and DCFNet [31] achieve an end-to-end representation learning.

Deep learning-based tracking
Features' representations are important for object tracking.Experiments prove that the features designed manually (e.g., haar-like features, histogram, HOG features) are not necessarily suitable for all the objects.Deep learning in the tracking uses the adaptive selection scheme of object features instead of hand-crafted features.The popular trend is to design the deep network structures and pre-train them in order to learn objectspecific features.
The main problem of deep learning in the tracking is the lack of training data [32,33].The object tracking only provides the object information of the first frame as training data.In this case, it is difficult to train the deep network model with small data.In [9], Wang proposes the idea of off-line pre-training deep network model and online fine-tuning tracking model called DLT tracker, which greatly solves the problem of insufficient training samples in the tracking.SO-DLT [34] continues the strategy of DLT and also greatly improves the problems of DLT by using CNN as a network model for extraction deep features and classifications.In FCNT [35], authors analyze the performance of CNN features pre-trained on ImageNet [36] and design the subsequent network structure based on the analysis results.In [37], the authors utilize a two-layer convolution neural network to learn hierarchical features from auxiliary video sequences, which takes the complex object motion and object appearance changes into account.
In recent years, the Siamese Networks [38,39] have received more and more attention due to its two stream identical structure.In the offline training phase, a matching score function is trained through the structure of the Siamese Network.Then the matching score function is used to determine the similarity between the current object candidate state and the object template of the first frame during the tracking, which improves the tracking efficiency.In recent years, the generative adversarial network (GAN) has been widely used in many fields, such as object detection [40] and intelligent recommendation system [41].GAN is first proposed by Ian Goodfellow in 2014, which was originally used to generate realistic-looking images [42].The main idea behind GAN is to have two competitive neural network models.One takes noise as input and generates samples called a generator.Another model is called discriminator which receives samples from the generator and tries to discriminate true object between two sources.The generator and the discriminator are trained simultaneously by competing with each other.Realizing that there is a huge difference between image classification and tracking, TADA [43] identifies the importance of each convolutional filter and selects the object-aware features based on activations for the object representation.Then, the object-aware features are integrated into a Siamese matching network for visual tracking.Different from the existing approaches using extensive annotated data for supervised learning, UDT [44] is trained on large-scale unlabeled videos in an unsupervised manner.UDT tracker achieves the baseline accuracy of fully supervised trackers which require complete and accurate labels during the training.
In this work, we apply adversarial learning to augment training samples in the feature space to capture appearance variations in temporal domain.In addition, we can exploit robust features over the long temporal span instead of the discriminative features in individual frames.

Attention mechanisms
Attention mechanisms were first introduced in neuroscience area [45].They have spread to image classification, multi-object tracking, etc. DAVT [46] employs a discriminative spatial attention scheme for visual tracking.CSR-DCF [47] utilizes color histograms to construct a foreground spatial map in the correlation filter framework, which learns the attention via an end-to-end deep network model.ACFN [48] chooses a subset from the associated correlation filters as an attention mechanism for visual tracking.

Motivation
The generative adversarial network (GAN) [42] has been widely used in object detection and semantic segmentation.The generative adversarial networks mainly consist of the generative model and discriminative model.The generative model takes a noise as an input, and the discriminative model takes samples from generative model or training data and outputs the classification probability.This learning process can be written as: where G denotes the generative model, D represents the discriminative model, E is the mean operation, and x and z are two vectors from two distributions P data (x) and P noise (z), respectively.
However, the original GAN is not feasible.In our work, G will predict a weight mask which operates on the extracted features.The mask is randomly set at the beginning and gradually identifies the discriminative features by using adversarial learning.The mask is generated by G network as G(I).We denote the predicted mask as M and the value of the element (i, j) as Mij .We define the input image as I, and the value of element (i, j, k) on image I as I ijk .The dropout operation is written as follows: where I 0 ijk denotes the image I after the dropout operation and passed onto the classifier.
4 Proposed object tracking method In this section, a novel attention generative adversarial network is given at first to describe the overall training architecture.The proposed generative model takes VGG network as input, which is mainly used to capture the object appearance variations of continuous video frames.The discriminative model is introduced as a supervisor and provides guidance on the advantages of the generated object appearance details.To stabilize the training of the generative adversarial networks, we present the mean squared loss to punish the classification error for each pixel.In order to improve the tracking performance, a novel spatial attention mechanism is developed to adapt the offline learned deep model to online object tracking.The VGG network is used to sense the tracked object and decode the object features into the attention response maps.At last, online tracking is described consisting of model updating and scales.The object attention maps are captured by inputting the object appearance information provided in the first ten frames and remaining video frames into the generative model.The score with maximum response score is regarded as the tracking result.This process will be continued for video frames until the end of the video sequences.Figure 2 shows the flowchart of the proposed tracker.
In Fig. 2, the generative model of GAN follows the encoder-decoder framework which attempts to encode the input of the object appearance into feature representation and decode it into corresponding outputs.The discriminative model is a standard convolutional neural network.

Network architecture
The network includes two branches, and in the lower part of the architecture which utilizes the first ten frames of a video sequences as input, which is called the prediction network.We use the prediction network to track the one to ten frames of a video sequence to obtain the object position of each frame.These features extracted from the predicted object location will be used to fine-tune the fully connected layers of the network located in the top half of the architecture.The object feature of each frame is taken as input of the network from the frame 11 to the end of video.The weight masks are applied to adaptively dropout input features.Adversarial learning identifies the weight mask that maintains the most robust features over a long temporal span while removing the discriminative features from individual frames.
It is worthy to note that the deep learning model is initialized with the weights of a VGG-16 model pre-trained on the ImageNet benchmark for object classification.Most of deep learning-based trackers use this offline learned network and then utilize the first frame to fine-tune the network parameters during the tracking.However, it is difficult to obtain the object specific feature of a video by training the deep network model only with the sample of the first frame.On the other hand, if the deep learning network is fine-tuned by using the first n frames of a video, manually labeling the object position will be expensive and impractical.Therefore, a prediction network is introduced into deep learning framework, which can automatically predict the position of the object in the video sequence.The network structure is shown in Fig. 2, which has three convolution layers and two fully connected layers.The architecture of the prediction network is depicted in the lower part of Fig. 2. We directly use a VGG-M [49] model pretrained in the classification task from ImageNet [36], and the parameters of the convolution layers are fixed and only the fully connected layers are fine-tuned online.The cross-entropy loss is adopted for fine-tuning network parameters online.The prediction network is optimized by minimizing the cross-entropy loss function with SGD as follows: where p and q denote training samples and corresponding labels, respectively; N is the number of training samples.
The object features are extracted from the convolution layer and fed to the fully connected layer for classification.Figure 3 reports the foreground response maps predicted by using different VGG feature maps.Figure 3 is the foreground response maps predicted by using different VGG feature maps.Foreground response maps are predicted using different VGG feature maps.Conclusion of Fig. 3 is that shallow layer feature (Conv4-1 feature) focuses on object details; deep layer feature (Conv4-2 and Conv4-3) is semantic features.
Finally, the sample with the highest response score in each frame is regarded as the tracking result.This prediction network is interpreted as a generative network in generative adversarial network framework, and the samples drawn from the predicted location will be used to fine-tune the fully connected layers of the generative model.
The discriminative model is employed to make the generative model produce attention response map that is robust to occlusion, deformation, background clutter, etc.In this work, the attention response map and corresponding RGB frame of a video sequence are considered as the input of discriminative model.

Training
In our work, mean squared error (MSE) is utilized to measure the difference between estimated attention response map and ground truth map.Given an image I, and its dimension is N = W × H.The mean squared loss can be formulated as: where Ŝ and S denote the attention response maps and its corresponding ground truth, respectively.However, mean squared loss function focuses on pixel-level features, and learned deep network can produce a coarse attention response maps.Therefore, training the network with the adversarial loss can be further improved the tracking performance.We iteratively train G and D, and the adversarial loss function is written as: where C is the input image feature, G(C) is the mask generated by the G network, and M is the actual mask identifying the discriminative feature.The dot is the dropout operation on the feature C. As described in Eq. ( 5), G is used to predict a weight mask G(C) which operates on the extracted features.The mask is randomly initialized at the beginning and each mask represents a specific type of appearance variation.Through the adversarial learning process, G will gradually identify the mask that degrades the performance of classifier.
In each iteration of the training process, object features of the input frames are extracted from convolutional layers and fed into G network to obtain the predicted mask m*.Then, obtained deep features are multiplied by the predicted mask m* and sent into D network.We keep the labels unchanged and train D through supervised learning method.D is trained to discriminate features from individual frames relying on more robust features over a long temporal span.Thus, it avoids the overfitting issue.G is used to predict different masks according to different input deep features.It enables D to focus on the temporal robust features without discriminative feature interference from single frame.Given an input image, multiple output features based on several random masks are created.Diversified features are performed through the dropout operation, which are sent to D for classification, and we choose the one with the highest loss.The corresponding mask of the selected feature is effective in decreasing the impact of the discriminative features.We set this mask as M in equation ( 5) and update G accordingly.
Finally, we combine the MSE loss with adversarial loss to obtain more stable and fast convergence for GAN model.The final loss function for the adversarial training can be formulated as: where λ is a trade-off parameter, and we experimentally set it as 1/20 in our implementation.

Spatial attention
Attention from the training samples can be captured to share a common attention.In practical sceneries, some attention maps are obtained by the initialization of matrix of ones.They are too restrictive to constrain all samples and the object to share a single deep network structure.Therefore, we propose a spatial attention scheme to model attention response map in Fig. 4. The proposed attention mechanism can capture the general features and distinct the object from the background in the video.It can encode the global information of the object and has a low computational load.The output of attention module is passed through a global pooling layer to produce a channel-wise descriptor.Then three fully connected (FC) layers are added, in which learned for each channel by a self-gating mechanism based on channel dependence.This is followed by reweighting the original feature maps to generate the output of attention module.The cosine similarity is utilized to measure the similarity between current frame features φ t (p) and the features φ t-1 (p) extracted from t-1 frame.
If the current frame features is close to the features of the last frame, it is prone to the foreground object and assigned with a larger weight, otherwise, a smaller weight is assigned to background pixel.

Online tracking
In this subsection, we illustrate how our tracker works for visual object tracking.We involve the generative model during the training and remove it in the tracking stage.
We first draw the samples from the first ten frames of a video sequence to fine-tune generative model online.Then, we track the object in all videos.Given an input frame, we generate multiple candidate proposals and extract their deep features.Deep features of the candidate proposals are fed into the classifier to obtain the probability scores.During the online update, we employ these training samples jointly train the generative model and the discriminative model.The object tracking result is obtained by finding the maximum response score in the attention map.
Object appearance model updating plays a critical role in object tracking, and most of trackers update their appearance model in each frame or at a fixed interval.However, this updating strategy may introduce background information into the object appearance model when the tracking result is inaccurate due to occlusion or illumination variations.
In this paper, we need to update the object appearance model with the recently obtained object results.First, we define a fixed length sequence L to store the tracking result of each frame.When the length of L reaches a fixed number of elements, we update object appearance once.In addition, model updating is performed when the number of iteration or maximum value of response map is satisfied.The maximum response score in L is used to update the object appearance model.
Therefore, the new object appearance model is written as: where β is a learning parameter and set empirically; T u is the updated object appearance model, which is represented by a linear combination of the initial object template T f and the last updated object appearance model T p .To alleviate the drift problem during the tracking, the initial template is incorporated into the new observation template.
To handle the scale change, we follow the approach in [21] and use patch pyramid with the scale factors.The proposed object tracking algorithm can be summarized as Algorithm 1.

Experiments
In this section, we introduce the implementation details of the proposed tracking algorithm.We then compare our tracker with state-of-the-art trackers on two benchmarks for performance evaluation.Our experiments are performed on a workstation by using MatConvNet toolbox [50] with E5 2.4 GHz CPU and Quadro K2200 GPU.
In this work, the first three convolution layers from the VGG-M model are utilized as feature extraction network.The network is pre-trained on a large-scale benchmark datasets.During the adversarial learning, both G and D are learned by the SGD scheme.The learning rate for training G and D are set to 10 −3 and 10 −4 , respectively.During the tracking, we draw 256 candidate samples around the object location of each frame for classification.The masks are set randomly the resolution of each mask is the same as that of the input features.We update the generative adversarial network using 10 iterations in every 10 frames or the response score of tracked result is less than a predefined threshold.Backbone architecture is shown in Table 1.

Benchmarks
We conduct the experiments on two standard benchmarks: OTB-2013 [4] and OTB-2015 [6].Video sequences are defined with bounding box annotations.These datasets cover various challenging aspects in visual tracking task, such as fast motion, background clutter, deformation, occlusion, illumination variations, and low resolution.The performance of all the trackers can be well tested by using two benchmarks.

Evaluation metrics
We follow the standard evaluation metrics [6] from two benchmarks.For the OTB-2013 and OTB-2015 benchmarks, we use the one-pass evaluation (OPE) with precision and area-under-the-curve (AUC) success rate criteria.The precision metric measures the rate of frame locations within a certain threshold distance from those of the ground truth.The threshold distance is set to 20 for all the trackers.The success rate criterion measures the overlap ratio between the predicted bounding box and the ground truth bounding box.
6 Results and discussion

Quantitative evaluation
We perform quantitative evaluation on two benchmark datasets.The experimental results of the proposed tracking algorithm are reported as follows.
We evaluate all the trackers on 50 video sequences using the one-pass evaluation with distance precision and overlap success metrics.Figure 5 shows the tracking results from all compared trackers.We only show the top ten trackers for presentation clarity.The number listed in the legend indicates the AUC overlap success rate and precision score at 20 pixels.Overall, it clearly illustrates that our tracking method outperforms the state-of-the-art trackers significantly in both evaluation measures.The OTB-2013 dataset has 11 attributes (e.g., background clutter, occlusion, deformation, scale variation, illumination variation) to describe the different challenges in the tracking.These attributes are useful for analyzing the performance of trackers in different aspects.Figure 6 shows the results of different tracking algorithms on eight main challenging attributes.It demonstrates that our tracker can effectively handle the challenges and achieve Fig. 9 The maps generated by the proposed spatial attention scheme (middle row) and saliency detection algorithm [63,64] Deep Saliency (bottom row) leading performance.The proposed method performs favorably against the stateof-the-art trackers when evaluating with eight challenging factors.

OTB-2015 benchmark
For more detailed analysis, we also compare our tracker with the state-of-the-art trackers on the OTB-2015 benchmark.Figure 7 shows that the proposed tracker performs well.Although the ECO tracker has achieved a good performance, the proposed tracker uses the samples of the first ten frames to train deep network model, so both precision and success rate are leading.

Qualitative evaluation
In Fig. 8, we qualitatively report the results of other four state-of-the-art trackers (such as, CNN-SVM, C-COT, MDNet, ECO) and the proposed tracker on 12 challenging video sequences.
In most of the video sequences, CNN-SVM is unable to locate the object position due to the limited performance of SVM classifier.MDNet improves CNN-SVM through an end-to-end CNN network formulation, and it performs well on deformation (Trans), low resolution (Skiing), and fast motion (Diving).However, it does not perform well in handling out-of-plane rotation (Ironman) and occlusion (Human4).The correlation filter-based trackers such as C-COT and ECO use deep features for visual object tracking, but they fail to exploit more sophisticated deeper architectures.They perform well in handling occlusion (Human4, Box) and deformation (Trans).However, the  10 Failure cases of our method on "Jump" and "Coupon".Green and red bounding means the ground truth and the results from our tracker, respectively tracked object drifts when it undergoes heavy occlusions (Bird1).Overall, our tracker captures the appearance variations of the object by fine-tuning the network and the adversarial learning scheme enhances the discriminative ability of the classifier.Therefore, the proposed tracker performs well in estimation both the scale and position of the object on these challenging video sequences.The proposed tracker performs favorably against state-of-the-art.Moreover, the accurate prediction of spatial attention response map is a key factor in our tracker.Thanks to the utilization of mean squared loss and adversarial loss, the predicted attention response maps are robust for most challenging cases.Even if the attention is not precise, the tracking results will not be largely affected in our experiments.Figure 9 shows the robustness of the proposed tracking method.The red bounding box is our results.
In addition, a major concern of the proposed tracker is its computational efficiency.Our tracker largely reduces the computational burden in learning and tracking.The parameters of deep network model can also be pre-computed in the training phase.Its tracking error grows proportionally as the number of index increases.During the tracking, the parameters of deep network model will be updated in a fixed interval time.This greatly accelerates the tracking process.The runtime of our tracker against other trackers is shown in Table 2.

Feature comparison
We compare the feature effects of different layers of deep learning network model on the OTB-2015 benchmark, which is shown in Table 3.We can see that the combination of the features extracted from conv3 and conv4 layers achieves the best results, which verifies the rationality of the feature selection strategy of the proposed tracking algorithm.The best results are in bold.

Failure cases
Although the proposed tracking algorithm can achieve a satisfied performance, a few failure cases occur when object suffers from the long-term occlusions.When the tracked object reappears and becomes very small, the proposed tracking method fails to follow the object due to the limited pixels and appearance variations, which can result in poor tracking performance.A feature selection implementation strategy using the feature from conv2 is able to track the object, because the features of conv2 layer have higher resolution than the features from deeper layers.For the Biker sequence, the object suddenly moves violently beyond the search area of the proposed tracking method.Many single object trackers are not able to cope with this challenge problem in this sequence.
Our tracker fails to track objects when they have very similar appearance (e.g., result in "Coupon" sequence) and experience dramatic topology changes (e.g., result in "Jump" sequence) in Fig. 10.Another limitation of our tracker is that the running speeds (14.8 fps on OTB-100 dataset) are far below real-time usage, which cannot be easily employed in other products, such as mobile phone and many embedded devices.We leave these issues for further studies.
In this paper, we propose an effective object tracking method with learning attention.We design a prediction network which is pre-trained off-line and used to predict the object positions of a video sequence.The object positions of the first ten frames are employed to fine-tune prediction network for obtaining rich appearance variations.The positive and negative samples are also augmented.Furthermore, these object locations are captured to mine the domain-specific information through fine-tuning the adversarial generative network.We adaptively use dropout to mine the discriminative features which are originally diminished during the training process.The adaptive dropout is achieved via adversarial learning to find discriminative features according to different inputs.In addition, we present a spatial attention mechanism to improve the tracking performance.Compared with the state-of-the-art, the proposed tracking method achieves outstanding performance in two large public tracking benchmarks.Further research directions include applying the spatial attention into multi-modal applications.

4. 1
Problem formulationThe existing trackers based on deep learning are performed as off-line training and online fine-tuning for surveillance analysis, and only use the object information of the first frame to fine-tune online the learned deep network parameters.However, it is difficult to capture previously unseen object features from one or few examples.In addition, the positive samples in each frame are highly spatially overlapped and they fail to capture rich appearance variations.On the other hand, amount of positive and negative samples for training deep learning network model are nearly impossible to meet up in real world.

Fig. 2
Fig.2The architecture of the proposed object tracking algorithm

Fig. 3
Fig. 3 Foreground response maps predict using different VGG feature maps.(a) Input image.(b) Using Conv4-1 feature only.(c) Using the concatenation of Conv4-2 and Conv4-3 feature with the proposed adversarial learning with attention

Fig. 5 51 Fig. 6
Fig. 5 Precision and success plots on the OTB-2013 dataset using the one-pass evaluation.The number in the legend indicates the average precision scores at 20 pixels and the AUC scores

Fig. 7
Fig. 7 Precision and success rate plots on the OTB-2015 benchmark by using the one-pass evaluation.The number in the legend indicates the average distance precision scores at 20 pixels and AUC success scores

Table 1
Backbone architecture.Details of each building block are reported in square brackets

Table 2
Tracking performance and frame per second (FPS) of the state-of-the-art approaches on OTB-100 benchmark."-"denotes invalid state; the bold fonts indicate the best results

Table 3
Results of different features