Weakly supervised spatial–temporal attention network driven by tracking and consistency loss for action detection

This study proposes a novel network model for video action tube detection. This model is based on a location-interactive weakly supervised spatial–temporal attention mechanism driven by multiple loss functions. It is especially costly and time consuming to annotate every target location in video frames. Thus, we first propose a cross-domain weakly supervised learning method with a spatial–temporal attention mechanism for action tube detection. In source domain, we trained a newly designed multi-loss spatial–temporal attention–convolution network on the source data set, which has both object location and classification annotations. In target domain, we introduced internal tracking loss and neighbor-consistency loss; we trained the network with the pre-trained model on the target data set, which only has inaccurate action temporal positions. Although this is a location-unsupervised method, its performance outperforms typical weakly supervised methods, and even shows comparable results with some recent fully supervised methods. We also visualize the activation maps, which reveal the intrinsic reason behind the higher performance of the proposed method.

The main contributions of this paper include two points. (1) On the source data set, the manuscript constructs a new multi-loss spatiotemporal attention convolution network based on the source data set, which has target location and classification annotation. (2) In the target domain, the manuscript introduces the internal tracking loss and neighborhood consistency loss. The pre-training model is used to train the target data set, and there are only inaccurate action time positions. Although this is a locationunsupervised classification-supervised method, the mAP performance outperforms typical weakly supervised methods, and even shows comparable results with some recent fully supervised methods.
Since the proposed method uses pre-trained model and amount of weakly labeled data in the target domain, it is a typical weakly supervised learning method. The basic idea of the method is as the follows: First, we introduce a novel location weakly supervised learning network model with a spatial-temporal attention mechanism for action tube detection. The framework structure clearly differs from the state-of-the-art methods.
Second, we introduce an internal tracking loss and neighbor-consistency loss for weakly supervised learning based on video sequences for which only the action classification temporal label is needed. This is the first study on tracker and consistency loss applied in location weakly supervised situations with a spatial-temporal-attention mechanism for action tube detection.
Third, we also visualize the activation maps, which reveal the intrinsic reason behind the higher performance of the proposed method.

Related works
Video processing methods have progressed through high-efficiency coding [1], detecting and object tracking [2], image retrieval [3], image enhancement [4,5] and image compositing [6] in many applications. Many supervised methods exist in the action detection field. Popular detection methods such as YOLO [7] and SSD [8] are mainly used in representative multi-scale end-to-end models for static images. Considering the importance of temporal information, Trans [9] first proposed a C3D model that introduced local connection and weight sharing features from a 2D convolution to video sequence processing. Although the calculation parameters of U-Net [10] based on 3D-CNN are relatively large, the performance of video processing was greatly improved compared with R-CNN [11,12]. I3D [13] uses a dual stream fusion model structure in which 2D and 3D convolutions are fused to implement the migration of Image-Net and other static image data to a 3D video stream processing model. Sun [14] decomposed a 3D convolution into a 2D convolution in the spatial direction and a 1D convolution in the time direction. This notably improves the computational efficiency; however, massive iterative training on video data is still required. To reduce the computational complexity, P3D [15] combines three different module structures. In ResNet(2+1)D [16], new convolution kernels were explored and the C3D model was optimized in terms of parameters and running speed. Video-based 3D multi-scale detection [17,18] has been widely used in video target recognition, and many open-source projects have validated its performance. Nevertheless, the algorithms are significantly affected by background factors owing to the lack of target focus. With the development of deep learning technology, multi-scale features and attention mechanisms of video were considered in videos for various applications. Popular attention mechanisms [19][20][21] are particularly important for streaming data processing in the machine-learning field, for example, task-adaptive attention method [22] used in image captioning and self-attention and multi-feature fusion method [23] used in face recognition. Inspired by human vision, the Institute for Human-Machine Communication from Munich University Germany proposed a fast and real-time video action detection method (You Only Watch Once, YOWO) [24], which achieves the highest efficiency at present. It introduced a target attention mechanism based on the video keyframe in the 2D-3D fusion model through a single-stage network. This constitutes the fundamental advantage of previous research results. There are also some weakly supervised or unsupervised studies in this field. UntrimmedNets [25] introduced a classification module for predicting the classification score for each snippet, and a selection module to select relevant video segments. In addition, STPN [26] added sparsity loss and class-specific proposals. AutoLoc [27] introduced the outer-inner contrastive loss to effectively predict temporal boundaries. W-TALC [28] and Islam and Radke [29] incorporated distance metric learning strategies, and proposed a novel average aggregation module and latent discriminative probabilities to reduce the difference between the most salient regions and the others. TSM [30] modeled each action instance as a multi-phase process to effectively characterize action instances. WSGN [31] assigned a weight to each frame prediction based on both local and global statistics. DGAM [32] used a conditional variational auto-encoder to separate the attention, action, and non-action frames. CleanNet [33] introduced an action proposal evaluator that provides pseudo-supervision by leveraging the temporal contrast in snippets. 3C-Net [34] adopted three loss terms to ensure separability, enhance discriminability, and delineate adjacent action sequences. Moreover, BaS-Net [35] and Nguyen et al. [36] modeled background activity by introducing an auxiliary background class. However, none of these approaches explicitly resolve the issue of modeling an action instance in its entirety. Nanan [37] proposed a spatial-channel filter, and Liu et al. [38] proposed a multi-branch network in which each branch predicts distinctive action parts. HAM-Net [39] hides the most discriminative parts of a video instead of random parts. Our method includes a novel location-interactive weakly supervised learning network model with a spatial-temporal attention mechanism for action tube detection in which an internal interactive location tracker and consistency loss is used for weakly supervised learning based on video sequence for which only the action classification temporal label is needed.

Framework overview
The motivation of this study is to propose an attention network with fewer object bounding box annotations while still achieving comparable results with some recent fully supervised methods. The classification attention maps may be disturbed by the moving background objects, some input data can be predicted well while others are poor, but we cannot decide in advance which video clip or keyframe to choose as input. Therefore, to enhance the robustness of detection, the network need to filter the noise by tracking the objects to see if they exist continuously and always have high confidence value in the previous frames. The overall framework is shown in Fig. 2. 1) Overall framework overview. According to Fig. 2, in source domain, we trained the newly designed multi-loss spatial-temporal attention-convolution network on the first data set, which has both location and classification annotations. In target domain, we introduced an internal tracking loss and neighbor-consistency loss for weakly supervised learning based on video sequence for which only action classification temporal labels are needed and trained the network with the pre-trained model on the second data set, which only has classification annotations. To ensure the continuity of the target in the video sequence, tracking regularization loss is calculated by a tracker between the tracking location and Fig. 2 Overall framework. In source domain, we trained the network on the first data set, which has both location and classification annotations. In target domain, we trained the network with the pre-trained model on the second data set, which only has action classification temporal annotations. To ensure the continuity of the target in the video sequence, Tracking-regularization loss is calculated by a tracker between the tracking location and network's predicted location. The neighbor-consistency loss makes the features of objects more closer between neighbors in the video network's predicted location. Intuitively, the features' cosine distance between the neighbors is closer in the same video clip, so we introduce neighbor consistency loss in the model.
2) Baseline framework overview. As shown in Fig. 3, there are four branches: the branch no. 2 adopts a spatial attention mechanism for the object location in video frames, and branch no.3 uses a channel attention mechanism to fuse the previous two network branches to obtain the total loss. In branch No.4, The internal loss function focuses on the loss between the network's predicted location and the ground truth based the video sequence.

Baseline network definition
Referring to Figs. 2 and 3, suppose that a video sequence is an input to the 3D-CNN network, and the original video is sampled in time as where X denotes the clip of video, x(t) is a frame of the video, U means that X consists of the set of frames, and the range of sampling time is [t 0 , t N −1 ].
The clips are fed into 3D convolutional network such as 3D-ResNeXt-50 and 3D-ResNeXt-34 [40] and the outputs where the ResNeXt is used to verify our model, and other 3D Convolutional Network backbones can also be used here. Referring to the network branch no. 2 which focuses on the object location in the video sequence, squeezing the tensor S 50 to the tensor F 01 : is the shape of the tensor F 00 which has N-Frames feature-groups, each group has D ′ features, and each feature is the size of H ′ × W ′ , and Referring to the network branch no. 1, It is concerned with the classification of action tubes, we further squeeze the tensor S 101 to the tensor F 11 : where C ′ × H ′ × W ′ is the shape of the tensor F 11 . Since F 01 and F 11 have the same feature map dimension H ′ × W ′ , so they can be concatenated as the follows: The network branch no. 2 only focused on the object location in the video sequence, it is referenced by the internal loss function marked as IL. The branch no. 1 mainly considers for video object behavior classification, and it is referenced by the global loss function marked as GL. Therefore, the network parameters can be learned like: where θ j denotes the trainable parameter of the network model. We choose different loss functions according to the network branches, α(t) is the learning rate function, and is a hyper-parameter. As shown in Fig. 4, we use the Gram matrix in the neural network to solve the fusion problem. Here, the implementation process of CFAM is simplified as follows: where C = C ′ + C ′′ , FA is the result of simply concatenating features of network branch No.1 and network branch No.2, FB is the mapping feature after 2-layer convolution, the Gram matrix transformer is used between FB and FC, and FD is the mapping feature of FC after 2-layer convolution. C * is the final number of features.  where β is a parameter that can be learned by the network. The reshape function transforms the dimension of the value to the same size as FB. In branch no. 3, we obtain the feature FD just before the Softmax function affected by the global loss function.

Internal and global loss function
In the proposed network model (see Fig. 2), there are two loss functions, namely, the external global loss function and the internal loss function, which can act on the network parameters using the gradient transfer mechanism. We next introduce the internal loss function which focuses on the loss between the predicted key-frame location and the tracking locations; then, the network can be trained under the location weakly supervised attention mechanism, for which only the action classification temporal label is needed.
• Loss function Part A: The action classification loss function is marked as Loss cls .
• Loss function Part B: The location loss function focuses on the location loss of objects in the video clip. The single frame loss is marked as Loss loc , and clip loss is marked as Loss clip . • Loss function Part C: The tracker predicted location loss function focuses on the tracking location loss with the previous video sequence, marked asL TRB . • Loss function Part D: The neighbor consistency loss function focuses on neighbor features's consistency in video sequence, marked asL NCB .

1) Supervised cross-entropy loss
Suppose the image is split by an S × S grid. We use a cross-entropy function to compute the action classification loss marked as Loss cls : where I obj ij denotes the jth prior box of the ith grid is responsible for the object with the class cls; I obj ij = 1 if the object center exists in the grid; otherwise, I obj ij = 0 , S 2 is the total number of grid cells and B denotes the total number of candidate prior boxes. P j i and P j i represent the ground truth and predicted class probability in the grid cell, respectively.
2) Clip supervised-location loss Suppose a single frame loss function defined as where I obj ij denotes the jth prior box of the ith grid cell is responsible for the object; I obj ij = 1 if the object center exists in the grid cell; otherwise, I obj ij = 0 . S 2 is the total number of grid cells and B denotes the total number of candidate prior boxes, and co is an adjustable parameter. The object location ( denotes the location and confidence of the predicted box. Considering that the video sequence is composed of a series of frames, the video clip loss function can be defined as follows: where N is the number of frames in the video clip. 3

) Tacking-regularization-based loss
The tracker location loss function focuses on the loss between the tracker-predicted and network-calculated locations in video frames. We can use KCF [41] as a tracker, and other tracker methods can also be used in this study. The track loss function can be defined as follows: where Loc N clip denotes the target location in keyframe, which comes from output of the network branch no. 4. Loc i+1 clip denotes the object locations in the ith frame of the clip, Loc i clip denotes the object locations in the previous frame. Note that, Since both tracking and attention-based localization are not certain and either cannot be taken as ground truth, this term is more like internal regularization loss, we might as well call it tracking regularization loss here.

4) Neighbor-consistency-based loss
Intuitively, the features' cosine distance between the neighbors is closer in the same video clip, and so we introduce neighbor consistency loss in the model. (11) Loss loc = co Loss loc (k) Loss clip Loc i clip , Tracker(Loc i+1 clip ) (14) X g = {x g,0 , ..., x g,i , ..., x g,N } where x g,i indicates the ith frame of the video clip, x g,N specially indicates the keyframe, and we normally copy the (N-1)th frame of clip as the keyframe. For the gth clip in the batch PK, the cosine distance between all images in X g is calculated, and f (·) means the target confidence feature of the image. The distance matrix D N is adopted to realize neighbor consistency. Intuitively, the distance between x g,i and x g,i+1 neighbors should be pulled closer. Besides, to make the closer neighbors get more proportions in NCB loss, a weight w i that reflects the contribution of the ith neighbor. If the distance between x g,i and x g,i+1 is large, then its contribution to x g,i is small: To pull the distance between the x g,i and its neighbors closer, the NCB loss can be formulated as where ǫ is the scaling parameter. NCB loss make the given object frame closer to its neighbors, which can further improve the stability of the model.

5) GL and IL defination
In source domain, we defined GL and IL as where t_loc is the ground truth location in the keyframe, and p_loc is the output location which comes from the output of the network branch no. 3. According to (7), the internal loss function IL directly affects the location feature of the sequence. Therefore, we can obtain more attention features of the video sequence to improve the precision of action tube detection.

Parameters about the model
In network branch no. 1, a clip of the video frame sequence is fed into the 3D network as the input and the original video can be sampled in time. The shape of the input data is Let us assume that the learning rate function of branches no. 2 and no. 4 is α(t) acting on the back-propagation process driven by the internal loss function. The learning rate function used in the branches no. 1 and no. 3 is α(t) , where is a constant less than 1.
In network branches no. 3 and no. 4, the location regression method partly refers to the idea of YOLO [7]. If the input size is 416 × 416 and 32 down-sampling is used, then the grid size is 13 × 13 . We also generate a 26 × 26 feature map with 16 down-sampling, or a 52× 52 feature map with 8 down-sampling. Note that the higher the sampling ratio, the larger the feature map. In this process, the k-means method is also used to determine the size of the prior boxes based on the training data set, where k is the selected number. If the number of prior boxes is five, and each box has four position parameters and one confidence parameter, the total number of categories is NumCls, and the dimension of C * is [5 × (NumCls + 4 + 1)] in the network branch no.3. The dimension of C * * is [5 × (4 + 1)] in network branch no. 4, because the internal loss does not focus on the classification information. To support multi label objects, a Softmax function is used to predict the results.

Results and discussion
In this section, we first describe the experimental setup. We then conduct ablation studies and it shows the effectiveness of the different network parts. Next, we provide comparisons with several metric methods for action classification and temporal action location tasks, respectively. Finally, we analyzed the intrinsic reasons of performance improvement, including the working mechanism, attention activation maps and issues that need to be further studied.

Experimental setup
We first validated our method for action classification on the J-HMDB-21 [42] and UCF101-24 [43] data sets. The UCF101-24 data set contains 24 action classes and 3207 videos, with multiple possible action instances in each video. The J-HMDB-21 data set consists of 928 short videos with 21 action categories in daily life, where each video is trimmed to a single action instance across all frames. Then, we used the THUMOS14 and ActivityNet-v1.3 data sets in the experiment of temporal action localization, where THUMOS14 contains all of the UCF101 actions. THUMOS14 has 13320 trimmed videos for training, and each video includes one action, and the data include UCF101-24 with bounding box annotations. THUMOS14 also has 2500 untrimmed videos for training, each is guaranteed not to include any instance of the 101 actions, and 1010 untrimmed videos for validation. The image size was 412 × 412 pixels. In this study, 32 down-sampling was used in the spatial domain to form a 13 × 13 grid. To improve the generalization ability of the data, a spatial transformer was also used to produce a 0.1 amplitude random shift and 10-degree intermediate random rotation in the spatial domain. A temporal transformer was used for random sequence extraction based on 16 frames. We used the SGD optimizer with the weight decay, in which the momentum parameter, and decay weight were used. The initial value of the learning rate was 0.05, which linearly decreased according to the epoch. The hyper-parameter ρ is 0.2, is 0.1 and ǫ is 1.0. For a batch size of 64, at least four TITAN GPU cards or two RTX8000 GPU cards are needed for training.
In the action classification tasks, the indicator Frame-mAP was used as a benchmark. Suppose x(t N −1 ) represents the keyframe of a video clip in (1), the whole video's time range is [t 0 , t L−1 ] , and X was a clip in the video, then there were L − N clips in the video. The Frame-mAP was the mAP of all of video clips on the validation dataset.
In the action temporal localization tasks, the AP of the temporal action localization mainly considers the localization matching rate between predicted frame localization and ground truth with the same classification label, so the mAP indicator can be defined as follows: where Pred(k) indicates the predicted frame localization in all LN video frames, and real(k) indicates the ground truth. AP(cls) means the AP of single category CLS, so the mAP was the mean AP of all categories.

Ablation study
We performed ablation studies on the UCF101-24 and J-HMDB-21 data sets to prove the effectiveness of each part of the loss. We used 80% of the UCF101-24 data set for training. The Frame-mAP of classification is shown in Table 1 In Table 2 and Fig. 2, to ensure the continuity of the target in the video sequence, tracking-regularization loss and neighbor-consistency loss were calculated in the video. "U-24→J-21" means that UCF101-24 is used in source domain and J-HMDB-21 is used in target domain. "Baseline + xxx" means that the "xxx" loss function is used upon the baseline model. SCEL stands for supervised cross-entropy loss, TRBL stands for tracking-regularization-based loss, NCBL means neighbor-consistency-based loss.
1) Effectiveness of ALL: In the Table 1, assuming that only 30% of the data have bounding-box annotations, the model can only achieve 86.8% Frame-mAP on the UCF101-24 data set in the source domain. However, if we use the other 70% data without bounding-box annotations to train the network in target domain, then Frame-mAP is 94.9%. It is especially costly and time consuming to annotate every target location in the video frames, and the Track Loss is effective if we only have few data with location labels.
2) Effectiveness of NCBL: When Baseline+SCEL+NCBL use in the training, the Frame-mAP performance achieved 86.3% and 90.6% on the J-HMDB-21 and UCF101-24 data sets, respectively. The NCBL loss is used to pull closer the similar targets within a certain range. Unlike the TRBL loss, the NCBL loss is likely to mine the similarity of targets within the video sequence. It also illustrates the advantage of avoiding completely relying on location labels.
3) Effectiveness of TRBL: The Frame-mAP performance presents no significant difference when we choose different trackers, such as MIL [44], KCF [41] and SRDCF [45]; when the model uses 70% of the data for location unsupervised training, they achieved a performance of 94.1%, 94.9%, 95.1% on the UCF101-24 data set, respectively. This is because the target occupies a large proportion in the image on the target data set, and generally there is no occlusion.

Experimental results of action classification
We compared the proposed method with state-of-the-art methods on the UCF101-24 and J-HMDB-21 data sets, as shown in Table 3. Using standard metrics, we present the Frame-mAP at IOU threshold 0.5 and 16-frame clips. It can be seen that the proposed method outperforms the state of the art in terms of Frame-mAP, which is improved by 2.1% and 2.7% on the two data sets, respectively. Note that the proposed method used transfer learning, which means training the network on the target domain (UCF101-24) in target domain with the pretrained model trained on the source domain (J-HMDB-21) in source domain, then, obtaining the Frame-mAP of 94.8% on the UCF101-24 data set. From Table 4, we can see that our method pertains to acceptable performance because of the parallel architecture mechanism consisting of a classification network branch No. 1 and location network branch No. 2. At the same time, because it has two 3D-CNN parallel computing branches, it consumes more computing resources than some state-of-the-art models. The comparison may not be fair without considering computing complexity, but note that, our contribution is introducing tracking loss and neighbor-consistency loss for action detection tasks, if the system needs high real-time performance, it can choose simple backbones.

Experimental results of temporal action localization
We also conducted experiments on temporal action localization (TAL) using the proposed method. Table 5 summarizes the performance comparisons between the proposed method and state-of-the-art methods. The table lists the comparison results in terms of mAP with state-of-the-art methods (16 frames clips). We compared the typical full and weak methods, T(IOU@0.3) indicates THUMOS14 with IOU@0.3, while T(IOU@0.5) indicates IOU=0.5, and A(IOU@0.5) indicates ActivityNet1.3 with IOU@0.5. Specifically, our proposed method achieves the mAP of 49.6% at IOU threshold 0.5 on the data set THUMOS14. Moreover, our method outperforms the weakly supervised TAL models, and even shows comparable results with some recent fully supervised TAL methods. Note that, we cannot perform the contrast experiment on the data set THUMOS14(UCF101), because only its sub-data set UCF101-24 has object bounding boxes, which means that only about 24% (3207/13320) of the THUMOS14 training data are available for fully supervised pretraining in source domain, and the pretrained model is also used in the experiment based on the ActivityNet1.3 data set.

Further analysis of the experimental results
We performed experiments for action classification and temporal action localization. The performance was better than recent weakly supervised methods, and even shows comparable results with recent fully supervised methods. We also present the activation maps [50] in Fig. 5, which reveal the intrinsic reason why our method has better attention performance than the state-of-the-art video action tube detection methods. Note the following points: • The jump action example shows that HAM-NET's attention mechanism is more likely to be disturbed by sudden or rapid object movements such as moving clouds and crowds of people. This is because HAM-NET's attention mechanism is based on the optical flow of the video frames. • The Walking-With-Dog action example shows that HAM-NET is more likely to ignore important parts of an action such as the presence of a dog in cases where the training dataset contains a series of similar actions, such as in Skiing, Ice-Dancing, and Long-Jump. In the action classification experiment, the keyframe of the video clips is particularly important for HAM-NET to predict the correct results. • Our method has a higher level of robustness. The internal tracker loss and neighbor consistency loss are more efficient for weakly supervised learning based on video sequences, in which only the action classification temporal labels are needed.
Concerning the experimental results, there are four important points to explain: First, the classification temporal label is also needed in our method when the object location is achieved through weakly supervised learning; this approach is notably different from other methods.
Second, the performance outperforms typical methods and some recent fully supervised methods because of the spatial-temporal attention mechanism. In other words, the attention mechanism also works well when the object location is based on weakly supervised learning.
Third, although the classification labels of the source domain may be different from that of the target domain, the pretrained model of the source domain can still be transferred, because they are all human actions with the same attention mechanism.
Finally, our contribution is introducing tracking loss and neighbor-consistency loss for action detection tasks. The comparison may not be fair without considering computing complexity, if the system needs high real-time performance, it can choose simple backbones.
However, There is a issue that still requires further study. The tracker needs a few previous target locations of the frames. This means, if some initial locations are wrongly Activation heat-maps are from the tensors just before the channel fusion network. Jump action example shows that HAM-NET's attention mechanism is more likely be disturbed by sudden or rapid object movements such as moving clouds and crowds of people, because it concerns the optical flow of the video frames. Walking-With-Dog action example shows that HAM-NET is more likely to ignore important parts of an action such as the presence of the dog in cases where the training data set contains a series of similar actions such as in Skiing, Ice-Dancing, Long-Jump. Our method has a higher level of robustness predicted by the network, it may affect the following loss to some extent during training. We can skip a few initial frames when training if the length of the video is not very short. Nevertheless, the shortage will affect the mAP to some extent for the temporal action localization tasks. As shown in Fig. 6, the starting point may be wrongly predicted or delayed by several frames. Moreover, if the training data are randomly mixed with reverse-time video clips by a ratio of 1:1, the mAP can be further improved by 0.5% on the THUMOS14 data set. Although the training process may be affected by the initial action frames, the proposed method outperforms the state-of-the-art methods.
In short, the ablation study shows the effectiveness of different parts of the proposed method. In classification tasks, the proposed method outperforms the state-of-the-art in terms of Frame-mAP, which is improved by 2.7% and 2.1% on data sets UCF101-24 and J-HMDB-21, respectively. In action temporal localization tasks, the proposed method achieved higher mAP than the current best scores on the data set THUMOS14. Moreover, the proposed method outperforms the weakly supervised TAL models, and even shows comparable results with some recent fully supervised TAL methods. Concerning the experimental results, we analyzed the intrinsic reasons of performance improvement, including working the mechanism, attention activation maps and the issues that need to be further studied.

Conclusions
We introduced a novel location-weakly supervised learning method with a spatialtemporal attention mechanism for action tube detection. The novelty is remarkable compared with previously reported methods. An internal interactive location tracker loss and neighbor consistency loss for weakly supervised learning are designed, in which only the classification temporal label is needed. This is the first study in location weakly supervised situation with a spatial-temporal-attention mechanism for action tube detection. Although this is a location-weakly supervised classification-supervised method, the mAP performance is better than that of typical weakly supervised methods, and even shows comparable results with some recent fully supervised methods.

TRBL
Tracking-regularization-based loss NCBL Neigbor-consistency-based loss SCEL Supervised cross-entropy loss Fig. 6 Temporal action localization experiment on THUMOS14. The horizontal axis denotes time, we sequentially plot the ground truth, predicted localization, and prediction score