Fig. 2From: Weakly supervised spatial–temporal attention network driven by tracking and consistency loss for action detectionOverall framework. In source domain, we trained the network on the first data set, which has both location and classification annotations. In target domain, we trained the network with the pre-trained model on the second data set, which only has action classification temporal annotations. To ensure the continuity of the target in the video sequence, Tracking-regularization loss is calculated by a tracker between the tracking location and network’s predicted location. The neighbor-consistency loss makes the features of objects more closer between neighbors in the videoBack to article page