Fig. 3From: Weakly supervised spatial–temporal attention network driven by tracking and consistency loss for action detectionBaseline framework. The internal loss function focuses on the loss between the network’s predicted location and the ground truth locations based on the video sequenceBack to article page