Table 4 It is about the run time and performance comparison on data set UCF101-24 on a single NVIDIA RTX8000 card with 16-frames video clip

From: Weakly supervised spatial–temporal attention network driven by tracking and consistency loss for action detection

Method Speed(fps) Frame-mAP
P3D-CTN 28
I3D 30 77.7
3C-Net 45 84.4
HAM-Net 29 92.1
YOWO+LFB 38 86.4
Ours 31 94.8
  1. For our method, ResNeXt-50 and ResNeXt-34 are used in its two 3D-CNN backbones