Weakly supervised spatial–temporal attention network driven by tracking and consistency loss for action detection

EURASIP Journal on Image and Video Processing

Table 5 Temporal action localization experiment

Method	Mode	T(IOU@0.3)	T(IOU@0.5)	A(IOU@0.5)
G-TAD [48]	Full	–	40.2	46.7
P-GCN [49]	Full	63.6	49.1	48.3
Nguyen [36]	Weak	46.6	26.8	–
3C-Net [34]	Weak	40.9	24.6	35.4
WSGN [31]	Weak	42.0	25.1	–
Islam [29]	Weak	46.8	29.6	35.2
BaS-Net [35]	Weak	44.6	27.0	34.5
DGAM [32]	Weak	46.8	28.8	41.0
HAM-Net [39]	Weak	50.3	31.0	41.5
Ours	Weak	64.4	49.6	52.2

The table lists the comparison results of mAP (16 frames clip). We compared with typical fully and weakly supervised methods. T(IOU@0.3) indicates THUMOS14 with IOU@0.3, T(IOU@0.5) indicates IOU=0.5, and A(IOU@0.5) indicates ActivityNet with IOU@0.5. Note that, the proposed method is an object location-unsupervised classification-supervised attention network