Framework overview
The motivation of this study is to propose an attention network with fewer object bounding box annotations while still achieving comparable results with some recent fully supervised methods. The classification attention maps may be disturbed by the moving background objects, some input data can be predicted well while others are poor, but we cannot decide in advance which video clip or keyframe to choose as input. Therefore, to enhance the robustness of detection, the network need to filter the noise by tracking the objects to see if they exist continuously and always have high confidence value in the previous frames. The overall framework is shown in Fig. 2.
1) Overall framework overview. According to Fig. 2, in source domain, we trained the newly designed multi-loss spatial–temporal attention–convolution network on the first data set, which has both location and classification annotations. In target domain, we introduced an internal tracking loss and neighbor-consistency loss for weakly supervised learning based on video sequence for which only action classification temporal labels are needed and trained the network with the pre-trained model on the second data set, which only has classification annotations. To ensure the continuity of the target in the video sequence, tracking regularization loss is calculated by a tracker between the tracking location and network’s predicted location. Intuitively, the features’ cosine distance between the neighbors is closer in the same video clip, so we introduce neighbor consistency loss in the model.
2) Baseline framework overview. As shown in Fig. 3, there are four branches: the branch no. 2 adopts a spatial attention mechanism for the object location in video frames, and branch no.3 uses a channel attention mechanism to fuse the previous two network branches to obtain the total loss. In branch No.4, The internal loss function focuses on the loss between the network’s predicted location and the ground truth based the video sequence.
Baseline network definition
Referring to Figs. 2 and 3, suppose that a video sequence is an input to the 3D-CNN network, and the original video is sampled in time as
$$\begin{aligned} X=U\{x\left( t_0\right) ,\ \ x\left( t_1\right) ,\ \ \ldots ,\ \ \ \ x\left( t_{N-1}\right) \} \end{aligned}$$
(1)
where X denotes the clip of video, \(x\left( t\right)\) is a frame of the video, U means that X consists of the set of frames, and the range of sampling time is \([t_0,\ t_{N-1}]\).
The clips are fed into 3D convolutional network such as 3D-ResNeXt-50 and 3D-ResNeXt-34 [40] and the outputs
$$\begin{aligned} S_{50}=3D\_ResNeXt\_34\left( X\right) \end{aligned}$$
(2)
$$\begin{aligned} S_{101}=3D\_ResNeXt\_50\left( X\right) \end{aligned}$$
(3)
where the ResNeXt is used to verify our model, and other 3D Convolutional Network backbones can also be used here. Referring to the network branch no. 2 which focuses on the object location in the video sequence, squeezing the tensor \(S_{50}\) to the tensor \(F_{01}\):
$$\begin{aligned} S_{50}\rightarrow {}F_{00}\in {}R^{(N\times {}D^{'})\times {}H^{'}\times {}W^{'}}\rightarrow {}\ F_{01}\in {}R^{C^{''}\times {}H^{'}\times {}W^{'}} \end{aligned}$$
(4)
where \((N\times {}D^{'})\times {}H^{'}\times {}W^{'}\) is the shape of the tensor \(F_{00}\) which has N-Frames feature-groups, each group has \(D^{'}\) features, and each feature is the size of \(H^{'}\times {}W^{'}\), and \(C^{''}\times {}H^{'}\times {}W^{'}\) is the shape of the tensor \(F_{01}\).
Referring to the network branch no. 1, It is concerned with the classification of action tubes, we further squeeze the tensor \(S_{101}\) to the tensor \(F_{11}\):
$$\begin{aligned} S_{101}\rightarrow {}F_{11}\in {}R^{C^{'}\times {}H^{'}\times {}W^{'}} \end{aligned}$$
(5)
where \(C^{'}\times {}H^{'}\times {}W^{'}\) is the shape of the tensor \(F_{11}\). Since \(F_{01}\) and \(F_{11}\) have the same feature map dimension \(H^{'}\times {}W^{'}\), so they can be concatenated as the follows:
$$\begin{aligned} \{F_{01},\ \ F_{11}\}\rightarrow {}FA\in {}R^{\left( C^{'}+C^{''}\right) }\times {}H^{'}\times {}W^{'} \end{aligned}$$
(6)
The network branch no. 2 only focused on the object location in the video sequence, it is referenced by the internal loss function marked as IL. The branch no. 1 mainly considers for video object behavior classification, and it is referenced by the global loss function marked as GL. Therefore, the network parameters can be learned like:
$$\begin{aligned} {\theta {}}_j=\left\{ \begin{array}{l}{\theta {}}_j-\alpha {}\left( t\right) \frac{\partial {}IL\left( \theta {}\right) }{\partial {}{\theta {}}_j},\ \ \ {\rm{if}}\ {\theta {}}_j\epsilon {}{\rm{Branch}}\ 2\ {\rm{or}}\ 4\ \\ {\theta {}}_j-\lambda {}\alpha {}\left( t\right) \frac{\partial {}GL\left( \theta {}\right) }{\partial {}{\theta {}}_j},\ {\rm{if}}\ {\theta {}}_j\epsilon {}{\rm{Branch}}\ 1\ {\rm{or}}\ 3\end{array}\right. \end{aligned}$$
(7)
where \({\theta {}}_j\) denotes the trainable parameter of the network model. We choose different loss functions according to the network branches, \(\alpha {}\left( t\right)\) is the learning rate function, and \(\lambda {}\) is a hyper-parameter.
As shown in Fig. 4, we use the Gram matrix in the neural network to solve the fusion problem. Here, the implementation process of CFAM is simplified as follows:
$$\begin{array}{l} FA \in {R^{(C' + C'') \times H' \times W'}} \to FB \in {R^{C \times H' \times W'}}\\ \to FC \in {R^{C \times H' \times W'}} \to FD \in {R^{C* \times H' \times W'}} \end{array}$$
(8)
where \(C=C^{'}+C^{''}\), FA is the result of simply concatenating features of network branch No.1 and network branch No.2, FB is the mapping feature after 2-layer convolution, the Gram matrix transformer is used between FB and FC, and FD is the mapping feature of FC after 2-layer convolution. \(C^*\) is the final number of features.
By squeezing FB in the directions \(H^{'}\) and \(W^{'}\), we then obtain the feature FF, \(FF\in {}R^{C\times {}D}\) where \(D=H^{'}+W^{'}\), and the transformer between FB and FC is defined as
$$\begin{aligned} \begin{aligned} FC= \, &\beta {}\cdot {} {\rm{reshape}}\left( \frac{\exp {\left( G_{ij}\right) }}{\sum _{j=1}^C \exp {\left( G_{ij}\right) }}\cdot {}FF\right) +FB\ \\& {\rm{with}}\ G_{ij}=\sum _{k=1}^D{FF}_{ik}\cdot {}{FF}_{jk\ \ } \end{aligned} \end{aligned}$$
(9)
where \(\beta {}\) is a parameter that can be learned by the network. The reshape function transforms the dimension of the value to the same size as FB. In branch no. 3, we obtain the feature FD just before the Softmax function affected by the global loss function.
Internal and global loss function
In the proposed network model (see Fig. 2), there are two loss functions, namely, the external global loss function and the internal loss function, which can act on the network parameters using the gradient transfer mechanism. We next introduce the internal loss function which focuses on the loss between the predicted key-frame location and the tracking locations; then, the network can be trained under the location weakly supervised attention mechanism, for which only the action classification temporal label is needed.
-
Loss function Part A: The action classification loss function is marked as \(\text{Loss}_\text{{cls}}\).
-
Loss function Part B: The location loss function focuses on the location loss of objects in the video clip. The single frame loss is marked as \(\text{Loss}_\text{{loc}}\), and clip loss is marked as \(\text{Loss}_\text{{clip}}\).
-
Loss function Part C: The tracker predicted location loss function focuses on the tracking location loss with the previous video sequence, marked as\({L}_{\rm{TRB}}\).
-
Loss function Part D: The neighbor consistency loss function focuses on neighbor features’s consistency in video sequence, marked as\({L}_{\rm{NCB}}\).
1) Supervised cross-entropy loss
Suppose the image is split by an \(S\times {}S\) grid. We use a cross-entropy function to compute the action classification loss marked as \(\text{Loss}_\text{{cls}}\):
$$\begin{aligned} \begin{aligned} \text{Loss}_\text{{cls}}=-\sum _{i=0}^{S^2}\sum _{j=0}^BI_{ij}^{\rm{obj}}\left[ P_i^j\log { \left( {\hat{P}}_i^j\right) }+\left( 1-P_i^j\right) \log {\left( 1-{\hat{P}}_i^j\right) }\right] \end{aligned} \end{aligned}$$
(10)
where \(I_{ij}^{\rm{obj}}\)denotes the jth prior box of the ith grid is responsible for the object with the class cls; \(I_{ij}^{\rm{obj}}=1\) if the object center exists in the grid; otherwise, \(I_{ij}^{\rm{obj}}=0\), \(S^2\) is the total number of grid cells and B denotes the total number of candidate prior boxes. \(P_i^j\) and \({\hat{P}}_i^j\) represent the ground truth and predicted class probability in the grid cell, respectively.
2) Clip supervised-location loss
Suppose a single frame loss function defined as
$$\begin{aligned}&{\rm{Loss}}_{\rm{loc}}={\lambda {}}_{\rm{co}}\sum _{i=0}^{S^2}\sum _{j=0}^BI_{ij}^{\rm{obj}}\left[ {\left( x_i-{\hat{x}}_i^j\right) }^2+{\left( y_i-{\hat{y}}_i^j\right) }^2\right] + \\&{\lambda {}}_{\rm{co}}\sum _{i=0}^{S^2}\sum _{j=0}^BI_{ij}^{\rm{obj}}\left[ {\left( \sqrt{w_i} -\sqrt{{\hat{w}}_i^j}\right) }^2+{\left( \sqrt{h_i}-\sqrt{{\hat{h}}_i^j}\right) }^2\right] \\&-\sum _{i=0}^{S^2}\sum _{j=0}^BI_{ij}^{\rm{obj}}\left[ C_i^j\log {\left( {\hat{C}}_i^j\right) } +\left( 1-C_i^j\right) log{\left( 1-{\hat{C}}_i^j\right) }\right] \end{aligned}$$
(11)
where \(I_{ij}^{\rm{obj}}\) denotes the jth prior box of the ith grid cell is responsible for the object; \(I_{ij}^{\rm{obj}}=1\) if the object center exists in the grid cell; otherwise, \(I_{ij}^{\rm{obj}}=0\). \(S^2\) is the total number of grid cells and B denotes the total number of candidate prior boxes, and \({\lambda {}}_{\rm{co}}\) is an adjustable parameter. The object location \({(x}_i,{\ y}_i,{\ w}_i,\ \ h_i,{\ C}_i^j)\) denotes the \((\rm{center\_left, center\_top, width, height, confidence})\) of the ground-truth box, and \(({\hat{x}}_i^j,{\hat{\ y}}_i^j,\ \ {\hat{w}}_i^j,\ \ {\hat{h}}_i^j,\ \ {\hat{C}}_i^j)\) denotes the location and confidence of the predicted box.
Considering that the video sequence is composed of a series of frames, the video clip loss function can be defined as follows:
$$\begin{aligned} {\rm{Loss}}_{\rm{clip}}=\frac{1}{N}\sum _{k=0}^{N-1} {\rm{Loss}}_{\rm{loc}}\left( k\right) \end{aligned}$$
(12)
where N is the number of frames in the video clip.
3) Tacking-regularization-based loss
The tracker location loss function focuses on the loss between the tracker-predicted and network-calculated locations in video frames. We can use KCF [41] as a tracker, and other tracker methods can also be used in this study. The track loss function can be defined as follows:
$$\begin{aligned} {L}_{\rm{TRB}}=\frac{1}{N}\sum _{i=0}^{N} {\rm{Loss}}_{\rm{clip}}\left( {\rm{Loc}}_{\rm{clip}}^i,{\rm{Tracker}}({\rm{Loc}}_{\rm{clip}}^{i+1})\right) \end{aligned}$$
(13)
where \({\rm{Loc}}_{\rm{clip}}^N\) denotes the target location in keyframe, which comes from output of the network branch no. 4. \({\rm{Loc}}_{\rm{clip}}^{i+1}\) denotes the object locations in the ith frame of the clip, \({\rm{Loc}}_{\rm{clip}}^i\) denotes the object locations in the previous frame. Note that, Since both tracking and attention-based localization are not certain and either cannot be taken as ground truth, this term is more like internal regularization loss, we might as well call it tracking regularization loss here.
4) Neighbor-consistency-based loss
Intuitively, the features’ cosine distance between the neighbors is closer in the same video clip, and so we introduce neighbor consistency loss in the model.
$$\begin{aligned} X_g=\{x_{g,0},...,x_{g,i},...,x_{g,N}\} \end{aligned}$$
(14)
$$\begin{aligned} {d_{g,i,j}=d}_c(x_{g,i},x_{g,j})={f\left( x_{g,i}\right) }^Tf(x_{g,j}) \end{aligned}$$
(15)
$$\begin{aligned} D_{N}=\left[ \begin{array}{ccccc} d_{0,0,1} &{} \cdots {} &{} d_{0,i,i+1} &{} \cdots {} &{} d_{0,N-1,N} \\ \cdots {} &{} \cdots {} &{} \cdots {} &{} \cdots {} &{} \cdots {} \\ d_{g,0,1} &{} \cdots {} &{} d_{g,i,i+1} &{} \cdots {} &{} d_{g,N-1,N} \\ \cdots {} &{} \cdots {} &{} \cdots {} &{} \cdots {} &{} \cdots {} \\ d_{PK-1,0,1} &{} \cdots {} &{} d_{PK-1,i,i+1} &{} \cdots {} &{} d_{PK-1,N-1,N} \end{array}\right] \end{aligned}$$
(16)
where \(x_{g,i}\) indicates the ith frame of the video clip, \(x_{g,N}\) specially indicates the keyframe, and we normally copy the (N-1)th frame of clip as the keyframe. For the gth clip in the batch PK, the cosine distance between all images in \(X_g\) is calculated, and \(f(\cdot {})\) means the target confidence feature of the image.
The distance matrix \(D_{N}\) is adopted to realize neighbor consistency. Intuitively, the distance between \(x_{g,i}\) and \(x_{g,i+1}\) neighbors should be pulled closer. Besides, to make the closer neighbors get more proportions in NCB loss, a weight \(w_i\) that reflects the contribution of the ith neighbor. If the distance between \(x_{g,i}\) and \(x_{g,i+1}\) is large, then its contribution to \(x_{g,i}\) is small:
$$\begin{aligned} w_i=\left\{ \begin{array}{l}\frac{1}{N}\left( 1-\frac{d_c\left( x_{g,i},x_{g,i+1}\right) }{\sum _{j=0}^{N-1}d_c\left( x_{g,j},x_{g,j+1}\right) }\right) ,\forall {}\ i\in {}\{0,1,...,N-2\} \\ d_c\left( x_{g,N-1},x_{g,N}\right) ,\ \ \ \ {\rm{where}} \ x_{g,N}\ {\rm{is\ the\ keyframe}}\end{array}\right. \end{aligned}$$
(17)
To pull the distance between the \(x_{g,i}\) and its neighbors closer, the NCB loss can be formulated as
$$\begin{aligned} L_{\rm{NCB}}=-\sum _{i=0}^{N-1}w_i{\rm{log}}\frac{{\rm{exp}}{(d}_c\left( x_{g,i},x_{g,i+1}\right) /\epsilon {})}{\sum _{i=0}^{N-1}{\rm{exp}}{(d}_c\left( x_{g,i},x_{g,i+1}\right) /\epsilon {})} \end{aligned}$$
(18)
where \(\epsilon {}\) is the scaling parameter. NCB loss make the given object frame closer to its neighbors, which can further improve the stability of the model.
5) GL and IL defination
In source domain, we defined GL and IL as
$$\begin{aligned} GL={\rm{Loss}}_{\rm{cls}}+{\rm{Loss}}_{\rm{loc}}(p\_loc, t\_loc) \end{aligned}$$
(19)
$$\begin{aligned} IL={\rm{Loss}}_{\rm{clip}} \end{aligned}$$
(20)
where \(t\_loc\) is the ground truth location in the keyframe, and \(p\_loc\) is the output location which comes from the output of the network branch no. 3. According to (7), the internal loss function IL directly affects the location feature of the sequence. Therefore, we can obtain more attention features of the video sequence to improve the precision of action tube detection.
In target domain, we defined GL and IL as
$$\begin{aligned} GL={\rm{Loss}}_{\rm{cls}}+{\rm{Loss}}_{\rm{loc}}\left( {\rm{Loc}}_{\rm{clip}}^{N-1},{\rm{Tracker}}\left({\rm{Loc}}_{\rm{clip}}^{N}\right)\right) \end{aligned}$$
(21)
$$\begin{aligned} IL={L}_{\rm{TRB}} + \rho {} {L}_{\rm{NCB}} \end{aligned}$$
(22)
where \(\rho {}\) is the hyper-parameter that control the importance of the NCB loss relative to the RAT loss. Note that the data set only has action temporal annotations in target domain. The location interactive loss is computed by \({L}_{\rm{TRB}}\) and \({L}_{\rm{NCB}}\) based on the video sequence.
Parameters about the model
In network branch no. 1, a clip of the video frame sequence is fed into the 3D network as the input and the original video can be sampled in time. The shape of the input data is \([N\times {}CH\times {}H\times {}W]\), where N is the length of the clip, CH is the number of image channels, H is the height of the video image, and W is the width of the video image. If 4 frames of 3-channel RGB images are sampled per second, then a clip consists of 16 frames per 4 seconds, then \(N=16,\ CH=3\). The tensor \(S_{101}\) has a shape \([N^{'}\times {}C^{'}\times {}H^{'}\times {}W^{'}]\), which can be squeezed by setting \(N^{'}=1\), \(H^{'}=H/32\), \(W^{'}=W/32\). Then, the feature dimension of the 3D-CNN output is squeezed and transformed into the shape \({[C}^{'}\times {}H^{'}\times {}W^{'}]\). Hence, it is easy to concatenate with the output feature of network branch no. 2, because they have the same single feature map shape \([H^{'}\times {}W^{'}]\).
In network branch no. 2, we input the same video frame sequence as in network branch no.1, and adopt the 3D-CNN network to generate the location feature. Given that the two network branches are calculated in parallel, this method does not required additional computing time. The tensor \(S_{50}\) has the shape \([N\times {}D^{'}\times {}H^{'}\times {}W^{'}]\). We squeeze the tensor by setting \(D^{'}=1\), \(H^{'}=H/32\), \(W^{'}=W/32\), and the output shape of network branch no. 2 is \({[C}^{''}\times {}H^{'}\times {}W^{'}]\), where \(C^{''}=N\times {}D^{'}=N\). Let us assume that the learning rate function of branches no. 2 and no. 4 is \(\alpha {}\left( t\right)\) acting on the back-propagation process driven by the internal loss function. The learning rate function used in the branches no. 1 and no. 3 is \(\lambda {}\alpha {}\left( t\right)\), where \(\lambda {}\) is a constant less than 1.
In network branches no. 3 and no. 4, the location regression method partly refers to the idea of YOLO [7]. If the input size is \(416\times {}416\) and 32 down-sampling is used, then the grid size is \(13\times {}13\). We also generate a \(26\times {}26\) feature map with 16 down-sampling, or a 52\(\times {}\)52 feature map with 8 down-sampling. Note that the higher the sampling ratio, the larger the feature map. In this process, the k-means method is also used to determine the size of the prior boxes based on the training data set, where k is the selected number. If the number of prior boxes is five, and each box has four position parameters and one confidence parameter, the total number of categories is NumCls, and the dimension of \(C^*\) is \([5\times {}\left( {\rm{NumCls}}+4+1\right) ]\) in the network branch no.3. The dimension of \(C^{**}\) is \([5\times {}\left( 4+1\right) ]\) in network branch no. 4, because the internal loss does not focus on the classification information. To support multi label objects, a Softmax function is used to predict the results.