- Open Access
Action recognition using length-variable edge trajectory and spatio-temporal motion skeleton descriptor
© The Author(s). 2018
- Received: 14 August 2017
- Accepted: 22 January 2018
- Published: 1 February 2018
Representing the features of different types of human action in unconstrained videos is a challenging task due to camera motion, cluttered background, and occlusions. This paper aims to obtain effective and compact action representation with length-variable edge trajectory (LV-ET) and spatio-temporal motion skeleton (STMS). First, in order to better describe the long-term motion information for action representation, a novel edge-based trajectory extracting strategy is introduced by tracking edge points from motion without limiting the length of trajectory; the end of the tracking is depending not only on the optical flow field but also on the optical flow vector position in the next frame. So, we only make use of a compact subset of action-related edge points in one frame to generate length-variable edge trajectories. Second, we observe that different types of action have their specific trajectory. A new trajectory descriptor named spatio-temporal motion skeleton is introduced; first, the LV-ET is encoded using both orientation and magnitude features and then the STMS is computed by motion clustering. Comparative experimental results with three unconstrained human action datasets demonstrate the effectiveness of our method.
- Human action recognition
- Length-variable edge trajectory
- Motion clustering
- Spatio-temporal motion skeleton
Human action recognition (HAR) is an active research topic in intelligent video analysis, gained extensive attention in academic and engineering communities [1, 2], and widely used in the fields of human-computer interaction, video surveillance, motion analysis, virtual reality, etc. [3–7]. Usually, the realization of HAR includes two steps: the first is feature extraction based on video information; the second is the classification according to feature vectors. However, due to the presence of background clutter, partial occlusion, varying viewpoints, and camera movement, it is still a challenging task to obtain discriminative action representations from realistic videos.
Recently, the trajectory-based methods were proposed and utilized in various human action recognition approaches owing to the promising results of histogram-based descriptors [8–12]. Different with the method of extracting the local features directly, the trajectory-based method is extracting spatio-temporal trajectory by matching the feature points between adjacent frames to represent human actions [13–15]. Due to the better description of motion changing and long-term target dynamic information, these methods out-performed than the local representations significantly. Therefore, this paper focuses on the trajectory-based method and will especially emphasize the following two aspects, namely, the trajectory extraction and trajectory description.
In this paper, we propose to sample edge points in each frame in order to adaptively select the trajectories associated with moving targets. In addition, we do not fix the length of trajectory, when the tracking is terminated depending on whether there is still the edge point in the next frame. So, a compact set of trajectories (LV-ET) can be obtained exactly to better represent the motion information of each action.
By introducing a spatio-temporal trajectory encoding method, where the videos are represented as a set of distinctive trajectory skeletons with latent motion features, a novel trajectory descriptor named STMS is designed by making use of spectral clustering with motion information, where videos are represented as a set of skeletons.
The remainder of this paper is organized as follows: we give an overview of the related work in Section 2. We describe the details of the LV-ET in Section 3. We explain the process for extracting the STMS descriptor in and discuss the experimental results in Section 4. Finally, the conclusion is drawn in Section 5.
Previous work shows that the most informative trajectories were extracted from the region of interest (ROI); motion and shape are two important information sources from human action videos. It is worthy noted that we also cannot guarantee that the points inside ROI are more informative and representative. Besides, we find that different types of action have its own movement rhythm, in other words, the evolution of different actions is unequal (e.g., “run” is a high-speed and continuous action whereas the action of “hug” is relatively slow and discontinuous between two persons). Therefore, the use of fixed length trajectories [30, 23, 24, 31] is not discriminative enough to represent the various types of human actions. In addition, we observe that the spatial-temporal features of trajectories are similar in the same kind of action, whereas those between different actions are dissimilar. So, the HAR performance may be improved if we can better consider the phenomenon aforementioned.
3.1 Proposed length-variable edge trajectory
In this section, we introduce the major components of the proposed LV-ET extraction, including edge point sampling and tracking, trajectory generation, and its pruning strategy.
3.1.1 Edge point sampling
The key to edge trajectory extraction is to select the tracking points exactly. In general, spatio-temporal interest point sampling [9, 10] and dense sampling [23, 24] have been successfully applied to various occasions. However, such approach often involves some irrelevant interest points like the background points with high complexity, which seriously affect the final recognition accuracy and reduce the efficiency of the algorithm. Unlike the sampling approach mentioned above, we utilize Canny detector to detect edge points. Moreover, in order to better obtain human action motion information and edge trajectories, we also leverage optical flow to track edge points across video frames to extract trajectories.
3.1.2 Tracking and edge trajectory generation
By observing the computational process, it is not difficult to find that if we compute the trajectory and only use (3) and (4), the succeeding tracking points may have a high-risk drifting to none informative regions, no matter whether the initial edge points are stipulated or not. In consideration of this, we utilize a novel tracking strategy that when computing the succeeding trajectory point, we prejudge whether the succeeding candidate point is an edge point. In other words, we compute every frame’s edge information E t and use it to determine whether the round position (x t + 1 ,y t + 1 ) is an edge point in F t + 1 . If so, we regard it as the valid sampled point in F t + 1 . Otherwise, we consider the succeeding sampled point (x t + 1 ,y t + 1 ) is not a valid trajectory point and terminated the tracking process at (x t ,y t ). In the next step, for each edge point in F t + 1 , if it is not a succeeding sampled point of F t , we use it as the initial point of a new trajectory. This frame-by-frame tracking process is iterated until the last sampling points are found in the last frame.
3.1.3 Trajectory pruning strategy
3.2 Proposed spatio-temporal motion skeleton
In this section, we introduce the trajectory descriptors and the generation of spatio-temporal motion skeleton, including trajectory encoding-based similarity measurement and motion clustering.
3.2.1 Trajectory descriptors
In order to well describe the motion information of each trajectory, four descriptors are calculated in our method, namely, HOF, MBH, HOG, and the proposed STMS. The first two descriptors capture the motion information from optical flow, the HOG descriptor capture the local appearance information, and the STMS represents the relationship of trajectories between different types of action.
Similar to other trajectory-based methods, the descriptors HOG, HOF, and MBH of the LV-ET are also acquired from a spatio-temporal volume centered along the trajectory, which is usually divided into spatio-temporal cells to embed structure information. In our work, we use a volume of 24 × 24 × L pixels for each trajectory, where L is the length of LV-ET, and divide the volume into 4 × 4 × 1 cells. Then, the three above descriptors are calculated with the same parameters which are used in . The detailed information of proposed STMS descriptor is introduced in the next section.
3.2.2 Trajectory encoding
3.2.3 Motion clustering and skeleton extraction
From the above analysis, we can obtain the motion similarity of LV-ET, which is the foundation for the motion clustering and motion skeleton extraction. In recent years, spectral clustering has become one of the most popular clustering algorithms. In this paper, we adopt this clustering similar to  to get the cluster center as motion skeleton. Spectral clustering algorithm is based on graph theory, we regard every TMH as a node in an undirected graph V = (v 1 ,v 2 ,…,v n ), the proposed motion similarity between trajectories is quantified as the edge weight between nodes. Therefore, we can transform the motion clustering problem into the sub-graph partitioning problem.
Form the affinity matrix W ∈ Rn × n defined by the proposed motion similarity between two trajectories, where if i ≠ j, W ij = ΔMs ij , and the elements on the diagonal is 0.
Define D to be the diagonal matrix which diagonal element is the sum of W’s i-th row, and construct the Laplacian matrix L = D−1/2 AD−1/2.
Calculate the first k eigenvalues x 1 , x 2 ,…,x k of L and form the matrix X = [x1x2…x k ] ∈ Rn × k by stacking the eigenvectors in columns.
Form the matrix Y from X by normalizing every X’s row to a unit length, where Y ij = X ij /(∑ j X2 ij )1/2.
Regarding each row of Y as a point in R k , cluster them into k cluster C 1 , C 2 ,…,C k via K-means algorithm.
The clustering centers are regarding as the motion skeleton of this action. Then, each trajectory type is assigned to its nearest cluster centroid using Euclidean distance. The STMS descriptor with a dimension of k is constructed for each type of trajectories to represent the video. In general, once we have extracted the LV-ET, we can obtain the STMS by applying the following algorithm:
UCF Sports  dataset contains ten human actions: diving, golf, swinging, kicking, weight lifting, horseback riding, running, skating, swinging bench, swinging side, and walking. For most action classes, there is considerable variation inaction performance, human appearance, camera movement, viewpoint, illumination, and background. Besides, due to its high resolution, we resize each video in it to half its original spatial to reduce the time consumption. Similar to , we add a horizontally flipped version of each sequence to the dataset to increase data samples and use the leave-one-out setup, i.e., testing on each original sequence while training on all other sequences except the flipped version of the tested sequence. We report average accuracy over all classes.
YouTube  dataset includes 11 action categories: basketball shooting, biking, diving, golf swinging, horseback riding, soccer juggling, swinging, tennis swinging, trampoline jumping, volleyball spiking, and walking with a dog. This dataset is very challenging due to large variations in camera motion, object appearance and pose, object scale, viewpoint, cluttered background, illumination conditions, etc. For each category, the videos are grouped into 25 groups. In total, 1168 videos are used in the experiment. Leave-one-out setup is utilized, and average accuracy over all classes is reported as the performance measure.
HMDB51  dataset is collected from a variety of sources. There are a total of 6766 videos distributed in 51 action categories. The action categories can be grouped in five types: general facial actions, facial actions with object manipulation, general body movements, body movements with object interaction, and body movements for human interaction. For evaluation, there are three distinct training and testing splits. We follow the original protocol using three train-test splits and report average accuracy over these splits.
4.2 Experimental setup
The experiment is divided into two parts: the first part is feature extraction to represent the videos, and the second part is video classification. In the first part, Farnebäck’s optical flow algorithm  is employed to estimate optical flow field. During LV-ET extraction, we adopt Canny operator to detect frame edges and a square-shaped structuring element with 3 × 3 pixels is used to compute the dilated edge image. The LV-ET can be obtained through the above steps, which is also reused later to compute motion-related descriptors like HOF and MBH. Moreover, in order to extract STMS descriptor, we introduce a novel trajectory encoding method and based on the proposed encoding method, the motion similarity is computed under Euclidean distance. Then, the spectral clustering algorithm is employed to construct the motion skeleton within 256 dimensional, so the video can be represented by a 256-dimension STMS descriptor. We also extract three baseline descriptors (i.e., HOG, HOF, and MBH) from the trajectories. Empirically, we use a volume of 24 × 24× L pixels and divide it into 4 × 4 × 1 cells for each trajectory when extracting the three baseline descriptors. In order to fairly compare with the baseline trajectory DT  and iDT , the same parameters are utilized to extract the descriptors as used in . Note that iDT uses human detection to detect the targets from the background. In contrast, the proposed method does not make use of human detection.
After the trajectory descriptors are computed, the principle component analysis (PCA) is individually applied to reduce the dimensionality of each descriptor (i.e., HOG, HOF, MBH, and STMS) by a factor of two as suggested in  so as to better mitigating the impact of noise. Then, the fisher vector (FV) model  is adopted in this paper. For each type of descriptor extracted from trajectory, these PCA-reduced vectors are separately encoded into a signature vector by the FV model. Similar to , the projection matrix of PCA is learned using 256,000 descriptors randomly sampled from the training set. Then, we use a Gaussian mixture models (GMM) with 256 components to encode the projected descriptors as the same in  and apply ℓ2 normalization  to each type of descriptor to obtain the video-level representation. For each video, four types of video-level representations are computed.
In the second part, after gaining the high-level video representations, we use multi-kernel learning-based support vector machine to predict action class , where four linear kernels are used, and each corresponds to one type of representations . The one-against-all approach is adopted, and the predicted class is selected with highest score.
4.3 Evaluation of parameters
In this subsection, we compare and analyze different edge detectors and the numbers of motion skeletons for HAR.
4.3.1 Evaluation of edge detectors
4.3.2 Evaluation of motion skeleton numbers
4.4 Comparison with baseline descriptors
Comparison of STMS with baseline descriptors
UCF Sports (%)
4.5 Comparison with baseline trajectories
Comparison of LV-ET with baseline trajectories
UCF Sports (%)
As given in Table 2, the HAR performance on three datasets of LV-ET outperform DT by 4.6, 5.5, and 7.1%. As compared with iDT, the performances are increasing 3.5, 3.3, and 2.0% respectively. It is obvious that the proposed LV-ET obtains the best HAR performance among three types of trajectories. This is because LV-ET describes the evolution features between different types of actions and uses the edge information of the target to reduce background interference and camera motion.
4.6 Evaluation of the overall recognition performance
Per class average accuracy for HMDB51
4.7 Comparison with state of the arts
Comparison of the overall performance of our method and the trajectory-based methods
As given in Table 4, the proposed method achieves comparable result on all three datasets. We note that the UCF Sports dataset, the proposed method obtains at least 0.8% improvement and obtains 92.8% accuracy; for the HMDB51 dataset, there is at least 1.5% improvement compared with other methods and obtains 58.2% accuracy; for the YouTube dataset, our method outperforms 0.7% than the others and obtains 89.6% accuracy.
4.8 Evaluation of computational complexity
Comparison of the average time consumption per video for the trajectory
Trajectory extraction (s)
Descriptor computation (s)
Descriptor encoding (s)
In this paper, a new trajectory generation strategy LV-ET is proposed and a novel descriptor STMS is designed for human action recognition. The LV-ET, extracted by tracking edge points across video frames based on optical flow with the aim of better descript the evolution of different type of actions, which proves representative and informative. In the process of extracting STMS, a novel encoding method for trajectory clustering is proposed. The motion similarity is adequately considered during TMH clustering. STMS is designed for extracting the most representative trajectories in one action, so we call it motion skeleton.
Through experimentation with three publicly unconstrained datasets, we demonstrated that the proposed LV-ET outperforms the baseline approach (e.g., DT and iDT). Regarding STMS, it is also comparable to the current trajectory-based descriptors and proved to be discriminative and complementary to existing descriptors. Note that we do not leverage background subtraction and thus can be well applied to unconstrained realistic action recognition.
This work was supported by the National Natural Science Foundation of China under Grants 11176016 and 60872117.
Availability of data and materials
All data are fully available without restriction.
ZKW implemented the core algorithm and drafted the manuscript. YPG reviewed and edited the manuscript. All authors discussed the results and implications, commented on the manuscript at all stages, and approved the final version.
Zhengkui Weng was born in Jiaxing, Zhejiang Province, China. He is now a Ph.D. candidate of School of Communication and Information Engineering in Shanghai University, China. His major research interests include computer vision and pattern recognition.
Yepeng Guan was born in Xiaogan, Hubei Province, China, in 1967. He received the B.S. and M.S. degrees in physical geography from the Central South University, Changsha, China, in 1990 and 1996, respectively, and the Ph.D. degree in geodetection and information technology from the Central South University, Changsha, China, in 2000. Since 2007, he has been a professor with School of Communication and Information Engineering, Shanghai University.
Ethical approval and consent to participate
Consent for publication
The authors declare that they have no competing interests.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
- R Poppe, A survey on vision-based human action recognition. Image Vis. Comput. 28(6), 976–990 (2010)View ArticleGoogle Scholar
- Y Yi, M Lin, Human action recognition with graph-based multiple-instance learning. Pattern Recogn. 53, 148–162 (2016)View ArticleGoogle Scholar
- C Yan et al., Supervised hash coding with deep neural network for environment perception of intelligent vehicles. IEEE Trans. Intell. Transp. Syst. PP(99), 1–12 (2017)Google Scholar
- C Yan et al., Effective Uyghur language text detection in complex background images for traffic prompt identification. IEEE Trans. Intell. Transp. Syst. PP(99), 1–10 (2017)Google Scholar
- GI Parisi, C Weber, S Wermter, Self-organizing neural integration of pose-motion features for human action recognition. Front. Neurorobotics 9, 3 (2015)View ArticleGoogle Scholar
- C Yan et al., A highly parallel framework for HEVC coding unit partitioning tree decision on many-core processors. IEEE Signal Process. Letters 21(5), 573–576 (2014)View ArticleGoogle Scholar
- H Zheng, Z Li, Y Fu, Efficient human action recognition by luminance field trajectory and geometry information. Transplant. Proc. 42(3), 987–989 (2009)Google Scholar
- C Yan et al., Efficient parallel framework for HEVC motion estimation on many-Core processors. IEEE Trans. Circuits Syst. Video Technol. 24(12), 2077–2089 (2014)View ArticleGoogle Scholar
- J Dou, J Li, Robust human action recognition based on spatio-temporal descriptors and motion temporal templates. Optik Int. J. Light Electron Opt. 125(7), 1891–1896 (2014)View ArticleGoogle Scholar
- I Laptev, On space-time interest points. IJCV. Int. J. Comput. Vis. 64(2), 107–123 (2005)View ArticleGoogle Scholar
- O Kliper-Gross et al., Motion interchange patterns for action recognition in unconstrained videos. Eur. Conf. Comput. Vision, 256–269 (2012)Google Scholar
- A Oikonomopoulos et al., Trajectory-based representation of human actions. ICMI 2006 IJCAI 2007 Int. Conf. Artifical Intell. Hum. Comput., 133–154 (2007)Google Scholar
- A Ciptadi, MS Goodwin, JM Rehg, Movement pattern histogram for action recognition and retrieval. Eur. Conf. Comput. Vision, 695–710 (2014)Google Scholar
- B Ni et al., Motion part regularization: Improving action recognition via trajectory group selection. IEEE Conf. Comput. Vision Pattern Recog., 3698–3706 (2015)Google Scholar
- Y Yi, H Wang, Motion keypoint trajectory and covariance descriptor for human action recognition. Vis. Comput., 1–13 (2017). https://doi.org/10.1007/s00371-016-1345-6
- JJ Seo et al., Effective and efficient human action recognition using dynamic frame skipping and trajectory rejection. Image Vision Comput. 58, 76–85 (2016)View ArticleGoogle Scholar
- PF Felzenszwalb et al., Object detection with discriminatively trained part-based models. IEEE Trans. Patt. Anal. Mach. Intell. 32(9), 1627 (2010)View ArticleGoogle Scholar
- J Sanchez et al., Image classification with the fisher vector: Theory and practice. Int. J. Comput. Vis. 105(3), 222–245 (2013)MathSciNetView ArticleMATHGoogle Scholar
- J Sun et al., Hierarchical spatio-temporal context modeling for action recognition. Comput. Vision Pattern Recog., 2004–2011 (2009)Google Scholar
- P Matikainen, M Hebert, R Sukthankar, Trajectons: Action recognition through the motion analysis of tracked features. IEEE Intern. Conf. Comput. Vision Workshops, 514–521 (2009)Google Scholar
- M Bregonzio et al., Discriminative topics Modelling for action feature selection and recognition. Br. Mach. Vision Conf., 1–11 (2010)Google Scholar
- N Sundaram, T Brox, K Keutzer, Dense point trajectories by GPU-accelerated large displacement optical flow. Eur. Conf. Comput. Vision, 438–451 (2010)Google Scholar
- H Wang et al., Dense trajectories and motion boundary descriptors for action recognition. Int. J. Comput. Vis. 103(1), 60–79 (2013)MathSciNetView ArticleGoogle Scholar
- H Wang, C Schmid, Action recognition with improved trajectories. IEEE Int. Conf. Comput. Vision, 3551–3558 (2014)Google Scholar
- N Dalal, B Triggs, Histograms of oriented gradients for human detection. Comput. Vision Patt. Recog. 1(12), 886–893 (2005)Google Scholar
- I Laptev et al., Learning realistic human actions from movies. Comput. Vision Pattern Recogn., 1–8 (2008)Google Scholar
- A Gaidon, Recognizing activities with cluster-trees of tracklets. Br. Mach. Vision Conf. 3201–3208 (2012)Google Scholar
- I Atmosukarto, N Ahuja, B Ghanem, Action recognition using discriminative structured trajectory groups. Appl. Comput. Vision, 899–906 (2015)Google Scholar
- E Vig, M Dorr, D Cox, Space-variant descriptor sampling for action recognition based on saliency and eye movements. Eur. Conf. Comput. Vision, 84–97 (2012)Google Scholar
- Y Yi, Y Lin, Human action recognition with salient trajectories. Signal Process. 93(11), 2932–2941 (2013)View ArticleGoogle Scholar
- I Jargalsaikhan et al., Action recognition based on sparse motion trajectories. IEEE Int. Conf. Image Process., 3982–3985 (2014)Google Scholar
- G Farnebäck, Two-frame motion estimation based on polynomial expansion. Scand. Conf. Image Anal., 363–370 (2003)Google Scholar
- A Saade, F Krzakala, Spectral clustering of graphs with the Bethe hessian. Int. Conf. Neural Inf. Process. Syst., 406–414 (2014)Google Scholar
- MD Rodriguez, J Ahmed, M Shah, Action MACH a spatio-temporal maximum average correlation height filter for action recognition. Comput. Vision Patt. Recog., 1–8 (2008)Google Scholar
- J Liu, J Luo, M Shah, Recognizing realistic actions from videos. Comput. Vision Patt. Recog., 1996–2003 (2009)Google Scholar
- H Kuehne et al., HMDB: A large video database for human motion recognition. IEEE Int. Conf. Comput. Vision, 2556–2563 (2011)Google Scholar
- JFE Frank, A review of multi-instance learning assumptions. Knowl. Eng. Rev. 25(1), 1–25 (2010)View ArticleGoogle Scholar
- F Orabona, L Jie, B Caputo, Online-batch strongly convex multi kernel learning. Comput. Vision Patt. Recog. 119(5), 787–794 (2010)Google Scholar
- X Wang, C Qi, Action recognition using edge trajectories and motion acceleration descriptor. Mach. Vision Appl. 27(6), 861–875 (2016)View ArticleGoogle Scholar
- X Peng, Y Qiao, Q Peng, Motion boundary based sampling and 3D co-occurrence descriptors for action recognition. Image Vision Comput. 32(9), 616–628 (2014)View ArticleGoogle Scholar
- J Cho et al., Robust action recognition using local motion and group sparsity. Pattern Recogn. 47(5), 1813–1825 (2014)View ArticleGoogle Scholar