Unsupervised Action Classification Using Space-Time Link Analysis
© Haowei Liu et al. 2010
Received: 16 September 2009
Accepted: 26 March 2010
Published: 6 May 2010
We address the problem of unsupervised discovery of action classes in video data. Different from all existing methods thus far proposed for this task, we present a space-time link analysis approach which consistently matches or exceeds the performance of traditional unsupervised action categorization methods in various datasets. Our method is inspired by the recent success of link analysis techniques in the image domain. By applying these techniques in the space-time domain, we are able to naturally take into account the spatiotemporal relationships between the video features, while leveraging the power of graph matching for action classification. We present a comprehensive set of experiments demonstrating that our approach is capable of handling cluttered backgrounds, activities with subtle movements, and video data from moving cameras. State-of-the-art results are reported on standard datasets. We also demonstrate our method in a compelling surveillance application with the goal of avoiding fraud in retail stores.
How to automatically discover and recognize activities from video data is an important topic in computer vision. A solution to this problem will not only facilitate applications, such as video retrieval or summary, but will also improve, for example, automatic video surveillance systems  and human-machine/robot communication . In addition to its importance for many practical applications, unsupervised action categorization is important in the context of machine learning, particularly on how video processing approaches could allow a high-level "understanding" of the data.
Numerous techniques have been proposed to solve the action classification problem . The requirements of video analysis techniques are manifold, such as dealing with cluttered background, camera motion, occlusion, and geometric and photometric variability, [1, 4, 5]. Recently, unsupervised methods based on bag of visual words have become very popular as they could achieve excellent performance in standard datasets  and long surveillance videos [1, 7].
Generally, these unsupervised algorithms extract spatiotemporal feature descriptors called video words and then use document-topic models such as pLSA , LDA , or HDP  to discover latent topics [1, 5, 7]. A common limitation of these models is that they usually do not consider spatiotemporal correlations among visual words unless the correlations are represented explicitly . Another general limitation is that some of these methods are EM-based learning approaches which makes recursive learning and updating difficult.
In this paper we introduce link analysis-based techniques to unsupervised activity discovery in video data that naturally preserves the spatiotemporal topology among the video words. Link analysis techniques are known from data mining, the information retrieval research communities, and the WWW . They were largely ignored in computer vision until their recent introduction to the community by Kim et al. [12, 13], who applied link analysis to unsupervised image clustering with impressive results.
Next, the VSN is analyzed separately by using the link analysis techniques, PageRank  and structure similarity (SS) . The PageRank algorithm would output a score for each feature indicating the amount of similar features it has while the structural similarity gives the likelihood of a feature being a hub node. The intuition is that genuine features should be similar to one another and thus have high ranking values. The PageRank and the structural similarity scores together form an affinity matrix between all video sequences.
Here, we interpret the pairwise matching weights as votes for the importance of the nodes which allows a quick division between consistent nodes and irrelevant ones (e.g., those from the background). Eventually, as shown in Figure 1, spectral clustering is applied to the affinity matrix to identify potential action categories. Link analysis techniques have been shown to be able to detect consistent matches (hubs) very effectively and efficiently [11, 12, 18, 19]. All computation and inference is done on the link weights between the nodes in the VSN which makes it fast and efficient.
The key contributions of our work are as follows.
We extend link analysis techniques to the spatiotemporal domain and show that unsupervised discovery of action classes can greatly benefit from such approach. For this we apply necessary revisions (feature representation, matching techniques, etc.) to the approach presented in  to make it efficiently applicable to video data. We report results that either match or exceed the performance of the state-of-the-art techniques in various datasets.
We demonstrate that our approach can be applied for action clustering in real surveillance videos and show a compelling application to avoid fraud in retail stores.
The paper is organized as follows: in Section 2, we review related literature on activity recognition. Section 3 describes our approach in detail, including the spatiotemporal interest point detector, the matching process, and link analysis techniques. In Section 4, we show the performance of our approach on standard datasets and a real surveillance application. Finally, Section 5 concludes our paper.
2. Related Work
Many methods have been proposed to address the problem of action recognition and analysis in video sequences [3, 20, 21]. Specifically for human action modeling, a variety of techniques rely on tracking body parts (e.g., arms, limbs, etc.) to classify human actions [22, 23]. The classical review of  covers significant amount of work that falls into this category. Although very promising results have been achieved recently to distinguish activities under large viewpoint changes , it is often difficult to accurately detect and track body parts in complex environments.
Template-based approaches make use of spatiotemporal patterns to match and identify specific actions in videos. Bobick and Davis  use motion history images—a.k.a temporal templates—for action classification. Efros et al.  introduce a spatiotemporal descriptor that works well on low-resolution videos. Blank et al.  represent actions as space-time shape volumes for classification. Shechtman and Irani  propose a similarity metric between video patches based on intensity variation. A common drawback of these template-based methods is their inability to generalize from a collection of examples and create a single template which captures the intraclass variability of an action. More recently, Rodriguez et al.  address this problem using a MACH filter.
State-Space models have been widely applied for short term action recognition and more complex behavior analysis, involving object interactions and activities at multiple levels of temporal granularity. Examples include Hidden Markov Models and its variations such as coupled HMMs  and Layered HMMs , Stochastic Grammars , and Conditional Random Fields . The majority of these methods are supervised, requiring manual labeling of video clips. When the state space is large, the estimation of many parameters make the learning process more difficult.
Bag of words models have recently shown great promise in action classification. These approaches in general extract sparse space-time interest points [35, 36] from the video sequences and then apply either discriminative or generative models for categorizing the activities. Highly discriminative results are obtained using SVM classifiers based on these descriptors under a supervised learning framework [36, 37]. Recently, Niebles and Fei-Fei  enhance this approach by proposing a novel model characterized as a constellation of bags-of-features, which encodes both the shape and appearance of the actor.
Unsupervised methods have also been proposed using the bag of words model (see the general discussion in Section 1). Closest to our work are probably Niebles et al.  and Wang et al. . Niebles et al.  use a generative model based on pLSA to cluster activities. Wang et al.  use a hierarchical Bayesian model to cluster activities and interactions in surveillance videos. Although these methods achieve excellent results in real world video data, they omit any global spatiotemporal structure information among the video words. More recently, Savarese et al.  used spatiotemporal correlograms to encode flexible long range temporal information into the local features.
Different from all methods thus far proposed for unsupervised action categorization, we address this problem using a link analysis-based approach. Specifically, we apply link analysis algorithms in the spatiotemporal domain to automatically discover actions in video sequences. By using link analysis, we are able to naturally take into account the spatiotemporal relationships between detected interest points, while leveraging the power of graph matching for action classification. Experiment results show that this approach performs well compared to the state-of-the-art techniques on standard datasets and works very well in real surveillance scenarios. More details about our algorithm follow in the next section.
3. Link-Analysis for Spatiotemporal Features
In this section, we break down our approach into its major components and give a detailed introduction to them. In detail, we will discuss the types of features we used, the use of shape context features matching, PageRank, structure similarity computation, and spectral clustering. Figure 1 shows the flow chart of our approach.
3.1. Extraction of Spatio-Temporal Features
The first step of our action classification approach is to extract spatiotemporal interest points from the input video sequences. The two most recent spatiotemporal descriptors are proposed by Laptev and Lindeberg  and Dollar et al. , respectively.
The two parameters and correspond to the spatial and temporal scales of the detector, respectively. The frequency of the harmonic functions is given by f. In all cases we use , as in .
Alternatives to the space-time volumes are possible. On crowded scenes in surveillance data, that we have successfully used the spatiotemporal motion descriptors from .
3.2. Matching Spatial-Temporal Words and Building VSN
In order to take into account the relationships between detected visual words we apply a graph matching algorithm on each pair of sequences to determine feature level similarities. In , quadratic matching techniques such as  are used to match nodes from two graphs by jointly considering the consistencies of their feature values and the spatial arrangements. However, the direct application of the techniques from  is not possible. While the work in  is applied on sets of images, our problem is concerned with sets of videos. For video processing, the techniques from , in particular the spectral matching , are too inefficient for video processing. For example, the spectral matching has a complexity of , where is the number of features. Thus, for better computational efficiency, we need to replace the spectral matching technique with the Hungarian method , a linear assignment matching approach, but augment the original spatial-temporal features with their associated shape context descriptors . The shape context feature was proposed by Belongie et al.  for shape matching two objects using the extracted sparse points on their boundaries. Given a set of features, the shape context descriptor of a feature is a histogram of the relative locations of all others with respect to itself in polar coordinate system. In our scenario, since the activities are periodic, we only consider the spatial distributions, that is, the 2D polar coordinates of the visual features. The incorporation of shape context features discourages the matching of a noisy word from the background and a legitimate one. The reason is that although the feature value of the noisy word could be very similar to a genuine one, its shape context descriptor would say otherwise since these noisy words often occur at random places in the video sequences while the genuine features from the activities of interest are usually centered around a specific location, for example, the human body. This way, although the Hungarian method itself does not consider the locations of matched features, by augmenting every spatial-temporal word with its shape context descriptor, the spatial arrangement of these features is implicitly modeled.
where is the matching cost between feature and . In our experiments we have computed the link weights, from the difference, that is, between the two feature vectors and with and without shape context features. For normalizing the weights we follow the approach outlined in .
The aim of the next step is to identify the strongest and most consistent features in each of the videos. This we do by extracting the subgraph from our original VSN that contains the nodes from the video as well as all other nodes in the VSN that are connected to the nodes from : we set if and . Then, we apply pagerank  to the subgraph . The intuition behind the application of pagerank is that the nodes that are referenced (linked) often by important nodes are considered important as well. After pagerank, the features with high ranking values are those highly relevant and most consistent in the video .
where is the weight matrix of , is a weighting constant set to 0.1 as in , is the transport vector representing the initial prior of (set to uniform distribution here), and , where is the -dimensional indicator vector identifying the nodes with zero outdegree and is the dimension of the transport vector. The final ranking value of each node represents its relative importance in the VSN .
3.2.2. Structure Similarity
After computing pagerank, we evaluate the structure similarity  between two nodes. Here, we follow the reasoning in [12, 17]: nodes with a similar set of links, that is, nodes that are pointed to by a similar set of nodes and which are pointing to a similar set of nodes will most likely belong to the same category. Blondel et al.  use this technique to find synonymies in text documents.
The goal for computing structure similarity is to identify which nodes in the graph are true hub nodes. In order to do this, we take the graph we have and compare it with graph , and see which node(s) are most similar to node 2, the center node, in . Therefore, is the graph we compare to and (5) is the matrix representation of it. Let be the resulting similarity scores. To solve , we use the following formulation to solve (6), which is an approach proposed in citeSS to compare nodes from different graphs.
for . Here, is a matrix, initially set to a matrix with all entries and is the Frobenius norm. Upon convergence, the structure similarity value for each neighbor of is given by . A value with higher score shares a lot of common nodes with . The process is repeated for each feature which gives us an matrix .
3.2.3. Spectral Clustering
Given an by matrix encoding the similarity scores between instances, the spectral clustering  clusters these instances into clusters, where is a predefined value. With the affinity matrix at hand, we apply spectral clustering  on the nearest neighbor graph to uncover the underlying activities.
In this section, we apply our algorithm to standard datasets and show that it performs well compared to the state-of-the-art approaches. In detail, we test our approach on the following:
the KTH dataset , which is the largest one,
the skating datset from , where we show that our approach is able to handle cluttered background as well as video data from a moving cameras,
real-world surveillance data where our approach was able to cope even with subtle movements.
In all the evaluations, the features are reduced to 100D vectors using PCA. In practice, the target dimensionality could be set using cross validation. In our case, we set it to 100 for the sake of comparison. The values of and could vary according to different datasets but they are set to the same values as in .
4.1. KTH Dataset
Confusion matrix for the KTH dataset. The average performance is 91.3%. "box", "hc", "hw", "j/r", and "walk" represent boxing, handclapping, handwaving, jogging/running, and walking, respectively. For example, row one means out of all the boxing sequences, 84% are classified correctly, and 16% are classified as handclapping.
4.2. Skating Dataset
As a second experiment, we apply our approach to a real world skating dataset reported in . We extract 24 video sequences from the dataset and apply the same process to uncover three activities: stand-spin, sit-spin, and camel-spin. The detector parameters are set to and when extracting the spatiotemporal interest points, which are then described by the corresponding PCA- reduced 3D gradients.
Confusion matrix for the skating dataset. The average performance is 83.4%.
Performance comparison between different methods from top to bottom: pLSA, without shape context features (SCF), and with shape context features (SCFs).
w/o shape features
with shape features
4.3. Real World Surveillance Video
Confusion matrix for the surveillance video. The average performance is 81.5%. "pick", "scan", and "drop" represent pickup, scanning, and drop, respectively.
In this paper, we proposed a link-analysis-based approach to unsupervised activity recognition. Different from previous approaches based on the bag of words models, the link-analysis approach takes into account the spatiotemporal relationship between visual words in the matching process. We see this as the major reason for the good performance of our approach. Furthermore, we have tested the link-analysis on a variety of test videos: the KTH data, which is the largest dataset, the skating video data, where our approach demonstrated its ability to deal with cluttered background and moving cameras and the surveillance data where our approach was able to cope even with very subtle hand movements.
During our tests of the link-analysis approach on the different datasets, we also compared different approaches, that is, (a) with the shape context features (SCF), (b) without SCF, and (c) state-of-the-art approach using pLSA. Future work will be to deal with multiple moving individuals/objects in the video data. We would also like to evaluate the performance of our approach using better matching algorithm, quadratic assignment , for example.
This work was partially funded by the EU-project PACOPlus, IST-FP6-IP-027657.
- Wang X, Ma X, Grimson E: Unsupervised activity perception by hierarchical bayesian models. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2007, Minneapolis, Minn, USAGoogle Scholar
- Krueger V, Kragic D, Ude A, Geib C: The meaning of action: a review on action recognition and mapping. International Journal on Advanced Robotics 2007,21(13):1473-1501.Google Scholar
- Moeslund TB, Hilton A, Krüger V: A survey of advances in vision-based human motion capture and analysis. Computer Vision and Image Understanding 2006,104(2-3):90-126. 10.1016/j.cviu.2006.08.002View ArticleGoogle Scholar
- Niebles JC, Fei-Fei L: A hierarchical model of shape and appearance for human action classification. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2007, Minneapolis, Minn, USAGoogle Scholar
- Niebles JC, Wang H, Fei-Fei L: Unsupervised learning of human action categories using spatial-temporal words. International Journal of Computer Vision 2008,79(3):299-318. 10.1007/s11263-007-0122-4View ArticleGoogle Scholar
- Savarese S, Del Pozo A, Niebles JC, Fei-Fei L: Spatial-temporal correlations for unsupervised action classification. Proceedings of the IEEE Workshop on Motion and Video Computing, 2008, Copper Mountain, Colo, USAGoogle Scholar
- Wang X, Ma KT, Ng G-W, Grimson WEL: Trajectory analysis and semantic region modeling using a nonparametric bayesian model. Proceedings of the 26th IEEE Conference on Computer Vision and Pattern Recognition (CVPR '08), 2008Google Scholar
- Hofmann T: Probabilistic latent semantic analysis. Proceedings of the 22nd Conference on Uncertainty in Artificial Intelligence, 1999, Stockholm, SwedenGoogle Scholar
- Blei DM, Ng AY, Jordan MI: Latent dirichlet allocation. Journal of Machine Learning Research 2003,3(4-5):993-1022.MATHGoogle Scholar
- Teh YW, Jordan MI, Beal MJ, Blei DM: Hierarchical Dirichlet processes. Journal of the American Statistical Association 2006,101(476):1566-1581. 10.1198/016214506000000302View ArticleMathSciNetMATHGoogle Scholar
- Brin S, Page L: The anatomy of a large-scale hypertextual web search engine. Proceedings of the 7th International Conference on World Wide Web, 1998 7: 107-117.Google Scholar
- Kim G, Faloutsos C, Hebert M: Unsupervised modeling of object categories using link analysis techniques. Proceedings of the 26th IEEE Conference on Computer Vision and Pattern Recognition (CVPR '08), June 2008, Anchorage, Alaska, USAGoogle Scholar
- Kim G, Faloutsos C, Hebert M: Unsupervised modeling and recognition of object categories with combination of visual contents and geometric similarity links. Proceedings of the 1st International ACM Conference on Multimedia Information Retrieval (MIR '08), 2008, British Columbia, Canada 419-426.View ArticleGoogle Scholar
- Leordeanu M, Hebert M: A spectral technique for correspondence problems using pairwise constraints. Proceedings of the IEEE International Conference on Computer Vision, 2005, Beijing, China 2: 1482-1489.Google Scholar
- Kuhn HW: The hungarian method for the assignment problem. Naval Research Logistics Quarterly 1955., 2:Google Scholar
- Belongie S, Malik J, Puzicha J: Shape matching and object recognition using shape contexts. IEEE Transactions on Pattern Analysis and Machine Intelligence 2002,24(4):509-522. 10.1109/34.993558View ArticleGoogle Scholar
- Blondel VD, Gajardo A, Heymans M, Senellart P, Van Dooren P: A measure of similarity between graph vertices: applications to synonym extraction and web searching. SIAM Review 2004,46(4):647-666. 10.1137/S0036144502415960View ArticleMathSciNetMATHGoogle Scholar
- Najork M, Craswell N: Efficient and effective link analysis with precomputed SALSA maps. Proceedings of the International Conference on Information and Knowledge Management, 2008, Napa Valley, Calif, USA 53-62.Google Scholar
- Thelwall M: Link Analysis: An Information Science Approach. Academic Press, San Diego, Calif, USA; 2004.Google Scholar
- Turaga BK, Chellappa R, Subrahmanian VS, Udrea O: Machine recognition of human activities: a survey. IEEE Transactions on Circuits and Systems for Video Technology 2008,18(11):1473-1488.View ArticleGoogle Scholar
- Wang L, Hu W, Tan T: Recent developments in human motion analysis. Pattern Recognition 2003,36(3):585-601. 10.1016/S0031-3203(02)00100-0View ArticleGoogle Scholar
- Ramanan D, Forsyth A: Automatic annotation of everyday movements. Proceedings of the Neural Information Processing Systems (NIPS '03), 2003, Washington, DC, USAGoogle Scholar
- Fanti C, Zelnik-Manor L, Perona P: Hybrid models for human motion recognition. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR '05), 2005, San Diego, Calif, USA 1: 1166-1173.Google Scholar
- Gavrila DM: The visual analysis of human movement: a survey. Computer Vision and Image Understanding 1999,73(1):82-98. 10.1006/cviu.1998.0716View ArticleMATHGoogle Scholar
- Ikizler N, Forsyth D: Searching video for complex activities with finite state models. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2007, Minneapolis, Minn, USAGoogle Scholar
- Bobick AF, Davis JW: The recognition of human movement using temporal templates. IEEE Transactions on Pattern Analysis and Machine Intelligence 2001,23(3):257-267. 10.1109/34.910878View ArticleGoogle Scholar
- Efros AA, Berg AC, Mori G, Malik J: Recognizing action at a distance. Proceedings of the IEEE International Conference on Computer Vision, 2003, Nice, France 2: 726-733.View ArticleGoogle Scholar
- Blank M, Gorelick L, Shechtman E, Irani M, Basri R: Actions as space-time shapes. Proceedings of the IEEE International Conference on Computer Vision, 2005, Beijing, China 2: 1395-1402.Google Scholar
- Shechtman E, Irani M: Space-time behavior based correlation. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2005, San Diego, Calif, USA 1: 405-412.Google Scholar
- Rodriguez MD, Ahmed J, Shah M: Action MACH: a spatio-temporal maximum average correlation height filter for action recognition. Proceedings of the 26th IEEE Conference on Computer Vision and Pattern Recognition (CVPR '08), 2008, Anchorage, Alaska, USAGoogle Scholar
- Brand M, Oliver N, Pentland A: Coupled hidden Markov models for complex action recognition. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 1997, San Juan, Puerto Rico, USA 994-999.View ArticleGoogle Scholar
- Oliver N, Garg A, Horvitz E: Layered representations for learning and inferring office activity from multiple sensory channels. Computer Vision and Image Understanding 2004,96(2):163-180. 10.1016/j.cviu.2004.02.004View ArticleGoogle Scholar
- Bobick AF, Ivanov YA: Action recognition using probabilistic parsing. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 1998, anta Barbara, Calif, USA 196-202.Google Scholar
- Quattoni A, Wang S, Morency L-P, Collins M, Darrell T: Hidden conditional random fields. IEEE Transactions on Pattern Analysis and Machine Intelligence 2007,29(10):1848-1853.View ArticleGoogle Scholar
- Laptev I, Lindeberg T: Space-time interest points. Proceedings of the IEEE International Conference on Computer Vision, 2003, Nice, France 1: 432-439.View ArticleMATHGoogle Scholar
- Dollar P, Rabaud V, Cottrellm G, Belongie S: Behavior recognition via sparse spatiotemporal features. Proceedings of the IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance (PETS '05), 2005, Beijing, ChinaGoogle Scholar
- Schüldt C, Laptev I, Caputo B: Recognizing human actions: a local SVM approach. Proceedings of the International Conference on Pattern Recognition, 2004, Cambridge, UK 3: 32-36.Google Scholar
- Song Y, Chen W-Y, Bai H, Lin C-J, Chang EY: Parallel spectral clustering. Proceedings of the European Conference on Machine Learning(ECML '08), 2008, Beijing, ChinaGoogle Scholar
- Wang Y, Jiang H, Drew MS, Li Z-N, Mori G: Unsupervised discovery of action classes. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR '06), 2006, New York, NY, USA 2: 1654-1661.Google Scholar
- Cour T, Srinivasan P, Shi J: Balanced graph matching. Proceedings of the Advances in Neural Information Processing Systems (NIPS '06), 2006, Cambridge, Mass, USAGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.