Skip to main content

Unsupervised Action Classification Using Space-Time Link Analysis


We address the problem of unsupervised discovery of action classes in video data. Different from all existing methods thus far proposed for this task, we present a space-time link analysis approach which consistently matches or exceeds the performance of traditional unsupervised action categorization methods in various datasets. Our method is inspired by the recent success of link analysis techniques in the image domain. By applying these techniques in the space-time domain, we are able to naturally take into account the spatiotemporal relationships between the video features, while leveraging the power of graph matching for action classification. We present a comprehensive set of experiments demonstrating that our approach is capable of handling cluttered backgrounds, activities with subtle movements, and video data from moving cameras. State-of-the-art results are reported on standard datasets. We also demonstrate our method in a compelling surveillance application with the goal of avoiding fraud in retail stores.

1. Introduction

How to automatically discover and recognize activities from video data is an important topic in computer vision. A solution to this problem will not only facilitate applications, such as video retrieval or summary, but will also improve, for example, automatic video surveillance systems [1] and human-machine/robot communication [2]. In addition to its importance for many practical applications, unsupervised action categorization is important in the context of machine learning, particularly on how video processing approaches could allow a high-level "understanding" of the data.

Numerous techniques have been proposed to solve the action classification problem [3]. The requirements of video analysis techniques are manifold, such as dealing with cluttered background, camera motion, occlusion, and geometric and photometric variability, [1, 4, 5]. Recently, unsupervised methods based on bag of visual words have become very popular as they could achieve excellent performance in standard datasets [6] and long surveillance videos [1, 7].

Generally, these unsupervised algorithms extract spatiotemporal feature descriptors called video words and then use document-topic models such as pLSA [8], LDA [9], or HDP [10] to discover latent topics [1, 5, 7]. A common limitation of these models is that they usually do not consider spatiotemporal correlations among visual words unless the correlations are represented explicitly [6]. Another general limitation is that some of these methods are EM-based learning approaches which makes recursive learning and updating difficult.

In this paper we introduce link analysis-based techniques to unsupervised activity discovery in video data that naturally preserves the spatiotemporal topology among the video words. Link analysis techniques are known from data mining, the information retrieval research communities, and the WWW [11]. They were largely ignored in computer vision until their recent introduction to the community by Kim et al. [12, 13], who applied link analysis to unsupervised image clustering with impressive results.

Our link analysis approach for video processing is structured as follows (see Figure 1). The first step of our approach is to extract spatiotemporal features from the video data. Then, we construct a visual similarity network (VSN) [12] by computing the pairwise similarity between the features. Here, we replace for better efficiency the spectral matching approach [14] as used in [12, 13] with a combination of a linear matching [15] and the shape context descriptor [16]. Note that after pairwise matching all the video sequences, each feature would establish links with another. The weights of these links are given by the result of matching, that is, how similar two features are. The features together with the links form a giant VSN, shown as the output of the matching process in Figure 1.

Figure 1
figure 1

This flow chart summarizes our approach: given a set of video sequences, we start with spatial-temporal interest points extraction. The extracted interest points from each sequence are then matched pairwise using Hungarian method with shape context features incorporated. The matching scores encoding the similarities between any two features are then refined using link PageRank and structure similarity to further enhance distinctive features while suppress noisy ones. Later, the two resulting refined scores are then jointly fed into a clustering algorithm.

Next, the VSN is analyzed separately by using the link analysis techniques, PageRank [11] and structure similarity (SS) [17]. The PageRank algorithm would output a score for each feature indicating the amount of similar features it has while the structural similarity gives the likelihood of a feature being a hub node. The intuition is that genuine features should be similar to one another and thus have high ranking values. The PageRank and the structural similarity scores together form an affinity matrix between all video sequences.

Here, we interpret the pairwise matching weights as votes for the importance of the nodes which allows a quick division between consistent nodes and irrelevant ones (e.g., those from the background). Eventually, as shown in Figure 1, spectral clustering is applied to the affinity matrix to identify potential action categories. Link analysis techniques have been shown to be able to detect consistent matches (hubs) very effectively and efficiently [11, 12, 18, 19]. All computation and inference is done on the link weights between the nodes in the VSN which makes it fast and efficient.

The key contributions of our work are as follows.

  1. (i)

    We extend link analysis techniques to the spatiotemporal domain and show that unsupervised discovery of action classes can greatly benefit from such approach. For this we apply necessary revisions (feature representation, matching techniques, etc.) to the approach presented in [12] to make it efficiently applicable to video data. We report results that either match or exceed the performance of the state-of-the-art techniques in various datasets.

  2. (ii)

    We demonstrate that our approach can be applied for action clustering in real surveillance videos and show a compelling application to avoid fraud in retail stores.

The paper is organized as follows: in Section 2, we review related literature on activity recognition. Section 3 describes our approach in detail, including the spatiotemporal interest point detector, the matching process, and link analysis techniques. In Section 4, we show the performance of our approach on standard datasets and a real surveillance application. Finally, Section 5 concludes our paper.

2. Related Work

Many methods have been proposed to address the problem of action recognition and analysis in video sequences [3, 20, 21]. Specifically for human action modeling, a variety of techniques rely on tracking body parts (e.g., arms, limbs, etc.) to classify human actions [22, 23]. The classical review of [24] covers significant amount of work that falls into this category. Although very promising results have been achieved recently to distinguish activities under large viewpoint changes [25], it is often difficult to accurately detect and track body parts in complex environments.

Template-based approaches make use of spatiotemporal patterns to match and identify specific actions in videos. Bobick and Davis [26] use motion history images—a.k.a temporal templates—for action classification. Efros et al. [27] introduce a spatiotemporal descriptor that works well on low-resolution videos. Blank et al. [28] represent actions as space-time shape volumes for classification. Shechtman and Irani [29] propose a similarity metric between video patches based on intensity variation. A common drawback of these template-based methods is their inability to generalize from a collection of examples and create a single template which captures the intraclass variability of an action. More recently, Rodriguez et al. [30] address this problem using a MACH filter.

State-Space models have been widely applied for short term action recognition and more complex behavior analysis, involving object interactions and activities at multiple levels of temporal granularity. Examples include Hidden Markov Models and its variations such as coupled HMMs [31] and Layered HMMs [32], Stochastic Grammars [33], and Conditional Random Fields [34]. The majority of these methods are supervised, requiring manual labeling of video clips. When the state space is large, the estimation of many parameters make the learning process more difficult.

Bag of words models have recently shown great promise in action classification. These approaches in general extract sparse space-time interest points [35, 36] from the video sequences and then apply either discriminative or generative models for categorizing the activities. Highly discriminative results are obtained using SVM classifiers based on these descriptors under a supervised learning framework [36, 37]. Recently, Niebles and Fei-Fei [4] enhance this approach by proposing a novel model characterized as a constellation of bags-of-features, which encodes both the shape and appearance of the actor.

Unsupervised methods have also been proposed using the bag of words model (see the general discussion in Section 1). Closest to our work are probably Niebles et al. [5] and Wang et al. [1]. Niebles et al. [5] use a generative model based on pLSA to cluster activities. Wang et al. [1] use a hierarchical Bayesian model to cluster activities and interactions in surveillance videos. Although these methods achieve excellent results in real world video data, they omit any global spatiotemporal structure information among the video words. More recently, Savarese et al. [6] used spatiotemporal correlograms to encode flexible long range temporal information into the local features.

Different from all methods thus far proposed for unsupervised action categorization, we address this problem using a link analysis-based approach. Specifically, we apply link analysis algorithms in the spatiotemporal domain to automatically discover actions in video sequences. By using link analysis, we are able to naturally take into account the spatiotemporal relationships between detected interest points, while leveraging the power of graph matching for action classification. Experiment results show that this approach performs well compared to the state-of-the-art techniques on standard datasets and works very well in real surveillance scenarios. More details about our algorithm follow in the next section.

3. Link-Analysis for Spatiotemporal Features

In this section, we break down our approach into its major components and give a detailed introduction to them. In detail, we will discuss the types of features we used, the use of shape context features matching, PageRank, structure similarity computation, and spectral clustering. Figure 1 shows the flow chart of our approach.

3.1. Extraction of Spatio-Temporal Features

The first step of our action classification approach is to extract spatiotemporal interest points from the input video sequences. The two most recent spatiotemporal descriptors are proposed by Laptev and Lindeberg [35] and Dollar et al. [36], respectively.

We use the interest point detector proposed by Dollar et al. [36] in order to get denser spatiotemporal visual words. For a video sequence with pixel values , separable linear filters are applied to the video in order to obtain the response function as follows:


where indicates the convolution, is the 2D Gaussian smoothing kernel applied only along the spatial dimensions , and and are a quadrature pair of 1D Gabor filters applied temporally, which are defined as


The two parameters and correspond to the spatial and temporal scales of the detector, respectively. The frequency of the harmonic functions is given by f. In all cases we use , as in [5].

Any region with spatially distinguishing characteristics undergoing a complex, nontranslational motion induces a strong response [36]. At these interest points, we extract spatiotemporal volumes (cuboids). Later we calculate the brightness gradients within these volumes and concatenate them to form a feature vector. PCA is then used to reduce the dimensions of these feature vectors. Figure 2 shows the extracted interest points on a few sequences from the KTH dataset [37]. Considering Figure 2(c) as an example, we can see that the interest points occur at places around the arms, where the up-and-down motion induces strong responses.

Figure 2
figure 2

Sample sequences with detected interest points using the approach in Section 3. 1 for the KTH dataset. From (a) to (f), the activities are boxing, handclapping, handwaving, jogging, running, and walking. Note that these interest points are detected at places where complex, nontranslational motions occur.

Alternatives to the space-time volumes are possible. On crowded scenes in surveillance data, that we have successfully used the spatiotemporal motion descriptors from [27].

3.2. Matching Spatial-Temporal Words and Building VSN

Suppose we have a set of video sequences, each with , , spatiotemporal features, and the total number of features in all sequences is .

In order to take into account the relationships between detected visual words we apply a graph matching algorithm on each pair of sequences to determine feature level similarities. In [12], quadratic matching techniques such as [14] are used to match nodes from two graphs by jointly considering the consistencies of their feature values and the spatial arrangements. However, the direct application of the techniques from [12] is not possible. While the work in [12] is applied on sets of images, our problem is concerned with sets of videos. For video processing, the techniques from [12], in particular the spectral matching [14], are too inefficient for video processing. For example, the spectral matching has a complexity of , where is the number of features. Thus, for better computational efficiency, we need to replace the spectral matching technique with the Hungarian method [15], a linear assignment matching approach, but augment the original spatial-temporal features with their associated shape context descriptors [16]. The shape context feature was proposed by Belongie et al. [16] for shape matching two objects using the extracted sparse points on their boundaries. Given a set of features, the shape context descriptor of a feature is a histogram of the relative locations of all others with respect to itself in polar coordinate system. In our scenario, since the activities are periodic, we only consider the spatial distributions, that is, the 2D polar coordinates of the visual features. The incorporation of shape context features discourages the matching of a noisy word from the background and a legitimate one. The reason is that although the feature value of the noisy word could be very similar to a genuine one, its shape context descriptor would say otherwise since these noisy words often occur at random places in the video sequences while the genuine features from the activities of interest are usually centered around a specific location, for example, the human body. This way, although the Hungarian method itself does not consider the locations of matched features, by augmenting every spatial-temporal word with its shape context descriptor, the spatial arrangement of these features is implicitly modeled.

Based on the pairwise matching results, and similarly to Kim et al. [12], we build a VSN where each node represents the feature in the input video , represents the feature in the input video . The weight for each edge are given by the similarity score between features and . The similarity score between feature vector and is obtained through the exponential equation:


where is the matching cost between feature and . In our experiments we have computed the link weights, from the difference, that is, between the two feature vectors and with and without shape context features. For normalizing the weights we follow the approach outlined in [12].

The intuition behind the matching algorithm and the VSN is that the number of links to and from a node reflects the cooccurrence statistics while each link weight reflects the belief in that match. This creates a clustering effect. The hope is that (a) features from the same category would tend to interconnect with each other through strong links, while only weak links would exist between features from different categories, and (b) features that appear often will have many links. Figure 3 shows the matching results between sequences from same and different categories, respectively. As one can see, sequences from different classes would incur worse matching (Figure 3(b)) while the matching between sequences from the same category are more consistent and regular (Figure 3(a)).

Figure 3
figure 3

The figures show the matching results between two sequences from (a) the same category and (b) different categories. Solid lines indicate matching pairs with low costs while dotted lines indicate costly matching pairs. Since the shape context features are incorporated, for two features to match well, they need to have not only similar feature values but also similar relative locations with respect to other features.

3.2.1. PageRank

The aim of the next step is to identify the strongest and most consistent features in each of the videos. This we do by extracting the subgraph from our original VSN that contains the nodes from the video as well as all other nodes in the VSN that are connected to the nodes from : we set if and . Then, we apply pagerank [11] to the subgraph . The intuition behind the application of pagerank is that the nodes that are referenced (linked) often by important nodes are considered important as well. After pagerank, the features with high ranking values are those highly relevant and most consistent in the video .

In short, the pagerank algorithm generates a pagerank vector by solving the following equation:


where is the weight matrix of , is a weighting constant set to 0.1 as in [12], is the transport vector representing the initial prior of (set to uniform distribution here), and , where is the -dimensional indicator vector identifying the nodes with zero outdegree and is the dimension of the transport vector. The final ranking value of each node represents its relative importance in the VSN .

The process is illustrated in Figure 4. Initially, as Figure 4(a) shows, we have a VSN composed of features from three sequences. We extract the subgraph with respect to the first sequence, of which the features are represented as the circular nodes (Blue, circular nodes in Figure 4(b)). Then, we apply pagerank to the subgraph to determine the relative importance of the features in the subgraph. Figure 4(c) shows the final graph after pagerank. Larger nodes are those relevant features with respect to sequence one.

Figure 4
figure 4

The process of pagerank: (a) is the original similarity network we have. (b) shows the result after the subgraph extraction. Nodes of different shape represent features from different categories. After pagerank, features that are important would receive high ranking values, represented as the size of the nodes in (c). The larger a node is, the higher it ranks.

3.2.2. Structure Similarity

After computing pagerank, we evaluate the structure similarity [17] between two nodes. Here, we follow the reasoning in [12, 17]: nodes with a similar set of links, that is, nodes that are pointed to by a similar set of nodes and which are pointing to a similar set of nodes will most likely belong to the same category. Blondel et al. [17] use this technique to find synonymies in text documents.

The goal for computing structure similarity is to identify which nodes in the graph are true hub nodes. In order to do this, we take the graph we have and compare it with graph , and see which node(s) are most similar to node 2, the center node, in . Therefore, is the graph we compare to and (5) is the matrix representation of it. Let be the resulting similarity scores. To solve , we use the following formulation to solve (6), which is an approach proposed in citeSS to compare nodes from different graphs.

Given a graph , we define the neighborhood graph of a node to be the subgraph formed by the neighboring nodes of and the edges in between them. Let be the adjacency matrix of and let be the number of neighbors of . Then, the similarity (central score) between the vertices of and vertex 2 of the path graph of (5) is calculated:


by iteratively solving


for . Here, is a matrix, initially set to a matrix with all entries and is the Frobenius norm. Upon convergence, the structure similarity value for each neighbor of is given by . A value with higher score shares a lot of common nodes with . The process is repeated for each feature which gives us an matrix .

3.2.3. Spectral Clustering

By fusing the result of pagerank and structure similarity, we can obtain the similarity score between sequence and sequence by


Given an by matrix encoding the similarity scores between instances, the spectral clustering [38] clusters these instances into clusters, where is a predefined value. With the affinity matrix at hand, we apply spectral clustering [38] on the nearest neighbor graph to uncover the underlying activities.

4. Experiments

In this section, we apply our algorithm to standard datasets and show that it performs well compared to the state-of-the-art approaches. In detail, we test our approach on the following:

  1. (i)

    the KTH dataset [37], which is the largest one,

  2. (ii)

    the skating datset from [39], where we show that our approach is able to handle cluttered background as well as video data from a moving cameras,

  3. (iii)

    real-world surveillance data where our approach was able to cope even with subtle movements.

In all the evaluations, the features are reduced to 100D vectors using PCA. In practice, the target dimensionality could be set using cross validation. In our case, we set it to 100 for the sake of comparison. The values of and could vary according to different datasets but they are set to the same values as in [5].

4.1. KTH Dataset

The KTH dataset [37] is by far the largest standard activity dataset, which consists of six categories of activities performed by twenty-five actors in four different scenarios. The feature detector parameters are set to and , the detector results are shown in Figure 2. Each spatiotemporal patch is represented by the concatenated vector of its 3D gradients and then further reduced to 100 dimensions using PCA. We then apply our approach to cluster the video sequences, the results are shown in Figure 5. Due to the size of the database, we report the result for KTH without shape context features. The confusion matrix for the KTH dataset is shown in Table 1. Note that we lump "jogging" and "running" into one category, as we did not incorporate features such as speed to distinguish these two activities. Our approach achieves 91.3% accuracy and performs well compared to the that of-state-of-art approaches (e.g., Niebles et al. [5] also recently reported considering running and jogging lumped together).

Table 1 Confusion matrix for the KTH dataset. The average performance is 91.3%. "box", "hc", "hw", "j/r", and "walk" represent boxing, handclapping, handwaving, jogging/running, and walking, respectively. For example, row one means out of all the boxing sequences, 84% are classified correctly, and 16% are classified as handclapping.
Figure 5
figure 5

Feature points with high pagerank values from the six different categories in the KTH dataset. From (a)–(f), the activities are boxing, handclapping, handwaving, jogging, running, and walking.

4.2. Skating Dataset

As a second experiment, we apply our approach to a real world skating dataset reported in [39]. We extract 24 video sequences from the dataset and apply the same process to uncover three activities: stand-spin, sit-spin, and camel-spin. The detector parameters are set to and when extracting the spatiotemporal interest points, which are then described by the corresponding PCA- reduced 3D gradients.

Figure 6 shows sample results for different sequence from the skating dataset with detected interest points. Since the sequences are shot with cluttered backgrounds and irregular camera motions, lots of irrelevant interest points are detected in the background. However, after space-time link analysis is applied, most of them are removed and not considered when classifying the sequences.

Figure 6
figure 6

The figure shows the detected interest points for six sequences from the three different categories of the skating dataset: (a) and (b) stand-spin; (c) and (d) sit-spin; (e) and (f) camel-spin.

The performance is considerably better when the features are augmented with their associated shape context descriptors. The reason is that given the cluttered background in these sequences and that the activities of interest are in the center of each frame, it is beneficial to filter out the spatiotemporal interest points induced by the background. Shape context features serve the purpose as most of time the background-induced interest points occur at random locations while the genuine features are typically around the performer. Figure 7 shows highly ranked features for different sequences in the dataset. Note that essentially all of the interest points incurred by the background are considered irrelevant. Table 2 shows the best classification result for the skating dataset with shape context features. The average performance is , which is better than 80.3% using the state-of-the art approach [5]. Table 3 compares the performance with and without the shape context features.

Table 2 Confusion matrix for the skating dataset. The average performance is 83.4%.
Table 3 Performance comparison between different methods from top to bottom: pLSA, without shape context features (SCF), and with shape context features (SCFs).
Figure 7
figure 7

The figure shows the detected feature points for six sequences from three different categories: (a) and (b) stand-spin; (c) and (d) sit-spin; (e) and (f) camel-spin.

4.3. Real World Surveillance Video

As a third experiment, we apply our approach to a real world surveillance system deployed in large retail stores to detect fraud scannings at the counters. The goal is to avoid retail shrink caused by cashiers who intentionally fail to enter one or more items into the transaction in an attempt to get free merchandise for the customer. We approach the problem by automatically detecting the scanning activities in the video and matching these detected events with the transaction log to uncover possible fake scans. For this experiment, we extract 27 video sequences from the dataset and show the performances of different methods. Figure 8 shows sample frames for three typical activities, that is, pickup, scan, and drop, with the detected interest points. As one can see from Figure 8, other than "drop" (Figure 8(c)), these activities only induce minor motions, and usually overlap with each other. The sparsity of interest points makes it even harder to detect the "scanning" activity. Throughout the experiment, we set and to extract the interest points, based on which the 3D gradients are calculated and PCA-reduced to 100-dimension feature vectors. Table 4 shows the best performance achieved with shape context features. The average accuracy for three activities is 81.5% with or without the shape context features. It would become 100% if we only care about scan/nonscan events. It is interesting to note that for our surveillance data, the recognition performance was independent from the use of the shape context features. Looking also at the results from the skating data, the recognition performance did not profit from the use of the shape context features as much as we had expected. This might be due to the fact that the link-analysis approach already takes into account the spatial relationship between the features.

Table 4 Confusion matrix for the surveillance video. The average performance is 81.5%. "pick", "scan", and "drop" represent pickup, scanning, and drop, respectively.
Figure 8
figure 8

Sample frames for three typical activities at the counter. Detected interest points are shown in rectangles. (a), (b), and (c) represents pickup, scan, and drop, respectively.

5. Conclusion

In this paper, we proposed a link-analysis-based approach to unsupervised activity recognition. Different from previous approaches based on the bag of words models, the link-analysis approach takes into account the spatiotemporal relationship between visual words in the matching process. We see this as the major reason for the good performance of our approach. Furthermore, we have tested the link-analysis on a variety of test videos: the KTH data, which is the largest dataset, the skating video data, where our approach demonstrated its ability to deal with cluttered background and moving cameras and the surveillance data where our approach was able to cope even with very subtle hand movements.

During our tests of the link-analysis approach on the different datasets, we also compared different approaches, that is, (a) with the shape context features (SCF), (b) without SCF, and (c) state-of-the-art approach using pLSA. Future work will be to deal with multiple moving individuals/objects in the video data. We would also like to evaluate the performance of our approach using better matching algorithm, quadratic assignment [40], for example.


  1. Wang X, Ma X, Grimson E: Unsupervised activity perception by hierarchical bayesian models. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2007, Minneapolis, Minn, USA

    Google Scholar 

  2. Krueger V, Kragic D, Ude A, Geib C: The meaning of action: a review on action recognition and mapping. International Journal on Advanced Robotics 2007,21(13):1473-1501.

    Google Scholar 

  3. Moeslund TB, Hilton A, Krüger V: A survey of advances in vision-based human motion capture and analysis. Computer Vision and Image Understanding 2006,104(2-3):90-126. 10.1016/j.cviu.2006.08.002

    Article  Google Scholar 

  4. Niebles JC, Fei-Fei L: A hierarchical model of shape and appearance for human action classification. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2007, Minneapolis, Minn, USA

    Google Scholar 

  5. Niebles JC, Wang H, Fei-Fei L: Unsupervised learning of human action categories using spatial-temporal words. International Journal of Computer Vision 2008,79(3):299-318. 10.1007/s11263-007-0122-4

    Article  Google Scholar 

  6. Savarese S, Del Pozo A, Niebles JC, Fei-Fei L: Spatial-temporal correlations for unsupervised action classification. Proceedings of the IEEE Workshop on Motion and Video Computing, 2008, Copper Mountain, Colo, USA

    Google Scholar 

  7. Wang X, Ma KT, Ng G-W, Grimson WEL: Trajectory analysis and semantic region modeling using a nonparametric bayesian model. Proceedings of the 26th IEEE Conference on Computer Vision and Pattern Recognition (CVPR '08), 2008

    Google Scholar 

  8. Hofmann T: Probabilistic latent semantic analysis. Proceedings of the 22nd Conference on Uncertainty in Artificial Intelligence, 1999, Stockholm, Sweden

    Google Scholar 

  9. Blei DM, Ng AY, Jordan MI: Latent dirichlet allocation. Journal of Machine Learning Research 2003,3(4-5):993-1022.

    MATH  Google Scholar 

  10. Teh YW, Jordan MI, Beal MJ, Blei DM: Hierarchical Dirichlet processes. Journal of the American Statistical Association 2006,101(476):1566-1581. 10.1198/016214506000000302

    Article  MathSciNet  MATH  Google Scholar 

  11. Brin S, Page L: The anatomy of a large-scale hypertextual web search engine. Proceedings of the 7th International Conference on World Wide Web, 1998 7: 107-117.

    Google Scholar 

  12. Kim G, Faloutsos C, Hebert M: Unsupervised modeling of object categories using link analysis techniques. Proceedings of the 26th IEEE Conference on Computer Vision and Pattern Recognition (CVPR '08), June 2008, Anchorage, Alaska, USA

    Google Scholar 

  13. Kim G, Faloutsos C, Hebert M: Unsupervised modeling and recognition of object categories with combination of visual contents and geometric similarity links. Proceedings of the 1st International ACM Conference on Multimedia Information Retrieval (MIR '08), 2008, British Columbia, Canada 419-426.

    Chapter  Google Scholar 

  14. Leordeanu M, Hebert M: A spectral technique for correspondence problems using pairwise constraints. Proceedings of the IEEE International Conference on Computer Vision, 2005, Beijing, China 2: 1482-1489.

    Google Scholar 

  15. Kuhn HW: The hungarian method for the assignment problem. Naval Research Logistics Quarterly 1955., 2:

    Google Scholar 

  16. Belongie S, Malik J, Puzicha J: Shape matching and object recognition using shape contexts. IEEE Transactions on Pattern Analysis and Machine Intelligence 2002,24(4):509-522. 10.1109/34.993558

    Article  Google Scholar 

  17. Blondel VD, Gajardo A, Heymans M, Senellart P, Van Dooren P: A measure of similarity between graph vertices: applications to synonym extraction and web searching. SIAM Review 2004,46(4):647-666. 10.1137/S0036144502415960

    Article  MathSciNet  MATH  Google Scholar 

  18. Najork M, Craswell N: Efficient and effective link analysis with precomputed SALSA maps. Proceedings of the International Conference on Information and Knowledge Management, 2008, Napa Valley, Calif, USA 53-62.

    Google Scholar 

  19. Thelwall M: Link Analysis: An Information Science Approach. Academic Press, San Diego, Calif, USA; 2004.

    Google Scholar 

  20. Turaga BK, Chellappa R, Subrahmanian VS, Udrea O: Machine recognition of human activities: a survey. IEEE Transactions on Circuits and Systems for Video Technology 2008,18(11):1473-1488.

    Article  Google Scholar 

  21. Wang L, Hu W, Tan T: Recent developments in human motion analysis. Pattern Recognition 2003,36(3):585-601. 10.1016/S0031-3203(02)00100-0

    Article  Google Scholar 

  22. Ramanan D, Forsyth A: Automatic annotation of everyday movements. Proceedings of the Neural Information Processing Systems (NIPS '03), 2003, Washington, DC, USA

    Google Scholar 

  23. Fanti C, Zelnik-Manor L, Perona P: Hybrid models for human motion recognition. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR '05), 2005, San Diego, Calif, USA 1: 1166-1173.

    Google Scholar 

  24. Gavrila DM: The visual analysis of human movement: a survey. Computer Vision and Image Understanding 1999,73(1):82-98. 10.1006/cviu.1998.0716

    Article  MATH  Google Scholar 

  25. Ikizler N, Forsyth D: Searching video for complex activities with finite state models. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2007, Minneapolis, Minn, USA

    Google Scholar 

  26. Bobick AF, Davis JW: The recognition of human movement using temporal templates. IEEE Transactions on Pattern Analysis and Machine Intelligence 2001,23(3):257-267. 10.1109/34.910878

    Article  Google Scholar 

  27. Efros AA, Berg AC, Mori G, Malik J: Recognizing action at a distance. Proceedings of the IEEE International Conference on Computer Vision, 2003, Nice, France 2: 726-733.

    Article  Google Scholar 

  28. Blank M, Gorelick L, Shechtman E, Irani M, Basri R: Actions as space-time shapes. Proceedings of the IEEE International Conference on Computer Vision, 2005, Beijing, China 2: 1395-1402.

    Google Scholar 

  29. Shechtman E, Irani M: Space-time behavior based correlation. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2005, San Diego, Calif, USA 1: 405-412.

    Google Scholar 

  30. Rodriguez MD, Ahmed J, Shah M: Action MACH: a spatio-temporal maximum average correlation height filter for action recognition. Proceedings of the 26th IEEE Conference on Computer Vision and Pattern Recognition (CVPR '08), 2008, Anchorage, Alaska, USA

    Google Scholar 

  31. Brand M, Oliver N, Pentland A: Coupled hidden Markov models for complex action recognition. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 1997, San Juan, Puerto Rico, USA 994-999.

    Chapter  Google Scholar 

  32. Oliver N, Garg A, Horvitz E: Layered representations for learning and inferring office activity from multiple sensory channels. Computer Vision and Image Understanding 2004,96(2):163-180. 10.1016/j.cviu.2004.02.004

    Article  Google Scholar 

  33. Bobick AF, Ivanov YA: Action recognition using probabilistic parsing. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 1998, anta Barbara, Calif, USA 196-202.

    Google Scholar 

  34. Quattoni A, Wang S, Morency L-P, Collins M, Darrell T: Hidden conditional random fields. IEEE Transactions on Pattern Analysis and Machine Intelligence 2007,29(10):1848-1853.

    Article  Google Scholar 

  35. Laptev I, Lindeberg T: Space-time interest points. Proceedings of the IEEE International Conference on Computer Vision, 2003, Nice, France 1: 432-439.

    Article  MATH  Google Scholar 

  36. Dollar P, Rabaud V, Cottrellm G, Belongie S: Behavior recognition via sparse spatiotemporal features. Proceedings of the IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance (PETS '05), 2005, Beijing, China

    Google Scholar 

  37. Schüldt C, Laptev I, Caputo B: Recognizing human actions: a local SVM approach. Proceedings of the International Conference on Pattern Recognition, 2004, Cambridge, UK 3: 32-36.

    Google Scholar 

  38. Song Y, Chen W-Y, Bai H, Lin C-J, Chang EY: Parallel spectral clustering. Proceedings of the European Conference on Machine Learning(ECML '08), 2008, Beijing, China

    Google Scholar 

  39. Wang Y, Jiang H, Drew MS, Li Z-N, Mori G: Unsupervised discovery of action classes. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR '06), 2006, New York, NY, USA 2: 1654-1661.

    Google Scholar 

  40. Cour T, Srinivasan P, Shi J: Balanced graph matching. Proceedings of the Advances in Neural Information Processing Systems (NIPS '06), 2006, Cambridge, Mass, USA

    Google Scholar 

Download references


This work was partially funded by the EU-project PACOPlus, IST-FP6-IP-027657.

Author information

Authors and Affiliations


Corresponding author

Correspondence to Volker Krueger.

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 2.0 International License ( ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and Permissions

About this article

Cite this article

Liu, H., Feris, R., Krueger, V. et al. Unsupervised Action Classification Using Space-Time Link Analysis. J Image Video Proc 2010, 626324 (2010).

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • DOI:


  • Video Sequence
  • Visual Word
  • Interest Point
  • Spectral Cluster
  • Cluttered Background