Human Motion Analysis via Statistical Motion Processing and Sequential Change Detection
EURASIP Journal on Image and Video Processing volume 2009, Article number: 652050 (2009)
The widespread use of digital multimedia in applications, such as security, surveillance, and the semantic web, has made the automated characterization of human activity necessary. In this work, a method for the characterization of multiple human activities based on statistical processing of the video data is presented. First the active pixels of the video are detected, resulting in a binary mask called the Activity Area. Sequential change detection is then applied to the data examined in order to detect at which time instants there are changes in the activity taking place. This leads to the separation of the video sequence into segments with different activities. The change times are examined for periodicity or repetitiveness in the human actions. The Activity Areas and their temporal weighted versions, the Activity History Areas, for the extracted subsequences are used for activity recognition. Experiments with a wide range of indoors and outdoors videos of various human motions, including challenging videos with dynamic backgrounds, demonstrate the proposed system's good performance.
The area of human motion analysis is one of the most active research areas in computer vision, with applications in numerous fields such as surveillance, content-based retrieval, storage, and virtual reality. A wide range of methods has been developed over the years to deal with problems like human detection, tracking, recognition, the analysis of activity in video, and the characterization of human motions .
One large category of approaches for the analysis of human motions is structure-based, using cues from the human body for tracking and action recognition . The human body can be modeled in D or D, with or without explicit shape models . Model-based methods include the representation of humans as stick figures , cardboard models , volumetric models , as well as hybrid methods that track both edges and regions . Structure-based approaches that do not use explicit models detect features , objects , or silhouettes , which are then tracked and their motion is classified. Feature-based methods are sensitive to local noise and occlusions, and the number of features is not always sufficient for tracking or recognition. Statistical shape models such as Active Contours have also been examined for human motion analysis , but they are sensitive to occlusions and require good initialization.
Another large category of approaches extracts cues about the activity taking place from motion information . One such approach examines the global shape of motion features, which are found to provide enough information for recognition . The periodicity of human motions is used in  to derive templates for each action class, but at a high computational cost, as it is based on the correlation of successive video frames. In , actions are modeled by temporal templates, that is, binary and grayscale masks that characterize the area of activity. Motion Energy Images (MEIs) are binary masks indicating which pixels are active throughout the video, while Motion History Images (MHIs) are grayscale, as they incorporate history information, that is, which pixels moved most recently. This approach is computationally efficient, but cannot deal with repetitive actions, as their signatures overwrite each other in the MHI. In , spatiotemporal information from the video is used to create "space-time shapes" which characterize human activities in space and time. However, these spatio-temporal characteristics are specific to human actions, limiting the method to this domain only. Additionally, the translational component of motions cannot be dealt with in .
Both structure and motion information can be taken into account for human action analysis using Hidden Markov Models (HMMs), which model the temporal evolution of events [17, 18]. However, the HMM approach requires significant training to perform well  and, like all model-based methods, its performance depends on how well the chosen model parameters represent the human action.
In this work, a novel, motion-based nonparametric approach to the problem of human motion analysis is presented. Since it is not model-based, it does not suffer from sensitivity to the correct choice of model, nor is it constrained by it. Additionally, it is based on generally applicable statistical techniques, so it can be extended to a wide range of videos, in various domains. Finally, it does not require extensive training for recognition, so it is not computationally intensive, nor dependent on the training data available.
1.1. Proposed Framework
The proposed system is based on statistical processing of video data in order to detect times and locations of activity changes (Figure 1). The first stage of the system involves the extraction of the Activity Area, a binary mask of pixels which are active throughout the sequence. Only these pixels are processed in the subsequent stages, leading to a lower computational cost, and also a reduction in the possibility of errors in the motion analysis. The extraction of the Activity Area can be considered as a preprocessing step, which can be omitted for real-time processing.
The second stage of the system is one of the main novel points of this framework, as it leads to the detection of changes in activity in a non ad-hoc manner. In the current literature, temporal changes in video are only found in the context of shot detection, where the video is separated into subsequences that have been filmed in different manners. However, this separation is not always useful, as a shot may contain several activities. The proposed approach separates the video in a meaningful manner, into subsequences corresponding to different activities by applying sequential change detection methods. The input, that is, interframe illumination variations, is processed sequentially as it arrives, to decide if a change has occurred at each frame. Thus, changes in activity can be detected in the real time, and the video sequence can then be separated into segments that contain different actions. The times of change are further examined to see if periodicity or repetitiveness is present in the actions.
After the change detection step, the data in each subsequence between the detected change points is processed for more detailed analysis of the activity in it. Activity Areas and a temporally weighted version of them called the Activity History Areas are extracted for the resulting subsequences. The shape of the Activity Areas is used for recognition of the activities taking place: the outline of each Activity Area is described by the Fourier Shape Descriptors (see Section 5), which are compared to each other using the Euclidean distance, for recognition. When different activities have a similar Activity Area (e.g., a person walking and running), the Activity History Areas (AHAs) are used to discriminate between them, as they contain information about the temporal evolution of these actions. This is achieved by estimating the Mahalanobis distance between appropriate features of the AHAs, like their slope and magnitude (see Section 5 for details). It is important to note that Activity History Areas would have the same limitations as MHIs  if they were applied on the entire video sequence: the repetitions of an activity would overwrite the previous activity history information, so the Activity History Area would not provide any new information. This issue is overcome in the proposed system, as the video is already divided into segments containing different activities, so that Activity History Areas are extracted for each repeating component of the motion separately, and no overwriting takes place.
2. Motion Analysis: Activity Area
In the proposed system, the interframe illumination variations are initially processed statistically in order to find the Activity Area, a binary mask similar to the MEIs of , which can be used for activity recognition. Unlike the MEI, the Activity Areas are extracted via higher-order statistical processing, which makes them more robust to additive noise and small background motions. Interframe illumination variations, resulting from frame differences or optical flow estimates (both referred to as "illumination variations" in the sequel), can be mapped to the following two hypotheses:
where are the illumination variations for a static/active pixel, respectively, at frame and pixel . The term corresponds to measurement noise and is caused by pixel motion. The background is considered to be static, so only the pixels of moving objects correspond to . The distribution of the measurement noise is unknown, however, it can be sufficiently well modeled by a Gaussian distribution, as in [20, 21]. In literature, the background is often modeled by mixtures of Gaussian distributions , but this modeling is computationally costly and not reliable in the presence of significant background changes (e.g., a change in lighting), as it does not always adapt to them quickly enough. The method used here is actually robust to deviations of the data from the simple Gaussian model [23, 24], so even in such cases, it provides accurate, reliable results at a much lower computational cost.
The illumination variations of static pixels are caused by measurement noise, so their values over time should follow a Gaussian distribution. A classical test of data Gaussianity is the kurtosis , which is equal to zero for Gaussian data, and defined as
In order to find the active pixels, that is, Activity Areas, the illumination variations at each pixel are accumulated over the entire video and their kurtosis is estimated from (2). Even if in practice the static pixels do not follow a strictly Gaussian distribution, their kurtosis is still significantly lower (by orders of magnitude) than that of active pixels. This is clearly obvious in the experimental results, where the regions of activity are indeed correctly localized, as well as in the simulations that follow.
As a practical example with a real sequence, we estimate the kurtosis of all active pixels and that of all static pixels, taken from the real video of a person boxing (Section 6.2), where the ground truth for the active and static pixels is extracted manually. The kurtosis values of active and static pixels are plotted in Figure 2, where it can be seen that the active pixels' kurtosis is significantly higher than that of the static pixels; note that the -axis on Figure 2(a) is from 0 to , while on Figure 2(b), its range is from to (for clarity of presentation). In the static pixels of Figure 2(b), the kurtosis is almost zero in almost all of them. It obtains higher values in pixels , most likely due to the presence of local noise, but even these values are much lower than those of the active pixels. Indeed, the mean value of the kurtosis for the active pixels is found to be and for the static ones it is equal to . Results like this motivate us to compare the relative values of pixels' kurtosis in practice, in order to determine if a pixel is active or static, rather than their absolute value.
A very common model for the background is the Mixture of Gaussians (MoG) , so we compare the kurtosis of data following a Gaussian, an MoG, and an Exponential distribution. The exponential data is chosen as it is clearly non-Gaussian and will provide a measure of comparison for the other data. Monte Carlo simulations take place with sample sets of data from each distribution, of length each. The kurtosis estimated for each sample set and for each distribution is shown in Figure 3 where it can be seen that the Gaussian and MoG data have significantly lower kurtosis values than the Exponential (non-Gaussian) data. Indeed, the average kurtosis for the Gaussian data is , for the MoG it is , and for the Exponential it is . Consequently, the kurtosis can reliably discriminate between active and static pixels even for background data that is modeled by an MoG instead of by a simple Gaussian.
3. Motion Analysis: Activity History Area
As mentioned in Section 1.1, the Activity Area is not always sufficient for recognizing activities, as some actions can lead to Activity Areas with very similar shapes. For example, different translational motions like jogging, running, and walking have similar Activity Areas, although they evolve differently in time. Thus, information about their temporal evolution should be used to discriminate amongst them. The temporal evolution of activities is captured by the Activity History Area (AHA), which is similar to the Motion History Area of , but extracted using the kurtosis, as in Section 2, rather than straightforward frame differencing. If the Activity Area value (binarized kurtosis value) on pixel is at frame , the AHA is defined as
Essentially, the AHA is a time-weighted version of the Activity Area, with higher weights given to the pixels which were active more recently. This introduces information about an activity's evolution with time, which can be particularly helpful for the classification of different actions. As an example, Figure 4 shows the Activity Area and AHA of a person running to the right and the same person running to the left. It is obvious that the direction of motion is captured by the AHA, which obtains higher values in the most recently activated pixels, but not by the Activity Area, which is a binary mask, and, therefore, can only provide spatial localization. In Figure 4, the AHA values have warmer colors (darker in grayscale) for the most recently activated pixels, while cooler colors (lighter in grayscale) represent pixels that were active in the past.
4. Sequential Change Detection
One of the main novel points of the proposed system is the detection of the times at which the activity taking place changes. The input data for the change detection is a sequence of illumination variations from frame to , that is, . If only the pixels inside the Activity Area are being examined, the data from each frame contains the illumination variations of that frame's pixels, for the pixels inside the Activity Area. Thus, if the activity area contains pixels, we have . In this work we examine the case where only the pixels inside the Activity Area are processed. It is considered that the data follows a distribution before a change occurs, and after the change, at an unknown time instant . This is expressed by the following two hypotheses:
At each frame , is an input into a test statistic to determine whether or not a change has occurred until then, as detailed in Section 4.1. If a change is detected, only the data after frame is processed to detect new changes, and this is repeated until the entire video has been examined.
4.1. Cumulative Sum (CUSUM) for Change Detection
The sequential change detection algorithm  uses the log-likelihood ratio (LLRT) of the input data as a test statistic. For the detection of a change between frames and , we estimate
where it has been assumed that the frame samples are identically independently distributed (i.i.d.) under each hypothesis, so that . Similarly, it is assumed that the illumination variations of the pixels inside the Activity Area are i.i.d., so .
Pixels in highly textured areas can be considered to have i.i.d. values of illumination variations, as they correspond to areas of the moving object with a different appearance, which may be subject to local sources of noise, shadow, or occlusion. In homogeneous image regions that move in the same manner this assumption does not necessarily hold, however, even these pixels can be subject to local sources of noise, which remove correlations between them. The approximation of the data distribution for data that is not considered i.i.d. is very cumbersome, making this assumption necessary for practical purposes as well. Such assumptions are often made in the change detection literature to ensure tractability of the likelihood test.
Under the i.i.d. assumption, the test statistic of (5) obtains the recursive form :
where is the data from the active pixels in the current frame . Then, (5) can also be written as
A change is detected at this frame when the test statistic becomes higher than a predefined threshold. Unlike the threshold for sequential probability likelihood ratio testing [27, 28], the threshold for the CUSUM testing procedure cannot be determined in a closed form manner. It has been proven in  that the optimal threshold for the CUSUM test for a predefined false alarm is the threshold that leads to an average number of changes equal to under , that is, when there are no real changes. In the general case examined here, the optimal threshold needs to be estimated empirically from the data being analyzed . In Section 6 we provide more details about how we determine the threshold experimentally.
In practice, illumination variations of only one pixel over time do not provide enough samples to detect changes effectively, so the illumination variations of all active pixels in each frame are used. If an Activity Area contains pixels, this gives samples from frame to , which leads to improved approximations of the data distributions, as well as better change detection performance.
4.2. Data Modeling
As (6) shows, in order to implement the CUSUM test, knowledge about the family of distributions before and after the change is needed, even if the time of change itself is not known. For the case where only the pixels in the Activity Area are being examined, it is known that they are active, and hence do not follow a Gaussian distribution (see Section 2). The distribution of active pixels over time contains outliers introduced by a pixel's change in motion, which lead to a more heavy-tailed distribution than the Gaussian, such as the Laplacian or generalized Gaussian . The Laplacian distribution is given by
where is the data mean and is its scale, for variance . The tails of this distribution decay more slowly than those of the Gaussian, since its exponent contains an absolute difference instead of the difference squared. Its tails are consequently heavier, indicating that data following the Laplace distribution contains more outlier values than Gaussian data. The test statistic of (7) for data samples can then be written as
In order to verify the validity of the Laplacian approximation of the data, the illumination variations are modeled by the Gaussian and Laplacian distributions, and their accuracy is compared. The generalized Gaussian model is not examined, as its approximation is computationally costly and hence impractical. Figure 5 contains plots showing the distribution of the actual data in comparison with its approximation by a Gaussian and Laplacian distribution. The Root Mean Square error (RMS) between the actual empirical data distribution and the corresponding Gaussian and Laplacian model is presented in Table 1 for several videos, where it can be seen that the Laplacian distribution provides a better fit. Modeling experiments are conducted on all the videos used in the experiments, but have not been included in Table 1 for reasons of space and readability. The mean RMS estimated from all the video sequences examined is for the Gaussian and for the Laplacian model, justifying the choice of the latter as a better fit for our data. The data could be modeled even more accurately by heavier tailed distributions, such as alpha-stable distributions . However, these do not exist in closed form, so they cannot be used in the likelihood ratio test. A closed form distribution from the alpha-stable family, namely, the Cauchy, describes the data well in the DCT domain , but the Laplacian has been shown to better describe quantized image data .
The proposed system detects when activities change in a video, based on sequential processing of the interframe illumination variations. After change points are detected, the subsequences resulting inbetween them are further processed in order to characterize and recognize the activities taking place in them. We focus on the case where there is a preprocessing stage that extracts the active pixels, as this reduces the system's overall computational cost and increases its reliability, since it does not look for activity changes in static pixels. The complete system consists of the following stages.
Activity areas are extracted to find the active pixels.
The illumination variations of the pixels inside the activity area over time are estimated.
Sequential change detection is applied to the illumination variations, to detect changes.
If the change points are (nearly) equidistant, the motion is considered to be (near) periodic.
The Activity Areas and Activity History Areas for the frames (subsequences) between change points are extracted. The shape of the Activity Areas and the direction and magnitude of motion are derived from the Activity History Area, to be used for recognition.
False alarms are removed: if motion characteristics of successive subsequences are similar, those subsequences are merged and the change point between them is deleted.
Multiple Activity Areas and Activity History Areas originating from the same activity are detected and merged if their motion and periodicity characteristics coincide.
Shape descriptors of the resulting Activity Areas and motion information from the Activity History Areas are used for recognition.
The detection of different activities between change points increases the usefulness and accuracy of the system for many reasons. The proposed system avoids the drawback of "overwriting" that characterizes MHIs that are extracted using the entire sequence. In periodic motions, for example, where an activity takes place from left to right, then from right to left, and so on, all intermediate changes of direction are lost in the temporal history image if the all video frames are used. This is overcome in our approach, as Activity History Areas are estimated over segments with one kind of activity, giving a clear indication of the activity's direction and temporal evolution. This also allows the extraction of details about the activity taking place, such as the direction of translational motions, periodicity of motions like boxing, or of more complex periodic motions, containing similarly repeating components (see Section 6.2). Finally, the application of recognition techniques to the extracted sequences would not be meaningful if the sequence had not been correctly separated into subsequences with one activity each.
Both the shape of the Activity Area and motion information from the Activity History Area are used for accurate activity recognition, as detailed in the sections that follow.
5.1. Fourier Shape Descriptors of Activity Area
The shape of the Activity Areas can be described by estimating the Fourier Descriptors (FDs)  of their outlines. The FDs are preferred as they provide better classification results than other shape descriptors . Additionally, they are rotation, translation, and scale invariant, and inherently capture some perceptual shape characteristics: their lower frequencies correspond to the average shape, while higher frequencies describe shape details . The FDs are derived from the Fourier Transform (FT) of each shape outline's boundary coordinates. The DC component is not used, as it only indicates the shape position. All values are divided by the magnitude of to achieve scale invariance, and rotation invariance is guaranteed by using their magnitude. Thus, the FDs are given by
Only the first terms of the FD, corresponding to the lowest frequencies, are used in the recognition experiments, as they capture the most important shape information. The comparison of the FDs for different activities takes place by estimating their Euclidean distance, since they are scale, translation, and rotation invariant. When elements of the FDs are retained, the Euclidean distance between two FDs , is given by
and each activity is matched to that with the shortest Euclidean distance.
5.2. Activity History Area for Motion Magnitude and Direction Detection
Although the shape of Activity Areas is characteristic of many activities and is effectively used for their recognition, there also exist categories of activities with very similar Activity Areas. A characteristic example commonly encountered in practice is that of translational motions, whose Activity Area covers a linear region (horizontally, vertically, or diagonally). It is seen in Figures 6, 14(e)–14(g) that this shape is linear for different translational motions, such as walking or running, so it is insufficient for discriminating amongst them. However, this linearity property can be used to separate translations from other kinds of motions. The linearity can be derived from its mean in the horizontal direction. Activities that do not contain a translational component, such as waving, lead to a local concentration of pixel activity, which makes sense since they take place over a confined area (last image pairs of Figure 6).
In order to separate translational motions from each other, the Activity History Areas (Figure 7) are used. Motion direction and magnitude information is extracted by estimating the mean of the Activity History Area in the horizontal and vertical directions. In this work all translational motions are horizontal, so only the horizontal mean of the AHA is estimated. This mean forms a line whose slope provides valuable information about the direction and magnitude of motion.
sign of the slope shows the direction of motion: it is negative for a person moving to the left and positive for motion to the right.
magnitude of the slope is inversely proportional to the velocity, that is, higher magnitudes correspond to slower activities.
The values of the Activity History Area are higher in pixels that were active recently; here the the pixel locations correspond to the horizontal axis, and the slope is estimated by
where is the frame at which the first horizontal pixel (the leftmost x location here) is activated, and the frame where the last horizontal pixel is activated (the rightmost x location). This can be seen in Figures 8(a), 8(b) for motions to the right and left, respectively: motion to the right leads to a positive slope since the rightmost pixel is activated at the most recent frame, while motion to the left leads to a negative slope.
The Activity History Area of a fast activity (e.g., running) contains a small range of frames (from to ), since it takes place in a short time, whereas the Activity History Area of a slow activity occurs during more frames, since the motion lasts longer. In order to objectively discriminate between fast and slow actions, the same number of pixels must be traversed in each direction. Thus, in (13), is the same for all activities, and has high values for slow actions and low values for fast ones. Consequently, higher magnitudes of the slope of (13) correspond to slower motions and lower magnitudes correspond to faster ones.
The activities examined are horizontal walking, jogging, running, and cover the same distance, so that the slope magnitude can be objectively used to discriminate among them. For comparison, the Activity History Area is extracted from a set of baseline translation videos, and its horizontal mean is estimated. The slope of the mean is found from (13) and its magnitude is given in Table 2 for each activity. As expected, the slope has higher values for slower motions.
For the classification of a test video, its Activity History Area is extracted, and its mean is estimated. The sign of its slope indicates whether the person is moving to the right or left and its magnitude is compared to the average slope of the three baseline categories of Table 2 using the Mahalanobis distance. For a baseline set with mean and covariance matrix , the Mahalanobis distance of data from it is defined as . The Mahalanobis distance is used as a distance metric as it incorporates data covariance, which is not taken into account by the Euclidean distance. In this case the data is one dimensional (the slope) so its variance is used instead of the covariance matrix.
6. Experiments for Recognition
Experiments with real videos take place to examine the performance of the change detection module. These videos can be found on http://mklab.iti.gr/content/temporal-templates-human-activity-recognition, so that the reader can observe the ground truth and verify the validity of the experiments. The ground truth for the times of change is extracted manually and compared to the estimated change points to evaluate the detection performance.
We model the data by a Laplacian distribution (Section 4.2) to approximate and of (5), which are unknown and need to be estimated from the data at each time . The distribution of the "current" data is extracted from the first samples of , in order to take into account samples that belong to the old distribution, while is approximated using the most recent samples. There could be a change during the first samples used to approximate , but there is no way to determine this a priori, so there is the implicit assumption that no change takes place in the first frames. Currently, there is no theoretically founded way to determine the optimal length of the windows and , as stated in the change detection literature . Consequently, the best possible solution is to empirically determine the window lengths that give the best change detection results for certain categories of videos, and use them accordingly. After extensive experimentation, and are found to give the best detection results with the fewest false alarms, for detecting a change between successive activities. For periodic motions, the changes occur more often, so smaller windows are used, namely .
At each frame , the test statistic is estimated and compared against a threshold in order to determine whether or not a change has occurred. Due to the sequential nature of the system, there is no closed form expression for this threshold, so an optimal value cannot be determined for it a priori . It is found empirically that for videos of human motions like the ones examined here, the threshold which leads to the highest detection rate with the fewest false alarms is given by
where and are the mean and standard deviation of the test statistic until frame .
6.1. Experiments with Translational Motions
In this section, experimental results for videos containing translational motions, namely, walking, jogging, and running, are presented. Characteristic frames of some videos, the corresponding activity area and the likelihood ratio over time are shown in Figure 9 and all the videos examined can be seen on http://mklab.iti.gr/content/temporal-templates-human-activity-recognition. The activity areas correctly capture the pixels that are active in each video and the likelihood ratio values change at the time when the actual change occurs. In total, change points are correctly detected for videos with translational motions, as shown in Table 3, but for three of the videos false alarms are also detected. These false alarms are easily eliminated by examining the average motion and its variance for each extracted subsequences as they do not change significantly before and after a false alarm. In this manner, no false alarms remain and only the correct change points are detected, shown in bold fonts in Table 3 (for the cases where there were false alarms). In the table, LR indicates that an activity takes place from left to right, HD means "horizontally-diagonally", LRL is left-right-left and LRLR is left-right-left-right. The numbers (e.g., Jog LR1) distinguish between different videos of the same activity. The last two videos, Walk LRL and Walk LRLR have two and three change points, respectively, which are correctly detected in both cases, with no false alarms.
Figures 9(e)–9(i) contains frames from a walking sequence, where the pixels around the person's neck are mistaken for static pixels, leading to two Activity Areas, one corresponding to the head and one to the body, shown in Figures 9(f), 9(g). When there are more than one Activity Area, the sequential testing is applied to each Activity Area separately, since there could be more than one different activity taking place. In this example, the area corresponding to the head is too small to provide enough samples for a reliable estimate of the change-point, so only the likelihood ratio values for the Activity Area corresponding to the body of the person with the coat are shown in Figures 9(h), 9(i). Even in this case, the change points are correctly found.
6.2. Experiments with Nontranslational Motions
Combinations of nontranslational motions are examined in this section. The first video contains a person clapping, followed by a person boxing, and the second shows a person waving followed by a person clapping (see Figure 10 and http://mklab.iti.gr/content/temporal-templates-human-activity-recognition). The resulting Activity Areas contain the pixels that move in both activities and the likelihood ratio values estimated over all active pixels lead to correct change point detection. For the clapping-boxing sequence, the correct change point is detected at frame , but there are also false alarms at frames , introduced because of changes in the individual repeating activities (clapping only or boxing only). As in Section 6.1, these false alarms are eliminated by simply estimating the motion characteristics of the extracted subsequences, which undergo significant change only at frame . In the handwaving-handclapping video, the true change point is found at frame , but false alarms are also detected at frames , which are removed as before, leading to the detection of only the correct change point. It should be emphasized that the relative height of the likelihood ratio values is not taken into account for the elimination of false alarms. Instead, the motion characteristics of the resulting subsequences are measured, as explained earlier.
6.2.1. Periodic Motions
The values of the data windows , chosen for approximating respectively, affect the resolution of the system. When have higher values, they detect changes at a coarse granularity, but at the cost of missing small changes inside each individual activity. In this section, we present experiments where these windows are set to , enabling the detection of changes in repeating activities with good accuracy.
Figure 11 shows frames of the videos examined, along with the corresponding activity areas, and log-likelihood ratio values. For the Boxing and Jumping in Place videos, two activity areas are extracted, one corresponding to the upper part of the human's body and one to the legs. This is because the middle area of the body is relatively static. For those cases, each activity area is examined separately: the resulting change points for the two activity areas coincide, and the motion characteristics between these change points are the same, so these areas are (correctly) assigned to the same activity. Table 4 shows the detected change points for each video and the resulting period. The last video is more complex, containing identical subsequences of a person walking left-right-left: all change points are found, and form a pattern that repeats times.
6.3. Experiments with Multiple Activity Areas
A video of two people performing different activities at different, but overlapping, time intervals, is examined (Figure 12, top two rows). The Activity Area consists of two distinct binary masks, corresponding to the different activities, so the sequential change detection takes place in each area separately. For both Activity Areas, the likelihood ratios for all pixels inside them correctly locate the times of change at frames for the person walking on the left, and at frame for the person walking on the right. Two more complicated videos, with multiple but overlapping activity areas are examined (Figure 12, last two rows). In this case, there is only one activity area, containing more than one activities, but the proposed method can still detect the changes of each activity. This is because enough of the data being processed undergoes a change, which is then detected by the sequential likelihood test. In the first video, with the crossing ladies, changes are found at frames when one lady enters and when another leaves, respectively. In the second video with the beach scene, changes are detected at frame , when the two ladies disappear behind the umbrella, at frame when the three ladies meet, frame when one lady is hidden by the umbrella, when the girl reappears, and when the two walking ladies disappear (see http://mklab.iti.gr/content/temporal-templates-human-activity-recognition). This shows that the proposed system can handle cases of multiple activities taking place during different, possibly overlapping, intervals, with accurate results. Also, these videos contain dynamically moving backgrounds, and yet accurate change detection is obtained for them.
6.4. Experiments with Dynamic Backgrounds
Several challenging videos involving dynamic backgrounds are examined. Despite the moving background, the activity areas are found with accuracy, as seen in Figure 13. The change detection results are extracted from the last column of Figure 13 and are tabulated in Table 5. All change points are detected correctly, along with a few false alarms, which are in italics in Table 5. The false alarms are easily removed by comparing the motion characteristics between estimated change points: before and after a false alarm, the motion characteristics do not change, so those change points are eliminated.
7. Experiments for Recognition
Experimental results for recognition based on the Activity Area and Activity History Area information are presented here. It should be emphasized that the activity recognition results are good although there is no training stage, so the proposed method is applicable to various kinds of activity, without restrictions imposed by the training set.
7.1. Recognition Using Fourier Shape Descriptors of Activity Area
Experiments for activity recognition take place for boxing, handclapping and handwaving, with Activity Area outlines like those in Figures 14(a)–14(f). The comparison of the FDs for videos of boxing, handclapping and handwaving each, lead to the correct classification of of the boxing, of the handclapping and of the handwaving sequences as can be seen in Table 6. This makes intuitive sense, as the outlines of the Activity Areas for the boxing videos have a blob-like shape, which is not as descriptive as the other boundaries. Indeed, the best recognition results are achieved for the handclapping video, whose Activity Area outlines have a very characteristic shape. Additionally, the boxing and handclapping motions are more often confused with each other than with the handwaving, as expected, since the latter's Activity Area has a very distinctive shape.
Different methods have also used this dataset for activity recognition. In , excellent recognition results of for boxing, for clapping, and for waving are achieved. However, that method is based on extracting motion templates (motion images and motion context) using very simple processing, which would fail for more challenging sequences, like those in Section 6.4: the standard deviation of the illumination over successive video frames is estimated to find active pixels, a measure which can easily lead to false alarms in the presence of noise. In , Support Vector Machines (SVMs) are used, so training is required in their method. They achieve recognition of for boxing, but for clapping and for waving, that is, worse than our results. Finally, in  volumetric features are used, leading to a higher computational cost, but achieving recognition results of only for boxing, for clapping and for waving (which is comparable to our result). Overall our approach has a consistently good performance, with recognition rates above , despite its simplicity, low computational cost, and the fact that it does not require any training or prior knowledge.
7.2. Recognition Using Activity History Area Features
For translational motion classification, we examine the subsequences extracted from the walking, jogging, and running videos of Section 6.1 after change detection. The direction of motion in each one is correctly found for all data. The Mahalanobis distance of the slope magnitude from the test values for each video is shown in Tables 7–9, where it can be seen that correct classification is achieved in all cases, both for the direction and for the type of motion.
In this work, a novel approach for the analysis of human motion in video is presented. The kurtosis of interframe illumination variations leads to binary masks, the Activity Areas, which indicate which pixels are active throughout the video. The temporal evolution of the activities is characterized by temporally weighted versions of the Activity Areas, the Activity History Areas. Changes in the activity taking place are detected via sequential change detection, applied on the interframe illumination variations. This separates the video into sequences containing different activities, based on changes in their motion. The activity taking place in each subsequence is then characterized by the shape of its Activity Area or on its magnitude and direction, derived from the Activity History Area. For nontranslational activities, Fourier Shape Descriptors represent the shape of each Activity Area, and are compared with each other, for recognition. Translational motions are characterized based on their relative magnitude and direction, which are retrieved from their Activity History Areas. The combined use of the aforementioned recognition techniques with the proposed sequential change detection for the separation of the video in sequences containing separate activities leads to successful recognition results at a low computational cost. Future work includes the development of more sophisticated and complex recognition methods, so as to achieve even better recognition rates. The application of change detection on video is also to be extended to a wider range of videos, as it is a generally applicable method, not limited to the domain of human actions.
Wang L, Hu W, Tan T: Recent developments in human motion analysis. Pattern Recognition 2003,36(3):585-601. 10.1016/S0031-3203(02)00100-0
Aggarwal JK, Cai Q: Human motion analysis: a review. Computer Vision and Image Understanding 1999,73(3):428-440. 10.1006/cviu.1998.0744
Gavrila DM: The visual analysis of human movement: a survey. Computer Vision and Image Understanding 1999,73(1):82-98. 10.1006/cviu.1998.0716
Akita K: Image sequence analysis of real world human motion. Pattern Recognition 1984,17(1):73-83. 10.1016/0031-3203(84)90036-0
Haritaoglu I, Harwood D, Davis LS: W4: real-time surveillance of people and their activities. IEEE Transactions on Pattern Analysis and Machine Intelligence 2000,22(8):809-830. 10.1109/34.868683
Bottino A, Laurentini A: A silhouette based technique for the reconstruction of human movement. Computer Vision and Image Understanding 2001, 83: 79-95. 10.1006/cviu.2001.0918
Green RD, Guan L: Quantifying and recognizing human movement patterns from monocular video imagespart I: a new framework for modeling human motion. IEEE Transactions on Circuits and Systems for Video Technology 2004,14(2):179-189. 10.1109/TCSVT.2003.821976
Laptev I, Lindeberg T: Space-time interest points. Proceedings of the 9th IEEE International Conference on Computer Vision (ICCV '03), October 2003, Nice, France 1: 432-439.
Oren M, Papageorgiou C, Sinha P, Osuna E, Poggio T: Pedestrian detection using wavelet templates. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR '97), June 1997, San Juan, Puerto Rico, USA 193-199.
Singh M, Basu A, Mandal M: Human activity recognition based on silhouette directionality. IEEE Transactions on Circuits and Systems for Video Technology 2008,18(9):1280-1292.
Cootes T, Taylor C, Cooper D, Graham J: Active shape models-their training and application. Computer Vision and Image Understanding 1995,61(1):38-59. 10.1006/cviu.1995.1004
Cedras C, Shah M: Motion-based recognition a survey. Image and Vision Computing 1995,13(2):129-155. 10.1016/0262-8856(95)93154-K
Boyd J, Little J: Global versus structured interpretation of motion: moving light displays. Proceedings of the IEEE Workshop on Motion of Non-Rigid and Articulated Objects (NAM '97), 1997 18-25.
Polana R, Nelson R: Detecting activities. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR '93), June 1993, New York, NY, USA 2-7.
Bobick AF, Davis JW: The recognition of human movement using temporal templates. IEEE Transactions on Pattern Analysis and Machine Intelligence 2001,23(3):257-267. 10.1109/34.910878
Gorelick L, Blank M, Shechtman E, Irani M, Basri R: Actions as space-time shapes. IEEE Transactions on Pattern Analysis and Machine Intelligence 2007,29(12):2247-2253.
Yamato J, Obya J, Ishii K: Recognizing human action in time sequential images using hidden markov model. Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition (CVPR '92), 1992, The Hague, The Netherlands 379-385.
Kale A, Sundaresan A, Rajagopalan AN, et al.: Identification of humans using gait. IEEE Transactions on Image Processing 2004,13(9):1163-1173. 10.1109/TIP.2004.832865
Sun X, Chen CW, Manjunath BS: Probabilistic motion parameter models for human activity recognition. Proceedings of the International Conference on Pattern Recognition (ICPR '02), August 2002, Quebec, Canada 16(1):443-446.
Wren CR, Azarbayejani A, Darrell T, Pentland AP: P finder: real-time tracking of the human body. IEEE Transactions on Pattern Analysis and Machine Intelligence 1997,19(7):780-785. 10.1109/34.598236
Aach T, Dümbgen L, Mester R, Toth D: Bayesian illumination-invariant motion detection. Proceedings of the IEEE International Conference on Image Processing (ICIP '01), October 2001, Thessaloniki, Greece 3: 640-643.
Stauffer C, Grimson W: Adaptive background mixture models for real-time tracking. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR '99), June 1999, Fort Collins, Colo, USA 2: 246-252.
El Hassouni M, Cherifi H, Aboutajdine D: HOS-based image sequence noise removal. IEEE Transactions on Image Processing 2006,15(3):572-581.
Giannakis GB, Tsatsanis MK: Time-domain tests for Gaussianity and time-reversibility. IEEE Transactions on Signal Processing 1994,42(12):3460-3472. 10.1109/78.340780
Stauffer C, Grimson W: Adaptive background mixture models for real-time tracking. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR '99), June 1999, Fort Collins, Colo, USA 2: 246-252.
Page ES: Continuous inspection scheme. Biometrika 1954,41(1):100-115.
Poor HV: An Introduction to Signal Detection and Estimation. 2nd edition. Springer, New York, NY, USA; 1994.
Wald A: Sequential Analysis. Dover Publications, New York, NY, USA; 2004.
Moustakides GV: Optimal stopping times for detecting changes in distributions. Annals of Statistics 1986,14(4):1379-1387. 10.1214/aos/1176350164
Basseville M, Nikiforov I: Detection of Abrupt Changes: Theory and Application. Prentice-Hall, Englewood Cliffs, NJ, USA; 1993.
Aiazzi B, Alparone L, Baronti S: Estimation based on entropy matching for generalized Gaussian PDF modeling. IEEE Signal Processing Letters 1999,6(6):138-140. 10.1109/97.763145
Nolan JP: Stable Distributions—Models for Heavy Tailed Data. Birkhäuser, Boston, Mass, USA; 2010.
Briassouli A, Tsakalides P, Stouraitis A: Hidden messages in heavy-tails:DCT-domain watermark detection using alpha-stable models. IEEE Transactions on Multimedia 2005, 7: 700-715.
Simitopoulos D, Tsaftaris SA, Boulgouris NV, Briassouli A, Strintzis MG: Fast watermarking of MPEG-1/2 streams using compressed-domain perceptual embedding and a generalized correlator detector. EURASIP Journal on Applied Signal Processing 2004, 8: 1088-1106.
Bober M: MPEG-7 visual shape descriptors. IEEE Transactions on Circuits and Systems for Video Technology 2001,11(6):716-719. 10.1109/76.927426
Zhang DS, Lu G: A comparative study of Fourier descriptors for shape representation and retrieval. Proceedings of the 5th Asian Conference on Computer Vision (ACCV '02), Januray 2002, Melbourne, Australia 646-651.
Hory C, Kokaram A, Christmas WJ: Threshold learning from samples drawn from the null hypothesis for the generalized likelihood ratio CUSUM test. Proceedings of the IEEE Workshop on Machine Learning for Signal Processing, September 2005 111-116.
Nikiforov IV: A generalized change detection problem. IEEE Transactions on Information Theory 1995,41(1):171-187. 10.1109/18.370109
Zhang ZM, Hu YQ, Chan S, Chia LT: Motion context: a new representation for human action recognition. Proceedings of the European Conference on Computer Vision (ECCV '08), October 2008, Marseille, France, Lecture Notes in Computer Science 5305: 817-829.
Schuldt C, Laptev I, Caputo B: Recognizing human actions: a local SVM approach. Proceedings of the 17th International Conference on Pattern Recognition, August 2004, Cambridge, UK
Ke Y, Sukthankar R, Hebert M: Efficient visual event detection using volumetric features. Proceedings of the10th IEEE International Conference on Computer Vision (ICCV '05), October 2005, Beijing, China 1: 166-173.
The research leading to these results has received funding from the European Communitys Seventh Framework Programme FP7/2007-2013 under grant agreement FP7-214306-JUMAS, from FP6 under contract no. 027685-MESH and FP6-027026-K-Space.
About this article
Cite this article
Briassouli, A., Tsiminaki, V. & Kompatsiaris, I. Human Motion Analysis via Statistical Motion Processing and Sequential Change Detection. J Image Video Proc 2009, 652050 (2009). https://doi.org/10.1155/2009/652050