Real-Time Multiview Recognition of Human Gestures by Distributed Image Processing
© Toshiyuki Kirishima et al. 2010
Received: 18 March 2009
Accepted: 3 June 2009
Published: 31 August 2009
Since a gesture involves a dynamic and complex motion, multiview observation and recognition are desirable. For the better representation of gestures, one needs to know, in the first place, from which views a gesture should be observed. Furthermore, it becomes increasingly important how the recognition results are integrated when larger numbers of camera views are considered. To investigate these problems, we propose a framework under which multiview recognition is carried out, and an integration scheme by which the recognition results are integrated online and in realtime. For performance evaluation, we use the ViHASi (Virtual Human Action Silhouette) public image database as a benchmark and our Japanese sign language (JSL) image database that contains 18 kinds of hand signs. By examining the recognition rates of each gesture for each view, we found gestures that exhibit view dependency and the gestures that do not. Also, we found that the view dependency itself could vary depending on the target gesture sets. By integrating the recognition results of different views, our swarm-based integration provides more robust and better recognition performance than individual fixed-view recognition agents.
For the symbiosis of humans and machines, various kinds of sensing devices will be either implicitly or explicitly embedded, networked, and cooperatively function in our future living environment [1–3]. To cover wider areas of interest, multiple cameras will have to be deployed. In general, gesture recognizing systems that function in real world must operate in real-time, including the time needed for event detection, tracking, and recognition. Since the number of cameras can be very large, distributed processings of incoming images at each camera node are inevitable in order to satisfy the real-time requirement. Also, improvements in recognition performance can be expected by integrating responses from each distributed processing component. But it is usually not evident how the responses should be integrated. Furthermore, since a gesture is such a dynamic and complex motion, single-view observation does not necessary guarantee better recognition performance. One needs to know from which camera views a gesture should be observed in order to quantitatively determine the optimal camera configuration and views.
2. Related Work
For the visual understanding of human gestures, a number of recognition approaches and techniques have so far been proposed [4–10]. Vision-based approaches usually employ a method that estimates a gesture class to which the incoming image belongs by introducing pattern recognition techniques. To make the recognition system more reliable and usable in our activity spaces, many approaches that employ multiple cameras are actively developed in recent years. These approaches can be classified into the geometry-based approach  and the appearance-based approach . Since the depth information can be computed by using multiple camera views, the geometry-based approach can estimate three-dimensional (3D) relationship between the human body and its activity spaces . For example, multiple person's actions such as walking including its path can be reliably estimated [2, 10]. On the other hand, the appearance-based approach usually focuses on more detailed understanding of human gestures. Since a gesture is a spatiotemporal event, spatial- and temporal-domain problems need to be considered at the same time. In , we have investigated the temporal-domain problems on gesture recognition and suggested that the recognition performance can depend on image sampling rate. Although there are some studies on view selection problems [15, 16], they do not deal with human gestures, and how the recognition results should be integrated when larger numbers of camera views are available is not studied. This means that most of the multiview gesture recognition system's actual camera configuration and views are determined empirically. There is a fundamental need to evaluate the recognition performance depending on camera views. To deal with the above-mentioned problems, we propose ( 1) a framework under which recognition is performed using multiple camera views ( 2) an integration scheme by which the recognition results are integrated on-line and in real-time. The effectiveness of our framework and an integration scheme is demonstrated by the evaluation experiments.
3. Multiview Gesture Recognition
3.2. Recognition Agent
where is the maximum number of visual interest points, and is the weight for each visual interest point of recognition agent . In the recognition phase, each component of in (1) is computed, which is the sum of convolution between the similarity and each protocol map as illustrated in Figure 2. The input image is judged to belong to the gesture class that returns the biggest sum of convolution.
3.3. Frame Rate Control Method
Generally, the actual frame rate of gesture recognition systems depends on ( 1) duration of each gesture, ( 2) number of gesture classes, and ( 3) performance of the implemented system. In addition, recognition systems must deal with slow and unstable frame rate caused by the following factors: ( 1) increase in pattern matching cost, (2 ) increased number of recognition agents, and (3 ) load fluctuations in the third party processes under the same operating systems environment.
In order to maintain the specified frame rate, a feedback control system is introduced as shown in the bottom part of Figure 2, which dynamically selects the magnitude of processing load. The control inputs are pattern scanning interval , pattern matching interval , and the number of effective visual interest points . Here, refers to the jump interval in scanning the feature image, and refers to the loop interval in matching the current feature vector with feature vectors in the reference data set. The controlled variable is the frame rate (fps), and (fps) is the target frame rate. The frame rate is stabilized by controlling the load of the recognition modules. Control inputs are determined in accordance with the response from frame rate detector. The feedback control is applied as long as the control deviation does not fall within the minimal error range.
The experiments are conducted on a personal computer (Core 2 Duo, 2 GHz, 2 GB Memory) under the Linux operating system environment.
4.1. Experiment I
Target gesture sets (Part I).
In this experiment, the image acquisition agent reads out the multiview image, and each view image is converted into an 8-bit gray scale image whose resolution is 80 by 60 dots and then stored in the shared memory area. Each recognition agent reads out the image and performs the recognition on-line and in real-time. The experiments are carried out according to the following procedures.
Launch four recognition agents ( , , , and ), then perform the protocol learning on six kinds of gestures in each group. In this experiment, the recognition agent also plays the role of an integration agent . Since the ViHASi image database does not contain any instances for each gesture, standard samples are also used as training samples in the protocol learning.
The target frame rate of each recognition agent is set to 30 fps. Then, the frame rate control is started.
Feed the testing samples into the recognition system. For each gesture, 10 standard samples are tested.
4.2. Experiment II
Target gesture sets (Part II).
In this experiment, the image acquisition agent reads out the multiview image in the database and converts each camera view image into an 8-bit gray scale image whose resolution is 80 by 60 dots and then stores each gray scale image in the shared memory area. Each recognition agent reads out the image and performs the recognition on-line and in real-time. The experiments are carried out according to the following procedures.
Launch four recognition agents ( , , , and ), then perform the protocol learning on six kinds of gestures in each group. In this experiment, the recognition agent also plays the role of an integration agent . As training samples, one standard sample and one similar sample are used for the learning of each gesture.
The target frame rate of each recognition agent is set to 30 fps. Then, the frame rate control is started.
Feed the testing samples into the recognition system. For each gesture, 20 similar samples that are not used in the training phase are tested.
4.3. Experiment III
Target gesture sets (Part III).
In the above experiments, each recognition rate is computed by dividing "the rate of correct answers" by "the rate of correct answers" plus "the rate of wrong answers." "The rate of correct answers" refers to the ratio of the number of correct recognition to the number of processed image frames, which is calculated only for the correct gesture class. On the other hand, "the rate of wrong answers" refers to the ratio of the number of wrong recognition to the number of processed image frames, which is calculated for all gesture classes except the correct gesture class. In this way, a recognition rate is calculated that reflects the occurrence of incorrect recognition during the evaluation. The recognition rates shown in the figures and tables are the averaged values given by the above calculation about 10 testing samples of each gesture in Experiment I and 20 testing samples in Experiments II and III.
5.1. Performance on ViHASi Database
Average recognition rates for each gesture group in Experiments I, II, and III (%).
5.2. Performance on Our JSL Database
5.3. Classification by View Dependency
Classification by view dependency.
GA-A, GA-B, GA-C
GA-D, GA-E, GA-F
GB-A, GB-B, GB-C
GB-D, GB-E, GB-F
GC-A GC-B, GC-C
GC-D GC-E, GC-F
GD-A, GD-B, GD-C
GD-D, GD-E, GD-F
GE-A, GE-C, GE-D
GF-A, GF-B, GF-E
GD-E, GE-B, GF-E
GE-A, GD-F, GD-D
GD-B, GE-C, GF-F
5.4. Analysis on View Dependency
5.5. Quantitative Difference between ViHASi and Our JSL Image Database
In this paper, a framework is proposed for multiview recognition of human gestures by real-time distributed image processing. In our framework, recognition agents run in parallel for different views, and the recognition results are integrated on-line and in real-time. In the experiments, the proposed approach is evaluated by using two kinds of image databases: ( 1) public ViHASi image database and ( 2) original JSL image database. By examining recognition rates of each gesture for each view, we found gestures that exhibit view dependency and the gestures that do not. And the most suitable view for recognition varied depending on the gestures in each of nine groups. More importantly, some gestures changed view dependency by changing the combination of target gestures. Therefore, the prediction of the most suitable view is difficult, especially when the target gesture sets are not determined beforehand as in the case of user-defined gestures. On the whole, the integration agent demonstrated better recognition performance than individual fixed-view recognition agent. The results presented in this paper clearly indicate the effectiveness of our swarm-based approach in multiview gesture recognition. Future work includes the application of our approach to many view gesture recognition in sensor network environment.
- Weiser M: Hot topics-ubiquitous computing. Computer 1993,26(10):71-72. 10.1109/2.237456View ArticleGoogle Scholar
- Matsuyama T, Ukita N: Real-time multitarget tracking by a cooperative distributed vision system. Proceedings of the IEEE 2002,90(7):1136-1149. 10.1109/JPROC.2002.801442View ArticleGoogle Scholar
- Liu R, Wang Y, Yang H, Pan W: An evolutionary system development approach in a pervasive computing environment. Proceedings of International Conference on Cyberworlds (CW '04), November 2004 194-199.Google Scholar
- Yamato J, Ohya J, Ishii K: Recognizing human action in time-sequential images using hidden Markov model. Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR '92), June 1992, Champaign, Ill, USA 379-385.Google Scholar
- Wren CR, Azarbayejani A, Darrell T, Pentland AP: Pfinder: real-time tracking of the human body. IEEE Transactions on Pattern Analysis and Machine Intelligence 1997,19(7):780-785. 10.1109/34.598236View ArticleGoogle Scholar
- Corradini A: Dynamic time warping for off-line recognition of a small gesture vocabulary. Proceedings of IEEE ICCV Workshop on Recognition, Analysis, and Tracking of Faces and Gestures in Real-Time Systems, July 2001 82-89.View ArticleGoogle Scholar
- Dreuw P, Deselaers T, Rybach D, Keysers D, Ney H: Tracking using dynamic programming for appearance-based sign language recognition. Proceedings of the 7th International Conference on Automatic Face and Gesture Recognition (FGR '06), April 2006 293-298.View ArticleGoogle Scholar
- Hang Z, Qiuqi R: Visual gesture recognition with color segmentation and support vector machines. Proceedings of the International Conference on Signal Processing (ICSP '04), September 2004, Beijing, China 2: 1443-1446.Google Scholar
- Wong S-F, Cipolla R: Continuous gesture recognition using a sparse Bayesian classifier. Proceedings of International Conference on Pattern Recognition, September 2006 1: 1084-1087.Google Scholar
- Jung UC, Seung HJ, Xuan DP, Jae WJ: Multiple objects tracking circuit using particle filters with multiple features. Proceedings of International Conference on Robotics and Automation, April 2007 4639-4644.Google Scholar
- Wan C, Yuan B, Miao Z: Model-based markerless human body motion capture using multiple cameras. Proceedings of IEEE International Conference on Multimedia and Expo, July 2007 1099-1102.Google Scholar
- Ahmad M, Lee S-W: HMM-based human action recognition using multiview image sequences. Proceedings of International Conference on Pattern Recognition (ICPR '06), September 2006 1: 263-266.View ArticleGoogle Scholar
- Utsumi A, Mori H, Ohya J, Yachida M: Multiple-human tracking using multiple cameras. Proceedings of the 3rd IEEE International Conference on Automatic Face and Gesture Recognition (FGR '98), April 1998 498-503.View ArticleGoogle Scholar
- Kirishima T, Manabe Y, Sato K, Chihara K: Multi-rate recognition of human gestures by concurrent frame rate control. Proceedings of the 23rd International Conference Image and Vision Computing New Zealand (IVCNZ '08), November 2008 1-6.Google Scholar
- Abbasi S, Mokhtarian F: Automatic view selection in multi-view object recognition. Proceedings of the 15th International Conference on Pattern Recognition (ICPR '00), September 2000 13-16.View ArticleGoogle Scholar
- Navarro-Serment LE, Dolan JM, Khosla PK: Optimal sensor placement for cooperative distributed vision. Proceedings of IEEE International Conference on Robotics and Automation (ICRA '04), July 2004 939-944.Google Scholar
- Kirishima T, Sato K, Chihara K: Real-time gesture recognition by learning and selective control of visual interest points. IEEE Transactions on Pattern Analysis and Machine Intelligence 2005,27(3):351-364.View ArticleGoogle Scholar
- Ragheb H, Velastin S, Remagnino P, Ellis T: ViHASi: virtual human action silhouette data for the performance evaluation of silhouette-based action recognition methods. Proceedings of the 2nd ACM/IEEE International Conference on Distributed Smart Cameras (ICDSC '08), September 2008, Palo Alto, Calif, USA 1-10.Google Scholar
- Hinchey MG, Sterritt R, Rouff C: Swarms and swarm intelligence. Computer 2007,40(4):111-113.View ArticleGoogle Scholar
- Fernández-Carrasco LM, Terashima-Marín H, Valenzuela-Rendón M: On the path towards autonomic computing: combining swarm intelligence and excitable media models. Proceedings of the 7th Mexican International Conference on Artificial Intelligence (MICAI '08), October 2008 192-198.Google Scholar
- Saisan P, Medasani S, Owechko Y: Multi-view classifier swarms for pedestrian detection and tracking. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR '05), June 2005, San Diego, Calif, USA 18.Google Scholar
- Scheutz M: Real-time hierarchical swarms for rapid adaptive multi-level pattern detection and tracking. Proceedings of the IEEE Swarm Intelligence Symposium (SIS '07), April 2007 234-241.Google Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.