Model-Based Hand Tracking by Chamfer Distance and Adaptive Color Learning Using Particle Filter
© C. Kerdvibulvech and H. Saito. 2009
Received: 31 January 2009
Accepted: 3 October 2009
Published: 29 December 2009
We propose a new model-based hand tracking method for recovering of three-dimensional hand motion from an image sequence. We first build a three-dimensional hand model using truncated quadrics. The degrees of freedom (DOF) for each joint correspond to the DOF of a real hand. This feature extraction is performed by using the Chamfer Distance function for the edge likelihood. The silhouette likelihood is performed by using a Bayesian classifier and the online adaptation of skin color probabilities. Therefore, it is to effectively deal with any illumination changes. Particle filtering is used to track the hand by predicting the next state of three-dimensional hand model. By using these techniques, this method adds the useful ability of automatic recovery from tracking failures. This method can also be used to track the guitarist's hand.
Acoustic guitars are currently very popular and as a consequence, research about guitars is a popular topic in the field of computer vision for musical applications.
Maki-Patola et al.  proposed a system called "Virtual Air Guitar" using computer vision. Their aim was to create a virtual air guitar which does not require a real guitar but produces music similar to a player using a real guitar. Liarokapis  proposed an augmented reality system for guitar learners. The aim of his work is to show the augmentation (e.g., the positions where the learner should place the fingers to play the correct chords) on an electric guitar as a guide for the novice player. Motokawa and Saito  built a system called "Online Guitar Tracking" that supports a guitarist by using augmented reality. This is done by showing a virtual model of the fingers on a stringed guitar as a teaching aid for anyone learning how to play the guitar.
These systems do not aim to detect the fingering and handing which a player is actually using (a pair of gloves are tracked in , and graphics information is overlaid on captured video in  and ). We have developed a different approach from most of these researches.
A challenge for tracking the hand of a guitar player is that, while playing the guitar, the fingers are not stretched out separately. Thus the existing model-based hand tracking methods such as , , and  are not directly applicable to the guitarist's hand tracking as the fingers are usually bent while playing the guitar. Moreover, the background is dynamic and nonuniform (e.g., guitar neck and natural scene) which makes it more difficult for background segmentation. Also, for many classic guitars, the colors of frets and strings are very similar to skin color. As a result it is not an easy task to track the hand correctly.
To begin with, we construct a three-dimensional hand model  using truncated quadrics as building blocks, approximating the anatomy of a real human hand . A hierarchical model with 27 degrees of freedom (DOF) is used. The DOF for each joint correspond to the DOF of a real hand. Then we extract corresponding features (edges and silhouette) between three-dimensional hand model and input image. The Canny edge detection is used to extract the edge. Then, the Chamfer Distance function  is used for edge likelihood. The silhouette is determined by using a Bayesian classifier and the online adaptation of skin color probabilities [11, 12]. By using the online adaptation, this method is able to cope well with illumination changes. Following this, the particle filter  is applied to track the hand by predicting the next state of three-dimensional hand model. As a result, the system enables us to visually track the hand, which can be applied to track the hand of a guitarist.
The advantage of particle filter is that the tracker is able to initialize and recover. Particle filter uses a lot of state vectors (particles) to represent possible solution in each time instance. If the hand moves fast until lost tracks, it can automatically recover from tracking failures. Also, in  they use template-based method. In their work, the range of allowed hand motion is limited by the number of templates that need to be stored. Thus, their method requires creating many templates manually enough at the first time, unless its coverage will not be sufficient for hand's movements. In contrast, by applying particle filter, it can avoid the cumbersome process of manually generating a lot of templates.
The work that is similar to ours is by De la Gorce et al.  They focused on recovery of geometric and photometric pose parameters of a skinned hand model from monocular image sequences. However, they do not aim to apply hand model-based to track the hand of guitarist which includes with guitar neck in the background. In their work, their assumption is that the background image is basically static and was also obtained from a frame where the hand was not visible. In other words, when they define the image synthesis process, they assume that the background image is known (if their background is not static, they relax this constraint by assuming that the color distribution associated to the background is available). In the case of guitarist's hand tracking, the background is dynamic because the guitar neck is not fixed to the camera which makes the situation different. Due to the dynamic movement of guitar position, it cannot simply use background subtraction from a frame where the hand is invisible at the first time. In addition, for many classic guitars, the colors of frets and strings are similar to skin color which makes it difficult for segmenting the hand from the guitar neck. The method of robust hand segmentation from the guitar neck is undoubtedly needed. For this reason, in this paper we apply a Bayesian classifier and the online adaptation of skin color probabilities [11, 12] to robustly segment the hand region from the guitar neck. This approach can also deal well with illumination changes.
2.1. Construct 3D Hand Model
and from this the clipping parameters for the cone are found as
2.2. Feature Extraction
This section explains the feature extraction which is used within the algorithm. The methods explained in this section are applied for both the hand model image and the captured image. The likelihood relates observations in the image to the unknown state . The likelihoods we used are based on the edge map of the image (edge) as well as pixel color values (silhouette). These features have proved useful for detecting and tracking hands in previous work . Therefore, the joint likelihood of is approximated as
2.2.1. Edge Likelihood
We first extract feature by considering the edge likelihood. The edge likelihood term is based on the chamfer distance function . Given the set of template points and the set of Canny edge points , a quadratic chamfer distance function is given by the average of the squared distances between each point of A and its closest point in B:
2.2.2. Silhouette Likelihood
We also perform feature extraction by determining the silhouette likelihood. The silhouette is calculated by using a Bayesian classifier and the online adaptation of skin color probabilities [11, 12].
The learning process has two phases. In the first phase, the color probability is obtained from a small number of training images during an offline preprocess. In the second phase, we gradually update the probability automatically and adaptively from the additional training data images. The adapting process can be disabled as soon as the achieved training is deemed sufficient.
Therefore, this method allows us to get accurate color probability of the skin from only a small set of manually prepared training images. This is because the additional skin region does not need to be segmented manually. Also, because of the adaptive learning, it can be used robustly with changing illumination during the online operation.
In this way, because our method can learn color probability of hand adaptively, the background of testing images does not have to be the same. When the background is suddenly changed, the segmentation result might become error prone in the beginning but as soon as several frames are learned adaptively, the segmentation will be recovered and becomes good again.
Learning from Training Data Set
During an offline phase, a small set of training input images (20 images) is selected on which a human operator manually segments skin regions. The color representation used in this process is YUV . However, the Y-component of this representation is not employed for two reasons. Firstly, the Y-component corresponds to the illumination of an image pixel. By omitting this component, the developed classifier becomes less sensitive to illumination changes. Secondly, compared to a 3D color representation (YUV), a 2D color representation (UV) is lower in dimensions and, therefore, less demanding in terms of memory storage and processing costs.
The conditional probability of a skin being color c. This is defined as the ratio of the number of occurrences of a color c within the skin-colored areas to the number of skin-colored image points in the training set.
This equation determines the probability of a certain image pixel being skin-colored using a lookup table indexed with the pixel's color. The resultant probability map thresholds are then set to be threshold and threshold , where all pixels with probability are considered as being skin-colored—these pixels constitute seeds of potential skin-colored blobs—and image pixels with probabilities where are the neighbors of skin-colored image pixels being recursively added to each color blob. The rationale behind this region growing operation is that an image pixel with relatively low probability of being skin-colored should be considered as a neighbor of an image pixel with high probability of being skin-colored. The values for and should be determined by test experiments (we use 0.5 and 0.15, resp., in the experiment in this paper). A standard connected component labelling algorithm (i.e., depth-first search) is then responsible for assigning different labels to the image pixels of different blobs. Size filtering on the derived connected components is also performed to eliminate small isolated blobs that are attributed to noise and do not correspond to interesting skin-colored regions. Each of the remaining connected components corresponds to a skin-colored blob.
where is a sensitivity parameter that controls the influence of the training set in the detection process, represents the adapted probability of a color c being a skin color, and and are both given by (1) but involve prior probabilities that have been computed from the whole training set (for ) and from the detection results in the last W frames (for ). In our implementation, we set and .
Thus, the hand skin color probability can be determined adaptively. By using online adaptation of skin color probabilities, the classifier is easily able to cope with considerable illumination changes. Example hand segmentation is illustrated in Figure 6.
In this way, given the shape template and the observed silhouette image , the likelihood function is defined from the ratio difference of overlapped areas, calculated from adaptive hand segmentation algorithm.
2.3. Particle Filter Tracking
Particle filtering  is a useful tool to track objects in a clutter, with the advantage of performing automatic recovering from tracking failures. We apply particle filter to compute and track the hand. In our method, one particle represents each DOF of hand model. We determine the probability-density function by calculating from edge likelihood and silhouette likelihood, as explained in Section 2.2.1 and Section 2.2.2, respectively. The calculation is based on the following analysis.
Given that the process at each time-step is an iteration of factored sampling, the output of an iteration will be a weighted, time-stamped sample-set, denoted by with weights , representing approximately the probability-density function at time t, whereN is the size of sample sets, is defined as the position of the particle at time , and represents the position of hand model at time.
In the first stage (the selection stage), a sample is chosen from the sample-set with probabilities , where is the cumulative weight. This is done by generating a uniformly distributed random number . We find the smallest j for which using binary search, and then can be set as follows: .
where noise is given as a Gaussian distribution. The form of is a propagation function we used. We have tried different propagation functions (e.g., constant velocity motion model and acceleration motion model), but our experimental results have revealed that constant velocity motion model and acceleration motion model do not give a significant improvement. A possible reason is that the motion of hand is usually changing directions while tracking. Therefore the calculated velocities or accelerations in previous frame do not give accurate prediction of the next frame. In this way, we use only the noise information by defining in (9).
Next, we update the cumulative probability, which can be calculated from normalized weights using
The hand can then be tracked, enabling us to perform automatic track recovering.
If the tracking certainty is lower than the threshold we set, it will return to the initial state again (as described in Particle Filter Tracking). This means that the particles will be initially distributed again for tracking. Thus, if the recent tracking results are not perfect, the hand can still be tracked.
It is possible to use all 27 DOF in tracking, but it will take too long time to compute the result in each frame (a lot of particles are required for tracking in those high dimensions). Thus, in our experiments we limit the movement of the hand in the input video to a small number of DOF. Then, we track the hand movement in this reduced dimension.
Figures 8 (a) and (b) show the results of tracking global hand motion together with finger articulation. The images are shown with projected corresponding three-dimensional hand models (green color). For these sequences, the three-dimensional hand model has 11 DOF.
As seen in Figure 8(a) , at the commencement of the experiment, the hand starts moving from left-to-right and then moving back from right-to-left, respectively. It can be seen that the global hand motions are tracked successfully in this sequence.
Figure 8(b) shows the result of hand tracking, when bending down the forefinger toward the palm. For this sequence, the range of global hand motion is restricted to a smaller region, but it still has 11 DOF. It can be observed that the proposed tracker successfully recovers these motions.
Camera is calibrated, so that the intrinsic parameters are known before starting the system. Then the camera is registered to the world coordinate through ARTag (Augmented Reality Tag) . Hence, when the projection matrix is known, we can track by using particle filter to register to the three-dimensional hand models onto the input images. It converges to the hand automatically by fitting model based on silhouette and edge clues using particle filter. Because ARTag's marker is placed to the guitar neck, the origin of the world coordinate is defined on the guitar neck. In this way, we know the projection matrix of camera at every frame, and so we can track the three-dimensional hand model using the proposed method.
In our method, we assume that the biggest region of a skin color found in the images is a hand blob. When we segment the skin area from the images, we remove small noise by size filtering. We determine that if noise is smaller than the threshold, we remove it. After that, we choose the biggest area as a hand blob. Therefore, our assumption is that the hand region has to be the largest area of skin found in the images. In this way, even though there is some other regions with color similar to skin color appear in the scene, our method can deal with. Similarly, if there is more than one hand or having faces in the images (but the hand of interest is the biggest skin area), the hand can still be tracked.
Error performance in terms of fingertip localization measured against manually labeled ground truth with different numbers of particles.
Number of particles
Execution time (seconds per frame)
Mean distance errors (pixels)
In this paper, we have developed a system that tracks the hand by using a model-based approach. We construct the three-dimensional hand model by using a set of quadrics. After that, we utilize a quadratic chamfer distance function to calculate the edge likelihood, and then the online color learning adaptation is utilized. Following this, particle filter is performed to track the hand. This implementation can be used to further improve the hand tracking application of guitarist such as . Although we believe that we can successfully produce accurate output from our system, the current system has the limitation with finger occlusion and guitar neck occlusion. This sometimes happens when playing the guitar in real life. Another limitation is about the high dimension of the state space (DOF). The number of particles required increases with the dimension of the state space. Therefore, some improvement or optimization should be considered to make the system faster. In the future, we intend to further refine these problems.
This work is supported in part by a Grant-in-Aid for the Global Center of Excellence for High-Level Global Cooperation for Leading-Edge Platform on Access Spaces from the Ministry of Education, Culture, Sport, Science, and Technology in Japan. The work presented in this paper is partially supported by CREST, JST (Research Area: Foundation of technology supporting the creation of digital media contents).
- Maki-Patola T, Laitinen J, Kanerva A, Takala T: Experiments with virtual reality instruments. Proceedings of the International Conference on New Interfaces for Musical Expression (NIME '05), May 2005, Vancouver, Canada 11-16.Google Scholar
- Liarokapis F: Augmented reality scenarios for guitar learning. Proceedings of the International Conference on Eurographics UK Theory and Practice of Computer Graphics, 2005, Canterbury, UK 163-170.Google Scholar
- Motokawa Y, Saito H: Support system for guitar playing using augmented reality display. Proceedings of the IEEE and ACM International Symposium on Mixed and Augmented Reality (ISMAR '06), October 2006, Santa Barbara, Calif, USA 243-244.Google Scholar
- Kerdvibulvech C, Saito H: Vision-based detection of guitar players' fingertips without markers. Proceedings of the IEEE International Conference on Computer Graphics, Imaging and Visualization (CGIV '07), August 2007, Bangkok, Thailand 419-428.Google Scholar
- Imai A, Shimada N, Shirai Y: Hand posture estimation in complex backgrounds by considering mis-match of model. Proceedings of the Asian Conference on Computer Vision (ACCV '07), November 2007, Tokyo, Japan 596-607.Google Scholar
- Iwai Y, Yagi Y, Yachida M: A system for 3D motion and position estimation of hand from monocular image sequence. Proceedings of the U.S.-Japan Graduate Student Forum in Robotics, November 1996, Osaka, Japan 12-15.Google Scholar
- Iwai Y, Yagi Y, Yachida M: Estimation of hand motion and position from monocular image sequence. Proceedings of the Asian Conference on Computer Vision (ACCV '95), December 1995, Singapore 230-234.Google Scholar
- Stenger B: Model-based hand tracking using a hierarchical bayesian filter, Ph.D. thesis. University of Cambridge, Cambridge, UK; 2004.Google Scholar
- Stenger B, Thayananthan A, Torr HSP, Cipolla R: Model-based hand tracking using a hierarchical bayesian filter. IEEE Transactions on Pattern Analysis and Machine Intelligence 2006,28(9):1372-1384.View ArticleGoogle Scholar
- Borgefors G: Hierarchical chamfer matching: a parametric edge matching algorithm. IEEE Transactions on Pattern Analysis and Machine Intelligence 1988,10(6):849-865. 10.1109/34.9107View ArticleGoogle Scholar
- Argyros AA, Lourakis MIA: Tracking skin-colored objects in real-time. Invited contribution to the "Cutting edge robotics book". Advanced Robotic Systems International 2005, 77-90.Google Scholar
- Argyros AA, Lourakis MIA: Tracking multiple colored blobs with a moving camera. Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition (CVPR '05), June 2005, San Diego, Calif, USA 2:Google Scholar
- Isard M, Blake A: Condensation—conditional density propagation for visual tracking. International Journal on Computer Vision 1998,29(1):5-28. 10.1023/A:1008078328650View ArticleGoogle Scholar
- De la Gorce M, Paragos N, Fleet DJ: Model-based hand tracking with texture, shading and self-occlusions. Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition (CVPR '08), June 2008, Anchorage, Alaska, USA 1-8.Google Scholar
- Hartley R, Zisserman A: Multiple View Geometry in Computer Vision. Cambridge University Press, Cambridge, UK; 2004.View ArticleMATHGoogle Scholar
- Erol A, Bebis G, Nicolescu M, Boyle RD, Twombly X: A review on vision-based full DOF hand motion estimation. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, June 2005, San Diego, Calif, USA 3: 75.Google Scholar
- Lockton R, Fitzgibbon AW: Real-time gesture recognition using deterministic boosting. Proceedings of the British Machine Vision Conference (BMVC '02), September 2002, Cardiff, UK 2: 817-826.Google Scholar
- Jack K: Video Demystified: A Handbook for the Digital Engineer. 4th edition. Elsevier Science and Technology Books; 2004.Google Scholar
- Fiala M: Artag, a fiducial marker system using digital techniques. Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition (CVPR '05), June 2005, San Diego, Calif, USA 590-596.Google Scholar
- Kerdvibulvech C, Saito H: Real-time guitar chord estimation by stereo cameras for supporting guitarists. Proceedings of the International Workshop on Advanced Image Technology (IWAIT '07), January 2007, Bangkok, Thailand 256-261.Google Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.