This section describes the presented framework for hand gesture recognition using computer vision-based scheme. Figure 1 shows a general schematic outline of the framework. As illustrated in the above figure, the general scheme of the approach works as follows. As a preliminary preprocessing step, the hand region of interest (ROI) is segmented from the input depth map image and then the basic morphological operations of dilation, erosion, opening, and closing are applied to refine the segmented ROI to obtain a high-quality binary ROI containing only a hand gesture. As shape features, a set of representative descriptors containing boundary-based (such as Fourier descriptors, curvature features, etc.) and region-based (such as moments invariants and moment-based features, etc.) is then extracted from the hand silhouette (or contour). A one-dimensional (1D) feature vector is generated and then fed into a one-vs-all SVM classifier for gesture classification. The SVM classifier is trained and evaluated on a real-world dataset [30] of static hand gestures acquired using a Creative Senz3D consumer depth camera. The subsections below provide the details of each component of the implemented recognition system.
Hand region segmentation
As mentioned previously, the objective of this preliminary step is to accurately detect and segment the gesture region of interest (i.e., hand region) from the input depth-map image. To accomplish this objective, the depth samples corresponding to the hand region are extracted from the depth map. In the present acquisition setting, since the hand that performs the gesture is the nearest object (i.e., closer distance object) to the depth camera, the segmentation algorithm can effectively locate the nearest hand to the camera using adaptive thresholding and hierarchical connected components analysis on the 2D depth-map image. More specifically, the point closest to the depth camera in the acquired depth-map image is initially detected. To this end, let Z={zi,i=1,…,n} be a given set of 3D points captured by the depth camera sensor, and also let D(Z)={d(zi),i=1,…,n} be the corresponding depth map. Then, d(zmin) will be the depth of the nearest sample to the depth camera.
Due to image noise and other disturbing factors (e.g., image artifacts), an isolated artifact might be incorrectly identified as the closest point. To effectively tackle this problem, in the presented system, the selection of the closest point inevitably hinges primarily on the existence of a relatively large number of depth samples in a neighborhood of a certain size (e.g., 5×5) around each candidate closest point. A predefined threshold value is used to decide candidate closest points.
The cardinality of a set of extracted points is compared with the threshold value. If the cardinality does not exceed this value, the next candidate closest point is checked and the verification process is repeated until the admitted threshold is exceeded. Once the closest point is found, the set of all points of a depth value within a threshold τ1 from zmin and with a distance from zmin smaller than another threshold τ2 is calculated as follows,
$$ \Omega=\{z_{i} | d(z_{i})-d(z_{min})<\tau_{1}\wedge\|z_{i}-z_{min}\|<\tau_{2}\} $$
(1)
where ∧ denotes the AND operator. It should be emphasized at this point that the proper values of the above two thresholds are often manually configured based on the size of hands in the original images of the dataset. In our experiments, we have varied the values of these thresholds and it has been empirically found that the "optimal" values for τ1 and τ2 are 10 and 40, respectively. Furthermore, in order to prevent the detection of extremely small non-hand objects and to avoid the confusion of isolated artifacts with hand objects of interest, appropriate additional precautions are to be taken, e.g., applying additional two low-pass filters on Ω. Firstly, an adaptive morphological open-close filter is iteratively applied to the resulting binary image in order to remove very small objects from the image, while other objects of a relatively large size or shape are selectively maintained. Such a filter is preferentially realized by a cascade of erosion and dilation operations, using locally adaptive structuring elements that can preserve the geometrical features of image objects.
During this step, another filter (namely size adaptive filter) is also applied to get rid of isolated components (i.e., amorphous objects) that do not meet a minimum pixel count (i.e., size). More specifically, after applying this filter, very small objects of average size less than a certain pre-defined threshold value (e.g., 5% of image size) are eliminated from the binary hand image. Now, after filtering out almost all unwanted image components and isolated artifacts, an improved Canny edge detection algorithm [31] is applied to selectively extract high-contrast hand region boundaries, where Gaussian filter was replaced by self-adaptive filter and morphological operator was used to refine the edge points detected and obtain single pixel level edge. Figure 2 shows an example of applying the hand segmentation scheme to extract the hand ROI from an input depth image. As is clear from the figure, the scheme can potentially realize high accuracy segmentation of hand samples from the scene objects and from the other body parts. Furthermore, the largest circle contained in the hand contour can be used to reliably locate the hand palm centroid, as fingertips are closely related to the convex defects, which are close to the start and end contour points of convex defects. Thus, it is possible to detect the fingertips from hand contour and convex defects [32, 33].
Feature extraction
Broadly speaking, feature extraction is fundamental but most challenging and time-consuming in the framework of any approach to pattern recognition. A wide variety of informative and distinctive features can be employed exclusively or combined with others for hand gesture recognition. In this work, the shape features that describe the shape structures of the segmented hand silhouettes are employed to represent different static hand poses. As robust shape-based features, a variety of visual features including Fourier descriptors, invariant moments, and curvature features are employed, which allow a much better characterization of static hand gestures. In the following subsections, we describe in more detail how these features are locally extracted from segmented hand regions.
Fourier descriptors
Fourier descriptors (FDs) of a given 2D shape depend on the notion that any 2D shape border (namely, contour) can be mathematically represented by a periodic complex function: zi=xi+jyi, where xi,yi,i=0,1,…,n−1 indicate x and y coordinates of the contour points. Consequently, the discrete Fourier transform coefficients can be determined as follows,
$$ a_{k}=\frac{1}{n}\sum_{i=0}^{n-1}z_{i}\,\exp\left(-\frac{j2\pi ik}{n}\right),\; k=0,1,\ldots,n-1 $$
(2)
The adapted Fourier descriptors can be selectively obtained from the above coefficients ak for instance by truncating the first two coefficients, a0,a1 and dividing the remaining coefficients by |a1|, as follows,
$$ b_{k}=\frac{|a_{k+2}|}{|a_{1}|}, k=0,1,\ldots,n-3 $$
(3)
It is straightforward to see that this choice of coefficients will not only guarantee that the generated descriptors are invariant with respect to shape translation, scale and rotation, but also ensure that they are independent of the choice of the starting point on the shape border.
Shape moments
Invariant moments that are essentially a set of nonlinear moment functions are often used in various pattern recognition applications for providing regional descriptors to capture global geometric characteristics of a particular object shape within a given frame/image. This set of nonlinear functions can be easily derived from regular moments. Spatial moments of order (p+q) of a given object shape f(x,y) is defined as follows:
$$ M_{pq}=\int\!\!\int_{\mathbf{C}} x^{p} y^{q}f(x,y)dxdy $$
(4)
It is easy to see that the spatial moments in Eq. (4) are, in general, not invariant under the main geometrical transformations such as translation, rotation, and scale change. To obtain invariance under translation, these functions can be computed with respect the centroid of the object shape as follows,
$$ \mu_{pq}=\int\!\!\int(x-\bar{x})^{p}(y-\bar{y})^{q} f(x,y)dxdy $$
(5)
where \((\bar {x}, \bar {y})\)) is the centroid that matches to the center of gravity of that shape. The normalized central moments are therefore given by:
$$ \eta_{pq}=\frac{\mu_{pq}}{\mu_{00}^{\alpha}},\quad \alpha=\frac{p+q}{2}+1 $$
(6)
On the basis of the above normalized central moments, a simple set of nonlinear moment functions [34] can be defined, which is invariant to translation, rotation and scale changes:
$$ {\displaystyle \begin{array}{lll}{\phi}_1& ={\eta}_{20}+{\eta}_{02}\kern2em & \kern2em \\ {}{\phi}_2& ={\left({\eta}_{20}-{\eta}_{02}\right)}^2+{\left(2{\eta}_{11}\right)}^2\kern2em & \kern2em \\ {}{\phi}_3& ={\left({\eta}_{30}-3{\eta}_{12}\right)}^2+{\left(3{\eta}_{03}-{\eta}_{21}\right)}^2\kern2em & \kern2em \\ {}{\phi}_4& ={\left({\eta}_{30}+{\eta}_{12}\right)}^2+{\left({\eta}_{03}+{\eta}_{21}\right)}^2\kern2em & \kern2em \\ {}{\phi}_5& =\left({\eta}_{30}-3{\eta}_{12}\right)\left({\eta}_{30}+{\eta}_{12}\right)\left[{\left({\eta}_{30}+{\eta}_{12}\right)}^2\right.\kern2em & \kern2em \\ {}& \kern1em \left.-3{\left({\eta}_{03}+{\eta}_{21}\right)}^2\right]+\left(3{\eta}_{21}-{\eta}_{03}\right)\left({\eta}_{21}+{\eta}_{03}\right)\kern2em & \kern2em \\ {}& \kern1em \left[3{\left({\eta}_{30}+{\eta}_{12}\right)}^2-{\left({\eta}_{03}+{\eta}_{21}\right)}^2\right]\kern2em & \kern2em \\ {}{\phi}_6& =\left({\eta}_{20}-{\eta}_{02}\right)\left[{\left({\eta}_{30}+{\eta}_{12}\right)}^2-{\left({\eta}_{03}+{\eta}_{21}\right)}^2\right]\kern2em & \kern2em \\ {}& \kern1em +4{\eta}_{11}\left({\eta}_{30}+{\eta}_{12}\right)\left({\eta}_{03}+{\eta}_{21}\right)\kern2em & \kern2em \\ {}{\phi}_7& =\left(3{\eta}_{21}-{\eta}_{03}\right)\left({\eta}_{30}+{\eta}_{12}\right)\left[{\left({\eta}_{30}+{\eta}_{12}\right)}^2\right.\kern2em & \kern2em \\ {}& \kern1em \left.-3{\left({\eta}_{03}+{\eta}_{21}\right)}^2\right]\left({\eta}_{30}-3{\eta}_{12}\right)\left({\eta}_{03}+{\eta}_{21}\right)\kern2em & \kern2em \\ {}& \kern1em \left[3{\left({\eta}_{30}+{\eta}_{12}\right)}^2-{\left({\eta}_{03}+{\eta}_{21}\right)}^2\right]\kern2em & \kern2em \end{array}} $$
Curvature features
In a way quite analogous to the Mel Frequency Cepstral Coefficients (MFCC) features extracted from the frequency-domain signal and dominantly used in speech and speaker recognition systems, a concise set of shape descriptors can be directly computed from the cepstrum of the shape curvature, as follows. Firstly, the shape curvature is normally computed along the hand contour from the Freeman chain-code representation [35]. Then, the cepstrum of the curvature signal (i.e., spectrum) is obtained using discrete Fourier transform. Finally, as a shape descriptor, a certain number of the largest coefficients are chosen to be added to the feature vector. Numerous experimentations have been performed and revealed that a small set of cepstrum coefficients can sufficiently reconstruct the curvature function, with a compression ratio of up to 10:1 in the original signal length [36].
Moment-based features
In addition to the prior features, a set of other features generated by the central moments can be extracted. The existing analogy between hand image moments and mechanical moments contribute to a deeper understanding of the second-order central moments, i.e., μ11,μ02 and μ20. These moments construct the components of the inertial tensor of the object’s rotation about its center of gravity:
$$ \mathcal{J}=\left[\begin{array}{cc}{\mu}_{20}& -{\mu}_{11}\\ {}-{\mu}_{11}& {\mu}_{02}\\ {}\end{array}\right] $$
(7)
Upon the inertial tensor analogy, a set of invariant features using second-order central moments can be extracted. For instance, the major inertial axis can be obtained by the roots of the eigenvalues of the inertial tensor as follows,
$$ \lambda_{1,2}=\sqrt{\frac{1}{2}(\mu_{02}+\mu_{20})\pm \left(4\mu_{11}^{2}-(\mu_{02}-\mu_{20})^{2}\right)^{1/2}} $$
(8)
where λ1 and λ2 correspond to the semi-major and semi-minor axes, respectively, of the ellipse that can be intuitively considered as a fairly good approximation of the hand object. Furthermore, the orientation of the hand object defined as the tilt angle between the x-axis and the axis around which the hand object can be rotated with minimal inertia can also be obtained as follows:
$$ \varphi=\frac{1}{2} \arctan \left(\frac{2\mu_{11}}{\mu_{20}-\mu_{02}}\right) $$
(9)
where φ is the angle between the x-axis and the semi-major axis. The above principal value of the arc tangent is expressed in radians (in the interval \(\left [-\frac {\pi }{2},\frac {\pi }{2}\right ]\)). A variety of other shape features such as eccentricity ε and roundness κ convey shape identification information or could possibly provide some perceptual representation of hand shape. The roundness or circularity κ is properly defined to be the ratio of the area of an object to the area of a circle circumscribing that object. More specifically, κ can be determined by simply dividing the square of the perimeter ℓ by the area of the object A. From a geometric perspective, it follows that out of all shapes, the circle has the maximum area for a given perimeter/circumference. Thus, κ can explicitly be given as follows,
$$ \kappa = \frac{\ell^{2}}{4\pi A} $$
(10)
Notably κ=1 for a circle, whereas for other objects κ>1. The eccentricity ε can readily be calculated from the second-order central moments as follows:
$$ \varepsilon = \frac{(\mu_{20}-\mu_{02})^{2}-4\mu_{11}^{2}}{(\mu_{20}+\mu_{02})^{2}} $$
(11)
All the computed descriptor vectors are concatenated resulting in a single feature vector for each hand posture (see Fig. 3). The resultant feature vectors can be independently normalized to the range [0,1] by means of a linear transformation to establish the best fit of the characteristics of the learning model. The normalized feature vectors that encode much of the shape information are then fed into the next classification layer for supervised gesture recognition.
Gesture classification
This section explains the classification module as the last stage in the proposed hand gesture recognition framework, which primarily aims at classifying static gesture images based on gesture representations obtained using shape feature descriptors, as described in the previous section. The current gesture recognition task can normally be modeled as a multi-class classification problem with a unique class for each gesture category, and the ultimate goal is to assign a class label to a given gesture image. To achieve this objective, a set of one-vs-all nonlinear SVMs (Support Vector Machines) with RBF (Radial Basis Function) kernels is trained for each gesture category. Each SVM classifier learns to separate images that belong to a certain gesture category from those that do not. The final decision on the output category for a given image is made according to the classifier with the highest score.
Broadly speaking, there is a wide variety of machine learning (ML) algorithms that can be utilized to learn a model to recognize certain patterns of hand gesture robustly and effectively. In this work, we propose to apply an ensemble of SVMs to hand gesture classification as a first step towards integration into a full gesture recognition framework. The motivation behind using SVMs is the superior generalization capabilities as well as their well-deserved reputation of a highly accurate paradigm. Moreover, thanks to the kernel principle, SVMs could be conveniently trained with almost no parameter tuning, since they can reliably find optimal parameter settings automatically, with low computational overhead independent of their dimensionality. All this enables SVMs to be very promising and extremely competitive with other classification techniques in pattern recognition.
SVMs were originally designated [37] to specifically handle dichotomic classes in a higher-dimensional feature space, where an optimal maximum-margin separating hyperplane is constructed. To determine the maximum margin, two parallel hyperplanes are built, one on each side of the separating hyperplane, as illustrated in Fig. 4. The primary aim of SVM is then to find out the separating hyperplane that maximizes the distance between the two parallel hyperplanes. Intuitively, a good separation is accomplished by the hyperplane that has the largest distance to the closest training data points of any class (so-called functional margin), since generally the larger the margin, the lower the generalization error of the classifier. Formally speaking, assume \( \mathcal{D}=\left\{\left({\mathbf{x}}_i,{y}_i\right)\kern0.3em |\kern0.3em {\mathbf{x}}_i\in {\mathbb{R}}^d,{y}_i\in \left\{-1,+1\right\}\right\} \) be a training set, Coretes and Vapnik [37] have argued that this problem is best addressed by allowing some data points to violate the margin constraints. These potential violations can be elegantly formulated through the use of some positive slack variables ξi and a penalty parameter C≥0 that penalize the margin violations. Therefore, the optimal maximum-margin separating hyperplane is obtained by solving the following quadratic programming. (QP) problem:
$$ \min_{\boldsymbol{\beta},\beta_{0}}\quad\frac{1}{2}\|\boldsymbol{\beta}\|^{2}+C\sum_{i}\xi_{i} $$
(12)
subject to (yi(〈xi,β〉+β0)≥1−ξi ∀i)∧(ξi≥0 ∀i).
Geometrically, β∈ℝd is a vector passing through the origin and measured perpendicularly to the separating hyperplane. The offset parameter β0 is included in the optimization problem to allow the margin to increase and to not force the hyperplane to go through the origin that restricts the practicality of the solution. From the computational point of view, it is probably most convenient to solve SVM in its dual formulation by first forming the Lagrangian associated with the problem and then optimizing over the Lagrange multiplier α instead of the primal variables. The decision function, also known as hypothesis, in the classifier is described by a weight vector: \(\boldsymbol {\beta }=\sum _{i}\alpha _{i}\boldsymbol {x}_{i}y_{i},\: 0\leq \alpha _{i} \leq C\). The training instances xi that lie at the edge of their respective class space have non-zero αi. These data points are called support vectors, since they uniquely define or "support" the maximum-margin hyperplane and are closest to it.
In the presented framework, several classes of hand gestures are defined and one-vs-all SVMs with RBF kernel are independently trained to learn the pattern of each class of gestures, based on the shape features extracted from the gesture images in the training set. Among the most popular and powerful kernels, we have selected the more related with our work which is the RBF kernel (also referred to as Gaussian kernel) given as follows:
$$ \kappa(\boldsymbol{x},\boldsymbol{y})=\exp\left(\|\boldsymbol{x}-\boldsymbol{y}\|^{2}/\left(2\sigma^{2}\right)\right) $$
(13)
where σ is the kernel width that can typically be seen as a tuning or smoothing parameter. It is perhaps pertinent to mention here that the SVMs with RBF kernels have emerged as a flexible and powerful tool for creating predictive models that can potentially handle non-linearly separable data by mapping the input feature space to a higher dimensional feature space (denoted as a kernel space) in the hope that in this higher-dimensional space the data could become more easily separated or better arranged. More specifically, in the higher-dimensional space, linear boundary functions are constructed to make it possible to perform the linear separation for classification. When brought back to the input space, these functions could potentially produce non-linear boundaries to effectively separate data of different classes. Another point worth mentioning here is that, when using the RBF kernel, there are some important parameters such as c and γ that need to be properly tuned. Therefore, several tests have been performed in order to establish optimum values for the latter two parameters. Notably, improper selection of such parameters tends to make the classification model highly prone to either overfitting or underfitting that in turn causes the classifier to deliver a poor generalization capability and poor classification performance.