Improved Viola-Jones face detection algorithm based on HoloLens
EURASIP Journal on Image and Video Processing volume 2019, Article number: 41 (2019)
The current face detection in Microsoft HoloLens can only be achieved by remote call of face detection interface algorithm which is, however, restricted by network, resulting in slow detection and failing to meet real-time detection demand. This paper proposes an improved Viola-Jones algorithm of face detection based on HoloLens upgrading classical Viola-Jones face detection algorithm relying on Haar-like rectangle feature expansion to enhance detection efficiency, and accelerating detection building on two-dimensional convolution separation and image re-sampling technique. The detection efficiency of improved face detection algorithm is 12% on average higher than that of existing face detection interface algorithm, and its detection speed is four-folded. Moreover, HoloLens depth camera enables 3D face detection and location, and its unique gaze, voice, and gesture interacting techniques free the hands, thereby realizing easier and less-burdened man-computer interaction. HoloLens furnished with real-time video face detection algorithm as detailed in this paper can be applied in such fields as social contact, public security, and business management.
Face detection technology is to extract effective detection information and identify its identity by computer instead of natural person [1, 2]. In addition, it is difficult to forge, cannot be lost, portable, and easy to use. It overcomes the shortcomings of traditional identity authentication methods and provides a more secure and reliable authentication mechanism. Therefore, as the basic application of artificial intelligence technology, face recognition has been widely used in public safety and enterprise management. This has reduced the cost of the industry to a certain extent, improved service efficiency and management level, and has been widely recognized by people from all walks of life .
During the development of human-computer interaction, it is more and more easy for users to accept the more practical, more human, and more intelligent interaction mode. The application of human-computer interaction technology in the field of graphics and images is to pursue a better user experience. Face detection technology can reduce the number of user hands-on operations and provide a new way of interaction that is different from finger touching the screen, truly achieve the liberation of hands, to achieve more convenient, less burden of human-computer interaction . However, the mainstream interactive devices related to face detection are still traditional keyboard, mouse, and touch screen. However, these devices have many limitations. Keyboard can only complete text input, mouse can only achieve cursor movement, click and orientation control, and other simple operations, but cannot interact with the user for a higher level, richer semantic information. The touch screen directly interacts with the interface through the fingers, and the WYSIWYG interaction mode greatly improves the interaction efficiency compared with the traditional mouse and keyboard. However, there are still limitations in finger touch, such as cumbersome operation and eye obstruction.
On the other hand, face detection on mobile devices is often slow and difficult to meet real-time detection needs. Zhenyu et al.  proposed a real-time face detection method based on optical flow estimation to realize video face detection on mobile devices, which makes the detection accuracy close to the detection accuracy of convolution neural network (CNN)  method, and the speed basically meets the requirements of real-time monitoring. It can be applied to medium- and low-end devices and cannot meet the performance of high resolution. Approximately 90% of face detection devices in the market are provided with front-end image acquisition while face detection in back-end server. These devices deeply rely on networks, slowing detection down, which are not applicable to the places encountering unsatisfactory network state, thereby consequently worsening application effect . Yu’s  Face Terminal Identity Recognition Simulation technology for mobile device network security does not specify what kind of mobile device is suitable for, nor does it verify on Microsoft HoloLens devices. The current face detection in Microsoft HoloLens can only be achieved by remote call of face detection application programming interface (API) which is, however, restricted by network, resulting in slow detection and failing to meet real-time detection demand. This paper introduces the face detection algorithm based on Microsoft HoloLens holographic glasses. Such algorithm is upgraded from Viola-Jones [9,10,11,12] classical algorithm building on Haar-like  rectangle feature expansion and is accelerated relying on two-dimensional convolution separation and image re-sampling technique. Besides, HoloLens depth camera is installed for 3D face detection and location, thus localizing HoloLens face detection. Compared with existing Microsoft Azure Face API , the face detection algorithm, as shown in the experimental results, is more advantageous in terms of detection accuracy and speed. HoloLens enables supplementation and superposition of real and virtual information, creating a half-to-half real and virtual environment. HoloLens furnished with face detection not only betters user experience, but also contributes to smarter and easier life, which can be applied in such fields as social contact, public security, and business management.
2 HoloLens overview
Microsoft HoloLens is developed as the first cable-unrestricted and computer-controlled holographic smart glasses, enabling interaction between user and digital data, and between user and holographic images of real world [15, 16]. Figure 1 shows HoloLens appearance.
As a mixed reality (MR) device, HoloLens is provided with unique man-computer interaction modes, namely, gaze, gesture, and voice [17,18,19,20], GGV for short. Thanks to cooperation among three interaction modes, the device enables the user to operate freely under MR environment. Figure 2 illustrates GGV.
HoloLens mixes holographic scenes and real world, zooming in/out virtual objects just like the real world so that the user may feel the holographic scenes as a part of real world. Figure 3 shows details of HoloLens hardware. HoloLens is equipped with inertial measurement unit, ambient light sensor, and four ambient sensing cameras along with the depth-sensing camera to portray and scan current space and environment in real-time manner, thus identifying the plane, wall, desk, and other bigger objects. Besides, HoloLens is provided with self-developed holographic processing unit for real-time scanning, massive data processing, tracking, and space anchoring.
The face detection in HoloLens can only be fulfilled by remote call of Microsoft Azure Face API with low speed and limited detection accuracy, which gives rise to inconvenient practical application.
3 Classical Viola-Jones face detection method
The classical Viola-Jones algorithm combines shape and edge, face feature, template matching, and other statistical models with AdaBoost. Firstly, the Haar-like feature matrix is used to calibrate the face feature, and the feature evaluation is accelerated by the integral image [21,22,23,24,25,26], then the AdaBoost [27,28,29] algorithm is used to construct strong and weak classifiers and to form a screening cascade classifier [30, 31] to eliminate non-face images and improve accuracy.
3.1 Haar-like rectangle feature
As shown in Fig. 4, Haar features are classified into three categories: edge features (bi-adjacency matrices), linear features (tri-adjacency matrices), central features, and diagonal features (quadra-adjacency matrices), which are combined into feature templates.
There are white and black matrices in the feature template, and the eigenvalues of the template are defined as white rectangular pixels and black rectangular pixels subtracted. The Haar eigenvalue reflects the change of the grayscale of the image. For example, some features of the face can be simply described as rectangular features; as shown in Fig. 5, the eyes are darker than the cheeks, the sides of the nose are darker than the bridge of the nose, and the mouth is darker than the surrounding color. It is more advantageous to use feature judgment than to use pixel only, and the speed of judgment is faster. However, rectangular features are sensitive only to simple graphical structures such as edges and line segments, so they can only describe structures with specific directions (horizontal, vertical, diagonal).
3.2 Integral image
In order to compute Haar-like features, it is necessary to sum all the pixels in the rectangular region. Viola-Jones face detection algorithm uses the concept of integral image. The integral image value of any point in the image is equal to the sum of all the pixels in the upper left corner of the point. As shown in Fig. 6, by integrating the image through the graph, the pixel sum of all regions in the image can be obtained by one traversal of the image, which greatly improves the computational efficiency of the image eigenvalue. Let SAT (x, y) be the integral image value of points (x, y) and I(x', y') be the gray value of any pixel (x', y') in the integral image, then:
The following recursion formula can be obtained through traversing order from left to right and from top to bottom:
In the same way, the sum of pixels of any rectangular region in the image can be obtained. As shown in Fig. 7, let the upper left corner coordinates of the rectangular to be solved be x, y and the width and height of the rectangular to be w, h, denoted as a rectangle (x, y, w, h). The integral image formula is as follows:
3.3 AdaBoost algorithm
AdaBoost (Ada: Adaptive, Boost: Boosting) algorithm can carry out feature selection and classifier training at the same time. It is an iterative algorithm. Its core idea is to train different classifiers (weak classifiers) for the same training set, and then assemble these weak classifiers to form a stronger final classification (strong classifier). The algorithm itself is realized by changing the data distribution. It determines the weight of each sample according to whether the classification of each sample in each training set is correct or not and the accuracy of the last global classification. The new data sets with modified weights are sent to the lower classifier for training. Finally, the classifiers obtained from each training are fused as the final decision classifier. Using AdaBoost classifier can eliminate some unnecessary training data features and put them on the key training data.
When the eigenvalue of the input image is greater than the threshold value, the face is judged, so the process of training the optimal weak classifier is actually to find the appropriate threshold value of the classifier.
In ordinary images, regions containing human faces occupy only a small part of the whole image. Therefore, if all local regions must traverse all the features, the operation is very heavy and time-consuming. In order to save computing time, more potential samples should be tested.
In the cascade classifier architecture, each level contains strong classifier. All rectangular features are divided into several groups, each containing some rectangular features used at each stage of the cascaded classifier. Each stage of the cascaded classifier determines whether the input area is human face, and if it is not, then the area will be discarded immediately. Only those areas that are judged as possibly human faces will be passed into the next stage, further distinguished by more complex classifiers. Its flowchart is shown in Fig. 8:
4 Algorithm improvement
4.1 Haar-like rectangle feature expansion
For a 45° rotation rectangle, we define RSAT(x, y) as the integral image value of point (x, y) upper left 45° region and lower left 45° region. For the shaded part shown in Fig. 10, I(x', y') is the gray value of any pixel point (x', y') in the area integral image.
According to the definition of integral image, then:
Similarly, by traversing from left to right and from top to bottom, the following recursive formulas can be obtained, so that the values of all pixels in the shadow region of the graph can be calculated at one time:
As shown in Fig. 11, the 45° rotating rectangle is assumed to be the highest vertex coordinate of the rotating rectangle (x, y, w, h, 45°), (x, y) the horizontal and vertical distances between the rectangle and the rightmost vertex, respectively. Similarly, the integral image value Sum(x, y, w, h, 45°) of the 45° rotating rectangle can be deduced and calculated. See Eq. (6).
By the above formula, the integral image of rotated rectangle can be figured out rapidly to calculate feature value of face detection under different states.
4.2 Acceleration by two-dimensional convolution separation
The algorithm as mentioned in this paper works to screen out targeted matrix by detecting actual marginal density greater than threshold value. It is required to firstly conduct gray processing for the image; secondly, calculate perpendicularity and levelness of image by two-dimensional convolution separation; then, detect marginal density of image by Sobel operator to find out image margin in order to create integral graph of marginal density, facilitating the subsequent calculation of marginal density for any rectangle. The procedures are detailed in Fig. 12.
With image sized as M × N and filter sized as P × Q, MNPQ times of multiplying and adding operations will be made under no-separation condition; MNP times and MNQ times of operations for the first/second time respectively, namely MN(P + Q) times of operations, increasing operation speed by PQ/(P + Q) times. By adoption of 3 × 3 filter, the detection speed upon convolution separation will be increased by 1.5 times.
4.3 Acceleration by image re-sampling technique
The sampled image measures 2048 × 1152 by PhotoCaputure API of HoloLens, resulting in massive characteristic matrices, which is not applicable to image detection accordingly. Therefore, the bi-linear image re-sampling technique is adopted to extract low-resolution images from high-resolution images to scale down the image width to 1/2. In such way, the detection will be accelerated to four times theoretically.
As a result, if the convolution separation and image re-sampling techniques work cooperatively, the theoretical detection speed will be six-folded by 3 × 3 filter.
5 Realization of face detection based on HoloLens
The algorithm procedures are as follows: capture image by HoloLens device; simplify image relying on image re-sampling method; calculate gray value of each pixel in the image; accelerate margin detection by convolution separation method; judge whether the detected rectangle rotates by comparison with template; compute integral image value of detected rectangle; sum up feature values following Viola-Jones algorithm; and compare values with threshold values: if actual feature value is less than threshold value, the image is judged as non-face image, indicating false; if actual feature value exceeds threshold value, the sum of feature values and comparison with threshold value will be repeated for the next stage until the image passes through cascade classifier completely, judging the rectangle as face and returning to “true.” The complete algorithm procedures are shown in Fig. 13.
Figure 14 shows the face in the scene detected by HoloLens furnished with algorithm as mentioned in this paper and displays the detection information. And HoloLens marks detected face model by 3D grid.
Upon the above detection, the face is located as (x, y, width, height) in 2D image. Then, the central points are calculated to capture the face, which are returned and collided by rays to figure out 3D coordinates of the face in the real world. Ultimately, 2D image coordinates are converted from coordinates in the camera to world coordinates in the space as detailed in Fig. 15.
Figure 16 shows 3D face detection result which is illustrated by line-frame cube in 3D space.
6 Results and discussion
The algorithm as mentioned in this paper will be compared with the data stored in commonly used Orl, Yale, Ar, Stanford, Jaffe, and cit face image databases, and the data before/after acceleration and optimization and Microsoft Azure Face API relying on feature comparison bank containing the feature changes in expression, illumination expression glasses, illumination expression scarf, face calibration, expression of Asian woman, and color illumination. Furthermore, the image is sized as 60 × 60 ~ 179 × 118, and the number of face images ranges from 165 to 2600. The data analysis and comparison focus on missing quantity and detection time. As for the experiment, it is planned to firstly develop face detection program based on the algorithm as mentioned in this paper building on Unity game engine and Visual Studio C# script, then input program and data into Hololens helmet via LAN. After the above preparations, the experiencer wears HoloLens helmet for face detection. The experimental results are shown in Table 1. As Microsoft Azure Face API sends detection request once every 3 s on average, considering API request interval, the total time of detection via networks is far longer than that of local detection.
As shown in the experimental results:
Acceleration and optimization of algorithm will exert no influence on the detection accuracy while the detection speeds up. To be specific, the detection time is shortened by 3.5, 3.8, 4.0, 5.1, 3.7, and 3.9 times respectively, averaging 4.0 times.
In detection of database featuring massive data, such as AR face image database and detection of database containing scaled-down image size, such as Yale face image database,
Microsoft Azure Face API network detection requires longer time, 464 and 221 times of local detection respectively.
Regardless of Face API request interval, the speed of local detection is generally 4.1 times, in case of extensive images, 9.8 times, or in case of scaled-down image size, 20 times of that for detection via networks.
Moreover, the loss rate of face detection by this algorithm is lower than that of Microsoft Azure Face API detection via networks. And its detection accuracy increases by 12% on average.
In recent years, with leapfrog progress of science and technology, the face detection is applied in our life on all fronts instead of only presence in science fiction film, such as identify comparison, access control, personal computer unlocking, and retail payment. This paper clarifies the cooperation between face detection algorithm and HoloLens holographic glasses by optimizing classical Viola-Jones depending on Haar-like rectangle feature expansion, two-dimensional convolution separation, and image re-sampling technique. Building on the above improved Viola-Jones algorithm, the local HoloLens face detection is realized, enhancing detection speed and accuracy. Moreover, the HoloLens depth camera is installed for better 3D detection and spatial location of the face. As shown in the experimental results, upon comparison with Face API, the speed of local detection is generally 4.1 times, in case of extensive images, 9.8 times, or in case of scaled-down image size, 20 times of that for detection via networks.
HoloLens-based face detection, enabling supplementation and superposition of real and virtual information, not only betters user experience, but also contributes to smarter and easier life, which will be applied in such fields as social contact, public security, and business management.
Application programming interface
Local area network
Xia Xueting, Hu Zhengfei, Pan Lingyun, Indoor automatic lighting control system with OpenCV face detection [J]. Comput. Technol. Dev., 2017, 27(04):184–187
Zahid Mahmood, Tauseef Ali, Shahid Khattak, Laiq Hasan, Samee U. Khan. Automatic player detection and identification for sports entertainment applications[J]. Pattern. Anal. Applic.,2015,18(4):971–982.
Sawant M M, Bhurchandi K M. Age invariant face recognition: a survey on facial aging databases techniques and effect of aging[J]. Artificial Intelligence Review. (1):1–28(2018)
Luo Chang, Chen Chen, Overview of patent technology of face recognition technology applied to graphical user interface [J]. China Invent. Patent, 2017, 14(10)52–56
Wei Zhenyu, Wen Chang, Xie Kai, He Jianbiao, Real-time face detection on mobile device with optical flow estimation [J]. J. Comput. Appl., 2018, 38(04):1146–1150
JIANG H, LEARNEDMILLER E. Face detection with the faster R-CNN[C]/ / Proceedings of the 2017 12th IEEE nternational Conference on Automatic Face & Gesture Recognition. Washington, DC: IEEE Computer Society, 2017: 650–657
Yu Wei, Zhu Qiuyu, Real-time people counting method based on face detection and tracking in embedded platform [J]. Ind. Control Comput., 2017, 30(12):18–20+23
Han Yu, Simulation of terminal face recognition for confirming identity based on mobile network security [J]. Comput. Simul., 2017, 34(10):352–356
Jia Haipeng, Zhang Yunquan, Yuan Liang, Li Shigang, Research of Viola-Jones face detection algorithm performance optimization based on OpenCL [J]. Chin. J. Comput., 2016, 39(09):1775–1789
Song Yanhui, Wang Wenyong, Cheng Xiaochun, Amelioration to the face recognition algorithm based on the Viola-Jones frame [J]. J. N. E. Normal Univ. (Natural Science Edition), 2005(03):24–27
Peng Mingsha, Liu Cuixiang, Face detection in video motion region based on Viola-Jones algorithm [J]. Electron. Des. Eng., 2015, 23(21):15–17
Zhu Jing, Making real-time character animation based on Viola-Jones algorithm [J]. J. Xichang Coll. (Natural Science Edition), 2015,29(02):62–64
Zhengping Wu, Jie Yang, Haibo Liu, Qingnian Zhang. A real-time object tracking via L2-RLS and compressed Haar-like features matching[J]. Multimed. Tools Appl.,2016,75(15):9427–9443.
Maheshwari K, Nalini N. Facial recognition enabled smart door using Microsoft Face API[J]. Int. J. Eng. Trends Appl. (IJETA), 2017, 5(4):1–4
M.L. Gottmer, Merging Reality and Virtuality with Microsoft HoloLens[J] (2015)
Hockett P, Ingleby T. Augmented Reality with Hololens: Experiential Architectures Embedded in the Real World[J]. 2017
N. Cui, P. Kharel, V. Gruev, Augmented Reality with Microsoft HoloLens Holograms for Near Infrared Fluorescence Based Image Guided Surgery[C]// SPIE BiOS (2017), p. 100490I
Furlan R. The future of augmented reality: Hololens – Microsoft’s AR headset shines despite rough edges [Resources_Tools and Toys][J]. IEEE Spectr., 2016, 53(6):21–21
Müller C, Krone M, Huber M, et al. Interactive Molecular Graphics for Augmented Reality Using HoloLens.[J]. Human Molecular Genetics. 9(2):221–233(2018)
Prochaska M T, Hohmann S F, Modes M, et al. Trends in troponin-only testing for AMI in academic teaching hospitals and the impact of choosing wisely?.[J]. J. Hosp. Med., 2017, 12(12):957
Chen Fangfei, Feng Rui, Improved LATCH based on integral image [J]. Comput. Syst. Appl., 2017, 26(5):145–149
C.-M. Vong, K.I. Tai, C.-M. Pun, P.-K. Wong, Fast and accurate face detection by sparse Bayesian extreme learning machine. Neural Comput. Applic. 26(5), 1149–1156 (2015)
G. Hermosilla, J. Ruiz-del-Solar, R. Verschae, An enhanced representation of thermal faces for improving local appearance-based face recognition. Intell. Autom. Soft Comput. 23(1), 1–12 (2017)
X.W. Li,D.H. Kim,S.J. Cho,S.T. Kim. Integral imaging based 3-D image encryption algorithm combined with cellular automata[J]. J. Appl. Res. Technol.,2013,11(4):549–558.
Puchala D, Stokfiszewski K. Numerical accuracy of integral images computation algorithms[J]. Iet Image Processing. 12(1):31–41(2018)
Li X W, Wang Q H, Kim S T, et al. Encrypting 2D/3D image using improved lensless integral imaging in Fresnel domain [J]. Optics Communications. 381:260–270(2016)
Liao Hongwen, Zhou Delong, Overview of AdaBoost and its improved algorithm [J]. Comput. Syst. Appl., 2012,21(05):240–244
Ye Jianfeng, Wang Huaming, An automatic face recognition method using AdaBoost detection and SOM [J]. J. Harbin Eng. Univ., 2018, 39(01):129–134
Ju Zhiyong, Peng Yanni, Detection of dynamic pedestrians based on AdaBoost algorithm and improved frame difference method [J]. Software Guide, 2017, 16(09):50–54
Yu Wei, Zhu Qiuyu, Gao Xiang, Wang Hui, Multi-level cascade fast face detection based on ARM [J]. Ind. Control Comput., 2017, 30(09):29–31
Jiang Weijian, Guo Gongde, Lai Zhiming, An improved Adaboost algorithm based on new Haar-like feature for face detection [J]. J. Shandong Univ. (Engineering Science), 2014, 44(02):43–48
Pavani SK, Delgado D, Frangi AF. Haar-like features with optimally weighted rectangles for rapid object detection[J]. Pattern Recognition. 43(1):160–172(2010)
The authors thank the editor and anonymous reviewers for their helpful comments and valuable suggestions. I would like to acknowledge all our team members. These authors contributed equally to this work.
About the authors
Jing Huang was born in Hengyang, Hunan, P.R. China, in 1967. She received the Ph.D degree in Computer Science from the Macau University of Science and Technology, Macao, China, in 2006. From 2005 to now, she is a teacher/associate professor/professor in Beijing Normal University (Zhuhai), Guangdong, China. her current research interests include computer graphics and image processing, and Virtual Reality.
Yunyi Shang was born in Guangzhou, Guangdong, P.R. China, in 1995. She received the Bachelor degree from the Beijing Normal University (Zhuhai), Guangdong, China, in 2018. Now, she studies at the Macau University of Science and Technology for a master degree. Her research interests include computer graphics and image processing, and deep learning.
Hai Chen was born in Anshun, Guizhou, P.R. China, in 1974. She received The Master degree from Beijing WuZi University, Beijing, China, in 2009. she works at School of Information Technology, Beijing Normal University, Zhuhai, China. Her research interests include Intelligent speech recognition, telemedicine for respiratory diseases.
This research was supported by the Natural Science Foundation of Guangdong Province under Grant No. 2016A030313384, the Major Scientific Research Project for Universities of Guangdong Province (No.2017KTSCX207).
Availability of data and materials
Please contact author for data requests.
The authors declare that they have no competing interests.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
About this article
Cite this article
Huang, J., Shang, Y. & Chen, H. Improved Viola-Jones face detection algorithm based on HoloLens. J Image Video Proc. 2019, 41 (2019). https://doi.org/10.1186/s13640-019-0435-6