- Research
- Open Access
- Published:

# A simplified nonlinear regression method for human height estimation in video surveillance

*EURASIP Journal on Image and Video Processing*
**volume 2015**, Article number: 32 (2015)

## Abstract

This paper presents a simple camera calibration method for estimating human height in video surveillance. Given that most cameras for video surveillance are installed in high positions at a slightly tilted angle, it is possible to retain only three calibration parameters in the original camera model, namely the focal length, the tilting angle and the camera height. These parameters can be directly estimated using a nonlinear regression model from the observed head and foot points of a walking human instead of estimating the vanishing line and point in the image, which is extremely sensitive to noise in practice. With only three unknown parameters, the nonlinear regression model can fit data efficiently. The experimental results show that the proposed method can predict the human height with a mean absolute error of only about 1.39 cm from ground truth data.

## Introduction

Advances in the image resolution and quality of digital cameras in the last few years have increased the image analysis capability of modern video surveillance systems. Estimating human height is an essential task in video surveillance because it enables many practical applications such as soft biometrics and forensic analyses [1–6]. The key idea behind this technology is a camera calibration system containing a set of parameters for transforming real-world coordinates into image coordinates and vice versa. It is natural to associate a walking or standing human with the camera calibration problem in the context of video surveillance for the following two reasons: a walking or standing person is basically vertical, and his or her height is known. Several camera calibration methods based on walking humans have been proposed. Most such methods rely on estimating vanishing points from walking human. However, as Micuisik et al. reported in [7], estimating the vanishing points is usually the bottleneck of these methods because it is extremely sensitive to noise.

Lv et al. [8, 9] proposed a self-calibration method for estimating camera’s intrinsic and extrinsic parameters. Their method of computing calibration parameters relies on three vanishing points that can be estimated from a set of automatically extracted head-feet pairs in the video. The initial projection matrix is then refined by minimizing the distance from the original and reprojected head points by using a nonlinear optimization algorithm. Lv’s work has inspired many similar methods [7, 10–13].

Krahnstoever et al. [10] introduced a homology-based method. Homology is the transformation from the foot plane to the head plane and contains all geometric information necessary to recover the whole projective matrix in the camera model. The initial projective matrix is updated using a Bayesian framework to obtain estimated parameters. Junejo et al. [11] employed a homology-based method to recover a projective matrix with some modification in the outlier removal stage in which outliers are removed by the Rayleigh quotient.

As reported in Micuisik et al. [7], a drawback of the aforementioned method is that it relies on estimating three vanishing points, which is usually the bottleneck of approaches because it is extremely sensitive to noise. Even negligible inaccuracy in a vanishing point can cause huge inaccuracy in the estimate of the focal length, by up to 100 %. Therefore, these approaches have limited use in practice. To overcome this problem, Micuisik et al. introduced an improved approach based on the quadratic eigenvalue problem without estimating the vanishing points. According to their experiment, this approach outperforms other approaches such as vanishing point-based or the standard eight point-based approaches.

Liu et al. [12, 13] proposed a more automated method for calibrating surveillance cameras based on prior knowledge of the distribution of human heights. The main idea behind this method is based on the observation that objects (pedestrians) in a scene are all roughly the same height. This method is practical in applications that do not require highly precise camera calibration.

Recently, many studies have verified the robustness of camera calibration methods based on vanishing points [14–16]. These methods assume that images are from a “Manhattan” scene with an orthogonal structures and estimate the vanishing points from the scene. Given vanishing points and reference height information, the human height can be computed straightforwardly. The proposed method provides an alternative solution which does not rely on computing vanishing points. This is useful in some cases where vanishing points are difficult to compute.

This study presents a camera calibration method for estimating the human height in video surveillance. In order to provide the best field of view, most surveillance cameras are set at high locations with low tilt angles. Only three camera parameters, namely, the focal length, the tilting angle, and the camera height, are effective with this installation. In the proposed method, these parameters are directly calculated using a nonlinear regression model from the observed head and foot points of a walking human instead of estimating vanishing line and points, which are extremely sensitive to noise in practice. Unlike other methods that estimate all parameters in the camera model, the proposed method estimates only three parameters. With only three unknown parameters, the nonlinear regression model can provide an efficient fit to the data. The experimental results show that the proposed method can predict the human height with a mean absolute error of only about 1.39 cm from ground truth data.

The main advantage of the proposed method is that it provides the simplest solution for the camera calibration problem in video surveillance in comparison to other methods. More specifically, the proposed method 1) does not require vanishing line, which is in generally difficult to estimate and generates many errors in practice, 2) takes only three parameters (the focal length, the tilting angle, and the camera height), and 3) uses no calibration objects, including parallel or perpendicular lines on the ground. These advantages are increasingly valuable because a growing number of surveillance cameras are being installed and the proposed method can save a lot of time calibrating them.

The rest of this paper is organized as follows: Section 2 describes the proposed method for calibrating cameras and estimating the human height in video surveillance. Section 3 presents the experimental results from the walking human and the ruler-based evaluations. Section 4 analyzes errors, and Section 5 concludes the paper.

## Proposed method

The original pin-hole camera model consists of five intrinsic and six extrinsic parameters such that the intrinsic parameters describe inner properties of the camera, such as the focal length and skewness, and the extrinsic ln describe the translation and rotation of the camera center from the world coordinate system to the camera n system [17]. The original camera model is given by

where **R** is the rotation matrix and **t** is the translation vector.

Most cameras for video surveillance are installed at high positions with a slightly tilted angle to ensure the best field of view. Figure 1 shows this type of camera installation and the coordinate system. In this installation, rotation angles along the *Y*- and *Z*-axis can be assumed as 0 (which are also known as pan and roll), and translations along *X*- and *Z*-axis can also be assumed as 0. Therefore, the original camera model **P** can be simplified as

where **R**
_{X} is the rotation matrix of the camera tilt and **c**
_{Y} is the translation vector along the Y direction. To further reduce the number of calibration parameters in **K**, zero skew, unit aspect ratio, and known principle points [0,0]^{T} are assumed. Then the camera matrix **P** can be written as

where *f* is the focal length, *θ* is the tilt angle, and *c* is the height of the camera. These three parameters can determine the mapping from world coordinates [X,Y,Z]^{T} to image coordinates [x,y,w]^{T} as follows:

which can be represented in Cartesian coordinates as

The walking human in the camera view provides a set of head and foot points in the image plane and a physical height. The walking human is vertical to the ground. Therefore, the y-coordinate offers more information than the x-coordinate and can be associated with the physical height of the human. In this regard, the bottom equation in Eq. (5) gives a basic relationship between world coordinates Y and Z and the image coordinate y:

if cos*θ*≠0,

Because each head-foot pair of the y-coordinate, denoted as y_{h} and y_{f}, can be measured from the image. Applying Eq. (7) provides a set of equations with three unknowns:

where Y_{f}=0 and Y_{h} is Y coordinate of the head, which is the known physical height of the human, and Z_{f} and Z_{h} are Z coordinates of the head and the foot. In practice, measuring Z requires additional grids or objects on the ground and is more difficult than measuring Y which is the known height of the human. Therefore, the variable Z in Eq. (8) is eliminated by substituting Z_{h} in the bottom equation with Z_{f} in the above equation. This yields an equation containing only y_{f} and y_{h}:

For real data, the right-hand side of Eq. (9) actually gives an estimated value of y_{h} which is denoted as the estimation function \({\hat {\mathrm {y}}}_{\mathrm {h}}\). This function takes two arguments y_{f} and Y_{h}:

Given that real data always come with noise, the observed value of y_{h} can be rewritten as

where *ε* is the error produced by calibration parameters. Minimizing *ε* gives the optimal parameters.

Note that Eq. (10) has a nonlinear form and its parameters can be found by a nonlinear regression:

There are many algorithms for solving this type of problem, including the Levenberg-Marquardt algorithm. Initial parameters *θ*
_{0} and *c*
_{0} can be easily approximated through visual measurement, and *f*
_{0} can be set as 0.5–1.5 times the image height if the real-world length unit is in centimeter.

Once calibration parameters \(\hat {f},\hat {\theta },\hat {c}\) of a camera are obtained, the physical height of a person can be estimated from a pair of head and foot points observed from the image. As in the case of Eq. (10), the estimated physical height \(\hat {\mathrm {Y}}\) can be written as a function of y_{f} and y_{h}:

## Experimental results

### Experiment setup

Two types of experiments were conducted to evaluate the accuracy and robustness of the proposed method: 1) an evaluation based on the walking human and 2) an evaluation based on the ruler. A dataset was collected from a video surveillance site in use. Cameras at the site were installed at entrances and corridors of a building as well as at an outside parking lot. The video resolution for this dataset was 1280× 720. For each camera, 15 pairs of points were collected in the ruler-based evaluation, and 5–30 pairs of points were collected in the walking human-based evaluation. These points were collected in a broad range of camera view, and they covered near and far positions.

Figure 2 shows the camera calibration process using the proposed method. First, head and foot points were marked and saved, as shown in Fig. 2 a. Initial values of the focal length, the tilt angle, and the camera height are set to 720, −30, and 300 by default or to values obtained by the visual measurement of the camera location. Estimated parameters were found by the nonlinear regression method described in Section 2. More specifically, this experiment adopted the Levenberg-Marquardt algorithm. Optimized parameters \(\hat {f}\), \(\hat {\theta }\), and \(\hat {c}\) for this camera are 547.7, −38.6, and 270.2, respectively.

Figure 2
c plots the relationship between the observed value of y_{h} and the estimated value of \(\hat {\mathrm {y}}_{\mathrm {h}}\) with respect to the observed value of y_{f}. Note that the slope of the initially estimated value of \(\hat {\mathrm {y}}_{\mathrm {h}}\) was very close to that of the observed y_{h} but that the scale diverged. This is because the visually approximated height and tilt can be relatively accurate, whereas the focal length cannot.

### Walking human-based evaluation

In the evaluation based on the walking human, the video dataset consisted of 11 subjects and 9 cameras. The height of each subject was measured with shoes before recording the video. Head and foot points were manually marked in the videos, although there are many algorithms for automatic human detection. Some studies have suggested that the height of the human is more accurate in the phase in which two feet cross each other [9]. The manual marking in this experiment was done according to this clue. Some videos and marking results are shown in Fig. 3. Each camera was calibrated by measuring subject 1, and the error was evaluated by remaining subjects.

In the experiment, the mean and standard deviation of heights were computed from observations from each camera. Figure 4 shows the height estimation error in the evaluation based on the walking human, including the distribution and the limits of agreement (LOA; also known as the Bland-Altman plot). The purpose of the LOA is to investigate the difference between the true height (measured based on the ruler) and the estimated height. The mean absolute error is 1.39 cm, and the standard deviation is 1.91 cm. The 95 % limits of agreement are 3.32 and −4.71 cm, respectively, which are computed by the ±1.96 standard deviation. Table 1 provides the detailed results. Figure 5 demonstrates no correlation is found between the height estimation error (cm) and x- and y-coordinates, which indicates that the height estimation error does not depend on the position of the human.

Table 2 compares the results for the proposed method with the existing methods. The empty fields indicate no available result (N/A) in these studies. Mean absolute error, standard deviation, and maximum error were chosen as measures for comparing accuracy. As shown in the table, the proposed method provided more accurate height estimates while requiring only a walking human as the calibration object.

### Ruler-based evaluation

In the ruler-based evaluation, a vertical ruler was used instead of a waking human in order to isolate the error caused by the wrong annotation of head and foot points. Figure 6 shows the devices and the data collection procedure. The experiment was conducted in an indoor environment to clearly identify ruler labels. Calibration parameters were estimated by measuring 200 cm and the measurement error was evaluated by measuring 160 to 210 cm with 10-cm increasing intervals. Figure 7 shows the height estimation error in the ruler-based evaluation, including the distribution and the LOA. The mean absolute error is 0.42 cm, and the standard deviation is 0.54 cm. The 95 % limits of agreement are 1.25 and −0.98 cm which are computed by the ±1.96 standard deviation. The mean difference is −0.10 cm, which is close to 0, indicating that the measurement error is not biased.

## Discussion

### Lens distortion correction

Lens distortion causes substantial error in edges of the recorded area, particularly in some wide-angle cameras. To solve this problem, a commonly used radial distortion correction method [18] was applied. The image coordinates were converted into distortion-free coordinates before the calibration.

### Ground surface

Some cameras are placed lower than the height of adult subjects, such that their main function is to recognize the face. The proposed method can be applied without any modification. The condition for using the proposed method is that the camera pan and roll are both equal to 0. If this condition is satisfied and the human’s head/foot points are observable, then calibration parameters can be estimated in the same way as general cases. In such cases, the camera height *c* is lower than that of adult subjects.

Another case may be the ground surface not being in the same level or the floor not being flat. In such case, substantial error of height estimation will be caused. The solution might be to consider the different level as a new floor and perform the calibration separately.

### Pose of the walking human

Some subjects habitually walk with a bowed head. Figure 8 demonstrates that subject 11 walked with a forward-leaning pose in cameras 3, 5, and 9. Estimated errors are −0.14, −4.84, and −7.93 cm. The results show that walking with a leaning pose leads to an underestimation of the subject height.

## Conclusions

This paper proposes a simple camera calibration method for estimating the human height in video surveillance. The proposed method requires neither any special calibration object nor a special pattern on the ground, such as parallel or perpendicular lines. Only three parameters are retained in the camera model, making the estimation of parameters more efficient. In addition, the proposed method does not rely on computing vanishing points, which is difficult to estimate in practice.

The experimental results show that the proposed method can predict the human height from observed head and foot points in the video. The experimental results show that the mean absolute error is only about 1.39 cm from ground truth data in a walking human-based evaluation.

The proposed method can be integrated with automated human detection methods to fully perform autocalibration, and this provides a useful avenue for future research. In addition, future research should introduce lens distortion parameters to a simplified camera model.

## References

- 1
A Dantcheva, C Velardo, A D’Angelo, JL Dugelay, Bag of soft biometrics for person identification. Multimed. Tools. Appl.

**51**(2), 739–777 (2011). - 2
B Hoogeboom, I Alberink, M Goos, Body height measurements in images. J. Forensic. Sci.

**54**(6), 1365–1375 (2009). - 3
N Ramstrand, S Ramstrand, P Brolund, K Norell, P Bergstrom, Relative effects of posture and activity on human height estimation from surveillance footage. Forensic. Sci. Int.

**212**(1–3), 27–31 (2011). - 4
D Reid, M Nixon, S Stevenage, Soft biometrics; human identification using comparative descriptions. IEEE Trans. Pattern. Anal. Mach. Intell.

**36**(6), 1216–1228 (2014). - 5
P Tome, J Fierrez, R Vera-Rodriguez, M Nixon, Soft biometrics and their application in person recognition at a distance. IEEE Trans. Inf. Forensics. Secur.

**9**(3), 464–475 (2014). - 6
SX Yang, PK Larsen, T Alkjaer, B Juul-Kristensen, EB Simonsen, N Lynnerup, Height estimations based on eye measurements throughout a gait cycle. Forensic. Sci. Int.

**236**(0), 170–174 (2014). - 7
B Micusik, T Pajdla, in

*IEEE Conference on Computer Vision and Pattern Recognition (CVPR)*. Simultaneous surveillance camera calibration and foot-head homology estimation from human detections (IEEE, 2010), pp. 1562–1569. - 8
F Lv, T Zhao, R Nevatia, in

*International Conference on Pattern Recognition (ICPR)*, 1. Self-calibration of a camera from video of a walking human (IEEE, 2002), pp. 562–567. - 9
F Lv, T Zhao, R Nevatia, Camera calibration from video of a walking human. IEEE Trans. Pattern. Anal. Mach. Intell.

**28**(9), 1513–1518 (2006). - 10
N Krahnstoever, PR Mendonca, in

*International Conference on Computer Vision (ICCV)*. Bayesian autocalibration for surveillance (IEEE, 2005). - 11
I Junejo, H Foroosh, in

*IEEE International Conference on Video and Signal Based Surveillance (AVSS)*. Robust auto-calibration from pedestrians (IEEE, 2006). - 12
J Liu, RT Collins, Y Liu, in

*British Machine Vision Conference, Dundee*. Surveillance camera autocalibration based on pedestrian height distributions (BMVA, 2011). - 13
J Liu, RT Collins, Y Liu, Robust autocalibration for a surveillance camera network. IEEE Work. Appl. Comp. Vis. (WACV), 433–440 (2013).

- 14
JP Tardif, in

*IEEE Conference on Computer Vision and Pattern Recognition (CVPR)*. Non-iterative approach for fast and accurate vanishing point detection, (2009), pp. 1250–1257. - 15
E Tretyak, O Barinova, P Kohli, V Lempitsky, Geometric image parsing in man-made environments. Int. J. Comput. Vis.

**97**(3), 305–321 (2012). - 16
H Wildenauer, A Hanbury, in

*IEEE Conference on Computer Vision and Pattern Recognition (CVPR)*. Robust camera self-calibration from monocular images of Manhattan worlds (IEEE, 2012), pp. 2831–2838. - 17
R Hartley, A Zisserman,

*Multiple View Geometry in Computer Vision*(Cambridge University Press, New York, 2004). - 18
Z Zhengyou, A flexible new technique for camera calibration. Pattern. Anal. Mach. Intell. IEEE Trans.

**22**(11), 1330–1334 (2000). - 19
K-Z Lee, in

*IEEE Conference on Computer and Robot Vision (CRV)*. A simple calibration approach to single view height estimation (IEEE, 2012), pp. 161–166. - 20
AC Gallagher, AC Blose, T Chen, in

*International Conference on Computer Vision (ICCV)*. Jointly estimating demographics and height with a calibrated camera (IEEE, 2009), pp. 1187–1194. - 21
E Jeges, I Kispal, Z Hornak, in

*Human System Interactions*. Measuring human height using calibrated cameras (IEEE, 2008), pp. 755–760.

## Acknowledgements

This work was supported by Institute for Information & Communications Technology Promotion (IITP) grant funded by the Korea government (MSIP) (B0101-15-1282-00010002, Suspicious pedestrian tracking using multiple fixed cameras).

The source code and the dataset are available for download at https://github.com/lishengzhe/ccvs.

## Author information

## Additional information

### Competing interests

The authors declare that they have no competing interests.

## Rights and permissions

**Open Access** This article is distributed under the terms of the Creative Commons Attribution 4.0 International License(http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

## About this article

#### Received

#### Accepted

#### Published

#### DOI

### Keywords

- Camera calibration
- Soft biometrics
- Human height estimation
- Video surveillance