This section introduces the algorithm used to produce the disparity map and depth measurement while the camera tracks an object without the need to constantly recalibrate the system. The process of updating the geometry online is described in this part. The method of updating the configuration of the system has two stages. The first stage is the offline calibration process using Zhang’s calibration algorithm [7], where the output of this algorithm is the PPM and distortion matrix for each camera, as well as the translation and rotation matrices between the left and right cameras. The PPM and distortion matrix contain the internal parameters for each camera, and these parameters are fixed at all times. The translation and rotation matrices contain the external parameters of the system and are constantly changing.
Figure 1 shows the outer parameters of the system. The origin of the system is set, as is frequently done in computer vision, with the left origin as the origin of the system [20]. Therefore, the essential matrix describes the rotation and translation from the left image to the right image. In the offline calibration stage, the calibration was done under different geometric configurations. This process is used to find the relationship between the rotation angle in the image space and the platform space and to apply this to the translation.
The second stage of the calibration is online calibration, where the generated relationship between the image space and the platform space is used to update the essential matrix. The essential matrix is used in the rectification process.
Single-camera model
We start with a single-camera model that describes a pinhole camera system. This model is also used to describe the CMOS sensor in the cameras used in this project. The center of the camera is O, which is the center of the Euclidean coordinate system. The image plane π coincides with the z-axis, and the distance between the origin and the image plane is the focal length f.
Suppose a point W with coordinates W = [X Y Z]T set in the front image plane. A projection point w = [x y]T on the image plane will form when we draw a line from W to the origin of the camera O. This creates a mapping from 3D space to 2D space. Using a homogeneous coordinate to map between points, we get Eq. (1):
where W = [X Y Z 1]T and w = [x y 1]T are homogenous vectors and P is the camera projection matrix.
The camera projection matrix P contains the internal and external parameters:
$$ P= AR\left[R|t\right], $$
(2)
where A is a 3 × 3 matrix describing the internal properties of the camera (Eq. (3)), where αx and αy are the focal lengths in pixels in the x and y directions, respectively, and s is a skew parameter, which, in most new cameras, is zero [18]. R and t are external parameters that refer to the transformation between the camera and world coordinate, where R is a 3 × 3 rotation matrix of rank 3 and t is a translation vector.
$$ A=\left[\begin{array}{ccc}{\alpha}_x& s& {x}_0\\ {}0& {\alpha}_y& {y}_0\\ {}0& 0& 1\end{array}\right]. $$
(3)
The calibration process for a single camera depends on Eq. (1) to provide the point coordinates of w and W that the image coordinate found by applying corner detection and the points in the world coordinate given by measuring the distance between the corners in the checkerboard. By finding these points, the camera projection matrix can be determined using algebra. A well-known algorithm that can be used to find P is y using the algorithm of Zhang (2000) [7].
Stereo model
In the two-camera model, the same process as that for a single camera is applied. In this section, the parameters with subscript letters l and r are used to refer to the left and right camera models, respectively. Figure 2 shows the model that is studied in this section. The distance between the two origin cameras is B and is referred to as the baseline. Supposing that both cameras look at the same point in the world W = [X Y Z]T, a point w will be projected onto both image planes wl = [xl yl] and wr = [xr yr].
From the model, a plane is formed when Ol, W, and Or are connected. This plane is called the epipolar plane. If we know wl, we can find wr by searching along a line lr = er × wr. This line is called the epipolar line. From the epipolar line, lr = er × wr = [er] × wr, where [er] is the cross product, and because we know that wr is mapping to wl, we get the relation wr = H wl. H is a 3 × 3 homography matrix of rank 3 that describes the mapping between two points. By combining both equations, we get lr = [er] × H wl = F wl, where F = [er] × H and is called the fundamental matrix [21].
The fundamental matrix (F) can be extended to include the camera projection matrix, as shown Eq. (4), where \( {P}_l^{+} \) is the pseudoinverse of Pl. The fundamental matrix defines the internal and external parameters of the stereo vision system. F is a 3 × 3 matrix of rank 2.
$$ F=\left[{e}_r\right]\times {P}_r\ {P}_l^{+}. $$
(4)
For a stereo vision rig, the projection camera matrix satisfies Eqs. (5) and (6), where R and t represent the rotation and translation between the left and right origins. Ol is the origin of the rig.
$$ {P}_l=\left[I\ |0\right] $$
(5)
$$ {P}_r=\left[R\ |t\right]. $$
(6)
The fundamental matrix should satisfy Eq. (7), where wl lies on the epipolar line lr = Fwl [21]:
Equations (5) and (6) are in normalized coordinates, and solving them, we obtain Eq. (8):
$$ E={\left[t\right]}_{\times }\ R=R{\left[{R}^Tt\right]}_{\times }. $$
(8)
The essential matrix (E) describes the transformation between the left and right origins in normalized image coordinates. The E matrix has similar properties to the F matrix in its correspondence between \( {\widehat{w}}_l \) and \( {\widehat{w}}_r \) in normalized coordinates [21]:
$$ {\widehat{w}}_rE{\widehat{w}}_l=0. $$
(9)
The essential matrix is used to compute the distance to the point W(X, Y, Z) seen by both cameras. Using the essential matrix means that there will be 6° of freedom: 3° from the rotation angle and 3° from the translation. In our system, the rotation angle around the y-axis and the translation along the baseline are not fixed. These two parameters were selected because they change the visual view of the camera.
The calibration process used in stereo vision is the same when a checkerboard is used as a reference to the points in the world coordinate and image processing is used to find the points in the image coordinate. The calibration process is first done on each camera separately to find the projection camera matrix for each camera, and then, these matrices are used to calculate the essential matrix to find the external geometry parameters between the cameras.
Rectification algorithm
The disparity is the difference between the same points in the left and right images. The calibration process generates the parameters used to rectify the images, where the rectification process is the transformation of the left and right images to obtain the same horizontal epipolar lines. The rectification process used in this study is based on Bouguet’s algorithm [20].
The process starts by dividing the rotation matrix R that is responsible for rotating the right image into the left image into two rotating matrices, Rl and Rr, for each image. These two rotation matrices rotate the left and right images by a half rotation. This rotation aligns both image planes with the baseline, but the images are not aligned in the raw data. Therefore, we find a correction matrix to rotate the epipolar lines into infinity and align them horizontally with the baseline.
In the stereo model, it is assumed that the left camera was set as the origin of the system. Starting with the epipole point \( {e}_{1_l} \) in the left image and connecting to the epipole point \( {e}_{1_r} \) in the right image, the point is translated along the baseline that defines the translation vector T. This leads to Eq. (10):
$$ {e}_1=\frac{T}{\left\Vert T\right\Vert }. $$
(10)
Using the cross product of e1 will generate e2, which is orthogonal to the focal length ray. This results in e2 being orthogonal to e1. The result is shown in Eq. (11):
$$ {e}_2=\frac{{\left[-{T}_y\ {T}_x\ 0\right]}^T}{\sqrt{T_x^2+{T}_y^2}}. $$
(11)
The last vector is e3, which is orthogonal to e1 and e2, and can be calculated via a cross product:
$$ {e}_3={e}_1\times {e}_2. $$
(12)
Now, we add these vectors into the correction matrix Rcorr, which transforms the epipolar lines to be infinite and parallel with the baseline by rotating the image about the projection center.
$$ {R}_{\mathrm{corr}}=\left[\begin{array}{c}{e}_1^T\\ {}{e}_2^T\\ {}{e}_3^T\end{array}\right]. $$
(13)
Rcorr is multiplied by the split rotation matrix to form correction rotation matrices for the left and right images.
$$ {R}_{l_{\mathrm{corr}}}={R}_{\mathrm{corr}}\ {R}_l $$
(14)
$$ {R}_{r_{\mathrm{corr}}}={R}_{\mathrm{corr}}\ {R}_r. $$
(15)
This leads to the importance of a given rotation matrix and translation matrix to rectify an image. The rotation and translation matrices are taken from the essential matrix, i.e., decomposing the essential matrix allows the rotation and translation matrices to be calculated.
Online geometry update
This subsection integrates the above discussion to generate a relationship between the image angle and the motor encoder angle. Mapping between motor space to image space lead to errors if we use the encoder angle direct to the image angle [22].
As explained in the above section, the process is divided into two parts: an offline calibration process and an online geometry update. The offline calibration calculates the essential matrix and the internal parameters of the cameras. The essential matrix is decomposed to generate the rotation and translation matrices. The translation matrix is a pure translation from the left to right camera origins.
In theory, the rotation matrix should be equal to the pure rotation around the y-axis. However, in reality, this assumption is not valid because of the actual installation of the camera on the platform and the installation of the camera sensor. The calibration result returns the rotation matrix, including these small values around the x- and z-axes. Therefore, the rotation matrix returns three angles. The complete rotation matrix is a product of multiplying the rotation matrices in XYZ order:
$$ R={R}_x\left(\psi \right)\times {\mathrm{R}}_{\mathrm{y}}\left(\theta \right)\times {R}_z\left(\varnothing \right). $$
(16)
The rotation matrix is solved to return the individual angle. These angles are recorded as the image space angles. The most important angle is θimg, which changes the angle around the y-axis.
The calibration process is done 30 times with different configurations (different verge angles) and each time the encoder verge angle θencoder is recorded. The complete 30-configuration calibration set constituted one run, and 20 runs were performed. The data of the calibration process are used to generate a linear relationship between the encoder angle and the image angle:
$$ {\theta}_{\mathrm{img}}=e+\eta \times {\uptheta}_{\mathrm{encoder}}, $$
(17)
where e refers to the error due to the mechanical misalignment and lens distortion and η is an estimated factor to correct the encoder angle.
Disparity
After the rectification of the system, the generated left and right images are used to compute the disparity map. Correspondence is then established, following the extensive literature, for example [23]. The primary junction of correspondence is to find the point in the right image to match the point in the left image and then calculate the differences in the x-axis. These differences are called the disparity.
The semi-global block-matching algorithm (SGM) [24] is used in this study to evaluate the disparity map of the rectified images. SGM is a global stereo matching algorithm using multiple direction searches (pixel-wise) to smoothen the output, where the matching cost used in SGM is mutual information to overcome issues in lighting, different time exposures, and reflection [25]. The pixel-wise method calculates the final disparity by summing the total cost of the disparities at different angles from the scan line. This approach ensures that there is some smoothness in the disparity.
$$ E(D)=\sum \limits_P\left(C\left(p,{D}_p\right)+\sum \limits_{q\epsilon {N}_p}{P}_1T\left[\left|{D}_p-{D}_q\right|=1\right]+\sum \limits_{q\epsilon {N}_p}{P}_2T\left[\left|{D}_p-{D}_q\right|>1\right]\right). $$
(18)
Equation (18) represents the minimized cost function used by SGM, where p and q are the pixel indices in the image, C(p, Dp) is the cost of disparity matching based on the intensity, Np represents the neighbor of the pixel p, and P1 and P2 are constraints to penalize the change in the disparity, where P1 represents the change equal to 1 and P2 represents the change greater than 1 [26].
The disparity map is used to transform the pixel from the image coordinate in 2D into a world coordinate in 3D [X Y Z]T relative to the camera origin. This process is done using a triangulation approach in Eqs. (19)–(21). In Eq. (20), x and y represent the modified coordinates of the object in the image frame, b is the baseline, d is the disparity, and f represents the focal length.
$$ Z=f\ast \frac{b}{d} $$
(19)
$$ X=f\ast \frac{x}{Z} $$
(20)
$$ Y=f\ast \frac{y}{Z}. $$
(21)
Experiment
The platform used in this work is explained in details in our previous work [27]. The setup of the experimental system was divided into two configurations. The first configuration collected the data for the calibration process to find the actual parameters of the platform. The second configuration evaluated the new calibration algorithm.
Collecting data
This section explains the process of obtaining the data to help extract the parameters of the active stereo vision system. Exploring the parameters of the system and comparing the image space to the motor space required generating data for different platform setups, which meant setting different verge angles and baselines; 30 configurations of varying verge angles and five configurations for the baseline were selected.
In each configuration, a calibration process was performed as explained in Section 2.2 to find the parameters of the system using a calibration board. The board consists of an 8 × 6 array of black and white squares with sizes of 34.5 mm in height and width. The algorithm used to find the corners on the checkerboard also detected the 48 internal corners on the board. For a robust calibration, 15 images were taken of the calibration board at various positions and orientations as recommended by (Bradski and Kaehler, 2008).
To accelerate and improve the collection of data, the calibration process was automated using a Baxter robot, as explained in [27]. Automating the calibration process reduced the time required to complete the calibration process by three times and improved the calibration result.
Figure 3 shows the data collection setup, where the platform was installed in front of Baxter at a distance of 2 m, and the calibration board was fixed on the arm of the robot. A total of 40 positions and orientations of the board were pre-recorded using the Baxter teaching methods. A desktop PC was used to control Baxter, and a laptop was used to control the platform and perform the calibration process. A UDP connection was used to communicate between the PC and the laptop.
Figure 4 shows a flowchart of the calibration process, where the process starts by setting the verge angle. The second step is to find the corner and to move the arm to a new position. This step was repeated until 15 sets of images were taken successfully with the corners detected. Then, the calibration process is started at the same time as the evaluation of the quality of the calibration, when the output meets the requirement that the projection error is less than 0.1, the calibration process is a success, and the system moves to a new configuration. If the projection error is larger than 0.1, the process repeats until it meets the requirement. This algorithm was repeated 20 times to generate data for the analysis. The same process was used to calibrate the baseline.
Rectification
The calibration algorithm results in a rectified image where the epipolar lines of the left and right images become co-linear and parallel with the horizontal axis. To measure the performance of this rectification, a projection error measurement was used as described in [21]. The projection error is defined as the difference between the point y-axis in the left image and the point y-axis in the right image, as shown in Fig. 5 [28].
A calibration board was placed in different locations and orientations at distances between 1.5 and 2.5 m from the platform. This allowed us to obtain more data and to evaluate the calibration algorithm more accurately. As explained in Section 2.4, the geometry of the system needs to be updated when the configuration of the platform changes, and rectified images should then be generated. The rectified images are the output of the calibration algorithm, and these two images are used to evaluate the quality of the calibration. The evaluation algorithm uses the calibration board to detect the corners of the left and right images and then calculate the root mean square error (RMS), i.e., Eq. (22). The output value is in units of pixels.
$$ \mathrm{error}=\kern0.5em \sqrt{{\left({y_l}_i-{y_r}_i\right)}^2} $$
(22)
Surface compression
The data generated from the disparity map are used to create a 3D point cloud related to the system origin, which is a physical dimension of the scene. These data are used to evaluate the quality of the system in generating the point cloud. A spherical object was placed in front of the system, and a 3D point cloud was generated for this sphere. These data were then compared to the ground truth of the sphere that was generated using a 3D model.
The iterative closest point (ICP) algorithm was used to translate and rotate the source of the point cloud to the reference by minimizing the differences [29]; that is, ICP was used to align the two point clouds. There are four steps that ICP uses in the alignment process, as described in the work of Rusinkiewicz and Levoy (2001) [30].
-
1.
Apply the correspondent to the points where the strategy starts by selecting a point with a uniform distribution.
-
2.
Use singular value decomposition to compute the rotation and translation between the reference and source point clouds.
-
3.
Apply rotation and translation to the registered point cloud.
-
4.
Calculate the error between the corresponding points by applying SSD.
The above steps were repeated until the error reached the threshold value.
To evaluate the generated sample (S) point cloud of the platform, it was compared to the reference point cloud that was generated using a model, which we refer to as the ground truth (G). The Euclidean distance algorithm, Eq. (23), is used to compute the distance between each point in the source that lies near the point in the reference point cloud. The differences between the sample and the ground truth were calculated using the RMS using the Euclidean distance:
$$ \mathrm{RM}{\mathrm{S}}_{\mathrm{error}}=\sqrt{{\left({S}_x-{G}_x\right)}^2+{\left({S}_y-{G}_y\right)}^2+{\left({S}_z-{G}_z\right)}^2}. $$
(23)
In the experiment, three spheres were used, with different diameters (80, 125, and 150 mm). CAD software was used to generate the ground truth, which was then converted to a point cloud. These point clouds were set to have a subsampling between points equal to 1 mm in all directions (Fig. 6a).
The generated point cloud from the platform is shown in Fig. 6b prior to post-processing to remove the surrounding points that do not belong to the sphere; the post-processing was done using the Point Cloud Library [31]. The setup of the experiment is shown in Fig. 7.
The data were collected at different configurations (verge angles from − 6° to 12° and baselines from 55 to 250 mm) while the ball was placed at different positions between 1 and 2.5 m from the platform. A set of 10 samples was taken at each configuration.