- Research
- Open Access
Efficient cost aggregation for feature-vector-based wide-baseline stereo matching
- Xiaoming Peng^{1, 2}Email author,
- Abdesselam Bouzerdoum^{1, 3} and
- Son Lam Phung^{1}
https://doi.org/10.1186/s13640-018-0249-y
© The Author(s) 2018
- Received: 27 April 2017
- Accepted: 22 January 2018
- Published: 11 April 2018
Abstract
■■■
In stereo matching applications, local cost aggregation techniques are usually preferred over global methods due to their speed and ease of implementation. Local methods make implicit smoothness assumptions by aggregating costs within a finite window; however, cost aggregation is a time-consuming process. Furthermore, most existing local methods are based on pixel intensity values, and hence are not efficient with feature vectors used in wide-baseline stereo matching. In this paper, a new cost aggregation method is proposed, where a Per-Column Cost matrix is combined with a feature-vector-based weighting strategy to achieve both matching accuracy and computational efficiency. Here, the proposed cost aggregation method is applied with the DAISY feature descriptor for wide-baseline stereo matching; however, this method can also be applied to a fast growing number of stereo matching techniques that are based on feature descriptors. A performance comparison with several benchmark local cost aggregation approaches is presented, along with a thorough analysis of the time and storage complexity of the proposed method.
Keywords
- Stereo matching
- Cost aggregation
- Feature vector
- DAISY
1 Introduction
Estimating depth from a pair of stereo images is a long-standing problem in computer vision. Its aim is to find a dense correspondence map between a pair of stereo images to generate either a disparity map (for rectified stereo pairs), or a depth map (for known camera calibration parameters). Stereo algorithms generally involve the following four steps: i) matching cost computation, (ii) cost aggregation, (iii) disparity computation or optimization, and (iv) disparity refinement [1]. Both local and global approaches need to perform the matching cost computation step, but they differ in the treatment of smoothness constraints. Local methods make implicit smoothness assumptions by aggregating costs within a finite window. Global approaches, by contrast, make explicit smoothness assumptions by combining the data and smoothness terms into a cost function, which is subsequently optimized using an iterative procedure. The most commonly used optimization methods for global approaches include Expectation-Maximum (EM) [2], cooperative optimization [3, 4], Graph Cuts (GC) [5], Max-Product Loopy Belief Propagation (LBP) [6], and Tree-Reweighted Message Passing (TRW) [7]. The last three methods are categorized as energy minimization for Markov Random Fields (MRFs) [8]. In practical applications, local approaches are preferred to their global counterparts due to their speed and ease of implementation.
Existing short-baseline stereo matching methods, which are mostly based on pixel intensity values, perform reasonably well and are quite fast. Di Stefano et al. proposed a local method which achieved real-time speed using Single Instruction Multiple Data (SIMD) implementation [9]. Tombari et al. compared fourteen cost aggregation techniques in terms of accuracy and computation cost [10]. They found that cost aggregation methods using adaptive weights are among the most accurate. Hirschmüller proposed a semi-global method based on mutual information [11]. Based on the gestalt principles, Yoon and Kweon developed an edge-preserving bilateral filter for stereo matching [12]. Subsequently, Mattoccia et al. proposed a symmetric adaptive weighting strategy using two independent spatial and range filters [13]. Instead of adopting an exact weighting strategy, Min et al. addressed the cost aggregation issue by introducing two approximations [14]. In another approach, Hosni et al. formulated the stereo matching problem in a cost-volume filtering manner [15]. The cost volume is a three-dimensional (3D) array that stores the costs for choosing a label (i.e. disparity value in stereo matching) at a given pixel. To maintain boundaries in the filtered output of the guidance image (the left image of a stereo pair), the filter weights are chosen to be those of the guided filter [16]. In [17], Yang developed a Minimum-Spanning-Tree-based cost aggregation method, which avoids the local optimality caused by manually specifying the size of the support window. Inspired by this work, Mei et al. introduced a segment-tree-based cost aggregation method [18]. More recently, Zhang et al. showed that the different cost aggregation methods essentially differ in the choice of similarity kernels and can be reformulated in a unified optimization framework [19].
Matching measures directly constructed using pixel intensity values, such as Sum of Absolute Differences (SAD), Sum of Squared Differences (SSD), and Normalized Cross Correlation (NCC), lack robustness to large perspective distortions. These measures are not suitable for wide-baseline stereo matching, where there are significant variations in the viewpoints. A better alternative to pixel-intensity-based cost aggregation is to employ local feature descriptors. In recent years, many new feature detectors and descriptors have been developed, including BRIEF [20], BRISK [21], FREAK [22] and ORB [23]. At the same time, several studies analyzed the performance of feature vectors on different tasks. Heinly et al. compared the performance of five descriptors, BRIEF, BRISK and ORB, SIFT [24], and SURF [25] in feature detection and description [26]. Khan et al. used precision-recall curves and the Wilcoxon signed rank test to compare the performance of thirteen feature vectors [27]. They found the SIFT is the most accurate performer in both image recognition and feature matching applications. The increasingly abundant feature descriptors provide alternatives to pixel-intensity-based similarity measures in dense matching scenarios, as shown by Tola et al. [28] and Liu et al. [29]. Tola et al. showed that the DAISY feature descriptor, SIFT and SURF all outperform the NCC and pixel difference in wide-baseline stereo matching [28]. To accelerate the cost aggregation step using the BRIEF feature descriptor, Zhang et al. incorporated binary masks into the cost aggregation term [30]. The binary masks are constructed using the Sum of Absolute Differences between two pixels in the CIELAB color space. However, their binary masks can only be paired with binary feature vectors like BRIEF. Thus, this method is not applicable to general real-valued feature vectors such as SIFT and DAISY.
Stereo matching using feature vectors has two difficulties. First, feature-vector-based matching costs are much more computationally intensive than pixel-intensity-based ones. There are several pixel-intensity-based strategies for reducing cost aggregation [13–16]. However, they do not work well when directly applied to feature vectors. For example, the computational load of the similarity kernels in the bilateral filtering methods [12, 13] and the tree-based methods [17, 18] is quite low when involving only pixel-wise intensity differences. However, their computational load is very high if feature-vector-based similarity measures are used. Second, storing feature vectors for each image pixel requires a large amount of memory. To facilitate the repetitive testing of different pixel correspondences, it is desirable to store the per-pixel feature vectors, as done in SIFT flow [29]. As an example of storage cost, to store a 200-element DAISY feature vector in double-precision floating-point format for each pixel of a pair of high definition images of size 1920×1080 pixels, we need approximately (2×1920×1080×8×200)/1024^{3}≈6.18 GB of storage. Obviously, this is not practical on memory-limited systems.
Very recently, deep learning has been used to compare image pairs for stereo matching [31–34]. In this approach, a convolutional neural network (CNN) is deployed to compute the matching cost between a pair of image patches from the left and right images. Zagoruyko and Komodakis extracted feature descriptors from image patches at the branches of their Siamese network [31]. Their feature descriptor can be used as an alternative to hand-crafted feature descriptors, such as SIFT and DAISY. žbontar and LeCun developped a convolutional neural network that directly outputs the matching cost between a pair of image patches [32]. The matching cost is then combined with a cross-based cost aggregation method. Chen et al. proposed a multi-scale deep embedding model to extract features from a pair of image patches [33]. The inner product of the features is large if the pair of image patches matches well, and vice versa. Luo et al. adopted a four-layer Siamese architecture for their CNN [34]. However, they observed that simply predicting the most likely configuration for every pixel using only the CNN output is not competitive with other modern stereo algorithms. To achieve a better performance, they combined their CNN with semi-global block matching and sophisticated post-processing. All these deep learning methods rely on the availability of a large pool of annotated pairs of image patches to learn a mapping between them. To address this issue, Mayer et al. established a synthetic dataset containing 35,000 stereo image pairs with ground truth disparity, optical flow, and scene flow [35].
In this paper, we propose a local cost aggregation method that operates on feature vectors. The proposed method has two major contributions. First, we develop a feature-vector-based weighting strategy, which can be computed much more efficiently than conventional bilateral filtering, but with only a slight reduction in accuracy. Second, we propose a new concept called the Per-Column Cost (PCC) matrix to share the aggregated costs across different disparities during cost aggregation. This is in sharp contrast to other cost aggregation strategies, such as Integral Images [36] and Box-Filtering [9], which are limited to a single disparity map. Although we use the DAISY feature to instantiate the proposed method, other feature vectors can also be applied.
The rest of the paper is organized as follows. In Section 2, the cost aggregation problem is formulated under the filtering framework, followed by the delineation of the proposed method. In Section 3, experimental results are presented, followed by a comprehensive analysis and discussion of the results. In Section 4, conclusions and future directions are given.
2 Cost aggregation method using feature vectors
In this section, we first cast cost aggregation as a filtering problem. Then, we analyze why existing pixel-intensity-based cost aggregation methods are not suitable for feature vectors. Finally, we describe the proposed method in detail.
2.1 Cost aggregation using a filtering framework
The proposed method works on a pair of rectified stereo images, where the search of corresponding points on the stereo image pair is constrained on the horizontal image scanlines. Consider a pair of left image I_{ l } and right image I_{ r }. The aim is to find a pixel q=(x+d,y) in I_{ r } which corresponds to a pixel p=(x,y) in I_{ l }, where d is the disparity between the pixel pair, \(d \in \mathcal {D} =\left [d_{\text {min}},d_{\text {max}}\right ]\). Let I_{ l }(x,y) denote the intensity value (or the vector of color values) at location (x,y) in image I_{ l }. The search for pixel q is carried out along a horizontal scan line. For a chosen feature vector f of length L, the dissimilarity between pixels p and q is computed by comparing f(I_{ l };p) and f(I_{ r };q), which denote the feature vectors extracted at pixels p and q in I_{ l } and I_{ r }, respectively. Here, we do not restrict the type of feature vector f, so long as we can derive a scalar dissimilarity measure between f(I_{ l };p) and f(I_{ r };q). For simplicity and without loss of generality, we assume that images I_{ l } and I_{ r } have identical sizes (H rows, W columns). Furthermore, the search range \(\mathcal {D}\) has M discrete integer values, and is the same for all the pixels p in I_{ l }. The dissimilarity between two feature vectors f_{1} and f_{2} is denoted as \(c\left (\mathbf {f}_{1},\mathbf {f}_{2}\right)\).
where 1≤x≤W, 1≤y≤H, and d_{min}≤d≤d_{max}. The parameters τ_{1} and τ_{2} are two thresholds, and α balances the color and gradient terms.
where u is the mean vector, Σ is the covariance matrix of the pixels in the support window, E denotes the identity matrix, and ε is a smoothness parameter. Note that for both filters, the filter kernel varies according to the coordinates of the center pixel (x,y). In the tree-based cost aggregation methods [17, 18], a connected, undirected graph G=(V,E) is established for I_{ l }, where the set of vertices V is the image pixels and the edges E connect neighboring pixels. In all these kernels, the weight between two pixels is defined as the difference between their intensity values.
One important reason that the state-of-the-art cost aggregation methods can work very fast is that the similarity kernels can be computed very efficiently, with simple arithmetic operations mostly. However, this is not the case when feature vectors rather than pixel intensity values are used. For example, if the pixel difference term ∥I_{ l }(x,y)−I_{ l }(x+i,y+j)∥_{2} in Eq. (3) is replaced with the feature-vector-based dissimilarity term c(f(I_{ l };x,y),f(I_{ l };x+i,y+j)), we would incur a sharp increase in the computation.
2.2 Feature-vector-based cost aggregation
The x-th column of \(F_{l}^{y}\) contains the feature vector computed at pixel (x,y) in the left image I_{ l }. The x-th column of \(F_{r}^{y}\) contains the feature vector computed at pixel (x,y) in the right image I_{ r }. To compute c(f(I_{ l };x,y),f(I_{ r };x+d,y)), the feature vectors f(I_{ l };x,y) and f(I_{ r };x+d,y) are retrieved directly from \(F_{l}^{y}\) and \(F_{r}^{y}\) rather than being re-computed. Next, we present in detail the proposed method for cost aggregation.
2.2.1 Feature-vector-based weighting strategy
where −w≤i,j≤w, and parameters σ_{1} and σ_{2} control the spatial and color similarity. From Eq. (7), if i=0, G^{bf}(i,j)=1. In other words, all the pixels in the middle column of the support window share the same weights, and act as “center” or “reference” pixels. As in the case of K^{bf}(i,j), the weights of G^{bf}(i,j) are normalized so that \(\sum _{i,j}G^{\text {bf}}(i,j)=1\).
The rationale of the proposed bilateral filter can be explained as follows. Because the feature vector c(f(I_{ l };x,y) describes a local area around a pixel (x,y) instead of just the pixel itself, the cost c(f(I_{ l };x,y),f(I_{ l };x+i,y+j)) is less spatially sensitive than ∥I_{ l }(x,y)−I_{ l }(x+i,y+j)∥_{2}. This property enables us to use multiple center pixels for the proposed bilateral filter instead of a single center pixel as for the conventional bilateral filter. Furthermore, Eq. (7) is computed along the horizontal scan lines. If a single center is used, the term c(f(I_{ l };x,y),f(I_{ l };x+i,y+j)) needs to be computed across the scan lines, leading to a significant increase in the storage requirement. As confirmed later in Section 3.4, the proposed multi-center bilateral filter operates more efficiently while achieving a similar stereo matching accuracy compared with the single-center case.
2.2.2 Per-column cost matrix
where the subscript y indicates that it is constructed for the y-th row.
An important property of Γ_{ y } is that it is shared across different disparities, as opposed to Integral Images and Box-Filtering, which are limited to a single disparity level.
Next, we incorporate the weighting strategy described in Section 2.2.1. From Eq. (8), each element of Γ_{ y } accumulates the unweighted costs from one column of the support window. However, the weights in Eq. (7) are computed pixel-wise. To address this incompatibility, we compute the average of the weights within one column of the support window. This averaged weight is used as a common weight for the accumulated costs within one column of the support window.
where the common weight for column s is denoted as \(\bar {K}_{s}=\frac {1}{2w+1}\sum _{t=1}^{2w+1}K_{s,t}\). Note that \(\sum _{s=1}^{2w+1}(2w+1)\bar {K}_{s}=\sum _{s=1}^{2w+1}\sum _{t=1}^{2w+1}K_{s,t}=1\).
An estimated disparity d is considered reliable if its aggregated cost a(d) is smaller than (2w+1)^{2} τ, where τ is a predefined threshold. Otherwise, we regard the pixel as occluded. A large aggregated cost indicates the feature vectors are not well matched, usually because the local regions lack distinctive visual appearance.
2.2.3 Computation complexity
Before discussing the algorithm complexity, we first distinguish between three types of operations involved in our method: (a) feature vector computation, (b) feature vector comparison, and (c) basic floating-point arithmetic operation. Operation (a) is the computation of a feature vector at a given pixel. Operation (b) is the computation of the dissimilarity between two feature vectors. Operation (c) obviously is much less computationally-intensive than Operations (a) and (b). For this reason, we use O_{ a }(1), O_{ b }(1) and O_{ c }(1) to denote the time complexity for each of these three operations, respectively.
The computation of the feature vectors for all pixels in the left and right image has a time complexity of O_{ a }(2×W×H). To construct the cost volume array C, the term c(f(I_{ l };x,y),f(I_{ r };x+d,y)) needs to be computed for each location (x,y) in the left image and each disparity candidate d in \(\mathcal {D}\), which leads to a time complexity of O_{ b }(W×H×M). Similarly, the construction of array P needs a time complexity of O_{ b }(W×H×[2w+1]).
The initial construction of matrix Γ has a time complexity of \(O_{c}\left (W\times [\!2w+1]^{2} \times M\right)\). Note that W×M is the number of valid entries in Γ. Then, updating each valid entry requires only three single floating-point arithmetic operations, see Eq. (8). Thus, we have a time complexity of O_{ c }(3×W×(H−2w−1)×M)≈O_{ c }(3×W×H×M) for this purpose. Similarly, the construction and update of matrix Φ require a time complexity of \(O_{c} \left (\frac {(2w+1)^{2}}{2} \times W\right)\) and O_{ c }(3/2×W×H×(2w+1)), respectively. Note that the 1/2 factor is due to the symmetry of Φ.
Once Φ and Γ are computed, for each pixel (x,y) in the left image, the evaluation of a needs 3×(2w+1)×M floating-point arithmetic operations. The evaluation of w requires an extra (2w+2) floating-point arithmetic operations, which is negligible compared with that of evaluating a. Therefore, for all the pixels in the left image, the time complexity with respect to this step is O_{ c }(3×(2w+1)×W×H×M).
Now we analyze the storage required for implementing the algorithm. The cost volume array C and the array P require storage of W×H×M and W×H×(2w+1) floating-point numbers, respectively. They account for the largest share of storage requirement. However, if storage is limited, this requirement can be significantly reduced. In fact, only the top (2w+1) rows participate in the initial construction of Γ, and we only need a small portion of C corresponding to these (2w+1) rows, which consists of (2w+1)×W×M floating-point numbers. Afterwards, Γ is iteratively updated by “removing” the oldest contributing “row” of C and adding a new one, which means we only need to keep two “rows” of C, or 2×W×M floating-point numbers. Summarizing these two cases, it would be sufficient to use (2w+1)×W×M floating-point numbers to dynamically keep those “rows” of C that are needed for the construction or update of Γ. Similarly, (2w+1)^{2}×W floating-point numbers are required for the construction or update of P.
Time and storage complexity of the proposed feature-vector-based cost aggregation algorithm
Time complexity | |
---|---|
Feature vector computation | O_{ a }(2×W×H) |
Feature vector comparison | O_{ b }(W×H×M) for C and O_{ b }(W×H×(2w+1)) for P |
Construction and update of Γ | \(O_{c}\left ((2w+1)^{2} \times W \times M\right)\) and O_{ c }(3×W×H×M), respectively |
Construction and update of Φ | \(O_{c} \left (\frac {(2w+1)^{2}}{2} \times W\right)\) and O_{ c }(3/2×W×H×(2w+1)), respectively |
Cost aggregation for all pixels | O_{ c }(3×(2w+1)×W×H×M) |
Storage complexity | Unit: Number of floating-point numbers |
The cost volume array C | Maximum: W×H×M |
Minimum: (2w+1)×W×M | |
Array P | Maximum: (2w+1)×W×M |
Minimum: (2w+1)^{2}×W | |
Matrices F_{ l } and F_{ r } | L×W entries, depending on the type of feature vector |
Matrices Γ and Φ | W ^{2} |
Vector a | M |
Vector w | 2w+1 |
3 Results and discussion
In this section, we first introduce the test data in Section 3.1 and the DAISY descriptor in Section 3.2. Then, we explain how to select the parameters of the proposed method in Section 3.3. Next, we compare the proposed feature-vector-based weighting strategy with several other weighting strategies in Section 3.4. Finally, we compare the performance of the proposed cost aggregation method with two benchmark methods on two datasets in Section 3.5.
The proposed algorithm was implemented using C++. The experiments were done on a desktop computer equipped with an Intel Core i7-4770@3.40 GHz CPU, 8 GB memory, and 64-bit Windows 7 Enterprise operating system.
3.1 Wide-baseline stereo image data
The experiments in this work used two groups of wide baseline stereo image data: i) the Fountain and HerzJesu dataset; and ii) the 2014 Middlebury Stereo dataset.
3.1.1 The fountain and HerzJesu dataset
The first group of test data is from the public dataset released by Strecha et al. [37]. Specifically, we used two data sets: the “Fountain” data set and the “HerzJesu” data set. Both data sets contain gray-scale images of size 768×512 pixels, along with their ground-truth depth maps and occlusion maps. The camera calibration parameters associated with each gray-scale image are available, allowing these images to be rectified. The rectified images are of size 768×512 pixels.
For the “Fountain” data set, eleven wide-baseline stereo images were used in our experiments. One stereo image was used for the parameter selection experiment, presented in Section 3.3. The other ten stereo images were divided into two sub-sets, denoted as Fountain-A and Fountain-B. Each sub-set contained five consecutive images. For each sub-set, one image was considered as the left image while the other four images were considered as the right images. This was repeated five times, giving twenty stereo pairs for each sub-set. For the “HerzJesu” dataset, five images were selected to form another sub-set. In all, the first group of test data contained 60 stereo pairs. For brevity, we call this set “Fountain and HerzJesu Dataset” hereafter.
3.1.2 The 2014 Middlebury stereo dataset
The second group of test data is from the 2014 Middlebury Stereo Dataset [38]. The stereo image pairs in this dataset were generated using a structured lighting system, and were meant to present new challenges for the next generation of stereo algorithms. Of the 33 stereo image pairs in the 2014 Middlebury Stereo Dataset, only 23 of them have accompanying ground-truth disparity maps. Therefore, we used these 23 stereo pairs in quarter resolution to form the second group of test data: 1) adirondack, 2) jadeplant, 3) motorcycle, 4) piano, 5) pipes, 6) playroom, 7) playtable, 8) recycle, 9) shelves, 10) vintage, 11) backpack, 12) bicycle1, 13) cable, 14) classroom1, 15) couch, 16) flowers, 17) mask, 18) shopvac, 19) sticks, 20) storage, 21) sword1, 22) sword2, and 23) umbrella.
Because this group has no accompanying occlusion maps, the performance of a given method is measured by the overall disparity estimation accuracy, which is defined as the fraction of pixels with correctly estimated disparity values. If the difference between the estimated disparity and ground-truth disparity of a pixel is less than two pixels, the disparity is considered as correctly estimated.
3.2 Implementation of the DAISY feature vector
In this work, we selected the DAISY feature descriptor to implement and test the proposed method. The DAISY descriptor gets its name from its flower-like shape. The center of the flower is located at the center of an image patch. There are Q concentric rings surrounding the flower center, each ring containing T evenly distributed circles. These Q×T circles form the flower petals. The interested reader is referred to Fig. 6 of [28] for a visual appearance of the DAISY descriptor. The flower center and petals are each described by a histogram of length H, which is the convolved orientation map computed at the flower center or a petal. Thus, a DAISY descriptor contains H×(Q×T+1) elements. Our experiments used the default parameters published in [28]: Q=3, T=8, and H=8, for a feature vector of 200 elements.
The DAISY feature descriptor has been shown to outperform SIFT and SURF in wide-baseline stereo matching [28]. In addition, DAISY is more computationally efficient than SIFT because it reuses the descriptor computation of other pixels. A disadvantage of the DAISY descriptor is that it is not scale- and rotation-invariant. However, since we use rectified images as input, the scale and rotation disparities between the stereo image pair are mostly compensated during the image rectification step. A C++ implementation of the DAISY descriptor is publicly available from http://cvlab.epfl.ch/software/daisy.
where S is the total number of non-occluded histograms, and \(\mathbf {f}_{1}^{k}\) and \(\mathbf {f}_{1}^{k}\) are the k-th normalized histogram of f_{1} and f_{2}, respectively [28]. Of the (Q×T+1)=25 histograms of a DAISY descriptor, some may be occluded because their corresponding petals lie outside the image plane. Hence, only the non-occluded histograms are used for matching. Each of the non-occluded histograms is normalized to unity norm. The dissimilarity measure c(f_{1},f_{2}) ranges between 0 (perfect match) and 2 (complete non-match).
3.3 Parameter selection for the proposed method
3.4 Analysis of weighting strategies
In this simple experiment, we analyzed the proposed feature-vector-based weighting strategy and three other weighting strategies. The experiment used the two stereo pairs presented in Section 3.3.
We first compared the proposed method with two weighting strategies: i) the conventional bilateral filter with intensity difference [12]; ii) the bilateral filter with a single center. The first weighting strategy is represented by Eq. (3). The second weighting strategy is an extension of Eq. (3), by replacing the term ∥I_{ l }(x,y)−I_{ l } (x+i,y+j)∥_{2} with the feature-vector-based dissimilarity term c(f(I_{ l };x,y),f(I_{ l };x+i,y+j)).
For fair comparison between the proposed method and the first two strategies, the DAISY feature vector was used to compute the initial cost volume in Eq. (1), instead of the truncated absolute differences. This way, the performance differences in cost aggregation were solely determined by the different weighting strategies.
Next, we analyzed another weighting strategy, which is based on the guided-image filter [15]. The cost volume was computed in two ways. One way was to use a combination of the truncated absolute difference of the color and the gradient, as in [15]. The other way was to use the DAISY feature vector. On the two stereo pairs, this strategy did not produce good disparity estimates, even using both ways of computing the cost volumes. This result suggests that the weighting strategy based on the guided-image filter may not be suitable for wide-baseline stereo matching.
3.5 Comparison with other methods
In this section, we compare, using the two datasets described in Section 3.1, the proposed cost aggregation method with two benchmark methods: Min et al.’s method [14], and the census transform [39, 40].
Min et al. developed an approximate strategy to optimize cost aggregation [14]. This strategy, originally based on the truncated absolute difference (TAD) matching cost, consists of two parts: disparity candidate selection and joint histogram-based aggregation. In our implementation, the strategy was extended to feature vectors, by replacing the TAD matching costs with the dissimilarities between feature vectors. To enable a fair comparison, the DAISY descriptors were stored in the memory for the disparity candidate selection.
The census transform is one of the most popular techniques to compute matching costs for stereo vision. This method creates an encoded bit string for the pixels in a window. If the intensity of a pixel is lower than that of the center pixel of the window, the corresponding bit is set to one; otherwise, it is set to zero. This way, the census transform describes the spatial structure in the window. A census-transformed image pair is matched by computing the Hamming distance between the bit strings. Our C++ implementation of the census transform was adapted from Banks and Corke’s source code that accompanies their publication [41]. In the experiments, we used a census window size of 31×31 pixels, which is compatible with the computation of the DAISY feature vector.
Both benchmark methods rely on the left-right crosscheck to detect pixels with unreliable disparities. That is, the disparity map for both the left and right images are computed. First, for a pixel p in the left image, its counterpart q in the right image is found. Then, for pixel q in the right image, its counterpart p^{′} in the left image is found. Finally, the disparity value for pixel p is considered as correctly estimated if ∥p−p^{′}∥≤ε, where ε is a small threshold. For a fair comparison, we also applied the left-right crosscheck with the proposed method. An evaluation of different values for ε indicated that ε=2 gave a good trade-off between the precision and recall rate. Therefore, in the following experiments we selected ε=2.
3.5.1 Comparison on the fountain and HerzJesu dataset
Performance comparison of three methods on the fountain and HerzJesu dataset
3.5.2 Comparison on the 2014 Middlebury stereo dataset
Note that the 2014 Middlebury Stereo Dataset contains less texture compared with the Fountain and HerzJesu Dataset. This result indicates that the proposed method and the census transform are more stable for textureless scenes.
4 Conclusion
In this paper, a feature-vector-based cost aggregation algorithm was proposed for wide-baseline stereo matching, and evaluated using the DAISY feature vector. The proposed algorithm improved the efficiency of cost aggregation by combining a Per-Column-Cost matrix and a feature-vector-based weighting strategy. The paper also presented a detailed analysis of both time and storage complexity of the proposed method. The new method was extensively tested and compared with two benchmark methods on two wide-baseline datasets. With growing research in feature detectors and visual descriptors, it can be envisaged that the proposed method will be attractive for stereo matching applications where feature vectors are used.
Among several possibilities, one direction for future work is to further accelerate the speed of the proposed method. For example, once the Per-Column-Cost matrix Γ and the Per-Column-Weight matrix Φ are computed, the disparity values for the pixels in a row can be computed in parallel. Another direction is to combine the proposed method with feature vectors that are found via deep learning [31–34].
Declarations
Acknowledgements
We gratefully thank the anonymous reviewers and Associate Editor for the constructive and detailed comments that helped improve the paper.
Funding
This research was partly supported two grants from University of Electronic Science and Technology of China (Grant No. LJT115010701037 and Grant No. Y02002010701026), and a grant from the Australian Research Council.
Availability of data and materials
The Fountain and HerzJesu Dataset is available from http://cvlab.epfl.ch/software/daisy. The 2014 Middlebury Stereo Dataset is available from http://vision.middlebury.edu/stereo/data/2014.
Authors’ contributions
The original idea of the research was proposed by XP, but was largely inspired by his discussions with AB. SLP contributed to the experiment analysis and paper writing. All three authors worked closely during the preparation and revision of the manuscript. All authors read and approved the final manuscript.
Competing interests
The authors declare that they have no competing interests.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
Authors’ Affiliations
References
- D Scharstein, R Szeliski, A taxonomy and evaluation of dense two-frame stereo correspondence algorithms. Int. J. Comput. Vis.47(1-3), 7–42 (2002).View ArticleMATHGoogle Scholar
- C Strecha, R Fransens, L Van Gool, in Proc. Computer Vision and Pattern Recognition (CVPR). Combined depth and outlier estimation in multi-view stereo, (2006), pp. 2394–2401.Google Scholar
- X Huang, in Proc. 26th DAGM Symposium. Cooperative optimization for energy minimization in computer vision: a case study of stereo matching, (2004), pp. 302–309.Google Scholar
- Z Wang, Z Zheng, in Proc. Computer Vision and Pattern Recognition (CVPR). A region based stereo matching algorithm using cooperative optimization, (2008), pp. 1–8.Google Scholar
- Y Boykov, O Veksler, R Zabih, Fast approximate energy minimization via graph cuts. IEEE Trans. Pattern Anal. Mach. Intell.23(11), 1222–1239 (2001).View ArticleGoogle Scholar
- JS Yedidia, WT Freeman, Y Weiss, in Advances in Neural Information Processing Systems (NIPS). Generalized belief propagation, (2000), pp. 689–695.Google Scholar
- MJ Wainwright, TS Jaakkola, AS Willsky, Map estimation via agreement on trees: message-passing and linear programming. IEEE Trans. Inf. Theory. 51(11), 3697–3717 (2005).MathSciNetView ArticleMATHGoogle Scholar
- R Szeliski, R Zabih, D Scharstein, O Veksler, V Kolmogorov, A Agarwala, M Tappen, C Rother, A comparative study of energy minimization methods for Markov random fields with smoothness-based priors. IEEE Trans. Pattern Anal. Mach. Intell.30(6), 1068–1080 (2008).View ArticleGoogle Scholar
- L Di Stefano, M Marchionni, S Mattoccia, A fast area-based stereo matching algorithm. Image Vis. Comput.22(12), 983–1005 (2004).View ArticleGoogle Scholar
- F Tombari, S Mattoccia, L Di Stefano, E Addimanda, in Proc. Computer Vision and Pattern Recognition (CVPR). Classification and evaluation of cost aggregation methods for stereo correspondence, (2008), pp. 1–8.Google Scholar
- H Hirschmüller, Stereo processing by semiglobal matching and mutual information. IEEE Trans. Pattern Anal. Mach. Intell.30(2), 328–341 (2008).View ArticleGoogle Scholar
- KJ Yoon, IS Kweon, Adaptive support-weight approach for correspondence search. IEEE Trans. Pattern Anal. Mach. Intell.28(5), 650–656 (2006).View ArticleGoogle Scholar
- S Mattoccia, S Giardino, A Gambini, in Proc. Asian Conference on Computer Vision (ACCV). Accurate and efficient cost aggregation strategy for stereo correspondence based on approximated joint bilateral filtering, (2009), pp. 371–380.Google Scholar
- D Min, J Lu, MN Do, in Proc. International Conference on Computer Vision (ICCV). A revisit to cost aggregation in stereo matching: how far can we reduce its computational redundancy? (2011), pp. 1567–1574.Google Scholar
- A Hosni, C Rhemann, M Bleyer, C Rother, M Gelautz, Fast cost-volume filtering for visual correspondence and beyond. IEEE Trans. Pattern Anal. Mach. Intell.35(2), 504–511 (2013).View ArticleGoogle Scholar
- K He, J Sun, X Tang, Guided image filtering. IEEE Trans. Pattern Anal. Mach. Intell.35(6), 1397–1409 (2013).View ArticleGoogle Scholar
- Q Yang, in Proc. Computer Vision and Pattern Recognition (CVPR). A non-local cost aggregation method for stereo matching, (2012), pp. 1402–1409.Google Scholar
- X Mei, X Sun, W Dong, H Wang, X Zhang, in Proc. Computer Vision and Pattern Recognition (CVPR). Segment-tree based cost aggregation for stereo matching, (2013), pp. 313–320.Google Scholar
- K Zhang, Y Fang, D Min, L Sun, S Yang, S Yan, Q Tian, in Proc. Computer Vision and Pattern Recognition (CVPR). Cross-scale cost aggregation for stereo matching, (2014), pp. 1590–1597.Google Scholar
- M Calonder, V Lepetit, M Ozuysal, T Trzcinski, C Strecha, P Fua, BRIEF: computing a local binary descriptor very fast. IEEE Trans. Pattern Anal. Mach. Intell.34(7), 1281–1298 (2012).View ArticleGoogle Scholar
- S Leutenegger, M Chli, RY Siegwart, in Proc. International Conference on Computer Vision (ICCV). BRISK: Binary robust invariant scalable keypoints, (2011), pp. 2548–2555.Google Scholar
- A Alahi, R Ortiz, P Vandergheynst, in Proc. Computer Vision and Pattern Recognition (CVPR). FREAK: fast retina keypoint, (2012), pp. 510–517.Google Scholar
- E Rublee, V Rabaud, K Konolige, G Bradski, in Proc. International Conference on Computer Vision (ICCV). ORB: an efficient alternative to SIFT or SURF, (2011), pp. 2564–2571.Google Scholar
- DG Lowe, Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis.60(2), 91–110 (2004).View ArticleGoogle Scholar
- H Bay, A Ess, T Tuytelaars, L Van Gool, Speeded-up robust features (SURF). Comput. Vis. Image Understanding. 110(3), 346–359 (2008).View ArticleGoogle Scholar
- J Heinly, E Dunn, JM Frahm, in Proc. European Conference on Computer Vision (ECCV). Comparative evaluation of binary features, (2012), pp. 759–773.Google Scholar
- N Khan, B McCane, S Mills, Better than SIFT?Mach. Vis. Appl.26(6), 819–836 (2015).View ArticleGoogle Scholar
- E Tola, V Lepetit, P Fua, DAISY: an efficient dense descriptor applied to wide-baseline stereo. IEEE Trans. Pattern Anal. Mach. Intell.32(5), 815–830 (2010).View ArticleGoogle Scholar
- C Liu, J Yuen, A Torralba, SIFT flow: dense correspondence across scenes and its applications. IEEE Trans. Pattern Anal. Mach. Intell.33(5), 978–994 (2011).View ArticleGoogle Scholar
- K Zhang, J Li, Y Li, W Hu, L Sun, S Yang, in Proc. International Conference on Pattern Recognition (ICPR). Binary stereo matching, (2012), pp. 356–359.Google Scholar
- S Zagoruyko, N Komodakis, in Proc. Computer Vision and Pattern Recognition (CVPR). Learning to compare image patches via convolutional neural networks, (2015), pp. 4353–4361.Google Scholar
- J žbontar, Y LeCun, in Proc. Computer Vision and Pattern Recognition (CVPR). Computing the stereo matching cost with a convolutional neural network, (2015), pp. 1592–1599.Google Scholar
- Z Chen, X Sun, L Wang, Y Yu, C Huang, in Proc. International Conference on Computer Vision (ICCV). A deep visual correspondence embedding model for stereo matching costs, (2015), pp. 972–980.Google Scholar
- W Luo, AG Schwing, R Urtasun, in Proc. Computer Vision and Pattern Recognition (CVPR). Efficient deep learning for stereo matching, (2016), pp. 5695–5703.Google Scholar
- N Mayer, E Ilg, P Hausser, P Fischer, D Cremers, A Dosovitskiy, T Brox, in Proc. Computer Vision and Pattern Recognition (CVPR). A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation, (2016), pp. 4040–4048.Google Scholar
- F Crow, Summed-area tables for texture mapping. Comput. Graphics. 18(3), 207–212 (1984).View ArticleGoogle Scholar
- C Strecha, W von Hansen, L Van Gool, P Fua, U Thoennessen, in Proc. Computer Vision and Pattern Recognition (CVPR). On benchmarking camera calibration and multi-view stereo for high resolution imagery, (2008), pp. 1–8.Google Scholar
- D Scharstein, H Hirschmller, Y Kitajima, G Krathwohl, N Nesic, X Wang, P Westling, in Proc. German Conference on Pattern Recognition (GCPR). High-resolution stereo datasets with subpixel-accurate ground truth, (2014), pp. 31–42.Google Scholar
- R Zabih, J Wood, in Proc. European Conference on Computer Vision (ECCV). Non-parametric local transforms for computing visual correspondence, (1994), pp. 151–158.Google Scholar
- H Hirschmüller, D Scharstein, Evaluation of stereo matching costs on images with radiometric differences. IEEE Trans. Pattern Anal. Mach. Intell.31(9), 1582–1599 (2009).View ArticleGoogle Scholar
- J Banks, P Corke, Quantitative evaluation of matching methods and validity measures for stereo vision. Int. J. Robot. Res.20(7), 512–532 (2001).View ArticleGoogle Scholar