Shape-reserved stereo matching with segment-based cost aggregation and dual-path refinement

Stereo matching is one of the most important topics in computer vision and aims at generating precise depth maps for various applications. The major challenge of stereo matching is to suppress inevitable errors occurring in smooth, occluded, and discontinuous regions. In this paper, the proposed stereo matching system uses segment-based superpixels and matching cost. After determination of edge and smooth regions and selection of matching cost, we suggest the segment-based adaptive support weights in cost aggregation instead of color similarity and spatial proximity only. The proposed dual-path depth refinements use the cross-based support region by referring texture features to correct the inaccurate disparities with iterative procedures to improve the depth maps for shape reserving. Specially for leftmost and rightmost regions, the segment-based refinement can greatly improve the mismatched disparity holes. The experimental results show that the proposed system can achieve higher accurate depth maps than the conventional methods.

"disparity". The disparity becomes larger when the object is moving toward the observer [11]. By parsing the disparities of left and right views, we can also extend the geometrical principle to estimate the distance between a viewed object and the observer. To get the depth map efficiently, we propose a local stereo matching method to save the computation. Since the depth values are mostly dependent on bases of the objects. By using segmentation information, the proposed stereo matching system can not only enhance the aggregation efficiency but also refine the missing objects. The basic idea of stereo matching will be briefly reviewed in Section 2. In Section 3, the designs of the proposed stereo matching system are present. Section 4 will show the experimental results achieved by the proposed and other methods. Some conclusions about this paper are finally given in Section 5.

Local stereo matching methods
Generally, the stereo matching algorithms can be classified into three major categories: global, local, and semi-global approaches. The global approach uses data term and smooth team to construct their energy functions to compute the global depth map. Graph-cut [12], belief propagation [13,14], and dynamic programming [15] are the typical global stereo matching algorithms. Recently, deep learning approaches have been proposed to estimate the depth maps [16]; however, they are data dependent. In this paper, we focus on the designs of local stereo matching approaches for computation considerations and avoid the problems of data dependency.

Local stereo matching process
The typical local stereo matching process shown in Fig. 1 mainly contains matching cost computation, cost aggregation, disparity decision, and disparity refinement stages.

Matching cost computation
To evaluate the pixel-based matching status, there are several famous costs that are used for disparity estimation. The sum of absolute differences (SAD) [17] with color components is the most common cost of stereo matching. The SAD cost for finding the left disparity map can be formulated as: where p = (x, y) is the pixel position in the left image and p' = (x-d, y) denotes its corresponding pixel position in the right image with disparity d. In this paper, I l c and I r c represent the color intensities of the left and right images in RGB domain, respectively. Besides the SAD cost, the gradient similarity can also measure the variations of the texture images. The gradient cost for searching the left disparity map can be expressed by where the gradient operator, ∇, is the combination of horizontal and vertical differences between the central pixel and its neighboring pixels in the cross relation.
Besides, the census transform, which detects the slight variations in a small block, can achieve a robust performance for minor intensity changes successfully. Figure 2 shows the traditional census and modified mini-census transforms [18]. The modified mini-census transform only selects a few specified representative pixels in the block to reduce the useless information.
To describe the above census transforms precisely, the binary result is expressed by comparing the neighboring pixel to the central pixel at p as where q denotes the position of the neighboring pixel in the block. A bit-wise catenation is applied to get the census transform as where ⊗ expresses the bit-wise catenation operator and W is the block containing the selected neighboring pixels. Thus, the census cost in terms of Hamming distance between two census transforms obtained from left and right views is expressed by: where ⊕ is the bit-wise XOR operation. The modified mini-census transform [18] needs fewer computations and achieves more robust performance against the noises than the traditional census transform.

Cost aggregation
Once the cost of the paired pixels in the stereo images is calculated, the cost aggregation is further applied to achieve more robust results by including more pixels, which have the same tendency. For local stereo matching, the window-based aggregation considers the similarities of the surrounding pixels in a designated window [19][20][21][22][23][24][25]. The ideal windows are designed to include the nearby pixels, which are in the same object as possible. For example, the adaptive support weights [21] based on color similarity and spatial proximity are noted as where Δc pq and Δg pq denote the color similarity weight and the geometric distance weight, respectively. γ c and γ g are control factors that map Δc pq and Δg pq to become weights. The color similarity weight is controlled by Δc pq , which can be represented as while the geometric distance weight is controlled by Δg pq , which is given as where (x p , y p ) and (x q , y q ) are the x and y coordinates of pixels p and q, respectively. Besides the pixel-wise adaptive support weights, the segmentation concept is also used to modify the weights increasing the matching reliability. The segment-based adaptive support weight [22,23] could be expressed by where S p is the segment on which p lies. In (9), they modify the weight to the largest, i.e., 1.0 if the neighboring pixel is in the same segment as the target pixel while the weights of the rest pixels are determined by color similarity.
After the weight of each pixel in the window has been calculated, we can apply the aggregation cost to all the pixel costs become as where C(p, p') is the initial cost, which could be SAD cost, gradient cost, or census cost stated in (2), (3), or (5), respectively. Of course, the combined cost with different weighting ratios is also possible. In (10), q and q' are the neighboring pixels of p and p' pixels in the target and the reference windows of the target and the reference views, respectively.

Raw disparity estimation
To obtain the raw disparity map, the disparity estimation is executed after cost aggregation. It is common to utilize the winner-take-all (WTA) strategy for the criterion of disparity estimation. The selection of WTA can be formulated as where R d is the disparity search range. In the WTA process, we can finally estimate the raw disparity by choosing the smallest cost. The raw disparity map d p needs to be refined in the final disparity refinement process.

Disparity refinement
Usually, the raw disparity map contains mismatched disparities occurring near the object boundaries due to occlusion problems and the regions with smooth texture regions, which are hard to find the exact matches. Thus, a suitable disparity refinement technique is required to remove the mismatched disparities. First, we need to identify the mismatched pixels by left right consistency check (LRC) to test if the disparities of the left and right views are consistent.
The LRC detection rule is normally stated as where d i and d r are the disparities of the left and right views respectively, and σ 0 is the tolerance for detecting the wrong disparity. To correct the mismatched pixels with L(x, y) = 0, there are several disparity refinement methods [26][27][28][29][30][31]. Usually, we can classify the mismatching pixels into large and small hole regions. For small hole regions, the background filling algorithm is used to improve the rough disparity map. For big hole regions, the four-step hole filling method can search the nearest reliable pixel in neighboring regions [31].

Simple linear iterative clustering
It is noted that the disparity map will have same disparity values in an object. In order to correctly estimate the disparity, the precise segmentation of the objects will help to improve accurate performance. With precise object boundaries, we could use them to improve the estimation of disparity map. It is noted that the precise object segmentation is computation consuming processes for left and right images. However, for stereo matching, we only need to perform a localized segmentation in small regions. In other words, we only need to identify the superpixels, which are collections of adjacent and homogeneous pixels of the images. The superpixel, as a segment, provides more structure information than a single pixel.
In this paper, we adopt simple linear iterative clustering (SLIC) [32], which adapts kmeans clustering method to efficiently group the superpixels. The SLIC method with five-dimension space of {l i , a i , b i , x i , y i } localizes the ith pixel search range to an area associated with the cluster center to reduce the computation, where (l, a, b) is the pixel color vector defined in CIELAB color space and (x, y) is the pixel position. The SLIC algorithm, which measures the distance between the ith pixel to the cluster center, considers both color similarity and spatial proximity, which are respectively denoted as and where {l k , a k , b k , x k , y k } is the cluster center. The k-means clustering is then applied to achieve superpixel segmentation. With the SLIC method, the utilization of segmentation results could provide more matching information for local stereo matching algorithms.

The proposed stereo matching system
Comparing to the traditional method depicted in Fig. 1, the corresponding functional diagram of the proposed stereo matching system is shown in Fig. 3, which uses SLICbased cost aggregation for estimating the accurate left and right depth maps.
To exhibit the usages of SLIC segmentation information, Fig. 4 shows two innovated kernels: the adaptive stereo matching computation unit first estimates the left and right raw disparity maps while the dual-path refinement unit further enhances them to become accurate disparity maps. The descriptions of the kernels are addressed in the following subsections. Figure 5 shows the diagram of the proposed adaptive stereo matching computation unit, which includes the adaptive cost selection of gradient cost, census cost, or SAD cost, the SLIC-based cost aggregation with left and right SLIC segmentation information, and 2-level winner takes all to estimate the left and right raw disparity maps.

Adaptive cost selection
To estimate the similarity between the pixels in the left and its corresponding right image, the initial cost computation is necessary. First, we detect the edge regions in color image by using Sobel operator such that we can classify the pixels into edge region or non-edge region. For edge regions, we will use gradient cost as the initial cost. For non-edge region, we further classify it as a smooth or non-smooth region. Here, we utilize the cross-based window [22] to identify the smooth region. The criterion for the adaptive cost selection of SAD, gradient, or census cost is shown in Fig. 6. If the pixel is classified in the edge region, the gradient similarity as stated in (2) is used since the variation in color image is large. If the pixel is classified as the non-edge region, we will use cross-based window to further verify whether the pixel lies on smooth region or not. To find a smooth plane, we calculated the cross-based plane as where r * denotes the largest left span in one direction and the indicator function is defined by to evaluate the color similarity of pixels. In (15) and (16), p i is the pixel extended in the direction. Once the largest span arm r* is derived, we define the left arm length h p -= max (r * , 1). Similarly, we can find the other three directions to obtain the arm lengths as {h p − , h p + , v p − , v p + } for the pixel p. The two orthogonal cross segments are given as After computation of pixel-wise cross decision results, we can obtain a shapeadaptive full support region U(p) for the pixel at p. The support region is an area integral of multiple segments H(p) and is defined as where p v is a support pixel located on the vertical segment V(p). If the area of the cross-based plane is more than 80% of the intact window, we classify the pixel lies on a smooth region. Once the pixel is classified in the smooth region, we use the census cost defined in (5) for stereo matching. On the contrary, if we classify the pixel in nonsmooth region, the SAD cost as stated in (1) will be used. For stereo matching cost, Fig. 7 shows the results of raw disparity map achieved by using the direct combined initial cost and the adaptive selected initial cost. In consideration of different texture features, the proposed adaptive cost selection achieves better raw disparity maps in both complex texture regions and smooth regions.

Cost aggregation with SLIC-based ASW
For cost aggregation, we use adaptive support weights (ASWs), which are determined by SLIC segmentation information [32]. For each segmented superpixel, the aggregated cost for the pixels in the same segment should give them higher weights. The aggregation weights in the superpixel concept will be better than the geometry and color similarities in pixel-by-pixel fashions. First, we segment the color image into K levels by the SLIC segmentation algorithm. The segments in a higher level will have a more complex segmentation map. From low to high levels, if the neighboring pixels and the center  pixel are in the same segmented superpixels, these pixels, which are prone to have higher similarity, should be given with higher weights. Figure 8 shows the result of different level segmentation images. For K-level system, the proposed SLIC-based adaptive weight is given as where N s (p, q) denotes the segmentation dissimilarity defined as where T [·] is an indicator function whose value equals to 1 when p and q are not in the same segment at the kth level, and 0 otherwise. In (21), S k p and S k q are the segmentation labels of pixels p and q at the kth level, respectively. To avoid the ambiguity in the dissimilar pixels, we suggest the adaptive weight based on the color difference to increase the accuracy if the dissimilarity is over half of total levels. The SLIC-based adaptive weights help to obtain a more reasonable aggregation cost to improve the disparity estimation than the cost aggregation weights stated in (10).
It is noted that the proposed adaptive weights can reduce the distortion of the similar pixels and keep the sensitive in complex texture regions. If we only use segmentation similarity part, called SLIC-only, without adaptive weights controlled by color changes, the variations of weights cannot tell the detailed differences. Figure 9 shows two distributions of the adaptive weights along x-axis obtained by the proposed method (blue color) and the SLIC-only (red color). If their weights are the same, we show them with mixed (purple) color. Thus, the weights obtained by the SLIC-only are hard to separate the differences in complex region since they are nearly equal and of low values. As the results, the proposed adaptive weights obtained in (22) can successfully avoid the ambiguity conditions with large variation weights.

Two-level WTA strategy
Normally, the WTA strategy is used to select the disparity value with the minimum cost. However, there might exist over one disparity sharing the same minimum cost or have several similar minimum costs. In order to avoid inaccurate disparity decision, we modified WTA into two-level procedure. First, we check every pixel as where N(·) represents the number of disparities, which have the same minimum cost. If we have more than 3 candidates, which share the same minimum cost, we will replace d(p) by 256 to label the pixel p as an unstable point. To deal with the unstable points, we use window-based histogram voting to select the correct disparity. For each pixel p, a histogram H p (d) of the stable points surrounding p in this window is created. The histogram bin with the highest value d v (p) is selected to replace the unstable point as After the disparity of each pixel is found, we could adjust the scale of the disparities to generate raw depth maps. Generally, the left and right images will have slight intensity difference except the whole object is flat or perpendicular to the paired cameras. The minimum matching cost might not be able to find the correct matching point. With the proposed method, the truly disparity could be obtained more precisely.

Iterative dual-path depth refinement
Since the estimated raw disparity map usually contains some mismatches occurring near object boundaries and smooth regions. It is hard to reserve the shapes in these regions. Thus, we propose an iterative dual-path refinement algorithm to refine the raw disparity maps to obtain high precision depth maps and shape reserved.
To find the mismatched disparity, we first label the disparity map by the modified LRC as, In (24), not only with disparity similarity, we further include the color tolerance to label the pixels. For L(x, y) = 0, the mismatched pixels are further categorized into two types: small holes or big holes. If the mismatching region between the pixel in the target view is less than 2 pixels, we classify them as small holes. Otherwise, the other mismatched pixels are classified as big holes. Figure 10 shows the flow chart of proposed iterative dual-path refinement.

Small hole filling
Since the mismatching region contains small holes, the color image helps to find the accurate disparity by obtaining the texture information. Here, we utilize maximumweighted candidate to find the correct disparity. With the image color similarity and spatial proximity in a correction window, we calculate the weight of each pixel as where Δc pq and Δs pq are the color differences in RGB domain and spatial difference, respectively. We analyze the disparity distribution with the calculated weight. Under the disparity in the ascending order weighted histogram, the maximum corresponding disparity in the ordered histogram is the point of the final disparity, which is written as where Ω is the correlation window region and d out is the final refined disparity map.

Dual-path big hole filling
For big holes, finding the correct disparity in the surrounding pixels is not suitable in this circumstance. Here we should first classify the occlusion region into non-border occlusion and leftmost/rightmost border occlusion. Then, we designed two-path hole filling for both cases. For non-border regions, the holes, which are induced by the occlusion of the foreground objects, should be filled with the background disparity. On the other hand, the holes should be considered on the target color image only. The flow chart of occlusion region refinement in two paths is shown in Fig. 11, and the processing details are described as the following. For non-border hole filling, we usually directly use the background information to fill the pixels with the mismatched disparity. To get more accurate disparity map, we make use of the similar pixels in background of the color image. First, we calculate the color similarity to find the most similar pixel on the same horizontal line among Q pixels to fill its corresponding disparity of the occluded pixel. After finding the similar pixel in the background (extended to the left side), we assign its corresponding disparity to the hole as where ΔC(x, x − i) denotes the color similarity between the target hole at x and the horizontal-left background pixel at x − i. Though there are still some residual wrong disparities by the proposed non-border hole filling method, the problem can be solved by iterative steps. The illustrations of the regular background hole filling and the matched-color background hole filling are shown in Fig. 12 a and b, respectively. We did not fill the hole by the nearby background disparity (light blue) pixel. We filled the marched-color background disparity (yellow) pixel.
For the leftmost border regions in the target (left) disparity map, as shown in Fig. 13 a and b, we only can refer the target (left) color image to fill the holes of the target (left) Fig. 11 Flow chart of dual-path refinement for big hole regions disparity map since we cannot find any matching information from the reference (right) view. The object in the leftmost region of the right image totally disappears. We do not have any clue to find the corresponding disparity for unknown regions. Thus, we only can use the leftmost color image to infer the holes as possible. Fortunately, we have computed SLIC segments for determination of ASWs to the color image as shown in Fig. 13c, which shows the localized superpixels. We can use the concept that the pixel in the same superpixel should have the same depth value. For better inferences, we could associate the localized SLIC superpixels for border big-hole filling as the following procedures: First, we could merge the localized superpixels, which have similar texture color information, as shown in Fig. 13d, to gather some superpixels into larger megapixels, which are treated as the object-like segments; secondly, we horizontally extend the known and reliable disparity leftward to all the hole pixels, which share the same megapixel as possible. We can obtain some filled megapixels in this step. Thirdly, we perform disparity histogram voting for those isolated megapixels, which do not contain any known disparity. Starting from the lowest pixel of the isolated megapixel, we choose the disparity from the largest disparity histogram of the filled megapixel, which is next to the current megapixel. Finally, the hole regions in the border can be reproduced with clear objects and their edges. The left refined disparity map is shown in Fig. 13f.  The proposed stereo matching system was implemented with MATLAB R2016a and tested on an Intel Core i5-8400 PC and 16 GB RAM. The experimental evaluation is performed by using 2003 [33], 2005 [34], and 2014 [35] datasets created in Middlebury. The testing images that include Cones, Teddy, Tsukuba, and Venus are shown in Table 1 while the test images with higher resolutions and higher disparity levels are exhibited in Table 2.

Results achieved by the proposed system
As shown in Fig. 14, the raw and refined disparity maps achieved by the proposed adaptive stereo matching and dual-path refinement methods for Cones, Teddy, Tsukuba, and Venus test images are exhibited in the first and second rows, respectively.

Comparisons with other approaches
For performance evaluations, we compare the proposed method to other stereo matching algorithms. The compared methods include adaptive support weight (ASW) [21], segmentation-based adaptive support weight (S-ASW) [23], plant leaf stereo matching (LP-SM) [36], edge-based stereo matching method (E-SM) [37], stereo matching implemented on GPU platform [31], AdaStereo [38], comparative evaluation of SGM variants for dense stereo matching (tMGM) [39], learning-based disparity estimation (iResNet) [40], and DeepPruner [41] methods. Tables 3 and 4 show that the performance of the proposed multi-scale ASW is superior to traditional ASW and other methods. Table 4 shows we have better performance than some deep learning-based methods in training set, even the training set is more beneficial to deep learning. Though the average error  rate is slightly lower than S-ASW, our method utilizes more information from segmentation instead of only assign weight to 1 with the same segment. According to the refinement steps, the edge areas of the depth maps can be reasonably reconstructed.
With the proposed algorithm, the disparity maps show accurate, which helps to improve the performance in the DIBR system for multi-view synthesis [42]. Figures 15,16,17 and 18 show the results achieve by the referenced methods for Cones, Teddy, Tsukuba, and Venus images, respectively. .

Conclusions
In this paper, we proposed a segment-based adaptive stereo matching algorithm and a dual-path disparity segment-based refinement method. The former can provide a reasonable good raw disparity map, and the latter can effectively enhance the raw disparity map into high-quality ones. The contributions of the proposed method include Fig. 15 Estimated disparity maps of Cones achieved by a Kuo et al. [27], b Kuo [28], c Hsieh [29], d Sun [31], e ASW [21], f S-ASW [23], g LF-SM [36], and h the proposed method Fig. 16 Estimated disparity maps of Teddy achieved by a Kuo et al. [27], b Kuo [28], c Hsieh [29], d Sun [31], e ASW [21], f S-ASW [23], g LF-SM [36], and h the proposed methods adaptive cost selection, the segment-based adaptive weights for cost aggregation, twolevel WTA strategy, and dual-path depth refinement. For small holes, the depth refinement uses maximum-weighted candidate for the best filling process. For non-border big holes, the background filling strategy is adopted by consideration of color and proximity information. And for border holes, the megapixel-based filling process is suggested to achieve better results. The proposed stereo matching system tested on Middlebury stereo datasets shows the best performances among all compared methods. Especially in the edge areas of the depth maps, it can reasonably reconstruct depth values of the objects. The experimental results show that the proposed system can reach high-quality depth maps for 3D video broadcasting with 3D-HEVC [43,44] and CTDP-HEVC [45,46] formats. Comparing with the deep-learning methods, the proposed system can be applied to various databases. As to learning-based approaches with Fig. 17 Estimated disparity maps of Tsukuba achieved by a Kuo et al. [27], b Kuo [28], c Hsieh [29], d Sun [31], e ASW [21], f S-ASW [23], g LF-SM [36], and h the proposed methods Fig. 18 Estimated disparity maps of Venus achieved by a Kuo et al. [27], b Kuo [28], c Hsieh [29], d Sun [31], e ASW [21], f S-ASW [23], g LF-SM [36], and h the proposed methods