Real-time stereo matching architecture based on 2D MRF model: a memory-efficient systolic array
- Sungchan Park^{1}Email author,
- Chao Chen^{1},
- Hong Jeong^{1} and
- Sang Hyun Han^{1}
https://doi.org/10.1186/1687-5281-2011-4
© Park et al; licensee Springer. 2011
Received: 24 January 2011
Accepted: 17 August 2011
Published: 17 August 2011
Abstract
There is a growing need in computer vision applications for stereopsis, requiring not only accurate distance but also fast and compact physical implementation. Global energy minimization techniques provide remarkably precise results. But they suffer from huge computational complexity. One of the main challenges is to parallelize the iterative computation, solving the memory access problem between the big external memory and the massive processors. Remarkable memory saving can be obtained with our memory reduction scheme, and our new architecture is a systolic array. If we expand it into N's multiple chips in a cascaded manner, we can cope with various ranges of image resolutions. We have realized it using the FPGA technology. Our architecture records 19 times smaller memory than the global minimization technique, which is a principal step toward real-time chip implementation of the various iterative image processing algorithms with tiny and distributed memory resources like optical flow, image restoration, etc.
Keywords
1 Introduction
The stereo matching problem is to find the corresponding points in a pair of images portraying the same scene. The underlying principle is that two cameras separated by a baseline capture slightly dissimilar views of the same scene. Finding the corresponding pairs is known to be the most challenging step in the binocular stereo problem.
The local method, typically window correlation and dynamic programming (DP) methods, examines subimages only to obtain local minima as solutions. Inherently, this method needs relatively small operations and memory, making it the popular approach in real-time DSP systems [2, 3] and parallel VLSI chips [4–7]. The local method can be easily realized in the massive parallel structure as shown in Table 1. Nevertheless, there are many situations where this method may fail: the occlusion, uniform texture, ambiguity of the low texture, etc. Even further, the window method tends to yield blurred results around the object boundary.
In contrast, the global method, typically graph cut [8, 9] and BP [10–12], deals with whole images, resulting in the global minima, analogously to the approximated global minimum principle. This approach has the advantage of low error rate but tends to need huge computational loads and memory resources. Recently, some researchers realized BP using PC aided by specialized parallel processors on GPU graphic card [13]. As described in Table 1, the so-called real-time BP can yield reasonable results only for the small throughput (MDE/s). Unfortunately, the specialized GPU relies upon high speed clocks and a small number of processors, which cannot be regarded as fully parallel architecture. Thus, it has the throughput limitation. Nevertheless, this system is successfully used in the real-time computer vision area [14]. There is no full parallel system that has fast computational power (MDE/s) for the high resolution images or the fast frame rates. Further, there is no genuine compact hardware dedicated to the global stereo matching in real time. Most of the existing systems are impractical in terms of size, power requirement, and expense and are not suitable for compact applications like robot vision.
Consider the one chip solution with a systolic array and efficient memory configuration. To avoid the huge memory, we tried to implement the BP on the FPGA by reducing the memory size [15], which is similar to the hierarchical iteration sequence [16]. In this paper, we use IF scheme [16] for our architecture and make it 2 times smaller than IF considering the message propagation direction, as we will call "Fast belief propagation (FBP)". Based on this method, we built a full parallel architecture that is efficient in memory usage as well as equivalent to the original belief propagation (BP) method in terms of accuracy.
For a real-time application with small and compact hardware, GPU- and CPU-based system is not good due to their bulky size. We used this architecture to build a stereo vision chip and observed the expected performance--realtime and small memory for high precision depth images.
The remainder of this paper is organized as follows. Section 2 explains the background of the belief propagation. Section 3 defines a layer structure and explains an FBP sequence. A new iteration filter algorithm considering iteration directions is described in Section 4. For a VLSI realization, Section 5 suggests a parallel architecture and its memory complexity. Experiments are presented in Section 6. Section 7 draws conclusions on our newly developed architecture.
2 Review of belief propagation
D(d_{ p } ) is the data cost for the node p having the state d_{ p } . Similarly, V (d_{ p } , d_{ q } ) is the edge cost for a pair of neighbor nodes p and q having states d_{ p } and d_{ q } , respectively.
We assume a condition of parallel optics without the loss of generality. Then, stereo matching simply involves finding a point (p_{0}, p_{1} + d_{ p } ) in the right image which corresponds to a point (p_{0}, p_{1}) in the left image. Thus, the hidden state d_{ p } represents the offset between the corresponding pixels, as is called disparity.
where C_{ d } and K_{ d } are a weighting factor and upper bound of the cost, respectively. This upper bound is useful in making the data cost robust to occlusions and artifacts that may violate the common assumptions that the ambient brightness must be uniform.
where C_{ v } and K_{ v } are similarly defined as the constant.
If the memory complexity at each node is B bits, the overall memory size is ${\sum}_{k=0}^{K-1}B\left(N\u2215{2}^{k}\right)\left(M\u2215{2}^{k}\right)$ bits.
3 The proposed fast belief propagation sequence
In this section, we propose our FBP algorithm and architecture that enable us to run the BP on the FPGA with tiny distributed RAMs and show the remarkable memory reduction. It is 2 times smaller than the Iteration Filter's memory reduction scheme [16]. Before entering this section, I recommend for readers to understand the Iteration Filter scheme [16] that is wholly different from the normal iteration sequence and shows the amazing memory reduction effect. We redesign the Iteration Filter algorithm and implement it on the FPGA.
where (N(p), l - 1) and M((N(p), l - 1)) = {M(u, l - 1)|u ∈ N(p)} represent the neighbor nodes and their message costs in the buffer, respectively.
As an initialization stage, each node p observes the input to obtain the data cost D(p, 0). Afterward, in every iteration l, each node calculates the new message M(p, l) according to the update function f(·) and after then stores it as M(p, l - 1) in the buffer.
Algorithm 1: FBP algorithm
Q(p_{0}, l) and Q(p_{0}, l - 1) belong to Q(p 0). Hence, given the layer buffer Q(p 0 - 2) and Q(p 0 - 1) and the local buffer Q(p_{0}, l - 1), the costs in Q(p_{0}, l) are updated at each layer l recursively, which sequence is described in Figure 6a, b, and 6c. That is, given M(Q(p_{0} - 1)), M(Q(p_{0} - 2)), and D(Q(p_{0} - 1)), we can calculate M(Q(p_{0})). The new costs in local buffer should be stored in the layer buffer to process the next set Q(p_{0} + 1) in the next time. This sequence shifts the layer buffer to the p_{0} axis direction. Then, for p_{0} from 0 to N + L - 1, we can obtain the final iterated message M(Q(p_{0}, L)). For the example, as shown in Figure 6b, and 6c, the location of the buffer is changed from Q(p_{0} = 5) to Q(p_{0} = 6) by our sequence.
In the hierarchical case, as shown in Figure 6d, we can construct the hierarchical layer structure by considering the hierarchical iterations. At each level, we can follow the FBP sequence at each level only if considering two by two scale changes between levels. Please refer to [16] for the detailed hierarchical memory reduction scheme of IF.
If we approximately consider the total memory as the 0th level, the reduction rate amounts to N/(2L^{0} + 1) times when 2L^{0} ≪ N. In summary, the update sequence must be effective whenever N, one of the image size components is big, and L^{0}, the iteration number, is small.
4 New iteration sequence considering the iteration direction
Number of messages stored at each node in the buffer
Access(Δ) | Store(Δ) | |||
---|---|---|---|---|
Directions | No. | Directions | No. | |
Q(p_{0}, l - 1) | [+1 0] | 1 | [±1 0], [0 ± 1] | 4 |
Q(p_{0} - 1, l - 1) | [0 ± 1] | 2 | [-1 0], [0 ± 1] | 3 |
Q(p_{0} - 2, l - 1) | [- 1 0] | 1 | [-1 0] | 1 |
As explained in the FBP algorithm, at each update time, the location of the buffer is shifted to p_{0} axis being updated by the new cost. The newly updated messages and data cost in the local buffer should be stored in the layer buffer for the processing of the next Q(p_{0} + 1). Thus, if the messages from all possible directions be saved in the local buffer, then some messages can be transferred to Q(p_{0}- 1, l - 1). At the same time, some old costs in Q(p_{0} - 1, l - 1) are moved to Q(p_{0} - 2, l - 1) in a similar way. With this scheme, the number of propagation directions to be stored at the buffer is described at the store(Δ) part in Table 2.
FBP buffer size
Buffer | Message | Data cost |
---|---|---|
Layer buffer Q(p_{0} - 2) | B _{ m } SML | 0 |
Layer buffer Q(p_{0} - 1) | 3B_{ m }SML | B _{ D } SML |
Local buffer Q(p_{0} - (l - 1), l - 1) | 4B_{ m }SM | B _{ D } SM |
Total | 4B_{ m }SM (L + 1) | B_{ D }SM (L + 1) |
If you compare Equations 20 and 22, the value a is changed from two to one. Therefore, due to the propagation direction of BP, we can obtain 2 times smaller memory than the iteration filter [16].
5 Systolic VLSI architecture
Since (M/2 ^{ k } ) nodes are handled by (M/2 ^{ k } ) processors in parallel on ${p}_{1}^{k}$ axis, the total required clocks are reduced from ${\sum}_{k=0}^{K-1}6S\left(M\u2215{2}^{k}\right)\left(N\u2215{2}^{k}\right)$ to ${\sum}_{k=0}^{K-1}6S{L}^{k}\left(N\u2215{2}^{k}\right)$. As a whole, each PE calculates the messages in parallel by accessing the local buffer or the layer buffer which is located in the neighboring PEs or PE groups.
6 Experimental results
Our new architecture has been tested by both a simulation and FPGA realization.
6.1 Software simulation
First, we verify our VLSI algorithm using the Middlebury data set with a software simulation. In the previous sections, we presented a new architecture which is equivalent to HBP in terms of input-output relationship and which is a systolic array with a small memory space. Hence, it is suitable for VLSI implementation.
where $\widehat{d}$ is the estimated disparity, d_{ True } is the true disparity, Pm is the area except for the occlusion part, and N is the pixel number in its area. This error means the rate where the disparity error is larger than 1.
Figure 16 shows the relationship between the iteration layers and FBP's average memory reduction rates when compared with HBP, where the same iteration times, (L, L, L, L), are applied for each layer. Due to the hierarchical scheme, the iteration converged around 28 iterations and yielded 0.8% maximum error. The remarkable result, though, is the memory reduction, which is around 32 times. In fact, even less memory is possible for a higher error rate. Thus, this architecture makes the performance scalable between the space and accuracy.
Disparity error comparison of several real-time methods (%)
6.2 FPGA implementation
We developed the VHDL code on FPGA as follows using the specs: S = 32, B_{ m } = 7, B_{ D } = 10, (L^{3}, L^{2}, L^{1}, L^{0})=(8, 8, 8, 10), 15 frames/sec at 160 × 240 or 160 × 480 image.
Comparisons of computation time between the real-time systems
Spec | System | Image | Levels | fps |
---|---|---|---|---|
Our FBP, One FPGA | FPGA, Virtex2 | 160 × 480 | 32 | 15 |
Two FPGAs | FPGA, Virtex2 | 320 × 480 | 32 | 15 |
Semi-global matching [17] | FPGA, Virtex5 | 640 × 480 | 128 | 103 |
Local matching [22] | FPGA, Virtex5 | 640 × 480 | 64 | 230 |
Accelerated BP [21] | FPGA, Virtex2 | 256 × 240 | 16 | 25 |
Real-time BP [13] | GPU, Geforce 7900 | 320 × 240 | 16 | 16 |
Real-time DP [20] | CPU, MMX | 320 × 240 | 100 | 26.7 |
Trellis DP [19] | FPGA, Virtex2 | 320 × 240 | 128 | 30 |
Comparisons of hardware spec. between the real-time systems
Spec | System | clock | PEs | Int. Mem. | Ext. Mem. |
---|---|---|---|---|---|
Our FBP, One FPGA | Virtex2 | 25 MHz | 128 | 3.3 Mb | No |
Two FPGAs | Virtex2 | 25 MHz | 256 | 6.6 Mb | No |
Semi-global matching [17] | Virtex5 | 133 MHz | 30 | 3.3 Mb | Yes |
Local matching [22] | Virtex5 | 93 MHz | 64 | 5.8 Mb | No |
Real-time BP [13] | Geforce 7900 | 670 MHz | 26 | NA | 62 Mb |
Accelerated BP[21] | Virtex2 | 65 MHz | 24 | 2 Mb | 9 Mb |
Real-time DP [20] | MMX | NA | NA | NA | Yes |
Trellis DP [19] | Virtex2 | 50 MHz | 128 | Yes | No |
For a higher resolution solution, we need to increase the computational power. It is possible by simply cascading several chips together in proportion to the image size or increasing the clock speed.
Additional hardware specifications used in our system
Spec. (Resource usage percentage) | |
---|---|
FPGA | Xilinx Virtex II pro-100 |
Number of multiplier | 0 |
Number of divider | 0 |
Number of slice flip flops | 30,585 (34%) |
Number of 4 input LUTs | 46,812 (53%) |
7 Conclusions
In this paper, a new architecture for the global stereo matching algorithm has been presented. The key idea is to rearrange the computation order in BP to obtain a parallel and memory-efficient structure. As the results show, our system spends 19 times less memory than the ordinary BP. The memory space can be negotiated with the iteration number. The architecture is also scalable in terms of image size; the regular structure can be easily expanded by cascading identical modules.
When applied to binocular stereo vision, this architecture shows the ability to process stereo matching in real time. Experimental results confirm that this array architecture easily provides high throughput with low clock speed where small iterations are guaranteed by the hierarchical iteration scheme.
In the future, we plan to realize this architecture with a small and compact ASIC chip. Beyond the programmable chips, we can simply expect a real-time chip with higher resolution and the lowest error rate with huge PE numbers. Unlike the bulky GPU and CPU systems, making the complex stereo matching system with a compact chip may lead to many real-time vision applications.
Furthermore, if we change the message and data cost model, our memory-efficient architecture can be considered to other BP-based motion estimation and image restoration [10]. The combined effort of parallel processing and efficient memory usage makes a chance to implement a compact VLSI chip. Furthermore, more general iterative algorithms can be considered, which communicate only neighbor pixels in the image, such as GBP typical cut [18]. As explained in [16], if we apply the IF scheme to these algorithms, we can reduce their memory resources to a tiny size. Thus, if they have simple update logics for the iteration, then full parallel VLSI architectures may be realizable.
Declarations
Acknowledgements
This work was supported by the following funds: the Brain Korea 21 project and the Ministry of Knowledge Economy, Korea, under the Core Technology Development for Breakthrough of Robot Vision Research support program supervised by the National IT Industry Promotion Agency.
Authors’ Affiliations
References
- Scharstein D, Szeliski R: A taxonomy and evaluation of dense two-frame stereo correspondence algorithms. Int J Comput Vision 2002, 47(1-3): 7-42.View ArticleGoogle Scholar
- Kanade T, et al.: A stereomachine for video-rate dense depth mapping and its newapplications. Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition 1996.Google Scholar
- Konolige K: Small vision systems: Hardware and implementation. Proceedings of Eighth International Symposium Robotics Research 1997.Google Scholar
- Corke P, Dunn P: Real-time stereopsis using fpgas. IEEE TEN-CON.Speech and Image Technologies for Computing and Telecommunications 1997, 235-238.Google Scholar
- Hariyama M, et al.: Architecture of a stereo matching VLSI processor based on hierarchically parallel memory access. The 2004 47th Midwest Symposium on Circuits and Systems 2004, 2: II245-II247.Google Scholar
- Kimura S, et al.: A convolver-based real-time stereo machine (SAZAN). Proceedings of Computer Vision and Pattern Recognition 1999, 1: 457-463.Google Scholar
- Woodfill J, Von Herzen B: Real-time stereo vision on the parts reconfigurable computer. IEEE Workshop FPGAs for Custom Computing Machines 1997, 242-250.Google Scholar
- Kolmogorov V, Zabih R: Computing visual correspondence with occlusions using graph cuts. ICCV 2001, 2: 508-515.Google Scholar
- Xiao J, Shah M: Motion layer extraction in the presence of occlusion using graph cuts. IEEE Trans Pattern Anal Mach Intell 2005, 27(10):1644-1659.View ArticleGoogle Scholar
- Felzenszwalb PF, Huttenlocher DR: Efficient belief propagation for early vision. Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition 2004, 1: I261-I268.View ArticleGoogle Scholar
- Zheng NN, Sun J, Shum HY: Stereo matching using belief propagation. IEEE Trans Pattern Anal Mach Intell 2003, 25(7):787-800. 10.1109/TPAMI.2003.1206509View ArticleGoogle Scholar
- MacCormick J, Isard M: Estimating disparity and occlusions in stereo video sequences. Asian Conference on Computer Vision (ACCV) 2006, 32-41.Google Scholar
- Yang Q, et al.: Real-time global stereo matching using hierarchical belief propagation. The British Machine Vision Conference 2006.Google Scholar
- Mignotte M, Jodoin P-M, St-Amour J-F: Markovian energy-based computer vision algorithms on graphics hardware. ICIAP'05, LNCS 2005, 3617: 592-603.Google Scholar
- Park S, Chen C, Jeong H: VLSI Architecture for MRF Based Stereo Matching. 7th International Workshop SAMOS 2007, 55-64.Google Scholar
- Park S, Jeong H: Memory-efficient iterative process on a two-dimensional first-order regular graph. Opt Lett 2008., 33(1):Google Scholar
- Banz Christian, et al.: Real-time stereo vision system using semi-global matching disparity estimation: Architecture and FPGA-implementation. International Conference on Embedded Computer Systems (SAMOS) 2010, 93-101.Google Scholar
- Shental N, et al.: Learning and inferring image segmentations using the GBP typical cut algorithm. ICCV 2003, 1243-1250.Google Scholar
- Park S, Jeong H: Real-time stereo vision FPGA chip with low error rate. International Conference on Multimedia and Ubiquitous Engineering 2007, 751-756.Google Scholar
- Forstmann S, et al.: Real-time stereo by using dynamic programming. CVPR, Workshop on Real-Time 3D Sensors and Their Use 2004.Google Scholar
- Park S, Jeong H: High-speed parallel very large scale integration architecture for global stereo matching. J Electron Imaging 2008, 17(1):010501. 10.1117/1.2892680View ArticleGoogle Scholar
- Jin Seunghun, et al.: FPGA design and implementation of a real-time stereo vision system. IEEE Trans Circuits Syst Video Technol 2010, 20(1):15-26.View ArticleGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.