High-performance hardware architectures for multi-level lifting-based discrete wavelet transform

Darji, Anand D; Kushwah, Shailendra Singh; Merchant, Shabbir N; Chandorkar, Arun N

doi:10.1186/1687-5281-2014-47

Research
Open access
Published: 04 October 2014

High-performance hardware architectures for multi-level lifting-based discrete wavelet transform

Anand D Darji¹,
Shailendra Singh Kushwah²,
Shabbir N Merchant¹ &
…
Arun N Chandorkar¹

EURASIP Journal on Image and Video Processing volume 2014, Article number: 47 (2014) Cite this article

3524 Accesses
20 Citations
Metrics details

Abstract

In this paper, three hardware efficient architectures to perform multi-level 2-D discrete wavelet transform (DWT) using lifting (5, 3) and (9, 7) filters are presented. They are classified as folded multi-level architecture (FMA), pipelined multi-level architecture (PMA), and recursive multi-level architecture (RMA). Efficient FMA is proposed using dual-input Z-scan block (B1) with 100% hardware utilization efficiency (HUE). Modular PMA is proposed with the help of block (B1) and dual-input raster scan block (B2) with 60% to 75% HUE. Block B1 and B2 are micro-pipelined to achieve critical path as single adder and single multiplier for lifting (5, 3) and (9, 7) filters, respectively. The clock gating technique is used in PMA to save power and area. Hardware-efficient RMA is proposed with the help of block (B1) and single-input recursive block (B3). Block (B3) uses only single processing element to compute both predict and update; thus, 50% multipliers and adders are saved. Dual-input per clock cycle minimizes total frame computing cycles, latency, and on-chip line buffers. PMA for five-level 2-D wavelet decomposition is synthesized using Xilinx ISE 10.1 for Virtex-5 XC5VLX110T field-programmable gate array (FPGA) target device (Xilinx, Inc., San Jose, CA, USA). The proposed PMA is very much efficient in terms of operating frequency due to pipelining. Moreover, this approach reduces and totals computing cycles significantly as compared to the existing multi-level architectures. RMA for three-level 2-D wavelet decomposition is synthesized using Xilinx ISE 10.1 for Virtex-4 VFX100 FPGA target device.

1 Introduction

In recent years, multi-level two-dimensional discrete wavelet transform (2-D DWT) is used in many applications, such as image and video compression (JPEG 2000 and MPEG-4), implantable neuroprosthetics, biometrics, image processing, and signal analysis. due to good energy compaction in higher-level DWT coefficients. To meet application constraint such as speed, power, and area, there is a huge demand of hardware-efficient VLSI architectures in recent years. DWT provides high compression ratio without any blocking artifact that deprives reconstructed image of desired smoothness and continuity. However, implementation of convolution-based DWT has many practical obstacles, such as higher computational complexity and more memory requirement. Therefore, Swelden et al.[1] proposed lifting wavelet, which is also known as second-generation wavelet. DWT can be implemented using convolution scheme as well as lifting scheme. The computational complexity and memory requirement of lifting scheme is very less as compared to convolution. Several architectures have been proposed to perform lifting-based DWT, which differ in terms of numbers of multipliers, adders, register, line buffers requirement, and scanning scheme adopted. 2-D DWT can be computed by applying 1-D DWT row-wise, which produces low-frequency (L) and high-frequency (H) sub-bands and then process these sub-bands column-wise to compute one approximate (LL) and three detail (LH, HL, HH) coefficients.

Jou et al. have proposed architecture with straightforward implementation of the lifting steps and therefore this architecture has long critical path[2]. An efficient pipeline architecture having critical path of only one multiplier has been proposed by merging predict and update stages by Wu and Lin[3]. Lai et al. have implemented dual-scan 2-D DWT design based on the algorithm proposed by Wu et al. with critical path delay of one multiplier and throughput of 2-input/2-output at the cost of more pipeline registers[4]. Dual-scan architecture with one multiplier as a critical path has also been proposed by Zhang et al. at the cost of complex control path[5]. Hsia et al.[6] have proposed a memory-efficient dual-scan 2-D lifting DWT architecture with temporal buffer 4N and critical path of two multipliers and four adders. Recently, a dual-scan parallel flipping architecture is introduced with the critical path of one multiplier, less pipeline registers, and simple control path[7].

Several lifting-based 2-D DWT multi-level architectures have been suggested for efficient VLSI implementation[3, 8–15] that take into consideration various aspects like memory, power, and speed. These architectures can be classified as folded architectures[8], parallel architectures[14], and recursive architectures[16]. Andra et al.[8] proposed simple folded architecture to perform several stages of 2-D DWT with the help of memory. Systolic architecture is proposed by Huang et al.[9] for a DWT filter with finite length. Multi-level 2-D DWT decomposition is implemented by recursive architecture (RA)[10, 16], but these approaches demand large amount of frame buffers to store the intermediate LL output and also require complex control path. Wu and Lin[3] proposed folded scheme, where multi-level DWT computation is performed level by level, with memory and single processing element. Unlike RA, folded architecture uses simple control circuitry, and it has 100% hardware utilization efficiency (HUE). Folded architectures consist of memory and a pair of 1-D DWT modules, a row processing unit (RPU) and a column processing unit (CPU). Mohanty and Meher[12] proposed a multi-level architecture for high throughput with more number of adders and multipliers. Xiong et al.[13] proposed two line-based high-speed architectures by employing parallel and pipelining techniques, which can perform j-level decomposition for N × N image in approximate 2(1 - 4^-j)N²/3 and (1 - 4^-j)N²/3 clock cycles. Hsia et al.[17] have proposed a symmetric mask-based scheme to compute 2-D integer lifting DWT, where the separate mask is used for each sub-band. Mask-based algorithms do not require temporal buffers, but they are not suitable for area efficient implementation due to a large number of adder and multiplier requirement. A memory efficient architecture is proposed by Lai et al.[4] with low latency. Al-Sulaifanie et al.[18] designed an architecture, which is independent of input image size with moderate speed and HUE. Recently, Aziz and Pham[15] have proposed parallel architecture for lifting (5, 3) multi-level 2-D DWT with a single processing unit to calculate both predict and update values. But this architecture requires 4N line buffers for single-level decomposition and has less HUE due to the wastage of the alternate clock cycle. Memory management is key to design multi-level 2-D DWT architectures. The lifting-based multi-level 2-D DWT architecture is suggested using overlapped stripe-based scanning method[19].

Three types of memory buffer are generally used in any 2-D DWT architecture, i.e., frame, transposition, and temporal memory. Frame memory is required to store intermediate LL coefficients, transposition memory is mainly required for storing row processor output, and temporal memory is required to store partial results during column processing. Temporal and transposition memories are on-chip, while the frame memory is on or off-chip, depending upon the architecture. Size of transposition memory is limited by the method adopted to scan external memory. Different scanning techniques have been proposed, such as line-based, block-based, and stripe-based[14, 19–21].

This paper is organized as follows. Section 2 provides a brief overview of lifting scheme. In Section 3, design of three efficient multi-level architectures and their modules are discussed. Performance comparison, field-programmable gate array (FPGA) implementation and timing analysis are described in Section 4. Conclusions are presented in Section 5.

2 Lifting scheme

The lifting scheme is a hardware-efficient technique to perform DWT. Lifting scheme entirely relies on spatial domain and it has many advantages as compared to the convolution method of calculating DWT such as in-place computation, symmetric forward and inverse transforms, and perfect reconstruction. The basic principle of lifting scheme is to factorize the polyphase matrix of a wavelet filter into a sequence of alternating upper and lower triangular matrices and a diagonal matrix and convert the filter implementation into banded matrices multiplications[1] as shown by (1) and (2). Where, h_e(z) and g_e(z) (h_o(z) and g_o(z)) represent the even parts (odd part) of the low-pass and high-pass filters, respectively. s_i(z) and t_i(z) are denoted as predict lifting and update lifting polynomials. K and 1/K are scale normalization factors. Factorization of classical wavelet filter into lifting steps reduces the computational complexity up to 50%.

P (z) = [\begin{array}{c} h_{e} (z) & h_{o} (z) \\ g_{e} (z) & g_{o} (z) \end{array}]

(1)

P (z) = \prod_{i = 1}^{m} [\begin{array}{c} 1 & s_{i} (z) \\ 0 & 1 \end{array}] [\begin{array}{c} 1 & 0 \\ t_{i} (z) & 1 \end{array}] [\begin{array}{c} K & 0 \\ 0 & 1 / K \end{array}]

(2)

Every reconstructible filter bank can be expressed in terms of lifting steps in general. Figure1 shows a block diagram of a forward lifting scheme to compute 1-D DWT. It is composed of three stages: split, predict, and update. In split stage, input sequence is divided into two subsets, an even indexed sequence and an odd indexed sequence. During the second stage, the even indexed sequence is used to predict odd sequence. Two consecutive even and one odd indexed input sequences X(i) are used to calculate high-pass coefficient W_H(i) (detail coefficients), as given by (3). Two consecutive high-pass coefficients (present and previous) and one even indexed input sequence are used to calculate low-pass coefficients W_L(i) (approximate coefficients), as given by (4).

W_{H} (i) = X (2 i - 1) + α [X (2 i) + X (2 i - 2)]

(3)

W_{L} (i) = X (2 i - 2) + β [W_{H} (i) + W_{H} (i - 1)]

(4)

3 Proposed architectures

In this section, the proposed dual-input Z-scan architecture (B1), dual-input raster scan module (B2), and single-input block B3 are discussed. All these blocks are designed with consideration of lifting wavelet to perform single-level 2-D DWT. Blocks B1 and B2 are designed to process two inputs and generate two outputs at every clock cycle. The total clock cycles required to perform one-level decomposition of a N × N image is N²/2 without considering latency. Then, three novel architectures for multi-level 2-D DWT are presented, i.e., folded multi-level architecture (FMA), pipelined multi-level architecture (PMA), and recursive multi-level architecture (RMA). The FMA is composed of a block (B1) and N²/4 off-chip memory, where as the PMA is composed of a block (B1) to perform first-level decomposition and block (B2) for higher levels of decompositions. The RMA is composed of block (B1) and block (B3) to compute first-level and higher-level decomposition.

Three different architectures are suggested based on different VLSI optimization criteria such as area, power, speed, throughput, and memory. FMA is a straight forward design which requires simple control but has lower throughput and it demands N²/4 memory. PMA is designed to satisfy the need of high throughput at the cost of area, whereas RMA gives moderate throughput and utilizes moderate area. RMA gives throughput higher than FMA but lower than PMA. Therefore, for the application which demands high throughput, PMA can be deployed. In applications with no memory constraint but need simple control, we can go for FMA design. RMA can be deployed where we have constraints of area and power but need to achieve high throughput.

3.1 Scanning schemes

Pixels are accessed by architecture B1 in dual-input Z-scan manner as shown in Figure2. In this method, data scanning is optimized for simultaneous operation on two rows which produces coefficients required for vertical filtering such that the latency involved in calculating 2-D DWT coefficients with boundary treatment is decreased and become independent of image size. Two pixel values are read from the first row in a single clock and processed by 1-D DWT architecture in the next clock. During the same clock, two values are read from the next row. Architecture B2 accesses pixels in dual-input raster scan manner as shown in Figure3.

3.2 Predict/update module

The main processing element in the blocks B1, B2, and B3 is predict/update. Both RPU and CPU consist of pipelined predict and update blocks to reduce the critical path. Both the predict and update have the same data path, so it can be designed as one generic processing element, as shown in Figure4. 1-D DWT using lifting (5, 3) filter requires multiplication of two filter coefficients, α and β, whose values are -1/2 and 1/4, respectively. Coefficient multiplication is designed using logical left shift operation by one and two bits, respectively. We have used simple and power efficient hardwired scaling unit (HSU), in which shifters are replaced by hardwired connections. This operation optimizes the speed without compromising power and area. The HSU to perform divide by two and divide by four operations on signed input is shown in Figure5. In case of (9, 7) lifting, multipliers are used to reduce quantization noise generated due to fractional value of filter coefficients.

Both predict and update consist of two adders, two delay registers and one multiplier or HSU. Predict/update module takes three inputs and produces two outputs per clock cycle. Entire predict/update operation is divided into two stages. In the first pipeline stage, in 1 and in 2 are added and stored in D 1 register and at the same time in 3 is stored in D 2. In the second stage, data of D 2 and shifted or multiplied value of D 1 are added to compute predict value as shown in Table1. D 1 and D 2 registers are used to pipeline the predict/update module. Thus, predict/update module takes two clock cycles to compute predict/update coefficients.

Table 1 Data flow of predict/update module

High-performance hardware architectures for multi-level lifting-based discrete wavelet transform

Abstract

1 Introduction

2 Lifting scheme

3 Proposed architectures

3.1 Scanning schemes

3.2 Predict/update module

3.3 Dual-input Z-scan block (B1)

3.4 Dual-input raster scan block (B2)

3.5 Single-input recursive block (B3)

3.6 Multi-level design

3.6.1 Folded multi-level architecture

3.6.2 Pipelined multi-level architecture (PMA)

3.6.3 Recursive multi-level architecture

4 Performance analysis and comparison

4.1 Hardware complexity and timing analysis

4.2 FPGA Implementation

5 Conclusions

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Competing interests

Authors’ original submitted files for images

Rights and permissions

About this article

Cite this article

Share this article

Keywords