- Research
- Open Access
High-performance hardware architectures for multi-level lifting-based discrete wavelet transform
- Anand D Darji^{1}Email author,
- Shailendra Singh Kushwah^{2},
- Shabbir N Merchant^{1} and
- Arun N Chandorkar^{1}
https://doi.org/10.1186/1687-5281-2014-47
© Darji et al.; licensee Springer. 2014
- Received: 26 January 2014
- Accepted: 29 September 2014
- Published: 4 October 2014
Abstract
In this paper, three hardware efficient architectures to perform multi-level 2-D discrete wavelet transform (DWT) using lifting (5, 3) and (9, 7) filters are presented. They are classified as folded multi-level architecture (FMA), pipelined multi-level architecture (PMA), and recursive multi-level architecture (RMA). Efficient FMA is proposed using dual-input Z-scan block (B1) with 100% hardware utilization efficiency (HUE). Modular PMA is proposed with the help of block (B1) and dual-input raster scan block (B2) with 60% to 75% HUE. Block B1 and B2 are micro-pipelined to achieve critical path as single adder and single multiplier for lifting (5, 3) and (9, 7) filters, respectively. The clock gating technique is used in PMA to save power and area. Hardware-efficient RMA is proposed with the help of block (B1) and single-input recursive block (B3). Block (B3) uses only single processing element to compute both predict and update; thus, 50% multipliers and adders are saved. Dual-input per clock cycle minimizes total frame computing cycles, latency, and on-chip line buffers. PMA for five-level 2-D wavelet decomposition is synthesized using Xilinx ISE 10.1 for Virtex-5 XC5VLX110T field-programmable gate array (FPGA) target device (Xilinx, Inc., San Jose, CA, USA). The proposed PMA is very much efficient in terms of operating frequency due to pipelining. Moreover, this approach reduces and totals computing cycles significantly as compared to the existing multi-level architectures. RMA for three-level 2-D wavelet decomposition is synthesized using Xilinx ISE 10.1 for Virtex-4 VFX100 FPGA target device.
Keywords
- Clock gating
- DWT
- Dual-scan architecture
- Folding
- FPGA
- Lifting
1 Introduction
In recent years, multi-level two-dimensional discrete wavelet transform (2-D DWT) is used in many applications, such as image and video compression (JPEG 2000 and MPEG-4), implantable neuroprosthetics, biometrics, image processing, and signal analysis. due to good energy compaction in higher-level DWT coefficients. To meet application constraint such as speed, power, and area, there is a huge demand of hardware-efficient VLSI architectures in recent years. DWT provides high compression ratio without any blocking artifact that deprives reconstructed image of desired smoothness and continuity. However, implementation of convolution-based DWT has many practical obstacles, such as higher computational complexity and more memory requirement. Therefore, Swelden et al.[1] proposed lifting wavelet, which is also known as second-generation wavelet. DWT can be implemented using convolution scheme as well as lifting scheme. The computational complexity and memory requirement of lifting scheme is very less as compared to convolution. Several architectures have been proposed to perform lifting-based DWT, which differ in terms of numbers of multipliers, adders, register, line buffers requirement, and scanning scheme adopted. 2-D DWT can be computed by applying 1-D DWT row-wise, which produces low-frequency (L) and high-frequency (H) sub-bands and then process these sub-bands column-wise to compute one approximate (LL) and three detail (LH, HL, HH) coefficients.
Jou et al. have proposed architecture with straightforward implementation of the lifting steps and therefore this architecture has long critical path[2]. An efficient pipeline architecture having critical path of only one multiplier has been proposed by merging predict and update stages by Wu and Lin[3]. Lai et al. have implemented dual-scan 2-D DWT design based on the algorithm proposed by Wu et al. with critical path delay of one multiplier and throughput of 2-input/2-output at the cost of more pipeline registers[4]. Dual-scan architecture with one multiplier as a critical path has also been proposed by Zhang et al. at the cost of complex control path[5]. Hsia et al.[6] have proposed a memory-efficient dual-scan 2-D lifting DWT architecture with temporal buffer 4N and critical path of two multipliers and four adders. Recently, a dual-scan parallel flipping architecture is introduced with the critical path of one multiplier, less pipeline registers, and simple control path[7].
Several lifting-based 2-D DWT multi-level architectures have been suggested for efficient VLSI implementation[3, 8–15] that take into consideration various aspects like memory, power, and speed. These architectures can be classified as folded architectures[8], parallel architectures[14], and recursive architectures[16]. Andra et al.[8] proposed simple folded architecture to perform several stages of 2-D DWT with the help of memory. Systolic architecture is proposed by Huang et al.[9] for a DWT filter with finite length. Multi-level 2-D DWT decomposition is implemented by recursive architecture (RA)[10, 16], but these approaches demand large amount of frame buffers to store the intermediate LL output and also require complex control path. Wu and Lin[3] proposed folded scheme, where multi-level DWT computation is performed level by level, with memory and single processing element. Unlike RA, folded architecture uses simple control circuitry, and it has 100% hardware utilization efficiency (HUE). Folded architectures consist of memory and a pair of 1-D DWT modules, a row processing unit (RPU) and a column processing unit (CPU). Mohanty and Meher[12] proposed a multi-level architecture for high throughput with more number of adders and multipliers. Xiong et al.[13] proposed two line-based high-speed architectures by employing parallel and pipelining techniques, which can perform j-level decomposition for N × N image in approximate 2(1 - 4^{-j})N^{2}/3 and (1 - 4^{-j})N^{2}/3 clock cycles. Hsia et al.[17] have proposed a symmetric mask-based scheme to compute 2-D integer lifting DWT, where the separate mask is used for each sub-band. Mask-based algorithms do not require temporal buffers, but they are not suitable for area efficient implementation due to a large number of adder and multiplier requirement. A memory efficient architecture is proposed by Lai et al.[4] with low latency. Al-Sulaifanie et al.[18] designed an architecture, which is independent of input image size with moderate speed and HUE. Recently, Aziz and Pham[15] have proposed parallel architecture for lifting (5, 3) multi-level 2-D DWT with a single processing unit to calculate both predict and update values. But this architecture requires 4N line buffers for single-level decomposition and has less HUE due to the wastage of the alternate clock cycle. Memory management is key to design multi-level 2-D DWT architectures. The lifting-based multi-level 2-D DWT architecture is suggested using overlapped stripe-based scanning method[19].
Three types of memory buffer are generally used in any 2-D DWT architecture, i.e., frame, transposition, and temporal memory. Frame memory is required to store intermediate LL coefficients, transposition memory is mainly required for storing row processor output, and temporal memory is required to store partial results during column processing. Temporal and transposition memories are on-chip, while the frame memory is on or off-chip, depending upon the architecture. Size of transposition memory is limited by the method adopted to scan external memory. Different scanning techniques have been proposed, such as line-based, block-based, and stripe-based[14, 19–21].
This paper is organized as follows. Section 2 provides a brief overview of lifting scheme. In Section 3, design of three efficient multi-level architectures and their modules are discussed. Performance comparison, field-programmable gate array (FPGA) implementation and timing analysis are described in Section 4. Conclusions are presented in Section 5.
2 Lifting scheme
3 Proposed architectures
In this section, the proposed dual-input Z-scan architecture (B1), dual-input raster scan module (B2), and single-input block B3 are discussed. All these blocks are designed with consideration of lifting wavelet to perform single-level 2-D DWT. Blocks B1 and B2 are designed to process two inputs and generate two outputs at every clock cycle. The total clock cycles required to perform one-level decomposition of a N × N image is N^{2}/2 without considering latency. Then, three novel architectures for multi-level 2-D DWT are presented, i.e., folded multi-level architecture (FMA), pipelined multi-level architecture (PMA), and recursive multi-level architecture (RMA). The FMA is composed of a block (B1) and N^{2}/4 off-chip memory, where as the PMA is composed of a block (B1) to perform first-level decomposition and block (B2) for higher levels of decompositions. The RMA is composed of block (B1) and block (B3) to compute first-level and higher-level decomposition.
Three different architectures are suggested based on different VLSI optimization criteria such as area, power, speed, throughput, and memory. FMA is a straight forward design which requires simple control but has lower throughput and it demands N^{2}/4 memory. PMA is designed to satisfy the need of high throughput at the cost of area, whereas RMA gives moderate throughput and utilizes moderate area. RMA gives throughput higher than FMA but lower than PMA. Therefore, for the application which demands high throughput, PMA can be deployed. In applications with no memory constraint but need simple control, we can go for FMA design. RMA can be deployed where we have constraints of area and power but need to achieve high throughput.
3.1 Scanning schemes
3.2 Predict/update module
Data flow of predict/update module
Clock | Input | D1 | D2 | P/U |
---|---|---|---|---|
1 | X_{1,1}:X_{1,3}:X_{1,2} | |||
2 | X_{1,3}:X_{1,5}:X_{1,4} | X_{1,1} + X_{1,3} | X _{1,2} | |
3 | X_{1,5}:X_{1,7}:X_{1,6} | X_{1,3} + X_{1,5} | X _{1,4} | P/U_{1,1} |
4 | X_{1,7}:X_{1,9}:X_{1,8} | X_{1,5} + X_{1,7} | X _{1,6} | P/U_{1,2} |
3.3 Dual-input Z-scan block (B1)
Data flow of block ( B1 ): image size 256×256
Clk | Input | 1-D DWT output | 2-D DWT output |
---|---|---|---|
1 | X_{1,1}:X_{1,2} | ||
2 | X_{2,1}:X_{2,2} | ||
3 | X_{1,3}:X_{1,4} | ||
4 | X_{2,3}:X_{2,4} | ||
5 | X_{1,5}:X_{1,6} | ||
6 | X_{2,3}:X_{2,4} | L_{1,1}:H_{1,2} | |
7 | X_{1,5}:X_{1,6} | L_{2,1}:H_{2,2} | |
8 | X_{2,5}:X_{2,6} | L_{1,2}:H_{1,3} | |
9 | X_{1,7}:X_{1,8} | L_{2,2}:H_{2,3} | |
10 | X_{2,7}:X_{2,8} | L_{1,3}:H_{1,4} | |
… | … | … | |
257 | X_{3,1}:X_{3,2} | L_{1,254}:H_{1,255} | |
258 | X_{4,1}:X_{4,2} | L_{2,254}:H_{2,255} | |
… | … | … | |
268 | X_{4,11}:X_{4,12} | L_{3,3}:H_{3,4} | LL_{1,1}:LH_{1,2} |
269 | X_{3,13}:X_{3,14} | L_{4,3}:H_{4,4} | HL_{2,1}:HH_{2,2} |
269 | X_{4,13}:X_{4,14} | L_{3,4}:H_{3,5} | LL_{1,2}:LH_{1,3} |
3.4 Dual-input raster scan block (B2)
3.5 Single-input recursive block (B3)
Mapping of register bank and predict/update module
Flag | Operation | Predict/update module | ||
---|---|---|---|---|
in1 | in2 | in3 | ||
0 | Predict | On-line | R4 | R2 |
1 | Update | R4 | R8 | R6 |
3.6 Multi-level design
3.6.1 Folded multi-level architecture
3.6.2 Pipelined multi-level architecture (PMA)
Function of control pin in PMA
Control (ctr) | Level of decomposition |
---|---|
000 | 1 |
001 | 2 |
010 | 3 |
011 | 4 |
100 | 5 |
3.6.3 Recursive multi-level architecture
Block (B1) produces LL coefficient at alternate clock cycle, which is fed to block (B3) for multi-level processing. Block (B3) is designed such that it operates one sample per clock cycle. The block (B3) also has a feedback mechanism that brings the higher-level LL coefficients at the input to compute next higher-level coefficients. A buffer of size less than N/2 is used to store intermediate LL coefficients. This buffer is divided into different lengths, such as N/4,N/8,N/16… to store one row of LL coefficients. These buffered LL coefficients are serially provided to block (B3) to get next higher-level DWT coefficients. Here, a multiplexer is used, which is operated on the control signal ctr_2 and provides first- and higher-level LL coefficients at alternate clock cycle. Thus, every clock cycle is utilized to process the first- and higher-level LL coefficients alternatively. Valid multi-level coefficients are available at every fourth clock cycle. The depth of the data is decreased fourfold at each level of operation, i.e., first-level data depth is decreased from N^{2} to N^{2}/4, second level to N^{2}/16, and so on. This property of inherent compression is utilized for pushing the higher-level coefficients into buffers.
4 Performance analysis and comparison
Pipelining is done between predict and update stages in FMA, PMA, and RMA to reduce the critical path delay from conventional 4Ta + 2Tm to only Ta + Ts for lifting (5, 3) and Tm for lifting (9, 7) at the cost of latency of few clock cycles. Total 2N on-chip line buffers are required in the proposed FMA and PMA for lifting (5, 3) and j = 1 level, which is lowest among existing architectures. The latency of the proposed scheme is N cycles (without boundary treatment). The proposed RMA utilizes only one processing element to calculate both predict and update, resulting into 50% reduction in number of adders and multipliers.
4.1 Hardware complexity and timing analysis
Performance comparison of 2-D DWT architectures
Architecture | Mul. | Add. | Buff. | C.P. | Thr. | C.C. | HUE (%) |
---|---|---|---|---|---|---|---|
DSA[10] | 12 | 16 | 4N | 4T_{ m } + 8T_{ a } | 1 | N^{2}/2 | 100 |
Wu[3] | 6 | 8 | 4N | T _{ m } | 1 | N ^{2} | 100 |
FA[13] | 10 | 16 | 5.5N | T_{ m } + 2T_{ a } | 2 | N^{2}/2 | 100 |
HA[13] | 18 | 32 | 5.5N | T_{ m } + 2T_{ a } | 4 | N^{2}/4 | 100 |
Lai[4] | 10 | 16 | 4N | T _{ m } | 2 | N^{2}/2 | 100 |
Zhang[5] | 10 | 16 | 4N | T _{ m } | 2 | N^{2}/2 | - |
Hsia[6] | 0 | 16 | 4N | 2T_{ m } + 4T_{ a } | 2 | 3N^{2}/4 | - |
Darji[7] | 10 | 16 | 4N | T _{ m } | 2 | N^{2}/2 | 100 |
Hardware and time complexity comparison of the proposed FMA for lifting (5, 3) filter
Architecture | Multipliers/ | Adders/ | Memory | Computing | Output | Critical | HUE (%) | |
---|---|---|---|---|---|---|---|---|
shifters | subtractors | On-chip | Off-chip | time for j level | latency | path delay | ||
Andra[8] | 4 | 8 | N^{2} + 4N | 0 | 2N^{2}(1 - 4^{-j})/3 | 2N | 2Ta + 2Ts | 100 |
Wu[3] | 4 | 8 | 5N | N^{2}/4 | 2N^{2}(1 - 4^{-j})/3 | 2N | 2Ta + Tm | 100 |
Barua[22] | 4 | 8 | 5N | N^{2}/4 | 2N^{2}(1 - 4^{-j})/3 | 5N | Tm | 100 |
Xiong[13] FA | 4 | 8 | 3.5N | N^{2}/4 | 2N^{2}(1 - 4^{-j})/3 | N | 2Ta + Tm | 100 |
Xiong[13] HA | 8 | 16 | 3.5N | N^{2}/4 | N^{2}(1 - 4^{-j})/3 | N | 2Ta + Tm | 100 |
FMA | 4 | 8 | 2N | N^{2}/4 | 2N^{2}(1 - 4^{-j})/3 | N | Ta + Ts | 100 |
Comparison of hardware and time complexity of the proposed FMA for lifting (9, 7) filter
Architecture | Multipliers/ | Adders/ | Memory | Output | Computing | Critical | HUE (%) | |
---|---|---|---|---|---|---|---|---|
shifters | subtractors | On-chip | Off-chip | latency | time for j level | path delay | ||
Andra[8] | 32 | 32 | N ^{2} | 0 | N^{2}/2 | 4N^{2}(1 - 4^{-j})/3 | 4Ta + 2Tm | 100 |
Wu[3] | 6 | 8 | 5.5N | N^{2}/4 | ∼ | 2N^{2}(1 - 4^{-j})/3 | Tm | 100 |
Barua[22] | 12 | 16 | 7N | N^{2}/4 | 7N | 2N^{2}(1 - 4^{-j})/3 | 2Ta + Tm | 100 |
Xiong[13] FA | 10 | 16 | 5.5N | N^{2}/4 | 2N | 2N^{2}(1 - 4^{-j})/3 | 2Ta + Tm | 100 |
Xiong[13] HA | 18 | 32 | 5.5N | N^{2}/4 | N | N^{2}(1 - 4^{-j})/3 | 2Ta + Tm | 100 |
FMA | 10 | 16 | 4N | N^{2}/4 | N | 2N^{2}(1 - 4^{-j})/3 | Ta + Tm | 100 |
Comparison of hardware and time complexity of the proposed PMA for lifting (5, 3) filter
Architecture | Multipliers/ | Adders/ | Memory j= 1 | Computing | Latency | Critical | HUE (%) | |
---|---|---|---|---|---|---|---|---|
shifters | subtractors | On-chip | Off-chip | cycle for j level | j = 1 | path delay | ||
Hasan[23] | 2j | j | 3N | 0 | O(N^{2}) | 3N | 2Ta + Ts | 100 |
Aziz[15] | 2j | 4j | 4N | 0 | $\sum _{m=1}^{j}\left(1+\frac{3N}{{2}^{m-1}}+\frac{{N}^{2}}{{2}^{2(m-1)}}\right)$ | 3N + 1 | 2Ta | 50 to 60 |
PMA | 4 + 6(j - 1) | 8 + 12(j - 1) | 2N | 0 | $2+\sum _{m=1}^{j}\left(10+\frac{N}{{2}^{m-1}}+\frac{{N}^{2}}{{2}^{2m-1}}\right)$ | N + 12 | Ta + Ts | 60 to 75 |
Line buffer comparison for PMA for lifting (5, 3) filter
Architecture | Memory | |
---|---|---|
For j= 1 | For j= 5 | |
Aziz[15] | 4N | $4N+2N+N+\frac{N}{2}+\frac{N}{4}\approx 7.7N$ |
PMA | 2N | $2N+\frac{3N}{2}+\frac{3N}{4}+\frac{3N}{8}+\frac{3N}{16}\approx 4.8N$ |
A total computing cycle required by PMA for five-level decomposition of an image (size N × N) is shown in (6). All single-level processors (B1 and B2) work in a parallel fashion in PMA. Block (B1) utilizes$\left(\frac{{N}^{2}}{2}+10\right)$, and remaining clocks are consumed in consecutive four blocks of (B2). So, most of the processing is done within N^{2}/2 clock cycles and very small additional cycles, i.e., (N/2 + 8), (N/4 + 8), (N/8 + 8), and (N/16 + 8) are needed to compute further levels.
Comparison of hardware and time complexity of the proposed PMA for lifting (9, 7) filter
Architecture | Mohanty[14] | Mohanty[21] | Hu[19] | Proposed PMA |
---|---|---|---|---|
Scheme | LT | CV | LT | LT |
Multiplier | 6Px_{3} | 189 | $\frac{105S}{8}+6$ | 32 |
Adder | 32Px_{3}/3 | 294 | 21S+12 | 64 |
Registers | N(11x_{2} + 10x_{5}) | $\frac{21N}{4}+443$ | $3N+\frac{341S}{8}$ | 48 |
Line buffers | 0 | 0 | 0 | 4N + 24N |
ACT | $\frac{{N}^{2}}{P}$ | $\frac{{N}^{2}}{16}$ | $\frac{{N}^{2}}{2S}$ | $\frac{{N}^{2}}{2}+\sum \frac{N}{{2}^{2}}$ |
Critical path delay | 2Ta + Tm | ≈Tm | Ta + Tm | Ta + Tm |
Hardware and time complexity of the proposed RMA
Architecture | Mult./ | Add./ | Memory j= 1 | Output | Critical | HUE (%) | |
---|---|---|---|---|---|---|---|
shift. | sub. | On-chip | Off-chip | latency | path delay | ||
RMA (5, 3) | 6 | 12 | 6.5N | 0 | 2N | 2Ta + Tm | 75 |
RMA (9, 7) | 16 | 24 | 12.5N | 0 | 4N | 2Ta + Tm | 75 |
4.2 FPGA Implementation
FPGA synthesis results for FMA: image size 256×256
Architecture | FMA (5, 3) | FMA (9, 7) |
---|---|---|
FPGA | Virtex-5 | Virtex-5 |
Device | 5VLX110TFF1136-3 | 5VLX110TFF1136-3 |
Slice LUTs | 494 (0%) | 1008 (1%) |
Slice registers | 633 (0%) | 1091 (1%) |
MUF (MHz) | 537 | 210 |
FPGA synthesis results of the proposed PMA for lifting (5, 3) 2-D DWT
Architecture | Power for j= 1 (mW) | Frequency | CLB slice count | Throughput frames/s | |||
---|---|---|---|---|---|---|---|
Dynamic | Quiescent | Total | For j = 1 | For j = 5 | |||
Aziz[15] | 33.85 | 1,186.94 | 1,220.79 | 221.44 | 206 | 1,052 | 835 |
PMA without clk gating | 29.6 | 980.8 | 1,010.6 | 539 | 412 | 1,329 | 4,080 |
PMA with clk gating | 28 | 980.8 | 1,008.2 | 539 | 342 | 1,178 | 4,080 |
5 Conclusions
In this paper, we have proposed high-performance FMA, PMA and RMA with dual-pixel scanning method for computing multi-level 2-D DWT. The architectures are compared on the basis of resources utilized and speed. Micro-pipelining is employed in predict/update processor element to reduce the critical path to Ta + Ts and Ta + Tm for lifting (5, 3) and (9, 7) filters, respectively. Optimized single-level 2-D DWT blocks (B1), (B2), and (B3) are proposed to design multi-level architecture. The proposed FMA for lifting (5, 3) and lifting (9, 7) uses only 2N and 4N line buffers, respectively. The proposed PMA is simple, regular, modular, and can be cascaded for n-level decomposition. The PMA for lifting (5, 3) has a critical path delay of Ta + Ts. Moreover, it requires only 4.8N line buffers for five-level decomposition, thus reduces line buffer approximately 50% than other similar designs. The proposed RMA uses 16 multipliers and 24 adders for n-level decomposition. Moreover, a requirement of line buffers is independent of level of decomposition. The proposed architectures are implemented on Xilinx Virtex family devices. The proposed FMA and PMA operate with frequency of 537 MHz, which is sufficient to handle 518 full-HD frames with 1,920 × 1,080 resolution. The proposed PMA, when implemented on FPGA for five-level DWT, utilizes 1,178 slices and provides a throughput rate of 4,080 frames (512 × 512) per second, which is almost five times than that of the existing design. The proposed RMA uses unique buffer management and only single processing element for computing predict and update to save area and power. The Xilinx Virtex-4 implementation of RMA uses 1,822 (4%) and 1,040 (2%) slices for lifting (9, 7) and (5, 3) filters, respectively.
FPGA implementation of the proposed schemes show higher operating frequency, low latency, and lower power as compared to other architectures with the same specifications. The proposed designs can be used for practical applications with power, area, and speed constrains. The proposed architectures are suitable for high-speed real-time systems such as image de-noising, on-line video streaming, watermarking, compression, and multi-resolution analysis.
Declarations
Authors’ Affiliations
References
- Daubecies I, Sweldens W: Factoring wavelet transforms into lifting steps. J. Fourier Anal. Appl 1998, 4: 247-269. 10.1007/BF02476026MathSciNetView ArticleGoogle Scholar
- Jou J-M, Shiau Y-H, Liu C-C: Efficient VLSI architectures for the bi-orthogonal wavelet transform by filter bank and lifting scheme. In Proc. IEEE International Symposium on Circuits Systems (ISCAS). Sydney, New South Wales; 06–09 May 2001:529-532.Google Scholar
- Wu B, Lin C: A high-performance and memory-efficient pipeline architecture for the 5/3 and 9/7 discrete wavelet transform of JPEG2000 codec. IEEE Trans. Circuits Syst. Video Technol 2005, 15(12):1615-1628.View ArticleGoogle Scholar
- Lai Y-K, Chen L-F, Shih Y-C: A high-performance and memory-efficient VLSI architecture with parallel scanning method for 2-D lifting-based discrete wavelet transform. IEEE Trans. Consum. Electron 2009, 55(2):400-407.View ArticleGoogle Scholar
- Zhang W, Jiang Z, Gao Z, Liu Y: An efficient VLSI architecture for lifting-based discrete wavelet transform. IEEE Trans. Circuits Syst. II 2012, 59(3):158-162.View ArticleGoogle Scholar
- Hsia C-H, Chiang J-S, Guo J-M: Memory-efficient hardware architecture of 2-D dual-mode lifting-based discrete wavelet transform. IEEE Trans. Circuits Syst. Video Technol 2013, 25(4):671-683.View ArticleGoogle Scholar
- Darji A, Agrawal S, Oza A, Sinha V, Verma A, Merchant SN, Chandorkar A: Dual-scan parallel flipping architecture for a lifting-based 2-D discrete wavelet transform. IEEE Trans. Circuits Syst. II, Exp. Briefs 2014, 61(6):433-437.View ArticleGoogle Scholar
- Andra K, Chakrabarti C, Acharya T: A VLSI architecture for lifting-based forward and inverse wavelet transform. IEEE Trans. Signal Process 2002, 50(4):966-977. 10.1109/78.992147View ArticleGoogle Scholar
- Huang C-T, Tseng P-C, Chen L-G: Efficient VLSI architectures of lifting-based discrete wavelet transform by systematic design method. In Proceedings of the IEEE International Symposium on Circuits and Systems. Scottsdale, Arizona; 26–29 May 2002:565-568.Google Scholar
- Liao H, Mandal MK, Cockburn BF: Efficient architectures for 1-D and 2-D lifting-based wavelet transforms. IEEE Trans. Signal Process 2004, 52(5):1315-1326. 10.1109/TSP.2004.826175MathSciNetView ArticleGoogle Scholar
- Chen P-Y: VLSI implementation for one-dimensional multilevel lifting-based wavelet transform. IEEE Trans. Comput 2004, 53(4):386-398. 10.1109/TC.2004.1268396View ArticleGoogle Scholar
- Mohanty BK, Meher PK: VLSI architecture for high-speed /low-power implementation of multilevel lifting, DWT. In Proceedings of the IEEE Asia Pacific Conference on Circuits and Systems. Singapore; 4–7 Dec 2006:458-461.Google Scholar
- Xiong C, Tian J, Liu J: Efficient architectures for two-dimensional discrete wavelet transform using lifting scheme. IEEE Trans. Image Process 2007, 16(3):607-614.MathSciNetView ArticleGoogle Scholar
- Mohanty BK, Meher PK: Memory efficient modular VLSI architecture for highthroughput and low-latency implementation of multilevel lifting 2-D DWT. IEEE Trans. Signal Process 2011, 59(5):2072-2084.MathSciNetView ArticleGoogle Scholar
- Aziz SM, Pham DM: Efficient parallel architecture for multi-level forward discrete wavelet transform processors. J. Comp. Elect. Eng 2012, 38: 1325-1335. 10.1016/j.compeleceng.2012.05.009View ArticleGoogle Scholar
- Xiong C-Y, Tian J-W, Liu J: Efficient high-speed/low-power line-based architecture for two-dimensional discrete wavelet transform using lifting scheme. IEEE Trans. Circuits Syst. Video Technol 2006, 16(2):309-316.View ArticleGoogle Scholar
- Hsia C-H, Guo J-M, Chiang J-S: Improved low-complexity algorithm for 2-D integer lifting-based discrete wavelet transform using symmetric mask-based scheme. IEEE Trans. Circuits Syst. Video Technol 2009, 19(8):1202-1208.View ArticleGoogle Scholar
- Al-Sulaifanie AK, Ahmadi A, Zwolinski M: Very large scale integration architecture for integer wavelet transform. J. IET Comp. Digital Tech 2010, 4(6):471-483. 10.1049/iet-cdt.2009.0021View ArticleGoogle Scholar
- Hu Y, Jong CC: A memory-efficient high-throughput architecture for lifting-based multi-level 2-D DWT. IEEE Trans. Signal Process 2013, 61(20):4975-4987.MathSciNetView ArticleGoogle Scholar
- Angelopoulou ME, Masselos K, Cheung PY, Andreopoulos Y: Implementation and comparison of the 5/3 lifting 2-D discrete wavelet transform computation schedules on FPGAs. J. Signal Process. Syst 2008, 51(1):3-21. 10.1007/s11265-007-0139-5View ArticleGoogle Scholar
- Mohanty BK, Meher PK: Memory-efficient high-speed convolution-based generic structure for multilevel 2-D DWT. IEEE Trans. Circuits Syst. Video Technol 2013, 23(2):353-363.View ArticleGoogle Scholar
- Barua S, Carletta JE, Kotteri KA, Bell AE: An efficient architecture for lifting-based two-dimensional discrete wavelet transforms. J. Integration, VLSI J 2005, 38(3):341-352. 10.1016/j.vlsi.2004.07.010View ArticleGoogle Scholar
- Varshney H, Hasan M, Jain S: Energy efficient novel architectures for the lifting-based discrete wavelet transform. IET Image Process 2007, 1(3):305-310. 10.1049/iet-ipr:20060140View ArticleGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited.