# Power-optimized log-based image processing system

- Vaithiyanathan Dhandapani
^{1}Email author and - Seshasayanan Ramachandran
^{1}

**2014**:37

https://doi.org/10.1186/1687-5281-2014-37

© Dhandapani and Ramachandran; licensee Springer. 2014

**Received: **21 March 2014

**Accepted: **4 July 2014

**Published: **21 July 2014

## Abstract

The continuous development of devices such as mobile phones and digital cameras has led to a higher amount of research being dedicated to the image processing field. Today's image-acquiring tools require battery-operated power, and hence, power optimization becomes a major factor to be considered in the hardware implementation of image systems. This paper proposes an image processing system which utilizes set partitioning in hierarchical trees (SPIHT)-integrated discrete wavelet transform (DWT) structure for image processing. The overall advantage of this proposal is achieved by modifying the arithmetic units in the DWT structure. Utilizing a logarithm-based floating point unit (FPU) in the DWT computation structures, the logarithmic number system (LNS) adaptation in the arithmetic unit results in overall accuracy enhancement with reduced area and power consumption. To ensure the versatility of the proposal and for further evaluating the performance and correctness of the structure, the model is implemented using Xilinx and Altera field-programmable gate array (FPGA) devices. The analyses obtained from the implementation show that the structure incorporated with the log-based FPU is 25% more accurate with 47% reduced power consumption than the integer-styled FPU incorporated DWTs, along with enhanced speed and optimal area utilization.

### Keywords

Discrete wavelet transform (DWT) Lifting scheme Log principles Floating point unit (FPUs) Set partition in hierarchical trees (SPIHT) Image coding Field-programmable gate array (FPGA) implementation Real-time processing## 1 Introduction

Discrete wavelet transform (DWT) is increasingly being used for image coding. In particular, biorthogonal symmetric wavelets manifested remarkable abilities in still image compression. Hence, this paper proposes an image processing system by focusing on the biorthogonal 9/7 DWT structure. DWT has traditionally been implemented using the convolution method. This implementation demands a large number of computations and storage features that are not desirable for high-speed or low-power applications. Swelden [1] proposed a new mathematical formulation for wavelet transformation based on spatial construction of the wavelets, and a very versatile scheme for its factorization has been suggested in [2]. This approach is called the lifting-based wavelet transform. The main feature of the lifting-based DWT scheme is to break up high-pass and low-pass filters into a sequence of upper and lower triangular matrices and convert the filter implementation into banded matrix multiplications. This scheme has several advantages when compared to the convolution techniques, which includes ‘in-place’ computation of the DWT, symmetric forward, and inverse transform. Therefore, the DWT implemented using the lifting scheme in the JPEG 2000 standard are the biorthogonal lossless 5/3 integer and the lossy 9/7 floating point filter banks. Numerous architectures have been proposed in order to provide low-power, high-speed, and area-efficient hardware implementation for DWT computation [3–16]. Shi et al. [6] proposed efficient folded architecture (EFA) with low hardware complexity. The flipping structure is another important DWT architecture that was proposed by Huang et al. [7]. A high-speed, reduced-area two-dimensional (2-D) DWT architecture was proposed by Zhang et al. [10]. While most of these architectures are related to research involved in the optimization of critical paths, only some of them, such as Lee et al. [16], deal not only with the internal data path but also with the coefficient precision optimization.

This paper focuses on lossy biorthogonal 9/7 lifting-based DWT. This yields higher computational complexity with floating point computations. The implementation of this structure in hardware requires an additional complex hardware to handle the floating point computations. This demands a separate unit for its processing, which leads to the design of the floating point unit (FPU). By exploring the existing FPUs, the phenomenon of arithmetic computations are still the same as ordinary arithmetic logic unit (ALU) operations, acting like an additional prop up for normal ALUs. An island-style with embedded FPU [17] is proposed by Beauchamp et al., while a coarse-grained FPU was suggested by Ho et al. [18]. Even et al. [19] suggests a multiplier for performing on either single-precision or double-precision floating point numbers. An optimized FPU in a hybrid FPGA was suggested by Yu et al. [20] and a configurable multimode FPU for FPGAs by Chong and Parameswaran [21]. Performance improvisation and optimization of these suggested models are studied and employed in each successive development time frame. However, while these models fine tune the FPU in terms of area, there were no suggestions for power reduction or accuracy enhancements. Anand et al. [22] proposed a log lookup table (LUT)-based FPU, which utilizes a logarithmic principle to achieve good accuracy with reduced power consumption. However, this model has some serious drawbacks, which include increased delay and additional memory for the log LUT handling. The above factors affect the performance in terms of area and speed. Hence, this proposed scheme suggests an efficient model for performing floating point operations to reduce power consumption by reducing the operation complexities using log conversion [23].This reduces the overall computation burden, as the process is simply a numerical transformation to the logarithmic domain. Thus, a reduction in power consumption and increased accuracy is attained with optimal area usage [24]. The mere mapping of floating point numerals is not possible, and hence, a standardized form is adopted by using IEEE 754 single-precision floating point standard [25]. An optimized DWT architecture with log-based FPU is proposed, and a preliminary version of this work was presented in [26]. This paper revises the external memory access, and a more accurate and detailed error analysis and the simulation results are given.

After the lifting-based DWT was introduced, several coding algorithms were proposed to code the wavelet coefficients into an efficient result, while taking storage space and redundancy into consideration. These algorithms are embedded zerotree wavelet (EZW), embedded block coding with optimized truncation (EBCOT), and set partitioning in hierarchical trees (SPIHT). Among these, the SPIHT algorithm is most preferable because of its low-computational complexity and better image compression performance. The SPIHT coding, proposed by Said and Pearlman in 1996 [27], does not required arithmetic coding and provides a cheaper and faster hardware solution. It was modified by Wheeler and Pearlman [28] by making a no list SPIHT (NLS) to reduce memory usage. Later, Corsonello et al*.*[29] proposed a low-cost implementation of NLS in order to improve the coding speed. The work in [30] modified the scanning process and utilized fixed memory allocation for the data list to reduce the hardware complexity. In order to achieve high throughput, Cheng et al. [31] proposed a modified SPIHT that processes a 4 × 4 bit plane in 1 cycle. Fry and Hauck [32] improvised this model with a bit plane parallel SPIHT encoder architecture to further increase the throughput. By the year 2013, Jin and Lee [33] proposed a block-based pass-parallel SPIHT (BPS) algorithm, which employs pipelining and parallelism. This scheme has the highest throughput among the existing architectures. Hence, we espouse the BPS in our image processing core.

This proposal introduces an enhanced image processing system, which utilizes a low-power DWT structure along with a log-based FPU and BPS coder. The optimized decomposition level of DWT is selected based on performance parameters such as peak signal-to-noise ratio, compression ratio, and computational complexity. To examine the specific hardware performance and trade-offs associated with the solutions presented here, the architecture is first verified in Matlab for the image parameters. In addition to this, the hardware implementation is carried out using Verilog hardware description language (HDL) and synthesized using Xilinx and Altera FPGA families to verify its device level performance based on VLSI parameters.

The rest of the paper flow is given in brief as follows. Section 2 gives the background supporting the basic understanding of lifting-based discrete wavelet transform and SPIHT coding techniques. Section 3 pursues with the hardware implementation of forward 2-D DWT with modified computation unit adopting log-based FPU and SPIHT coders. Detailed experimental setup for the proposed real-time image processing system and the performance of the proposed architecture is assessed and compared with that of other existing architectures are given in Section 4. Conclusion and final remarks are given in Section 5.

## 2 Background

### 2.1 Discrete wavelet transform

#### 2.1.1 Lifting scheme

*h*(

*z*) and

*g*(

*z*) be the low-pass and high-pass synthesis filters, respectively. The corresponding polyphase matrices are defined as:

- 1.
*Spliting*. The original signal*X*(*n*) is split into odd and even sequences (lazy wavelet transform)${X}_{e}\left(n\right)=X\left(2n\right)$(4)${X}_{o}\left(n\right)=X\left(2n+1\right)$(5) - 2.
*Lifting.*It consists of one or more steps*m*of the form - (a)
*Predict/Dual lifting.*If*X*(*n*) possesses local correlation, then*X*_{ e }(*n*) and*X*_{ o }(*n*) also have local correlation. Therefore, one subset is used to predict the other subset. In the prediction step, the filtered even array is used to predict the odd array. The new odd array is redefined as the difference between the existing array and the predicted one.$D\left(n\right)={X}_{o}\left(n\right)-{s}_{i}\left({X}_{e}\left(n\right)\right)$(6) - (b)
*Update/Primal lifting*. To eliminate aliasing which appears due to the down sampling of the original signal, the even array is updated using the filtered new odd array.$A\left(n\right)={X}_{e}\left(n\right)+{t}_{i}\left(D\left(n\right)\right)$(7)

*m*pairs of prediction and update steps, the even samples become the low-frequency component while the odd samples become the high-frequency component.

- 3.
*Normalization/Scaling.*After*m*lifting steps, scaling coefficients*K*and 1/*K*are applied respectively to the even and odd samples in order to obtain the low-pass subband and high-pass subband.

For the biorthogonal 9/7 wavelet, four lifting steps and one scaling can be used, where *s*_{1}(*z*) = *α*(1 + *z*^{-1}), *s*_{2}(*z*) = *γ*(1 + *z*^{-1}), *t*_{1}(*z*) = *β*(1 + *z*), and *t*_{2}(*z*) = *δ*(1 + *z*). The parameters *α*, *β*, *γ*, and *δ* are two-tap symmetric filter coefficients and *K* and 1/*K* are scaling factors.

where *α* = -1.586134342, *β* = -0.05298011854, *γ* = 0.8829110762, *δ* = 0.4435068522, and *K* = 1.149604398

The original data to be filtered is denoted by *X*(*n*), and the outputs are *a*_{
i
} and *d*_{
i
} which are the approximation coefficients and detail coefficients, respectively. We focus on the implementation issue of the lifting-based DWT, which yields higher computational complexity with floating point computation. Hence, we suggest an efficient model for performing the floating point operation to reduce the power by reducing the operating complexities by adopting log conversion [22, 23].

### 2.2 Set partition in hierarchical trees

*T*, SPIHT defines a function of significance, which indicates whether the set

*T*has pixels larger than a given threshold.

*S*

_{ n }(

*T*), the significance of set

*T*in the

*n*th bit plane, is defined as in Equation 14.

Note: *w*(*i*, *j*) is the coefficient value for (*i*, *j*) position in the wavelet domain. *T* stands for the set of coefficients and *S*_{
n
}(*T*) is used for significant state of *T* at bit plane *n*.

When *S*_{
n
}(*T*) is ‘0’ , *T* is called an insignificant set. Otherwise, *T* is called a significant set. An insignificant set can be represented as a single bit ‘0’. The significant set is partitioned into subsets, and its significances have to be tested again based on the zerotree hypotheses. The SPIHT encodes a given set *T* and its descendants (denoted by *D*(*T*)) together by checking the significance of *T* ∪ *D*(*T*) and by representing *T* ∪ *D*(*T*) as a single symbol ‘0’ if *T* ∪ *D*(*T*) is insignificant. On the other hand, if *T* ∪ *D*(*T*) is significant, *T* has to partitioned into subsets and each subset is tested independently.

The spatial orientation trees are illustrated in Figure 2b for a 16 × 16 image and is transformed by three levels of discrete wavelet decomposition. Each level is divided into four subbands. The subband a_{2}a_{2} is divided into four groups of 2 × 2 coefficients. In each group, each of the four coefficients becomes the root of a spatial orientation tree. The square denoted by *R* in Figure 2a represents the subband a_{3}a_{3} (low pass subband) in Figure 2b, which corresponds to the root. In order to increase the speed of both the encoder and decoder, we adopt a BPS algorithm [33] for our image processing core. BPS algorithm modifies the processing order of the original SPIHT algorithm so that an image is partitioned into multiblocks, and the coefficients trees are local to these blocks. Furthermore, BPS employs pipelining and parallelism, which gives the highest throughput among the existing architectures.

## 3 Proposed architecture

### 3.1 Discrete wavelet transform core

*T*

_{mul}+ 2

*T*

_{adder}).

In hardware implementation, the multiplier occupies a large amount of hardware resources. In order to provide a low-power, high-speed, and area-efficient multiplier for DWT computation, Shi et al. [6] adopted the shift-add operations to optimize the multiplications since the coefficients of wavelet filters are constant. Zhang et al. [35] used the dedicated 18-bit multiplier block present in the FPGA. In spite of the numerous methods that were proposed, the overall latency in the circuit also depended on the multiplier. Hence, it is necessary to modify the multiplier structure in order to achieve minimum area and computation time. Furthermore, the accuracy also depends on floating point lifting coefficients and its arithmetic operations. The above three factors demand modification of computation units in the DWT architecture. Hence, this proposes a new computational unit based on logarithmic principle in order to achieve minimal computation time with optimal area consumption. Moreover, adaptation of the log principle results in good power reduction mainly because of reduced operator and operand strengths. In the next subsection, log-based floating point unit is discussed.

**a**

_{ 1 }and detail

**d**

_{ 1 }coefficients to the local memory. Once a sufficient number of rows have been processed, the column processor starts vertical filtering which consists of the same six computing modules. It fetches the approximation coefficients as the inputs from the local memory and generates four subbands:

**a**

_{ 1 }

**a**

_{ 1 },

**a**

_{ 1 }

**d**

_{ 1 },

**d**

_{ 1 }

**a**

_{ 1 }, and

**d**

_{ 1 }

**d**

_{ 1 }. These four subbands are written back to the external memory in row-wise order. Multiple-level decomposition is performed on this architecture in non-interleaved fashion, and results between levels are stored in the external memory. For the higher levels, an approximation subband is read from the external memory and four higher level subbands are generated using the same computing modules. This operation continues until the desired levels of wavelet decomposition are finished, as shown in Figure 6. As the real-time image processing core requires high performance, we adopt a highly pipelined, log-based FPU for implementing the lifting steps.

### Log-based floating point unit

*X*is divided into three parts as 1 sign bit (

*s*), 8 exponent bits (

*E*), and 23 mantissa bits (

*m*). This is represented as

The log-based arithmetic unit embedded in the designed FPU utilizes the carry save adder for computing all arithmetic operations. It uses simple log principles, along with operational switches, to select the inputs based on the operation needs. If the adder operator is fed to the switch, the addition computation phenomenon is carried out by merely adding or subtracting the mantissa bits according to the exponent and sign bits. The difference of the two exponents is calculated. If any, perform the mantissa shift and set the larger exponent as the tentative exponent of the result. Shift the mantissa of the smaller exponent to the right by the difference in the exponents. According to the sign bit, perform addition (if equal) or subtraction (if unequal) on the mantissas to get the tentative mantissa as the result. Normalize and round off the mantissa result. If there is an overflow due to rounding, shift right and increment the exponent by 1 bit. Have the highest of the sign bits be the sign bit of the result. Similarly, a multiplication computation procedure is chosen for multiplier input that is fed to the operator switch. The overall data path involved in the multiplier component of this FPU architecture gets simplified. This is a mere computation with only mapping involved. Hence, this simplifies the overall stages involved in multiplications. The mantissas of the input data are mapped to the corresponding logarithmic number in the LUT. This is followed by adding the logarithms. If any overflow shifts the result to the right, then map with antilogarithm LUT to obtain the mantissa of the result. The exponent of the result is obtained by mere addition of the exponent bits, and the sign bit of the result is obtained by the Ex-or-ing both sign bits.

### 3.2 Block-based parallel-pipelined SPIHT

*n*+ 1)th bit plane, the

*n*th bit of pixels are categorized and processed by one of the three passes. Insignificant pixels classified by the (

*n*+ 1)th bit plane are encoded by IPP, whereas significant pixels are processed by SPP. The main goal of each pass is the generation of an appropriate bit stream according to the wavelet coefficient information. If a set in this pass is classified as a significant set in the

*n*th bit plane, it is decomposed into smaller sets until the smaller sets become insignificant or they correspond to single pixels. If the smaller sets are insignificant, they are handled by ISP. If the smaller sets correspond to single pixels, they are handled by either IPP or SPP, depending on their significance.In the original SPIHT algorithm, three linked lists are maintained for processing the ISP, IPP, and SPP. In each pass, the entries in the linked list are processed in the first-in first-out (FIFO) order. This FIFO order creates a large overhead, which slows down the computation speed of the SPIHT algorithm. To speed up the algorithm, sets and pixels are visited in the Morton order as shown in Figure 2b and processed by the appropriate pass. This modified algorithm, called Morton order SPIHT, is relatively easy to implement in hardware with a slight degradation of the compression efficiency when compared with the original SPIHT.The block diagram of the block-based parallel-pipelined SPIHT architecture is shown in Figure 10. The 8 × 8 block discrete wavelet transformed image is given as the input and sliced into eight planes. The most significant bit (MSB) plane is given to the insignificant pixel pass in the first clock cycle, which finds the significance of each macro and minor block. In the second clock cycle, the insignificant bit planes are given for sorting. The sorting pass updates the insignificant sorting pass. Using the significance bit stream from the insignificant sorting pass, the refining pass (RP) codes the significant micro blocks and gives the coded output. When all the blocks in the 8 × 8 coefficient become significant, then the controller block stops the sorting pass (SP) and, hence, the unnecessary updating of insignificant sorting passes are removed. Thus, pipeline ISP along with parallel RP and SP increases the throughput.

## 4 Experiment results and analysis

The overall performance of the proposed image processing system is analyzed in this section. As DWT has a wide range of applications in various fields, the proposed system utilizes its efficiency for enhanced image handling and offers good improvement in speed and area consumption. Moreover, the accuracy of the output is also dealt with by modifying the computation parts in the DWT structure. This utilizes logarithmic principle and, hence, yields a good reduction in power. Furthermore, at each level of DWT, precision also depends on decomposition at that stage. Hence, it is necessary to select an optimized level of DWT. During the experimentation of this proposal, the optimized level of DWT is selected based on performance parameters such as peak signal-to-noise ratio (PSNR), compression ratio (CR), and wavelet decomposition computation complexity. The architecture is first verified using Matlab for the image parameters and then implemented in hardware to analyze its hardware efficiency.

### 4.1 Image parameters analysis

**aa**subband obtained after one level of decomposition. To evaluate the performance of the proposed architecture, each image was decomposed into different levels with the B9/7 wavelet transform and the transform coefficients were coded using SPIHT algorithm with different compression ratios. The reconstructed image was compared with the original image, and the PSNR values were computed using Equation 16 and are presented in Table 1 and Figure 11.

**PSNR values for different decomposition levels**

Images | CR | PSNR value for different decomposition levels | |||||||
---|---|---|---|---|---|---|---|---|---|

1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | ||

Lena | 0.2 | 6.6086 | 10.0639 | 13.9454 | 21.9417 | 25.0162 | 23.8757 | 13.8499 | 7.8100 |

0.6 | 10.7395 | 15.2456 | 22.5876 | 29.7732 | 31.2794 | 26.8079 | 14.0972 | 7.8751 | |

0.8 | 10.7395 | 15.2456 | 25.7146 | 32.1411 | 33.5822 | 27.3527 | 14.1218 | 7.8796 | |

1 | 10.7395 | 19.2277 | 28.5985 | 34.1943 | 35.3906 | 27.6906 | 14.1390 | 7.8837 | |

Woman | 0.2 | 6.3568 | 9.9488 | 14.1874 | 19.1769 | 20.7100 | 20.3191 | 12.1483 | 8.0200 |

0.6 | 10.3732 | 15.0825 | 19.5321 | 23.7927 | 24.8295 | 23.2527 | 12.4849 | 8.1484 | |

0.8 | 10.3732 | 15.0825 | 21.4108 | 25.2709 | 25.9779 | 23.8922 | 12.5394 | 8.1693 | |

1 | 10.3732 | 17.8149 | 23.0358 | 26.2344 | 26.7792 | 24.3555 | 12.5718 | 8.1816 | |

Mandrill | 0.2 | 5.3449 | 9.6256 | 13.6566 | 19.0756 | 20.1384 | 19.8964 | 12.7907 | 6.3448 |

0.6 | 10.0327 | 14.5846 | 19.2745 | 21.7600 | 22.3938 | 21.8662 | 13.0871 | 6.4114 | |

0.8 | 10.0327 | 14.5846 | 20.4391 | 22.8376 | 23.5433 | 22.8538 | 13.1886 | 6.4328 | |

1 | 10.0327 | 17.4308 | 21.3911 | 23.9240 | 24.5149 | 23.6136 | 13.2669 | 6.4500 |

*E*

^{2}

*ms*is the sample mean squared error as follows:

where *X*(*i*, *j*) represents the original N × N image and *Y*(*i*, *j*) represents the reconstructed image.

### 4.2 Numerical accuracy analysis

This work is also concerned with precisions, which is the most important factor of this design. As B9/7 DWT structure utilizes floating point coefficients, accuracy in the result mainly depends on the fractional computational values. Hence, the results obtained with normal integer computation units in DWT suffer from poor accuracy. Moreover, the addition of floating point operation units increases the accuracy. On the other hand, it also increases area and delay overhead. Hence, a logarithm-based FPU is integrated along with the DWT structure to achieve a good reduction in area with a higher improvement in accuracy. As the whole model depends on the log values, the accuracy of the log values is directly related to the accuracy of the result. Furthermore, as std. single precision IEEE754 has 23 mantissa bits, the accuracy also depends on the correctness of the bits. So, in the experimental phase, the analysis of the accuracy is done by two means: output accuracy and bit level accuracy. As accuracy is mostly discussed in its contrary term, the error rate is taken into consideration when discussing accuracy.

where, *t* is the number of mantissa bits.

**Output accuracy percentage computation**

Number of output bits | Percent of error | ||||||
---|---|---|---|---|---|---|---|

Wallace tree multiplier | Logarithm-based multiplier | ||||||

Logarithm word size in bits | |||||||

6 bits | 9 bits | 12 bits | 15 bits | 18 bits | 21 bits | ||

| 40.9% | 58.02% | 44.8% | 25.63% | 18.53% | 7.91% | 3.86% |

| 40.23% | 51.17% | 32.11% | 18.13% | 11.68% | 6.87% | 3.15% |

| 38.56% | 30.89% | 26.41% | 13.28% | 9.13% | 5.31% | 2.74% |

| 36.85% | 29.13% | 18.95% | 9.58% | 5.18% | 2.68% | 2.03% |

| 35.03% | 11.16% | 7.58% | 5.91% | 2.86% | 1.98% | 1.06% |

| 33.26% | 8.19% | 5.39% | 3.32% | 1.86% | 0.93% | 0.58% |

| 31.68% | 5.11% | 2.68% | 2.53% | 1.03% | 0.49% | 0.26% |

**Output bit error rate computation**

Number of output bits | Average bit change error (for 2 | ||||||
---|---|---|---|---|---|---|---|

Wallace tree multiplier | Logarithm based multiplier | ||||||

Logarithm word size in bits | |||||||

6 bits | 9 bits | 12 bits | 15 bits | 18 bits | 21 bits | ||

| 12.56 | 18.56 | 15.78 | 14.08 | 9.58 | 3.56 | 1.89 |

| 11.87 | 16.59 | 13.86 | 11.18 | 7.38 | 1.95 | 0.76 |

| 11.85 | 13.48 | 11.85 | 9.87 | 3.85 | 0.78 | 0.54 |

| 11.34 | 10.56 | 10.03 | 7.65 | 0.96 | 0.53 | 0.02 |

| 10.93 | 10.06 | 9.53 | 4.86 | 0.08 | 0.12 | 0.0085 |

| 10.58 | 8.85 | 7.96 | 2.95 | 0.03 | 0.0085 | 0.0053 |

| 10.87 | 7.32 | 4.58 | 0.08 | 0.01 | 0.0031 | 0.0017 |

### 4.3 Hardware analysis

The log-based floating point computation achieves superior accuracy when compared with normal floating point arithmetic computation. Hence, the computation unit based on the log principle is appended with the biorthogonal DWT structure, which is then implemented in FPGA to analyze its performance in hardware. The analyses were done in two different FPGA environments to show the versatility of the proposed idea as there was no inbuilt IPs used.

#### 4.3.1 Hardware result analysis based on Xilinx device

**Hardware utilization comparison**

Virtex 6 FPGA parameters | Integer multiplier-based DWT | Log multiplier-based DWT |
---|---|---|

| 42 | 55 |

| 21% | 45.6% |

| 18.56 ns | 13.23 ns |

| 34 mW | 18 mW |

#### 4.3.2 Hardware result analysis based on Altera device

**Altera level analysis**

Cores | Parameters | Altera family and devices | |||
---|---|---|---|---|---|

Cyclone | Cyclone II | Cyclone III | Cyclone IV E | ||

EP1C12Q240C6 | EP2C70F896I8 | EP3C10F256C6 | EP4CE115F29C7 | ||

DWT | Combinational functions | 2,893 | 2,688 | 2,688 | 2,688 |

Logic registers | N/A | 235 | 235 | 235 | |

Memory (bits) | 71,445 | 71,445 | 71,445 | 71,445 | |

| 71.8 | 58.58 | 475.29 | 1,210.65 | |

DWT with BPS | Combinational functions | 4,114 | 4,072 | 4071 | 4,071 |

Logic registers | N/A | 791 | 791 | 791 | |

Memory (bits) | 163,840 | 163,840 | 163,840 | 163,840 | |

| 71.8 | 58.58 | 475.29 | 1,210.65 |

**Sync signal specification**

Signals | Values | |
---|---|---|

‘0’ | ‘1’ | |

| Normal mode | Resets and Initialize memory |

| Sync to Address controller | Sync to R/W control |

| Resets DWT-SPIHT and switch to Input mode | Activates DWT. Sequentially control R/W sync and initiates SPIHT |

| Resets IDWT-ISPIHT and switch to input mode | Activates ISPIHT. Sequentially control R/W sync and initiates IDWT |

| Stand by VGA process | Activates VGA Process and VGA CLK |

## 5 Conclusions

This paper has proposed an enhanced image processing system utilizing DWT structure with log-based floating point computation units and SPIHT coders. Hence, efficient decomposition levels of DWT and SPIHT algorithms have to be chosen for the hardware implementation. From the detailed analysis performed with various test images, it is found that the five-level decompositions in DWT and block-based parallel-pipelined SPIHT give a good PSNR value irrespective of the compression ratio. This paper adopted a modified arithmetic unit in the DWT structure to achieve good accuracy with minimum latency and power. The modification is stated for the computation units in the DWT structure which are merely integer-styled operation units. As floating point operations are much more complex than integer-based operations, the complexity of the computation hardware also increases. This results in the degradation of the efficiency of DWT operations. Hence, this paper introduced a log-based computation structure to minimize the strength of the operations. Furthermore, it is also found from the results that the accuracy of DWT gets increased as the rounding off errors are fewer with log transformations. The overall structure got 25% improvement in accuracy with the proposed log-based FPUs. In addition, the utilization of LNS in the model provides 47% power reduction in the structure as the overall signal activity and strength is reduced. Hence, the proposed structure features high speed, good accuracy, and low-power utilization. Thus, the adaptation of this structure in the proposed image processing system results in good hardware optimization. Moreover, the model was tested in different environments to test its robustness and versatility. This was done by implementing the model in different FPGAs. This shows that the model is best suited for portable image analyzing gadgets.

## Declarations

## Authors’ Affiliations

## References

- Sweldens W: The lifting scheme: a custom-design construction of biorthogonal wavelets.
*Appl. Comput. Harmon. Anal.*1996, 3(2):186-200. 10.1006/acha.1996.0015MathSciNetView ArticleGoogle Scholar - Daubechies I, Sweldens W: Factoring wavelet transforms into lifting schemes.
*J. Fourier Anal. Appl.*1998, 4(3):247-269. 10.1007/BF02476026MathSciNetView ArticleGoogle Scholar - Acharya T, Chakrabarti C: A survey on lifting-based discrete wavelet transform architectures.
*J. VLSI Signal Process.*2006, 42: 321-339. 10.1007/s11266-006-4191-3View ArticleGoogle Scholar - Barua S, Carletta JE, Kotteri KA, Bell AE: An efficient architecture for lifting-based two-dimensional discrete wavelet transforms.
*Integr. VLSI J.*2005, 38(3):341-352. 10.1016/j.vlsi.2004.07.010View ArticleGoogle Scholar - Andra K, Chakrabarti C, Acharya T: A VLSI architecture for lifting-based forward and inverse wavelet transform.
*IEEE Trans. Signal Process*2002, 50(4):966-977. 10.1109/78.992147View ArticleGoogle Scholar - Shi G, Liu W, Zhang L, Li F: An efficient folded architecture for lifting-based discrete wavelet transform.
*IEEE Trans. Circuits Syst.-II*2009, 56(4):290-294.View ArticleGoogle Scholar - Huang CT, Tseng PC, Chen LG: Flipping structure: an efficient VLSI architecture of lifting based discrete wavelet transform.
*IEEE Trans. Signal Process.*2004, 52(4):1080-1088. 10.1109/TSP.2004.823509MathSciNetView ArticleGoogle Scholar - Kim J, Park T: High performance VLSI architecture of 2D discrete wavelet transform with scalable lattice structure.
*World Acad. Sci. Eng. Technol.*2009, 54: 591-596.Google Scholar - Jiang W, Ortega A: Lifting factorization-based discrete wavelet transform architecture design.
*IEEE Trans. Circuits Syst Video Technol.*2001, 11(5):651-657. 10.1109/76.920194View ArticleGoogle Scholar - Zhang W, Jiang Z, Gao Z, Liu Y: An efficient VLSI architecture for lifting-based discrete wavelet transform.
*IEEE Trans. Circuits Syst.–II*2012, 59(3):158-162.View ArticleGoogle Scholar - Cheng C, Parhi KK: High-speed VLSI implement of 2-D discrete wavelet transform.
*IEEE Trans. Signal Process.*2008, 56(1):393-403.MathSciNetView ArticleGoogle Scholar - Tian X, Wu L, Tan YH, Tian JW: Efficient multi-input/multi-output VLSI architecture for two-dimensional lifting-based discrete wavelet transform.
*IEEE Trans. Comput.*2011, 60(8):1207-1211.MathSciNetView ArticleGoogle Scholar - Wu BF, Hu YQ: An efficient VLSI implementation of the discrete wavelet transforms using embedded instruction codes for symmetric filters.
*IEEE Trans. Circuits Syst. Video Technol.*2003, 13(9):936-943. 10.1109/TCSVT.2003.816509View ArticleGoogle Scholar - Zhang C, Wang C, Ahmad MO: A pipeline VLSI architecture for fast computation of the 2-D discrete wavelet transform.
*IEEE Trans. Circuits Syst.–I*2012, 59(8):1775-1785.MathSciNetView ArticleGoogle Scholar - Lan X, Zheng N, Liu Y: Low-power and high-speed VLSI architecture for lifting-based forward and inverse wavelet transform.
*IEEE Trans. Consum. Electron.*2005, 51(2):379-386. 10.1109/TCE.2005.1467975View ArticleGoogle Scholar - Lee DU, Kim LW, Villasenor JD: Precision-aware self-quantizing hardware architecture for the discrete wavelet transform.
*IEEE Trans. Image Process.*2012, 21(2):768-777.MathSciNetView ArticleGoogle Scholar - Beauchamp MJ, Hauck S, Underwood KD, Hemmert KS: Architectural modification to enhance the floating-point performance of FPGAs.
*IEEE Trans. Very Large Scale Integer (VLSI) Syst.*2008, 16(2):177-187.View ArticleGoogle Scholar - Ho CH, Yu CW, Leong PHW, Luk W, Wilton SJE: Floating-point FPGA: architecture and modeling.
*IEEE Trans. Very Large Scale Integer (VLSI) Syst.*2009, 17(12):1709-1718.View ArticleGoogle Scholar - Even G, Mueller SM, Seidel P-M: A dual precision IEEE floating-point multiplier.
*Integr. VLSI J.*2000, 29(2):167-180. 10.1016/S0167-9260(00)00006-7View ArticleGoogle Scholar - Yu CW, Smith AM, Luk W, Leong PHW, Wilton SJE: Optimizing floating point units in Hybrid FPGAs.
*IEEE Trans. Very Large Scale Integer (VLSI) Syst.*2012, 20(7):45-65.Google Scholar - Chong YJ, Parameswaran S: Configurable multimode embedded floating-point units for FPGAs.
*IEEE Trans. Very Large Scale Integer (VLSI) Syst.*2011, 19(11):2033-2044.View ArticleGoogle Scholar - Anand TH, Vaithiyanathan D, Seshasayanan R: Optimized architecture for floating point computation unit. In
*Int. conf. on emerging trends in VLSI, embedded sys., nano elec. and tele. sys*. Thiruvannamalai, India; 2013:1-5.Google Scholar - Paul S, Jayakumar N, Khatri SP: A fast hardware approach for approximate, efficient logarithm and antilogarithm computations.
*IEEE Trans. Very Large Scale Integer (VLSI) Syst.*2009, 17(2):269-277.View ArticleGoogle Scholar - Paliouras V, Karagianni K, Stouraitis T: Error bounds for floating-point polynomial interpolators.
*IEE Electron. Lett.*1999, 35(3):195-197. 10.1049/el:19990143View ArticleGoogle Scholar - IEEE standard for floating-point arithmetic, IEEE Std754-2008. IEEE Inc, New York, NY, USA; 2008:1-70. doi:10.1109/IEEESTD.2008.4610935Google Scholar
- Vaithiyanathan D, Seshasayanan R: High speed low power DWT structure with log based FPU in FPGAs. In
*International conference on green computing, communication and conservation of energy (ICGCE 2013)*. Chennai, India; 2013:308-313. doi:10.1109/ICGCE.2013.6823451Google Scholar - Said A, Pearlman WA: A new fast and efficient image codec based on set partitioning in hierarchical trees.
*IEEE Trans. Circuits Syst. Video Technol.*1996, 6(3):243-250. 10.1109/76.499834View ArticleGoogle Scholar - Wheeler FW, Pearlman WA: SPHIT image compression without lists.
*IEEE international conference on acoustics, speech, and signal processing (ICASSP), vol. 4*2000, 2047-2050.Google Scholar - Corsonello P, Perri S, Staino G, Lanuzza M, Cocorullo G: Low bit rate image compression core for onboard space applications.
*IEEE Trans. Circuits Syst. Video Technol.*2006, 16(1):114-128.View ArticleGoogle Scholar - Jyotheswar J, Mahapatra S: Efficient FPGA implementation of DWT and modified SPIHT for lossless image compression.
*J. Syst. Arch.*2007, 53: 369-378. 10.1016/j.sysarc.2006.11.009View ArticleGoogle Scholar - Cheng CC, Tseng PC, Chen LG: Multimode embedded compression codec engine for power-aware video coding system.
*IEEE Trans. Circuits Syst Video Technol*2009, 19(2):141-150.View ArticleGoogle Scholar - Fry T, Hauck S: SPIHT image compression on FPGAs.
*IEEE Trans. Circuits Syst. Video Technol.*2005, 15(9):1138-1147.View ArticleGoogle Scholar - Jin Y, Lee HJ, Block-Based A: Pass-parallel SPIHT algorithm.
*IEEE Trans. Circuits Syst. Video Technol.*2012, 22(7):1064-1075.View ArticleGoogle Scholar - Zervas ND, Anagnostopoulos GP, Spiliotopoulos V, Andrepoulos Y, Goutis CE: Evaluation of design alternatives for the 2D-discrete wavelet transform.
*IEEE Trans. Circuits Syst. Video. Technol.*2001, 11: 1246-1262. 10.1109/76.974679View ArticleGoogle Scholar - Zhang C, Long Y, Kurdahi F: A hierarchical pipelining architecture and FPGA implementation for lifting-based 2-D DWT.
*J. Real-Time Image Proc.*2007, 2: 281-291. 10.1007/s11554-007-0057-6View ArticleGoogle Scholar *The USC-SIPI image database Univ. Southern California, signal and Inage processing inst*. 2011. Available: http://sipi.usc.edu/database/*Virtex-6 FPGA data sheet*. Xilinx, Inc, San Jose, CA, USA; 2012. . Accessed 18 Feb 2013 http://www.xilinx.com/support/documentation/data_sheets/ds150.pdf- Corsonello P, Perri S, Zicari P, Cocorullob G: Microprocessor-based FPGA implementation of SPIHT image compression system.
*Microprocessor and Microsystems*2005, 29(6):299-305. 10.1016/j.micpro.2004.08.013View ArticleGoogle Scholar - Chew LW, Chia WC, Ang L-M, Seng KP: Very low-memory wavelet compression architecture using strip-based processing for implementation in wireless sensor networks.
*EURASIP J. Embed. Syst*2009.Google Scholar - Liu K, Belyaev E, Guo J: VLSI architecture of arithmetic coder used in SPIHT.
*IEEE Trans. Very Large Scale Integer (VLSI) Syst.*2012, 20(4):697-710.View ArticleGoogle Scholar *DE2-115 FPGA board data sheet*. Altera Corporation, San Jose, California, USA; 2010. . Accessed 15 March 2013 ftp://ftp.altera.com/up/pub/Altera_Material/12.1/Boards/DE2-115/DE2_115_User_Manual.pdf

## Copyright

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited.