Implementation of computation-reduced DCT using a novel method
- K. K. Senthilkumar^{1},
- K. Kunaraj^{2}Email author and
- R. Seshasayanan^{1}
https://doi.org/10.1186/s13640-015-0088-z
© Senthilkumar et al. 2015
Received: 6 January 2015
Accepted: 20 October 2015
Published: 6 November 2015
Abstract
The discrete cosine transform (DCT) performs a very important role in the application of lossy compression for representing the pixel values of an image using lesser number of coefficients. Recently, many algorithms have been devised to compute DCT. In the initial stage of image compression, the image is generally subdivided into smaller subblocks, and these subblocks are converted into DCT coefficients. In this paper, we present a novel DCT architecture that reduces the power consumption by decreasing the computational complexity based on the correlation between two successive rows. The unwanted forward DCT computations in each 8 × 8 sub-image are eliminated, thereby making a significant reduction of forward DCT computation for the whole image. This algorithm is verified with various high- and less-correlated images, and the result shows that image quality is not much affected when only the most significant 4 bits per pixel are considered for row comparison. The proposed architecture is synthesized using Cadence SoC Encounter® with TSMC 180 nm standard cell library. This architecture consumes 1.257 mW power instead of 8.027 mW when the pixels of two rows have very less difference. The experimental result shows that the proposed DCT architecture reduces the average power consumption by 50.02 % and the total processing time by 61.4 % for high-correlated images. For less-correlated images, the reduction in power consumption and the total processing time is 23.63 and 35 %, respectively.
Keywords
DCT IDCT Image compression FPGA ASIC1 Introduction
Image compression is a process of reducing the size of representation of graphics file in binary format without affecting the quality of the image to an objectionable level. This reduction helps to store more images for the same amount of storage device. It also decreases the transmission time for images to be sent over the various technologies like internet [1]. The discrete cosine transform (DCT) which is the most widely used technique for image compression was initially defined in [1]. It came up as a revolutionary standard when compared with the other existing transforms. After that, an algorithm for computing Fast DCT (FDCT) was introduced by Chen et al., in [2] which was based on matrix decomposition of the orthogonal basis function of the cosine transform. This method took (3 N/2)(log_{2} N − 1) + 2 real additions and N log_{2} N − 3 N/2 + 4 real multiplications, and this is approximately six times faster than the conventional approach. Further, a new algorithm was introduced for the 2^{N}point DCT as in [3]. This algorithm uses only half of the number of multiplications required by the existing efficient algorithms (12 multiplications and 29 additions), and it makes the system simpler by decomposing the N-point Inverse DCT (IDCT) into the sum of two N/2-point IDCTs. A recursive algorithm for DCT [4] was presented with a structure that allows the generation of the next higher order DCT from two identical lower order DCTs to reduce the number of adders and multipliers (12 multiplications and 29 additions). Loffler came up with a practical fast 1-D DCT algorithm [5] in which the number of multiplications was reduced to 11 by inverting add/subtract modules and found an equivalence for the rotation block (only 3 additions and 3 multiplications per block instead of 4 multiplications and 2 additions). Following these contributions in DCT implementation, many algorithms were constantly introduced to optimize the DCT.
In recent years, the idea of implementing DCT using CORDIC (co-ordinate rotation digital computer) [6] using only shift and add arithmetic with look-up tables was analyzed for efficient hardware implementation. Another technique called distributed arithmetic (DA) was devised [7] which computes multiplication as distributed over bit-level memories and adders. Read-only memory (ROM) free 1-D DCT architecture was discussed in [8], and this architecture is based on DA method with reduced area and power reduction. As in [9], an unsigned constant coefficient multiplication was done by moving two negative signs to the next adder to make them positive, and it was implemented using multiplier-less operation. The prime N-length DCT was divided into similar cyclic convolution structures, and the DCT was implemented using systolic array structure [10]. The technique used in [11] reduced the resource usage and increased the maximum frequency by rearranging the ADD blocks to the consecutive stages. Also, to eliminate the use of multipliers by using shift and addition operations, many algorithms were devised. The technique which uses Ramanujan numbers for calculating cosine values and uses Chebyshev type recursion to compute DCT [12] was also proposed. A low power multiplier-less DCT was presented in [13], and it reduces the switching power consumption around 26 % by removing unnecessary arithmetic operations on unused bits during the CORDlC calculations. The complexity of DCT computation was reduced in [14] by optimizing the Loeffler DCT, based on the CORDlC algorithm. Further, it reduces the 11 multiply and 29 add operations to 38 add and 16 shift operations without losing quality. A low power design technique was presented in [15], which eliminates DCT computation of low energy macro block. A technique was presented to reduce the complexity of multiplications in DCT [16] by using differential pixels in 8 × 8 blocks of input image matrix. Based on differences of 64 DCT coefficients, separate operand bit-widths were used for different frequency components to reduce computation energy [17]. Various low-power design techniques such as dual voltage, dual frequency, and clock gating were used in the DCT architecture to reduce the power consumption [18].
This paper proposes a new architecture that computes the DCT, based on the difference between pixels of two rows, and also, it reduces the computations and power consumption of DCT. The paper is organized as follows: The most common DCT implementation strategies are discussed in Section 2. The conventional image compression technique using DCT and the proposed comparative input method (CIM) which eliminates the unwanted DCT computations are discussed in Section 3. The simulation results, performance, and comparative analysis of the proposed DCT is given in Section 4, and Section 5 concludes the research findings.
2 Existing algorithms for DCT implementation
- (i)
Direct 2-D computation and
- (ii)
Decomposition into two 1-D DCTs using seperability.
Here, f(x) is the 1-D row of input pixels, and the cosine term is the orthonormal basis function. F(u) is the 1-D DCT output, and D(u) is the normalizing factor.
To implement the DCT, modified Lee’s algorithm [3] and Chen’s algorithm [2] are used in this paper. Lee’s algorithm utilizes three levels of mathematical decomposition to calculate DCT in a simpler method. Compared to Chen’s algorithm, Lee’s method reduces the computational complexity of calculating DCT coefficients by 46 %. Both the algorithms are simulated using Matlab and EDA tool. To prove the hardware efficiency of the proposed algorithm, the architecture is implemented in field programmable gate array (FPGA). The design entry is made through Verilog hardware description language (HDL), simulated in Xilinx ISim, and synthesized using Xilinx XST.
2.1 Fast algorithm
The X(n) corresponds to the 1-D input values, and Y(n) corresponds to the 1-D output values. The number of computations involved is (3 N/2)(log_{2} N − 1) + 2 additions and Nlog_{2} N − 3 N/2 + 4 multiplications. Hence, for N = 8, it requires 16 multiplications and 12 additions.
3 Image compression using DCT
After performing CIM-based DCT computation, the following steps for the compression of the image are carried out. The DCT coefficients are quantized to a pre-determined level to reduce psycho-visual redundancy. Zigzag scanning ensures the scanning of high-frequency DCT coefficients, and the scanned coefficients are encoded to reduce coding redundancy.
3.1 DCT computation through CIM
The comparative input method is a new approach of comparing two adjacent rows in an N × N sub-image while calculating the forward 1-D DCT. Initially, the 8 × 8 block of the sub-images are obtained through subdivision process. In general, every row of the sub-image (an array of eight elements) is applied as input to the 1-D DCT to obtain an output array of eight DCT coefficients.
Here, the threshold value depends on the number of bits considered for row comparison. If the absolute difference between any of the pixels in X m and X m _{-1} is less than or equal to the given threshold (T) value, it is considered as matching otherwise it is assumed to be non-matching. With these assumptions, the Eq. (3) is used to eliminate the DCT computation (Y m) for that particular row, if the row (X m) is matched with previous row (X m _{-1}). Based on the required image quality while reconstruction, the threshold value is selected as 1 or 3 or 7 or 15 for efficient hardware implementation. Choosing higher threshold value slightly reduces the image quality while reconstruction.
3.2 Proposed DCT architecture using CIM
- 1.
Row-comparator
- 2.
DCT power controller
- 3.
DCT computation unit
- 4.
Output selection block
- 5.
Memory
Initially, each row from the 8 × 8 sub-image is sent to the row comparator block. The row comparator block compares all the eight pixels of the current row with the previous row. Based on the output of the row comparator block, the DCT power control block activates or deactivates the DCT core. Thus, the main function of the DCT power control block is to control the power input given to the DCT architecture. If it receives a “high” signal, it disables the power to be supplied to the DCT architecture else it enables the power input. Hence, if the two rows of an 8 × 8 sub-image are equal, the DCT need not be computed for the current row, and thus, significant power reduction is achieved. Also, the output selection block provides the buffered pre-computed DCT coefficients of the previous row or the output of the DCT core of the current row based on the input provided by the row comparator. Finally, the DCT coefficients of the 8 × 8 sub-image are stored in a RAM for further processing.
3.3 Average power consumption (P _{av})
3.4 Processing time (T _{pr})
4 Results and discussions
Comparison of MSE and PSNR of the reconstructed images for different number of bits (N) ignored for row comparison
Sl.no. | Name of the image | N = 0 | N = 1 | N = 2 | N = 3 | N = 4 | |||||
---|---|---|---|---|---|---|---|---|---|---|---|
MSE | PSNR | MSE | PSNR | MSE | PSNR | MSE | PSNR | MSE | PSNR | ||
1 | Lena | 4.0e-28 | 321.941 | 0.1156 | 57.5 | 0.5805 | 50.493 | 3.029 | 42.958 | 16.1493 | 36.049 |
2 | Cameraman | 6.5e-28 | 319.962 | 0.0805 | 59.073 | 0.4055 | 52.051 | 1.9449 | 45.241 | 11.1596 | 37.654 |
3 | Rice | 3.5e-28 | 322.577 | 0.0653 | 59.983 | 0.4505 | 51.594 | 3.4139 | 42.798 | 17.4703 | 35.707 |
4 | Mandrill | 5.1e-28 | 321.013 | 0.0382 | 62.311 | 0.1977 | 55.171 | 2.1792 | 44.747 | 20.2722 | 35.019 |
5 | Pirate | 3.8e-28 | 322.261 | 0.0351 | 62.678 | 0.235 | 54.42 | 2.2731 | 44.564 | 17.2967 | 35.751 |
6 | Peppers | 4.0e-28 | 322.021 | 0.0486 | 61.262 | 0.3784 | 52.351 | 4.1263 | 41.975 | 21.7505 | 34.756 |
Figure 7b shows the corresponding PSNR of the reconstructed images as given in Table 1. The chart shows a degradation in the image quality as the number of bits ignored for row comparison increases. When the comparison is made between the various inputs, it can be seen that the PSNR for the Cameraman image is high, compared with the other images for 4-bit eliminationand hence a better output quality.
Number of rows repeated for various images after N number of bits are ignored for row comparison
Image | N = 0 | N = 1 | N = 2 | N = 3 | N = 4 |
---|---|---|---|---|---|
Lena | 268 | 1255 | 2496 | 3911 | 5382 |
Cameraman | 809 | 1868 | 2556 | 3213 | 4077 |
Rice | 34 | 357 | 1304 | 2774 | 4310 |
Mandrill | 21 | 122 | 495 | 1374 | 3220 |
Pirate | 62 | 224 | 693 | 1653 | 3292 |
Peppers | 10 | 199 | 964 | 2811 | 4728 |
4.1 FPGA implementation of DCT using CIM
Device utilization and timing summary of DCT architecture with and without comparison block
Hardware utilization | Chen’s algorithm | Lee’s algorithm | ||
---|---|---|---|---|
With comparative block. | Without comparative block. | With comparative block. | Without comparative block. | |
Number of slices (4656) | 255 | 193 | 165 | 103 |
Number of four input LUTs (9312) | 428 | 373 | 242 | 187 |
Number of bonded IOBs (190) | 138 | 136 | 138 | 136 |
Maximum combinational path delay (ns) | 22.542 | 21.603 | 21.238 | 20.297 |
MSE and PSNR of the reconstructed images for various number of bits (N) ignored for row comparison
Sl. no. | Name of the image | N = 0 | N = 1 | N = 2 | N = 3 | N = 4 | |||||
---|---|---|---|---|---|---|---|---|---|---|---|
MSE | PSNR (dB) | MSE | PSNR (dB) | MSE | PSNR (dB) | MSE | PSNR (dB) | MSE | PSNR (dB) | ||
1 | Lena | 0.0004 | 82.003 | 0.131 | 56.951 | 0.651 | 49.997 | 3.672 | 42.482 | 18.645 | 35.425 |
2 | Cameraman | 0.0005 | 80.888 | 0.090 | 58.613 | 0.453 | 51.568 | 2.259 | 44.592 | 12.870 | 37.035 |
3 | Rice | 0.0002 | 86.370 | 0.072 | 59.539 | 0.495 | 51.184 | 3.765 | 42.374 | 19.689 | 35.189 |
4 | Mandrill | 0.0009 | 78.400 | 0.043 | 61.837 | 0.224 | 54.632 | 2.491 | 44.167 | 23.358 | 34.446 |
5 | Pirate | 0.0002 | 84.707 | 0.039 | 62.265 | 0.268 | 53.848 | 2.567 | 44.036 | 19.779 | 35.169 |
6 | Peppers | 0.0004 | 82.568 | 0.054 | 60.791 | 0.418 | 51.921 | 4.596 | 41.507 | 25.046 | 34.143 |
4.2 ASIC implementation of proposed DCT
Comparison of gate count and power consumption
Description | DCT architecture (Lee’s algorithm) | DCT with proposed row comparison unit |
---|---|---|
Gate counts | 1251 | 1656 |
Cell area (μm^{2}) | 35992 | 57829 |
Average power (mW) consumption (mW) | 8.0157 | 9.2727 |
Power reduction in the proposed DCT for various number of bits (N) ignored for row comparison
Image | Power consumption | N = 0 | N = 1 | N = 2 | N = 3 | N = 4 | ||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Comparison unit power consumption (mW) | Power consumed by DCT alone (mW) | Proposed DCT power consumption (mW) | Power variations in % | Proposed DCT power consumption (mW) | Power variations in % | Proposed DCT power consumption (mW) | Power variations in % | Proposed DCT power consumption (mW) | Power variations in % | Proposed DCT power consumption (mW) | Power variations in % | |
Lena | 1.26 | 8.02 | 9.01 | −12.4 | 8.04 | −0.4 | 6.83 | 14.8 | 5.45 | 32.1 | 4.01 | 50.0 |
Camera | 1.26 | 8.02 | 8.48 | −5.8 | 7.44 | 7.1 | 6.77 | 15.5 | 6.13 | 23.5 | 5.28 | 34.1 |
Rice | 1.26 | 8.02 | 9.24 | −15.3 | 8.92 | −11.3 | 8 | 0.2 | 6.56 | 18.2 | 5.06 | 36.9 |
Mandrill | 1.26 | 8.02 | 9.25 | −15.4 | 9.15 | −14.2 | 8.79 | −9.6 | 7.93 | 1.1 | 6.12 | 23.6 |
Pirate | 1.26 | 8.02 | 9.21 | −14.9 | 9.05 | −12.9 | 8.59 | −7.2 | 7.65 | 4.5 | 6.05 | 24.5 |
Peppers | 1.26 | 8.02 | 9.26 | −15.6 | 9.08 | −13.3 | 8.33 | −3.9 | 6.52 | 18.6 | 4.65 | 42.0 |
If all the bits are considered for comparing two pixel values, the proposed DCT power consumption (the average power (p _{ av })) is higher than the normal DCT power consumption (p _{ α }) while computing DCT of a sub-image. The power consumed by the comparison unit is greater than the power saved by row elimination in total for the complete image while all the bits are considered. Hence, the percentage power reduction are negative. Whereas in case of ignoring 1 bit for comprison, the power consumption for the proposed DCT is higher than that for normal DCT for all the images except the Cameraman image since it has a great reductions in the number of repeated rows (1868). Hence, the percentage power variations are positive for that image alone. Even in case of ignoring 2 bits, the power reduction may be achieved and it depends on the number of repeated row in an image. Perhaps, by ignoring 3 or 4 bits for row comparison, significant power reduction can be achieved, and it is clear from the values given in Table 6.
Reduction in the processing time of proposed DCT architecture for various number of bits (N) ignored for row comparison
Name of image | Processing time of comparison unit (ns) | Processing time of DCT alone (ns) | N = 0 | N = 1 | N = 2 | N = 3 | N = 4 | |||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Total processing time of proposed design (μs) | % processing time reduction | Total processing time of proposed design (μs) | % processing time reduction | Total processing time of proposed design (μs) | % processing time reduction | Total processing time of proposed design (μs) | % processing time reduction | Total processing time of proposed design (μs) | % processing time reduction | |||
Lena | 0.9 | 21.6 | 178.9 | −1.0 | 157.6 | 11.0 | 130.7 | 26.1 | 100.2 | 43.4 | 68.4 | 61.4 |
Cameraman | 0.9 | 21.6 | 167.2 | 6.0 | 144.3 | 18.0 | 129.5 | 26.9 | 115.3 | 34.9 | 96.6 | 45.4 |
Rice | 0.9 | 21.6 | 183.9 | −4.0 | 177.0 | 0.0 | 156.5 | 11.6 | 124.7 | 29.5 | 91.6 | 48.3 |
Mandrill | 0.9 | 21.6 | 184.2 | −4.0 | 182.0 | −3.0 | 174.0 | 1.7 | 155.0 | 12.4 | 115.1 | 35.0 |
Pirate | 0.9 | 21.6 | 183.3 | −4.0 | 179.8 | −2.0 | 169.7 | 4.1 | 149.0 | 15.8 | 113.6 | 35.8 |
Peppers | 0.9 | 21.6 | 184.5 | −4.0 | 180.4 | −2.0 | 163.8 | 7.4 | 123.9 | 30.0 | 82.5 | 53.4 |
5 Conclusions
In this paper, we have proposed a novel method for DCT computation for lossy image compression. 1-D DCT computation is computed for a row, and it is based on the difference between the pixel values of adjacent rows. By adopting this methodology, a larger number of computations are reduced when 5 and 4 bits of pixels are taken for row comparison. The proposed method is verified with various high- and less-correlated images. The results show that image quality is maintained to good level even though 4 bits are removed from 8 bits in a pixel for row comparison. The pixel comparison method is implemented in both FPGA as well as ASIC environment, and it eliminates maximum of 65 % of DCT computations in Cameraman image and 39% in mandrill image when 4 bits are eliminated for row comparison. The proposed architecture consumes 1.257 mW power instead of 8.027 mW with 24.4 % of additional hardware cost when the pixels of two rows have very less difference. The experimental result shows that the power consumption proposed DCT architecture is reduced to 4.01 mW for highly uncorrelated images and 6.02 mW for less-correlated images without much affecting the image quality. This achieves maximum power reduction of 50.02 % and minimum power reduction of 23.63 % of original DCT implementation.
Declarations
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
Authors’ Affiliations
References
- N. Ahmed, T. Natarjan, K.R. Rao, Discrete cosine transform. IEEE T Comput 23(2), 90–93 (1974)MATHView ArticleGoogle Scholar
- W.H. Chen, C.H. Smith, S.C. Fralick, A fast computational algorithm for the discrete cosine transform. IEEE T Commun 25(9), 1004–1009 (1977)MATHView ArticleGoogle Scholar
- B. Lee, A new algorithm to compute the discrete cosine transform. IEEE T Acoust Speech P 32(6), 1243–1245 (1984)MATHView ArticleGoogle Scholar
- H.S. Hou, A fast recursive algorithm for computing the discrete cosine transform. IEEE T Acoust Speech 35(10), 1455–1461 (1987)View ArticleGoogle Scholar
- C. Loeffler, A. Lightenberg, G.S. Moschytz, Practical fast 1–D DCT algorithms with 11 multiplications. Proc Int Conf Acoust Speech Signal Process 2, 988–991 (1989)View ArticleGoogle Scholar
- J. Rohit Kumar, Design and FPGA implementation of CORDIC-based 8-point 1D DCT processor, in E thesis, Department of Electronics and Communication Engineering, National Institute of Technology (Session, Rourkela, 2010)Google Scholar
- VK Sharma, KK Mahapatra, C Umesh, An efficient distributed arithmetic based VLSI architecture for DCT. Proc Int Conf Dev Commun. 1–5 (2011). doi:https://doi.org/10.1109/ICDECOM.2011.5738484
- A Shaofeng, C Wang, A computation structure for 2-D DCT watermarking. IEEE Int. Midwest Symposium Circ. Syst. 577–580. (2009). doi:https://doi.org/10.1109/MWSCAS.2009.5236026
- ME Aakif, S Belkouch, MM Hassani, Low power and fast DCT architecture using multiplier-less method. Proc. Int. Conf. Faible Tension Faible Consommation. 63–66 (2011). doi:https://doi.org/10.1109/FTFC.2011.5948920
- C. Chao, P. Keshab, A novel systolic array structure for DCT. IEEE Trans Circ Systems—II 52(7), 366–368 (2005)View ArticleGoogle Scholar
- S Belkouch, ME Aakif, A Ait Ouahman, Improved implementation of a modified discrete cosine transform on low-cost FPGA. Int. Symposium on I/V Comm Mobile Network. 1–4 (2010). doi:https://doi.org/10.1109/ISVC.2010.5656248
- K.S. Geetha, M. Uttara Kumari, A new multiplierless discrete cosine transform based on the Ramanujan ordered numbers for image coding. Int J Signal Proc Image Proc Pattern Recognit 3(4), 1–14 (2010)Google Scholar
- J. Hyeonuk, K. Jinsang, C. Won-Kyung, Low-power multiltiplierless DCT architecture using image correlation. IEEE T Consum Electr 50(1), 262–267 (2004)View ArticleGoogle Scholar
- C.C. Sun, S.J. Ruan, B. Heyne, Goetze, Low-power and high-quality Cordic-based Loeffler DCT for signal processing. IET Trans Circ Dev Syst 1(6), 453–461 (2007)View ArticleGoogle Scholar
- H. Dong Sam, Low power design of DCT and IDCT for low bit rate video codecs. IEEE T Multimedia 6(6), 414–422 (2004)Google Scholar
- A.P. Vinod, D. Rajan, A. Singla, Differential pixel-based low-power and high-speed implementation of DCT for on-board satellite image processing. IET-Circ Dev Syst 1(6), 444–450 (2007)View ArticleGoogle Scholar
- P. Jongsun, H.C. Jung, K. Roy, Dynamic bit-width adaptation in DCT: an approach to trade off image quality and computation energy. IEEE T VLSI Syst 18(5), 787–793 (2010)View ArticleGoogle Scholar
- S.P. Mohanty, K. Balakrishnan, A dual voltage-frequency VLSI chip for image watermarking in DCT domain. IEEE T Circuits-II: Express Briefs 53(5), 394–398 (2006)Google Scholar
- P. Jongsun, K. Roy, A low power reconfigurable DCT architecture to trade off image quality for computational complexity. Acoust Speech Signal Process 2004. Proc (ICASSP '04), IEEE Int Conf 5, V-17-20 (2004)Google Scholar
- L. Zhenwei, P. Silong, M. Hong, W. Qiang, A reconfigurable DCT architecture for multimedia applications. Congr Image Sign Proc CISP '08 1, 360–364 (2008)Google Scholar
- M.-W. Lee, J.-H. Yoon, J. Park, Reconfigurable CORDIC-based low-power DCT architecture based on data priority. IEEE T VLSI Syst 22(5), 1060–1068 (2014)View ArticleGoogle Scholar
- J.M. Rabaey, A. Chandrakasan, B. Nikolic, Digital integrated circuits, in Pearson Education-Engineering & Technology, 2nd edn., 2003Google Scholar