Implementation of computation-reduced DCT using a novel method
EURASIP Journal on Image and Video Processingvolume 2015, Article number: 34 (2015)
The discrete cosine transform (DCT) performs a very important role in the application of lossy compression for representing the pixel values of an image using lesser number of coefficients. Recently, many algorithms have been devised to compute DCT. In the initial stage of image compression, the image is generally subdivided into smaller subblocks, and these subblocks are converted into DCT coefficients. In this paper, we present a novel DCT architecture that reduces the power consumption by decreasing the computational complexity based on the correlation between two successive rows. The unwanted forward DCT computations in each 8 × 8 sub-image are eliminated, thereby making a significant reduction of forward DCT computation for the whole image. This algorithm is verified with various high- and less-correlated images, and the result shows that image quality is not much affected when only the most significant 4 bits per pixel are considered for row comparison. The proposed architecture is synthesized using Cadence SoC Encounter® with TSMC 180 nm standard cell library. This architecture consumes 1.257 mW power instead of 8.027 mW when the pixels of two rows have very less difference. The experimental result shows that the proposed DCT architecture reduces the average power consumption by 50.02 % and the total processing time by 61.4 % for high-correlated images. For less-correlated images, the reduction in power consumption and the total processing time is 23.63 and 35 %, respectively.
Image compression is a process of reducing the size of representation of graphics file in binary format without affecting the quality of the image to an objectionable level. This reduction helps to store more images for the same amount of storage device. It also decreases the transmission time for images to be sent over the various technologies like internet . The discrete cosine transform (DCT) which is the most widely used technique for image compression was initially defined in . It came up as a revolutionary standard when compared with the other existing transforms. After that, an algorithm for computing Fast DCT (FDCT) was introduced by Chen et al., in  which was based on matrix decomposition of the orthogonal basis function of the cosine transform. This method took (3 N/2)(log2 N − 1) + 2 real additions and N log2 N − 3 N/2 + 4 real multiplications, and this is approximately six times faster than the conventional approach. Further, a new algorithm was introduced for the 2Npoint DCT as in . This algorithm uses only half of the number of multiplications required by the existing efficient algorithms (12 multiplications and 29 additions), and it makes the system simpler by decomposing the N-point Inverse DCT (IDCT) into the sum of two N/2-point IDCTs. A recursive algorithm for DCT  was presented with a structure that allows the generation of the next higher order DCT from two identical lower order DCTs to reduce the number of adders and multipliers (12 multiplications and 29 additions). Loffler came up with a practical fast 1-D DCT algorithm  in which the number of multiplications was reduced to 11 by inverting add/subtract modules and found an equivalence for the rotation block (only 3 additions and 3 multiplications per block instead of 4 multiplications and 2 additions). Following these contributions in DCT implementation, many algorithms were constantly introduced to optimize the DCT.
In recent years, the idea of implementing DCT using CORDIC (co-ordinate rotation digital computer)  using only shift and add arithmetic with look-up tables was analyzed for efficient hardware implementation. Another technique called distributed arithmetic (DA) was devised  which computes multiplication as distributed over bit-level memories and adders. Read-only memory (ROM) free 1-D DCT architecture was discussed in , and this architecture is based on DA method with reduced area and power reduction. As in , an unsigned constant coefficient multiplication was done by moving two negative signs to the next adder to make them positive, and it was implemented using multiplier-less operation. The prime N-length DCT was divided into similar cyclic convolution structures, and the DCT was implemented using systolic array structure . The technique used in  reduced the resource usage and increased the maximum frequency by rearranging the ADD blocks to the consecutive stages. Also, to eliminate the use of multipliers by using shift and addition operations, many algorithms were devised. The technique which uses Ramanujan numbers for calculating cosine values and uses Chebyshev type recursion to compute DCT  was also proposed. A low power multiplier-less DCT was presented in , and it reduces the switching power consumption around 26 % by removing unnecessary arithmetic operations on unused bits during the CORDlC calculations. The complexity of DCT computation was reduced in  by optimizing the Loeffler DCT, based on the CORDlC algorithm. Further, it reduces the 11 multiply and 29 add operations to 38 add and 16 shift operations without losing quality. A low power design technique was presented in , which eliminates DCT computation of low energy macro block. A technique was presented to reduce the complexity of multiplications in DCT  by using differential pixels in 8 × 8 blocks of input image matrix. Based on differences of 64 DCT coefficients, separate operand bit-widths were used for different frequency components to reduce computation energy . Various low-power design techniques such as dual voltage, dual frequency, and clock gating were used in the DCT architecture to reduce the power consumption .
This paper proposes a new architecture that computes the DCT, based on the difference between pixels of two rows, and also, it reduces the computations and power consumption of DCT. The paper is organized as follows: The most common DCT implementation strategies are discussed in Section 2. The conventional image compression technique using DCT and the proposed comparative input method (CIM) which eliminates the unwanted DCT computations are discussed in Section 3. The simulation results, performance, and comparative analysis of the proposed DCT is given in Section 4, and Section 5 concludes the research findings.
Existing algorithms for DCT implementation
Generally, the two methods used for computing 2-D DCT are
Direct 2-D computation and
Decomposition into two 1-D DCTs using seperability.
The proposed method adopts the second approach to compute the 2-D DCT. The row transformation is initially applied to obtain a 1-D output and then applying it the next time along the column yields the 2-D output as shown in Fig. 1. In hardware implementation of 2-D DCT, the inputs can be obtained by storing them in random access memory (RAM), and then, it is given to the 1-D computation module. After the computation, the output is stored in a transposition buffer before it is given to the 1-D block again. This is illustrated in Fig. 2.
The 2-D DCT is given by Eq. (1).
In the Eq. (1), f(x,y) is the input matrix of pixels representing the N × N sub-image. F(u,v) is the corresponding 2-D DCT output coefficients. D(u) and D(v) are the normalizing factors. Both the cosine terms represent the orthonormal basis functions of the cosine functions used to map the input pixels into the transformed coefficients. The input values should be multiplied with the orthonormal basis functions and the normalizing factor to get the DCTcoefficients. The 1-D DCT is given in the Eq. (2).
Here, f(x) is the 1-D row of input pixels, and the cosine term is the orthonormal basis function. F(u) is the 1-D DCT output, and D(u) is the normalizing factor.
To implement the DCT, modified Lee’s algorithm  and Chen’s algorithm  are used in this paper. Lee’s algorithm utilizes three levels of mathematical decomposition to calculate DCT in a simpler method. Compared to Chen’s algorithm, Lee’s method reduces the computational complexity of calculating DCT coefficients by 46 %. Both the algorithms are simulated using Matlab and EDA tool. To prove the hardware efficiency of the proposed algorithm, the architecture is implemented in field programmable gate array (FPGA). The design entry is made through Verilog hardware description language (HDL), simulated in Xilinx ISim, and synthesized using Xilinx XST.
The algorithm proposed by Chen et al.  to compute forward DCT is called “Fast algorithm.” The computation is done similar to the method shown in Fig. 2 by computing 1-D DCT, transposing it, and then computing 2-D DCT. For a 2-D DCT, the 8 × 8 transformation matrix corresponding to the 8 × 8 basis function is given by
In Chen’s algorithm, the 8 × 8 transformation matrix is decomposed into two 4 × 4 matrices. This is done by considering the input values which should be multiplied with common coefficients (in the transformation matrix). After decomposing the 8 × 8 transformation matrix, the two 4 × 4 transformation matrices obtained are
The X(n) corresponds to the 1-D input values, and Y(n) corresponds to the 1-D output values. The number of computations involved is (3 N/2)(log2 N − 1) + 2 additions and Nlog2 N − 3 N/2 + 4 multiplications. Hence, for N = 8, it requires 16 multiplications and 12 additions.
Image compression using DCT
The overall image compression using the proposed CIM is carried out by performing the steps shown in Fig. 3. Initially, the input image is subdivided into smaller sub-images of size 2n, so that the correlation (redundancy) between the adjacent pixels in the sub-image will reduce the number of DCT coefficients. In general, both the level of computational complexity and compression increases as the sub-image size increases. The most popular sub-image sizes are 8 × 8 and 16 × 16, and we consider sub-image size of 8 × 8 to have optimal computational complexity. Also, the frequency transformations like DCT are good at compressing smooth areas with low frequency content, but quite bad at compressing high frequency contents.
After performing CIM-based DCT computation, the following steps for the compression of the image are carried out. The DCT coefficients are quantized to a pre-determined level to reduce psycho-visual redundancy. Zigzag scanning ensures the scanning of high-frequency DCT coefficients, and the scanned coefficients are encoded to reduce coding redundancy.
DCT computation through CIM
The comparative input method is a new approach of comparing two adjacent rows in an N × N sub-image while calculating the forward 1-D DCT. Initially, the 8 × 8 block of the sub-images are obtained through subdivision process. In general, every row of the sub-image (an array of eight elements) is applied as input to the 1-D DCT to obtain an output array of eight DCT coefficients.
For a considered sub-image, the DCT is computed for the first row. From the second row onwards, pixels in each row are compared with the previous row of pixels. If all the pixels of a row is found to be nearly same as the pixels in the previous row, the DCT computation need not be performed for the second row. Instead, the previous row’s 1-D DCT coefficients can be used for the current second row without any need for computation. Otherwise, the pixels are considered as non-matching and the comparison fails. For this case, 1-D DCT is applied again for the particular row to obtain a new DCT coefficients. This procedure is applied for all the remaining rows of the 8 × 8 sub-image. By following this row comparison, a large number of computations are eliminated. Figure 4 shows the above discussed comparison method for DCT computation.
Consider X m is the mth row of the given image, X m(n) is the nth pixel corresponding to the mth row of the original image. Similarly, Y m is the DCT output for mth row of the image, Y m(n) is the nth DCT coefficient corresponding to the mth row. Thus, the DCT coefficient is computed as follows:
Here, the threshold value depends on the number of bits considered for row comparison. If the absolute difference between any of the pixels in X m and X m -1 is less than or equal to the given threshold (T) value, it is considered as matching otherwise it is assumed to be non-matching. With these assumptions, the Eq. (3) is used to eliminate the DCT computation (Y m) for that particular row, if the row (X m) is matched with previous row (X m -1). Based on the required image quality while reconstruction, the threshold value is selected as 1 or 3 or 7 or 15 for efficient hardware implementation. Choosing higher threshold value slightly reduces the image quality while reconstruction.
Proposed DCT architecture using CIM
The proposed 1-D DCT architecture is implemented using CIM to perform the forward DCT, and it is shown in Fig. 5. The main components of the proposed system are
DCT power controller
DCT computation unit
Output selection block
Initially, each row from the 8 × 8 sub-image is sent to the row comparator block. The row comparator block compares all the eight pixels of the current row with the previous row. Based on the output of the row comparator block, the DCT power control block activates or deactivates the DCT core. Thus, the main function of the DCT power control block is to control the power input given to the DCT architecture. If it receives a “high” signal, it disables the power to be supplied to the DCT architecture else it enables the power input. Hence, if the two rows of an 8 × 8 sub-image are equal, the DCT need not be computed for the current row, and thus, significant power reduction is achieved. Also, the output selection block provides the buffered pre-computed DCT coefficients of the previous row or the output of the DCT core of the current row based on the input provided by the row comparator. Finally, the DCT coefficients of the 8 × 8 sub-image are stored in a RAM for further processing.
Average power consumption (P av)
Consider P α as the average power consumption of DCT without the row comparison unit and P β as the average power consumption of all the other units excluding DCT core which is used for comparing rows. N com is the total number of 8 × 1 rows available in the image. N rep is the number of rows of 8 × 1 pixels having similar pixel values and excluding the first row; N non-rep is the number of rows of 8 × 1 pixels having dissimilar pixel values, and also, it includes first row having similar pixel values. P av is the average power consumed by the proposed DCT architecture and is given by Eq. (4). The Eq. (5) provides the percentage of power reduction that is obtained using proposed DCT architecture when compared with the regular DCT implementation.
Processing time (T pr)
Consider T α as the time required to process a 1-D DCT for a single row and T β as the time required to process a 1-D DCT for a single row when the current row matches with previous row. N com is the number of 1-D computations involved in an image, N rep is the number of 1-D computations repeated, N non-rep is the number of non-repeated 1-D computation, and T tot is the total time required to process the 1-D DCT for an image by the proposed DCT architecture, and it is given by Eq. (6). The Eq. (7) shows the percentage processing time (T pr) reduction using proposed DCT architecture comparing with regular DCT implementation.
Results and discussions
The proposed algorithm is implemented in Matlab R2013a with various test standard images viz. Lena, Cameraman, etc., to compute the DCT coefficients. Further IDCT is calculated for the computed DCT coefficients of the test images, and then, the quality metrics like mean squared error (MSE) and peak signal to noise ratio (PSNR) are calculated. Figure 6 shows the reconstructed images simulated using Matlab and the MSE; PSNR values are also listed for each image. The experiment is conducted for different cases of various thresholds based on the number of MSBs considered for row comparison.
Different images with wide variations in the intensity are considered for computing DCT using the proposed method. Performance comparison for various images obtained from the proposed DCT computation is given in Fig. 6 along with the output images. From Fig. 6, it is clear that the output image is exactly same as the input image when 0 bit is ignoredfor row comparison. When the number of bits ignored per pixel increases, the image quality decreases slightly. The comparison of MSE and PSNR values obtained for each case of different input images are given in Table 1. For all the output images, the MSE value increases when the number of bits per pixel ignored for row comparison increases and makes the PSNR to decrease.
A plot between the MSE values and the number of bits ignored per pixel corresponding to various images is depicted in Fig. 7a. The MSE values are less till the number of bits ignored per pixel is less than 3. When 3 bits are ignored, the MSE is considerable and if the number of bits ignored becomes 4, the MSE value becomes significantly high. This is because in the latter case, the difference between the two pixels (for row comparison) becomes 15 which causes a significant difference in the DCT coefficient.
Figure 7b shows the corresponding PSNR of the reconstructed images as given in Table 1. The chart shows a degradation in the image quality as the number of bits ignored for row comparison increases. When the comparison is made between the various inputs, it can be seen that the PSNR for the Cameraman image is high, compared with the other images for 4-bit eliminationand hence a better output quality.
To calculate the reduction in the computational complexity, we have found the number of repeated rows for which DCT needs not be calculated. The repeated number of rows for various number of elminated bit for row comparison is shown in Table 2.
Figure 8 plots the number of repeated rows as given in Table 2. Based on the homogeneity of pixels in the sub-images, the interpixel redundancy varies, and hence, the number of repeated rows changes for each image.
FPGA implementation of DCT using CIM
FPGA consists of large number of configurable logic blocks (CLBs) connected together through connection matrix to form any complicated high speed digital systems. FPGA implementation is a suitable solution for testing the performance of the proposed architecture before the development of ASIC. All the hardware subblocks of the design is developed with the Verilog HDL and verified with the Xilinx ISim simulator. Figure 9 shows the simulation results of DCT using ISim, and these results are verified with the results obtained usingMatlab. After the functional verification, the design is synthesized using Xilinx XST with the Spartan 3e FPGA (3s500eft256-4) as the target device. Table 3 shows the device utilization summary for both the Chen’s and Lee’s algorithm with and without comparative blocks.
The input image is converted into its equivalent binary value using Matlab, and it is given as input to the proposed algorithm implemented using HDL. Using XILINX ISim simulator, the simulation is performed and the corresponding output is stored in a file for further processing. The stored output file is converted into image using Matlab, to calculate image quality metrics like MSE and PSNR. Table 4 shows the MSE and the PSNR values of the reconstructed image from the DCT coefficients computed by simulating the HDL design in XILINX ISim.
Figure 10 plots the MSE of the reconstructed image from the DCT coefficients calculated by simulating the algorithm in XILINX ISim simulator. The simulation is performed by truncating the least significant bits (LSB) of sizes ranging from 1 to 4 (N = 1 to 4) for row comparison. As N is increased from 1 to 4, the computational complexity reduces and the MSE increases as shown in Fig. 10. There is a small variation in the MSE and PSNR values obtained from hardware and software implementations.
Figure 11 shows the PSNR of the reconstructed image from the DCT coefficients obtained by simulating the algorithm in XILINX ISim simulator. The PSNR of the reconstructed image reduces as the number of bits for row comparison, N is increased.
The FPGA hardware resource utilization and the maximum combinational path delay of two DCT architectures are shown in Fig. 12a, b, respectively. From the device utilization summary, 32.25 % of FPGA hardware resource is utilized additionally, as the CIM block is included. But, the CIM block eliminates a maximum of 65 % DCT computations for Cameraman image and 39 % reduction for mandrill image.
ASIC implementation of proposed DCT
The proposed DCT architecture with Lee’s algorithm is synthesized using Cadence SoC Encounter® with TSMC 180 nm library. Table 5 provides the gate count and the power consumption for regular and proposed DCT architecturewith CIM block.
The proposed DCT core consumes more cell area as given in Table 5 due to the additional CIM block for reducing the overall computational complexity. When two rows are termed similar, DCT coefficients need not be computed for the later row, and the DCT coefficients of the previous row can be used. This eliminates the need to use the DCT core, and only the CIM block is active. Hence, using this proposed method, 1251 gates are idle and the DCT is computed with 1.257 mW instead of 8.0157 mW power consumption as the DCT core is disabled. Table 6 compares the power consumption of the regular DCT, the proposed DCT architecture. In Table 6, the power consumption for ignoring 0, 1, 2, 3, and 4 bits are calculated using the formula given in Eq. (4), and the percentage power variations are calculated using Eq. (5).
If all the bits are considered for comparing two pixel values, the proposed DCT power consumption (the average power (p av )) is higher than the normal DCT power consumption (p α ) while computing DCT of a sub-image. The power consumed by the comparison unit is greater than the power saved by row elimination in total for the complete image while all the bits are considered. Hence, the percentage power reduction are negative. Whereas in case of ignoring 1 bit for comprison, the power consumption for the proposed DCT is higher than that for normal DCT for all the images except the Cameraman image since it has a great reductions in the number of repeated rows (1868). Hence, the percentage power variations are positive for that image alone. Even in case of ignoring 2 bits, the power reduction may be achieved and it depends on the number of repeated row in an image. Perhaps, by ignoring 3 or 4 bits for row comparison, significant power reduction can be achieved, and it is clear from the values given in Table 6.
The Fig. 13 plots the percentage power reduction for various images corresponding to various number of bits eliminated for row comparison to avoid the DCT computation.
If two rows matches in pixel value, the DCT coefficients need not be computed and the DCT coefficients of the previous row can be retained. This eliminates the latency in computing DCT for the current row. Based on the correlation between the rows, the overall computational time can be greatly eliminated. If the DCT core is disabled, the overall time consumption to calculate DCT coefficients for a single pixel row is equal to the latency introduced by the comparator block. The power reduction is achieved by disabling the DCT core power supply when the two row values are same. This can be done using simple buffer and inverter circuit as shown in Fig. 14. The sizing of the CMOS  inverter which controls the power input to the DCT core is based on the power consumed by the DCT core and its input capacitance which depends on the technology.
The Table 7 shows the comparison of the regular DCT processing time and the proposed DCT architecture processing time. In the Table 7, the processing time for ignoring 0, 1, 2, 3, and 4 bits are calculated using the formula given in Eq. (6) and the percentage processing time reduction is calculated as given in Eq. (7).
Table 8 provides the maximum PSNR achieved and maximum power reduction while reconstructing Lena image from its DCT coefficient. Also, it compares the PSNR and power reduction (%) of various techniques available in the literature for the same image. The proposed method reduces the power by 50 % compared to other methods while achieving a maximum PSNR of 35.425 dB for Lena image.
In this paper, we have proposed a novel method for DCT computation for lossy image compression. 1-D DCT computation is computed for a row, and it is based on the difference between the pixel values of adjacent rows. By adopting this methodology, a larger number of computations are reduced when 5 and 4 bits of pixels are taken for row comparison. The proposed method is verified with various high- and less-correlated images. The results show that image quality is maintained to good level even though 4 bits are removed from 8 bits in a pixel for row comparison. The pixel comparison method is implemented in both FPGA as well as ASIC environment, and it eliminates maximum of 65 % of DCT computations in Cameraman image and 39% in mandrill image when 4 bits are eliminated for row comparison. The proposed architecture consumes 1.257 mW power instead of 8.027 mW with 24.4 % of additional hardware cost when the pixels of two rows have very less difference. The experimental result shows that the power consumption proposed DCT architecture is reduced to 4.01 mW for highly uncorrelated images and 6.02 mW for less-correlated images without much affecting the image quality. This achieves maximum power reduction of 50.02 % and minimum power reduction of 23.63 % of original DCT implementation.
N. Ahmed, T. Natarjan, K.R. Rao, Discrete cosine transform. IEEE T Comput 23(2), 90–93 (1974)
W.H. Chen, C.H. Smith, S.C. Fralick, A fast computational algorithm for the discrete cosine transform. IEEE T Commun 25(9), 1004–1009 (1977)
B. Lee, A new algorithm to compute the discrete cosine transform. IEEE T Acoust Speech P 32(6), 1243–1245 (1984)
H.S. Hou, A fast recursive algorithm for computing the discrete cosine transform. IEEE T Acoust Speech 35(10), 1455–1461 (1987)
C. Loeffler, A. Lightenberg, G.S. Moschytz, Practical fast 1–D DCT algorithms with 11 multiplications. Proc Int Conf Acoust Speech Signal Process 2, 988–991 (1989)
J. Rohit Kumar, Design and FPGA implementation of CORDIC-based 8-point 1D DCT processor, in E thesis, Department of Electronics and Communication Engineering, National Institute of Technology (Session, Rourkela, 2010)
VK Sharma, KK Mahapatra, C Umesh, An efficient distributed arithmetic based VLSI architecture for DCT. Proc Int Conf Dev Commun. 1–5 (2011). doi:10.1109/ICDECOM.2011.5738484
A Shaofeng, C Wang, A computation structure for 2-D DCT watermarking. IEEE Int. Midwest Symposium Circ. Syst. 577–580. (2009). doi:10.1109/MWSCAS.2009.5236026
ME Aakif, S Belkouch, MM Hassani, Low power and fast DCT architecture using multiplier-less method. Proc. Int. Conf. Faible Tension Faible Consommation. 63–66 (2011). doi:10.1109/FTFC.2011.5948920
C. Chao, P. Keshab, A novel systolic array structure for DCT. IEEE Trans Circ Systems—II 52(7), 366–368 (2005)
S Belkouch, ME Aakif, A Ait Ouahman, Improved implementation of a modified discrete cosine transform on low-cost FPGA. Int. Symposium on I/V Comm Mobile Network. 1–4 (2010). doi:10.1109/ISVC.2010.5656248
K.S. Geetha, M. Uttara Kumari, A new multiplierless discrete cosine transform based on the Ramanujan ordered numbers for image coding. Int J Signal Proc Image Proc Pattern Recognit 3(4), 1–14 (2010)
J. Hyeonuk, K. Jinsang, C. Won-Kyung, Low-power multiltiplierless DCT architecture using image correlation. IEEE T Consum Electr 50(1), 262–267 (2004)
C.C. Sun, S.J. Ruan, B. Heyne, Goetze, Low-power and high-quality Cordic-based Loeffler DCT for signal processing. IET Trans Circ Dev Syst 1(6), 453–461 (2007)
H. Dong Sam, Low power design of DCT and IDCT for low bit rate video codecs. IEEE T Multimedia 6(6), 414–422 (2004)
A.P. Vinod, D. Rajan, A. Singla, Differential pixel-based low-power and high-speed implementation of DCT for on-board satellite image processing. IET-Circ Dev Syst 1(6), 444–450 (2007)
P. Jongsun, H.C. Jung, K. Roy, Dynamic bit-width adaptation in DCT: an approach to trade off image quality and computation energy. IEEE T VLSI Syst 18(5), 787–793 (2010)
S.P. Mohanty, K. Balakrishnan, A dual voltage-frequency VLSI chip for image watermarking in DCT domain. IEEE T Circuits-II: Express Briefs 53(5), 394–398 (2006)
P. Jongsun, K. Roy, A low power reconfigurable DCT architecture to trade off image quality for computational complexity. Acoust Speech Signal Process 2004. Proc (ICASSP '04), IEEE Int Conf 5, V-17-20 (2004)
L. Zhenwei, P. Silong, M. Hong, W. Qiang, A reconfigurable DCT architecture for multimedia applications. Congr Image Sign Proc CISP '08 1, 360–364 (2008)
M.-W. Lee, J.-H. Yoon, J. Park, Reconfigurable CORDIC-based low-power DCT architecture based on data priority. IEEE T VLSI Syst 22(5), 1060–1068 (2014)
J.M. Rabaey, A. Chandrakasan, B. Nikolic, Digital integrated circuits, in Pearson Education-Engineering & Technology, 2nd edn., 2003
The authors declare that they have no competing interests.