Approximate calculation of 8-point DCT for various scenarios of practical applications

In this paper, based on the parametric model of the matrix of discrete cosine transform (DCT), and using an exhaustive search of the parameters’ space, we seek for the best approximations of 8-point DCT at the given computational complexities by taking into account three different scenarios of practical usage. The possible parameter values are selected in such a way that the resulting transforms are only multiplierless approximations, i.e., only additions and bit-shift operations are required. The considered usage scenarios include such cases where approximation of DCT is used: (i) at the data compression stage, (ii) at the decompression stage, and (iii) both at the compression and decompression stages. The obtained results in effectiveness of generated approximations are compared with results of popular known approximations of 8-point DCT of the same class (i.e., multiplierless approximations). In addition, we perform a series of experiments in lossy compression of natural images using popular JPEG standard. The obtained results are presented and discussed. It should be noted, that in the overwhelming number of cases the generated approximations are better than the known ones, e.g., in asymmetric scenarios even by more than 3 dB starting from entropy of 2 bits per pixel. In the last part of the paper, we investigate the possibility of hardware implementation of generated approximations in Field-Programmable Gate Array (FPGA) circuits. The results in the form of resource and energy consumption are presented and commented. The experiment outcomes confirm the assumption that the considered class of transformations is characterized by low resource utilization.

which has good properties in terms of compression of highly correlated data. However, DCT requires floating-point arithmetic involving multiplications what can be prohibitive for battery operated embedded systems and low-cost hardware realizations. Hence, in recent years we can observe high interest of the scientific community in searching for novel computationally effective approximations of DCT (e.g., see [1]- [17]). Since modern image compression standards partition input images into 8×8-pixel blocks, and assume the separability of two-dimensional discrete cosine transform, only 8-point DCT has to be taken into consideration in the process of searching for effective approximations.
In this paper, we propose a family of highly effective approximations of 8-point DCT that can be used in three different scenarios of practical applications. In order to find the required approximations, we use the parametric model of the matrix of 8-point DCT, we browse the space of its parameters with an exhaustive search, and at the same time, we evaluate the quality of found approximations with the use of the proposed transform quality measures. The best approximations found are ordered by non-decreasing computational complexities and presented in a form of a specific dictionary of transform approximations. It should be noted that the proposed approximations of 8-point DCT require only additions and bit-shift operations involving integer representation of input and resulting data. It facilitates their implementation in embedded systems based on simple central processing units that do not support floating-point operations, and also imposes significantly smaller requirements on dedicated hardware implementations. The considered scenarios of practical usage include such cases where an approximation of DCT is used: (i) only at the data compression stage, (ii) only at the decompression stage, and (iii) at both the compression and decompression stages. Taking into account three scenarios is an additional contribution of this paper as other authors have not considered scenarios (i) and (ii).
The organization of the paper is as follows. In Section 2, we describe the theoretical background regarding modern image compression standards based on block quantization. The well-known approximations of 8-point DCT are listed and described in Section 3. In Section 4, we introduce the parametric model of 8-point DCT, define the considered scenarios of practical usage, and formulate the quality measures used in the process of searching for the best approximations of 8-point DCT. In the same section we present the best approximations found. The results of practical experiments involving natural images and verifying the effectiveness of the proposed approximations can be found in Section 5. In Section 6, we address the aspects of hardware realizations.

Theoretical background
The block quantization scheme (see Fig. 1) in its two-dimensional variant is a core of modern standards for lossy compression of static images and video sequences [18]. The main purpose of block quantization is to decorrelate input data in order to apply simple scalar quantizers instead of more computationally demanding vector quantization. The process of data decorrelation can be implemented with aid of linear transformation (matrix V in Fig. 1a). The linear transformation that allows for perfect decorrelation is a Karhunen-Loève transform (KLT) [19]. The elements of vectors in KLT domain are decorrelated, and also according to the central limit theorem, their values are normally distributed. The lack of correlation and normal distribution of random variables guarantee their statistical independence, which, in turn, justifies the usage of scalar quantization to individual elements of data vectors. Despite its optimality, KLT is not preferred in practical applications due to the following disadvantages: (i) lack of fast algorithms resulting in high computational complexity of order O(N 2 ), which is typical for matrix by vector multiplication; (ii) dependence on input data; and (iii) specific hat-like shape of the first base vector which causes undesirable visible artifacts at high compression levels.
In the light of this in practical applications, the fast approximation of KLT in the form of discrete cosine transform (DCT) is used most often. DCT has the following important features (see [19]): (i) it is asymptotically convergent to KLT for highly correlated data (i.e., for the first order Markov random processes with correlation ρ → 1), and hence, it is independent of the input data; (ii) its first base vector is equivalued and allows to calculate an average of image pixels, which in case of two-dimensional lossy compression results in square-like but more pleasing for an eye artifacts; and (iii) it can be calculated with fast discrete cosine transform algorithm (FCT) of O(N log 2 N) computational complexity (see [19]). The elements of matrix V of one-dimensional DCT in the case of N-element input vectors can be defined as follows: for k, n = 0, 1, . . . , N − 1, with α(k) = √ 1/N for k = 0 and α(k) = √ 2/N otherwise. The second stage of block quantization (see Fig. 1a) is scalar quantization based on rounding the values of vector elements previously divided by the quantization factors. The final stage is entropy coding which allows to reduce the data size further on by taking into account some statistical redundancy in data resulting from normal distribution of random variables and possible sequences of zero-valued coefficients obtained after quantization. For this purpose, in popular solutions (e.g., Joint Photographic Experts Group (JPEG) and Moving Picture Experts Group (MPEG) compression standards), the first order entropy coding in the form of Huffman codes, as well as the higher order entropy coding based on Run-Length Encoding (RLE), is used most often (c.f. [20,21]).
The existing image and video compression standards assume separable models of natural images which means that also two-dimensional decorrelating transforms could be separable. The separability of transformation results in simpler computational scheme. In particular, if we take DCT then its two-dimensional variant can be calculated with use of one-dimensional transform defined by (1) in a simple row-column approach (see [19]). In such case, it is enough to transform rows (columns) of an image and then columns (rows) of the obtained results of the first step. If X is an input image, then its representation Y in the domain of two-dimensional DCT can be calculated as Y = VXV T . If we compare now both diagrams from Fig. 1a and b, the only difference lies in data dimensionality and preparation of input data. In two-dimensional case, an input image is divided into smaller blocks with M on M pixels (usually M = 8) which is a compromise between data decorrelation and computational complexity. In such case, the obtained reduction of a number of calculations can be estimated as O(log 2 N/ log 2 M), where N is the size of a square image. Two-dimensional block quantization finds wide applications in popular standards for lossy compression of images and videos, namely, JPEG [20][21][22], MJPEG, MPEG-1,-2 [20,21,23], H.261 [20], H.263 [24], and H.264 [25].
Since two-dimensional DCT can be computed using one-dimensional transform, then the optimization process aiming at the reduction of computational complexity can be focused only on one-dimensional case. Moreover, if we assume image partition into blocks of size 8 by 8 pixels, then the optimization applies only to the computational procedure of 8-point DCT. Direct calculation of 8-point DCT, i.e., based on its definition as y = Vx, where V is defined as in (1), requires 64 multiplications and 56 additions. The fast and exact algorithm proposed by Loeffler et al. [1] reduces that number to 11 multiplications and 29 additions. It was shown by Duhamel and H'Mida that a theoretical lower bound on the number of multiplications of 8-point DCT equals precisely 11 (c.f. [2,3]). However, in case of block quantization, some of multiplications required by DCT can be combined with multiplications needed to implement scalar quantization. Then the practical number of multiplications required by 8-point DCT can be reduced further on down to 5 multiplications as in Arai, Agui, and Nakajima's variant of fast DCT presented in paper [4]. Although the number of arithmetic operations, in particular multiplications, can be significantly reduced when compared to the definition approach, it does not change the fact that the remaining multiplications operate on rational numbers. Hardware realizations of rational numbers arithmetic can be prohibitive for low-power consumption and battery operated systems. One of possible solutions to this are multiplierless approximations of 8-point DCT that require only additions and bit-shift operations. The simplicity of such operations makes it possible to carry out calculations using integer arithmetic (i.e., natural binary code or two's complement representation). In turn integer arithmetic translates into significantly simpler hardware implementations and lower power consumption of the synthesized circuits. The mentioned features of the class of approximate DCTs explain high interest of the scientific community in developing new approximations closer to DCT than the known ones, but retaining the same small computational complexity. For example, we can indicate approximations with a very low number of additions, i.e., smaller or equal to 14, see papers [5,12,16], or approximations that except additions require bit-shift operations but both in a moderate number (c.f. [7,8,13]), and also such approximations where the number of additions is higher or equal to 24, see [6,10,15].

Popular approximations of 8-point DCT
In this section, we present a brief overview of popular approximations of 8-point DCT studied in the remaining part of the paper. The considered approximations are listed and described in a chronological order. In Table 1, we can find the short summary of the transforms including abbreviated names of approximations, names of the authors and references to the papers where such approximations were formulated, numbers of additions and bit-shift operations, and whether the proposed transform has the orthogonality property.

SDCT2001 approximation
This is historically one of the first known approximations of 8-point discrete cosine transform. It was proposed in the year 2001 by T. I. Haweel in his research paper [6]. The apt idea standing behind the form of the matrix of this approximation can be briefly described as apply a signum function to the matrix of 8-point DCT. If a value of DCT matrix element is positive, then the resulting value would be 1. In turn, if this value is negative, then the resulting value would be −1. Because the base vectors of the resulting matrix are square this transform is called a Square "wave" DCT (SDCT). Its direct computation requires 56 additions. However, due to the symmetry of the row vectors of cosine transform, the matrix of SDCT can be factorized into a product of matrices requiring only 24 additions. It should be also noted that this approximation is not orthogonal. But as it was shown in paper [6], its inverse can be calculated only with additions and bit-shift operations. The SDCT is very often chosen as a starting transform to develop novel computationally effective approximations of 8-point DCT.

BAS2008I approximation
This is the first of several approximations formulated by Bouguezel et al. (see [7][8][9][10]). This transform was proposed in [7]. It is constructed on the basis of the well-known SDCT approximation by introducing into its matrix zero and 1/2 valued elements. It can be verified that the resulting matrix is orthogonal which simplifies the construction of its inverse (as the transposed matrix of forward transform). Moreover, as it was shown in paper [7], the matrix of this approximation can be factorized into a product of three sparse matrices requiring: 8 additions, 6 additions, 4 additions, and 2 bit-shift operations, respectively. Hence, it is possible to calculate this transform with the total number of 18 additions and 2 bit-shift operations. On the basis of the property of orthogonality, it is obvious that an inverse transformation can be calculated with precisely the same number of arithmetic operations.

BAS2008II approximation
In paper [8], the authors of BAS2008I transform propose another approximation of DCT taking as a starting point the well-known SDCT approximation. The proposed transform was obtained by appropriately setting to zero some of the entries of SDCT matrix which resulted in the novel approximation that can be calculated as a product of four sparse matrices requiring 8, 6, 6, and 1 additions, respectively. It gives a total number of 21 additions. Although the transform itself is not orthogonal, its inverse matrix can be determined with the same number of additions and three additional bit-shift operations.

BAS2010 approximation
The following approximation is a multiplication-free transform of any size N > 4 (see [10]). For the case of N = 8, it can be effectively calculated with the aid of radix-2 like structures, and it requires 24 additions and 4 bit-shift operations. Moreover its matrix is orthogonal which allows to find the inverse matrix immediately by the transposition.

BAS2011 approximation
This transform (proposed in [9]) is the only one in the group of considered well-known approximations which is parametrized (excluding the parametric models used in papers [15,16] in order to find appropriate approximations). It was constructed on the basis of approximation presented in paper [7] (BAS2008I) by introducing an arbitrary parameter into the matrix of that transformation and by performing some permutations of the rows of the matrix. As a result a parametric transform was obtained defined by the following matrix parametrized with one parameter a, i.e.: Hence, this transform can be treated as a parametric variant of approximation BAS2008I. It should be noted that U a matrix is orthogonal regardless the value of parameter a. This feature greatly simplifies calculation of inverse transformation. Of course, the value of parameter a should be selected in such a way that multiplication operations are not required. In paper [9] the following values were considered, i.e., a ∈ {0, 1 2 , 1}. Moreover a direct calculation of U a transform may require 36 additions; however, U a matrix can be factorized into a product of three sparse matrices requiring respectively: 8 additions, 6 additions, and 2 additions for a = 0 or 4 additions with a = 0, and a number of 2 bit-shift operations with a = 0 and a = 1.

CB2011 approximation
On the basis of the matrix of 8-point discrete cosine transform, a novel approximation obtained by means of rounding-off operation was formulated and proposed by Cintra and Bayer in paper [11]. A rounding-off operation, when applied to the scaled by 2 matrix of DCT, allows to obtain, in contrast to the signum function, the approximation containing zeros. As it was shown in paper [11], the matrix of the obtained transform can be further on decomposed into a product of three sparse matrices with the following numbers of additions: 8, 12, and 2. It gives the total number of 22 additions. It should be noted that the matrix of the considered approximation is not orthogonal, which means that an inverse transformation cannot be easily calculated as a transposition of the matrix of forward transform.

BC2012 approximation
In paper [12], Bayer and Cintra propose a novel approximation of 8-point DCT that requires only 14 additions. It is based on previous approximation formulated in [11] and is constructed by replacing aptly selected elements of CB2011 matrix with zeros. The elements of the resulting matrix are only {−1, 0, 1}, and hence, no bit-shift operations are required. Moreover transform matrix can be factorized into a product of three sparse matrices requiring 8, 4 and 2 additions, respectively. It can be also easily verified that the obtained transformation is orthogonal.

PMCBR2012 approximation
This transform, proposed by Potluri et al. in paper [15], was obtained in the optimization process aimed at minimizing the differences between transfer functions of selected row vectors of the DCT transformation and the searched for approximation, by assuming: (i) parametric model of transform matrix, (ii) elements of transform matrix can only take values {0, ±1, ±2}, and (iii) the resulting transform must be orthogonal. The exhaustive search procedure allowed to find the approximation characterized by low computational complexity that requires only additions and bit-shift operations. It was shown in paper [15] that transform matrix can be factorized into a product of three sparse matrices requiring: 8 additions, 12 additions and 4 bit-shifts, 4 additions and 2 bit-shifts. It gives the total number of 24 additions and 6 bit-shifts.

PS2012 approximation
The considered approximation of 8-point DCT was proposed by Puchala and Stokfiszewski in [13]. It is based on the approach presented previously in [14] which in order to approximate 8-point DCT uses two 4-point DCTs operating on the outputs of four 2-point Haar transforms that take as inputs adjacent pairs of elements of input vectors. It means that one 4-point DCT operates on a low-frequency band component of input data (an averaged input data), while the second one takes as input high frequency band, i.e., details of input signal. In approximate transform from paper [13], the first 4-point DCT was calculated in an approximate way using only additions and bit-shift operations while the second one was removed. It allowed to obtain multiplication-free approximation which is orthogonal. The matrix of this transform can be factorized into a product of four matrices which require: 8 additions, 4 additions, 4 additions, and 2 additions and 2 bit-shifts respectively.
Hence, the total number of arithmetical operations equals: 18 additions and 2 bit-shift operations.

DR2014 approximation
This is a very low computational complexity approximation of DCT. It was proposed in paper [5]. Since it requires only 12 additions and no bit-shift operations, it can be characterized by the smallest number of arithmetical operations among DCT approximations considered in this paper. The proposed approximation is not orthogonal but the authors formulate an inverse matrix which can be characterized by the same computational effectiveness. Moreover the aspects of hardware implementation of the proposed transform are also discussed in paper [5]. For example a pipelined implementation of this transform in FPGA circuit (Xilinx Vertex 7 series) requires 132 LUT tables and 134 logic elements and the delay of data propagation equaled 3.247 ns. This can be viewed as a very small demand for hardware resources.

PMCBKE2014 approximation
This transform is an another example of the approximation obtained in the process of exhaustive search using a parametric model for transform matrix. It was proposed in paper [16] by Potluri et al. Also in this case the optimization criterion was to minimize the arithmetic complexity with similar constraints as in [15], i.e., (i) a specific parametric model of transform matrix was imposed, (ii) matrix elements can take values from the set {0, ±1, ±2}, and (iii) transform matrix must be orthogonal. The obtained approximation of 8-point DCT can be decomposed into a product of three sparse matrices where each of them requires respectively: 4, 2. and 8 additions. It gives the total number of 14 additions required to calculate this transformation.

Methods
In this section, we describe the parametric model used to search for approximations of 8-point DCT and formulate the specific requirements that must be met by the model in order to obtain invertible and orthogonal matrices. Next, by taking into account three different scenarios of practical usage, we introduce measures of transform qualities, called quality indexes, which include error components resulting from approximation of 8-point DCT and from quantization of transform coefficients (i.e., energy compaction capability). Then, based on the formulated quality indexes and with use of the proposed procedure, we search for approximated variants of 8-point DCT operating on model input signals. The obtained results are listed and arranged in order of increasing computational complexities.

Considered parametric model
In order to make possible the search for approximations of 8-point DCT in our research, we take advantage of the model formulated previously in paper [16]. The model is constructed on the basis of the matrix of 8-point DCT (see formula (1)) by replacing its identical elements with symbolic parameters. Since the matrix of 8-point DCT has seven individual values, then seven unique parameters, i.e., {a, b, c, d, e, f , g}, would be required by the model. The obtained model can thus be described in the matrix form as: where D is a diagonal scaling matrix required to guarantee the unit length of row vectors. Its main diagonal contains the elements: , and further on in the case of remaining two vectors we have: . It can be easily verified that it is possible to factorize matrix U into the product of the following four matrices: where matrix U 3 computes the sums and differences of elements of input vector, U 2 is a block diagonal matrix that contains the parameters of the model, i.e.: with O 4 being a zero valued matrix with size 4 on 4 elements, I 4 standing for an identity matrix, and J 4 defined as follows: and the remaining matrix P 1 describes a permutation required to restore the proper order of row vectors of the resulting matrix U. This matrix takes form: The 4 on 4 element matrices A and B, which are the block diagonal elements of matrix U 2 , hold the symbolic parameters of the model, i.e.: Matrix A can be factorized further on with application of divide-and-conquer strategy into the product of four sparse matrices of the following form: where I 2 is an identity matrix, P 2 and P 3 are permutation matrices defined as: In case of factorization of 8-point DCT matrix B is a 4-point cosine transform of type four, and as such it could be factorized. However, such factorization (i) would not give a significant reduction in a number of arithmetic operations and (ii) would not guarantee the assumed form of matrix B for arbitrary values of free parameters, in particular for the integer powers of 2. Hence, the proposed and adopted in this paper final construction of factorization (3) can be described in graphic form as shown in Fig. 2. An analysis of computational structure of the proposed factorization (see Fig. 2) allows to determine the numbers of additions and multiplications required by the matrix product described by formula (3). These numbers depend strictly on the values of parameters but in the worst scenario they can be upper bounded by the numbers of 28 additions and 22 multiplications. When compared with the direct matrix by vector multiplication, which requires 56 additions and 64 multiplications, the structure from Fig. 2 describes undeniably an approach with reduced computational complexity. However, it should be emphasized that we are interested in such values of parameters {a, b, c, d, e, f , g} which do not require multiplications but much simpler bit-shift operations instead. Then if only the scaling operations described by the matrix D could be combined with the multiplications required at the quantization step, then the resulting structure would be multiplication free. In such case the only operations required to calculate the approximation of 8-point DCT would be additions (upper bounded by a number of 28 operations) and bit-shift operations (with a number less than or equal to 22). In the following part of the paper, we assume the acceptable values of parameters specified by the set: { 1 8 , 1 4 , 1 2 , 0, 1, 2}. Multiplications by such values can be implemented simply as right bit-shift operations by a number of 3, 2, or 1 bit, or bit-shift by 1 bit left (except {0, 1} values which do not require any additional actions). With such assumptions all operations required by matrix B, i.e., additions and bit-shifts, can be implemented according to the parallel computational scheme presented in Fig. 3. Here, all bit-shift operations can be calculated independently in a parallel manner, while additions can be realized in mixed parallel-sequential mode based on reduction or sweep-up approach. Then such a scheme, which requires at most 12 additions and 16 bit-shifts, can be characterized by high step efficiency. As we can see with such realization of the scheme from Fig. 3 , only a number of three sequential steps would be required. It can be exceptionally important in case of Graphics Processing Unit (GPU) based [17,26,27], i.e., massively parallel computations, or Single Instruction, Multiple Data (SIMD) [28,29], and i.e. vector, parallelism, accelerated implementations. Similarly in the case of hardware implementations, it would translate into relatively short data flow paths.

Invertibility conditions
In order to avoid both trivial solutions or singularities within obtained results, we assume throughout this paper that the matrices U of transforms generated with the considered parametric model are invertible. It is obvious that invertibility of matrix U depends strictly on the values of model parameters. Matrix U will be invertible only when all of the components of factorization (3) are also invertible. It can be easily verified that det(U 3 ) =0 (to be precise det(U 3 )=16) and on the basis of definition of permutation matrix we have also det(P 1 )=1. The determinants of the remaining two matrices D and U 2 depend on the values of model parameters. Matrix U 2 will be invertible if only its component matrices A and B are invertible. Based on factorization (4) it can be seen that matrix A is invertible if (a = 0) and (b = 0 or c = 0). For the remaining matrices in formula (4) we have det(P 2 )=det(P 3 )=1 and the determinant of matrix (built with identity matrices I 2 ) calculating sums and differences of data elements equals 4. In case of matrix B it is not so straightforward to provide general conditions of invertibility. However, in the case of the considered, i.e., reduced, set of acceptable values of parameters it can be proved, even by exhaustive search, that matrix B will be invertible if at least one among its parameters is not equal to zero, i.e. d =0 or e =0 or f =0 or g =0. Then it can be easily verified that if the above conditions of invertibility of U 2 matrix are met, we also have det(D) =0, and as a consequence det(U) =0.

Orthogonality conditions
By orthogonality of matrix U, we understand the condition UU T =I, where (·) T stands for matrix transposition, and I is an identity matrix. It should be noted that the role of matrix D in factorization (3) is to normalize the length of base vectors of resulting transform. Hence, the orthogonality condition will be met if only the following product is diagonal and P 1 is a permutation matrix, then the formulated demand can be reduced to simpler form which assumes that the result of (U 2 U T 2 ) is diagonal. Now on the basis of definition of matrix U 2 we can observe that the reduced demand will be fulfilled if only both products (AA T ) and (BB T ) give diagonal matrices. Next, it is easy to verify that the first product (AA T ) is always diagonal regardless of the values of parameters {a, b, c}. In the second case, we can formulate a general condition guaranteeing that (BB T ) is diagonal. It takes the following form: However, with the assumed set { 1 8 , 1 4 , 1 2 , 0, 1, 2} of acceptable values of parameters, it can be shown that condition (5) will be satisfied if at least one of the parameters {d, e, f , g} will be equal to zero (of course, excluding the case when all parameters are zeros). In this way the necessary conditions for invertibility and orthogonality of matrix U were formulated. In the further part of the section, we formulate the procedure used in the process of exhaustive browsing of the parameters' space.

Procedure for generating the requested approximations
According to the basic assumption of the paper the requested approximations of 8-point DCT are generated with the formulated parametric model in the process of exhaustive browsing of the parameters' space. The model takes 7 parameters whose set of acceptable values { 1 8 , 1 4 , 1 2 , 0, 1, 2} consists of 6 elements. This gives the number 6 7 = 279936 of all possible combinations of parameters' values. Hence, to make the analysis of results simpler, the list of generated approximations is sorted in order of non-decreasing number of additions, and within the same number of additions, in order of non-decreasing number of bit-shift operations (typically bit-shifts are less computationally demanding than additions). Moreover the approximations that have higher or equal value of the quality index (the smaller value the better) to approximations with lower computational complexity are excluded from the list. In the further part of the paper the appropriate quality indexes of approximations for each of the considered scenarios of practical usage are formulated (see Section 3.2). The approximations generating procedure can be described as follows:   (3)). 5 Add the generated approximation U with all additional data to the resulting list. 6 Once the list of approximation is generated sort its element in order of non-decreasing number of additions, and within the same number of additions, sort them locally in order of non-decreasing number of bit-shift operations. 7 Exclude from the list such approximations that can be characterized by a higher or equal value of quality index than any element placed closer to the beginning of the sorted list. 8 Finish.
After executing the above procedure, we obtain the requested list of approximations of 8-point DCT.

Considered scenarios of data compression and decompression
In this paper, we consider three scenarios for compression and decompression of data. The first two scenarios are named asymmetric because two different transforms, i.e., approximation of 8-point DCT and 8-point DCT itself, are used interchangeably at compression and decompression stages. The last scenario uses approximate DCT at both stages (approximate transform U for compression and its inverse U −1 or transpose U T for decompression of data). We also assume that separable Markov fields with high correlation are a good model of natural images and choose block quantization as a tool for compression and decompression of images. Then, it is obvious that two-dimensional transform used by block quantization would be also separable, which means that to its calculation only one-dimensional transform is required. In one-dimensional case, the equivalent model of data would be the first-order Markov process. Hence, in the further part of this section, we can consider only one-dimensional case to determine the quality indexes describing the effectiveness of found approximations.

Asymmetric scenario of type I
The asymmetric scenario of type I assumes to use approximation of DCT (described by matrix U) for compression of data and DCT (described by matrix V ) at the decompression stage (see Fig. 4 where we have a simplified diagram of block quantization). An example of practical application of such scenario might be systems in which the compression step must be implemented in a simplified and computationally effective way, e.g., by battery operated hardware. On the other hand, it is not economically justified to modify the data decompression algorithms built into the receiving devices, which operate using DCT. Then, if the column vector x denotes samples of input data (modeled as realizations of first-order Markov process), then the autocovariance matrix of the signal can be calculated as R x =E(xx T ), where E(·) denotes an expected value operator. We assume in this paper that R x is an autocovariance matrix of the first-order Markov process with correlation coefficient ρ = 0.95. In Fig. 4 column vector η represents the quantization noise, which is added to signal in U transform domain as a result of quantization, wherein R η =E(ηη T )=σ 2 η I with I being an identity matrix and σ 2 η denoting a variance of quantization noise. In case of optimal bit-allocation applied to scalar quantizers (see [18,19]) with a given bit budget N bits, where N is a size of U transform and describes an average number of bits per one random variable (element of y), the variance of quantization noise can be calculated as σ 2 η =κ2 −2 π(U). It should be noted that κ is a constant depending on the probability distribution of elements of y vector (we assume normal distribution and κ=5.33) and π(U) describes the product of variances of random variables in U transform domain. We have then: where R y is an autocovariance matrix in U transform domain, i.e., R y =E(yy T ), with y=Ux, and by (R y ) ii for i=0, 1, . . . , N − 1, we denote the diagonal elements of matrix R y , i.e., the above mentioned variances of random variables in the domain of U transformation. In addition, it is assumed further on that there exist no correlation between the elements of vectors x and η, which results in E In the case of asymmetric scenario of type I, the output vector takes form z=V T (Ux+η) (c.f. Fig. 4). Let tr(·) denotes a trace of matrix. Then the mean squared error (MSE) of data reconstruction is defined as: Let W = V T U − I, then we have: Based on the following assumptions, that E(xx T )=R x , E(ηη T )=R η , and E(xη T )=E(ηx T )= O, we can write: which, after taking into account the definition of R η matrix, and the fact that VV T = I, results in formula: From formula (7) it can be observed that in this scenario the MSE is a sum of two components, i.e., the first one: which describes an error resulting from approximation of DCT by U transform (we have , and the second component, defined as: which describes the quantization error resulting from assumed budget of N bits and the coding properties of U transform, expressed by the product π(U). Thus: The last effort is to formulate the quality index that will allow to quantify the efficiency of U transform by taking into account both components of MSE (c.f. formula (8)). However, the value of the second component, i.e., Q (U, ), depends on the average number of bits assigned to each element of vector y. The proposed solution is to take into account the quantization error in the mean sense, i.e., calculated as the average value over a fixed set of i values, where we assume i = 1 2 (i + 1) bits for i=0, 1, . . . , L − 1, with L=12. Then the proposed index can be formulated as: It should be noted that the smaller the value obtained with formula (9), then the higher the efficiency of U transform. With the formulated transform quality index it is possible to analyze known approximations and generate approximations based on the adopted parametric model. In the first place known approximations of 8-point DCT, described in Section 2, would be analyzed. The obtained results are collected in Table 2. In addition to the values of quality index χ 1 (U), also the values of approximation error The analysis of results from Table 2 shows that the efficiency of approximate transforms increases with increasing computational complexity, which is fully expected. For the worst case, i.e., with transform DR2014 (only 12 additions), we have χ 1 (U) ≈ 15.15, while the best transform, i.e., PCMBR2012 (24 additions and 6 bit-shifts), can be described by χ 1 (U)≈0.57. However, this is not a rule, since we can indicate transforms with lower computational complexity, which give results better than transforms with higher complexities, e.g., CB2011, which is better in the sense of χ 1 (U) index than SDCT2001 and BAS2010 approximations. Especially noteworthy here is the CB2011 approximation, which can be characterized by high efficiency (χ 1 (U)≈0.62) in relation to its computational complexity

(only 22 additions). It can be described by good approximation of DCT ( (I)
A (U) ≈ 0.08) and also by high effectiveness in quantization (π(U) ≈ 0.15).
In the second place, the procedure described in Section 3 was used. Considering "pleasant for an eye" artifacts generated during image compression by the first base vector with identical values, we assume a constant value of parameter a = 1. The generated approximations are presented in Table 3 in the order of increasing computational complexity, and thus also increasing effectiveness understood in the sense of χ 1 (U) index. It should be noted that approximations APRXI.1 and APRXI.7 correspond to known transforms, i.e., namely BC2012 and CB2011. In the case of approximations PMCBKE2014, BAS2008III, PS2012, BAS2008II, SDCT2001, BAS2010, and PMCBR2012, it was possible to found transforms with higher effectiveness, which can be also characterized by identical or smaller computational complexity. The best result (χ 1 (U)≈0.48) was obtained with a number of 28 additions and 10 bit-shift operations with APRXI.11 (double underlined row of the table). This approximation is very close to DCT (see .010708) and can be also characterized by high performance in block quantization (c.f. π(U)=0.132910 and π(V )=0.131042). The featured approximation, i.e., the one with high ratio of effectiveness to computational complexity, is APRXI.8 (underlined row of Table 3). The mentioned transform reaches the χ 1 (U) ≈ 0.55 value of quality index with 22 additions and 4 bit-shift operations.

Asymmetric scenario of type II
The block diagram of the asymmetric scenario of type II is presented in Fig. 5. Here, we also have two different transforms at the compression and decompression stages, however, the difference from the scenario of type I lies in the usage of DCT approximation at the decompression stage. Such scenario may find practical applications in the case of image preview devices with low computing power and designed for low energy consumption. This in turn implies the necessity of constructing simple algorithmic solutions.  By noting that z=U T (Vx + η), and also using analogous mathematical derivation, we get the formula for MSE in the considered case, which takes the following form: On the basis of Eq. (10), it is possible to formulate the quality index χ 2 (U) for the considered scenario. Then we have: with i = 1 2 (i + 1) for i=0, 1, . . . , L − 1 and L=12. The measurements of effectiveness of known DCT approximations are presented in Table 4. The obtained results are similar to those previously presented (see Table 2) with such differences that BC2012 transform is better than BAS2008III (a = 0.0, a = 0.5, a = 1.0), BAS2008II outperformed SDCT2001, BAS2010, similarly SDCT2001 gave better results than BAS2010, and CB2011 outperformed SDCT2001,  A (U) component), but worse performance in the sense of block quantization (i.e. higher values of the π(U) product). However, in this scenario, the coding transformation is DCT and therefore the values of the π(U) product do not affect the MSE. The highest and the lowest efficiency was obtained by PMCBR2012 (i.e., χ 2 (U) ≈ 0.51) and DR2014 (i.e., χ 2 (U) ≈ 13.58) transformations respectively.
The results in effectiveness of approximations generated with procedure proposed in Section 3 are collected in Table 5. Their analysis shows that for the following computational complexities (adds/shifts), i.e., (18/0) and (24/6), it was possible to obtain better results than with known approximations of the same computational complexities. For the complexities (14/0) and (22/0) the procedure generated known approximations in the form of BC2012 and CB2011 transforms respectively. The best result (i.e., χ 2 (U)≈0.47) was obtained with 28 additions and 10 bit-shift operations (see APRXII.9 in double underlined row of Table 5). The featured transformation, i.e., the one which can be characterized by good performance in the sense of χ 2 (U) index at moderate computational complexity, is APRXII.6 described by the sixth row of Table 5 (underlined row). It allows to reach the following result χ 2 (U)≈0.52 with only 22 additions and 4 bit-shifts, which is better than almost all of the considered known approximations, including those with higher computational complexities (except PMCBR2012 transform but here the results are comparable).

Symmetric scenario
The last of the considered scenarios concerns a symmetrical case, i.e., the one where the same transformation is used at the compression and decompression stages. The simplified scheme of block quantization for this scenario is shown in Fig. 6.
It should be noted that at the decompression stage we have U transform, which is assumed for derivation purposes. The U matrix, depending on the properties of U transform, can take one of the following values: U=U T for orthogonal transforms or U=U −1 , when inverse transformation can be calculated by means of computationally effective , only with additions and bit-shifts operations). In this case the approximation of DCT is understood in the sense of pursue for a standard of both high efficiency and fast computational structure, which undoubtedly is DCT. Further on it is easy to verify that in this scenario we have z=U(Ux + η). Then with analogous mathematical derivations, as in both previously discussed scenarios, we can come to the following formula for MSE, i.e.: (III) where the first error component, which takes the following form  (12), we may construct the quality index of the form: with i = 1 2 (i + 1) for i=0, 1, . . . , L − 1 and L=12. The benchmark results of known approximations for the considered scenario are collected in Table 6. In this experiment, we assume as U matrix the transpose of U, i.e., U=U T , for orthogonal transforms, and in the case of non-orthogonal approximations, i.e.,  DR2014, BAS2008II, and SDCT2001, we explore two possible variants, i.e., U =U T and U =U −1 (the second variant is possible because for all non-orthogonal approximations computationally effective inverse transformations have been formulated in literature). An analysis of the χ 3 (U, U) values from Table 6 shows that the best result, i.e., the smallest index value, was possible to be obtained with PMCBR2012 approximation, while the worst result belongs to the least computationally demanding DR2014 transform. Moreover, in the case of non-orthogonal approximations, as we can see, it is more advantageous to use transposed transform matrix U, i.e., U =U T , at the decompression stage, instead of its inverse, i.e., U = U −1 , for DR2014 and BAS2008II approximations. In the case of SDCT2001, we can draw the contrary conclusion. Such twofold behavior of the mentioned non-orthogonal approximations of DCT is a direct consequence of the balance in the proportions between the values taken by  Table 6). The results obtained in generating the DCT approximations with use of the parametric model and the proposed procedure are presented in Table 7. In this experiment at the decompression stage, we used a transposed matrix of U transformation, i.e., U=U T , which corresponds to the following heuristic, i.e., U T ≈ U −1 . The reason for this is that in the considered parametric model it is not possible in the general case to calculate the inverse matrix to B using only additions and bit-shift operations. An analysis of obtained results reveals that: (i) for the computational complexities (adds/bit-shifts) of (14/0), (22/0) and (24/0) it was possible to find better or equivalent approximations in the sense of χ 3 (U, U) index, but starting up from the computational complexity of (24/2) the generated approximations allowed to obtain results even better than the best known approximation PMCBR2012, while in other cases found transforms can be characterized by worse effectiveness that known approximations with comparable computational complexities; (ii) the best found approximation APRXIII.12 requires 28 additions and 10 bit-shifts and it is not orthogonal, but the approximation error component takes relatively small value  Table 7 The approximations of DCT generated by the proposed procedure in the case of the symmetric scenario (π(V)=0.131042) approximation APRXIII.9 is described by ninth row of Table 7 and it requires 24 additions and 2 bit-shifts and allows to obtain results better than all known approximations (see underlined row); and (iv) since during the experiment it was assumed that U =U T , then it is clear that for all generated approximations the term tr(U T U)/N equals one.

Results and discussion
In shows that the best results were obtained with generated approximation APRXI.11, which fully corresponds to the results of previous experiments (see Table 3). For example, the advantage of that approximation over the best among the others for the entropy value of 1.5 bits per pixel was at the level of 2.75 dB for "Lena.bmp" image, 0.6 dB for "Mandrill.bmp" image, and at the level of 2 dB in average. Equally important is the fact that the PSNR value for this approximation constantly increases with the increase of entropy, which indicates a good approximation of the DCT transform. The performance of the best known approximation PMCBR2012 and the featured APRXI.8 is comparable; however, PMCBR2012 gives always better results, e.g., for "Lena.bmp" and 0.4 to 0.9 bpp its advantage is highest, and at the level of 0.3 dB, for "Mandrill.bmp, " it is constantly growing to obtain the highest observed value of 1.3 dB at 4 bpp, and finally in average, it is better by about 0.25 dB starting from 0.8 bpp. But, as it should be noted, the featured approximation requires less by 2 additions and 2 bit-shift operations. The second best of known approximations, i.e., CB2011 transform, gave results worse than PMCBR2012 and the generated approximations (for example in the average sense the smallest difference is about 2 dB at 2 bpp and grows with the increase of the entropy). However, this transform is the least computationally demanding among transforms considered in this experiment. It should be also noted that the generated approximations give much better results for images with high correlation (c.f. results obtained with "Lena.bmp" and "Mandrill.bmp" images). The possible reason for this is the assumption of a high correlation coefficient ρ = 0.95 in the transform generating procedure (see Section 3). The subjective analysis of the performance of DCT approximations based on results from Figs. 16, 17, and 18 allows to draw conclusions corresponding to the results presented in the form of the PSNR plots. It can be seen that the least annoying artifacts appear with APRXI.11 approximation, while in the case of other approximations the artifacts in the form of vertical and horizontal lines are especially visible in places of sharp edges in image (see the hat's hem or the rim of the eye). Subjectively these artifacts are the most visible for PMCBR2012 and CB2011 approximations.
In the second experiment, regarding the asymmetric scenario of type II, the obtained results are analogous to those from the first experiment (see Figs. 10, 11, and 12). The  best results were obtained with generated approximation APRXII.9 and among known approximations the PMCBR2012 transform can be characterized by the highest effectiveness. The featured approximation APRXII.6 with 22 additions and 4 bit-shift operations allowed to obtain results comparable to known PMCBR2012 transform (even better for "Lena.bmp" image and very close in the averaged sense) but with smaller computational complexity. The second best among known approximations, i.e., CB2011, gave average results worse by even about 2 dB (starting from 0.8 bpp entropy). The characteristics and strength of visible artifacts in images in the case of subjective performance evaluation were analogous to those observed in the previous scenario. Hence, we present such results only for the case of 1.5 bpp entropy (see Fig. 19).
In the case of the symmetric scenario, it can be seen that differences between results obtained with various approximations of DCT are very small. In addition, based on averaged results (see Fig. 15), we can conclude that the choice of the best approximation depends strictly on the values of entropy. For entropy smaller than around 1.8 bpp the APROXIII.12 guarantees the highest effectiveness (e.g., higher by 0.5 dB at 0.5 bpp). But starting up from 1.8 bpp, its efficiency drops and the best results can be obtained with known BAS2010 approximation. It can be stated that APROXIII.12 is the best at high compression ratios. The high efficiency of this approximation observed in the first interval of entropy variability is a consequence of small value of π(U) component, which translates into small values of quantization error. But the considered approximation is not orthogonal and the error component  APROXIII.12 it is the most computationally complex among the considered approximations. Finally, it should be noted that the featured approximation APROXIII.9 allows to obtain results close to that acquired with the best among known approximation, i.e., BAS2010, also better but very close to those characteristic for PMCBR2012 transform, and all of that with smaller computational demands. An analysis of visible artifacts (see exemplary results in Fig. 20) reveals comparable characteristics of the considered approximations.

Hardware realizations
Bearing in mind the fact that hardware implementations of DCT and its approximations in FPGA or Application-Specific Integrated Circuits (ASICs) are significant areas of practical realizations, we consider in this section the hardware designs of selected among generated approximations, i.e., APRXI.1, APRXI.7, APRXI.8 (APRXII.6), APRXI.10, APRXI.11, and also APRXII.9 (APRXIII.12), APRXIII.9. For these transforms, hardware requirements for FPGA implementations are specified, including the required number of logical elements and the predicted consumption of energy (see Table 8). In addition, for APRXIII.9 transform the detailed description of its design for FPGA circuit is provided. The obtained results in hardware utilization by proposed approximations are compared to the equivalent results obtained for well-known approximations, i.e., PMCBKE2014, BC2012, CB2011, BAS2010, and PMCBR2012. The aim of this part of the paper is to verify the feasibility of hardware implementations of the considered approximations. Such transforms, as we know, use only additions and bit-shift operations. This feature should translate directly into low energy and resources consumption. The APRXIII.9 approximation is the featured transform from the experiment concerning the symmetric scenario. This transformation can be described by the following parameters, i.e., {1, 1, 1 2 , 0, 1, 1, 1}, and it requires 24 additions and 2 bit-shift operations. Since two-dimensional transformation can be realized with one dimensional transforms, i.e., one for rows, and one for columns, in this place, we consider only hardware implementation of one-dimensional 8-point transformation, but in two variants, i.e., regarding rows and columns respectively. Further on we assume that an input image is represented as two-dimensional array with 8-bit integer values describing pixel colors (i.e., luminance or one of the chrominance components interchangeably). On this basis, we can assume that pixel data is stored using 8-bit two's complement numbers format, i.e., integer numbers form an interval [ −128, 127]. It should be noted, however, that in two's complement number format addition or subtraction operations of two n-bit arguments in general require n+1 bits to represent their results. This is a reason why the width of data representation must increase at the following stages of data flow diagram of transform implementation structure (c.f. Fig. 21). For example in the considered case for row variant input data is 8-bit wide, while output results require 12 bits to represent data. In column variant the width of data changes from 12 to 16 bits. This is the only difference between row and column designs of logic circuits, which of course should result in proportionally higher demands on hardware resources for the column transformation.
The data flow diagram of the hardware implementation of APRXIII.9 transform is presented in Fig. 21. It will be analyzed in terms of transform operating on 8-bit image data (i.e., first from the row-column order). In the first place, input data in the form of vector [ x(0), x(1), . . . , x (7)] with eight elements goes through the first stage composed of eight adders: A00, A01 to A07, and eight registers built with D-flops, i.e., D00, D01 to D07. The first stage is the implementation of U 3 matrix from the considered parametric model. In the case of row transformation input data is represented with 8-bit two's complement format. Before performing operations within the first stage data must be extended to 9 bits by duplicating the most significant bit (MSB), which simply takes the value of MSB from 8-bit representation (simple duplication of MSB). In result, the adders A00 to A07 and registers D00 to D07 are 9-bit wide. All Dnm registers in this implementation are used to enable calculations in the pipeline mode. The second stage is composed of eight adders and twelve registers. The input data, which is now 9-bit wide, must be extended to 10-bit representation by duplicating the MSBs. The registers D18, D19, D1A, and D1B are needed to delay by one clock cycle the outputs of D04 to D07, required at the third stage. Further on, at the third stage, we have eight adders A20 to A27, two bit-shift registers S0 and S1, and eight data holding registers D20 to D27. Due to the care for efficiency of hardware implementation (i.e., the reduced data width at the first two stages), the mentioned bit-shift registers S0 and S1 make data shifts by one bit left, which corresponds to the following set of parameters: {1, 2, 1, 0, 1, 1, 1}. From the point of view of the parametric model, this is precisely the same transformation as obtained with the following parametrization, i.e., {1, 1, 1 2 , 0, 1, 1, 1}. Since, both left bit-shift and addition operations increase the data size each by one bit, then the input data for this stage must be transformed in an analogous way (by duplicating their MSBs) from 10-bit to 12 bit representation. At the same time this is the width of the output data, i.e., [ y(0), y(1), . . . , y (7)]. It should be noted that bitshift operations by a constant number of bits can be simply hardwired and thus do not constitute an additional resource burden.