- Research
- Open Access

# Implementation of fast HEVC encoder based on SIMD and data-level parallelism

- Yong-Jo Ahn
^{1}, - Tae-Jin Hwang
^{1}, - Dong-Gyu Sim
^{1}Email author and - Woo-Jin Han
^{2}

**2014**:16

https://doi.org/10.1186/1687-5281-2014-16

© Ahn et al.; licensee Springer. 2014

**Received: **18 June 2013

**Accepted: **5 March 2014

**Published: **26 March 2014

## Abstract

This paper presents several optimization algorithms for a High Efficiency Video Coding (HEVC) encoder based on single instruction multiple data (SIMD) operations and data-level parallelism. Based on the analysis of the computational complexity of HEVC encoder, we found that interpolation filter, cost function, and transform take around 68% of the total computation, on average. In this paper, several software optimization techniques, including frame-level interpolation filter and SIMD implementation for those computationally intensive parts, are presented for a fast HEVC encoder. In addition, we propose a slice-level parallelization and its load-balancing algorithm on multi-core platforms from the estimated computational load of each slice during the encoding process. The encoding speed of the proposed parallelized HEVC encoder is accelerated by approximately ten times compared to the HEVC reference model (HM) software, with minimal loss of coding efficiency.

## Keywords

## 1 Introduction

Along with the development of multimedia and hardware technologies, the demand for high-resolution video services with better quality has been increasing. These days, the demand for ultrahigh definition (UHD) video services is emerging, and its resolution is higher than that of full high definition (FHD), by a factor of 4 or more. Based on the market demands, ISO/IEC Moving Picture Experts Group (MPEG) and ITU-T Video Coding Experts Group (VCEG) have organized Joint Collaborative Team on Video Coding (JCT-VC) and standardized High Efficiency Video Coding (HEVC), whose target coding efficiency was twice better than that of H.264/AVC [1]. In the near future, HEVC is expected to be employed for many video applications, such as video broadcasting and video communications.

Historically, MPEG-x and H.26x video compression standards employ the macro-block (MB) as one basic processing unit [2], and its size is 16 × 16. However, HEVC supports larger sizes of the basic processing unit, called coding tree unit (CTU), from 8 × 8 to 64 × 64. A CTU is split into multiple coding units (CU), in a quad-tree fashion [3]. Along with the CU, the prediction unit (PU) and transform unit (TU) are defined, and their sizes and shapes are more diverse than the prior standard technologies [4, 5]. On top of them, many advanced coding tools that improve prediction, transform, and loop filtering are employed to double the compression performance compared with H.264/AVC. However, the computation requirement of HEVC is known to be significantly higher than that of H.264/AVC because HEVC has more prediction modes, larger block size, longer interpolation filter, and so forth.

Typically, a huge number of rate-distortion (RD) cost computations are required to find the best mode from 64 × 64 to 8 × 8 block sizes in the encoder side for HEVC. With respect to applications, HEVC would be employed for ultrahigh-resolution video services. For such cases, fast video coders are required to process more data with a given processing power. Thus, parallelization techniques would be crucial, with multiple low-power processors or platforms. The single instruction multiple data (SIMD) implementation of the most time-consuming modules on HM 6.2 encoders was proposed [6]. This work implemented the cost functions, transformation, and interpolation filter with SIMD, and it reported that the average time saving obtained is approximately 50% to 80%, depending on the modules. Wavefront parallel processing (WPP) for HEVC encoders and decoders was introduced [7]. For the decoder case, they achieved parallel speed-up by a factor of 3. The acceleration factor of the wavefront parallelism is in general saturated into 2 or 3 due to data communication overhead, epilog, and prolog parts. There are no works that incorporate all the parallel algorithms, with maximum harmonization for fast HEVC encoders. In this paper, we focus on load-balanced slice parallelization, with optimization implementation of HEVC. This paper presents several optimization techniques using SIMD operations for the RD cost computations and transforms for variable block sizes. In addition, motion estimation is also efficiently implemented with a frame-based processing to reduce the number of redundant filtering. For data-level parallelization, this paper demonstrates how to allocate encoding jobs to all the available cores through the use of complexity estimation. As a result, it is possible to achieve load-balanced slice parallelism in HEVC encoders to significantly reduce the average encoding time. With all the proposed techniques, the optimized HEVC encoder achieves a 90.1% average time saving within 3.0% Bjontegaard distortion (BD) rate increases compared to HM 9.0 reference software.

The paper is organized as follows. Section 2 presents a complexity analysis of HEVC encoder, and Section 3 introduces basic data-level parallelisms for video encoders. In Section 4, the SIMD optimization for cost functions and transform, as well as frame-level implementation of interpolation filter, is explained in detail. A slice-level parallelization technique with a load-balancing property is proposed in Section 5. Section 6 shows the performance and numerical analysis of the proposed techniques. Finally, Section 7 concludes the work, with further research topics.

## 2 HEVC and its complexity analysis

In this section, the complexity of HEVC encoder is investigated, and critical modules can be identified based on the complexity analysis. In this work, HM 9.0 reference software [13] was used for HEVC encoder analysis. Note that it was used as the base software for our optimization. A HEVC encoder can be mainly modularized into five parts: entropy coding, intra prediction, inter prediction, transform quantization, and loop filter. The cycle analyzer, Intel® VTune™ Amplifier XE 2013 [14] on Intel® Core™ i7-3960 K processor, was employed to measure the number of cycles for each module, in cases of the random access (RA) and low-delay (LD) test configurations, under the common test conditions [15]. Note that class B (1,920 × 1,080) and class C (832 × 480) sequences were used.

**Percentages of computational cycles of HM 9.0 encoder**

Module | RA (%) | LD (%) | Average (%) |
---|---|---|---|

Entropy coding | 2.98 | 2.40 | 2.69 |

Intra prediction | 2.25 | 1.95 | 2.10 |

Inter prediction | 79.03 | 82.23 | 80.63 |

Transform quantization | 14.48 | 12.50 | 13.49 |

In-loop filter (de-blocking filter) | 0.08 | 0.08 | 0.08 |

In-loop filter (sample adaptive offset) | 0.10 | 0.10 | 0.10 |

Others | 1.28 | 0.93 | 1.10 |

**Percentages of computational cycles, depending on CU sizes and modes**

Size | Mode | RA (%) | LD (%) | Average (%) | Ratio in each CU size (%) |
---|---|---|---|---|---|

64 × 64 | Intra | 2.1 | 1.0 | 1.6 | 5.6 |

Inter | 19.0 | 31.9 | 25.5 | 82.3 | |

Skip | 3.9 | 3.4 | 3.7 | 12.1 | |

32 × 32 | Intra | 1.9 | 0.7 | 1.3 | 4.5 |

Inter | 25.0 | 27.4 | 26.2 | 83.4 | |

Skip | 4.5 | 3.2 | 3.9 | 12.2 | |

16 × 16 | Intra | 2.3 | 0.2 | 1.3 | 4.4 |

Inter | 17.0 | 12.5 | 14.8 | 82.9 | |

Skip | 3.2 | 1.7 | 2.5 | 12.7 | |

8 × 8 | Intra | 2.4 | 0.4 | 1.4 | 13.5 |

Inter | 8.7 | 4.9 | 6.8 | 73.7 | |

Skip | 1.7 | 0.6 | 1.2 | 12.8 |

**Percentages of computational cycles of top four functions**

Module | RA (%) | LD (%) | Average (%) |
---|---|---|---|

Interpolation filter | 35.74 | 36.00 | 35.87 |

SATD | 12.99 | 18.57 | 15.78 |

SAD | 15.33 | 13.26 | 14.30 |

Transform/inverse transform | 3.52 | 3.08 | 3.30 |

Total (%) | 67.58 | 70.91 | 69.25 |

## 3 Data-level parallelization of video encoders

Data-level and function-level parallelization approaches are widely used for high-speed video codecs. In particular, function-level parallel processing is frequently used for hard-wired implementations. Note that function-level parallel processing is not easily implemented mainly due to difficulties of load balancing and longer development period. Data-level parallel processing is relatively easy to be employed for video encoders because the data processing flows are the same for all the data. The data-level parallelism for HEVC can be conducted in terms of CU-, slice-, and frame-level ones. In addition, HEVC contains a parallel tool, called tile, which divides a picture into multiple rectangles [16]. In tile partitioning, the number of CTUs adjacent to boundaries of tile partitions is less than that of slices. From this fact, tile partitioning can yield slightly lower coding loss in compression efficiency compared to an implementation with the same number of slices [17].

For parallel implementations, we need to consider several factors, such as throughput and core scalability, as well as coding efficiency. Note that the core scalability means how much we need to change an implementation, depending on an increasing or decreasing number of cores. In addition, the throughput can be improved with parallel processing as compared with the single processing unit. However, many video coding algorithms, in general, have dependencies among neighboring coding units, neighboring frame, earlier-coded syntaxes, and so on. At the same time, we need to consider the coding efficiency degradation from the parallelization. Even though the throughput can be improved with parallel processing, it is not desirable that the coding efficiency is significantly degraded. Regarding the core scalability, it is better to employ a scalable parallelization method that can be easily extended for an increasing number of cores. If not, we are required to change the implementation, depending on the number of cores.

The 2D wavefront algorithm [18] has been used for the parallelization of video coding in CTU level. This coding tool does not impact the coding gain, but there is a limitation in the parallelization factor, even with many cores, due to coding dependence. Frame-level parallelization can be also used for non-reference bidirectional pictures; however, it depends on the encoding of reference structures.

As mentioned before, the slice-level parallelism has relatively high coding losses of around 2% to 4% compared to tile-level parallelism and wavefront processing [19]. However, slice-level parallelism has an advantage that the slice partitioning is more flexible and accurate for picture partitioning, by adjusting the number of CTUs, compared to the tile partitioning. Note that the tiles within the same row and column should use the same tile width and height, respectively. Slice-level parallelism of a fine-grained load balancing can yield additional encoding speed-up compared to the tile levels. WPP has the advantage that the loss of parallelization is relatively small compared to other parallelization methods. However, the acceleration factor of WPP is not so high compared to slice- or tile-level parallelism because WPP has prolog and epilog so that parts of the cores are inactivated. It is not easy to utilize all the cores with WPP on average. In our work, slice-level parallelism was chosen for the acceleration of parallelization. In addition, slice partitioning is widely used for the packetizing of bitstreams for error resiliencies, in practical video encoders and services.

There are two main criteria to divide a picture into multiple slices. One is an equal bitrate, and the other is the same number of CTUs for all the slices. The first one cannot be easily employed for parallel encoding because we cannot define the target bit prior to actual encoding. For the second method, we can easily use the same number of CTUs at a time.

## 4 Optimization for fast HEVC encoder

In this section, two software optimization methods, frame-level processing and SIMD implementation, for three most complex functions at the function-level are presented. The proposed software optimization methods have several advantages to accelerate HEVC encoders without any bitrate increase.

### 4.1 Frame-level interpolation filter in HEVC encoder

The HEVC DCT-based interpolation filter (DCT-IF), which is used for obtaining fractional sample positions, is the most complex function, especially with motion estimation in encoders. Instead of using 6-tap and bilinear interpolation filters of H.264/AVC, HEVC adopts 8(7)-tap DCT-IF for luminance components, and 4-tap DCT-IF for chrominance components [20]. Furthermore, all of the fractional position samples are derived by increasing the number of filter taps without intermediate rounding operations which can reduce potential rounding errors compared to H.264/AVC. In order to determine the optimal CU size and coding modes, HM encoder uses a recursive scheme for the RD optimization process. In particular, the PU-level interpolation filter causes iterative memory accesses for the same positions redundantly. Excessive memory accesses significantly increase encoding time due to the limit of memory bandwidth. Actually, the DCT-IF occupies approximately 30% to 35% of the total cycles in the HM encoder. We adopt a frame-level interpolation filter to reduce redundant memory accesses. The frame-level interpolation filter avoids redundant execution that occurs in the RD optimization process and enables parallel process with independency among neighboring blocks. However, it requires the additional amount of memory for 15 factional samples per integer sample in an entire frame. In addition, SIMD instructions and multi-thread processes using OpenMP and GPU can be easily used for fast encoding.

### 4.2 SIMD implementation of cost function and transformation

*i*and

*j*are the pixel indices, and their ranges are determined by a block size.

*O*(

*i*,

*j*) and

*P*(

*i*,

*j*) are the original and predicted pixel values, respectively. Note that

*H*(

*i*,

*j*) is the Hadamard transformation of the prediction error,

*O*(

*i*,

*j*) −

*P*(

*i*,

*j*) [8]. Because only addition and subtraction operations are involved for the cost functions, SATD can yield an accurate cost in the transform domain with relatively small complexity compared to DCT. Since both apply the same operations on multiple data, vector instructions are quite useful to reduce the required clock cycles. This work uses SSE2 and SSE3 instructions defined in Intel SIMD architecture, which are widely employed for many DSP processors [21]. In the case of the SAD operation, we employed

*PSADBW*(packed sum of absolute differences),

*PACKUSWB*(packed with unsigned saturation), and

*PADDD*(add packed double word integers) instructions. Sixteen SAD values can be computed by

*PSADBW*instruction at once. Figure 4 shows how to compute SAD with SIMD instructions. Data packing is conducted with sixteen 16-bit original pixels (

*i*

_{ x }) and sixteen 16-bit reference pixels (

*j*

_{ x }) using

*PACKUSWB*instruction. For 8-bit internal bit depth, the data packing is conducted to form 16-bit short data. For 10-bit internal bit depth, the data packing process is not required. Sixteen original pixels and reference pixels are packed into two 128-bit registers, and

*PSADBW*is performed. The computed SAD from

*i*

_{0}-

*j*

_{0}to

*i*

_{7}-

*j*

_{7}is stored in the lower 16 bits, and the SAD from

*i*

_{8}-

*j*

_{8}to

*i*

_{15}-

*j*

_{15}is stored at bit position 64 to 79. Acceleration of 4 × 4 to 64 × 64 SAD computations can be achieved using the aforementioned instructions based on instruction-level parallelism. The 4 × 4 and 8 × 8 SATD operations are implemented using interleaving instructions, such as

*PUNPCKLQDQ*(unpack low-order quad-words),

*PUNPCKHWD*(unpack high-order words),

*PUNPCKLWD*(unpack low-order words), and arithmetic instructions, such as

*PADDW*(add packed word integers),

*PSUBW*(subtract packed word integers), and

*PABSW*(packed absolute value).

*PUNPCKLDQ*(unpack low-order double words),

*PUNPCKHDQ*(unpack high-order double words),

*PUNPCKLQDQ*,

*PUNPCKHQDQ*(unpack high-order quad words), and arithmetic instructions such as

*PMADDWD*(packed multiply and add),

*PADDD*, and shift instruction such as

*PSRAD*(packed shift right arithmetic). For HEVC forward and backward transformation, we need to consider the data range and the center value in computing matrix multiplications, unlike SAD and SATD implementations. Figure 5 shows how to compute the HEVC inverse transform using SIMD instructions. Data packing is conducted with sixteen 16-bit coefficients (

*c*

_{ x }) using

*PUNPCKLWD*instruction. The 16 coefficients are packed into two 128-bit registers. For reordering coefficients, the packed coefficient signals are repacked using

*PUNPCKLQDQ and PUNPCKHQDQ*instructions. Repacked coefficients and the kernel (

*k*

_{ x }) of the inverse transform are multiplied for eight 16-bit data in 128-bit registers. Then, the results of multiplications are added into 128-bit registers using

*PMADDWD*instruction. Finally, the results of

*PMADDWD*are added into the 128-bit destination register to compute inverse-transformed residuals using

*PHADD*instruction. Input data for transformation range from −255 to 255. As a result, the data should be represented by at least 9 bits. Data ranges of coefficients of HEVC transform kernels depend on the size of the transform kernels. However, they can be represented in 8 bits for the 32 × 32 kernel because they range from −90 to 90 [22]. For computation of one transform coefficient, the required number of addition and multiplication operations is as many as the size of the transform kernel along the horizontal and vertical directions. A downscale should be employed to keep 16 bits in every operation for each direction. To avoid overflow and underflow, four 32-bit data should be packed into the 128-bit integer register of SSE2. In addition, the transform matrix is transposed in advance to reduce memory read/write operations.

## 5 Proposed slice-level parallelism with load balance

To reduce the computational load of the RD optimization, early termination and mode competition algorithms have been adopted in HM reference software [23–25]. However, these fast encoding algorithms cause different encoding complexities among different slices. To maximize parallelism of the data-level task partition, an accurate load balance for slice parallelization is required. Several works [26, 27] have been conducted to achieve accurate load balance for slice parallelization. In Zhang's algorithm [26], the adaptive data partitioning for MPEG-2 video encoders was proposed by adjusting computational loads based on the complexity of a previously encoded frame of the same picture type. In Jung's algorithm [27], the adaptive slice partition algorithm was proposed to use early-decided coding mode for macro-blocks in H.264/AVC. In the conventional algorithm, a quantitative model was designed to estimate the computational load associated with each candidate MB mode group. However, in order to apply slice-level parallelism to a HEVC encoder, we need to focus on CTU structures, variable block sizes, and coding modes. In this section, a complexity estimation model and adaptive slice partition algorithm to achieve load-balanced slice parallelization are proposed.

### 5.1 Complexity estimation model

*R*(

*s*,

*m*) and

*r*(

*s*,

*m*) represent the complexity per unit and the complexity ratio of each CU size and mode, respectively.

*w*(

*s*) and CEM(

*s*,

*m*) are the width of CU size and the complexity estimation model in Table 4, respectively. Note that NF is a normalization factor for fixed-point operation. The complexity of the

*l*th CTU is defined by

*s*,

*m*|

*l*) represents the selected mode for the CTU.

*S*and

*M*are defined by {64 × 64, 32 × 32, 16 × 16, 8 × 8} and {Skip, Inter, Intra}, respectively. The predicted complexity for each slice is computed by summation of complexity for CTU and is defined by

**Normalized complexity for variable CU size and mode**

CU size | Skip | Inter | Intra |
---|---|---|---|

64 × 64 | 109 | 760 | 52 |

32 × 32 | 42 | 280 | 16 |

16 × 16 | 9 | 71 | 3 |

8 × 8 | 2 | 19 | 1 |

**Pearson product moment correlations of the actual and predicted times**

Class | Sequence name | Pearson product moment correlation |
---|---|---|

Class A (2,560 × 1,600) | Traffic | 0.9495 |

PeopleOnStreet | 0.9083 | |

Class B (1,920 × 1,080) | Kimono | 0.9859 |

ParkScene | 0.9689 | |

Cactus | 0.9382 | |

BasketballDrive | 0.9456 | |

BQTerrace | 0.9093 | |

Class C (832 × 480) | BasketballDrill | 0.9568 |

BQMall | 0.9723 | |

PartyScene | 0.9326 | |

RaceHorses | 0.9484 | |

Average | 0.9469 |

### 5.2 Adaptive slice partitioning using characteristics of temporal layers

*L*(

*k*), and the offset to control the number of CTU in a slice,

*offset*(

*k*), are defined by

*L*(

*k*) is the number of CTU in the

*k*th slice,

*i*is the frame index,

*j*is the temporal layer index, and

*k*is the slice index. Also,

*N*is the number of slices in a frame, and CTU

_{inFrame}is the number of CTUs in the frame. In Equation 9, the CTU offset for each slice is set to the additional number of CTUs. The proposed algorithm adopts the adaptive slice partitioning method, with the difference between the ideal complexity for each slice, and the ratio of predicted complexity, which achieves the speed-up of slice-level parallelism. Figure 8 shows the actual encoding time and predicted encoding time using the proposed load-balanced slice parallelization. This shows that the complexity load is quite well balanced compared to that shown in Figure 6. In addition, the maximum difference between the ratios of actual encoding time and the predicted one is 0.09363, and the minimum difference is 4 × 10

^{−5}.

## 6 Experimental results

- (a)
According to HEVC common test condition [15]

- (b)
Profile: HEVC main profile (MP) [1]

- (c)
Level: Level 4.1 [1]

- (d)
Encoding structure: RA and LD

- (e)
QP value: 22, 27, 32, 37

- (f)
Test sequences: HEVC common test sequences (classes B and C) in Table 6

**Test sequences**

Class | Sequence number | Sequence name | Frame count | Frame rate |
---|---|---|---|---|

Class B (1,920 × 1,080) | S01 | Kimono | 240 | 24 |

S02 | ParkScene | 240 | 24 | |

S03 | Cactus | 500 | 50 | |

S04 | BasketballDrive | 500 | 50 | |

S05 | BQTerrace | 600 | 60 | |

Class C (832 × 480) | S06 | BasketballDrill | 500 | 50 |

S07 | BQMall | 600 | 60 | |

S08 | PartyScene | 500 | 50 | |

S09 | RaceHorses | 300 | 30 |

_{anchor}is the average of the maximum ratio of complexity load over all the slices for the anchor, and AML

_{proposed}is the average of the maximum ratio of complexity load over all the slices for the proposed algorithm. In addition, BD-BR (%) for bitrate increase, BD-PSNR (dB) for objective quality decrease, and ATS (%) for average time saving were evaluated. Note that the ATS is defined by

where Etime_{anchor} is the encoding time of the anchor encoder and Etime_{proposed} is the proposed method.

Firstly, the ATS comparison between the anchor and the proposed software optimizations will be shown. Secondly, the coding efficiency of the slice parallelism using OpenMP will be presented for the four-slice case. Thirdly, the coding efficiency of the proposed load-balanced slice parallelism will be presented. Finally, the coding efficiency of the overall proposed encoder based on software optimization and parallelization will be evaluated, comparing to the HM 9.0 reference encoder.

**HM 9.0 vs. optimized HEVC encoder software**

Sequence | RA | LD | |||||
---|---|---|---|---|---|---|---|

SIMD (A) | Frame-level IF (B) | A + B | SIMD (A) | Frame-level IF (B) | A + B | ||

B | S01 | 14.13 | 17.79 | 31.92 | 15.74 | 19.44 | 35.18 |

S02 | 12.38 | 20.18 | 32.56 | 14.78 | 21.10 | 35.88 | |

S03 | 14.09 | 19.56 | 33.65 | 16.23 | 20.26 | 36.49 | |

S04 | 15.16 | 16.85 | 32.01 | 17.62 | 17.12 | 34.74 | |

S05 | 11.93 | 20.35 | 32.28 | 13.59 | 21.58 | 35.17 | |

C | S06 | 14.33 | 18.51 | 32.84 | 16.49 | 19.60 | 35.99 |

S07 | 13.84 | 20.90 | 34.74 | 16.02 | 20.95 | 36.97 | |

S08 | 11.88 | 18.49 | 30.37 | 13.44 | 19.94 | 33.38 | |

S09 | 14.67 | 15.03 | 29.70 | 17.23 | 15.54 | 32.77 | |

Average (B) | 13.54 | 18.95 | 32.48 | 15.59 | 19.90 | 35.49 | |

Average (C) | 13.68 | 18.23 | 31.91 | 15.80 | 18.98 | 34.78 |

**HM 9.0 vs. slice parallelization using OpenMP**

Sequence | RA | LD | |||||
---|---|---|---|---|---|---|---|

BD-BR (%) | BD-PSNR (dB) | ATS (%) | BD-BR (%) | BD-PSNR (dB) | ATS (%) | ||

B | S01 | 1.79 | −0.05 | 70.25 | 1.49 | −0.05 | 70.60 |

S02 | 1.00 | −0.03 | 71.53 | 0.89 | −0.03 | 70.61 | |

S03 | 1.38 | −0.03 | 71.08 | 1.29 | −0.03 | 71.60 | |

S04 | 2.06 | −0.05 | 68.03 | 1.53 | −0.04 | 68.65 | |

S05 | 1.39 | −0.02 | 70.27 | 1.35 | −0.03 | 70.23 | |

C | S06 | 3.41 | −0.14 | 68.95 | 2.60 | −0.10 | 69.26 |

S07 | 3.78 | −0.14 | 66.98 | 2.89 | −0.11 | 66.98 | |

S08 | 1.58 | −0.07 | 68.04 | 1.45 | −0.06 | 69.61 | |

S09 | 2.96 | −0.11 | 68.53 | 2.03 | −0.08 | 68.75 | |

Average (B) | 1.52 | −0.04 | 70.23 | 1.31 | −0.03 | 70.34 | |

Average (C) | 2.93 | −0.12 | 68.13 | 2.24 | −0.09 | 68.65 |

**BD-BR, ATS, and ALS for slice and load-balanced slice parallelization**

Sequence | RA | LD | |||||
---|---|---|---|---|---|---|---|

BD-BR (%) | ATS (%) | ALS (%) | BD-BR (%) | ATS (%) | ALS (%) | ||

B | S01 | −0.01 | 13.44 | 16.33 | −0.04 | 12.03 | 11.72 |

S02 | 0.01 | 11.31 | 14.94 | −0.02 | 12.44 | 13.25 | |

S03 | 0.05 | 0.46 | 2.95 | −0.01 | −0.16 | 14.86 | |

S04 | 0.16 | 18.05 | 22.58 | 0.05 | 17.82 | 20.83 | |

S05 | −0.01 | 10.90 | 11.16 | −0.08 | 14.63 | 16.18 | |

C | S06 | 0.16 | 5.64 | 6.29 | 0.10 | 6.50 | 7.88 |

S07 | 0.33 | 17.71 | 15.99 | 0.23 | 19.13 | 18.55 | |

S08 | 0.18 | 8.01 | 10.17 | 0.08 | 8.02 | 8.89 | |

S09 | −0.01 | 8.72 | 10.62 | 0.02 | 8.90 | 9.65 | |

Average (B) | 0.04 | 10.83 | 13.59 | −0.02 | 11.35 | 15.37 | |

Average (C) | 0.17 | 10.02 | 10.77 | 0.11 | 10.64 | 11.24 |

**HM 9.0 vs. the proposed accelerated and parallelized HEVC encoder**

Sequence | RA | LD | |||||
---|---|---|---|---|---|---|---|

BD-BR (%) | BD-PSNR (dB) | ATS (%) | BD-BR (%) | BD-PSNR (dB) | ATS (%) | ||

B | S01 | 3.88 | −0.12 | 90.13 | 3.29 | −0.10 | 89.34 |

S02 | 3.45 | −0.11 | 91.33 | 3.56 | −0.11 | 90.47 | |

S03 | 4.81 | −0.10 | 89.12 | 3.96 | −0.09 | 88.77 | |

S04 | 4.34 | −0.10 | 88.26 | 3.08 | −0.07 | 87.93 | |

S05 | 4.52 | −0.07 | 91.28 | 3.32 | −0.06 | 90.19 | |

C | S06 | 5.44 | −0.22 | 86.86 | 3.84 | −0.15 | 86.72 |

S07 | 7.46 | −0.28 | 88.92 | 5.41 | −0.21 | 88.31 | |

S08 | 4.30 | −0.18 | 86.75 | 3.65 | −0.15 | 86.10 | |

S09 | 6.10 | −0.23 | 85.44 | 3.76 | −0.15 | 85.46 | |

Average (B) | 4.20 | −0.10 | 90.02 | 3.44 | −0.09 | 89.34 | |

Average (C) | 5.83 | −0.23 | 86.99 | 4.17 | −0.17 | 86.65 |

**BD-BR, BD-PSNR, and ATS of the proposed HEVC encoder for class A (2,560 × 1,600)**

Sequence | RA | LD | ||||
---|---|---|---|---|---|---|

BD-BR (%) | BD-PSNR (dB) | ATS (%) | BD-BR (%) | BD-PSNR (dB) | ATS (%) | |

Traffic | 3.66 | −0.12 | 91.14 | 3.72 | −0.17 | 85.48 |

PeopleOnStreet | 2.80 | −0.09 | 90.43 | 2.29 | −0.11 | 85.80 |

Average | 3.23 | −0.11 | 90.79 | 3.01 | −0.14 | 85.64 |

## 7 Conclusions

In this paper, the computational complexity of the HM 9.0 encoder was analyzed for acceleration and parallelization of the HEVC encoder. We identified five key modules for the HM 9.0 encoder, requiring dominant computing cycles. Based on the complexity analysis, two software optimization methods were used for acceleration: the frame-level interpolation filter and SIMD implementation. In addition, load-balanced slice parallelization is proposed. Software optimization methods achieve 33.56% of the average time saving, with any coding loss. In addition, load balancing for the slice parallelization method achieves about 10% of average time saving compared to uniform slice partition. The overall average time saving of the proposed HEVC encoder yields approximately 90% compared to HM 9.0 with acceptable coding loss. HEVC encoder with the proposed methods can compress full HD videos at approximately 1 fps speed in a commercial PC environment, without any hardware acceleration.

Further study will be focused on additional software optimization, fast encoding algorithm, and tile-level parallel processing for real-time encoder of HEVC.

## Declarations

### Acknowledgements

This research was partly supported by the IT R&D program of MSIP/KEIT [10039199, A Study on Core Technologies of Perceptual Quality based Scalable 3D Video Codecs], the MSIP (Ministry of Science, ICT & Future Planning), Korea, under the ITRC (Information Technology Research Center) support program (NIPA-2013-H0301-13-1011) supervised by the NIPA (National IT Industry Promotion Agency), and the grant from the Seoul R&BD Programs (SS110004M0229111).

## Authors’ Affiliations

## References

- Bross B, Han W-J, Sullivan GJ, Ohm JR, Wiegand T:
*High Efficiency Video Coding (HEVC) text specification draft 9, ITU-T/ISO/IEC Joint Collaborative Team on Video Coding (JCT-VC) document JCTVC-K1003*. 2012.Google Scholar - ITU-T and ISO/IEC JTC 1: Advanced video coding for generic audiovisual services, ITU-T Rec. H.264/and ISO/IEC 14496–10 (MPEG-4 AVC), versions 1-16, 2003-2012.Google Scholar
- Samet H: The quadtree and related hierarchical data structures.
*ACM Comput Surv (CSUR)*1984, 16(2):187-260. 10.1145/356924.356930MathSciNetView ArticleGoogle Scholar - Han W-J, Min J, Kim I-K, Alshina E, Alshin A, Lee T, Chen J, Seregin V, Lee S, Hong YM, Cheon MS, Shlyakhov N, McCann K, Davies T, Park JH: Improved video compression efficiency through flexible unit representation and corresponding extension of coding tools.
*Circuits Syst Video Technol, IEEE Trans*2010, 20(12):1709-1720.View ArticleGoogle Scholar - Wiegand T, Ohm J-R, Sullivan GJ, Han W-J, Joshi R, Tan TK, Ugur K: Special section on the joint call for proposals on High Efficiency Video Coding (HEVC) standardization.
*Circuits Syst Video Technol, IEEE Trans*2010, 20(12):1661-1666.View ArticleGoogle Scholar - Chen K, Duan Y, Yan L, Sun J, Guo Z:
*Efficient SIMD optimization of HEVC encoder over X86 processors, in Signal & Information Processing Association Annual Summit and Conference (APSIPA ASC)*. Hollywood, CA: Asia-Pacific; 2012:1-4.Google Scholar - Clare G, Henry F, Pateux S:
*Wavefront parallel processing for HEVC encoding and decoding, ITU-T/ISO/IEC Joint Collaborative Team on Video Coding (JCT-VC) document JCTVC-F274*. 2011.Google Scholar - Kim I-K, McCann K, Sugimoto K, Bross B, Han W-J:
*HM9: High Efficiency Video Coding (HEVC) test model 9 encoder Description, ITU-T/ISO/IEC Joint Collaborative Team on Video Coding (JCT-VC) document JCTVC-K1002*. 2012.Google Scholar - McCann K, Han WJ, Kim IK, Min JH, Alshina E, Alshin A, Lee T, Chen J, Seregin V, Lee S, Hong YM, Cheon MS, Shlyakhov N: Samsung's response to the call for proposals on video compression technology, ITU-T/ISO/IEC Joint Collaborative Team on Video Coding (JCT-VC) document JCTVC-A124. 2010.Google Scholar
- De Forni R, Taubman D: On the benefits of leaf merging in quad-tree motion models.
*IEEE Int Conf Image Process*2005, 2005: 858-861.Google Scholar - Jung J, Bross B, Chen P, Han W-J:
*Description of core experiment 9: MV coding and skip/merge operations, ITU-T/ISO/IEC Joint Collaborative Team on Video Coding (JCT-VC) document JCTVC-D609*. 2011.Google Scholar - Yuan Y, Zheng X, Peng X, Xu J, Kim IK, Liu L, Wang Y, Cao X, Lai C, Zheng J, He Y, Yu H: CE2: non-square quadtree transform for symmetric and asymmetric motion partition, ITU-T/ISO/IEC Joint Collaborative Team on Video Coding (JCT-VC) document JCTVC-F412. 2011.Google Scholar
- Joint Collaborative Team on Video Coding: (JCT-VC) of ITU-T SG 16 WP 3 and ISO/IEC JTC 1/SC 29/WG 11, HM-9.0 reference software. 2014.Google Scholar
- VTune™Amplifier XE 2013 from Intel 2014.http://software.intel.com/en-us/articles/intel-vtune-amplifier-xe/
- Bossen F:
*Common HM test conditions and software reference configuration, ITU-T/ISO/IEC Joint Collaborative Team on Video Coding (JCT-VC) document JCTVC-K1100*. 2012.Google Scholar - Sullivan GJ, Ohm JR, Han WJ, Wiegand T: Overview of the High Efficiency Video Coding (HEVC) standard.
*IEEE Transactions on Circuits and Systems for Video Technology*2012, 22(12):1649-1668.View ArticleGoogle Scholar - Fuldseth A, Horowitz M, Xu S, Segall A, Zhou M:
*Tiles, ITU-T/ISO/IEC Joint Collaborative Team on Video Coding (JCT-VC) document JCTVC-F335*. 2011.Google Scholar - Henry F, Pateux S:
*Wavefront parallel processing, ITU-T/ISO/IEC Joint Collaborative Team on Video Coding (JCT-VC) document JCTVC-E196*. 2011.Google Scholar - Chi CC, Alvarez-Mesa M, Juurlink B, Clare G, Henry F, Pateux S, Schierl T: Parallel scalability and efficiency of HEVC parallelization approaches.
*IEEE Transactions on Circuits and Systems for Video Technology*2012, 22(12):1827-1838.View ArticleGoogle Scholar - Alshin A, Alshina E, Park JH, Han WJ:
*DCT based interpolation filter for motion compensation in HEVC, in Proceedings of the SPIE 8499 Applications of Digital Image Processing XXXV*. CA: San Diego; 2012.Google Scholar - Intel: Intel 64 and IA-32 architectures software developer manuals. 2014.http://www.intel.com/content/www/us/en/processors/architectures-software-developer-manuals.htmlGoogle Scholar
- Budagavi M, Sze V:
*Unified forward + inverse transform architecture for HEVC, in 19th IEEE International Conference on Image Processing (ICIP), 30 September 30 2012 to 3 October*. Florida, USA: Orlando; 2012:209-212.View ArticleGoogle Scholar - Gweon RH, Lee Y-L, Lim J:
*Early termination of CU encoding to reduce HEVC complexity, ITU-T/ISO/IEC Joint Collaborative Team on Video Coding (JCT-VC) document JCTVC-F045*. 2011.Google Scholar - Choi K, Jang ES:
*Coding tree pruning based CU early termination, ITU-T/ISO/IEC Joint Collaborative Team on Video Coding (JCT-VC) document JCTVC-F092*. 2011.Google Scholar - Yang J, Kim J, Won K, Lee H, Jeon B:
*Early skip detection for HEVC, ITU-T/ISO/IEC Joint Collaborative Team on Video Coding (JCT-VC) document JCTVC-G543*. 2011.Google Scholar - Zhang N: C-H Wu, Study on adaptive job assignment for multiprocessor implementation of MPEG2 video encoding.
*IEEE Trans. Ind. Electron*1997, 44(5):726-734. 10.1109/41.633481View ArticleGoogle Scholar - Jung B, Jeon B: Adaptive slice-level parallelism for H.264/AVC encoding using pre macroblock mode selection. J Vis Commun Image.
*Representation*2008, 19(8):558-572.Google Scholar - Bjontegaard G:
*Document VCEG-M33: calculation of average PSNR differences between RD-curves, ITU-T VCEG Meeting*. Texas, USA: Austin; 2001.Google Scholar - Tian X, Chen Y-K, Girkar M, Ge S, Lienhart R, Shah S:
*Exploring the use of hyper-threading technology for multimedia applications with Intel® OpenMP compiler, in Proceedings of International Symposium on Parallel and Distributed Processing 2003*. France: Nice; 2003.Google Scholar - Sankaraiah S, Shuan LH, Eswaran C, Abdullah J: Performance optimization of video coding process on multi-core platform using GOP level parallelism.
*Int J Parallel Program Springer*2013, 1-17.Google Scholar

## Copyright

This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.