Skip to main content

CU splitting early termination based on weighted SVM


High efficiency video coding (HEVC) is the latest video coding standard that has been developed by JCT-VC. It employs plenty of efficient coding algorithms (e.g., highly flexible quad-tree coding block partitioning), and outperforms H.264/AVC by 35–43% bitrate reduction. However, it imposes enormous computational complexity on encoder due to the optimization processing in the efficient coding tools, especially the rate distortion optimization on coding unit (CU), prediction unit, and transform unit. In this article, we propose a CU splitting early termination algorithm to reduce the heavy computational burden on encoder. CU splitting is modeled as a binary classification problem, on which a support vector machine (SVM) is applied. In order to reduce the impact of outliers as well as to maintain the RD performance while a misclassification occurs, RD loss due to misclassification is introduced as weights in SVM training. Efficient and representative features are extracted and optimized by a wrapper approach to eliminate dependency on video content as well as on encoding configurations. Experimental results show that the proposed algorithm can achieve about 44.7% complexity reduction on average with only 1.35% BD-rate increase under the “random access” configuration, and 41.9% time saving with 1.66% BD-rate increase under the “low delay” setting, compared with the HEVC reference software.

1. Introduction

High definition (HD) and ultra-high definition (UHD) video contents have become increasingly popular worldwide, thus the demand of video compression technologies that can provide higher coding efficiency over HD/UHD videos can be envisioned in near future. In view of this, high efficiency video coding (HEVC) standard is being developed by the Joint Collaborative Team on Video Coding [1], which is established by the ITU-T Video Coding Experts Group and the ISO/IEC Moving Picture Experts Group. HEVC outperforms H.264/AVC high profile by 35–43% bitrate reduction at the same reconstructed video quality [2]. HEVC inherits the well-known block-based hybrid coding scheme [3] used by previous coding standards, e.g., H.264/AVC, and extends the framework by introducing highly flexible quad-tree coding block partitioning. The quad-tree coding block partitioning consists of newly brought concepts of coding unit (CU), prediction unit (PU), and transform unit (TU). CU is the basic unit of region splitting used for inter/intra coding, which extends the traditional concept of macroblock (MB) based on a hierarchical structure with block size varying from 64 × 64 to 8 × 8 pixels. A CU is allowed to recursively be split into four smaller CUs of equal size. In this manner, a picture is represented by a content-adaptive coding tree structure comprised of CU blocks with different sizes. PU is the basic unit used for prediction process in a rectangular shape. One PU can be encoded with one of the modes in candidate set, which is similar to MB mode of H.264/AVC in spirit. The pixels in one PU share prediction information, e.g., modes, motion vectors (MV), and reference index. TU is the basic unit for transform and quantization. TU is defined in a similar way as CU, and its size varies from 4 × 4 to 32 × 32. As reported in [4, 5], the flexible data structure representation (extending the MB size up to 64 × 64) introduced over 10% bitrate saving in comparison with the 16 × 16-based configuration in H.264/AVC, since the flexibility of block partitioning can effectively deal with the diversity of picture content.

However, the flexibility of block partitioning of HEVC imposes significant computational burden on encoder during seeking of the optimal combinations of CU, PU, and TU sizes. Thus, it is crucial for practical implementation of the new standard to reduce the complexity while maintaining the coding performance. Researches on accelerating the encoder of HEVC test model (HM) are emerging. A fast intra mode decision algorithm [6] was proposed, which made use of the direction information of the neighboring blocks to reduce the number of directions taking part in rate distortion optimization (RDO) process. To reduce the computational complexity of TU size selection, a fast algorithm for residual quad-tree mode decision was proposed in [7]. Besides, the depth-first decision process for TU size selection in HM was replaced by a merge-and-split decision process, which also reduces unnecessary computation by using the inheritance property of zero-blocks and early termination schemes for non-zero blocks.

In this article, we focus on CU size selection for HEVC. A content-based fast CU decision algorithm was developed for HEVC TMuC (test model under consideration) [8], which analyzed the ratio of utilized CUs to total number of CUs in different depth in frame level and skipped the rarely used CUs with specified depths. Information of neighboring and co-located CUs was used to skip CUs in unnecessary depth in CU level. The algorithm investigated temporal and spatial correlations of CU depth, and designed different thresholds to control the number of CU depths to be evaluated. However, the correlations were data dependent and the ratio was affected by encoding configurations, such as the hierarchical depth in hierarchical prediction structure. Spatial correlation of CU depth as well as the probability that neighboring CUs were SKIP mode was considered in [9] to design an adaptive weighting factor, which was used to adjust the threshold in early terminating the following RD calculations of the current CU. In [10], a method for complexity controlling was proposed by limiting the number of coding decision tests and comparisons according to temporal correlations. All these related works explored the spatial correlations and/or temporal correlations of CU depth to eliminate specific CU depths with a trivial impact on RD performance. However, they were not robust enough due to diversity of the content. It is necessary to consider more statistics so as to get a more accurate and stable model to simplify the CU splitting.

In the field of accelerating the encoder of H.264/AVC as well as its extensions, various properties were investigated and employed to simplify mode decision. A nearly sufficient condition for early zero-block detection is constructed based on the analysis of prediction error to speed up the motion estimation of H.264/AVC JM reference software in [11]. It indicated that prediction error offered a valuable clue about encoder acceleration. Spatial and temporal correlations were exploited to predict the skip mode [12] to reduce encoder complexity. In [13, 14], distribution of MV in an MB was chosen as a feature to predict the optimal mode other than performing exhaustive search over all modes. A hierarchical algorithm proposed in [15] categorized all type of modes into three levels which were triggered on by evaluating SAD (which is between current MB and its co-located MB), high-frequency energy in DCT domain, and RD cost of mode P-8 × 8. In [16], a fast mode decision algorithm named motion activity-based mode decision was proposed. It classified MBs into different classes by pre-defined thresholds and motion activity. Each class corresponded to different number of modes to be checked. Tiesong et al. [17] projected encoding modes onto a 2D map and an optimal 2D map was predicted using spatial and temporal information. Then, a priority-based mode candidate list was constructed based on the optimal 2D map and mode decision was performed starting with the most important mode in the candidate list with early termination conditions. In such a way, the number of modes to be evaluated was reduced and acceleration was achieved. Changsung and Kuo [18] presented a feature-based fast inter/intra mode decision algorithm. This algorithm computed three features regarding spatial and temporal correlations with which to determine inter or intra mode to use. The feature space were partitioned into three regions, i.e., risk-free, risk-tolerable, and risk-intolerable regions by checking the RD loss due to wrong mode decision and the probability distribution of inter/intra modes. Depending on the region, mechanisms with different complexity were applied for final mode decision. Martinez-Enriquze et al. [19] analyzed the conditional pdfs for every mode and estimated the RD cost to decide the optimal mode. A fast stereo video encoding algorithm based on hierarchical two-stage neural network was proposed in [20]. Local properties of input data and predicted error were extracted as the input feature to train a neural network which was designed to predict the optimal partition mode. SVM were also introduced in the study of fast mode decision [21, 22]. However, MBs were treated equally in the classification problem, and the RD performance of an MB was ignored. In general, these works exploited various mode-related features to predict the optimal mode or reduce the number of modes to be evaluated. The features included spatial and temporal correlations, the gradient or high-frequency energy, the RD cost of specific mode, motion activity, and local properties, such as the prediction error or SAD/sum of absolute transformed differences (SATD).

As shown in the previous researches, CU size selection process applying RD optimization can be unacceptably time-consuming for practical implementation, which will be further analyzed in Section 2. To solve this problem, we propose a method utilizing machine learning to accelerate the CU size selection process. With properly modeling the problem and applying machine learning algorithm, our method can accurately predict the optimal decision on CU splitting instead of exhaustive searching over all possibilities. In order to derive a more accurate model to predict the CU splitting decision, RD difference is introduced as weights in the SVM training procedure to alleviate the RD performance degradation due to misclassification. Furthermore, various features are extracted from input video as well as earlier encoded data and an optimal feature subset is derived by a wrapper feature selection algorithm.

The rest of the article is organized as follows. We briefly go through CU size selection process of HM, and present the motivation of the proposed algorithm in Section 2. In Section 3, we elaborate the modeling of the CU splitting problem and its solution based on a machine learning algorithm, i.e., SVM. Experimental results in Section 4 demonstrate the effectiveness of the proposed algorithm, and Section 5 concludes the article.

2. CU size optimization in HM

To adapt to the diversity of picture content, flexible quad-tree coding block partitioning is adopted into HEVC which enables the use of CU, PU, and TU. The concept of CU is analogous to MB in pervious standards, e.g., H.264/AVC. It is the basic unit for intra/inter coding and is always square in shape. Pictures are divided into many largest CUs (LCUs), and each LCU can be splitting into four equal-sized CUs which can be further recursively split up to the maximal allowable hierarchical depth. In such a manner, the LCU is constructed as a quad-tree of CU(s) with different size as it shown in Figure  1. At leaf node of the quad-tree, the CU can be encoded in SKIP, inter, or intra mode. The partitioning size of SKIP mode is 2N × 2N, which means that the PU size of SKIP mode equals to CU size; the CU encoded in inter mode can be treated as one PU or partitioned into several PUs, which is specified by partitioning mode: Part_2N × 2N, Part_2N × N, Part_N × 2N, (Part_N × N), Part_2N × n U, Part_2N × n D, Part_n L × 2N, and Part_n R × 2N; and the CU in intra mode can be treated as one PU with size of 2N × 2N, or partitioned into four N × N PUs. A simple example of PUs in one CU is shown in Figure  1, as highlighted by the green square. PU corresponding to different partition size is the basic unit to carry the prediction information. In order to match the boundaries of real objects in a picture, the shape of PU is not restricted to being square, e.g., 2N × N is allowed. TU is defined for the transform and quantization process. The shape of TU depends on PU. When PU is square, TU is also square and its size varies from 4 × 4 to 32 × 32 luma samples. When PU is non-square, TU is also non-square and takes a size of 32 × 8, 8 × 32, 16 × 4, or 4 × 16 luma samples. One CU may contain one or more PUs. As well one CU may contain one or more TUs which are arranged in quad-tree structure as shown in Figure  1.

Figure 1

Relationship of CU, PU, and TU.

As explained in the previous paragraph, one LCU can be coded into a rather complex quad-tree to adapt to various video contents. Furthermore, CUs with different depths may be coded in different prediction modes, different partitioning modes, and different transform sizes. To derive the optimal CU-level coding parameters, an exhaustive search method is employed by evaluating the RD costs of all possible combinations of CU size, PU size, and TU size. The RDO of CU size is illustrated in Figure  2. It needs a total of 85 RD calculations when CU size varies from 64 × 64 to 8 × 8. Obviously, such RD-based optimization method introduces significant complexity on encoder. Actually, it is unnecessary to do an exhaustive search over all possible CU sizes, since there exist some CU sizes that do not result in much rate distortion improvement and it is possible to accelerate the encoder by early terminating the CU splitting decision process. As shown in Figure  3, “flat” or “homogenous” regions, e.g., the floor, are more likely to be encoded in large CUs. Areas containing moving objects or objects boundaries, e.g., the net and the basketball, are usually split into small CUs. Motivated by this observation, we model CU splitting decision as a binary classification problem.

Figure 2

RDO on CU size in HM.

Figure 3

CU representation of frame 40 of sequence “BasketballDrill” optimized by HM5.2.

3.CU splitting early termination algorithm based on weighted SVM

3.1. Problem formulation

As the flexible representation of coding data introduces heavy burden on the encoder, we propose to early terminate CU splitting to avoid unnecessary trials. We model CU splitting as a binary classification problem, (i.e., a CU that is not split into four sub-parts is assigned a label +1, otherwise −1 is assigned,) and tackle the classification problem by SVM [23]. As a widely used machine learning algorithm, SVM is based on the idea of structural risk minimization (SRM) and it has successfully been applied to a number of real-world problems, such as face recognition, text categorization, and object detection in machine vision. The main idea behind SVM is to derive a unique separating hyperplane that maximizes margin between two classes. Given l training data points

x i , y i i = 1 l , x i R N , y i 1 , 1 .

where {x i , y i } is the i th training sample, i.e., i th CU. x i is the input feature vector and y i is the class label indicating CU splitting or not. The membership decision rule is based on the function defined in Equation (2), where f(x) represents the discriminant function associated with the hyperplane.

f x = w T φ x + b .

where φ(·) is a nonlinear operator that maps the input x i into a higher-dimensional space and it is the kernel function.

Mathematically, this hyperplane can be constructed by minimizing the following cost function

J w = 1 2 w T w = 1 2 w 2 = i = 1 l w i 2 .

with constraints

y i w T . ϕ x i + b 1.

For a non-separable case, the classification problem is generalized by introducing slack variables ξ i and a user-defined regularization parameter C. Then the classification problem is to minimize the following quantity

J w = 1 2 w T w + C i = 1 l ξ i .

subject to

y i w T ϕ x i + b 1 ξ i ξ i 0 .

The modified cost function in Equation (5) is the so-called structural risk, which balances the empirical risk (i.e., the training errors reflected by the second term) with model complexity (the first term) [24]. It has been proven that the solution to the optimization problem of Equation (5) under the constraint of Equation (6) is given by the saddle point of Lagrange function

Γ w , b , α , ξ , β = 1 2 w 2 + C i = 1 l ξ i i = 1 l α i y i w T ϕ x i + b 1 + ξ i i = 1 l β i ξ i .

where α i and β i are Lagrange multipliers associated with the constraints in Equation (6).

The Lagrange multipliers are solved as maximizing

α * = argmax α i = 1 l α i 1 2 i = 1 l j = 1 l α i α j y i y j K x i , x j .

subject to

i = 1 l α i y i = 0 , C α i 0 , i = 1 , 2 , , l .

where K(x i , x) = ϕT(x i )ϕ(x). The decision function can equivalently be expressed as

sign f x = sign i = 1 l α i * y i K x i , x + b .

It is obvious from Equation (10) that the α i associated with training point x i expresses the strength with which that point is embedded in the final decision function. Notice that the nonlinear mapping φ(·) never appears explicitly in the training or the decision. In general, the kernel takes the form of linear, polynomial, radial basis function (RBF), or sigmoid. In this article, we use the RBF kernel, since it can handle the case when the relation between class labels and the input vector is nonlinear as well as linear. Furthermore, the model complexity of the RBF kernel is lower than polynomial, and RBF kernel has fewer numerical difficulties [25].

3.2. Proposed CU splitting early termination algorithm

The proposed CU splitting early termination algorithm is shown in Figure  4. At each CU depth, the encoder first performs rate and distortion calculation of SKIP mode and inter mode with Part_2N × 2N (denoted as inter 2N × 2N mode thereafter), meanwhile extracts required features, i.e., input vector x of SVM during the evaluation procedure. Then, an offline trained SVM CU splitting model is loaded, which predicts the class label of the current CU according to the extracted input features. Based on the predicted class label, the encoder will decide whether to perform RD trials on CU splitting. The off-line trained SVM model is optimized based on SVM procedure with weighting on training samples. The weights are proposed as the difference of RD costs due to misclassifications. It is obvious that as long as the CU splitting predictor is accurate, early terminating RD trials on CU splitting can reduce a lot of computational complexity while maintaining RD performance.

Figure 4

Proposed CU early termination algorithm based on SVM.

3.3. CU splitting early termination algorithm based on weighted SVM

3.3.1. Off-line training and weights generation

In the field of machine learning, accuracy is one of the most important measurements for classification algorithms. However, in this scenario, not only the ratio of correct classification, but also the loss of RD performance introduced by misclassifications is important.

There exist some CUs that the RD cost difference between four sub-CUs coding and one CU coding are almost the same. Misclassification of such CUs results in negligible RD degradation. On the contrary, for CUs that four sub-CUs coding outperforms one CU coding greatly, misclassification does lead to much RD loss. Obviously, different CUs are of different importance. It is improper to treat samples with different RD performance equally in the training process, and the optimal hyperplane will be deviated by those “unimportant” samples, i.e., these samples are outliers. The desired SVM predictor should predict class label as accurate as possible and keep RD loss as low as possible. Based on this observation, we suggest introducing weights into the SVM training process, i.e., assigning different weights to training samples.

x i , y i , W i i = 1 l , x i R N , y i 1 , 1 , W i R .

where the weights are defined as the percentage of RD cost increased due to misclassification, which is

{ W i = C i s C i n C i n , when the CU is actually encoded in one CU W i = C i n C i s C i s , otherwise

where C i (s) and C i (n) are RD cost of splitting the CU into four sub-CUs and RD cost of non-splitting CU, respectively. CU with little difference of RD cost is assigned a small weight, while CU with large difference of RD cost is assigned a large weight. Note that the weights are only needed in the training procedure, and not needed anymore when the trained model is used to predict the class label in the encoding process.

Then the standard SVM optimization problem in Equation (5) becomes

J w = 1 2 w T w + C i = 1 l ξ i W i .
α * = argmax α i = 1 l α i 1 2 i = 1 l j = 1 l α i α j y i y j K x i , x j .

subject to

i = 1 l α i y i = 0 , C W i α i 0 , i = 1 , 2 , , l .

The upper bounds of α i are bounded by dynamical boundaries C*W i instead of a constant value C. Then the CUs with larger difference when encoded into one CU and into four sub-CUs will affect the optimal hyperplane more by introducing a larger weight W i .

3.3.2. Feature selection

We introduce several representative features related to CU splitting. Selecting effective and relevant features is crucial for classification. Good features help reduce training time as well as utilization time, defy the curse of dimensionality to improve prediction performance, and reduce storage requirements [26]. To select the features that are useful to build a good predictor of SVM, there are usually two types of feature selection approaches, filters and wrapper approaches. In this article, we suggest using a wrapper method based on F-score [27]. Filter methods based on correlation or mutual information ranking [21] are easy to implement; however, selecting the most relevant variables is usually suboptimal for building a predictor, particularly if the variables are redundant. Wrapper method assesses a subset of features according to their usefulness to a given predictor, which is better in this scenario. However, the number of subsets is extremely large as the number of features increase, and thus exhaustive search is not proper. Therefore, we propose to rank all features first by F-score and perform a greedy search based on the ranked results. F-score, as define in Equation (16), is a simple metric that measures the discrimination of two sets of real numbers.

F i x i + x i 2 + x i x i 2 1 n + 1 k = 1 n + x k , i + x i + 2 + 1 n 1 k = 1 n x k , i x i 2 .

where x i , x i + , x i . are the average of the i th feature of the input vector x of the whole, positive, and negative training samples, respectively. x k,i + is the i th feature of the k th positive sample and x k,i is the i th feature of the k th negative sample. n + and n are the total numbers of positive and negative training samples. The larger the F-score is, the more likely this feature is more discriminative. F-score is easy to calculate and is friendly to be coupled with SVM training process. The procedure of the wrapper approach is summarized in the following four steps:

  1. (1)

    Collect training samples by running the HEVC reference software HM6.0.

  2. (2)

    Calculate F-score of every feature in the training set and sort the features in descending order according to F-score.

  3. (3)

    Start from one feature formed subset F (only one feature with the highest F-score).

    1. (a)

      Randomly divide the training set into S tr and S cv.

    2. (b)

      Train SVM model using the S tr.

    3. (c)

      Predict S cv and get the cross validation (CV) (based on accuracy rate).

    4. (d)

      Add the feature with the highest F-score in the rest to subset F and repeat steps in (3) until all features are evaluated or early terminate this process by defining the maximum feature number.

  4. 4)

    Find the optimal feature subset with the lowest validation error.

To setup a rich feature set, diverse features are introduced and evaluated. Furthermore, it is possible to eliminate the dependency on video content by considering as many features as possible and then optimizing the feature subset. The features we consider as potential candidates are summarized as follows.

  • Prediction error-related features, such as SATD and CBF, denoted as x std, x vrs, and x cbf. x std is defined as the SATD between prediction and original pixel values, and x vrs is the variance of four SATDs of sub-block. x cbf is the coded block flags (CBF) of the inter 2N × 2N mode. CBF indicates the complexity of the predicted error under specific quantization parameters (QP). As discussed in [1115], these features are correlated with CU partitioning.

  • CU depth information of the context [8], denoted as x sl, x sa, and x tp. x sl and x sa are the CU depth of left-neighboring and above-neighboring CU, respectively. x tp is the CU depth of the co-located CU. Since there is substantial correlation in spatial and temporal domain of video signal, such context provides very good information.

  • Gradient magnitude of current CU [18], denoted as x gm. It is the summation of gradient of every pixel in the current CU by applying Sobel operator, which reveals the flatness of the CU.

  • Motion consistency-related feature [13, 14], denoted as x mc, which is defined as the variance of the MVs of four sub-blocks in inter N × N mode. Regions with inconsistent motion activities are more likely to be encoded in small CUs.

  • RD cost difference between skip and inter 2N × 2N mode, denotes as x drc. If the skip mode is better than inter 2N × 2N, the CU is likely to be background and it maybe not necessary to partition the CU into smaller ones. On the contrary, if inter 2N × 2N mode is better, it may be better to apply smaller partition mode or smaller CU size.

  • Side information in RD cost, denotes as x si. Small size motion partition provides good RD performance for those blocks with high motion activities or rich in content. However, more bits should be paid to signal the side information. Therefore, the percentages of side information in total RD cost of inter 2N × 2N mode give good indication of optimal CU size.

  • Hierarchical structure-related feature, denotes as x hrc. For the hierarchical prediction structure in HEVC, small CU size is preferred for frames with low temporal depth and large CU size is more likely to be optimal for the frames with high temporal depth.

All the above-mentioned candidate features are evaluated and an effective feature subset is formed by the proposed wrapper approach based on F-score. The experimental results on feature selection are presented. Although some of the features are correlated, the wrapper method can select the useful feature to the predictor regardless of correlation, as discussed in [26]. The video sequences we use in feature selection are “Cactus”, “BQMall”, and “FourPeople” and the training samples are collected by running HM6.0 [28] under common test conditions. In Table  1, it presents the F-scores of different features in different CU depths. CBF information x cbf and side information in RD cost x si exhibit relative high F-score and give good information about CU splitting. In contrast, the F-score of x hrc is rather low and therefore is excluded from the input vector in the feature selection. Table  2 presents the feature subsets in selection procedure and its corresponding CV. The CV is nearly the same when feature number is greater than five. However, it takes more time to extract the features and the SVM predictor will become more complex as the number of features raises. It is a good choice to set the feature number as five, as shown in Table  2, considering the balance between accuracy and additional complexity introduced by feature extraction and SVM model predictor. The optimized feature subsets are x cbf, x si, x tp, x drc, x std, x cbf, x si, x tp, x drc, x std, and x cbf, x si, x tp, x gm, x std for CU depth zero (CU 64 × 64), one (CU 32 × 32), and two (CU 16 × 16), respectively. Since the optimal feature subsets are different for different CU depths, the proposed CU splitting early termination models are trained separately for different CU depths. The overhead introduced by feature extraction is almost negligible, since most of them can be derived when calculating the RD cost of Skip and inter 2N × 2N modes.

Table 1 F -score of features in different CU depth
Table 2 CV of different feature subsets

4. Experimental results

4.1. Experimental results on the proposed CU splitting early termination algorithm

To verify the efficiency of the proposed CU splitting early termination algorithm, we conduct comprehensive experiments by comparing the proposed algorithm with HEVC reference software HM6.0. The encoding configuration exactly follows what is recommended in [29] and the test sequences in the experiments cover a variety of content. The sequences we use to train the SVM predictor model are “Cactus”, “BQMall”, and “FourPeople”, denoted as TS1 (training set 1) and they are not used in performance comparison anymore. The offline training process is carried out by the SVM training software [30] and the proposed CU early termination algorithm is incorporated into HEVC reference software HM6.0.

To evaluate the performance of the proposed algorithm, two metrics are used in Tables  3 and 4: the average BD-rate (BDBR) [31] difference between the proposed algorithm and HM6.0, and the time reduction ratio which is defined as

ΔT = T HM T p T HM × 100 % .

where T HM and T p are the total encoding time of HM6.0 encoder and the proposed encoder, respectively. The actual encoding time is measured on a workstation with a 2.93-GHz processor and 8 GB of RAM. In Tables  3 and 4, we present the RD performance and the computational complexity of the proposed algorithm and the anchor under “Random Access, main” and “Low Delay, main” configurations.

Table 3 Complexity and RD performance comparison in TS1 (average of 4 QP points)
Table 4 Complexity and RD performance comparison in TS1 (data per QP)

Regarding complexity, the proposed algorithm achieves a maximum of 73.7% running-time reduction with respect to HM6.0 with an average of 44.7% under “Random Access, main” configuration, as shown in Tables  3 and 4. In Table  3, the column of “ΔT” is the average ΔT of 4 QP points. Concerning the RD performance, it loses 1.35% in terms of BD-rate on average, and a worst case of 1.8% for sequence “Traffic”. The RD loss is not significant. For the “Low Delay, main” configuration as shown in Tables  3 and 4, the proposed algorithm behaves very similar to the “Random Access, main” case and it reduces the complexity by 41.9% with 1.66% RD-Rate loss on average. In Table  4, part of the experimental results under different QPs is listed. As can be seen from it, more complexity reduction is achieved in low bitrate scenario (i.e., using high QP values). In such cases, larger CUs are more efficient in RD performance than smaller CUs, and large CUs take a high percentage. The proposed algorithm accurately early terminates the RDO procedures on large CU size and avoids unnecessary RD calculations on small CU size. Therefore, greater complexity reduction can be achieved in low bitrate case than the high bitrate case.

To verify that different training set will not affect the performance of the proposed algorithm, additional experiment is conducted. Three different sequences (“ParkScene”, “BasketballDrill”, and “Johnny”, denoted as TS2) are used to train the offline model which is to be used in the encoding process. The encoding configurations are the same as the previous experiments. The metrics used in Table  5 are the same with that in Table  3. As shown in Table  5, similar RD performance and complexity reduction are derived using a different training set.

Table 5 Complexity and RD performance comparison in TS2 (average of 4 QP points)

Both the weighted SVM training algorithm and the wrapper feature selection algorithm have been designed to provide the ability to generalize. First of all, the weighted SVM is based on SRM principle as opposed to traditional empirical risk minimization principle employed by conventional learning algorithms. SRM minimizes an upper bound on the expected risk, which equips the SVM with great ability to generalize. Introducing RD difference as weights eliminates the influence of outliers. In other words, those training samples with little RD performance degradation due to misclassification are “almost excluded” by assigning small weights and more attention is paid to “important” samples. Second, large number of relevant features are evaluated and assessed. Diversity of features lowers the opportunity of dependence on training set. The feature selection algorithm chooses optimal feature subset based on CV error to ensure that the optimal subset is not dependent on a specific training set. Therefore, the algorithm performs stably.

4.2. Additional overhead of SVM classification

SVM classification imposes additional computational complexity on encoder. Some experiments are conducted to investigate the overhead. Table  6 presents the total time to predict class labels in column “Total SVM” and the total time to encode sequences with the proposed algorithm in column “Encode Time”. As it shown in column “percentage”, the computational overheads are not critical especially in the low bitrate cases, less than 5%. It costs a little more time to predict the class labels of CU 16 × 16 as there are more 16 × 16 CUs.

Table 6 Computational complexity overheads of SVM prediction

5. Conclusion

In this article, a CU splitting early termination algorithm is proposed. The CU splitting optimization in HEVC is formulized as a binary classification problem and is solved by support vector classification. In order to maintain the RD performance of CU splitting early termination algorithm, RD loss due to misclassification is introduced as weighting factor of training samples in the offline training procedure, with which the training method pays special attention to CUs which are prone to degrade RD performance when using a suboptimal partition. Furthermore, diverse features are considered such as the correlation between CUs both in spatial and temporal domains, prediction errors, motion activities, and RD cost of modes. To select the optimal feature subset, a wrapper feature selection approach is carried out. It embeds the model training into the selection process and simple greedy search is performed based on F-score ranking. In such a way, the proposed algorithm performs well and stably across different configurations and various video contents. Since the CU splitting early termination model is trained offline and the optimal feature subset is small, the proposed algorithm is computationally simple. Demonstrated by the experimental results, the proposed algorithm can achieve 44.7% reduction in computational complexity with 1.35% BD-Rate increase in “Random Access, main” configuration and 41.9% complexity reduction with 1.66% BD-Rate increase in “Low Delay, main” configuration.


  1. 1.

    ITU-T SG16 Q6 and ISO/IEC JTC1/SC29/WG11, 2010 ITU-T SG16 Q6 document VCEG-AM91 and ISO/IEC JTC1/SC29/WG11 document N11113: Joint Call for Proposals on Video Compression Technology. ITU-T SG16 Q6 and ISO/IEC JTC1/SC29/WG11, Kyoto, Japan;

  2. 2.

    Bin L, Sullivan GJ, Jizheng X: Comparison of compression performance of HEVC working draft 5 with AVC high profile. ITU-T/ISO/IEC Joint Collaborative Team on Video Coding (JCT-VC) document JCTVC-H0360, in 8th Meeting of JCT-VC, San Jose, USA; 2012.

    Google Scholar 

  3. 3.

    Bross B, Han W-J, Sullivan GJ, Ohm J-R, Wiegand T: High efficiency video coding (HEVC) text specification draft 6. ITU-T/ISO/IEC Joint Collaborative Team on Video Coding (JCT-VC) document JCTVC-H1003, in 8th Meeting of JCT-VC, San Jose,USA; 2012.

    Google Scholar 

  4. 4.

    Kim J, Kim M, Kim H-Y, Sato K, Shen X, Yu L, Choi K, Jang ES, Bross B, Han W-J, Jo J-K, Park S-N, Sim DG, Oh S-J: JCTVC TE9: Report on large block structure testing. ITU-T/ISO/IEC Joint Collaborative Team on Video Coding (JCT-VC) document JCTVC-C067, in 3rd Meeting of JCT-VC, Guangzhou, China; 2010.

    Google Scholar 

  5. 5.

    Qualcomm Inc: Video Coding Using Extended Block Sizes, ITU-T Q.6/SG16 document COM16-C123-E. VCEG 36th Meeting, Geneva, Switzerland; 2009.

    Google Scholar 

  6. 6.

    Liang Z, Li Z, Siwei M, Debin Z: Fast mode decision algorithm for intra prediction in HEVC. 2011 IEEE Visual Communications and Image Processing (VCIP), Tainan; 2011:1-4.

    Google Scholar 

  7. 7.

    Su-Wei T, Hsueh-Ming H, Yi-Fu C: Fast mode decision algorithm for residual quad-tree coding in HEVC. 2011 IEEE Visual Communications and Image Processing (VCIP), Tainan; 2011:1-4.

    Google Scholar 

  8. 8.

    Jie L, Lei S, Ikenaga T, Sakaida S: Content based hierarchical fast coding unit decision algorithm for HEVC. 1st edition. 2011 International Conference on Multimedia and Signal Processing (CMSP), Guilin, Guangxi; 2011:56-59.

    Google Scholar 

  9. 9.

    Jongho K, Seyoon J, Sukhee C, Jin Soo C: Adaptive coding unit early termination algorithm for HEVC. 2012 IEEE International Conference on Consumer Electronics (ICCE), Las Vegas, NV; 2012:261-262.

    Google Scholar 

  10. 10.

    Correa G, Assuncao P, Agostini L, da Silva Cruz LA: Complexity control of high efficiency video encoders for power-constrained devices. IEEE Trans Consum Electron 2011, 57(4):1866-1874. 10.1109/TCE.2011.6131165

    Article  Google Scholar 

  11. 11.

    Lee YM, Tsai YJ, Lin Y: Improved motion estimation using early zero-block detection. EURASIP J Image Video Process 2008, 2008: 524793. 10.1155/2008/524793

    Google Scholar 

  12. 12.

    Byung-Gyu K: Novel inter-mode decision algorithm based on macroblock (MB) tracking for the P-slice in H.264/AVC video coding. IEEE Trans Circuits Syst Video Technol 2008, 18(2):273-279. 10.1109/TCSVT.2008.918121

    Article  Google Scholar 

  13. 13.

    Tien-Ying K, Chen-Hung C: Fast variable block size motion estimation for H.264 using likelihood and correlation of motion field. IEEE Trans Circuits Syst Video Technol 2006, 16(10):1185-1195. 10.1109/TCSVT.2006.883512

    Article  Google Scholar 

  14. 14.

    Zhi L, Liquan S, Zhaoyang Z: An efficient inter mode decision algorithm based on motion homogeneity for H.264/AVC. IEEE Trans Circuits Syst Video Technol 2009, 19(1):128-132. 10.1109/TCSVT.2008.2005804

    Article  Google Scholar 

  15. 15.

    Yu ACW, Martin GR, Heechan P: Fast inter-mode selection in the H.264/AVC standard using a hierarchical decision process. IEEE Trans Circuits Syst Video Technol 2008, 18(2):186-195. 10.1109/TCSVT.2007.913970

    Article  Google Scholar 

  16. 16.

    Huanqiang Z, Canhui C, Kai-Kuang M: Fast mode decision for H.264/AVC based on macroblock motion activity. IEEE Trans Circuits Syst Video Technol 2009, 19(4):491-499. 10.1109/TCSVT.2009.2014014

    Article  Google Scholar 

  17. 17.

    Tiesong Z, Hanli W, Kwong S, Kuo C-CJ: Fast mode decision based on mode adaptation. IEEE Trans Circuits Syst Video Technol 2010, 20(5):697-705. 10.1109/TCSVT.2010.2045812

    Article  Google Scholar 

  18. 18.

    Changsung K, Kuo C-CJ: Feature-based intra-/inter coding mode selection for H.264/AVC. IEEE Trans Circuits Syst Video Technol 2007, 17(4):441-453. 10.1109/TCSVT.2006.888829

    Article  Google Scholar 

  19. 19.

    Martinez-Enriquez D, Jimenez-Moreno A, Diaz-de-Maria F: An adaptive algorithm for fast inter mode decision in the H.264/AVC video coding standard. IEEE Trans Consum Electron 2010, 56(2):826-834. 10.1109/TCE.2010.5506008

    Article  Google Scholar 

  20. 20.

    Jui-Chiu C, Wei-Chih C, Lien-Ming L, Kuo-Feng H, Wen-Nung L: A fast H.264/AVC-based stereo video encoding algorithm based on hierarchical two-stage neural classification. IEEE J Sel Topics Signal Process 2011, 5(2):309-320. 10.1109/JSTSP.2010.2066956

    Article  Google Scholar 

  21. 21.

    Chen-Kuo C, Wei-Hau P, Chiuan H, Shin-Shan Z, Shang-Hong L: Fast H.264 encoding based on statistical learning. IEEE Trans Circuits Syst Video Technol 2011, 21(9):1304-1315. 10.1109/TCSVT.2011.2147250

    Article  Google Scholar 

  22. 22.

    Jaeil K, Munchurl K, Sangjin H, In-joon C, Changsub P: Block-mode classification using SVMs for early termination of block mode decision in H.264MPEG-4 part 10 AVC. Seventh International Conference on Advances in Pattern Recognition, ICAPR'09, Kolkata; 2009:83-86.

    Google Scholar 

  23. 23.

    Corinna C, Vapnik V: Support-vector networks. Mach Learn 1995, 20(3):273-297. 1995

    Google Scholar 

  24. 24.

    Scholkopf B, Burges C, Smola A: Advances in Kernel Methods: Support Vector Learning. MIT Press, Cambridge, MA; 1999.

    Google Scholar 

  25. 25.

    Hsu CW, Chang CC, Lin CJ: A practical guide to support vector classification, Tech. rep. Department of Computer Science, National Taiwan University; 2003.

    Google Scholar 

  26. 26.

    Isabelle G, André E: An introduction to variable and feature selection. J Mach Learn Res 2003, 3: 1157-1182.

    Google Scholar 

  27. 27.

    Chen YW, Lin CJ: Combining SVMs with Various Feature Selection Strategies. Springer, New York; 2006.

    Book  Google Scholar 

  28. 28.

    HM Software.

  29. 29.

    Bossen F: Common test conditions and software reference configurations, ITU-T/ISO/IEC Joint Collaborative Team on Video Coding (JCT-VC) document JCTVC-H1100. 8th meeting o JCT-VC, San Jose, USA; 2012.

    Google Scholar 

  30. 30.

    Chih-Chung C, Chih-Jen L: LIBSVM: a library for support vector machines. ACM Trans Intell Syst Technol 2011, 2(27):1-27.

    Article  Google Scholar 

  31. 31.

    Bjontegaard G: Improvements of the BD-PSNR model, ITU-T SG16/Q6 document VCEG-AI11. 35th VCEG Meeting, Germany, Berlin; 2008.

    Google Scholar 

Download references


This work is supported by the National Basic Research Program of China (973) under Grant No. 2009CB320903 and Specialized Research Fund for the Doctoral Program of Higher Education (SRFDP) No. 20120101110032.

Author information



Corresponding author

Correspondence to Lu Yu.

Additional information

Competing interests

The authors declare that they have on competing interests.

Authors’ original submitted files for images

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 2.0 International License (, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and Permissions

About this article

Cite this article

Shen, X., Yu, L. CU splitting early termination based on weighted SVM. J Image Video Proc 2013, 4 (2013).

Download citation


  • HEVC
  • fast coding unit decision
  • classification
  • SVM
  • feature selection.