CU splitting early termination based on weighted SVM

High efficiency video coding (HEVC) is the latest video coding standard that has been developed by JCT-VC. It employs plenty of efficient coding algorithms (e.g., highly flexible quad-tree coding block partitioning), and outperforms H.264/AVC by 35–43% bitrate reduction. However, it imposes enormous computational complexity on encoder due to the optimization processing in the efficient coding tools, especially the rate distortion optimization on coding unit (CU), prediction unit, and transform unit. In this article, we propose a CU splitting early termination algorithm to reduce the heavy computational burden on encoder. CU splitting is modeled as a binary classification problem, on which a support vector machine (SVM) is applied. In order to reduce the impact of outliers as well as to maintain the RD performance while a misclassification occurs, RD loss due to misclassification is introduced as weights in SVM training. Efficient and representative features are extracted and optimized by a wrapper approach to eliminate dependency on video content as well as on encoding configurations. Experimental results show that the proposed algorithm can achieve about 44.7% complexity reduction on average with only 1.35% BD-rate increase under the “random access” configuration, and 41.9% time saving with 1.66% BD-rate increase under the “low delay” setting, compared with the HEVC reference software.


Introduction
High definition (HD) and ultra-high definition (UHD) video contents have become increasingly popular worldwide, thus the demand of video compression technologies that can provide higher coding efficiency over HD/UHD videos can be envisioned in near future. In view of this, high efficiency video coding (HEVC) standard is being developed by the Joint Collaborative Team on Video Coding [1], which is established by the ITU-T Video Coding Experts Group and the ISO/IEC Moving Picture Experts Group. HEVC outperforms H.264/AVC high profile by 35-43% bitrate reduction at the same reconstructed video quality [2]. HEVC inherits the well-known block-based hybrid coding scheme [3] used by previous coding standards, e.g., H.264/AVC, and extends the framework by introducing highly flexible quad-tree coding block partitioning. The quad-tree coding block partitioning consists of newly brought concepts of coding unit (CU), prediction unit (PU), and transform unit (TU). CU is the basic unit of region splitting used for inter/intra coding, which extends the traditional concept of macroblock (MB) based on a hierarchical structure with block size varying from 64 × 64 to 8 × 8 pixels. A CU is allowed to recursively be split into four smaller CUs of equal size. In this manner, a picture is represented by a content-adaptive coding tree structure comprised of CU blocks with different sizes. PU is the basic unit used for prediction process in a rectangular shape. One PU can be encoded with one of the modes in candidate set, which is similar to MB mode of H.264/AVC in spirit. The pixels in one PU share prediction information, e.g., modes, motion vectors (MV), and reference index. TU is the basic unit for transform and quantization. TU is defined in a similar way as CU, and its size varies from 4 × 4 to 32 × 32. As reported in [4,5], the flexible data structure representation (extending the MB size up to 64 × 64) introduced over 10% bitrate saving in comparison with the 16 × 16-based configuration in H.264/AVC, since the flexibility of block partitioning can effectively deal with the diversity of picture content.
However, the flexibility of block partitioning of HEVC imposes significant computational burden on encoder during seeking of the optimal combinations of CU, PU, and TU sizes. Thus, it is crucial for practical implementation of the new standard to reduce the complexity while maintaining the coding performance. Researches on accelerating the encoder of HEVC test model (HM) are emerging. A fast intra mode decision algorithm [6] was proposed, which made use of the direction information of the neighboring blocks to reduce the number of directions taking part in rate distortion optimization (RDO) process. To reduce the computational complexity of TU size selection, a fast algorithm for residual quadtree mode decision was proposed in [7]. Besides, the depth-first decision process for TU size selection in HM was replaced by a merge-and-split decision process, which also reduces unnecessary computation by using the inheritance property of zero-blocks and early termination schemes for non-zero blocks.
In this article, we focus on CU size selection for HEVC. A content-based fast CU decision algorithm was developed for HEVC TMuC (test model under consideration) [8], which analyzed the ratio of utilized CUs to total number of CUs in different depth in frame level and skipped the rarely used CUs with specified depths. Information of neighboring and co-located CUs was used to skip CUs in unnecessary depth in CU level. The algorithm investigated temporal and spatial correlations of CU depth, and designed different thresholds to control the number of CU depths to be evaluated. However, the correlations were data dependent and the ratio was affected by encoding configurations, such as the hierarchical depth in hierarchical prediction structure. Spatial correlation of CU depth as well as the probability that neighboring CUs were SKIP mode was considered in [9] to design an adaptive weighting factor, which was used to adjust the threshold in early terminating the following RD calculations of the current CU. In [10], a method for complexity controlling was proposed by limiting the number of coding decision tests and comparisons according to temporal correlations. All these related works explored the spatial correlations and/or temporal correlations of CU depth to eliminate specific CU depths with a trivial impact on RD performance. However, they were not robust enough due to diversity of the content. It is necessary to consider more statistics so as to get a more accurate and stable model to simplify the CU splitting.
In the field of accelerating the encoder of H.264/AVC as well as its extensions, various properties were investigated and employed to simplify mode decision. A nearly sufficient condition for early zero-block detection is constructed based on the analysis of prediction error to speed up the motion estimation of H.264/AVC JM reference software in [11]. It indicated that prediction error offered a valuable clue about encoder acceleration. Spatial and temporal correlations were exploited to predict the skip mode [12] to reduce encoder complexity. In [13,14], distribution of MV in an MB was chosen as a feature to predict the optimal mode other than performing exhaustive search over all modes. A hierarchical algorithm proposed in [15] categorized all type of modes into three levels which were triggered on by evaluating SAD (which is between current MB and its co-located MB), high-frequency energy in DCT domain, and RD cost of mode P-8 × 8. In [16], a fast mode decision algorithm named motion activity-based mode decision was proposed. It classified MBs into different classes by predefined thresholds and motion activity. Each class corresponded to different number of modes to be checked. Tiesong et al. [17] projected encoding modes onto a 2D map and an optimal 2D map was predicted using spatial and temporal information. Then, a priority-based mode candidate list was constructed based on the optimal 2D map and mode decision was performed starting with the most important mode in the candidate list with early termination conditions. In such a way, the number of modes to be evaluated was reduced and acceleration was achieved. Changsung and Kuo [18] presented a featurebased fast inter/intra mode decision algorithm. This algorithm computed three features regarding spatial and temporal correlations with which to determine inter or intra mode to use. The feature space were partitioned into three regions, i.e., risk-free, risk-tolerable, and riskintolerable regions by checking the RD loss due to wrong mode decision and the probability distribution of inter/intra modes. Depending on the region, mechanisms with different complexity were applied for final mode decision. Martinez-Enriquze et al. [19] analyzed the conditional pdfs for every mode and estimated the RD cost to decide the optimal mode. A fast stereo video encoding algorithm based on hierarchical two-stage neural network was proposed in [20]. Local properties of input data and predicted error were extracted as the input feature to train a neural network which was designed to predict the optimal partition mode. SVM were also introduced in the study of fast mode decision [21,22]. However, MBs were treated equally in the classification problem, and the RD performance of an MB was ignored. In general, these works exploited various mode-related features to predict the optimal mode or reduce the number of modes to be evaluated. The features included spatial and temporal correlations, the gradient or high-frequency energy, the RD cost of specific mode, motion activity, and local properties, such as the prediction error or SAD/sum of absolute transformed differences (SATD).
As shown in the previous researches, CU size selection process applying RD optimization can be unacceptably time-consuming for practical implementation, which will be further analyzed in Section 2. To solve this problem, we propose a method utilizing machine learning to accelerate the CU size selection process. With properly modeling the problem and applying machine learning algorithm, our method can accurately predict the optimal decision on CU splitting instead of exhaustive searching over all possibilities. In order to derive a more accurate model to predict the CU splitting decision, RD difference is introduced as weights in the SVM training procedure to alleviate the RD performance degradation due to misclassification. Furthermore, various features are extracted from input video as well as earlier encoded data and an optimal feature subset is derived by a wrapper feature selection algorithm.
The rest of the article is organized as follows. We briefly go through CU size selection process of HM, and present the motivation of the proposed algorithm in Section 2. In Section 3, we elaborate the modeling of the CU splitting problem and its solution based on a machine learning algorithm, i.e., SVM. Experimental results in Section 4 demonstrate the effectiveness of the proposed algorithm, and Section 5 concludes the article.

CU size optimization in HM
To adapt to the diversity of picture content, flexible quad-tree coding block partitioning is adopted into HEVC which enables the use of CU, PU, and TU. The concept of CU is analogous to MB in pervious standards, e.g., H.264/AVC. It is the basic unit for intra/inter coding and is always square in shape. Pictures are divided into many largest CUs (LCUs), and each LCU can be splitting into four equal-sized CUs which can be further recursively split up to the maximal allowable hierarchical depth. In such a manner, the LCU is constructed as a quad-tree of CU(s) with different size as it shown in Figure 1. At leaf node of the quad-tree, the CU can be encoded in SKIP, inter, or intra mode. The partitioning size of SKIP mode is 2N × 2N, which means that the PU size of SKIP mode equals to CU size; the CU encoded in inter mode can be treated as one PU or partitioned into several PUs, which is specified by partitioning mode: , and Part_nR × 2N; and the CU in intra mode can be treated as one PU with size of 2N × 2N, or partitioned into four N × N PUs. A simple example of PUs in one CU is shown in Figure 1, as highlighted by the green square. PU corresponding to different partition size is the basic unit to carry the prediction information. In order to match the boundaries of real objects in a picture, the shape of PU is not restricted to being square, e.g., 2N × N is allowed. TU is defined for the transform and quantization process. The shape of TU depends on PU. When PU is square, TU is also square and its size varies from 4 × 4 to 32 × 32 luma samples. When PU is non-square, TU is also non-square and takes a size of 32 × 8, 8 × 32, 16 × 4, or 4 × 16 luma samples. One CU may contain one or more PUs. As well one CU may contain one or more TUs which are arranged in quadtree structure as shown in Figure 1.
As explained in the previous paragraph, one LCU can be coded into a rather complex quad-tree to adapt to various video contents. Furthermore, CUs with different depths may be coded in different prediction modes, different partitioning modes, and different transform sizes. To derive the optimal CU-level coding parameters, an exhaustive search method is employed by evaluating the RD costs of all possible combinations of CU size, PU size, and TU size. The RDO of CU size is illustrated in Figure 2. It needs a total of 85 RD calculations when CU size varies from 64 × 64 to 8 × 8. Obviously, such RD-based optimization method introduces significant complexity on encoder. Actually, it is unnecessary to do an exhaustive search over all possible CU sizes, since there exist some CU sizes that do not result in much rate distortion improvement and it is possible to ac-celerate the encoder by early terminating the CU splitting decision process. As shown in Figure 3, "flat" or "homogenous" regions, e.g., the floor, are more likely to be encoded in large CUs. Areas containing moving objects or objects boundaries, e.g., the net and the basketball, are usually split into small CUs. Motivated by this observation, we model CU splitting decision as a binary classification problem.

Problem formulation
As the flexible representation of coding data introduces heavy burden on the encoder, we propose to early terminate CU splitting to avoid unnecessary trials. We model CU splitting as a binary classification problem, (i.e., a CU that is not split into four sub-parts is assigned a label +1, otherwise −1 is assigned,) and tackle the classification problem by SVM [23]. As a widely used machine learning algorithm, SVM is based on the idea of structural risk minimization (SRM) and it has successfully been applied to a number of real-world problems, such as face recognition, text categorization, and object detection in machine vision. The main idea behind SVM is to derive a unique separating hyperplane that maximizes margin between two classes. Given l training data points where {x i , y i } is the ith training sample, i.e., ith CU. x i is the input feature vector and y i is the class label indicating CU splitting or not. The membership decision rule is based on the function defined in Equation (2), where f(x) represents the discriminant function associated with the hyperplane.
where ϕ(·) is a nonlinear operator that maps the input x i into a higher-dimensional space and it is the kernel function.
Mathematically, this hyperplane can be constructed by minimizing the following cost function with constraints For a non-separable case, the classification problem is generalized by introducing slack variables ξ i and a user- defined regularization parameter C. Then the classification problem is to minimize the following quantity subject to The modified cost function in Equation (5) is the so-called structural risk, which balances the empirical risk (i.e., the training errors reflected by the second term) with model complexity (the first term) [24]. It has been proven that the solution to the optimization problem of Equation (5) under the constraint of Equation (6) is given by the saddle point of Lagrange function where α i and β i are Lagrange multipliers associated with the constraints in Equation (6).
The Lagrange multipliers are solved as maximizing subject to The decision function can equivalently be expressed as It is obvious from Equation (10) that the α i associated with training point x i expresses the strength with which that point is embedded in the final decision function. Notice that the nonlinear mapping ϕ(·) never appears explicitly in the training or the decision. In general, the kernel takes the form of linear, polynomial, radial basis function (RBF), or sigmoid. In this article, we use the RBF kernel, since it can handle the case when the relation between class labels and the input vector is nonlinear as well as linear. Furthermore, the model complexity of the RBF kernel is lower than polynomial, and RBF kernel has fewer numerical difficulties [25].

Proposed CU splitting early termination algorithm
The proposed CU splitting early termination algorithm is shown in Figure 4. At each CU depth, the encoder first performs rate and distortion calculation of SKIP mode and inter mode with Part_2N × 2N (denoted as inter 2N × 2N mode thereafter), meanwhile extracts required features, i.e., input vector x of SVM during the evaluation procedure. Then, an offline trained SVM CU splitting model is loaded, which predicts the class label of the current CU according to the extracted input features. Based on the predicted class label, the encoder will decide whether to perform RD trials on CU splitting. The off-line trained SVM model is optimized based on SVM procedure with weighting on training samples. The weights are proposed as the difference of RD costs due to misclassifications. It is obvious that as long as the CU splitting predictor is accurate, early terminating RD trials on CU splitting can reduce a lot of computational complexity while maintaining RD performance.

Off-line training and weights generation
In the field of machine learning, accuracy is one of the most important measurements for classification algorithms. However, in this scenario, not only the ratio of correct classification, but also the loss of RD performance introduced by misclassifications is important.
There exist some CUs that the RD cost difference between four sub-CUs coding and one CU coding are almost the same. Misclassification of such CUs results in negligible RD degradation. On the contrary, for CUs that four sub-CUs coding outperforms one CU coding greatly, misclassification does lead to much RD loss. Obviously, different CUs are of different importance. It is improper to treat samples with different RD performance equally in the training process, and the optimal hyperplane will be deviated by those "unimportant" samples, i.e., these samples are outliers. The desired SVM predictor should predict class label as accurate as possible and keep RD loss as low as possible. Based on this observation, we suggest introducing weights into the SVM training process, i.e., assigning different weights to training samples.
where the weights are defined as the percentage of RD cost increased due to misclassification, which is ; when the CU is actually encoded in one CU where C i (s) and C i (n) are RD cost of splitting the CU into four sub-CUs and RD cost of non-splitting CU, respectively. CU with little difference of RD cost is assigned a small weight, while CU with large difference of RD cost is assigned a large weight. Note that the weights are only needed in the training procedure, and not needed anymore when the trained model is used to predict the class label in the encoding process. Then the standard SVM optimization problem in Equation (5) and the solution of the problem is subject to The upper bounds of α i are bounded by dynamical boundaries C*W i instead of a constant value C. Then the CUs with larger difference when encoded into one CU and into four sub-CUs will affect the optimal hyperplane more by introducing a larger weight W i .

Feature selection
We introduce several representative features related to CU splitting. Selecting effective and relevant features is crucial for classification. Good features help reduce training time as well as utilization time, defy the curse of dimensionality to improve prediction performance, and reduce storage requirements [26]. To select the features that are useful to build a good predictor of SVM, there are usually two types of feature selection approaches, filters and wrapper approaches. In this article, we suggest using a wrapper method based on F-score [27]. Filter methods based on correlation or mutual information ranking [21] are easy to implement; however, selecting the most relevant variables is usually suboptimal for building a predictor, particularly if the variables are redundant. Wrapper method assesses a subset of features according to their usefulness to a given predictor, which is better in this scenario. However, the number of subsets is extremely large as the number of features increase, and thus exhaustive search is not proper. Therefore, we propose to rank all features first by F-score and perform a greedy search based on the ranked results. F-score, as define in Equation (16), is a simple metric that measures the discrimination of two sets of real numbers.
ð16Þ where xi ; xi þ; xi À: are the average of the ith feature of the input vector x of the whole, positive, and negative training samples, respectively. x k,i + is the ith feature of the kth positive sample and x k,i − is the ith feature of the kth negative sample. n + and n − are the total numbers of positive and negative training samples. The larger the F-score is, the more likely this feature is more discriminative. F-score is easy to calculate and is friendly to be coupled with SVM training process. The procedure of the wrapper approach is summarized in the following four steps: (1) Collect training samples by running the HEVC reference software HM6.0. (2) Calculate F-score of every feature in the training set and sort the features in descending order according to F-score. To setup a rich feature set, diverse features are introduced and evaluated. Furthermore, it is possible to eliminate the dependency on video content by considering as many features as possible and then optimizing the feature subset. The features we consider as potential candidates are summarized as follows.
Prediction error-related features, such as SATD and CBF, denoted as x std , x vrs , and x cbf . x std is defined as the SATD between prediction and original pixel values, and x vrs is the variance of four SATDs of sub-block. x cbf is the coded block flags (CBF) of the inter 2N × 2N mode. CBF indicates the complexity of the predicted error under specific quantization parameters (QP). As discussed in [11][12][13][14][15], these features are correlated with CU partitioning. CU depth information of the context [8], denoted as x sl , x sa , and x tp . x sl and x sa are the CU depth of leftneighboring and above-neighboring CU, respectively. x tp is the CU depth of the co-located CU. Since there is substantial correlation in spatial and temporal domain of video signal, such context provides very good information. Gradient magnitude of current CU [18], denoted as x gm . It is the summation of gradient of every pixel in the current CU by applying Sobel operator, which reveals the flatness of the CU. Motion consistency-related feature [13,14], denoted as x mc , which is defined as the variance of the MVs of four sub-blocks in inter N × N mode. Regions with inconsistent motion activities are more likely to be encoded in small CUs. RD cost difference between skip and inter 2N × 2N mode, denotes as x drc . If the skip mode is better than inter 2N × 2N, the CU is likely to be background and it maybe not necessary to partition the CU into smaller ones. On the contrary, if inter 2N × 2N mode is better, it may be better to apply smaller partition mode or smaller CU size. Side information in RD cost, denotes as x si . Small size motion partition provides good RD performance for those blocks with high motion activities or rich in content. However, more bits should be paid to signal the side information. Therefore, the percentages of side information in total RD cost of inter 2N × 2N mode give good indication of optimal CU size. Hierarchical structure-related feature, denotes as x hrc . For the hierarchical prediction structure in HEVC, small CU size is preferred for frames with low temporal depth and large CU size is more likely to be optimal for the frames with high temporal depth.
All the above-mentioned candidate features are evaluated and an effective feature subset is formed by the proposed wrapper approach based on F-score. The experimental results on feature selection are presented. Although some of the features are correlated, the wrapper method can select the useful feature to the predictor regardless of correlation, as discussed in [26]. The video sequences we use in feature selection are "Cactus", "BQMall", and "FourPeople" and the training samples are collected by running HM6.0 [28] under common test conditions. In Table 1, it presents the F-scores of different features in different CU depths. CBF information x cbf and side information in RD cost x si exhibit relative high F-score and give good information about CU splitting. In contrast, the F-score of x hrc is rather low and therefore is excluded from the input vector in the feature selection. Table 2 presents the feature subsets in selection procedure and its corresponding CV. The CV is nearly the same when feature number is greater than five. However, it takes more time to extract the features and the SVM predictor will become more complex as the number of features raises. It is a good choice to set the feature number as five, as shown in Table 2, considering the balance between accuracy and additional complexity introduced by feature extraction and SVM model predictor. The optimized feature subsets are [x cbf , x si , x tp , x drc , x std ], [x cbf , x si , x tp , x drc , x std ], and [x cbf , x si , x tp , x gm , x std ] for CU depth zero (CU 64 × 64), one (CU 32 × 32), and two (CU 16 × 16), respectively. Since the optimal feature subsets are different for different CU depths, the proposed CU splitting early termination models are trained separately for different CU depths. The overhead introduced by feature extraction is almost negligible, since most of them can be derived when calculating the RD cost of Skip and inter 2N × 2N modes.

Experimental results on the proposed CU splitting early termination algorithm
To verify the efficiency of the proposed CU splitting early termination algorithm, we conduct comprehensive experiments by comparing the proposed algorithm with HEVC reference software HM6.0. The encoding configuration exactly follows what is recommended in [29] and the test sequences in the experiments cover a variety of content. The sequences we use to train the SVM predictor model are "Cactus", "BQMall", and "FourPeople", denoted as TS1 (training set 1) and they are not used in performance comparison anymore. The offline training process is carried out by the SVM training software [30] and the proposed CU early termination algorithm is incorporated into HEVC reference software HM6.0.
To evaluate the performance of the proposed algorithm, two metrics are used in Tables 3 and 4: the average BD-rate (BDBR) [31] difference between the proposed algorithm and HM6.0, and the time reduction ratio which is defined as where T HM and T p are the total encoding time of HM6.0 encoder and the proposed encoder, respectively. The actual encoding time is measured on a workstation with a 2.93-GHz processor and 8 GB of RAM. In Tables 3  and 4, we present the RD performance and the computational complexity of the proposed algorithm and the anchor under "Random Access, main" and "Low Delay, main" configurations. Regarding complexity, the proposed algorithm achieves a maximum of 73.7% running-time reduction with respect to HM6.0 with an average of 44.7% under "Random Access, main" configuration, as shown in Tables 3 and 4. In Table 3, the column of "ΔT" is the average ΔT of 4 QP points. Concerning the RD performance, it loses 1.35% in terms of BD-rate on average, and a worst case of 1.8% for sequence "Traffic". The RD loss is not significant. For the "Low Delay, main" configuration as shown in Tables 3 and  4, the proposed algorithm behaves very similar to the "Random Access, main" case and it reduces the complexity by 41.9% with 1.66% RD-Rate loss on average. In Table 4, part of the experimental results under different QPs is listed. As can be seen from it, more complexity reduction is achieved in low bitrate scenario (i.e., using high QP values). In such cases, larger CUs are more efficient in RD performance than smaller CUs, and large CUs take a high percentage. The proposed algorithm accurately early terminates the RDO procedures on large CU size and avoids unnecessary RD calculations on small CU size. Therefore, greater complexity reduction can be achieved in low bitrate case than the high bitrate case.
To verify that different training set will not affect the performance of the proposed algorithm, additional experiment is conducted. Three different sequences ("ParkScene", "BasketballDrill", and "Johnny", denoted as TS2) are used to train the offline model which is to be used in the encoding process. The encoding configurations are the same as the previous experiments. The metrics used in Table 5 are the same with that in Table 3. As shown in Table 5, similar RD performance and complexity reduction are derived using a different training set.
Both the weighted SVM training algorithm and the wrapper feature selection algorithm have been designed to provide the ability to generalize. First of all, the weighted SVM is based on SRM principle as opposed to traditional empirical risk minimization principle employed by conventional learning algorithms. SRM minimizes an upper bound on the expected risk, which equips the SVM with great ability to generalize. Introducing RD difference as weights eliminates the influence of outliers. In other words, those training samples with little RD performance degradation due to misclassification are "almost excluded" by assigning small weights and more attention is paid to "important" samples. Second, large number of relevant features are evaluated and assessed. Diversity of features lowers the opportunity of dependence on training set. The   feature selection algorithm chooses optimal feature subset based on CV error to ensure that the optimal subset is not dependent on a specific training set. Therefore, the algorithm performs stably.

Additional overhead of SVM classification
SVM classification imposes additional computational complexity on encoder. Some experiments are conducted to investigate the overhead. Table 6 presents the total time to predict class labels in column "Total SVM" and the total time to encode sequences with the proposed algorithm in column "Encode Time". As it shown in column "percentage", the computational overheads are not critical especially in the low bitrate cases, less than 5%. It costs a little more time to predict the class labels of CU 16 × 16 as there are more 16 × 16 CUs.

Conclusion
In this article, a CU splitting early termination algorithm is proposed. The CU splitting optimization in HEVC is formulized as a binary classification problem and is solved by support vector classification. In order to maintain the RD performance of CU splitting early termination algorithm, RD loss due to misclassification is introduced as weighting factor of training samples in the offline training procedure, with which the training method pays special attention to CUs which are prone  to degrade RD performance when using a suboptimal partition. Furthermore, diverse features are considered such as the correlation between CUs both in spatial and temporal domains, prediction errors, motion activities, and RD cost of modes. To select the optimal feature subset, a wrapper feature selection approach is carried out. It embeds the model training into the selection process and simple greedy search is performed based on F-score ranking. In such a way, the proposed algorithm performs well and stably across different configurations and various video contents. Since the CU splitting early termination model is trained offline and the optimal feature subset is small, the proposed algorithm is computationally simple. Demonstrated by the experimental results, the proposed algorithm can achieve 44.7% reduction in computational complexity with 1.35% BD-Rate increase in "Random Access, main" configuration and 41.9% complexity reduction with 1.66% BD-Rate increase in "Low Delay, main" configuration.