- Research
- Open access
- Published:
Fast CU decision method based on texture characteristics and decision tree for depth map intra-coding
EURASIP Journal on Image and Video Processing volume 2024, Article number: 34 (2024)
Abstract
As the demand for higher quality 3D videos continues to grow, the 3D extensions of the Efficient Video Coding (3D-HEVC) standard are gradually unable to meet the needs of users. Versatile Video Coding Standard (H. 266/VVC), as an advanced video coding standard, adopts the nested Multi-Type Tree quadtree (QTMT) partitioning structure that current fast depth map coding unit (CU) partitioning methods cannot apply. Therefore, we have designed a fast intra-frame CU partitioning algorithm for VVC 3D video depth maps. Our proposed algorithm in this article consists of two steps, including two sub-algorithms. First, by analyzing the relationship between image entropy and variance and depth map CU division, we establish a bi-criterion decision algorithm to determine whether the texture complexity of the current CU is low enough to terminate its partitioning process. Then, for CUs that have been determined by the first algorithm to need further partitioning, we use a decision tree model based on Light Gradient Boosting Machine (LGBM) to predict which direction of Rate-Distortion Optimization (RDO) calculation can be skipped, which can avoid some unnecessary RDO calculations in a certain direction. The final experiment demonstrated the effect of the proposed algorithm, which can reduce 47.65% complexity of VVC 3D video intra-coding with negligible 0.23% Bjøntegaard Delta Bitrate (BDBR) increase, superior to other advanced methods.
1 Introduction
Due to the increasingly important role of video data in modern life, work, and learning, people's demand for video quality continues to increase, particularly in terms of video clarity and frame rate. With the continuous development of video coding technology over the past few decades, obtaining high-quality 2D videos is no longer difficult. In addition, to enhance the viewing experience, people also hope to gain a sense of realism while watching videos. However, traditional 2D videos cannot provide an immersive visual experience. Due to the multiple viewpoints of 3D videos [1], 3D videos can naturally bring viewers into a multidimensional, intuitive, and dynamic real scene. Therefore, the immersive visual effects brought by 3D videos are becoming increasingly popular among the general public. It is undeniable that the application of 3D video touches on various fields, such as medicine, education, military, entertainment, and have shown significant advantages. To meet the growing needs of people and researchers in various fields for 3D videos, multi-view video plus depth (MVD) [2] has gained popularity. This technology involves encoding and transmitting texture and depth maps from multiple different viewpoints [3]. Multi-viewpoint video, as the name suggests, is a video captured by using multiple cameras aligned in a row and shooting the same scene at a certain distance. It can be imagined that encoding multiple videos will inevitably result in greater data volume and increased encoding complexity. Therefore, to further promote the development of the 3D video industry and ensure an immersive visual experience, more and more researchers are dedicating themselves to the study of 3D videos.
In order to meet the demand for high-definition video, increase the compression efficiency of 3D video and develop a new generation of video encoding technology, the Joint Collaborative Team on 3D Video Coding Extension (JCT-3V) [4] was established. In 2015, JCT-3V jointly released 3D-HEVC video encoding standard in the third version of HEVC. 3D-HEVC adopts the MVD format with multiple viewpoints and depth, as well as a series of advanced 3D video encoding technologies [5]. Among them, MVD collects texture maps and corresponding depth map sequences at the same location and time under limited viewpoints. After a series of encoding, Depth Image Based Rendering (DIBR) technology is used to generate the required virtual viewpoints around a few collected viewpoints, which not only enhances the spatial viewing experience, but greatly reduces the number of encoded viewpoints. In 2020, the latest video coding standard VVC [6] emerged and gained widespread attention from researchers in related fields. In terms of encoding efficiency, VVC outperforms the previous generation coding standard HEVC, by reducing the video bit rate by half. To increase compression efficiency, VVC applies a highly refined QTMT partitioning structure [7] to better fit the texture information of the image. Compared to the single encoding unit partitioning method used in HEVC, namely quad-tree (QT) block partitioning structure, the QTMT partitioning mode is more efficient and flexible. For example, in the Multi-Type Tree (MTT) partition structure, a 32 × 32 CU can be divided horizontally or vertically into two rectangular blocks of the same size, known as a Binary Tree (BT) split, which includes a binary tree partition in the horizontal direction (BTH) and a binary tree partition in the vertical direction (BTV). Alternatively, it can be divided into three rectangular blocks in a ratio of 1:2:1 from left to right or from top to bottom, which means that the Ternary Tree (TT) split includes the horizontal ternary tree partition (TTH) and the vertical ternary tree partition (TTV). At the same time, the sub-blocks obtained from BT or TT partitioning are allowed to continue using BT or TT partitioning, but QT partitioning cannot be used anymore. Figure 1 shows a schematic diagram of QTMT partitioning. In addition, VVC has made optimizations in various aspects such as intra-prediction [8], inter-prediction, transformation, and quantization through the use of new encoding tools.
In contrast to 2D videos, 3D videos require the encoding of texture maps from three viewpoints and corresponding depth maps. Figure 2 shows a texture map and its corresponding depth map. Among them, texture maps mainly describe the texture details of real objects and corresponding backgrounds in the video, while depth map mainly reflects the distance between the camera and the object [9], so the corresponding texture map can be mapped in three dimensions through the depth map. This method is a key part of virtual viewpoint synthesis. At the same time, due to the fact that depth maps represent distance information of object positions, they have many flat areas and abrupt edge areas. The grayscale values of flat areas in depth maps fluctuate within a small range [10]. These areas are usually the interior of the object or the background part of the scene, and those abrupt edges are often the outline of the object, so the pixel values change significantly, the area appears sharp [11]. This also leads to significant differences in the encoding characteristics between depth maps and color texture maps. Therefore, traditional fast algorithms for 2D video encoding cannot effectively improve and ensure the encoding efficiency and quality of depth maps. Although there have been many fast algorithms in recent years to study how to diminish the complicacy of 3D-HEVC encoding [12], these methods are not suitable due to the new QTMT partitioning structure adopted by VVC. From this, it can be concluded that developing a fast algorithm for depth maps using new technologies will make significant contributions to the application and propagation of higher quality 3D videos.
It is not difficult to deduce that homogeneous areas prevail in depth maps [13], and CUs in these areas are more probable to select the no-split mode as the final partitioning mode. Considering this inference, it can be found that if the flat area of CUs can be early determined not to be divided in an accurate manner, the other partitioning modes will not perform unnecessary calculations. In this paper, we mainly made the following contributions to this research: (1) by analyzing the relationship between image entropy and variance and depth map texture complexity, we establish a bi-criterion decision algorithm to determine whether the texture complexity of the current CU is low enough to terminate its partitioning process in advance. (2) A Decision Tree (DT) model algorithm dedicated to skipping impossible partition directions has been proposed, which uses a decision tree model to predict the partitioning direction of CUs that need to be divided in the bi-criterion decision algorithm. This involves skipping the horizontal or vertical division of these CUs in order to reduce some of the RDO calculation steps and achieve the goal of reducing complexity.
The remaining part of this paper consists of the following structure. Section 2 describes the relevant work of depth map encoding, such as the CU partitioning algorithm for VVC 2D videos and the CU fast partitioning algorithm for 3D-HEVC. Section 3 presents the two sub-algorithms proposed in this paper. Section 4 elaborates on the experimental results. Finally, Sect. 5 summarizes this article.
2 Related works
2.1 Fast video coding methods for H.266/VVC 2D video coding
In recent years, in order to diminish the encoding complicacy caused by the QTMT partitioning structure, a large number of literatures have studied methods to increase compression efficiency. Zhao et al. [14] utilized the spatial–temporal correlation of adjacent encoding units in VVC intra-frame encoding to first construct a depth prediction model-based Deep Feature Fusion (D-DFF) to predict the optimal partitioning depth of CU, and then constructed a partition mode prediction model-based Probability Estimation (P-PBE) to select several optimal partitioning modes. This algorithm can skip some unnecessary partitioning modes, thus reducing coding time. To diminish the huge computational complicacy brought by the QTMT partitioning structure, Lin et al. [15] found that in addition to the information of the current CU and adjacent CUs, the characteristics of sub-CUs can also be utilized to quicken the VVC intra-frame partitioning process. Specifically, for each partitioning mode of CU, authors use Structural Similarity Index Metric Variation (SSIMV) to calculate the corresponding SSIMV values. The magnitude of this value reflects the suitability of the partitioning pattern. This method filters out a specific number of partition patterns and leaves them to the original RDO algorithm for the final decision. Shu et al. [16] first use a detection algorithm to divide the video into multiple scenes. Then, three partition-related features are extracted from the first frame of each scene to train five online Support Vector Machine (SVM) models. These SVM models are considered as binary classifiers, which are used to decide whether to skip the QT, BTV, BTH, TTV, or TTH partitioning of CU in the remaining frames of the scene. It should be pointed out that the algorithm is only for CU Size of 32 × 32. Abdallah et al. [17] constructed a hierarchical convolutional neural network that can be exited early, and three bi-thresholds are designed for CU sizes of 64 × 64, 32 × 32 and 16 × 16 aiming to diminish coding time and maintain coding quality. This method can judge whether these CU will be divided into QT by threshold comparison in advance to accelerate the division process. Saldanha et al. [18] selected important features to train five LGBM binary classifiers for the five partitioning modes of coding units in VVC intra-frame encoding, including BTH, BTV, TTH, TTV, and QT. Each LGBM classifier determines whether to jump the corresponding partitioning mode to diminish the complicacy of the encoder. And this algorithm can be flexibly adjusted, providing operating space to reach a counterbalance between reduced coding complicacy and loss of coding quality. Tissier et al. [19] proposed a two-stage method combining convolutional neural network (CNN) and DT to diminish the coding unit partition complicacy based on QT-MTT in VVC intra-coding. At first, CNN is used to predict the spatial characteristics of CUs with size 64×64, and the model output is the probability vector. Then, in each division depth, the probability vector and quantization parameter (QP) are input into the DT LGBM model, and the probability vectors of all possible division modes are output. The first N division modes with the highest probability are selected. Finally, the encoder original algorithm determines which mode is the most suitable among the N modes. It should be pointed out that sixteen decision trees need to be trained for the 16 CU sizes that need to be further divided in VVC. In addition, this N can be adjusted to achieve a counterbalance between reduced coding complicacy and loss of coding quality.
2.2 Fast methods for 3D-HEVC depth map coding
Depth maps play a vital role in 3D videos. In the past decade, many research works have focused on depth map intra-frame CU partitioning method in 3D-HEVC and achieved significant results. Reference [20] proposed a measure based on static decision trees to replace the original RDO process. Due to the differences in I-frames and P- and B-frames encoding methods, the author designed several DT models for each CU size, i.e., 16 × 16, 32 × 32 and 64 × 64 and constructed a decision tree, applying data mining and machine learning techniques to determine whether the encoding unit is further divided. In reference [21], Liu et al. proposed a combination of traditional methods and deep learning algorithm to downgrade the complicacy of 3D-HEVC depth map encoding. Firstly, a database comprising the division of encoding units was established. Then a deep edge classification network was constructed to classify coding tree unit (CTU) into two categories: simple edge CTU or complicated edge CTU. Finally, a traditional method is used to rectify the classification results of the CNN network, effectively diminishing coding time and ensuring the quality of synthesized viewpoints. In reference [22], Fu et al. introduced a good feature Corner Points to quicken the encoding process. This fast algorithm includes three aspects: the Quadtree depth Limited Strategy to accelerate CU partitioning process, Prediction Unit (PU) Decision (PUD) is utilized to quicken the PU partitioning process and Fast Intra-Mode Decision (FIMD) is used to quicken the mode decision of depth maps. By using a flexible corner selection technique, the algorithm is more suitable for different quantization parameters and text content, thus achieving a counterbalance between encoding complicacy and the quality of synthesized viewpoints. To quicken the coding process of 3D-HEVC depth map, reference [23] proposed a manual method according to the texture attributes of depth maps to diminish the complicacy of CU partitioning. Specifically, this method calculates the total sum of squares and the RD cost and compares them with a preset threshold to infer whether the division of coding units can be terminated prematurely. Hamout and Elyousf [24] adopt the automatic merging probabilistic clustering method (AM-PCM), utilizing the characteristics of encoding units in the flat region of the depth map, which allows for skipping the selection of traditional texture map prediction modes and Depth Modeling Modes (DMM). This approach is based on tensor features and statistical analysis to quicken mode decision-making process in depth maps. Lin et al. [25] utilized human visual perception system to accelerate depth map coding in 3D-HEVC. This method first made three preparations. Initially, two thresholds are used to divide depth map into three categories. Next, use edge discriminator to determine which category the edge direction of the coding unit belongs to. Finally, use the just notifiable depth difference to obtain edge information that humans can perceive. Based on this, an ingenious manual method was ultimately designed to effectively accelerate the intra-mode decision-making and CU early termination partitioning process.
3 Proposed method
The VVC 3D video depth map still uses QTMT technology. For CU partitioning, the QTMT partitioning structure means that if the original algorithm, namely RDO search, is used for partitioning depth at each layer, the encoder will perform redundant and complex calculations. Although the encoding quality of the original CU partitioning algorithm is very high, the computational complexity brought by this method is not easy to handle for network bandwidth. Accordingly, it is very necessary to study an algorithm that can significantly diminish coding time and minimize the decrease in coding quality. In this section, by analyzing the relationship between the partitioning results of CUs and texture complexity, we first propose a bi-criterion decision algorithm to infer whether the current CU should continue to be split. Then, for CUs that need to be further partitioned, we consider the determination of the division direction of CUs as a binary classification problem, using a DT to determine which direction the CU needs to be partitioned in, thereby reducing encoding time. Our proposed method also takes into account prediction accuracy, thus achieving an ideal balance between encoding quality and encoding efficiency.
3.1 Bi-criterion decision algorithm based on texture characteristics
From the perspective of video encoding, a depth map is actually a grayscale map, and each grayscale value in the depth map represents the distance between the object at the corresponding position in the scene and camera. As is well known, the range of grayscale values in grayscale images is 0–255. The closer the object is to the camera, the closer the grayscale value is to 0, and the closer the color is too white. In flat areas, the color of the depth map tends to be consistent, which means that all its grayscale values are distributed within a smaller range. In depth maps, the magnitude of the overall grayscale value variation symbolizes texture complexity, which is closely related to the division of encoding units. It is generally believed that CUs in complex areas are often finely divided to fit the texture of the image, while CUs in simple areas are often roughly divided. Figure 3 shows the partitioning result of a depth map in a shark video sequence, which can illustrate this phenomenon.
To a certain extent, the information entropy of an image can reflect the complicacy of depth map [26]. The larger the value of image entropy, the more complex the encoding unit is and the more likely it is to be further divided. On the contrary, when the value of image entropy is small, the probability of CU being further divided is lower. In this algorithm, we need a threshold of th1 as the boundary for whether the texture of the current CU is complicated. If the entropy value of CU is greater than th1, then CU tends to continue partitioning. If the entropy value of CU is less than th1, terminate the current CU partitioning process in advance. The information entropy is represented by the following equation [26]:
where \(j\) represents a certain grayscale value within the CU area, ranging from 0 to 255, and \(P_{j}\) represents the probability of this grayscale value appearing within the CU area.
Variance is usually utilized to calculate the degree of dispersion of a range of data. For depth maps, a larger variance value implies a higher texture complexity of current encoding unit [27], indicating a higher probability of being further divided, and vice versa. Therefore, we can utilize the characteristic of variance to quicken depth map coding process. Similarly, we set a threshold of th2 as the boundary for whether the CU has a complicated texture. If variance of CU is greater than th2, then current CU tends to continue partitioning. If the variance of CU is less than th2, terminate the current CU partitioning process in advance. The variance is represented by the following equation:
where W and H represents the width and length of CU, respectively, \(p\left( {i,j} \right)\) denotes the gray value at position \(\left( {i,j} \right)\) and \(p_{mean}\) represents the average of all grayscale values within CU.
Many previous methods only used a single standard to determine CU classification, but a single standard sometimes cannot distinguish CU well, which often leads to many erroneous judgments and ultimately leads to a decrease in coding quality. For example, in this paper, the first criterion we introduce, which is image entropy, cannot distinguish CU well in certain situations (where pixel values in the CU region slowly change). However, in this case, the second criterion we introduce, which is variance, can still correctly judge CU as having a simple texture, which can compensate for the shortcomings of the first criterion. Through experimental analysis, we obtained the average performance of the algorithm using only a single criterion and the average performance of the algorithm using two criteria. When using only image entropy, the algorithm saved 36.34% of encoding time, but BDBR increased by 0.45%. When using only variance, the algorithm saved 37.51% of encoding time, but BDBR increased by 0.56%. When using two criteria, the algorithm saved 36.15% of encoding time, and BDBR only increased by 0.19%, significantly improving the performance of the algorithm. Therefore, using these two criteria together can make the algorithm achieve better results. In order to overcome the shortcomings of single standard and increase the accuracy of the algorithm, in this paper we propose a manual based bi-criterion decision algorithm to make decisions on whether to continue dividing CU. Through analysis and experimentation, we have concluded that this algorithm is feasible. The algorithm flowchart is shown in Figure 4 and the detailed procedures of the bi-criterion decision method are as follows:
Step 1: Determine whether current CU can continue to be partitioned according to the VVC Test Model (VTM) rules. If so, perform the next step; otherwise, proceed to step 4.
Step 2: Calculate the entropy value of current CU and determine if it is greater than th1. If it is greater than th1, perform the next step; otherwise, proceed to step 4.
Step 3: Calculate the variance of current CU and determine if it is greater than th2. If it is greater than th2, exit this algorithm and proceed to the second algorithm proposed in this article. Otherwise, proceed to step 4.
Step 4: This CU tends to not be split, terminate the division process, and select the next CU.
Obviously, the preset values of the two thresholds in our proposed dual criteria algorithm, th1 and th2, are crucial. A good threshold must ensure coding quality close to the original encoder algorithm while also reducing a significant amount of computational complexity. In order to obtain reliable and accurate thresholds th1 and th2, we chose three representative video sequences from the official 3D video sequence, namely "Balloons", "Shark", and "Poznan_street". The selected video sequence contains multiple contents and resolutions to enhance the generalization ability of the algorithm. Considering the process of this algorithm, we first obtain the ideal value of th1, and then obtain the appropriate value of th2. This will fully utilize the advantages of the bi-criterion decision algorithm. Figure 5 shows the relationship between maintaining encoding quality and reducing encoding time using different th1 values. We use the abscissa axis to represent the value of th1, and the ordinate axis of the two graphs represents encoding quality loss and encoding time savings, respectively. The BDBR in the figure denotes the loss of coding quality compared to the original algorithm, while time savings (TS) represents the time savings in encoding compared to the original algorithm. Similarly, Fig. 6 shows the relationship between using different th2 values to maintain encoding quality and reduce encoding time.
In this paper, the determination of thresholds th1 and th2 is based on whether they can achieve a good balance between reducing coding complexity and maintaining coding quality. Based on this principle, we ultimately chose to use 0.6 as the value of th1 and 8 as the value of th2. When CU meets threshold condition, it denotes that the texture complexity of this CU is relatively uncomplicated, and the complex partition mode decision step can be jumped to diminish the overall video encoding time. In addition, considering that the threshold obtained in the algorithm needs to be applied in the official eight video sequences, the three video sequences we selected contain two resolutions (1024 × 768 and 1920 × 1088) and two frame rates (frame rate: 25 and 30), covering all the features of the eight videos. Therefore, these three video sequences are representative in the official eight video sequences, and the threshold obtained from them can ensure the generalization ability of the algorithm and can be applied in our proposed bi-criterion decision algorithm.
Of course, it should be pointed out that the thresholds might need to be adjusted based on further tests.
3.2 Decision trees method for predicting partition direction
A large number of algorithms based on Machine Learning (ML) and Deep Learning (DL) have been applied in various scientific research fields in the past decade [28], including video coding. Specifically, in terms of encoding unit partitioning, they usually play the role of classifiers, allowing the encoder to exclude certain partitioning patterns in advance to reach the effect of diminishing encoding complexity. At present, a good many of machine learning methods such as decision trees and random forests, as well as deep learning algorithms such as CNN, Long-Short Term Memory (LSTM), and reinforcement learning (RL), have been applied to research on reducing coding complexity. Although these algorithms perform well, they also have their drawbacks. For deep learning-based methods, such as CNN, they usually need a lot of experimental data and consume substantial computing resources, making them difficult to apply in practice. Machine learning algorithms such as SVM and DT often face challenges such as long processing times and weak generalization ability. Light Gradient Boosting Machine (LGBM) is an open-source gradient boosting framework developed by Microsoft. Compared with other machine learning algorithms, LGBM has some advantages, such as using smaller memory, high accuracy, and less inference time. These advantages of LGBM have prompted us to propose an algorithm based on the DT LGBM model to accelerate encoding unit partitioning in VVC 3D videos. Previous fast algorithms have mostly focused on previous generation standard, HEVC and VVC, for 2D videos. To our knowledge, our proposed algorithm based on the DT LGBM model is the first to be applied to accelerate intra-frame encoding in VVC 3D videos.
The latest MTT partitioning structure is a newly added partitioning technique that includes two directions, namely horizontal and vertical. Each direction also has two partitioning modes, namely binary tree partitioning and triple tree partitioning. This greatly scales up the computational complicacy of RDO during the coding process. If the original algorithm, namely RDO search, is used, at each layer of partition depth, the encoder requires a significant amount of computation. Although the encoding quality of the original CU partitioning algorithm is very high, the computational complexity brought by this method is not easy to handle for network bandwidth. Through extensive research, it has been found that the texture direction of encoding units is often related to their partitioning direction. For example, when the texture direction of CU is horizontal, if partitioning is required, horizontal partitioning is often performed. When the texture direction of CU is vertical, if partitioning is required, vertical partitioning is often performed. Accordingly, we propose an algorithm of using the DT LGBM model as a binary classifier to determine the partitioning direction of CUs. The algorithm flowchart is shown in Fig. 7, as not all encoding units of different sizes have horizontal and vertical partitions, such as CUs of 64 × 64 size only have two modes: QT partitioning and non-partitioning; CUs of 4 × 8 size only have two modes: horizontal binary tree partitioning and non-partitioning. To ensure the effectiveness of the algorithm, we set the set of encoding unit sizes to 32 × 32, 32 × 16, 16 × 32, 16 × 16, 16 × 8, and 8 × 16. In order to effectively alleviate the local optimization problem of the model and enhance the generalization ability of the algorithm, we set a risk interval [29]. When the predicted probability value is in the risk interval, the algorithm fails. Then, the original encoder algorithm is utilized to decide the CU partition mode. The risk interval is between two thresholds th3 and th4, and the preset threshold also follows the original intention of this paper, which is to reach an ideal counterbalance between coding efficiency and coding quality. Since the setting of these two thresholds is consistent with the method of setting thresholds in the first algorithm of this paper, it will not be repeated here. In the end, we used 0.38 as the value of th3 and 0.62 as the value of th4. The input of this algorithm is the CUs that need to further determine the partition mode in the first algorithm of this paper. The detailed steps of our proposed algorithm based on DT LGBM are as follows:
Step 1: Determine if the size of current CU is in the size set. If it is, proceed to the next step. Otherwise, proceed to the original RDO process.
Step 2: Obtain the prediction probability through the DT LGBM model and determine whether it is greater than th3 and less than th4. If it is true, proceed to the original RDO process. Otherwise, proceed to step 3.
Step 3: Determine whether the current CU will skip horizontal or vertical partitioning based on the predicted probability value.
Step 4: After further partitioning of CU, proceed to the coding unit partitioning mode decision process of next partitioning depth, and re-enter the bi-criterion decision algorithm.
3.2.1 Feature analysis and selection
For machine learning algorithms, feature engineering is extremely important, and good features will make the model perform well. For this purpose, we analyzed and selected some features that are closely related to the direction of coding unit division for depth maps using new partitioning techniques. Finally, we chose to introduce these features one by one and provide calculation formulas below:
1. Gradient is often used to describe the rate of change of image pixels in a certain direction. From the perspective of coding unit division, gradient values have a significant relationship with texture direction, thus it has a significant connection with the division direction of CUs. We use the Sobel operator to calculate the four gradient-related features we have selected: the normalized gradients (\(G_{{{\text{ng}}}}\)), the normalized maximum gradient magnitude (\(G_{{{\text{nmg}}}}\)), the average gradients in the horizontal direction (\(G_{agh}\)), and the average gradients in the vertical direction (\(G_{agv}\)). The calculation of gradients in two directions is as follows:
where \(grad_{dir}\) represents the gradient values of the encoding unit in two directions. \(Sobel_{dir}\) represents a matrix corresponding in two directions. \(A\) denotes the pixel matrix of the pixel in the current CU of the depth map, and CU with length H and width W has a total of (W-2) × (H-2) such pixel matrices, which are composed as follows:
where \(p\left( {i,j} \right)\) represents the gray value at position \(\left( {i,j} \right)\), and the rest of the values in the matrix are the gray values at the adjacent positions of coordinates \(\left( {i,j} \right)\).
The two matrices corresponding to the two directions of the Sobel operator are as follows:
The calculation formula for the four features we have selected is as follows:
where W and H represent the width and length of current CU. Since these formulas only calculate the pixel matrix of \(\left( {W - 2} \right) \times \left( {H - 2} \right)\) positions in CU, in order to obtain more standardized data, we set their denominators to \(\left( {W - 2} \right) \times \left( {H - 2} \right)\).
2. Variance. As is well known, variance is commonly utilized to represent the degree of discreteness of a range of data. In previous fast intra-frame algorithms for encoding standards, variance is often used as an important feature, which is still applicable in intra-frame encoding of VVC 3D video depth maps. In order to distinguish the division of MT in different directions, we need more detailed variance features. We use two features: the difference in variance of sub-CUs in the horizontal MT partitioning mode (\(VarH\)) and the difference in variance of sub-CUs in the vertical MT partitioning mode (\(VarV\)). The calculation formula for both is as follows:
where \(VarBH\)、\(VarTH\), respectively, represent the difference in variance of sub-CUs corresponding to CU in HBT and HTT partition modes, and similarly \(VarBV\)、\(VarTV\), respectively, represent the difference in variance of sub-CUs corresponding to CU in VBT and VTT partition modes. Their calculation formulas are as follows:
Figure 8 shows the structure of CU partitioning sub-blocks in four MT modes.
3. Block information. Some block information of the current encoding unit should also be considered, including length and width. Our DT LGBM model can accelerate the partitioning process of multiple CU sizes. In addition, we obtained from [30] that the ratio of CU length to width can also affect the direction of CU partitioning. Therefore, we use Block Shape Rotation (BSR) as a feature to distinguish the direction of CU partitioning. Its expression is as follows:
The quantification parameter (QP) plays an important role in intra-frame encoding. Generally speaking, when QP increases, it will lead to deeper CU partitioning. Therefore, QP should be a feature of our model.
3.2.2 Model framework analysis
We used official JVT-3V standard video sequences [31] to train and test our proposed DT LGBM model. Table 1 shows their detailed information. Figure 9 shows the main steps of constructing our DT LGBM model. Firstly, in the feature extraction stage, the unmodified VTM encoder is utilized to encode some training sequences to gain the necessary data for training the model. These data include the selected feature information, encoder attributes, and CU partitioning results. Subsequently, in preprocessing stage, the dataset is balanced and suitable features are selected to determine the direction of CU partitioning. During the model training phase, these features will be input into our DT LGBM model, and we use the Optuna tool and the Tree structured Parzen Estimator approach to select the optimal values for the hyperparameters of the model. Finally, our proposed model is embedded in the original VTM encoder to replace some of the RDO process, and a test sequence is used to estimate the effect of this method in reducing encoding complexity compared to the original algorithm. Final experimental results will be presented in the fourth section.
4 Experiments and analysis
We display final simulation results to prove the effect of the proposed algorithm in reducing the complicacy of depth map coding. Firstly, the approach proposed is embedded in the original VTM encoder, and the modified encoder is used to test the video sequence to obtain the performance results. We followed the common test conditions (CTC) [32] to set the quantization parameters (QP (depth) {34, 39, 42, 45}) used to encode depth maps. All results in Tables 3, 4 and 5 are the average simulation results of these QPs. The hardware and software environment of the experiment are shown in Table 2, and the eight video sequences used for testing are shown in Table 1. We use two criteria, namely time savings (TS) and BDBR, to test the effect of the proposed algorithm. TS represents how much encoding time can be saved by the proposed method. The larger this value, the more it proves that the method can diminish intra-frame coding time of depth maps. BDBR represents how much the proposed algorithm diminishes coding complicacy while reducing coding quality. The smaller this value, the less the proposed algorithm reduces coding quality. The expression for TS is as follows:
where \(T_{pro}\) represents the encoding time required to use our proposed algorithm for 3D video depth maps, and \(T_{ori}\) represents the encoding time required to use the anchoring algorithm for 3D video depth maps.
4.1 Performance analysis
The algorithm we propose includes a bi-criterion decision algorithm and an algorithm based on the DT LGBM model that skips a certain partition direction in advance. The bi-criterion decision algorithm can avoid the RDO process of CUs with relatively simple texture complexity and terminate their partition process in advance, while the partition direction skip algorithm based on the DT LGBM model can determine early the partition direction that is impossible for CUs that need to be partitioned, thus avoiding some RDO calculations. Therefore, both sub-algorithms can reduce coding complicacy of depth map using new partitioning techniques. Table 3 shows performance results of the two sub-algorithms and the overall algorithm compared to the VTM anchoring algorithm. It can be concluded that the bi-criterion decision algorithm exhibits positive performance, with an average encoding time savings of 36.15% for eight video sequences and only an increase of 0.19% in BDBR. This indicates that the bi-criterion decision algorithm can effectively avoid the partitioning process of CUs with simple textures in flat regions. Owing to the presence of more flat regions and fewer sharp regions in depth maps, there are many simple textured CUs in 3D video depth maps. This also means that many CUs do not need to be partitioned. Our first proposed sub-algorithm utilizes this feature of depth maps to achieve ideal results. In addition, in the table, we can also see that in the second sub-algorithm, compared to anchoring algorithm, our proposed algorithm based on the DT LGBM model saves 27.49% of encoding time in reducing encoding complexity, and BDBR only increases 0.26% in maintaining encoding quality. This indicates that the sub-algorithm can effectively skip impossible partition directions in advance, reducing coding complexity.
In addition, the coding simulation results of the entire algorithm can also be observed in Table 3, which includes the two sub-algorithms proposed in this paper, namely the bi-criterion decision algorithm and the early partition direction skip algorithm based on the DT LGBM model. These two sub-algorithms can complement each other, so the entire algorithm can achieve better results than any single algorithm. Overall, the average encoding performance of all video sequences decreased by 47.65% in coding time, while BDBR only increased by 0.23%. Among them, the Poznan_Hall2 video sequence achieved the highest encoding time savings of 56.47%, as the algorithm utilizes the texture characteristics of flat regions in depth maps. Moreover, the BDBR of all video sequences did not increase by more than 0.3%, indicating that the algorithm greatly improves encoding efficiency while ensuring almost no impact on video encoding quality.
4.2 Comparison of algorithm performance
To prove the advantages of the algorithm proposed in this article, we compare the final simulation results of our algorithm with those of some advanced algorithms. This includes a fast intra-frame CU partitioning algorithm for 3D-HEVC proposed by Hamout in [33], a fast RDO algorithm for 3D video depth maps proposed by Huo in [34], and a bi-layer texture discriminant algorithm aimed at reducing intra-frame encoding complexity for 3D-HEVC proposed by Zuo in [35]. These three methods are all fast intra-frame CU partitioning algorithms for 3D-HEVC. In addition, we compared our proposed algorithm with some fast intra-frame CU partitioning algorithms of VVC, including Wang et al. [36] proposed a fast intra-frame CU partitioning algorithm for VVC 3D depth maps, an ensemble clustering approach proposed by Song et al. [37] to accelerate the partitioning decision process for CUs of different sizes, and an Extra trees-based method aimed at reducing the intra-frame encoding complexity of VVC 3D depth maps proposed by Wang et al. in [38]. Tables 4 and 5 present in detail the experimental results of these six papers and this paper. Figures 10 and 11 show the performance comparison of these algorithms.
It can be concluded that the proposed algorithm performs better overall compared to these three algorithms of 3D-HEVC. Specifically, in terms of ensuring coding quality is not significantly affected, the BDBR in the simulation results of the proposed method is only 0.02% higher than the algorithm proposed by Hamout [33], but it provides additional 7.45% encoding time reduction, especially in the GT_Fly sequence, the encoding complexity decreased by an average of 13.78%. Huo et al. [34] proposed a swift RDO method to quicken depth map encoding and achieve good encoding quality. BDBR only increased by 0.07%, saving encoding time by 24.8%. Our proposed approach saves an average of 22.85% more coding time than this algorithm, while the bit rate only increases by 0.16%, which can be ignored because it achieves a significant improvement in encoding efficiency, especially in HD (1920 × 1088) video sequences, the algorithm in this article greatly diminishes the complexity, providing average additional 28.42% of encoding time savings. The algorithm proposed by Zuo [35] is aimed at encoding texture and depth map in 3D-HEVC, in this paper, we only compare the performance results of depth maps with proposed algorithm, which is close to the proposed algorithm in reducing encoding complexity. However, our proposed algorithm can maintain better encoding quality, and the increase in BDBR is significantly smaller. Compared with these three fast intra-frame CU partitioning algorithms of VVC, our proposed method still performs better. Overall, in terms of maintaining the encoding quality of intra-frame CU partitioning in depth maps, our method BDBR only increases by 0.23%, which is superior to the other three fast intra-frame CU partitioning algorithms for VVC. In terms of reducing encoding complexity, [36] proposed a method based on two neural network models to accelerate depth map encoding, saving 43.23% of encoding time. Our proposed method saves an average of 4.42% more encoding time than the algorithm, and an average of 3.41% more encoding time than the methods of [37] and [38]. For HD video sequences, the algorithm proposed in this paper saves an average of 7.77% more encoding time than [37]. Therefore, from Tables 4 and 5, as well as Figures 10 and 11, compared with other advanced methods, our proposed algorithm can exhibit better encoding performance for 3D video depth map intra-frame encoding, achieving a good counterbalance between coding quality and encoding efficiency.
5 Conclusions
In this paper, committed to efficiently reducing the intra-frame encoding complexity of VVC 3D video depth maps, we propose an algorithm that combines manual with machine learning. This algorithm consists of two parts: a bi-criterion decision algorithm to determine whether CUs can skip partitioning in advance, and an algorithm based on the DT LGBM model to determine whether horizontal or vertical partitioning can be skipped for CUs that need partitioning. In summary, this algorithm can significantly reduce unnecessary RDO processes to improve encoding efficiency. The final simulation results prove that our algorithm not only performs well in reducing coding time, but also ensures good coding quality. Compared with the original algorithm, it reduces coding time by 47.65%, and the coding quality loss is only negligible by 0.23%. Compared with other advanced algorithms, it also shows an ideal counterbalance between coding quality and efficiency. Therefore, it can be concluded that the proposed method can make significant contributions to the practical application of VVC 3D videos.
Availability of data and materials
The conclusion and comparison data of this article are included within the article.
References
Y.-L. Chan et al., Overview of current development in depth map coding of 3D video and its future. IET Signal Process. 14(1), 1–14 (2020)
J. Lei et al., Deep multi-domain prediction for 3D video coding. IEEE Trans. Broadcast. 67(4), 813–823 (2021)
M. Wien et al., Standardization status of immersive video coding. IEEE J. Emerg. Select. Topics Circuits Syst. 9(1), 5–17 (2019)
K. Müller et al., 3D high-efficiency video coding for multi-view video and depth data. IEEE Trans. Image Process. 22(9), 3366–3378 (2013)
A. Smolic, P. Kauff, Interactive 3-D video representation and coding technologies. Proc. IEEE 93(1), 98–110 (2005)
B. Bross et al., Overview of the versatile video coding (VVC) standard and its applications. IEEE Trans. Circuits Syst. Video Technol. 31(10), 3736–3764 (2021)
B. Bross et al., Developments in international video coding standardization after AVC, with an overview of versatile video coding (VVC). Proc. IEEE 109(9), 1463–1493 (2021)
X. Zhao et al., Transform coding in the VVC standard. IEEE Trans. Circuits Syst. Video Technol. 31(10), 3878–3890 (2021)
A. Smolic et al., “3D video and free viewpoint video-technologies, applications and MPEG standards.” 2006 IEEE International Conference Multimedia and Expo (Piscataway, IEEE, 2006)
C. Lee, Y-S. Ho, "View synthesis using depth map for 3D video." Proceedings: APSIPA ASC 2009: Asia-Pacific Signal and Information Processing Association, 2009 Annual Summit and Conference. Asia-Pacific Signal and Information Processing Association, 2009 Annual Summit and Conference, International Organizing Committee. (2009)
S. Smirnov, A. Gotchev, K. Egiazarian, Methods for depth-map filtering in view-plus-depth 3D video representation. EURASIP J. Adv. Signal Process. 2012(1), 1–21 (2012)
Q. Zhang et al., Low-complexity depth map compression in HEVC-based 3D video coding. EURASIP J. Image Video Process. 2015, 1–14 (2015)
B.T. Oh, J. Lee, D.-S. Park, Depth map coding based on synthesized view distortion function. IEEE J. Select. Topics Signal Process. 5(7), 1344–1352 (2011)
T. Zhao et al., Efficient VVC intra prediction based on deep feature fusion and probability estimation. IEEE Trans. Multimedia (2022). https://doi.org/10.1109/TMM.2022.3208516
J. Lin et al., SSIM-variation-based complexity optimization for versatile video coding. IEEE Signal Process. Lett. 29, 2617–2621 (2022)
C. Shu, C. Yang, P. An, “An online SVM based VVC intra fast partition algorithm with pre-scene-cut detection.” 2022 IEEE international symposium on circuits and systems (ISCAS) (Piscataway, IEEE, 2022)
B. Abdallah et al., Low-complexity QTMT partition based on deep neural network for Versatile Video Coding. Signal Image Video Process. 15(6), 1153–1160 (2021)
M. Saldanha et al., Configurable fast block partitioning for VVC intra coding using light gradient boosting machine. IEEE Trans. Circuits Syst. Video Technol. 32(6), 3947–3960 (2021)
A. Tissier et al., Machine learning based efficient QT-MTT partitioning scheme for VVC intra encoders (IEEE Transactions on Circuits and Systems for Video Technology, Piscataway, 2023)
M. Saldanha et al., Fast 3D-HEVC depth map encoding using machine learning. IEEE Trans. Circuits Syst. Video Technol. 30(3), 850–861 (2019)
C. Liu, K. Jia, P. Liu, Fast depth intra coding based on depth edge classification network in 3D-HEVC. IEEE Trans. Broadcast. 68(1), 97–109 (2021)
C.-H. Fu et al., Efficient depth intra frame coding in 3D-HEVC by corner points. IEEE Trans. Image Process. 30, 1608–1622 (2020)
Y.-C. Hsu, "Acceleration of depth intra coding for 3D-HEVC by efficient early termination algorithm." 2018 IEEE Asia Pacific Conference on Circuits and Systems (APCCAS). IEEE. (2018).
J.-R. Lin et al., Visual perception based algorithm for fast depth intra coding of 3D-HEVC. IEEE Trans. Multimedia 24, 1707–1720 (2021)
H. Hamout, A. Elyousfi, Fast depth map intra coding for 3D video compression-based tensor feature extraction and data analysis. IEEE Trans. Circuits Syst. Video Technol. 30(7), 1933–1945 (2019)
H. Bai et al., Depth image coding using entropy-based adaptive measurement allocation. Entropy 16(12), 6590–6601 (2014)
F. Shao et al., Depth map coding for view synthesis based on distortion analyses. IEEE J. Emerg. Select. Topics Circuits Syst. 4(1), 106–117 (2014)
K. Choi, A study on fast and low-complexity algorithms for versatile video coding. Sensors 22(22), 8990 (2022)
T. Amestoy et al., Tunable VVC frame partitioning based on lightweight machine learning. IEEE Trans. Image Process. 29, 1313–1328 (2019)
S.-H. Park, J.-W. Kang, Fast multi-type tree partitioning for versatile video coding using a lightweight neural network. IEEE Trans. Multimedia 23, 4388–4399 (2020)
J. Zhang, "Ghost town fly 3DV sequence for purposes of 3DV standardization." ISO/IEC JTC1/SC29/WG11, Doc. M 20027 (2011).
K. Müller, A. Vetro. "Common test conditions of 3DV core experiments, joint collaborative team on 3D video coding extensions (JCT-3V) document JCT3V-G1100." 7th Meeting: San Jose, CA, USA. (2014)
H. Hamout, A. Elyousfi, A computation complexity reduction of the size decision algorithm in 3D-HEVC depth map intracoding. Adv. Multimedia (2022). https://doi.org/10.1155/2022/3507201
J. Huo et al., Fast rate-distortion optimization for depth maps in 3-D video coding. IEEE Trans. Broadcast. 69(1), 21–32 (2022)
J. Zuo et al., Bi-layer texture discriminant fast depth intra coding for 3D-HEVC. IEEE Access 7, 34265–34274 (2019)
F. Wang, Z. Wang, Q. Zhang, CNN-LNN based fast CU partitioning decision for VVC 3D video depth map intra coding. IEEE Access (2023). https://doi.org/10.1109/ACCESS.2023.3305266
W. Song, G. Li, Q. Zhang, Fast algorithm for CU size decision based on ensemble clustering for intra coding of VVC 3D video depth map. Electronics 12(14), 3098 (2023)
F. Wang, Z. Wang, Q. Zhang, Efficient CU decision algorithm for VVC 3D video depth map using GLCM and extra trees. Electronics 12(18), 3914 (2023)
Funding
This work was supported in part by the National Natural Science Foundation of China Nos. 61771432, and 61302118, the Basic Research Projects of Education Department of Henan No. 21zx003, and the Key projects Natural Science Foundation of Henan 232300421150, the Scientific and Technological Project of Henan Province 232102211014, and the Postgraduate Education Reform and Quality Improvement Project of Henan Province YJS2023JC08.
Author information
Authors and Affiliations
Contributions
Conceptualization, L.S. and A.Y.; methodology, L.S.; software, A.Y.; validation, L.S., Q.Z. and A.Y.; formal analysis, A.Y.; investigation, A.Y.; resources, Q.Z.; data curation, A.Y.; writing—original draft preparation, A.Y.; writing—review and editing, L.S.; visualization, L.S.; supervision, Q.Z.; project administration, Q.Z.; funding acquisition, Q.Z. All authors have read and agreed to the published version of the manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Si, L., Yan, A. & Zhang, Q. Fast CU decision method based on texture characteristics and decision tree for depth map intra-coding. J Image Video Proc. 2024, 34 (2024). https://doi.org/10.1186/s13640-024-00651-2
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s13640-024-00651-2