An image-guided network for depth edge enhancement

With the rapid development of 3D coding and display technologies, numerous applications are emerging to target human immersive entertainments. To achieve a prime 3D visual experience, high accuracy depth maps play a crucial role. However, depth maps retrieved from most devices still suffer inaccuracies at object boundaries. Therefore, a depth enhancement system is usually needed to correct the error. Recent developments by applying deep learning to deep enhancement have shown their promising improvement. In this paper, we propose a deep depth enhancement network system that effectively corrects the inaccurate depth using color images as a guide. The proposed network contains both depth and image branches, where we combine a new set of features from the image branch with those from the depth branch. Experimental results show that the proposed system achieves a better depth correction performance than state of the art advanced networks. The ablation study reveals that the proposed loss functions in use of image information can enhance depth map accuracy effectively.

Depth maps are commonly acquired either by depth sensors [3] or by stereo matching pairs [4,5]. However, the depth map generated by depth sensors is often noisy, missing depth values and tends to misalign with the object boundaries in color image. The depth map estimated by stereo matching methods often contains errors in occlusion and flat regions. Recently, much work has been done to predict depth maps based on 2D videos, such as using monocular depth estimation networks [6][7][8] or creating key depth frames manually followed with depth map interpolation [9] afterward. The generated depth map suffers either blurred edges as shown in Fig. 1 or other inevitable errors caused by manual notations along object boundaries. To achieve high-quality accurate depth maps, a depth enhancement system is needed to remove the noise and correct the depth errors.
Without loss of generality, a typical scenario that a limited quality depth map containing noise and errors is retrieved from a pair of stereo color images is considered. Multiple approaches have been proposed to further enhance the depth map quality by exploring the pixel correlation and structure relationship between the color image and depth map. However, when the color image contains highly textured objects, the enhancement from correlation calculation often causes texture-like artifacts in depth map. Traditional depth enhancement approaches apply adaptive filters to smooth or remedy the noisy depth maps. Gaussian filtering-based methods [10] estimate the missing depth values using the known neighboring depth values. Joint bilateral filtering (JBF)-based methods [11,12] as an extension of bilateral filtering [13], apply information from color frame to gain more accurate depth map enhancement. However, these frameworks usually generate blurry depth results since they cannot train the systems to focus on the area around object boundaries.
Recently, deep learning has been introduced in various image processing applications, such as image super-resolution [14,15], image restoration [16], image denoising [17,18], and depth denoising by graph [19] networks. The denoising and enhancement convolutional neural network (DE-CNN) [20] and image-guided method [21] also have been adopted for depth enhancement.
In this paper, we propose an end-to-end framework for depth enhancement with the inputs of color frame and noisy depth map and the output of the enhanced depth map. The rest of the paper is organized as follows. In Sect. 2, we briefly review related work, including deep residual convolution neural network (DRECNN) [22], residual dense network [23], and focal loss [24]. In Sect. 3, we describe in details the proposed image-guided depth enhancement (IGDE) system. In Sect. 4, we present the experimental results. Finally, we draw the conclusions in Sect. 5.

Related work
The DRECNN is a typical framework that performs depth enhancement well. It learns the underlying correlation between depth map and color image first, and then applies the learned correlation to enhance the quality of the depth map. As shown in Fig. 2, the DRECNN architecture is divided into depth branch, intensity branch, and fusion module. The depth and intensity branches have the same structure, consisting of one set of convolution and ReLU layers and seven sets of convolution, batch normalization, and ReLU layers to retrieve the depth and intensity feature maps. The fusion module applies eleven sets of convolution, batch normalization, and ReLU layers and a convolution layer to retrieve fusing coefficient maps. By referring to the concept of image-guided filter [21], the filter output Ĩ i is given as where I is the input guidance image, a k and b k are the linear coefficients assumed to be constant in the window ω k . Extending this concept, the DRECNN with the fusion module retrieves the pixel-level fusing coefficient maps a and b. As shown in Fig. 2, the residual depth map can be obtained by where Y is the luminance of color image and D is the depth map. The DRECNN effectively improves depth enhancement and avoids overfitting problem by finding a linear model supervised by the ground-truth label. After AlexNet [25] was proposed, the state-of-the-art CNN architectures commonly adopt large number of layers. However, by simply increasing convolutional layers, better results are not guaranteed due to the gradient vanishing problem. Using batch normalization could solve partially the problem of gradient vanishing. ResNet [23], which utilizes the residual blocks by adding the original input to the output with a shortcut connection, effectively solves the degradation problem caused by increasing the network layers. Since then, the residual blocks are modified to build various high performance networks. The EDSR [26] removes the batch normalization to boost the convergence speed. DenseNet [27] achieves similar results with much smaller number of parameters. SRDenseNet [28] applies DenseNet to solve image super-resolution effectively. Figure 3 shows the structures of the residual block, dense block, and residual dense block. For image super-resolution, the architecture of residual dense network (RDN) [15] as shown in Fig. 4 is composed of multiple residual dense blocks. The features generated by previous convolution layers are concatenated to 1 × 1 convolution layer to reduce the number of channels. The RDN consists of N residual dense blocks (RDBs) which transfer the low-resolution (LR) image to the high-resolution (HR) image. The RDN preserves the details of the LR image and performs the suitable image corrections to obtain the HR image.
In deep learning CNNs, it is important to design loss functions in order to train the target CNN network. The loss functions with mean square error (MSE) and mean absolute error (MAE) are often used in regression problems, while the cross-entropy is used in classification problems. To improve speed and direction of network convergence, special loss functions has been proposed. Taking the binary classification problem, the cross-entropy loss is given as  where y with {+ 1, − 1} denotes the label and p represents the probability that the predicted sample belongs to 1. Adding the cross-entropy of all samples, we can find the loss of the network. To correct the imbalance of binary samples, the focus loss used in Reti-naNet [24] is suggested as where γ > 0 is a focusing parameter to reduce the relative loss for well-classified examples with p > 0.5 and α is the shared weight to control positive and negative samples. Comparing to two-stage object detectors, faster RCNN [29] and RFCN [30], the focal loss can improve one-stage object detectors, YOLO [31] and SSD [32] to obtain higher performance. The one-stage detector has too much difference in the number of positive and negative samples during training, α is used to reduce the influence of negative samples with (1 − p t ) γ modulating factor. The modulating factor reduces the weight of easy-toclassify samples to ensure the network pay more attention to difficult-to-classify samples. The effectiveness of focal loss has been proven in many advanced networks.

The methods
The guided image filter [21] uses the correlation between color and depth maps to enhance the noisy depth map. However, images with complex textures often degrade the depth map with ghosting textures. Learning-based methods mitigate the strong influence from image texture. However, the enhanced depth maps often contain inaccurate depth at object boundaries. To address this issue, we proposed an end-to-end depth map enhancement system that focuses mainly on correction of the depth edges.

The IGDE network
The proposed image-guided depth enhancement (IGDE) network, as shown in Fig. 5, consists of two feature extractors, one fusion module, and one depth refinement module. It is noted task-adaptive attention [33] and multi-feature fusing [34] can help increase the performances of image captioning and recognition, respectively. We employ the residual dense network (RDN) as the backbone of the feature extractor and depth refinement module. We extract features from the image and depth frames, and concatenate them together as the fused feature. The fused feature is sent to the depth refinement module to obtain the enhanced depth map.  Figure 6 shows the detailed structure of the feature extractor. In the early stages of the network, the low-level features of the image frame are concatenated into the low-level features of the depth map. In simulation section, we will describe a better number of layers for the concatenations of low-level image features.
At the end of the network, we convert the fused feature into the enhanced depth map with the depth refinement module. Here, we use the same residual dense network as the backbone of the depth refinement module. The features obtained by the residual dense network are restored to a depth map by a 1 × 1 convolution. The detailed architecture of the depth refinement module is shown in Fig. 7. All the convolutions used in the module utilize ReLU to prevent the network output from being too linear.

Loss functions
To train the proposed IGDE network, we refine the noisy depth values at object boundaries to be consistent with the image frame. Typically, depth loss function calculates the depth loss from all pixels in depth map. Since the number of object boundaries pixels is much lower than that of the pixels of the whole image, the impact of errors at object boundaries is often compromised. We proposed to add a special depth focal loss by assigning lower weights for pixels with smaller depth deviations and higher weights for pixels with larger deviation. In addition, we also design a Sobel loss to emphasize the depth deviation at object boundaries. The total loss function, including depth loss L depth , depth focal loss L focal , and Sobel loss L sobel , becomes where d and d * are predicted depth value and the corresponding ground truth, respectively. M sobel is the mask that focuses on the object boundaries, where ρ, μ, and λ are the weighting factors of the losses. We set all of them to 1. The details of each loss function are described as follows.

Depth loss
To minimize the difference between predicted depth maps d and the corresponding ground truth d * , we use the L1 loss as where i and j denote the pixel indices, and H and W are the height and width of depth maps, respectively.

Depth focal loss
For refining the noisy depth map, the error depth pixels only occupy a small portion of the depth map. We aim to train the network to focus more on the error pixels than the correct ones. Hence, we suggest the depth focal loss as with where α and γ are the shared weight and focusing parameter. Currently, we set the values of α to 0.25 and γ to 2. Similar to the focal loss [24], e i,j can be treated as the probability that the correct predicted depth is 1 in most cases. In (8), the ratio of the difference between the predicted and ground truth values to the maximum depth value 255 exhibits similar characteristics of positive and negative samples for depth focal loss. With the depth focal loss, our network emphasizes more on the pixels with a higher ratio of errors in order to make the training results more accurate.

Sobel loss
Since the depth map contains the errors mostly near object boundaries, we design a Sobel loss to ensure the network to focus more on areas close to object boundaries. The Sobel loss is expressed as where M sobel is a depth edge mask and β is the parameter used to control the importance of the edge area. We set M sobel to 1 for pixels close to object boundaries and set it to 0 for the rest of pixels. Here, we choose β = 0.9.
To compute the depth edge mask, as shown in Fig. 8, we first perform Sobel edge detection to the input depth map to obtain Sobel edges. Then, the detected edges are expanded by dilation operator. Finally, we apply the OTSU thresholding method to compute the depth edge mask. In (9), the depth error pixels in the region of depth edge mask will be weighted by 0.9, while those outside the depth edge mask will be weighted by 0.1. Of course, we can use β to adjust the weights of the area near the depth edge with respect to the rest of the area.

Results and discussions
The proposed IGDE system is implemented in Python 3.7, CUDA 10.2, cuDNN 7.6.5, and Tensorflow-GPU 1.15.0 learning function library. For hardware infrastructure, we use the personal computer with Intel Core i7-9700 k CPU 3.6 GHz-4.9 GHz, 32 GB 3200 MHz RAM. NVIDIA Geforce RTX 2080Ti 11G GPU is used to accelerate the training process of the proposed IGDE system.
We evaluate the proposed IGDE system on the Scene Flow dataset [26], which is a large-scale synthetic dataset containing Flyingthings3D, Monkaa, and Driving subdatasets. Some selected examples of datasets are shown in Fig. 9. Compared to other datasets, it has more accurate ground-truth depth maps since they are generated by virtual images. The images in the dataset are divided into 70,908 training images and 8740 testing images with H = 540 and W = 960. We crop the image to H = 512 and W = 960 in the proposed network.
To simulate depth maps with erroneous edges, we randomly inflate or reduce depth values at object boundaries in all depth maps. We then take the simulated depth maps and the corresponding color maps as input to the proposed system. During training, images with a batch size of 2 were randomly cropped to size H = 280 and W = 480. To improve the prediction accuracy, we normalized the input images by dividing them by 255. We trained our network with a learning rate of 0.0001 for 50 epochs.  Figure 10 shows the visual results of testing on the Flyingthings3D sub-dataset. The depth map refined by the proposed IGDE system performs well at object boundaries.

Visualization performance of the network
Comparing the error map before and after refinement, the number of error points has been significantly reduced.
To test the effectiveness of the proposed network, the trained network is directly applied to Middlebury dataset. Figure 11a and b, respectively, shows four original natural images and their corresponding ground-truth depth maps with unknown holes, which are treated as noisy depth maps. After simple extension of known depth values from the bottom vertically and the enhanced process by the proposed IGDE system, Figure 11c and d show the enhanced depth maps and the error depth maps, respectively. For those unknown holes, for natural images, we do not know the exact depth value. The subjective quality as the graphic image becomes impossible. However, the refined depth maps by the proposed IGDE system show the reasonably good objective quality. The IGDE system can enhance the depth maps of natural images successfully. For detailed evaluation of the performances, we present numerical comparisons with other methods in the next sub-section.

Comparisons with quality measures
We compare the performance of the proposed IGDE system with three depth refinement networks, namely, denoising and enhancement CNN (DE − CNN) [20], deep residual enhancement CNN (DRECNN) [22], and depth enhancement network with color-based prediction network (DEN + CBPN) [35]. The DE − CNN with single branch concatenates the depth map and color image as the input, while the proposed IGDE, DRECNN, and DEN + CBPN systems with two branches fuse the depth and image features in different approaches. Without ground truth values, the quality measure can also use no-reference measure [36]. With the ground-truth depth maps,  Table 1 shows the comparison results on Scene Flow testing set [37]. We use common quality measures, such as PSNR in dB, SSIM, and RMSE, to evaluate the performance of all networks. In addition, we also, respectively, calculate the PSNR t and PSNR f of correct and error depth pixels of the prediction results. Table 1 shows that the proposed IGDE achieves the best results, which are marked with bold face. Hereafter, in Tables 2 and 3, we also marked the best results with bold face. The PSNR, PSNR t , and PSNR f in dBs are defined as follows: (10) PSNR = 10 · log 10 MAX 2 MSE all = 10 · log 10 255 2 MSE all , (11) PSNR t = 10 · log 10 255 2 MSE t , (12) PSNR f = 10 · log 10 255 2 MSE f , where MSE all , MSE t , and MSE f , respectively, denote the mean square error of all, correct, and incorrect depth pixels. The structural similarity (SSIM) measure is defined as where l(x, y), c(x, y), and s(x, y) denote the luminance, contrast, and structure measures of x and y, which are, respectively, defined as where u x and u y are the averages of x and y; σ x and σ y represent the standard deviations of x and y, respectively; σ xy denotes the covariance of x and y; and C 1 , C 2 , and C 3 are constants to stabilize the division with a weak denominator. The RMSE is defined as (13) SSIM x, y = l x, y · c x, y · s x, y , , s x, y = σ xy + C 3 σ x σ y + C 3 , Fig. 11 Visualization results on Middlebury dataset: a original color images, b ground-truth depth maps with unknown holes, c enhanced depth maps by the proposed IGDE system, d the absolute difference between the original and refinement depth maps where N denotes the total number of pixels for prediction result, d i * and d i indicate the ith ground-truth depth value map and the ith predicted depth value, respectively. We implemented networks of the three selected depth refinement methods due to their source codes are not available. We used our training configuration to train their networks. Based on comparison results, the proposed IGDE system achieves the best performance.

Ablation study
We evaluated the performance of the proposed IGDE system with different settings. First, we train the IGDE network with different sets of loss functions to prove that depth focal loss and Sobel loss make the network predict better. The prediction results with or without adding the proposed loss functions are shown in Table 2. The comparison results show that the two proposed loss functions clearly help achieve better results.
To demonstrate the effectiveness of concatenating three layers of low-level features of the image branch to those of the depth branch, we also try to reduce the number of concatenating layers of low-level features. Since the deeper layers of color image frame are more important to the depth map, we try to reduce the shallow layers of color map information to the depth branch. The comparison results are shown in Table 3. The results show that concatenating three different layers of color map information to the depth branch generates the best prediction results.

Conclusion
In this paper, we propose an image-guided depth enhancement system that extracts the features of color images to enhance the depth values of object boundaries through the residual dense network. To enable the network to focus more on enhancing the depth value of object boundaries, we propose Sobel loss to increase the weight of object edges.
Regarding the concept of focal loss used in object detection, we further propose depth focal loss to improve the performance of network prediction. In addition, the inclusion of color information to the first half of the depth branch shows benefits for depth map restoration. We simulate the situation where the depth values of the object boundaries are intentionally mismatched to the color map in order to create a training dataset on Scene Flow dataset. Using this dataset to train and compare with other advanced methods, the proposed IGDE system obtains the best prediction results from multiple data.
Finally, the ablation study shows that each function proposed in this paper effectively improves prediction results.