 Research
 Open Access
 Published:
An imageguided network for depth edge enhancement
EURASIP Journal on Image and Video Processing volume 2022, Article number: 6 (2022)
Abstract
With the rapid development of 3D coding and display technologies, numerous applications are emerging to target human immersive entertainments. To achieve a prime 3D visual experience, high accuracy depth maps play a crucial role. However, depth maps retrieved from most devices still suffer inaccuracies at object boundaries. Therefore, a depth enhancement system is usually needed to correct the error. Recent developments by applying deep learning to deep enhancement have shown their promising improvement. In this paper, we propose a deep depth enhancement network system that effectively corrects the inaccurate depth using color images as a guide. The proposed network contains both depth and image branches, where we combine a new set of features from the image branch with those from the depth branch. Experimental results show that the proposed system achieves a better depth correction performance than state of the art advanced networks. The ablation study reveals that the proposed loss functions in use of image information can enhance depth map accuracy effectively.
Introduction
With the rapid development of 3D media and display technologies, 3D multimedia immersive services are being explored to address new demands. For example, 3D virtual reality has been widely used in various fields, such as medicine, games, and education. In recent years, an increasing number of 3D movies are made commercially available in cinemas. With display technology improvement, there is high expectation that the glassed 3D displays to watch stereo 3D videos will be ultimately replaced by the nakedeye 3D displays, without wearing any glasses.
Supporting nakedeye 3D displays brings the need to have multiple view images available. In depth imagebased rendering (DIBR) [1], 3D information is represented by the color image frame and its corresponding depth map. Pixelbased perspective multiple view images are then generated effectively based on information from the color image frame and depth map. The multiview can help retrieve the image successfully [2]. To achieve a highquality and comfortable 3D viewing experience, it is crucial to have an accurate depth map.
Depth maps are commonly acquired either by depth sensors [3] or by stereo matching pairs [4, 5]. However, the depth map generated by depth sensors is often noisy, missing depth values and tends to misalign with the object boundaries in color image. The depth map estimated by stereo matching methods often contains errors in occlusion and flat regions. Recently, much work has been done to predict depth maps based on 2D videos, such as using monocular depth estimation networks [6,7,8] or creating key depth frames manually followed with depth map interpolation [9] afterward. The generated depth map suffers either blurred edges as shown in Fig. 1 or other inevitable errors caused by manual notations along object boundaries. To achieve highquality accurate depth maps, a depth enhancement system is needed to remove the noise and correct the depth errors.
Without loss of generality, a typical scenario that a limited quality depth map containing noise and errors is retrieved from a pair of stereo color images is considered. Multiple approaches have been proposed to further enhance the depth map quality by exploring the pixel correlation and structure relationship between the color image and depth map. However, when the color image contains highly textured objects, the enhancement from correlation calculation often causes texturelike artifacts in depth map. Traditional depth enhancement approaches apply adaptive filters to smooth or remedy the noisy depth maps. Gaussian filteringbased methods [10] estimate the missing depth values using the known neighboring depth values. Joint bilateral filtering (JBF)based methods [11, 12] as an extension of bilateral filtering [13], apply information from color frame to gain more accurate depth map enhancement. However, these frameworks usually generate blurry depth results since they cannot train the systems to focus on the area around object boundaries.
Recently, deep learning has been introduced in various image processing applications, such as image superresolution [14, 15], image restoration [16], image denoising [17, 18], and depth denoising by graph [19] networks. The denoising and enhancement convolutional neural network (DECNN) [20] and imageguided method [21] also have been adopted for depth enhancement.
In this paper, we propose an endtoend framework for depth enhancement with the inputs of color frame and noisy depth map and the output of the enhanced depth map. The rest of the paper is organized as follows. In Sect. 2, we briefly review related work, including deep residual convolution neural network (DRECNN) [22], residual dense network [23], and focal loss [24]. In Sect. 3, we describe in details the proposed imageguided depth enhancement (IGDE) system. In Sect. 4, we present the experimental results. Finally, we draw the conclusions in Sect. 5.
Related work
The DRECNN is a typical framework that performs depth enhancement well. It learns the underlying correlation between depth map and color image first, and then applies the learned correlation to enhance the quality of the depth map. As shown in Fig. 2, the DRECNN architecture is divided into depth branch, intensity branch, and fusion module. The depth and intensity branches have the same structure, consisting of one set of convolution and ReLU layers and seven sets of convolution, batch normalization, and ReLU layers to retrieve the depth and intensity feature maps. The fusion module applies eleven sets of convolution, batch normalization, and ReLU layers and a convolution layer to retrieve fusing coefficient maps. By referring to the concept of imageguided filter [21], the filter output \(\tilde{I}_{i}\) is given as
where I is the input guidance image, a_{k} and b_{k} are the linear coefficients assumed to be constant in the window ω_{k}. Extending this concept, the DRECNN with the fusion module retrieves the pixellevel fusing coefficient maps a and b. As shown in Fig. 2, the residual depth map can be obtained by
where Y is the luminance of color image and D is the depth map. The DRECNN effectively improves depth enhancement and avoids overfitting problem by finding a linear model supervised by the groundtruth label.
After AlexNet [25] was proposed, the stateoftheart CNN architectures commonly adopt large number of layers. However, by simply increasing convolutional layers, better results are not guaranteed due to the gradient vanishing problem. Using batch normalization could solve partially the problem of gradient vanishing. ResNet [23], which utilizes the residual blocks by adding the original input to the output with a shortcut connection, effectively solves the degradation problem caused by increasing the network layers. Since then, the residual blocks are modified to build various high performance networks. The EDSR [26] removes the batch normalization to boost the convergence speed. DenseNet [27] achieves similar results with much smaller number of parameters. SRDenseNet [28] applies DenseNet to solve image superresolution effectively. Figure 3 shows the structures of the residual block, dense block, and residual dense block.
For image superresolution, the architecture of residual dense network (RDN) [15] as shown in Fig. 4 is composed of multiple residual dense blocks. The features generated by previous convolution layers are concatenated to 1 × 1 convolution layer to reduce the number of channels. The RDN consists of N residual dense blocks (RDBs) which transfer the lowresolution (LR) image to the highresolution (HR) image. The RDN preserves the details of the LR image and performs the suitable image corrections to obtain the HR image.
In deep learning CNNs, it is important to design loss functions in order to train the target CNN network. The loss functions with mean square error (MSE) and mean absolute error (MAE) are often used in regression problems, while the crossentropy is used in classification problems. To improve speed and direction of network convergence, special loss functions has been proposed. Taking the binary classification problem, the crossentropy loss is given as
where y with {+ 1, − 1} denotes the label and p represents the probability that the predicted sample belongs to 1. Adding the crossentropy of all samples, we can find the loss of the network. To correct the imbalance of binary samples, the focus loss used in RetinaNet [24] is suggested as
where γ > 0 is a focusing parameter to reduce the relative loss for wellclassified examples with p > 0.5 and α is the shared weight to control positive and negative samples. Comparing to twostage object detectors, faster RCNN [29] and RFCN [30], the focal loss can improve onestage object detectors, YOLO [31] and SSD [32] to obtain higher performance. The onestage detector has too much difference in the number of positive and negative samples during training, α is used to reduce the influence of negative samples with (1 − p_{t})^{γ} modulating factor. The modulating factor reduces the weight of easytoclassify samples to ensure the network pay more attention to difficulttoclassify samples. The effectiveness of focal loss has been proven in many advanced networks.
The methods
The guided image filter [21] uses the correlation between color and depth maps to enhance the noisy depth map. However, images with complex textures often degrade the depth map with ghosting textures. Learningbased methods mitigate the strong influence from image texture. However, the enhanced depth maps often contain inaccurate depth at object boundaries. To address this issue, we proposed an endtoend depth map enhancement system that focuses mainly on correction of the depth edges.
The IGDE network
The proposed imageguided depth enhancement (IGDE) network, as shown in Fig. 5, consists of two feature extractors, one fusion module, and one depth refinement module. It is noted taskadaptive attention [33] and multifeature fusing [34] can help increase the performances of image captioning and recognition, respectively. We employ the residual dense network (RDN) as the backbone of the feature extractor and depth refinement module. We extract features from the image and depth frames, and concatenate them together as the fused feature. The fused feature is sent to the depth refinement module to obtain the enhanced depth map.
Figure 6 shows the detailed structure of the feature extractor. In the early stages of the network, the lowlevel features of the image frame are concatenated into the lowlevel features of the depth map. In simulation section, we will describe a better number of layers for the concatenations of lowlevel image features.
At the end of the network, we convert the fused feature into the enhanced depth map with the depth refinement module. Here, we use the same residual dense network as the backbone of the depth refinement module. The features obtained by the residual dense network are restored to a depth map by a 1 × 1 convolution. The detailed architecture of the depth refinement module is shown in Fig. 7. All the convolutions used in the module utilize ReLU to prevent the network output from being too linear.
Loss functions
To train the proposed IGDE network, we refine the noisy depth values at object boundaries to be consistent with the image frame. Typically, depth loss function calculates the depth loss from all pixels in depth map. Since the number of object boundaries pixels is much lower than that of the pixels of the whole image, the impact of errors at object boundaries is often compromised. We proposed to add a special depth focal loss by assigning lower weights for pixels with smaller depth deviations and higher weights for pixels with larger deviation. In addition, we also design a Sobel loss to emphasize the depth deviation at object boundaries. The total loss function, including depth loss L_{depth}, depth focal loss L_{focal}, and Sobel loss L_{sobel}, becomes
where d and d^{*} are predicted depth value and the corresponding ground truth, respectively. M_{sobel} is the mask that focuses on the object boundaries, where ρ, μ, and λ are the weighting factors of the losses. We set all of them to 1. The details of each loss function are described as follows.
Depth loss
To minimize the difference between predicted depth maps d and the corresponding ground truth d^{*}, we use the L1 loss as
where i and j denote the pixel indices, and H and W are the height and width of depth maps, respectively.
Depth focal loss
For refining the noisy depth map, the error depth pixels only occupy a small portion of the depth map. We aim to train the network to focus more on the error pixels than the correct ones. Hence, we suggest the depth focal loss as
with
where α and γ are the shared weight and focusing parameter. Currently, we set the values of α to 0.25 and γ to 2. Similar to the focal loss [24], e_{i,j} can be treated as the probability that the correct predicted depth is 1 in most cases. In (8), the ratio of the difference between the predicted and ground truth values to the maximum depth value 255 exhibits similar characteristics of positive and negative samples for depth focal loss. With the depth focal loss, our network emphasizes more on the pixels with a higher ratio of errors in order to make the training results more accurate.
Sobel loss
Since the depth map contains the errors mostly near object boundaries, we design a Sobel loss to ensure the network to focus more on areas close to object boundaries. The Sobel loss is expressed as
where M_{sobel} is a depth edge mask and β is the parameter used to control the importance of the edge area. We set M_{sobel} to 1 for pixels close to object boundaries and set it to 0 for the rest of pixels. Here, we choose β = 0.9.
To compute the depth edge mask, as shown in Fig. 8, we first perform Sobel edge detection to the input depth map to obtain Sobel edges. Then, the detected edges are expanded by dilation operator. Finally, we apply the OTSU thresholding method to compute the depth edge mask. In (9), the depth error pixels in the region of depth edge mask will be weighted by 0.9, while those outside the depth edge mask will be weighted by 0.1. Of course, we can use β to adjust the weights of the area near the depth edge with respect to the rest of the area.
Results and discussions
The proposed IGDE system is implemented in Python 3.7, CUDA 10.2, cuDNN 7.6.5, and TensorflowGPU 1.15.0 learning function library. For hardware infrastructure, we use the personal computer with Intel Core i79700 k CPU 3.6 GHz4.9 GHz, 32 GB 3200 MHz RAM. NVIDIA Geforce RTX 2080Ti 11G GPU is used to accelerate the training process of the proposed IGDE system.
We evaluate the proposed IGDE system on the Scene Flow dataset [26], which is a largescale synthetic dataset containing Flyingthings3D, Monkaa, and Driving subdatasets. Some selected examples of datasets are shown in Fig. 9. Compared to other datasets, it has more accurate groundtruth depth maps since they are generated by virtual images. The images in the dataset are divided into 70,908 training images and 8740 testing images with H = 540 and W = 960. We crop the image to H = 512 and W = 960 in the proposed network.
To simulate depth maps with erroneous edges, we randomly inflate or reduce depth values at object boundaries in all depth maps. We then take the simulated depth maps and the corresponding color maps as input to the proposed system. During training, images with a batch size of 2 were randomly cropped to size H = 280 and W = 480. To improve the prediction accuracy, we normalized the input images by dividing them by 255. We trained our network with a learning rate of 0.0001 for 50 epochs.
Visualization performance of the network
Figure 10 shows the visual results of testing on the Flyingthings3D subdataset. The depth map refined by the proposed IGDE system performs well at object boundaries. Comparing the error map before and after refinement, the number of error points has been significantly reduced.
To test the effectiveness of the proposed network, the trained network is directly applied to Middlebury dataset. Figure 11a and b, respectively, shows four original natural images and their corresponding groundtruth depth maps with unknown holes, which are treated as noisy depth maps. After simple extension of known depth values from the bottom vertically and the enhanced process by the proposed IGDE system, Figure 11c and d show the enhanced depth maps and the error depth maps, respectively. For those unknown holes, for natural images, we do not know the exact depth value. The subjective quality as the graphic image becomes impossible. However, the refined depth maps by the proposed IGDE system show the reasonably good objective quality. The IGDE system can enhance the depth maps of natural images successfully. For detailed evaluation of the performances, we present numerical comparisons with other methods in the next subsection.
Comparisons with quality measures
We compare the performance of the proposed IGDE system with three depth refinement networks, namely, denoising and enhancement CNN (DE − CNN) [20], deep residual enhancement CNN (DRECNN) [22], and depth enhancement network with colorbased prediction network (DEN + CBPN) [35]. The DE − CNN with single branch concatenates the depth map and color image as the input, while the proposed IGDE, DRECNN, and DEN + CBPN systems with two branches fuse the depth and image features in different approaches. Without ground truth values, the quality measure can also use noreference measure [36]. With the groundtruth depth maps, Table 1 shows the comparison results on Scene Flow testing set [37]. We use common quality measures, such as PSNR in dB, SSIM, and RMSE, to evaluate the performance of all networks. In addition, we also, respectively, calculate the PSNR_{t} and PSNR_{f} of correct and error depth pixels of the prediction results. Table 1 shows that the proposed IGDE achieves the best results, which are marked with bold face. Hereafter, in Tables 2 and 3, we also marked the best results with bold face.
The PSNR, PSNR_{t}, and PSNR_{f} in dBs are defined as follows:
where MSE_{all}, MSE_{t}, and MSE_{f}, respectively, denote the mean square error of all, correct, and incorrect depth pixels. The structural similarity (SSIM) measure is defined as
where l(x, y), c(x, y), and s(x, y) denote the luminance, contrast, and structure measures of x and y, which are, respectively, defined as
where u_{x} and u_{y} are the averages of x and y; σ_{x} and σ_{y} represent the standard deviations of x and y, respectively; σ_{xy} denotes the covariance of x and y; and C_{1}, C_{2}, and C_{3} are constants to stabilize the division with a weak denominator. The RMSE is defined as
where N denotes the total number of pixels for prediction result, d_{i}^{*} and d_{i} indicate the ith groundtruth depth value map and the ith predicted depth value, respectively. We implemented networks of the three selected depth refinement methods due to their source codes are not available. We used our training configuration to train their networks. Based on comparison results, the proposed IGDE system achieves the best performance.
Ablation study
We evaluated the performance of the proposed IGDE system with different settings. First, we train the IGDE network with different sets of loss functions to prove that depth focal loss and Sobel loss make the network predict better. The prediction results with or without adding the proposed loss functions are shown in Table 2. The comparison results show that the two proposed loss functions clearly help achieve better results.
To demonstrate the effectiveness of concatenating three layers of lowlevel features of the image branch to those of the depth branch, we also try to reduce the number of concatenating layers of lowlevel features. Since the deeper layers of color image frame are more important to the depth map, we try to reduce the shallow layers of color map information to the depth branch. The comparison results are shown in Table 3. The results show that concatenating three different layers of color map information to the depth branch generates the best prediction results.
Conclusion
In this paper, we propose an imageguided depth enhancement system that extracts the features of color images to enhance the depth values of object boundaries through the residual dense network. To enable the network to focus more on enhancing the depth value of object boundaries, we propose Sobel loss to increase the weight of object edges. Regarding the concept of focal loss used in object detection, we further propose depth focal loss to improve the performance of network prediction. In addition, the inclusion of color information to the first half of the depth branch shows benefits for depth map restoration. We simulate the situation where the depth values of the object boundaries are intentionally mismatched to the color map in order to create a training dataset on Scene Flow dataset. Using this dataset to train and compare with other advanced methods, the proposed IGDE system obtains the best prediction results from multiple data. Finally, the ablation study shows that each function proposed in this paper effectively improves prediction results.
Availability of data and materials
The color images with corresponding depth maps are obtained from Scene Flow dataset [26]. The datasets generated for the current study are available from the corresponding author on reasonable request.
Abbreviations
 3D:

Three dimension
 DIBR:

Depth imagebased rendering
 2D:

Two dimension
 JBF:

Joint bilateral filtering
 SAD:

Summation of absolute differences
 DECNN:

Denoising and enhancement convolutional neural network
 DRECNN:

Deep residual convolution neural network
 IGDE:

Imageguided depth enhancement
 ReLU:

Rectified linear unit
 AlexNet:

Alex network
 CNN:

Convolution neural network
 EDSR:

Enhanced deep residual network
 DenseNet:

Densely network
 SRDenseNet:

Superresolution DenseNet
 RDN:

Residual dense network
 RDB:

Residual dense block
 LR:

Low resolution
 HR:

High resolution
 MSE:

Mean square error
 MAE:

Mean absolute error
 RCNN:

Regionbased convolutional neural networks
 RFCN:

Regionbased fully convolutional network
 YOLO:

You only look once
 SSD:

Single shot multibox detector
 L1:

1 Norm
 CUDA:

Compute unified device architecture
 cuDNN:

CUDA^{®} deep neural network
 GPU:

Graph processing unit
 PSNR:

Peak signaltonoise ratio
 dB:

Decibel
 SSIM:

Structural similarity
 RMSE:

Root mean square error
References
S.C. Chan, H.Y. Shum, K.T. Ng, Imagebased rendering and synthesis—technological advances and challenges. IEEE Signal Process. Mag. 24(6), 22–33 (2007). https://doi.org/10.1109/Msp.2007.905702
C. Yan, B. Gong, Y. Wei, Y. Gao, Deep multiview enhancement hashing for image retrieval. IEEE Trans. Pattern Anal. Mach. Intell. 43(4), 1445–1451 (2021). https://doi.org/10.1109/TPAMI.2020.2975798
K. Tang, L. Shi, S. Guo, S. Pan, H. Xing, S. Su, P. Guo, Z. Chen and Y. He, “Vision locating method based RGBD camera for amphibious spherical robots”, in IEEE International Conference on Mechatronics and Automation (ICMA), (2017)
H.M. Zhu, J.H. Yin, D. Yuan, SVCV: segmentation volume combined with cost volume for stereo matching. IET Comput. Vision 11(8), 733–743 (2017). https://doi.org/10.1049/ietcvi.2016.0446
N.Y.C. Chang, T.H. Tsai, B.H. Hsu, Y.C. Chen, T.S. Chang, Algorithm and architecture of disparity estimation with minicensus adaptive support weight. IEEE Trans. Circuits Syst. Video Technol. 20(6), 792–805 (2010). https://doi.org/10.1109/Tcsvt.2010.2045814
A. Roy and S. Todorovic, “Monocular depth estimation using neural regression forest”, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (2016)
C. Godard, O. Mac Aodha and G.J. Brostow, “Unsupervised monocular depth estimation with leftright consistency”, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (2017)
S. Kumari, R.R. Jha, A. Bhavsar and A. Nigam, “Autodepth: Single image depth map estimation via residual cnn encoderdecoder and stacked hourglass”, in IEEE International Conference on Image Processing (ICIP), (2019)
H.M. Wang, C.H. Huang, J.F. Yang, Blockbased depth maps interpolation for efficient multiview content generation. IEEE Trans. Circuits Syst. Video Technol. 21(12), 1847–1858 (2011)
K.R. Vijayanagar, M. Loghman and J. Kim, “Refinement of depth maps generated by lowcost depth sensors”, in International SoC Design Conference (ISOCC), (2012)
O.P. Gangwal and B. Djapic, “Realtime implementation of depth map postprocessing for 3DTV in dedicated hardware”, in Digest of Technical Papers International Conference on Consumer Electronics (ICCE), (2010)
J. Kopf, M.F. Cohen, D. Lischinski, M. Uyttendaele, Joint bilateral upsampling. ACM Trans. Graph. (ToG) 26(3), 96 (2007)
C. Tomasi and R. Manduchi, “Bilateral filtering for gray and color images”, in Sixth International Conference on Computer Vision (IEEE Cat. No. 98CH36271), (1998)
C. Dong, C.C. Loy, K. He, X. Tang, Image superresolution using deep convolutional networks. IEEE Trans. Pattern Anal. Mach. Intell. 38(2), 295–307 (2015)
Y. Zhang, Y. Tian, Y. Kong, B. Zhong and Y. Fu, “Residual dense network for image superresolution”, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (2018)
Y.T. Zhou, R. Chellappa, A. Vaid, B.K. Jenkins, Image restoration using a neural network. IEEE Trans. Acoust. Speech Signal Process. 36(7), 1141–1151 (1988)
K. Zhang, W. Zuo, Y. Chen, D. Meng, L. Zhang, Beyond a Gaussian Denoiser: residual learning of deep CNN for image denoising. IEEE Trans. Image Process. 26(7), 3142–3155 (2017)
H.C. Burger, CJ Schuler and S Harmeling, “Image denoising: Can plain neural networks compete with BM3D?”, in IEEE Conference on Computer Vision and Pattern Recognition, (2012)
C. Yan, Z. Li, Y. Zhang, Y. Liu, X. Ji, Y. Zhang, Depth image denoising using nuclear norm and learning graph model. ACM Trans. Multimed. Comput. Commun. Appl. 16(4), 1–17 (2020). https://doi.org/10.1145/3404374
X. Zhang and R. Wu, “Fast depth image denoising and enhancement using a deep convolutional network”, in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), (2016)
K. He, J. Sun, X. Tang, Guided image filtering. IEEE Trans. Pattern Anal. Mach. Intell. 35(6), 1397–1409 (2012)
J. Zhu, J. Zhang, Y. Cao and Z. Wang, “Image guided depth enhancement via deep fusion and local linear regularization”, in IEEE International Conference on Image Processing (ICIP), (2017)
K. He, X. Zhang, S. Ren and J. Sun, “Deep residual learning for image recognition”, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (2016)
T.Y. Lin, P. Goyal, R. Girshick, K. He and P. Dollár, “Focal loss for dense object detection”, in Proceedings of the IEEE International Conference on Computer Vision, (2017)
A. Krizhevsky, I. Sutskever, G.E. Hinton, Imagenet classification with deep convolutional neural networks. Adv. Neural. Inf. Process. Syst. 25, 1097–1105 (2012)
B. Lim, S. Son, H. Kim, S. Nah and K. Mu Lee, “Enhanced deep residual networks for single image superresolution”, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, (2017)
G. Huang, Z. Liu, L. Van Der Maaten and K.Q. Weinberger, “Densely connected convolutional networks”, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017
T. Tong, G. Li, X. Liu and Q. Gao, “Image superresolution using dense skip connections”, in Proceedings of the IEEE International Conference on Computer Vision, (2017)
S. Ren, K. He, R. Girshick, J. Sun, Faster rcnn: towards realtime object detection with region proposal networks. Adv. Neural. Inf. Process. Syst. 28, 91–99 (2015)
J. Dai, Y. Li, K. He and J. Sun, “Rfcn: Object detection via regionbased fully convolutional networks”, in Advances in Neural Information Processing Systems, (2016)
J. Redmon, S. Divvala, R. Girshick and A. Farhadi, “You only look once: Unified, realtime object detection”, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (2016)
W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.Y. Fu and A.C. Berg, “SSD: Single shot multibox detector”, in European Conference on Computer Vision, (2016)
C. Yan, Y. Hao, L. Li, J. Yin, A. Liu, Z. Mao, Z. Chen, X. Gao, Taskadaptive attention for image captioning. IEEE Trans. Circuits Syst. Video Technol. 32(1), 43–51 (2022). https://doi.org/10.1109/TCSVT.2021.3067449
C. Yan, L. Meng, L. Li, J. Zhang, J. Yin, J. Zhang, Z. Wang, B. Zheng, Ageinvariant face recognition by multifeature fusion and decomposition with selfattention. ACM Trans. Multimed. Comput. Commun. Appl. 18(1), 1–18 (2022). https://doi.org/10.1145/3472810
W. Zhou, X. Li and D. Reynolds, “Guided deep network for depth map superresolution: How much can color help?”, in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), (2017)
C. Yan, T. Teng, Y. Liu, Y. Zhang, H. Wang, X. Ji, Precise noreference image quality evaluation based on distortion identification. ACM Trans. Multimed. Comput. Commun. Appl. 17(3), 1–21 (2021). https://doi.org/10.1145/3468872
N. Mayer, E. Ilg, P. Hausser, P. Fischer, D. Cremers, A. Dosovitskiy and T. Brox, “A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation”, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (2016)
Funding
This work was partially supported by the Ministry of Science and Technology under Grant MOST 1092218E006032, 1102218E006025MBK and Qualcomm, USA under Grant SOW#NAT435536.
Author information
Authors and Affiliations
Contributions
All the authors have made contributions to the current work. KT LEE devised the image processing study plan, participated in the proposed system, and drafted the manuscript. ER Liu carried out software simulations, conducted the experiment, and collected the data. JF Yang and L Hong conceived of the study, and participated in its design and coordination, and helped modify the manuscript. All the authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare that they have no competing financial interests and all simulations were completed in National Cheng Kung University.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Lee, KT., Liu, ER., Yang, JF. et al. An imageguided network for depth edge enhancement. J Image Video Proc. 2022, 6 (2022). https://doi.org/10.1186/s13640022005839
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s13640022005839
Keywords
 Depth map
 Deep convolutional neural network
 Imageguided depth enhancement