Skip to main content

Zoom motion estimation for color and depth videos using depth information

Abstract

In this paper, two methods of zoom motion estimation for color and depth videos by using depth information are proposed. Color and depth videos are independently estimated for zoom motion. Zoom for color video is scaled by spatial domain, and depth video is scaled by both spatial and depth domains. For color video, instead of existing methods of zoom motion estimation that apply all of possible zoom ratios for a current block, the zoom ratio of the proposed method is determined as the ratio of the average depth values of the current and reference blocks. Then, the reference block is resized by multiplying the zoom ratio and the reference block is mapped to the current block. For depth video, the reference block is first scaled in the spatial direction by the same methodology used for color video and then scaled by a distance ratio from a camera to the objects. Compared to the conventional motion estimation method, the proposed method reduces MSE by up to about 30% for the color video and up to about 85% for the depth video.

1 Introduction

Intelligent surveillance systems for monitoring the behavior of objects are operated in various places for public safety. These systems can use not only conventional RGB videos but also infrared and depth videos to acquire new information. In order to operate the intelligent surveillance systems by transmitting the videos, an efficient encoding method is required for the various types of the videos.

In video coding standards such as H.264/AVC [1–4] and H.265/HEVC [5, 6], various methods for removing redundancies are used to compress color video. The temporal direction is one type of the redundancies of the video. The temporal redundancy is efficiently removed by motion estimation for objects in frames. The block matching algorithm (BMA) [7, 8] has been embraced as a method of motion estimation in the video coding standards. BMA estimates object motion accurately when the object size among frames is fixed. However, conventional motion estimation methods through BMA have a limitation that it estimates object motion inaccurately when the object size is changed because the size of the reference block is equal to the size of a current block.

In order to estimate various types of object motion including zoom, whose size is changed, the object motion models such as affine [9–11], perspective [12], polynomial [13], or elastic [14] can be applied. However, motion estimation methods through the motion models have high computational complexity because they need computation of model factor for each object. An improved affine model that the number of parameters is reduced from 6 to 4 has been introduced to solve this problem [15, 16]. Instead of computing the model parameters, a method of introducing a zoom ratio into the conventional BMA [17] has been proposed. However, there is a need to limit searching range of zoom ratios since the possible zoom ratios are infinite. To reduce the searching complexity of the zoom ratio, a diamond search method has been introduced to zoom ratio search [18]. Methods [19–21] for determining the zoom ratio instead of searching a zoom ratio have also been researched as follows. Superiori [19] observes that directions of motion vectors (MVs) tend to align with a direction from the border to the center of the object when the object has zoom motion. Takada et al. [20] proposes a method of improving coding efficiency by calculating zoom ratios by analyzing MVs in the coded video and re-coding the video. This method has a limitation that it can only be applied in the coded video. Shukla et al. [21] proposes a method of finding warping vectors in the vertical and horizontal directions instead of the conventional BMA. Shen et al. [22] proposed a motion estimation method for extracting and matching scale-invariant feature transform (SIFT) features that are robust for rotating and scaling. Luo et al. [23] proposes a motion compensation method to detect feature points through the speeded-up robust features (SURF) algorithm in reference and current frames and find corresponding image projections by the perspective-n-point method. Qi et al. [24] proposes a 3D motion estimation method by predicting a future scene based on the 3D motion decomposition. Wu et al. [25] introduces a K-means clustering algorithm to improve a performance of motion estimation.

In this paper, a zoom estimation method for color video is first proposed by using depth information. Each pixel value in the depth video represents some distance from a depth camera to the objects. Applications of depth video have been researched in various fields such as face recognition [26–28], simultaneous localization and mapping [29, 30], object tracking [31–35], and people tracking [36–38]. The proposed method determines the zoom ratio as the ratio of the representative depth values of a current block to a reference block. The representative depth value is set to an average of depth values in each block. Then, a reference block size is determined by multiplying the current block size and the zoom ratio. The reference block is scaled to the current block size by spatial interpolation, and two blocks are compared in order to find an optimal reference block.

A method of motion estimation for depth video is also proposed in this paper. In depth video coding, studies for intra-prediction have been conducted [39–43], but studies for interprediction are insufficient. When an object in depth video has zoom motion, not only the size but also depth values of the object are scaled to a zoom rate. In order to accurately estimate the zoom motion for the depth video, we propose a 3D scaling method that is simultaneously scaling 2D spatial size and depth values of the reference block. The spatial scaling is similar to the method for the color video. After the spatial scaling, the depth values in the reference block are also scaled by multiplying the zoom ratio.

Contributions of the proposed method are as follows. The proposed method for color video encoding reduces a computational complexity for determining a zoom ratio through calculating the ratios of depth values. The proposed method for depth video encoding improves the accuracy of motion estimation through considering changes of pixels in the depth video when the object has zoom motion.

This paper is organized as follows. The proposed method is described in Section 2. In Section 3, we present the simulation results to show the improvement of motion estimation accuracy using the proposed method. Finally, we describe a conclusion for this paper in Section 4.

2 Proposed method

2.1 Relationship between depth values and object size

The size of an object and the distance from a camera appear to be inversely proportional. To clarify the relationship between the object size and the depth value of the depth frame, object widths in captured pictures are measured while moving a diamond-shaped object at intervals of 0.5 m from 1 m to 4 m as shown in Fig. 1.

Fig. 1
figure 1

Measurement of relationship between distance and width of object. a 4 m and b 1 m

The relationship between the width and distance of the object is described as shown in Fig. 2. The measured relationship can be approximated with a fitting equation as follows:

$$ P=\frac{\beta}{d^{\alpha}}, $$
(1)
Fig. 2
figure 2

Approximation of relationship between distance and width of object

where P means the number of pixels of the object width shown in red arrow in Fig. 1, d means the distance from the camera, and α and β mean constant values. In the case of Fig. 2, α and β are measured as 0.965 and 214.59, respectively.

2.2 Relationship between depth values and object size

When the zoom motion of an object occurs between the current and reference picture, a size of the object is zoomed as the distance moved toward the camera. Therefore, the size of the reference block should be determined through the distance in order to estimate the object motion which has zooming. The depth information has distances from the camera at each pixel. Therefore, the zoom ratio between the current and reference blocks can be calculated through the depth information. The averages of the depth values in the current and reference blocks are assumed as distances of each block. If the zoom ratio s is defined as the ratio of the number of the pixels between the current and reference blocks, s is calculated by substituting the number of pixels of the current and reference blocks into Eq. (1) as follows:

$$ \left. s=\frac{P_{\text{ref}}}{P_{\text{cur}}}=\frac{\beta}{(\overline{d_{\text{ref}}})^{\alpha}} \middle/ \frac{\beta}{(\overline{d_{\text{cur}}})^{\alpha}} \right., $$
(2)

where \(\overline {d_{\text {cur}}}\) and \(\overline {d_{\text {ref}}}\) mean the representative depth values of the current and reference blocks, respectively, and Pcur and Pref mean the number of pixels of the current and reference blocks, respectively. A simplified expression of Eq. (2) is as follows:

$$ s=\left(\frac{\overline{d_{\text{cur}}}}{\overline{d_{\text{ref}}}}\right). $$
(3)

When a size of the current block is assumed as m×n, the size of the reference block is determined as sm×sn. The reference block is scaled by interpolation so that the size of the reference block is equal to the size of the current block. Figure 3 shows a flowchart of the proposed zoom motion estimation for the color video and Fig. 4 shows processes of the proposed method.

Fig. 3
figure 3

Flowchart of proposed method for color video

Fig. 4
figure 4

Processes of proposed method for color video

Figure 5 shows an example of zoom motion estimation for the color video. Areas surrounded by the red rectangle in Fig. 5 a and b are the 8×8 current and reference blocks, respectively, and Fig. 5 c and d show depth values in each blocks. \(\overline {d_{\text {cur}}}\) and \(\overline {d_{\text {ref}}}\) of 8×8 current and reference blocks in depth pictures are about 2322.312 and 2469.523, respectively, so s is calculated as about 0.940 if α is set to 0.965. Therefore, the size of the reference color block is determined to 7×7. Then, a 7×7 reference color block is scaled so that the reference block size is equal to the current block size. The mean square errors (MSEs) of conventional and proposed motion estimation methods are about 169.734 and 74.609, respectively. These results shows the proposed zoom motion estimation method is more accurate when the object in the video has zoom motion.

Fig. 5
figure 5

Zoom motion estimation for color video. a Current picture, b reference picture, c current block, d 7×7 reference block, and e size-scaled reference block

2.3 Zoom motion estimation for depth video

In depth video, the distance of an object from the depth camera is changed when the object has zoom motion, so the depth values of the object are changed as shown in Fig. 6. Therefore, not only the size but also depth values of the object should be considered for the zoom motion estimation for depth video.

Fig. 6
figure 6

Pixel values in depth pictures including an object a when object moves in parallel and b when object has zoom motion

A method of 3D scaling is introduced for the zoom motion estimation for depth video. 3D scaling means that depth axis scaling has been added to the 2D spatial scaling that scales the block size. The flowchart of 3D scaling is shown in Fig. 7.

Fig. 7
figure 7

Flowchart of proposed method for depth video

In 3D scaling, the zoom ratio calculation and the size determination of a reference block are the same as the processes of zoom motion estimation for previous color video. Then, the depth values of the size-scaled reference block are scaled by the following equation:

$$ R_{i}(i,j)=s\times R(i,j), $$
(4)

where R(i,j) and Ri(i,j) mean original and scaled depth values in position (i,j), respectively.

Figure 8 shows an example of zoom motion estimation for the depth video. Areas surrounded by the red rectangle in Fig. 8a and b are the 8×8 current and reference blocks, respectively, and Fig. 8 c and d show the depth values in each block. \(\overline {d_{\text {cur}}}\) and \(\overline {d_{\text {ref}}}\) for each 8×8 block are about 679.625 and 776.969, respectively, so s is calculated as about 0.874. If α is set to 0.965, the reference block size is determined as 7×7 as shown in Fig. 8 e when the current block size is 8×8. Then, a 7×7 reference block is scaled by the spatial scaling so that the reference block size is equal to the current block size. After that, depth values in 2D scaled reference block is scaled as shown in Fig. 8 g. MSEs of conventional and proposed methods are about 9482.97 and 3.48, respectively. These results show that the 3D scaling improves an accuracy of the motion estimation for the depth video.

Fig. 8
figure 8

3D scaling in proposed method for depth video. a Color current picture, b color reference picture, c current block, d reference block, e 7×7 reference block, f size-scaled block, and g value-scaled block

2.4 Zoom motion estimation for variable-size block

The video coding standard provides the variable-size block that groups blocks which have similar MVs in order to reduce the number of coding blocks. In the motion estimation of H.264/AVC [1–4], the size of variable-size block is allowed to be 16×16, 16×8, 8×16, and 8×8 when the macroblock size is 16×16 and 8×8, 8×4, 4×8, and 4×4 when the macroblock size is 16×16. Figure 9 shows the division of a macroblock in the variable-size block. The modes of variable-size block are determined by comparing sum of absolute errors (SAEs) or sum of square errors (SSEs) of each variable-size block.

Fig. 9
figure 9

Variable-size block in H.264/AVC

In addition, an introduction of the variable-size block can solve a problem that is difficult to determine the representative depth value of a mixed block having foreground object and background. For the mixed block, the representative depth value is determined as an average value of the depth values of background and foreground, and then this causes the inaccurately zoom ratio. This problem can be solved by dividing the block into smaller size blocks so that each block has only background or foreground object.

The proposed method can provide estimation for variable-size block. The variable-size block is applied independently to both color and depth videos. When the size of sample block is 16×16, SAE for the original block and sums of SAEs for partitioned block are 16×16, 16×8, 8×16, and 8×8. In motion estimation for partitioned block, coding of each MVs for partitioned block should also be considered. In the case of comparing between 16×16 and 16×8 variable-size blocks, the equation for comparing SAEs is as follows:

$$ \text{SAE}_{16\times16}\geq \sum {\text{SAE}_{16\times8}}+T_{16\times8}, $$
(5)

where SAE16×16 and SAE16×8 mean SAEs for original block and partitioned block as 16×8, respectively, and T16×8 means a threshold considering MVs. If the 16×16 sample block satisfies Eq. (5), this block can be partitioned into 16×8.

3 Results and discussion

In order to measure motion estimation accuracies of the proposed zoom motion estimation, we use the depth video datasets [44] that the camera moves forth or back as shown in Fig. 10 a and b, and we capture videos in which 1 or 2 people move back and forth while the position of the camera is fixed as shown in Fig. 10 c and d. The videos in Fig. 10 a and b are captured by Microsoft Kinect, and the videos in Fig. 10 c and d are captured by Intel Realsense D435. The resolutions of color and depth videos are specified as 640 × 480. We used 30 consecutive frames that has the most prominent zoom motion in each video. The reference picture basically has a picture gap from the current picture. The full-search method is applied as the search method for BMA. The search range is set to ± 15 while the sizes of the sample block are set to 8×8 and 16×16. α in Eq. (3) is set to 0.965. In the color videos, only a gray channel is used. The searching pixel unit is limited as 1/2 pixel in the case of the color video and 1 pixel in the case of the depth video.

Fig. 10
figure 10

Color and depth videos for simulation. a Bedroom, b basement, c a man, and d two men

In the proposed method, the RD optimization method can be used to determine the motion estimation mode. However, this paper does not discuss the coding method of depth video. Therefore, the estimation mode for each block is selected by following equation:

$$ \text{SSE}_{\text{ME}} > \text{SSE}_{\text{ZME}} +T_{\text{mode}}, $$
(6)

where SSEME and SSEZME mean SSE for the conventional and proposed methods. If a block satisfies Eq. (6), then the motion estimation mode of this block is selected as the zoom motion estimation. In this simulation, Tmode is determined as the following equation:

$$ T_{\text{mode}}=2mn, $$
(7)

where m and n mean the height and width of a current block, respectively.

Figures 11 and 12 show MSEs of motion estimation for the color videos through the conventional and proposed methods. A picture gap between the current picture and the reference picture is 1. The accuracies of motion estimation by the proposed method are improved.

Fig. 11
figure 11

Comparison of MSEs between conventional and proposed methods for color videos (8×8 block size)

Fig. 12
figure 12

Comparison of MSEs between conventional and proposed methods for color videos (16×16 block size)

Tables 1 and 2 show the average MSEs according to the frame gap between the current picture and the reference picture. In Tables 1 and 2, \(\overline {\text {MSE}_{\text {ME}}}\) and \(\overline {\text {MSE}_{\text {ZME}}}\) mean averages of MSEs for conventional and proposed motion estimation methods and \(\Delta \overline {\text {MSE}}\) means improved MSE by the proposed zoom motion estimation. The picture gap between the current and reference pictures is farther, and the number of selected block as the zoom estimation mode is larger. In color image, blocks including the object boundary region are mainly selected as the zoom motion estimation mode. This means that when the color video has the zoom motion, regions of the object boundaries are particularly affected in conventional motion estimation method.

Table 1 Averages of MSEs in color video according to frame gap in 8×8 block size
Table 2 Averages of MSEs in color video according to frame gap in 16×16 block size

Figures 13 and 14 shows MSEs of motion estimation in the depth videos through the conventional and the proposed methods. A picture gap between the current picture and the reference picture is 1. The accuracies of motion estimation by the proposed method are more improved than in the case of the color videos. Figure 15 shows zoom ratios in the proposed zoom motion estimation for depth videos. The zoom motion estimation mode is selected for almost all the areas where the zoom motion occurs.

Fig. 13
figure 13

Comparison of MSEs between conventional and proposed methods for depth videos (8×8 block size)

Fig. 14
figure 14

Comparison of MSEs between conventional and proposed methods for depth videos (16×16 block size)

Fig. 15
figure 15

Zoom ratios for simulation depth videos. a 8×8 block size and b 16×16 block size

Tables 3 and 4 show the average MSEs according to the picture gap between the current picture and the reference picture. Similar to the case of color images, the picture gap between the current and reference pictures is farther, and the number of selected block as the zoom estimation mode is larger.

Table 3 Averages of MSEs in depth video according to frame gap in 8×8 block size
Table 4 Averages of MSEs in depth video according to frame gap in 16×16 block size

Estimation accuracies and reduction in the number of MVs through the variable-size block are measured in Tables 5, 6, 7, and 8. Thresholds of the block partition in Eq. (6) are set as follows: T16×8 and T8×16 are set to 162/2, T8×8 is set to 162, T8×4 and T4×8 are set to 82/2, and T4×4 is set to 82. Tables 5, 6, 7, and 8 show MSEs and a number of each block size in a variable-size block allowing block sizes of 16×16, 16×8, 8×16, and 8×8, and in a variable-size block allowing block sizes of 8×8, 8×4, 4×8, and 4×4. In Tables 5 and 6, MSEVB means MSEs of the variable-size block and MSE16×16, MSE8×8, and MSE4×4 means MSEs of the fixed-size block. In Tables 7 and 8, notations such as MV16×16 and MV16×8 mean the number of MVs in the variable-size block, MVfixed(8×8) means the number of MVs in the fixed-size block, and \(\sum \text {MSE}_{\text {VB}}\) means the sum of the number of MVs in the variable-size block. MSEs in the variable-size block are similar to the fixed-size block whose the block size is equal to the smallest size in allowed size. The number of MVs is greatly reduced to up to about 40% compared to the fixed-size block.

Table 5 Comparison of MSEs between variable- and fixed-size blocks (16×16, 16×8, 8×16, and 8×8)
Table 6 Comparison of MSEs between variable- and fixed-size blocks (8×8, 8×4, 4×8, and 4×4)
Table 7 Comparison of a number of MVs between variable- and fixed-size blocks (16×16, 16×8, 8×16, and 8×8)
Table 8 Comparison of a number of MVs between variable- and fixed-size blocks (8×8, 8×4, 4×8, and 4×4)

4 Conclusions

In this paper, we proposed a method of calculating the zoom ratio for the zoom motion estimation of color video by using the depth information. We also proposed a method of the zoom motion estimation for the depth video. We measured the improvement of MSEs when the proposed method was separately applied to the color and depth videos. The simulation results showed that MSE is reduced up to about 30% for the color video and 85% for the depth video. Furthermore, zoom motion estimation for variable-size block reduces a lot of the number of motion vectors.

Some of the conventional methods for zoom motion estimation determine the zoom ratio by extracting and matching object features which are robust against zooming. There are also methods for determining the zoom ratio through searching the pattern of zoom motion from the direction and size of MVs. In the other method, an optimal zoom ratio can be found through scaling a reference block in the range of possible zoom ratios. However, these conventional methods of determining the zoom ratio have a limitation of high computational complexity. On the other hand, a computation of the zoom ratio is simplified in the proposed method, since the determination of the zoom ratio is required only in the calculation of a ratio of depth values between reference and current blocks.

The motion estimation method proposed in this paper is expected to be applicable to the video coding standard. Also, a method to encode the zoom motion vector is to be studied more in the future. Further research to obtain optimal coding efficiency by considering both the number of bits for additional transmission of the zoom motion vector and the coding gain according to the reduced motion estimation difference signal is also required.

Availability of data and materials

The dataset used during the current study is the NYU Depth Dataset V2 [44] and is available at https://cs.nyu.edu/~silberman/datasets/nyu_depth_v2.html.

Abbreviations

BMA:

Block matching algorithm

MSE:

Mean square error

MV:

Motion vector

SAE:

Sum of absolute error

SIFT:

Scale-invariant feature transform

SSE:

Sum of square error

SURF:

Speeded-up robust features

References

  1. H.264: Advanced Video Coding for Generic Audiovisual Services. ITU-T Rec. H.264. https://www.itu.int/rec/T-REC-H.264/en. Accessed 1 June 2019.

  2. I. E. G. Richardson, H.264 and MPEG-4 Video Compression: Video Coding for Next-Generation Multimedia (Wiley, NJ, 2003).

    Book  Google Scholar 

  3. T. Wiegand, G. J. Sullivan, G. Bjontegaard, A. Luthra, Overview of the h.264/AVC video coding standard. IEEE Trans. Circ. Syst. Video Technol.13(7), 560–576 (2003).

    Article  Google Scholar 

  4. S. K. Kwon, A. Tamhankar, K. R. Rao, Overview of h.264/MPEG-4 part 10. J. Vis. Commun. Image Represent.17(2), 186–216 (2006).

    Article  Google Scholar 

  5. G. J. Sullivan, J. R. Ohm, W. J. Han, T. Wiegand, Overview of the high efficiency video coding (HEVC) standard. IEEE Trans. Circ. Syst. Video Technol.22(12), 1649–1668 (2012).

    Article  Google Scholar 

  6. D. Patel, T. Lad, D. Shah, Review on intra-prediction in high efficiency video coding (HEVC) standard. Int. J. Comput. Appl.132(13), 26–29 (2015).

    Google Scholar 

  7. H. G. Musmann, P. Pirsch, H. Grallert, Advances in picture coding. Proc. IEEE. 73(4), 523–548 (1985).

    Article  Google Scholar 

  8. J. Jain, A. Jain, Displacement measurement and its application in interframe image coding. IEEE Trans. Commun.29(12), 1799–1808 (1981).

    Article  Google Scholar 

  9. H. Jozawa, K. Kamikura, A. Sagata, H. Kotera, H. Watanabe, Two-stage motion compensation using adaptive global mc and local affine mc. IEEE Trans. Circ. Syst. Video Technol.7(1), 75–85 (1997).

    Article  Google Scholar 

  10. T. Wiegand, E. Steinbach, B. Girod, Affine multipicture motion-compensated prediction. IEEE Trans. Circ. Syst. Video Technol.15(2), 197–209 (2005).

    Article  Google Scholar 

  11. R. C. Kordasiewicz, M. D. Gallant, S. Shirani, Affine motion prediction based on translational motion vectors. IEEE Trans. Circ. Syst. Video Technol.17(10), 1388–1394 (2007).

    Article  Google Scholar 

  12. Y. Nakaya, H. Harashima, Motion compensation based on spatial transformations. IEEE Trans. Circ. Syst. Video Technol.4(3), 339–356 (1994).

    Article  Google Scholar 

  13. M. Karczewicz, J. Nieweglowski, J. Lainema, O. Kalevo, in Proceedings of First International Workshop on Wireless Image/Video Communications. Video coding using motion compensation with polynomial motion vector fields, (1996), pp. 26–31. https://doi.org/10.1109/wivc.1996.624638.

  14. M. R. Pickering, M. R. Frater, J. F. Arnold, in 2006 International Conference on Image Processing. Enhanced motion compensation using elastic image registration, (2006), pp. 1061–1064. https://doi.org/10.1109/icip.2006.312738.

  15. L. Li, H. Li, D. Liu, Z. Li, H. Yang, S. Lin, H. Chen, F. Wu, An efficient four-parameter affine motion model for video coding. IEEE Trans. Circ. Syst. Video Technol.28(8), 1934–1948 (2018). https://doi.org/10.1109/TCSVT.2017.2699919.

    Article  Google Scholar 

  16. N. Zhang, X. Fan, D. Zhao, W. Gao, Merge mode for deformable block motion information derivation. IEEE Trans. Circ. Syst. Video Technol.27(11), 2437–2449 (2017). https://doi.org/10.1109/TCSVT.2016.2589818.

    Article  Google Scholar 

  17. L. Po, K. Wong, K. Cheung, K. Ng, Subsampled block-matching for zoom motion compensated prediction. IEEE Trans. Circ. Syst. Video Technol.20(11), 1625–1637 (2010).

    Article  Google Scholar 

  18. H. S. Kim, J. H. Lee, C. K. Kim, B. G. Kim, Zoom motion estimation using block-based fast local area scaling. IEEE Trans. Circ. Syst. Video Technol.22(9), 1280–1291 (2012).

    Article  Google Scholar 

  19. L. Superiori, M. Rupp, in 2009 10th Workshop on Image Analysis for Multimedia Interactive Services. Detection of pan and zoom in soccer sequences based on H.264/AVC motion information, (2009), pp. 41–44. https://doi.org/10.1109/wiamis.2009.5031427.

  20. R. Takada, S. Orihashi, Y. Matsuo, J. Katto, in 2015 IEEE International Conference on Consumer Electronics (ICCE). Improvement of 8k UHDTV picture quality for H.265/HVEC by global zoom estimation, (2015), pp. 58–59. https://doi.org/10.1109/icce.2015.7066317.

  21. D. Shukla, R. K. Jha, A. Ojha, Unsteady camera zoom stabilization using slope estimation over interest warping vectors. Pattern Recogn. Lett.68:, 197–204 (2015).

    Article  Google Scholar 

  22. X. Shen, J. Wang, Q. Yang, P. Chen, F. Liang, in 2017 IEEE Visual Communications and Image Processing (VCIP). Feature based inter prediction optimization for non-translational video coding in cloud, (2017), pp. 1–4. https://doi.org/10.1109/vcip.2017.8305066.

  23. G. Luo, Y. Zhu, Z. Weng, Z. Li, A disocclusion inpainting framework for depth-based view synthesis. Trans. Pattern Anal. Mach. Intell. (Early Access), IEEE, 1–14 (2019). https://doi.org/10.1109/tpami.2019.2899837.

  24. X. Qi, Z. Liu, Q. Chen, J. Jia, in 2019 IEEE Conference on Computer Vision and Pattern Recognition. 3D motion decomposition for RGBD future dynamic scene synthesis, (2019), pp. 7673–7682. https://doi.org/10.1109/cvpr.2019.00786.

  25. M. Wu, X. Li, C. Liu, M. Liu, N. Zhao, J. Wang, X. Wan, Z. Rao, L. Zhu, Robust global motion estimation for video security based on improved k-means clustering. J. Ambient Intell. Humanized Comput.10(2), 439–448 (2019).

    Article  Google Scholar 

  26. G. Fanelli, M. Dantone, L. Van Gool, in 2013 10th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG). Real time 3d face alignment with random forests-based active appearance models, (2013), pp. 1–8. https://doi.org/10.1109/fg.2013.6553713.

  27. M. Dantone, J. Gall, G. Fanelli, L. Van Gool, in 2012 IEEE Conference on Computer Vision and Pattern Recognition. Real-time facial feature detection using conditional regression forests, (2012), pp. 2578–2585.

  28. R. Min, N. Kose, J. Dugelay, Kinectfacedb: A kinect database for face recognition. IEEE Trans. Syst. Man Cybernet. Syst.44(11), 1534–1548 (2014).

    Article  Google Scholar 

  29. J. Sturm, N. Engelhard, F. Endres, W. Burgard, D. Cremers, in 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems. A benchmark for the evaluation of rgb-d slam systems, (2012), pp. 573–580. https://doi.org/10.1109/iros.2012.6385773.

  30. F. Pomerleau, S. Magnenat, F. Colas, M. Liu, R. Siegwart, in 2011 IEEE/RSJ International Conference on Intelligent Robots and Systems. Tracking a depth camera: Parameter exploration for fast ICP, (2011), pp. 3824–3829. https://doi.org/10.1109/iros.2011.6094861.

  31. M. Siddiqui, G. Medioni, in 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition - Workshops. Human pose estimation from a single view point, real-time range sensor, (2010), pp. 1–8. https://doi.org/10.1109/cvprw.2010.5543618.

  32. R. Muñoz Salinas, R. Medina Carnicer, F. J. Madrid Cuevas, A. Carmona Poyato, Depth silhouettes for gesture recognition. Pattern Recogn. Lett.29(3), 319–329 (2008).

    Article  Google Scholar 

  33. P. Suryanarayan, A. Subramanian, D. Mandalapu, in 2010 20th International Conference on Pattern Recognition. Dynamic hand pose recognition using depth data, (2010), pp. 3105–3108. https://doi.org/10.1109/icpr.2010.760.

  34. J. Preis, M. Kessel, M. Werner, C. Linnhoff-Popien, in 1st International Workshop on Kinect in Pervasive Computing. Gait recognition with kinect (New CastleUK, 2012), pp. 1–4.

    Google Scholar 

  35. S. Song, J. Xiao, in 2013 IEEE International Conference on Computer Vision. Tracking revisited using RGBD camera: Unified benchmark and baselines, (2013), pp. 233–240. https://doi.org/10.1109/iccv.2013.36.

  36. J. Sung, C. Ponce, B. Selman, A. Saxena, in Workshops at the Twenty-fifth AAAI Conference on Artificial Intelligence. Human activity detection from RGBD images, (2011), pp. 47–55.

  37. L. Spinello, K. O. Arras, in 2011 IEEE/RSJ International Conference on Intelligent Robots and Systems. People detection in RGB-D data, (2011), pp. 3838–3843. https://doi.org/10.1109/iros.2011.6095074.

  38. M. Luber, L. Spinello, K. O. Arras, in 2011 IEEE/RSJ International Conference on Intelligent Robots and Systems. People tracking in RGB-D data with on-line boosted target models, (2011), pp. 3844–3849. https://doi.org/10.1109/iros.2011.6095075.

  39. K. Lai, L. Bo, X. Ren, D. Fox, in 2011 IEEE International Conference on Robotics and Automation. A large-scale hierarchical multi-view RGB-D object dataset, (2011), pp. 1817–1824. https://doi.org/10.1109/icra.2011.5980382.

  40. S. Gasparrini, E. Cippitelli, E. Gambi, S. Spinsante, J. Wåhslén, I. Orhan, T. Lindh, in International Conference on ICT Innovations. Proposal and experimental evaluation of fall detection solution based on wearable and depth data fusion (Springer, 2015), pp. 99–108. https://doi.org/10.1007/978-3-319-25733-4_11.

  41. P. Ammirato, P. Poirson, E. Park, J. Košecká, A. C. Berg, in 2017 IEEE International Conference on Robotics and Automation (ICRA). A dataset for developing and benchmarking active vision, (2017), pp. 1378–1385. https://doi.org/10.1109/icra.2017.7989164.

  42. M. Kraft, M. Nowicki, A. Schmidt, M. Fularz, P. Skrzypczyński, Toward evaluation of visual navigation algorithms on RGB-D data from the first- and second-generation kinect. Mach. Vis. Appl.28(1-2), 61–74 (2016).

    Article  Google Scholar 

  43. D. S. Lee, S. K. Kwon, Intra prediction of depth picture with plane modeling. Symmetry. 10(12), 715 (2018).

    Article  Google Scholar 

  44. N. Silberman, D. Hoiem, P. Kohli, R. Fergus, in European Conference on Computer Vision. Indoor segmentation and support inference from RGBD images (Springer, 2012), pp. 746–760. https://doi.org/10.1007/978-3-642-33715-4_54.

Download references

Acknowledgements

Not applicable.

Funding

Not applicable.

Author information

Authors and Affiliations

Authors

Contributions

All authors took part in the discussion of the work described in this paper. The authors read and approved the final manuscript.

Corresponding author

Correspondence to Soon-kak Kwon.

Ethics declarations

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Kwon, Sk., Lee, Ds. Zoom motion estimation for color and depth videos using depth information. J Image Video Proc. 2020, 11 (2020). https://doi.org/10.1186/s13640-020-00499-2

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/s13640-020-00499-2

Keywords