A fast 3D scene reconstructing method using continuous video
 BoYi Sung^{1} and
 ChangHong Lin^{1}Email author
https://doi.org/10.1186/s1364001701683
© The Author(s). 2017
Received: 2 August 2016
Accepted: 8 February 2017
Published: 22 February 2017
Abstract
Accurate 3D measuring systems thrive in the past few years. Most of them are based on laser scanners because these laser scanners are able to acquire 3D information directly and precisely in real time. However, comparing to the conventional cameras, these kinds of equipment are usually expensive and they are not commonly available to customers. Moreover, laser scanners interfere easily with each other sensors of the same type. On the other hand, computer visionbased 3D measuring techniques use stereo matching to acquire the cameras’ relative position and then estimate the 3D location of points on the image. Because this kind of systems needs additional estimation of the 3D information, systems with real time capability often relies on heavy parallelism that prevents implementation on mobile devices.
Inspired by the structure from motion systems, we propose a system that reconstructs sparse feature points to a 3D point cloud using a mono video sequence so as to achieve higher computation efficiency. The system keeps tracking all detected feature points and calculates both the amount of these feature points and their moving distances. We only use the key frames to estimate the current position of the camera in order to reduce the computation load and the noise interference on the system. Furthermore, for the sake of avoiding duplicate 3D points, the system reconstructs the 2D point only when the point shifts out of the boundary of a camera. In our experiments, we show that our system is able to be implemented on tablets and can achieve stateoftheart accuracy with a denser point cloud with high speed.
Keywords
1 Introduction
The 3D reconstructing techniques have been widely promoted in these years. These 3D techniques have many ways of application, such as object modeling and 3D printing, architecture, robots, augmented reality, medical, archaeology, or just a 3D record as an alternative of camera photos. There are many 3D acquiring methods, such as ultrasound [1], synthetic aperture radar (SAR) [2], and the most famous LIDAR system [3].
Recently, the 3D applications based on the RGBD camera (IR laser) are improved dramatically, such as methods using Microsoft Kinect [4, 5] and ASUS Xtion [6, 7], which are capable to construct an exceedingly precise and dense 3D point cloud in real time with a GPGPU (generalpurpose computing on graphics processing units) assist. However, since the RGBD camera and laser scanners are not commonly available products, at the cost of implementing depthestimating methods, both stereo cameras [8–10] and mono cameras [11–15] are used to achieve the same functionality of a RGBD camera.
Among these mono cameras, methods [12, 13] are designed for building a 3D point cloud with multiple 2D images. These methods usually took a long time to compute the relation between input images in order to acquire a large amount of cloud points with high accuracy. On the other hand, the SLAM (simultaneous localization and mapping) system [11] is designed for AR applications. Although the system can acquire accurate camera position in real time, the system only tracks very little amount of feature points so that the system is not quite suitable for reconstructing a 3D point cloud. Therefore, we seek a system that is able to build a 3D point cloud within real time.
In this paper, we present a fast 3D reconstructing system which is suitable for 3D scenery recording usage. Considering the structure from motion methods [12, 13], most of them take minutes or even hours to finish the work. While the 3D reconstruction of a vast area is not the requirement, such as building a whole city district, we concentrate at reconstructing a single scene at high speed. First, in order to improve the speed of the overall process, we apply the key frame selection technique, which can significantly reduce the processing time of pose estimation by removing duplicate information from neighbor frames in a video sequence. Second, we maintain the precision of the result while the overall computation time is reduced. Because we only cut the neighbor frames, which carry almost same information with each other, the data kept by system can be updated at the time when it is truly needed. Third, the proposed system avoids to reconstruct the duplicated points. The duplicated point appears because the instability of feature points, and this can be caused by the blur, heavy motion, or the illuminance change of the image. Avoiding duplicated points can reduce the time needed for triangulation and save a lot of space for memory in comparison with the system, which do not avoid the duplicated points.
By these improvements, the proposed system is able to be implemented on a single CPU. The computation speed achieves 5 to15 fps based on the size of the video sequence and the number of feature points.
2 Related works
Providing a better experience for users has been an active goal for multimedia research. Images can not only aggregate into a 3D scene but also animate themselves in order to provide more vivid and interesting visuals. This works nowadays are widely implemented in mobile devices, such as Apple Live Photos, Instagram Boomerang, and the “Cinemagraphs” project (available: http://cinemagraphs.com/). Most of them require capturing more than one photograph. However, there are studies that require only one still image to reconstruct the motion of an object in the photograph. A study [16] has carried out to create the cloud motion, which is much harder than animating the rigid objects. On the other hand, in order to reconstruct a 3D scene, localizing the object is critical even if the system uses the RGBD camera. There are studies that dedicated in researching a more efficient way to track an object. The study [17] compares the performance of the system that uses early or the late fusion with SVM (support vector machine) or other deep learning classifiers. In the case of hand gesture recognizing, the research [18] uses the deep learning to enhance the system to track the moving hand with faster speed without losing precision. While this research is tracking a hand captured by a stationary camera, the 3D reconstructing methods require studies that track stationary objects captured by a moving camera, reversely. There exists a lot of work that reconstructs the 3D scene or tracks the camera positions. Since we are concentrated in the method of using mono cameras, only methods using mono visions will be briefly described in the following paragraphs.
One of the solutions that can reconstruct a dense depth map is segmenting a photo into superpixels [19, 20]. Based on the local appearance of the photo with the global constraints, they build the most likely 3D structure for each segment and use it to build the depth map. Even though their systems give excellent results, their results are still not precise enough and were restricted in a single view that is not suitable for building a scenery around an area.
Instead of building depth map for an image, the method that estimates the positions for a camera can also reconstruct the 3D points from the estimated camera positions. SLAM [11, 21] is the process that a system incrementally builds a consistent map of its environment and uses this map to compute its own location at the same time. Their methods not only work in real time but also maintain the position of the camera precisely. However, in order to keep these efforts simultaneously in real time, most of the approaches track only a few of feature points, while we are interested in a method that tracks the feature points as dense as possible.
Similar to the SLAM method, a method that also estimates the positions of the cameras from video was demonstrated by Akbarzadeh et al. [22]. The main purpose of their method is to reconstruct the 3D urban scene but not maintaining the camera track in the map. On the other hand, the 3D reconstruction with multiple photos that can reconstruct a scenery as large as the entire city of Rome was originally demonstrated by N. Snavely et al. [12, 13]. This kind of technique that processes photo sequences into a 3D point cloud by studying the coherence between photos is called structure from motion (SfM). This kind of method basically computes the relative camera positions between all related photos. After every relative camera position is found, the scheme uses these matrices to reconstruct all feature points using triangulation. Although the results of the method proposed by N. Snavely et al. [12, 13] were very impressive, their method requires a very long time to finish calculating all required matrices. Hence, the other SfM methods, which are VisualSFM [14] and OpenMVG [15], are proposed to improve the processing speed and the robustness of the system. VisualSFM [14] uses the preemptive feature matching, the incremental structure from motion and the retriangulation techniques. The incremental feature matching can greatly speed up the process because this kind of matching will first sort all feature points and match only first h feature points for each photo. The scheme will not proceed to match whole feature points unless the number of successful matches among first h features is greater than a defined threshold. Incremental structure from motion also saves the processing time due to not performing the bundle adjustment while a new camera was added. Instead, it performs the bundle adjustment only when number of points increase relatively by a certain ratio. The retriangulation technique lowers the camera drift caused by the bad camera relative pose, which might have a low ratio between their common points. They retriangulate these bad camera poses after a sufficient amount of data obtained from new added cameras. OpenMVG [15] also contains incremental structure from motion technique. Besides that, they proposed a new iterative sampling method called a contrario Random Sample Consensus (ACRANSAC) as a substitution to the original RANSAC in order to acquire higher precision and better performance. The ACRANSAC using the “a contrario” methodology in order to find a model that best fits the data with a threshold T that adapts automatically to the noise. Hence, it is able to find a model and its associated noise without a fixed threshold.
In this paper, we are interested in reconstructing the observed points as detailed as possible from the video sequence while losing very less realtime performance.
3 Proposed method
3.1 Feature processing loop
In this section, the feature processing loop is described. This procedure first does all the process for feature detection and matching and iterates itself to search for the points found in the previous frame. In order to improve the efficiency, it also decides whether a frame is a key frame or not. The key frame would then be used in the motion data processing.

The average feature distance in pixel must be more than ImageWidth/3.

Feature point matches did not drop dramatically with respect to previous match.
3.2 Pose estimation
Where Q = (Q _{1}, Q _{2}, Q _{3}, 1) is a representation of a 3D point in homogeneous coordinates and q = (q _{1}, q _{2}, 1) is a representation of an image’s corresponding point. The projection matrix P can be decomposed into an intrinsic matrix and an extrinsic matrix.
For the intrinsic matrix K, it describes the geometric property of a camera and projects 2D point from the camera coordinates to image coordinates. K is composed of focal length, principle point, and the skew parameter. In short, to get a projection matrix, it is necessary to get the intrinsic and extrinsic matrix in advance. Because intrinsic matrix K is only related to the camera setting, K can be acquired by calibrating the camera. In this case, the extrinsic matrix [Rt], which denotes the coordinate transformations from 3D world coordinates to 3D camera coordinates by the rotation R and translation t, is what will be estimated in this section. The extrinsic matrix describes a camera’s “pose” including camera’s rotation, panandtilt, and location c in the world coordinate. On the other hand, the fundamental matrix I is the algebraic representation of epipolar geometry. And the epipolar geometry is the projective geometry between two views. Therefore, every extrinsic matrix can be derived from knowing relation between cameras while assuming the first camera is located at the origin [Rt] = [I0].
Base on the essential matrix constraint [29], a 3 × 3 matrix is an essential matrix if and only if two of its singular values are equal and the third is zero. And in Eq. (4) SR = (UZU^{T})(UWV^{T}) = U(ZW)V ^{T}, the singular values Σ = diag(1, 1, 0) and Σ = ZW are true as required.
Corresponding to an essential matrix, there are four possible solutions for the extrinsic matrix because of two possible choices of R and the unknown sign of t. It means that the translation vector from the first to the second camera can be reversed and camera can have a rotation 180° about the baseline. In order to decide between all four solutions, it is sufficient to test with input points from previous procedure and see which solution reconstructed most points located in front of both cameras. Reconstruction is done by a simple triangulation, assuming the first camera locating at the origin [I0] and the second located at [Rt]. Triangulation will be described in Section 3.4.
3.3 Point tracking
Besides categorizing every input point, the system also checks the track vector and finds the track which has not been touched in the categorizing step. The track that has not been touched means that its corresponding feature point does not found in the current key frame by the previous procedure in Section 3.1. In this case, we assume that the point has been shifted out of the boundary of the camera and will not appear any more. These untouched tracks will be erased from the vector and pushed to the next procedure in section 3.4 which does triangulation and reconstructs the 3D point.
3.4 Triangulation
The triangulation process reconstructs a 3D point from a pair of known cameras and the corresponding 2D points. Recall Eq. (1), although the wanted 3D point can be calculated directly by reprojecting the 2D point to 3D point, the solution will not be sufficiently correct because there are errors in the measured points q and q ’. This means that, by reprojecting the points q and q ', it usually does not exist that a point Q satisfies q = PQ and q ' = P ' Q simultaneously, where q ' and P ' represent the 2D point and the projection matrix of the second camera. These points q and q ^{'} also do not sufficiently satisfy the epipolar constraint q ^{' T } Fq = 0. Therefore, the direct linear transformation (DLT) [29] is proposed to achieve a closer solution to the ideal Q.
The first two equations from (9) have been included for each camera, which provide totally four equations and four homogenous unknowns. This homogenous linear equation YQ = 0 can be solved by considering Q as a null space of Y. And because we are using the homogeneous coordinate system, the solution Q, which is a 4vector, needs to be normalized so that the last coordinate of itself equals to one.
Since the parameters used by the triangulation procedure are measured pointmatch from Section 3.1 and the estimated projection matrix from Section 3.2, both of them may include noises. In this case, the reconstructed 3D point cannot be reconstructed at an ideal location. The reconstructed 3D point will locate within the area between the rays from the camera through the measured points. The more parallel of these rays become the larger of the shaded area it will be. This means that the small camera movements in all six directions may cause a poor triangulating solution. The small movement can mostly be solved by point tracking in Section 3.3 because we always use the first found key frame and the last found key frame of a feature point to do the triangulation. This keeps the triangulation procedure that always uses the key frames with longest camera distance. However, the reconstructed 3D points might still contain noises. If the precision is the priority order, it is necessary to do a further refinement in the end of the program on these reconstructed 3D points.
3.5 Bundle adjustment
The bundle adjustment is a method that solves the problem of simultaneously refining the 3D coordinates, the parameters of the camera motion, and the characteristics of cameras. As described in Sections 3.2 and 3.4, if the image measurements are noisy, the camera poses are not flawlessly precise and the Eq. (1) q = PQ will not be satisfied exactly. In this case, an optimization is needed to minimize the reprojection error between the image points of observed and predicted image points.
Where I is the identity matrix and λ is the adjusting factor that varies from iteration to iteration. If error decreases, then λ gets smaller; otherwise, it gets larger.
Where U = [A^{T}A] × diag(1 + λ), V = [B^{T}B] × diag(1 + λ) and W = [A^{T}B].
After δ_{P} and δ_{ Q } are computed, the parameters of P and Q are replaced by new (P + δ_{P}) and (Q + δ_{ Q }), respectively. And there will be a new error vector ϵ which is computed from these new parameters. If the error decreases, the system scales the factor λ down and proceeds to the next iteration. Otherwise, it reverts the parameters to the old parameter values and tries again with the scaled up factor λ. In the end, the iteration continues until the error has minimized below the threshold or the maximum number of iterations is reached.
4 Experimental results
The system is tested with a surface tablet with intel Core i34020Y running at 1.5 GHz and a personal computer with an intel Core i74770 CPU running at 3.4 GHz. In the experiment, our system is able to work faster than 1 fps on tablet and 5 fps on PC depending on the size of input video, the amount of feature points, and the moving speed of the camera. We compare our method to other 3D reconstructing methods, the SfM methods [12, 14, 15]. These SfM methods focus on the precision, and it was designed for reconstructing a vast scenery. Hence, their methods always take minutes or even hours to finish the whole process. The first one we are going to test is the Bundler: SfM for unordered image collections [12, 13]. It is a wellknown SfM method that can solve a large amount of images with different intrinsic parameters by checking the EXIF data of every photo. VisualSFM [14] and OpenMVG [15] are similar methods that can also solve a large amount of images, but with different feature detecting, matching, tracking, outlier removing, and distortion recovering technique. Furthermore, they also improve the speed with the multithread technique. On the contrary, in the case of using a continuous video taken by a single moving camera, we provide a much faster method running in a single thread that was able to acquire a compromise solution in real time.
There are four different video sequences that are used for testing. Two videos were taken indoors and the other two were taken outdoors. And these sequences were captured by a hand held camera, and every frame was extracted into PNG files. We first compare the timing results in Section 4.1 and then compare the result of 3D reconstruction in Section 4.2.
4.1 Timing results
The timing results of using all frames (examined on PC)
The timing results of using partial frames and the proposed method using all frames (on PC)
Bundler [12]  VisualSFM [14]  OpenMVG [15]  Proposed  

Sequence 1  70 frames  349 frames  
5 min 52 s  2 min 20 s  34 s  1 min 24 s  
Sequence 2  67 frames  534 frames  
2 min 57 s  2 min 18 s  32 s  1 min 12 s  
Sequence 3  160 frames  1601 frames  
46 min 03 s  11 min 19 s  1 min 19 s  5 min 22 s  
Sequence 4  33 frames  486 frames  
2 min 08 s  1 min 49 s  15 s  1 min 24 s 
The timing results of the proposed method using all frames (on tablet)
Sequence 1  Sequence 2  Sequence 3  Sequence 4  

Timing  51 min 36 s  42 min 6 min  7 min 17 s  1 min 24 s 
Time cost per frame with all four video sequences
Sequence 1  Sequence 2  Sequence 3  Sequence 4  

PC  Tablet  PC  Tablet  PC  Tablet  PC  Tablet  
3D points  18,129  15,561  15,057  22,841  35,717  36,528  39,682  36,012 
Feature processing (ms/frame)  72.94  211.95  56.25  204.81  171.20  589.85  122.27  390.18 
Pose estimation (ms/keyframe)  117.88  170.85  549.53  2300.3  120.28  286.22  135.42  352.04 
Point tracking (ms/keyframe)  57.45  81.92  51.59  166.38  45.26  169.89  64.75  158.944 
Triangulation (ms/keyframe)  17.07  53.67  6.62  18.06  20.94  81.60  33.96  55.5213 
Average FPS  11.06  4.17  8.21  2.02  5.34  1.57  6.68  2.18 
4.2 3D reconstruction results
The 3D reconstructed results are tested with four different video sequences from sequence 1 through sequence 4, and the results from each method are printed with the cloud of points in the following figures in this section.
Sequence 1 is an indoor video sequence, which contains 349 frames as illustrated in Fig. 5. The reconstructed point clouds are illustrated in Fig. 6. We can see that the results of Bundler and OpenMVG are indistinct, and the VisualSFM gives the rather precise result. On the contrary, the proposed method was also able to give the precise results.
5 Discussion
Based on the inquisition, methods that only use camera work slower than the methods with additional equipment because these visionbased methods need time to calculate the depth information. For mono vision systems, the most important issues are the precision and the speed because the actual camera position can be only estimated from changes in camera frames. In this case, most of the SfM methods [12, 15] use the famous SIFT feature detecting and describing method that costs a lot more time to process, and this is the reason why SLAM [11] systems usually use FAST corner detector that can help them achieve the realtime performance. The SLAM system maintains their robustness by loop closure detection using the map they built. For the proposed method, we also use the FAST corner detector to accelerate the process. Moreover, different from SLAM system, the proposed method did not build the map because doing so usually need to parallelize the program that is more unsuitable to implement on the mobile devices. Since our main priority is the processing speed and the density of points, we simply maintain the precision by carefully choosing the key frame, so the proposed method’s available building area will be smaller than SfM or SLAM methods. However, among the systems building the point cloud, the proposed method uses a continuous video, and it can process very fast and maintain precision in a certain area.
6 Conclusions
Over the years, the computer vision community has contributed many efforts improving the quality of the reconstructed 3D point cloud. As part of this effort, we have demonstrated a system that generates an accurate point cloud with high speed. Comparing to other existing 3D reconstructing methods, we are able to use only a mono video sequence as an input on a single CPU and reconstruct the 3D point cloud as dense as possible.
The most critical part for a mono 3D reconstructing method is the heavy load on estimating the camera position. Unlike the stereo system, the camera position can only be guessed from the projection between frames. The proposed method is able to lower the load on estimating camera position while losing very little precision. Furthermore, with the lack of camera baseline as a reference, the estimated camera position is usually gained not only with noise but also the ambiguous scale between the pixels and the real world. In this case, despite that the proposed system is able to reconstruct a scenery within an area, this system will also encounter some scale drift while the video sequence was recorded along a very long distance.
Last but not least, we hope that this kind of fast and accurate 3D reconstructing algorithm can be promoted and become a readily available tool for artist, architect, engineer, and everyone whoever wants to build a 3D scenery.
Declarations
Acknowledgements
The authors would like to thank the Ministry of Science and Technology in Taiwan for supporting this research under the project MOST1042220E011001.
Authors’ contributions
BYS carried out the algorithm studies, platform implementation and the simulation and drafted the manuscript. CHL participated in the algorithm studies and helped to draft the manuscript. Both authors read and approved the final manuscript.
Competing interests
The authors declare that they have no competing interests.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
Authors’ Affiliations
References
 A Fenster, DB Downey, Fast parametric elastic image registration. IEEE Eng Med Biol Mag 15(6), 41–51 (2002)View ArticleGoogle Scholar
 JM LopezSanchez, J FortunyGuasch, 3D radar imaging using range migration techniques. IEEE Trans. Antennas Propag. 48(5), 728–737 (2002)View ArticleGoogle Scholar
 B Douillard, J Underwood, N Kuntz, V Vlaskine, A Quadros, P Morton, A Frenkel, On the Segmentation of 3D LIDAR Point Clouds (IEEE International Conference on Robotics and Automation, Shanghai, 2011), pp. 2798–2805Google Scholar
 F Endres, J Hess, N Engelhard, J Sturm, An Evaluation of the RGBD SLAM System (IEEE International Conference on Robotics and Automation (ICRA), Saint Paul, 2012), pp. 1691–1696Google Scholar
 J Chen, D Bautembach, S Izadi, Scalable realtime volumetric surface reconstruction. ACM Trans. Graph.  SIGGRAPH 2013 Conference Proceedings 32(4), 1–16 (2013)MATHGoogle Scholar
 QY Zhou, V Koltun, Dense scene reconstruction with points of interest. ACM Trans. Graph.  SIGGRAPH 2013 Conference Proceedings 32(4), 1–8 (2013)MATHGoogle Scholar
 M Nießner, M Zollhöfer, S Izadi, M Stamminger, Realtime 3D reconstruction at scale using voxel hashing. ACM Trans. Graph.  Proceedings of ACM SIGGRAPH Asia 32(6), 1–11 (2013)View ArticleGoogle Scholar
 SM Seitz, B Curless, J Diebel, D Scharstein, A Comparison and Evaluation of MultiView Stereo Reconstruction Algorithms. IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2006, pp. 519–528Google Scholar
 B Micusik, J Kosecka, Piecewise Planar City 3D Modeling from Street View Panoramic Sequences (IEEE Conference on Computer Vision and Pattern Recognition, Miami, 2009), pp. 2906–2912Google Scholar
 A Geiger, J Ziegler, C Stiller, StereoScan: Dense 3d Reconstruction in Realtime (IEEE Intelligent Vehicles Symposium, BadenBaden, 2011), pp. 963–968Google Scholar
 G Klein, D Murray, Parallel Tracking and Mapping for Small AR Workspaces. ISMAR '07 Proceedings of the 6th IEEE and ACM International Symposium on Mixed and Augmented Reality, 2007, pp. 1–10Google Scholar
 N Snavely, SM Seitz, R Szeliski, Modeling the world from internet photo collections. Int. J. Comput. Vis. 80(2), 189–210 (2008)View ArticleGoogle Scholar
 S Agarwal, Y Furukawa, N Snavely, I Simon, B Curless, SM Seitz, R Szeliski, Building Rome in a day. Commun. ACM 54(10), 105–112 (2011)View ArticleGoogle Scholar
 C Wu, Towards LinearTime Incremental Structure from Motion (International Conference on 3D Vision, Seattle, 2013), pp. 127–134Google Scholar
 P Moulon, P Monasse, R Marlet, Global Fusion of Relative Motions for Robust, Accurate and Scalable Structure from Motion (IEEE International Conference on Computer Vision, Sydney, 2013), pp. 3248–3255Google Scholar
 WC Jhou, WH Cheng, Animating still landscape photographs through cloud motion creation. IEEE Trans. Multimedia 18(1), 4–13 (2016). doi:10.1109/TMM.2015.2500031 View ArticleGoogle Scholar
 J SanchezRiera, KL Hua, YS Hsiao, T Lim, SC Hidayati, WH Cheng, A comparative study of data fusion for RGBD based visual recognition. Pattern Recogn. Lett. 73, 1–6 (2016)View ArticleGoogle Scholar
 J SanchezRiera, YS Hsiao, T Lim, KL Hua, WH Cheng, A Robust Tracking Algorithm for 3D Hand Gesture with Rapid Hand Motion Through Deep Learning (IEEE International Conference on Multimedia and Expo Workshops (ICMEW), Chengdu, 2014), pp. 1–6. doi:10.1109/ICMEW.2014.6890556 Google Scholar
 A Saxena, M Sun, AY Ng, Learning 3D Scene Structure from a Single Still Image (IEEE 11th International Conference on Computer Vision, Rio de Janeiro, 2007), pp. 1–8Google Scholar
 A Gupta, AA Efros, M Hebert, Blocks World Revisited: Image Understanding Using Qualitative Geometry and Mechanics. Computer Vision  ECCV 2010: 11th European Conference on Computer Vision, 2010, pp. 482–496Google Scholar
 AJ Davison, ID Reid, ND Molton, O Stasse, MonoSLAM: realtime single camera SLAM. IEEE Trans Pattern Anal Mach Intell 29(6), 1052–1067 (2007)View ArticleGoogle Scholar
 A Akbarzadeh, JM Frahm, P Mordohai, B Clipp, C Engels, D Gallup, M Pollefeys, Towards Urban 3D Reconstruction from Video (Third International Symposium on 3D Data Processing, Visualization, and Transmission, Chapel Hil, 2006), pp. 1–8Google Scholar
 DG Lowe, Distinctive image features from scaleinvariant keypoints. Int. J. Comput. Vis. 60, 91–110 (2004)View ArticleGoogle Scholar
 H Bay, T Tuytelaars, LV Gool, Speededup robust features (SURF). Comput. Vis. Image Underst. 110(3), 346–359 (2008)View ArticleGoogle Scholar
 E Rublee, V Rabaud, K Konolige, G Bradski, ORB: an Efficient Alternative to SIFT or SURF. International Conference on Computer Vision (IEEE, Barcelona, 2011), pp. 2564–2571. doi:10.1109/ICCV.2011.6126544 Google Scholar
 PF Alcantarilla, J Nuevo, A Bartoli, Fast Explicit Diffusion for Accelerated Features in Nonlinear Scale Spaces (British Machine Vision Conference (BMVC), Bristol, 2013), pp. 1–11Google Scholar
 E Rosten, T Drummond, Machine Learning for HighSpeed Corner Detection. Proceedings of the 9th European conference on Computer Vision, 1, 2006, pp. 105–119Google Scholar
 Jy Bouguet, Pyramidal Implementation of the Lucas Kanade Feature Tracker. Intel Corporation, Microprocessor Research Labs, 2000, pp. 1–9Google Scholar
 R Hartley, A Zisserman, Multiple View Geometry in Computer Vision (second ed) (Cambridge, Cambridge University Press The Edinburgh Building, 2004)Google Scholar
 MA Fischler, RC Bolles, Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Commun. ACM 26, 381–395 (1981). doi:10.1145/358669.358692 MathSciNetView ArticleGoogle Scholar
 C Zach, Robust bundle adjustment revisited. Comput. Vis. 8693, 772–787 (2014). doi:10.1007/9783319106021_50 Google Scholar