ORB-SLAM: A Versatile and Accurate Monocular SLAM System
ORB-SLAM: A Versatile and Accurate Monocular SLAM System
   Abstract—This paper presents ORB-SLAM, a feature-based                                 4) An initial estimation of the keyframe poses and point
monocular simultaneous localization and mapping (SLAM) system                                 locations for the nonlinear optimization.
that operates in real time, in small and large indoor and outdoor                         5) A local map in exploration where optimization is focused
environments. The system is robust to severe motion clutter, allows
wide baseline loop closing and relocalization, and includes full au-                          to achieve scalability.
tomatic initialization. Building on excellent algorithms of recent                        6) The ability to perform fast global optimizations (e.g., pose
years, we designed from scratch a novel system that uses the same                             graph) to close loops in real time.
features for all SLAM tasks: tracking, mapping, relocalization, and                       The first real-time application of BA was the visual odometry
loop closing. A survival of the fittest strategy that selects the points               work of Mouragon et al. [3], followed by the ground-breaking
and keyframes of the reconstruction leads to excellent robustness
and generates a compact and trackable map that only grows if                           SLAM work of Klein and Murray [4], known as parallel track-
the scene content changes, allowing lifelong operation. We present                     ing and mapping (PTAM). This algorithm, while limited to
an exhaustive evaluation in 27 sequences from the most popular                         small-scale operation, provides simple but effective methods
datasets. ORB-SLAM achieves unprecedented performance with                             for keyframe selection, feature matching, point triangulation,
respect to other state-of-the-art monocular SLAM approaches. For
                                                                                       camera localization for every frame, and relocalization after
the benefit of the community, we make the source code public.
                                                                                       tracking failure. Unfortunately, several factors severely limit its
   Index Terms—Lifelong mapping, localization, monocular vision,                       application: lack of loop closing and adequate handling of oc-
recognition, simultaneous localization and mapping (SLAM).                             clusions, low invariance to viewpoint of the relocalization, and
                             I. INTRODUCTION                                           the need of human intervention for map bootstrapping.
                                                                                          In this study, we build on the main ideas of PTAM, the place
      UNDLE adjustment (BA) is known to provide accurate
B     estimates of camera localizations as well as a sparse geo-
metrical reconstruction [1], [2], given that a strong network of
                                                                                       recognition work of Gálvez-López and Tardós [5], the scale-
                                                                                       aware loop closing of Strasdat et al. [6], and the use of covis-
                                                                                       ibility information for large-scale operation [7], [8], to design
matches and good initial guesses are provided. For a long time,                        from scratch ORB-SLAM, i.e., a novel monocular SLAM sys-
this approach was considered unaffordable for real-time appli-                         tem whose main contributions are as follows.
cations such as visual simultaneous localization and mapping                              1) Use of the same features for all tasks: tracking, mapping,
(visual SLAM). Visual SLAM has the goal of estimating the                                     relocalization, and loop closing. This makes our system
camera trajectory while reconstructing the environment. Now,                                  more efficient, simple, and reliable. We use ORB features
we know that to achieve accurate results at nonprohibitive com-                               [9], which allow real-time performance without GPUs,
putational cost, a real-time SLAM algorithm has to provide BA                                 providing good invariance to changes in viewpoint and
with the following.                                                                           illumination.
   1) Corresponding observations of scene features (map                                   2) Real-time operation in large environments. Thanks to the
      points) among a subset of selected frames (keyframes).                                  use of a covisibility graph, tracking and mapping are fo-
   2) As complexity grows with the number of keyframes, their                                 cused in a local covisible area, independent of global map
      selection should avoid unnecessary redundancy.                                          size.
   3) A strong network configuration of keyframes and points                              3) Real-time loop closing based on the optimization of a pose
      to produce accurate results, that is, a well spread set of                              graph that we call the Essential Graph. It is built from
      keyframes observing points with significant parallax and                                a spanning tree maintained by the system, loop closure
      with plenty of loop closure matches.                                                    links, and strong edges from the covisibility graph.
                                                                                          4) Real-time camera relocalization with significant invari-
   Manuscript received April 28, 2015; accepted July 27, 2015. Date of pub-                   ance to viewpoint and illumination. This allows recovery
lication August 24, 2015; date of current version September 30, 2015. This                    from tracking failure and also enhances map reuse.
paper was recommended for publication by Associate Editor D. Scaramuzza
and Editor D. Fox upon evaluation of the reviewers’ comments. This work was               5) A new automatic and robust initialization procedure based
supported by the Dirección General de Investigación of Spain under Project                  on model selection that permits to create an initial map of
DPI2012-32168, the Ministerio de Educación Scholarship FPU13/04175, and                      planar and nonplanar scenes.
Gobierno de Aragón Scholarship B121/13.
   The authors are with Instituto de Investigación en Ingenierı́a de Aragón (I3A),      6) A survival of the fittest approach to map point and
Universidad de Zaragoza, 50018 Zaragoza, Spain (e-mail: raulmur@unizar.es;                    keyframe selection that is generous in the spawning but
josemari@unizar.es; tardos@unizar.es).                                                        very restrictive in the culling. This policy improves track-
   Color versions of one or more of the figures in this paper are available online
at http://ieeexplore.ieee.org.                                                                ing robustness and enhances lifelong operation because
   Digital Object Identifier 10.1109/TRO.2015.2463671                                         redundant keyframes are discarded.
                           1552-3098 © 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
                               See http://www.ieee.org/publications standards/publications/rights/index.html for more information.
Authorized licensed use limited to: ULAKBIM UASL - MARMARA UNIVERSITY. Downloaded on May 15,2020 at 19:34:33 UTC from IEEE Xplore. Restrictions apply.
1148                                                                                   IEEE TRANSACTIONS ON ROBOTICS, VOL. 31, NO. 5, OCTOBER 2015
   We present an extensive evaluation in popular public datasets             be initialized with high uncertainty in depth using an inverse
from indoor and outdoor environments, including hand-held,                   depth parameterization [21], which hopefully will later converge
car, and robot sequences. Notably, we achieve better camera                  to their real positions. The recent semidense work of Engel
localization accuracy than the state of the art in direct meth-              et al. [10] follows a similar approach initializing the depth of
ods [10], which optimize directly over pixel intensities instead             the pixels to a random value with high variance.
of feature reprojection errors. We include a discussion in Sec-                 Initialization methods from two views either assume locally
tion IX-B on the possible causes that can make feature-based                 scene planarity [4], [22] and recover the relative camera pose
methods more accurate than direct methods.                                   from a homography using the method of Faugeras and Lustman
   The loop closing and relocalization methods here presented                [23], or compute an essential matrix [24], [25] that models pla-
are based on our previous work [11]. A preliminary version of                nar and general scenes, using the five-point algorithm of Nistér
the system was presented in [12]. In the current paper, we add               [26], which requires to deal with multiple solutions. Both recon-
the initialization method, the Essential Graph, and perfect all              struction methods are not well constrained under low parallax
methods involved. We also describe in detail all building blocks             and suffer from a twofold ambiguity solution if all points of a
and perform an exhaustive experimental validation.                           planar scene are closer to one of the camera centers [27]. On the
   To the best of our knowledge, this is the most complete and               other hand, if a nonplanar scene is seen with parallax, a unique
reliable solution to monocular SLAM, and for the benefit of the              fundamental matrix can be computed with the eight-point algo-
community, we make the source code public. Demonstration                     rithm [2], and the relative camera pose can be recovered without
videos and the code can be found in our project webpage.1                    ambiguity.
                                                                                In Section IV, we present a new automatic approach based
                          II. RELATED WORK                                   on model selection between a homography for planar scenes
                                                                             and a fundamental matrix for nonplanar scenes. A statistical
A. Place Recognition
                                                                             approach to model selection was proposed by Torr et al. [28].
   The survey by Williams et al. [13] compared several ap-                   Under a similar rationale, we have developed a heuristic initial-
proaches for place recognition and concluded that techniques                 ization algorithm that takes into account the risk of selecting
based on appearance, that is, image-to-image matching, scale                 a fundamental matrix in close to degenerate cases (i.e., planar,
better in large environments than map-to-map or image-to-                    nearly planar, and low parallax), favoring the selection of the
map methods. Within appearance-based methods, bags of words                  homography. In the planar case, for the sake of safe operation,
techniques [14], such as the probabilistic approach FAB-MAP                  we refrain from initializing if the solution has a twofold ambi-
[15], are to the fore because of their high efficiency. DBoW2 [5]            guity, as a corrupted solution could be selected. We delay the
used for the first time bags of binary words obtained from BRIEF             initialization until the method produces a unique solution with
descriptors [16] along with the very efficient FAST feature de-              significant parallax.
tector [17]. This reduced in more than one order of magnitude
the time needed for feature extraction, compared with SURF                   C. Monocular Simultaneous Localization and Mapping
[18] and SIFT [19] features that were used in bags of words                     Monocular SLAM was initially solved by filtering [20], [21],
approaches so far. Although the system demonstrated to be very               [29], [30]. In that approach, every frame is processed by the fil-
efficient and robust, the use of BRIEF, neither rotation nor scale           ter to jointly estimate the map feature locations and the camera
invariant, limited the system to in-plane trajectories and loop de-          pose. It has the drawbacks of wasting computation in processing
tection from similar viewpoints. In our previous work [11], we               consecutive frames with little new information and the accumu-
proposed a bag of words place recognizer built on DBoW2 with                 lation of linearization errors. On the other hand, keyframe-based
ORB [9]. ORB are binary features invariant to rotation and scale             approaches [3], [4] estimate the map using only selected frames
(in a certain range), resulting in a very fast recognizer with good          (keyframes) allowing to perform more costly but accurate BA
invariance to viewpoint. We demonstrated the high recall and                 optimizations, as mapping is not tied to frame rate. Strasdat
robustness of the recognizer in four different datasets, requiring           et al. [31] demonstrated that keyframe-based techniques are
less than 39 ms (including feature extraction) to retrieve a loop            more accurate than filtering for the same computational cost.
candidate from a 10 K image database. In this study, we use an                  The most representative keyframe-based SLAM system is
improved version of that place recognizer, using covisibility in-            probably PTAM by Klein and Murray [4]. It was the first work
formation and returning several hypotheses when querying the                 to introduce the idea of splitting camera tracking and mapping
database instead of just the best match.                                     in parallel threads and demonstrated to be successful for real-
                                                                             time augmented reality applications in small environments. The
B. Map Initialization                                                        original version was later improved with edge features, a rota-
                                                                             tion estimation step during tracking, and a better relocalization
   Monocular SLAM requires a procedure to create an initial
                                                                             method [32]. The map points of PTAM correspond to FAST cor-
map because depth cannot be recovered from a single image.
                                                                             ners matched by patch correlation. This makes the points only
One way to solve the problem is to initially track a known
                                                                             useful for tracking but not for place recognition. In fact, PTAM
structure [20]. In the context of filtering approaches, points can
                                                                             does not detect large loops, and the relocalization is based on
                                                                             the correlation of low-resolution thumbnails of the keyframes,
  1 http://webdiis.unizar.es/∼raulmur/orbslam                                yielding a low invariance to viewpoint.
Authorized licensed use limited to: ULAKBIM UASL - MARMARA UNIVERSITY. Downloaded on May 15,2020 at 19:34:33 UTC from IEEE Xplore. Restrictions apply.
MUR-ARTAL et al.: ORB-SLAM: A VERSATILE AND ACCURATE MONOCULAR SLAM SYSTEM                                                                       1149
   Strasdat et al. [6] presented a large-scale monocular SLAM               out requiring to extract features in every frame, they are able
system with a front-end based on optical flow implemented on a              to operate at high frame rates obtaining impressive results in
GPU, followed by FAST feature matching and motion-only BA,                  quadracopters. However, no loop detection is performed, and
and a back-end based on sliding-window BA. Loop closures                    the current implementation is mainly thought for downward
were solved with a pose graph optimization with similarity con-             looking cameras.
straints [7 degrees of freedom (DoF)], which was able to correct               Finally, we want to discuss about keyframe selection. All
the scale drift appearing in monocular SLAM. From this work,                visual SLAM works in the literature agree that running BA with
we take the idea of loop closing with 7-DoF pose graph optimiza-            all the points and all the frames is not feasible. The work of
tion and apply it to the Essential Graph defined in Section III-D.          Strasdat et al. [31] showed that the most cost-effective approach
   Strasdat et al. [7] used the front-end of PTAM, but performed            is to keep as much points as possible, while keeping only
the tracking only in a local map retrieved from a covisibility              nonredundant keyframes. The PTAM approach was to insert
graph. They proposed a double-window optimization back-end                  keyframes very cautiously to avoid an excessive growth of the
that continuously performs BA in the inner window and pose                  computational complexity. This restrictive keyframe insertion
graph in a limited-size outer window. However, loop closing is              policy makes the tracking fail in hard exploration conditions.
only effective if the size of the outer window is large enough              Our survival of the fittest strategy achieves unprecedented ro-
to include the whole loop. In our system, we take advantage                 bustness in difficult scenarios by inserting keyframes as quickly
of the excellent ideas of using a local map based on covisi-                as possible, and removing later the redundant ones, to avoid the
bility and building the pose graph from the covisibility graph,             extra cost.
but apply them in a totally redesigned front-end and back-end.
Another difference is that, instead of using specific features for                                 III. SYSTEM OVERVIEW
loop detection (SURF), we perform the place recognition on the
same tracked and mapped features, obtaining robust frame-rate               A. Feature Choice
relocalization and loop detection.                                             One of the main design ideas in our system is that the same
   Pirker et al. [33] proposed CD-SLAM, i.e., a very complete               features used by the mapping and tracking are used for place
system including loop closing, relocalization, large-scale oper-            recognition to perform frame-rate relocalization and loop detec-
ation, and efforts to work on dynamic environments. However,                tion. This makes our system efficient and avoids the need to in-
map initialization is not mentioned. The lack of a public im-               terpolate the depth of the recognition features from near SLAM
plementation does not allow us to perform a comparison of                   features as in previous works [6], [7]. We require features that
accuracy, robustness, or large-scale capabilities.                          need for extraction much less than 33 ms per image, which
   The visual odometry of Song et al. [34] uses ORB features                excludes the popular SIFT (∼ 300 ms) [19], SURF (∼ 300 ms)
for tracking and a temporal sliding window BA back-end. In                  [18], or the recent A-KAZE (∼ 100 ms) [35]. To obtain general
comparison, our system is more general as they do not have                  place recognition capabilities, we require rotation invariance,
global relocalization, loop closing, and do not reuse the map.              which excludes BRIEF [16] and LDB [36].
They are also using the known distance from the camera to the                  We chose ORB [9], which are oriented multiscale FAST cor-
ground to limit monocular scale drift.                                      ners with a 256-bit descriptor associated. They are extremely
   Lim et al. [25], work published after we submitted our pre-              fast to compute and match, while they have good invariance to
liminary version of this work [12], use also the same features for          viewpoint. This allows us to match them with wide baselines,
tracking, mapping, and loop detection. However, the choice of               boosting the accuracy of BA. We already shown the good perfor-
BRIEF limits the system to in-plane trajectories. Their system              mance of ORB for place recognition in [11]. While our current
only tracks points from the last keyframe; therefore, the map is            implementation makes use of ORB, the techniques proposed are
not reused if revisited (similar to visual odometry) and has the            not restricted to these features.
problem of growing unbounded. We compare qualitatively our
results with this approach in Section VIII-E.
   The recent work of Engel et al. [10], known as LSD-SLAM, is              B. Three Threads: Tracking, Local Mapping, and
able to build large-scale semidense maps, using direct methods              Loop Closing
(i.e., optimization directly over image pixel intensities) instead             Our system, see an overview in Fig. 1, incorporates three
of BA over features. Their results are very impressive as the               threads that run in parallel: tracking, local mapping, and loop
system is able to operate in real time, without GPU accelera-               closing. The tracking is in charge of localizing the camera with
tion, building a semidense map, with more potential applica-                every frame and deciding when to insert a new keyframe. We
tions for robotics than the sparse output generated by feature-             perform first an initial feature matching with the previous frame
based SLAM. Nevertheless, they still need features for loop de-             and optimize the pose using motion-only BA. If the tracking is
tection, and their camera localization accuracy is significantly            lost (e.g., due to occlusions or abrupt movements), the place
lower than in our system and PTAM, as we show experimen-                    recognition module is used to perform a global relocalization.
tally in Section VIII-B. This surprising result is discussed in             Once there is an initial estimation of the camera pose and feature
Section IX-B.                                                               matchings, a local visible map is retrieved using the covisibility
   In a halfway between direct and feature-based methods is the             graph of keyframes that is maintained by the system [see
semidirect visual odometry SVO of Forster et al. [22]. With-                Fig. 2(a) and (b)]. Then, matches with the local map points are
Authorized licensed use limited to: ULAKBIM UASL - MARMARA UNIVERSITY. Downloaded on May 15,2020 at 19:34:33 UTC from IEEE Xplore. Restrictions apply.
1150                                                                                      IEEE TRANSACTIONS ON ROBOTICS, VOL. 31, NO. 5, OCTOBER 2015
Fig. 1. ORB-SLAM system overview, showing all the steps performed by the
tracking, local mapping, and loop closing threads. The main components of the
place recognition module and the map are also shown.
searched by reprojection, and camera pose is optimized again                    Fig. 2. Reconstruction and graphs in the sequence f r3 long
with all matches. Finally, the tracking thread decides if a new                 of f ice household from the TUM RGB-D Benchmark [38]. (a) Keyframes
                                                                                (blue), current camera (green), map points (black, red), current local map points
keyframe is inserted. All the tracking steps are explained in de-               (red). (b) Covisibility graph. (c) Spanning tree (green) and loop closure (red).
tail in Section V. The novel procedure to create an initial map                 (d) Essential graph.
is presented in Section IV.
   The local mapping processes new keyframes and performs lo-
cal BA to achieve an optimal reconstruction in the surroundings
of the camera pose. New correspondences for unmatched ORB                              the point with the optical center of the keyframes that
in the new keyframe are searched in connected keyframes in                             observe it);
the covisibility graph to triangulate new points. Some time after                  3) a representative ORB descriptor Di , which is the asso-
creation, based on the information gathered during the track-                          ciated ORB descriptor whose hamming distance is mini-
ing, an exigent point culling policy is applied in order to retain                     mum with respect to all other associated descriptors in the
only high quality points. The local mapping is also in charge                          keyframes in which the point is observed;
of culling redundant keyframes. We explain in detail all local                     4) the maximum dm ax and minimum dm in distances at which
mapping steps in Section VI.                                                           the point can be observed, according to the scale invari-
   The loop closing searches for loops with every new keyframe.                        ance limits of the ORB features.
If a loop is detected, we compute a similarity transformation                      Each keyframe Ki stores the following:
that informs about the drift accumulated in the loop. Then, both                   1) the camera pose Tiw , which is a rigid body transforma-
sides of the loop are aligned and duplicated points are fused.                         tion that transforms points from the world to the camera
Finally, a pose graph optimization over similarity constraints [6]                     coordinate system;
is performed to achieve global consistency. The main novelty is                    2) the camera intrinsics, including focal length and principal
that we perform the optimization over the Essential Graph, i.e.,                       point;
a sparser subgraph of the covisibility graph which is explained                    3) all the ORB features extracted in the frame, associated or
in Section III-D. The loop detection and correction steps are                          not with a map point, whose coordinates are undistorted
explained in detail in Section VII.                                                    if a distortion model is provided.
   We use the Levenberg–Marquardt algorithm implemented in                         Map points and keyframes are created with a generous policy,
g2o [37] to carry out all optimizations. In the Appendix, we                    while a later very exigent culling mechanism is in charge of
describe the error terms, cost functions, and variables involved                detecting redundant keyframes and wrongly matched or not
in each optimization.                                                           trackable map points. This permits a flexible map expansion
                                                                                during exploration, which boost tracking robustness under hard
                                                                                conditions (e.g., rotations, fast movements), while its size is
C. Map Points, Keyframes, and Their Selection
                                                                                bounded in continual revisits to the same environment, i.e.,
   Each map point pi stores the following:                                      lifelong operation. Additionally, our maps contain very few
   1) its 3-D position Xw ,i in the world coordinate system;                    outliers compared with PTAM, at the expense of containing
   2) the viewing direction ni , which is the mean unit vec-                    less points. Culling procedures of map points and keyframes
      tor of all its viewing directions (the rays that join                     are explained in Sections VI-B and VI-E, respectively.
Authorized licensed use limited to: ULAKBIM UASL - MARMARA UNIVERSITY. Downloaded on May 15,2020 at 19:34:33 UTC from IEEE Xplore. Restrictions apply.
MUR-ARTAL et al.: ORB-SLAM: A VERSATILE AND ACCURATE MONOCULAR SLAM SYSTEM                                                                       1151
D. Covisibility Graph and Essential Graph                                      An additional benefit of the bags of words representation for
                                                                            feature matching was reported in [5]. When we want to compute
   Covisibility information between keyframes is very useful in
several tasks of our system and is represented as an undirected             the correspondences between two sets of ORB features, we can
                                                                            constraint the brute force matching only to those features that
weighted graph as in [7]. Each node is a keyframe, and an edge
                                                                            belong to the same node in the vocabulary tree at a certain level
between two keyframes exists if they share observations of the
same map points (at least 15), being the weight θ of the edge               (we select the second out of six), speeding up the search. We use
                                                                            this trick when searching matches for triangulating new points,
the number of common map points.
                                                                            and at loop detection and relocalization. We also refine the
   In order to correct a loop, we perform a pose graph opti-
mization [6] that distributes the loop closing error along the              correspondences with an orientation consistency test (see [11]
                                                                            for details) that discards outliers ensuring a coherent rotation
graph. In order not to include all the edges provided by the co-
visibility graph, which can be very dense, we propose to build              for all correspondences.
an Essential Graph that retains all the nodes (keyframes), but
                                                                                           IV. AUTOMATIC MAP INITIALIZATION
less edges, still preserving a strong network that yields accurate
results. The system builds incrementally a spanning tree from                  The goal of the map initialization is to compute the relative
the initial keyframe, which provides a connected subgraph of                pose between two frames to triangulate an initial set of map
the covisibility graph with minimal number of edges. When a                 points. This method should be independent of the scene (pla-
new keyframe is inserted, it is included in the tree linked to              nar or general) and should not require human intervention to
the keyframe which shares most point observations, and when                 select a good two-view configuration, i.e., a configuration with
a keyframe is erased by the culling policy, the system updates              significant parallax. We propose to compute in parallel two ge-
the links affected by that keyframe. The Essential Graph con-               ometrical models: a homography assuming a planar scene and
tains the spanning tree, the subset of edges from the covisibility          a fundamental matrix assuming a nonplanar scene. We then use
graph with high covisibility (θm in = 100), and the loop closure            a heuristic to select a model and try to recover the relative pose
edges, resulting in a strong network of cameras. Fig. 2 shows an            with a specific method for the selected model. Our method only
example of a covisibility graph, spanning tree, and associated              initializes when it is certain that the two-view configuration is
essential graph. As shown in the experiments of Section VIII-E,             safe, detecting low-parallax cases and the well-known twofold
when performing the pose graph optimization, the solution is                planar ambiguity [27], avoiding to initialize a corrupted map.
so accurate that an additional full BA optimization barely im-              The steps of our algorithm are as follows.
proves the solution. The efficiency of the essential graph and                 1) Find initial correspondences: Extract ORB features (only
the influence of the θm in is shown at the end of Section VIII-E.                  at the finest scale) in the current frame Fc and search for
                                                                                   matches xc ↔ xr in the reference frame Fr . If not enough
E. Bags of Words Place Recognition                                                 matches are found, reset the reference frame.
                                                                               2) Parallel computation of the two models: Compute in par-
   The system has embedded a bags of words place recogni-                          allel threads a homography Hcr and a fundamental matrix
tion module, based on DBoW22 [5], to perform loop detection                        Fcr as
and relocalization. Visual words are just a discretization of the
descriptor space, which is known as the visual vocabulary. The                                    xc = Hcr xr ,      xTc Fcr xr = 0               (1)
vocabulary is created offline with the ORB descriptors extracted                   with the normalized DLT and eight-point algorithms, re-
from a large set of images. If the images are general enough, the                  spectively, as explained in [2] inside a RANSAC scheme.
same vocabulary can be used for different environments getting                     To make homogeneous the procedure for both models,
a good performance, as shown in our previous work [11]. The                        the number of iterations is prefixed and the same for both
system builds incrementally a database that contains an invert                     models, along with the points to be used at each iteration:
index, which stores for each visual word in the vocabulary, in                     eight for the fundamental matrix, and four of them for
which keyframes it has been seen, so that querying the database                    the homography. At each iteration, we compute a score
can be done very efficiently. The database is also updated when                    SM for each model M (H for the homography, F for the
a keyframe is deleted by the culling procedure.                                    fundamental matrix)
   Because there exists visual overlap between keyframes, when                                                                      
querying the database, there will not exist a unique keyframe                                    SM =          ρM d2cr (xic , xir , M )
with a high score. The original DBoW2 took this overlapping                                                  i
into account, adding up the score of images that are close in time.                                                                   
                                                                                                            + ρM (d2r c xic , xir , M )
This has the limitation of not including keyframes viewing the                                              
same place but inserted at a different time. Instead, we group                                               Γ − d2 , if d2 < TM
those keyframes that are connected in the covisibility graph.                                 ρM (d2 ) =                                          (2)
                                                                                                              0,           if d2 ≥ TM
In addition, our database returns all keyframe matches whose
scores are higher than the 75% of the best score.                                  where d2cr and d2r c are the symmetric transfer errors [2]
                                                                                   from one frame to the other. TM is the outlier rejection
                                                                                   threshold based on the χ2 test at 95% (TH = 5.99,
  2 https://github.com/dorian3d/DBoW2                                              TF = 3.84, assuming a standard deviation of 1 pixel in
Authorized licensed use limited to: ULAKBIM UASL - MARMARA UNIVERSITY. Downloaded on May 15,2020 at 19:34:33 UTC from IEEE Xplore. Restrictions apply.
1152                                                                                   IEEE TRANSACTIONS ON ROBOTICS, VOL. 31, NO. 5, OCTOBER 2015
Authorized licensed use limited to: ULAKBIM UASL - MARMARA UNIVERSITY. Downloaded on May 15,2020 at 19:34:33 UTC from IEEE Xplore. Restrictions apply.
MUR-ARTAL et al.: ORB-SLAM: A VERSATILE AND ACCURATE MONOCULAR SLAM SYSTEM                                                                       1153
B. Initial Pose Estimation From Previous Frame                              challenging camera movements, typically rotations. To insert a
                                                                            new keyframe, all the following conditions must be met.
   If tracking was successful for last frame, we use a constant
velocity motion model to predict the camera pose and perform                   1) More than 20 frames must have passed from the last global
                                                                                   relocalization.
a guided search of the map points observed in the last frame. If
                                                                               2) Local mapping is idle, or more than 20 frames have passed
not enough matches were found (i.e., motion model is clearly
violated), we use a wider search of the map points around their                    from last keyframe insertion.
                                                                               3) Current frame tracks at least 50 points.
position in the last frame. The pose is then optimized with the
                                                                               4) Current frame tracks less than 90% points than Kref .
found correspondences.
                                                                               Instead of using a distance criterion to other keyframes as
                                                                            PTAM, we impose a minimum visual change (condition 4).
C. Initial Pose Estimation via Global Relocalization                        Condition 1 ensures a good relocalization and condition 3 a
   If the tracking is lost, we convert the frame into bag of words          good tracking. If a keyframe is inserted when the local mapping
and query the recognition database for keyframe candidates for              is busy (second part of condition 2), a signal is sent to stop local
global relocalization. We compute correspondences with ORB                  BA so that it can process as soon as possible the new keyframe.
associated with map points in each keyframe, as explained in
Section III-E. We then perform alternatively RANSAC iterations                                      VI. LOCAL MAPPING
for each keyframe and try to find a camera pose using the PnP
                                                                              In this section, we describe the steps performed by the local
algorithm [41]. If we find a camera pose with enough inliers, we
                                                                            mapping with every new keyframe Ki .
optimize the pose and perform a guided search of more matches
with the map points of the candidate keyframe. Finally, the
camera pose is again optimized, and if supported with enough                A. Keyframe Insertion
inliers, tracking procedure continues.                                         First, we update the covisibility graph, adding a new node
                                                                            for Ki and updating the edges resulting from the shared map
D. Track Local Map                                                          points with other keyframes. We then update the spanning tree
                                                                            linking Ki with the keyframe with most points in common. We
   Once we have an estimation of the camera pose and an ini-
                                                                            then compute the bags of words representation of the keyframe,
tial set of feature matches, we can project the map into the
                                                                            which will help in the data association for triangulating new
frame and search more map point correspondences. To bound
                                                                            points.
the complexity in large maps, we only project a local map. This
local map contains the set of keyframes K1 , which share map
points with the current frame, and a set K2 with neighbors to the           B. Recent Map Points Culling
keyframes K1 in the covisibility graph. The local map also has                 Map points, in order to be retained in the map, must pass a
a reference keyframe Kref ∈ K1 , which shares most map points               restrictive test during the first three keyframes after creation,
with current frame. Now, each map point seen in K1 and K2 is                which ensures that they are trackable and not wrongly triangu-
searched in the current frame as follows.                                   lated, i.e., due to spurious data association. A point must fulfill
   1) Compute the map point projection x in the current frame.              these two conditions.
       Discard if it lays out of the image bounds.                             1) The tracking must find the point in more than the 25% of
   2) Compute the angle between the current viewing ray v                          the frames in which it is predicted to be visible.
       and the map point mean viewing direction n. Discard if                  2) If more than one keyframe have passed from map point
       v · n < cos(60◦ ).                                                          creation, it must be observed from at least three keyframes.
   3) Compute the distance d from map point to camera center.                  Once a map point have passed this test, it can only be removed
       Discard if it is out of the scale invariance region of the           if at any time it is observed from less than three keyframes.
       map point d ∈  / [dmin , dmax ].                                     This can happen when keyframes are culled and when local
   4) Compute the scale in the frame by the ratio d/dmin .                  BA discards outlier observations. This policy makes our map
   5) Compare the representative descriptor D of the map point              contain very few outliers.
       with the still unmatched ORB features in the frame, at the
       predicted scale, and near x, and associate the map point             C. New Map Point Creation
       with the best match.
   The camera pose is finally optimized with all the map points                New map points are created by triangulating ORB from con-
found in the frame.                                                         nected keyframes Kc in the covisibility graph. For each un-
                                                                            matched ORB in Ki , we search a match with other unmatched
                                                                            point in other keyframe. This matching is done as explained
E. New Keyframe Decision
                                                                            in Section III-E and discard those matches that do not fulfill
   The last step is to decide if the current frame is spawned as            the epipolar constraint. ORB pairs are triangulated, and to ac-
a new keyframe. As there is a mechanism in the local mapping                cept the new points, positive depth in both cameras, parallax,
to cull redundant keyframes, we will try to insert keyframes as             reprojection error, and scale consistency is checked. Initially,
fast as possible, because that makes the tracking more robust to            a map point is observed from two keyframes, but it could be
Authorized licensed use limited to: ULAKBIM UASL - MARMARA UNIVERSITY. Downloaded on May 15,2020 at 19:34:33 UTC from IEEE Xplore. Restrictions apply.
1154                                                                                   IEEE TRANSACTIONS ON ROBOTICS, VOL. 31, NO. 5, OCTOBER 2015
matched in others; therefore, it is projected in the rest of con-            in the loop. The computation of this similarity will serve also as
nected keyframes, and correspondences are searched as detailed               geometrical validation of the loop.
in Section V-D.                                                                 We first compute correspondences between ORB associated
                                                                             with map points in the current keyframe and the loop candidate
D. Local Bundle Adjustment                                                   keyframes, following the procedure explained in Section III-E.
                                                                             At this point, we have 3-D-to-3-D correspondences for each
   The local BA optimizes the currently processed keyframe
                                                                             loop candidate. We alternatively perform RANSAC iterations
Ki , all the keyframes connected to it in the covisibility graph
                                                                             with each candidate, trying to find a similarity transformation
Kc , and all the map points seen by those keyframes. All other
                                                                             using the method of Horn [42]. If we find a similarity Sil with
keyframes that see those points but are not connected to the
                                                                             enough inliers, we optimize it (see the Appendix) and perform
currently processed keyframe are included in the optimization
                                                                             a guided search of more correspondences. We optimize it again,
but remain fixed. Observations that are marked as outliers are
                                                                             and if Sil is supported by enough inliers, the loop with Kl is
discarded at the middle and at the end of the optimization. See
                                                                             accepted.
the Appendix for more details about this optimization.
                                                                             C. Loop Fusion
E. Local Keyframe Culling
                                                                                The first step in the loop correction is to fuse duplicated map
   In order to maintain a compact reconstruction, the local
                                                                             points and insert new edges in the covisibility graph that will
mapping tries to detect redundant keyframes and delete them.
                                                                             attach the loop closure. First, the current keyframe pose Tiw is
This is beneficial as BA complexity grows with the number of
                                                                             corrected with the similarity transformation Sil , and this cor-
keyframes, but also because it enables lifelong operation in the
                                                                             rection is propagated to all the neighbors of Ki , concatenating
same environment as the number of keyframes will not grow
                                                                             transformations, so that both sides of the loop get aligned. All
unbounded, unless the visual content in the scene changes. We
                                                                             map points seen by the loop keyframe and its neighbors are
discard all the keyframes in Kc whose 90% of the map points
                                                                             projected into Ki , and its neighbors and matches are searched
have been seen in at least other three keyframes in the same or
                                                                             in a narrow area around the projection, as done in Section V-D.
finer scale. The scale condition ensures that map points maintain
                                                                             All those map points matched and those that were inliers in the
keyframes from which they are measured with most accuracy.
                                                                             computation of Sil are fused. All keyframes involved in the fu-
This policy was inspired by the one proposed in the work of Tan
                                                                             sion will update their edges in the covisibility graph effectively
et al. [24], where keyframes were discarded after a process of
                                                                             creating edges that attach the loop closure.
change detection.
                                                                             D. Essential Graph Optimization
                         VII. LOOP CLOSING
                                                                                To effectively close the loop, we perform a pose graph opti-
   The loop closing thread takes Ki , the last keyframe processed
                                                                             mization over the Essential Graph, described in Section III-D,
by the local mapping, and tries to detect and close loops. The
                                                                             that distributes the loop closing error along the graph. The opti-
steps are next described.
                                                                             mization is performed over similarity transformations to correct
                                                                             the scale drift [6]. The error terms and cost function are detailed
A. Loop Candidates Detection
                                                                             in the Appendix. After the optimization, each map point is trans-
   First, we compute the similarity between the bag of words                 formed according to the correction of one of the keyframes that
vector of Ki and all its neighbors in the covisibility graph                 observes it.
(θm in = 30) and retain the lowest score sm in . Then, we query
the recognition database and discard all those keyframes whose                                        VIII. EXPERIMENTS
score is lower than sm in . This is a similar operation to gain
                                                                                We have performed an extensive experimental validation of
robustness as the normalizing score in DBoW2, which is com-
                                                                             our system in the large robot sequence of NewCollege [39],
puted from the previous image, but here we use covisibility
                                                                             evaluating the general performance of the system, in 16 hand-
information. In addition, all those keyframes directly connected
                                                                             held indoor sequences of the TUM RGB-D benchmark [38],
to Ki are discarded from the results. To accept a loop candi-
                                                                             evaluating the localization accuracy, relocalization, and lifelong
date, we must detect consecutively three loop candidates that
                                                                             capabilities, and in 10 car outdoor sequences from the KITTI
are consistent (keyframes connected in the covisibility graph).
                                                                             dataset [40], evaluating real-time large scale operation, local-
There can be several loop candidates if there are several places
                                                                             ization accuracy, and efficiency of the pose graph optimization.
with similar appearance to Ki .
                                                                                Our system runs in real time and processes the images ex-
                                                                             actly at the frame rate they were acquired. We have carried
B. Compute the Similarity Transformation
                                                                             out all experiments with an Intel Core i7-4700MQ (four cores
   In monocular SLAM, there are seven DoFs in which the                      @ 2.40 GHz) and 8 GB RAM. ORB-SLAM has three main
map can drift: three translations, three rotations, and a scale              threads, that run in parallel with other tasks from ROS and the
factor [6]. Therefore, to close a loop, we need to compute a                 operating system, which introduces some randomness in the re-
similarity transformation from the current keyframe Ki to the                sults. For this reason, in some experiments, we report the median
loop keyframe Kl that informs us about the error accumulated                 from several runs.
Authorized licensed use limited to: ULAKBIM UASL - MARMARA UNIVERSITY. Downloaded on May 15,2020 at 19:34:33 UTC from IEEE Xplore. Restrictions apply.
MUR-ARTAL et al.: ORB-SLAM: A VERSATILE AND ACCURATE MONOCULAR SLAM SYSTEM                                                                                   1155
                                                                                                                TABLE I
                                                                                                TRACKING AND MAPPING TIMES IN NEWCOLLEGE
                                                                                  closure. The whole map after processing the full sequence at its
                                                                                  real frame rate is shown in Fig. 6. The big loop on the right
                                                                                  does not perfectly align because it was traversed in opposite
                                                                                  directions, and the place recognizer was not able to find loop
Fig. 5. Map before and after a loop closure in the NewCollege sequence. The       closures.
loop closure match is drawn in blue, the trajectory in green, and the local map      We have extracted statistics of the times spent by each thread
for the tracking at that moment in red. The local map is extended along both
sides of the loop after it is closed.                                             in this experiment. Table I shows the results for the tracking
                                                                                  and the local mapping. Tracking works at frame rates around
                                                                                  25–30 Hz, being the most demanding task to track the local
A. System Performance in the NewCollege Dataset
                                                                                  map. If needed, this time could be reduced limiting the number
   The NewCollege dataset [39] contains a 2.2-km sequence                         of keyframes that are included in the local map. In the local
from a robot traversing a campus and adjacent parks. The se-                      mapping thread, the most demanding task is local BA. The local
quence is recorded by a stereo camera at 20 frames/s and a                        BA time varies if the robot is exploring or in a well-mapped area,
resolution 512 × 382. It contains several loops and fast rota-                    because during exploration, BA is interrupted if tracking inserts
tions that makes the sequence quite challenging for monocular                     a new keyframe, as explained in Section V-E. In case of not
vision. To the best of our knowledge, there is no other monocu-                   needing new keyframes, local BA performs a generous number
lar system in the literature able to process this whole sequence.                 of prefixed iterations.
For example Strasdat et al. [7], despite being able to close loops                   Table II shows the results for each of the six loop closures
and work in large-scale environments, only showed monocular                       found. It can be seen how the loop detection increases sublin-
results for a small part of this sequence.                                        early with the number of keyframes. This is due to the efficient
   As an example of our loop closing procedure, we show in                        querying of the database that only compares the subset of im-
Fig. 4 the detection of a loop with the inliers that support the                  ages with words in common, which demonstrates the potential
similarity transformation. Fig. 5 shows the reconstruction be-                    of bag of words for place recognition. Our Essential Graph in-
fore and after the loop closure. In red, the local map is shown,                  cludes edges around five times the number of keyframes, which
which after the loop closure extends along both sides of the loop                 is a quite sparse graph.
Authorized licensed use limited to: ULAKBIM UASL - MARMARA UNIVERSITY. Downloaded on May 15,2020 at 19:34:33 UTC from IEEE Xplore. Restrictions apply.
1156                                                                                           IEEE TRANSACTIONS ON ROBOTICS, VOL. 31, NO. 5, OCTOBER 2015
                                                                       TABLE II
                                                           LOOP CLOSING TIMES IN NEWCOLLEGE
         Loop    KeyFrames    Essential Graph Edges   Candidates Detection    Similarity Transformation    Fusion     Essential Graph Optimization       Total (s)
         1           287              1347                   4.71                      20.77                0.20                   0.26                    0.51
         2          1082              5950                    4.14                     17.98                0.39                   1.06                    1.52
         3          1279              7128                    9.82                     31.29                0.95                   1.26                    2.27
         4          2648             12547                   12.37                     30.36                0.97                   2.30                    3.33
         5          3150             16033                   14.71                     41.28                1.73                   2.80                    4.60
         6          4496             21797                   13.52                     48.68                0.97                   3.62                    4.69
Authorized licensed use limited to: ULAKBIM UASL - MARMARA UNIVERSITY. Downloaded on May 15,2020 at 19:34:33 UTC from IEEE Xplore. Restrictions apply.
MUR-ARTAL et al.: ORB-SLAM: A VERSATILE AND ACCURATE MONOCULAR SLAM SYSTEM                                                                                      1157
Fig. 7. Relocalization experiment in fr2_xyz. Map is initially created during          Fig. 8. Example of challenging relocalizations (severe scale change, dynamic
the first 30 s of the sequence (KFs). The goal is to relocalize subsequent frames.     objects) that our system successfully found in the relocalization experiments.
Successful relocalizations (R) of our system and PTAM are shown. The ground
truth (GT) is only shown for the frames to relocalize.
                               TABLE IV
              RESULTS FOR THE RELOCALIZATION EXPERIMENTS
 System         KFs      RMSE (cm)      Recall (%)    RMSE (cm)      Max. Error (cm)
                           fr2_xyz. 2769 frames to relocalize
 PTAM            37          0.19          34.9           0.26            1.52
 ORB-SLAM        24          0.19          78.4           0.38            1.67
                                                                                       Fig. 9. Lifelong experiment in a static environment where the camera is
                       fr3_walking_xyz. 859 frames to relocalize
                                                                                       always looking at the same place from different viewpoints. PTAM is always
 PTAM            34          0.83           0.0            –                –          inserting keyframes, while ORB-SLAM is able to prune redundant keyframes
 ORB-SLAM        31          0.82          77.9           1.32            4.95         and maintains a bounded-size map.
of the recovered poses. We perform the same experiment with                            conjunction with our keyframe culling procedure allows us to
PTAM for comparison. Fig. 7 shows the keyframes used to cre-                           operate lifelong in the same environment under different view-
ate the initial map, the poses of the relocalized frames, and the                      points and some dynamic changes.
ground truth for those frames. It can be seen that PTAM is only                           In the case of a completely static scenario, our system is
able to relocalize frames, which are near to the keyframes due                         able to maintain the number of keyframes bounded even if the
to the little invariance of its relocalization method. Table IV                        camera is looking at the scene from different viewpoints. We
shows the recall and the error with respect to the ground truth.                       demonstrate it in a custom sequence where the camera is looking
ORB-SLAM accurately relocalizes more than the double of                                at the same desk during 93 s but performing a trajectory so that
frames than PTAM. In the second experiment, we create an ini-                          the viewpoint is always changing. We compare the evolution of
tial map with sequence fr3_sitting_xyz and try to relocalize all                       the number of keyframes in our map and those generated by
frames from fr3_walking_xyz. This is a challenging experiment                          PTAM in Fig. 9. It can be seen how PTAM is always inserting
as there are big occlusions due to people moving in the scene.                         keyframes, while our mechanism to prune redundant keyframes
Here, PTAM finds no relocalizations, while our system relocal-                         makes its number to saturate.
izes 78% of the frames, as can be seen in Table IV. Fig. 8 shows                          While the lifelong operation in a static scenario should be a
some examples of challenging relocalizations performed by our                          requirement of any SLAM system, more interesting is the case
system in these experiments.                                                           where dynamic changes occur. We analyze the behavior of our
                                                                                       system in such a scenario by running consecutively the dynamic
                                                                                       sequences from fr3: sitting_xyz, sitting_halfsphere, sitting_rpy,
D. Lifelong Experiment in the TUM RGB-D Benchmark                                      walking_xyz, walking_halfspehere, and walking_rpy. All the
  Previous relocalization experiments have shown that our sys-                         sequences focus the camera on the same desk but perform
tem is able to localize in a map from very different viewpoints                        different trajectories, while people are moving and change
and robustly under moderate dynamic changes. This property in                          some objects like chairs. Fig. 10(a) shows the evolution of the
Authorized licensed use limited to: ULAKBIM UASL - MARMARA UNIVERSITY. Downloaded on May 15,2020 at 19:34:33 UTC from IEEE Xplore. Restrictions apply.
1158                                                                                   IEEE TRANSACTIONS ON ROBOTICS, VOL. 31, NO. 5, OCTOBER 2015
total number of keyframes in the map, and Fig. 10(b) shows for
each keyframe its frame of creation and destruction, showing
how long the keyframes have survived in the map. It can be seen
that during the first two sequences, the map size grows as all the
views of the scene are being seen for the first time. In Fig. 10(b),
we can see that several keyframes created during these two
first sequences are maintained in the map during the whole
experiment. During the sequences sitting_rpy and walking_xyz,
the map does not grow, because the map created so far explains
well the scene. In contrast, during the last two sequences, more
keyframes are inserted showing that there are some novelties
in the scene that were not yet represented, due probably to
dynamic changes. Finally, Fig. 10(c) shows a histogram of the
keyframes according to the time they have survived with respect
to the remaining time of the sequence from its moment of
creation. It can be seen that most of the keyframes are destroyed
by the culling procedure soon after creation, and only a small
subset survive until the end of the experiment. On one hand,
this shows that our system has a generous keyframe spawning
policy, which is very useful when performing abrupt motions
in exploration. On the other hand, the system is eventually able
to select a small representative subset of those keyframes.
   In these lifelong experiments, we have shown that our map
grows with the content of the scene but not with the time and is
able to store the dynamic changes of the scene, which could be
useful to perform some scene understanding by accumulating
experience in an environment.
Authorized licensed use limited to: ULAKBIM UASL - MARMARA UNIVERSITY. Downloaded on May 15,2020 at 19:34:33 UTC from IEEE Xplore. Restrictions apply.
MUR-ARTAL et al.: ORB-SLAM: A VERSATILE AND ACCURATE MONOCULAR SLAM SYSTEM                                                                                    1159
Fig. 11. Sequences 00, 05, and 07 from the odometry benchmark of the KITTI dataset. (Left) Points and keyframe trajectory. (Center) trajectory and ground
truth. (Right) Trajectory after 20 iterations of full BA. The output of our system is quite accurate, while it can be slightly improved with some iterations of BA.
   Finally, we wanted to show the efficacy of our loop closing                                      IX. CONCLUSION AND DISCUSSION
approach and the influence of the θm in used to include edges
                                                                                   A. Conclusions
in the essential graph. We have selected the sequence 09 (a
very long sequence with a loop closure at the end), and in the                        In this study, we have presented a new monocular SLAM
same execution, we have evaluated different loop closing strate-                   system with a detailed description of its building blocks and
gies. In Table VI, we show the keyframe trajectory RMSE and                        an exhaustive evaluation in public datasets. Our system has
the time spent in the optimization in different cases: without                     demonstrated that it can process sequences from indoor and
loop closing, if we directly apply a full BA (20 or 100 itera-                     outdoor scenes and from car, robot, and hand-held motions. The
tions), if we apply only pose graph optimization (ten iterations                   accuracy of the system is typically below 1 cm in small indoor
with different number of edges), and if we apply pose graph                        scenarios and of a few meters in large outdoor scenarios (once
optimization and full BA afterwards. Fig. 13 shows the output                      we have aligned the scale with the ground truth).
trajectory of the different methods. The results clearly show that                    Currently, PTAM by Klein and Murray [4] is considered the
before loop closure, the solution is so far from the optimal that                  most accurate SLAM method from monocular video in real time.
BA has convergence problems. Even after 100 iterations, still                      It is not coincidence that the back-end of PTAM is BA, which
the error is very high. On the other hand, essential graph opti-                   is well known to be the gold standard method for the offline
mization shows fast convergence and more accurate results. It                      Structure From Motion problem [2]. One of the main successes
can be seen that the choice of θm in has not significant effect                    of PTAM, and the earlier work of Mouragnon [3], was to bring
in accuracy, but decreasing the number of edges, the time can                      that knowledge into the robotics SLAM community and demon-
be significantly reduced. Performing an additional BA after the                    strate its real-time performance. The main contribution of our
pose graph optimization slightly improves the accuracy while                       work is to expand the versatility of PTAM to environments that
increasing substantially the time.                                                 are intractable for that system. To achieve this, we have designed
Authorized licensed use limited to: ULAKBIM UASL - MARMARA UNIVERSITY. Downloaded on May 15,2020 at 19:34:33 UTC from IEEE Xplore. Restrictions apply.
1160                                                                                           IEEE TRANSACTIONS ON ROBOTICS, VOL. 31, NO. 5, OCTOBER 2015
Fig. 12. ORB-SLAM keyframe trajectories in sequences 02, 03, 04 ,06, 08, 09, and 10 from the odometry benchmark of the KITTI dataset. Sequence 08 does
not contains loops and drift (especially scale) is not corrected. (a) Sequence 02. (b) Sequence 03. (c) Sequence 04. (d) Sequence 06. (e) Sequence 08 (f) Sequence
09. (g) Sequence 10.
from scratch a new monocular SLAM system with some new                                                             TABLE V
                                                                                                  RESULTS OF OUR SYSTEM IN THE KITTI DATASET
ideas and algorithms, but also incorporating excellent works
developed in the past few years, such as the loop detection of
                                                                                                                      ORB-SLAM            + Global BA (20 its.)
Gálvez-López and Tardós [5], the loop closing procedure and
covisibility graph of Strasdat et al. [6], [7], the optimization                    Sequence     Dimension (m×m)    KFs    RMSE (m)     RMSE (m)     Time BA (s)
framework g2o by Kuemmerle et al. [37], and ORB features by                         KITTI 00        564 × 496      1391      6.68          5.33          24.83
Rubble et al. [9]. To the best of our knowledge, no other system                    KITTI 01       1157 × 1827       X         X            X              X
has demonstrated to work in as many different scenarios and                         KITTI 02        599 × 946      1801      21.75        21.28          30.07
                                                                                    KITTI 03        471 × 199       250      1.59          1.51           4.88
with such accuracy. Therefore, our system is currently the most                     KITTI 04        0.5 × 394       108      1.79          1.62           1.58
reliable and complete solution for monocular SLAM. Our novel                        KITTI 05        479 × 426       820      8.23          4.85          15.20
policy to spawn and cull keyframes permits to create keyframes                      KITTI 06         23 × 457       373      14.68        12.34           7.78
                                                                                    KITTI 07        191 × 209       351      3.36          2.26           6.28
every few frames, which are eventually removed when consid-                         KITTI 08        808 × 391      1473      46.58        46.68          25.60
ered redundant. This flexible map expansion is really useful in                     KITTI 09        465 × 568       653      7.62          6.62          11.33
poorly conditioned exploration trajectories, i.e., close to pure                    KITTI 10        671 × 177       411      8.68          8.80           7.64
Authorized licensed use limited to: ULAKBIM UASL - MARMARA UNIVERSITY. Downloaded on May 15,2020 at 19:34:33 UTC from IEEE Xplore. Restrictions apply.
MUR-ARTAL et al.: ORB-SLAM: A VERSATILE AND ACCURATE MONOCULAR SLAM SYSTEM                                                                            1161
                            TABLE VI                                                  which a dense and accurate map of the scene can be built. A
         COMPARISON OF LOOP CLOSING STRATEGIES IN KITTI 09
                                                                                      first effort in this line is presented in [47].
Authorized licensed use limited to: ULAKBIM UASL - MARMARA UNIVERSITY. Downloaded on May 15,2020 at 19:34:33 UTC from IEEE Xplore. Restrictions apply.
1162                                                                                         IEEE TRANSACTIONS ON ROBOTICS, VOL. 31, NO. 5, OCTOBER 2015
      where Λi,j is the information matrix of the edge, which,                    [15] M. Cummins and P. Newman, “Appearance-only SLAM at large scale
      as in [48], we set to the identity. We fix the loop closure                      with FAB-MAP 2.0,” Int. J. Robot. Res., vol. 30, no. 9, pp. 1100–1123,
                                                                                       2011.
      keyframe to fix the 7 degrees of gauge freedom. Although                    [16] M. Calonder, V. Lepetit, C. Strecha, and P. Fua, “BRIEF: Binary robust
      this method is a rough approximation of a full BA, we                            independent elementary features,” in Proc. Eur. Conf. Comput. Vision,
      demonstrate experimentally in Section VIII-E that it has                         Hersonissos, Greece, Sep. 2010, pp. 778–792.
                                                                                  [17] E. Rosten and T. Drummond, “Machine learning for high-speed corner
      significantly faster and better convergence than BA.                             detection,” in Proc. Eur. Conf. Comput. Vision, Graz, Austria, May 2006,
   3) Relative Sim(3) Optimization: Given a set of n matches                           pp. 430–443.
      i ⇒ j (keypoints and their associated 3-D map points)                       [18] H. Bay, T. Tuytelaars, and L. Van Gool, “SURF: Speeded up robust fea-
                                                                                       tures,” in Proc. Eur. Conf. Comput. Vision, Graz, Austria, May 2006,
      between keyframe 1 and keyframe 2, we want to optimize                           pp. 404–417.
      the relative Sim(3) transformation S12 (see Section VII-                    [19] D. G. Lowe, “Distinctive image features from scale-invariant keypoints,”
      B) that minimizes the reprojection error in both images                          Int. J. Comput. Vision, vol. 60, no. 2, pp. 91–110, 2004.
                                                                                  [20] A. J. Davison, I. D. Reid, N. D. Molton, and O. Stasse, “MonoSLAM:
      as                                                                               Real-time single camera SLAM,” IEEE Trans. Pattern Anal. Mach. Intell.,
                                                                                       vol. 29, no. 6, pp. 1052–1067, Jun. 2007.
                         e1 = x1,i − π1 (S12 , X2,j )                             [21] J. Civera, A. J. Davison, and J. M. M. Montiel, “Inverse depth parametriza-
                                                                                       tion for monocular SLAM,” IEEE Trans. Robot., vol. 24, no. 5,
                         e2 = x2,j − π2 (S−1
                                          12 , X1,i )                    (10)          pp. 932–945, Oct. 2008.
                                                                                  [22] C. Forster, M. Pizzoli, and D. Scaramuzza, “SVO: Fast semi-direct monoc-
                                                                                       ular visual odometry,” in Proc. IEEE Int. Conf. Robot. Autom., Hong Kong,
       and the cost function to minimize is                                            Jun. 2014, pp. 15–22.
                                                                               [23] O. D. Faugeras and F. Lustman, “Motion and structure from motion in a
            C=        ρh (eT1 Ω−1              T  −1
                               1,i e1 ) + ρh (e2 Ω2,j e2 )               (11)          piecewise planar environment,” Int. J. Pattern Recog. Artif. Intell., vol. 2,
                                                                                       no. 03, pp. 485–508, 1988.
                     n
                                                                                  [24] W. Tan, H. Liu, Z. Dong, G. Zhang, and H. Bao, “Robust monocular SLAM
                                                                                       in dynamic environments,” in Proc. IEEE Int. Symp. Mixed Augmented
       where Ω1,i and Ω2,i are the covariance matrices associ-                         Reality, Adelaide, Australia, Oct. 2013, pp. 209–218.
       ated with the scale in which keypoints in images 1 and 2                   [25] H. Lim, J. Lim, and H. J. Kim, “Real-time 6-DOF monocular visual
       were detected. In this optimization, the points are fixed.                      SLAM in a large-scale environment,” in Proc. IEEE Int. Conf. Robot.
                                                                                       Autom., Hong Kong, Jun. 2014, pp. 1532–1539.
                                                                                  [26] D. Nistér, “An efficient solution to the five-point relative pose problem,”
                                                                                       IEEE Trans. Pattern Anal. Mach. Intell., vol. 26, no. 6, pp. 756–770, Jun.
                                  REFERENCES                                           2004.
 [1] B. Triggs, P. F. McLauchlan, R. I. Hartley, and A. W. Fitzgibbon, “Bun-      [27] H. Longuet-Higgins, “The reconstruction of a plane surface from two
     dle adjustment a modern synthesis,” in Vision Algorithms: Theory and              perspective projections,” Proc. Royal Soc. London Ser. B, Biol, Sci.,
     Practice. New York, NY, USA: Springer, 2000, pp. 298–372.                         vol. 227, no. 1249, pp. 399–410, 1986.
 [2] R. Hartley and A. Zisserman, Multiple View Geometry in Computer Vision,      [28] P. H. Torr, A. W. Fitzgibbon, and A. Zisserman, “The problem of degener-
     2nd ed., Cambridge, U.K.: Cambridge Univ. Press, 2004.                            acy in structure and motion recovery from uncalibrated image sequences,”
 [3] E. Mouragnon, M. Lhuillier, M. Dhome, F. Dekeyser, and P. Sayd, “Real             Int. J. Comput. Vision, vol. 32, no. 1, pp. 27–44, 1999.
     time localization and 3D reconstruction,” in Proc. IEEE Comput. Soc.         [29] A. Chiuso, P. Favaro, H. Jin, and S. Soatto, “Structure from motion causally
     Conf. Comput. Vision Pattern Recog., 2006, vol. 1, pp. 363–370.                   integrated over time,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 24,
 [4] G. Klein and D. Murray, “Parallel tracking and mapping for small AR               no. 4, pp. 523–535, Apr. 2002.
     workspaces,” in Proc. IEEE ACM Int. Symp. Mixed Augmented Reality,           [30] E. Eade and T. Drummond, “Scalable monocular SLAM,” in Proc. IEEE
     Nara, Japan, Nov. 2007, pp. 225–234.                                              Comput. Soc. Conf. Comput. Vision Pattern Recog., New York, NY, USA,
 [5] D. Gálvez-López and J. D. Tardós, “Bags of binary words for fast place         Jun. 2006, vol. 1, pp. 469–476.
     recognition in image sequences,” IEEE Trans. Robot., vol. 28, no. 5,         [31] H. Strasdat, J. M. M. Montiel, and A. J. Davison, “Visual SLAM: Why
     pp. 1188–1197, Oct. 2012.                                                         filter?” Image Vision Comput., vol. 30, no. 2, pp. 65–77, 2012.
 [6] H. Strasdat, J. M. M. Montiel, and A. J. Davison, “Scale drift-aware         [32] G. Klein and D. Murray, “Improving the agility of keyframe-based
     large scale monocular SLAM,” presented at the Proc. Robot.: Sci. Syst.,           SLAM,” in Proc. Eur. Conf. Comput. Vision, Marseille, France, Oct. 2008,
     Zaragoza, Spain, Jun. 2010.                                                       pp. 802–815.
 [7] H. Strasdat, A. J. Davison, J. M. M. Montiel, and K. Konolige, “Double       [33] K. Pirker, M. Ruther, and H. Bischof, “CD SLAM-continuous localization
     window optimisation for constant time visual SLAM,” in Proc. IEEE Int.            and mapping in a dynamic world,” in Proc. IEEE/RSJ Int. Conf. Intell.
     Conf. Comput. Vision, Barcelona, Spain, Nov. 2011, pp. 2352–2359.                 Robots Syst., San Francisco, CA, USA, Sep. 2011, pp. 3990–3997.
 [8] C. Mei, G. Sibley, and P. Newman, “Closing loops without places,” in         [34] S. Song, M. Chandraker, and C. C. Guest, “Parallel, real-time monoc-
     Proc. IEEE/RSJ Int. Conf. Intell. Robots Syst., Taipei, Taiwan, Oct. 2010,        ular visual odometry,” in Proc. IEEE Int. Conf. Robot. Autom., 2013,
     pp. 3738–3744.                                                                    pp. 4698–4705.
 [9] E. Rublee, V. Rabaud, K. Konolige, and G. Bradski, “ORB: An efficient        [35] P. F. Alcantarilla, J. Nuevo, and A. Bartoli, “Fast explicit diffusion for
     alternative to SIFT or SURF,” in Proc. IEEE Int. Conf. Comput. Vision,            accelerated features in nonlinear scale spaces,” presented at the Brit. Mach.
     Barcelona, Spain, Nov. 2011, pp. 2564–2571.                                       Vision Conf., Bristol, U.K., 2013.
[10] J. Engel, T. Schöps, and D. Cremers, “LSD-SLAM: Large-scale direct          [36] X. Yang and K.-T. Cheng, “LDB: An ultra-fast feature for scalable aug-
     monocular SLAM,” in Proc. Eur. Conf. Comput. Vision, Zurich, Switzer-             mented reality on mobile devices,” in Proc. IEEE Int. Symp. Mixed Aug-
     land, Sep. 2014, pp. 834–849.                                                     mented Reality, 2012, pp. 49–57.
[11] R. Mur-Artal and J. D. Tardós, “Fast relocalisation and loop closing in     [37] R. Kuemmerle, G. Grisetti, H. Strasdat, K. Konolige, and W. Burgard,
     keyframe-based SLAM,” in Proc. IEEE Int. Conf. Robot. Autom., Hong                “g2o: A general framework for graph optimization,” in Proc. IEEE Int.
     Kong, Jun. 2014, pp. 846–853.                                                     Conf. Robot. Autom., Shanghai, China, May 2011, pp. 3607–3613.
[12] R. Mur-Artal and J. D. Tardós, “ORB-SLAM: Tracking and mapping              [38] J. Sturm, N. Engelhard, F. Endres, W. Burgard, and D. Cremers, “A bench-
     recognizable features,” presented at the MVIGRO Workshop Robot. Sci.              mark for the evaluation of RGB-D SLAM systems,” in Proc. IEEE/RSJ Int.
     Syst., Berkeley, CA, USA, Jul. 2014.                                              Conf. Intell. Robots Syst., Vilamoura, Portugal, Oct. 2012, pp. 573–580.
[13] B. Williams, M. Cummins, J. Neira, P. Newman, I. Reid, and J. D. Tardós,    [39] M. Smith, I. Baldwin, W. Churchill, R. Paul, and P. Newman, “The new
     “A comparison of loop closing techniques in monocular SLAM,” Robot.               college vision and laser data set,” Int. J. Robot. Res., vol. 28, no. 5,
     Auton. Syst., vol. 57, no. 12, pp. 1188–1197, 2009.                               pp. 595–599, 2009.
[14] D. Nister and H. Stewenius, “Scalable recognition with a vocabulary tree,”   [40] A. Geiger, P. Lenz, C. Stiller, and R. Urtasun, “Vision meets robotics:
     in Proc. IEEE Comput. Soc. Conf. Comput. Vision Pattern Recog., New               The KITTI dataset,” Int. J. Robot. Res., vol. 32, no. 11, pp. 1231–1237,
     York, NY, USA, Jun. 2006, vol. 2, pp. 2161–2168.                                  2013.
Authorized licensed use limited to: ULAKBIM UASL - MARMARA UNIVERSITY. Downloaded on May 15,2020 at 19:34:33 UTC from IEEE Xplore. Restrictions apply.
MUR-ARTAL et al.: ORB-SLAM: A VERSATILE AND ACCURATE MONOCULAR SLAM SYSTEM                                                                                   1163
[41] V. Lepetit, F. Moreno-Noguer, and P. Fua, “EPnP: An accurate O(n)                                      J. M . M. Montiel (M’15) was born in Arnedo, Spain,
     solution to the PnP problem,” Int. J. Comput. Vision, vol. 81, no. 2,                                  in 1967. He received the M.S. and Ph.D. degrees in
     pp. 155–166, 2009.                                                                                     electrical engineering from Universidad de Zaragoza,
[42] B. K. P. Horn, “Closed-form solution of absolute orientation using unit                                Zaragoza, Spain, in 1992 and 1996, respectively.
     quaternions,” J. Opt. Soc. Amer. A, vol. 4, no. 4, pp. 629–642, 1987.                                      He is currently a Full Professor with the De-
[43] F. Endres, J. Hess, J. Sturm, D. Cremers, and W. Burgard, “3-D mapping                                 partamento de Informática e Ingenierı́a de Sistemas,
     with an RGB-D camera,” IEEE Trans. Robot., vol. 30, no. 1, pp. 177–187,                                Universidad de Zaragoza, where he is in charge of
     Feb. 2014.                                                                                             perception and computer vision research grants and
[44] R. A. Newcombe, S. J. Lovegrove, and A. J. Davison, “DTAM: Dense                                       courses. His interests include real-time vision local-
     tracking and mapping in real-time,” in Proc. IEEE Int. Conf. Comput.                                   ization and semantic mapping for rigid and nonrigid
     Vision, Barcelona, Spain, Nov. 2011, pp. 2320–2327.                                                    environments, and the transference of this technology
[45] S. Lovegrove, A. J. Davison, and J. Ibanez-Guzmán, “Accurate visual         to robotic and nonrobotic application domains.
     odometry from a rear parking camera,” in Proc. IEEE Intell. Vehicles             Dr. Montiel is a Member of the I3A Robotics, Perception, and Real-Time
     Symp., 2011, pp. 788–793.                                                    Group, Universidad de Zaragoza. He has been awarded several Spanish MEC
[46] P. H. Torr and A. Zisserman, “Feature based methods for structure and        grants to fund research with the University of Oxford, U.K., and with Imperial
     motion estimation,” in Vision Algorithms: Theory and Practice. New York,     College London, U.K.
     NY, USA: Springer, 2000, pp. 278–294.
[47] R. Mur-Artal and J. D. Tardos, “Probabilistic semi-dense mapping from
     highly accurate feature-based monocular SLAM,” presented at the Proc.
     Robot.: Sci. Syst., Rome, Italy, Jul. 2015.
[48] H. Strasdat, “Local accuracy and global consistency for efficient visual
     SLAM,” Ph.D. dissertation, Imperial College London, London, U.K., Oct.
     2012.
Authorized licensed use limited to: ULAKBIM UASL - MARMARA UNIVERSITY. Downloaded on May 15,2020 at 19:34:33 UTC from IEEE Xplore. Restrictions apply.