Towards a benchmark for RGB-D SLAM evaluation
Jürgen Sturm1     Stéphane Magnenat2                        Nikolas Engelhard3               François Pomerleau2
                Francis Colas2     Daniel Cremers1                       Roland Siegwart2                Wolfram Burgard3
   Abstract— We provide a large dataset containing RGB-D
image sequences and the ground-truth camera trajectories
with the goal to establish a benchmark for the evaluation
of visual SLAM systems. Our dataset contains the color and
depth images of a Microsoft Kinect sensor and the ground-
truth trajectory of camera poses. The data was recorded at
full frame rate (30 Hz) and sensor resolution (640x480). The
ground-truth trajectory was obtained from a high-accuracy
motion-capture system with eight high-speed tracking cameras
(100 Hz). Further, we provide the accelerometer data from                          (a) Typical office scene            (b) Motion capture system
the Kinect. Finally, we propose an evaluation criterion for
measuring the quality of the estimated camera trajectory of
visual SLAM systems.
                        I. I NTRODUCTION
   Simultaneous localization and mapping (SLAM) has a
long history in robotics and computer-vision research [11],                (c) Microsoft Kinect sensor          (d) Checkerboard with reflective
[6], [1], [15], [7], [4]. Different sensor modalities have been            with reflective markers              markers used for calibration
explored in the past, including 2D laser scanners [12], [3],               Fig. 1: The office environment and the experimental setup
3D scanners [14], [16], monocular cameras [13], [7], [9],                  in which the RGB-D dataset with ground truth camera poses
[19], [20] and stereo systems [8]. Recently, low-cost RGB-                 was recorded.
D sensors became available, of which the most prominent
one is the Microsoft Kinect. Such sensors provide both color
images and dense depth maps at video frame rates. Henry et
                                                                              II. E XPERIMENTAL S ETUP AND DATA ACQUISITION
al. [5] were the first to use the Kinect sensor in a 3D SLAM
system. Others have followed [2], and we expect to see more                   We acquired a large set of data recordings containing
approaches using RGB-D data for visual SLAM in the near                    both the RGB-D data from the Kinect and the ground truth
future.                                                                    estimates from the mocap system. We moved the Kinect
   Various datasets and benchmarks have been proposed for                  along different trajectories in typical office environments
laser- and camera-based SLAM, such as the Freiburg, Intel                  (see Fig. 1a). The recordings differ in their translational
and Newcollege datasets [18], [17]. However until now, no                  and angular velocities (fast/slow movements) and the size
suitable dataset or benchmark existed that can be used to                  of the environment (one desk, several desks, whole room).
evaluate, measure, and compare the performance of RGB-                     We also acquired data for three specific trajectories for
D SLAM systems. As we consider objective evaluation                        debugging purposes, i.e., we moved the Kinect (more or less)
methods to be highly important for measuring progress in the               individually along the x/y/z-axes and rotated it individually
field (and demonstrating this in a verifiable way), we decided             around the x/y/z-axes.
to provide such a dataset. To the best of our knowledge, this                 We captured both the color and depth images from an
is the first RGB-D dataset for visual SLAM benchmarking.                   off-the-shelf Microsoft Kinect sensor using PrimeSense’s
                                                                           OpenNI-driver. All data was logged at full resolution
  1 Jürgen Sturm and Daniel Cremers are with the Computer Vision and
                                                                           (640×480) and full frame rate (30 Hz) of the sensor on a
Pattern Recognition Group, Computer Science Department, Technical Uni-     Linux laptop running Ubuntu 10.10 and ROS Diamondback.
versity of Munich, Germany. {sturmju,cremers}@in.tum.de                    Further, we recorded IMU data from the accelerometer in
  2 S. Magnenat, F. Pomerlau, F. Colas and R. Seigwart are
                                                                           the Kinect at 500 Hz and also read out the internal sensor
with the Autonomous Systems Lab, ETH Zurich, Switzerland.
{stephane.magnenat,francis.colas}@mavt.ethz.ch                             parameters from the Kinect factory calibration.
and f.pomerleau@gmail.com                                                     Further, we obtained the camera trajectory by using an
   3    Nikolas   Engelhard     and    Wolfram  Burgard     are   with     external motion capturing system from MotionAnalysis at
the     Autonomous       Intelligent    Systems     Lab,     Computer
Science     Department,     University     of   Freiburg,     Germany.     100 Hz (see Fig. 1b). We attached reflective targets to the
{engelhar,burgard}@informatik.uni-freiburg.de                              Kinect (see Fig. 1c) and used a modified checkerboard for
calibration (Fig. 1d) to obtain the transformation between the     and evaluations. In this way, we hope to detect (and resolve)
optical frame of the Kinect sensor and the coordinate system       potential problems present in our current dataset, such as
of the motion capture system. Finally, we also video-taped         calibration and synchronization issues between the Kinect
all recordings with an external video camera to capture the        and our mocap system as well as the effects of motion blur
camera motion and the environment from a different view            and the rolling shutter of the Kinect. Furthermore, we want
point.                                                             to investigate ways to measure the performance of a SLAM
   The original data has been recorded as a ROS bag file.          system not only in terms of the accuracy of the estimated
In total, we collected 50 GB of Kinect data, divided into          camera trajectory, but also in terms of the quality of the
separate nine sequences. The dataset is available online under     resulting map of the environment.
the Creative Commons Attribution license at
                                                                                                R EFERENCES
      https://cvpr.in.tum.de/research/                              [1] F. Dellaert. Square root SAM. In Proc. of Robotics: Science and
           datasets/rgbd-dataset                                        Systems (RSS), Cambridge, MA, USA, 2005.
                                                                    [2] N. Engelhard, F. Endres, J. Hess, J. Sturm, and W. Burgard. Real-
The website contains—next to additional information about               time 3D visual SLAM with a hand-held RGB-D camera. In Proc. of
the data formats—videos for simple visual inspection of the             the RGB-D Workshop on 3D Perception in Robotics at the European
                                                                        Robotics Forum, Vasteras, Sweden, 2011.
dataset.                                                            [3] G. Grisetti, C. Stachniss, and W Burgard. Improved techniques for grid
                                                                        mapping with rao-blackwellized particle filters. IEEE Transactions on
                        III. E VALUATION                                Robotics (T-RO), 23:34–46, 2007.
   For evaluating visual SLAM algorithms on our dataset,            [4] G. Grisetti, C. Stachniss, and W. Burgard. Non-linear constraint
                                                                        network optimization for efficient map learning. IEEE Transactions
we propose a metric similar to the one introduced by [10].              on Intelligent Transportation systems, 10(3):428–439, 2009.
The general idea is to compute the relative error between           [5] P. Henry, M. Krainin, E. Herbst, X. Ren, and D. Fox. RGB-D mapping:
the true and estimated motion w.r.t. the optical frame of the           Using depth cameras for dense 3D modeling of indoor environments.
                                                                        In Proc. of the Intl. Symp. on Experimental Robotics (ISER), Delhi,
RGB camera. As we have ground-truth pose information for                India, 2010.
all time indices, we propose to compute the error as the sum        [6] H. Jin, P. Favaro, and S. Soatto. Real-time 3-D motion and structure
of distances between the relative pose at time i and time               of point features: Front-end system for vision-based control and
                                                                        interaction. In Proc. of the IEEE Conf. on Computer Vision and Pattern
i + ∆, i.e.,                                                            Recognition (CVPR), 2000.
                  n
                  X                                                 [7] G. Klein and D. Murray. Parallel tracking and mapping for small AR
                                                      2                 workspaces. In Proc. of the IEEE and ACM International Symposium
        error =         [(x̂i+∆   x̂i )   (xi+∆   xi )]     (1)
                                                                        on Mixed and Augmented Reality (ISMAR), Nara, Japan, 2007.
                  i=1                                               [8] K. Konolige, M. Agrawal, R.C. Bolles, C. Cowan, M. Fischler, and
where i = 1, . . . , n are the time indices where ground                B.P. Gerkey. Outdoor mapping and navigation using stereo vision. In
                                                                        Intl. Symp. on Experimental Robotics (ISER), 2007.
truth information is available, ∆ is a free parameter that          [9] K. Konolige and J. Bowman. Towards lifelong visual maps. In Proc. of
corresponds to the time scale, xi is the ground truth pose              the IEEE/RSJ Intl. Conf. on Intelligent Robots and Systems (IROS),
at time index i, x̂i the estimated pose at time index i,                pages 1156–1163, 2009.
                                                                   [10] R. Kümmerle, B. Steder, C. Dornhege, M. Ruhnke, G. Grisetti,
stands for the inverse motion composition operator. If the es-          C. Stachniss, and A. Kleiner. On measuring the accuracy of SLAM
timated trajectory has missing values, i.e., there are timesteps        algorithms. Autonomous Robots, 27:387–407, 2009.
ij1 , . . . , ijm for which no pose x̂i could be estimated, the    [11] F. Lu and E. Milios. Globally consistent range scan alignment for
                                                                        environment mapping. Autonomous Robots, 4(4):333–349, 1997.
ratio of missing poses m/n should be stated as well.               [12] M. Montemerlo, S. Thrun, D. Koller, and B. Wegbreit. FastSLAM:
   All data necessary to evaluate our measure are present               A factored solution to the simultaneous localization and mapping
in the dataset. We plan to release a Python script that                 problem. In Proceedings of the AAAI National Conference on Artificial
                                                                        Intelligence, Edmonton, Canada, 2002. AAAI.
computes these measures automatically given the estimated          [13] D. Nistér. Preemptive ransac for live structure and motion estimation.
trajectory and the respective dataset. To prevent that (future)         Machine Vision and Applications, 16:321–329, 2005.
approaches are over-fitted on the dataset, we recorded all         [14] A. Nüchter, K. Lingemann, J. Hertzberg, and H. Surmann. 6D SLAM –
                                                                        3D mapping outdoor environments: Research articles. J. Field Robot.,
scenes twice, and held back the ground-truth trajectory in              24:699–722, August 2007.
these secondary recordings. With this, we plan to provide a        [15] E. Olson, J. Leonard, and S. Teller. Fast iterative optimization of pose
comparative offline evaluation benchmark for visual SLAM                graphs with poor initial estimates. In Proc. of the IEEE Intl. Conf. on
                                                                        Robotics and Automation (ICRA), 2006.
systems.                                                           [16] B. Pitzer, S. Kammel, C. DuHadway, and J. Becker. Automatic recon-
                                                                        struction of textured 3D models. In Proc. of the IEEE Intl. Conf. on
                     IV. C ONCLUSIONS                                   Robotics and Automation (ICRA), 2010.
                                                                   [17] M. Smith, I. Baldwin, W. Churchill, R. Paul, and P. Newman. The new
   In this paper, we have presented a novel RGB-D dataset               college vision and laser data set. Intl. Journal of Robotics Research
for benchmarking visual SLAM algorithms. The dataset con-               (IJRR), 28(5):595–599, 2009.
tains color images, depth maps, and associated ground-truth        [18] C. Stachniss, P. Beeson, D. Hähnel, M. Bosse, J. Leonard, B. Steder,
                                                                        R. Kümmerle, C. Dornhege, M. Ruhnke, G. Grisetti, and A. Kleiner.
camera pose information. Further, we proposed an evaluation             Laser-based slam datasets and benchmarks at http://openslam.org.
metric that can be used to assess the performance of a visual      [19] H. Strasdat, J. M. M. Montiel, and A. Davison. Scale drift-aware
SLAM system. We thus propose a benchmark that allows                    large scale monocular slam. In Proc. of Robotics: Science and Systems
                                                                        (RSS), Zaragoza, Spain, 2010.
researchers to objectively evaluate visual SLAM systems.           [20] J. Stühmer, S. Gumhold, and D. Cremers. Real-time dense geometry
Our next step is to evaluate our own system [2] on this dataset         from a handheld camera. In Proc. of the DAGM Symposium on Pattern
in order to provide a baseline for future implementations               Recognition (DAGM), Darmstadt, Germany, 2010.