MATEC Web of Conferences 139, 00007 (2017)                                                        DOI: 10.
1051/matecconf/201713900007
ICMITE 2017
    CNN-Based Vision Model for Obstacle Avoidance of Mobile
    Robot
    Canglong Liu1,2, Bin Zheng2*, Chunyang Wang1*, Yongting Zhao2, Shun Fu2, Haochen Li2
    1   School of Electronic and Information Engineering, Changchun University of Science and Technology, Changchun, 130022, China
    2 Chongqing    Institute of Green and Intelligent Technology, Chinese Academy of Sciences, Chongqing, 400714, China
                   Abstract. Exploration in a known or unknown environment for a mobile robot is an essential application.
                   In the paper, we study the mobile robot obstacle avoidance problem in an indoor environment. We present
                   an end-to-end learning model based Convolutional Neural Network (CNN), which takes the raw image
                   obtained from camera as only input. And the method converts directly the raw pixels to steering commands
                   including turn left, turn right and go straight. Training data was collected by a human remotely controlled
                   mobile robot which was manipulated to explore in a structure environment without colliding into obstacles.
                   Our neural network was trained under caffe framework and specific instructions are executed by the Robot
                   Operating System (ROS). We analysis the effect of the datasets from different environments with some
                   marks on training process and several real-time detect experiments were designed. The final test result
                   shows that the accuracy can be improved by increase the marks in a structured environment and our model
                   can get high accuracy on obstacle avoidance for mobile robots.
                   Keywords. End-to-End Learning, CNN; Obstacle Avoidance, ROS
  1 Introduction                                                        obtained from a camera as input and output the
                                                                        command for control the robot arm.
  With the continuous development of science and                        Gaya et al. showed the application of automatic obstacle
  technology, various mobile robots have been widely                    avoidance based CNN for Autonomous Underwater
  used in different fields, such as life services, industrial           Vehicles (AUVs) [1]. While they don’t test the model in
  production, education, entertainment and military area,               real-time and the network is not end-to-end since it
  etc. The mobile robot technology include control theory,              consist of requirement of the intermediate procedures.
  mechanical design, computer technology. The ability of                Nicolai et al. use deep learning approach for odometry
  mobile robots to navigate and avoid obstacles is an                   estimates based lidar data[5]. while their approach need
  important indicator of the robot's intelligence.                      to obtain laser data from expensive LIDAR. Giust et al
  Autonomous navigation and obstacle avoidance for                      study a problem of quadrotor trail a forest based
  mobile robot need to equip with some range sensors and                CNN[6] . They acquire the dataset by a hiker who
  depend on complex algorithms[1]. Some typical range                   equipped with three cameras in head . And they train a
  sensors such laser sensors, ultrasonic sensors, and visual            CNN model to output control commands for a quadrotor
  sensors, while these sensors have their own limitations.              trail the forest following a special path. Ross et al. [7]
  For example, lasers are more expensive and traditional                proposed a method that learn left/right controller for
  algorithms based vision are relatively complex.                       Micro Aerial Vehicles(MAV). They use monocular
  In recent years, with the development of machine                      vision as its sensor, and extracted features from the raw
  learning, especially deep learning[2], it has became a                image. The MAV can autonomously navigate through a
  study hotspot that the robot avoid obstacles by self-                 forest environment. But the system only aimed to the left
  learning[3]. Deep learning is an end-to-end learning                  and right motion, it has to be operated by a person when
  approach, which is a mapping relationship from input to               need to forward motion command.
  output through a deep learning network. That is to say,               Compared to the traditional mobile robot obstacle
  the system automatically learns the characteristics of the            avoidance method, the method of obstacle avoidance
  data by pouring a lot of data into the algorithm. Levine et           based end-to-end learning greatly simplifies the
  al.[4] demonstrate that end-to-end learning is superior to            calculation. The system generate steering commands
  the traditional approach that fixed vision layers . They              directly by the original pixels[8]. For the problem of
  presented the end-to-end application of CNN for motion                robot explore environment, traditional algorithms steps
  planning of robot. They trained a end-to-end learning                 are sense-plan-act. While our main work is try to use the
  model based convolutional neural network. The image                   end-to-end model. The main difference between
                                                                        traditional algorithms and end-to-end model is in the
    *
        Corresponding author: zhengbin@cigit.ac.cn wangchunyang19@cust.edu.cn
© The Authors, published by EDP Sciences. This is an open access article distributed under the terms of the Creative Commons Attribution
License 4.0 (http://creativecommons.org/licenses/by/4.0/).
MATEC Web of Conferences 139, 00007 (2017)                                                    DOI: 10.1051/matecconf/201713900007
ICMITE 2017
   path planning. Traditional algorithms are complicated,             robot, a trained CNN model classifies the images to
   time-consuming and usually requires some sensors. By               decide which direction will go next for the robot, and
   contrast, our method skips the traditional plan, that is           specific instructions are executed by the ROS.
   sense-act[9].
   In the paper, we present a mobile robot explore in an
   indoor environment, and the system can rapidly learn               3 Experiments and analysis
   feature for avoid obstacles. We don’t require high
   resolution camera, but only a cheap and common camera.             3.1 Hardware platform and environment
   The paper is organized as follows: Section 2 introduce
   the vision model of obstacle avoidance. Our experiment             We built a light-weight and small mobile robot based
   will present in Section 3, which include hardware                  iRobot Roomba and ROS, which is an open source frame.
   platform, dataset, train model, a real-time test and other         ROS provides a variety of software packages that can be
   training details. Finally, conclusions and prospect are            applied to robots, and it played an important role in the
   given in Section 4.                                                design of mobile robot control systems. In addition, the
                                                                      system was equipped with a common camera and a Mini
                                                                      computer.
   2 Vision model of obstacle avoidance
   Deep convolution neural network has made great
   progress with the development of large-scale
   computation and GPU. More and more researchers take
   advantage of CNNs to solver some problems for robot in
   machine learning.
   We control the robot by CNN, which take RGB images                              (a)                           (b)
   as its input. And it has three classes output, include go          Figure 3. Test environment
   straight, turn left and turn right. The network architecture
   is shown in Fig.1, which consists of 5 convolutional               In order to demonstrate the problem more effectively, we
   layers and 3 fully connected layers. In our environment,           constructed two types structured environment in indoor
   extracting and learning feature from our dataset by using          based KT foam board. The first type environment as
   Caffe[10] , which is a deep learning open source frame.            shown in the Fig. 3(a), which was constructed using raw
   And we designed a CNN network based AlexNet ,                      KT foam board. We put some black tape in another type
   which was a winner in ILSVRC-2012 and r presented by               environment, as shown in the Fig. 3(b). Besides, we
   Alex[11]. The last fully connected layer was adjusted to           placed a table in the center as a major obstacle.
   3 nodes, which is consistent with the system included
   three turning commands.
                                                                      3.2 Datasets
                                                                      CNN need a massive dataset in order to train an
                                                                      effectively model. In these two environments, we
                                                                      operated the robot explore the environment and without
                                                                      colliding into obstacles. At the same time, we recorded
                                                                      the image and control commands by a tool of named
   Figure 1. Structure of the CNN                                     rosbag in ROS.
   The mobile robot system performed the instructions                           Table 1. The number of training images
                                                                                     TL         TR           GS        Total
   which was predicted with CNN based Robot Operating
   System (ROS). ROS is an open source framework for                  Dataset 1      1892       1955         1888      5735
   robot, which can provide a similar operating system for            Dataset 2      1901       1999         2038      5938
   heterogeneous computer clusters[12].
                                                                      We set the label for each frame by matching the clock of
                                                                      control commands and the clock of images, and the label
                                                                      include Turn left (TL), Turn right (TR), and Go straight
                                                                      (GS). The dataset 1 and dataset 2 was collected
                                                                      respectively in the first and second type environment.
                                                                      We sampled 5735 images from dataset 1, and 5938
                                                                      images from dataset 2. The special information is shown
                                                                      in table 1. The dataset 1 was spilt in disjoint 5057 images
   Figure 2. The flow chart of running                                for training and 678 images for testing. The experience
                                                                      use 5178 images for training and 760 images for testing
   We use a package named ROS_caffe which provide a                   in the dataset 2.
   bridge between caffe and ROS. The flow chart is shown
   in Fig.2. The raw image was acquired by ROS from
                                                                  2
MATEC Web of Conferences 139, 00007 (2017)                                                     DOI: 10.1051/matecconf/201713900007
ICMITE 2017
  3.3 Training results                                              of the dataset 1 is 81.72%. The accuracy rate of the GS
                                                                    class is the highest, which account for 96.08%. The
  We start training based our model, as shown in Fig.1.             classes TL and TR are low in accuracy. In addition, the
  The model was trained on a workstation equipped with              error classification of the GR class is often misclassified
  an NVIDA GTX 1080 GPU and NVIDIA cuDNN. And                       into the GL class.
  the network can directly generate TL/TR/GS commands
  from images of camera after the completion of the
  training.
  3.3.1 Training curves
  The learning curves were plotted in the Fig. 4, which
  include test accuracy, train and test loss as the number of
  iterations. In the Fig. 4(a), the test accuracy eventually
  achieved 81.72%. And the model achieved a test
  accuracy 93.21% in the Fig. 4 (b).
                                                                     Figure 5. The confusion matrix of the dataset 1 classification
                                                                    result
                                                                    3.3.3 Test result for samples
                                                                    We tested randomly some sample images with the
                                                                    trained model, shown in Fig. 6 .The average network
                                                                    prediction time of each frame is 4.03 ms. The raw input
                                                                    images are shown in the first line, and the second line are
                                                                    the corresponding predicted output which include turn
            (a) Train curve of dataset 1                            left/right and go straight.
                                                                    In the histograms, red mark represents the response
                                                                    probability values of three classes. We observe that
                                                                    CNN model is effectiveness, and it can extract valid
                                                                    information from raw input and predict the right
                                                                    commands.
           (b) Train curve of dataset 2                                         (a)
  Figure 4. Training curves
  We can see that the performance of dataset 2 outperform
  significantly dataset 1 from the training curves.
  Comparing two classes of the environment, we observe
  that the more easily the environment is identified, the
  higher the accuracy, which is similar to recognize the                        (b)
  environment for human. The accuracy can be improve
  by change the environment mark in the constructed
  environment.
  3.3.2 The confusion matrix
  The confusion matrix of the dataset 1 classification result
                                                                                (c)
  was shown in Fig.5. We can see that the overall accuracy          Figure 6. Test result for samples.
                                                                3
MATEC Web of Conferences 139, 00007 (2017)                                                  DOI: 10.1051/matecconf/201713900007
ICMITE 2017
   3.3.4 Comparison                                                  can get high accuracy on obstacle avoidance of mobile
                                                                     robots.
   We compare the result with some others in order to show
   the effectiveness of the method. As is shown in table 2,          In the next work, we will attempt to do it in a more
   Lei Tai et al. tested deep-network solution towards               complex setting, include dynamic and non-structure
   obstacle avoidance in an indoor environment, their                environment. We will solver the tasks of robot
   overall accuracy is 80.2%[13]. Comparing their method             navigation based CNN model and provide a significant
   and result, our test was designed in the indoor                   contribution for the development of intelligence mobile
   environment with mark and without mark, respectively.             robotic navigation.
   The final test result shows that our model can get high
   accuracy on obstacle avoidance for mobile robots and
   the accuracy can be improved by increase the marks in              References
   the environment.                                                  1. Kovács L. Visual Monocular Obstacle Avoidance for
             Table 2. Accuracy of obstacle avoidance                    Small Unmanned Vehicles[C]//Proceedings of the
                                                                        IEEE Conference on Computer Vision and Pattern
                           Dataset     Dataset   [13]                   Recognition Workshops. 2016: 59-66.
                              1           2                          2. Deng L, Yu D. Deep learning: methods and
           Accuracy        81.72%      93.21%    80.2%                  applications[J]. Foundations and Trends® in Signal
                                                                        Processing, 2014, 7(3–4): 197-387.
   3.4 Real-time test                                                3. Jia B, Feng W, Zhu M. Obstacle detection in single
                                                                        images with deep neural networks[J]. Signal, Image
   In our real-time-test experiment, we predicted the                   and Video Processing, 2016, 10(6): 1033-1040.
   commands based the model that was trained with the                4. Levine S, Finn C, Darrell T, et al. End-to-end training
   dataset 2. The robot implements these outputs as                     of deep visuomotor policies[J]. arXiv preprint
   constant rotational and/or translational velocities. We              arXiv:1504.00702, 2015.
   used a package named ROS_caffe which provide a                    5. Nicolai A, Skeele R, Eriksen C, et al. Deep Learning
   bridge between caffe and ROS. The system performed                   for Laser Based Odometry Estimation[J].
   the instructions which was predicted with CNN based               6. Giusti A, Guzzi J, Cireşan D C, et al. A machine
   ROS.                                                                 learning approach to visual perception of forest trails
                                                                        for mobile robots[J]. IEEE Robotics and Automation
   In the test, the robot predicted the instructions of the             Letters, 2016, 1(2): 661-667.
   corresponding action by the trained model and perform             7. Ross S, Melik-Barkhudarov N, Shankar K S, et al.
   the actions by ROS. Throughout the whole experiment,                 Learning monocular reactive uav control in cluttered
   the robot successfully avoid square table in the center of           natural environments[C]//Robotics and Automation
   the environment, and it didn’t hit around the KT board.              (ICRA), 2013 IEEE International Conference on.
   In real time testing, the network forecast outputs are               IEEE, 2013: 1765-1772.
   shown in the Fig.7. The experiment achieve the goal of            8. LeCun Y, Muller U, Ben J, et al. Off-road obstacle
   the mobile robot obstacle avoidance in explore the                   avoidance through end-to-end learning[C]//NIPS.
   environment.                                                         2005: 739-746.
                                                                     9. Pfeiffer M, Schaeuble M, Nieto J, et al. From
                                                                        Perception to Decision: A Data-driven Approach to
                                                                        End-to-end Motion Planning for Autonomous Ground
                                                                        Robots[J]. arXiv preprint arXiv:1609.07910, 2016.
                                                                    10. Jia Y, Shelhamer E, Donahue J, et al. Caffe:
                                                                        Convolutional architecture for fast feature
                                                                        embedding[C]//Proceedings of the 22nd ACM
                                                                        international conference on Multimedia. ACM, 2014:
                                                                        675-678.
                                                                    11. Krizhevsky A, Sutskever I, Hinton G E. Imagenet
   Figure 7. Real-time predict of the network                           classification with deep convolutional neural
                                                                        networks[C]//Advances in neural information
                                                                        processing systems. 2012: 1097-1105.
   4 Conclusions and prospect                                       12. Quigley M, Conley K, Gerkey B, et al. ROS: an open-
                                                                        source robot operating system In: ICRA workshop on
   In the paper, we presented an approach to the mobile
                                                                        open source software[J]. IEEE, Kobe, Japan, 2009.
   robot explore environment and obstacle avoidance using
                                                                    13. Tai L, Li S, Liu M. A deep-network solution towards
   methods of end-to-end learning based CNN. A deep
                                                                        model-less obstacle avoidance[C]//Intelligent Robots
   neural network was trained with the dataset and it
                                                                        and Systems (IROS), 2016 IEEE/RSJ International
   converts directly the RGB images to steering commands.
                                                                        Conference on. IEEE, 2016: 2759-2764.
   We also discussed how to improve the accuracy by
   change the environment mark in a constructed
   environment. The real-time test experiment our approach