0% found this document useful (0 votes)
111 views41 pages

Model-Based Design For Visual Localization Via Stereoscopic Video Processing

This document is a research paper submitted by Bryan Mah in partial fulfillment of a Master of Science degree in electrical engineering. The paper proposes a model-based design approach for visual localization using stereoscopic video processing. It discusses background topics related to image processing, feature extraction, stereo vision, model-based design, and prior literature. The design methodology section outlines the basic design flow and hardware/software used, and describes the models developed for color detection, localization, and counting. Experimental results are presented to evaluate the accuracy of depth, horizontal, and vertical position estimation based on varying camera parameters. The paper concludes by discussing future work opportunities to improve the visual localization system.

Uploaded by

Dan Kaputa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
111 views41 pages

Model-Based Design For Visual Localization Via Stereoscopic Video Processing

This document is a research paper submitted by Bryan Mah in partial fulfillment of a Master of Science degree in electrical engineering. The paper proposes a model-based design approach for visual localization using stereoscopic video processing. It discusses background topics related to image processing, feature extraction, stereo vision, model-based design, and prior literature. The design methodology section outlines the basic design flow and hardware/software used, and describes the models developed for color detection, localization, and counting. Experimental results are presented to evaluate the accuracy of depth, horizontal, and vertical position estimation based on varying camera parameters. The paper concludes by discussing future work opportunities to improve the visual localization system.

Uploaded by

Dan Kaputa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 41

Model-Based Design for Visual Localization via Stereoscopic

Video Processing

By

Bryan Mah

A Research Paper Submitted

in

Partial Fulfillment

of the

Requirements for the Degree of

MASTER OF SCIENCE

in

Electrical Engineering

Approved by:

PROF
(Dr. Daniel S. Kaputa, Research Advisor)

PROF
(Dr. Sohail A. Dianat, Department Head)

DEPARTMENT OF ELECTRICAL AND MICROELECTRONIC ENGINEERING

KATE GLEASON COLLEGE OF ENGINEERING

ROCHESTER INSTITUTE OF TECHNOLOGY

ROCHESTER, NEW YORK

May 2017
Contents

1 Introduction 1

2 Background 4
2.1 Image Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.3 Stereo Vision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.3.1 Triangulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.4 Model-Based Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.5 Literature Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

3 Design Methodology 11
3.1 Basic Design Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.2 Hardware & Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.3 Design Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.3.1 Color Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.3.2 Localization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.3.3 Counter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

4 Results 19
4.1 Depth (X-varying) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.2 Horizontal (Y-varying) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.3 Vertical (Z-varying) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.4 Disparity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

i
5 Conclusion 30
5.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

Bibliography 32

ii
List of Figures

2.1 Camera Geometry for Stereo Vision [3] . . . . . . . . . . . . . . . . . . . . . . . 5

3.1 Basic Design Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11


3.2 Raspberry Pi 3B . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.3 ELP 5 Megapixel USB Webcam . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.4 System Setup and View . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.5 Visual Localization Model-Based Design . . . . . . . . . . . . . . . . . . . . . . 14
3.6 Color Detection System Block . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.7 Localization System Block . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.8 Counter System Block . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

4.1 Results for X = 15 cm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19


4.2 Results for X = 30 cm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.3 Results for X = 45 cm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.4 Results for X = 75 cm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.5 Results for X = 233 cm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.6 Numerical Analysis of Depth Accuracy . . . . . . . . . . . . . . . . . . . . . . . 22
4.7 Results for Y = -5 cm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.8 Results for Y = 1 cm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.9 Results for Y = 8 cm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.10 Numerical Analysis of Horizontal Accuracy . . . . . . . . . . . . . . . . . . . . . 24
4.11 Results for Z = -4.6 cm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.12 Results for Z = 1.9 cm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.13 Results for Z = 8 cm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.14 Numerical Analysis of Vertical Accuracy . . . . . . . . . . . . . . . . . . . . . . 26

iii
4.15 Numerical Analysis of Disparity . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.16 Distance Approximation Based on ’b’ . . . . . . . . . . . . . . . . . . . . . . . . 27

iv
List of Tables

3.1 Software on Remote Machine . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14


3.2 HSV Threshold Values for Color Red . . . . . . . . . . . . . . . . . . . . . . . . 16

v
Nomenclature

ARM: Advanced RISC Machines

b: distance between the centers of two camera lenses

CMOS: Complementary Metal-Oxide Semiconductor

CPU Central Processing Unit

d: disparity

f: focal length

FAST: Features from Accelerated Segment Test

FPGA: Field Programmable Gate Array

FPS: Frames Per Second

GPS: Global Positioning System

GPU: Graphics Processing Unit

HDL: Hardware Description Language

HSV: Hue-Saturation-Value

Hz: Hertz

IMU: Inertial Measurement Unit

IO: Input-Output

RAM: Random Access Memory

vi
RGB: Red-Green-Blue

RISC: Reduced Instruction Set Computing

SIFT: Scale Invariant Feature Transform

SoC: System on Chip

SURF: Speeded Up Robust Features

UAV Unmanned Aerial Vehicle

USB: Universal Serial Bus

vii
Chapter 1
Introduction

As space exploration and drone technology grow more in demand, robot developers are
challenged with innovating methods of localization for unmanned systems. Localization is the
ability to determine and track position and orientation of the system. Robotic systems require loca-
tion awareness in order to maneuver safely through their environment. Methods of localization can
be divided into two classes: global and local. Global localization utilizes mapping to determine
absolute position of the system; whereas local localization references relative position based on
nearby objects in the environment [1]. Absolute positioning is easily calculated with satellite data
from a Global Positioning System (GPS), making it an effective localization tool when outdoors.
Relative positioning is determined generally from onboard sensors such as proximity sensors, en-
coders, and Inertial Measurement Units (IMUs). While both methods are useful for robot localiza-
tion, they become less effective as robotic applications become more complex. In the area of space
exploration, industries are heavily investing into unmanned and autonomous systems to navigate
the great beyond; however, GPS becomes inaccessible to determine absolute position when away
from Earth. Without GPS available, other methods must be used to determine the location of the
system. Even with GPS, however, signal integrity generally degrades in obstructed environments
such as inside buildings, within forests, and even mountainous regions. This disadvantage is appar-
ent when dealing with unmanned aerial technology, where stabilization relies on GPS signals. In
both applications, it is necessary to apply different localization techniques to determine positioning
of the system. This requires the necessary sensors to gather data related to the nearby environment.
However, many sensors accumulate error and uncertainty in their measurements. In a mobile appli-
cation, these errors can quickly present misconceptions of the environment surrounding the system.
This is where visual localization can overcome the weaknesses of conventional sensors.

1
Visual localization uses images to evaluate the relative location of the system, which pro-
vides more capabilities to interpret the nearby environment. These systems have a unique problem
though; the obtained images have more information than the system may need. For this reason,
vision systems require some sort of image processing to limit the data interpreted from the im-
ages. Image analysis is needed to identify features and/or determine the location of the system in
space. From color tracking to edge detection, a camera sensor allows an unmanned or autonomous
system to recognize key features that describe the landscape of the environment. The downfall to
visual localization is features such as distance and orientation cannot be easily determined with a
single camera because the image interprets a 3D environment as a 2D space, which results in a
loss in depth perception. One method that has been proven to properly evaluate position with re-
spect to an object is by using two cameras to produce stereoscopic images. Through triangulation,
the distance of an object from the system can be accurately calculated. The advantage of using
stereoscopic imaging, or stereo vision, for localization is the ability to maintain accurate results
regardless of positional changes to the system. The use of proximity sensors, for example, require
some sort of wave (i.e. sonar, infrared, and laser) to be transmitted away from the system as the
sensor waits to receive a returning signal from the former wave. Any immediate change to posi-
tion will cause error to the sensor measurement. Position changes are negligible in a stereo vision
system as long as the images are updated at a reasonable frame rate. While some may argue that
laser range finders are capable of faster sampling rates than vision systems, the current high price
of these sensors make it difficult to develop low-cost localization systems. Vision systems can be
the bridge between accurate localization and relatively low cost. Therefore, both stereo vision and
visual localization research are quintessential for the advancement of unmanned and autonomous
robotics.
Progressive work in these fields can require years before applying to different robotic plat-
forms. Stereo vision concepts, for example, have been studied for decades, but their implementa-
tion is not common because of their complex design and high requirement of computing power. It
is uncertain what the best way to approach these implementations is because of the vast choices

2
of computing platforms, specifically microprocessors, Field Programmable Gate Arrays (FPGAs),
and Graphics Processing Units (GPUs). Within each category of platforms, there are numerous
products to choose from that all have their advantages and disadvantages. To eliminate the need to
develop algorithms for specific architectures, a study in model-based design should be performed
for visual localization applications. By developing a single design model to perform visual local-
ization, the target platform no longer affects the algorithm implementation. As visual localization
algorithms become more complex, model-based design will allow for a quick turnaround from al-
gorithm to cross-platform application. The increased productivity through model-based design will
be essential in rapidly developing vision systems sustainable in both unmanned and autonomous
robotic operations.
In this research, the performance of visual localization is analyzed through the implementa-
tion of stereoscopic video processing within model-based design. Accuracy is evaluated to ensure
the system model is theoretically sound. Disparity tests are performed to determine the optimal
hardware configuration for a dual camera system. Frame rate and resolution is looked at to deter-
mine the effectiveness of the communication scheme behind model-based design.

3
Chapter 2
Background

2.1 Image Processing

The fundamental requirement to perform visual localization revolves around image pro-
cessing. Image processing is a form of signal processing where an image is treated as a multi-
dimensional signal. Through image processing, a set of parameters can be calculated based on the
input image [2]. These parameters can be used to alter a system’s behavior or update information
regarding the environment. The type and accuracy of parameters returned from the image process-
ing stem from the operations performed on the image. Filtering, for example, can reduce noise
within the image to produce more accurate data about the environment. Edge detection is another
technique of image processing which can provide useful data regarding a system’s desired or un-
desired movement. In order to implement visual localization, it is necessary to process the images
of the environment; otherwise, data cannot be interpreted from the images.

2.2 Feature Extraction

Data from images should reflect various features detected within an image. Features are
recognizable structures of elements in the environment. Low-level features are considered geo-
metric primitives such as lines, corners, and blobs. High-level features are viewed as objects such
as doors, tables, and trash cans [2, 3]. Extracting features from an image allows the system to un-
derstand if an object is in its path, it has reached a certain end goal, and etc. Not all features are
useful for localization; therefore, many factors must be analyzed to ensure the most appropriate
features are analyzed to gather environmental data. For example, edge detection would be useful

4
in an office building because there are many lines that can be extracted, but the same edge detection
is not necessarily useful on Mars. Although a single image can be used to detect features, it cannot
calculate the distance that an object may be away from the system. While the environment is indeed
three-dimensional, the camera receives an image of that same environment as a two-dimensional
signal, thus losing all information related to depth. Without depth, localization cannot be attained
from this method. This obstacle is handled through the implementation of stereo vision.

2.3 Stereo Vision

Stereo vision involves using two different images of the environment to gain additional in-
formation that may have been lost in a single image. Information such as depth can be recovered
from using stereoscopic images. For implementation, two cameras are used varying in either cam-
era geometry or camera viewpoint. Each camera simultaneously produces a distinct image of the
environment.

Figure 2.1: Camera Geometry for Stereo Vision [3]

5
As shown in Figure 2.1, the two cameras are parallel to each other and face the same direction. The
lenses are separated by a distance, b. Each camera has their respective focal length, f . The object
clearly appears in different locations with respect to the different camera images, as described by
the dotted lines in Figure 2.1.

2.3.1 Triangulation

In order to correctly identify the location of an object with respect to the system, triangula-
tion is used to identify objects in two similar images. As stated from Figure 2.1, an object appears
in different coordinate locations with respect to the different camera images. The distance between
the two locations, also known as disparity, d, is needed to determine the 3D position of the object.
Disparity is calculated based on Equation (2.1).

d = xl − xr (2.1)

where xl and xr refer to the horizontal coordinates in the left and right images, respectively. Ba-
sically, a subtraction is performed between the pixel in the left image that directly corresponds to
the same pixel in the right image to form a disparity measurement.
According to Siegwart [2, 3], the following observations can be made regarding disparity:

• Depth is inversely proportional to disparity

• Disparity is proportional to b

• As b is increased, some objects may not have a disparity calculation because it may not appear
in the field of view of both cameras

Using what is known about disparity, the position of an object with relation to the stereo vision
system can be calculated. Equations (2.2)-(2.4) describe how to determine the 3D position of an
object with respect to the system [3].

6
b · (xl + xr )
x= (2.2)
2d

b · (yl + yr )
y= (2.3)
2d

b·f
z= (2.4)
d

where xl and xr are the same from Equation (2.1), yl and yr refer to the vertical coordinates in
the left and right images, respectively. For Equations (2.2)-(2.4), the XYZ coordinate system is
given from Figure 2.1. x and y are the horizontal and vertical axes relative to the vision system,
respectively, and z is associated with depth relative to the vision system.

2.4 Model-Based Design

Mathworks has developed a series of design tools related to model-based simulation to pro-
mote rapid product development. They allows designers to create a system using function blocks
that describe the desired behavior. If the system is stable and error-free, code generation can be
performed for either software through C code or hardware through Hardware Description Language
(HDL) depending on the target architecture. This gives a developer a leg up on development as
changes to the design model directly change the code generated for the system. Additionally, sim-
ulations can be performed without external hardware to prevent system damage. Once simulation
results are as expected, the design model can be converted to code to the target architecture.
The ability to target any desired architecture stems from Simulink Coder. This Mathworks
compilation tool generates and executes C and C++ code from Simulink models [4]. The coder
customizes the generated code based on the target architecture so that the programmer does not have
to do it. Therefore, any embedded system can utilize a Simulink model. Similarly, Mathworks also

7
has a HDL Coder tool that generates HDL code for FPGA programming. Mathworks has developed
a great set of tools for any user to needing to implement systems quickly with or without a desired
target architecture.
An advantage of model-based design is real-time tuning of model parameters. Before
model-based design, programmers would have to end their simulation, rewrite part of their code,
rebuild their code, and then re-upload to their hardware. Mathworks has done away with much of
this tedious process by allowing programmers to change parts of their model while the simulation
is running in what is known as External Mode. With the model deployed to the hardware, various
parameters can be adjusted to modify system behavior, to improve a system function, or to check
for hardware/software limitations.

2.5 Literature Review

Many studies without model-based design can be found in the fields of visual localization
and stereo image processing. This literature review provides several researchers and their work in
applications related to stereo visual localization. Different feature extraction methods are used in
each work, but all implementations involve a dual camera system.
Dai et al. [5] demonstrated a theoretical design for nonlinear control of a quad-rotor plat-
form using a stereo vision system. Trajectory tracking was performed in simulation and experiment
to verify the feasibility of such nonlinear controller. Feature points are extracted from the environ-
ment to gather an estimation of the robot position. The experiment performed utilized a resolution
of 320x240 at 20 Frames Per Second (FPS). The vision system was able to measure not only 3D
position but also yaw angle and velocities of the quad-rotor. The results prove that stereo vision
can be used to control quad-rotor platforms with robust performance and low accumulation error.
Natarajan et al. [6] explores color region based segmentation in a stereo vision applica-
tion to develop a 3D reconstruction of real-world objects. Various real-world objects found in a
refrigerator are used to test the performance of the system on an assistive robot. The system was

8
required to handle cluttered environments to be considered a robust implementation. Based on the
segmentation results, the algorithm was capable of detecting individual objects of different colors
even with clutter. The experiment also showed accurate results in size, relative position, radius,
and volume.
Lefaudeux and Nashashibi [7] implemented a real-time stereo vision application to localize
static and moving objects. They utilized Features from Accelerated Segment Test (FAST) corner
detection through the OpenCV library to implement feature extraction. Feature tracking was per-
formed using not only the current images but also the previous images. The algorithm used was
able to generate a top-down view of the reconstructed environment. It is noted that the feature
extraction is the slowest part of the algorithm developed.
Chandra and Prihatmanto [8] examined stereo visual odometry using OpenCV for a hu-
manoid robot. The researchers recognized the need for an odometry system on their robotic system;
therefore, they implemented a vision system which could provide environmental data in 3D space.
The humanoid robot, named Nao, was interfaced with a stereo vision system consisting of two Log-
itech C920 high definition cameras. The results concluded that the stereo vision system is capable
of tracking position and orientation with an average error of 10%. However, a few problems were
encountered. Because the humanoid robot lacked mounting points for the cameras, careful design
consideration was needed to ensure optimal implementation. Additionally, motion blur and lag led
to inconsistent results and occasional loss of data.
Ruppelt and Trommer [9] proposed a motion estimation algorithm for use in a pedestrian
navigation system. An analysis is performed in outdoor, indoor, and dark indoor environments
to test the performance and robustness of the algorithm in conjunction with an inertial navigation
system. The implemented hardware consisted of two components mounted on the torso and foot
of a pedestrian. The torso attachment houses an IMU, two GPS receivers, a microprocessor, and
two cameras; the foot attachment contains a secondary IMU. Using this system implementation,
various sample consensus techniques are analyzed to determine the best estimation strategy in terms
of accuracy, robustness, and speed. The researchers were able to perform long and short trajectories

9
with the system and accurately determining the start and end points with a precision level of roughly
0.9 meters in both indoor and outdoor environments.
Lentaris et al. [10] performed various visual odometry algorithms using a hardware and
software co-design methodology in applications related to navigation on Mars. The research fo-
cused heavily on integrating FPGAs into visual localization implementations to immensely expe-
dite the execution time of vision algorithms. The proposed system utilized both a processor and
FPGA to handle different operations depending on computational intensity and potential for par-
allel processing. The communication between the Central Processing Unit (CPU) and the FPGA
required a custom device driver to perform read/write operations in Linux. The results of their work
indicated a speedup factor of 16x over a CPU implementation at a relatively low cost in terms of
lookup tables and memory.

10
Chapter 3
Design Methodology

3.1 Basic Design Flow

For the proposed system, two parallel cameras are used to view the target environment.
These cameras will send images to the computing platform for image processing. The image is
converted to a color space less susceptive to ambient lighting. Thresholding then occurs to check
if a pixel is in fact within the range of a particular color. To ensure there are not any small pixels
detected from potential noise, filtering is performed using morphological operations. All detected
objects undergo centroid analysis to determine where in the image they are centrally located. Based
on the centroids of these objects, the computer calculates the relative XYZ-coordinates of the object
with respect to the center of the system. The end objective is to identify the desired objects by
drawing circles around them and outputting via text the relative location of the objects. The basic
design flow can be seen visually in Figure 3.1.

Figure 3.1: Basic Design Flow

11
As seen in Figure 3.1, any color can be targeted for feature detection. For the scope of this research,
the desired color is red. Also, while Figure 3.1 shows three separate end goals, the actual output of
the system combines all three end goals into a single video display.

3.2 Hardware & Software

The computing platform used is a single Raspberry Pi 3B shown below in Figure 3.2.

Figure 3.2: Raspberry Pi 3B

Because of its 1.2GHz quad-core Advanced RISC (Reduced Instruction Set Computing) Machine
(ARM) processor and its various MATLAB/Simulink hardware support packages developed by
Mathworks, the Raspberry Pi is chosen for this stereo vision application.
The camera system is composed of two ELP 5 megapixels web cameras.

Figure 3.3: ELP 5 Megapixel USB Webcam

The camera model shown in Figure 3.3 is capable of producing a video stream at 5 megapixels
with a claimed frame rate of 15 FPS. It uses a 1/4” Complementary Metal-Oxide Semiconductor

12
(CMOS) OV5640 sensor with a view angle of 60 degrees. Because of the type of sensor and its
advertised compatibility with the Raspberry Pi, this camera model is chosen for the vision system.
The two ELP webcams connect to the Raspberry Pi via Universal Serial Bus (USB). The
cameras come with their own Linux driver so integrating the vision system to the computing plat-
form is not difficult. Each camera registers as a device per the operating system. While the Rasp-
berry Pi is performing all the computations and video processing, the design model (discussed in
the next section) runs on a remote machine. A connection to the Raspberry Pi is necessary to en-
sure communication from the remote machine to the Pi. For maximum speed and bandwidth, a
wired Ethernet connection is used directly between the Raspberry Pi and remote machine rather
than wireless connection.
Although Figure 2.1 has a specific coordinate system, the coordinate system used in this
research is described in Figure 3.4.

(a) Camera View (b) View of System

Figure 3.4: System Setup and View

The coordinate system in Figure 3.4 shows that depth is in the +x-direction, signified by the blue
arrow. Horizontal to the camera system is considered the y-direction with the green arrow indi-
cating the +y-direction. Lastly, vertical to the camera system is considered the z-direction with
the red arrow indicating the +z-direction. Equations (2.2)-(2.4) will change based on a rotation
corresponding to the updated coordinate system.

13
Software utilized in this application can be found below in Table 3.1.

Software Add-On Version


MATLAB 2016b MATLAB 9.1
Simulink 8.8
Computer Vision System Toolbox 7.2
Simulink Hardware Support Package for Raspberry Pi 17.1
DSP System Toolbox 9.3
Embedded Coder 6.11
Simulink Coder 8.11

Table 3.1: Software on Remote Machine

The Raspberry Pi runs a Linux-based operating system known as Raspbian Jessie. The
Raspbian release is version 8 and the Linux kernel is version 4.9.24.

3.3 Design Model

Through MATLAB and Simulink, the design model created for this visual localization ap-
plication is found below.

Figure 3.5: Visual Localization Model-Based Design

As shown in Figure 3.5, the inputs to the model are the images received from the two ELP cameras
connected to the Raspberry Pi. The images are treated as separated color signals for red, green, and
blue channels. The size of the inputs are based on the resolution of the image frame. For initial

14
implementation purposes, the resolution will start at 320x240. The output of the model is a video
display portraying the post-processed image with the localization for each colored object displayed.
The output image is also treated as separated color signals.

3.3.1 Color Detection

The input images directly enter the Color Detection block. This is a custom-made system
which performs color thresholding and centroid calculation for all distinct objects.

Figure 3.6: Color Detection System Block

As shown in Figure 3.6, the individual color channels are combined to form one multi-dimensional
image. This is necessary because the Color Space Conversion block requires single or double data
inputs, and the ”Convert image to single” block must receive a [M x N x 3] signal. The image
is then processed from Red-Green-Blue (RGB) color space to Hue-Saturation-Value (HSV) color
space for the thresholding process. Afterwards, the newly converted HSV image is broken down
into separate hue, saturation, and value channels. The threshold values have been chosen manually
after careful manipulation and verification through the HSV color map. These values can be found
in Table 3.2.

15
Component Minimum Maximum
0 10
Hue (0 - 180)
170 180
Saturation (0 - 255) 100 255
Value (0 - 255) 70 255

Table 3.2: HSV Threshold Values for Color Red

Because red is found on the higher and lower scale of the hue component, there are two hue thresh-
olds. While the thresholds found in Table 3.2 are not absolute values, they are good to use as a
baseline for the color red. The result of the thresholding creates a 2D binary image where a pixel
either meets or does not meet thresholding requirements (value of 1 or 0 respectively). To ensure
the binary image does not contain small pixel noise, sequential closing and opening morphological
operations are performed. Closing involves a dilation followed by an erosion operation whereas
opening is vice versa. Erosion groups nearby pixels into a kernel and sets all pixels to 0 if not all
pixels in the kernel are equal to 1, and dilation is similar but sets all pixels to 1 if at least one pixel
in the kernel is equal to 1 [11]. Through these morphological operations, the image is removed
of any small pixel noise while intensifying recognized objects. The last step requires the binary
image to go through a Blob Analysis block to calculate the centroids and diameter squared of all
identified objects. The centroids are used as the major features of each red object in terms of lo-
calization. Additionally, for visual purposes, circles are drawn around the identified objects in the
video display.

3.3.2 Localization

When centroid data is gathered from the Color Detection blocks, localization can begin.
Figure 3.7 provides the custom system that performs the localization calculations for each object.

16
Figure 3.7: Localization System Block

As seen in Figure 3.7, the inputs to the system include the centroid data from both images as well
as the resolution. xr , xl , and yl are extracted from the centroid data. Disparity is calculated as
found from Equation (2.1). The XYZ localization is calculated based on the updated coordinate
system. x, y, and z are based on Equation (2.4), Equation (2.2), and Equation (2.3), respectively.
Focal length and distance between the lenses are held as separate constant blocks for model tuning.
It is determined that the equations given by Siegwart’s book [3] reference the origin of the image
to be the bottom left corner. Since it is desired to have the center of the system as the origin, The
equations for y and z are modified to reflect an origin centered in the middle of the images. This
is done by finding the width and height of the image and dividing by 2 to the find the vertical and
horizontal center points. Depth, x, is not affected by the origin.

3.3.3 Counter

The desired output of the model is the received image with the centroids clearly marked
and a bounding box to identify the general locations where red is detected. Furthermore, the output
should display the respective XYZ location of each individual object. It is determined currently
that the Simulink block which can output text onto an image cannot handle multiple instances of
text; therefore, a solution is made so that all of the XYZ locations are displayed one at a time based

17
on a custom Counter block.

Figure 3.8: Counter System Block

As shown in Figure 3.8, the system is comprised of a standard Counter block with a relational
operation used to determine the reset state. Because the counter needs to increment up to the number
of objects in view, the system does not have a static reset value. The counter needs a dynamic reset
value to handle a different number of objects at any given time. The input of the system block is
the maximum number of objects. The output is the index value related to one of the objects in view.
The output controls which XYZ position is displayed and where the text is displayed. For example,
if the index value is 0, the first object will have its XYZ location displayed near its centroid. The
index values are zero-based, therefore the equality check first requires a subtraction by 1. To avoid
an algebraic loop, a delay is added. If the previous index value is less than the maximum number
of objects, the counter will continue to increment. When the previous index value is greater than or
equal to the maximum number of objects, the counter resets to 0. Rate transition blocks are needed
before and after the counter because the XYZ text should not switch as fast as the sampling rate.
Switching at the sampling rate does not give adequate time to display the XYZ location; therefore,
the counter increments at 10x the sampling rate.

18
Chapter 4
Results

All results found are performed at an image resolution of 320x240. At this resolution, the
model processes 5-10 FPS at a sampling rate of 10 Hertz (Hz). When attempting to increase the
sampling rate to 20-30 Hz, the model only processes 10-18 FPS. Unfortunately, the model is unable
to run at a higher resolution. At higher resolutions, the model will build and initialize, but it will
not start executing regardless of the sampling rate.

4.1 Depth (X-varying)

Depth accuracy is tested based on distance away from the system. Both the Y-plane and
Z-plane are kept constant while the X-plane is varied.

(a) Left Camera View at X = 15 cm (b) Right Camera View at X = 15 cm

Figure 4.1: Results for X = 15 cm

19
(a) Left Camera View at X = 30 cm (b) Right Camera View at X = 30 cm

Figure 4.2: Results for X = 30 cm

(a) Left Camera View at X = 45 cm (b) Right Camera View at X = 45 cm

Figure 4.3: Results for X = 45 cm

20
(a) Left Camera View at X = 75 cm (b) Right Camera View at X = 75 cm

Figure 4.4: Results for X = 75 cm

(a) Left Camera View at X = 233 cm (b) Right Camera View at X = 233 cm

Figure 4.5: Results for X = 233 cm

21
(a) Depth Accuracy Compared to Ideal (b) Depth Percent Error

Figure 4.6: Numerical Analysis of Depth Accuracy

As shown in Figures 4.1-4.6, depth is recovered with minimal error. While it is apparent
that the accuracy of depth gradually diminishes as the object moves further away, an observation
can be made similar for horizontal and vertical accuracies. Even though both the Y-plane and Z-
plane are kept constant, the localization related to those dimensions become less accurate. This
may be a symptom of disparity being too small to get an accurate measurement. Additionally, it
may be the color tracking that is misinterpreting the center of the red object.

4.2 Horizontal (Y-varying)

Horizontal accuracy is tested based on the horizontal offset from the center of the system.
Both the X-plane and Z-plane are kept constant while the Y-plane is varied.

22
(a) Left Camera View at Y = -5 cm (b) Right Camera View at Y = -5 cm

Figure 4.7: Results for Y = -5 cm

(a) Left Camera View at Y = 1 cm (b) Right Camera View at Y = 1 cm

Figure 4.8: Results for Y = 1 cm

23
(a) Left Camera View at Y = 8 cm (b) Right Camera View at Y = 8 cm

Figure 4.9: Results for Y = 8 cm

(a) Horizontal Accuracy Compared to Ideal (b) Horizontal Percent Error

Figure 4.10: Numerical Analysis of Horizontal Accuracy

As shown in Figures 4.7-4.10, horizontal accuracy is maintained within 6%. It is noted that
horizontal accuracy is likely to remain low since the horizontal field of view is relatively small.
The horizontal field of view can increase if the distance b increases, but an increase in b also affects
the disparity and depth accuracy.

24
4.3 Vertical (Z-varying)

Vertical accuracy is tested based on the vertical offset from the center of the system. Both
the X-plane and Y-plane are kept constant while the Z-plane is varied.

(a) Left Camera View at Z = -4.6 cm (b) Right Camera View at Z = -4.6 cm

Figure 4.11: Results for Z = -4.6 cm

(a) Left Camera View at Z = 1.9 cm (b) Right Camera View at Z = 1.9 cm

Figure 4.12: Results for Z = 1.9 cm

25
(a) Left Camera View at Z = 8 cm (b) Right Camera View at Z = 8 cm

Figure 4.13: Results for Z = 8 cm

(a) Vertical Accuracy Compared to Ideal (b) Vertical Percent Error

Figure 4.14: Numerical Analysis of Vertical Accuracy

As shown in Figures 4.11-4.14, vertical accuracy is held to under 6%. Once again, it is
noted that vertical accuracy should remain low because of the small vertical field of view. The
vertical field of view cannot increase since it is only dependent on the angle of the camera lenses.

4.4 Disparity

Disparity is tested to determine what distance between lenses is most appropriate for appli-
cations.

26
Figure 4.15: Numerical Analysis of Disparity

As shown in Figure 4.15, the disparity increases as depth decreases. Additionally, as the
distance between the lenses, b, increases, the disparity also increases. However, it is evident from
the figure that the disparity converges to approximately 11-12 as the distance increases.

Figure 4.16: Distance Approximation Based on ’b’

Figure 4.16 demonstrates that while disparity changes according to Figure 4.15, the approx-
imation for long distance objects is also changing. The baseline for this test is a red object placed
233 cm away from the system. As the distance between the lenses increases, so does the approx-
imation. However, the approximation is found to be valid when b is equal roughly 5.09 cm. This
distance gives a percent error of 1.29%.

27
4.5 Discussion

Based on current test, it is readily apparent that even visual localization is prone to some
error. Images can be prone to noise because there is so much data which can be interpreted even
though it may not be useful data. Lots of filtering and accurate feature extraction is needed to mit-
igate error. Even though this application does have apparent error, the error is under 10 percent
at one meter range. This may seem underwhelming considering laser range finders are capable of
much larger ranges, but this vision application does not require wave propagation. This is a slight
advantage since wave propagation may be ineffective for applications of fast mobile robots and
space exploration. While laser range finders have fast sampling times to minimize wave propaga-
tion error, the relative cost to implement laser sensors generally exceeds the cost of vision sensors.
Quality of manufacturing can also be a problem that introduces error to measurements.
While the camera models are the same, manufacturing of cameras have some error too. Focal
lengths may not be equal or accurate, mounting angle may be different in each black box, power
may be supplied unequally, and etc. Because these cameras are technically independent systems
of each other, there is likely some error between the two devices when compared to a single dual-
camera system. These different errors are all systemic and can be fixed by incorporating more
precise camera systems.
Parameter tuning is performed to help improve the color detection thresholds from Table
3.2. While the original thresholds did allow for the detection of red color, random noise would
appear in the output display. Upon adjusting the thresholds, the random noise could be ignored,
resulting in accurate object detection. Further testing can also be found through the changing of
distance b between the camera lenses. As the parameter changed physically in the setup, it also had
to be accounted for in the design model. As stated in Equation (2.2)-(2.4), b directly affects the
results of visual localization. The convenience of changing the parameter in real-time allowed for
quick turnaround in testing. As b changed, it was evident that the resulting localization of objects
would change to the expected values.

28
One downfall to model-based design is the lack of bandwidth and speed associated with
the MATLAB Input-Output (IO) server. Because the model has to communicate to the Raspberry
Pi via the MATLAB server and vice versa, the execution speed is quite slow. The model is able
to handle 5-10 FPS at 320x240, but attempting to increase the resolution results in a failure of
model execution. The model completes build and initialization, but then stalls at time T = 0. One
possible problem is the amount of Random Access Memory (RAM) available in the Raspberry Pi.
1 gigabyte of RAM may not provide enough memory for the Raspberry Pi to perform all tasks.
These tests did provide a proof of concept for model-based visual localization. Disregard-
ing resolution and frame rate, this application is very capable of flexible feature extraction and
performing data analysis necessary for visual localization. The distance between the camera lenses
found in tests can be used to obtain relatively accurate measurement through two meters. Although
various military and space applications require better accuracy, this is indeed a stepping stone for
model-based visual localization.

29
Chapter 5
Conclusion

As demonstrated from this research, model-based design can be applied to visual localiza-
tion applications. The ability to tune parameters in real-time allows for quick hardware specifica-
tion changes such as camera separation or software modifications such as color thresholding. The
capabilities of model-based design focus heavily on quick customizations; however, the future of
vision systems will continue to require better bandwidth and speed. Without these improvements,
visual localization through model-based design will remain at a halt because of low resolution and
low frame-rate. If these problems are answered and the communication is enhanced between de-
sign model and processing platform, so too will the work related to visual localization in unmanned
and autonomous robotics.
In a world where robotic technologies require an identified location in 3D space, the need for
high quality sensor subsystems will continue to thrive. Many look upon laser technology and vision
systems to take on these challenges as Unmanned Aerial Vehicles (UAVs) become more prominent
in diverse applications. Laser range finders will likely remain as one of the most capable sensors in
the market, but vision systems will continue to compete in performance relative to cost. As current
computers grow in speed and efficiency, the performance of vision systems will also grow.

5.1 Future Work

Some of the main concerns related to model-based design for this application involves speed
and bandwidth. With the Raspberry Pi being capable of 10 FPS at 320x240 resolution, visual local-
ization is simply not an option as a robotic subsystem. The communication between the MATLAB
IO server and the Raspberry Pi is simply too slow to handle the amount of image processing nec-

30
essary for visual localization. Work has to be done to provide added speed and bandwidth before
model-based design can work for these types of applications.
While the resolution used during test was acceptable at the time, the ultimate goal should be
to maximize resolution and frame rate. Additional computing resources are needed to increase these
specifications. With the rise of System on Chip (SoC) platforms, computing platforms now have
access to fast reconfigurable hardware through their onboard FPGA. These platforms utilize two
processors as well as an FPGA on a single board. An FPGA could potentially speed up the image
processing enough to bump the resolution up to 5 megapixels. This is because an FPGA can be
configured for parallel data computing, meaning streams of data can be processed simultaneously
at very fast speeds. The problem with moving forward with this solution is model-based design
cannot contribute all the necessary code for the FPGA to run. Not all function blocks utilized in
this research application can be applied directly to an FPGA.
Lastly, the method of feature extraction can be changed to improve visual localization.
Color detection is one of many simple feature extraction techniques. Features can also be found us-
ing edge detection, corner detection, Hough transforms and etc. Additionally, there are algorithms
developed for computer vision to extract features such as Speeded Up Robust Features (SURF)
or Scale-Invariant Feature Transform (SIFT). These different feature extraction tools may likely
create a more accurate visual localization system.

31
Bibliography

[1] A. J. G. de Oliveira, “Autonomous vehicles control and instrumentation,” tech. rep., Instituto
de Sistemas e Robótica − Pólo do Porto, 2001.

[2] R. Siegwart, I. R. Nourbakhsh, and D. Scaramuzza, Introduction to Autonomous Mobile


Robots. The MIT Press, 2011.

[3] R. Siegwart and I. R. Nourbakhsh, Introduction to Autonomous Mobile Robots. The MIT
Press, 2004.

[4] “Simulink coder - matlab & simulink,” 2017.

[5] F. Dai, K. Wang, and P. Lin, “A stereo camera-equipped quadrotor platform for vision based
nonlinear control,” in 2016 IEEE International Conference on Robotics and Biomimetics (RO-
BIO), pp. 1864–1869, Dec 2016.

[6] S. K. Natarajan, D. Ristic-Durrant, A. Leu, and A. Gräser, “Robust stereo-vision based 3d


modelling of real-world objects for assistive robotic applications,” in 2011 IEEE/RSJ Inter-
national Conference on Intelligent Robots and Systems, pp. 786–792, Sept 2011.

[7] B. Lefaudeux and F. Nashashibi, “Real-time visual perception : detection and localisation
of static and moving objects from a moving stereo rig,” in 2012 15th International IEEE
Conference on Intelligent Transportation Systems, pp. 522–527, Sept 2012.

[8] J. Chandra and A. S. Prihatmanto, “Stereo visual odometry system design on humanoid robot
nao,” in 2016 6th International Conference on System Engineering and Technology (ICSET),
pp. 34–38, Oct 2016.

[9] J. Ruppelt and G. F. Trommer, “Stereo-camera visual odometry for outdoor areas and in dark

32
indoor environments,” IEEE Aerospace and Electronic Systems Magazine, vol. 31, pp. 4–12,
November 2016.

[10] G. Lentaris, I. Stamoulias, D. Soudris, and M. Lourakis, “Hw/sw codesign and fpga accel-
eration of visual odometry algorithms for rover navigation on mars,” IEEE Transactions on
Circuits and Systems for Video Technology, vol. 26, pp. 1563–1577, Aug 2016.

[11] A. Mordvintsev and A. K, OpenCV-Python Tutorials Documentation, 2017.

33

You might also like