Principles of Robot Autonomy II
Course overview and intro to
machine learning for robot autonomy
IPRL
Team
Instructors
Jeannette Bohg Marco Pavone Dorsa Sadigh
Assistant Associate Assistant
Professor CS Professor AA, Professor CS
and CS/EE (by and EE
courtesy)
CAs
Annie Xie Yilun Hao
Head CA
1/8/23 AA 274B | Lecture 1 2
From automation…
1/8/23 AA 274A | Lecture 1 3
…to autonomy
Waymo Self-Driving Car Intuitive DaVinci Surgical Robot Apollo Robot at MPI for Intelligent Systems
Astrobee - NASA
Boston Dynamics – Spot Mini Zipline
1/8/23 AA 274A | Lecture 1 4
From Principles of Robot Autonomy I:
the see-think-act cycle
Knowledge Mission goals
Localiza(on posi(on Decision making
Map Building global map Mo(on planning
environmental model trajectory
local map
Informa(on Trajectory
extrac(on execu(on
actuator
raw data commands
See-think-act
Sensing Actua(on
Real world
environment
1/8/23 AA 274B | Lecture 1 5
Outstanding questions and new trends
• How do we build models for complex tasks? Can we use data / prior
experience?
• How should the robot reason in terms of actively interacting with
the environment?
• And how should the robot reason when interacting with other
decision-making agents?
1/8/23 AA 274B | Lecture 1 6
Course goals
• Obtain a fundamental understanding of advanced principles of
robot autonomy, including:
1. robot learning
2. physical interaction with the environment
3. interaction with humans
1/8/23 AA 274B | Lecture 1 7
Course structure
• Three modules, roughly of equal length
1. learning-based control and perception
2. interaction with the physical environment
3. interaction with humans
• Requirements
• AA 174A / AA 274A / CS 237A / EE 260A
• CS 106A or equivalent, CS106B highly recommended
• CME 100 or equivalent (for linear algebra)
• CME 106 or equivalent (for probability theory)
1/8/23 AA 274B | Lecture 1 8
Logistics
• Lectures: Monday and Wednesday, 1:30pm - 2:50pm
• Information about office hours available in the Syllabus:
http://web.stanford.edu/class/cs237b/pdfs/syllabus.pdf
• Course websites:
• https://cs237b.stanford.edu (course content and announcements)
• https://canvas.stanford.edu/courses/166005 (course-related discussions)
• https://www.gradescope.com/courses/481536/ (HW submissions)
• https://canvas.stanford.edu/courses/166005 (Panopto Course Videos)
• To contact the teaching staff, use the email: cs237b-win2223-
staff@lists.stanford.edu
1/8/23 AA 274B | Lecture 1 9
Grading and units
• Course grade calculation
• (60%) homework
• (40%) midterm exams (for each student, the lowest exam grade will be
dropped)
• (extra 5%) participation on EdStem
• Units: 3 or 4. Taking this class for 4 units entails additionally
presenting a paper at the end of the quarter
1/8/23 AA 274B | Lecture 1 10
Schedule
1/8/23 AA 274B | Lecture 1 11
Intro to Machine Learning (ML)
• Aim
• Present and motivate modern ML techniques
• Courses at Stanford
• EE 104: Introduction to Machine Learning
• CS 229: Machine Learning
• Reference
• Hastie, Tibshirani, and Friedman: The elements of statistical learning: data
mining, inference, and prediction (2009). Available here:
https://web.stanford.edu/~hastie/ElemStatLearn/
1/8/23 AA 274B | Lecture 1 12
Machine learning
• Supervised learning (classification, regression)
• Given (x1 , y 1 ), . . . , (xn , y n ) , choose a function f (x) = y
xi = data point
yi = class/value
• Unsupervised learning (clustering, dimensionality reduction)
• Given (x1 , x2 , . . . , xn ) find patterns in the data
1/8/23 AA 274B | Lecture 1 13
Supervised learning
• Regression • Classification
1/8/23 AA 274B | Lecture 1 14
Learning models
Parametric
models
Linear regression Linear classifier
Non-parametric
models
Spline fitting k-Nearest Neighbors
1/8/23 AA 274B | Lecture 1 15
Loss functions
In selecting f (x) ⇡ y we need a quality metric, i.e., a loss function to minimize
• Regression • Classification
X X
2
` loss : |f (xi ) y i |2 0 1 loss : 1{f (xi ) 6= y i }
i i
X X
1 i i
` loss : |f (x ) y| Cross entropy loss : (y i )T log(f (xi ))
i i
1/8/23 AA 274B | Lecture 1 16
Machine learning as optimization
How can we choose the best (loss minimizing) parameters to fit our
training data?*
Analytical solution Numerical optimization
2 3 2 32 3
y11 y21 x11 x12 ··· x1k a11 a12
6 y12 y22 7 6 x21 x22 ··· x2k 7 6 a12 7
6 7 6 7 6a11 7
6 .. 7 ⇡ 6 .. .. .. .. 7 6 .. .. 7
4. 5 4 . . . . 54 . . 5
y1n y2n xn1 xn2 ··· xnk ak1 ak2
fA (x) = xA, `2 loss
 = (X T X) 1
XT Y
(example: gradient descent)
(example: linear least squares)
* we’ll come back to worrying about test data
1/8/23 AA 274B | Lecture 1 17
Stochastic optimization
Our loss function is defined over the
Stochastic
entire training dataset: gradient
Xn Xn
1 i 2 1 descent
L= f (xi ) y = Li
n i=1 n i=1
Computing rL could be very
computationally intensive. We
approximate: Other
1 X variants
rL ⇡ rLi
|S|
i2S⇢{1,...,n}
1/8/23 AA 274B | Lecture 1 18
Regularization
To avoid overfitting on the training data, we may add additional
terms to the loss function to penalize “model complexity”
`2 regularization: kAk2 `1 regularization: kAk1
often corresponds to a Gaussian prior often encourages sparsity in A
on parameters A (easier to interpret/explain)
Hyperparameter regularization:
1/8/23 AA 274B | Lecture 1 19
Generalizing linear models
Linear regression/classification
can be very powerful when
empowered by the right features
2 3 2 32 3
y1 1 x1 x21 ... xm1 a0
6 y2 7 6 1 x2 x22 ... x2 7 6 a1 7
m7 6
6 7 6 7
6 y3 7 6 1 x3 x23 ... xm 7 6 a 7
6 7⇡6 3 76 27
6 .. 7 6 .. .. .. .. 7 6 .. 7
4 . 5 4. . . . 54 . 5
yn 1 xn x2n ... xm
n am
Nonlinearity via basis functions Eigenfaces
1/8/23 AA 274B | Lecture 1 20
Feature extraction
Human
Ingenuity
Gradient
Descent
1/8/23 AA 274B | Lecture 1 21
Perceptron – analogy to a neuron
Bio people are apparently somewhat skeptical
Just the math: y = f (xw + b) (with input as a row vector)
1/8/23 AA 274B | Lecture 1 22
Single layer neural network
Original perceptron: binary inputs, binary output
y1i = f (xi w1 + b1 )
y2i = f (xi w2 + b2 )
y = f (xW + b)
y3i = f (xi w3 + b3 )
i i
y4 = f (x w4 + b4 )
1/8/23 AA 274B | Lecture 1 23
Multi-layer neural network
Also known as the Multilayer Perceptron (MLP)
Also known as the foundations of DEEP LEARNING
h1 = f1 (xW1 + b1 )
h2 = f2 (h1 W2 + b2 )
y = f3 (h2 W3 + b3 )
Like the brain, we’re connecting neurons to each other sequentially
1/8/23 AA 274B | Lecture 1 24
Activation functions
y = ((xW1 + b1 )W2 + b2 )W3 + b3 ?
Can’t go only linear:
=) y = xW1 W2 W3 + (b1 W2 W3 + b2 W3 + b3 )
Secret theme:
All of these
functions are super
easy to differentiate
1/8/23 AA 274B | Lecture 1 25
Training neural networks
We want to use some variant of gradient descent
How to compute gradients?
The Chain Rule
1. Sample a batch of data r(f g)(x) = ((Dg)(x))T (rf )(g(x))
2. Forward propagate it through
the network to compute loss Leveraging the intermediate
3. Backpropagate to calculate the results of forward propagation
gradient of the loss with with “easy” to differentiate
respect to the weights/biases activation functions
4. Update these parameters using è Gradient is a bunch of
SGD matrix multiplications
1/8/23 AA 274B | Lecture 1 26
Training neural networks
1/8/23 AA 274B | Lecture 1 27
Training neural networks
Lots of regularization tricks:
Dropout:
(randomly zero out some
neurons each pass)
Transform input data
to artificially expand
training set:
1/8/23 AA 274B | Lecture 1 28
Neural networks example
http://playground.tensorflow.org/
1/8/23 AA 274B | Lecture 1 29
Next time
NNs and TensorFlow Tutorial Markov Decision Processes
1/8/23 AA 274B | Lecture 1 30