0% found this document useful (0 votes)
7 views43 pages

Lec 2

UC Berkly CS182 Lecture Notes
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views43 pages

Lec 2

UC Berkly CS182 Lecture Notes
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 43

Introduction to Machine Learning

Designing, Visualizing and Understanding Deep Neural Networks

CS W182/282A
Instructor: Sergey Levine
UC Berkeley
How do we formulate learning problems?
Different types of learning problems
[object label] supervised learning

unlabeled
data
representation unsupervised learning

reinforcement learning
Supervised learning
Given:

[object label]

Questions to answer:
Unsupervised learning
unlabeled
data
representation what does that mean?

GANs
generative modeling: VAEs
pixel RNN, etc.

self-supervised
representation learning:
Reinforcement learning

Actions: muscle contractions Actions: motor current or torque Actions: what to purchase
Observations: sight, smell Observations: camera images Observations: inventory levels
Rewards: food Rewards: task success measure (e.g., Rewards: profit
running speed)
Reinforcement learning
But many other application areas too!
➢ Education (recommend which topic to study next)

➢ YouTube recommendations!

➢ Ad placement
Haarnoja et al., 2019 ➢ Healthcare (recommending treatments)
Let’s start with supervised learning…
Supervised learning
Given:

[object label]

The overwhelming majority of machine learning that is used in industry is supervised learning

➢ Encompasses all prediction/recognition models trained from ground truth data


➢ Multi-billion $/year industry!
➢ Simple basic principles
Example supervised learning problems
Given:

Predict… Based on…

category of object image

sentence in French sentence in English

presence of disease X-ray image

text of a phrase audio utterance


Prediction is difficult
0 1 2 3 4 5 6 7 8 9
5? 0% 0% 0% 0% 0% 90% 8% 0% 2% 0%

9? 4% 0% 0% 0% 11% 0% 4% 0% 6% 75%

3? 5% 0% 0% 40% 0% 30% 20% 0% 5% 0%

4? 5% 0% 0% 0% 50% 0% 3% 0% 2% 40%

0? 70% 0% 20% 0% 0% 0% 0% 0% 10% 0%


Predicting probabilities
Often makes more sense than predicting discrete labels

We’ll see later why it is also easier to learn, due to smoothness


Intuitively, we can’t change a discrete label “a tiny bit,” it’s all or nothing
But we can change a probability “a tiny bit”

Given:
Conditional probabilities
random variable representing the input
why is it a random variable?

random variable representing the output

chain rule
definition of
conditional
probability
How do we represent it?

computer
program [object label]
[object probability]

0 1 2 3 4 5 6 7 8 9
0% 0% 0% 0% 0% 90% 8% 0% 2% 0%

10 possible labels, output 10 numbers


(that are positive and sum to 1.0)
How do we represent it?

computer
program [object label]
[object probability]

(that are positive and sum to 1.0)


How do we represent it?

computer
program [object label]
[object probability]

why any function?

could be any (ideally one to one & onto)


function that takes these inputs and outputs
probabilities that are positive and sum to 1
How do we represent it?

could be any (ideally one to one & onto)


function that takes these inputs and outputs
probabilities that are positive and sum to 1

especially convenient because it’s one to one & onto


maps entire real number line to entire set of positive reals
(but don’t overthink it, any one of these would work)
How do we represent it?

makes it positive

makes it sum to 1

There is nothing magical about this

It’s not the only way to do it

Just need to get the numbers to be positive and sum to 1!


The softmax in general
0 1 2 3 4 5 6 7 8 9
0% 0% 0% 0% 0% 90% 8% 0% 2% 0%
An illustration: 2D case
An illustration: 1D case

definitely blue not sure definitely red


probability increases exponentially as
we move away from boundary

normalizer
Why is it called a softmax?
Loss functions
So far…

computer
program
[object probability]

this has learned parameters


The machine learning method
for solving any problem ever
How do represent the “program”
1. Define your model class
We (mostly) did this in the last section

(though we’ll spend a lot more time on this later)

2. Define your loss function How to measure if one model in your model
class is better than another?

How to search the model class to find the model


3. Pick your optimizer that minimizes the loss function?

4. Run it on a big GPU


Aside: Marr’s levels of analysis
computational “why?” e.g., loss function

algorithmic “what?” e.g., the model

implementation “how?” e.g., the optimization algorithm

“on which GPU?”

There are many variants on this basic idea…


The machine learning method
for solving any problem ever
How do represent the “program”
1. Define your model class
We (mostly) did this in the last section

(though we’ll spend a lot more time on this later)

2. Define your loss function How to measure if one model in your model
class is better than another?

How to search the model class to find the model


3. Pick your optimizer that minimizes the loss function?

4. Run it on a big GPU


How is the dataset “generated”?

probability distribution
~ over photos

conditional probability
distribution over labels
How is the dataset “generated”?

Training set:
How is the dataset “generated”?
How is the dataset “generated”?

maximum likelihood estimation (MLE)

negative log-likelihood (NLL)


this is our loss function!
Loss functions aside: cross-entropy

In general:

Examples:
Optimization
The machine learning method
for solving any problem ever
1. Define your model class

2. Define your loss function

3. Pick your optimizer

4. Run it on a big GPU


The loss “landscape”

some small constant


called “learning rate” or
“step size”
Gradient descent

negative slope = go to the right


positive slope = go to the left gradient:

in general:
for each dimension, go in the direction
opposite the slope along that dimension

etc.
Gradient descent

We’ll go into a lot more detail about gradient


descent and related methods in a later lecture!
The machine learning method
for solving any problem ever
1. Define your model class

2. Define your loss function

3. Pick your optimizer

4. Run it on a big GPU


Logistic regression

matrix
Special case: binary classification

this is called the logistic equation


also referred to as a sigmoid
Empirical risk and true risk
1 if wrong, 0 if right

is this a good approximation?


Empirical risk minimization

Overfitting: when the empirical risk is low, but the true risk is high
can happen if the dataset is too small
can happen if the model is too powerful (has too many parameters/capacity)

Underfitting: when the empirical risk is high, and the true risk is high
can happen if the model is too weak (has too few parameters/capacity)
can happen if your optimizer is not configured well (e.g., wrong learning rate)

This is very important, and we will discuss this in much more detail later!
Summary
1. Define your model class

2. Define your loss function

3. Pick your optimizer

4. Run it on a big GPU

You might also like