0% found this document useful (0 votes)
6 views31 pages

Lec 3

UC Berkly CS182 Lecture Notes
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views31 pages

Lec 3

UC Berkly CS182 Lecture Notes
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 31

Bias, Variance, and Regularization

Designing, Visualizing and Understanding Deep Neural Networks

CS W182/282A
Instructor: Sergey Levine
UC Berkeley
Will we get the right answer?
Empirical risk and true risk
1 if wrong, 0 if right

is this a good approximation?


Empirical risk minimization

Overfitting: when the empirical risk is low, but the true risk is high
can happen if the dataset is too small
can happen if the model is too powerful (has too many parameters/capacity)

Underfitting: when the empirical risk is high, and the true risk is high
can happen if the model is too weak (has too few parameters/capacity)
can happen if your optimizer is not configured well (e.g., wrong learning rate)
Let’s analyze error!
Last time, we discussed classification
computer
This time, we’ll focus on regression [object label]
program
[object probability]
All this stuff applies to classification too,
it’s just simpler to derive for regression

computer
continuous number
program
continuous distribution

normal (Gaussian) distribution


Let’s analyze error!

Also the same as the mean squared error (MSE) loss!


a bit easier to analyze, but we
can analyze other losses too

Overfitting: when the empirical risk is low, but the true risk is high
can happen if the dataset is too small
can happen if the model is too powerful (has too many parameters/capacity)

Underfitting: when the empirical risk is high, and the true risk is high
can happen if the model is too weak (has too few parameters/capacity)
can happen if your optimizer is not configured well (e.g., wrong learning rate)
Let’s analyze error!
Let’s try to understand overfitting and underfitting more formally

Question: how does the error change for different training sets?
Why is this question important?

overfitting underfitting

• The training data is fitted well • The training data is fitted poorly
• The true function is fitted poorly • The true function is fitted poorly
• The learned function looks different each time! • The learned function looks similar, even if
we pool together all the datasets!
Let’s analyze error!
What is the expected error, given a distribution over datasets?

expected value of error w.r.t. data distribution


sum over all possible datasets
Let’s analyze error!

Why do we care about this quantity?

We want to understand how well our algorithm does independently


of the particular (random) choice of dataset

This is very important if we want to improve our algorithm!

overfitting underfitting
Bias-variance tradeoff
Bias-variance tradeoff

Regardless of what the true function is, how


much does our prediction change with dataset?
This error doesn’t go away no
matter how much data we have!
Bias-variance tradeoff

If variance is too high, we have too little data/too complex a function class/etc. => this is overfitting

If bias is too high, we have an insufficiently complex function class => this is underfitting

How do we regulate the bias-variance tradeoff?


Regularization
How to regulate bias/variance?
Get more data
addresses variance

has no effect on bias

Change your model class e.g., 12th degree polynomials to linear functions

Can we “smoothly” restrict the model class?

Can we construct a “continuous knob” for complexity?


Regularization
Regularization: something we add to the loss function to reduce variance
Bayesian interpretation: could be regarded as a prior on parameters (but this is not the only interpretation!)

High level intuition:


When we have high variance, it’s because the data doesn’t give enough information to identify parameters
If there is not enough information in the data, can we give more information through the loss function?
If we provide enough information to disambiguate between (almost) equally good models, we can pick the best one

what makes this one better?

all of these solutions have zero training error


The Bayesian perspective
Regularization: something we add to the loss function to reduce variance
Bayesian interpretation: could be regarded as a prior on parameters (but this is not the only interpretation!)

what is this part?

we’ve seen this part before!


Can we pick a prior that
makes the smoother
function more likely?

remember: this is just shorthand for


The Bayesian perspective
Regularization: something we add to the loss function to reduce variance
Bayesian interpretation: could be regarded as a prior on parameters (but this is not the only interpretation!)

we choose this bit


Example: regularized linear regression
Can we pick a prior that
makes the smoother
function more likely?

what kind of distribution assigns higher probabilities to small numbers?

this kind of thing typically requires large coefficients if we only allow small coefficients,
best fit might be more like this
Example: regularized linear regression
Can we pick a prior that
makes the smoother
function more likely?

what kind of distribution assigns higher probabilities to small numbers?

“hyperparameter”

(but we don’t care, we’ll just select it directly)


Example: regularized logistic regression
what we wanted what we got

technically every point is classified correctly


Example: regularized logistic regression
Example: regularized logistic regression
same prior, but now for a classification problem

this is sometimes called weight decay

Other examples of regularizers (we’ll discuss some of these later):

creates a preference for


zeroing out dimensions!

“L1 regularization” “L2 regularization”


Dropout: a special type of regularizer for neural networks
Gradient penalty: a special type of regularizer for GANs
…lots of other choices
Other perspectives
Regularization: something we add to the loss function to reduce variance

Bayesian perspective: the regularizer is prior knowledge about parameters

Numerical perspective: the regularizer makes underdetermined problems well-determined

Optimization perspective: the regularizer makes the loss landscape easier to search
paradoxically, regularizers can sometimes reduce underfitting if it was due to poor optimization!
especially common with GANs

In machine learning, any “heuristic” term added to the loss


that doesn’t depend on data is generally called a regularizer
“hyperparameter”
Regularizers introduce hyperparameters that we have to
select in order for them to work well
Training sets and test sets
Some questions…
How do we know if we are overfitting or underfitting?

How do we select which algorithm to use?

How do we select hyperparameters?

One idea: choose whatever makes the loss low

Can’t diagnose overfitting by


looking at the training loss!
The machine learning workflow
the dataset

use this for training


training set

reserve this for…


…selecting hyperparameters
validation set
…adding/removing features
…tweaking your model class
The machine learning workflow
the dataset

used to select…

training set

used to select…

validation set
Learning curves
loss

loss
this is the bias!
# of gradient descent steps # of gradient descent steps

Question: can we stop here?


How do we know when to stop?
The final exam
We followed the recipe, now what?
the dataset How good is our final classifier?

That’s no good – we already used


the validation set to pick
hyperparameters!
training set
What if we reserve another set for a final
exam (a kind of… validation validation set!)

validation set
The machine learning workflow
the dataset
used to select…

training set

used to select…

validation set

test set Used only to report final performance


Summary and takeaways
➢ Where do errors come from?
▪ Variance: too much capacity, not enough information in the data to find the right parameters
▪ Bias: too little capacity, not enough representational power to represent the true function
▪ Error = Variance + Bias^2
▪ Overfitting = too much variance
▪ Underfitting = too much bias
➢ How can we trade off bias and variance?
▪ Select your model class carefully
▪ Select your features carefully
▪ Regularization: stuff we add to the loss to reduce variance
➢ How do we select hyperparameters?
▪ Training/validation split
▪ Training set is for optimization (learning)
▪ Validation set is for selecting hyperparameters
▪ Test set is for reporting final results and nothing else!

You might also like