Bias, Variance, and Regularization
Designing, Visualizing and Understanding Deep Neural Networks
CS W182/282A
Instructor: Sergey Levine
UC Berkeley
Will we get the right answer?
Empirical risk and true risk
1 if wrong, 0 if right
is this a good approximation?
Empirical risk minimization
Overfitting: when the empirical risk is low, but the true risk is high
can happen if the dataset is too small
can happen if the model is too powerful (has too many parameters/capacity)
Underfitting: when the empirical risk is high, and the true risk is high
can happen if the model is too weak (has too few parameters/capacity)
can happen if your optimizer is not configured well (e.g., wrong learning rate)
Let’s analyze error!
Last time, we discussed classification
computer
This time, we’ll focus on regression [object label]
program
[object probability]
All this stuff applies to classification too,
it’s just simpler to derive for regression
computer
continuous number
program
continuous distribution
normal (Gaussian) distribution
Let’s analyze error!
Also the same as the mean squared error (MSE) loss!
a bit easier to analyze, but we
can analyze other losses too
Overfitting: when the empirical risk is low, but the true risk is high
can happen if the dataset is too small
can happen if the model is too powerful (has too many parameters/capacity)
Underfitting: when the empirical risk is high, and the true risk is high
can happen if the model is too weak (has too few parameters/capacity)
can happen if your optimizer is not configured well (e.g., wrong learning rate)
Let’s analyze error!
Let’s try to understand overfitting and underfitting more formally
Question: how does the error change for different training sets?
Why is this question important?
overfitting underfitting
• The training data is fitted well • The training data is fitted poorly
• The true function is fitted poorly • The true function is fitted poorly
• The learned function looks different each time! • The learned function looks similar, even if
we pool together all the datasets!
Let’s analyze error!
What is the expected error, given a distribution over datasets?
expected value of error w.r.t. data distribution
sum over all possible datasets
Let’s analyze error!
Why do we care about this quantity?
We want to understand how well our algorithm does independently
of the particular (random) choice of dataset
This is very important if we want to improve our algorithm!
overfitting underfitting
Bias-variance tradeoff
Bias-variance tradeoff
Regardless of what the true function is, how
much does our prediction change with dataset?
This error doesn’t go away no
matter how much data we have!
Bias-variance tradeoff
If variance is too high, we have too little data/too complex a function class/etc. => this is overfitting
If bias is too high, we have an insufficiently complex function class => this is underfitting
How do we regulate the bias-variance tradeoff?
Regularization
How to regulate bias/variance?
Get more data
addresses variance
has no effect on bias
Change your model class e.g., 12th degree polynomials to linear functions
Can we “smoothly” restrict the model class?
Can we construct a “continuous knob” for complexity?
Regularization
Regularization: something we add to the loss function to reduce variance
Bayesian interpretation: could be regarded as a prior on parameters (but this is not the only interpretation!)
High level intuition:
When we have high variance, it’s because the data doesn’t give enough information to identify parameters
If there is not enough information in the data, can we give more information through the loss function?
If we provide enough information to disambiguate between (almost) equally good models, we can pick the best one
what makes this one better?
all of these solutions have zero training error
The Bayesian perspective
Regularization: something we add to the loss function to reduce variance
Bayesian interpretation: could be regarded as a prior on parameters (but this is not the only interpretation!)
what is this part?
we’ve seen this part before!
Can we pick a prior that
makes the smoother
function more likely?
remember: this is just shorthand for
The Bayesian perspective
Regularization: something we add to the loss function to reduce variance
Bayesian interpretation: could be regarded as a prior on parameters (but this is not the only interpretation!)
we choose this bit
Example: regularized linear regression
Can we pick a prior that
makes the smoother
function more likely?
what kind of distribution assigns higher probabilities to small numbers?
this kind of thing typically requires large coefficients if we only allow small coefficients,
best fit might be more like this
Example: regularized linear regression
Can we pick a prior that
makes the smoother
function more likely?
what kind of distribution assigns higher probabilities to small numbers?
“hyperparameter”
(but we don’t care, we’ll just select it directly)
Example: regularized logistic regression
what we wanted what we got
technically every point is classified correctly
Example: regularized logistic regression
Example: regularized logistic regression
same prior, but now for a classification problem
this is sometimes called weight decay
Other examples of regularizers (we’ll discuss some of these later):
creates a preference for
zeroing out dimensions!
“L1 regularization” “L2 regularization”
Dropout: a special type of regularizer for neural networks
Gradient penalty: a special type of regularizer for GANs
…lots of other choices
Other perspectives
Regularization: something we add to the loss function to reduce variance
Bayesian perspective: the regularizer is prior knowledge about parameters
Numerical perspective: the regularizer makes underdetermined problems well-determined
Optimization perspective: the regularizer makes the loss landscape easier to search
paradoxically, regularizers can sometimes reduce underfitting if it was due to poor optimization!
especially common with GANs
In machine learning, any “heuristic” term added to the loss
that doesn’t depend on data is generally called a regularizer
“hyperparameter”
Regularizers introduce hyperparameters that we have to
select in order for them to work well
Training sets and test sets
Some questions…
How do we know if we are overfitting or underfitting?
How do we select which algorithm to use?
How do we select hyperparameters?
One idea: choose whatever makes the loss low
Can’t diagnose overfitting by
looking at the training loss!
The machine learning workflow
the dataset
use this for training
training set
reserve this for…
…selecting hyperparameters
validation set
…adding/removing features
…tweaking your model class
The machine learning workflow
the dataset
used to select…
training set
used to select…
validation set
Learning curves
loss
loss
this is the bias!
# of gradient descent steps # of gradient descent steps
Question: can we stop here?
How do we know when to stop?
The final exam
We followed the recipe, now what?
the dataset How good is our final classifier?
That’s no good – we already used
the validation set to pick
hyperparameters!
training set
What if we reserve another set for a final
exam (a kind of… validation validation set!)
validation set
The machine learning workflow
the dataset
used to select…
training set
used to select…
validation set
test set Used only to report final performance
Summary and takeaways
➢ Where do errors come from?
▪ Variance: too much capacity, not enough information in the data to find the right parameters
▪ Bias: too little capacity, not enough representational power to represent the true function
▪ Error = Variance + Bias^2
▪ Overfitting = too much variance
▪ Underfitting = too much bias
➢ How can we trade off bias and variance?
▪ Select your model class carefully
▪ Select your features carefully
▪ Regularization: stuff we add to the loss to reduce variance
➢ How do we select hyperparameters?
▪ Training/validation split
▪ Training set is for optimization (learning)
▪ Validation set is for selecting hyperparameters
▪ Test set is for reporting final results and nothing else!