0% found this document useful (0 votes)

6 views31 pages

Lec 3

UC Berkly CS182 Lecture Notes

Uploaded by

Phạm Thạch Thanh Trúc

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views31 pages

Lec 3

UC Berkly CS182 Lecture Notes

Uploaded by

Phạm Thạch Thanh Trúc

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 31

Bias, Variance, and Regularization

Designing, Visualizing and Understanding Deep Neural Networks

CS W182/282A
Instructor: Sergey Levine
UC Berkeley
Will we get the right answer?
Empirical risk and true risk
1 if wrong, 0 if right

is this a good approximation?

Empirical risk minimization

Overfitting: when the empirical risk is low, but the true risk is high
can happen if the dataset is too small
can happen if the model is too powerful (has too many parameters/capacity)

Underfitting: when the empirical risk is high, and the true risk is high
can happen if the model is too weak (has too few parameters/capacity)
can happen if your optimizer is not configured well (e.g., wrong learning rate)
Let’s analyze error!
Last time, we discussed classification
computer
This time, we’ll focus on regression [object label]
program
[object probability]
All this stuff applies to classification too,
it’s just simpler to derive for regression

computer
continuous number
program
continuous distribution

normal (Gaussian) distribution

Let’s analyze error!

Also the same as the mean squared error (MSE) loss!

a bit easier to analyze, but we
can analyze other losses too

Overfitting: when the empirical risk is low, but the true risk is high
can happen if the dataset is too small
can happen if the model is too powerful (has too many parameters/capacity)

Question: how does the error change for different training sets?
Why is this question important?

overfitting underfitting

• The training data is fitted well • The training data is fitted poorly
• The true function is fitted poorly • The true function is fitted poorly
• The learned function looks different each time! • The learned function looks similar, even if
we pool together all the datasets!
Let’s analyze error!
What is the expected error, given a distribution over datasets?

expected value of error w.r.t. data distribution

sum over all possible datasets
Let’s analyze error!

Why do we care about this quantity?

We want to understand how well our algorithm does independently

of the particular (random) choice of dataset

This is very important if we want to improve our algorithm!

overfitting underfitting
Bias-variance tradeoff
Bias-variance tradeoff

Regardless of what the true function is, how

much does our prediction change with dataset?
This error doesn’t go away no
matter how much data we have!
Bias-variance tradeoff

If variance is too high, we have too little data/too complex a function class/etc. => this is overfitting

If bias is too high, we have an insufficiently complex function class => this is underfitting

How do we regulate the bias-variance tradeoff?

Regularization
How to regulate bias/variance?
Get more data
addresses variance

has no effect on bias

Change your model class e.g., 12th degree polynomials to linear functions

Can we “smoothly” restrict the model class?

Can we construct a “continuous knob” for complexity?

Regularization
Regularization: something we add to the loss function to reduce variance
Bayesian interpretation: could be regarded as a prior on parameters (but this is not the only interpretation!)

High level intuition:

When we have high variance, it’s because the data doesn’t give enough information to identify parameters
If there is not enough information in the data, can we give more information through the loss function?
If we provide enough information to disambiguate between (almost) equally good models, we can pick the best one

what makes this one better?

all of these solutions have zero training error

The Bayesian perspective
Regularization: something we add to the loss function to reduce variance
Bayesian interpretation: could be regarded as a prior on parameters (but this is not the only interpretation!)

what is this part?

we’ve seen this part before!

Can we pick a prior that
makes the smoother
function more likely?

remember: this is just shorthand for

we choose this bit

Example: regularized linear regression
Can we pick a prior that
makes the smoother
function more likely?

what kind of distribution assigns higher probabilities to small numbers?

this kind of thing typically requires large coefficients if we only allow small coefficients,
best fit might be more like this
Example: regularized linear regression
Can we pick a prior that
makes the smoother
function more likely?

what kind of distribution assigns higher probabilities to small numbers?

“hyperparameter”

(but we don’t care, we’ll just select it directly)

Example: regularized logistic regression
what we wanted what we got

technically every point is classified correctly

Example: regularized logistic regression
Example: regularized logistic regression
same prior, but now for a classification problem

this is sometimes called weight decay

Other examples of regularizers (we’ll discuss some of these later):

creates a preference for

zeroing out dimensions!

“L1 regularization” “L2 regularization”

Dropout: a special type of regularizer for neural networks
Gradient penalty: a special type of regularizer for GANs
…lots of other choices
Other perspectives
Regularization: something we add to the loss function to reduce variance

Bayesian perspective: the regularizer is prior knowledge about parameters

Numerical perspective: the regularizer makes underdetermined problems well-determined

Optimization perspective: the regularizer makes the loss landscape easier to search
paradoxically, regularizers can sometimes reduce underfitting if it was due to poor optimization!
especially common with GANs

In machine learning, any “heuristic” term added to the loss

that doesn’t depend on data is generally called a regularizer
“hyperparameter”
Regularizers introduce hyperparameters that we have to
select in order for them to work well
Training sets and test sets
Some questions…
How do we know if we are overfitting or underfitting?

How do we select which algorithm to use?

How do we select hyperparameters?

One idea: choose whatever makes the loss low

Can’t diagnose overfitting by

looking at the training loss!
The machine learning workflow
the dataset

use this for training

training set

reserve this for…

…selecting hyperparameters
validation set
…adding/removing features
…tweaking your model class
The machine learning workflow
the dataset

used to select…

training set

used to select…

validation set
Learning curves
loss

loss
this is the bias!
# of gradient descent steps # of gradient descent steps

Question: can we stop here?

How do we know when to stop?
The final exam
We followed the recipe, now what?
the dataset How good is our final classifier?

That’s no good – we already used

the validation set to pick
hyperparameters!
training set
What if we reserve another set for a final
exam (a kind of… validation validation set!)

validation set
The machine learning workflow
the dataset
used to select…

training set

used to select…

validation set

test set Used only to report final performance

Summary and takeaways
➢ Where do errors come from?
▪ Variance: too much capacity, not enough information in the data to find the right parameters
▪ Bias: too little capacity, not enough representational power to represent the true function
▪ Error = Variance + Bias^2
▪ Overfitting = too much variance
▪ Underfitting = too much bias
➢ How can we trade off bias and variance?
▪ Select your model class carefully
▪ Select your features carefully
▪ Regularization: stuff we add to the loss to reduce variance
➢ How do we select hyperparameters?
▪ Training/validation split
▪ Training set is for optimization (learning)
▪ Validation set is for selecting hyperparameters
▪ Test set is for reporting final results and nothing else!

DL Unit1
100% (1)
DL Unit1
61 pages
Unit 1.2 Perceptron 2024
No ratings yet
Unit 1.2 Perceptron 2024
107 pages
DL Unit1
100% (2)
DL Unit1
79 pages
DL-Lec 2 - Bias-Variance-Tradeoff
No ratings yet
DL-Lec 2 - Bias-Variance-Tradeoff
33 pages
ML 01
No ratings yet
ML 01
24 pages
CMPE257 - W2C3 - ML Fundamentals - Part 2
No ratings yet
CMPE257 - W2C3 - ML Fundamentals - Part 2
34 pages
Lecture 2 Ai
No ratings yet
Lecture 2 Ai
24 pages
All DL
No ratings yet
All DL
72 pages
Machine Learning: Professor Department of Computer Science & Engineering
No ratings yet
Machine Learning: Professor Department of Computer Science & Engineering
28 pages
Csa202 Unit 2
No ratings yet
Csa202 Unit 2
36 pages
ML MU Unit 2
100% (2)
ML MU Unit 2
42 pages
(Technical) Machine Learning U3-6 (2019 Pattern)
No ratings yet
(Technical) Machine Learning U3-6 (2019 Pattern)
101 pages
Diagnosing Bias Vs Variance
No ratings yet
Diagnosing Bias Vs Variance
11 pages
ML Tips and Tricks
No ratings yet
ML Tips and Tricks
32 pages
Understanding Probability and ML Bias
No ratings yet
Understanding Probability and ML Bias
88 pages
Week11 - Regularization and Optimization
No ratings yet
Week11 - Regularization and Optimization
75 pages
Machine Learning Cheatsheet Compiled and Curated by Robins Yadav
No ratings yet
Machine Learning Cheatsheet Compiled and Curated by Robins Yadav
14 pages
ML Errors: Bias & Variance Explained
No ratings yet
ML Errors: Bias & Variance Explained
9 pages
Unit 2
No ratings yet
Unit 2
97 pages
Theory in Machine Learning
No ratings yet
Theory in Machine Learning
60 pages
3 LogisticRegression
No ratings yet
3 LogisticRegression
30 pages
19 ML Intro
No ratings yet
19 ML Intro
31 pages
Regularization Linear Models
No ratings yet
Regularization Linear Models
23 pages
MtechDL Unit2
No ratings yet
MtechDL Unit2
25 pages
Week 06 - Deep Feedforward Networks - Optimization
No ratings yet
Week 06 - Deep Feedforward Networks - Optimization
83 pages
Machine Learning Bias-Variance Guide
No ratings yet
Machine Learning Bias-Variance Guide
28 pages
DL Unit-2
No ratings yet
DL Unit-2
24 pages
Machine Learning Model Validation
No ratings yet
Machine Learning Model Validation
50 pages
Deep Neural Network Module 4 Regularization
No ratings yet
Deep Neural Network Module 4 Regularization
53 pages
AI ch6
No ratings yet
AI ch6
42 pages
08 Eval-Intro Notes
No ratings yet
08 Eval-Intro Notes
10 pages
Overfitting Underfitting Bias Variance
No ratings yet
Overfitting Underfitting Bias Variance
11 pages
ML 04 Validation Regularization
No ratings yet
ML 04 Validation Regularization
57 pages
ML MU Unit 2
100% (3)
ML MU Unit 2
84 pages
Introduction To Machine Learning
No ratings yet
Introduction To Machine Learning
116 pages
Bias Variance Tradeoff Plot PPT 2 27.7.25
No ratings yet
Bias Variance Tradeoff Plot PPT 2 27.7.25
21 pages
Machine Learning Juunit2.pdf Lands
No ratings yet
Machine Learning Juunit2.pdf Lands
7 pages
Gansp Awareness Quiz PDF
No ratings yet
Gansp Awareness Quiz PDF
13 pages
Model Evaluation
No ratings yet
Model Evaluation
29 pages
1.2 Overfitting Under Fitting and Cross Validation and Confusion Matrix
No ratings yet
1.2 Overfitting Under Fitting and Cross Validation and Confusion Matrix
17 pages
Module 3 Modified
No ratings yet
Module 3 Modified
48 pages
10: Advice For Applying Machine Learning: Deciding What To Try Next
No ratings yet
10: Advice For Applying Machine Learning: Deciding What To Try Next
8 pages
4 MachineLearningForCV
No ratings yet
4 MachineLearningForCV
73 pages
Lecture 15 - Recap and Midterm Review
No ratings yet
Lecture 15 - Recap and Midterm Review
37 pages
Linear Regression, Polynomical, Gradiant Descent
No ratings yet
Linear Regression, Polynomical, Gradiant Descent
42 pages
L2 - Problems in ML & Performance Evaluation
No ratings yet
L2 - Problems in ML & Performance Evaluation
30 pages
MLSM Lecture1 050923
No ratings yet
MLSM Lecture1 050923
37 pages
Machine Learning General: Definiton
No ratings yet
Machine Learning General: Definiton
14 pages
Machine Learning Generalization Guide
No ratings yet
Machine Learning Generalization Guide
7 pages
Unit - 2 Deep Learning
No ratings yet
Unit - 2 Deep Learning
26 pages
Machine Learning HC
No ratings yet
Machine Learning HC
4 pages
Deep Learning for Data Scientists
No ratings yet
Deep Learning for Data Scientists
17 pages
Deep Learning Notes
No ratings yet
Deep Learning Notes
61 pages
Overfitting & Feature Engineering
No ratings yet
Overfitting & Feature Engineering
37 pages
Underfitting & Overfitting
No ratings yet
Underfitting & Overfitting
13 pages
Emsemble Methods-Pages-Deleted
No ratings yet
Emsemble Methods-Pages-Deleted
2 pages
Fit Without Fear - Remarkable Mathematical Phenomena of Deep Learning Through The Prism of Interpolation
No ratings yet
Fit Without Fear - Remarkable Mathematical Phenomena of Deep Learning Through The Prism of Interpolation
51 pages
Lec 24
No ratings yet
Lec 24
8 pages
Ethical Travel British English Teacher
No ratings yet
Ethical Travel British English Teacher
16 pages
TOK Exibition Draft - Is Bias Inevitable in The Prodcution of Knowledge
No ratings yet
TOK Exibition Draft - Is Bias Inevitable in The Prodcution of Knowledge
6 pages
GONZALES - Module 8
No ratings yet
GONZALES - Module 8
6 pages
निरपेक्ष आर्द्रता by Kunjan Koiri
No ratings yet
निरपेक्ष आर्द्रता by Kunjan Koiri
1 page
Deck API Business Insight - Updated 2024
No ratings yet
Deck API Business Insight - Updated 2024
15 pages
DUDLEYs HANDBOOK OF PRACTICAL GEAR DESIG Part187
No ratings yet
DUDLEYs HANDBOOK OF PRACTICAL GEAR DESIG Part187
1 page
Ticketing System: User Manual
No ratings yet
Ticketing System: User Manual
24 pages
YEakgTfyjn - Worksheet Laboratory Apparatus
No ratings yet
YEakgTfyjn - Worksheet Laboratory Apparatus
3 pages
Survey Questionnaire For Learners
0% (1)
Survey Questionnaire For Learners
5 pages
DD322148000C PDF
0% (1)
DD322148000C PDF
13 pages
Marketing Research Proposal
No ratings yet
Marketing Research Proposal
14 pages
VAL 2001 Quantitative Validation Guidelines
No ratings yet
VAL 2001 Quantitative Validation Guidelines
12 pages
4 Issue 2 Indian JLLegal RSCH 1
No ratings yet
4 Issue 2 Indian JLLegal RSCH 1
3 pages
SPM Unit5 Part-2
No ratings yet
SPM Unit5 Part-2
30 pages
English to Amharic Translation Guide
100% (2)
English to Amharic Translation Guide
4 pages
Truss Girder Welding Machines: The History of Innovation
No ratings yet
Truss Girder Welding Machines: The History of Innovation
7 pages
Case - 2 Walmart Analysis
No ratings yet
Case - 2 Walmart Analysis
3 pages
Convection Currents Lab
No ratings yet
Convection Currents Lab
6 pages
BECE Social Studies Mock Exam 2021
No ratings yet
BECE Social Studies Mock Exam 2021
15 pages
Vuorinen Kasper
No ratings yet
Vuorinen Kasper
112 pages
Ficha Técnica Cinta Foam Aislante Adhesiva
No ratings yet
Ficha Técnica Cinta Foam Aislante Adhesiva
3 pages
Xii Ips 6 B. Inggris
No ratings yet
Xii Ips 6 B. Inggris
43 pages
All We Have To Fear Psychiatry S Transformation of Natural Anxieties Into Mental Disorders 1st Edition Allan V. Horwitz All Chapter Instant Download
100% (8)
All We Have To Fear Psychiatry S Transformation of Natural Anxieties Into Mental Disorders 1st Edition Allan V. Horwitz All Chapter Instant Download
84 pages
JRF Position in Biomedical Engineering
No ratings yet
JRF Position in Biomedical Engineering
1 page
Ts680 User Manual Foif
75% (4)
Ts680 User Manual Foif
144 pages
Plan A Trip To Space
No ratings yet
Plan A Trip To Space
2 pages
4th Grade. Math Review Guide
No ratings yet
4th Grade. Math Review Guide
4 pages
Wiener Behavior Purpose Teleology 1943
No ratings yet
Wiener Behavior Purpose Teleology 1943
8 pages
Mock Board Day 2 Nov 2007 (NK)
No ratings yet
Mock Board Day 2 Nov 2007 (NK)
6 pages
SAPC66
No ratings yet
SAPC66
23 pages

Lec 3

Uploaded by

Lec 3

Uploaded by

Bias, Variance, and Regularization

Designing, Visualizing and Understanding Deep Neural Networks

is this a good approximation?

normal (Gaussian) distribution

Also the same as the mean squared error (MSE) loss!

expected value of error w.r.t. data distribution

Why do we care about this quantity?

We want to understand how well our algorithm does independently

This is very important if we want to improve our algorithm!

Regardless of what the true function is, how

How do we regulate the bias-variance tradeoff?

has no effect on bias

Can we “smoothly” restrict the model class?

Can we construct a “continuous knob” for complexity?

High level intuition:

what makes this one better?

all of these solutions have zero training error

what is this part?

we’ve seen this part before!

remember: this is just shorthand for

we choose this bit

what kind of distribution assigns higher probabilities to small numbers?

what kind of distribution assigns higher probabilities to small numbers?

(but we don’t care, we’ll just select it directly)

technically every point is classified correctly

this is sometimes called weight decay

Other examples of regularizers (we’ll discuss some of these later):

creates a preference for

“L1 regularization” “L2 regularization”

Bayesian perspective: the regularizer is prior knowledge about parameters

Numerical perspective: the regularizer makes underdetermined problems well-determined

In machine learning, any “heuristic” term added to the loss

How do we select which algorithm to use?

How do we select hyperparameters?

One idea: choose whatever makes the loss low

Can’t diagnose overfitting by

use this for training

reserve this for…

Question: can we stop here?

That’s no good – we already used

test set Used only to report final performance

You might also like