0% found this document useful (0 votes)

18 views29 pages

Lecture 07

Uploaded by

Tim Widmoser

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

18 views29 pages

Lecture 07

Uploaded by

Tim Widmoser

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 29

WiSe 2023/24

Deep Learning 1

Lecture 7 Loss Functions

Outline

Recap: Formulating the learning problem

Loss functions for regression
▶ 0/1 loss, squared loss, absolute loss, logcosh

▶ Incorporating predictive uncertainty

Loss functions for classication

▶ 0/1 loss, perceptron loss, log loss

▶ Extensions to multiple classes

Practical Aspects
▶ Utility-based loss functions

▶ Incorporating data quality

▶ Multiple tasks

1/28
Formulating the Learning Problem

Objective to minimize is often dened

as the average over the training data
of a loss function ℓ, measuring for
each instance i the discrepancy be-
tween the prediction yi = f (xi , θ) and
the ground-truth ti .
N
1 X
E(θ) ℓ(yi , ti )
N i=1

Two factors inuence the learned model f :

▶ What data is available for training the model (Lectures 5 and 6).

▶ The choice of loss function, e.g. whether larger errors are penalized
more than small errors (today's lecture).

2/28
Part 1 Loss Functions for Regression

3/28
Regression Losses

Observations:
▶ In numerous applications, one needs to predict real values (e.g. age of
an organism, expected durability of a component, energy of a physical
system, value of a good, product of a chemical reaction, yield of a
machine, temperature next week, etc.).

▶ For these applications, labels are provided as real-valued targets t ∈ R,

and one needs to choose a loss function that quanties well the
dierence between such target value and the prediction f (x) ∈ R.

Several considerations for designing ℓ:

▶ What is the cost of making certain types of errors? Are small errors
tolerated? Are big errors more costly?

▶ What is the quality of the ground-truth target values in the dataset?

Are there some outliers?

4/28
The 0/1 Loss

`
Function to minimize: 1

0 −ϵ ≤ y − t ≤ ϵ y¡t
ℓ(y, t) = ¡² ²
1 else
unacceptable acceptable unacceptable

Advantages:
▶ Tolerant to some small task-irrelevant discrepancies (→ does not need
to t the data exactly) and can therefore accomodate simple,
better-generalizing models.

▶ Not aected by potential outliers in the data (just treat them as

regular errors).

Disadvantage:
▶ The gradient of that loss function is almost always zero → impossible
to optimize via gradient descent.

5/28
The Squared Loss

`
Function to minimize: 1

ℓ(y, t) = (y − t)2 y¡t

¡² ²
unacceptable acceptable unacceptable

Advantages:
▶ Tolerant to some small task-irrelevant discrepancies.

▶ Unlike the 0/1 loss, gradients are most of the time non-zero. This
makes this loss easy to optimize.

Disadvantage:
▶ Strongly aected by outliers (errors grow quadratically).

6/28
The Absolute Loss

`
Function to minimize: 1

ℓ(y, t) = |y − t| y¡t
¡² ²
unacceptable acceptable unacceptable

Advantages:
▶ Compared to the square error, less aected by outliers (errors grow
only linearly).

▶ Non-zero gradients → easy to optimize.

Disadvantage:
▶ Unlike the 0/1 loss and the square error, it is not tolerant to small
errors (small errors incur a non-negligible cost).

7/28
The Log-Cosh Loss

Function to minimize: `
1 1
ℓ(y, t) = log cosh(β · (y − t))
β
y¡t
¡² ²
with β a positive-valued hyperparame- unacceptable acceptable unacceptable
ter.

Advantages:
▶ Tolerant to some small task-irrelevant discrepancies.

▶ Non-zero gradients everywhere (except when the prediction is correct).

This makes this loss easy to optimize.

▶ Only mildly aected by outliers (error grows linearly).

8/28
Regression Losses

Systematic comparison

outlier-robust
optimizable

ϵ-tolerant
0/1 loss ✗ ✓ ✓
2
squared loss (y − t) ✓ ✗ ✓
absolute loss |y − t| ✓ ✓ ✗
log-cosh loss ✓ ✓ ✓

Note:
▶ Many further loss functions have been proposed in the literature (e.g.
Huber's loss, ϵ-sensitive loss, etc.). They often implement similar
desirable properties as the log-cosh loss.

9/28
Regression Losses: Adding Predictive Uncertainty

Idea:
▶ Let the network output consist of two variables µ, σ , representing the
parameters of some probability distribution modeling the labels t, for
example a normal distribution y ∼ N (µ, σ).
▶ We can then dene the log-likelihood function, which we would like to
maximize w.r.t. the parameters of the network:

(t − µ)2 √
log p(y = t | µ, σ) = − 2
− log( 2πσ)
2σ

10/28
Regression Losses: Adding Predictive Uncertainty

Objective to maximize:

(t − µ)2 √
log p(y = t | µ, σ) = − 2
− log( 2πσ)
2σ

Observation:
▶ The objective has a gradient w.r.t. µ and σ (as long as the scale σ is
positive and not too small). To ensure this, one can use some special
activation function to produce σ, e.g. σ = log(1 + exp(·)).
▶ If we set σ constant (i.e. disconnect it from the rest of the network),
the model reduces to an application of the square error loss function.
However, if we learn σ, the latter provides us with a an indication of
prediction uncertainty.

▶ If we choose dierent data distributions, we recover dierent loss

functions (e.g. the Laplace distribution yields the absolute loss, or the
hyperbolic secant distribution yields the log-cosh loss).

11/28
Part 2 Loss Functions for Classication

12/28
Classication Losses

Observations:
▶ Classication is perhaps the most common scenario in machine
learning (e.g. detecting if some tissue contains cancerous tissue or not,
determining whether to grant access or not to some resource,
detecting if some text is positive or negative, etc.)

▶ For these applications, labels are provided as elements of a set,

typically t ∈ {−1, 1} for binary classication or t ∈ {1, 2, . . . , C} for
multi-class classication.

▶ However, the output of the neural network is, like for the regression
case, real-valued. For binary classication, it is typically a real-valued
scalar the sign of which gives the class. The classication is then
correct if and only if:

(y > 0) ∧ (t = 1) ∨ (y < 0) ∧ (t = −1)

and this can be written more compactly as:

y·t>0

13/28
0/1 Loss

`
Function to minimize:

0 if y·t>0 y·t
ℓ(y, t) =
1 if y·t<0 incorrect correct
decision decision
Properties:
▶ Using the 0/1 loss function is equivalent to minimizing the average
classication error on the training data.

▶ If the training data would exactly correspond to the test distribution,

then, the optimization objective would exactly maximize what we are
interested in, i.e. the classication accuracy.

Problem:
▶ The loss function has gradient zero everywhere ⇒ It can't be
optimized via gradient descent.

14/28
Perceptron Loss
Function to minimize: `

0 if y·t>0
ℓ(y, t) = y·t
|y| if y·t<0

Note that it can also be formulated more com-

incorrect correct
decision decision
pactly as ℓ(y, t) = max(0, −y · t).

Advantage:
▶ Gradient is non-zero for misclassications and indicates how to adapt
the model to reduce the classication errors.

▶ Remains fairly capable of dealing with misclassied data (like the 0/1
loss), because the error only grows linearly with y.
iteration 31
4
Disadvantage: 3
2
▶ Training stops as soon as training points are on 1
the correct side of the decision boundary. → 0
1
Unlikely to generalize well to new data points 2
3
(the 0/1 loss function has the same problem).
4
4 2 0 2 4

15/28
Log-Loss

`
Function to minimize:
y·t
ℓ(y, t) = log(1 + exp(−y · t))
incorrect correct
decision decision

iteration 999
Advantages: 4
3
▶ Penalize points that are correctly classied if the 2
1
neural network output is too close to the 0
threshold. This pushes the decision boundary 1
2
away from the training data and provide intrinsic 3
regularization properties. 4
4 2 0 2 4

16/28
Log-Loss

Probabilistic interpretation:
Assuming the following mapping from neural network output y to class
probabilities
exp(−y) exp(y)
p= ,
1 + exp(−y) 1 + exp(y)
minimizing the log-loss is equivalent to minimizing the cross-entropy
H(q, p) where q = (1t<0 , 1t>0 ) is a one-hot vector encoding the class.

Proof: 2
X
H(q, p) = − qi log pi
i=1
= −q1 log p1 − q2 log p2
e−y ey
= −1t<0 log − 1t>0 log
1 + e−y 1 + ey
eyt
= − log
1 + eyt
1
= − log
1 + e−yt
= log(1 + e−yt )

17/28
Classication Losses

Systematic comparison

mislabeling-robust

builds margin
optimizable
0/1 loss ✗ ✓ ✗
perceptron loss max(0, −yt) ✓ ✓ ✗
log loss log(1 + exp(−yt)) ✓ ✓ ✓

18/28
Handling Multiple Classes

Blueprint:
▶ Build a neural network with as many outputs
as there are classes, call them y1 , . . . , yC .
▶ Classify as k = arg max[y1 , . . . , yC ].

Observation:
▶ The 0/1 loss function can then be straightforwardly generalized to the
multi-class case as:

t=1 t=2 ... t=C

k=1 0 1 ... 1
k=2 1 0
. . .
. . .
. . .

k=C 1 0

▶ However, this generalization of the 0/1 loss suers from the same
problem as the original 0/1 loss, that is, the diculty to optimize it,
and the fact it does not promote margins between the data/predictions
and the decision boundary.

19/28
Handling Multiple Classes
Generalizing the log-loss to multiple classes:
▶ Let y1 , . . . , y C be the C outputs of our network. Mapping these scores
to a probability vector via the softmax function

exp(yi )
pi = PC
j=1 exp(yj )

and constructing a one-hot encoding q of the class label t, we dene

the loss function as the cross-entropy H(q, p), i.e.
C
X
ℓ(y, t) = H(q, p) = − qi log pi
i=1

= − log pt
C
X
= log exp yj − yt
j=1

which can be interpreted as the dierence between the evidence found

by the neural network for all classes, and the evidence found by the
neural network for the target class.

20/28
Part 3 Practical Aspects

21/28
Practical Aspect 1: Non-Uniform Misclassication Costs

Example: medical diagnosis.

▶ Assumes a type of error is much more costly than another, e.g. missing
the detection of a disease.
Actual

Predicted No infection Infection

No infection 0 10000
Infection 2000 0

Approach for the 0/1 loss:

▶ To reect this cost structure, the 0/1 loss can be straightforwardly
enhanced by replacing the 1s in the loss function by the actual costs.

▶ Minimizing the loss function is then equivalent to minimizing the

expected cost (or maximizing utility).

Approach for other losses:

▶ When the loss has a probabilistic interpretation (e.g. log-loss), one can
treat predicted probabilities p(y P
= k) as `ground-truth' and estimate
the expected cost for class i as C k=1 cost(choose i|k)p(y = k).

22/28
Practical Aspect 2: Labels of Varying Quality

low quality labels

high quality labels

Examples:
▶ Non-expert vs. expert labeler, outcome of a physics simulation
with/without approximations, noisy/clean measurement of an
experimental outcome.

Idea:
▶ In presence of two similar instances that are similar but with diverging
labels, focus on the high-quality one. Low-quality labels remain useful
in regions with scarce data.

23/28
Practical Aspect 2: Labels of Varying Quality

low quality labels

high quality labels

Idea (cont.):
▶ Use a dierent loss function for dierent data points, e.g. one
associates to instance i the loss function:

ℓi (y, t) = Ci ℓ(y, t)

where Ci is a multiplicative factor set large if i is a high quality data

point or low if i is low-quality.

24/28
Practical Aspect 3: Multiple Tasks
In practice, we may want the same neural network to perform several tasks
simultaneously, e.g. multiple binary classication tasks, or some additional
regression tasks.

Example: (New J. Phys. 15 095003, 2013)

Denoting by t = (t1 , . . . , tL ) the vector of targets for the L dierent tasks,

and building a neural network with the corresponding number of outputs
y = (y1 , . . . , yL ), we can dene the loss function

L
X
ℓ(y, t) = ℓj (yj , tj )
j=1

where ℓj is the loss function chosen for solving task j.

25/28
Practical Aspect 3: Multiple Tasks

Remark 1:
▶ When the dierent tasks are regression tasks (with similar scale and
weighting), and when applying the square loss and absolute loss to
these dierent tasks, the multi-task loss takes the respective forms:

L
X
E(y, t) = (yl − tl )2 = ∥y − t∥2
l=1
L
X
E(y, t) = |yl − tl | = ∥y − t∥1 .
l=1

Remark 2:
▶ We distinguish between multi-class classication and multiple binary
classication tasks. For example, in image recognition, there are
typically multiple objects on one image, and one often prefers to
indicate for each object its presence or absence rather than to
associate to the image a single class.

26/28
Summary

27/28
Summary

▶ Lectures 56 have highlighted that the actual data on which we train
the model plays an important role. In Lecture 7, we have demonstrated
that an equally important role is played by the way we specify the
errors of the model through particular choices of a loss function ℓ.
▶ Many loss functions exist for tasks such as regression, binary
classication, multi-class classication, multi-task learning, etc.

▶ Loss functions must be designed by taking multiple aspects into

account, such as the ability to account for mislabelings, the ability to
tolerate some noise, and the ability to support ecient optimization.

▶ Loss functions can be dened exibly to address practical aspects such

as the presence of asymetric misclassication costs, subsets of the
data with dierent data quality, or the presence of multiple subtasks.

28/28

05 AIS302 ANN-Optimization
No ratings yet
05 AIS302 ANN-Optimization
44 pages
CS480 6 Linear Models
No ratings yet
CS480 6 Linear Models
68 pages
Comprehensive Guide to Loss Functions
No ratings yet
Comprehensive Guide to Loss Functions
8 pages
Loss Functions Types
No ratings yet
Loss Functions Types
11 pages
What Is A Loss Function
No ratings yet
What Is A Loss Function
3 pages
Group 30
No ratings yet
Group 30
33 pages
Loss Functions
No ratings yet
Loss Functions
29 pages
9.b Handout-1-Loss Functions
No ratings yet
9.b Handout-1-Loss Functions
3 pages
Deep Learning Loss Functions
No ratings yet
Deep Learning Loss Functions
10 pages
DL 02 Basics
No ratings yet
DL 02 Basics
95 pages
Lect 8
No ratings yet
Lect 8
117 pages
Week 2 Introduction To Linear Models - Revised - v1
No ratings yet
Week 2 Introduction To Linear Models - Revised - v1
54 pages
Loss Functions
No ratings yet
Loss Functions
8 pages
Loss Function - Ipynb - Colaboratory
No ratings yet
Loss Function - Ipynb - Colaboratory
6 pages
Lec 04 Deep Networks 2
No ratings yet
Lec 04 Deep Networks 2
78 pages
Lecture-4 Emprical Risk and Optimization
No ratings yet
Lecture-4 Emprical Risk and Optimization
20 pages
Loss Functions
No ratings yet
Loss Functions
7 pages
Binary Classification and Logistic Regression
No ratings yet
Binary Classification and Logistic Regression
7 pages
CS-31002 (ML) - CS End April 2025
No ratings yet
CS-31002 (ML) - CS End April 2025
19 pages
DL 02 Basics
No ratings yet
DL 02 Basics
94 pages
Group30 Linear Regression
No ratings yet
Group30 Linear Regression
20 pages
LogisticRegression ExercisesSolutions
No ratings yet
LogisticRegression ExercisesSolutions
5 pages
AML - Lecture 3 Logistic Regression. Neural Networks
No ratings yet
AML - Lecture 3 Logistic Regression. Neural Networks
59 pages
DL Assi02
No ratings yet
DL Assi02
9 pages
CS229 Supplemental Lecture Notes: 1 Binary Classification
No ratings yet
CS229 Supplemental Lecture Notes: 1 Binary Classification
7 pages
Loss Function in Deep Learning
No ratings yet
Loss Function in Deep Learning
15 pages
MtechDL Unit2
No ratings yet
MtechDL Unit2
25 pages
Neural Networks
No ratings yet
Neural Networks
38 pages
DL Unit-2
No ratings yet
DL Unit-2
24 pages
Multiclass SVM Loss & Optimization
No ratings yet
Multiclass SVM Loss & Optimization
22 pages
CS6910 Tutorial5
No ratings yet
CS6910 Tutorial5
9 pages
01 Lecturenote SRM
No ratings yet
01 Lecturenote SRM
9 pages
DL145611 03 Shallow
No ratings yet
DL145611 03 Shallow
92 pages
Machine Learning Loss Functions Guide
100% (2)
Machine Learning Loss Functions Guide
37 pages
L02 Linear Regression
No ratings yet
L02 Linear Regression
9 pages
Lecture 1, Part 1: Linear Regression: Roger Grosse
No ratings yet
Lecture 1, Part 1: Linear Regression: Roger Grosse
9 pages
01 Lecturenote SRM
No ratings yet
01 Lecturenote SRM
9 pages
Linear Regression in Machine Learning
No ratings yet
Linear Regression in Machine Learning
38 pages
ML Linear Model
No ratings yet
ML Linear Model
10 pages
Lec 2
No ratings yet
Lec 2
5 pages
CS2011 2
No ratings yet
CS2011 2
14 pages
Deep Neural Networks
No ratings yet
Deep Neural Networks
79 pages
2 LossAndOptimization
No ratings yet
2 LossAndOptimization
130 pages
Introduction
No ratings yet
Introduction
10 pages
Chapter 4 Assignment
No ratings yet
Chapter 4 Assignment
5 pages
Lecture 11
No ratings yet
Lecture 11
26 pages
practicalMachineLearning Lecture3
No ratings yet
practicalMachineLearning Lecture3
25 pages
Homework 2
No ratings yet
Homework 2
3 pages
Deep Learning (Part 2) - Loss Function and Gradient Function - by Sumbatilinda - Medium
No ratings yet
Deep Learning (Part 2) - Loss Function and Gradient Function - by Sumbatilinda - Medium
30 pages
Loss Function
No ratings yet
Loss Function
5 pages
Most Influential Data Science Research Papers
No ratings yet
Most Influential Data Science Research Papers
628 pages
02 - Linear Models - D (Multiclass Classification)
No ratings yet
02 - Linear Models - D (Multiclass Classification)
9 pages
Lecture 1
No ratings yet
Lecture 1
6 pages
Learning3 6pp
No ratings yet
Learning3 6pp
15 pages
Deep Learning Regularization Guide
No ratings yet
Deep Learning Regularization Guide
77 pages
Roz-4 - Janocha
No ratings yet
Roz-4 - Janocha
11 pages
Unit 2b
No ratings yet
Unit 2b
11 pages
cp1 Project
No ratings yet
cp1 Project
35 pages
CD3291 PreBook
No ratings yet
CD3291 PreBook
14 pages
Real-Time Smart Driver Sleepiness Detection by Eye Aspect Ratio Using Computer Vision
No ratings yet
Real-Time Smart Driver Sleepiness Detection by Eye Aspect Ratio Using Computer Vision
10 pages
Module02 Cost Est Techniques ICEAA Canada
No ratings yet
Module02 Cost Est Techniques ICEAA Canada
72 pages
M 001 Manual
No ratings yet
M 001 Manual
22 pages
AD1735 - Virtualization and Cloud Computing Lab Manual
100% (1)
AD1735 - Virtualization and Cloud Computing Lab Manual
68 pages
KTA50 - Cam Followers
No ratings yet
KTA50 - Cam Followers
13 pages
Meters and Gauges For HVAC Piping PDF
100% (1)
Meters and Gauges For HVAC Piping PDF
22 pages
Samsung Level 3 Repair Guide
No ratings yet
Samsung Level 3 Repair Guide
26 pages
Mahindra XUV700 Official Product Brief
No ratings yet
Mahindra XUV700 Official Product Brief
4 pages
The Demystification of Lookup Tables in Revit Families I
100% (1)
The Demystification of Lookup Tables in Revit Families I
35 pages
Dahua - Product - Selector1740301218424 (2025-02-23 01 - 00 - 50)
No ratings yet
Dahua - Product - Selector1740301218424 (2025-02-23 01 - 00 - 50)
8 pages
C15 and C18 Generator Set
No ratings yet
C15 and C18 Generator Set
4 pages
Title: AI in Art and Music Generation
No ratings yet
Title: AI in Art and Music Generation
8 pages
Energy Efficiency in Pump Systems
No ratings yet
Energy Efficiency in Pump Systems
30 pages
Online Job Portal
No ratings yet
Online Job Portal
126 pages
STEP BY STEP PROCEDURE TO USE PLANWIN Part I
No ratings yet
STEP BY STEP PROCEDURE TO USE PLANWIN Part I
146 pages
Welder Performance Test For Foreigners (Material and Welding Method)
No ratings yet
Welder Performance Test For Foreigners (Material and Welding Method)
4 pages
MyAXA Gulf Guide - English
No ratings yet
MyAXA Gulf Guide - English
24 pages
Assignment
No ratings yet
Assignment
6 pages
Manual For Car Washer
No ratings yet
Manual For Car Washer
67 pages
Organizational Velocity: Turbocharge Your Business To Stay Ahead of The Curve
100% (1)
Organizational Velocity: Turbocharge Your Business To Stay Ahead of The Curve
51 pages
EEE321 Mid Autumn 2022
No ratings yet
EEE321 Mid Autumn 2022
2 pages
UNIPhD 2022 GFA 0
No ratings yet
UNIPhD 2022 GFA 0
27 pages
SSRN 4971863
No ratings yet
SSRN 4971863
8 pages
De-CP Premium250Pump Manual
No ratings yet
De-CP Premium250Pump Manual
52 pages
Order 112-0059861-4949011
No ratings yet
Order 112-0059861-4949011
1 page
Mastermind Game with Tkinter
No ratings yet
Mastermind Game with Tkinter
8 pages
Introduction To Display Advertising: (A Half-Day Tutorial)
No ratings yet
Introduction To Display Advertising: (A Half-Day Tutorial)
2 pages
Relief Valves 24RV16 Rev 0812
No ratings yet
Relief Valves 24RV16 Rev 0812
2 pages

Lecture 07

Uploaded by

Lecture 07

Uploaded by

WiSe 2023/24

Lecture 7 Loss Functions

Recap: Formulating the learning problem

▶ Incorporating predictive uncertainty

Loss functions for classication

▶ Extensions to multiple classes

▶ Incorporating data quality

Objective to minimize is often dened

Two factors inuence the learned model f :

▶ For these applications, labels are provided as real-valued targets t ∈ R,

Several considerations for designing ℓ:

▶ What is the quality of the ground-truth target values in the dataset?

▶ Not aected by potential outliers in the data (just treat them as

ℓ(y, t) = (y − t)2 y¡t

▶ Non-zero gradients → easy to optimize.

▶ Non-zero gradients everywhere (except when the prediction is correct).

▶ Only mildly aected by outliers (error grows linearly).

▶ If we choose dierent data distributions, we recover dierent loss

▶ For these applications, labels are provided as elements of a set,

and this can be written more compactly as:

▶ If the training data would exactly correspond to the test distribution,

Note that it can also be formulated more com-

t=1 t=2 ... t=C

and constructing a one-hot encoding q of the class label t, we dene

which can be interpreted as the dierence between the evidence found

Example: medical diagnosis.

Predicted No infection Infection

Approach for the 0/1 loss:

▶ Minimizing the loss function is then equivalent to minimizing the

Approach for other losses:

low quality labels

low quality labels

where Ci is a multiplicative factor set large if i is a high quality data

Example: (New J. Phys. 15 095003, 2013)

Denoting by t = (t1 , . . . , tL ) the vector of targets for the L dierent tasks,

where ℓj is the loss function chosen for solving task j.

▶ Loss functions must be designed by taking multiple aspects into

▶ Loss functions can be dened exibly to address practical aspects such

You might also like

Loss functions for classication

Objective to minimize is often dened

Two factors inuence the learned model f :

▶ Not aected by potential outliers in the data (just treat them as

▶ Only mildly aected by outliers (error grows linearly).

▶ If we choose dierent data distributions, we recover dierent loss

and constructing a one-hot encoding q of the class label t, we dene

which can be interpreted as the dierence between the evidence found

Denoting by t = (t1 , . . . , tL ) the vector of targets for the L dierent tasks,

▶ Loss functions can be dened exibly to address practical aspects such