WiSe 2023/24
Deep Learning 1
Lecture 7 Loss Functions
Outline
Recap: Formulating the learning problem
Loss functions for regression
▶ 0/1 loss, squared loss, absolute loss, logcosh
▶ Incorporating predictive uncertainty
Loss functions for classication
▶ 0/1 loss, perceptron loss, log loss
▶ Extensions to multiple classes
Practical Aspects
▶ Utility-based loss functions
▶ Incorporating data quality
▶ Multiple tasks
1/28
Formulating the Learning Problem
Objective to minimize is often dened
as the average over the training data
of a loss function ℓ, measuring for
each instance i the discrepancy be-
tween the prediction yi = f (xi , θ) and
the ground-truth ti .
N
1 X
E(θ) ℓ(yi , ti )
N i=1
Two factors inuence the learned model f :
▶ What data is available for training the model (Lectures 5 and 6).
▶ The choice of loss function, e.g. whether larger errors are penalized
more than small errors (today's lecture).
2/28
Part 1 Loss Functions for Regression
3/28
Regression Losses
Observations:
▶ In numerous applications, one needs to predict real values (e.g. age of
an organism, expected durability of a component, energy of a physical
system, value of a good, product of a chemical reaction, yield of a
machine, temperature next week, etc.).
▶ For these applications, labels are provided as real-valued targets t ∈ R,
and one needs to choose a loss function that quanties well the
dierence between such target value and the prediction f (x) ∈ R.
Several considerations for designing ℓ:
▶ What is the cost of making certain types of errors? Are small errors
tolerated? Are big errors more costly?
▶ What is the quality of the ground-truth target values in the dataset?
Are there some outliers?
4/28
The 0/1 Loss
`
Function to minimize: 1
0 −ϵ ≤ y − t ≤ ϵ y¡t
ℓ(y, t) = ¡² ²
1 else
unacceptable acceptable unacceptable
Advantages:
▶ Tolerant to some small task-irrelevant discrepancies (→ does not need
to t the data exactly) and can therefore accomodate simple,
better-generalizing models.
▶ Not aected by potential outliers in the data (just treat them as
regular errors).
Disadvantage:
▶ The gradient of that loss function is almost always zero → impossible
to optimize via gradient descent.
5/28
The Squared Loss
`
Function to minimize: 1
ℓ(y, t) = (y − t)2 y¡t
¡² ²
unacceptable acceptable unacceptable
Advantages:
▶ Tolerant to some small task-irrelevant discrepancies.
▶ Unlike the 0/1 loss, gradients are most of the time non-zero. This
makes this loss easy to optimize.
Disadvantage:
▶ Strongly aected by outliers (errors grow quadratically).
6/28
The Absolute Loss
`
Function to minimize: 1
ℓ(y, t) = |y − t| y¡t
¡² ²
unacceptable acceptable unacceptable
Advantages:
▶ Compared to the square error, less aected by outliers (errors grow
only linearly).
▶ Non-zero gradients → easy to optimize.
Disadvantage:
▶ Unlike the 0/1 loss and the square error, it is not tolerant to small
errors (small errors incur a non-negligible cost).
7/28
The Log-Cosh Loss
Function to minimize: `
1 1
ℓ(y, t) = log cosh(β · (y − t))
β
y¡t
¡² ²
with β a positive-valued hyperparame- unacceptable acceptable unacceptable
ter.
Advantages:
▶ Tolerant to some small task-irrelevant discrepancies.
▶ Non-zero gradients everywhere (except when the prediction is correct).
This makes this loss easy to optimize.
▶ Only mildly aected by outliers (error grows linearly).
8/28
Regression Losses
Systematic comparison
outlier-robust
optimizable
ϵ-tolerant
0/1 loss ✗ ✓ ✓
2
squared loss (y − t) ✓ ✗ ✓
absolute loss |y − t| ✓ ✓ ✗
log-cosh loss ✓ ✓ ✓
Note:
▶ Many further loss functions have been proposed in the literature (e.g.
Huber's loss, ϵ-sensitive loss, etc.). They often implement similar
desirable properties as the log-cosh loss.
9/28
Regression Losses: Adding Predictive Uncertainty
Idea:
▶ Let the network output consist of two variables µ, σ , representing the
parameters of some probability distribution modeling the labels t, for
example a normal distribution y ∼ N (µ, σ).
▶ We can then dene the log-likelihood function, which we would like to
maximize w.r.t. the parameters of the network:
(t − µ)2 √
log p(y = t | µ, σ) = − 2
− log( 2πσ)
2σ
10/28
Regression Losses: Adding Predictive Uncertainty
Objective to maximize:
(t − µ)2 √
log p(y = t | µ, σ) = − 2
− log( 2πσ)
2σ
Observation:
▶ The objective has a gradient w.r.t. µ and σ (as long as the scale σ is
positive and not too small). To ensure this, one can use some special
activation function to produce σ, e.g. σ = log(1 + exp(·)).
▶ If we set σ constant (i.e. disconnect it from the rest of the network),
the model reduces to an application of the square error loss function.
However, if we learn σ, the latter provides us with a an indication of
prediction uncertainty.
▶ If we choose dierent data distributions, we recover dierent loss
functions (e.g. the Laplace distribution yields the absolute loss, or the
hyperbolic secant distribution yields the log-cosh loss).
11/28
Part 2 Loss Functions for Classication
12/28
Classication Losses
Observations:
▶ Classication is perhaps the most common scenario in machine
learning (e.g. detecting if some tissue contains cancerous tissue or not,
determining whether to grant access or not to some resource,
detecting if some text is positive or negative, etc.)
▶ For these applications, labels are provided as elements of a set,
typically t ∈ {−1, 1} for binary classication or t ∈ {1, 2, . . . , C} for
multi-class classication.
▶ However, the output of the neural network is, like for the regression
case, real-valued. For binary classication, it is typically a real-valued
scalar the sign of which gives the class. The classication is then
correct if and only if:
(y > 0) ∧ (t = 1) ∨ (y < 0) ∧ (t = −1)
and this can be written more compactly as:
y·t>0
13/28
0/1 Loss
`
Function to minimize:
0 if y·t>0 y·t
ℓ(y, t) =
1 if y·t<0 incorrect correct
decision decision
Properties:
▶ Using the 0/1 loss function is equivalent to minimizing the average
classication error on the training data.
▶ If the training data would exactly correspond to the test distribution,
then, the optimization objective would exactly maximize what we are
interested in, i.e. the classication accuracy.
Problem:
▶ The loss function has gradient zero everywhere ⇒ It can't be
optimized via gradient descent.
14/28
Perceptron Loss
Function to minimize: `
0 if y·t>0
ℓ(y, t) = y·t
|y| if y·t<0
Note that it can also be formulated more com-
incorrect correct
decision decision
pactly as ℓ(y, t) = max(0, −y · t).
Advantage:
▶ Gradient is non-zero for misclassications and indicates how to adapt
the model to reduce the classication errors.
▶ Remains fairly capable of dealing with misclassied data (like the 0/1
loss), because the error only grows linearly with y.
iteration 31
4
Disadvantage: 3
2
▶ Training stops as soon as training points are on 1
the correct side of the decision boundary. → 0
1
Unlikely to generalize well to new data points 2
3
(the 0/1 loss function has the same problem).
4
4 2 0 2 4
15/28
Log-Loss
`
Function to minimize:
y·t
ℓ(y, t) = log(1 + exp(−y · t))
incorrect correct
decision decision
iteration 999
Advantages: 4
3
▶ Penalize points that are correctly classied if the 2
1
neural network output is too close to the 0
threshold. This pushes the decision boundary 1
2
away from the training data and provide intrinsic 3
regularization properties. 4
4 2 0 2 4
16/28
Log-Loss
Probabilistic interpretation:
Assuming the following mapping from neural network output y to class
probabilities
exp(−y) exp(y)
p= ,
1 + exp(−y) 1 + exp(y)
minimizing the log-loss is equivalent to minimizing the cross-entropy
H(q, p) where q = (1t<0 , 1t>0 ) is a one-hot vector encoding the class.
Proof: 2
X
H(q, p) = − qi log pi
i=1
= −q1 log p1 − q2 log p2
e−y ey
= −1t<0 log − 1t>0 log
1 + e−y 1 + ey
eyt
= − log
1 + eyt
1
= − log
1 + e−yt
= log(1 + e−yt )
17/28
Classication Losses
Systematic comparison
mislabeling-robust
builds margin
optimizable
0/1 loss ✗ ✓ ✗
perceptron loss max(0, −yt) ✓ ✓ ✗
log loss log(1 + exp(−yt)) ✓ ✓ ✓
18/28
Handling Multiple Classes
Blueprint:
▶ Build a neural network with as many outputs
as there are classes, call them y1 , . . . , yC .
▶ Classify as k = arg max[y1 , . . . , yC ].
Observation:
▶ The 0/1 loss function can then be straightforwardly generalized to the
multi-class case as:
t=1 t=2 ... t=C
k=1 0 1 ... 1
k=2 1 0
. . .
. . .
. . .
k=C 1 0
▶ However, this generalization of the 0/1 loss suers from the same
problem as the original 0/1 loss, that is, the diculty to optimize it,
and the fact it does not promote margins between the data/predictions
and the decision boundary.
19/28
Handling Multiple Classes
Generalizing the log-loss to multiple classes:
▶ Let y1 , . . . , y C be the C outputs of our network. Mapping these scores
to a probability vector via the softmax function
exp(yi )
pi = PC
j=1 exp(yj )
and constructing a one-hot encoding q of the class label t, we dene
the loss function as the cross-entropy H(q, p), i.e.
C
X
ℓ(y, t) = H(q, p) = − qi log pi
i=1
= − log pt
C
X
= log exp yj − yt
j=1
which can be interpreted as the dierence between the evidence found
by the neural network for all classes, and the evidence found by the
neural network for the target class.
20/28
Part 3 Practical Aspects
21/28
Practical Aspect 1: Non-Uniform Misclassication Costs
Example: medical diagnosis.
▶ Assumes a type of error is much more costly than another, e.g. missing
the detection of a disease.
Actual
Predicted No infection Infection
No infection 0 10000
Infection 2000 0
Approach for the 0/1 loss:
▶ To reect this cost structure, the 0/1 loss can be straightforwardly
enhanced by replacing the 1s in the loss function by the actual costs.
▶ Minimizing the loss function is then equivalent to minimizing the
expected cost (or maximizing utility).
Approach for other losses:
▶ When the loss has a probabilistic interpretation (e.g. log-loss), one can
treat predicted probabilities p(y P
= k) as `ground-truth' and estimate
the expected cost for class i as C k=1 cost(choose i|k)p(y = k).
22/28
Practical Aspect 2: Labels of Varying Quality
low quality labels
high quality labels
Examples:
▶ Non-expert vs. expert labeler, outcome of a physics simulation
with/without approximations, noisy/clean measurement of an
experimental outcome.
Idea:
▶ In presence of two similar instances that are similar but with diverging
labels, focus on the high-quality one. Low-quality labels remain useful
in regions with scarce data.
23/28
Practical Aspect 2: Labels of Varying Quality
low quality labels
high quality labels
Idea (cont.):
▶ Use a dierent loss function for dierent data points, e.g. one
associates to instance i the loss function:
ℓi (y, t) = Ci ℓ(y, t)
where Ci is a multiplicative factor set large if i is a high quality data
point or low if i is low-quality.
24/28
Practical Aspect 3: Multiple Tasks
In practice, we may want the same neural network to perform several tasks
simultaneously, e.g. multiple binary classication tasks, or some additional
regression tasks.
Example: (New J. Phys. 15 095003, 2013)
Denoting by t = (t1 , . . . , tL ) the vector of targets for the L dierent tasks,
and building a neural network with the corresponding number of outputs
y = (y1 , . . . , yL ), we can dene the loss function
L
X
ℓ(y, t) = ℓj (yj , tj )
j=1
where ℓj is the loss function chosen for solving task j.
25/28
Practical Aspect 3: Multiple Tasks
Remark 1:
▶ When the dierent tasks are regression tasks (with similar scale and
weighting), and when applying the square loss and absolute loss to
these dierent tasks, the multi-task loss takes the respective forms:
L
X
E(y, t) = (yl − tl )2 = ∥y − t∥2
l=1
L
X
E(y, t) = |yl − tl | = ∥y − t∥1 .
l=1
Remark 2:
▶ We distinguish between multi-class classication and multiple binary
classication tasks. For example, in image recognition, there are
typically multiple objects on one image, and one often prefers to
indicate for each object its presence or absence rather than to
associate to the image a single class.
26/28
Summary
27/28
Summary
▶ Lectures 56 have highlighted that the actual data on which we train
the model plays an important role. In Lecture 7, we have demonstrated
that an equally important role is played by the way we specify the
errors of the model through particular choices of a loss function ℓ.
▶ Many loss functions exist for tasks such as regression, binary
classication, multi-class classication, multi-task learning, etc.
▶ Loss functions must be designed by taking multiple aspects into
account, such as the ability to account for mislabelings, the ability to
tolerate some noise, and the ability to support ecient optimization.
▶ Loss functions can be dened exibly to address practical aspects such
as the presence of asymetric misclassication costs, subsets of the
data with dierent data quality, or the presence of multiple subtasks.
28/28