Arti cial Intelligence 2
CI 2 - S3
Pr. Hamza Alami
Academic year: 2024/2025
fi
Outline
1. Recap
2. Backpropagation algorithm
3. Loss functions
2 Hamza Alami CI2-S3 2024/2025
Recap
• What is the difference between IA, ML, and DL?
• What is a perceptron, MLP, universal approximator?
• What is the difference between Heaviside and sigmoid functions?
• What is the algorithm used to optimize NNs?
3 Hamza Alami CI2-S3 2024/2025
Outline
1. Recap
2. Backpropagation algorithm
3. Loss functions
4 Hamza Alami CI2-S3 2024/2025
Backpropagation Algorithm
• Let’s consider the following network:
b (0) b (1)
w (0) σ w (1) σ
X z (0) a (0) z (1) a (1)
1
( ŷ − ȳ)
2
2∑
loss_ function =
…
ŷ = a (L)
w (L)
z (L) σ a (L) loss_ function
b (L)
5 Hamza Alami
Backpropagation Algorithm
• We have the following: δ δ δz (l)
• And: =
(l)
z =w a (l) (l−1)
+b (l) δw (l) δz δw
(l) (l)
(l)
a = σ (z )
(l) (l)
(l) δz
=δ
(l) δ δw (l)
δ = (l) (l) (l−1)
δz =δ a
(l)
1 • And: δ δ δz
( ŷ − ȳ)
2
=
2∑
= loss_ function =
δb (l) δz (l) δb (l)
ŷ = a (L) (l)
(l) δz
=δ
𝕃
δb (l)
(l)
=δ
𝕃
6
𝕃
𝕃
𝕃
𝕃
Hamza Alami
Backpropagation Algorithm
• In the case of the last layer:
(L)
δ δ δz 1
( ŷ − ȳ)
2
= = loss_ function =
2∑
δw (L) δz (L) δw (L) (L) ŷ = a (L)
δz (L) (L) δ δa
=δ (L) δ = (L) (L)
δw (L) δa δz
(L) (L)
(L) (L−1)
=δ a = (ŷ − ȳ)a (1 − a )
(L)
δ δ δz
=
δb (L) δz δb
(L) (L)
(L)
=δ
𝕃
𝕃
7
𝕃
𝕃
𝕃
𝕃
Hamza Alami
Backpropagation Algorithm
• In the case of an arbitrary layer:
(l)
δ δ δz 1
( ŷ − ȳ)
2
= = loss_ function =
2∑
δw (l) δz (l) δw (l) ŷ = a (L)
(l)
(L) δz
=δ (l+1) (l)
δw (l) (l) δ δz δa
δ = (l+1)
(l) (l−1)
=δ a δz δa (l) δz (l)
(l+1) (l+1) (l) (l)
=δ w a (1 − a )
(l)
δ δ δz
=
δb (l) δz δb
(l) (l)
(l)
=δ
𝕃
𝕃
8
𝕃
𝕃
𝕃
𝕃
Hamza Alami
Backpropagation Algorithm
• Now lets consider with arbitrary inputs and hidden layers
(l)
z =W a (l) (l−1)
+b (l) • Now that a layer has multiple neurons
we have one auxiliary variable per neuron
a = σ (z )
(l) (l)
(l) δ
δj = (l)
∀j ∈ {0...number of neurons in layer l}
1 L δzj
= loss_ function = ∥a − ȳ∥2
2
𝕃
9 Hamza Alami
𝕃
Backpropagation Algorithm
• Auxiliary variable in the case of the last layer
(L)
δ δaj 1 L
= loss_ function = ∥a − ȳ∥ 2
(L)
δj = (L) (L)
2
δaj δzj
(L) (L) (L)
= (aj − ȳj)aj (1 − aj )
(L) (L) (L) (L)
δ = (a − ȳ) ⊙ a ⊙ (1 − a )
𝕃
𝕃
10 Hamza Alami
Backpropagation Algorithm
• Auxiliary variable in the case an arbitrary layer
(0 )
(l+1) (l+1) (l+1)
δ z , z1 , z2 , . . . 1 L 2
(l) = loss_ function = ∥a − ȳ∥
δj = (l)
2
δzj
(l+1) (l)
δ δzk δak
∑
=
= ((W )
(l+1) (l) (l)
) δ ( )
δzk δaj δzj (l) (l+1) T (l+1) (l) (l)
k δ ⊙ a ⊙ 1 − a
𝕃
(l+1) (l+1) (l) (l)
∑
= δk wkj aj (1 − aj )
k
𝕃
𝕃
11 Hamza Alami
Backpropagation Algorithm
• Let’s consider the following network:
b0(0)
b0(1)
(0)
w00 σ
X1 (0)
z0(0) a0(0) (1)
w00
w01
z0(1)
σ a0(1) loss_ function
(0)
w10
z1(0)
σ a1(0)
(1)
w10
X2 (0)
w11
1
( ŷ − ȳ)
2
b1(0)
2∑
loss_ function =
(1)
ŷ = a0
12 Hamza Alami
Outline
1. Recap
2. Backpropagation algorithm
3. Loss functions
13 Hamza Alami CI2-S3 2024/2025
So far so good
• Neural networks (NNs) are essentially function approximators
• In case of supervised learning, NNs model the decision boundaries
between classes
• Training NNs is an iterative process which aims to nd the weights that
predicts correctly the training outputs
• The loss function measures the deviation of the NNs outputs from the
desired outputs
14 Hamza Alami
fi
Loss function
• The loss function can be viewed as a surface in a high dimensional space
• A loss function can be described by the equation:
n−1 n−1
( ( ) ) ∑ (y , ȳ ∣ w, b)
(i) (i) (i) (i) (i) (i)
∑
(w, b) = f x ∣ w, b , ȳ =
i=0 i=0
𝕃
𝕃
𝕃
15 Hamza Alami
Loss function
16 Hamza Alami
Loss function and global minima
• The global minimum of the loss function using the training data is not
actually what provides the best generalization to the test data
• A study suggests that global minima in practice are irrelevant as they often
leads to over tting
“Choromanska, Anna, et al. "The loss surfaces of multilayer
networks." Artificial intelligence and statistics. PMLR, 2015.”
17 Hamza Alami
fi
Regression loss
• Used to solve regression problems
• The L2 norm of the difference between the model predictions and the
expected predictions (MSE)
• RMSE, MAE
1 L 2
= ∥a − ȳ∥
2
𝕃
18 Hamza Alami
Cross entropy loss
Predicted Expected
p(cat) 1
How do effectively quantitatively estimate
NN p(dog) 0 the deviation of the predicted classes
and the expected classes ?
p(airplane) 0
p(automobile) 0
19 Hamza Alami
Expected classes
Cross entropy loss
Good predictions
Bad predictions
20 Hamza Alami
Cross entropy loss
pexpected (i) log (ppredicted (i))
∑
cross_entropy = −
i
• If pexpected (i) is close to 1 and ppredicted (i) is close to 1, the CE is close to
0
• If pexpected (i) is close to 1 and ppredicted (i) is close to 0, then
log (ppredicted (i)) will be close to −∞ and the CE will be very high
• If pexpected (i) is close to 0 then it will not contribute the CE
21 Hamza Alami
Binary Cross entropy loss
• In the case of the number of classes is 2, considering p0 the predicted
probability of the class 0 then p1 = 1 − p0
• Thus the binary cross entropy is de ned as:
BCE = − pexpected (0) log (p0) − (1 − pexpected (0)) log (1 − p0)
∂ 1 1 ȳ 1 − ȳ
− = ȳ − (1 − ȳ) = − = 0 ⟹ y = ȳ −0.25 log(0.25) − 0.75 log(0.75) = 0.56 ≠ 0
∂y y 1−y y 1−y ⟸
22 Hamza Alami
𝕃
fi
Softmax cross entropy loss
• Considering the previous example of classes cat, dog, airplane, automobile
a score vector may be [9 10 0.1 -3]
• The scores are unbounded, they can be any real number
• NN behave better when the loss function involves a bounded set of
numbers in the same range
• The Softmax converts unbounded scores into probabilities
23 Hamza Alami
Softmax cross entropy loss
• Given the scores vector s = [s1, s2, . . . , sN−1]the corresponding Softmax
vector is:
[ ∑k e ∑k e ∑k e k ]
s1 s2 sN−1
e e e
softmax (s) = s
, s
, . . . , s
k k
The sum of the softmax vector is 1
An element of the softmax vector represent the probability of a class
24 Hamza Alami
Softmax cross entropy loss
• Why Softmax ?
• The Softmax is a smooth (differentiable) approximation of the
argmaxonehot
argmaxonehot (p) = [0,1] Far from each other
p = [9.99,10] argmaxonehot (q) = [1,0]
q = [10,9.99]
softmax (p) = [0.4975,0.5025]
Close to each other
softmax (q) = [0.5025,0.4975]
25 Hamza Alami
Softmax cross entropy loss
• The Softmax output probabilities so it will be used in the last layer
• The Softmax probabilities are then used to compute the cross entropy loss
• Various deep learning libraries combine the softmax and CE loss in one
operation
• That combination tends to be numerically better
26 Hamza Alami
Softmax cross entropy loss
27 Hamza Alami
Focal loss
• Not all training data are equally important and we need to use the available
data wisely
• The idea behind the focal loss is to focus more on data that are not doing
well
• The focal loss can achieve better performances in the case of imbalanced
data
28 Hamza Alami
Focal loss
• Lets consider the binary CE
−log (y) if the expected class is 1
{−log (1 − y) if the expected class is 0
(y, ȳ) =
• The focal loss is de ned as:
− (1 − y) log (y) if the expected class is 1
γ
{ −y log (1 − y) if the expected class is 0
(y, ȳ) = γ
𝕃
𝕃
29 Hamza Alami
fi
Focal loss
30 Hamza Alami
Hinge loss
• A hinge loss function increases if the goodness criterion is not satis ed
and becomes 0 If the criterion is satis ed
“If you are not my friend the distance between
us can vary from small to large.
But I don’t distinguish between friends. All my
friends are at a distance 0 from me”
Hinged door
31 Hamza Alami
fi
fi
Hinge loss
• The hinge loss is de ned by the equation
N−1 yj the score of incorrect classes
max (0,yj − yc + m)
∑ yc the score of the correct classe
j=0,j≠c
• In the case of good output: yj < yc ⟹ max (0,yj − yc) = 0
• In the case of bad output: yj > yc ⟹ max (0,yj − yc) = yj − yc
32 Hamza Alami
fi
Hinge loss
33 Hamza Alami