Machine Learning
Basics
Di He
Supervised learning
Distribution 𝑃(𝑋, 𝑌)
𝑋 𝑌
2
Supervised learning
Distribution 𝑃(𝑋, 𝑌)
𝑋 𝑌
an image
a sentence
a piece of audio
…
3
Supervised learning
Distribution 𝑃(𝑋, 𝑌)
𝑋 𝑌
an image positive? negative?
a sentence category
a piece of audio …
…
4
Supervised learning
Stack of neural network layers
𝑋 𝑌
an image positive? negative?
a sentence category
a piece of audio …
…
5
Supervised learning
𝑋 𝑌
𝑓 ∗ (𝑥) The ground truth mapping
6
Supervised learning
𝑋 𝑌
The approximated mapping 𝑓𝜃 (𝑥) 𝑓 ∗ (𝑥) The ground truth mapping
7
Supervised learning
𝑋 𝑌
The approximated mapping 𝑓𝜽 (𝑥) 𝑓 ∗ (𝑥) The ground truth mapping
The difficulty: obtain good 𝜃 from finite samples
8
Generative modeling (unsupervised)
P(𝑋)
Distribution of images, sentences,
audio…
9
Generative modeling (unsupervised)
{𝑥𝑖 }
P(𝑋)
Distribution of images, sentences,
audio…
10
Generative modeling (unsupervised)
{𝑥𝑖 }
P(𝑋) P𝜃 (𝑋)
Distribution of images, sentences,
audio…
11
Generative modeling (unsupervised)
{𝑥𝑖 }
generate images,
P(𝑋) P𝜃 (𝑋) sentences,
audio…
Distribution of images, sentences,
audio…
12
Generative modeling (unsupervised)
{𝑥𝑖 }
P(𝑋) P𝜽 (𝑋)
The difficulty
13
Simple examples
14
Simple examples
Ground truth
𝑋~Normal(0,1)
15
Simple examples
0.156 -1.237 0.894 1.502 -0.671
Ground truth 𝑥𝑖 -0.201 0.327 1.101 -0.942 -0.555
𝑋~Normal(0,1)
16
Simple examples
0.156 -1.237 0.894 1.502 -0.671
Ground truth 𝑥𝑖 -0.201 0.327 1.101 -0.942 -0.555
𝑋~Normal(0,1)
Hypothesis space
𝑋~Normal(𝜃, 1)
17
Simple examples
0.156 -1.237 0.894 1.502 -0.671
Ground truth 𝑥𝑖 -0.201 0.327 1.101 -0.942 -0.555
How to estimate 𝜽
𝑋~Normal(0,1)
Hypothesis space
𝑋~Normal(𝜃, 1)
18
Simple examples
0.156 -1.237 0.894 1.502 -0.671
Ground truth 𝑥𝑖 -0.201 0.327 1.101 -0.942 -0.555
How to estimate 𝜽
• Which 𝜃 can generate 𝑥𝑖
𝑋~Normal(0,1) with the highest probability
Hypothesis space 𝑃𝜃 𝑥𝑖 = Π𝑖=1:𝑁 𝑃𝜃 (𝑥𝑖 )
𝑋~Normal(𝜃, 1)
19
Simple examples
0.156 -1.237 0.894 1.502 -0.671
Ground truth 𝑥𝑖 -0.201 0.327 1.101 -0.942 -0.555
How to estimate 𝜽
• Which 𝜃 can generate 𝑥𝑖
𝑋~Normal(0,1) with the highest probability
Hypothesis space 𝑃𝜃 𝑥𝑖 = Π𝑖=1:𝑁 𝑃𝜃 (𝑥𝑖 )
• Known as maximum likelihood
𝑋~Normal(𝜃, 1)
method
∑𝑥𝑖
𝜃 = 𝑎𝑟𝑔𝑚𝑎𝑥 𝑃𝜃 𝑥𝑖 =
𝑁
= 0.037
20
Simple examples
0.156 -1.237 0.894 1.502 -0.671
Ground truth 𝑥𝑖 -0.201 0.327 1.101 -0.942 -0.555
𝑋~Normal(0,1)
Hypothesis space
What we will never know
𝑋~Normal(𝜃, 1)
21
Simple examples
0.156 -1.237 0.894 1.502 -0.671 What is not simple
Ground truth 𝑥𝑖 -0.201 0.327 1.101 -0.942 -0.555
𝑋~Normal(0,1)
Hypothesis space
What we will never know
𝑋~Normal(𝜃, 1)
22
Simple examples
0.156 -1.237 0.894 1.502 -0.671 What is not simple
Ground truth 𝑥𝑖 -0.201 0.327 1.101 -0.942 -0.555
𝑋~Normal(0,1)
Hypothesis space
What we will never know
𝑋~Normal(𝜃, 1) What has to be highly complex
23
Simple examples
0.156 -1.237 0.894 1.502 -0.671 What is not simple
Ground truth 𝑥𝑖 -0.201 0.327 1.101 -0.942 -0.555
𝑋~Normal(0,1)
Hypothesis space
What we will never know
𝑋~Normal(𝜃, 1) What has to be highly complex
We seek to computational methods instead of writing down analytical solutions
24
Arthur Samuel
• In 1952, Arthur Samuel, developed
a program playing Checkers.
25
Arthur Samuel
• In 1952, Arthur Samuel, developed
a program playing Checkers.
• The program was able to observe positions
and learn an implicit model that gives
better moves for the latter cases.
• With that program, Samuel clamed that
machines can go beyond the written
codes and learn patterns like human-beings.
26
Arthur Samuel
• In 1952, Arthur Samuel, developed
a program playing Checkers.
• The program was able to observe positions
and learn an implicit model that gives
better moves for the latter cases.
• With that program, Samuel clamed that
machines can go beyond the written
codes and learn patterns like human-beings.
• Samuel coined the concept of “machine learning”
in 1959.
27
Basic Machine Learning Concepts
• The goal: To learn a model from
experiences/data
• Training data 𝑥𝑖 𝑛𝑖=1 ∈ 𝑋 𝑛
• Model 𝑓𝜃 : noise(𝑅𝑑 ) ⇒ 𝑋
28
Basic Machine Learning Concepts
• The goal: To learn a model from
experiences/data
• Training data 𝑥𝑖 𝑛𝑖=1 ∈ 𝑋 𝑛
• Model 𝑓𝜃 : noise(𝑅𝑑 ) ⇒ 𝑋
• Test/inference/prediction
• Sample a (or a set of) noise 𝜖
• 𝑓𝜃 (𝜖)
29
Basic Machine Learning Concepts
• The goal: To learn a model from
experiences/data
• Training data 𝑥𝑖 𝑛𝑖=1 ∈ 𝑋 𝑛
• Model 𝑓𝜃 : noise(𝑅𝑑 ) ⇒ 𝑋
• Test/inference/prediction
• Sample a (or a set of) noise 𝜖
• 𝑓𝜃 (𝜖)
• Training: empirical loss
𝑛
minimization
min 𝑙(𝑥𝑖 , 𝜃)
𝜃
𝑖=1
30
Basic Machine Learning Concepts
• The goal: To learn a model from • Negative log likelihood:
experiences/data 𝑙 𝑥𝑖 , 𝜃 = −log 𝑃(𝑥𝑖 ; 𝜃)
• Training data 𝑥𝑖 𝑛𝑖=1 ∈ 𝑋 𝑛
• Model 𝑓𝜃 : noise(𝑅𝑑 ) ⇒ 𝑋
• Test/inference/prediction
• Sample a (or a set of) noise 𝜖
• 𝑓𝜃 (𝜖)
• Training: empirical loss
𝑛
minimization
min 𝑙(𝑥𝑖 , 𝜃)
𝜃
𝑖=1
31
Basic Machine Learning Concepts
• The goal: To learn a model from • Negative log likelihood:
experiences/data 𝑙 𝑥𝑖 , 𝜃 = −log 𝑃(𝑥𝑖 ; 𝜃)
• Training data 𝑥𝑖 𝑛𝑖=1 ∈ 𝑋 𝑛 • Other surrogate loss will be introduced during
• Model 𝑓𝜃 : noise(𝑅𝑑 ) ⇒ 𝑋 the course
• Test/inference/prediction
• Sample a (or a set of) noise 𝜖
• 𝑓𝜃 (𝜖)
• Training: empirical loss
𝑛
minimization
min 𝑙(𝑥𝑖 , 𝜃)
𝜃
𝑖=1
32
• The goal: To learn a model from
experiences/data
• Training data 𝑥𝑖 𝑛𝑖=1 ∈ 𝑋 𝑛
• Model 𝑓𝜃 : noise(𝑅𝑑 ) ⇒ 𝑋
What does 𝑓𝜃 look like? • Test/inference/prediction
• Sample a (or a set of) noise 𝜖
• 𝑓𝜃 (𝜖)
• Training: empirical loss minimization
𝑛
min 𝑙(𝑥𝑖 , 𝜃)
𝜃
𝑖=1
33
Frank Rosenblatt
• In 1957, Frank Rosenblatt designed the first
neural network for computers (the
perceptron), which simulates the thought
processes of the human brain.
34
Marvin Minsky
• In 1969, Minsky proposed the famous XOR
problem and the inability of Perceptron in
such linearly inseparable data distributions.
• It was the Minsky's tackle to the NN
community. Thereafter, NN researches
would be dormant up until 1980s.
35
Perceptron is too simple, more complicated models
are needed to handle complex problems…
36
Paul Werbos
• Paul Werbos suggested using Multi-Layer Perceptron (MLP)
in 1981, and proposed the Backpropagation (BP) algorithm
for training neural networks. This new architecture solved
the XOR challenge.
37
Paul Werbos
• Paul Werbos suggested using Multi-Layer Perceptron (MLP)
in 1981, and proposed the Backpropagation (BP) algorithm
for training neural networks. This new architecture solved
the XOR challenge.
• Following Werbos’ new ideas, neural network researchers
successively presented different architectures of MLP and a
number of BP variants for effective training.
38
Geoffrey Hinton, Yan LeCun,
Jurgen Schmidhuber
• Geoffrey Hinton contributed a lot to the practical
backpropagation algorithms (1986) and
Boltzmann Machines (1983).
• Yan LeCun was the first to train a convolutional
neural network on images of handwritten digits
(1986).
• Jurgen Schmidhuber invented a new type of
recurrent neural network called Long short-term
memory or LSTM (1997).
39
Neural networks are black boxes, and therefore
difficult to interpret…
40
Neural networks are data-hungry. When there are
only small number of training data, they will overfit …
41
Ross Quinlan
• Decision trees were proposed by Ross Quinlan, more
specifically the ID3 algorithm.
• ID3 is able to find more real-life use case with its simplistic
rules and its clear inference.
• After ID3, many different alternatives or improvements have
been explored by the community (e.g. ID4, Regression
Trees, CART ...) and still it is one of the active topics in ML.
42
Decision Trees
• ID3 Algorithm
• Take all unused attributes and count
their entropy concerning test samples
• Choose attribute for which entropy is
minimum (or, equivalently, information
gain is maximum)
• Make node containing that attribute
43
Vladimir Vapnik
• Support Vector Machines (SVM) was proposed by Vapnik
and Cortes in 1995 with very strong theoretical standing and
empirical results.
• SVM got the best of many tasks that were occupied by NN
models before. In addition, SVM was able to exploit all the
profound knowledge of convex optimization, generalization
margin theory and kernels against NN models.
• ML community was separated into two crowds as NN or
SVM advocates.
44
Support Vector Machines
• Basic idea
• The decision boundary should be as far
away from the data of both classes as
possible
• We should maximize the margin m
• SVM could be efficiently solved in its dual form,
whose solutions only rely on the so-called Class 2
support vectors.
• SVM could be kernelized to handle non-
separable cases
Class 1
m
45
Revival of Neural Networks (Deep Learning)
46
47
A Neuron
No hidden layer
A single output unit
o 𝑥 = max{ 𝑤𝑖 𝑥𝑖 , 0}
𝑖
48
Stack of layers
𝑊, weight matrix
𝑏, bias vector
𝑦𝑖 = 𝑓 𝑊𝑖𝑗 𝑥𝑗 + 𝑏𝑖 = 𝑓(𝑊𝑖𝑇 𝑥 + 𝑏𝑖 )
𝑖,𝑗
49
Stack of layers
ℎ0 = 𝑓(𝑤 0 𝑥 + 𝑏0 ) ℎ1 = 𝑓(𝑤 1 ℎ0 + 𝑏1 )
3 2 3
ℎ2 = 𝑓(𝑤 2 ℎ1 + 𝑏2 ) 𝑦 = 𝑓(𝑤 ℎ + 𝑏 )
50
Universal Approximation Theorem
• A feed-forward network with a single hidden layer containing a finite number
of neurons can approximate continuous functions on compact subsets of 𝑅𝑛,
under mild assumptions on the activation function.
51
Activations
Sigmoid Tanh
1 𝑒 𝑥 − 𝑒 −𝑥
𝜎 𝑥 = 𝑡𝑎𝑛ℎ 𝑥 = 2𝜎 2𝑥 − 1 = 𝑥
1 + 𝑒 −𝑥 𝑒 + 𝑒 −𝑥
52
Activations
Sigmoid Tanh
• Not used in common scenarios such as image and language processing
• Still popularly used in some specific scenarios such as Neural ODE/PDE
1 𝑒 𝑥 − 𝑒 −𝑥
𝜎 𝑥 = 𝑡𝑎𝑛ℎ 𝑥 = 2𝜎 2𝑥 − 1 = 𝑥
1 + 𝑒 −𝑥 𝑒 + 𝑒 −𝑥
53
Activations
Rectified linear 𝑓 𝑧 = 𝑚𝑎𝑥(0, 𝑧)
units = 𝑚𝑎𝑥 0, 𝑊 𝑇 𝑥 + 𝑏
Absolute value 𝛼 = −1 𝑓 𝑧 = |𝑧|
rectification
𝑓(𝑧) = 𝑚𝑎𝑥 0, 𝑧 + 𝛼 𝑚𝑖𝑛 0, 𝑧
Leaky ReLU fixes 𝛼 to a small value like 0.01
Parametric ReLU Learns 𝛼
54
Activations
Rectified linear 𝑓 𝑧 = 𝑚𝑎𝑥(0, 𝑧)
units = 𝑚𝑎𝑥 0, 𝑊 𝑇 𝑥 + 𝑏
Absolute value 𝛼 = −1 𝑓 𝑧 = |𝑧|
rectification
𝑓(𝑧) = 𝑚𝑎𝑥 0, 𝑧 + 𝛼 𝑚𝑖𝑛 0, 𝑧
Leaky ReLU fixes 𝛼 to a small value like 0.01
Parametric ReLU Learns 𝛼
• Why ReLU fails in Neural ODE/PDE
55
Implementation
56
• The goal: To learn a model from
experiences/data
• Training data 𝑥𝑖 𝑛𝑖=1 ∈ 𝑋 𝑛
• Model 𝑓𝜃 : noise(𝑅𝑑 ) ⇒ 𝑋
How to find good 𝜃? • Test/inference/prediction
• Sample a (or a set of) noise 𝜖
• 𝑓𝜃 (𝜖)
• Training: empirical loss minimization
𝑛
min 𝑙(𝑥𝑖 , 𝜃)
𝜃
𝑖=1
57
Loss function
• Loss function (more accurately, data loss) measures the goodness of 𝜃, e.g., how
likely it can generate the data it sees during training
min 𝑙(𝑥𝑖 , 𝜃)
𝜃
𝑖=1
58
Loss function
• Loss function (more accurately, data loss) measures the goodness of 𝜃, e.g., how
likely it can generate the data it sees during training
min 𝑙(𝑥𝑖 , 𝜃)
𝜃
𝑖=1
• The way we estimate the value(weight) of 𝜃 is called optimization
59
Loss surface
• If we can draw a “high-dimensional curve”, where
• x-axis is the value(s) of 𝜃
• y-axis is loss of 𝜃
• This curve is called “loss surface”
• The goal is to find the “basin”
60
Optimization is another course
• Key concept
• Conditions for optimality
• Convergence
• Convergence rate
• Duality
• …
• What deep learning guys care
• Convex optimization problems
• Non-convex optimization problems
• Minimax problems
61
The standard optimizer in deep learning
• Intuition of gradient descent
• Randomly put a ball on surface.
62
The standard optimizer in deep learning
• Intuition of gradient descent
• Randomly put a ball on surface.
• The ball moves in the direction that can
reduce the loss fast
63
The standard optimizer in deep learning
• Intuition of gradient descent
• Randomly put a ball on surface.
• The ball moves in the direction that can
reduce the loss fast
Direction: −𝛻𝐿 𝜃
64
The standard optimizer in deep learning
• Intuition of gradient descent
• Randomly put a ball on surface.
• The ball moves in the direction that can
reduce the loss fast
Direction: −𝛻𝐿 𝜃
Update: 𝜃 ← 𝜃 − 𝜖𝛻𝐿 𝜃
65
The standard optimizer in deep learning
• Intuition of gradient descent
• Randomly put a ball on surface.
• The ball moves in the direction that can
reduce the loss fast
Direction: −𝛻𝐿 𝜃
Update: 𝜃 ← 𝜃 − 𝜖𝛻𝐿 𝜃
https://towardsdatascience.com/a-visual-explanation-of-gradient-descent-methods-momentum-adagrad-rmsprop-adam-f898b102325c
66
From GD to Stochastic GD
• Intuition of gradient descent
• Randomly put a ball on surface.
• The ball moves in the direction that can
reduce the loss fast
Direction: −𝛻𝐿 𝜃
Update: 𝜃 ← 𝜃 − 𝜖𝛻𝐿 𝜃
https://towardsdatascience.com/a-visual-explanation-of-gradient-descent-methods-momentum-adagrad-rmsprop-adam-f898b102325c
67
From GD to Stochastic GD
• Intuition of gradient descent
• Randomly put a ball on surface.
• The ball moves in the direction that can
reduce the loss fast
Direction: −𝛻𝐿 𝜃
Update: 𝜃 ← 𝜃 − 𝜖𝛻𝐿 𝜃
Disadvantage: huge cost while computing a
single gradient in the case of large scale
machine learning problems.
https://towardsdatascience.com/a-visual-explanation-of-gradient-descent-methods-momentum-adagrad-rmsprop-adam-f898b102325c
68
From GD to Stochastic GD
GD: Sweeps through the training set,
computes gradient, and performs one 𝜃 ← 𝜃 − 𝜖𝛻𝐿 𝜃
update
69
From GD to Stochastic GD
GD: Sweeps through the training set,
computes gradient, and performs one 𝜃 ← 𝜃 − 𝜖𝛻𝐿 𝜃
update
SGD: Sweeps through the training set,
computes gradient, and performs update 𝜃 ← 𝜃 − 𝜖𝑘 𝛻𝑙𝑘 𝜃
for each training example
70
From GD to Stochastic GD
SGD: Sweeps through the training set,
computes gradient, and performs update
𝜃 ← 𝜃 − 𝜖𝑘 𝛻𝐿𝑘 𝜃
for each training example
71
From GD to Stochastic GD
SGD: Sweeps through the training set,
computes gradient, and performs update
𝜃 ← 𝜃 − 𝜖𝑘 𝛻𝐿𝑘 𝜃
for each training example
❑ A sufficient condition to ensure convergence
∞ ∞
𝜖𝑘 = ∞, 𝜖𝑘2 < ∞
𝑘=1 𝑘=1
❑ Step decay: decay learning rate by 0.5 every 5 epochs, by 0.1 every 20
epochs, or by validation error
❑ 1/t decay: 𝜖𝑘 = 𝜖0 /(1 + 𝑘𝑡)
❑ Linear decay to zero
72
Minibatch Stochastic Gradient Descent
Input: learning rate 𝜂𝑘 and initial model parameter 𝜃
While stopping criterion not met do
Sample a minibatch of 𝑚 samples 𝑥 𝑖
𝑖∈ 𝑚
from the training
data
1
Compute the gradient 𝑔 = 𝛻 ∑𝑚 𝐿(𝑥 𝑖 ; 𝜃)
𝑚 𝜃 𝑖=1
Update the model 𝜃 ← 𝜃 − 𝜖𝑘 𝑔
End while
73
Summary about SGD Algorithms
74
Summary about SGD Algorithms
75
A low-loss model may not be a good model
76
A low-loss model may not be a good model
Generation:
Training
77
Overfitting (memorization)
Overfitting refers to the phenomenon
where the gap between training loss
(perf) and test loss (perf) is too large.
78
Overfitting (memorization)
http://www.deeplearningbook.org/contents/ml.html
79
Tools to avoid overfitting
“Any modification we make to
• DropOut
a learning algorithm that is
intended to reduce its
• Weight decay
generalization error but not its • Early stopping
training error.” • Pre-training
80
Any Questions?