0% found this document useful (0 votes)
40 views81 pages

02 ML Fundatmentals 2

The document discusses machine learning basics and supervised learning. In supervised learning, the goal is to learn a model from a distribution of labeled training data consisting of inputs X and corresponding outputs Y. The model aims to approximate the ground truth mapping between inputs and outputs. Estimating the parameters of the model from a finite number of samples presents a key challenge in supervised learning.

Uploaded by

suponjiayume
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
40 views81 pages

02 ML Fundatmentals 2

The document discusses machine learning basics and supervised learning. In supervised learning, the goal is to learn a model from a distribution of labeled training data consisting of inputs X and corresponding outputs Y. The model aims to approximate the ground truth mapping between inputs and outputs. Estimating the parameters of the model from a finite number of samples presents a key challenge in supervised learning.

Uploaded by

suponjiayume
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 81

Machine Learning

Basics

Di He
Supervised learning

Distribution 𝑃(𝑋, 𝑌)

𝑋 𝑌

2
Supervised learning

Distribution 𝑃(𝑋, 𝑌)

𝑋 𝑌
an image
a sentence
a piece of audio

3
Supervised learning

Distribution 𝑃(𝑋, 𝑌)

𝑋 𝑌
an image positive? negative?
a sentence category
a piece of audio …

4
Supervised learning

Stack of neural network layers


𝑋 𝑌
an image positive? negative?
a sentence category
a piece of audio …

5
Supervised learning

𝑋 𝑌

𝑓 ∗ (𝑥) The ground truth mapping

6
Supervised learning

𝑋 𝑌

The approximated mapping 𝑓𝜃 (𝑥) 𝑓 ∗ (𝑥) The ground truth mapping

7
Supervised learning

𝑋 𝑌

The approximated mapping 𝑓𝜽 (𝑥) 𝑓 ∗ (𝑥) The ground truth mapping

The difficulty: obtain good 𝜃 from finite samples

8
Generative modeling (unsupervised)

P(𝑋)
Distribution of images, sentences,
audio…

9
Generative modeling (unsupervised)

{𝑥𝑖 }

P(𝑋)
Distribution of images, sentences,
audio…

10
Generative modeling (unsupervised)

{𝑥𝑖 }

P(𝑋) P𝜃 (𝑋)
Distribution of images, sentences,
audio…

11
Generative modeling (unsupervised)

{𝑥𝑖 }

generate images,
P(𝑋) P𝜃 (𝑋) sentences,
audio…
Distribution of images, sentences,
audio…

12
Generative modeling (unsupervised)

{𝑥𝑖 }

P(𝑋) P𝜽 (𝑋)
The difficulty

13
Simple examples

14
Simple examples

Ground truth

𝑋~Normal(0,1)

15
Simple examples

0.156 -1.237 0.894 1.502 -0.671


Ground truth 𝑥𝑖 -0.201 0.327 1.101 -0.942 -0.555

𝑋~Normal(0,1)

16
Simple examples

0.156 -1.237 0.894 1.502 -0.671


Ground truth 𝑥𝑖 -0.201 0.327 1.101 -0.942 -0.555

𝑋~Normal(0,1)
Hypothesis space

𝑋~Normal(𝜃, 1)

17
Simple examples

0.156 -1.237 0.894 1.502 -0.671


Ground truth 𝑥𝑖 -0.201 0.327 1.101 -0.942 -0.555
How to estimate 𝜽

𝑋~Normal(0,1)
Hypothesis space

𝑋~Normal(𝜃, 1)

18
Simple examples

0.156 -1.237 0.894 1.502 -0.671


Ground truth 𝑥𝑖 -0.201 0.327 1.101 -0.942 -0.555
How to estimate 𝜽
• Which 𝜃 can generate 𝑥𝑖
𝑋~Normal(0,1) with the highest probability
Hypothesis space 𝑃𝜃 𝑥𝑖 = Π𝑖=1:𝑁 𝑃𝜃 (𝑥𝑖 )

𝑋~Normal(𝜃, 1)

19
Simple examples

0.156 -1.237 0.894 1.502 -0.671


Ground truth 𝑥𝑖 -0.201 0.327 1.101 -0.942 -0.555
How to estimate 𝜽
• Which 𝜃 can generate 𝑥𝑖
𝑋~Normal(0,1) with the highest probability
Hypothesis space 𝑃𝜃 𝑥𝑖 = Π𝑖=1:𝑁 𝑃𝜃 (𝑥𝑖 )
• Known as maximum likelihood
𝑋~Normal(𝜃, 1)
method
∑𝑥𝑖
𝜃 = 𝑎𝑟𝑔𝑚𝑎𝑥 𝑃𝜃 𝑥𝑖 =
𝑁
= 0.037

20
Simple examples

0.156 -1.237 0.894 1.502 -0.671


Ground truth 𝑥𝑖 -0.201 0.327 1.101 -0.942 -0.555

𝑋~Normal(0,1)
Hypothesis space
What we will never know

𝑋~Normal(𝜃, 1)

21
Simple examples

0.156 -1.237 0.894 1.502 -0.671 What is not simple


Ground truth 𝑥𝑖 -0.201 0.327 1.101 -0.942 -0.555

𝑋~Normal(0,1)
Hypothesis space
What we will never know

𝑋~Normal(𝜃, 1)

22
Simple examples

0.156 -1.237 0.894 1.502 -0.671 What is not simple


Ground truth 𝑥𝑖 -0.201 0.327 1.101 -0.942 -0.555

𝑋~Normal(0,1)
Hypothesis space
What we will never know

𝑋~Normal(𝜃, 1) What has to be highly complex

23
Simple examples

0.156 -1.237 0.894 1.502 -0.671 What is not simple


Ground truth 𝑥𝑖 -0.201 0.327 1.101 -0.942 -0.555

𝑋~Normal(0,1)
Hypothesis space
What we will never know

𝑋~Normal(𝜃, 1) What has to be highly complex

We seek to computational methods instead of writing down analytical solutions

24
Arthur Samuel

• In 1952, Arthur Samuel, developed


a program playing Checkers.

25
Arthur Samuel

• In 1952, Arthur Samuel, developed


a program playing Checkers.
• The program was able to observe positions
and learn an implicit model that gives
better moves for the latter cases.
• With that program, Samuel clamed that
machines can go beyond the written
codes and learn patterns like human-beings.

26
Arthur Samuel

• In 1952, Arthur Samuel, developed


a program playing Checkers.
• The program was able to observe positions
and learn an implicit model that gives
better moves for the latter cases.
• With that program, Samuel clamed that
machines can go beyond the written
codes and learn patterns like human-beings.
• Samuel coined the concept of “machine learning”
in 1959.
27
Basic Machine Learning Concepts

• The goal: To learn a model from


experiences/data

• Training data 𝑥𝑖 𝑛𝑖=1 ∈ 𝑋 𝑛


• Model 𝑓𝜃 : noise(𝑅𝑑 ) ⇒ 𝑋

28
Basic Machine Learning Concepts

• The goal: To learn a model from


experiences/data

• Training data 𝑥𝑖 𝑛𝑖=1 ∈ 𝑋 𝑛


• Model 𝑓𝜃 : noise(𝑅𝑑 ) ⇒ 𝑋

• Test/inference/prediction
• Sample a (or a set of) noise 𝜖
• 𝑓𝜃 (𝜖)

29
Basic Machine Learning Concepts

• The goal: To learn a model from


experiences/data

• Training data 𝑥𝑖 𝑛𝑖=1 ∈ 𝑋 𝑛


• Model 𝑓𝜃 : noise(𝑅𝑑 ) ⇒ 𝑋

• Test/inference/prediction
• Sample a (or a set of) noise 𝜖
• 𝑓𝜃 (𝜖)

• Training: empirical loss


𝑛
minimization
min ෍ 𝑙(𝑥𝑖 , 𝜃)
𝜃
𝑖=1

30
Basic Machine Learning Concepts

• The goal: To learn a model from • Negative log likelihood:


experiences/data 𝑙 𝑥𝑖 , 𝜃 = −log 𝑃(𝑥𝑖 ; 𝜃)

• Training data 𝑥𝑖 𝑛𝑖=1 ∈ 𝑋 𝑛


• Model 𝑓𝜃 : noise(𝑅𝑑 ) ⇒ 𝑋

• Test/inference/prediction
• Sample a (or a set of) noise 𝜖
• 𝑓𝜃 (𝜖)

• Training: empirical loss


𝑛
minimization
min ෍ 𝑙(𝑥𝑖 , 𝜃)
𝜃
𝑖=1

31
Basic Machine Learning Concepts

• The goal: To learn a model from • Negative log likelihood:


experiences/data 𝑙 𝑥𝑖 , 𝜃 = −log 𝑃(𝑥𝑖 ; 𝜃)

• Training data 𝑥𝑖 𝑛𝑖=1 ∈ 𝑋 𝑛 • Other surrogate loss will be introduced during


• Model 𝑓𝜃 : noise(𝑅𝑑 ) ⇒ 𝑋 the course

• Test/inference/prediction
• Sample a (or a set of) noise 𝜖
• 𝑓𝜃 (𝜖)

• Training: empirical loss


𝑛
minimization
min ෍ 𝑙(𝑥𝑖 , 𝜃)
𝜃
𝑖=1

32
• The goal: To learn a model from
experiences/data

• Training data 𝑥𝑖 𝑛𝑖=1 ∈ 𝑋 𝑛


• Model 𝑓𝜃 : noise(𝑅𝑑 ) ⇒ 𝑋

What does 𝑓𝜃 look like? • Test/inference/prediction


• Sample a (or a set of) noise 𝜖
• 𝑓𝜃 (𝜖)

• Training: empirical loss minimization


𝑛

min ෍ 𝑙(𝑥𝑖 , 𝜃)
𝜃
𝑖=1

33
Frank Rosenblatt

• In 1957, Frank Rosenblatt designed the first


neural network for computers (the
perceptron), which simulates the thought
processes of the human brain.

34
Marvin Minsky

• In 1969, Minsky proposed the famous XOR


problem and the inability of Perceptron in
such linearly inseparable data distributions.
• It was the Minsky's tackle to the NN
community. Thereafter, NN researches
would be dormant up until 1980s.

35
Perceptron is too simple, more complicated models
are needed to handle complex problems…

36
Paul Werbos

• Paul Werbos suggested using Multi-Layer Perceptron (MLP)


in 1981, and proposed the Backpropagation (BP) algorithm
for training neural networks. This new architecture solved
the XOR challenge.

37
Paul Werbos

• Paul Werbos suggested using Multi-Layer Perceptron (MLP)


in 1981, and proposed the Backpropagation (BP) algorithm
for training neural networks. This new architecture solved
the XOR challenge.
• Following Werbos’ new ideas, neural network researchers
successively presented different architectures of MLP and a
number of BP variants for effective training.

38
Geoffrey Hinton, Yan LeCun,
Jurgen Schmidhuber

• Geoffrey Hinton contributed a lot to the practical


backpropagation algorithms (1986) and
Boltzmann Machines (1983).
• Yan LeCun was the first to train a convolutional
neural network on images of handwritten digits
(1986).
• Jurgen Schmidhuber invented a new type of
recurrent neural network called Long short-term
memory or LSTM (1997).

39
Neural networks are black boxes, and therefore
difficult to interpret…

40
Neural networks are data-hungry. When there are
only small number of training data, they will overfit …

41
Ross Quinlan

• Decision trees were proposed by Ross Quinlan, more


specifically the ID3 algorithm.
• ID3 is able to find more real-life use case with its simplistic
rules and its clear inference.
• After ID3, many different alternatives or improvements have
been explored by the community (e.g. ID4, Regression
Trees, CART ...) and still it is one of the active topics in ML.

42
Decision Trees

• ID3 Algorithm
• Take all unused attributes and count
their entropy concerning test samples
• Choose attribute for which entropy is
minimum (or, equivalently, information
gain is maximum)
• Make node containing that attribute

43
Vladimir Vapnik

• Support Vector Machines (SVM) was proposed by Vapnik


and Cortes in 1995 with very strong theoretical standing and
empirical results.
• SVM got the best of many tasks that were occupied by NN
models before. In addition, SVM was able to exploit all the
profound knowledge of convex optimization, generalization
margin theory and kernels against NN models.
• ML community was separated into two crowds as NN or
SVM advocates.

44
Support Vector Machines

• Basic idea
• The decision boundary should be as far
away from the data of both classes as
possible
• We should maximize the margin m
• SVM could be efficiently solved in its dual form,
whose solutions only rely on the so-called Class 2
support vectors.
• SVM could be kernelized to handle non-
separable cases
Class 1
m

45
Revival of Neural Networks (Deep Learning)

46
47
A Neuron

No hidden layer
A single output unit

o 𝑥 = max{෍ 𝑤𝑖 𝑥𝑖 , 0}
𝑖

48
Stack of layers

𝑊, weight matrix
𝑏, bias vector

𝑦𝑖 = 𝑓 ෍ 𝑊𝑖𝑗 𝑥𝑗 + 𝑏𝑖 = 𝑓(𝑊𝑖𝑇 𝑥 + 𝑏𝑖 )
𝑖,𝑗

49
Stack of layers

ℎ0 = 𝑓(𝑤 0 𝑥 + 𝑏0 ) ℎ1 = 𝑓(𝑤 1 ℎ0 + 𝑏1 )
3 2 3
ℎ2 = 𝑓(𝑤 2 ℎ1 + 𝑏2 ) 𝑦 = 𝑓(𝑤 ℎ + 𝑏 )

50
Universal Approximation Theorem

• A feed-forward network with a single hidden layer containing a finite number


of neurons can approximate continuous functions on compact subsets of 𝑅𝑛,
under mild assumptions on the activation function.

51
Activations

Sigmoid Tanh

1 𝑒 𝑥 − 𝑒 −𝑥
𝜎 𝑥 = 𝑡𝑎𝑛ℎ 𝑥 = 2𝜎 2𝑥 − 1 = 𝑥
1 + 𝑒 −𝑥 𝑒 + 𝑒 −𝑥
52
Activations

Sigmoid Tanh

• Not used in common scenarios such as image and language processing

• Still popularly used in some specific scenarios such as Neural ODE/PDE

1 𝑒 𝑥 − 𝑒 −𝑥
𝜎 𝑥 = 𝑡𝑎𝑛ℎ 𝑥 = 2𝜎 2𝑥 − 1 = 𝑥
1 + 𝑒 −𝑥 𝑒 + 𝑒 −𝑥
53
Activations

Rectified linear 𝑓 𝑧 = 𝑚𝑎𝑥(0, 𝑧)


units = 𝑚𝑎𝑥 0, 𝑊 𝑇 𝑥 + 𝑏
Absolute value 𝛼 = −1 𝑓 𝑧 = |𝑧|
rectification
𝑓(𝑧) = 𝑚𝑎𝑥 0, 𝑧 + 𝛼 𝑚𝑖𝑛 0, 𝑧
Leaky ReLU fixes 𝛼 to a small value like 0.01
Parametric ReLU Learns 𝛼

54
Activations

Rectified linear 𝑓 𝑧 = 𝑚𝑎𝑥(0, 𝑧)


units = 𝑚𝑎𝑥 0, 𝑊 𝑇 𝑥 + 𝑏
Absolute value 𝛼 = −1 𝑓 𝑧 = |𝑧|
rectification
𝑓(𝑧) = 𝑚𝑎𝑥 0, 𝑧 + 𝛼 𝑚𝑖𝑛 0, 𝑧
Leaky ReLU fixes 𝛼 to a small value like 0.01
Parametric ReLU Learns 𝛼

• Why ReLU fails in Neural ODE/PDE

55
Implementation

56
• The goal: To learn a model from
experiences/data

• Training data 𝑥𝑖 𝑛𝑖=1 ∈ 𝑋 𝑛


• Model 𝑓𝜃 : noise(𝑅𝑑 ) ⇒ 𝑋

How to find good 𝜃? • Test/inference/prediction


• Sample a (or a set of) noise 𝜖
• 𝑓𝜃 (𝜖)

• Training: empirical loss minimization


𝑛

min ෍ 𝑙(𝑥𝑖 , 𝜃)
𝜃
𝑖=1

57
Loss function

• Loss function (more accurately, data loss) measures the goodness of 𝜃, e.g., how
likely it can generate the data it sees during training

min ෍ 𝑙(𝑥𝑖 , 𝜃)
𝜃
𝑖=1

58
Loss function

• Loss function (more accurately, data loss) measures the goodness of 𝜃, e.g., how
likely it can generate the data it sees during training

min ෍ 𝑙(𝑥𝑖 , 𝜃)
𝜃
𝑖=1

• The way we estimate the value(weight) of 𝜃 is called optimization

59
Loss surface

• If we can draw a “high-dimensional curve”, where


• x-axis is the value(s) of 𝜃
• y-axis is loss of 𝜃

• This curve is called “loss surface”

• The goal is to find the “basin”

60
Optimization is another course

• Key concept
• Conditions for optimality
• Convergence
• Convergence rate
• Duality
• …
• What deep learning guys care
• Convex optimization problems
• Non-convex optimization problems
• Minimax problems

61
The standard optimizer in deep learning

• Intuition of gradient descent


• Randomly put a ball on surface.

62
The standard optimizer in deep learning

• Intuition of gradient descent


• Randomly put a ball on surface.
• The ball moves in the direction that can
reduce the loss fast

63
The standard optimizer in deep learning

• Intuition of gradient descent


• Randomly put a ball on surface.
• The ball moves in the direction that can
reduce the loss fast

Direction: −𝛻𝐿 𝜃

64
The standard optimizer in deep learning

• Intuition of gradient descent


• Randomly put a ball on surface.
• The ball moves in the direction that can
reduce the loss fast

Direction: −𝛻𝐿 𝜃

Update: 𝜃 ← 𝜃 − 𝜖𝛻𝐿 𝜃

65
The standard optimizer in deep learning

• Intuition of gradient descent


• Randomly put a ball on surface.
• The ball moves in the direction that can
reduce the loss fast

Direction: −𝛻𝐿 𝜃

Update: 𝜃 ← 𝜃 − 𝜖𝛻𝐿 𝜃

https://towardsdatascience.com/a-visual-explanation-of-gradient-descent-methods-momentum-adagrad-rmsprop-adam-f898b102325c

66
From GD to Stochastic GD

• Intuition of gradient descent


• Randomly put a ball on surface.
• The ball moves in the direction that can
reduce the loss fast
Direction: −𝛻𝐿 𝜃
Update: 𝜃 ← 𝜃 − 𝜖𝛻𝐿 𝜃

https://towardsdatascience.com/a-visual-explanation-of-gradient-descent-methods-momentum-adagrad-rmsprop-adam-f898b102325c

67
From GD to Stochastic GD

• Intuition of gradient descent


• Randomly put a ball on surface.
• The ball moves in the direction that can
reduce the loss fast
Direction: −𝛻𝐿 𝜃
Update: 𝜃 ← 𝜃 − 𝜖𝛻𝐿 𝜃
Disadvantage: huge cost while computing a
single gradient in the case of large scale
machine learning problems.
https://towardsdatascience.com/a-visual-explanation-of-gradient-descent-methods-momentum-adagrad-rmsprop-adam-f898b102325c

68
From GD to Stochastic GD

GD: Sweeps through the training set,


computes gradient, and performs one 𝜃 ← 𝜃 − 𝜖𝛻𝐿 𝜃
update

69
From GD to Stochastic GD

GD: Sweeps through the training set,


computes gradient, and performs one 𝜃 ← 𝜃 − 𝜖𝛻𝐿 𝜃
update

SGD: Sweeps through the training set,


computes gradient, and performs update 𝜃 ← 𝜃 − 𝜖𝑘 𝛻𝑙𝑘 𝜃
for each training example

70
From GD to Stochastic GD

SGD: Sweeps through the training set,


computes gradient, and performs update
𝜃 ← 𝜃 − 𝜖𝑘 𝛻𝐿𝑘 𝜃
for each training example

71
From GD to Stochastic GD

SGD: Sweeps through the training set,


computes gradient, and performs update
𝜃 ← 𝜃 − 𝜖𝑘 𝛻𝐿𝑘 𝜃
for each training example
❑ A sufficient condition to ensure convergence
∞ ∞

෍ 𝜖𝑘 = ∞, ෍ 𝜖𝑘2 < ∞
𝑘=1 𝑘=1
❑ Step decay: decay learning rate by 0.5 every 5 epochs, by 0.1 every 20
epochs, or by validation error
❑ 1/t decay: 𝜖𝑘 = 𝜖0 /(1 + 𝑘𝑡)
❑ Linear decay to zero

72
Minibatch Stochastic Gradient Descent

Input: learning rate 𝜂𝑘 and initial model parameter 𝜃


While stopping criterion not met do
Sample a minibatch of 𝑚 samples 𝑥 𝑖
𝑖∈ 𝑚
from the training
data
1
Compute the gradient 𝑔 = 𝛻 ∑𝑚 𝐿(𝑥 𝑖 ; 𝜃)
𝑚 𝜃 𝑖=1
Update the model 𝜃 ← 𝜃 − 𝜖𝑘 𝑔
End while

73
Summary about SGD Algorithms

74
Summary about SGD Algorithms

75
A low-loss model may not be a good model

76
A low-loss model may not be a good model

Generation:

Training

77
Overfitting (memorization)

Overfitting refers to the phenomenon


where the gap between training loss
(perf) and test loss (perf) is too large.

78
Overfitting (memorization)

http://www.deeplearningbook.org/contents/ml.html

79
Tools to avoid overfitting

“Any modification we make to


• DropOut
a learning algorithm that is
intended to reduce its
• Weight decay
generalization error but not its • Early stopping
training error.” • Pre-training

80
Any Questions?

You might also like