0% found this document useful (0 votes)

40 views81 pages

02 ML Fundatmentals 2

The document discusses machine learning basics and supervised learning. In supervised learning, the goal is to learn a model from a distribution of labeled training data consisting of inputs X and corresponding outputs Y. The model aims to approximate the ground truth mapping between inputs and outputs. Estimating the parameters of the model from a finite number of samples presents a key challenge in supervised learning.

Uploaded by

suponjiayume

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

40 views81 pages

02 ML Fundatmentals 2

Uploaded by

suponjiayume

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 81

Machine Learning

Basics

Di He
Supervised learning

Distribution 𝑃(𝑋, 𝑌)

𝑋 𝑌

2
Supervised learning

Distribution 𝑃(𝑋, 𝑌)

𝑋 𝑌
an image
a sentence
a piece of audio
…

3
Supervised learning

Distribution 𝑃(𝑋, 𝑌)

𝑋 𝑌
an image positive? negative?
a sentence category
a piece of audio …
…

4
Supervised learning

Stack of neural network layers

𝑋 𝑌
an image positive? negative?
a sentence category
a piece of audio …
…

5
Supervised learning

𝑋 𝑌

𝑓 ∗ (𝑥) The ground truth mapping

6
Supervised learning

𝑋 𝑌

The approximated mapping 𝑓𝜃 (𝑥) 𝑓 ∗ (𝑥) The ground truth mapping

7
Supervised learning

𝑋 𝑌

The approximated mapping 𝑓𝜽 (𝑥) 𝑓 ∗ (𝑥) The ground truth mapping

The difficulty: obtain good 𝜃 from finite samples

8
Generative modeling (unsupervised)

P(𝑋)
Distribution of images, sentences,
audio…

9
Generative modeling (unsupervised)

{𝑥𝑖 }

P(𝑋)
Distribution of images, sentences,
audio…

10
Generative modeling (unsupervised)

{𝑥𝑖 }

P(𝑋) P𝜃 (𝑋)
Distribution of images, sentences,
audio…

11
Generative modeling (unsupervised)

{𝑥𝑖 }

generate images,
P(𝑋) P𝜃 (𝑋) sentences,
audio…
Distribution of images, sentences,
audio…

12
Generative modeling (unsupervised)

{𝑥𝑖 }

P(𝑋) P𝜽 (𝑋)
The difficulty

13
Simple examples

14
Simple examples

Ground truth

𝑋~Normal(0,1)

15
Simple examples

0.156 -1.237 0.894 1.502 -0.671

Ground truth 𝑥𝑖 -0.201 0.327 1.101 -0.942 -0.555

𝑋~Normal(0,1)

16
Simple examples

0.156 -1.237 0.894 1.502 -0.671

Ground truth 𝑥𝑖 -0.201 0.327 1.101 -0.942 -0.555

𝑋~Normal(0,1)
Hypothesis space

𝑋~Normal(𝜃, 1)

17
Simple examples

0.156 -1.237 0.894 1.502 -0.671

Ground truth 𝑥𝑖 -0.201 0.327 1.101 -0.942 -0.555
How to estimate 𝜽

𝑋~Normal(0,1)
Hypothesis space

𝑋~Normal(𝜃, 1)

18
Simple examples

0.156 -1.237 0.894 1.502 -0.671

Ground truth 𝑥𝑖 -0.201 0.327 1.101 -0.942 -0.555
How to estimate 𝜽
• Which 𝜃 can generate 𝑥𝑖
𝑋~Normal(0,1) with the highest probability
Hypothesis space 𝑃𝜃 𝑥𝑖 = Π𝑖=1:𝑁 𝑃𝜃 (𝑥𝑖 )

𝑋~Normal(𝜃, 1)

19
Simple examples

0.156 -1.237 0.894 1.502 -0.671

Ground truth 𝑥𝑖 -0.201 0.327 1.101 -0.942 -0.555
How to estimate 𝜽
• Which 𝜃 can generate 𝑥𝑖
𝑋~Normal(0,1) with the highest probability
Hypothesis space 𝑃𝜃 𝑥𝑖 = Π𝑖=1:𝑁 𝑃𝜃 (𝑥𝑖 )
• Known as maximum likelihood
𝑋~Normal(𝜃, 1)
method
∑𝑥𝑖
𝜃 = 𝑎𝑟𝑔𝑚𝑎𝑥 𝑃𝜃 𝑥𝑖 =
𝑁
= 0.037

20
Simple examples

0.156 -1.237 0.894 1.502 -0.671

Ground truth 𝑥𝑖 -0.201 0.327 1.101 -0.942 -0.555

𝑋~Normal(0,1)
Hypothesis space
What we will never know

𝑋~Normal(𝜃, 1)

21
Simple examples

0.156 -1.237 0.894 1.502 -0.671 What is not simple

Ground truth 𝑥𝑖 -0.201 0.327 1.101 -0.942 -0.555

𝑋~Normal(0,1)
Hypothesis space
What we will never know

𝑋~Normal(𝜃, 1)

22
Simple examples

0.156 -1.237 0.894 1.502 -0.671 What is not simple

Ground truth 𝑥𝑖 -0.201 0.327 1.101 -0.942 -0.555

𝑋~Normal(0,1)
Hypothesis space
What we will never know

𝑋~Normal(𝜃, 1) What has to be highly complex

23
Simple examples

0.156 -1.237 0.894 1.502 -0.671 What is not simple

Ground truth 𝑥𝑖 -0.201 0.327 1.101 -0.942 -0.555

𝑋~Normal(0,1)
Hypothesis space
What we will never know

𝑋~Normal(𝜃, 1) What has to be highly complex

We seek to computational methods instead of writing down analytical solutions

24
Arthur Samuel

• In 1952, Arthur Samuel, developed

a program playing Checkers.

25
Arthur Samuel

• In 1952, Arthur Samuel, developed

a program playing Checkers.
• The program was able to observe positions
and learn an implicit model that gives
better moves for the latter cases.
• With that program, Samuel clamed that
machines can go beyond the written
codes and learn patterns like human-beings.

26
Arthur Samuel

• In 1952, Arthur Samuel, developed

• The goal: To learn a model from

experiences/data

• Training data 𝑥𝑖 𝑛𝑖=1 ∈ 𝑋 𝑛

• Model 𝑓𝜃 : noise(𝑅𝑑 ) ⇒ 𝑋

28
Basic Machine Learning Concepts

• The goal: To learn a model from

experiences/data

• Training data 𝑥𝑖 𝑛𝑖=1 ∈ 𝑋 𝑛

• Model 𝑓𝜃 : noise(𝑅𝑑 ) ⇒ 𝑋

• Test/inference/prediction
• Sample a (or a set of) noise 𝜖
• 𝑓𝜃 (𝜖)

29
Basic Machine Learning Concepts

• The goal: To learn a model from

experiences/data

• Training data 𝑥𝑖 𝑛𝑖=1 ∈ 𝑋 𝑛

• Model 𝑓𝜃 : noise(𝑅𝑑 ) ⇒ 𝑋

• Test/inference/prediction
• Sample a (or a set of) noise 𝜖
• 𝑓𝜃 (𝜖)

• Training: empirical loss

𝑛
minimization
min ෍ 𝑙(𝑥𝑖 , 𝜃)
𝜃
𝑖=1

30
Basic Machine Learning Concepts

• The goal: To learn a model from • Negative log likelihood:

experiences/data 𝑙 𝑥𝑖 , 𝜃 = −log 𝑃(𝑥𝑖 ; 𝜃)

• Training data 𝑥𝑖 𝑛𝑖=1 ∈ 𝑋 𝑛

• Model 𝑓𝜃 : noise(𝑅𝑑 ) ⇒ 𝑋

• Test/inference/prediction
• Sample a (or a set of) noise 𝜖
• 𝑓𝜃 (𝜖)

• Training: empirical loss

𝑛
minimization
min ෍ 𝑙(𝑥𝑖 , 𝜃)
𝜃
𝑖=1

31
Basic Machine Learning Concepts

• The goal: To learn a model from • Negative log likelihood:

experiences/data 𝑙 𝑥𝑖 , 𝜃 = −log 𝑃(𝑥𝑖 ; 𝜃)

• Training data 𝑥𝑖 𝑛𝑖=1 ∈ 𝑋 𝑛 • Other surrogate loss will be introduced during

• Model 𝑓𝜃 : noise(𝑅𝑑 ) ⇒ 𝑋 the course

• Test/inference/prediction
• Sample a (or a set of) noise 𝜖
• 𝑓𝜃 (𝜖)

• Training: empirical loss

𝑛
minimization
min ෍ 𝑙(𝑥𝑖 , 𝜃)
𝜃
𝑖=1

32
• The goal: To learn a model from
experiences/data

• Training data 𝑥𝑖 𝑛𝑖=1 ∈ 𝑋 𝑛

• Model 𝑓𝜃 : noise(𝑅𝑑 ) ⇒ 𝑋

What does 𝑓𝜃 look like？ • Test/inference/prediction

• Sample a (or a set of) noise 𝜖
• 𝑓𝜃 (𝜖)

• Training: empirical loss minimization

𝑛

min ෍ 𝑙(𝑥𝑖 , 𝜃)
𝜃
𝑖=1

33
Frank Rosenblatt

• In 1957, Frank Rosenblatt designed the first

neural network for computers (the
perceptron), which simulates the thought
processes of the human brain.

34
Marvin Minsky

• In 1969, Minsky proposed the famous XOR

problem and the inability of Perceptron in
such linearly inseparable data distributions.
• It was the Minsky's tackle to the NN
community. Thereafter, NN researches
would be dormant up until 1980s.

35
Perceptron is too simple, more complicated models
are needed to handle complex problems…

36
Paul Werbos

• Paul Werbos suggested using Multi-Layer Perceptron (MLP)

in 1981, and proposed the Backpropagation (BP) algorithm
for training neural networks. This new architecture solved
the XOR challenge.

37
Paul Werbos

• Paul Werbos suggested using Multi-Layer Perceptron (MLP)

in 1981, and proposed the Backpropagation (BP) algorithm
for training neural networks. This new architecture solved
the XOR challenge.
• Following Werbos’ new ideas, neural network researchers
successively presented different architectures of MLP and a
number of BP variants for effective training.

38
Geoffrey Hinton, Yan LeCun,
Jurgen Schmidhuber

• Geoffrey Hinton contributed a lot to the practical

backpropagation algorithms (1986) and
Boltzmann Machines (1983).
• Yan LeCun was the first to train a convolutional
neural network on images of handwritten digits
(1986).
• Jurgen Schmidhuber invented a new type of
recurrent neural network called Long short-term
memory or LSTM (1997).

39
Neural networks are black boxes, and therefore
difficult to interpret…

40
Neural networks are data-hungry. When there are
only small number of training data, they will overfit …

41
Ross Quinlan

• Decision trees were proposed by Ross Quinlan, more

specifically the ID3 algorithm.
• ID3 is able to find more real-life use case with its simplistic
rules and its clear inference.
• After ID3, many different alternatives or improvements have
been explored by the community (e.g. ID4, Regression
Trees, CART ...) and still it is one of the active topics in ML.

42
Decision Trees

• ID3 Algorithm
• Take all unused attributes and count
their entropy concerning test samples
• Choose attribute for which entropy is
minimum (or, equivalently, information
gain is maximum)
• Make node containing that attribute

43
Vladimir Vapnik

• Support Vector Machines (SVM) was proposed by Vapnik

and Cortes in 1995 with very strong theoretical standing and
empirical results.
• SVM got the best of many tasks that were occupied by NN
models before. In addition, SVM was able to exploit all the
profound knowledge of convex optimization, generalization
margin theory and kernels against NN models.
• ML community was separated into two crowds as NN or
SVM advocates.

44
Support Vector Machines

• Basic idea
• The decision boundary should be as far
away from the data of both classes as
possible
• We should maximize the margin m
• SVM could be efficiently solved in its dual form,
whose solutions only rely on the so-called Class 2
support vectors.
• SVM could be kernelized to handle non-
separable cases
Class 1
m

45
Revival of Neural Networks (Deep Learning)

46
47
A Neuron

No hidden layer
A single output unit

o 𝑥 = max{෍ 𝑤𝑖 𝑥𝑖 , 0}
𝑖

48
Stack of layers

𝑊, weight matrix
𝑏, bias vector

𝑦𝑖 = 𝑓 ෍ 𝑊𝑖𝑗 𝑥𝑗 + 𝑏𝑖 = 𝑓(𝑊𝑖𝑇 𝑥 + 𝑏𝑖 )
𝑖,𝑗

49
Stack of layers

ℎ0 = 𝑓(𝑤 0 𝑥 + 𝑏0 ) ℎ1 = 𝑓(𝑤 1 ℎ0 + 𝑏1 )
3 2 3
ℎ2 = 𝑓(𝑤 2 ℎ1 + 𝑏2 ) 𝑦 = 𝑓(𝑤 ℎ + 𝑏 )

50
Universal Approximation Theorem

• A feed-forward network with a single hidden layer containing a finite number

of neurons can approximate continuous functions on compact subsets of 𝑅𝑛,
under mild assumptions on the activation function.

51
Activations

Sigmoid Tanh

1 𝑒 𝑥 − 𝑒 −𝑥
𝜎 𝑥 = 𝑡𝑎𝑛ℎ 𝑥 = 2𝜎 2𝑥 − 1 = 𝑥
1 + 𝑒 −𝑥 𝑒 + 𝑒 −𝑥
52
Activations

Sigmoid Tanh

• Not used in common scenarios such as image and language processing

• Still popularly used in some specific scenarios such as Neural ODE/PDE

1 𝑒 𝑥 − 𝑒 −𝑥
𝜎 𝑥 = 𝑡𝑎𝑛ℎ 𝑥 = 2𝜎 2𝑥 − 1 = 𝑥
1 + 𝑒 −𝑥 𝑒 + 𝑒 −𝑥
53
Activations

Rectified linear 𝑓 𝑧 = 𝑚𝑎𝑥(0, 𝑧)

units = 𝑚𝑎𝑥 0, 𝑊 𝑇 𝑥 + 𝑏
Absolute value 𝛼 = −1 𝑓 𝑧 = |𝑧|
rectification
𝑓(𝑧) = 𝑚𝑎𝑥 0, 𝑧 + 𝛼 𝑚𝑖𝑛 0, 𝑧
Leaky ReLU ﬁxes 𝛼 to a small value like 0.01
Parametric ReLU Learns 𝛼

54
Activations

Rectified linear 𝑓 𝑧 = 𝑚𝑎𝑥(0, 𝑧)

• Why ReLU fails in Neural ODE/PDE

55
Implementation

56
• The goal: To learn a model from
experiences/data

• Training data 𝑥𝑖 𝑛𝑖=1 ∈ 𝑋 𝑛

• Model 𝑓𝜃 : noise(𝑅𝑑 ) ⇒ 𝑋

How to find good 𝜃？ • Test/inference/prediction

• Sample a (or a set of) noise 𝜖
• 𝑓𝜃 (𝜖)

• Training: empirical loss minimization

𝑛

min ෍ 𝑙(𝑥𝑖 , 𝜃)
𝜃
𝑖=1

57
Loss function

• Loss function (more accurately, data loss) measures the goodness of 𝜃, e.g., how
likely it can generate the data it sees during training

min ෍ 𝑙(𝑥𝑖 , 𝜃)
𝜃
𝑖=1

58
Loss function

• Loss function (more accurately, data loss) measures the goodness of 𝜃, e.g., how
likely it can generate the data it sees during training

min ෍ 𝑙(𝑥𝑖 , 𝜃)
𝜃
𝑖=1

• The way we estimate the value(weight) of 𝜃 is called optimization

59
Loss surface

• If we can draw a “high-dimensional curve”, where

• x-axis is the value(s) of 𝜃
• y-axis is loss of 𝜃

• This curve is called “loss surface”

• The goal is to find the “basin”

60
Optimization is another course

• Key concept
• Conditions for optimality
• Convergence
• Convergence rate
• Duality
• …
• What deep learning guys care
• Convex optimization problems
• Non-convex optimization problems
• Minimax problems

61
The standard optimizer in deep learning

• Intuition of gradient descent

• Randomly put a ball on surface.

62
The standard optimizer in deep learning

• Intuition of gradient descent

• Randomly put a ball on surface.
• The ball moves in the direction that can
reduce the loss fast

63
The standard optimizer in deep learning

• Intuition of gradient descent

• Randomly put a ball on surface.
• The ball moves in the direction that can
reduce the loss fast

Direction: −𝛻𝐿 𝜃

64
The standard optimizer in deep learning

• Intuition of gradient descent

• Randomly put a ball on surface.
• The ball moves in the direction that can
reduce the loss fast

Direction: −𝛻𝐿 𝜃

Update: 𝜃 ← 𝜃 − 𝜖𝛻𝐿 𝜃

65
The standard optimizer in deep learning

• Intuition of gradient descent

• Randomly put a ball on surface.
• The ball moves in the direction that can
reduce the loss fast

Direction: −𝛻𝐿 𝜃

Update: 𝜃 ← 𝜃 − 𝜖𝛻𝐿 𝜃

https://towardsdatascience.com/a-visual-explanation-of-gradient-descent-methods-momentum-adagrad-rmsprop-adam-f898b102325c

66
From GD to Stochastic GD

• Intuition of gradient descent

• Randomly put a ball on surface.
• The ball moves in the direction that can
reduce the loss fast
Direction: −𝛻𝐿 𝜃
Update: 𝜃 ← 𝜃 − 𝜖𝛻𝐿 𝜃

https://towardsdatascience.com/a-visual-explanation-of-gradient-descent-methods-momentum-adagrad-rmsprop-adam-f898b102325c

67
From GD to Stochastic GD

• Intuition of gradient descent

• Randomly put a ball on surface.
• The ball moves in the direction that can
reduce the loss fast
Direction: −𝛻𝐿 𝜃
Update: 𝜃 ← 𝜃 − 𝜖𝛻𝐿 𝜃
Disadvantage: huge cost while computing a
single gradient in the case of large scale
machine learning problems.
https://towardsdatascience.com/a-visual-explanation-of-gradient-descent-methods-momentum-adagrad-rmsprop-adam-f898b102325c

68
From GD to Stochastic GD

GD: Sweeps through the training set,

computes gradient, and performs one 𝜃 ← 𝜃 − 𝜖𝛻𝐿 𝜃
update

69
From GD to Stochastic GD

GD: Sweeps through the training set,

computes gradient, and performs one 𝜃 ← 𝜃 − 𝜖𝛻𝐿 𝜃
update

SGD: Sweeps through the training set,

computes gradient, and performs update 𝜃 ← 𝜃 − 𝜖𝑘 𝛻𝑙𝑘 𝜃
for each training example

70
From GD to Stochastic GD

SGD: Sweeps through the training set,

computes gradient, and performs update
𝜃 ← 𝜃 − 𝜖𝑘 𝛻𝐿𝑘 𝜃
for each training example

71
From GD to Stochastic GD

SGD: Sweeps through the training set,

computes gradient, and performs update
𝜃 ← 𝜃 − 𝜖𝑘 𝛻𝐿𝑘 𝜃
for each training example
❑ A sufficient condition to ensure convergence
∞ ∞

෍ 𝜖𝑘 = ∞, ෍ 𝜖𝑘2 < ∞
𝑘=1 𝑘=1
❑ Step decay: decay learning rate by 0.5 every 5 epochs, by 0.1 every 20
epochs, or by validation error
❑ 1/t decay: 𝜖𝑘 = 𝜖0 /(1 + 𝑘𝑡)
❑ Linear decay to zero

72
Minibatch Stochastic Gradient Descent

Input: learning rate 𝜂𝑘 and initial model parameter 𝜃

While stopping criterion not met do
Sample a minibatch of 𝑚 samples 𝑥 𝑖
𝑖∈ 𝑚
from the training
data
1
Compute the gradient 𝑔 = 𝛻 ∑𝑚 𝐿(𝑥 𝑖 ; 𝜃)
𝑚 𝜃 𝑖=1
Update the model 𝜃 ← 𝜃 − 𝜖𝑘 𝑔
End while

73
Summary about SGD Algorithms

74
Summary about SGD Algorithms

75
A low-loss model may not be a good model

76
A low-loss model may not be a good model

Generation:

Training

77
Overfitting (memorization)

Overfitting refers to the phenomenon

where the gap between training loss
(perf) and test loss (perf) is too large.

78
Overfitting (memorization)

http://www.deeplearningbook.org/contents/ml.html

79
Tools to avoid overfitting

“Any modification we make to

• DropOut
a learning algorithm that is
intended to reduce its
• Weight decay
generalization error but not its • Early stopping
training error.” • Pre-training

80
Any Questions?

Day 1 S3
No ratings yet
Day 1 S3
29 pages
AIMLB PGP 2025 Session 13 14
No ratings yet
AIMLB PGP 2025 Session 13 14
44 pages
Introduction To Machine Learning
No ratings yet
Introduction To Machine Learning
33 pages
SEng5305-chap-1-Introduction To ML
No ratings yet
SEng5305-chap-1-Introduction To ML
85 pages
AA12 Deep Learning 2024
No ratings yet
AA12 Deep Learning 2024
30 pages
MAI Lecture 01 Introduction
No ratings yet
MAI Lecture 01 Introduction
52 pages
Unit 1a - Fundamentals of Deep Learning
No ratings yet
Unit 1a - Fundamentals of Deep Learning
54 pages
ML Module I
No ratings yet
ML Module I
71 pages
Mlfa Autumn 22 Lec 01
No ratings yet
Mlfa Autumn 22 Lec 01
43 pages
Lecture 1 - Introduction To Machine Learning-HO - Ch0
No ratings yet
Lecture 1 - Introduction To Machine Learning-HO - Ch0
44 pages
Date: Venue:: 28-11-2023, Saveetha School of Engineering
No ratings yet
Date: Venue:: 28-11-2023, Saveetha School of Engineering
100 pages
Session 2 ANN 2024
No ratings yet
Session 2 ANN 2024
29 pages
Ai - Foundations of Machine Learning I
No ratings yet
Ai - Foundations of Machine Learning I
40 pages
DL Unit 1
No ratings yet
DL Unit 1
21 pages
Unit-1 Introduction To Machine Learning
No ratings yet
Unit-1 Introduction To Machine Learning
24 pages
MachineLearning Lecture 2
No ratings yet
MachineLearning Lecture 2
23 pages
1 Sup
No ratings yet
1 Sup
80 pages
Deep Learning Midsem Merged Previous Batch
No ratings yet
Deep Learning Midsem Merged Previous Batch
423 pages
Presentation On ML
No ratings yet
Presentation On ML
469 pages
Basic Concepts of Machine Learning For Beginners 1732109263
No ratings yet
Basic Concepts of Machine Learning For Beginners 1732109263
102 pages
Chapter 5 - Machine Learning Basics
No ratings yet
Chapter 5 - Machine Learning Basics
58 pages
Unit-I (R20 Syllabus) Machine Learning Basics
No ratings yet
Unit-I (R20 Syllabus) Machine Learning Basics
50 pages
ML Unit 1 Intro ML
No ratings yet
ML Unit 1 Intro ML
43 pages
DNN Merged Sugata
No ratings yet
DNN Merged Sugata
243 pages
Ai - Foundations of Machine Learning I
No ratings yet
Ai - Foundations of Machine Learning I
39 pages
ML 1
No ratings yet
ML 1
9 pages
Slide 1 Introduction
No ratings yet
Slide 1 Introduction
33 pages
Basic Concepts of Machine Learning For Beginners
No ratings yet
Basic Concepts of Machine Learning For Beginners
102 pages
Machine Learning Course Overview
No ratings yet
Machine Learning Course Overview
51 pages
Unit I
No ratings yet
Unit I
28 pages
These Slides Are Based On Materials Created by Prof. Dr.-Ing Andreas Geiger. The Use of These Original Slides Is With Permission
No ratings yet
These Slides Are Based On Materials Created by Prof. Dr.-Ing Andreas Geiger. The Use of These Original Slides Is With Permission
93 pages
Advanced ML Slides Intro
No ratings yet
Advanced ML Slides Intro
14 pages
ML Fundamentals by Bitspace
No ratings yet
ML Fundamentals by Bitspace
19 pages
Deep Learning U1
No ratings yet
Deep Learning U1
5 pages
Chapter 1
No ratings yet
Chapter 1
52 pages
Unit 1
No ratings yet
Unit 1
38 pages
01 ML Basics
No ratings yet
01 ML Basics
61 pages
Unit 1
No ratings yet
Unit 1
119 pages
DL Unit - I CSD Iv
No ratings yet
DL Unit - I CSD Iv
19 pages
Karthik
No ratings yet
Karthik
10 pages
Machine Learning Course Guide
No ratings yet
Machine Learning Course Guide
55 pages
Deep Learning l1
No ratings yet
Deep Learning l1
47 pages
1 Introduction
No ratings yet
1 Introduction
24 pages
Deep Learning
No ratings yet
Deep Learning
100 pages
ML Intro Theory
No ratings yet
ML Intro Theory
10 pages
ML1 Introduction
No ratings yet
ML1 Introduction
109 pages
IF4071 Deep Learning Notes
No ratings yet
IF4071 Deep Learning Notes
188 pages
Deep Learning - A Gentle Introduction
No ratings yet
Deep Learning - A Gentle Introduction
100 pages
Unit 1
No ratings yet
Unit 1
43 pages
Module1 - Deep Learning
No ratings yet
Module1 - Deep Learning
26 pages
Unit-1 Deep Learning
No ratings yet
Unit-1 Deep Learning
71 pages
Short Course On Deep Learning: Welcome!!
No ratings yet
Short Course On Deep Learning: Welcome!!
57 pages
Deep Learning Exam: Key Concepts
No ratings yet
Deep Learning Exam: Key Concepts
32 pages
Deep Learning
No ratings yet
Deep Learning
13 pages
Unit 1
No ratings yet
Unit 1
93 pages
Lecture Slides-Week13,14
No ratings yet
Lecture Slides-Week13,14
62 pages
Week-12 - Introduction To ML-NN-CNN
No ratings yet
Week-12 - Introduction To ML-NN-CNN
45 pages
Week 1 - Artificial Neural Networks - Part I - Justin
No ratings yet
Week 1 - Artificial Neural Networks - Part I - Justin
56 pages
Verilog Code For 2 - 1 Multiplexer (MUX) - All Modeling Styles
No ratings yet
Verilog Code For 2 - 1 Multiplexer (MUX) - All Modeling Styles
13 pages
Debre Markos Online Voting System
No ratings yet
Debre Markos Online Voting System
59 pages
TDC Verified
No ratings yet
TDC Verified
8 pages
Chapter 17: Recovery System: ©silberschatz, Korth and Sudarshan 17.1 Database System Concepts, 5 Ed
No ratings yet
Chapter 17: Recovery System: ©silberschatz, Korth and Sudarshan 17.1 Database System Concepts, 5 Ed
40 pages
Document of DAROC
100% (2)
Document of DAROC
26 pages
Vignesh R-6
No ratings yet
Vignesh R-6
2 pages
Master File
100% (6)
Master File
32 pages
Data Processing and System Analysis
No ratings yet
Data Processing and System Analysis
20 pages
Installing Ubuntu 18.04 On A Lenovo 100S
No ratings yet
Installing Ubuntu 18.04 On A Lenovo 100S
1 page
Adama Science and Technology University: School of Electrical Engineering and Computing
No ratings yet
Adama Science and Technology University: School of Electrical Engineering and Computing
10 pages
Session3-Lab Exercises
No ratings yet
Session3-Lab Exercises
15 pages
Biodex System 4 Synchronization Manual-Mjs-Emg-Analog-Signal-Access-Config-14379
No ratings yet
Biodex System 4 Synchronization Manual-Mjs-Emg-Analog-Signal-Access-Config-14379
15 pages
EMAX Hawk Pro User Manual
No ratings yet
EMAX Hawk Pro User Manual
11 pages
Practice Set 2
No ratings yet
Practice Set 2
5 pages
Seminar Report BIG DATA
No ratings yet
Seminar Report BIG DATA
28 pages
CompTIA Security Guide To Network Security Fundamentals 6th Edition Mark Ciampa
No ratings yet
CompTIA Security Guide To Network Security Fundamentals 6th Edition Mark Ciampa
323 pages
C++ Statement Types Overview
No ratings yet
C++ Statement Types Overview
150 pages
Final Csed File
No ratings yet
Final Csed File
33 pages
Intelligent (Smart) E-Commerce
No ratings yet
Intelligent (Smart) E-Commerce
28 pages
Cover Letter For Electrical Engineer Fresher PDF
100% (1)
Cover Letter For Electrical Engineer Fresher PDF
7 pages
Erp CH4
No ratings yet
Erp CH4
29 pages
Automated Greeting System: Mohd Afzal Jeelani (1400118073) & Mohd Faisal (1400118083)
No ratings yet
Automated Greeting System: Mohd Afzal Jeelani (1400118073) & Mohd Faisal (1400118083)
51 pages
A Project On: Adobe Flash
No ratings yet
A Project On: Adobe Flash
29 pages
Ms Word MCQ
No ratings yet
Ms Word MCQ
2 pages
Quadratic Equations Guide & Solutions
No ratings yet
Quadratic Equations Guide & Solutions
5 pages
History: Json (/ Dʒeɪsən/), or Javascript Object Notation, Is A Text-Based Open Standard Designed For Human-Readable
No ratings yet
History: Json (/ Dʒeɪsən/), or Javascript Object Notation, Is A Text-Based Open Standard Designed For Human-Readable
10 pages
GNSS Receivers, Data Colletors and Radio
No ratings yet
GNSS Receivers, Data Colletors and Radio
10 pages
Netflix Amazon Apple
No ratings yet
Netflix Amazon Apple
2 pages
Anritsu S820E Microwave Sitemaster 25MHz To 20GHz
No ratings yet
Anritsu S820E Microwave Sitemaster 25MHz To 20GHz
2 pages
Cortex XDR Handson Workshop Lab Guide
No ratings yet
Cortex XDR Handson Workshop Lab Guide
64 pages

02 ML Fundatmentals 2

Uploaded by

02 ML Fundatmentals 2

Uploaded by

Machine Learning

Stack of neural network layers

𝑓 ∗ (𝑥) The ground truth mapping

The approximated mapping 𝑓𝜃 (𝑥) 𝑓 ∗ (𝑥) The ground truth mapping

The approximated mapping 𝑓𝜽 (𝑥) 𝑓 ∗ (𝑥) The ground truth mapping

The difficulty: obtain good 𝜃 from finite samples

0.156 -1.237 0.894 1.502 -0.671

0.156 -1.237 0.894 1.502 -0.671

0.156 -1.237 0.894 1.502 -0.671

0.156 -1.237 0.894 1.502 -0.671

0.156 -1.237 0.894 1.502 -0.671

0.156 -1.237 0.894 1.502 -0.671

0.156 -1.237 0.894 1.502 -0.671 What is not simple

0.156 -1.237 0.894 1.502 -0.671 What is not simple

𝑋~Normal(𝜃, 1) What has to be highly complex

0.156 -1.237 0.894 1.502 -0.671 What is not simple

𝑋~Normal(𝜃, 1) What has to be highly complex

We seek to computational methods instead of writing down analytical solutions

• In 1952, Arthur Samuel, developed

• In 1952, Arthur Samuel, developed

• In 1952, Arthur Samuel, developed

• The goal: To learn a model from

• Training data 𝑥𝑖 𝑛𝑖=1 ∈ 𝑋 𝑛

• The goal: To learn a model from

• Training data 𝑥𝑖 𝑛𝑖=1 ∈ 𝑋 𝑛

• The goal: To learn a model from

• Training data 𝑥𝑖 𝑛𝑖=1 ∈ 𝑋 𝑛

• Training: empirical loss

• The goal: To learn a model from • Negative log likelihood:

• Training data 𝑥𝑖 𝑛𝑖=1 ∈ 𝑋 𝑛

• Training: empirical loss

• The goal: To learn a model from • Negative log likelihood:

• Training data 𝑥𝑖 𝑛𝑖=1 ∈ 𝑋 𝑛 • Other surrogate loss will be introduced during

• Training: empirical loss

• Training data 𝑥𝑖 𝑛𝑖=1 ∈ 𝑋 𝑛

What does 𝑓𝜃 look like？ • Test/inference/prediction

• Training: empirical loss minimization

• In 1957, Frank Rosenblatt designed the first

• In 1969, Minsky proposed the famous XOR

• Paul Werbos suggested using Multi-Layer Perceptron (MLP)

• Paul Werbos suggested using Multi-Layer Perceptron (MLP)

• Geoffrey Hinton contributed a lot to the practical

• Decision trees were proposed by Ross Quinlan, more

• Support Vector Machines (SVM) was proposed by Vapnik

• A feed-forward network with a single hidden layer containing a finite number

• Not used in common scenarios such as image and language processing

• Still popularly used in some specific scenarios such as Neural ODE/PDE

Rectified linear 𝑓 𝑧 = 𝑚𝑎𝑥(0, 𝑧)

Rectified linear 𝑓 𝑧 = 𝑚𝑎𝑥(0, 𝑧)

• Why ReLU fails in Neural ODE/PDE

• Training data 𝑥𝑖 𝑛𝑖=1 ∈ 𝑋 𝑛

How to find good 𝜃？ • Test/inference/prediction

• Training: empirical loss minimization

• The way we estimate the value(weight) of 𝜃 is called optimization

• If we can draw a “high-dimensional curve”, where

• This curve is called “loss surface”

• The goal is to find the “basin”

• Intuition of gradient descent

• Intuition of gradient descent

• Intuition of gradient descent

• Intuition of gradient descent

• Intuition of gradient descent

• Intuition of gradient descent

• Intuition of gradient descent

GD: Sweeps through the training set,

GD: Sweeps through the training set,

SGD: Sweeps through the training set,

SGD: Sweeps through the training set,

SGD: Sweeps through the training set,

Input: learning rate 𝜂𝑘 and initial model parameter 𝜃

Overfitting refers to the phenomenon

“Any modification we make to

You might also like