0% found this document useful (0 votes)

11 views61 pages

Lec1 Mathreview

The document outlines a lecture on machine learning, covering topics such as linear algebra, probability, information theory, and numerical computation. It uses the analogy of building a rocket to explain the components of machine learning algorithms, emphasizing the importance of data, mathematical principles, and computational techniques. Additionally, it includes a review of Python and Numpy for practical implementation in machine learning tasks.

Uploaded by

SajidBashir

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views61 pages

Lec1 Mathreview

Uploaded by

SajidBashir

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 61

Math Review

Lecture 1

Machine Learning Decal

Hosted by Machine Learning at Berkeley

1
Overview

Agenda
Let’s build a rocket
Linear Algebra
Probability and Information Theory
Numerical Computation
Python and Numpy Review Demo
Questions

2
Let’s build a rocket
Designing an ML algorithm is like building a rocket

• Fuel: Data
• The design of rocket propulsion system:
Linear Algebra
• The knowledge of physics and chemistry
needed to ensure that the combustion of
fuel provides enough thrust:
Statistics (Probability and Information
Theory)
• The principles of engineering to close the
gap between the ideal and the reality:
Numerical Computation

3
Linear Algebra
Scalars

A scalar is a single number

Integers Z, real numbers R, rational numbers Q, etc.
Example notation: Italic font x, y, m, n, a

4
Vectors

A vector is a 1-D array of numbers:

» fi
v1
— v2 ffi
— ffi
~v “ —
— .. ffi
ffi (0.1)
– . fl
vm

The entries can be integers Z, real numbers R, rational numbers

Q, binary etc.
Example notation for type and size:

ZN (0.2)

5
Matrices

A matrix is a 2-D array of numbers:

« ff
M1,1 M1,2
M“ (0.3)
M2,1 M2,2

Example notation for type and shape:

M P Qmxn (0.4)

6
Tensors

A tensor is an n-D array of numbers and can have:

7
Matrix Transpose

AJ i,j “ Aj,i

The transpose of a matrix is like a reflection along the main

diagonal.
» fi
A1,1 A1,2 « ff
A1,1 A2,1 A3,1
A “ –A2,1 A2,2 fl ùñ AJ “ (0.5)
— ffi
A1,2 A2,2 A3,2
A3,1 A3,2

pABqJ “ BJ AJ

8
Matrix Multiplication (Dot Product)

C “ AB
ÿ
Ci,j “ Ai,k Bk,j
k

Element-wise multiplication: Hadamard product A d B

9
Identity Matrix

» fi
1 0 0
I3 “ –0 1 0fl
— ffi

0 0 1

@x P Rn , In x “ x

10
Systems of Equations

11
Solving Systems of Equations using Matrix Inversion

A´1 A “ In

Numerically unstable, but useful for abstract analysis.

Ax “ b

A´1 Ax “ A´1 b

In x “ A´1 b 12
Invertibility

Matrix can’t be inverted if:

• More rows than columns

• More columns than rows
• Redundant rows/columns (linearly dependent/low rank)

13
Norms

1) Functions that measure the ”size” of a vector

2) Similar to the distance between zero and the point represented
by the vector

• f pxq “ 0 ùñ x “ 0
• f px ` y q ď f pxq ` f py q (the triangle equality)
• @a P R, f paxq “ |a|f pxq

14
Norms

• Lp norm ÿ 1
||x||p “ p |xi |p q p
i
• L1 norm, p = 1: ÿ
||x||1 “ |xi |
i

• L2 norm, p = 2: (most popular)

ÿ 1
||x||2 “ p |xi |2 q 2
i

• Max norm, infinite p:

||x||8 “ max |xi |

15
Special Matrices and Vectors

• Unit vector:
||x||2 “ 1

• Symmetric Matrix:
A “ AT

• Orthogonal Matrix:

AT A “ AAT “ I

A´1 “ AT

16
Eigendecomposition

• Eigenvector and eigenvalue:

Av “ λv

• Eigendecomposition of a diagonalizable matrix:

A “ Vdiag pλqV´1

• Every real symmetric matrix has a real, orthogonal

eigendecomposition:
A “ QDQT

17
Effect of Eigenvalues

18
Singular Value Decomposition

Similar to eigendecomposition but more general: matrix need not

be square.
A “ UDVT

19
Singular Value Decomposition

Similar to eigendecomposition but more general: matrix need not

be square.
A “ UDVT

20
Moore-Penrose Pseudoinverse

x “ A` y
If the equation has:
• Exactly one solution: this is the same as the inverse.
• No solution: this gives us the solution with the smallest error
||Ax ´ y ||2
• Many solutions: this gives us the solution with the smallest
norm of x
• Use SVD to compute pseudoinverse
A` “ UD` VT
D` : Take reciprocal of non-zero entries
21
Trace

Trace is the sum of the diagonal entries of a matrix.

ÿ
Tr pAq “ Ai,i
i

Useful identities:
Tr pAq “ Tr pAT q

Tr pABCq “ Tr pCABq “ Tr pBCAq

22
Probability and Information Theory
Probability Mass Function

The domain of P must be the set of all possible states of x.

@x P X , 0 ď Ppxq ď 1
ÿ
Ppxq “ 1
xPX

Example: uniform distribution

1
PpX “ xq “
k

23
Probability Density Function

The domain of p must be the set of all possible states of x.

@x P X , ppxq ě 0

Note: do not require ppxq ď 1

ż
ppxqdx “ 1

Example: uniform distribution

1
upx; a, bq “
b´a

24
Computing Marginal Probability with the Sum Rule

ÿ
@x P X , PpX “ xq “ PpX “ x, Y “ y q
y
ż
ppxq “ ppx, y qdy

25
Conditional Probability

PpY “ y , X “ xq
PpY “ y |X “ xq “
PpX “ xq

Bayes Rule:
PpY qPpX |Y q
PpY |X q “
PpX q
26
Chain Rule of Probability

n
ź
Ppx1 , ..., xn q “ Ppx1 q Ppxi |x1 , ..., xi´1 q
i“2

Markov property:

Ppxi |xi´1 , ..., x1 q “ Ppxi |xi´1 q for i ą 1

n
ź
Ppxn , ..., x1 q “ Ppx1 q Ppxi |xi´1 q “ Ppx1 qPpx2 |x1 q...Ppxn |xn´1 q
i“2

27
Independence

@x P X , y P Y

ppX “ x, Y “ y q “ ppX “ xqppY “ y q

Conditional Independence:

@x P X , y P Y , z P Z

ppX “ x, Y “ y |Z “ zq “ ppX “ x|Z “ zqppY “ y |Z “ zq

28
Expectation

ÿ
Ex„P rf pxqs “ Ppxqf pxq
x
ż
Ex„p rf pxqs “ ppxqf pxqdx

Linearity of Expectations:

Ex rαf pxq ` βg pxqs “ αEx rf pxqs ` βEx rg pxqs

29
Variance and Covariance

Var pf pxqq “ Erpf pxq ´ Erf pxqsq2 s

Cov pf pxq, g py qq “ Erpf pxq ´ Erf pxqsqpg py q ´ Erg py qsqs

Covariance matrix:

Cov pxqi,j “ Cov pxi , xj q

30
Bernoulli Distribution

PpX “ 1q “ φ

PpX “ 0q “ 1 ´ φ

PpX “ xq “ φx p1 ´ φq1´x

Ex rX s “ φ

Varx pX q “ φp1 ´ φq

31
Gaussian Distribution

Parametrized by variance:
c
1 1
N px; µ, σ 2 q “ 2
expp´ 2 px ´ µq2 q
2πσ 2σ
Parametrized by precision:
c
β 1
N px; µ, β ´1 q “ expp´ βpx ´ µq2 q
2π 2

32
Gaussian Distribution

33
Multivariate Gaussian

d
1 1
N px; µ, Σq “ expp´ px ´ µqT Σ´1 px ´ µqq
p2πqn detpΣq 2
« ff « ff « ff
1 0 0.5 0 1 0
Σ“ Σ“ Σ“
0 1 0 1 0 2

34
Empirical Distribution

m
1 ÿ
p̂ “ δpx ´ x piq q
m i“1

35
Mixture Distribution

ÿ
PpX q “ Ppc “ iqPpx|c “ iq
i

36
Activation Functions

37
Vanishing Gradient Problem

38
Information Theory

Information:
I pxq “ ´logPpxq

Shannon’s Entropy:

HpX q “ EX „P rI pxqs “ ´EX „P rlogPpxqs

KL divergence:

Ppxq
DKL pP||Qq “ EX „P rlog s “ EX „P rlogPpxq ´ logQpxqs
Qpxq

39
Entropy of a Bernoulli Variable

40
KL Divergence is Asymmetric

41
Directed Model

ppa, b, c, d, eq “ ppaqppb|aqppc|a, bqppd|bqppe|cq

42
Undirected Model

1 p1q
ppa, b, c, d, eq “ φ pa, b, cqφp2q pb, dqφp3q pc, eq
Z

43
Numerical Computation
Numerical concerns for implementation

Algorithms are often specified in terms of real numbers but R

cannot be implemented in a finite computer.
To implement deep learning algorithms with a finite number of
bits, we need Iterative Optimization.

• Gradient descent
• Curvature

44
Gradient

45
Gradient Descent

46
Gradient Descent Algorithm

repeat until convergence {

m
1 ÿ
θ0 :“ θ0 ´ α phθ px piq q ´ y piq q
m i“1

m
1 ÿ
θ1 :“ θ1 ´ α phθ px piq q ´ y piq q ¨ x piq
m i“1

}
update θ0 and θ1 simultaneously

47
Approximate Optimization

48
Usually don’t even reach a local minimum

49
Optimization: Pure math vs Deep Learning

Pure Math (Calculus: setting derivative to zero/ Lagrange

multipliers)

• Find literally the smallest value of f(x)

• Or maybe: find some critical point of f(x) where the value is
locally smallest by solving equations

Deep Learning

• Just decrease the value of f(x) a lot iteratively until a point of

convergence is approached

50
Critical Points

51
Directional Second Derivatives

52
Predicting optimal step size using Taylor series

f px p0q ´ g q « f px p0q q ´ g T g ` 21 2 g T Hg

gT g
˚ “ g T Hg

Numerator: Big gradients speed you up

Denominator: Big eigenvalues slow you down if you align with
their eigenvectors

53
Gradient Descent and Poor Conditioning

54
Neural net visualization

At the end of learning:

• gradient is still large

• curvature is huge

55
Python and Numpy Review Demo
Questions

Machine Learning: The Basics
No ratings yet
Machine Learning: The Basics
288 pages
Deep-Learning
No ratings yet
Deep-Learning
28 pages
Math Review For ML
No ratings yet
Math Review For ML
41 pages
Brief Intro To ML PDF
No ratings yet
Brief Intro To ML PDF
236 pages
Basic Concepts For Understanding ML & DL
No ratings yet
Basic Concepts For Understanding ML & DL
8 pages
FL LectureNotes
No ratings yet
FL LectureNotes
92 pages
Lecture Maths
No ratings yet
Lecture Maths
103 pages
L3-7 Mathematical Foundations
No ratings yet
L3-7 Mathematical Foundations
25 pages
ML Iit Madras Summary (1-12)
No ratings yet
ML Iit Madras Summary (1-12)
43 pages
DL (Unit I)
No ratings yet
DL (Unit I)
25 pages
Maths for Intelligent Systems Guide
No ratings yet
Maths for Intelligent Systems Guide
76 pages
Pattern Classification
No ratings yet
Pattern Classification
41 pages
Course Outline 2
No ratings yet
Course Outline 2
4 pages
Poly Aml
No ratings yet
Poly Aml
76 pages
M01 Algebra
No ratings yet
M01 Algebra
53 pages
1 9780692196380 FM
No ratings yet
1 9780692196380 FM
3 pages
Data Science
No ratings yet
Data Science
74 pages
Math For ML
No ratings yet
Math For ML
10 pages
Dasar Statistika Dan Matematika
No ratings yet
Dasar Statistika Dan Matematika
30 pages
Machine Learning Notation: 1 Numbers & Arrays 4 Functions
No ratings yet
Machine Learning Notation: 1 Numbers & Arrays 4 Functions
2 pages
V Aids Ad3501 DL Unit-1
No ratings yet
V Aids Ad3501 DL Unit-1
70 pages
Lecture 2 - Math
No ratings yet
Lecture 2 - Math
39 pages
Linear Algebra Pure Applied 1st Edition Edgar G. Goodaire Instant Download
No ratings yet
Linear Algebra Pure Applied 1st Edition Edgar G. Goodaire Instant Download
52 pages
Skript Opt Mach
No ratings yet
Skript Opt Mach
49 pages
Maths for Intelligent Systems Guide
No ratings yet
Maths for Intelligent Systems Guide
104 pages
Data - Science and - Artificial - Intelligence
No ratings yet
Data - Science and - Artificial - Intelligence
106 pages
Linear Algebra Pure Applied 1st Edition Edgar G. Goodaire Get PDF
No ratings yet
Linear Algebra Pure Applied 1st Edition Edgar G. Goodaire Get PDF
80 pages
DL (1-10)
No ratings yet
DL (1-10)
10 pages
Lecture Notes MAI
No ratings yet
Lecture Notes MAI
111 pages
WINSEM2024-25 CSE4006 ETH AP2024254000689 2025-01-03 Reference-Material-I
No ratings yet
WINSEM2024-25 CSE4006 ETH AP2024254000689 2025-01-03 Reference-Material-I
39 pages
Practical Linear Algebra
100% (1)
Practical Linear Algebra
253 pages
Exercise 01
No ratings yet
Exercise 01
3 pages
Math For Machine Learning 1694120073
No ratings yet
Math For Machine Learning 1694120073
11 pages
Pma Exam
100% (1)
Pma Exam
358 pages
Cs Theorists Toolkit
No ratings yet
Cs Theorists Toolkit
95 pages
Compre
No ratings yet
Compre
46 pages
Background Material Crib-Sheet: 1 Probability Theory
No ratings yet
Background Material Crib-Sheet: 1 Probability Theory
4 pages
Math For Data Science
No ratings yet
Math For Data Science
538 pages
Graph Laplacian
No ratings yet
Graph Laplacian
109 pages
CSE465 T2 Mathematics For DL
No ratings yet
CSE465 T2 Mathematics For DL
29 pages
Exercise 01 Math Refresher
No ratings yet
Exercise 01 Math Refresher
4 pages
Probabilistic Numerics Guide
No ratings yet
Probabilistic Numerics Guide
412 pages
Lecture Notes MAI
No ratings yet
Lecture Notes MAI
114 pages
NNDL
No ratings yet
NNDL
4 pages
Mml-Book Removed
No ratings yet
Mml-Book Removed
295 pages
Homework 0: Mathematical Background For Machine Learning
No ratings yet
Homework 0: Mathematical Background For Machine Learning
11 pages
Mml-Book (1) Removed
No ratings yet
Mml-Book (1) Removed
371 pages
3410notes-Linear Algebra Python
No ratings yet
3410notes-Linear Algebra Python
235 pages
Math Foundations of Gena I
No ratings yet
Math Foundations of Gena I
210 pages
Livro
No ratings yet
Livro
230 pages
Hda RMML
No ratings yet
Hda RMML
131 pages
Vmls Python Companion
No ratings yet
Vmls Python Companion
192 pages
Mathematics of Deep Learning: Lecture 1-Introduction and The Universality of Depth 1 Nets
No ratings yet
Mathematics of Deep Learning: Lecture 1-Introduction and The Universality of Depth 1 Nets
12 pages
Kuttler LinearAlgebra AFirstCourse YorkU MATH2022 Winter2017
No ratings yet
Kuttler LinearAlgebra AFirstCourse YorkU MATH2022 Winter2017
258 pages
100+ Mathematics For Machine Learning - ComprehensiveEdition
No ratings yet
100+ Mathematics For Machine Learning - ComprehensiveEdition
10 pages
Cuban Missile War v1.8
No ratings yet
Cuban Missile War v1.8
97 pages
3B. Purchase Order (CBG Staggered Delivery)
No ratings yet
3B. Purchase Order (CBG Staggered Delivery)
4 pages
Accounts
No ratings yet
Accounts
8 pages
HACCP Program
No ratings yet
HACCP Program
28 pages
AG PIECO Glass & Rubber Products Catalog
No ratings yet
AG PIECO Glass & Rubber Products Catalog
8 pages
Play Court Quest - Judicial System Game - Icivics
No ratings yet
Play Court Quest - Judicial System Game - Icivics
2 pages
National Senior Certificate: Grade 12
No ratings yet
National Senior Certificate: Grade 12
19 pages
Vuchic Solution
100% (1)
Vuchic Solution
5 pages
Asm 16272
No ratings yet
Asm 16272
15 pages
AW109SP QRH - Issue 2 - Rev.2
100% (1)
AW109SP QRH - Issue 2 - Rev.2
380 pages
Bonding Pouch Sewing Guide
100% (1)
Bonding Pouch Sewing Guide
8 pages
MT Sy Peed
No ratings yet
MT Sy Peed
7 pages
Paranoid
No ratings yet
Paranoid
8 pages
Sci-Fi Survival: Harold's Escape
No ratings yet
Sci-Fi Survival: Harold's Escape
2 pages
Driving License Surya
No ratings yet
Driving License Surya
1 page
Mech Test
No ratings yet
Mech Test
4 pages
International Style in Modern Architecture
No ratings yet
International Style in Modern Architecture
25 pages
Unit 4 SportsReporting
No ratings yet
Unit 4 SportsReporting
22 pages
Flash On English For Tourism - Answer Key and Transcripts: Unit 1, Pp. 4-7
100% (1)
Flash On English For Tourism - Answer Key and Transcripts: Unit 1, Pp. 4-7
15 pages
Student Element Assignments List
No ratings yet
Student Element Assignments List
5 pages
23-19777-03 (02 2023) FACSDuet - Site Preparation Guide
100% (1)
23-19777-03 (02 2023) FACSDuet - Site Preparation Guide
20 pages
Marketing Flexibility - Significance and Implications For Automobile Industry
No ratings yet
Marketing Flexibility - Significance and Implications For Automobile Industry
12 pages
Final Examination in Eng121
No ratings yet
Final Examination in Eng121
24 pages
Product Placement Impact in Bollywood
No ratings yet
Product Placement Impact in Bollywood
72 pages
Mobileye 8 Connect User Manual v0.5
No ratings yet
Mobileye 8 Connect User Manual v0.5
51 pages
TmForum ODA
No ratings yet
TmForum ODA
42 pages
Finland Danske Bank
No ratings yet
Finland Danske Bank
1 page
Pie Chart - Pie Chart & Note Set - 1
No ratings yet
Pie Chart - Pie Chart & Note Set - 1
17 pages
Zero 2022 Collection - BD
No ratings yet
Zero 2022 Collection - BD
38 pages
Framework For Task 1 Maps
No ratings yet
Framework For Task 1 Maps
2 pages

Lec1 Mathreview

Uploaded by

Lec1 Mathreview

Uploaded by

Math Review

Machine Learning Decal

A scalar is a single number

A vector is a 1-D array of numbers:

The entries can be integers Z, real numbers R, rational numbers

A matrix is a 2-D array of numbers:

Example notation for type and shape:

A tensor is an n-D array of numbers and can have:

The transpose of a matrix is like a reflection along the main

Element-wise multiplication: Hadamard product A d B

Numerically unstable, but useful for abstract analysis.

Matrix can’t be inverted if:

• More rows than columns

1) Functions that measure the ”size” of a vector

• L2 norm, p = 2: (most popular)

• Max norm, infinite p:

||x||8 “ max |xi |

• Eigenvector and eigenvalue:

• Eigendecomposition of a diagonalizable matrix:

• Every real symmetric matrix has a real, orthogonal

Similar to eigendecomposition but more general: matrix need not

Similar to eigendecomposition but more general: matrix need not

Trace is the sum of the diagonal entries of a matrix.

Tr pABCq “ Tr pCABq “ Tr pBCAq

The domain of P must be the set of all possible states of x.

Example: uniform distribution

The domain of p must be the set of all possible states of x.

Note: do not require ppxq ď 1

Example: uniform distribution

Ppxi |xi´1 , ..., x1 q “ Ppxi |xi´1 q for i ą 1

ppX “ x, Y “ y q “ ppX “ xqppY “ y q

ppX “ x, Y “ y |Z “ zq “ ppX “ x|Z “ zqppY “ y |Z “ zq

Ex rαf pxq ` βg pxqs “ αEx rf pxqs ` βEx rg pxqs

Var pf pxqq “ Erpf pxq ´ Erf pxqsq2 s

Cov pf pxq, g py qq “ Erpf pxq ´ Erf pxqsqpg py q ´ Erg py qsqs

Cov pxqi,j “ Cov pxi , xj q

HpX q “ EX „P rI pxqs “ ´EX „P rlogPpxqs

ppa, b, c, d, eq “ ppaqppb|aqppc|a, bqppd|bqppe|cq

Algorithms are often specified in terms of real numbers but R

repeat until convergence {

Pure Math (Calculus: setting derivative to zero/ Lagrange

• Find literally the smallest value of f(x)

• Just decrease the value of f(x) a lot iteratively until a point of

Numerator: Big gradients speed you up

At the end of learning:

• gradient is still large

You might also like