0% found this document useful (0 votes)
11 views61 pages

Lec1 Mathreview

The document outlines a lecture on machine learning, covering topics such as linear algebra, probability, information theory, and numerical computation. It uses the analogy of building a rocket to explain the components of machine learning algorithms, emphasizing the importance of data, mathematical principles, and computational techniques. Additionally, it includes a review of Python and Numpy for practical implementation in machine learning tasks.

Uploaded by

SajidBashir
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views61 pages

Lec1 Mathreview

The document outlines a lecture on machine learning, covering topics such as linear algebra, probability, information theory, and numerical computation. It uses the analogy of building a rocket to explain the components of machine learning algorithms, emphasizing the importance of data, mathematical principles, and computational techniques. Additionally, it includes a review of Python and Numpy for practical implementation in machine learning tasks.

Uploaded by

SajidBashir
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 61

Math Review

Lecture 1

Machine Learning Decal


Hosted by Machine Learning at Berkeley

1
Overview

Agenda
Let’s build a rocket
Linear Algebra
Probability and Information Theory
Numerical Computation
Python and Numpy Review Demo
Questions

2
Let’s build a rocket
Designing an ML algorithm is like building a rocket

• Fuel: Data
• The design of rocket propulsion system:
Linear Algebra
• The knowledge of physics and chemistry
needed to ensure that the combustion of
fuel provides enough thrust:
Statistics (Probability and Information
Theory)
• The principles of engineering to close the
gap between the ideal and the reality:
Numerical Computation

3
Linear Algebra
Scalars

A scalar is a single number


Integers Z, real numbers R, rational numbers Q, etc.
Example notation: Italic font x, y, m, n, a

4
Vectors

A vector is a 1-D array of numbers:

» fi
v1
— v2 ffi
— ffi
~v “ —
— .. ffi
ffi (0.1)
– . fl
vm

The entries can be integers Z, real numbers R, rational numbers


Q, binary etc.
Example notation for type and size:

ZN (0.2)

5
Matrices

A matrix is a 2-D array of numbers:

« ff
M1,1 M1,2
M“ (0.3)
M2,1 M2,2

Example notation for type and shape:

M P Qmxn (0.4)

6
Tensors

A tensor is an n-D array of numbers and can have:

7
Matrix Transpose

AJ i,j “ Aj,i

The transpose of a matrix is like a reflection along the main


diagonal.
» fi
A1,1 A1,2 « ff
A1,1 A2,1 A3,1
A “ –A2,1 A2,2 fl ùñ AJ “ (0.5)
— ffi
A1,2 A2,2 A3,2
A3,1 A3,2

pABqJ “ BJ AJ

8
Matrix Multiplication (Dot Product)

C “ AB
ÿ
Ci,j “ Ai,k Bk,j
k

Element-wise multiplication: Hadamard product A d B


9
Identity Matrix

» fi
1 0 0
I3 “ –0 1 0fl
— ffi

0 0 1

@x P Rn , In x “ x

10
Systems of Equations

11
Solving Systems of Equations using Matrix Inversion

A´1 A “ In

Numerically unstable, but useful for abstract analysis.

Ax “ b

A´1 Ax “ A´1 b

In x “ A´1 b 12
Invertibility

Matrix can’t be inverted if:

• More rows than columns


• More columns than rows
• Redundant rows/columns (linearly dependent/low rank)

13
Norms

1) Functions that measure the ”size” of a vector


2) Similar to the distance between zero and the point represented
by the vector

• f pxq “ 0 ùñ x “ 0
• f px ` y q ď f pxq ` f py q (the triangle equality)
• @a P R, f paxq “ |a|f pxq

14
Norms

• Lp norm ÿ 1
||x||p “ p |xi |p q p
i
• L1 norm, p = 1: ÿ
||x||1 “ |xi |
i

• L2 norm, p = 2: (most popular)


ÿ 1
||x||2 “ p |xi |2 q 2
i

• Max norm, infinite p:

||x||8 “ max |xi |


i

15
Special Matrices and Vectors

• Unit vector:
||x||2 “ 1

• Symmetric Matrix:
A “ AT

• Orthogonal Matrix:

AT A “ AAT “ I

A´1 “ AT

16
Eigendecomposition

• Eigenvector and eigenvalue:

Av “ λv

• Eigendecomposition of a diagonalizable matrix:

A “ Vdiag pλqV´1

• Every real symmetric matrix has a real, orthogonal


eigendecomposition:
A “ QDQT

17
Effect of Eigenvalues

18
Singular Value Decomposition

Similar to eigendecomposition but more general: matrix need not


be square.
A “ UDVT

19
Singular Value Decomposition

Similar to eigendecomposition but more general: matrix need not


be square.
A “ UDVT

20
Moore-Penrose Pseudoinverse

x “ A` y
If the equation has:
• Exactly one solution: this is the same as the inverse.
• No solution: this gives us the solution with the smallest error
||Ax ´ y ||2
• Many solutions: this gives us the solution with the smallest
norm of x
• Use SVD to compute pseudoinverse
A` “ UD` VT
D` : Take reciprocal of non-zero entries
21
Trace

Trace is the sum of the diagonal entries of a matrix.


ÿ
Tr pAq “ Ai,i
i

Useful identities:
Tr pAq “ Tr pAT q

Tr pABCq “ Tr pCABq “ Tr pBCAq

22
Probability and Information Theory
Probability Mass Function

The domain of P must be the set of all possible states of x.

@x P X , 0 ď Ppxq ď 1
ÿ
Ppxq “ 1
xPX

Example: uniform distribution


1
PpX “ xq “
k

23
Probability Density Function

The domain of p must be the set of all possible states of x.

@x P X , ppxq ě 0

Note: do not require ppxq ď 1


ż
ppxqdx “ 1

Example: uniform distribution


1
upx; a, bq “
b´a

24
Computing Marginal Probability with the Sum Rule

ÿ
@x P X , PpX “ xq “ PpX “ x, Y “ y q
y
ż
ppxq “ ppx, y qdy

25
Conditional Probability

PpY “ y , X “ xq
PpY “ y |X “ xq “
PpX “ xq

Bayes Rule:
PpY qPpX |Y q
PpY |X q “
PpX q
26
Chain Rule of Probability

n
ź
Ppx1 , ..., xn q “ Ppx1 q Ppxi |x1 , ..., xi´1 q
i“2

Markov property:

Ppxi |xi´1 , ..., x1 q “ Ppxi |xi´1 q for i ą 1


n
ź
Ppxn , ..., x1 q “ Ppx1 q Ppxi |xi´1 q “ Ppx1 qPpx2 |x1 q...Ppxn |xn´1 q
i“2

27
Independence

@x P X , y P Y

ppX “ x, Y “ y q “ ppX “ xqppY “ y q

Conditional Independence:

@x P X , y P Y , z P Z

ppX “ x, Y “ y |Z “ zq “ ppX “ x|Z “ zqppY “ y |Z “ zq

28
Expectation

ÿ
Ex„P rf pxqs “ Ppxqf pxq
x
ż
Ex„p rf pxqs “ ppxqf pxqdx

Linearity of Expectations:

Ex rαf pxq ` βg pxqs “ αEx rf pxqs ` βEx rg pxqs

29
Variance and Covariance

Var pf pxqq “ Erpf pxq ´ Erf pxqsq2 s

Cov pf pxq, g py qq “ Erpf pxq ´ Erf pxqsqpg py q ´ Erg py qsqs

Covariance matrix:

Cov pxqi,j “ Cov pxi , xj q

30
Bernoulli Distribution

PpX “ 1q “ φ

PpX “ 0q “ 1 ´ φ

PpX “ xq “ φx p1 ´ φq1´x

Ex rX s “ φ

Varx pX q “ φp1 ´ φq

31
Gaussian Distribution

Parametrized by variance:
c
1 1
N px; µ, σ 2 q “ 2
expp´ 2 px ´ µq2 q
2πσ 2σ
Parametrized by precision:
c
β 1
N px; µ, β ´1 q “ expp´ βpx ´ µq2 q
2π 2

32
Gaussian Distribution

33
Multivariate Gaussian

d
1 1
N px; µ, Σq “ expp´ px ´ µqT Σ´1 px ´ µqq
p2πqn detpΣq 2
« ff « ff « ff
1 0 0.5 0 1 0
Σ“ Σ“ Σ“
0 1 0 1 0 2

34
Empirical Distribution

m
1 ÿ
p̂ “ δpx ´ x piq q
m i“1

35
Mixture Distribution

ÿ
PpX q “ Ppc “ iqPpx|c “ iq
i

36
Activation Functions

37
Vanishing Gradient Problem

38
Information Theory

Information:
I pxq “ ´logPpxq

Shannon’s Entropy:

HpX q “ EX „P rI pxqs “ ´EX „P rlogPpxqs

KL divergence:

Ppxq
DKL pP||Qq “ EX „P rlog s “ EX „P rlogPpxq ´ logQpxqs
Qpxq

39
Entropy of a Bernoulli Variable

40
KL Divergence is Asymmetric

41
Directed Model

ppa, b, c, d, eq “ ppaqppb|aqppc|a, bqppd|bqppe|cq

42
Undirected Model

1 p1q
ppa, b, c, d, eq “ φ pa, b, cqφp2q pb, dqφp3q pc, eq
Z

43
Numerical Computation
Numerical concerns for implementation

Algorithms are often specified in terms of real numbers but R


cannot be implemented in a finite computer.
To implement deep learning algorithms with a finite number of
bits, we need Iterative Optimization.

• Gradient descent
• Curvature

44
Gradient

45
Gradient Descent

46
Gradient Descent Algorithm

repeat until convergence {


m
1 ÿ
θ0 :“ θ0 ´ α phθ px piq q ´ y piq q
m i“1

m
1 ÿ
θ1 :“ θ1 ´ α phθ px piq q ´ y piq q ¨ x piq
m i“1

}
update θ0 and θ1 simultaneously

47
Approximate Optimization

48
Usually don’t even reach a local minimum

49
Optimization: Pure math vs Deep Learning

Pure Math (Calculus: setting derivative to zero/ Lagrange


multipliers)

• Find literally the smallest value of f(x)


• Or maybe: find some critical point of f(x) where the value is
locally smallest by solving equations

Deep Learning

• Just decrease the value of f(x) a lot iteratively until a point of


convergence is approached

50
Critical Points

51
Directional Second Derivatives

52
Predicting optimal step size using Taylor series

f px p0q ´ g q « f px p0q q ´ g T g ` 21 2 g T Hg

gT g
˚ “ g T Hg

Numerator: Big gradients speed you up


Denominator: Big eigenvalues slow you down if you align with
their eigenvectors

53
Gradient Descent and Poor Conditioning

54
Neural net visualization

At the end of learning:

• gradient is still large


• curvature is huge

55
Python and Numpy Review Demo
Questions

You might also like