Math Review
Lecture 1
Machine Learning Decal
Hosted by Machine Learning at Berkeley
1
Overview
Agenda
Let’s build a rocket
Linear Algebra
Probability and Information Theory
Numerical Computation
Python and Numpy Review Demo
Questions
2
Let’s build a rocket
Designing an ML algorithm is like building a rocket
• Fuel: Data
• The design of rocket propulsion system:
Linear Algebra
• The knowledge of physics and chemistry
needed to ensure that the combustion of
fuel provides enough thrust:
Statistics (Probability and Information
Theory)
• The principles of engineering to close the
gap between the ideal and the reality:
Numerical Computation
3
Linear Algebra
Scalars
A scalar is a single number
Integers Z, real numbers R, rational numbers Q, etc.
Example notation: Italic font x, y, m, n, a
4
Vectors
A vector is a 1-D array of numbers:
» fi
v1
— v2 ffi
— ffi
~v “ —
— .. ffi
ffi (0.1)
– . fl
vm
The entries can be integers Z, real numbers R, rational numbers
Q, binary etc.
Example notation for type and size:
ZN (0.2)
5
Matrices
A matrix is a 2-D array of numbers:
« ff
M1,1 M1,2
M“ (0.3)
M2,1 M2,2
Example notation for type and shape:
M P Qmxn (0.4)
6
Tensors
A tensor is an n-D array of numbers and can have:
7
Matrix Transpose
AJ i,j “ Aj,i
The transpose of a matrix is like a reflection along the main
diagonal.
» fi
A1,1 A1,2 « ff
A1,1 A2,1 A3,1
A “ –A2,1 A2,2 fl ùñ AJ “ (0.5)
— ffi
A1,2 A2,2 A3,2
A3,1 A3,2
pABqJ “ BJ AJ
8
Matrix Multiplication (Dot Product)
C “ AB
ÿ
Ci,j “ Ai,k Bk,j
k
Element-wise multiplication: Hadamard product A d B
9
Identity Matrix
» fi
1 0 0
I3 “ –0 1 0fl
— ffi
0 0 1
@x P Rn , In x “ x
10
Systems of Equations
11
Solving Systems of Equations using Matrix Inversion
A´1 A “ In
Numerically unstable, but useful for abstract analysis.
Ax “ b
A´1 Ax “ A´1 b
In x “ A´1 b 12
Invertibility
Matrix can’t be inverted if:
• More rows than columns
• More columns than rows
• Redundant rows/columns (linearly dependent/low rank)
13
Norms
1) Functions that measure the ”size” of a vector
2) Similar to the distance between zero and the point represented
by the vector
• f pxq “ 0 ùñ x “ 0
• f px ` y q ď f pxq ` f py q (the triangle equality)
• @a P R, f paxq “ |a|f pxq
14
Norms
• Lp norm ÿ 1
||x||p “ p |xi |p q p
i
• L1 norm, p = 1: ÿ
||x||1 “ |xi |
i
• L2 norm, p = 2: (most popular)
ÿ 1
||x||2 “ p |xi |2 q 2
i
• Max norm, infinite p:
||x||8 “ max |xi |
i
15
Special Matrices and Vectors
• Unit vector:
||x||2 “ 1
• Symmetric Matrix:
A “ AT
• Orthogonal Matrix:
AT A “ AAT “ I
A´1 “ AT
16
Eigendecomposition
• Eigenvector and eigenvalue:
Av “ λv
• Eigendecomposition of a diagonalizable matrix:
A “ Vdiag pλqV´1
• Every real symmetric matrix has a real, orthogonal
eigendecomposition:
A “ QDQT
17
Effect of Eigenvalues
18
Singular Value Decomposition
Similar to eigendecomposition but more general: matrix need not
be square.
A “ UDVT
19
Singular Value Decomposition
Similar to eigendecomposition but more general: matrix need not
be square.
A “ UDVT
20
Moore-Penrose Pseudoinverse
x “ A` y
If the equation has:
• Exactly one solution: this is the same as the inverse.
• No solution: this gives us the solution with the smallest error
||Ax ´ y ||2
• Many solutions: this gives us the solution with the smallest
norm of x
• Use SVD to compute pseudoinverse
A` “ UD` VT
D` : Take reciprocal of non-zero entries
21
Trace
Trace is the sum of the diagonal entries of a matrix.
ÿ
Tr pAq “ Ai,i
i
Useful identities:
Tr pAq “ Tr pAT q
Tr pABCq “ Tr pCABq “ Tr pBCAq
22
Probability and Information Theory
Probability Mass Function
The domain of P must be the set of all possible states of x.
@x P X , 0 ď Ppxq ď 1
ÿ
Ppxq “ 1
xPX
Example: uniform distribution
1
PpX “ xq “
k
23
Probability Density Function
The domain of p must be the set of all possible states of x.
@x P X , ppxq ě 0
Note: do not require ppxq ď 1
ż
ppxqdx “ 1
Example: uniform distribution
1
upx; a, bq “
b´a
24
Computing Marginal Probability with the Sum Rule
ÿ
@x P X , PpX “ xq “ PpX “ x, Y “ y q
y
ż
ppxq “ ppx, y qdy
25
Conditional Probability
PpY “ y , X “ xq
PpY “ y |X “ xq “
PpX “ xq
Bayes Rule:
PpY qPpX |Y q
PpY |X q “
PpX q
26
Chain Rule of Probability
n
ź
Ppx1 , ..., xn q “ Ppx1 q Ppxi |x1 , ..., xi´1 q
i“2
Markov property:
Ppxi |xi´1 , ..., x1 q “ Ppxi |xi´1 q for i ą 1
n
ź
Ppxn , ..., x1 q “ Ppx1 q Ppxi |xi´1 q “ Ppx1 qPpx2 |x1 q...Ppxn |xn´1 q
i“2
27
Independence
@x P X , y P Y
ppX “ x, Y “ y q “ ppX “ xqppY “ y q
Conditional Independence:
@x P X , y P Y , z P Z
ppX “ x, Y “ y |Z “ zq “ ppX “ x|Z “ zqppY “ y |Z “ zq
28
Expectation
ÿ
Ex„P rf pxqs “ Ppxqf pxq
x
ż
Ex„p rf pxqs “ ppxqf pxqdx
Linearity of Expectations:
Ex rαf pxq ` βg pxqs “ αEx rf pxqs ` βEx rg pxqs
29
Variance and Covariance
Var pf pxqq “ Erpf pxq ´ Erf pxqsq2 s
Cov pf pxq, g py qq “ Erpf pxq ´ Erf pxqsqpg py q ´ Erg py qsqs
Covariance matrix:
Cov pxqi,j “ Cov pxi , xj q
30
Bernoulli Distribution
PpX “ 1q “ φ
PpX “ 0q “ 1 ´ φ
PpX “ xq “ φx p1 ´ φq1´x
Ex rX s “ φ
Varx pX q “ φp1 ´ φq
31
Gaussian Distribution
Parametrized by variance:
c
1 1
N px; µ, σ 2 q “ 2
expp´ 2 px ´ µq2 q
2πσ 2σ
Parametrized by precision:
c
β 1
N px; µ, β ´1 q “ expp´ βpx ´ µq2 q
2π 2
32
Gaussian Distribution
33
Multivariate Gaussian
d
1 1
N px; µ, Σq “ expp´ px ´ µqT Σ´1 px ´ µqq
p2πqn detpΣq 2
« ff « ff « ff
1 0 0.5 0 1 0
Σ“ Σ“ Σ“
0 1 0 1 0 2
34
Empirical Distribution
m
1 ÿ
p̂ “ δpx ´ x piq q
m i“1
35
Mixture Distribution
ÿ
PpX q “ Ppc “ iqPpx|c “ iq
i
36
Activation Functions
37
Vanishing Gradient Problem
38
Information Theory
Information:
I pxq “ ´logPpxq
Shannon’s Entropy:
HpX q “ EX „P rI pxqs “ ´EX „P rlogPpxqs
KL divergence:
Ppxq
DKL pP||Qq “ EX „P rlog s “ EX „P rlogPpxq ´ logQpxqs
Qpxq
39
Entropy of a Bernoulli Variable
40
KL Divergence is Asymmetric
41
Directed Model
ppa, b, c, d, eq “ ppaqppb|aqppc|a, bqppd|bqppe|cq
42
Undirected Model
1 p1q
ppa, b, c, d, eq “ φ pa, b, cqφp2q pb, dqφp3q pc, eq
Z
43
Numerical Computation
Numerical concerns for implementation
Algorithms are often specified in terms of real numbers but R
cannot be implemented in a finite computer.
To implement deep learning algorithms with a finite number of
bits, we need Iterative Optimization.
• Gradient descent
• Curvature
44
Gradient
45
Gradient Descent
46
Gradient Descent Algorithm
repeat until convergence {
m
1 ÿ
θ0 :“ θ0 ´ α phθ px piq q ´ y piq q
m i“1
m
1 ÿ
θ1 :“ θ1 ´ α phθ px piq q ´ y piq q ¨ x piq
m i“1
}
update θ0 and θ1 simultaneously
47
Approximate Optimization
48
Usually don’t even reach a local minimum
49
Optimization: Pure math vs Deep Learning
Pure Math (Calculus: setting derivative to zero/ Lagrange
multipliers)
• Find literally the smallest value of f(x)
• Or maybe: find some critical point of f(x) where the value is
locally smallest by solving equations
Deep Learning
• Just decrease the value of f(x) a lot iteratively until a point of
convergence is approached
50
Critical Points
51
Directional Second Derivatives
52
Predicting optimal step size using Taylor series
f px p0q ´ g q « f px p0q q ´ g T g ` 21 2 g T Hg
gT g
˚ “ g T Hg
Numerator: Big gradients speed you up
Denominator: Big eigenvalues slow you down if you align with
their eigenvectors
53
Gradient Descent and Poor Conditioning
54
Neural net visualization
At the end of learning:
• gradient is still large
• curvature is huge
55
Python and Numpy Review Demo
Questions