MSE 546: Advanced Machine Learning
Sirisha Rambhatla
University of Waterloo
Lecture 2
Math Background: Probability, Statistics and Information Theory
1 / 46
Outline
1 Math Background: Probability
2 Math Background: Statistics
3 Math Background: Information Theory
4 Reading
2 / 46
Math Background: Probability
Outline
1 Math Background: Probability
Motivation
Definitions
Random Variables
Distribution Function
Multivariate Distributions
Bayes Rule
Independence of Random Variables
2 Math Background: Statistics
3 Math Background: Information Theory
4 Reading
3 / 46
Math Background: Probability Motivation
Conditional Generation of Data
Prompt: “Alice in wonderland. Down the rabbit hole. Alice was getting
so very bored with her sister on the riverbank.”
https://dalledemo.com/ [1, 2]
https://stablediffusion.fr/
demo GitHub [3]
4 / 46
Math Background: Probability Definitions
Probability: Terms and Notation
Sample Space: a set of all possible outcomes or realizations of some
random trial.
Example: : Toss a coin twice; the sample space is
Ω = {HH, HT, T H, T T }.
Event: A subset of sample space
Example: : the event that at least one toss is a head is
A = {HH, HT, T H}.
Probability: We assign a real number P (A) to each event A, called the
probability of A.
Example: In the coin tossing example
3
P (A) =
4
5 / 46
Math Background: Probability Definitions
Probability Axioms
The probability P must satisfy three axioms:
1 Non-negativity: P (A) ≥ 0 for every A;
Probabilities cannot be negative!
2 Unit Measure:P (Ω) = 1;
Probability of entire samples space is 1
i.e. “something happens” and that there are no events outside of the
sample space
3 Mutually exclusive
P∞ events: If A1 , A2 , . . . are disjoint, then
P (∪∞ A
i=1 i ) = i=1 P (Ai )
disjoint: A1 ∧ A2 = ∅, ∧ is logical AND.
∪ is “union” which can be represented by ∨ logical OR
Helps us to analyze events of interest and convert logical operations
into arithmetic
6 / 46
Math Background: Probability Definitions
Probability: Terms and Notation
Probability of a union of two events: The probability of event A or B
happening is given by:
P (A ∨ B) = P (A) + P (B) − P (A ∧ B)
7 / 46
Math Background: Probability Random Variables
Random Variables
Definition: A random variable is a function that maps from a random
event to a real number, i.e. X : Ω → R, that assigns a real number X(ω)
to each outcome ω.
Example: In coin tossing, we let H → 1 and let T → 0. The event “at
least one toss is a head” then can be written as X1 + X2 > 0, where X1
and X2 are the random variables (ie, 1, or 0 corresponding to the first toss
and the second toss respectively).
Two Types: Discrete (e.g. Bernoulli in Coin toss) and Continuous (e.g.
Gaussian)
8 / 46
Math Background: Probability Random Variables
From Random Variables to Data
Data The data are specific realizations of random variables
(X1 = 1, X2 = 0), (X1 = 1, X2 = 1), (X1 = 0, X2 = 0)
are 3 observations from the coin toss experiments (note that each
experiment involves tossing twice).
Statistic: A statistic is any function of the data or random variables, e.g.
mean, variance etc.
9 / 46
Math Background: Probability Distribution Function
Distribution Function: Discrete Random Variable
Definition: Suppose X is a random variable and x is a specific value that
it can take, then
For discrete r.v. X, the probability mass function is defined as
fX (x) = P (X = x) : probability of X takes the value of x
Example: For a fair coin P (X = 1) = 1/2, where X is either 0 (’T’) or 1
(’H’).
10 / 46
Math Background: Probability Distribution Function
Distribution Function: Continuous Random Variable
Definition: Suppose X is a random variable and x is a specific value that
it can take, then
For a continuous r.v. X, fX (x) ≥ 0 is the probability density function,
for every a ≤ b: Z b
P (a ≤ X ≤ b) = f (x)dx
a
R∞
where −∞ f (x)dx = 1. Note: for continuous distributions P (X = x) = 0!
Cumulative distribution function (CDF) of X : FX (x) = P (X ≤ x). If
F (x) is differentiable everywhere, f (x) = F ′ (x).
11 / 46
Math Background: Probability Distribution Function
Expectation
Expected Values
Of a function g(·) of a discrete r.v. X,
X
E[g(X)] = g(x)f (x);
x∈X
Example: E[g(X)] for tossing a fair coin is
µ = 1 × P (X = 1) + 0 × P (X = 0) = 1/2
Of a function g(·) of a continuous r.v. X,
Z ∞
E[g(X)] = g(x)f (x)dx.
−∞
12 / 46
Math Background: Probability Distribution Function
Expectation
Mean and Variance µ = E[X] is the mean; var[X] = E[(X − µ)2 ] is the
variance. Hence, we have var[X] = E[X 2 ] − µ2 .
Example: The variance in our coin tossing example is
1
var[X] = E[(X − µ)2 ] = (1 − 1/2)2 P (X = 1) + (0 − 1/2)2 P (X = 0) =
4
Linearity of Expectation: For r.v X1 , . . . , Xn :
Xn n
X
E[ Xi ] = E[Xi ].
i=1 i=1
13 / 46
Math Background: Probability Distribution Function
Common Distributions
Discrete variable Probability function Mean
N +1
Uniform X ∼ U [1, . . . , N ] 1/N 2
n x
Binomial X ∼ Bin(n, p) x
p (1 − p)(n−x) np
Geometric X ∼ Geom(p) (1 − p)x−1 p 1/p
e−λ λx
Poisson X ∼ P oisson(λ) x!
λ
Continuous variable Probability density function Mean
Uniform X ∼ U (a, b) 1/ (b-a) (a + b)/2
Gaussian X ∼ N (µ, σ 2 ) √1 exp(− 2σ1 2 (x − µ)2 ) µ
2πσ
1
Gamma X ∼ Γ(α, β) (x ≥ 0) Γ(α)β a
xa−1 e−x/β αβ
x
1 −β
Exponential X ∼ exponen(β) β
e β
14 / 46
Math Background: Probability Multivariate Distributions
Multivariate Distributions
Dealing with two random variables
fX,Y (x, y) = P (X = x, Y = y) : probability of X taking x and Y taking y
Example: Let X represent ’Vertical Jump Height’ and Y represents
players (either ’SN’ or ’LJ’).
P (X = 40 inches, Y = ‘LJ ′ )
probability that the vertical jump height is 40 inches AND the player is LJ
15 / 46
Math Background: Probability Multivariate Distributions
Multivariate Distributions: Marginal Distribution
Marginal distribution
X X
P (X = x) = P (X = x, Y = y), P (Y = y) = P (X = x, Y = y)
y x
Example: Represents the probability of vertical jump being x (irrespective
of who the player is.)
Discrete Case:
X X
fX (x) = P (X = x) = P (X = x, Y = y) = fX,Y (x, y)
y y
Continuous Case Z
fX (x) = fX,Y (x, y)dy
y
16 / 46
Math Background: Probability Multivariate Distributions
Multivariate Distributions: Conditional Distribution
Conditional distribution
P (X = x|Y = y)
Represents the probability that vertical jump height is x GIVEN that the
player is y.
P (Y = y|X = x)
Represents the probability that the player is y GIVEN that the vertical
jump height is x.
Computed as
P (X = x, Y = y) fX,Y (x, y)
fX|Y (x|y) = P (X = x|Y = y) = =
P (Y = y) fY (y)
17 / 46
Math Background: Probability Multivariate Distributions
Multivariate Distributions: An Example
Vertical Jump Height (in) Player # of times
35 LJ 20
35 SN 10
20 LJ 5
20 SN 10
Jointly
20 20 4
P (X = 35, Y = LJ) = = =
20 + 10 + 5 + 10 45 9
Marginally
30 2 20 4
P (X = 35) = = , P (Y = SN ) = =
45 3 45 9
Conditionally
20 2 4 2
P (Y = LJ|X = 35) = = which is also 9 ÷ 3
10 + 20 3
18 / 46
Math Background: Probability Bayes Rule
Bayes Rule
Consider the Conditional:
P (X = x, Y = y)
P (X = x|Y = y) =
P (Y = y)
This relationship can be used to derive the relationship between the two
conditionals P (X = x|Y = y) and P (Y = y|X = x). This is Bayes Rule,
i.e.
P (X = x, Y = y) P (X = x|Y = y)P (Y = y)
P (Y = y|X = x) = =
P (X = x) P (X = x)
Here we used the fact that
P (X = x, Y = y) = P (X = x|Y = y)P (Y = y)
19 / 46
Math Background: Probability Bayes Rule
Bayes Rule
Law of total Probability: Relates conditional to marginal
For the discrete case, say X takes values as x1 , x2 , . . . the marginal can
be written as
X X
P (Y = y) = P (X = x, Y = y) = P (Y = y|X = x)P (X = x)
x x
Therefore, we have
X
fY (y) = fY |X (y|xj )fX (xj )
j
20 / 46
Math Background: Probability Bayes Rule
Bayes Rule
(Simple Form)
P (Y |X)P (X)
P (X|Y ) =
P (Y )
(Discrete Random Variables)
fY |X (y|xi )fX (xi )
fX|Y (xi |y) = P
j fY |X (y|xj )fX (xj )
(Continuous Random Variables)
fY |X (y|x)fX (x)
fX|Y (x|y) = R
x fY |X (y|x)fX (x)dx
21 / 46
Math Background: Probability Independence of Random Variables
Independence
Independent Variables X and Y are independent if and only if:
P (X = x, Y = y) = P (X = x)P (Y = y)
or fX,Y (x, y) = fX (x)fY (y) for all values x and y.
IID variables: Independent and identically distributed (IID) random
variables are drawn from the same distribution and are all mutually
independent.
22 / 46
Math Background: Probability Independence of Random Variables
Correlation
Covariance
cov(X, Y ) = E[(X − µx )(Y − µy )],
Correlation coefficients
corr(X, Y ) = Cov(X, Y )/σx σy
Independence ⇒ Uncorrelated (corr(X, Y ) = 0).
However, the reverse is generally not true.
23 / 46
Math Background: Statistics
Outline
1 Math Background: Probability
2 Math Background: Statistics
Motivation
Point Estimates
Bayesian Statistics
3 Math Background: Information Theory
4 Reading
24 / 46
Math Background: Statistics Motivation
Model fitting from Data
The process of estimating parameters θ from D is called model fitting, or
training, and is at the heart of machine learning.
Fitting a line (linear model) to Data. Fitting a polynomial (degree d > 1)
BUT
How do we know which model to fit?
How do we learn the parameters of these models?
25 / 46
Math Background: Statistics Motivation
Statistics: Preliminaries
Suppose X1 , . . . , Xn are random variables, you may have seen empirical
estimates of mean and variance as follows:
Sample Mean:
N
1 X
X̄ = Xi
N
i=1
Sample Variance:
N
2 1 X
SN −1 = (Xi − X̄)2 .
N −1
i=1
26 / 46
Math Background: Statistics Point Estimates
Point Estimation
Definition The point estimator θ̂N is a function of samples X1 , . . . , XN
that approximates a parameter θ∗ of the distribution of Xi .
Sample Bias: The bias of an estimator is
bias(θ̂N ) = Eθ [θ̂N ] − θ∗
An estimator is unbiased estimator if Eθ [θ̂N ] = θ
Why care about unbiasedness? If an estimator is unbiased then it gives
you the exact parameter of interest (in expectation)!
27 / 46
Math Background: Statistics Point Estimates
Unbiased Estimators
Is unbiasedness enough?
Example As estimator that looks as only one datapoint is also unbiased.
θ̂(D) = x1
What is the problem with it? It will not generalize to new data!
28 / 46
Math Background: Statistics Point Estimates
Variance is also important
So the variance of an estimator is also important.
2
V ar[θ̂] := E[θ̂2 ] − E[θ̂]
How low can this go? This is answered by the celebrated Cramer-Rao
Lower Bound. (Out of the scope of this course)
29 / 46
Math Background: Statistics Point Estimates
Bias-Variance Trade-off
A fundamental trade-off that needs to be made when picking a method for
parameter estimation, assuming that the goal is to minimize the mean
squared error (MSE) of a estimate.
Let
θ∗ be the true parameter
θ̂ = θ̂(D) denote the estimator
θ̄ = E[θ̂] be its expected value
mean squared error (MSE) := E[(θ̂ − θ∗ )2 ]
E[(θ̂ − θ∗ )2 ] = V ar[θ̂] + Bias2 (θ̂)
Show?
Take-away: Assuming our goal is to minimize squared error, it might be
wise to use a biased estimator, so long as it reduces our variance by more
than the square of the bias.
30 / 46
Math Background: Statistics Point Estimates
Maximum Likelihood Estimation
Motivation: Most machine learning training boils down to an
optimization problem of the form
θ̂ = argmin L(θ)
θ
Maximum Likelihood Estimation picks the parameters θ that assign the
highest probability to the training data.
θ̂mle := argmax p(D|θ)
θ
This is a point estimate since it is a single estimator.
31 / 46
Math Background: Statistics Point Estimates
Maximum Likelihood Estimation
We usually assume the training examples D = {xn , yn } for
n ∈ {1, . . . , N } are independently sampled from the same distribution (iid
assumption), so the (conditional) likelihood p(D|θ) becomes
N
Y
p(D|θ) := p(yn |xn , θ)
n=1
It is easier to work with log likelihood
N
X
θ̂mle := argmax log p(yn |xn , θ)
θ n=1
32 / 46
Math Background: Statistics Point Estimates
Maximum Likelihood Estimation
We prefer positing this as a minimization of the negative log likelihood
(N LL(θ)):
θ̂mle = argmin N LL(θ) (1)
θ
N
X
= argmin − log p(yn |xn , θ) (2)
θ n=1
Aside: It can be shown that the MLE achieves the Cramer Rao lower
bound, and hence has the smallest asymptotic variance of any unbiased
estimator (asymptotically optimal).
33 / 46
Math Background: Statistics Point Estimates
Empirical Risk Minimization
We can generalize MLE by replacing the (conditional) log loss term in
N LL(θ) (1), with any loss function ℓ(yn , θ; xn ) to get
N
1 X
L(θ) = ℓ(yn , θ; xn ) (3)
N
n=1
This is known as empirical risk minimization or ERM, since it is the
expected loss where the expectation is taken wrt the empirical distribution.
34 / 46
Math Background: Statistics Point Estimates
Maximum A Posteriori (MAP) Estimation
Problem with MLE and ERM is Overfitting:Try to pick parameters
that minimize loss on the training set, but this may not result in a model
that has low loss on future data.
Main solution: Add a penalty term to N LL(θ) (MLE) or ERM objective.
Maximum A Posteriori (MAP) Estimation: This penalty takes the
form of assuming the prior p(θ) for the parameters θ, gives
L(θ) = − ( log p(D|θ) + log p(θ)) (4)
Minimizing this objective is equivalent to maximizing the log posterior.
θ̂ = argmax log p(θ|D) = argmax [ log p(D|θ) + log p(θ) − const] (5)
θ θ
35 / 46
Math Background: Statistics Bayesian Statistics
Relationship to Bayesian Inference
The point estimates way of thinking does not consider the uncertainty in
the parameters itself.
Inference: Modeling uncertainty about parameters using a probability
distribution (as opposed to just computing a point estimate)
The Bayesian way of doing this is by using the posterior distribution
p(θ|D), and for this we:
start with the prior distribution p(θ)
define a likelihood function p(D|θ)
use Bayes Rule to compute the posterior
p(θ)p(D|θ)
p(θ|D) = (6)
p(D)
where the marginal likelihood p(D) is computed by marginalizing the
unknown θ.
The methods such as variational inference all originate from this.
36 / 46
Math Background: Information Theory
Outline
1 Math Background: Probability
2 Math Background: Statistics
3 Math Background: Information Theory
Motivation
4 Reading
37 / 46
Math Background: Information Theory Motivation
Learning to Generate Data
The comparing distributions when learning generative modeling!
Learning in Generative Adversarial Networks
38 / 46
Math Background: Information Theory Motivation
The Discrete Case: Shannon Entropy
Suppose X is a random variable which can have one of the m values:
x1 , . . . , xm ., with probability P (X = xi ) = pi for i ∈ m.
Entropy (Shannon Entropy) is the average amount of surprise in a
random variable’s outcome.
m
X
H(X) = − pi logb pi
i=1
“High entropy” means X is from a uniform (boring) distribution;
“Low entropy” means X is from varied (peaks and valleys)
distribution.
39 / 46
Math Background: Information Theory Motivation
Shannon Entropy: An Example
Entropy of a fair coin flip
For a fair coin pi = 0.5
H(X) = − m
P
i=1 pi logb pi
2
X
=− 0.5 log2 0.5
i=1
=1
Entropy of a biased coin flip
Say p1 = 0.8
H(X) = − m
P
i=1 pi logb pi
Entropy of coin flips
= −0.8 log2 0.8 − 0.2 log2 0.2
= 0.7219
40 / 46
Math Background: Information Theory Motivation
Conditional Entropy and Mutual Information
Conditional Entropy is the remaining entropy of a random variable Y given that
the value of another random variable X is known.
m
X m X
X n
H(Y |X) = p(X = xi )H(p(Y |X = xi )) = − p(xi , yj ) log p(yj |xi )
i=1 i=i j=1
Mutual Information: if Y must be transmitted: what is amount of information
about Y that is to be transmitted if both sides knew X?
how many bits on average would be saved if both ends of the line knew X?
I(Y ; X) = H(Y ) − H(Y |X).
Notice that I(Y ; X) = I(X; Y ) = H(X) + H(Y ) − H(X, Y )
These concepts are directly used in learning Decision Trees.
41 / 46
Math Background: Information Theory Motivation
Kullback-Leibler (KL) Divergence
Kullback-Leibler divergence is a measure of distance between two distributions:
a “true” distribution p(X), and an arbitrary distribution q(X).
X p(x)
KL(p||q) = p(x) log
x
q(x)
Cross Entropy the expected number of bits we need to represent a dataset
coming from distribution p if we using distribution q.
m
X
Hce (p, q) = − pi logb qi
i=1
42 / 46
Math Background: Information Theory Motivation
Relationship between KL Divergence and Cross-Entropy
Kullback-Leibler divergence can be written as
X p(x) p(x) is the “true”
KL(p||q) = p(x) log
q(x) distribution
x
q(x) is an arbitrary
X X
= p(x) log p(x) − p(x) log q(x)
x x distribution
= −H(p) + Hce (p, q)
Cross Entropy the expected number of bits we need to represent a dataset
coming from distribution p if we using distribution q.
X
Hce (p, q) = − p(x) logb q(x)
x
Therefore, KL Divergence is the extra bits needed to use distribution q to
represent p. It is essentially a way to compare distributions!
43 / 46
Reading
Outline
1 Math Background: Probability
2 Math Background: Statistics
3 Math Background: Information Theory
4 Reading
44 / 46
Reading
Reading
PiML1
Chapter 2 – 2.1, 2.2 (2.2.1 - 2.2.4, 2.2.5.1 - 2.2.5.3), and 2.3.
Chapter 3 – Eq (3.1), eq.(3.7), sections 3.1.3, and 3.1.4
Chapter 4 – Section 4.1, 4.2.1, 4.2.2, 4.3 (intro), 4.5 (intro about
MAP), 4.6 (intro), 4.7.6 (intro), 4.7.6.1, 4.7.6.2,
Chapter 5 – Section 5.1.6.1
Chapter 6 – 6.1 (6.1.1-6.1.4, 6.1.6. (intro)), 6.2 (6.2.1-6.2.2), 6.3
(6.3.1 - 6.3.2)
45 / 46
Reading
References I
[1] Alec Radford et al. “Learning Transferable Visual Models From
Natural Language Supervision”. In: Proceedings of the 38th
International Conference on Machine Learning. Ed. by Marina Meila
and Tong Zhang. Vol. 139. Proceedings of Machine Learning
Research. PMLR, 18–24 Jul 2021, pp. 8748–8763. url:
https://proceedings.mlr.press/v139/radford21a.html.
[2] Aditya Ramesh et al. “Hierarchical text-conditional image generation
with clip latents”. In: arXiv preprint arXiv:2204.06125 (2022).
[3] Robin Rombach et al. High-Resolution Image Synthesis with Latent
Diffusion Models. 2021. arXiv: 2112.10752 [cs.CV].
46 / 46