Lecture 7: Convergence and Limit Theorems
Almost sure convergence
Convergence in mean square
Convergence in probability
Convergence in distribution
1
Motivation: Estimating Mean
Want to estimate the mean E[X] of a distribution.
Generate X1 , X2 , . . . , Xn i.i.d. drawn from the distribution and compute the sample mean
n
1X
Sn = Xi .
n i=1
Does Sn converge to E[X]? In what sense? Note that Sn is a random variable.
2
Motivation: Estimating Mean
Example: X1 , X2 , . . . , Xn i.i.d. N (0, 1)
▶ Generate 6 sets of X1 , X2 , . . . , Xn (sample paths).
▶ Note that every sample path appears to be converging to the mean 0 as n increases.
3
Almost Sure Convergence
Recall that a sequence of numbers x1 , x2 , . . . converges to x if given any ϵ > 0, there
exists an n(ϵ) such that for all n ≥ n(ϵ), |xn − x| < ϵ. Equivalently, maxi≥n |xi − x| → 0
as n → ∞.
Suppose a sequence of random variables X1 , X2 , . . . defined over the same probability
space (Ω, F, P). For every ω ∈ Ω, we get a sample path X1 (ω), X2 (ω), . . ., which is a
sequence of numbers.
We say Xn → X almost surely (a.s.) or with probability 1 (w.p. 1) if
n o
P ω : lim Xn (ω) = X(ω) = 1.
n→∞
This means that the set of ω such that the sample path or sequence X1 (ω), X2 (ω), . . .
converges to X(ω) has probability 1.
4
Almost Sure Convergence
a.s.
Lemma. P max |Xi − X| > ϵ → 0 for any ϵ > 0 iff Xn −→ X.
i≥n
We prove the forward direction. We will see that the converse holds when discussing
convergence in probability.
Let Mn = maxi≥n |Xi − X|, which is a decreasing sequence bounded below by 0. Therefore
Mn ↓ M for some M .
Then, for all ϵ > 0, P(M > ϵ) ≤ P(Mn > ϵ) → 0 as n → ∞.
Therefore, from continuity of P, P(M = 0) = 1 and Mn → 0 a.s. This is equivalent to saying
that Xn → X a.s.
5
Qn
Example †: X1 , X2 , . . . i.i.d. ∼ Bern (1/2). Let Yn = 2n i=1 Xi . Show that the sequence
Yn converges to 0 a.s.
For 0 < ϵ < 1, we have
P max |Yi − 0| > ϵ ≤ P(Xi = 1 ∀ i ≤ n)
i≥n
1
= →0
2n
as n → ∞.
Important example of a.s. convergence is the Strong Law of Large Numbers (SLLN): If
X1 , X2 , . . . are i.i.d. (in fact pairwise independence suffices) with finite mean µ, then
n
1X a.s.
Sn = Xi −→ µ
n i=1
as n → ∞. Proof of the SLLN is beyond the scope of this course. See [Github:
https://github.com/wptay/aipt].
6
Convergence in Mean Square
We say Xn → X in mean square or L2 if
lim E(Xn − X)2 = 0.
n→∞
Let X1 , X2 , . . . be i.i.d. with finite mean E[X] and var(X). Then Sn → E[X] in mean square.
" n
!2 #
2
1 X
E (Sn − E[X]) =E (Xi − E[X])
n
i=1
" #
1 X
= 2E (Xi − E[X])(Xj − E[X])
n
i,j
n
1 X
= E(Xi − E[X])2 ∵ Xi s are independent
n2
i=1
1
= var(X) → 0
n
as n → ∞.
Note proof works even if the Xi s are only pairwise independent or even only uncorrelated.
7
Convergence in Mean Square
Convergence in mean square does not imply convergence a.s.
Example ♣: Let X1 , X2 , . . . be independent random variables such that
1
0 w.p. 1 − n
,
Xn = 1
1 w.p. n .
1
E (Xn − 0)2 = n
→ 0 as n → ∞. Therefore, this sequence converges to 0 in mean square.
Pm 1 m
For any 0 < ϵ < 1, using the inequalities log(1 − x) ≤ −x and i=n i
≥ log n
,
m
1
Y
P max |Xi | < ϵ = lim P max |Xi | < ϵ = lim 1−
i≥n m→∞ n≤i≤m m→∞ i
i=n
m
! m
!
1 1 n
X X
= lim exp log 1 − ≤ lim exp − ≤ lim = 0.
m→∞ i m→∞ i m→∞ m
i=n i=n
Therefore, it does not converge a.s.
8
Convergence in Mean Square
Convergence a.s. does not imply convergence in mean square.
From Example †, Yn converges to 0 a.s., but
1
E (Yn − 0)2 = 22n n = 2n ,
2
the sequence does not converge in mean square.
9
Convergence in Probability
A sequence X1 , X2 , . . . converges to a random variable X in probability if for any ϵ > 0,
lim P(|Xn − X| > ϵ) = 0,
n→∞
p
If Xn → X a.s., then Xn −→ X.
For proof, see [ Github: https://github.com/wptay/aipt ].
If Xn → X a.s., then 1{|Xn −X|>ϵ} → 0 a.s. The Dominated Convergence Theorem tells
us that
E1{|Xn −X|>ϵ} = P(|Xn − X| > ϵ) → 0.
Example ♣:
1
P(|Xn − 0| > ϵ) = → 0,
n
so the sequence converges in probability. But we saw before that it does not converge a.s.
10
Convergence in Probability
p
If Xn → X in mean square, then Xn −→ X.
From the Markov inequality,
1
P(|Xn − X| > ϵ) ≤ E(Xn − X)2 → 0.
ϵ2
Converse is not true. Example:
0 w.p. 1 − n1 ,
Xn =
n 1
w.p. n.
Convergence in probability:
1
P(|Xn − 0| > ϵ) = → 0.
n
But,
1
E(Xn − 0)2 = n2 · = n → ∞.
n
Convergence in probability is weaker than both convergence a.s. and in mean square.
11
Weak Law of Large Numbers
Suppose X1 , X2 , . . . are such that EXi = 0, EXi2 = σ 2 < ∞ and E[Xi Xj ] ≤ 0 for i ̸= j. Then,
n
1 X p
Xi −→ 0 as n → ∞.
n
i=1
Proof:
We have
" n
!2 # " n
#
1 X 1 X X
E Xi = 2E Xi2 +2 Xi Xj
n n
i=1 i i<j
n
1 X σ2
≤ EXi2 = .
n2 n
i
12
Weak Law of Large Numbers
From Chebyshev’s inequality, we then have for any ϵ > 0,
n
!
1 X 1 σ2
P Xi > ϵ ≤ → 0,
n ϵ2 n
i=1
as n → ∞.
13
Convergence in distribution
d
We say that Xn −→ X if for all continuous bounded functions f ,
E[f (Xn )] → E[f (X)].
d
Equivalent to saying Xn −→ X ⇐⇒ FXn (t) → FX (t) for all continuity points t of
FX (·). [Github: https://github.com/wptay/aipt]
Central Limit Theorem (CLT) for i.i.d. sequences.
ConsiderPi.i.d. random variables X1 , X2 , . . ., with EXi = µ and var Xi = σ 2 < ∞. Let
n
Sn = n1 i=1 Xi . Then
Sn − µ d
Zn = √ −→ N (0, 1).
σ/ n
14
CLT Example
15
Characteristic Functions
Given a random variable X, the characteristic function of X is φ : R 7→ C given by
φ(t) = E eitX ,
= E[cos(tX)] + iE[sin(tX)],
√
where i = −1.
This is the Fourier transform of pdf of X.
Example: The characteristic function of the Gaussian distribution N (µ, σ 2 ) is
σ 2 t2
φ(t) = eiµt− 2 .
In particular, when µ = 0, σ 2 = 1,
t2
φ(t) = e− 2 .
16
Fourier Inversion
R
Suppose that φ(t) is a characteristic function of a r.v. X. If |φ(t)| dt < ∞, then X has
pdf
Z
1
fX (x) = φ(t)e−itx dt.
2π
d
If Xn −→ X then φn (t) → φ(t), ∀ t ∈ R.
Obvious because φn (t) = E eitXn = E[cos tXn ] + iE[sin tXn ], cos(tx) and sin(tx) are
d
bounded continuous functions, and Xn −→ X.
17
Fourier Inversion
Converse not true. The sequence Xn ∼ N (0, n) has characteristic function
nt2 n→∞
φn (t) = e− 2 −−−−→ 0, ∀ t ̸= 0 and φn (0) = 1, ∀ n. Therefore, φn (t) converges.
But for all x ∈ R,
1 x 1
P(Xn < x) = 1 + erf √ → ,
2 2n 2
when n → ∞, which implies that this sequence does not converge in distribution.
18
Levy’s Continuity Theorem
Suppose Xn has characteristic function φn (t) → φ(t), and φ(t) is continuous at t = 0. Then there exists
d
X s.t. φ(t) is the characteristic function of X and Xn −→ X.
For proof, see [Github: https://github.com/wptay/aipt].
Proof of CLT for i.i.d. sequence. Without loss of generality, we assume that µ = 0, σ = 1. We have
n
1 X
Zn = √ Xj .
n
j=1
Let
Pn
j=1
Xj
φn (t) = E exp it √
n
n h X i
Y j
= E exp it √
n
j=1
X
n
it √ 1
= Ee n
n
t
= φ1 √ .
n
19
CLT Proof
From the Taylor series expansion, we have
φ′′
1 (0)
φ1 (t) = φ1 (0) + φ′1 (0)t + + o(t2 ).
2!
Since φ1 (0) = 1, φ′1 (0) = E iX1 ei0X1 = iEX1 = 0, φ′′
1 (0) = E (iX1 )
2 = −1, we obtain
t2
φ1 (t) = 1 − + o(t2 ).
2
Therefore,
n
t
φn (t) = φ1 √
n
n
t2 n→∞ t2
= 1− + o(t2 /n) −−−−→ e− 2 ,
2n
the characteristic function of N (0, 1) (certainly continuous at t = 0). From Levy’s Continuity Theorem,
we then have
d
Zn −→ N (0, 1).
20
CLT for Discrete R.V.
Pn Xi −1/2
Example: X1 , X2 , . . . are i.i.d. ∼ Bern (1/2). Note that Zn = i=1
√
n/2
is discrete and has
no pdf. But its cdf converges to the Gaussian cdf.
21
CLT for Random Vectors
The CLT applies to i.i.d. sequences of random vectors.
Let X1 , X2 , . . . be i.i.d. with each having finite mean µ and non-singular covariance
matrix Σ.
n
1 X d
Zn = √ (Xi − µ) −→ N (0, Σ)
n i=1
22
Summary
23