0% found this document useful (0 votes)
18 views25 pages

Random Variables Explained

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views25 pages

Random Variables Explained

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

If we wish to specify P (X 2 A), it makes sense to have

A simple random variable X take finitely many distinct


P (X 2 Ac) also defined, in fact as 1 P (X 2 A).
values, say x1, · · · , xn 2 S. Letting Ai = X 1({xi}) 2 F ,
the Ai’s are disjoint (without loss of generality) and X =
Lecture 2 If we define probabilities X 2 Ai, i 1, it also makes sense Pn
i=1 xiIAi . When S = Rd, we consider only nonzero xi’s.
to define probabilities of X 2 [iAi (X 2 A1 or X 2 A2 or
Vivek Borkar
· · ·) or X 2 \iAi (X 2 A1 and X 2 A2 and · · ·). P
An elementary random variable X = i xi I A i for count-
IIT BOMBAY
ably infinite collections {xi} ⇢ S and disjoint {Ai} ⇢ F .
It makes sense to let P (X 2 S) = 1 and P (X 2 ) = 0.
August 2020
From now on, we stick to S = R, since the same thinking
=) We want the collection of subsets of A for which we
applies to S = Rd, d > 1, componentwise.
assign probabilities to {X 2 A} to be a -field.

Equip with S a -field S (typically the Borel -field B(S)).


Lemma Let X 0 be a real random variable on (⌦, F , P ).
Formally, a random variable X taking values in a space
Then there exist simple random variables {Xn} and ele-
S is a map X : ⌦ 7! S in the least. In other words, what We want to assign probabilities to sets of the type
mentary random variables {Xn0 } on (⌦, F , P ) such that (i)
we see is some manifestation of the underlying sample X 1(A) := {! 2 ⌦ : X(!) 2 A} for A 2 S.
Xn " X pointwise and, (ii) Xn0 " X uniformly pointwise.
point ! through this map.
=) A 2 S, X 1(A) := {! 2 ⌦ : X(!) 2 A} 2 F .
Proof Let
n 1
n2X ( )
An X that satisfies this requirement, i.e., k k k+1
Examples: coin toss (S = {head, tail}, roll of a die (S = Xn := I  X(!) < ,
n 2n 2n
k=0 2
{1, 2, 3, 4, 5, 6}), temperature tomorrow noon
X : (⌦, F ) 7! (S, B(S)), X 1(A) 2 F 8A 2 B(S) ( )
1 k k k+1
(S = [22 , 38 ]), · · · Xn0 :=
X
I  X(!) < .
n 2n 2n
it is called a random variable k=0 2

(⇡ a measurable map (⌦, F ) 7! (S, S)).

Need to make sense of probabilities of sets such as The partitions of R given by


Treat X, X 0 as identical if X = X 0 a.s. (an equivalence
{! 2 ⌦ : X(!) 2 A} for suitable A, denoted by {X 2 A} k k+1
relation =) a random variable is an ‘equivalence class’) {[ n , ), 0  k < n2n, [n, 1)}, n 1,
(i.e., these should be events, equivalently, in F ). 2 2n
are nested, i.e., the (n + 1)st partition is a refinement of
The simplest random variable is a binary random variable
For example, A could be the singleton { head } for the the nth partition in the sense that each set in the nth
taking values 1 on (say) A 2 F and 0 on Ac:
coin toss or the interval [22 , 38 ] for the temperature partition is a union of sets in the (n + 1)st partition.
tomorrow at noon.
the indicator random variable IA(!) := 1 if ! 2 A amd 0
Hence Xn is non-decreasing.
otherwise.
Notation: P (X 2 A) for P ({! 2 ⌦ : X(!) 2 A}),
colloquially, ‘probability that X is in A’. Likewise for Xn0 , n 1.
(⇤) Note also that any Cauchy sequence can have at most
Example: Consider, for example, n = 1, 2, and the prob- In fact, the converse also holds. We establish a more one limit point. If xn(k) ! x⇤, then
ability space ([0, 1], B([0, 1]), `) where ` is the Lebesque general statement.
lim |xn(k) xm| = 0 =) xn ! x⇤.
measure (length). Suppose X(!) = ! 2, ! 2 [0, 1]. Then Say that Xn ! X a.s. if P (Xn ! X) = 1. m,k"1
( )
1 1
X1(!) = I !<1 ,
4 2 This requires C := {Xn converges} 2 F . But (†) (Bolzano-Weirstrass theorem) A bounded sequence
! ( )
3
X i 2 i i+1 ( )
X2(!) = I !< . 1 has a convergent subsequence (i.e., at least one limit
i=1 4 4 4 C = \1 1 1 1
k=1 \`=1 [n=1 \m=n |Xm Xm+`| < 2 F.
k point).
Thus when we go from n = 1 to n = 2, the function The set on the right is simply ‘the set of ! 2 ⌦ for which
h ⌘
1 on 1 , 1 and from 1 to 9 on
value jumps from 0 to 16 for all k 1, there exists an n 1 such that for all
4 2 4 16
m, i n, we have |Xm Xi| < 1 Proof: Suppose {xn} ⇢ I0 := [a, b], 1 < a < b < 1.
k ’, in other words, {Xn} is
h ⌘
3, 1 .
4 Cauchy, therefore convergent. Divide I0 into equal halves, both closed, say

I10 = [a, a + (b a)/2], I100 = [a + (b a)/2, b].

APPENDIX ON COMPLETENESS:
Now, At least one of them contains infinitely many {xn}, call
it I1. Repeat this to generate I1, I2, · · · , such that there
1 A sequence {xn} in a metric space S is Cauchy in metric
0  X(!) Xn(!)  n I{X(!)  n} + X(!)I{X(!) > n}. is some xn(k) in Ik = [ak , bk ] where
2
d if limm,n"1 d(xn, xm) ! 0.
Both terms on the right ! 0, the first because 2 n # 0 ak ", bk #, |ak bk | = (b a)/2k # 0.
and the second because I{X(!) > n} = 0 from some n
The metric d is complete, equivalently, (S, d) is complete,
on for each !. Likewise, So ak , bk ! x⇤ = maxk ak = mink bk . Then xn(k) ! x⇤.
if every Cauchy sequence converges.
1
0  X(!) Xn0 (!)  8n. Proof that Rd is complete: It suffices to consider
2n d = 1. Let {xm} ⇢ R and limn,m"1 |xn xm| = 0. Then
This property may hold for one metric and not for
The claim follows. given ✏ > 0, for sufficiently large n, |xn xm| < ✏ 8 m n,
another even when the two are equivalent (i.e., lead hence {xk } is bounded. The claim follows by (⇤), (†).
to the same open sets).

Let X = (limn!1 Xn)IC . Then for a  0,


Corollary For a real random variable X on (⌦, F , P ), (
1
)!
{X < a} = D := C\ [1 1 1
k=1 [n=1 \m=n Xm < a 2F
there exist simple random variables {Xn} and elementary For example, S = (0, 1] is not complete for d(x, y) = k
d0(x, y) 1 1
random variables {Xn0 } on (⌦, F , P ) such that |x y|, but is for = x y . and for a > 0, {X < a} := D [ C c.

Proof: Sequence xn = n 1 is Cauchy in d but does not The collection of sets A ⇢ R for which {X 2 A} 2 F
(i) Xn ! X and,
converge. (It is not Cauchy in d0.) So d is not complete. forms a -field (follows from the fact that for a map

(ii) Xn0 ! X uniformly, pointwise in both cases. But d0 is because {yn} is Cauchy/convergent in d0 , y1n f : S1 ! S2, f 1 preserves set operations).

is Cauchy/convergent in d for [1, 1).


Proof Write X = X + X for X + := max(X, 0), X := It contains sets of the type ( 1, a), a 2 R. Thus it

min(X, 0), and apply the above lemma to X +, X . contains the -field generated by such sets, viz., the Borel
-field B of R. In other words, X 1(A) 2 F for all Borel
sets A 2 R. Hence X is a random variable.
(X) := {X 1(A) : A 2 B} is a -field,
1. ⌦, 2 J 8J 2 C, so ⌦, 2 ⇤. Since (Y ) ⇢ (X), Ai = {! : Y (!) = ai} must equal
the ‘ -field generated by X’ denoted (X). {! : X(!) 2 Bi} for some Bi in the Borel sigma field of
2. If A 2 ⇤, then A 2 J for some J 2 C, therefore Ac 2 J Rd .
Equivalently, the smallest sub- -field of F with respect to for this J. So Ac 2 ⇤.

which X is measurable (i.e., which contains X 1(A), A 2 Also, {Ai} are disjoint.
B), 3. If Ai 2 Ji for some Ji 2 C, then [iAi 2 J for J =
[iJi ⇢ I and J, being a countable union of countable Define disjoint sets {Cn} by:
or, the intersection of sub- -fields of F with respect to sets, is countable, i.e., J 2 C. Thus J ⇢ ⇤, so [iAi 2 ⇤. 1
C 1 = B 1 , Cn = B n \ [ n
m=1 Bm, n 2.
which X is measurable (i.e., which contains X 1(A), A 2
B). This proves that ⇤ is a -field, completing the proof. Then [n = [n 8n.
m=1Cm m=1Bm

(Prove this equivalence.)

Furthermore,
This result shows that I contains only those sets that
What about a possibly uncountably infinite collection of 1
can be described in terms of countably many indices in {! : X(!) 2 Cn} = {! : X(!) 2 Bn\ [n
m=1 Bm}
real random variables, say X↵, ↵ 2 I, where I is an un- 1
I. = {! : X(!) 2 Bn}\ [n
m=1 {! : X(!) 2 Bm}
countable index set?
1
= A n \ [n
m=1 Am
For example, the set {max↵ X↵ < 1} is not in this set. = An ,
Let (X↵, ↵ 2 I) denote the smallest sub- -field of F with
respect to which all X↵ are measurable. Let C denote the because {Ai} are disjoint. Define h(x) = ai for x 2 Ci 8i.
This is a serious limitation when, e.g., we consider I =
collection of countable subsets of I. Then
[0, T ] (say) with t 2 I interpreted as time. That is, Xt, t 2
X
[0, T ], is a random process. One then has to go beyond Y = ai I A i
Lemma (X↵, ↵ 2 I) = [J2C (X↵, ↵ 2 J). i
X
this simple construct. = aiI{!:X(!)2Ci}
i
· · · · · · (because Ai = {! : X(!) 2 Ci})

Another important result in this context is the following.


Proof For notational ease, write J = (X↵, ↵ 2 J) for
X
J ⇢ I and ⇤ := the right hand side above. Theorem X, Y are random variables taking values in = h(X(!))I{!:X(!)2Ci}
i
Rd , Rm resp., defined on (⌦, F , P ). If (Y ) ⇢ (X), then · · · · · · (because h(x) = ai for x 2 Ci)
Then ⇤ ⇢ I. Y = h(X) for some measurable h : Rd 7! Rm. (The X
= h(X(!)) IAi
converse is easy - prove it.) i
= h(X(!)).
But I is the smallest -field with respect to which all
X↵, ↵ 2 I, are measurable. Thus it suffices to show that Proof Special case:
⇤ is a -field.
P1
Assume Y is elementary, i.e., Y = n=1 aiIAi for a disjoint
family of events {Ai} that partitions ⌦.
General case: Without loss of generality, let m = 1, since µ is called the ‘law’ of X.
This is because the set on the right hand side is the set
for m > 1, we can apply the argument componentwise. of x for which
This also works for d = 1, i.e., countably infinite random
Consider the elementary random variables
variables, exactly the same way and, thanks to Lemma ,
1
X k ‘for all k 1, there exists ` 1 such that for all n `
Yn := I
n {!: 2kn Y < k+1
, n 1. for arbitrary collections of real random variables.
k= 1 2 2n } and m 1, khn(x) hn+m(x)k < 1 ’.
k
Then
1 n"1 For 1  d < 1, X = [X1, · · · , Xd], and x = [x1, · · · , xd] 2
In other words, C is the set on which the sequence
0  Y ! 0.
Yn 
2n Rd ,
{hn(x)} is Cauchy, equivalently convergent (because in
By the foregoing, Yn = hn(X(!)) for suitable hn : Rd 7!
Rm, a sequence is Cauchy if and only if it converges). x 2 Rd 7! F (x) := µ({y 2 Rd : yi  xi, 1  i  d})
Rm. Hence
From the above expression, it is clear that C is measur- = P (Xi  xi 8i) 2 [0, 1]
Y (!) = lim Yn(!) = lim hn(X(!)). able.
n"1 n"1
is the familiar ‘distribution function’ of X.

F satisfies the conditions:


Consider an Rd-valued random variable on a probability
space (⌦, F , P ). We then have X 1(A) 2 F 8 A 2 B.
Thus Range(X(·)) ⇢ C := {x : limn"1 hn(x) exists }.
1. limmini xi"1 F (x) = 1, limxi# 1 F (x) = 0 8i,
Define
Define a probability measure µ on (Rd, B) by:
h(x) = lim hn(x) for x 2 C,
n"1 µ(A) = P (X 1(A)) := P (X 2 A), A 2 B. 2. F (·) is non-decreasing and right continuous with left
= an arbitrary y ⇤ 2 Rm for x 2
/ C.
limits existing in each argument.
Then Y = h(X).
It is easy to check that this is indeed a probability mea-
sure: The latter follows because it is monotone and bounded,
so it can have only jump discontinuities. Also, it has at
1. µ(Rd) = P (⌦) = 1, µ( ) = P ( ) = 0. most countably many jumps. (Prove this.)

2. For A 2 B, Consider an Rd-valued random variable X for


To see that h is measurable, we note that
some 1  d < 1.
h = limn"1(hnIC + y ⇤IC c ). µ(Ac) = P (X 1(Ac)) = P ((X 1(A))c)

=1 P (X 1(A)) = 1 µ(A). Suppose two di↵erent persons have two di↵erent


Since products and limits of measurable functions are
‘models’ for it, i.e.,
measurable, the claim will follow if we show that IC is
measurable. Hence we need to check that the set C is 3. For disjoint Ai, i 1, in B, X 1(Ai), i 1, are disjoint
one views it as X = f (!) for some measurable f : ⌦ 7! Rd
measurable. But in F and
defined on a probability space (⌦, F , P ),
( )
C = \k [` \n ` \m x : khn(x) hn+m(x)k <
1
. µ([iAi) = P (X 1([iAi)) = P ([iX 1(Ai))
k
X X
the other views it as X = g(! 0) for some measurable
= P (X 1(Ai)) = µ(Ai).
i i g : ⌦0 7! Rd defined on a probability space (⌦0, F 0, P 0).
Then (X) in the first case is {f 1(A), A 2 B} ⇢ F , and,
in the second case is {g 1(A), A 2 B} 2 F 0. This also extends to infinite collections of random vari-
ables, and more importantly, to random variables taking
Example: for the roll of a die, it is 1 1 1
6 ⇥1+ 6 ⇥2+ 6 ⇥3+
If we identify f 1(A) ! g 1(A), we have a one-one values on more general spaces.
1 ⇥4+1 1
correspondence between the two. 6 6 ⇥ 5 + 6 ⇥ 6 = 3.5.

For example, the brownian motion is realized by con-


So for all practical purposes they are indistinguishable. structing a suitable probability measure, called the Wiener
measure, on the space of continuous functions.
That is, as far as analyzing X is concerned, it does not
matter what else is there in the two probability spaces.

Let X be a non-negative random variable and define


In this sense, the nonuniqueness of the ‘representation’ n 1
n2X k k k+1
of X as a random variable on some probability space is X n(!) := I{  X(!) < }, n 1,
k=0 2n 2n 2n
unimportant. This easily extends to d = 1, and again
Lecture 3 its simple function approximations satisfying 0  X n " ⌦
thanks to the earlier lemma, to arbitrary collections of
a.s.
real random variables.
Vivek Borkar
X n is constant over sets of the partition
IIT BOMBAY ( " !! )
This verbal intuition can be given a precise mathematical k k+1
⇧n := X 1 , , 0  k  n2n 1, X 1([n, 1)) ,
formulation, but this will suffice for the present course. 2n 2n
August 2020
The key message is that what exactly the underlying which are formed by taking inverse images under X of
probability space is does not matter, the fact that it is the partitions of [0, 1) given by
(" ! )
there suffices for most purposes. k k+1
f
⇧ n := , , 0  k  n2n 1, [n, 1) .
2n 2n

Expectation an indicator random variable IA, denoted


E [IA] is
In fact, one can go a step further and make a very simple f f
Furthermore, the partition ⇧ n+1 is a refinement of ⇧n
1 ⇥ P (A) + 0 ⇥ P (Ac) = P (A).
choice. For an Rd-valued random variable X with law µ, g
in the sense that each set in ⇧ n is the union of sets in
take ⌦ := Rd, F := B, and P = µ with the random f
⇧ n+1.
variable defined as the identity map X : ! 7! !. Extend this to simple random variables by linearity: for
Pm
X = i=1 aiIAi with ai 2 R and disjoint Ai 2 F , 1  i 
Using this, it can be directly verified that X n " X and
This choice of (⌦, F , P ) is called the canonical space and m, m 1,
m furthermore, E [X n] is non-deceasing in n.
X
X is said to be canonically realized. E[X] := aiP (Ai).
i=1
We use this as the definition of expectation of a simple
random variable.
Continuing with the example from previous lecture, let Some properties of expectation:
n = 1, 2, and the probability space given by ([0, 1], B([0, 1]), `) In defining expectation, what we have done is to partition
where ` is the Lebesque measure (length). Suppose Linearity: If X, Y are Rd-valued random variables and into intervals the range of X rather than its domain,
X(!) = ! 2, ! 2 [0, 1]. Then a, b 2 R, then E[aX + bY ] = aE[X] + bE[Y ] whenever
( )
1 1
X1(!) = I  !2  1 , both sides are well defined. multiplied the lower value of each interval I by the (prob-
2 2
3 i (i
X i+1
) ability) measure of X 1(I), and,
X2(!) = I  !2 < . Positivity: X 0 ! E[X] 0.
i=1 4 4 4
Thus when we go from n = 1 to◆ n = 2, the function summed it up over all intervals.
value jumps from 0 to 1 1 p1 and from 1 3 These are properties of integrals. The expectation is
4 on 2 , 2 2 to 4 on
r ◆ precisely the Lebesgue integral of the measurable func-
3, 1 . The value of E[Xn] then increases by As the partitions become arbitrarily fine, we get E[X] as
4
0v 1 0
tion X : (⌦, F ) 7! (R, B) with respect to the measure P the limit.
! v 1
u u
1 t1
Bu 1C 3 1 B u3C
t on (⌦, F ). This is distinct from the classical Riemann
@ A+ @1 A.
4 2 2 4 2 4 integral.

We define To compare with the Riemann integral above, let Q :=


E[X] := lim E[X n]. the set of rationals in [0, 1].
n"1 For example, consider ⌦ = [0, 1] with its Borel -field
Being limit of a non-negative non-decreasing functions, and P := the Lebesgue measure (length). If you are
it is well-defined for each ! 2 ⌦, with +1 as a possible While we won’t go into the details thereof, both Q, Qc
given a piecewise continuous bounded (for simplicity) f :
value. R1 are measurable and their Lebesgue measures are resp. 0
[0, 1] 7! R, then its Riemann integral 0 f (x)dx is defined
and 1, the latter being a consequence of the fact that Q
as follows.
If X is a real valued random variable that is not neces- is countable. Let f = IQ.
h i h i
sarily non-negative, we define E[X] := E X + E X
h i h i Consider a partition ⇧ := {0 = x0 < x1 < · · · < xn =
whenever at least one of E X + , E X + is finite, with
Since every interval of nonzero length contains both a
±1 as possible values. 1}, n 2, of the domain [0, 1] of f .
rational and an irrational, it follows that the upper Rie-
mann sum is always 1 and the lower Riemann sum is
Declare it as undefined otherwise, i.e., if both E[X +],
E[X ] are +1. always 0, so the Riemann integral is undefined.

Define the upper and lower Riemann sums


If E[X] is defined and finite, clearly so is 0 1 But the Lebesgue integral is 1 ⇥ 0 + 0 ⇥ 1 = 0, which is
nX1
 h i @ max f (x)A (xi+1 xi ) perfectly well defined.
E[|X|] = E X + + E X . i=1 xixxi+1

and
Such random variables are said to be integrable. nX1
0 1 Thus the Lebesgue measure enlarges the class of func-
@ min f (x)A (xi+1 xi). tions that we can integrate. This generality carries over
i=1 xixxi+1
For Rd-valued random variables, the expectation is Let |⇧| := maxi(xi+1 xi). If the two sums approach each to expectations, because they are simply Lebesgue inte-
defined componentwise. other as |⇧| ! 0, we call the common limit the Riemann grals with respect to a probability measure.
Rt
integral 0 f (x)dx. If not, the integral is undefined.
The following analogy will help fix the ideas.

We need: A symmetric argument leads to


Suppose you receive money bills of denominations
lim E[Y m] lim E[X n].
10, 20, 50, 100 in the following order: 20, 20, 10, 50, m"1 n"1

100, 10, 10, 20, 50, 20, 100, 10, 10, 10, 100, 50, 20, Thus
Lemma If Zn, n = 1, 2, · · · , 1, are simple random vari-
50, 20, 10, 50, 50, 20, 10, 20.
ables with 0  Zn " Z1 a.s., then E[Zn] " E[Z1]. lim E[Y m] = lim E[X n] = E[X].
m"1 n"1

If you add them sequentially, you arrive at the total


amount as 840.

Proof Since Z1 is simple, it is bounded. Let M > 0


be a bound on Z1 and hence on Zn, n 1. Then using With hindsight, we may define E[X] for a random
If you take it to the bank, the teller will separate the linearity and positivity of expectation, variable X 0 as
di↵erent denominations first, then calculate 8 ⇥ 10 + 8 ⇥ n h i o
f f f
E[Z1] E[Zn] = E[(Z1 Zn)] 0, and sup E X :0X  X, X is simple
20 + 6 ⇥ 50 + 3 ⇥ 100 = 840.
h i h i
0  E[Z1] E[Zn] = E[(Z1 Zn)] and extend it to a general X as E X + E X as before.
" ( )#
1
= E (Z1 Zn)I Z1 Zn < +
" ( k )#
You performed Riemann integration, the teller performed 1 This is how it is usually done in measure theory / real
E (Z1 Zn)I Z1 Zn
k analysis textbooks.
Lebesgue integration. 1 1
!
 + M P Z1 Zn
k k
1
! as n " 1. Next we list some simple but extremely useful properties.
k
Letting k " 1, the claim follows.

R
Proof of the above theorem: 1. Change of variables formula: Let X be an Rd-valued
We often use the alternative notations X(!)P (d!) or
R R integrable random variable with law µ and f : Rd 7! R
X(!)dP (!) or simply XdP .
Define Wn,m := X n ^ Y m, m, n 1. Since X n " X Ym be a measurable function such that the real random
a.s., Wn,m " Y m a.s. as n " 1. Similarly, since Y m " X variable f (X) is integrable. Then
This definition is not dependent on the choice of the ap-
X n a.s. as m " 1, Wn,m " X n a.s. as m " 1. Hence Z Z
proximations {X n}: Suppose {Yn} is another sequence of E[f (X)] := f (X(!))dP (!) = f (x)µ(dx).
simple random variables with 0  Y n " X a.s. lim E[X n] lim E[Wn,m]
n"1 n"1 For f = IA for some A 2 B(R), this reduces to
······ (because X n Wn,m 8m)
P (X 1(A)) = µ(A), which is how µ was defined.
= E[Y m]
Theorem E[Y n] " E[X]. ······ (by Lemma) It extends to simple f by linearity and general f by

=) lim E[X n] lim E[Y m]. the familiar approximation argument.


n"1 m"1
One can also define joint moments, such as E [X pY q Z r ]

2. Chebyshev inequality: This says: for a random vari- for real random variables X, Y, Z and p, q, r 1. Some
familiar cases are: We have the following two useful inequalities:
able X 0 and 0 < x 2 R,
E[X] the variance var(X) of a real random variable X, which
P (X x)  . Theorem (Minkowski inequality) Let X, Y be random
x
is its second centred moment,
variables on some probability space with kXkp, kY kp < 1
This follows from the fact
for some p, 1  p  1. Then
standard deviation := the square-root thereof, and,
E[X] E[XI{X x}] E[xI{X x}] = xP (X x).
kX + Y kp  kXkp + kY kp. (1)
You are invited to verify each step using the properties the covariance matrix of RD -valued random variables
X, Y, given by
of expectation.

E (X E[X])(Y E[Y ])T ,

the expectation being componentwise.

For non-negative X, we have the following convenient Proof The claim is easy when p = 1 or when kXkp or
3. Jensen’s inequality: If f : Rd 7! R is convex, then kY kp is zero. So let 1  p < 1 and kXkp, kY kp > 0.
formula:
E[f (X)] f (E[X])
R X , Y := Y , so that kX k , kY k = 1.
Let X0 := kXk
Theorem If X 0 a.s., E [X p] = p 01 tp 1P (X t)dt, p 0 kY kp 0 p 0 p
assuming both sides are well defined.
1  p < 1.
pkXk
Then for := kXk +kY kp
,
To see this, use the fact that f (x) = supg2G g(x) p

where G := {h(x) : h  f and h is affine} Proof We have |X + Y |p  (|X| + |Y |)p


(i.e., h(x) = aT x + b for some a 2 Rd, b 2 R). Z Z Z X = (kXkp|X0| + kY kp|Y0|)p
E [X p] = X pdP = ptp 1dtdP
Z 0Z = (kXkp + kY kp)p( |X0| + (1 )|Y0|)p
Thus for h 2 G, E[f (X)] E[h(X)] = h(E[X]). = p I{X t}tp 1dtdP
Z Z  (kXkp + kY kp)p( |X0|p + (1 )|Y0|p)
Hence E[f (X)] suph2G h(E[X]) = f (E[X]). = p tp 1 I{X t}dP dt
Z (by convexity of the map x 7! xp)
= p tp 1P (X t)dt. = (|X|p + |Y |p).

For a real valued random variable X and p 1, define For non-negative integer valued X, this leads to
the following whenever the expectation is well defined. 1
X
E[X] = P (X n). Thus
n=0 ! !
1 p 1 p
1. pth moment E [X p] E[|X+Y |p]  E[|X|p]+E[|Y |p] = E[|X|p] p + E[|Y |p] p .
Let X be a real random variable and 1 < p, q < 1 with
1 1
2. pth centred moment E [(X E[X])p] 1 + 1q = 1. We also allow p = 1, q = 1. Let a := E[|X|p] p , b := E[|Y |p] p . We are done if we prove
p
1 1
(ap + bp) p  ((a + b)p) p = a + b,
3. pth absolute moment E [|X|p] Define
1 a , xp + (1
equivalently, for x := a+b x)p  1, which is
kXkp = (E [|X|p]) p , 1  p < 1; kXk1 := ess supkXk,
4. pth centred absolute moment E [|X E[X]|p] easily verified. This completes the proof. 2
where ‘ess sup’, which stands for ‘essential supremum’,
For Rd valued random variables, the last two can be is the smallest C 0 such that |X|  C a.s. (Recall that
defined with k · k replacing | · |. random variables are a.s. equivalence classes.)
As in case of (1), equality holds for 1 < p < 1 if and only Bounded convergence theorem:
if X = Y for some 2 R\{0}. For p = 2, q = 2 and this
Corollary If X, Y 6= 0 a.s. and 1 < p < 1 in the above,
reduces to the familiar Cauchy-Schwartz inequality. Theorem Let Xn, n 1, be real random variables satisfy-
then equality holds in (1) if and only if X = Y for some
2 R\{0}. ing |Xn|  K a.s. for some constant K < 1 and Xn ! X
Let 1  p  1 and kXkp, kY kp < 1. Then: a.s. Then E[Xn] ! E[X].

Proof This follows from the strict convexity of the map


1. kXkp 0 and = 0 if and only if X = 0 a.s., because Proof We need the following lemma:
x 7! xp for 1 < p < 1. 2
in that case, for each a > 0,
p
kXkp Lemma Xn ! X a.s. in Rd =) P (kXn Xk ✏) ! 0 for
P (kXk a)  = 0 =) X = 0 a.s.
ap
any ✏ > 0.
2. kbXkp = |b|kXkp. Finally, (1) holds.

Proof This follows from:


Theorem (Hölder inequality) For 1  p  1 and q as
Together, these characterize k · kp as a norm. Thus 0 = P ({Xn ! X}c)
above, if random variables X, Y satisfy 0 < kXkp, kY kq <
the space of real random variables X on a probability
1, then
space (⌦, F , P ) with E [kXkp] < 1 for 1  p  1 is a
P (kXn Xk ✏ i. o.)
|E [XY ] |  kXkpkY kq . (2) normed linear space. It is also complete, i.e., whenever
kXn Xmkp ! 0 as m, n " 1, there exists a random
= P (\1
n=1 [m n {kXm Xk ✏})
Proof The case p = 1, q = 1 is easy, as is the case when variable X on (⌦, F , P ) such that kXn Xkp ! 0.
one of X, Y is zero a.s. Let 1 < p < 1. Since we may
This makes it a complete normed linear space, i.e., a = lim P (\N
n=1 [m n {kXm Xk ✏})
replace X, Y by |X|, |Y | without loss of generality, we take Banach space. This space is denoted by Lp(⌦, F , P ). N "1

X, Y 0 and not a.s. 0.


= lim P ([m N {kXm Xk ✏})
N "1

Define Z := Y q 1 = Y q/p, implying Y = Z p 1 = Z p/q .


Then for t 2 R,

ptXY = ptXZ p 1  (Z + tX)p Zp lim P (kXn Xk ✏)


Lecture 4 n"1
by the convexity of x 7! xp . Taking expectations ,
Vivek Borkar 0.
ptE[XY ]  E [|Z + tX|p] E [|Z|p]  (kZkp+tkXkp)p kZkpp.
IIT BOMBAY
where we use (1) for the second inequality. Dividing by
t > 0 and letting t # 0, we have August 2020

pE[XY ]  pkXkpkZkpp 1 = pkXkpkY kq .

The claim follows. 2


Returning to the proof of the theorem, clearly |X|  K
a.s. For ✏ > 0, Thus, since Xn Xn ^ N ! X ^ N a.s.,

|E[Xn] E[X]| Clearly, lim inf E[Xn]


n"1
lim sup xn lim inf xn lim inf E[Xn ^ N ]
n"1 n"1
n"1
= |E[Xn X]| = lim E[Xn ^ N ]
and when they are equal, limn"1 xn exists and equals n"1
= E[X ^ N ],
either of them.
 E[|Xn X|] (by Jensen’s ineq.)
where the last two equalities follow from the bounded
convergence theorem. Now let N " 1 on the right hand
= E[|Xn X|I{|Xn X| ✏}] + side and use the lemma above to conclude.
E[|Xn X|I{|Xn X| < ✏}]

Fatou’s lemma:
Monotone convergence theorem:
 ✏ + E[(|Xn| + |X|)I{|Xn X| ✏}]
Theorem Let Xn 0 be integrable real random variables
with Xn ! X a.s. Then lim inf n"1 E[Xn] E[X].
 ✏ + 2KP (|Xn X| ✏)]
Theorem Let Xn, n 1, be integrable real random vari-
Proof We need the following lemma: ables with Xn " X a.s. Then E[Xn] " E[X].
n"1
! ✏
Lemma Let X 0 be integrable. Then E[X ^ N ] " E[X] Proof By replacing Xn by Xn X1 if necessary, we may
by the above lemma.
as N " 1.
assume without loss of generality that Xn 0 a.s. Then
by Fatou’s lemma,
Since ✏ can be made arbitrarily small, the claim follows. Proof Let ✏ > 0. Define
N 1
N 2X k k
(
k+1
) lim inf E[Xn] E[X].
XN := I X< . n"1
N 2N 2N
k=0 2

The ‘limit infimum’ or ‘liminf’ of a real sequence {xn} is Note that XN  X ^ N . Pick N sufficiently large so that
defined as On the other hand, Xn  X implies E[Xn]  E[X]. Thus
|E[X] E[XN ]| = E[X] E[XN ] < ✏, (1)
lim inf xn = lim inf xm = sup inf xm. lim sup E[Xn]  E[X].
n"1 n"1 m n n 1m n n"1
which is possible from our construction of the expecta-
It is always well defined in R [ {±1}. But
tion.
lim sup E[Xn] lim inf E[Xn].
n"1 n"1
Similarly, the ‘limit supremum’ or ‘limsup’ of a real Thus Combining these inequalities, the claim follows.
sequence {xn} is defined as
0  E[X] E[X ^ N ] = E[X] E[XN ] + E[XN ] E[X ^ N ]
lim sup xn = lim sup xm = inf sup xm. This result also holds if E[X] = 1.
n"1 n"1 m n n 1m n  E[X] E[XN ] < ✏.

It is always well defined in R [ {±1}. Since ✏ > 0 is arbitrary, the claim follows.
Dominated convergence theorem:  ✏ + 2 sup E[|Xj |I{|Xn X1| > ✏}]
2. If {X↵, ↵ 2 I} is uniformly integrable, then 1j1

Theorem Let {Xn}, X, Y be real random variables with sup↵ E[|X↵|] < 1.
 ✏ + 2 sup (E[|Xj |I{|Xn X1| > ✏}I{|Xj |  a}]
Xn ! X a.s. and |Xn|  Y with E[Y ] < 1. Then 1j1
To see this, fix a > 0 such that
E[Xn] ! E[X].
+ E[|Xj |I{|Xn X1| > ✏}I{|Xj | > a}])
sup E[|X↵|I{|X↵| > a}] := M < 1.

Proof By Fatou’s lemma,
Then
 ✏ + 2aP (|Xn X1| > ✏) +
E[Y ] lim sup E[Xn] = lim inf E[Y Xn ]
n"1 n"1 sup E[|X↵|]  sup E[|X↵|I{|X↵| > a}] 2 sup E[|Xj |I{|Xj | > a}].
↵ ↵ 1j1
E[Y X] = E[Y ] E[X],
+ sup E[|X↵|I{|X↵|  a}] Thus

implying  M + a < 1.
0  lim inf E[|Xn X1|]
lim sup E[Xn]  E[X]. n"1
n"1

Similarly,
The following equivalence serves as an alternative defini-
lim inf E[Xn] + E[Y ] = lim inf E[Xn + Y ]
n"1 n"1
tion of uniform integrability, sometimes more convenient  lim sup E[|Xn X1|]
E[X + Y ] = E[X] + E[Y ], n"1
to use.
implying
 ✏ + 2 sup E[|Xj |I{|Xj | > a}].
Theorem {X↵, ↵ 2 I} are uniformly integrable if and only 1j1
lim inf E[Xn] E[X].
n"1
if Let ✏ # 0 and a " 1 to conclude.
Thus
Z
sup E [|X↵|] < 1 and lim sup |X↵|dP = 0. (2)
E[X]  lim inf E[Xn]  lim sup E[Xn]  E[X]. ↵ P (A)!0 ↵2I A
n"1 n"1

Hence all inequalities must be equalities, implying the


claim.

Uniform integrability: Theorem Let {Xn, n 1} be uniformly integrable and


The following is easy to prove (prove it):
Xn ! X1 a.s. Then E[|Xn X1|] ! 0.
A family {X↵, ↵ 2 I} of integrable random variables is
said to be uniformly integrable (‘u.i.’ for short) if Lemma If E [|Xn X|] ! 0, then {Xn}
Proof Fix ✏ > 0. Then
(or {Xn, n 1; X}) are u.i.
lim sup E[|X↵|I{|X↵| > a}] = 0. E[|Xn X1|]
a"1 ↵
Another useful fact:

The following two facts are immediate (prove them): = E[|Xn X1|I{|Xn X1| > ✏}] +
E[|Xn X1|I{|Xn X1|  ✏}] Theorem Let {Xn, 1  n  1} be non-negative in-
tegrable random variables with Xn ! X1 a.s. Then
1. If {X↵, ↵ 2 I} is uniformly integrable and Y1, · · · , Yk are
E [|Xn X1|] ! 0 if and only if E [Xn] ! E [X1].
integrable random variables, then {X↵, ↵ 2 I; Y1, · · · , Yk }  ✏ + E[(|Xn| + |X1|)I{|Xn X1| > ✏}]

is uniformly integrable.
Proof Clearly, “Uniform continuity”:

|E [Xn] E [X1] |  E [|Xn X1|] ! 0. f is uniformly continuous if given ✏ > 0, we can find
Lecture 5 > 0 such that kx yk < =) |f (x) f (y)| < ✏.
Conversely, 0  (X1 X n )+  X 1 .
Hence by the dominated convergence theorem, Contrast with ‘continuity at x’:
h i Vivek Borkar
E (X1 Xn)+ ! 0.
IIT BOMBAY f is continuous at x if given ✏ > 0, we can find >0
Then since (X1 Xn ) = Xn X1 + (X1 X n )+ ,
such that kx yk < =) |f (x) f (y)| < ✏.

E [|Xn X1|] = 2E (X1 Xn)+ + E [Xn] E [X1] ! 0. Sept. 2020
‘Continuous’ () ‘continuous at x 8x’. Here can
This completes the proof. depend on x.

We say Xn ! X 3. in probability =) in law: It suffices to consider


Finally, we have an important characterization of uniform uniformly continuous f (to be proved later). Let
integrability due to de la Valleé Poussin. |f (·)|  K < 1. Pick > 0 and take ✏ > 0 such
• almost surely (a.s.) or ‘with probability 1’
() P ({! : Xn(!) ! X(!)}) = 1, that kx yk < ✏ =) |f (x) f (y)| < . Then
Theorem {X↵, ↵ 2 I} is uniformly integrable if and only
if there exists a G : [0, 1) ! [0, 1) satisfying |E[f (Xn)] E[f (X)]|
• in probability () P (kXn Xk ✏) ! 0 8 ✏ > 0,  E[|f (Xn) f (X)|I{kXn Xk ✏}]
G(t)
lim = 1 and M := sup E [G(|X↵|)] < 1.
t"1 t ↵2I + E[|f (Xn) f (X)|I{kXn Xk < ✏}]
• in q-th mean (1  q  1) () E[kXn Xkq ] ! 0,  2KP (kXn Xk ✏) +
n"1
For example, sup↵ E [|X↵|a] < 1 for some a > 1 suffices ! .
to ensure that {X↵, ↵ 2 I} are u.i. • in law / distribution () E[f (Xn)] ! E[f (X)] 8 bounded
Since > 0 can be made arbitrarily small, the claim
continuous f .
follows.

1. a.s. =) in probability: This was proved in the last


Proof (‘If’ part only) Suppose the latter condition holds. lecture, the proof is repeated here for convenience. 4. ‘in probability’ or ‘in q-th mean’ does not imply

Pick ✏ > 0 and set C := M. Choose a > 0 large enough ‘a.s.’:


✏ 0 = P (kXn Xk ✏ i. o.)
G(t)
so that C for t a. Then
t
= P (\1 Take ⌦ = [0, 1), P = the Lebesgue measure.
n=1 [m n {kXm Xk ✏})
G(|X↵|) G(|X↵|)
|X↵| a =) C =) |X↵|  . = lim P ([m n{kXm Xk ✏}) Define X1 = I[0,1/2), X2 = I[1/2,1), X3 = I[0,1/4), X4 =
|X↵| C n"1
lim P (kXn Xk ✏) I[1/4,1/2), X5 = I[1/2,3/4), X6 = I[3/4,1), X7 = I[0,1/8)
Hence n"1
0. and so on.
sup↵ E[G(|X↵|)] M
sup E[|X↵|I{|X↵| ak]  = = ✏.
↵ C C
Then P (|Xn| ✏) ! 0 8 ✏, but
The claim follows. 2. in q-th mean =) in probability: lim supn"1 Xn(!) = 1 6= 0 = X(!) 8!.
E[kXn Xkq ]
P (kXn Xk ✏)  ! 0.
✏q
Theorem Xn ! X in probability if and only if any subse-
quence of {Xn} has a further subsequence that converges Recall: G ⇢ Rd is an open set if 8 x 2 G, there is an
5. ‘in probability’ or ‘a.s.’ does not imply in q-th
a.s. to X.
r > 0 such that {y : ky xk < r} ⇢ G. Finite intersections
mean:
and arbitrary unions of open sets are open.
1 Proof Suppose Xn ! X in probability and {Xn(k)} a sub-
Take Xn = n q I[0, 1 ]. Then Xn ! X := 0 a.s., but
n sequence of {Xn}. Then Xk0 := Xn(k) ! X in probability,
E[|Xn X|q ] ⌘ 1 6= 0. 0 F ⇢ Rd is a closed set if F c is open, equivalently, if
so we can find a subsequence {Xk(m) } of {Xk0 } such that
✓ ◆
0
P |Xk(m) X| 2 m  2 m. xn 2 F, xn ! x =) x 2 F.
For a.s. to imply convergence in q-th mean, need precisely
Then Finite unions and arbitrary intersections of closed sets
the uniform integrability of |Xn|q . ✓ ◆
X 0 X
P |Xk(m) X| 2 m  2 m < 1, are closed.
m m
0
so by Borel-Cantelli lemma, almost surely, |Xk(m) X| 
0
2 m for sufficiently large m, i.e., Xk(m) ! X a.s.

Given A ⇢ Rd, the interior of A, denoted int(A), is the


largest open set contained in A (can be empty),
Convergence in law (or distribution) is NOT really a Conversely, suppose every subsequence of {Xn} has a
equivalently, the union of open sets contained in A.
convergence of random variables, but of their laws, further subsequence that converges a.s. to X, but Xn
i.e., of probability measures. does not converge to X in probability. Then there ex-
Closure of A, denoted by Ā, is the smallest closed set
ists an ✏, > 0 and a subsequence {n(k)} ⇢ {n} such
containing A, alternatively, intersection of closed sets
In particular, in the definition, {Xn} need not even be that P (|Xn(k) X| ✏) > 8k. But by hypothesis, it
containing A. This is nonempty if A is nonempty.
defined on the same probability space. has a further subsequence that converges a.s. to X, a
contradiction. The claim follows.
Clearly, int(A) ⇢ A ⇢ Ā.

The boundary of A, denoted by @A, is defined as Ā\int(A).

Lemma (Borel-Cantelli lemma, first half) Let An 2 F , n


1, satisfy
P
n P (An) < 1. Then Consider the space of all real random variables on (⌦, F , P ). Denote by Cb(Rd) the space of bounded continuous func-
Define on it a metric tions Rd 7! R and by P(Rd) the space of probability
P (An i.o.) = P (\n 1 [m n Am) = 0.
measures on Rd.
d(X, Y ) := E [|X Y | ^ 1] .
Proof Since Theorem (Portmanteau theorem) Let µn, n 1; µ 2
" #
X X h i X Theorem The metric d(·, ·) is complete and convergence P(Rd). Then the following are equivalent.
P (An) = E IAn = E IAn ,
n n n with respect to d coincides with convergence in probabil-
where the second equality follows from the monotone ity. 0. µn ! µ in P(Rd)
convergence theorem, we have
R R
X
P (An) < 1 =)
X
IAn < 1 a.s. The a.s. convergence cannot in general be metrized. 1. f dµn ! f dµ 8 f 2 Cb(Rd).
n n

This is exactly equivalent to the claim.


R R
2. f dµn ! f dµ 8 f 2 Cb(Rd) that are uniformly contin- Define Bi := {x 2 Rd : ai 1  f (x) < ai}, 1  i  N .
Z
uous with respect to the euclidean norm. Then,
lim inf µn(G) lim inf f ✏dµn
n"1 n"1
Z
= lim f ✏dµn
3. lim inf n"1 µn(G) µ(G) 8 open sets G ⇢ Rd. n"1
Z
1. @Bi = {x 2 Rd : f (x) 2 {ai 1, ai}} and therefore
= f ✏dµ. µ(@Bi) = 0 8i,
4. lim supn"1 µn(F )  µ(F ) 8 closed sets F ⇢ Rd.
Letting ✏ # 0 in the rightmost term, we get
Z Z
2. Bi’s are disjoint and [iBi = Rd,
5. limn"1 µn(A) = µ(A) 8 sets A ⇢ B(Rd) for which lim inf µn(G) lim f ✏dµ = IGdµ = µ(G)
n"1 ✏#0
PN
µ(@A) = 0. 3. |f (x) i=1 ai 1IBi (x)|  ✏ by construction.
by Monotone Convergence Theorem. Thus 2. =) 3.

3. + 4. =) 5. By applying 3., 4., to resp. the interior and


closure of A: Then
Z Z
Proof Note that :
µ(A) = µ(Ā) lim sup f dµn f dµ
n"1
0 1 0 1
Z N Z N
lim sup µn(Ā) X X
1. =) 2. uniformly continuous =) continuous n"1  2✏ + lim sup @ IBi A dµn @ aiIBi A dµ
n"1 i=1 i=11
lim sup µn(A) N
0
n"1 X
3. () 4. by complementation: take F := Gc.  2✏ + |ai | @lim sup |µn(B i) µ(Bi)|A
lim inf µn(A) i=1 n"1
n"1
= 2✏.
lim inf µn(int(A))
n"1
µ(int(A)) Since ✏ > 0 was arbitrary, 1. follows.
= µ(A).

5. =) 1. Let A ⇢ B(Rd) with µ(@A) = 0. Let f 2 Cb(Rd) Corollary In 3. above, it suffices to consider bounded
2. =) 3. For an open G ⇢ Rd , the set and for a given ✏ > 0, choose a0 < a1 < · · · < aN , N 1, open G.
such that
G✏ := {x 2 G : inf kx yk ✏} ⇢ G Proof Suppose for all bounded open G,
y2@G

G✏ is nonempty for ✏ > 0 sufficiently small, and is closed. 1. a0 = supx |f (x)| ✏, aN = supx |f (x)| + ✏,
lim inf µn(G) µ(G). (1)
n"1
The function f ✏ : Rd 7! [0, 1] given by
2. µ({x : f (x) = ai}) = 0 8i, and, Consider an unbounded open set G. Let Gn := G \ {x :
inf y2G c kx yk
f ✏(x) := kxk < n}, n 1. Then {Gn} are bounded open, so
inf y2G c kx yk + inf y2G ✏ kx yk
3. |ai ai 1 | < ✏ 8 1  i  N .
is uniformly continuous with respect to k · k (prove it), lim inf µn(G) lim inf µn(Gm) µ(Gm)
n"1 n"1
is 1 on G✏ and 0 outside G and f✏ " IG as ✏ # 0. This is always possible because the set of a for which for m 1. Since G = [mGm and Gm ⇢ Gm+1, letting
µ({x : f (x) = a}) > 0 is at most countable and can be " 1 on the right hand side, µ(Gm) " µ(G), hence (1)
avoided. holds for any arbitrary open set G.
Corollary There exists a countable family {fi} ⇢ Cb(Rd)
d(·,·)
such that Fact: d(·, ·) a metric, then d(·, ·)^1, 1+d(·,·) are equivalent Consider µn ! µ1 in P(Rd) and Fn, 1  n  1, the
Z Z bounded metrics. corresponding distribution functions, assumed to be con-
fidµn ! fidµ 8i =) µn ! µ.
tinuous and strictly increasing. Then Fn ! F1 pointwise.
ProofLet G be open. For any x 2 G, we can find a ratio- One such convenient metric is
nal y 2 G (i.e., with rational coordinates) and a rational Let U := a uniformly distributed random on the
⇢(µ, ⌫) := inf E [kX Y k ^ 1]
r > 0 such that the open ball {z 2 Rd : kz yk < r} X⇡µ,Y ⇡⌫ probability space ([0, 1], B([0, 1]), P ), P := the
contains x and is contained in G. where ‘X ⇡ µ’ stands for ‘X has law µ’ and likewise for Lebesgue measure, defined as U (!) = !.
G is the union of all open balls with rational centres and Y, ⌫, and the infimum on the right is over all pairs of
rational radii contained in it. Then finite unions of such random variables (X, Y ) such that X has law µ and Y Then Xn := Fn 1(U ) has law µn for 1  n  1 and
balls is a countable collection G and an argument analo- has law ⌫. Xn ! X1 a.s. (in fact, pointwise).
gous to the above shows that it suffices to verify 3. for
sets in G.

As in the proof of 3., each set B in G has an associated


family giB , i 1, of continuous functions Rd 7! [0, 1] such
that giB = 0 on B c for all i and giB " IB as i " 1. In fact, for p 1, The following theorem, due to Skorokhod, generalizes
The collection {giB , i 1, B 2 G} is countable. Suppose 1 this claim.
Wp(µ, ⌫) := inf E [kX Y kp ] p
Z Z X⇡µ,Y ⇡⌫
giB dµn ! giB dµ 8 i, B. R
defines a metric on {µ 2 P(Rd) : kxkpdµ(x) < 1}. Theorem If µn ! µ1 in P(Rd), then there exist
Then These are called Wasserstein p metrics. These are com- Rd-valued random variables Xn, 1  n  1, such that
Z
plete metrics. the law of Xn is µn for 1  n  1 and Xn ! X1 a.s.
lim inf µn(B) lim giB dµn
n"1 n"1
Z
= giB dµ " µ(B) as i " 1, 8 B 2 G
=) lim inf µn(G) µ(G) 8 G open.
n"1

The claim follows.

Any countable set {fi, i 1} ⇢ Cb(Rd) with the above


Sche↵e’s theorem below gives a stronger form of con-
property is called a convergence determining class. Skorohod’s theorem:
vergence.

Define a metric on P(Rd) by Consider a real random variable X whose distribution


Theorem Suppose µn 2 P(Rd), 1  n  1, have
1
X Z Z function F is continuous and strictly increasing. Then its R R
d(µ, ⌫) := 2 i fidµ fid⌫ ^ 1. densities pn(·) resp. (i.e., f (x)dµn(x) = f (x)pn(x)dx 8x)
i=1 inverse F 1 : [0, 1] 7! R is well defined. For U := F (X),
and pn(x) ! p1(x) for all x outside a set of zero Lebesgue
This metric is compatible with convergence in law, i.e., 1(x)) 1(x))
P (F (X)  x) = P (X  F = F (F = x, measure, then
µn ! µ if and only if d(µn, µ) ! 0. Z
i.e., U is uniformly distributed on [0, 1]. Therefore |pn(x) p1(x)|dx ! 0.
Unfortunately this metric is not complete, so we usually X = F 1(U ).
use other metrics. (It is complete for P(D), D ⇢ Rd
This is the so called ‘total variation’ convergence.
closed and bounded.)
The characteristic function of µ 2 P(Rd) is defined as the
R
complex valued function t 2 Rd 7! 'µ(t) := eiht,xidµ(x), Proof If µn ! µ in P(Rd), then by Skorohod’s theo-
Replace x by x z and integrate both sides with respect
i.e., the Fourier transform of µ. rem, there exists a probability space (⌦, F , P ) and d-
to µ(dx). Interchanging the order of integrals,
Z Z kx zk2
dimensional random variables Xn, n 1; X defined on
If X is a random variable with law µ, then this is simply 1 1 2 kyk2 1
 'µ(y)e ihz,yi 2 dy = q e 2 2 µ(dx).
it such that the law of Xn is µn 8n, the law of X is µ,
(2⇡)d (2⇡)d 2d
E eiht,Xi . The following properties are immediate.
and Xn ! X a.s. Then
An analogous equation holds for ⌫. Since 'µ = '⌫ , we
 
1. 'µ(0) = 1, |'µ(t)|  1, 'µ( t) = 'µ(t). have |'µn (t) 'µ(t)| = E eihXn,ti E eihX,ti

Z kx zk2 Z kx zk2
1
2 2 µ(dx)
1
2 2 ⌫(dx).
 E eihXn X,ti ! 0.
q e = q e
2. 'µ is uniformly continuous: (2⇡)d 2d (2⇡)d 2d
 Proof of the converse is more technical and is omitted.
|'µ(t + h) 'µ(t)|  E eiht+h,Xi eiht,Xi

khk#0
= E eihh,Xi 1 ! 0.

3. 'µ is positive definite in the sense that for


n 1; t1, · · · , tn 2 Rd; c1, · · · , cn 2 C, Hence for bounded continuous f : RD 7! R,
X
1 Z Z kx zk2
ci'(ti tj )c̄j 0. 2 2 µ(dx)dz
q f (z)e
i,j
(2⇡)d 2d Lecture 6
This follows from the observation that for a d-dimensional 1 Z Z kx zk2
= q f (z)e 2 2 ⌫(dx)dz.
random variable X with law µ, the left hand side is (2⇡)d 2d Vivek Borkar
2 3 2 3
2
X X Interchanging the order of integration and letting # 0, IIT BOMBAY
E 64 cieihti,Xi(cj eihtj ,Xi)75 = E 64 cieihti,Xi 7
5.
i,j i we get
Z Z
f dµ = f d⌫. September 2020
4. For a d-dimensional random variable X with law µ,
This holds for all bounded continuous f . Thus µ = ⌫.
a 2 R, b 2 Rd and ⌫ := the law of aX + b,

'⌫ (t) = eiht,bi'µ(at).

Any ' : Rd 7! C satisfying the first three conditions is Independence


necessarily a characteristic function of some µ 2 P(RD ) The next theorem, known as Levy continuity theorem, is
(Bochner’s theorem). The next two results make this a the key theorem of this topic, being a standard tool for Intuitively, two events are independent if the occurence
very useful notion. establishing convergence in law. or non-occurence of one does not a↵ect the probability
of the other.
Lemma If 'µ = '⌫ for µ, ⌫ 2 P(RD ), then µ = ⌫. Theorem Ifµn ! µ in P(Rd), 'µn ! 'µ.
To make this precise, we shall jump ahead a bit and given
Proof We need the Fourier transform of the gaussian Conversely, if 'µn ! some ' : Rd 7! C and ' is continuous two events A, B 2 F with P (B) > 0, define conditional
function: at the origin, then ' = 'µ for some µ 2 P(Rd) and µn ! µ probability of A given B as
1 Z ihx,yi 1 kxk2 in P(Rd).
1 2 kyk2 P (AB)
e 2 dy = q e 2 2 P (A|B) :=
(2⇡)d (2⇡)d 2d P (B)
for > 0. where we use the standard abbreviation AB for A \ B.
The idea is that if you already know that the given sample It is pairwise independent if any Ai, Aj , j 6= i, are. A measure theoretic definition equivalent to the above is
point ! is in B, you can restrict your sample space to B. as follows.
This is automatically true if the collection is independent,
Then probability of any event C 2 F with this extra
but the converse is not true. Let µ 2 P(Rd)n be the joint law of (X1, · · · , Xn) and
knowledge should be proportional to P (CB): ‘propor-
µi 2 P(Rd) the law of Xi (called the ‘ith marginal’) for
tional to’ instead of ‘equal to’ because the latter would Qn
Consider, e.g., two independent coin tosses X, Y cor- 1  i  n, then µ(dx1 · · · dxn) = i=1 µi(dxi) in the sense
lead to the anomaly that the probability of the entire
responding to outcomes ±1 with equal probability of 1
sample space itself would then be P (B), which can be 2 that
0 1
< 1. Therefore normalization by P (B) is needed. corresponding to head or tail. That is, each of the four n
Y n
Y
1
µ@ Ai A = µi(Ai) 8 choices of Ai 2 B(R), 1  i  n.
outcomes {X = x, Y = y} has probability 4 for x, y = ±1. i=1 i=1
This motivates the above definition, which can be equiv- Then the ‘composite coin’ Z = XY takes values ±1 with
alently written as equal probability and can be seen to be independent of The equivalence is obvious. Such µ are called ‘product

P (AB) = P (A|B)P (B). X or Y , but not of XY . measures’.

We can write P (AB) = P (A)P (B) as


Some consequences of independence
Having defined this, it makes sense to say that occurence E [IAIB ] = E [IA] E [IB ]
of B does not a↵ect the probability of A if P (A|B) =
Let {Xn} be independent Rd-valued random variables and
and say that the indicator random variables IA, IB are
P (A). ⇤ := \
define the sigma field n 1 (Xn, Xn+1, · · ·). This
independent.
is called the tail sigma field and any event A 2 ⇤ is called
That is, P (AB) = P (A)P (B). This equation is symmet-
More generally, we can say that Rd-valued random vari- a tail event.
ric in A, B, so we can say that A, B are independent of
ables X1, · · · , Xn are independent if
each other if it holds.
m
Theorem (Kolmogorov’s zero-one law)
Y
P (Xik 2 Aik , 1  k  m) = P (Xik 2 Aik ) 8 Aik 2 B(RD ), A 2 ⇤ =) P (A) = 0 or 1.
k=1

for 1  k  m, and all subcollections {i1, · · · , im} ⇢ {1, · · · , n}.

This is equivalent to Proof (Sketch) If A 2 ⇤, A 2 (Xm, m > n) 8n


2 3
m
Y m
Y h i
E4 fik (Xik )5 = E fik (Xik )
More generally, we say that a finite collection of events k=1 k=1 =) it is independent of (Xm, m  n) 8n
Qn
A1, · · · , An 2 F is independent if P (\m
k=1Ai) = ik =1 P (Aik ) for all measurable fi : Rd 7! R, 1  i  n, for which
for every subcollection {i1, · · · , im} ⇢ {1, · · · , n}. the above expectations are well defined and all subsets ⇤
=) independent of (Xm, m 1) and therefore of
{i1, · · · , im} ⇢ {1, · · · , n}. (prove this)
Even more generally, we define independence of arbitrary
collection of events by the requirement that any finite This reduces to the previous definition for fi = IAi and
=) A is independent of itself
subcollection thereof be independent. conversely, the previous definition implies this by an ap-
proximation argument via simple functions and induction.
=) P (A) = P (A \ A) = P (A)2

Again, an arbitrary collection of random variables are


=) P (A) = 0 or 1.
independent if all its finite subcollections are.
Theorem (Borel-Cantelli lemma, contd.) If {An, n 1} For the general case, pick N > 0 large enough so that
P 3. The law of iterated logarithms:
are independent, then n P (An) = 1 =) P (An i.o.) = 1.
 Pn
2
n
3 The growth of the asymptotic upper and lower envelopes sup E [|Xn|I{|Xn| > N }] < ⌘
Y 1n<1
Proof E e m=1 IAm = E4 e IAm 5 of the sums in almost sure sense.
m=1
n
Y  for a prescribed ⌘ > 0 Let X̃n := XnI{|Xn|  N }, n 1
= E e IAm Pn ⇣ h i⌘
and S̃n := X̃m E X̃m . Then
m=1 4. Cramer’s theorem: m=1
n
Y
= (e 1P (Am) + (1 P (Am))) Captures how the probabilities of untypical events decay
h
E |X̃n
h
E X̃n |2
i i
 4N 2 < 1
m=1
n
Y to zero. This is an instance of ‘large deviations’ theory
= (1 (1 e 1)P (Am)) and therefore by the ‘special case’ above,
m=1 which quantifies probabilities of large departures from the 0 1
1 Pn
 e (1 e ) m=1 P (Am) typical behaviour. P @
S̃n ✏A
! 0.
n 2
# 0 as n " 1.

For n 1, let X̂n := Xn X̃n = XnI{|Xn| > N } and


By monotone convergence theorem, it then follows that Suppose {Xn, n 1} are real integrable random variables Pn ⇣ h i⌘
Ŝn := m=1 X̂m E X̂m . Then
 P1  Pn with mean zero.
E e m=1 IAm = lim E e m=1 IAm = 0, 0 1 h i
n"1 Ŝn ✏A 2E Ŝn
P @ 
implying
P1
= 1 a.s., i.e., P (An = i.o.) = 0. (The general case of nonzero means can be reduced to n 2 n✏
m=1 IAm 2
8 sup1n<1 E [|Xn|I{|Xn| > N }]
the case of zero means by replacing Xn with Xn E[Xn] 

Let {Xn} be independent random variables and Sn := for n 1.) 8⌘
< .
Pn ✏ 0 1
m=1 Xm. Then the following result of Levy holds.
!
Sn S̃n ✏A
Pn lim sup P ✏  lim sup P @
Let S0 = 0, Sn = m=1 Xm, n 1. We begin with the n"1 n n"1 n 2
0 1
Theorem Sn converges in law () it converges in weak law of large numbers. Ŝn ✏A 8⌘
+ lim sup P @  .
probability () it converges a.s. n"1 n 2 ✏
Since ⌘ > 0 was arbitrary, the claim follows. 2

Limit theorems for independent random


Theorem(Weak Law of Large Numbers) Let {Xn} be Theorem (Strong Law of Large Numbers) Let {Xn}
variables h i
zero mean, pairwise independent and uniformly integrable. above be pairwise independent with supn E |Xn|2  K <

Then Snn ! 0 in probability. 1 for some K. Then Snn ! 0 a.s.


1. Law of Large Numbers, ‘weak’ and ‘strong’ :
Convergence of arithmetic means of a sequence of zero
mean random variables to zero in probability / a.s. Proof Let ✏ > 0. We first consider the special case Proof Let ✏ > 0. Note that
h i !
(‘typical’ behaviour) when supn E Xn2  K < 1 for some K. Then, since by
1
X Sn 2 1
X n2 K
P ✏  < 1.
h i h i
m=1 n2 m=1 n
4 ✏2
pairwise independence, E XiXj = E [Xi] E Xj = 0 for
2. Central Limit Theorem: i 6= j, By Borel-Cantelli lemma, lim supn"1 nn22  ✏ a.s.
S

Spread (fluctuations) of sums of independent random ! h i Pn h i


Sn E Sn2 m=1 E Xm
2 K Considering ✏ = 1
k , k = 1, 2, · · · , we have
variables around the ‘typical’ behaviour captured by the P ✏  =  ! 0.
n n 2 ✏2 n 2 ✏2 n✏2
Sn 2
laws of large numbers ⇡ the limiting distribution of ! 0 a.s. (1)
n2
suitably normalized fluctuations around the mean.
Sketch of proof: The log characteristic function is
Now Glivenko-Cantelli theorem:
0 1 log 'µn (t)
1
X B
supn2m<(n+1)2 Sm Sn 2
P @ ✏C
A h p i
n=1 n2 Let {Xn} be i.i.d. with common distribution function F . = log E eit(Sn na)/( n)
 n
2 Y h p i
1
X E supn2m<(n+1)2 Sm Sn 2 = log E eit(Xm a)/( n) (by independence)
 Define the ‘empirical distribution function’ m=1
n=1 n 4 ✏2 n
X h p i
" #
P(n+1)2 1 2 n = log E eit(Xm a)/( n)
E Sm Sn 2 1 X m=1
1
X m=n2 Fn(x) := I{Xn  x}. n " ✓ ◆#
 n m=1 X it(Xm a) 1 2 1
n 4 ✏2 = log E 1 + p t (Xm a)2 + O
n=1 m=1 n 2 2n n3/2
1 (2n + 1)2K Then ✓ ◆!
X n
X t2 1
 < 1. = log 1 +O
n=1 n 4 ✏2 sup |Fn(x) F (x)| ! 0 a.s. m=1 2n n3/2
x t2 ⇣ ⌘
⇡ + O n 1/2 .
2

Uniform Law of Large Numbers:


Arguing as before using the Borel-Cantelli lemma,
Sm Sn 2 Pn t2
supn2m<(n+1)2 n2 Let Sn := m=1(f (Xm) E[f (Xm)]), {Xm} i.i.d. zero Hence 'µn (t) ! e 2, which is the characteristic function
! 0 a.s. (2)
n2 mean and f bounded continuous. Uniform LLN gives of gaussian distribution with zero mean and unit variance,
Let [m] := max{n2 : n 1, n2  m < (n + 1)2}. Then conditions on families A of functions for which denoted N (0, 1). By the Levy continuity theorem, this
combining (1) and (2), we have Sn leads to:
lim sup ! 0 a.s.
S[m] Sm S[m] n"1 f 2A n
Sm
= + ! 0 a.s.
[m] [m] [m] Example: Vapnik-Chervonenkis Theorem Theorem Snpna
n
! N (0, 1) in law.

Sm Sm [m]
=) = ⇥ ! 0 Empirical Risk Minimization: justifies minimization of
m [m] m
a.s. 1 Pn
empirical loss n
2 m=1 L(Xm, Ym, ✓) in place of E [L(Xn, Yn, ✓)]
where {(Xn, Yn)} are i.i.d. input-output pairs.

We state next a much more general version due to Lind-


For pairwise independent and identically distributed
Central Limit Theorem berg and Feller, which gives both necessary and sufficient
random variables, a result of Etemadi establishes SLLN
conditions. Let {Xn} be independent zero mean with
under the weaker requirement that Xn be integrable. h i P
Consider {Xn} independent and identically distributed E Xn2 = n 2 < 1 8n. Define s2 := n
n
2
m=1 m, n 1. Sup-

with mean a and variance 2. pose s2


n " 1. Either of the following, which can be shown
Corollary If {Xn} are pairwise independent and identi-
to be equivalent, constitute the ‘Lindberg condition’.
cally distributed with E [X1] = 1, then Snn ! 1 a.s.
n Z ⇣ ⌘
Defining Sn, n 1, as before, let µn denote the law of X 2 dP = o s2
Xm 8✏ > 0.
n
Snpna for n 1. m=1 {|Xm|>✏sm}
Proof Let 0 < N < 1. Then n n Z
X 2 dP = o s2
⇣ ⌘
Pn
Xm n 8✏ > 0.
Sn m=1 Xm ^N m=1 {|Xm|>✏sn}
lim inf lim = E [X1 ^ N ] . ‘Uniform CLT’ also possible.
n"1 n n"1 n
Let N " 1 on the right to conclude. 2 Theorem (Central Limit Theorem) Lindberg’s condition
holds if and only if Ssnn ! N (0, 1) in law and snn ! 0.
Theorem Suppose I(✓) < 1 8✓ 2 R. Then for any
Law of Iterated Logarithms Various large deviations results:
B ⇢ B(R),
!
1 Sn
With the same notation as above, we also have: inf I  lim inf log P 2B
B n"1 n n 1. Associated with strong law of large numbers for IID
! random variables: Cramer’s theorem
1 Sn
Theorem
h Suppose
i that for some ✏ 2 (0, 1), C > 0,  lim sup log P 2B  inf I.
Pn 3 n n B̄
n"1
m=1 E |Xn | C . Then
s3
 1+✏
n sn
2. Associated with a.s. convergence of empirical mea-
Sn In particular, if a 2
/ B̄, then the probability that Sn/n 2 B
lim sup q = 1 a.s. sures of IID random variables to their common law:
n"1 2s2
n log log sn decays at least at the rate inf B̄ I. If in addition inf B I =
Sn Sanov’s theorem.
lim inf q = 1 a.s. inf B̄ I = , then P (Sn/n 2 B) ⇡ e n modulo lower order
n"1 2s2
n log log sn
terms (logarithmic equivalence – exact asymptotics given
by Bahadur-Rao theorem which gives the pre-factor). 3. Extensions of the above to Markov chains

4. ‘Small noise limit’ of stochastic di↵erential equations


More generally, the ‘Large Deviations Principle’
as noise goes to zero: Freidlin-Wentzell theory
Cramer’s theorem (S. R. S. Varadhan)”:

Let {Xn} be independent and identically distributed with A sequence of probability measures µ✏ 2 P(S), ✏ > 0, 5. Thermodynamic limits of lattice models in statistical

mean a. Define I : R 7! R [ {1} by satisfies the large deviations principle with rate function mechanics: variational principles of thermodynamics
✓  ◆
I if for all Borel sets in S,
I(x) := sup ✓x log E e✓X .

inf I  lim inf ✏ log µ✏(B) 6. Associated with ‘fluid limits’ of point processes
Then I(·) is known as the rate function. B ✏#0

 lim sup ✏ log µ✏(B)  inf I.


✏#0 B̄ 7. Associated with ‘mean field limits’: Dawson and
Gartner

Properties of I(·):
Laplace-Varadhan principle:
h i
1. I(✓) 0⇥x log E e0⇥X = 0.
"
F (X)
# Lecture 7
lim ✏ log E e ✏ = min(F (x) I(x)).
2. I, being a pointwise supremum of linear functions, is ✏!0 x

convex. Vivek Borkar


Connects probabilistic phenomena with variational prin- IIT BOMBAY
h i
3. By Jensen’s inequallity, E e✓X e✓a. Taking loga- ciples, i.e., optimization problems. Intuition: go for the
h i
rithms, ✓a log E e✓X  0, so I(a)  0. But I(·) 0, so dominant term, e.g., for ci > 0, 1  i  N , September 2020
0 1
I(a) = 0 = min✓ I(✓). 1 N
X
lim
n!1 N
log @ cn
m ! max
A
m
cm .
m=1
4. I(·) is decreasing on ( 1, a] and increasing on [a, 1).
Given a probability space (⌦, F , P ), for A, B 2 F with Let X denote an integrable random variable on (⌦, F , P )
Proof We first prove the uniqueness.
P (B) > 0, the conditional probability of A given B is and G a sub- -field of F . The foregoing suggests the
(A\B)
P (A|B) := P P : following definition:
(B) If x̂ is another point in G attaining the above minimum,

then x⇤, x̂ are equidistant from x and x0 = x 2+x̂ 2 G will
Definition: Conditional expectation of X given G is
1. We interpret ‘given B’, i.e., ‘given that the sample
necessarily be at a strictly smaller distance from x, a
defined as the G-measurable random variable, denoted
point ! is in B’, as restricting the sample space to B.
E [X|G], that satisfies contradiction.
Z Z
2. The conditional probability then restricts the set XdP = E[X|G]dP 8 C 2 G.
C C Thus x⇤, if it exists, is unique.
A to B, i.e., replace it by P (A \ B), and then nor-
malize it by P (B) to define the conditional probability
Does such a random variable exist? For existence, let x 2
/ G, otherwise there is nothing to
(A\B)
P (A|B) := P P , so that the new ‘sample space’ B
(B) prove.
satisfies P (B|B) = 1. If so, is it unique?

More generally, consider a partition D := {B1, · · · , Bm} ⇢ Let c := inf y2G kx yk and xn 2 G, n 1, such that
If Y, Z were two G measurable random variables satisfying
F of ⌦ with P (Bi) > 0 8i. kx xnk # c. Now,
the definition, we have
Z Z Z
Y dP = ZdP 8 C 2 G =) (Y Z)dP = 0. k(xn x) + (xm x)k2 + k(xn x) (xm x)k2
Suppose we are told which particular bin Bi the sample C C C
point ! falls in. Then we can say that ‘when ! 2 Bi, the  2kxn xk2 + 2kxm xk2
Taking C := {Y Z > 0} 2 G, we have
(A\Bi)
conditional probability of A ⇢ F is P (A|Bi) := P P ’. =) kxn xm k2
(Bi) Z
(Y Z)dP = E [(Y Z > 0}] = 0, xn + xm 2
{Y Z>0}
Z)I{Y  2kxn xk2 + 2kxm xk2 4 x
2
In other words, we can define a random variable P (A|G)  2kxn xk2 + 2kxm xk2 4c2
which implies Y  Z a.s.
where G := the -field generated by {B1, · · · , Bm} as: n,m"1
! 2c2 + 2c2 4c2 = 0.
0 1
m
X P (A \ Bi) A
P (A|G) := @ I Bi A symmetric argument leads to Y Z a.s., so Y = Z
P (Bi) Then {xn} is Cauchy and therefore xn ! some x̌ 2 G. It
i=1 a.s. =) a.s. uniqueness.
follows that kx x̌k = c. Set x⇤ = x̌.
and call it the ‘conditional probability of A given G’.

G consists of , the Bi’s and [i2S Bi for all possible S ⇢ D. Digression: Projection theorem:
Corollary For x, x⇤ as above, hx x⇤ , y x⇤i = 0 8 y 2 G.
Thus for any C = [i2S Bi 2 G, S ⇢ D, we have Let H be a Hilbert space over reals , i.e., a vector space
0 0 1 1
Z Z m P (A \ Bi) A
X with an inner product h·, ·i such that the associated norm Proof If not, either y x⇤ or 2x⇤ y makes a strictly
P (A|G)dP = @ @ IBi A dP q
C C i=1 P (Bi) x 2 H 7! kxk := hx, xi is complete. acute angle with x x⇤ , say the former. Then for points
0 0 1 1
Z X P (A \ Bi)
= @ @ A I A dP
B i sufficiently close to x⇤ on the line segment joining y and
C i2S P (Bi)
0 1
X P (A \ Bi)
Theorem Given a closed subspace G of H and x 2 H, x⇤, the distance from x is strictly less that kx x⇤k, a
= @ A P (B )
i
i2S P (Bi) there exists a unique x⇤ 2 G such that contradiction.
X
= P (A \ Bi)
i2S kx x⇤k = min kx yk.
= P (A [ C). y2G
Let H = L2(⌦, F , P ), i.e., the space of random variables Letting N " 1 in the above equality and using the mono-
h i
X on (⌦, F , P ) that are square-integrable, i.e., E X2 < tone convergence theorem, we have, For A 2 F , we define P (A|G) := E [IA|G] and call it the
Z Z
1. XdP = E[X|G]dP 8 C 2 G. conditional probability of A given G.
C C

This defines E[X|G] (necessarily < 1 a.s.) for non-


With the inner product hX, Y i := E[XY ], it is a Hilbert Note that these are G-measurable random variables.
negative integrable X. For general integrable X, define
space.
E[X|G] := E[X +|G] E[X |G] When G = (Y↵, ↵ 2 I) for some collection {Y↵, ↵ 2 I} of
Let G ⇢ F be a sub- -field of F and let G = L2(⌦, G, P ). random variables indexed by some index set I, we write
E[X|Y↵, ↵ 2 I] and P (A|Y↵, ↵ 2 I) resp. and call these
Theorem For an integrable random variable X on (⌦, F , P )
Then G is a closed subspace of H. conditional expectation, resp. probability, given Y↵, ↵ 2 I.
and a sub- -field G ⇢ F , there exists an a.s. unique in-
tegrable G-measurable random variable E[X|G] such that
For X 2 H, let X̂ denote its projection to G. R R
C XdP = C E[X|G]dP for all C 2 G.

By the above corollary, (for ✓ ⌘ 0) A simpler but far from elementary proof for existence of Properties of conditional expectations
E[X|G] goes as follows.
E[(X X̂)Y ] = E[(X X̂)(Y X̂)] E[(X X̂)(✓ X̂)] = 0
Fix a probability space (⌦, F , P ) and let G 0 ⇢ G ⇢ F be
for Y 2 G. (Take x = X, x⇤ = X̂ and Y = y x⇤ or = the Radon-Nikodym theorem: Suppose µ is a positive mea- sub- -fields. X, Y, W, Z denote integrable random vari-
zero vector x⇤ in the above.) sure on (⌦, F ) and ⌫ a signed measure on (⌦, F ) such ables on (⌦, F , P ).
that ⌫ is absolutely continuous w.r.t. µ, i.e.,
h i
That is, E[XY ] = E X̂Y . In what follows, all expectations and conditional expec-
µ(A) = 0, A 2 F =) ⌫(A) = 0.
tations are assumed to be well defined.
Letting Y = IC for some C 2 G, we have Then there exists an F -measurable function Z : ⌦ 7! R
Z h i Z such that 1. (Monotonicity) X Y a.s. =) E[X|G] E[Y |G] a.s.
XdP = E [XIC ] = E X̂IC = X̂dP.
C C Z Z
f d⌫ = f Zdµ 8 F measurable f.
Thus we can define E[X|G] := X̂, proving existence. (Already proved)

Next, let X Y in H. Then for C 2 G,


R
Z Z Z Z Define a signed measure µ on (⌦, G) by µ(C) = C XdP 2. (Linearity) If X, Y are G-measurable,
E[X|G]dP = XdP Y dP = E[Y |G]dP
C C C C for C 2 G. Let P0 denote the restriction of P to (⌦, G).
Z E[XW + Y Z|G] = XE[W |G] + Y E[Z|G].
=) (E[X|G] E[Y |G])dP 0. Then for C 2 G, P 0(C) = 0 =) µ(C) = 0.
C
Taking C := {E[X|G] < E[Y |G]} 2 G, we get a contradic-
By the Radon-Nikodym theorem, there exists an For C 2 G, we have
tion unless P (C) = 0. Hence
Z Z
integrable random variable E[X|G] on (⌦, G, P 0) such
E[W + Z|G]dP = (W + Z)dP =
X Y a.s. =) E[X|G] E[Y |G] a.s. R R R C C
that µ(C) = C E[X|G]dP , i.e., C XdP = C E[X|G]dP
Z Z Z Z
Let X 0 be integrable. Then for N 1, for all C 2 G. W dP + ZdP = E[W |G]dP + E[Z|G]dP,
C C C C
Z Z
X ^ N dP = E[X ^ N |G]dP, N 0.
C C
We call E[X|G] the conditional expectation of X given G. so E[W + Z|G] = E[W |G] + E[Z|G].
Let E[X|G] := limN "1 E[X ^ N |G] which is well defined
as a G-measurable [0, 1]-valued random variable.
Therefore Lemma A is closed under finite unions, intersections and
0 h i1
Thus it suffices to prove that E[XW |G] = XE[W |G]. This X E I Bi X complementation, i.e., is a Boolean algebra.
E[X|G 0] = @ AI
Bi
is immediate for X = IC , C 2 G, because for any A 2 G, P (Bi)
i Furthermore, (A) = B(Rd).
2 0  1 3
Z Z Z ⇣ ⌘
E[IC W |G]dP = I W dP = W dP X 6X B E IBij X C P Bij 7
A A C A\C = 6
4
B
@ ⇣ ⌘ C
A IB 75 IBi
i j P Bij P (Bi) i Proof Clearly, , ⌦ 2 A. Closure under finite unions
Z Z
= E[W |G]dP = IC E[W |G]dP. 2 0  1 3 follows from definition, that under finite intersection
A\C A
X 6X B E IBij X C

7
= 6 B ⌘ C IBij | G 7I follows from the identity
This extends to simple X by linearity of expectations. 4 @ ⇣ AE 5 Bi
i j P Bij
For general X, it follows by a standard approximation ([iAi) \ ([j Bj ) = [i,j (Ai \ Bj )
2 0  1 3
argument. A special case of interest is when X, Y are X 6X B E IBij X C 7
= E 6
4
B
@ ⇣ ⌘ C
A IBij | G 75 IBi and the fact that Ai \ Bj 2 A0 if Ai, Bj 2 A0. Closure
constants. i j P Bij
under complementation follows from the latter and the
= E[E[X|G]|G 0].
fact that A 2 A0 implies Ac 2 A.

Theorem Given an integrable random variable X on


3. (Filtering property) E[E[X|G]|G 0] = E[X|G 0]. The first two properties suggest that conditioned on a
(⌦, F , P ) and a sub- -field G ⇢ F , there exists a P(Rd)-
sub- -field, R
valued random variable µ such that E[f (X)|G] = f dµ
This follows from the fact that C 2 G 0 =) C 2 G, so
for all bounded measurable f : Rd 7! R.
Z Z Z Z 1. conditional expectation is like a ‘random integral’ and,
E[E[X|G]|G 0]dP = E[X|G]dP = XdP = E[X|G 0]dP.
C C C C
Proof Define µ(A) := P (X 2 A|G) for A 2 A.
2. G-measurable random variables are ‘treated like
By the properties of conditional expectations proved above,
Consider G 0 = (Bi, 1  i  n), with Bi’s disjoint, constants’.
it easily follows that:
G = (Bij , 1  j  ni, 1  i  n) with Bi = [j Bij where
µ( ) = 0 a.s., µ(Rd) = 1 a.s.,
Bij ’s are disjoint. We now make this intuition more precise.
(i.e., outside a set N0 2 F with P (N0) = 0, µ( ) = 0 and
µ(⌦) = 1),

µ(Ac) = 1 µ(A) a.s. for A 2 A (i.e., outside some NA 2 F


Let Q ⇢ R denote the set of rationals. Let A0 ⇢ B(Rd)
with P (NA) = 0), and,
denote the collection of the form
Then on Bij ,

E IBij X {x = [x1, · · · , xd] 2 Rd : ai < (or ) xi < (or ) bi, 1  i  d}, for any B := {A1, · · · , An disjoint} ⇢ A, n 1,
E[X|G] = ⇣ ⌘ n
X
P Bij P ([n
for ai 2 Q [ { 1}, bi 2 Q [ {1}, with the understanding i=1Ai|G) = P (Ai|G)
and on Bi, i=1
h i that for ai = 1 (resp., bi = 1) only < (resp., >) is 0 2 F with P (N 0 ) = 0).
E I Bi X a.s. (i.e., outside some NB
E[X|G 0] = allowed.
B
P (Bi)
Since A is countable, we can take a common zero prob-
Let A := the collection of finite unions of sets in A0. 0 ) 2 F outside which
ability set N = N0 [ ([ANA) [ ([B NB
Then A is countable.
this holds regardless of the choice of A, {Ai, 1  i  n} in
A.
Conditional independence
The function
d
Y We say that random variables Xi taking values in Rdi , di
Then for each ! 2
/ N , µ extends to a unique probability x = [x1, · · · , xd] 2 Rd 7! F (x1, · · · , xd|G) := µ( ( 1, xi]|G)
i=1 1, for 1  i  n, are conditionally independent given a
measure on (A) = B(Rd) by the Caratheodory extension
is called the conditional distribution of X given G or sub- -field G of F if any of the following three equivalent
theorem, denoted by µ again.
X↵, ↵ 2 I, when appropriate. definitions holds:

This µ satisfies the desired conditions.


Q
From this theorem and already known properties of 1. P (Xi 2 Ai, 1  i  n|G) = ni=1 P (Xi 2 Ai|G)
integrals with respect to probability measures, many other 8 Ai 2 B(Rdi ), 1  i  d.
properties follow.
hQ i Q
2. E n = n
i=1 fi(Xi)|G i=1 E [fi(Xi)|G]
8 bounded measurable fi : Rdi 7! R.

Theorem Let X, Y, Xn, 1  n  1, be integrable real


Formally, the regular conditional law µ of X given G is
valued random variables on (⌦, F , P ) and G ⇢ F a sub- - 3. the regular conditional law of [X1, · · · , Xn] given G is
defined as the a.s. unique P(Rd)-valued random variable
field. Then the following hold. a.s. the product of the regular conditional laws of the
µ(!, ·) such that:
individual Xi’s given G
1. (Conditional Chebyshev’s inequality) P (|X| ✏|G) 
1. for each A 2 B(Rd), µ(!, A) is a G-measurable random
E[|X||G]
a.s. for ✏ > 0. Events Ai, 1  i  n, in F are conditionally independent
variable, ✏
given G if their indicators are.
2. (Conditional Jensen’s inequality) For convex f : R 7!
2. for each ! 2 ⌦, µ(!, ·) is a probability measure on
R, E[f (X)|G] f (E[X|G]) a.s. Sub- -fields Fi, 1  i  n, of F are conditionally indepen-
(Rd, B(Rd)), and,
dent of G if any collection Ai 2 Fi, 1  i  n, is.
3. (Conditional Fatou’s lemma) If 0  Xn ! X1 a.s.,
3. P (X 2 A|G) = µ(!, A) a.s. 8 A 2 B(Rd).
then lim inf n"1 E[Xn|G] E[X1|G] a.s.

4. (Conditional dominated convergence theorem) If Xn !


To make the dependence of µ on G explicit, we write
X1 a.s. and |Xn|  Y a.s. with E[Y |G] < 1 a.s., then Clearly, random variables X1, · · · , Xn are conditionally in-
µ(·|G).
E[Xn|G] ! E[X1|G] a.s. dependent given G if (Xi), 1  i  n, are.

R
Thus µ(A|G) = P (X 2 A|G) a.s. and f dµ(x|G) = E[f (X)|G]
5. (Conditional Minkowski inequality) Let kXkp, kY kp < An arbitrary family of random variables, events, or sub-
a.s.
1 for some p, 1  p  1. Then -fields is independent if every finite subfamily thereof
is.
When G = (X↵, ↵ 2 I), we write µ(·|X↵, ↵ 2 I). E[|X + Y |p|G]1/p  E[|X|p|G]1/p + E[|Y |p|G]1/p.

In fact, one can speak of a mixed collection of events,


We call µ(·|G) (resp., µ(·|X↵, ↵ 2 I)) the regular condi- 6. (Conditional Hölder inequality) For 1  p  1 and random variables, and sub- -fields being independent give
tional law, or simply the conditional law, of X given G 1 + 1q = 1, if 0 < kXkp, kY kq < 1, then
p G in the obvious manner.
(resp., given X↵, ↵ 2 I).
|E [XY |G] |  E[|X|p|G]1/pE[|Y |q |G]1/q .
Theorem Let Fi, i = 1, 2, 3, be sub- -fields of F . Then 2. =) 1.
the following are equivalent.
Let Y1, Y3 be integrable random variables measurable with
respect to F1, F3 resp., such that Y1Y3 is integrable.
1. F1, F3 are conditionally independent given F2.
Then

E [Y1Y3|F2] = E [E [Y1Y3|F1 _ F2] |F2]


2. For every integrable F3-measurable Y , E[Y |F1 _ F2] =
= E [Y1E [Y3|F1 _ F2] |F2]
E[Y |F2].
= E [Y1E [Y3|F2] |F2]
= E [Y1|F2] E [Y3|F2] .
3. For every integrable F1-measurable Z, E[Z|F2 _ F3] =
E[Z|F2]. Therefore F1, F3 are conditionally independent given F2.

Proof 1. =) 2. 1. =) 3. follows by a symmetric argument.

Let Ai 2 Fi, i = 1, 2. Then Observe that only the first definition is symmetric in F1
h i h h ii and F3.
E IA1 IA2 E [Y |F1 _ F2] = E E IA1 IA2 Y |F1 _ F2
h i
= E IA1 IA2 Y
h h ii Example: {Xn} satisfies the Markov property if 8n,
= E E IA1 IA2 Y |F2
h h ii
= E IA2 E IA1 Y |F2 P (Xn + 1 2 A1, · · · , Xn+k 2 Ak |Xm, m  n)
h h i i
= E IA2 E IA1 |F2 E [Y |F2]
h h ii = P (Xn + 1 2 A1, · · · , Xn+k 2 Ak |Xn).
= E E IA1 IA2 E [Y |F2] |F2
h i () the ‘future’ Xm, m > n, and the ‘past’ Xm, m < n,
= E IA1 IA2 E [Y |F2] .
are conditionally independent given the present.

It is easy to check that the sets C 2 F that satisfy


Z Z
E [Y |F1 _ F2] dP = E [Y |F2] dP
C C

form a -field.

By the foregoing, this includes


A := {A1 \ A2 : Ai 2 Fi, i = 1, 2}.

Therefore it includes (A), viz., F1 _ F2.

This implies 2.

You might also like