0% found this document useful (0 votes)
17 views93 pages

011 Cours en

The document is a comprehensive outline of advanced probability theory, authored by Thierry Lévy, covering topics such as probability spaces, conditional expectation, martingales, and Markov chains. It includes fundamental concepts, definitions, properties, and exercises to reinforce understanding. The content is structured into sections that progressively build on elementary probability theory to more complex topics in the field.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views93 pages

011 Cours en

The document is a comprehensive outline of advanced probability theory, authored by Thierry Lévy, covering topics such as probability spaces, conditional expectation, martingales, and Markov chains. It includes fundamental concepts, definitions, properties, and exercises to reinforce understanding. The content is structured into sections that progressively build on elementary probability theory to more complex topics in the field.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 93

Advanced Probability Theory


Thierry Lévy

2020-21

Contents
1 A reminder of elementary probability theory 2
1.1 Probability spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.4 Integration of random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.5 Convergence of random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2 Conditional expectation 11
2.1 Definition and fundamental examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2 Uniqueness and existence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.3 Main properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3 Martingales 22
3.1 Martingales and conserved quantities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.2 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.3 The case where the filtration is generated by partitions . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.4 Stochastic integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.5 Almost sure convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.6 Branching processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.7 Convergence in L1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.8 Stopping times . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.9 Convergence in Lp for p > 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.10 Square-integrable martingales . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.11 Uniform integrability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.12 Backward martingales . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

4 Markov chains 58
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.2 First definition and first properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.3 Construction of Markov chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.4 The Markov property . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.5 Recurrent and transient states . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.6 Invariant measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
4.7 Brief summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
4.8 The ergodic theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

Sorbonne Université – LPSM – 4, place Jussieu – F-75005 Paris
thierry.levy@sorbonne-universite.fr

1
1 A reminder of elementary probability theory
1.1 Probability spaces
In order to describe an experiment from the point of view of probability theory, one considers a
probability space, that is, a triple (Ω, A, P), where
• Ω is a set: the set of all possible outcomes of the experiment,

• A is a σ-field on Ω: the set of subsets of Ω that it is possible to consider, to name, to think


of,

• P is a probability measure on (Ω, A), which to each A ∈ A associates the probability P(A)
that the outcome of the experiment belongs to A.
By definition, the fact that A is a σ-field means that A is a subset of the set P(Ω) of all
subsets of Ω (in other words, the elements of A are subsets of Ω) such that
• A contains the empty set ∅,

• for all A ∈ A, the set Ω \ A also belongs to A,


S
• for all sequence (An )n≥0 of elements of A, the union n≥0 An also belongs to A.
The fact that P is a probability measure on (Ω, A) means that P is a function P : A → [0, 1]
which satisfies
• P(Ω) = 1,

• for all sequence (An )n≥0 of elements of A which are pairwise disjoint,
 
[ X
P An  = P(An ).
n≥0 n≥0

Exercise 1.1 Let (Ω, A, P) be a probability space. Prove that P(∅) = 0.

The first fundamental example of probability space is (Ω, P(Ω), P), where Ω is a finite set,
endowed with the σ-field of all subsets of Ω, and P is the uniform measure defined, for all A ⊂ Ω,
by P(A) = |A||Ω| . Here, |A| denotes the number of elements of A.
Another fundamental example is ([0, 1], B[0,1] , λ), where B[0,1] is the σ-field of Borel subsets
of the real interval [0, 1], that is, the smallest σ-field on [0, 1] which contains all open subsets of
[0, 1], and λ is the Lebesgue measure on ([0, 1], B[0,1] ), the unique measure on this measurable
space which for all a, b such that 0 ≤ a ≤ b ≤ 1 satisfies λ([a, b]) = b − a.
Let us agree on the following question of terminology: some authors call a set countable if
it has the cardinality of N. For us, a countable set is a set which has the cardinality of a subset
of N. The difference is that for us, a finite set is countable.

Exercise 1.2 Let (Ω, A, P) be a probability space. An element ω ∈ Ω is called an atom of P


if {ω} ∈ A and P({ω}) > 0. Can you give an upper bound to the number of atoms of P of
1
mass at least 10 ? Prove that the set of atoms of P is a countable subset of Ω. Prove that
{ω ∈ Ω : ω is not an atom of P} belongs to A.

2
Exercise 1.3 How many σ-fields does there exist on the set {1, 2, 3} ? Define, for all n ≥ 0, Bn
as the number of σ-fields on a finite set with n elements. Prove that for all n ≥ 0, the equality
Bn+1 = nk=0 nk Bk holds. The number Bn is called the n-th Bell number.
P

The most important technical property of a probability measure is the following. Consider a
sequence (An )n≥0 of elements of A which is non-decreasing in the sense that A0 ⊂ A1 ⊂ A2 ⊂ . . ..
Then  
[
P An  = sup{P(An ) : n ≥ 0} = lim P(An ). (1)
n→∞
n≥0

Exercise 1.4 Why does the limit on the right-hand side exist ? Prove the two equalities.

Exercise 1.5 Prove that for any non-increasing sequence (An )n≥0 of elements of A one has
P( n≥0 An ) = limn→∞ P(An ) ? Is this true on any measure space ? Is (1) true on any measure
T
space ?

Another important property is this one: for all sequence (An )n≥0 of elements of A, the
inequality  
[ X
P An  ≤ P(An )
n≥0 n≥0

holds.
Yet another important result is formulated in the next exercise.

Exercise 1.6 (Borel-Cantelli lemma) TLet (AS n )n≥0 be a sequence of elements of A. How
would youPdescribe in words the set S = p≥0 n≥p An ? Prove that S belongs to A. Assume
now that n≥0 P(An ) < ∞ and compute P(S).

Exercise 1.7 SProve that for any family (Ai )i∈I of elements of A which are pairwise disjoint
and such that i∈I Ai belongs to A, the inequality
!
[ X
P Ai ≥ P(Ai )
i∈I i∈I

holds, where the right-hand side is defined by


( )
X X
P(Ai ) = sup P(Ai ) : F ⊂ I, F finite .
i∈I i∈F

Note that the index set I may be uncountable. Can you think of an example where the inequality
is strict ?

Exercise 1.8 Let (Ω, A) be a measurable space. Prove that a function P : A → [0, 1] is a
probability measure if and only if the following conditions hold :

• P(∅) = 0 and P(Ω) = 1,

3
• for all A, B ∈ A, P(A) + P(B) = P(A ∪ B) + P(A ∩ B),

• for all sequence (An )n≥0 of elements of A,


 
[ \
P An  ≤ lim inf P(An ).
n→∞
p≥0 n≥p

Is it possible to remove the assumption that P(∅) = 0 from this list ?

1.2 Independence
Let us now turn to the concept of independence. Two events A and B are independent (with
respect to P) if P(A ∩ B) = P(A)P(B).

Exercise 1.9 What can you say about an event which is independent of itself ?

In order to define the independence of more than two events, we need to be more careful.
We say that A1 , . . . , An are independent if for all k ∈ {2, . . . , n} and all choice of 1 ≤ i1 < . . . <
ik ≤ n, the equality P(Ai1 ∩ . . . ∩ Aik ) = P(Ai1 ) . . . P(Aik ) holds.

Exercise 1.10 Find three events A, B, C such that P(A ∩ B ∩ C) = P(A)P(B)P(C) but A and
B are not independent. Can you find an example where P(A)P(B)P(C) > 0?

It turns out that the best definition is in terms of σ-fields. Let (Ω, A, P) be a probability
space. We say that n sub-σ-fields B1 , . . . , Bn of A are independent if for all choice of events
B1 ∈ B1 , . . . , Bn ∈ Bn , the equality P(B1 ∩ . . . ∩ Bn ) = P(B1 ) . . . P(Bn ) holds. We say that an
arbitrary family (Bi )i∈I of sub-σ-fields of A is independent if any finite sub-family Bi1 , . . . , Bin
is independent.

Exercise 1.11 Let C be a subset of P(Ω). Prove that there exists a unique σ-field A on Ω such
that C ⊂ A and, for all σ-field B on Ω such that C ⊂ B, one has A ⊂ B. In words, A is the
smallest σ-field on Ω which contains C. It is called the σ-field generated by C and it is denoted
by σ(C).
For example, the Borel σ-field of R is the σ-field generated by the class of open subsets of R.
Compute σ(∅) and, for all A ∈ A, σ({A}). Let Ω = A1 ∪ . . . ∪ An be a partition of
Ω. This means that the events A1 , . . . , An are non-empty and pairwise disjoint. Describe
σ({A1 , . . . , An }).
S What can you say if instead of a finite
S partition we consider a countable
partition Ω = n≥0 An ? An arbitrary partition Ω = i∈I Ai ? What is the σ-field on R
generated by {{t} : t ∈ R} ?

In the following exercise, you will show that our previous two definitions of independence
are consistent.

Exercise 1.12 Let A1 , . . . , An be n events. Prove that it is equivalent to say that the events
A1 , . . . , An are independent or to say that the σ-fields σ({A1 }), . . . , σ({An }) are independent.

The following example warns us against an easily made mistake.

4
Exercise 1.13 Let us toss a coin twice. Consider the events “the first coin gives tail”, “the
second coin gives tail”, “the two coins give the same result”. Prove that any two of these three
events are independent, but the three together are not independent.

Exercise 1.14 For n ≥ 2, set Ω = {ω = (ω1 , . . . , ωn ) ∈ {−1, 1}n : ω1 . . . ωn = 1} and consider


the uniform probability P on (Ω, P(Ω)). For each i ∈ {1, . . . , n}, define Ai = {ω ∈ Ω : ωi = 1}.
Compute, for all k ∈ {1, . . . , n} and all 1 ≤ i1 < . . . < ik ≤ n, the probability P(Ai1 ∩ . . . ∩ Aik ).
What is this an example of ? (“Of a silly exercise” is not the expected answer.)

1.3 Random variables


A random variable is the mathematical notion that represents a quantity whose value depends
on the outcome of the experiment which is performed. Formally, a (real-valued) random variable
on a probability space (Ω, A, P) is a measurable function X : (Ω, A) → (R, BR ). Measurable
means by definition that for all B ∈ BR , the set X −1 (B) = {ω ∈ Ω : X(ω) ∈ B} belongs to A.
For example, the function X : [0, 1] → R defined by X(t) = t2 − 1 is a random variable on
the probability space ([0, 1], B[0,1] , λ). Random variables are in fact rarely defined by formulas
like this one, and this example is rather atypical.

Exercise 1.15 Consider a random variable X : (Ω, A) → (R, BR ). Prove that {X −1 (B) : B ∈
BR } is a sub-σ-field of A. It is called the σ-field generated by X and it is denoted by σ(X).

Exercise 1.16 Let X : Ω → R be a function. For all x ∈ R, consider the subset

{X ≤ x} = X −1 ((−∞, x]) = {ω ∈ Ω : X(ω) ≤ x}

of Ω. Prove that X is a random variable if and only if for all a ∈ R, one has {X ≤ a} ∈ A.

It is sometimes convenient to allow infinite values for random variables. To do so, one
considers the extended real line R = [−∞, +∞]. This set is endowed with the σ-field BR =
σ(BR ∪ {{−∞}, {+∞}}).

Exercise 1.17 Prove that a subset A of R belongs to BR if and only if A ∩ R belongs to BR .


If you are familiar with the abstract notion of a topology on a set: is BR the Borel σ-field of
a topology on R, that is, the σ-field generated by the class of all open subsets of R ?

Just as for events, we say that n random variables X1 , . . . , Xn defined on the same probability
space are independent if the σ-fields σ(X1 ), . . . , σ(Xn ) are independent. More generally, we say
that a family (Xi )i∈I of random variables is independent if every finite sub-family Xi1 , . . . , Xin
is independent.

Exercise 1.18 Let A be an event. Prove that the indicator function 1A defined by 1A (ω) = 1
if ω ∈ A and 1A (ω) = 0 otherwise, is a random variable. Prove that it is equivalent to say
that the events A1 , . . . , An are independent or to say that the random variables 1A1 , . . . , 1An are
independent.

Exercise 1.19 What can you say about a random variable which is independent of itself ?

5
Exercise 1.20 Is it equivalent to say that a finite family of random variables are pairwise
independent and to say that they are independent ?

Let X : (Ω, A, P) → (R, BR ) be a random variable. The function P ◦ X −1 : BR → [0, 1]


defined by P ◦ X −1 (B) = P(X −1 (B)) is a probability measure on (R, BR ).

Exercise 1.21 Prove this assertion.

The probability measure P ◦ X −1 is called the distribution, or the law, of X. It is also often
denoted by PX . The line

P ◦ X −1 (B) = PX (B) = P({ω ∈ Ω : X(ω) ∈ B}) = P({X ∈ B}) = P(X ∈ B)

gives five common notations for the same quantity. Let us emphasize that {X ∈ B} is a notation
for {ω ∈ Ω : X(ω) ∈ B}, that is, for X −1 (B).
The distribution of a real-valued random variable is thus a probability measure on the real
line. The following rough classification of these probability measures is often used. We say that
the distribution of X is discrete, or atomic, if there exists a sequence x1 , x2 , . . . of atoms of PX ,
finite or infinite (but in any case countable, see Exercise 1.2), such that PX ({x1 }) + PX ({x2 }) +
. . . = 1. In other words, there exists a countable subset C ⊂ R such that P(X ∈ C) = 1. The
typical cases are C = N and C = Z, and we then speak of integer-valued random variables, but
there are also other interesting cases of discrete random variables. In this course, we will define
and use many random variables with values in N ∪ {∞}.
The distribution of a discrete random variable is completely described by a countable set
C = {x1 , x2 , . . .} such that P(X ∈ C) = 1 and the data, for each i ≥ 1, of the probability
pi = P(X = xi ). It is the framework of elementary probability theory, especially when C is
finite.
A random variable X such that PX has no atom, that is, such that P(X = x) = 0 for all
x ∈ R, is said to be diffuse.

Exercise 1.22 Find a diffuse real-valued random variable. Find a real-valued random variable
which is neither diffuse nor discrete.

Among diffuse random variables, there is the practically very important class of random
variables for which PX is absolutely continuous with respect to the Lebesgue measure. This
means that for all Borel subset N of R such that λ(N ) = 0, one has P(X ∈ N ) = 0. In this case,
since we are working with probability measures, which are σ-finite, the Radon-Nikodym theorem
can be applied to produce a non-negative measurable function f : (R, BR ) → (R+ , BR+ ) which
has the property that for all Borel subset B of R, the equality
Z Z
PX (B) = P(X ∈ B) = f dλ = f 1B dλ
B R

holds. This situation is extremely convenient, because the computation of probabilities is re-
duced to the computation of ordinary integrals with respect to the Lebesgue measure.
Our quick zoology of probability measures on R is summarised by Figure 1 below.

6
Exercise 1.23 Let f : R → R R+ be a measurable function. Prove that the function µ : A →
[0, +∞] defined by µ(A) = A f dλ is a measure on (R, BR ). We shall write µ = f λ. Under
which condition on f is µ a probability measure ?
On (R, BR ), consider the Lebesgue measure λ and the counting measure κ, which by definition
is such that κ(A) is the number of elements of A if A is finite, and κ(A) = +∞ if A is
infinite. Prove that λ is absolutely continuous with respect to κ. Prove that there does not exist
a measurable function f : R → R+ such that λ = f κ. Is this not in contradiction with the
Radon-Nikodym theorem ?

Exercise 1.24 Prove that every probability measure on (R, BR ) can be written in a unique way
as the sum of a discrete measure and a diffuse measure.

probability measures on R
diffuse

discrete
absolutely
continuous
w.r.t. Lebesgue

neither discrete nor diffuse

Figure 1: Do you agree with this partition ? Can you find a probability measure in each class ?

1.4 Integration of random variables


Consider a real-valued random variable X. If X admits an integral on (Ω, A, P) in the sense of
Lebesgue’s integration theory, we say that X admits an expectation, which is defined by
Z
E[X] = X dP.

The following exercises help you to remember the key points of the theory of integration.

Exercise 1.25 (Definition of the integral) Let (Ω, A, P) be a probability space. Recall the
definition of the integral with respect to P of a non-negative measurable function X : (Ω, A) →
(R+ , BR+ ), for example as a supremum of elementary integrals of simple functions. Consider
an arbitrary measurable real-valued function X : (Ω, A) → (R, BR ). Define X + = max(X, 0)
and X − = max(−X, 0), so that X = X + − X − . Explain how to define the integral of X if one
at least of the two functions X + and X − have a finite integral. What is the integral of X if both
X + and X − have infinite integrals ? When is it the case that |X| has a finite integral ? (In this
case, we say that X is integrable).

7
Let us emphasize that a non-negative random variable always admits an expectation, possibly
equal to +∞.

Exercise 1.26 (Convergence theorems) Recall the statement of the monotone convergence
theorem. Does the theorem also hold for a decreasing sequence of functions ? Deduce Fatou’s
lemma from the monotone convergence theorem and then the dominated convergence theorem
from Fatou’s lemma. (You may find that this is not as difficult as you expect or remember. The
monotone convergence theorem is the one most important convergence theorem of the theory of
integration, and the other convergence theorems follow relatively easily from it.)

A few inequalities are of crucial importance. The simplest one, but very useful, is the Markov
inequality. It says that for a non-negative random variable X and a non-negative real a, one has

aP(X ≥ a) ≤ E[X].

Exercise 1.27 Prove the Markov inequality.

The Hölder inequality states that if X and Y are non-negative random variables and if
p, q > 1 are two real numbers such that p1 + 1q = 1, then
1 1
E[XY ] ≤ E[X p ] p E[Y q ] q .

The case where p = q = 2 is an instance of the Cauchy-Schwarz inequality.


One says that a random variable X admits a moment of order p ≥ 1 if the random variable
|X|p is integrable.

Exercise 1.28 Prove that if 1 ≤ p < p0 and if a random variable X admits a moment of order
p0 , then it also admits a moment of order p. Find an explicit bound for E[|X|p ].

Exercise 1.29 Let X be a non-negative random variable. Prove that the function from [1, ∞)
to [0, ∞] defined by p 7→ log E[X p ] is convex on [1, ∞) (one says that the function p 7→ E[X p ] is
log-convex). Check that the statement and your proof of it make sense and are correct even if
some or all moments of X are infinite.

Exercise 1.30 Let X be a non-negative random variable. With the usual agreement that inf ∅ =
+∞, define r = inf{p ∈ (0, ∞) : E[X p ] = ∞}. Prove by giving examples that r can be any
element of [0, ∞]. Is it true that r = ∞ implies that X is bounded ? When 0 < r < ∞, is it
always true, sometimes true and sometimes false, or never true that E[X r ] < ∞ ?

Another important inequality is Minkowski’s inequality. It states that for all p ≥ 1 and for
any two non-negative random variables X and Y , the inequality
1 1 1
E[(X + Y )p ] p ≤ E[X p ] p + E[Y p ] p

holds. In particular, the sum of two random variables which admit a moment of order p also
admits a moment of order p.
This suggests to define, for all real p ≥ 1, the set Lp (Ω, A, P) as the set of real-valued random
variables on Ω which admit a moment of order p, quotiented by the subspace formed by the
random variables X such that E[|X|p ] = 0.

8
Exercise 1.31 Check that Lp (Ω, A, P) is a vector space and that the function X 7→ kXkp =
1
E[|X|p ] p is a norm on Lp (Ω, A, P).

Exercise 1.32 Let X be a random variable. With the agreement that log 0 = −∞ and log(+∞) =
+∞, prove that
1
lim log E[|X|p ] = log kXk∞ ,
p→∞ p

where the norm on the right-hand side is defined by

kXk∞ = inf{M ∈ [0, +∞] : P(|X| ≤ M ) = 1}.

Is it true that limp→∞ kXkp = kXk∞ ?

It is a very important fact that the normed vector space (Lp (Ω, A, P), k · kp ) is complete. In
particular, (L2 (Ω, A, P), k · k2 ) is a Hilbert space, since the norm k · k2 is induced by the scalar
product hX, Y i = E[XY ].

1.5 Convergence of random variables


Let us conclude this reminder of classical probability theory by defining three notions of conver-
gence of random variables. Let (Xn )n≥0 be a sequence of real-valued random variables defined
on (Ω, A, P). Let X be a random variable on the same probability space.
We say that the sequence (Xn )n≥0 converges almost surely towards X if there exists an event
N ⊂ Ω such that P(N ) = 0 and for all ω ∈ / N , the sequence (Xn (ω))n≥0 converges to X(ω).
This is the pointwise convergence outside a negligible event.
Let p ≥ 1 be fixed. We say that the sequence (Xn )n≥0 converges towards X in Lp if X
admits a moment of order p and kXn − Xkp converges to 0. This is the norm convergence in
the Banach space Lp (Ω, A, P).
We say that the sequence (Xn )n≥0 converges in probability towards X if for all ε > 0, the
sequence of probabilities P(|Xn − X| > ε) converges to 0 as n tends to infinity.
The next exercises are perhaps more substantial than some of the earlier ones.

Exercise 1.33 (Modes of convergence) Prove that a sequence which converges almost surely
or in Lp for some p ≥ 1 also converges in probability towards the same limit. Prove that a
sequence of random variables has at most one limit in any of the three modes of convergence
which we have defined.
Prove that from a sequence which converges in probability one can extract a sub-sequence
which converges almost surely.

Exercise 1.34 (Lp is complete) Let (Xn )n≥1 be a Cauchy sequence in the normed vector
space (Lp (Ω, A, P), k · kp ).
Prove that there exists a sub-sequence (Xnk )k≥1 of (Xn )n≥1 such that for all k ≥ 1 and for all
l ≥ k, one has kXnk − Xnl kpp ≤ 2−k . Prove that for almost all ω ∈ Ω, the sequence (Xnk (ω))k≥1
is a Cauchy sequence. Denote by X(ω) its limit.
Use Fatou’s lemma to prove that X belongs to Lp and that the sequence (Xnk )k≥1 converges
in Lp to X. Prove that the sequence (Xn )n≥1 converges in Lp to X.

9
The fact that Lp is complete allows one to use the machinery of functional analysis. In
particular, L2 is a Hilbert space as we already mentioned. This has at least two important
consequences.
Firstly, given any closed linear subspace F of L2 (Ω, A, P), the orthogonal subspace F ⊥ is a
closed subspace which satisfies F ⊕ F ⊥ = L2 (Ω, A, P). In particular, there exists an orthogonal
projection pF whose image is exactly F .
Secondly, any continuous linear form on L2 (Ω, A, P) is of the form Y 7→ E[XY ] for some
(uniquely defined) X ∈ L2 (Ω, A, P). This is called the Riesz representation theorem.

Exercise 1.35 (Radon-Nikodym theorem) Let (Ω, A) be a measurable space. Let P and Q
be two probability measures on (Ω, A). Assume that Q << P, that is, Q is absolutely continuous
with respect to P, which means that for all A ∈ A, P(A) = 0 implies Q(A) = 0.
Define the measure µ = P + Q on (Ω, A). Note that this is not a probability measure. Prove
that L2 (Ω, A,R µ) is a subspace of L1 (Ω, A, Q) and that the mapping EQ : L2 (Ω, A, µ) → R defined
by EQ [X] = Ω X dQ is a continuous linear form. Deduce that there exists δ ∈ L2 (Ω, A, µ) such
that for all X ∈ L2 (Ω, A, µ), one has EQ [X] = Ω Xδ dµ.
R

Prove that δ ∈ [0, 1) P-almost surely. Set D = 1−δ δ


. Prove that for all non-negative X : Ω →
R+ , one has Ω X dQ = Ω XD dP . Prove that the same equality holds for any X ∈ L1 (Ω, A, Q).
R R

10
2 Conditional expectation
There are three main parts in this chapter. The first is the definition of conditional expectation.
We will make a lot of comments about it, in order to help you to relate this rather abstract
definition to what you already know. The second part is the proof that conditional expectation
exists and is unique. Uniqueness is easy, but in my opinion more instructive than many books
indicate. Existence is not easy and is commonly proved in one of two different ways. We explain
one, the other can for example be found in Durrett’s book. The third part is a list of properties
which will allow you to manipulate in practical computations the concept of conditional expec-
tation. My own experience of mathematics is that learning a new notion involves repeatedly
going from applying rules without questioning them too much and watching them at work on one
hand, and thinking about the structure of the theory and going deeper into the understanding
of the rules on the other hand.

2.1 Definition and fundamental examples


Definition 2.1 Let (Ω, A, P) be a probability space. Let B be a sub-σ-field of A, that is, a σ-field
on Ω such that B ⊂ A. Let X : (Ω, A, P) → (R, BR ) be an integrable random variable.
A conditional expectation of X given B is an integrable real-valued random variable Y :
(Ω, A, P) → (R, BR ) such that the following two conditions hold.
1. Y is measurable with respect
R to B. R
2. For all B ∈ B, one has B X dP = B Y dP.

We shall prove that a conditional expectation of X given B always exists and is unique al-
most surely, and we shall denote it by E[X|B].

Exercise 2.1 Assume that B = {∅, Ω}. Prove that a conditional expectation of X given B must
be a constant random variable and find the unique possible value of this constant.

Let us start by an example. Suppose that B is generated by an event C ∈ A, so that


B = {∅, C, C c , Ω}. Let us assume that P(C) is neither 0 nor 1. If a conditional expectation of
X given B exists, it must be of the form

Y = α1C + β 1C c ,

because this is (up to modification on a negligible set) the form of the most general B-measurable
random variable. The second condition which Y must satisfy applied with B = C yields
Z Z
αP(C) = Y dP = X dP,
C C

so that
1
Z
α= X dP.
P(C) C
Similarly,
1
Z
β= X dP.
P(C c ) Cc

11
The second condition is satisfied for B = ∅ and a short computation shows that, with the
choices of α and β above, it is also satisfied for B = Ω. We have thus proved that a conditional
expectation of X given B exists and is unique. It is equal to
   
1 1
Z Z
E[X|B] = X dP 1C + X dP 1C c .
P(C) C P(C c ) C c

We see the connection with elementary conditional probabilities: the value of E[X|B] on C is
nothing but the expectation of X under the conditional probability PC = P(·|C) defined on A
by
P(A ∩ C)
PC (A) = .
P(C)
Let us try to formulate this with words. We know what the expectation of X with respect
to the conditional probability PC is, as well as its expectation with respect to the conditional
c
probability PC . These are two real numbers. Now the conditional expectation E[X|B] is a
random variable on Ω, a real-valued function on Ω. Consider a point ω in Ω. Think of the
σ-field B as encoding the information to which we have access. What do we know about ω ? We
know whether it belongs to C or to C c , and no more. If it belongs to C, then the conditional
expectation of X given B evaluated at ω is equal to the expectation of X with respect to the
conditional probability PC . If it belongs to C c , the same conclusion holds with C replaced by
C c.
What we have done with two sets C and C c can be generalised to an arbitrary finite or
countably infinite number of sets. This is the object of the next exercise.

Exercise 2.2 Let {C1 , C2 , . . .} be a partition of Ω into at least two (and possibly countably
infinitely many) disjoint events with positive probability. Let B = σ({C1 , C2 , . . .}) be the σ-
algebra generated by this partition. Describe B. Prove that E[X|B] exists and is unique, and is
given by the formula
X 1 Z 
E[X|B] = X dP 1Ci .
P(Ci ) Ci
i≥1

This first discussion revealed the relation between the definition of E[X|B] and the elementary
notion of conditional probability. However, it is not general enough for many purposes, and
cannot be generalised to the case of an arbitrary sub-σ-field B. Indeed, many interesting σ-fields
are not generated by a partition.

Exercise 2.3 Consider on R the Borel σ-field BR and define

C = {C ∈ BR : C + 2π = C},

where for any subset C of R, we denote by C + 2π the result of the translation of C by 2π, that
is, C + 2π = {x + 2π : x ∈ C}. Prove that C is a sub-σ-field of BR . Is C generated by a partition
of R ?

Exercise 2.4 Prove that every σ-field on a countable set is generated by a partition. Is it true
that if a set Ω has the property that every σ-field on Ω is generated by a partition, then Ω is
countable ?

12
There is another situation where we can study by hand the existence of a conditional expec-
tation. It is a case where, among other things that we will explain in a minute, we assume that
B is generated by a real random variable, say Z : (Ω, A, P) → (R, BR ). Recall from Exercise
1.15 the definition of σ(Z), the σ-field generated by Z.
Let us make the assumption that the random vector (X, Z) has a distribution which admits
a density f(X,Z) with respect to the Lebesgue measure on R2 . This means that for all Borel
subset D of R2 , we have
Z
P((X, Z) ∈ D) = f(X,Z) (u, v) dudv.
D

In this situation, we know that the distribution of Z admits a density with respect to the
Lebesgue measure on R which is the function fZ given (for almost every v ∈ R) by
Z
fZ (v) = f(X,Z) (u, v) du.
R

Now let us look for E[X|σ(Z)], which is also denoted by E[X|Z]. It is useful to understand what
it means for a random variable to be measurable with respect to σ(Z).

Exercise 2.5 In this exercise, we work with sets and maps between sets, without σ-fields and
without any notion of measurability. Let Ω be a set. Let Y, Z : Ω → R be two maps. Prove that
Y is constant on every non-empty set of Ω of the form Z −1 ({t}), t ∈ R, if and only if there
exists a map h : R → R such that Y = h ◦ Z.

Proposition 2.2 Let Y, Z be real-valued random variables on a probability space (Ω, A, P). The
random variable Y is measurable with respect to σ(Z) if and only if there exists a Borel measur-
able function h : R → R such that Y = h(Z) = h ◦ Z.

The proof of this proposition is not essential for our purposes. I give it here for the sake of
completeness and pleasure.

Proof. If Y = h(Z) for some measurable function h, then for all B ∈ BR , we have Y −1 (B) =
(h ◦ Z)−1 (B) = Z −1 (h−1 (B)) ∈ σ(Z), so that Y is measurable with respect to σ(Z).
The most interesting part is the converse. Assume that Y is measurable with respect to
σ(Z). Let us also assume, for a start, that Y is non-negative.
We shall use the binary expansion of real numbers:

14.5 (decimal) = 1110.1 (binary).

Let us observe two things. Firstly, in the binary expansion of a real x, the digit just to the left
of the dot, the 0 in the example above, is the parity of the integer part of x. Let us write it
bxc mod 2. Secondly, for all n ∈ Z, the n-th digit of the binary expansion of x, that is, the digit
which is the coefficient of 2n , is the digit located just to the left of the dot in the expansion of
2−n x. Putting these remarks together, we find that
X
x= 2−n ((2n x) mod 2).
n∈Z

13
We can write the same formula for a non-negative random variable

X N
X
Y = 2−n ((2n Y ) mod 2) = lim 2−n ((2n Y ) mod 2).
N →+∞
n∈Z n=−N

We use now the assumption. Since Y is σ(Z) measurable, the set Cn = {(2n Y ) mod 2 = 1}
belongs to σ(Z) for all n ∈ Z. There exists thus a Borel subset Bn of R such that Cn = Z −1 (Bn ).
Define now a function h on R by setting

2−n 1Bn .
X
h=
n∈Z

Then for all ω ∈ Ω,

2−n 1Bn (Z(ω)) = 2−n 1Cn (ω) =


X X X
h(Z(ω)) = 2−n ((2n Y (ω)) mod 2) = Y (ω).
n∈Z n∈Z n∈Z

Observe that h may take the value +∞, but on the set {h = +∞} we can give it the value 0
without changing the last computation, because Y takes real values.
Finally, we have expressed Y explicitely as a function of Z. The case where Y can take neg-
ative values is treated as usual by decomposing Y into its non-negative and non-positive parts.

Exercise 2.6 Is the σ-field C defined in Exercise 2.3 of the form σ(Z) for some measurable
function Z : R → R ?

Let us come back to our original problem of computing E[X|Z]. Thanks to the proposition
which we have just proved, we know that we are looking for a random variable of the form
h(Z) for some function h. This function must satisfy the second defining property of conditional
expectation for all B ∈ σ(Z). By definition of σ(Z), each B ∈ σ(Z) is of the form B = Z −1 (E)
for some Borel subset E of R. The condition to be satisfied is
Z Z
h(Z) dP = X dP,
B B

which, since B = Z −1 (E), can be written as


Z Z
h(v)fZ (v) dv = uf(X,Z) (u, v) dudv.
E R×E

The last line was deduced from the previous one by a short computation that you might want
to carefully check. Now the right-hand side can be written as
Z Z 
uf(X,Z) (u, v) du dv.
E R

Provided the denominator is not zero, we see that the function h given by
R R
R uf(X,Z) (u, v) du uf(X,Z) (u, v) du
h(v) = = RR
fZ (v) R f(X,Z) (u, v) du

14
satisfies the desired relation. Hence, h(Z) is a conditional expectation of X given σ(Z).
To what extent is h unique ? The integral of hfZ with respect to the Lebesgue measure over
each Borel subset of R is prescribed. This means that the integral of h with respect to PZ over
each Borel subset of R is prescribed. If h0 has the same integral over each Borel subset, then h
and h0 coincide PZ -almost surely. Let us state this as an exercise.

Exercise 2.7 On a probability R (Ω,


space A, P), let X and X 0 be two random variables such that
for all A ∈ A one has A X dP = A X 0 dP. Then P(X = X 0 ) = 1.
R

2.2 Uniqueness and existence


The result of Exercise 2.7 is elementary and we may not pay great attention to it, but in the
context of conditional expectation, it is very important. Let us see why. Random variables are
equivalence classes of measurable functions and therefore cannot be evaluated at a point of Ω.
Indeed, two measurable functions which are equal almost everywhere may differ at any point
which is not an atom of the underlying probability measure. In contrast, they can be integrated
on any event, and this result tells us that this is all there is: a random variable is not meant
to be evaluated, but it is certainly made to be integrated over events, and it is characterised by
the collection of its integrals over all events.
This is important precisely because the conditional expectation E[X|B] is defined not by
its values at the points of Ω, but by the values of its integrals over all events of B. It is an
instance where a random variable is truly seen as an equivalence class of measurable functions,
with no preferred choice of a specific measurable function in this class. This is in a sense the
most intrinsic possible definition of a random variable.
These remarks hopefully make the structure of the definition of the conditional expectation
E[X|B] more natural. The first condition says that it is a B-measurable random variable, and
the second condition specifies its integral over every single event of B.
By now, it should have become almost obvious that the conditional expectation is unique.

Lemma 2.3 Recall the notation of Definition 2.1. If Y and Y 0 are conditional expectations of
X given B, then Y = Y 0 almost surely.

Proof. Indeed, Y and Y 0 are both measurable with respect to B and they have the same integral
over every event belonging to B.

We have settled the question of the uniqueness, but not yet that of the existence of the
conditional expectation. In order to do this, we will introduce a new and quite different point of
view on the conditional expectation. This point of view is usually explained in terms of “the best
prediction about X which can be made using the information available in B”. This is certainly
correct, but there is something about this phrasing that I have always found strange, and I will
try to explain it in a convincing way.
Let us start by a comment on the usual expectation. Apart from its definition, we can
understand the expectation of a random variable through the strong law of large numbers. This
law tells us that the arithmetic mean of a sufficiently large sample of independent copies of our
random variable will be arbitrarily close to its expectation. However, this can hardly be called a
prediction, let alone a best prediction. In which sense does the expectation of a random variable
constitute a prediction about this variable ? If we think about a dice that we roll, there is little

15
hope that it will actually give the value 3.5. A similar comment could be made about a uniform
random variable on the interval [0, 1], and about many other random variables. The words best
prediction should not be heard in the naive sense of likeliest outcome.
There is however a precise and simple sense in which the expectation of a random variable
constitutes a best prediction of its outcome.

Exercise 2.8 Let X be a square-integrable random variable. Prove that for all real number c,
the following equality holds:

E[(X − c)2 ] = Var(X) + (c − E[X])2 .

Deduce that there is, among all constant random variables, one which is closer than any
other to X in the normed vector space (L2 (Ω, A, P), k · k2 ).

If we think of E[X] not as a number but as a constant random variable (and this is very much
in the spirit of our study of E[X|B] in the case where B = {∅, C, C c , Ω}), then it is, among all
the constant random variables, the one which is closest to X in the L2 norm. Recalling Exercise
2.1, and introducing B0 = {∅, Ω}, we can reformulate this as follows: E[X|B0 ] is, among all
random variables which are measurable with respect to B0 , the one which is closest to X in the
L2 distance.
The key to the existence of the conditional expectation is that this is true not just for the
σ-field B0 , but for any other sub-σ-field of A.

Theorem 2.4 Let X : (Ω, A, P) → (R, BR ) be a square-integrable random variable. Let B be a


sub-σ-field of A.
There exists, among all square-integrable random variables which are measurable with respect
to B, one which is closer to X in the L2 distance than any other. This random variable is
moreover a conditional expectation of X given B.

Proof. The space of square-integrable random variables which are measurable with respect to B
is the subspace L2 (Ω, B, P) of L2 (Ω, A, P). Since the L2 distance on L2 (Ω, B, P) is the distance
induced by the ambient space L2 (Ω, A, P) and since L2 (Ω, B, P) endowed with this distance is
complete, it is a closed subspace of L2 (Ω, A, P).
Let p : L2 (Ω, A, P) → L2 (Ω, B, P) be the orthogonal projection, which exists precisely because
L2 (Ω, B, P) is a closed linear subspace. Then from the general theory of the geometry of Hilbert
spaces, we know that p(X) is the element of L2 (Ω, B, P) which is closest to X in the L2 distance.
Let us now prove that p(X) is a conditional expectation of X given B. To start with, p(X)
is measurable with respect to B. Then, let us consider an event B ∈ B. We have
Z Z Z
p(X) dP − X dP = (p(X) − X)1B dP = hp(X) − X, 1B iL2 (Ω,A,P ) .
B B Ω

But on one hand, 1B is an element of L2 (Ω, B, P). On the other hand, since p is an orthogonal
projection, p(X) − X is orthogonal to L2 (Ω, B, P). Hence, the scalar product of the right-hand
side is equal to 0, so that Z Z
p(X) dP = X dP,
B B

16
proving that p(X) is the conditional expectation of X given B.

We have proved the existence for all square-integrable random variables. It remains to go
from square-integrable to integrable random variables (recall that L2 (Ω, A, P) ⊂ L1 (Ω, A, P)).

Theorem 2.5 Let X : (Ω, A, P) → (R, BR ) be an integrable random variable. Let B be a


sub-σ-field of A. There exists a conditional expectation of X given B.

We first need to prove a fundamental property of the conditional expectation, which is its
positivity. It is more easily done after proving its linearity. We do both in the following lemma.

Lemma 2.6 Let X1 , X2 : (Ω, A, P) → (R, BR ) be two random variables. Let B be a sub-σ-field
of A. Assume that X1 and X2 admit conditional expectations with respect to B.
1. For all reals a, b, the random variable aX1 + bX2 admits a conditional expectation given
B and E[aX1 + bX2 |B] = aE[X1 |B] + bE[X2 |B].
2. If 0 ≤ X1 ≤ X2 almost surely, then 0 ≤ E[X1 |B] ≤ E[X2 |B] almost surely.

Proof. 1. The random variable aE[X1 |B] + bE[X2 |B] is integrable and B-measurable. The
linearity of the integral implies immediately that it is a conditional expectation of aX1 + bX2
given B.
2. Thanks to the linearity, it suffices to prove that 0 ≤ X almost surely implies 0 ≤ E[X|B]
almost surely. Consider an integer k ≥ 1 and the event Bk = E[X|B] ≤ − k1 . It belongs to B.
Hence, we have the inequalities
1
Z Z
0≤ X dP = E[X|B] dP ≤ − P(Bk ).
Bk Bk k
This implies P(Bk ) = 0. Hence, the event B = {E[X|B] < 0} satisfies
 
[ X
P(B) = P  Bk  ≤ P(Bk ) = 0.
k≥1 k≥1

Finally, E[X|B] is non-negative almost surely.

Let us now turn to the proof of the theorem.

Proof. Let us write X = X + − X − , where X + = max(X, 0) and X − = (−X)+ . Let us treat the
case of X + first.
For each n ≥ 1, let us consider the random variable min(X + , n). It is non-negative and
bounded by n. It is in particular square-integrable, so that it admits a conditional expectation
given B.
For each m ≤ n, we have 0 ≤ min(X + , m) ≤ min(X + , n), so that, by the lemma, 0 ≤
E[min(X + , m)|B] ≤ E[min(X + , n)|B]. Hence, the sequence of random variables (E[min(X + , n)|B]))n≥1
is non-decreasing. In particular, it has a limit Y+ towards which it converges almost surely.
On the other hand, the sequence (min(X + , n))n≥1 is also non-decreasing and converges to
X. Hence, for all B ∈ B, the monotone convergence theorem applied to the equality
Z Z
E[min(X , n)|B] dP =
+
min(X + , n) dP
B B

17
yields Z Z
Y+ dP = X + dP.
B B
In particular, since Y+ is non-negative, Ω Y+ dP = Ω X + dP < +∞ and Y+ is integrable.
R R

Since Y+ is the almost sure limit of a sequence of B-measurable random variables, it is also
B-measurable. Finally, the equality above proves that it is the conditional expectation of X +
given B.
The same argument proves that X − admits a conditional expectation given B, which we
denote by Y− .
Finally, we claim that Y = Y+ − Y− is the conditional expectation of X given B. Indeed, Y
is integrable, B-measurable, and for all B ∈ B, we have
Z Z Z Z Z Z Z
+ −
Y dP = Y+ − Y− dP = Y+ dP − Y− dP = X dP − X dP = X dP.
B B B B B B B

This concludes the proof.

Let us summarise our approach. We analysed the definition of the conditional expectation
in a few simple cases and this led us to an understanding of the uniqueness of the conditional
expectation. We then proved its existence by using a strategy of proof which is not uncommon
when one is dealing with L1 spaces: we first proved the existence for L2 random variables, using
the rich geometry of Hilbert spaces, and then extended our result to L1 by an argument of
approximation.

Exercise 2.9 Is it true that among all constant random variables, E[X] is closer to X than any
other in the L1 distance? Could we have characterised E[X|B] for an integrable random variable
by a property similar to the one we used for square-integrable random variables?

Exercise 2.10 Let X be a random variable that is uniformly distributed on the finite set {−2, −1, 1, 10}.
Compute the expectation of X and draw the graph of the function c 7→ E[|X − c|].
For a general integrable random variable, describe the set of reals c that minimise E[|X − c|].

Exercise 2.11 Let X be a bounded random variable. Prove that for all p ∈ (1, ∞), the function
c 7→ kX − ckLp attains its minimum for a unique value of c, which we denote by e(X; p). What
can you say about the function p 7→ e(X; p)? Is it continuous? Does it have a limit as p tends
to infinity? As p tends to 1? Here is a part of the graph of this function for a random variable
X whose distribution is a slightly perturbed version of the distribution of the previous exercise:
e(X;p)

p
2 3 4 5 6

-1

-2

18
2.3 Main properties
We are now ready for the third part of this section, where we gather the useful properties of the
conditional expectation.

Theorem 2.7 Let (Ω, A, P) be a probability space. Let X, Y, X1 , X2 , . . . be integrable random


variables on this space. Let a, b be real numbers. Let B, C be sub-σ-fields of A.
1. E[aX + bY |B] = aE[X|B] + bE[Y |B].
2. If X is B-measurable and XY is integrable, then E[XY |B] = XE[Y |B].
3. If X is B-measurable, then E[X|B] = X.
4. If X is independent of B, then E[X|B] = E[X] almost surely.
5. If C ⊂ B, then E[E[X|B]|C] = E[X|C].
6. E[E[X|B]] = E[X].
7. If X ≥ 0 almost surely, then E[X|B] ≥ 0 almost surely.
8. If (Xn )n≥1 is a non-decreasing sequence of non-negative random variables, converging to
a.s.
an integrable random variable X, then E[Xn |B] −→ E[X|B].
n→∞
9. If (Xn )n≥1 is a sequence of non-negative random variables, then E[lim inf Xn |B] ≤
lim inf E[Xn |B].
10. If (Xn )n≥1 is a sequence of non-negative random variables converging almost surely to a
random variable X, and if there exists an integrable random variable Y such that for all n ≥ 1
a.s.
one has |Xn | ≤ Y , then E[Xn |B] −→ E[X|B].
n→∞
11. If φ : R → R is a convex function and φ(X) is integrable, then φ(E[X|B]) ≤ E[φ(X)|B].

To prove this theorem yourself is one of the best exercises that you can do to at this stage.
None of the ten first properties are difficult to prove. Among these, the least simple is the
second, in that it is the least directly deduced from the definition. You should prove it after
you have proved the eighth point, starting by the case where Y is an indicator function, then a
simple function, and finally using an approximation argument.
For the eleventh property, use the fact that a convex function is the supremum of all affine
functions which are inferior to it:

φ(x) = sup ax + b.
a,b∈R
∀y∈R,ay+b≤φ(y)

Here are some more exercises about conditional expectation. When nothing is specified,
X, Y, . . . are integrable random variables on a probability space.

Exercise 2.12 Let X be a standard Gaussian random variable. Compute E[X|X 2 ].

Exercise 2.13 Assume that E[X|B] is a constant random variable. Prove that this constant is
E[X]. Is X necessarily independent of B ?

Exercise 2.14 Let a be a real number. Let X be a random variable whose distribution admits
the density
x2
(x + a)2 e− 2
fX (x) = √
1 + a2 2π

19
with respect to the Lebesgue measure (check that fX is indeed a density function). Try to under-
stand before computing it that the sign of E[X|X 2 ] is almost surely constant, and to guess how
it depends on a. Then compute E[X|X 2 ].
For all real number t, compute

tfX (t) − tfX (−t)


.
fX (t) + fX (−t)

Do you see a relation between the two computations ?

Exercise 2.15 Which relations can you find between E[X|Y 2 ] and E X |Y | ?
 

Exercise 2.16 Last year a student told me that E[X|X 2 ] = 0 if and only if X and −X have
the same distribution. Do you think he was right ?

Exercise 2.17 Choose p ∈ (0, 1). Let X and Y be integer-valued random variables such that
for all k, l ∈ N, one has
 p k k l
P(X = k, Y = l) = (1 − p) .
e l!
Compute E[Y |X].

Exercise 2.18 Let (X, Y ) be a two-dimensional random vector whose distribution admits the
density
f(X,Y ) (s, t) = e−s 1[0,s] (t)
with respect to the Lebesgue measure on R2 . Compute E[X|Y ] and E[Y |X].

Exercise 2.19 (Do this exercise only if you know the definition and main properties of Gaussian
random vectors.) Let (X, Y ) be a two-dimensional Gaussian random vector. Prove that there
exists a real number a such that X − aY is independent of Y . Prove that there exists a real
number b such that E[X|Y ] = aY + b.

Exercise 2.20 Let N be a bounded integer-valued random variable. Let (Xn )n≥1 be a sequence
of identically distributed integrable random variables. Assume that N, X1 , X2 , . . . are indepen-
dent. Set
XN
S= Xn .
n=1

Prove that S is integrable and compute its expectation.

Exercise 2.21 Consider the probability space ([0, 1), B[0,1) , λ), where λ is the Lebesgue measure.
Choose n ≥ 0 and consider the σ-field
  
k k+1 n
Fn = σ , : k ∈ {0 . . . 2 − 1} .
2n 2n

Define the random variable X : [0, 1) → R by setting X(t) = t for all t ∈ [0, 1). Compute
E[X|Fn ].

20
Exercise 2.22 Consider a random vector (X, Y, Z) whose distribution admits the density

f(X,Y,Z) (x, y, z) = ce−z−2x 10≤x≤y≤z

with respect to the Lebesgue measure on R3 , where c is a real constant. Determine the value of
c, then compute E[X|Y, Z] and E[X|Y ].

Exercise 2.23 Let X and Y be two square-integrable random variables such that E[X|Y ] = Y
and E[Y |X] = X. Prove that X = Y .

Exercise 2.24 Consider a probability space (Ω, A, P). Let B be a sub-σ-field of A. Propose
a definition of the conditional expectation E[X|B] for every random variable X : (Ω, A) →
[0, +∞], without any assumption of integrability. Investigate the existence and uniqueness of
this conditional expectation, as well as its properties, in the spirit of Theorem 2.7.

Exercise 2.25 Let (Ω, A, P) be a probability space. Let B be a sub-σ-field of A. Check that
L∞ (Ω, B, P) acts by multiplication on L1 (Ω, A, P), in the sense that for all X ∈ L1 (Ω, A, P) and
all Z ∈ L∞ (Ω, B, P), the product ZX belongs to L1 (Ω, A, P). Check that L∞ (Ω, B, P) acts also
by multiplication on L1 (Ω, B, P).
Find all linear maps
E : L1 (Ω, A, P) → L1 (Ω, B, P)
which preserve the expectation and respect the action by multiplication of L∞ (Ω, B, P) in the
sense that for all X ∈ L1 (Ω, A, P) and all Z ∈ L∞ (Ω, B, P),

E(ZX) = ZE(X).

In particular, check that E is automatically continuous.


Find all continuous linear maps Ẽ : L1 (Ω, A, P) → L1 (Ω, B, P) which respect the action by
multiplication of L∞ (Ω, B, P).

21
3 Martingales
3.1 Martingales and conserved quantities
In the study of a deterministic dynamical system, a mechanical system for example, it is very
important to identify quantities which stay constant through the evolution of the system. For
example, let us think of the motion of a pendulum.

θ
`

m~g

g
Figure 2: Newton’s law writes m`θ̈ = mg sin θ, that is, θ̈ = ` sin θ, where g = 9.81 ms−2 is the
gravitation field on the surface of the Earth.

The differential equation θ̈ = g` sin θ is not easy to solve explicitly. Let us introduce the
function E = 12 m`θ̇2 − mg cos θ. It is immediately checked that Ė = 0, and a very detailed study
of the motion of the pendulum can be done from this single observation.
Θ¢

Θ
2Π Π Π 2Π
-Π - - Π
3 3 3 3

Figure 3: The curve (θ, θ̇) corresponding to any possible motion of the pendulum is one of these curves
(the right and left vertical sides of the picture, corresponding to θ = π and θ = −π, should be identified).
Which are the four radically different possible kinds of motion ?

In the evolution of a random dynamical system, it is unlikely that any quantity remains
constant. However, certain quantities remain constant on average. For example, consider the
simple random walk on Z. That is, let (Xn )n≥1 be a sequence of i.i.d. random variables such
that P(X1 = 1) = P(X1 = −1) = 12 . Set S0 = 0 and, for all n ≥ 1, Sn = X1 + . . . + Xn .

22
Sn
4 æ

3 æ æ

2 æ æ

1 æ æ æ

æ æ æ æ n
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

-1 æ æ æ æ

-2 æ æ æ æ

-3 æ

Figure 4: A sample path of the random dynamical system (Sn )n≥0 , also known as the simple random
walk.

It is intuitively clear, and not difficult to prove, that the only functions f : Z → R such
that f (Sn ) does not depend on n are the constant functions. There is nothing which, in this
situation, plays exactly the same role as the mechanical energy in the case of the pendulum.

Exercise 3.1 Find a non-constant function f : N × Z → R such that f (n, Sn ) is constant.

However, you know that Sn itself is constant on average. More precisely, E[Sn ] = 0 does not
depend on n. This observation is interesting but it is quite crude and, with some thought, it
can be turned into a subtle and very fruitful one.
Indeed, look at the picture above. We know the value of Sn for n ∈ {0, . . . , 20}. What can
we say about S21 ? Since S20 = 4, it has probability 21 to be equal to 3, or to 5. In particular,
its expected value is 12 (3 + 5) = 4. Here, the words “expected value” do not of course refer
to the expectation, for we know that E[S21 ] = 0. They refer to the conditional expectation
of S21 given the information to which we have access. This information consists in the values
of S0 (ω), . . . , S20 (ω) for the particular ω which corresponds to the particular realisation of the
infinite experiment of which we see the first twenty steps.
Hence, the sentence “the expected value of S21 is 4” refers to the value at ω of the conditional
expectation of S21 given the σ-field generated by S0 , . . . , S20 . It means that

E[S21 |σ(S0 , . . . , S20 )](ω) = 4 = S20 (ω).

How do we evaluate a random variable, which is only defined almost everywhere, at the point
ω? Well, the information to which we have access, contained in the picture above, describes
exactly one event of the σ-field σ(S0 , . . . , S20 ), namely the event

{S0 = 0, S1 = −1, S2 = −2, S3 = −3, S4 = −2, S5 = −1, . . . , S19 = 3, S20 = 4}.

We know that ω belongs to this event. We also know that E[S21 |σ(S0 , . . . , S20 )] is constant on
this event, equal to 4, which happens to be also the value of S20 .
Finally, the random dynamical system (Sn )n≥0 has the following property: if for some n ≥
0 you observe S0 , . . . , Sn , then the values which you observe define an event, on which the

23
(conditional) average of Sn+1 equals the value of Sn . This is beautifully expressed by the
formula
∀n ≥ 0, E[Sn+1 |S0 , . . . , Sn ] = Sn .
This is the main defining property of a martingale.
Martingales play for random dynamical systems the role played by conserved quantities for
deterministic dynamical systems. For example, knowing that (Sn )n≥0 is a martingale will allow
us to answer the question : what is the probability that the first time at which Sn hits the value
a is smaller than the first time at which Sn hits the value b ? You can try to find the answer
for a = 2 and b = −1 and see that it is in general not easy to determine this probability.

Exercise 3.2 Check that E[Sn2 ] = n. What do you think about the process (Sn2 − n)n≥0 ?

In the classical case, it is sometimes not possible, or difficult, to identify a conserved quantity.
This is for example the case for the pendulum if we take friction into account. But then, one can
prove that Ė ≤ 0 and this is already a very interesting information. Similarly, martingales have
variants called sub- and super-martingales, which correspond to Lyapunov functions of classical
dynamical systems. Let us now turn to the systematic study of this class of processes.

3.2 Definition
Definition 3.1 Let (Ω, A) be a measurable space. A filtration on (Ω, A) is a non-decreasing
sequence of sub-σ-fields of A, that is, a sequence of sub-σ-fields of A such that the inclusions

F0 ⊂ F1 ⊂ . . . ⊂ Fn ⊂ Fn+1 ⊂ . . . ⊂ A

hold. A probability space (Ω, A, P) endowed with a filtration (Fn )n≥0 is called a filtered probability
space.

Exercise
S 3.3 Recall the notation of Exercise 2.21. Prove that (Fn )n≥0 is a filtration. Is
n≥0 Fn a σ-field on [0, 1) ?
S 
The ambient σ-field A can always be replaced by σ F
n≥0 n , and this is why we will
sometimes not mention it in our definitions.
An important example of filtration is that generated by a sequence of random variables.
Given a sequence X = (Xn )n≥0 of random variables (and we will also call such a sequence a
stochastic process), we can form a filtration (FnX )n≥0 by setting, for all n ≥ 0,

FnX = σ(X0 , . . . , Xn ).

Definition 3.2 A stochastic process X = (Xn )n≥0 defined on a filtered probability space (Ω, A, (Fn )n≥0 , P)
is said to be adapted if for all n ≥ 0, the random variable Xn is measurable with respect to Fn .

With the current notation, you can check that X is adapted if and only if FnX ⊂ Fn for all
n ≥ 0.

24
Definition 3.3 (Martingales) Let (Ω, A, (Fn )n≥0 , P) be a filtered probability space. Let X =
(Xn )n≥0 be a stochastic process defined on this probability space. One calls X a martingale if
the following conditions are satisfied.
1. X is adapted to (Fn )n≥0 .
2. For all n ≥ 0, Xn is integrable.
3. For all n ≥ 0,
E[Xn+1 |Fn ] = Xn . (MG)
One calls X a supermartingale if it satisfies the same conditions, but with the equality (MG)
replaced by the inequality
E[Xn+1 |Fn ] ≤ Xn . (MG)
One calls X a submartingale if it satisfies the same conditions with (MG) replaced by the other
inequality
E[Xn+1 |Fn ] ≥ Xn . (MG)

The simple random walk is a fundamental example of martingale.

Exercise 3.4 Consider the simple random walk S = (Sn )n≥0 as defined in the previous section.
Recall in particular the i.i.d. sequence (Xn )n≥1 . Prove that the filtrations (FnX )n≥0 and (FnS )n≥0
are equal. Prove that S is a martingale with respect to (FnS )n≥0 . What do you think about (Sn2 )n≥1
?

Exercise 3.5 Let X = (Xn )n≥0 be a martingale with respect to the filtration (Fn )n≥0 . Compute,
for all n, m ≥ 0, the conditional expectation E[Xn |Fm ]. Prove that E[Xn ] = E[X0 ] for all n ≥ 0.

Exercise 3.6 Prove that if X is a submartingale (resp. a supermartingale), then the sequence
(E[Xn ])n≥0 is non-decreasing (resp. non-increasing). Mind the misleading vocabulary !

Exercise 3.7 Let X = (Xn )n≥0 be a stochastic process. Prove that if X is a martingale with
respect to a filtration (Fn )n≥0 , then X is a martingale with respect to (FnX )n≥0 . Is the converse
true ? What about supermartingales and submartingales ?

Another important example is the following. Consider an integrable random variable Z and
a filtration (Fn )n≥0 and define, for all n ≥ 0, Xn = E[Z|Fn ]. Then, for all n ≥ 0, Xn is Fn -
measurable by definition of the conditional expectation, it is also integrable by definition of the
conditional expectation. Thanks to the property labelled 5 in Theorem 2.7, we have

E[Xn+1 |Fn ] = E[E[Z|Fn+1 ]|Fn ] = E[Z|Fn ] = Xn .

Hence, (Xn )n≥0 is a martingale with respect to the filtration (Fn )n≥0 .
A martingale of the form (E[Z|Fn ])n≥0 for some integrable random variable Z is sometimes
called a closed martingale.

Exercise 3.8 Is the simple random walk S a closed martingale ?

Exercise 3.9 Let again S be the simple random walk. Define T = (Tn )n≥0 by setting Tn = Sn
if n ≤ 8888 and Tn = S8888 if n > 8888. Is T a martingale with respect to the filtration (FnS )n≥0
? Is it a closed martingale ?

25
Proposition 3.4 Let (Xn )n≥0 and (Yn )n≥0 be supermartingales. Let a be a real number.
1. If a is positive, then (aXn )n≥0 is a supermartingale.
2. If a is negative, then (aXn )n≥0 is a submartingale.
3. (Xn + Yn )n≥0 is a supermartingale.
The proposition obtained by interchanging the words supermartingale and submartingale is also
true. In particular, any linear combination of martingales is a martingale.

The proof of this proposition is a verification. For the last sentence, observe that a stochastic
process is a martingale if and only if it is both a supermartingale and a submartingale.
As a consequence of the conditional Jensen inequality (property 11 in Proposition 2.7), we
have the following result.

Proposition 3.5 Let (Xn )n≥0 be a martingale. Let φ : R → R be a convex function such that
for all n ≥ 0, φ(Xn ) is integrable. Then (φ(Xn ))n≥0 is a submartingale.
The proposition also holds if X is a submartingale and φ is non-decreasing.

Proof. Let us treat the case where X is a submartingale. The process (φ(Xn ))n≥0 is adapted
and integrable. For all n ≥ 0, we have

E[φ(Xn+1 )|Fn ] ≥ φ(E[Xn+1 |Fn ]) ≥ φ(Xn ),

where the first inequality is Jensen inequality and the second a consequence of the fact that φ
is non-decreasing. This proves that (φ(Xn ))n≥0 is a submartingale.
The case where X is a martingale is similar and easier.

3.3 The case where the filtration is generated by partitions


Let us consider a probability space (Ω, A, P). On this probability space, let us consider a filtration
(Fn )n≥0 and let us make the assumption that each σ-field Fn is generated by a finite partition
of Ω. In this case, the partition which generates Fn is uniquely defined (as the set of elements of
Fn \{∅} which are minimal for inclusion) and we will denote it by Πn . The inclusion Fn ⊂ Fn+1
is equivalent to the fact that each element of Πn is the union of some elements of Πn+1 (one
says that Πn+1 is a finer partition of Ω than Πn ). For the sake of simplicity, let us also assume
that F0 = {∅, Ω}, that is, Π0 = {Ω}.

Exercise 3.10 Check rigourously the two assertions made in the last paragraph.
S
There is a natural genealogical structure on n≥0 Πn . Indeed, any set B ∈ Πn+1 is included
in exactly one set A ∈ Πn , and we could say that A is the father of B.
S This relation of fatherhood
gives rise to a genealogical tree whose nodes are the elements of n≥0 Πn . In order to follow
more easily this discussion, it may be useful to have
S a look at the picture on the next page.
Let us introduce a notation for the elements of n≥0 Πn which takes its genealogical structure
into account. The idea is that each element of Πn will be labelled by a word of integers of length
n which contains all its genealogy.
To start with, F0 is generated by Π0 = {Ω}. There is a unique word of length 0, it is the
empty word, and we will accordingly set A∅ = Ω.

26
Now, F1 is generated by a partition Π1 = {A1 , . . . , An∅ } of Ω. The number n∅ happens to
be the total number of blocks of Π1 , but it is also more importantly the number of blocks of Π1
contained in A∅ . It is the number of children of A∅ .
Then, the inductive rule is that the label of an element of Πn+1 is a word of n + 1 integers
whose n first letters are the label of the father of that element (the father belongs to Πn ) and
whose last letter is the rank of this child in the progeny of the father. As you can guess and
see on the picture, there is some freedom in the attribution of the last element, which amounts
to a choice of a total order on the set of elements of Πn+1 which are contained in a particular
element of Πn .

Let us now consider a process X = (Xn )n≥0 which is a martingale with respect to the
filtration (Fn )n≥0 . For each n ≥ 0, Xn is measurable with respect to Fn and this is equivalent
to the fact that Xn is constant on each block Ai1 ...in of Πn . Let us denote by xi1 ...in the value
which Xn takes on Ai1 ...in .
The martingale property can be easily formulated in terms of the numbers xi1 ...xn . Indeed,
for each n ≥ 0, the conditional expectation E[Xn+1 |Fn ] is constant on each block Ai1 ...in of Πn ,
P(A nj )
taking the value kj=1 xi1 ...in j P(Aii1 ...i
P
1 ...in
) , where k is the number of children of Ai1 ...in .
The martingale property is thus equivalent to the equality
k Pk
P(Ai1 ...in j ) j=1 xi1 ...in j P(Ai1 ...in j )
X
xi1 ...in = xi1 ...in j = .
P(Ai1 ...in )
Pk
j=1 j=1 P(Ai1 ...in j )

To summarise this discussion:


• we are considering the case of a filtration in which each σ-field is generated by a finite
partition of Ω,
• there is a genealogical structure on the set of all blocks of the partitions which generate
the σ-fields of our filtration,
• a stochastic process adapted to this filtration is specified by its value on each of these
blocks, that is, by a real number attached to each node of the genealogical tree,
• each node of the tree has a natural weight, which is the proportion of the probability of its
father that it represents,
• a stochastic process adapted to this filtration is a martingale if and only if the value attached
to each node is equal to the weighted average of the values attached to each of its children.

Exercise 3.11 Recall the notation of Exercise 2.21. Draw the first levels of the genealogical
tree of the filtration (Fn )n≥0 . Instead of labelling the sets by the integers 1, 2, 3, . . . use the
integers starting from 0, so that Π1 = {A0 , A1 , . . .}, Π2 = {A00 , A01 , . . . , A10 , A11 , . . .}. Can you
characterise the real numbers which belong to the set A0100110 ? To the set Ai1 ...in ?
Can you explain how martingales with respect to the filtration (Fn )n≥0 can be put in one-
to-one correspondence with certain ways of writing real numbers in the circles of the following
picture? What about submartingales and supermartingales?

Exercise 3.12 Let X be a non-negative martingale. By this we mean that for all n ≥ 0, Xn ≥ 0
almost surely. Prove that for all n and m such that 0 ≤ n ≤ m, one has Xm = 0 on the event
{Xn = 0}, that is, Xm 1{Xn =0} = 0.

27
A121 A213 A222
A211

A132 A223 F3
A112
A212
A221
A131
A111

A12

A22
F2
A21
A11 A13

A2 F1
A1

A∅ = Ω F0

Figure 5: This illustrates the tree structure which underlies a filtration in which every σ-field is generated
by a finite partition of Ω. A martingale is a process X = (Xn )n≥0 such that on each block of the partition
which generates Fn , Xn is constant, and the average of Xn+1 is equal to the value of Xn .

One usually phrases this property by saying that when a non-negative martingale hits zero, it
stays equal to zero forever. This formulation implicitly considers the index n as a time variable,
as is customary for stochastic processes.
Is this property also true of submartingales? Of supermartingales?

28
Figure 6: An infinite binary tree.

3.4 Stochastic integration


Recall the definition of the simple random walk (Sn )n≥0 . You can think of it as representing the
(random) unfolding of a game. At each step of the game, a coin is tossed, and according to the
result, the player wins or loses 1. There is a variant of the game where the player is allowed to
bet any real number x before the coin is tossed. Then his or her gain or loss will be x, depending
on the coin. Of course, if the game is to be fair, the player is only allowed to bet before the coin
is tossed. This means that the amount which he or she bets for the n-th run of the game can
be decided only on the basis of the results of the n − 1 first runs. This notion of fair betting is
encoded by the following definition.

Definition 3.6 Let (Ω, A, (Fn )n≥0 , P) be a filtered probability space. A stochastic process H =
(Hn )n≥1 is said to be previsible if for each n ≥ 1, the random variable Hn is measurable with
respect to Fn−1 .

Exercise 3.13 Check that a previsible process is adapted. What can you say about a previsible
martingale ?

Let (Xn )n≥0 be a martingale, which we think as representing the successive states of fortune
of a player which bets 1 at each turn of the game. Let us consider another player which bets in
a fair way, according to the values of a previsible process (Hn )n≥1 . What is his fortune at time
n?
Let us be more precise: X0 is the initial fortune of the player and Xn his fortune after the
n-th turn of the game. The gain of this first player during the n-th turn is thus Xn − Xn−1 .
The second player bets Hn just before this n-th turn, and gets Hn (Xn − Xn−1 ). Let us suppose
that the second player starts with nothing. His fortune after the n-th turn is H1 (X1 − X0 ) +
. . . + Hn (Xn − Xn−1 ).

29
Definition 3.7 (Discrete stochastic integral) Let (Ω, A, (Fn )n≥0 , P) be a filtered probability
space. Let X = (Xn )n≥0 be a martingale (or a sub- or super-martingale) and H = (Hn )n≥1 a
previsible process, both with respect to the filtration (Fn )n≥0 . The stochastic integral of H with
respect to X is the process H • X = ((H • X)n )n≥0 defined by (H • X)0 = 0 and, for all n ≥ 1,
n
X
(H • X)n = Hk (Xk − Xk−1 ).
k=1
R
The process H • X is also sometimes denoted by H dX. The strength of this construction
lies in the following result, which says that if the correct assumptions of integrability are satisfied,
then H • X is still a martingale when X is a martingale.

Theorem 3.8 Let (Ω, A, (Fn )n≥0 , P) be a filtered probability space. Let X = (Xn )n≥0 be a
martingale or a sub- or super-martingale and H = (Hn )n≥1 be a previsible process.
1. If X is a martingale and each random variable Hn is bounded, then H • X is a martingale.
2. If X is a supermartingale and each random variable Hn is bounded and non-negative, then
H X is a supermartingale.

3. In the two previous assertions, the assumption “each random variable Hn is bounded” can
be replaced by “all random variables Xn and Hn are square-integrable”.

Proof. For all n ≥ 1, the random variable (H • X)n is a function of the random variables
X0 , . . . , Xn , H1 , . . . , Hn which all are Fn -measurable. Thus, (H • X)n is Fn -measurable and
H • X is adapted.
The product of a bounded random variable with an integrable random variable is an inte-
grable random variable. Also, by Hölder inequality, the product of two square-integrable random
variables is an integrable random variable. Thus, in all cases considered in the statement, the
random variable (H • X)n is integrable for all n ≥ 0.
Let us now check the main relation. Let us assume that X is a martingale. Then, for all
n ≥ 0, we must compute the conditional expectation

E[(H • X)n+1 |Fn ] = E H1 (X1 − X0 ) + . . . + Hn (Xn − Xn−1 ) +Hn+1 (Xn+1 − Xn )|Fn


 
| {z }
Fn -measurable
= (H • X)n + E Hn+1 (Xn+1 − Xn )|Fn
 
| {z }
Fn -meas.
= (H • X)n + Hn+1 E[Xn+1 − Xn |Fn ] (Thm. 2.7, 2)
| {z }
=0 because X mart.
= (H • X)n .

If X is a supermartingale and H is non-negative, then only the last line changes. Indeed, in this
case, E[Xn+1 − Xn |Fn ] ≥ 0, hence Hn+1 E[Xn+1 − Xn |Fn ] ≥ 0 and finally E[(H • X)n+1 |Fn ] ≥
(H • X)n .

Exercise 3.14 Choose an integer N ≥ 1 and define Hn to be the constant random variable
equal to 1 if n ≤ N , and to 0 if n > N . Describe H • X in terms of X.

30
Exercise 3.15 (The clever gambler) A clever gambler plays the fair coin tossing game de-
scribed at the beginning of this section. He thinks: “The coin is random, hence it doesn’t like
to repeat itself. Let me bet 1 on heads for the next turn whenever the coin gives tails, and con-
versely”. Define rigourously the previsible process H which corresponds to his strategy (you will
have to make a choice for H1 ). Describe as precisely as you can the distribution of the stochastic
process H • S. Is our player quite as clever as he thinks?

Exercise 3.16 Recall once again the notation of Exercise 2.21. For all n ≥ 0, write Xn =
E[X|Fn ]. Prove that for every martingale M with respect to the filtration (Fn )n≥0 such that
M0 = 0, there exists a previsible process H such that M = H • X. It may be useful to have in
mind the point of view of Exercise 3.11.

3.5 Almost sure convergence


As a beautiful application of the construction which we made in the previous section, let us
prove of one of the fundamental theorems on martingales. In the next statement, we use the
notation Xn+ = max(Xn , 0) for the positive part of a random variable Xn .

Theorem 3.9 (Almost sure convergence) Let (Ω, F, (Fn )n≥0 , P) be a filtered probability space.
Let X = (Xn )n≥0 be a submartingale such that sup{E[Xn+ ] : n ≥ 0} < ∞. Then there exists an
integrable random variable X∞ such that the sequence (Xn )n≥0 converges almost surely to X∞ .

This theorem is stated for submartingales. There is also a version for supermartingales,
which one reads by replacing X by −X: the assumption must be replaced by sup{E[Xn− ] : n ≥
0} < ∞. For martingales, which are both submartingales and supermartingales, any of the two
assumptions implies the conclusion.
The following corollary may seem slightly less general than Theorem 3.9, but it is in fact
equivalent to it, and perhaps more easily remembered.

Corollary 3.10 A supermartingale (resp. submartingale, resp. martingale) which is bounded


in L1 converges almost surely.

The following exercise explains why the assumptions of Theorem 3.9 and Corollary 3.10 are
equivalent. For the sake of practising the definitions, it is formulated in terms of supermartin-
gales.

Exercise 3.17 Let X be a supermartingale. Prove that the sequence (E[Xn− ])n≥0 is non-decreasing.
Prove that sup{E[Xn− ] : n ≥ 0} is finite if and only if X is bounded in L1 . Can we replace Xn−
by Xn+ in this statement ?

Exercise 3.18 Let X be a (sub-, super-) martingale bounded in L1 . Prove that the almost sure
limit of X is an integrable random variable.

A useful special case of Theorem 3.9 is the following. It is easily remembered by analogy
with the fact that a non-increasing sequence of non-negative real numbers is convergent.

Corollary 3.11 A non-negative supermartingale converges almost surely towards a limit which
is an integrable random variable.

31
Proof. For all n ≥ 0, we have
E[|Xn |] = E[Xn ] ≤ E[X0 ],
so that a non-negative supermartingale is bounded in L1 . The result now follows from Theorem
3.9.

The main tool for proving the almost sure convergence is the notion of upcrossing of an
interval. Let us first define it for a deterministic sequence. Let x = (xn )n≥0 be a sequence of
real numbers. Let a < b be two real numbers. We call upcrossing of [a, b] by the sequence x any
couple (s, t) of integers such that

s < t and xs < a < b < xt .

We say that two upcrossings (s1 , t1 ) and (s2 , t2 ) occur successively if t1 < s2 or t2 < s1 .

Exercise 3.19 Prove that the sequence x does not converge to an element of [−∞, +∞] if and
only if there exists two rational numbers a and b such that a < b and such that there are infinitely
many successive upcrossings of [a, b] by x.

For all N ≥ 1, let us denote by uN (x; a, b) the number of successive upcrossings of [a, b] by
x before time N . More rigourously, uN (x; a, b) is the largest integer k such that there exists 2k
integers s1 , t1 , . . . , sk , tk such that

0 ≤ s1 < t1 < . . . < sk < tk ≤ N and for all i ∈ {1, . . . , k}, xsi < a < b < xti .

The sequence (uN (x; a, b))N ≥1 is non-decreasing and we call u∞ (x; a, b) its limit, which belongs
to N ∪ {+∞}.
The statement of the last exercise is that x converges to an element of [−∞, +∞] if and only
if u∞ (x; a, b) is finite for all rational a and b.

Let us now turn to the probabilistic case. Let X be a stochastic process. We define for all
N ≥ 1 and for all real numbers a < b the number of upcrossings UN (X; a, b) of [a, b] by X before
time N , which is now an integer-valued random variable. We define U∞ (X; a, b) as the almost
sure limit of the non-decreasing sequence of random variables (UN (X; a, b))N ≥1 .
In order to prove Theorem 3.9, we are going to prove that its assumptions imply that
U∞ (X; a, b) < +∞ for all rational a < b almost surely. One needs to pay attention to the order
of the quantifiers “for all rational a < b” and “almost surely”. Fortunately, there are countably
many intervals with rational endpoints, and the two statements

∀a < b ∈ Q, P(U∞ (X; a, b) < ∞) = 1 and P(∀a < b ∈ Q, U∞ (X; a, b) < ∞) = 1

are equivalent.

Exercise 3.20 Check this equivalence.

The main ingredient of the proof is an estimation of this number of upcrossings, which we
will deduce from Theorem 3.8 by an appropriate choice of the previsible process H.

32
Proposition 3.12 (Doob’s upcrossing lemma) Let X be a supermartingale. Let a, b be two
real numbers such that a < b. For all n ≥ 1, one has
1
E[Un (X; a, b)] ≤ E[(Xn − a)− ].
b−a
Proof. Let us think in the same terms as before Definition 3.7. Here is the line of reasoning of
a very clever gambler. “I know that the game is fair, so that whenever Xn has reached a value
which is too low, it will have to compensate and increase until it takes again higher values. Let
me choose two levels a and b, which I call respectively the low level and the high level. I will
wait until the first time X reaches a level below a. Then I will start betting 1 at each turn,
and stop as soon as X reaches a level above b. And then I will repeat this strategy. I am quite
confident that I will make a lot of money that way.”
Well, we know by Theorem 3.8 that our gambler is, on average, not going to make money,
but rather to lose money. However he is going to help us proving Doob’s upcrossing lemma,
which is certainly more important.
Let us define the previsible process H which corresponds to his strategy. We define it
inductively. First set
H1 = 1{X0 <a} ,
because we start betting at the first turn if and only if X0 is below the low level a. Then suppose
H1 , . . . , Hn−1 have been defined, for some n ≥ 2. If Hn−1 = 1, then X has recently reached a
level below a and we are currently betting until it exceeds b. Thus, Hn = 1 unless X has just
reached such a level, that is, unless Xn−1 > b. If on the contrary Hn−1 = 0, then we are waiting
until X passes below a. We thus have Hn = 0, unless Xn−1 < a. Altogether, we set

Hn = 1{Hn−1 =1} 1{Xn−1 ≤b} + 1{Hn−1 =0} 1{Xn−1 <a} .

From this definition it follows that H1 is σ(X0 )-measurable, hence F0 -measurable, and for all
n ≥ 2, Hn is σ(Hn−1 , Xn−1 )-measurable. By induction, it follows that Hn is Fn−1 -measurable.
Finally, H = (Hn )n≥0 is previsible, and since it is bounded and non-negative by definition,
the second assertion of Theorem 3.8 applies, allowing us to claim that H • X is a supermartingale.
Incidentally, this is why our very clever gambler is losing money on average : his winnings
are given by H • X, but E[(H • X)n ] is a non-increasing sequence.
But the main point is the following inequality : for all n ≥ 1, we have

(H • X)n ≥ (b − a)Un (X; a, b) − (Xn − a)− . (2)

The first term is what motivates our gambler: each upcrossing of [a, b] by X returns him at least
b − a. The second term is what he forgot to take into account, and what restores the equity.
Sometimes, X will reach a level below a, and it will reach much lower values before it rises again
to a level higher than b. In the mean time, our gambler will have reached the limit of his own
solvability.
For those of you who want to read a more rigourous proof of (2), let us introduce the sequence of times at which the
gambler changes his bet. More precisely, let us set

S1 = inf{n ≥ 0 : Hn+1 = 1} = inf{n ≥ 0 : Xn < a}.

Then, set
T1 = inf{n ≥ S1 : Hn+1 = 0} = inf{n ≥ S1 : Xn > b}.

33
X,Y

Figure 7: A sample path of X and the corresponding sample path of H • X.

The first upcrossing of [a, b] by X is thus the random interval [S1 , T1 ]. During this interval, the gambler gains XT1 − XS1 .
Then, suppose S1 , T1 , . . . , Sk , Tk defined. We set

Sk+1 = inf{n ≥ Tk : Hn+1 = 1} = inf{n ≥ Tk : Xn < a},
Tk+1 = inf{n ≥ Sk+1 : Hn+1 = 0} = inf{n ≥ Sk+1 : Xn > b}.

Any of these random times can be infinite. If this occurs, the next times are not defined.
Now let us consider an integer n. There is, among the random times which we have defined, one which is largest among
those which are smaller than n. There are two cases, depending on whether this last time is an S or a T .
1. The largest time is Sk+1 for some k. In this case, Un (X; a, b) = k. Moreover,

(H • X)n = (XT1 − XS1 ) + . . . + (XTk − XSk ) + (Xn − XSk+1 ).

The first k terms are larger than b − a because XTl > b and XSl < a for all l ∈ {1, . . . , k}. The last term is larger than
Xn − a, hence than −(Xn − a)− . Indeed, recall that for all real number x, we use the notation x− = max(−x, 0), so that
the inequality x ≥ −x− holds.
Thus, we have
(H • X)n ≥ k(b − a) − (Xn − a)+ = (b − a)Un (X; a, b) − (Xn − a)+
and (2) is proved.
2. The largest time is Tk for some k. In this case,

(H • X)n = (XT1 − XS1 ) + . . . + (XTk − XSk ) ≥ (b − a)Un (X; a, b)

and (2) is also proved.


3. There is actually a third case, where S1 > n. In this case, both sides of (2) are equal to zero.

Now let us take the expectation on both sides of (2). We find

(b − a)E[Un (X; a, b)] − E[(Xn − a)− ] ≤ E[(H • X)n ] ≤ E[(H • X)0 ] = 0,

the last inequality being due to the fact that H • X is a supermartingale. This concludes the
proof of the proposition.

Proof. (Theorem 3.9) Let X be a supermartingale such that sup{E[Xn− ] : n ≥ 0} < ∞. Let us
prove that it converges almost surely.

34
Consider a, b ∈ Q such that a < b. For all n ≥ 1, Doob’s lemma reads
1
E[Un (X; a, b)] ≤ E[(Xn − a)− ].
b−a
On one hand, for all real number x, we have (x − a)− ≤ x− + |a|. On the other hand, the non-
decreasing sequence (Un (X; a, b))n≥1 converges to U∞ (X; a, b), so that the monotone convergence
theorem entails
1
E[U∞ (X; a, b)] ≤ sup{E[Xn− ] : n ≥ 0} + |a| < ∞

b−a
Hence,
P(U∞ (X; a, b) = ∞) = 0.
We have proved that for all rationals a < b, there are almost surely only finitely many upcrossings
of [a, b] by X. By the equivalence checked in Exercise 3.20, it follows that

P(∀a < b ∈ Q, U∞ (X; a, b) < ∞) = 1.

This in turn, by Exercise 3.19, implies that (Xn )n≥0 converges almost surely. Let us denote by
X∞ its limit. There remains to prove that X∞ is integrable.
Observe that for all n ≥ 0, we have E[X0 ] ≥ E[Xn ] = E[Xn+ ] − E[Xn− ], so that

E[Xn+ ] ≤ E[X0 ] + E[Xn− ] ≤ E[X0 ] + sup{E[Xn− ] : n ≥ 0}.

Hence, E[Xn+ ] is also bounded, and we deduce that E[|Xn |] is bounded, by E[X0 ] + 2 sup{E[Xn− ] :
n ≥ 0}.
The sequence (Xn )n≥0 is bounded in L1 and converges almost surely to X∞ . Hence, by
Fatou’s lemma,

E[|X∞ |] = E[lim inf |Xn |] ≤ lim inf E[|Xn |] ≤ sup{E[|Xn |] : n ≥ 0} < ∞.

In other words, X∞ is integrable and the proof of the theorem is finished.

Let us summarise the strategy of the proof.


1. We expressed the convergence of a sequence of real numbers in terms of upcrossings of
intervals.
2. By integrating a well-chosen previsible process against our supermartingale, we derived a
bound on the number of its upcrossings of a fixed interval.
3. Thanks to the countability of Q, we deduced that a supermartingale which is bounded in
1
L has almost surely only finitely many upcrossings of every interval with rational endpoints.
4. We concluded that a supermartingale bounded in L1 converges almost surely, and Fatou’s
lemma allowed us to prove that the limit is integrable.

Exercise 3.21 Let X be a supermartingale bounded in L1 . For all real numbers a, b such that
a < b, define
Na,b = {U∞ (X; a, b) = ∞}.
Compute P( a<b Na,b ), where the union is taken over all real numbers a < b.
S

35
Exercise 3.22 Consider a sequence (Yn )n≥1 of independent random variables such that for all
n ≥ 1, one has
1 1 1
P(Yn = 0) = 1 − 2
, P(Yn = en ) = 2 and P(Yn = −en ) = 2 .
n 2n 2n
Set X0 = 0 and, for all n ≥ 1, Xn = Y1 + . . . + Yn .
Prove that (Xn )n≥0 is a martingale in its natural filtration. Prove that (Xn )n≥0 converges
almost surely to a random variable X∞ . Prove that X∞ is not integrable (Hint: consider the
events Jn = {Y1 = e, Yn = en and Yk = 0 for all k ∈ / {1, n}}).
Exercise 3.23 Let S = (Sn )n≥0 be the simple random walk. Define H = (Hn )n≥1 inductively
by setting H1 = 1 and for all n ≥ 2,
Hn = 1{Hn−1 =1} 1{Sn−1 6=−1} .
Explain with words what the process H • S is. Prove that H • S is a martingale. Using the
following fact (which you can admit if you don’t know it):
P(inf{k > 0 : Sk = −1} < ∞) = 1,
prove that H • S converges almost surely to a random variable C. What is this random variable
? Is there convergence in L1 of the sequence (Sn )n≥1 towards C ?

3.6 Branching processes


Branching processes are a very important class of stochastic processeses, and a rich source of
examples of martingales. They constitute a simple model for the evolution of the size of a
population, according to the following simple scheme.
The population starts with a certain number ` ≥ 1 of indiviuals, who constitute the 0-th
generation. Then, at each step of the process, each individual of the existing population gives
birth to a certain number of children, and disappears. The number of children of each individual
is random, it has the same distribution for each individual, and the numbers of children of the
individuals living at a certain generation are independent. Let us give a more formal definition
of the process (Xn )n≥0 , where Xn is the size of the population at time n.
Let (Zn,k )n,k≥0 be a family of independent identically distributed integer-valued random
variables. Let us exclude the cases where these random variables are 0 almost surely, and 1
almost surely. Let us define inductively X0 = ` and, for each n ≥ 0,
Xn
X
Xn+1 = Zn,k .
k=1

Let us define a filtration by setting F0 = {∅, Ω} and, for all n ≥ 1,


Fn = σ(Zm,k : m < n, k ≥ 1).
Let m denote the common expectation of the random variables Zn,k , that is, the average number
of children of an individual of our population. It is a posivite real number or +∞, for the case
P(Zn,k = 0) = 1 is excluded. Let us assume that m < +∞.
The following result is extremely helpful in the study of branching processes, and it is an
instance of the general idea that it is useful in the study of a random dynamical system to
identify that certain quantities are martingales.

36
Proposition 3.13 The process (m−n Xn )n≥0 is a martingale with respect to the filtration (Fn )n≥0 .

Proof. By definition, X0 is F0 -measurable. Then, for all n ≥ 1, Xn is a function of Xn−1 and


{Zn−1,k : k ≥ 1}. By induction, and by definition of Fn , it follows that Xn is Fn -measurable.
Thus, (Xn )n≥0 is adapted, and so is (m−n Xn )n≥0 .
Let us now prove by induction on n that Xn is integrable. For n = 0, this is true. Let us
assume that Xn is integrable. Then, since Xn+1 is non-negative, we may consider its expectation,
and we must prove that it is not equal to ∞. We have

" x #
Zn,k 1Xn =x .
X X
E[Xn+1 ] = E
x=0 k=1

Since Xn is Fn -measurable and the random variables {Zn,k : k ≥ 1} are independent of Fn , we


find
∞ X
X x
E[Xn+1 ] = E[Zn,k ]P(Xn = x)
x=0 k=1
X∞
= xmP(Xn = x)
x=0
= mE[Xn ].

Thus, E[Xn ] = mn ` is finite.


Let us finally compute E[Xn+1 |Fn ]. We have
"X #
Xn

E[Xn+1 |Fn ] = E Zn,k Fn


k=1
Xn
X
= E[Zn,k ]
k=1
= mXn .

It follows from this result that (m−n Xn )n≥0 is a martingale.

Exercise 3.24 Check the passage from the first to the second line in the last computation of
this proof.

We may apply Corollary 3.11 to the non-negative martingale (m−n Xn )n≥0 , which thus con-
verges almost surely towards an integrable random variable Y . There are three fairly different
cases, depending on the value of m.

• The case m < 1. In this case, the almost sure convergence of (m−n Xn )n≥0 to Y implies
that almost surely, Xn = mn (m−n Xn ) converges to 0 × Y = 0. Since (Xn )n≥0 is an
integer-valued process, this forces it to be almost surely stationary with limiting value 0.

37
Indeed, a sequence of integers is convergent if and only if it is stationary1 . Let us write
our conclusion in symbols:

P(∃N ≥ 1, ∀n ≥ 1, Xn = 0) = 1.

In words, our conclusion is that when m < 1, the population gets extinct with probability
1. This case is called the subcritical case.

• The case m = 1. In this case, (Xn )n≥1 itself converges almost surely to Y . Because X is
integer-valued, this convergence implies that it is almost surely stationary. In symbols:

P(∃p ≥ 0, ∃N ≥ 1, ∀n ≥ 1, Xn = p) = 1.

For each integer p ≥ 0, we can consider the event

Sp = {∃N ≥ 1, ∀n ≥ 1, Xn = p}

on which X is stationary at p. We claim that for all p ≥ 1, the event Sp has probability
0. Indeed, choose p ≥ 1 and rewrite the event Sp as
[ \
Sp = {∃N ≥ 1, ∀n ≥ N, Zn,1 + . . . + Zn,p = p} = {Zn,1 + . . . + Zn,p = p}.
N ≥1 n≥N

The events ({Zn,1 + . . . + Zn,p = p})n≥1 are independent and they all have the same
probability. Since the case P(Zn,k = 1) = 1 is excluded, and since E[Zn,k ] = 1, we must
have P(Zn,k = 0) > 0 (check this assertion !). So, in particular,

P(Zn,1 + . . . + Zn,p = p) ≤ 1 − P(Zn,1 + . . . + Zn,p = 0) = 1 − P(Zn,k = 0)p < 1.

The version of Borel-Cantelli’s lemma for independent events now implies that P(Sp ) = 0.
Since for all p ≥ 1, the event on which X is stationary with limiting value p has probability
0, it must be that X converges to 0 almost surely. In this case too, the population gets
extinct with probability 1. This case is called the critical case.

• The case m > 1. In this supercritical case, which we will not treat in detail, one can show
that there is a positive probability that the population survives forever.

Exercise 3.25 Prove that in the case where m = 1, the martingale (Xn )n≥0 does not converge
in L1 to its almost sure limit. How do you intuitively understand this fact ?
P
Exercise 3.26 Prove that in a subcritical branching process, the total population n≥0 Xn is
not only finite almost surely, but has a finite expectation. Compute this expectation. What about
the critical case ?
1
Let me spell this out, since it is so important in the present problem. Let a = (an )n≥0 be a sequence
of elements of Z. If a is stationary, it is of course convergent. Let us prove the converse. Assume that a is
convergent. Then it is in particular Cauchy. Applying the definition of a Cauchy sequence with ε = 12 , we find
that there exists an integer N such that for all n ≥ N , we have |an − aN | < 12 . Since an and aN are both integers,
they must be equal. Hence, a is stationary after rank N .

38
50

40

30

20

10

10 20 30 40

Figure 8: Twenty sample paths of X when ` = 1 and P(Zn,k = 0) = P(Zn,k = 2) = 21 .

Exercise 3.27 This exercise is a preparation for the next. Prove that every random variable Y
with values in N satisfies the inequality

E[Y ]2
P(Y ≥ 1) ≥ .
E[Y 2 ]

Exercise 3.28 Prove that for a subcritical branching process, there exists a positive real α > 0
such that for all n ≥ 0,
P(Xn ≥ 1) ≤ e−αn .
In words, the probability of survival of the population up to time n decays exponentially fast.
On the other hand, consider a critical branching process such that the random variables Zn,k
are square-integrable. Set σ 2 = Var(Zn,k ). Prove that

E[Xn2 ] = n`σ 2 + `(` − σ 2 )

and deduce that there exists a positive real β > 0 such that
β
P(Xn ≥ 1) ≥ .
n

3.7 Convergence in L1
Critical branching processes are examples of non-negative martingales which converge almost
surely (as they must according to Corollary 3.11), but not in L1 (see Exercise 3.25). In this
section, we clarify what it means for a martingale to be convergent in L1 .

Theorem 3.14 Let (Ω, A, (Fn )n≥0 , P) be a filtered probability space. Let X = (Xn )n≥0 be a
martingale. The following two conditions are equivalent:
1. The martingale X converges to a random variable X∞ almost surely and in L1 .
2. There exists an integrable random variable Z such that Xn = E[Z|Fn ] for all n ≥ 0.
Moreover, if these conditions are satisfied, then one can take Z = X∞ in 2.

39
Proof. 1 ⇒ 2. Choose n ≥ 0. For all m ≥ n, we have Xn = E[Xm |Fn ]. We have

E [|E[Xm |Fn ] − E[X∞ |Fn ]|] = E [|E[Xm − X∞ |Fn ]|]


≤ E [E[|Xm − X∞ ||Fn ]]
= E[|Xm − X∞ |],

so that E[Xm |Fn ] converges to E[X∞ |Fn ] in L1 as n tends to infinity. Since E[Xm |Fn ] does in
fact not depend on m ≥ n, we even have E[Xm |Fn ] = E[X∞ |Fn ]. Finally,

Xn = E[X∞ |Fn ].

2 ⇒ 1. For all n ≥ 0, one has

E[|Xn |] = E[|E[Z|Fn ]|] ≤ E[E[|Z||Fn ]] = E[|Z|].

Hence, (Xn )n≥0 is bounded in L1 and, thanks to Theorem 3.9, converges almost surely to an
integrable random variable X∞ . We need to prove that (Xn )n≥0 converges in L1 to X∞ .
Let us do this first in the case where Z is bounded by a constant M , that is, under the
assumption that P(|Z| ≤ M ) = 1. Then, for all n ≥ 1, the positivity of the conditional
expectation implies P(|Xn | ≤ M ) = 1. We can conclude, by the dominated convergence theorem,
that (Xn )n≥1 converges in L1 to X∞ .
Let us now treat the general case. Let us fix ε > 0. There exists a constant M > 0 such that

E |Z − Z 1|Z|≤M | < ε.
 

Let us choose such an M and set Z̃ = Z 1|Z|≤M . For all n ≥ 0, let us define X̃n = E[Z̃|Fn ].
Then for all n ≥ 0,

E[|Xn − X̃n |] = E[|E[Z − Z̃|Fn ]|] ≤ E[|Z − Z̃|] < ε.

Since Z̃ is a bounded random variable, we know from our study of the bounded case that the
martingale (X̃n )n≥0 converges in L1 . Thus, it is a Cauchy sequence in L1 , which is to say that
there exists n0 ≥ 1 such that for all n, m ≥ n0 , one has

E[|X̃n − X̃m |] < ε.

Then, for all n, m ≥ n0 , we have

E[|Xn − Xm |] ≤ E[|Xn − X̃n |] + E[|X̃n − X̃m |] + E[|X̃m − Xm |] < 3ε.

Hence, the sequence (Xn )n≥0 is Cauchy in L1 . Thus, it converges in L1 , and it must be towards
X∞ .

In this proof, we used several facts which it may be useful to review.

Exercise 3.29 Prove that kE[X|F] − E[Y |F]kL1 ≤ kX − Y kL1 . In other words, as a mapping
of L1 into itself, conditional expectation is 1-Lipschitz. Prove in fact that, as a linear mapping
from L1 into itself, it has norm exactly 1. What about the norm of E[·|F] as a linear operator
on Lp ?

40
Exercise 3.30 Let Z be an integrable random variable. Let ε > 0 be fixed. Prove that there
exists M > 0 such that kZ − Z 1|Z|≤M kL1 < ε.

Exercise 3.31 Prove that if a sequence of random variables converges almost surely and in L1 ,
it must be to the same limit.

The next exercise provides us with a refinement of the preceding theorem.

Exercise 3.32 Let (Ω, A, (Fn )n≥0 , P) be a filtered probability space. Let Z be an integrable
random variable. Let X∞ the almost sure and L1 limit of the martingale (E[Z|Fn ])n≥0 . Does
the equality X∞ = Z necessarily
Shold ? 
Define the σ-field F∞ = σ n≥0 Fn . Prove that X∞ is F∞ -measurable. Prove that for all
S
A ∈ n≥0 Fn , one has
Z Z
Z dP = X∞ dP.
A A
Prove, using the monotone class theorem, that the same equality holds for all A ∈ F∞ . What is
the conclusion of this argument ?

So far, we have studied the almost sure convergence of martingales, observed that it is not
always a convergence in L1 , and understood that it is equivalent for a martingale to be a closed
martingale or to be convergent in L1 . If time allows, we shall come back to the question of
convergence in L1 . Before that, we want to study convergence in Lp for p > 1. As often, the
case p = 2 is particularly pleasant, but there is a very beautiful theory for general p which relies
on Doob’s maximal inequality. In order to study it, we will need a fundamental tool in the study
of martingales, which is the theory of stopping. This is the subject of the next section.

3.8 Stopping times


Let us remember our understanding of a martingale in the picture involving a gambler. A
gambler does not usually play forever, he stops playing at a certain point. The reason why he
stops may be a mixture of various ingredients : lack of time, lack of money, bad luck, sense
that he has won all that he could, time for dinner. In any case, the time at which the gambler
stops playing can depend on how the game went : it is normally a random variable. However,
if things are to be fair, the decision of stopping cannot depend on any information about the
future of the game. The random time at which the player stops must thus have the following
property : the decision of stopping or not at time n must be taken in a deterministic way from
the information available at time n. The following definition encodes this property.

Definition 3.15 Let (Ω, A, (Fn )n≥0 , P) be a filtered probability space. A random variable T :
Ω → N ∪ {∞} is a stopping time if for all n ≥ 0, the event {T = n} belongs to Fn .

For example, a random time which is almost surely constant is a stopping time.

Exercise
S  Prove that if T is a stopping time, then {T = ∞} belongs to A, and in fact to
3.33
σ n≥0 Fn

41
Exercise 3.34 Prove that T is a stopping time if and only if for all n ≥ 0, the event {T ≤ n}
belongs to Fn .

Observe that the definition of a stopping time does not involve the probability P. It only
depends on the filtration. Here are a few simple but useful properties of stopping times.

Lemma 3.16 Let (Ω, A, (Fn )n≥0 , P) be a filtered probability space. Let S, T be stopping times.
Then S ∧ T = min(S, T ), S ∨ T = max(S, T ), S + T are stopping times.

The proof is left as an exercise. A fundamental example of stopping time is the hitting time
of a Borel subset by an adapted process.

Exercise 3.35 Let X = (Xn )n≥0 be an adapted process. Let B be a Borel subset of R. Prove
that H = inf{n ≥ 0 : Xn ∈ B}, with the convention inf ∅ = ∞, is a stopping time.

As their name indicates, stopping times are meant to stop processes, in particular martin-
gales. In fact, any integer-valued random variable can be used to stop any stochastic process,
according to the following definition.

Definition 3.17 Let X = (Xn )n≥0 be a stochastic process. Let T be a random variable with
values in N ∪ {∞}. The stopped process X T = (XT ∧n )n≥0 is defined by setting, for all n ≥ 0
and all ω ∈ Ω, 
Xn (ω) if T (ω) > n,
XT ∧n (ω) = XT (ω)∧n (ω) =
Xm (ω) if T (ω) = m ≤ n.
If T is finite almost surely, then the random variable XT is well defined by the formula

XT (ω) = XT (ω) (ω).

Observe that in general, for XT to be defined, one need to make sure that T is finite almost
surely. If one knows that the sequence (Xn )n≥0 converges almost surely to X∞ , one might
however remove this assumption and set XT = X∞ on the event {T = ∞}.
The first important result about stopping of martingales by stopping times is the following
and we will deduce it from Theorem 3.8.

Proposition 3.18 Let X be a supermartingale. Let T be a stopping time. The process X T is


a supermartingale.

Proof. In the picture of the gambler, stopping a process at time T amounts to playing 1 until
time T and then playing 0 forever. Let us define accordingly, for all n ≥ 1,

Hn = 1{T ≥n} = 1{T ≤n−1}c .

We play 1 during the n-th turn if and only if we did not decide to stop just after the end of the
n − 1-th turn. By construction, H is previsible. It is bounded and non-negative. Thus, H • X is

42
a supermartingale. Moreover, for almost all ω, we have
n
X
(H • X)n (ω) = Hk (ω)(Xk (ω) − Xk−1 (ω))
k=1
n
1T (ω)≥k (Xk (ω) − Xk−1 (ω))
X
=
k=1
T (ω)∧n
X
= Xk (ω) − Xk−1 (ω)
k=1
= XT (ω)∧n (ω) − X0 (ω).

Thus, X T itself is a supermartingale.

In particular, E[XT ∧n ] ≤ E[X0 ] and if X is a martingale, the equality holds : one has
E[XT ∧n ] = E[X0 ]. It is tempting to let n tend to infinity in this equation, but one must be very
cautious in doing so. In order to be able to say that E[XT ] = E[X0 ], one must usually combine
properties of X and T , in one of many possible ways. Let us see what can go wrong.
Firstly, as we already mentioned, the sequence (XT ∧n )n≥0 may not converge almost surely.
Indeed, the event {T = +∞} may have positive probability, and X may not converge on
it. However, if T < ∞ almost surely, it is true that XT ∧n converges almost surely to XT .
Nevertheless, this convergence may not happen in L1 . The classical counter-example is that of
the simple random walk (see Exercise 3.23). Indeed, let T be the hitting time of −1 for S, that
is, T = inf{n : Sn = −1}. Then T is finite almost surely, and S T converges almost surely to −1,
although 0 = E[ST ∧n ] 6= E[ST ] = −1.
Let us give three simple sets of conditions under which the expected result holds.

Theorem 3.19 Let X be a supermartingale. Let T be a stopping time. If one of the following
conditions is satisfied :
1. T is bounded,
2. T is integrable and there exists M > 0 such that for all n ≥ 0, |Xn+1 − Xn | ≤ M ,
3. T is almost surely finite and X T is bounded,
then XT is integrable and the inequality E[XT ] ≤ E[X0 ] holds.

Observe that the three successive sets of assumptions are in decreasing order of strength on
T , and of increasing order of strength on X.

Proof. 1. If T is bounded by an integer N , then E[X0 ] ≥ E[XT ∧N ] = E[XT ].

2. Since T is integrable, it is almost surely finite, so that XT ∧n converges almost surely to


XT . Moreover, for all n ≥ 1,
(T ∧n)−1
X
|XT ∧n − X0 | ≤ |Xk+1 − Xk | ≤ M T.
k=0

Hence, the sequence (XT ∧n )n≥0 is dominated by the integrable random variable M T + |X0 | and
the dominated convergence theorem allows us to conclude that XT ∧n converges in L1 to XT . In

43
particular, XT is integrable and E[XT ] = limn→∞ E[XT ∧n ] ≤ E[X0 ].

3. If T is finite almost surely, then XT ∧n converges almost surely to XT . If moreover X T is


bounded, then the dominated convergence theorem ensures that the converges holds in L1 , and
we conclude as in the previous case.

Exercise 3.36 Let S be the simple random walk. Consider two integers a and b such that
a < 0 < b. Prove that S will almost surely visit the set {a, b} (you can prove this without using
any sophisticated results on the simple random walk). Compute the probability that S hits a
before hitting b. Application : a gambler starts with a fortune of 1. He has decided to stop
playing as soon as his fortune reaches 100, or when he has no money anymore. What is the
probability that he returns home with empty pockets ?

Exercise 3.37 Prove that a random variable T with values in N ∪ {+∞} is a stopping time if
and only if the process H = (Hn )n≥1 defined by Hn = 1{T ≥n} is previsible.

Exercise 3.38 Let H = (Hn )n≥1 be a stochastic process. Prove that the following two assertions
are equivalent.
1. H is adapted.
2. For every adapted process X, the process H • X is adapted.

Exercise 3.39 Let T be a random variable with values in N ∪ {+∞}. Prove that the following
assertions are equivalent.
1. For all n ≥ 0, the event {T = n} belongs to Fn+1 .
2. For every adapted process X, the stopped process X T is adapted.
3. For every martingale X, the stopped process X T is adapted.

3.9 Convergence in Lp for p > 1


In this section, we shall prove that a supermartingale which is bounded in Lp for some real
p > 1 converges almost surely and in Lp . We already know that it converges almost surely, for
boundedness in Lp implies boundedness in L1 . The point is to establish the convergence in Lp .
For this, we shall use Doob’s maximal inequality, for which we need the following lemma.

Lemma 3.20 Let X be a submartingale. Let S and T be two bounded stopping times such that
S ≤ T almost surely. Then E[XS ] ≤ E[XT ].

Proof. Let our gambler start playing at time S and stop at time T . This amounts to defining,
for all n ≥ 1,
Hn = 1{S<n≤T } = 1{S≤n−1} 1{T ≤n−1}c .
The way we wrote it makes it clear that H = (Hn )n≥0 is previsible. The same computation as
in the proof of Proposition 3.18 yields

(H • X)n = XT ∧n − XS∧n .

44
Let N be an integer such that S ≤ T ≤ N almost surely. Then (H • X)N = XT − XS . Since H
is bounded and non-negative, H • X is a submartingale, so that

E[XT − XS ] = E[(H • X)N ] ≥ E[(H • X)0 ] = 0,

as expected.

Let us use this lemma to prove Doob’s maximal inequality.

Proposition 3.21 (Doob’s maximal inequality) Let X be a submartingale. For all integer
n ≥ 0 and all real number a, the following inequalities hold :
!
aP sup Xk ≥ a ≤ E[Xn 1{sup0≤k≤n Xk ≥a} ] ≤ E[Xn+ ].
0≤k≤n

Proof. Choose a real number a. Define T as the first hitting time of [a, +∞) for X:

T = inf{n ≥ 0 : Xn ≥ a}.

Consider now an integer n ≥ 0. The random variable XT ∧n is equal to XT , hence greater or


equal to a, on the event {T ≤ n}. On the complement of this event, XT ∧n is equal to Xn .
Observing that the events {T ≤ n} and {sup0≤k≤n Xk ≥ a} are the same, we thus find

XT ∧n ≥ a1{sup0≤k≤n Xk ≥a} + Xn 1{sup0≤k≤n Xk <a} .

Taking the expectation of both sides of this inequality, we find

aP( sup Xk ≥ a) ≤ E[XT ∧n ] − E[Xn 1{sup0≤k≤n Xk <a} ].


0≤k≤n

The previous lemma applied to the bounded stopping times T ∧ n and n yields the inequality
E[XT ∧n ] ≤ E[Xn ], which combined with the previous one implies that

aP( sup Xk ≥ a) ≤ E[Xn (1 − 1{sup0≤k≤n Xk <a} )] = E[Xn 1{sup0≤k≤n Xk ≥a} ].


0≤k≤n

This proves the first inequality. The second follows from an instance of the inequality Z 1A ≤ Z +
which is true for any random variable Z and any event A.

Although one would quite naturally apply this result with a positive value for a, it seems
that the result and the proof do not depend on any assumption on the sign of a. It is interesting
to take a moment to think about what the theorem says when a = 0 (and even in this case the
theorem is not trivial), and when a < 0.
Let us now take one further step towards the Lp convergence. For this, let us introduce a
notation. If X = (Xn )n≥0 is a stochastic process, let us define its maximal process X ∗ = (Xn∗ )n≥0
by setting, for all n ≥ 0,
Xn∗ = sup |Xk |.
0≤k≤n

45
4

0
1 2

 p
p
Figure 9: The graph of the function p 7→ .
p−1

Proposition 3.22 Let X be a martingale. Let p > 1 be a real number. For all integer n ≥ 1,
one has  p
∗ p p
E [(Xn ) ] ≤ E[|Xn |p ].
p−1

Observe that we are only considering non-negative quantities in this statement. Hence, we
do not need to assume that X belongs to Lp .
Proof. By the previous proposition applied to the submartingale (|Xn |)n≥0 , we have, for all
a > 0,
aP(Xn∗ ≥ a) ≤ E[|Xn |1{Xn∗ ≥a} ].
Let us multiply both sides of this inequality by ap−2 and integrate with respect to a from 0 to
+∞. On the left, we find
Z +∞ "Z ∗ #
Xn
1
ap−1 P(Xn∗ ≥ a) da = E ap−1 da = E[(Xn∗ )p ].
0 0 p

On the right, we find


" #
Z +∞ Z Xn∗
a p−2
E[|Xn |1{Xn∗ >a} ] da = E |Xn | a p−2
da
0 0
1
= E[|Xn |(Xn∗ )p−1 ]
p−1
1 1 p−1
≤ E[|Xn |p ] p E[(Xn∗ )p ] p ,
p−1
where the last inequality is Hölder’s inequality. Combining the two expressions, we find the
expected inequality.

As n tends to infinity, Xn∗ increases and converges almost surely to



X∞ = sup{|Xn | : n ≥ 0}.

Let us now prove the convergence result in Lp .

46
Theorem 3.23 Let X be a martingale. Let p > 1 be a real number. Assume that X is bounded
in Lp . Then X converges almost surely and in Lp towards a random variable X∞ which satisfies

E[|X∞ |p ] = sup{E[|Xn |p ] : n ≥ 0}.

Moreover, one has  p


∗ p p
E[(X∞ ) ] ≤ E[|X∞ |p ].
p−1

Proof. Let us assume that X is bounded in Lp . In particular, X is bounded in L1 , hence it


converges almost surely to a random variable X∞ .
The previous proposition, the fact that (Xn∗ )n≥0 is a non-decreasing sequence which converges
towards X∞∗ and the monotone convergence theorem imply that

 p
∗ p p
E[(X∞ ) ] ≤ sup{E[|Xn |p ] : n ≥ 0} < ∞.
p−1

Hence, the almost sure converence of Xn to X∞ is dominated in Lp by X∞ ∗ . By the dominated

convergence theorem, the convergence holds in Lp . In particular, E[|X∞ |p ] is the limit of the
sequence (E[|Xn |p ])n≥0 which, by Jensen’s inequality, is non-decreasing. Hence,

E[|X∞ |p ] = sup{E[|Xn |p : n ≥ 0}.

The last inequality follows immediately.

The results of this section, as well as those of Section 3.5, fall in a category that one could call
“rigidity results” for martingales. Doob’s upcrossing lemma, which was the main result needed
to prove the almost sure convergence of martingales bounded in L1 , can be rephrased by saying
that if a martingale oscillates a lot, then its L1 norm becomes large. Doob’s maximal inequality
says that if the largest value taken by a martingale is large in Lp , then the martingale itself
is large in Lp . Neither of these results hold, even in very weak forms, for arbitrary stochastic
processes.

Exercise 3.40 The two equalities


∞ ∞
X 1 X (−1)n
= ∞ and = − log 2
n n
n=1 n=1

are well known. Consider now a sequence (εn )n≥1 of i.i.d.random variables such that P(ε1 =
1) = P(ε1 = −1) = 21 . Study the random series
∞ ∞
X εn X εn
and more generally ,
n ns
n=1 n=1

where s is an arbitrary complex number.

47
3.10 Square-integrable martingales
As often in analysis when something can be done for all Lp spaces, the case p = 2 enjoys special
properties. Martingales are no exception, the more so since the conditional expectation, which
lies at the principle of the notion of martingale, is closely related to the geometry of L2 . Let us
illustrate this by the following elementary but useful result.

Proposition 3.24 The increments of a square-integrable martingale are orthogonal in L2 . More


precisely, if X = (Xn )n≥0 is a martingale on (Ω, A, (Fn )n≥0 , P) such that E[Xn2 ] < ∞ for all
n ≥ 0, then for all integers m, n, p such that m ≤ n ≤ p, one has

E[(Xn − Xm )(Xp − Xn )] = 0.

Proof. Consider three integers m ≤ n ≤ p. We have

E[(Xn − Xm )(Xp − Xn )] = E[E[(Xn − Xm )(Xp − Xn )|Fn ]]


= E[(Xn − Xm )E[Xp − Xn |Fn ]]
= E[(Xn − Xm ).0]
= 0,

as expected.

Exercise 3.41 Is the converse true ? Is any square-integrable process X = (Xn )n≥0 such that
for all m ≤ n ≤ p one has E[(Xn − Xm )(Xp − Xn )] = 0 necessarily a martingale ?

Exercise 3.42 Let H be a Hilbert space. A curve x : R → H, t 7→ xt is called a helix if for all
reals s ≤ t ≤ u, the vectors xt − xs and xu − xt are perpendicular. Let x = (xt )t≥0 be a helix
in a Hilbert space H. Prove that for all reals r ≤ s ≤ t ≤ u, the vectors xs − xr and xu − xt
are perpendicular. Prove that H is infinite dimensional. Find an example of a helix in your
favorite separable infinite-dimensional Hilbert space, and prove that there exists a helix in any
infinite-dimensional Hilbert space.

A collection of orthogonal vectors in a Hilbert space begs us for an application of the


Pythagorean theorem.

Proposition 3.25 Let X be a square-integrable martingale. For all n ≥ 0, one has


n−1
X
E[Xn2 ] = E[X02 ] + E[(Xk+1 − Xk )2 ].
k=0

In partiular, X is bounded in L2 if and only if



X
E[(Xk+1 − Xk )2 ] < +∞.
k=0

Proof. This follows immediately from the previous proposition and the Pythagorean theorem.

48
Exercise 3.43 Prove that if (Zn )n≥0 is a P
sequence of independent random variables such
P that
E[Zn ] = 0 for all n ≥ 0 and the series n≥0 Var(Zn ) converges, then the series n≥0 Zn
2
converges almost surely and in L .

The following result is fundamental. It is not specific of the square-integrable case, but we
will see that it has very important consequences in this case.

Proposition 3.26 (Doob’s decomposition) Let (Ω, A, (Fn )n≥0 , P) be a filtered probability
space. Let X = (Xn )n≥0 be an adapted integrable stochastic process.
1. There exists a martingale M = (Mn )n≥0 with M0 = 0 and a previsible process A =
(An )n≥0 such that for all n ≥ 0, one has

Xn = X0 + Mn + An .

Moreover, this decomposition is unique.


2. The process X is a submartingale if and only if the process A is non-decreasing, that is,
for all n ≥ 0, P(An ≤ An+1 ) = 1.

Proof. If such a decomposition exists, then we must have A0 = 0 and, for all n ≥ 0,

E[Xn+1 − Xn |Fn ] = An+1 − An .

This formula defines inductively a previsible process (An )n≥0 . Moreover, we must have Mn =
Xn − X0 − An and the equality

E[(Xn+1 − An+1 ) − (Xn − An )|Fn ] = 0

shows that (Mn )n≥0 is a martingale.


The second assertion follows immediately from the first line of computation above.

For square-integrable martingales, this decomposition leads to the definition of a very im-
portant object.

Definition 3.27 Let X be a square-integrable martingale. Let Xn2 = X02 + Mn + An be the


Doob decomposition of the integrable submartingale (Xn2 )n≥0 . The process (An )n≥0 is called the
increasing process associated to X and is often denoted by hXi = (hXin )n≥0 .

The process hXi plays a crucial role in the study of continuous-time martingales and in
the construction of the stochastic integral, one of the main results being the continuous-time
analogue of the following property.

Exercise 3.44 Let X be a martingale. Let H be a bounded previsible process. Prove that for
all n ≥ 0,
Xn
hH • Xin = Hk2 (hXik − hXik−1 ).
k=1
This can be written informally as
Z ·  Z n
H dX = H 2 dhXi.
0 n 0

49
Using the increasing process associated with a square-integrable martingale, we can refine
the theorem of convergence. For this, let us introduce

hXi∞ = lim hXin ,


n→∞

which exists almost surely as the limit of a non-decreasing sequence.

Exercise 3.45 Check that E[Xn2 ] = E[hXin ] for all n ≥ 0. Prove that X is bounded in L2 if
and only if hXi∞ is integrable.

Theorem 3.28 Let X be a square-integrable martingale. The sequence (Xn )n≥0 converges al-
most surely on the event {hXi∞ < ∞}.

Proof. Choose an integer k ≥ 0 and define

Tk = inf{n ≥ 0 : hXin+1 > k}.

It is a stopping time, because it is the hitting time of the Borel subset (k, +∞) of R by the
adapted process (hXin+1 )n≥0 . We claim that the process hXiTk is the increasing process asso-
ciated to X Tk .
Firstly, since X 2 − hXi is a martingale and Tk a stopping time, the process

(X 2 − hXi)Tk = (X Tk )2 − hXiTk

is a martingale. It remains to prove that hXiTk is previsible.


Choose n ≥ 1 and B a Borel subset of R. Then

{hXiTnk ∈ B} = {hXiTk ∧n ∈ B}
= ({Tk ≥ n} ∩ {hXin ∈ B}) ∪ ({Tk < n} ∩ {hXiTk ∈ B})
n−1
[
= ({Tk ≤ n − 1}c ∩ {hXin ∈ B}) ∪ ({Tk = m} ∩ {hXim ∈ B})
m=0

and this way of writing this event makes it apparent that it belongs to Fn−1 .
The increasing process of the martingale X Tk is bounded by k by construction, so that this
martingale is bounded in L2 , from which it follows that it converges almost surely.
On the event hXi∞ ≤ k, the stopping time Tk is equal to ∞, and the processes X and X Tk
are equal. Hence, X itself converges almostSsurely on {hXi∞ ≤ k}.
Finally, the equality {hXi∞ < ∞} = k≥1 {hXi∞ ≤ k} implies that X converges almost
surely on the whole event {hXi∞ < ∞}.

2
P
Exercise 3.46
P Give an example of a sequence (z n )n≥0 of real numbers such that n≥0 zn con-
verges, but n≥0 zn does not.
Let (Zn )n≥0 be a sequence of independent random variables such that E[Zn ] = 0 for all n ≥ 0.
Set X0 = 0 and, for all n ≥ 1, Xn = Z1 + . . . + Zn . Compute hXi. What does the last theorem
say in this situation ?

50
3.11 Uniform integrability
In this section, we will give a much more detailed answer to the question of knowing when a
martingale which is bounded in L1 converges in L1 . The crucial tool is the notion of uniform
integrability.
Definition 3.29 Let (Ω, A, P) be a probability space. A family (Xi )i∈I of integrable random
variables is said to be uniformly integrable if for all ε > 0 there exists M > 0 such that
∀i ∈ I, E |Xi |1{|Xi |>M } < ε.
 

An equivalent formulation of this definition is


 
lim sup E |Xi |1{|Xi |>M } = 0.
 
M →∞ i∈I

Exercise 3.47 Prove that a uniformly integrable family is bounded in L1 .


Exercise 3.48 Let (Xi )i∈I be a family of integrable random variables. Prove that it is uniformly
integrable as soon as one of the following assumptions is satisfied.
• The index set I is finite.
• There exists an integrable random variable Z such that |Xi | ≤ Z for all i ∈ I.
• The random variable supi∈I |Xi | is integrable.
• There exists p > 1 such that the family (Xi )i∈I is bounded in Lp .
Consider the probability space ([0, 1], B[0,1] , λ). For each n ≥ 1, set m = blog2 nc, so that
n = 2m + k with k ∈ {0, . . . , 2m − 1}, and define the random variable
2m
Xn = 1 k k+1 .
log(m + 1) [ 2m , 2m ]
Prove that the family (Xn )n≥1 is uniformly integrable but does not satisfy any of the properties
above. Incidentally, does the sequence (Xn )n≥1 converge, and in which sense ?
The name uniform integrability is justified by the following proposition.
Proposition 3.30 Let (Ω, A, P) be a probability space. Let (Xi )i∈I be a family of random vari-
ables which is bounded in L1 . The following two assertions are equivalent.
1. The family (Xi )i∈I is uniformly integrable.
2. For all ε > 0 there exists δ > 0 such that for all A ∈ A such that P(A) < δ, one has
Z
∀i ∈ I, |Xi | dP < ε.
A
Proof. 1 ⇒ 2. Choose ε > 0. Since the family (Xi )i∈I is uniformly integrable, there exists a
real M > 0 such that for all i ∈ I, E[|Xi |1|Xi |>M ] < 2ε . Let A be any event in A such that
P(A) < 2Mε
. Then
Z Z Z
|Xi | dP = |Xi | dP + |Xi | dP
A A∩{|Xi |≤M } A∩{|Xi |>M }
Z Z
≤ M dP + |Xi | dP
A {|Xi |>M }
ε
< M P(A) +
2
< ε.

51
2 ⇒ 1. Let K > 0 be such that E[|X
R i |] ≤ K for all i ∈ I. Choose ε > 0. Let δ > 0 be such
that for all A ∈ A, P(A) < δ implies A |Xi | dP < ε for all i ∈ I. Set M = δ . Then for all
2K

i ∈ I,
1 K δ
Z
P(|Xi | > M ) ≤ |Xi | dP ≤ ≤ < δ,
M {|Xi |>M } M 2
so that Z
|Xi | dP < ε,
{|Xi |>M }

and the proof is finished.

Exercise 3.49 When did we use the assumption that the family (Xi )i∈I is bounded in L1 ? Prove
that if the probability space (Ω, A, P) is such that Ω is a finite set, then the second assertion is true
for any family of random variables. Under which assumption on the probability space (Ω, A, P)
can one deduce from the second assertion that the family (Xi )i∈I is bounded in L1 ?

Corollary 3.31 Let (Ω, A, P) be a probability space. Let Z be an integrable variable. The family
{E[Z|B] : B sub-σ-field of A} is uniformly integrable.

Proof. For each sub-σ-field B of A, the random variable E[Z|B] is integrable. Let us now choose
ε. Thanks to the uniform integrability of the family which consists in the single random variable
Z, let us consider δ > 0 such that P(A) < δ implies A |Z| dP < ε. Set M = 2δ E[|Z|]. Let B be
R

a sub-σ-field of A. We claim that

E |E[Z|B]|1{|E[Z|B]|>M } < ε.
 

Indeed, we have
1 1 1
P(|E[Z|B]| > M ) ≤ E[|E[Z|B]|] ≤ E[E[|Z||B]] = E[|Z|] < δ.
M M M
Hence, Z Z
|E[Z|B]| dP ≤ |Z| dP < ε
{|E[Z|B]|>M } {|E[Z|B]|>M }

and the result is proved.

The reason why uniform integrability is so useful in the context of convergence of martingales
is the following result, which is a stronger form of the dominated convergence theorem (but it
is only valid on finite measure spaces).

Theorem 3.32 Let (Xn )n≥0 be a sequence of integrable random variables. Let X∞ be a random
variable. The following assertions are equivalent.
1. The sequence (Xn )n≥0 converges in L1 to X∞ .
2. The sequence (Xn )n≥0 is uniformly integrable and converges in probability to X∞ .

52
Proof. 1 ⇒ 2. We know that convergence in L1 implies convergence in probability. Let us prove
that (Xn )n≥0 is uniformly integrable.
Choose ε > 0. Let δ1 > 0 be such that P(A) < δ1 implies A |X∞ | dP < 2ε . Let n0 be such
R

that for all n ≥ n0 , RE[|Xn − X∞ |] < 2ε . Let δ2 > 0 be such that for all n ≤ n0 and all A with
P(A) < δ2 , we have A |Xn | dP < ε.
Now set δ = min(δ1 , δ2R). Choose A ∈ A such that P(A) < δ. Choose n ≥ 0. If n ≤ n0 , then
since P(A) < δ2 , we have A |Xn | dP. If n ≥ n0 , then

ε
Z Z Z
|Xn | dP ≤ |Xn − X∞ | dP + |X∞ | dP < E[|Xn − X∞ |] + < ε.
A A A 2

2 ⇒ 1. From Proposition 3.30 it follows that the family (Xn − Xm )n,m≥0 is uniformly
integrable. Choose ε > 0. Let M > 0 be such that E[|Xn − Xm |1{|Xn −Xm |>M } ] < ε for all
n, m ≥ 0. Thus,

E[|Xn − Xm |] = E[|Xn − Xm |1{|Xn −Xm |<ε} ] + E[|Xn − Xm |1{ε≤|Xn −Xm |≤M } ]


+ E[|Xn − Xm |1{|Xn −Xm |>M } ]
≤ 2ε + E[|Xn − Xm |1{ε≤|Xn −Xm |≤M } ]
≤ 2ε + M P(|Xn − Xm | > ε).

Let n0 be such that for all n ≥ n0 , P(|Xn − X∞ | > 2ε ) < 2M


ε
. Then for all n, m ≥ n0 , we have
 ε   ε ε
P(|Xn − Xm | > ε) ≤ P |Xn − X∞ | > + P |Xm − X∞ | > < .
2 2 M
Thus, for all n, m ≥ n0 we have
E[|Xn − Xm |] ≤ 3ε.
The sequence (Xn )n≥0 is a Cauchy sequence in L1 . Hence, it converges in L1 , and its limit must
be X∞ .

The proof of the following result is left as an exercise.

Theorem 3.33 Let X = (Xn )n≥0 be a martingale. The following assertions are equivalent.
1. The martingale X converges in L1 .
2. The family (Xn )n≥0 is uniformly integrable.
3. There exists an integrable random variable such that Xn = E[Z|Fn ] for all n ≥ 0.
Moreover, if any of these assertions holds, then the martingale converges almost surely.

We would like to complete this picture by extending the stopping theorem to the case of
uniformly integrable martingales. For this, we need to introduce the notion of σ-field associated
with a stopping time.

Definition 3.34 Let (Ω, A, (Fn )n≥0 , P) be a filtered probability space. Let T be a stopping time.
The collection of events

FT = {A ∈ A : ∀n ≥ 0, A ∩ {T = n} ∈ Fn }

is a sub-σ-field of A called the the σ-field of the past up to time T .

53
Exercise 3.50 Prove that FT is indeed a σ-field. Compute FT when T = n almost surely.
Compute FT when T = ∞ almost surely.
Exercise 3.51 Prove that a subset A of Ω belongs to FT if and only if there exists a sequence
(An )n≥0 of events, and an event A∞ , such that A∞ ∈ A and An ∈ Fn for every n ≥ 0, such
that

[
A = (A∞ ∩ {T = ∞}) ∪ (An ∩ {T = n}).
n=0

Lemma 3.35 Let S and T two stopping times. If S ≤ T , then FS ⊂ FT .


Proof. Consider A ∈ FS . Choose n ≥ 0. Then the event

[ n
[
A ∩ {T = n} = (A ∩ {T = n} ∩ {S = k}) = (A ∩ {S = k} ∩ {T = n})
k=0 k=0
belongs to Fn . Hence, A belongs to FT .

Lemma 3.36 Let X = (Xn )n≥0 be an adapted stochastic process. Let T be a stopping time. If
one of the following assumptions is satisfied :
1. T is finite almost surely,
2. Xn converges almost surely to X∞ ,
then XT is well defined and FT -measurable.
Proof. We know already that XT is well defined under any of the two assumptions. Let us prove
that XT is FT measurable. For this, let us consider a Borel subset B of R and an integer n ≥ 0.
We have
{XT ∈ B} ∩ {T = n} = {Xn ∈ B} ∩ {T = n} ∈ Fn .
Since this holds for all n ≥ 0, {XT ∈ B} belongs to FT .

Let us now state and prove the stopping theorem for uniformly integrable martingales.
Theorem 3.37 Let X be a uniformly integrable martingale. Let T be a stopping time. Then
XT = E[X∞ |FT ].
In particular, E[XT ] = E[X∞ ] = E[Xn ] for all n ≥ 0. Moreover, if S and T are two stopping
times such that S ≤ T , then
XS = E[XT |FS ].
Proof. Let us check that XT is integrable. We have

E[|Xn |1{T =n} ] + E[|X∞ |1{T =∞} ]
X
E[|XT |] =
n=0

E E |X∞ | Fn 1{T =n} + E[|X∞ |1{T =∞} ]
X    

n=0

E |X∞ |1{T =n} + E[|X∞ |1{T =∞} ]
X  
=
n=0
= E[|X∞ |] < ∞.

54
Let us now choose A ∈ FT . We have

E[X∞ 1A ] = E[X∞ 1A∩{T =n} ] + E[X∞ 1A∩{T =∞} ]
X

n=0

E[E[X∞ |Fn ]1A∩{T =n} ] + E[X∞ 1A∩{T =∞} ]
X
=
n=0

E[Xn 1A∩{T =n} ] + E[X∞ 1A∩{T =∞} ]
X
=
n=0
= E[XT 1A ].

This proves that XT = E[X∞ |FT ].


It follows that E[XT ] = E[X∞ ]. We already know that E[Xn ] = E[E[X∞ |Fn ]] = E[X∞ ].
Finally, if S ≤ T , then since FS ⊂ FT , we have

XS = E[X∞ |FS ] = E[E[X∞ |FT ]|FS ] = E[XT |FS ]

and the proof is finished.

Exercise 3.52 Let X be a uniformly integrable martingale. Let T be a stopping time. Prove
that XT ∧n converges almost surely and in L1 to XT .

3.12 Backward martingales


The name backward martingales may be misleading. We are not going to run time backwards
but rather allow time to be unbounded below.

Definition 3.38 Let (Ω, A, P) be a probability space. A backward filtration is a non-decreasing


sequence (Fn )n≤0 of sub-σ-fields of A:

. . . ⊂ F−n−1 ⊂ F−n ⊂ . . . ⊂ F−1 ⊂ F0 .

A backward martingale on the backward filtered probability space (Ω, A, (Fn )n≤0 , P) is a sequence
X = (Xn )n≤0 such that for all n ≤ 0, Xn is integrable and Fn -measurable, and

E[Xn |Fn−1 ] = Xn−1 .

Having time unbounded below instead of unbounded above makes a huge difference.

Lemma 3.39 A backward martingale is uniformly integrable.

Proof. Indeed, we have for all n ≤ 0 the equality Xn = E[X0 |Fn ], and the result is a consequence
of Corollary 3.31.

It is thus not too surprising that the problem of convergence of backward martingales is
simpler than that of martingales.

55
T
Theorem 3.40 Let X be a backward martingale. Set F−∞ = n≤0 Fn . Then, as n tends to
−∞, Xn converges almost surely and in L1 to E[X0 |F−∞ ].
Proof. The proof that X converges almost surely is the same as that for usual martingales.
Indeed, for all N ≤ 0, the sequence XN , XN +1 , . . . , X0 is a usual martingale and for all a < b,
the number of upcrossings of this martingale can be estimated by Doob’s upcrossing lemma.
As N tends to −∞, this number of upcrossings converges to the number of upcrossings of the
full backward martingale X, and turns out to be finite almost surely for all a < b. Thus, Xn
converges almost surely as n tends to −∞, towards a random variable which we denote by X−∞ .
Since, by the preceding lemma, X is uniformly integrable, it also converges to X−∞ in L1 .
The random variable X−∞ is Fn -measurable for each n ≤ 0, because it is the limit of the
sequence Xn , Xn−1 , Xn−2 , . . .. Thus, X−∞ is F−∞ measurable. Now if A belongs to F−∞ , then
the L1 convergence of Xn to X∞ implies
Z Z Z Z Z
X−∞ dP = lim Xn dP = lim E[X0 |Fn ] dP = lim X0 dP = X0 dP,
A n→−∞ A n→−∞ A n→−∞ A A
because A belongs to Fn for all n ≤ 0. Hence, X−∞ = E[X0 |F−∞ ] and the proof is finished.

This result of convergence has several spectacular consequences, including a proof of the
strong law of large numbers. This proof however requires some preparatory results.
Lemma 3.41 Let X, Y1 , . . . , Yn and X 0 , Y10 , . . . , Yn0 be random variables. Assume that the ran-
dom vectors (X, Y1 , . . . , Yn ) and (X 0 , Y10 , . . . , Yn0 ) have the same distribution. Assume that X
and X 0 are integrable. Assume finally that h : Rn → R is a measurable function such that
E[X|σ(Y1 , . . . , Yn )] = h(Y1 , . . . , Yn ).
Then
E[X 0 |σ(Y10 , . . . , Yn0 )] = h(Y10 , . . . , Yn0 ),
with the same function h.
Proof. Let B be a Borel subset of Rn . The assumptions imply that
Z Z
X 1B (Y1 , . . . , Yn ) dP = h(Y1 , . . . , Yn )1B (Y1 , . . . , Yn ) dP.
Ω Ω

Denoting by µ the distribution of (X, Y1 , . . . , Yn ) (which is a probability measure on Rn+1 ) and


by ν the distribution of (Y1 , . . . , Yn ) (which is a probability measure on Rn ), this rewrites as
Z Z
x1B (y1 , . . . , yn )µ(dx, dy1 , . . . , dyn ) = h(y1 , . . . , yn )1B (y1 , . . . , yn )ν(dy1 , . . . , dyn ).
Rn+1 Rn
Since (X, Y1 , . . . , Yn ) and (X 0 , Y10 , . . . , Yn0 )
have the same distribution, the last equality also says
that Z Z
X 0 1B (Y10 , . . . , Yn0 ) dP = h(Y10 , . . . , Yn0 )1B (Y10 , . . . , Yn0 ) dP,
Ω Ω
that is,
E[X 0 |σ(Y10 , . . . , Yn0 )] = h(Y10 , . . . , Yn0 )
as expected.

This lemma has the following consequence.

56
Lemma 3.42 Let (Xn )n≥1 be an i.i.d. sequence of integrable random variables. For all n ≥ 1,
set Sn = X1 + . . . + Xn . Then for all n ≥ 1, and all k ∈ {1, . . . , n},
Sn
E[Xk |Sn ] = = E[Xk |σ(Sn , Sn+1 , Sn+2 , . . .)].
n
Proof. For all k, l ∈ {1, . . . , n}, the vectors (Xk , Sn ) and (Xl , Sn ) have the same distribution.
Indeed, assuming k < l, the vectors (X1 , . . . , Xn ) and (X1 , . . . , Xl , . . . , Xk , . . . , Xn ) with Xk and
k l
Xl exchanged have the same distribution, so that (Xk , Sn ) and (Xl , Sn ), which are respectively
obtained from these two vectors by applying the function
(x1 , . . . , xn ) 7→ (xk , x1 + . . . + xn ),
also have the same distribution.
Let h : R → R be a measurable function such that E[X1 |Sn ] = h(Sn ). By the previous
lemma, we have E[Xk |Sn ] = h(Sn ) for all k ∈ {1, . . . , n}. Hence,
n
X
Sn = E[Sn |Sn ] = E[Xk |Sn ] = nh(Sn ),
k=1

so that h(Sn ) = Snn . This proves the first equality.


In order to prove the second, observe that σ(Sn , Sn+1 , . . .) = σ(Sn , Xn+1 , Xn+2 , . . .). We
need to prove that for all event A ∈ σ(Sn , Xn+1 , Xn+2 , . . .), we have
1
E[Xk 1A ] = E[Sn 1A ].
n
Let us denote by C the class of events A ∈ A for which this equality holds. It is a λ-system.
Let us consider an event B ∈ σ(Sn ) and an event C ∈ σ(Xn+1 , Xn+2 , . . .). Since the σ-fields
σ(Xk , Sn ) and σ(Xn+1 , Xn+2 , . . .) are independent, we have
1 1
E[Xk 1B 1C ] = E[Xk 1B ]E[1C ] = E[Sn 1B ]E[1C ] = E[Sn 1B 1C ].
n n
Hence, the λ-system C contains the π-system of all events of the form B ∩ C with B ∈ σ(Sn ) and
C ∈ σ(Xn+1 , Xn+2 , . . .). Thus, by the monotone class theorem, it contains σ(Sn , Xn+1 , Xn+2 , . . .)
and the second equality is proved.

Theorem 3.43 (Strong law of large numbers) Let (Xn )n≥1 be an i.i.d. sequence of inte-
grable random variables. Then as n tends to infinity,
X1 + . . . + Xn
−→ E[X1 ]
n
almost surely and in L1 .

Proof. Set S0 = 0 and, for all n ≥ 1, Sn = X1 + . . . , Xn . For all n ≥ 0, set F−n =


σ(Sn , Sn+1 , . . .) = σ(Sk : k ≥ n). Then (Fn )n≤0 is a backward filtration. For each n ≥ 1,
we have
  n n
Sn 1X 1X Sn+1
E F−n−1 = E[Xk |F−n−1 ] = E[Xk |Sn+1 , Sn+2 , . . .] = .
n n n n+1
k=1 k=1

57
In other words, ( S−n
−n
)n≤1 is a backward martingale with respect to its natural filtration. This
implies that the sequence Snn n≥0 converges almost surely and in L1 . Its limit is


X1 + . . . + Xn
Y = lim ,
n→∞ n
but for each n0 ≥ 0, it is also
Xn0 + . . . + Xn
Y = lim
n→∞ n
almost surely, so that Y is measurable with respect to σ(Xn : n ≥ n0 ). Since this holds for all
n0 ≥ 1, the random variable Y is measurable with respect to the tail σ-field
\
σ(Xk : k ≥ n)
n≥1

and Kolmogorov’s 0-1 law asserts that this σ-field is trivial. Hence, the limit Y is a constant.
Since E[Y ] = E[X1 ], this constant must be E[X1 ].

4 Markov chains
4.1 Introduction
Just as martingales, Markov chains are a class of stochastic processes whose definition involves
conditional expectation in a crucial way. However, in contrast with martingales, which are a
technical tool for the study of dynamical systems, Markov chains are a model for actual random
phenomena. The class of Markov chains, or more generally Markov processes, has the main
features of a good and successful model, namely: simplicity and versatility. It is mathematically
simple enough to be studied in great detail, and allows one to give interesting models of a whole
range of real situations.
Both points should however be nuanced. On one hand, the general theory of Markov pro-
cesses is a very sophisticated theory, of which we are going to study a few fundamental ideas in
the simplest interesting situation. On the other hand, Markovian models have their limit when
applied to reality, and must often be enhanced to fit any particular actual random phenomenon.
The general idea of Markov processes is that of a random motion in a so-called state space,
that we will denote by E. This random motion can either be though of as the random motion of
a particle in the space E, or as the random evolution in time of the state of a system, each point
of E describing such a state. The second point of view is more general than the first, since the
position of a particle is a special case of the state of a system. In any case, the main defining
property of the random motion, or random evolution, is that it is memoryless, in the sense that
the way it moves, or evolves, immediately after a given instant of time depends on the history of
the movement, or evolution, only through its present location, or state. Another way of stating
this property is to say that at every instant of time, the future evolution is independent of the
past evolution conditional on the present state.
Markov processes come in several variants, depending on the time being discrete or continu-
ous, and on the space E being discrete or continuous. We are going to study the simplest case,
where E is discrete and time is discrete. Although technically simple, this case will allow us to
meet most of the fundamental ideas of the theory of Markov processes.

58
4.2 First definition and first properties
In this chapter, we will call state space and denote by E an arbitrary non-empty finite or
countable set. Whenever it is necessary, this set will be endowed with the σ-field P(E) of all
subsets of E.
A Markov chain is a sequence of random variables with values in E which satisfies a certain
property which we described as absence of memory. In dealing with this property, we will make
use of conditional expectations in a way that differs slightly from the way we discussed it in the
first chapter of these notes.
Given a probability space (Ω, A, P), a sub-σ-field B of A and an event A ∈ A, we will use
the notation
P(A|B) = E[1A |B]
for the conditional probability of A given B. Let us emphasize that this is a random variable,
and not a number. In the simple case where B is the σ-field {∅, B, B c , Ω} generated by a single
event B such that 0 < P(B) < 1, we have (see Section 2.1)
P(A ∩ B) P(A ∩ B c )
P(A|B) = 1B + 1Bc
P(B) P(B c )

almost surely. The number P(A∩B)


P(B) is traditionally denoted by P(A|B), and called the conditional
probability of A given B, with the important precaution that it is defined only when P(B) > 0.
A situation that we will meet often is summarised in the following exercise.

Exercise 4.1 Let A be an event and Y a random variable with values in a finite or countable
set D endowed with the σ-field P(D). Suppose that we want to understand the conditional
probability P(A|σ(Y )), which incidentally is often denoted by P(A|Y ). What is usually easy to
compute is, for an element y ∈ D, the conditional probability P(A|{Y = y}), which is usually
denoted by P(A|Y = y). However, this conditional probability is defined only if P(Y = y) > 0.
Prove that
P(A|Y = y)1{Y =y} .
X
P(A|Y ) =
y∈D
P(Y =y)>0

The point here is that we consider in the sum only those values y for which P(Y = y) > 0.

Apart from the conditional expectation, the main ingredient in the definition of a Markov
chain is the transition kernel.

Definition 4.1 A transition kernel on E is a function


P : E × E −→ [0, 1]
(x, y) 7−→ p(x, y)
such that X
∀x ∈ E, p(x, y) = 1.
y∈E

The value of P at a couple (x, y) is usually denoted by p(x, y), but also sometimes by P (x, y).
Informally, conditional on the system currently being in the state x, the probably of this
state becoming y at the next instant of time is p(x, y).

59
Exercise 4.2 Check that it is equivalent to consider a transition kernel on E or to consider a
map E → Prob(E) from E into the set of all probability measures on (E, P(E)).

A transition kernel can be though of as, and indeed, when E is finite, identified with a matrix
whose rows and columns are indexed by elements of E. The conditions on this matrix are that
it should have non-negative entries and that the sum of the entries in each row should be 1.
Just as matrices, transition kernels can be multiplied: if P and P 0 are transition kernels,
then P P 0 : E → [0, +∞] defined by
X
∀x, y ∈ E, (P P 0 )(x, y) = P (x, z)P 0 (z, y)
z∈E

is a transition kernel.

Exercise 4.3 Check this statement, and check that the product thus defined is associative on the
set of transition kernels, in the sense that if P, P 0 , P 00 are transition kernels, then P (P 0 P 00 ) =
(P P 0 )P 00 .

The transition kernel I : E × E → [0, 1] defined by



1 if x = y
I(x, y) = δx,y =
0 otherwise

is the unit element of the associative product on the set of all transition kernels. It is of course
the identity matrix if E is finite, and the natural analogue of it when E is infinite.
Given a transition kernel P , we define for every non-negative integer n a new transition
kernel P n by setting
P 0 = I and for all n ≥ 1, P n = P
| .{z
. . P} .
n times

We can now give a first definition of a Markov chain.

Definition 4.2 (Markov chains, first definition) Let E be a non-empty, finite or countable
set endowed with the σ-field of all its subsets. Let P be a transition kernel on E. Let (Ω, A, P)
be a probability space. Let X = (Xn )n≥0 be a sequence of random variables defined on this
probability space, with values in E.
The sequence X is a Markov chain on E with transition kernel P if the following condition
holds:
∀n ≥ 0, ∀y ∈ E, P(Xn+1 = y|X0 , . . . , Xn ) = p(Xn , y). (3)

According to the result of Exercise 4.1, this condition can be written in a slightly longer
but more elementary way as follows : for all n ≥ 0, all y ∈ E and all x0 , . . . , xn ∈ E such that
P(X0 = x0 , . . . , Xn = xn ) > 0,

P(Xn+1 = y|X0 = x0 , . . . , Xn = xn ) = p(xn , y).

Let us immediately give an equivalent characterisation of Markov chains.

60
Proposition 4.3 With the notation of Definition 4.2, X is a Markov chain on E with transition
kernel P if and only if the following condition holds:

∀n ≥ 0, ∀x0 , . . . , xn ∈ E, P(X0 = x0 , . . . , Xn = xn ) = P(X0 = x0 )p(x0 , x1 ) . . . p(xn−1 , xn ). (4)

Proof. Let us start by the ‘only if’ part, that is, the implication ⇒. We prove (4) by induction
on n. For n = 0, it reduces to P(X0 = x0 ) = P(X0 = x0 ), which is true. Let us now assume
that (4) has been proved up to rank n − 1 for some n ≥ 1 and let us consider x0 , . . . , xn in E. If
P(X0 = x0 , . . . , Xn−1 = xn−1 ) = 0, then both sides of (4) vanish, the left-hand side because it
is the probability of an event included in a negligible event, and the right-hand side because the
product of all factors except the last, being equal by induction to P(X0 = x0 , . . . , Xn−1 = xn−1 ),
is equal to 0.
If, on the other hand, P(X0 = x0 , . . . , Xn−1 = xn−1 ) > 0, then

P(X0 = x0 , . . . , Xn = xn ) = P(X0 = x0 , . . . , Xn−1 = xn−1 )


P(Xn = xn |X0 = x0 , . . . , Xn−1 = xn−1 )
= P(X0 = x0 )p(x0 , x1 ) . . . p(xn−2 , xn−1 )p(xn−1 , xn ),

as expected.
Let us now prove the ‘if’ part, that is, the implication ⇐. According to the remark made
after the definition of Markov chains, it suffices to consider x0 , . . . , xn , y in E such that P(X0 =
x0 , . . . , Xn = xn ) > 0 and compute P(Xn+1 = y|X0 = x0 , . . . , Xn = xn ). Thanks to (4), this
conditional expectation is equal to

P(X0 = x0 )p(x0 , x1 ) . . . p(xn−1 , xn )p(xn , y)


= p(xn , y),
P(X0 = x0 )p(x0 , x1 ) . . . p(xn−1 , xn )

and this finishes the proof.

It follows from this proposition that if X is a Markov chain on E with transition kernel P ,
then
∀n ≥ 0, ∀y ∈ E, P(Xn+1 = y|Xn ) = p(Xn , y). (5)
However, it is important to realise that this property is much weaker than the property which
defines Markov chains. Exercise 4.5 below gives an example of a process that is not a Markov
chain but satisfies (5).

Exercise 4.4 Check the fact that (5) holds for a Markov chain.

Exercise 4.5 Consider the state space E = {a, b, c}. Define a random variable Z with values
in E such that P(Z = b) = P(Z = c) = 12 . Define a sequence X = (Xn )n≥0 of random variables
with values in E such that

a if n is even,
∀n ≥ 0, Xn =
Z if n is odd.

Prove that there exists a unique transition kernel P such that X satisfies the property (5), and
prove that X is not a Markov chain with transition kernel P .

61
Let us give two simple consequences of this characterisation of Markov chains.
Proposition 4.4 Let X be a Markov chain on E with transition kernel P . For all n ≥ 0 and
all y ∈ E, one has
P(Xn = y|X0 ) = P n (X0 , y).
Proof. We need to prove that for every element x0 of E such that P(X0 = x0 ) > 0, the equality
P(Xn = y|X0 = x0 ) = P n (x0 , y) holds. But for such an x0 , we have
P(X0 = x0 )P(Xn = y|X0 = x0 ) = P(X0 = x0 , Xn = y)
X
= P(X0 = x0 , . . . , Xn−1 = xn−1 , Xn = y)
x1 ,...,xn−1 ∈E
X
= P(X0 = x0 ) p(x0 , x1 ) . . . p(xn−1 , y)
x1 ,...,xn−1 ∈E

= P(X0 = x0 )P n (x0 , y),


as expected.

Exercise 4.6 Let X be a Markov chain on E with transition kernel P . Check that for every
integer ` ≥ 0, the sequence (Xn` )n≥0 is a Markov chain on E with transition kernel P ` .
Proposition 4.5 Let X be a Markov chain on E with transition kernel P . Let N ≥ 0 be an
integer. For every n ≥ 0, set Yn = XN +n . Then Y = (Yn )n≥0 is a Markov chain on E with
transition kernel P .
Proof. Let us consider n ≥ 0 and y0 , . . . , yn ∈ E. Firstly, we know from the previous proposition
that
X
P(Y0 = y0 ) = P(X0 = x0 , Y0 = y0 )
x0 ∈E
X
= P(X0 = x0 , XN = y0 )
x0 ∈E
X
= P(X0 = x0 )P N (x0 , y0 ).
x0 ∈E

Secondly,
P(Y0 = y0 , . . . , Yn = yn ) = P(XN = y0 , . . . , XN +n = yn )
X
= P(X0 = x0 , . . . , XN −1 = xN −1 , XN = y0 , . . . , XN +n = yn )
x0 ,...,xN −1 ∈E
X
= P(X0 = x0 )p(x0 , x1 ) . . . p(xN −1 , y0 )p(y0 , y1 ) . . . p(yn−1 , yn )
x0 ,...,xN −1 ∈E
X
= P(X0 = x0 )P N (x0 , y0 )p(y0 , y1 ) . . . p(yn−1 , yn )
x0 ∈E
= P(Y0 = y0 )p(y0 , y1 ) . . . p(yn−1 , yn ),
from which we see that Y is a Markov chain with transition kernel P .

62
Exercise 4.7 Check that for a Markov chain X with transition kernel P , one has, for all
integers 0 ≤ n ≤ m and all y ∈ E

P(Xm = y|Xn ) = P m−n (Xn , y).

Let us conclude this first section by giving a few examples of Markov chains.
• Independent random variables. Let X = (Xn )n≥0 be a sequence of i.i.d. random
variables with values in E, with common distribution µ. Then X is a Markov chain with
transition kernel P given by

∀x, y ∈ E, P (x, y) = µ(y),

where µ(y) is a notation for µ({y}).

• Random walks on Zd . Let d ≥ 1 be an integer. Let µ be a probability measure on


Zd . Let (ξi )i≥1 be an i.i.d. sequence of random variables with values in Zd and common
distribution µ. Let X0 be a random variable with values in Zd independent of (ξi : i ≥ 1).
For every n ≥ 1, set
Xn = X0 + ξ1 + . . . + ξn .
Then X = (Xn )n≥0 is a Markov chain on Zd with transition kernel P given by

∀x, y ∈ E, P (x, y) = µ(y − x).

This Markov chain is called the random walk on Zd with jump distribution µ.
Let (e1 , . . . , ed ) denote the canonical basis of Zd . In the special case where
d
1 X
µ= (δei + δ−ei ),
2d
i=1

the random walk is called the simple random walk on Zd . The simple random walk is one
of the most classical objects of the theory of probability.

• Random walk on a graph. Let A be a subset of the set P2 (E) of pairs of elements of
E. We think of E as the set of vertices of a graph, and of each element {x, y} ∈ A as an
unoriented edge joining the vertices x and y.
For each x ∈ E, we define the set Nx of neighbours of x in the graph (E, A) by

Nx = {y ∈ E : {x, y} ∈ A}.

We make the assumption that for all x ∈ E, the set Nx is non-empty and finite:

∀x ∈ E, 0 < |Nx | < ∞.

The random walk on the graph (E, A) is the Markov chain with transition kernel P given
by  1
|Nx | if y ∈ Nx
∀x, y ∈ E, P (x, y) =
0 otherwise.
We did not prove yet that such a Markov chain exists, and we will do it soon.

63
• Branching processes. Recall from Section 3.6 the definition of a branching process.
With the notation used there, the sequence (Xn )n≥0 is a Markov chain on N with transition
kernel
∀n, m ∈ N, P (n, m) = µ∗n (m),
where µ is the reproduction law of the branching process, that is, the common distribution
of all the variables (Zn,k )n,k≥0 , and for every integer n ≥ 0, µ∗n is the n-th convolution
power of µ, that is, the distribution of the sum of n independent random variables with
distribution µ, for example Z0,1 + . . . + Z0,n . In particular, µ∗0 = δ0 .

Exercise 4.8 Check that a sequence of i.i.d. random variables, a random walk on Zd , a branch-
ing process, are indeed Markov chains with the claimed kernels.

4.3 Construction of Markov chains


In this section, we prove that on any finite or countable state space, any transition kernel is the
transition kernel of a Markov chain.

Proposition 4.6 Let E be a finite or countable set. Let P be a transition kernel on E. Let x0
be an element of E. There exists a probability space (Ω, A, P) and a sequence X = (Xn )n≥0 of
E-valued random variables defined on this probability space such that X0 = x0 a.s. and X is a
Markov chain with transition kernel P .

In order to construct such a Markov chain, we will need a source of randomness, which will
be provided by the next lemma.

Proposition 4.7 There exists a probability space (Ω, A, P) and a sequence U = (Un )n≥0 of i.i.d.
random variables with common distribution equal to the uniform distribution on the interval
[0, 1].

It may be that you have always taken this existence result for granted: in this case, you
should take a moment to think about the fact that it actually needs a proof.

Proof. Take (Ω, A, P) = ([0, 1], B[0,1] , λ), where λ is the Lebesgue measure. For all n ≥ 1, define
a random variable Bn on this probability space by setting
2n−1
X−1
n
∀t ∈ [0, 1], Bn (t) = b2 tc − 2b2 n−1
tc = 1[ 2k+1 2k+2 .
n , n )
2 2
k=0

Then (Bn )n≥1 is an i.i.d. sequence with Bernoulli distribution of parameter 21 :

1
∀n ≥ 1, P(Bn = 0) = P(Bn = 1) = .
2
Now, for all n ≥ 1, let pn denote the n-th prime number, so that p1 = 2, p2 = 3, p3 = 5 and so
on. For all n ≥ 0, define

X
Un = 2−k Bpk .
n+1
k=1

64
Then, by computing for instance its characteristic function, one checks that Un is uniformly dis-
tributed on [0, 1] for all n ≥ 0. Moreover, each variable Un is built from a subset of the variables
(Br )r≥0 that is disjoint from the subset used to build all the other Um ’s. Thus, (Un )n≥0 is an
independent sequence of random variables.

We now turn to the proof of the proposition.

Proof of Proposition 4.6. Take (Ω, A, P) = ([0, 1], B[0,1] , λ) as before and consider a sequence
(Un )n≥0 of i.i.d. random variables with uniform distribution on [0, 1].
We claim that there exists a function F : E × [0, 1] → E with the property that

∀x, y ∈ E, λ({t ∈ [0, 1] : F (x, t) = y}) = p(x, y).

In order to build such a function, we order E and write its elements as E = {y1 , y2 , . . .} in an
arbitrary fashion. Now choose x ∈ E and t ∈ [0, 1). There exists a unique integer k ≥ 1 such
that
k−1
X k
X
p(x, yl ) ≤ t < p(x, yl )
l=1 l=1

and we define
F (x, t) = yk .
With this definition,
k−1 k
#
X X
{t ∈ [0, 1] : F (x, t) = y} = p(x, yl ), p(x, yl )
l=1 l=1

is indeed a subset of [0, 1] with Lebesgue measure p(x, yk ).


Let us now define X0 = x0 and inductively, for all n ≥ 0,

Xn+1 = F (Xn , Un ).

Let k0 be the integer such that x0 = yk0 . Then, for all n ≥ 1 and all k1 , . . . , kn ≥ 1,
n m −1
kX km
!
Y X
P(X0 = yk0 , X1 = yk1 , . . . , Xkn = yn ) =

λ p(ykm −1 , yl ), p(ykm−1 , yl )
m=1 l=1 l=1
= p(yk0 , yk1 ) . . . p(ykn−1 , ykn )

and X is a Markov chain issued from yk0 = x0 with transition kernel P .

We would like now to take a slightly different point of view on Markov chains. So far, we
thought of a Markov chain as a sequence of random variables with values in E, and we would
like to become familiar with the idea that it is, or at least can be though of as a single random
variable with values in the space E N of sequences of elements of E. Informally, this amouts to
realising that X = (Xn )n≥0 is really a function of two variables, namely (n, ω) 7→ Xn (ω), and
changing the priority between the two variables n and ω.

65
In order to make this idea more precise, we need to define a measurable space with underlying
set E N . Concretely, we need to endow the set

E N = {ω = (ωi )i≥0 : ∀i ≥ 0, ωi ∈ E}

with a σ-field. For this, we start by listing natural functions on this set that we would want to
be measurable.

Definition 4.8 Let E be a finite or countable set. On the set E N , which in the present context
is called the canonical space, we define for each n ≥ 0 the function

X
bn : EN −→ E
ω = (ωi )i≥0 7−→ X
bn (ω) = ωn .

The collection X bn )n≥0 of E-valued functions on E N is called the canonical process.


b = (X

We think of an element ω of E N as the complete record of all the successive positions of


our particle (or states of our system), from the origin to the end of times. Then, X
bn (ω) is the
position at time n of our particle.
In the next definition, recall that the set E is endowed with the σ-field P(E).

Definition 4.9 For every n ≥ 0, we define on E N the σ-field

Cn = σ(X
b0 , . . . , X
bn ).

The elements of the π-system



[
Cn
n=0

are called cylinders on E N , and the σ-field



!
[
C=σ Cn bn : n ≥ 0)
= σ(X
n=0

is called the cylinder σ-field on E N .

In the present context, C turns out to be the natural σ-field on E N . This means, in English,
and without being too precise, that the functions of a trajectory that we want to consider (and
to call measurable) are those which are expressible in terms of finitely many successive positions
of the trajectory, or which can be approximated in an appropriate sense by such functions.
It is a useful observation that for all n ≥ 0, the σ-field Cn is generated by the partition

EN =
G
{X b n = xn }
b 0 = x0 , . . . , X
x0 ,...,xn ∈E

{ω ∈ E N : ω0 = x0 , . . . , ωn = xn }
G
=
x0 ,...,xn ∈E

{x0 } × . . . × {xn } × E N\{0,...,n} .


G
=
x0 ,...,xn ∈E

66
Lemma 4.10 A map f from a measurable space (Ω, A) into (E N , C) is measurable if and only
if for all n ≥ 0, the map X
bn ◦ f is measurable from (Ω, A) to (E, P(E)).

Proof. The ‘only if’ part of the statement is a consequence of the fact that a composition of
measurable maps is measurable. Let us prove the ‘if’ part: let us assume that for all n ≥ 0, the
map X bn ◦ f is measurable. Let us define the class

I = {C ∈ C : f −1 (X) ∈ A}
of subsets of E N . This is sometimes called the image σ-field2 of A by f . It is indeed a σ-field
(check it!) and it is the largest sub-σ-field of C that makes f measurable. Of course, we want
to show that I = C. For this, it suffices to to show that Cn ⊂ I for all n ≥ 0. According to the
observation made just before stating the present lemma, it is enough to prove that for all n ≥ 0
and all x0 , . . . , xn ∈ E, the set {X bn = xn } belongs to I. This is indeed the case,
b 0 = x0 , . . . , X
because
\n 
−1 −1
f ({X0 = x0 , . . . , Xn = xn }) = f
b b {Xi = xi }
b
i=0
n
\
= f −1 ({X
bi = xi })
i=0
\n
= bi ◦ f )−1 ({xi })
(X
i=0

belongs to A by assumption.

We are now in possession of a filtered measurable space with an adapted process on it,
namely (E N , C, (Cn )n≥0 , X
b = (X
bn )n≥0 ). The last lemma will allow us to connect this nice space
with our more familiar notion of sequence of E-valued random variables.

Definition 4.11 Let (Ω, A, P) be a probability space on which we are given a sequence X =
(Xn )n≥0 of E-valued random variables. The trajectory map of this stochastic process is the map
X : Ω −→ E N
ω 7−→ (Xn (ω))n≥0 .

A technical remark is in order at this point. Each random variable Xn is really an equivalence
class of functions, any two of which coincide outside a P-negligible subset of Ω. Because we are
considering a countable collection of random variables, the map X is indeed well defined P-almost
surely, in the sense that two distinct choices of a sequence of representatives of the sequence
of random variables (Xn )n≥0 yields two functions X that may be distinct, but coincide outside
a P-negligible subset of Ω. This is one respect in which the theory of continous time random
processes is technically more demanding than its discrete time analogue.
It follows immediately from Lemma 4.10 and from the identity
∀n ≥ 0, Xn = X
bn ◦ X

that the trajectory map is measurable with respect to the σ-fields A and C.
2
More precisely, I is the intersection of C and the image σ-field of A by f .

67
Exercise 4.9 What is the trajectory map of the canonical process (X
bn )n≥0 defined on the canon-
N
ical space (E , C)?

We can now give an upgraded and more canonical version of Proposition 4.6.

Theorem 4.12 Let E be a finite or countable set. Let P be a transition kernel on E. Let x be
an element of E. There exists on the measurable space (E N , C) a unique probability measure P
bx
such that X0 = x Px -a.s. and under Px , the canonical process X = (Xn )n≥0 is a Markov chain
b b b b b
on E with transition kernel P .

Proof. To prove the existence of Px , let us apply Proposition 4.6 to find a probability space
(Ω, A, P) and a Markov chain X = (Xn )n≥0 defined on this space, issued from x and with
transition kernel P . Let us consider the trajectory map X of this process and define
bx = P ◦ X −1 .
P

In English, P bx is the law of the trajectory of the process X = (Xn )n≥0 . Let us check that
(Xbn )n≥0 is under Pbx a Markov chain on E issued from x with transition kernel P . For this, let
us choose n ≥ 0 and x0 , . . . , xn in E. According to Proposition 4.3, we must prove that

P
b x (X
b 0 = x0 , . . . , X
bn = xn ) = δx,x p(x0 , x1 ) . . . p(xn−1 , xn ).
0

The following computation is elementary in that it consists exclusively in unfolding definitions:

P
b x (X bn = xn ) = P(X −1 ({X
b 0 = x0 , . . . , X bn = xn }))
b 0 = x0 , . . . , X
= P X −1 ({X
b0 = x0 }) ∩ . . . ∩ X −1 ({X

bn = xn })
b0 ◦ X)−1 ({x0 }) ∩ . . . ∩ (X
= P((X bn ◦ X)−1 ({xn }))
= P(X0−1 ({x0 }) ∩ . . . ∩ Xn−1 ({xn }))
= P(X0 = x0 , . . . , Xn = xn )
= δx,x0 p(x0 , x1 ) . . . p(xn−1 , xn ),

as expected3 .
Let us turn to the uniqueness of Pbx . Suppose that Q b x is a probability measure on (E N , C)
under which the canonical process is a Markov chain on E issued from x and with transition
kernel P . The for all n ≥ 0 and all x0 , . . . , xn ∈ E, we have

P
b x (X bn = xn ) = Q
b 0 = x0 , . . . , X b x (X
b 0 = x0 , . . . , X
bn = xn ).

According to the remark made just before the statement of Lemma 4.10, this implies that for all
n ≥ 0, the probability measures Pbx and Q b x agree on Cn . Hence, they agree on the class S
n≥0 Cn
of all cylinder sets. We conclude the proof by applying the monotone class theorem, as follows.
On any measurable space, the class of measurable sets on which two probability measures
agree is a monotone class (also called a λ-system). Applying this general fact in our particular
situation, we find that the class of all elements of C on which P bx and Qb x agree is a λ-system.
3
The last computation is a proof of the fact, which we could have used directly, that, because P bx = P ◦ X −1
and (X0 , . . . , Xn ) = (X0 , . . . , Xn ) ◦ X, the random variable (X0 , . . . , Xn ) has under Px the same distribution as
b b b b b
the random variable (X0 , . . . , Xn ) under P.

68
S
We just proved that this λ-system contains n≥0 Cn , which is a π-system. The monotone class
theorem states that a λ-system which contains a π-system also contains the σ-field generated by
this π-system. Hence, the class of sets on which P bx and Qb x agree contains the σ-field generated
by all cylinder sets, which, by definition, is C. Hence, Px and Q
b b x agree on C: they are equal.

We are now in position of giving a second, more sophisticated definition of a Markov chain.

Definition 4.13 (Markov chains, second definition) Let E be a non-empty, finite or count-
able set. Let P be a transition kernel on E. A Markov chain with transition kernel P on E is
a quintuple (Ω, A, (Fn )n≥0 , (Px )x∈E , X = (Xn )n≥0 ) in which (Ω, A, (Fn )n≥0 ) is a filtered mea-
surable space on which X = (Xn )n≥0 is an adapted E-valued process which, for all x, is under
Px , and in the sense of Definition 4.2, a Markov chain on E issued from x and with transition
kernel P .

Our discussion of the canonical space shows that every transition kernel on E is the transition
kernel of a Markov chain on E in this more sophisticated sense.
Given a Markov chain in the sense of the definition above, we define, for every probability
measure µ on E, the measure Pµ by
Z X
Pµ = Px dµ(x) = µ({x})Px .
E x∈E

Under Pµ , the process X is a Markov chain with initial distribution µ and transition kernel P .
There is a last improvement that we should make to our definition. There is a slight in-
consistency in Definition 4.13 because we are given the filtration (Fn )n≥0 on our measurable
space, and in Definition 4.2, and more precisely in Equation (3), we use the natural filtration
of the process (Xn )n≥0 . We are assuming that the process (Xn )n≥0 is adapted with respect to
the filtration (Fn )n≥0 , which means by definition that its natural filtration is included in the
filtration (Fn )n≥0 , but there is no reason why these two filtrations should be equal. This leads
us to the following, final definition of a Markov chain.

Definition 4.14 (Markov chains, third definition) Let E be a non-empty, finite or count-
able set. Let P be a transition kernel on E. A Markov chain with transition kernel P on E is a
quintuple (Ω, A, (Fn )n≥0 , (Px )x∈E , X = (Xn )n≥0 ) in which (Ω, A, (Fn )n≥0 ) is a filtered measur-
able space on which X = (Xn )n≥0 is an adapted E-valued process such that for all x, y ∈ E and
all n ≥ 0, one has
Px (Xn+1 = y|Fn ) = p(Xn , y).

It is important to see that a Markov chain in the third sense is a Markov chain in the second
sense. Indeed, for a Markov chain in the second sense, we have, for all x, y ∈ E and n ≥ 0

Px (Xn+1 = y|X0 , . . . , Xn ) = Ex Ex [Xn+1 = y|Fn ] X0 , . . . , Xn = Ex [p(Xn , y)|X0 , . . . , Xn ]


 

= p(Xn , y).

On the other hand, the ’if’ part of Proposition 4.3 is not true any more: it is still true for
a Markov chain in the thrid sense on E with transition matrix P that for all n ≥ 0 and all
x0 , . . . , xn ∈ E one has

Px0 (X0 = x0 , . . . , Xn = xn ) = p(x0 , x1 ) . . . p(xn−1 , xn ) (6)

69
but knowing that this equality holds does not imply that (Xn )n≥0 is a Markov in the third sense:
this depends indeed on the filtration (Fn )n≥0 .

Exercise 4.10 Consider a Markov chain in the third sense in the situation where for all n ≥ 0,
one has Fn = σ(X0 , . . . , Xn+1 ). Prove that the transition matrix of this Markov chain can only
take the values 0 and 1.

Exercise 4.11 Consider a Markov chain in the second sense with a transition matrix which
takes at least one value in the interval (0, 1). Assume that for all n ≥ 0, the random variable
Xn+1 is measurable with respect to Fn . Prove that the process (Xn )n≥0 satisfies (6) but is not a
Markov chain in the third sense with respect to the filtration (Fn )n≥0 .

Exercise 4.12 Consider a Markov in the third sense. Prove that for all x ∈ E, all n, m ≥ 0
and all xn , . . . , xn+m ∈ E, we have Px -almost surely

Px (Xn = xn , . . . , Xn+m = xn+m |Fn ) = 1{Xn =xn } p(xn , xn+1 ) . . . p(xn+m−1 , xn+m ).

The distinction between the second and third definition of Markov chains is a bit subtle, and
can be ignored in a first time. In what follows, we stick to the second defintion and make some
comments to indicate how certain proofs should be modified to fit the third definition.

4.4 The Markov property


The essence of a Markov chain is the fact that it is memoryless, a property that can be rephrased
by saying that it starts afresh at every instant. The Markov property expresses this absence of
memory in a very effective way, and extends it to random times: we will prove that, in a certain
sense, a Markov chain starts afresh at every stopping time.
In order to articulate the Markov property, we need to introduce one last piece of structure
on the canonical space.

Definition 4.15 The shift operator on the canonical space is the map

θ: EN −→ EN
ω = (ωi )i≥0 7−→ θ(ω) = (ωi+1 )i≥0 .

We also define θ0 = idE N and, for all n ≥ 2,

θn = θn = θ| ◦ .{z
. . ◦ θ} .
n times

For all n ≥ 0, we have simply θn (ω) = (ωn+i )i≥0 .

Exercise 4.13 Check that for all n ≥ 0, the map θn is measurable with respect to the σ-field C.

Finally, let us introduce a classical piece of notation. Let us consider a Markov chain in the
sophisticated sense of Definition 4.13. For all x ∈ E, we naturally denote by Ex the expectation
with respect to Px . Let us extend this notation as follows. Assume that Z is an E-valued

70
random variable on (Ω, A). Then, for all non-negative random variable Y on (Ω, A), we define
EZ [Y ] by the formula
Ex [Y ]1{Z=x} .
X
EZ [Y ] =
x∈E
This can also be written
EZ [Y ] = h(Z), where h(x) = Ex [Y ].
Let us emphasise that EZ [Y ] is not a number in general, but a random variable.
We can now state the Markov property. The following statement is called the weak Markov
property because it involves ‘only’ deterministic times. Immediately after proving it, we shall
strenghen it into the so-called strong Markov property, which covers the case of random times.
Theorem 4.16 (Weak Markov property) Let P be a transition kernel on the state space E.
Let (Ω, A, (Fn )n≥0 , (Px )x∈E , X = (Xn )n≥0 ) be a Markov chain on E with transition kernel P .
Let F : (E N , C) → (R+ , BR+ ) be a measurable non-negative function. For all integer n ≥ 0
and all x ∈ E, we have the equality
Ex [F (θn (X))|Fn ] = EXn [F (X)] Px -a.s. (7)
of non-negative random variables on (Ω, A, Px ).
Proof. The right-hand side of (7) is a function of Xn , so that it is Fn -measurable. What remains
to prove is that for every A ∈ Fn , both sides of (7) have the same Px -integral over A. Since both
sides of (7) depend linearly on F , and since any measurable positive function can be written as
the pointwise limit of an increasing sequence of simple functions, it is enough to prove (7) when
F is an indicator function, that is, F = 1C for some C ∈ C. Thus, we need to prove that
∀A ∈ Fn , ∀C ∈ C, Ex [1C (θn (X))1A ] = Ex [EXn [1C (X)]1A ]. (8)
Here we use a monotone class argument to reduce the problem further. The class of all sets
C ∈ C such that the last equality holds is a λ-system. S In order to prove that it contains C, it
suffices to prove that it contains the π-system m≥0 Cm , which generates the σ-field C. Thus, it
is enough to prove that the equality holds when C ∈ Cm for an arbitrary m. Finally, according
to the remark made before Lemma 4.10, it suffices to prove that for all choices of m ≥ 0,
x0 , . . . , xm ∈ E, y0 , . . . , yn ∈ E, the equality holds for
A = {X0 = x0 , . . . , Xn = xn } and C = {X bm = ym }.
b0 = y0 , . . . , X

In this case, we can compute both sides of (8). The left-hand side is equal to
Ex [1C (θn (X))1A ] = Px (X0 = x0 , . . . , Xn = xn , Xn = y0 , . . . , Xn+m = ym )
= δx,x0 p(x0 , x1 ) . . . p(xn−1 , xn )δxn ,y0 p(y0 , y1 ) . . . p(ym−1 , ym ).
The right-hand side is
Ex [EXn [1C (X)]1A ] = Ex [Exn [1C (X)]1A ]
= Ex [1A ]Exn [1C (X)]
= Px (X0 = x0 , . . . , Xn = xn )Pxn (X0 = y0 , . . . , Xm = ym )
= δx,x0 p(x0 , x1 ) . . . p(xn−1 , xn )δxn ,y0 p(y0 , y1 ) . . . p(ym−1 , ym ),
and the equality is proved.

71
Exercise 4.14 Prove the weak Markov property for Markov chains in the third sense (Definition
4.14). Hint: use the result of Exercise 4.12. You will see that the proof is slightly easier than
the proof above.
Let us extend the Markov property to random times. For this, we need to remember the
definition of a stopping time (see Definition 3.15) and the definition of the σ-field of events prior
to a stopping time (see Definition 3.34). We need also to define the shift by a stopping time.
For this, let us consider again a Markov chain in the sense of Definition 4.13. Let T be a
stopping time on the measurable space (Ω, A, (Fn )n≥0 ). On the event {T < ∞}, we denote by
θT (X) the map from (Ω, A) to (E N , C) defined by
(θT (X))(ω) = θT (ω) (X(ω)) = (XT +i (ω))i≥0 .
Finally, let us agree on the following convention: if h : N → R is a function and T is a stopping
time, we use the notation
h(T )1{T <∞} ,
although T might take the value ∞ and h(∞) is not defined, to indicate

h(T )1{T <∞} = h(n)1{T =n} .


X

n≥0

To be clear, this random variable vanishes on the event {T = ∞}. With all this preparation, we
can state the strong Markov property.
Theorem 4.17 (Strong Markov property) Let P be a transition kernel on the state space
E. Let (Ω, A, (Fn )n≥0 , (Px )x∈E , X = (Xn )n≥0 ) be a Markov chain on E with transition kernel
P.
Let F : (E N , C) → (R+ , BR+ ) be a measurable non-negative function. Let T be a stopping
time on (Ω, A, (Fn )n≥0 ). For all x ∈ E, we have the equality
Ex [F (θT (X))1{T <∞} |FT ] = EXT [F (X)]1{T <∞} Px -a.s. (9)
of non-negative random variables on (Ω, A, Px ).
Proof. Just as in the proof of the weak Markov property, the right-hand side of (9), which is a
function of XT , is FT -measurable, and it suffices to prove that for every A ∈ FT , both sides of
(9) have the same Px -integral over A.
Let us choose A ∈ FT . For all n ≥ 0, we have
Ex [F (θT (X))1{T =n} 1A ] = Ex [F (θT (X))1{T =n}∩A ].
Using the weak Markov property and the fact that {T = n} ∩ A belongs to Fn , we find
Ex [F (θT (X))1{T =n} 1A ] = Ex [EXn [F (X)]1{T =n}∩A ] = Ex [EXT [F (X)]1{T =n} 1A ].
Summing this equality over n, we find
Ex [F (θT (X))1{T <∞} 1A ] = Ex [EXT [F (X)]1{T <∞} 1A ],
and the proof is finished.

We will see over time that the theorem that we just proved expresses the Markov property
in a very convenient and powerful way. For the moment, let us give a simple consequence.

72
Corollary 4.18 We use the notation of Theorem 4.17. Let x, y be elements of E and let T be
a stopping time such that T < ∞ and XT = y Px -a.s. Then under Px , θT (X) is independent of
FT and has the same distribution as X under Py .

Proof. The statement is equivalent to saying that for all non-negative measurable function F
on E N and all B ∈ FT ,
Ex [F (θT (X))1B ] = Px (B)Ey [F (X)].
Let us compute the expectation on the left-hand side by conditioning with respect to FT and
using the strong Markov property. We find

Ex [F (θT (X))1B ] = Ex [EXT [F (X)]1B ] = Ex [Ey [F (X)]1B ] = Px (B)Ey [F (X)],

as expected.

4.5 Recurrent and transient states


Equipped with the powerful tools developed in the last section, we can embark on the study of
the recurrence properties of the states of a Markov chain. From this point on, and until further
notice, we fix a Markov chain (Ω, A, (Fn )n≥0 , (Px )x∈E , X = (Xn )n≥0 ) on the state space E with
transition kernel P .
For every state x ∈ E, let un introduce a random variable Nx with values in N ∪ {∞} defined
by

1{Xn =x}
X
Nx =
n=0

and a stopping time


Tx = inf{n ≥ 1 : Xn = x}.
The random variable Nx is the total number of visits of the chain to the state x and the stopping
time Tx is the first return time of the chain to x. Note that Tx ≥ 1 by construction, and the
usual convention inf ∅ = ∞ applies.
It is a simple but important observation that the following equality of events holds:

{Nx ≥ 1{X0 =x} + 1} = {Tx < ∞}.

These two events can be described as ’the chain visits the state x at least once strictly after
time to 0’.
We will prove that, starting from x, either the Markov chain visits x infinitely often almost
surely, or it visits x a finite number of times which moreover has finite expectation.
Before that, let us introduce the canonical versions of Nx and Tx : let us define on the
canonical space

1{Xbn =x} and Tbx = inf{n ≥ 1 : Xbn = x}.
X
Nbx =
n=0

We have Nx = N
bx (X) and Tx = Tbx (X).

73
Proposition 4.19 Let x be an element of E. Exactly one of the following two situations occurs.
1. Px (Tx < ∞) = 1. In this case, Nx = ∞ Px -a.s. and one says that x is recurrent.
2. Px (Tx < ∞) < 1. In this case, Nx < ∞ Px -a.s. and one says that x is transient.
Moreover, in this case,
1
Ex [Nx ] = .
Px (Tx = ∞)
Proof. Let k ≥ 1 be an integer and let us compute Px (Nx ≥ k + 1).
Px (Nx ≥ k + 1) = Px (N
bx (X) ≥ k + 1)
= Px (Tx < ∞, N
bx (θTx (X)) ≥ k)
= Ex [1{Nbx ≥k} (θTx (X))1{Tx <∞} ]
= Ex [ETx [1{Nbx ≥k} (X)]1{Tx <∞} ]
= Px (Nx ≥ k)Px (Tx < ∞).
Since Px (Nx ≥ 1) = 1, we get by induction
∀k ≥ 1, Px (Nx ≥ k) = Px (Tx < ∞)k−1 .
If Px (Tx < ∞) = 1, we deduce that Px (Nx = ∞) = 1. This is the recurrent case.
On the other hand, if Px (Tx < ∞) < 1, then
X
Ex [Nx ] = Px (Nx ≥ k)
k≥1
X
= Px (Tx < ∞)k
k≥0
1
= ,
1 − Px (Tx < ∞)
so that Ex [Nx ] < ∞ and Nx is finite Px -almost surely.

The dichotomy between recurrent and transient states suggests the definition of the following
function.
Definition 4.20 The Green function of the chain is the function G : E × E → [0, ∞] defined
by
∀x, y ∈ E, G(x, y) = Ex [Ny ].
Proposition 4.21 1. For all x, y ∈ E, we have

X
G(x, y) = P n (x, y).
n=0

2. The state x ∈ E is recurrent if and only if G(x, x) = ∞.


6 y, then
3. If x =
G(x, y) = Px (Ty < ∞)G(y, y).
In particular, G(x, y) ≤ G(y, y).
4. For all x ∈ E,
G(x, x) = 1 + Px (Tx < ∞)G(x, x).

74
Proof. 1. By definition of G and the monotonce convergence theorem,
∞ ∞ ∞
Ex [1{Xn =y} ] =
X X X
G(x, y) = Px [Xn = y] = P n (x, y).
n=0 n=0 n=0

2. By the previous proposition, x is recurrent if and only if Ex [Nx ] < ∞.


3. Under Px , we have almost surely Ny = N by (θTy (X))1{T <∞} , so that
y

by (X)]1{T <∞} ] = Px (Ty < ∞)Ey [Ny ] = Px (Ty < ∞)G(y, y).
G(x, y) = Ex [Ny ] = Ex [Ey [N y

bx (θTx (X))1{T <∞} , and the relation on the


4. Under Px , we have almost surely Nx = 1 + N x
Green functions follows.

The Green function allows us to define a binary relation on E: given two sates x and y, we
say that x leads to y, and we write x → y, if G(x, y) > 0. Thus, x leads to y if and only if
starting from x, there is a positive probability4 of visiting y. If x 6= y, then the following four
assertions are equivalent:

x→y ⇔ G(x, y) > 0 ⇔ Px (Ny ≥ 1) > 0 ⇔ Px (Ty < ∞) > 0.

Lemma 4.22 On E, the binary relation → is reflexive and transitive.

Proof. For every x ∈ E, one has G(x, x) ≥ 1, so that x → x. This means that → is a reflexive
relation.
To prove that it is transitive, consider three states x, y and z such that x → y and y → z.
If any two of these states are equal, then it is true that x → z. On the other hand, if they are
pairwise distinct, then

G(x, z) ≥ Px (Nz ≥ 1)
= Px (Tz < ∞)
≥ Px (Ty < ∞, Tbz (θTy (X)) < ∞)
= Px (Ty < ∞)Py (Tz < ∞)
= Px (Ny ≥ 1)Py (Nz ≥ 1)
>0

and x → z.

It is not true in general that the relation → is symmetric. We will now prove that its
restriction to the subset of all recurrent states is symmetric.

Proposition 4.23 Let x and y be two states. Assume that x is recurrent and x leads to y.
Then y is recurrent, y leads to x, and

Px (Ty < ∞) = Py (Tx < ∞) = 1.


4
The notation may be misleading: x → y means that starting from x, a visit to y occurs with positive
probability. It does not mean that starting from x, a visit to y is certain. This is the difference between
Px (Ny ≥ 1) > 0 and Px (Ny ≥ 1) = 1.

75
Proof. If y = x, then there is nothing to prove. Let us assume that y 6= x. Then

0 = Px (Tx = ∞) ≥ Px (Ty < ∞, Tbx (θTy (X)) = ∞)


= Px (Ty < ∞)Py (Tx = ∞).

Since, by assumption, Px (Ty < ∞) > 0, we find Py (Tx = ∞) = 0, that is, Py (Tx < ∞) = 1. In
particular, y → x.
Since G(x, y) > 0 and G(y, x) > 0, there exists two integers a, b ≥ 1 such that

P a (x, y) > 0 and P b (y, x) > 0.

Hence,
X
G(y, y) = P n (y, y)
n≥0
X
≥ P n (y, y)
n≥a+b
X
≥ P b (y, x)P c (x, x)P a (x, y)
c≥0

= P (x, y)G(x, x)P b (y, x)


a

=∞

and y is recurrent.
Since y is recurrent and y → x, we proved already that Px (Ty < ∞) = 1.

Let us write x ∼ y if x → y and y → x. The relation ∼ is an equivalence relation on E. It


follows from the last proposition that the relations → and ∼ coincide on the set of all recurrent
states. Note also that this proposition implies that if x is recurrent and y is transient, then
G(x, y) = 0.

Proposition 4.24 Under the assumptions of the previous proposition, we have

Px (Ny = ∞) = Py (Nx = ∞) = 1.

In particular,
G(x, y) = G(y, x) = ∞.

Proof. Indeed, y is recurrent and for all k ≥ 1,

Px (Ny ≥ k) = Px (Ty < ∞)Py (Ny ≥ k) = 1.

Hence, Px (Ny = ∞) = 1.

We can now state the following result which summarises our study.

Theorem 4.25 (Classification of states) Let R be the subset of E consisting of all recurrent
states. Let G
R= Ri
i∈I

76
be the partition of R in equivalence classes for the relation ∼. Consider a state x ∈ E.
1. If x is recurrent, let i ∈ I be such that x ∈ Ri . Then Px -a.s.,

Ny = ∞ for all y ∈ Ri and Nz = 0 for all z ∈ E \ Ri .

In English, starting from a recurrent state, the chain stays forever in the class of its initial state
and visits infinitely often every state of this class.
2. If x is transient, define T = inf{n ≥ 0 : Xn ∈ R}. Then

Px {T = ∞ and Ny < ∞ for all y ∈ E} ∪ {T < ∞ and ∃j ∈ I, ∀n ≥ T, Xn ∈ Rj } = 1.




In English, starting from a transient state, either the chain never visits a recurrent state, in
which case it visits every state a finite number of times, or it eventually visits a recurrent state,
in which case it gets stuck forever in the class of the first recurrent state which it visits.

The equivalence classes of R under the relation ∼ are called the recurrence classes of the
chain.

Proof. 1. Consider y ∈ Ri . Then, according to Proposition 4.24, we have Px (Ny = ∞) = 1.


Consider now z ∈ E \Ri . If z is transient, then G(x, z) = 0 by Proposition 4.23. If z is recurrent,
then G(x, z) = 0 by definition of the equivalence class ∼.
2. Consider a transient state y. The third assertion of Proposition 4.21 asserts that G(x, y) ≤
G(y, y), so that G(x, y) < ∞. In particular, Ny is finite Px -almost surely.
On the event {T < ∞}, let J be the random element of I such that XT ∈ RJ . Then

Px (T < ∞ and ∀n ≥ T, Xn ∈ RJ ) = Ex [1{T <∞} PXT (∀n ≥ T, Xn ∈ RJ )].

By the first part of the theorem, and since XT ∈ RJ by definition of J, we have PXT (∀n ≥
T, Xn ∈ RJ ) = 1 on the event T < ∞. Thus,

Px (T < ∞ and ∀n ≥ T, Xn ∈ RJ ) = P(T < ∞),

which is what we wanted to prove.

Definition 4.26 The Markov chain is said to be irreducible if x → y for all x, y ∈ E.

Corollary 4.27 Let us assume that the Markov chain is irreducible. Then we are in exactly
one of the following two situations.

• All states are recurrent, there is only one recurrence class and

Px (∀y ∈ E, Ny = ∞) = 1.

• All states are transient and

Px (∀y ∈ E, Ny < ∞) = 1.

If E is a finite set, the we are in the first situation.

77
Proof. If there is one recurrent state, then by Proposition 4.23, all states are recurrent and there
is only one class. Moreover, Proposition 4.24 implies that every state is visited infinitely often
Px -almost surely for every x ∈ E.
If there is no recurrent state, then all states are transient and, for all states x, y, we have
G(x, y) ≤ G(y, y) < ∞, so that Ny is finite Px -almost surely.
Finally, if E is finite, there exists at least one state that is visited infinitely often, and we
must be in the first situation.

In the first situation, one says that the Markov chain is irreducible and recurrent. Let us
study an important example.

Theorem 4.28 (Random walks on Z) Let µ be a probability measure on Z. Let P be the


transition kernel of the random walk on Z with jump distribution µ:

∀x, y ∈ Z, p(x, y) = µ(y − x).

Let ξ be a random variable with distribution µ. Let us assume that E[|ξ|] < ∞.
1. If E[ξ] 6= 0, then every state is transient.
2. If E[ξ] = 0, then every state is recurrent. Moreover, the chain is irreducible if and only if
the subgroup of Z generated by {x ∈ Z : µ(x) > 0} is Z itself.

Proof. 1. For all x ∈ Z, the strong law of large numbers implies that |Xn | tends to +∞ as n
tends to infinity, Px -almost surely. Hence, Nx is finite Px -almost surely and x is transient.
2. There is an invariance by translation of the problem which implies that all states have
the same nature. It is thus sufficient to prove that 0 is recurrent.
Let us consider an integer p ≥ 0 and a positive real ε > 0, both to be specified later, and let
us estimate X
G(0, x)
|x|≤εp

in two ways. Firstly, we want to say that this sum is not too big. For this, we argue that for
every x ∈ Z, we have G(x, x) = G(0, 0) by invariance by translation, and

G(0, x) ≤ G(x, x) = G(0, 0).

Thus, X
G(0, x) ≤ (2εp + 1)G(0, 0).
|x|≤εp

Secondly, we want to say that the same sum is not too small. For this, we use the weak law
of large numbers, according to which Xnn converges to 0 in probability. This implies that there
exists n0 , which depends on ε, such that for all n ≥ n0 ,
1
P0 (|Xn | ≤ εn) ≥ .
2

78
For all p ≥ n0 , we have then
X ∞
X X
G(0, x) = P n (0, x)
|x|≤εp |x|≤εp n=0
X X p
≥ P n (0, x)
|x|≤εp n=n0
X p
≥ P0 (|Xn | ≤ εp)
n=n0
p − n0 + 1
≥ .
2
Thus, we proved that for all ε > 0, there exists an n0 such that for all p ≥ n0 ,
p − n0 + 1 X
≤ G(0, x) ≤ (2εp + 1)G(0, 0).
2
|x|≤εp

Thus, for all ε > 0, we have


p − n0 + 1 1
G(0, 0) ≥ lim = .
p→∞ 4εp + 2 4ε
This implies that G(0, 0) = ∞, and 0 is recurrent.
There remains to study the irreducibility of the chain. Let us denote by S the set {x ∈ Z :
µ(x) > 0}.
Let us first assume that the chain is irreducible. Then in particular 0 leads to 1, which
means that there exists n ≥ 0 such that P n (0, 1) > 0. This in turns means that there exists
integers x1 , . . . , xn−1 such that p(0, x1 )p(x1 , x2 ) . . . p(xn−1 , 1) > 0. Thus, S contains the integers
x1 , x2 − x1 , . . . , xn−1 − xn−2 , 1 − xn−1 , which add up to 1. The subgroup of Z generated by S is
thus Z.
Conversely, let us assume that S generates Z. Then, let us choose z ∈ Z and let us prove
that 0 → z. Then, there exists x1 , . . . , xk and y1 , . . . , yl in S such that

z = x1 + . . . + xk − y1 − . . . − yl .

Let us write x = x1 + . . . + xk and y = y1 + . . . + yl . Now, we have on one hand

P k (0, x) ≥ µ(x1 ) . . . µ(xk ) > 0,

so that 0 → x, and on the other hand

P l (z, x) ≥ p(z, z + y1 ) . . . p(z + y1 + . . . + yl−1 , x) = µ(y1 ) . . . µ(yl ) > 0,

so that z → x. Since z is recurrent, it follows that x → z, and finally 0 → z.

Exercise 4.15 With the notation of the theorem, give an example of a probability measure µ
such that the random walk is transient, but such that the support of µ generates Z as an additive
group.

79
4.6 Invariant measures
As in the previous section, we fix once and for all a Markov chain

(Ω, A, (Fn )n≥0 , (Px )x∈E , X = (Xn )n≥0 )

on the state space E with transition kernel P .

Definition 4.29 A measure µ on E is an invariant measure of the transition kernel P if µ is


not the zero measure, µ gives finite mass to every singleton, and
X
∀y ∈ E, µ(x)p(x, y) = µ(y). (10)
x∈E

The conditions that µ is not zero and gives finite mass to every singleton can be written
symbolically as
∃x ∈ E, µ(x) > 0 and ∀x ∈ E, µ(x) < ∞.
The relation (10) can be written matricially, as follows. Let us remember that P can be
thought of as a possibly infinite square matrix with as many rows and columns as E has elements.
Let us think of a measure on E as a row, the width (or length) of which is |E|, the number
of elements of E. In a dual fashion, it would be appropriate to think of a function on E as a
column with height EI. Then, if µ is a measure and f a function on E, the matrix product µf
is a 1 × 1 matrix, that is, a number, which is none but the integral of f with respect to µ:
Z
µf = f dµ.
E

Exercise 4.16 Let f be a function on E, seen as a column. The matrix product P f is well
defined and it is a column. Thus, it represents a function on E. What is this function?

Equation (10) can be written


µP = µ.
This shows that there is a purely linear algebraic approach to the study of invariant measures.
Indeed, if E is finite, µ is an invariant measures if and only if µ∗ , the column obtained by
transposing µ, is an eigenvector with non-negative coefficients of the transposed kernel P ∗ ,
associated with the eigenvalue 1. The piece of linear algebra which deals with stochastic matrices
is called the Perron-Frobenius theory, and we shall discuss it later.
Let us give an example of an invariant measure. Let us consider a random walk on Zd , with
arbitrary jump distribution. Thus, η is a probability measure on Zd and for all x, y ∈ Zd , we
have p(x, y) = η(y − x). Then the equality
X X X
p(x, y) = η(y − x) = η(x) = 1,
x∈Zd x∈Zd x∈Zd

valid for all y ∈ Zd , shows that the counting measure µ, given by

∀x ∈ Zd , µ(x) = 1

80
is an invariant measure of this Markov chain.
It is an elementary observation that any positive multiple of an invariant measure is still an
invariant measure. If there exists an invariant measure µ with the property that µ(E) < ∞, it
is natural to normalise it and to define
1
π= ,
µ(E)

which is an invariant probability measure on E. Then, for every bounded function f on E, the
matricial relation
µP f = µf
can be written in terms of the Markov chain as
Z
Eπ [f (X1 )] = f dπ,
E

where Eπ denotes the expectation with respect to the probability measure


X
Pπ = π(x)Px .
x∈E

In other words, if the initial distribution of the Markov chain is an invariant probability measure,
then the Markov chain is stationary, in the sense that the distribution of Xn does not depend
on n.
There is another instructive way of thinking of (10). Suppose that we consider the evolution
of a very large assembly of particles (or sand grains, or people). At the initial time, the quantity
of particles (or sand, or people) that is present at each state x is measured by the number µ(x).
For example, µ(x) is the number of litres of water present at the state x at time 0. Then,
between time 0 and time 1, from every state x, and for every state y, a proportion p(x, y) of
the water present at x moves to y. Thus, on one hand, the water received by the state y in this
process is X
µ(x)p(x, y).
x∈E

On the other hand, the water lost by the same state y is


X
µ(y)p(y, x),
x∈E

which incidentally is equal to µ(y). To say that µ is invariant is to say that the water received
by y exactly compensates the water lost by y: this is the equality expressed by (10).
One could demand that more is true, and demand that in the process just described, for all
states x and y, the water received by y from x exactly compensates the water sent from y to x.

Definition 4.30 The measure µ on E is said to be reversible with respect to P if µ is not the
zero measure, µ gives finite mass to every singleton, and

∀x, y ∈ E, µ(x)p(x, y) = µ(y)p(y, x). (11)

Our discussion should make it clear that the following statement holds.

81
Proposition 4.31 A reversible measure is an invariant measure.

Exercise 4.17 Prove directly this proposition.

Let us give an example of a reversible measure. Let us consider on Z the random walk with
jump distribution η given by

η(1) = p and η(−1) = q = 1 − p

for some p ∈ (0, 1). This is called the p-biased random walk on Z.
Then the measure µ given by
 p i
µ(i) = (12)
q
is reversible. If p 6= q, that is, if p 6= 12 , this measure is not proportional to the counting measure
on Z, of which we know already that it is invariant.

Exercise 4.18 Describe, for every p ∈ (0, 1), the set of all invariant measures of the p-biased
random walk on Z.

Exercise 4.19 Recall the random walk on a graph described page 63. Prove that the formula

µ(x) = |Ax |

defines an invariant measure on E.

When a reversible measure exists, it is in general much easier to find by solving (11) than
by solving (10). In other words, when looking for an invariant measure, one should always start
by looking for a reversible measure.

Exercise 4.20 Draw a small connected graph. Find an invariant probability measure for the
random walk on this graph by solving (10) (for this to be tractable by hand, your graph should
not have more than a few vertices, a dozen would typically be too much, unless your graph has a
lot of symmetry and you find a way to use it). Compare with the effort needed to find a reversible
measure for the same random walk.

Another instance of the fact that reversible measures are nice to work with is given by the
following exercise.

Exercise 4.21 Under the assumption that the Markov chain is irreducible, prove that any two
reversible measures are proportional.
Is it true, under the same assumption of irreducibility, that any two invariant measures are
proportional ?

We will soon prove that if the chain is irreducible and recurrent, then any two invariant
measures are proportional, but this will require more than a two-line proof.
Let us leave reversible measures aside and consider general invariant measures again. The
fundamental result is the following.

82
Theorem 4.32 Let x ∈ E be a recurrent state. The formula
"T −1 #
x

1{Xi =y}
X
µ(y) = Ex
i=0

defines an invariant measure on E. Moreover, the support of this measure is

{y ∈ E : µ(y) > 0} = {y ∈ E : x → y} = {y ∈ E : x ∼ y},

the communication class of x.

Proof. The first observation is that µ(x) = 1 by definition of Tx . Hence, µ is not identically
zero. Let us prove that µ satisfies (10). To start with, whether y = x or y 6= x, we have
"T #
x

1{Xi =y} .
X
µ(y) = Ex
i=1

Now, let us compute by considering not only the i-th state of the walk, but also the state
immediately before.
"T #
x X

1{Xi−1 =z} 1{Xi =y}


X
µ(y) = Ex
i=1 z∈E

1{Xi−1 =z} 1{Xi =y} 1{i≤Tx } .
XX
Ex
 
=
i=1 z∈E

Now, for all i ≥ 1, y, z ∈ E, we have


h  i
Ex 1{Xi−1 =z} 1{Xi =y} 1{i≤Tx } = Ex Ex 1{Xi−1 =z} 1{Xi =y} 1{i≤Tx } |Fi−1 .
 

Using the fact that {i ≤ Tx } = {Tx ≤ i − 1}c belongs to Fi−1 , we find


h i
Ex 1{Xi−1 =z} 1{Xi =y} 1{i≤Tx } = Ex 1{Xi−1 =z} Ex 1{Xi =y} |Fi−1 1{i≤Tx }
   

= Ex 1{Xi−1 =z} p(Xi−1 , y)1{i≤Tx }


 

= p(z, y)Ex 1{Xi−1 =z} 1{i≤Tx } .


 

Thus,

1{Xi−1 =z} 1{i≤Tx }
X X
Ex
 
µ(y) = p(z, y)
z∈E i=1
"T #
x

1{Xi−1 =z}
X X
= p(z, y)Ex
z∈E i=1
"T −1 #
x

1{Xi−1 =z}
X X
= p(z, y)Ex
z∈E i=0
X
= µ(z)p(z, y).
z∈E

83
Note that, by construction, for all y ∈ E,
"∞ #
1{Xi =y} = Ex [Ny ] = G(x, y).
X
µ(y) ≤ Ex
i=0

In particular, if x 6→ y, then µ(y) = 0.


Let now y be a state such that x → y and y → x. Thus, there exist integers n and m such
that P n (x, y) > 0 and P m (y, x) > 0. Since µP n = µP m = µ, we have in particular
X
µ(y) = µ(z)P n (z, y) ≥ µ(x)P n (x, y) > 0,
z∈E

so that µ(y) > 0, and


X
1 = µ(x) = µ(z)P m (z, x) ≥ µ(y)P m (y, x),
z∈E

so that µ(y) < ∞.


This concludes the proof that µ is an invariant measure with support equal to the commu-
nication class of x.

Exercise 4.22 Does every Markov chain admit an invariant measure ?

Exercise 4.23 Prove that any invariant measure of an irreducible Markov chain gives a positive
mass to every state. In other words, such a measure µ satisfies ∀x ∈ E, µ(x) > 0.

The second fundamental result is the following.

Theorem 4.33 Assume that the Markov chain is irreducible and recurrent. Then any two
invariant measures are proportional.

Proof. Let µ be an invariant measure. Let x ∈ E be such that µ(x) > 0. Dividing µ by µ(x),
we may and will assume that µ(x) = 1. This normalisation being made, we want to prove that
µ is equal to the measure ν defined by
"T −1 #
x

1{Xi =y} .
X
∀y ∈ E, ν(y) = Ex
i=0

For this, we will prove that µ ≥ ν, in the sense that

∀y ∈ E, µ(y) ≥ ν(y). (13)

Let us take this inequality for granted for one moment and explain how it implies µ = ν. The
point is that, up to the fact that it may be identically zero, µ − ν is an invariant measure of our
Markov chain, which gives mass 0 to the state x, and which therefore must vanish identically5 .
More precisely, consider y ∈ E and an integer m such that P m (y, x) > 0. We have
X
0 = (µ − ν)(x) = (µ − ν)(z)P m (z, x) ≥ (µ − ν)(y)P m (y, x),
z∈E
5
This may be a good time to search Exercise 4.23 if you did not yet do so.

84
from which it follows that µ(y) = ν(y).
Let us now prove (13). For this, we prove that for all n ≥ 0 and all y ∈ E,
 
(Tx −1)∧n
1{X =y}  .
X
µ(y) ≥ Ex  i
(14)
i=0

If y = x, then both sides are equal to 1, for all n ≥ 0, and the inequality holds.
Let us assume that y 6= x and prove the result by induction on n. If n = 0, the right-hand
side of (14) is equal to 0 and the inequality holds. Let us assume that the result is proved at
rank n. Then, proceding in a way that is very similar to the proof of Theorem 4.32, we have
   
(Tx −1)∧(n+1) Tx ∧(n+1)
1{Xi =y}  = Ex  1{Xi =y} 
X X
Ex 
i=0 i=1

X n+1
Ex [1{Xi−1 =z} 1{Xi =y} 1{i≤Tx } ]
X
=
z∈E i=1
 
Tx ∧(n+1)
1{Xi−1 =z}  p(z, y)
X X
= Ex 
z∈E i=1
 
(Tx −1)∧n
1{Xi−1 =z}  p(z, y)
X X
= Ex 
z∈E i=0
X
≤ µ(z)p(z, y)
z∈E
= µ(y).
Using the monotone convergence theorem to let n tend to infinity in (14), we obtain (13), and
the proof is finished.

Exercise 4.24 Compare your understanding of the invariant measures of the p-biased random
walk on Z with the previous theorem.

We proved that an irreducible recurrent Markov chain admits, up to a multiplicative con-


stant, a unique invariant measure. We will distinguish between the cases where these invariant
measures are finite, and the case where they are infinite.

Proposition 4.34 Assume that the chain is irreducible and recurrent. Then exactly one of the
following two situations occurs.
1. All invariant measures have infinite total mass and for all x ∈ E, we have Ex [Tx ] = ∞. In
this case, the Markov chain is called null recurrent.
2. All invariant measures have finite total mass. There exists a unique invariant probability
measure π. For all x ∈ E, we have
1
π(x) > 0, Ex [Tx ] < ∞ and π(x) = .
Ex [Tx ]
In this case, the chain is called positive recurrent.

85
Proof. Let x be a state and let µ be the unique invariant measure on E such that µ(x) = 1,
which is given by Theorem 4.32. We have
"T −1  
x −1 X
TX
#
x

1{Xi =y} = Ex  1{Xi =y}  = Ex [Tx ].


X X
µ(E) = Ex
y∈E i=0 i=0 y∈E

Since all invariant measures are proportional, they are either all finite, or all infinite. If they are
all infinite, the previous computation shows that Ex [Tx ] = ∞ for all x ∈ E.
If they are all finite, the same computation shows that Ex [Tx ] is finite for all x ∈ E. Moreover,
the unique invariant probability measure is
1 1
π= µ= µ,
µ(E) Ex [Tx ]

from which it follows that


1
π(x) = ,
Ex [Tx ]
and the proof is finished.

Proposition 4.35 Assume that the chain is irreducible. If there exists an invariant probability
measure, then the chain is recurrent.

Proof. Let y be a state such that π(y) > 0. Let us compute G(y, y). Remember that for every
state x, we have G(x, y) ≤ G(y, y). Thus,
X X
G(y, y) = π(x)G(y, y) ≥ π(x)G(x, y).
x∈E x∈E

Now,
X ∞
XX ∞
X ∞
X
π(x)G(x, y) = π(x)P n (x, y) = (πP n )(y) = π(y) = ∞.
x∈E x∈E n=0 n=0 n=0

Thus, G(y, y) = ∞ and y is recurrent. Thus, since the chain is irreducible, all states are recur-
rent, and the chain is positive recurrent.

4.7 Brief summary


Let us summarise the results of the last two sections. Firstly, about recurrence, transience, and
communication between states.

• The state space E of a Markov chain is partitioned into two disjoint subsets: the set of
recurrent states and the set of transient states. Recall that x is recurrent, by definition, if
and only if
Px (Tx < ∞) = 1.

86
• On E, there is the relation → defined by

x → y ⇔ G(x, y) > 0.

It is reflexive and transitive. There is also the relation ∼ defined by

x ∼ y ⇔ (x → y and y → x).

It is an equivalence relation, the classes of which are called the communication classes. If
E is itself a communication class, the chain is said to be irreducible.

• A communication class consists either exclusively of recurrent states, or exclusively of


transient states. One speaks of recurrent and transient classes.

• Given two communication classes C and D, the existence of x ∈ C and y ∈ D such that
x → y implies that for every x ∈ C and every y ∈ D, one has x → y. In this case, one
writes C  D. This defines a binary relation on the set E/ ∼ of communication classes.

• The relation  on E/ ∼ is a (partial) order6 of which the recurrent classes are minimal
elements. Let us emphasize that there can exist transient minimal classes. It is also
possible that there be no minimal class at all.

Now about invariant measures.

• The support of any invariant measure is a union of communication classes. Moreover, if


the support of an invariant measure contains a class C, then it also contains every class
D such that C  D.

Let us now consider irreducible chains. There are three cases: transient, null recurrent and
positive recurrent.

• Transient irreducible chains can have no invariant measure at all, a unique (up to multi-
plication) invariant measure, or several non-proportional invariant measures. In any case,
any invariant measure of a transient irreducible chain has infinite total mass.

• Recurrent irreducible chains admit, up to multiplication, exactly one invariant measure.

• Null recurrent chains are the recurrent irreducible chains for which all invariant measures
are infinite. The expected return time at every state is infinite7 .

• Positive recurrent chains are the recurrent irreducible chains for which all invariant mea-
sures are finite, or equivalently for which there exists an invariant probability measure. If
π is this invariant probability measure, then the relation

π(x)Ex [Tx ] = 1

holds for every state x.


6
As the notation suggests, we think of C being ‘larger’ than D if C  D.
7
It is however not true that the hitting time of every state starting from every other is infinite.

87
4.8 The ergodic theorem
In this section and the next, we will study the asymptotic behaviour of our Markov chain when
time tends to infinity.
The main question is to determine, for all x and y, the behaviour as n tends to infinity of
P (x, y) = Px (Xn = y).
n

If y is transient, then G(x, y) ≤ G(y, y) < ∞, so that


lim P n (x, y) = 0.
n→∞
With recurrent states, the situation is more interesting.
Theorem 4.36 Consider a Markov chain that is irreducible and recurrent. Let µ be an invariant
R Let f : E → R and g : E → R be two non-negative real-valued functions
measure of this chain. + +

on E. Assume that E g dµ > 0 and that at least one of the functions f and g has finite integral
with respect to µ. The for all x ∈ E,
Pn R
f (Xi ) f dµ
Pi=0
n −→ RE Px − a.s.
i=0 g(Xi ) E g dµ
n→∞

Proof. Let x ∈ E be a state. Let us introduce the sequence of stopping times


Tx(0) = 0 and, for all k ≥ 1, Tx(k) = inf{n > Tx(k−1) : Xn = x}.
Note that for all k ≥ 1,
Tx(k) = Tbx (θT (k−1) (X)).
x
(1) (k)
In particular,Tx = Tx and the sequence (Tx )k≥1 is a sequence of independent and identically
distributed random variables.
Since x is recurrent, we have
Px (∀k ≥ 0, Tx(k) < ∞) = 1.
For all k ≥ 1, let us define
(k)
TxX−1
ξk = f (Xn ),
(k−1)
n=Tx
the sum of the values of f on the states visited by the chain during its k-th excursion from x.
The Markov property implies that the sequence (ξk )k≥1 of non-negative random variables is
i.i.d. The strong law of large numbers asserts that
ξ1 + . . . + ξk Px −a.s.
−→ E[ξ1 ].
k k→∞
On the other hand,
 
TXx −1 X

E[ξ1 ] = E  1{X =y} f (y)


n
n=0 y∈E
"T −1 #
x

1{Xn =y} f (y)


X X
= E
y∈E n=0
Z
= f dν,
E

88
where ν is the unique invariant measure on E such that ν(x) = 1. This measure is none other
1
than µ(x) µ, so that
1
Z
E[ξ1 ] = f dµ.
µ(x) E
For all n ≥ 0, let us define
n
1{Xi =x} ,
X
Nx,n =
i=0
the number of visits to x before n, of which we think as the number of the excursion at x which
is going on at time n. Since x is recurrent, we have
P −a.s.
Nx,n x−→ +∞.
n→∞

By construction, we have, for all n ≥ 0,


(Nx,n −1) (Nx,n )
Tx ≤ n < Tx Px − a.s.

Thus, for all n ≥ 0, we have


n
X
ξ1 + . . . + ξNx,n −1 ≤ f (Xi ) ≤ ξ1 + . . . + ξNx,n .
i=0

Dividing by Nx,n and letting n tend to infinity, we find


n
1 X 1
Z
f (Xi ) −→ f dµ Px − a.s.
Nx,n n→∞ µ(x) E
i=0

The same argument can be applied to g, and by dividing the two a.s. convergences, one obtains
the expected result.

Corollary 4.37 Suppose that the Markov chain is irreducible and recurrent.
1. Assume that the chain is positive recurrent. Let π denote the unique invariant probability
measure. Then for all probability measure ν on E and all y ∈ E, we have
n−1
1X
1{Xi =y} n→∞
−→ π(y) Pν − a.s.
n
i=0

In particular, for every non-negative function f on E,


n−1
1X
Z
f (Xi ) −→ f dπ Pν − a.s.
n n→∞ E
i=0

2. Assume that the chain is null recurrent. Then for all probability measure ν on E and all
y ∈ E, we have
n−1
1X
1{Xi =y} n→∞
−→ 0. Pν − a.s.
n
i=0

89
Proof. It suffices to apply the ergodic theorem to the function g = 1.

This corollary explains the names positive recurrent and null recurrent: in the positive re-
current case, the chain spends a positive proporition of the time in each state, whereas in the
null recurrent case, the asymptotic proportion of time spent in any given state is 0.
This corollary shows also, in the positive recurrent case, that for all x, y ∈ E, the sequence
n
(P (x, y))n≥0 converges in the sense of Cesàro to π(y). We will now study when this convergence
holds in the usual sense.
Consider, as an example, the transition kernel
 
0 1
P =
1 0

on the set E = {1, 2}. The invariant probability measure of this Markov chain is the uniform
measure 21 δ1 + 21 δ2 , but under P1 for instance, the distribution of the chain does not converge to
the invariant distribution. Instead, it is alternatively equal to δ1 and δ2 . This example shows
that a phenomenon of cyclicity, or periodicity, can prevent the convergence of a Markov chain
to its invariant measure. We will study this phenomenon and prove that it is the only possible
obstruction to the convergence in distribution of a Markov chain.
We will need a small amount of arithmetic. Firstly, we say that a subset I of N is a semigroup
if it contains 0 and is stable under addition:

∀x, y ∈ I, x + y ∈ I.

We will need two properties of semigroups.

Lemma 4.38 Let I ⊂ N be a semigroup. Let d be the g.c.d. of the elements of I.


1. The subgroup of Z generated by I is the set

I − I = {x − y : x, y ∈ Z}.

2. The equality I − I = dZ holds.


3. If d = 1, then there exists an integer n0 such that I contains every integer larger than n0 .

Proof. 1. The subgroup of Z generated by I certainly contains I −I. Our claim is thus equivalent
to the fact that I − I is a subgroup of Z. Since I − I contains 0, it suffices to check that it is
stable by subtraction. But for all n, m, n0 , m0 ∈ I, we have

(n − m) − (n0 − m0 ) = (n + m0 ) − (n0 + m) ∈ I − I.
| {z } | {z }
∈I ∈I

2. There exists an integer e such that I − I = eZ. On one hand, since 0 belongs to I, we
have I ⊂ I − I ⊂ eZ, so that e is a common divisor of I. On the other hand, any common
divisor of I is also a common divisor of I − I, hence of e. Thus, e = d.
3. Let us assume that d = 1. Then by the first assertion, I contains two consecutive integers,
say i and i + 1. Hence, I, being stable under addition, contains the set

{ai + b(i + 1) : a, b ≥ 0}.

90
We claim that this set contains all integers larger than i2 . Indeed, by reducing n modulo i2 and
then the remainder modulo i, write n = qi2 + ri + s with q ≥ 1 and r, s ∈ {0, . . . , i − 1}. Then
r − s > −i and the equality
n = (qi + (r − s))i + s(i + 1)
shows that n belongs to I.

Definition 4.39 Let x ∈ E be a recurrent state. Define the set

Ix = {n ≥ 0 : P n (x, x) > 0}.

The g.c.d. of this set is called the period of x and it is denoted by dx .

The assumption that x is recurrent implies that Ix is an infinite set, because


X X
∞ = G(x, x) = P n (x, x) = P n (x, x) ≤ |Ix |.
n≥0 n∈Ix

In particular, Ix 6= {0} and its g.c.d. is well defined.


Let us observe that Ix contains 0 and is stable by addition. Indeed, for all n, m ≥ 0, we have

P n,m (x, x) ≥ P n (x, x)P m (x, x),

so that n + m belongs to Ix as soon as n and m do. In particular, according to Lemma 4.38,

Ix − Ix = dx Z.

Proposition 4.40 If the chain is irreducible and recurrent, then all states have the same period.

Proof. Consider two states x and y. By irreducibility, there exists integers k and l such that
P k (x, y) > 0 and P l (y, x) > 0. It follows that

k + Iy + l ⊂ Ix .

Thus, for all n, m ∈ Iy , we have

n − m = (k + n + l) − (k + m + l) ∈ Ix − Ix = dx Z.

This shows that dx is a common divisor of all the elements of Iy , so that dx divides dy . By
symmetry, dy divides dx , and dx = dy .

Definition 4.41 An irreducible recurrent chain is said to be aperiodic if the common period of
all states is 1.

Proposition 4.42 Assume that the Markov chain is irreducible, recurrent and aperiodic. For
all x, y ∈ E, there exists an integer n0 such that for all n ≥ n0 , P n (x, y) > 0.

91
Proof. Consider x, y ∈ E. Let n2 ≥ 0 be such that P n2 (x, y) > 0. Such an integer n2 exists
because the chain is irreducible. Then, since dx = 1, there exists n1 such that Ix contains every
integer larger than n1 . Set n0 = n1 + n2 . For all n ≥ n0 , write n = m + n2 with m ≥ n1 . Then

P n (x, y) ≥ P m (x, x)P n2 (x, y) > 0,

and the result is proved.

Theorem 4.43 Assume that the chain is irreducible, positive recurrent, and aperiodic. Then,
for all x ∈ E, we have X
lim |P n (x, y) − π(y)| = 0,
n→∞
y∈E

where π is the unique invariant probability.

Proof. Let us define the Markovian kernel P on E × E by

p((x1 , x2 ), (y1 , y2 )) = p(x1 , y1 )p(x2 , y2 ).

Let (Ω, F, (F n )n≥0 , (P(x1 ,x2 ) )(x1 ,x2 )∈E 2 , X = (Xn1 , Xn2 )n≥0 ) be a Markov chain with transition
kernel P .
This chain X is irreducible. Indeed, consider (x1 , x2 ) and (y1 , y2 ) in E × E. By aperiodicity,
there exists n1 and n2 such that for all n ≥ n1 (resp. n ≥ n2 ), we have P n (x1 , y1 ) > 0 (resp.
n
P n (x2 , y2 ) > 0). For n = max(n1 , n2 ), we have P ((x1 , x2 ), (y1 , y2 )) > 0.
Moreover, the probability measure π ⊗ π is invariant for this chain. Indeed, for all (y1 , y2 ) ∈
2
E ,
X X
(π ⊗ π)(x1 , x2 )P ((x1 , x2 ), (y1 , y2 )) = π(x1 )p(x1 , y1 )π(x2 )p(x2 , y2 )
(x1 ,x2 )∈E×E (x1 ,x2 )∈E×E

= π(y1 )π(y2 )
= (π ⊗ π)(y1 , y2 ).

It follows from the fact that it admits an invariant probability measure that the chain is positive
recurrent.
For all x, y ∈ E,

P n (x, y) − π(y) = Pπ⊗δx (Xn2 = y) − Pπ⊗δx (Xn1 = y) = Eπ⊗δx [1{Xn2 =y} − 1{Xn1 =y} ].

Let us consider the stopping time

T = inf{n ≥ 0 : Xn1 = Xn2 }.

We have

P n (x, y) − π(y) = Eπ⊗δx [1{T >n} (1{Xn2 =y} − 1{Xn1 =y} )]


n X
Eπ⊗δx [1{T =k,X 1 =X 2 =z} (1{Xn2 =y} − 1{Xn1 =y} )].
X
+
k k
k=0 z∈E

92
For k ∈ {0, . . . , n − 1} and z ∈ E, we have

Eπ⊗δx [1{T =k,X 1 =X 2 =z} 1{Xn2 =y} ] = Eπ⊗δx [1{T =k,X 1 =X 2 =z} ]P n−k (z, y)
k k k k

= Eπ⊗δx [1{T =k,X 1 =X 2 =z} 1{Xn1 =y} ],


k k

so that the double sum of the last expression vanishes. Thus,


X
|P n (x, y) − π(y)| ≤ 2Pπ⊗δx (T > n).
y∈E

Since the chain X is recurrent, T is finite Pπ⊗δx -almost surely, and the last quantity tends to 0
as n tends to infinity.

References
[1] Patrick Billingsley. Probability and Measure. Wiley, 3rd edition, 1995.

[2] Joe Diestel. Uniform integrability: an introduction. Rend. Istit. Mat. Univ. Trieste, 23(1):41–
80 (1993), 1991. School on Measure Theory and Real Analysis (Grado, 1991).

[3] Rick Durrett. Probability: theory and examples, volume 31 of Cambridge Series in Statistical
and Probabilistic Mathematics. Cambridge University Press, Cambridge, fourth edition,
2010.

[4] James R. Norris. Markov chains, volume 2 of Cambridge Series in Statistical and Probabilistic
Mathematics. Cambridge University Press, Cambridge, 1998. Reprint of 1997 original.

[5] David Williams. Probability with martingales. Cambridge University Press, 1991.

93

You might also like