0% found this document useful (0 votes)
23 views133 pages

Stochastic Notes

The document consists of lecture notes on stochastic processes, covering various topics such as probability spaces, random variables, renewal processes, Markov chains, martingales, and Poisson processes. Each chapter provides definitions, examples, and theorems related to the respective topics, aiming to establish a foundational understanding of stochastic processes in mathematics. Additionally, there are exercises included to reinforce the concepts discussed throughout the notes.

Uploaded by

miru park
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views133 pages

Stochastic Notes

The document consists of lecture notes on stochastic processes, covering various topics such as probability spaces, random variables, renewal processes, Markov chains, martingales, and Poisson processes. Each chapter provides definitions, examples, and theorems related to the respective topics, aiming to establish a foundational understanding of stochastic processes in mathematics. Additionally, there are exercises included to reinforce the concepts discussed throughout the notes.

Uploaded by

miru park
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 133

Stochastic Processes

Lecture Notes

Hanbaek Lyu

D EPARTMENT OF M ATHEMATICS , U NIVERSITY OF W ISCONSIN -M ADISON , WI 53706

Email address: hlyu@math.wisc.edu

www.hanbaeklyu.com
Contents

Chapter 1. Probability Space, Random Variables, Limit Theorems 4


1.1. Probability measure and probability space 4
1.2. Independence 6
1.3. Random variables 7
1.4. Conditional expectation 14
1.5. Conditional variance 16
1.6. Elementary limit theorems 18

Chapter 2. Renewal Processes 21


2.1. Definition of renewal processes and Renewal SLLN 21
2.2. Renewal reward processes and Renewal Reward SLLN 23

Chapter 3. Markov chains 25


3.1. Definition and examples 25
3.2. Stationary distribution and examples 29
3.3. Markov property to the infinite future 34
3.4. Strong Markov Property 35
3.5. Recurrence and Transience 37
3.6. Existence and unieuqness of stationary distribution 49
3.7. Uniqueness of stationary distribution 50
3.8. Limit theorems for Markov chains 54
3.9. Convergence to the stationary distribution 57
3.10. Markov chain Monte Carlo 60

Chapter 4. Martingales 65
4.1. Definition and examples of martingales 65
4.2. Gambling strategies and stopping times 69
4.3. Optional stopping time theorem and its applications 72
4.4. Martingale limit theorem 75
4.5. Branching processes 76
4.6. Additional properties of martingales (Optional*) 79

Chapter 5. Introduction to mathematical finance (Optional*) 82


5.1. Hedging and replication in the two-state world 82
5.2. The fundamental theorem of asset pricing 83
5.3. The binomial model 85

Chapter 6. Poisson Processes 91


6.1. Recap of exponential and Poisson random variables 91
6.2. Poisson processes as an arrival process 94
6.3. Momoryless property and stopping times 94
6.4. Merging and splitting of Poisson process 98
6.5. Poisson process and Poisson point process 100
2
CONTENTS 3

6.6. Poisson process conditioned on the number of arrivals 102


6.7. Nonhomogeneous Poisson process 103
6.8. PP via asymptotic properties of the counting process (Optional*) 104

Chapter 7. Renewal Processes: Examples and Queuing Theory 107


7.1. Age and residual life 107
7.2. Examples of Renewal Processes 109
7.3. Little’s Law 112

Chapter 8. Continuous-time Markov Chains 118


8.1. Definition of CTMC 118
8.2. Construction of CTMC 121
8.3. Evolution of the transition probability function 126
8.4. Stationary distribution and limit theorems for CTMC 129

Bibliography 133
CHAPTER 1

Probability Space, Random Variables, Limit Theorems

1.1. Probability measure and probability space


We begin with idealizing our situation. A probability space is defined to be a triple (Ω, F , P), where
Ω = Sample space, F = Events, P = Probability measure. (1)
This is our idealized world where we can precisely measure uncertainty of all possible events. First, Ω is
a set called the sample space. This is the collection of all possible outcomes that we can observe (e.g., the
six sides of a die). Second, we are going to perform some experiment on Ω (e.g., rolloing a die), and the
outcome ω ∈ Ω could belong to any subset E of Ω, which we call an event. Let us denote the collection
of all events E ⊆ Ω by F . When Ω is finite or countable, we usually take F = 2Ω , the collection of all
subsets of Ω. If Ω is uncountable (e.g., Ω = R), then F must be a special collection of subsets of Ω called
a σ-algebra, In this course, we will from now on simply take F = 2Ω and do not worry about σ-algebra.
Lastly, P is a probability measure on the (measurable space) (Ω, F ), which gives a quantification P(E )
of how likely it is that the event E occurs out of our experiment. Mathematically, it is a function P : F → R
such that for each event E ∈ F (recall E ⊆ Ω), it assigns a real number P(E ) ∈ R and satisfies the following
Kolmogorov axioms:
(i) For each event E ∈ F , P(E ) ∈ [0, 1];
(ii) P(Ω) = 1;
(iii) (σ-additivity) For a countable collection of disjoint events A 1 , A 2 , · · · ∈ F ,
à !
[
∞ X

P Ai = P(A i ). (2)
i =1 i =1

When Ω is finite, one can replace (iii) above with the following simpler condition: If two events E 1 , E 2 ⊆ Ω
are disjoint, then P(E 1 ∪ E 2 ) = P(E 1 ) + P(E 1 ).

Exercise 1.1.1. Let (Ω, F , P) be a probability space. Show the following.


P
(i) Let E = {x 1 , x 2 , · · · x k } ⊆ Ω be an event. Then P(E ) = ki=1 P({x i }).
P
(ii) x∈Ω P({x}) = 1.
(iii) (subadditivity) Let A 1 , A 2 , · · · ∈ F be arbitrary. Then
à !
[
∞ X

P Ai ≤ P(A i ). (3)
i =1 i =1

Exercise 1.1.2. Let P be a probability measure on sample space Ω. Show the following.
P
(i) Let E = {x 1 , x 2 , · · · x k } ⊆ Ω be an event. Then P(E ) = ki=1 P({x i }).
P
(ii) x∈Ω P({x}) = 1.

Of course, there could be many (in fact, infinitely many) different probability measures on the same
sample space.

Exercise 1.1.3 (coin flip). Let Ω = {H , T } be a sample space. Fix a parameter p ∈ [0, 1], and define a
function Pp : 2Ω → [0, 1] by Pp (;) = 0, Pp ({H }) = p, Pp ({T }) = 1 − p, Pp ({H , T }) = 1. Verify that Pp is a
probability measure on Ω for each value of p.
4
1.1. PROBABILITY MEASURE AND PROBABILITY SPACE 5

A typical way of constructing a probability measure is to specify how likely it is to see each individual
P
element in Ω. Namely, let f : Ω → [0, 1] be a function that sums up to 1, i.e., x∈Ω f (x) = 1. Define a
function P : F → [0, 1] by
X
P(E ) = f (ω). (4)
ω∈E
Then this is a probability measure on Ω, and f is called a probability distribution on Ω. For instance, the
probability distribution on {H , T } we used to define Pp in Exercise 1.1.3 is f (H ) = p and f (T ) = 1 − p.

Example 1.1.4 (Uniform probability measure). Let Ω = {1, 2, · · · , m} be a sample space and let P be the
uniform probability measure on Ω, that is,
P({x}) = 1/m ∀x ∈ Ω. (5)
Then for the event A = {1, 2, 3}, we have
P(A) = P({1} ∪ {2} ∪ {3}) (6)
= P({1}) + P({2}) + P({3}) (7)
1 1 1 3
= + + = (8)
m m m m
Likewise, if A ⊆ Ω is any event and if we let |A| denote the size (number of elements) of A, then
|A|
P(A) = . (9)
m
For example, let Ω = {1, 2, 3, 4, 5, 6}2 be the sample space of a roll of two fair dice. Let A be the event that
the sum of two dice is 5. Then
A = {(1, 4), (2, 3), (3, 2), (4, 1)}, (10)
so |A| = 4. Hence P(A) = 4/36 = 1/9. ▲
Exercise 1.1.5. Show that the function P : F → [0, 1] defined in (4) is a probability measure on Ω. Con-
versely, show that every probability measure on a finite sample space Ω can be defined in this way.

Remark 1.1.6 (General probability space). A probability space does not need to be finite, but we need a
more careful definition in that case. For example, if we take Ω to be the unit interval [0, 1], then we have
to be careful in deciding which subset E ⊆ Ω can be an ‘event’: not every subset of Ω can be an event. A
proper definition of general probability space is out of the scope of this course.

Exercise 1.1.7. Let (Ω, F , P) be a probability space and let A ⊆ Ω be an event. Show that P(A c ) = 1− P(A).

Example 1.1.8 (Roll of two dice). Suppose we roll two dice and let X and Y be the outcome of each die.
Say all possible joint outcomes are equally likely. The sample space for the roll of a single die can be
written as {1, 2, 3, 4, 5, 6}, so the sample space for rolling two dice can be written as Ω = {1, 2, 3, 4, 5, 6}2 .
In picture, think of 6 by 6 square grid and each node represents a unique outcome (x, y) of the roll. By
assumption, each out come has probability 1/36. Namely,
P((X , Y ) = (x, y)) = 1/36 ∀1 ≤ x, y ≤ 6. (11)
This gives the uniform probability distribution on our sample space Ω (see Example 1.1.4).
We can compute various probabilities for this experiment. For example,
P(at least one die is even) = 1 − P(both dice are odd) (12)
9
= 1 − P({1, 3, 5} × {1, 3, 5}) = 1 − = 3/4, (13)
36
where for the first equality we have used the complementary probability in Exercise 1.1.7.
Y
4/ 1/ 1/ 1/
3 1.2. INDEPENDENCE 6

3/ 0 What
5/ is 1/
the most probable value for the sum X + Y ? By
2
Now think about the following question:
considering diagonal lines x + y = k in the 2-dimensional plane for different values of k, we find
2/ 3/ 1/ 5/
1 # of intersections between the line x + y = k and Ω
P(X + Y = k) = . (14)
1 1/ 1/ 2/ 36 2/
From example, P(X +Y = 2) = 01/36 and P(X +Y = 7) = 6/36 = 1/6. Moreover, from Figure 1, it is clear that
the number of intersections is maximized when the diagonal X line x + y = k passes through the extreme
3 for X + Y with the probability being 1/6.
2 value
points (1, 6) and (6, 1). Hence 7 is the0 most1probable
1
𝑥+𝑦=7

𝑥+𝑦=6 𝑥 + 𝑦 = 12

𝑥+𝑦=5 𝑥 + 𝑦 = 11

𝑥+𝑦=4 𝑥 + 𝑦 = 10

𝑥+𝑦=3 𝑥+𝑦=9

𝑥+𝑦=2 𝑥+𝑦=8

1 2 3 4 5 6
1 1 1 1 1 1
F IGURE 1. Sample space representation for roll of two independent fair dice and and events of
fixed sum of two dice.

1.2. Independence
When knowing something about one event does not yield any information of the other, we say the
two events are independent. To make this statement a bit more precise, suppose Bob is betting $5 for
whether an event E 1 occurs or not. Suppose Alice tells him that some other event E 2 holds true. If Bob
can somehow leverage on this knowledge to increase his chance of winning, then we should say the two
events are not independent.
Formally, we say two events E 1 and E 2 are independent if
P(E 1 ∩ E 2 ) = P(E 1 ) P(E 2 ). (15)
We say two events or RVs are dependent if they are not independent.
Example 1.2.1. Flip two fair coins at the same time and let X and Y be their outcome, labeled by H and
T . Clearly knowing about one coin does not give any information of the other. For instance, the first coin
lands on heads with probability 1/2. Whether the first coin lands on heads or not, the second coin will
land on heads with probability 1/2. So
1 1
P(X = H and Y = H ) = · = P(X = H )P(Y = H ). (16)
2 2

Exercise 1.2.2. Suppose two events A 1 and A 2 are independent and suppose that P(A 2 ) > 0. Show that
P(A 1 | A 2 ) = P(A 1 ). (17)
1.3. RANDOM VARIABLES 7

1.3. Random variables


In this section, we introduce the notion of random variables along with a number of important ex-
amples. We also introduce expectation and variance as summary statistics for random variables, and
learn how to compute them.

1.3.1. Discrete random variables. Given a finite probability space (Ω, P), a (discrete) random vari-
able (RV) is any real-valued function X : Ω → R. We can think of it as the outcome of some experiment on
Ω (e.g., height of a randomly selected friend). We often forget the original probability space and specify
a RV by probability mass function (PMF) f X : R → [0, 1],

f X (x) = P(X = x) = P({ω ∈ Ω | X (ω) = x}). (18)

Namely, P(X = x) is the likelihood that the RV X takes value x.

Example 1.3.1. Say you win $1 if a fair coin lands heads and lose $1 if lands tails. We can set up our
probability space (Ω, P) by Ω = {H , T } and P = uniform probability measure on Ω. The RV X : Ω → R for
this game is X (H ) = 1 and X (T ) = −1. The PMF of X is given by f X (1) = P(X = 1) = P({H }) = 1/2 and
likewise f X (−1) = 1/2.

Exercise 1.3.2. Let (Ω, P) be a probability space and X : Ω → R be a RV. Let f X be the PMF of X , that is,
f X (x) = P(X = x) for all x. Show that f X adds up to 1, that is,
X
f X (x) = 1, (19)
x

where the summation runs over all numerical values x that X can take.

There are two useful statistics of a RV to summarize its two most important properties: Its average
and uncertainty. First, if one has to guess the value of a RV X , what would be the best choice? It is the
expectation (or mean) of X , defined as below:
X
E(X ) = x P(X = x). (20)
x

Example 1.3.3 (Doubling strategy). Suppose we bet $x on a game where we have to predict whether a
fair coin clip comes up heads. We win $x on heads and lose $x on tails. Suppose we are playing the
‘doubling strategy’. Namely, we start betting $1, and until the first time we win, we double our bet; upon
the first win, we quit the game. Let X be the random variable giving the net gain of the overall game. How
can we evaluate this strategy?
Let N be the random number of coin flips we have to encounter until we see the first head. For
instance,

P(N = 1) = P({H }) = 1/2, (21)


2
P(N = 2) = P({(T, H )}) = 1/2 (22)
3
P(N = 3) = P({(T, T, H )}) = 1/2 . (23)

In general, we have

P(N = k) = 1/2k . (24)

Note that on the event that N = k (i.e., we bet k times), the net gain X |{N = k} is

X |{N = k} = (−1) + (−2) + (−22 ) + (−23 ) + · · · + (−2k−1 ) + 2k (25)


k k
= −(2 − 1) + 2 = 1. (26)
1.3. RANDOM VARIABLES 8

Since the last expression do not depend on k, we conclude that


X

P(X = 1) = P(X = 1 | N = k)P(N = k) (27)
k=1
X∞
= P(N = k) = 1. (28)
k=1

Hence our net gain is $1 with probability 1. In particular, the expected net gain is also 1:
E[X ] = 1 · P(X = 1) = 1. (29)
What if we use tripling strategy? ▲
Exercise 1.3.4. For any RV X and real numbers a, b ∈ R, show that
E[aX + b] = a E[X ] + b. (30)

Exercise 1.3.5 (Tail sum formula for expectation). Let X be a RV taking values on positive integers.
(i) For any x, show that
X

P(X ≥ x) = P(X = y). (31)
y=x

(ii) Use (i) and Fubini’s theorem to show


X
∞ X
∞ X
y
P(X ≥ x) = P(X = y) (32)
x=1 y=1 x=1

(iii) From (ii), conclude that


X

E(X ) = P(X ≥ x). (33)
x=1

On the other hand, say you play two different games where in the first game, you win or lose $1
depending on a fair coin flip, and in the second game, you win or lose $10. In both games, your expected
winning is 0. But the two games are different in how much the outcome fluctuates around the mean.
This notion if fluctuation is captured by the following quantity called variance:
Var(X ) = E[(X − E(X ))2 ]. (34)
Namely, it is the expected squared difference between X and its expectation E(X ).
The following exercise ties the expectation and the variance of a RV into a problem of finding a point
estimator that minimizes the mean squared error.

Exercise 1.3.6 (Variance as minimum MSE). Let X be a RV. Let x̂ ∈ R be a number, which we consider
as a ‘guess’ (or ‘estimator’ in Statistics) of X . Let E[(X − x̂)2 ] be the mean squared error (MSE) of this
estimation.
(i) Show that
EY [(X − x̂)2 ] = EY [X 2 ] − 2x̂ E[X ] + x̂ 2 (35)
= (x̂ − E[X ])2 + E[X 2 ] − E[X ]2 (36)
2
= (x̂ − E[X ]) + Var(X ). (37)
(ii) Conclude that the MSE is minimized when x̂ = E[X ] and the global minimum is Var(X ). In this sense,
E[X ] is the ‘best guess’ for X and Var(X ) is the corresponding MSE.

Here are some of the simplest and yet most important discrete RVs.
1.3. RANDOM VARIABLES 9

Exercise 1.3.7 (Bernoulli RV). A RV X is a Bernoulli variable with (success) probability p ∈ [0, 1] if it takes
value 1 with probability p and 0 with probability 1 − p. In this case we write X ∼ Bernoulli(p). Show that
E(X ) = p and Var(X ) = p(1 − p).
Exercise 1.3.8 (Indicator variables). Let (Ω, P) be a probability space and let E ⊆ Ω be an event. The
indicator variable of the event E , which is denoted by 1E , is the RV such that 1E (ω) = 1 if ω ∈ E and
1E (ω) = 0 if ω ∈ E c . Show that 1E is a Bernoulli variable with success probability p = P(E ).
Example 1.3.9 (Binomial RV). Let X 1 , X 2 , · · · , X n be independent and identically distributed Bernoulli p
variables. Let X = X 1 + · · · + X n . One can think of flipping the same probability p coin n times. Then X is
the total number of heads. Note that X has the following PMF
à !
n k
P(X = k) = p (1 − p)k (38)
k
for k nonnegative integer, and P(X = k) = 0 otherwise. We say X follows the Binomial distribution with
parameters n and p, and write X ∼ Binomial(n, p).
We can compute the mean and variance of X using the above PMF directly, but it is much easier to
break it up into Bernoulli variables and use linearity. Recall that X i ∼ Bernoulli(p) and we have E(X i ) = p
and Var(X i ) = p(1 − p) for each 1 ≤ i ≤ n (from Exercise 1.3.7). So by linearity of expectation (Exercise
1.3.22),
E(X ) = E(X 1 + · · · + X n ) = E(X 1 ) + · · · + E(X n ) = np. (39)
On the other hand, since X i ’s are independent, variance of X is the sum of variance of X i ’s (Exercise
1.3.31) so
Var(X ) = Var(X 1 + · · · + X n ) = Var(X 1 ) + · · · + Var(X n ) = np(1 − p). (40)

Example 1.3.10 (Geometric RV). Suppose we flip a probability p coin until it lands heads. Let X be the
total number of trials until the first time we see heads. Then in order for X = k, the first k − 1 flips must
land on tails and the kth flip should land on heads. Since the flips are independent with each other,
P(X = k) = P({T, T, · · · , T, H }) = (1 − p)k−1 p. (41)
This is valid for k positive integer, and P(X = k) = 0 otherwise. Such a RV is called a Geometric RV with
(success) parameter p, and we write X ∼ Geom(p).
The mean and variance of X can be easily computed using its moment generating function, which
we will learn soon in this course. For their direct computation, note that
E(X ) − (1 − p)E(X ) = (1 − p)0 p + 2(1 − p)1 p + 3(1 − p)2 p + 4(1 − p)3 p · · · (42)
£ ¤
− (1 − p)1 p + 2(1 − p)2 p + 3(1 − p)3 p + · · · (43)
= (1 − p)0 p + (1 − p)1 p + (1 − p)2 p + (1 − p)3 p · · · (44)
p
= = 1, (45)
1 − (1 − p)
where we recognized the series after the second equality as a geometric series. This gives
E(X ) = 1/p. (46)
(In fact, one can apply Exercise 1.3.5 and quickly compute the expection of a Geometric RV.)

Exercise 1.3.11. Let X ∼ Geom(p). Use a similar computation as we had in Example 1.3.10 to show
E(X 2 ) = (2 − p)/p 2 . Using the fact that E(X ) = 1/p, conclude that Var(X ) = (1 − p)/p 2 .
Example 1.3.12 (Poisson RV). A RV X is a Poisson RV with rate λ > 0 if
λk e −λ
P(X = k) = (47)
k!
for all nonnegative integers k ≥ 0. We write X ∼ Poisson(λ). A computation shows E(X ) = Var(X ) = λ.
1.3. RANDOM VARIABLES 10

1.3.2. Continuous Random Variables. So far we have only considered discrete RVs, which takes
either finitely many or countably many values. While there are many examples of discrete RVs, there are
also many instances of RVs which various continuously (e.g., temperature, height, weight, price, etc.).
To define a discrete RV, it was enough to specify its PMF. For a continuous RV, probability distribution
P
function
R (PDF) plays an analogous role of PMF. We also need to replace summation with an integral
d x.
Namely, X is a continuous RV if there is a function f X : R → [0, ∞) such that for any interval [a, b], the
probability that X takes a value from an interval (a, b] is given by integrating f X over the interval (a, b]:
Zb
P(X ∈ (a, b]) = f X (x) d x. (48)
a
The cumulative distribution function (CDF) of a RV X (either discrete or continuous), denoted by F X , is
defined by
F X (x) = P(X ≤ x). (49)
By definition of PDF, we get
Zx
F X (x) = f X (t ) d t . (50)
−∞
Conversely, PDFs can be obtained by differentiating corresponding CDFs.

Exercise 1.3.13. Let X be a continuous RV with PDF f X . Let a be a continuity point of f X , that is, f X is
continuous at a. Show that F X (x) is differentiable at x = a and
d F X ¯¯
¯ = f X (a). (51)
d x x=a
The expectation of a continuous RV X with pdf f X is defined by
Z∞
E(X ) = x f X (x) d x, (52)
−∞
and its variance Var(X ) is defined by the same formula (34).

Exercise 1.3.14 (Tail sum formula for expectation). Let X be a continuous RV with PDF f X and suppose
f X (x) = 0 for all x < 0. Use Fubini’s theorem to show that
Z∞
E(X ) = P(X ≥ t ) d t . (53)
0

Example 1.3.15 (Higher moments). Let X be a continuous RV with PDF f X and suppose f X (x) = 0 for all
x < 0. We will show that for any real number α > 0,
Z∞
α
E[X ] = x α f X (x) d x. (54)
0
First, Use Exercise 1.3.14 and to write
Z∞
α
E[X ] = P(X α ≥ x) d x (55)
0
Z∞
= P(X ≥ x 1/α ) d x (56)
0
Z∞ Z∞
= f X (t ) d t d x. (57)
0 x 1/α
We then use Fubini’s theorem to change the order of integral. This gives
Z∞ Z∞ Z∞ Zt α Z∞
E(X ) = f X (t ) d t d x = f X (t ) d x d t = t α f X (t ) d t , (58)
0 x 1/α 0 0 0
as desired. ▲
1.3. RANDOM VARIABLES 11

Exercise 1.3.16. Let X be a continuous RV with PDF f X and suppose f X (x) = 0 for all x < 0. Let g : R → R
be a strictly increasing function. Use Fubini’s theorem and tail sum formula for expectation to show
Z∞
E[g (X )] = g (x) f X (x) d x. (59)
0

Next, we introduce three important continuous RVs.

Example 1.3.17 (Uniform RV). X is a uniform RV on the interval [a, b] (denoted by X ∼ Uniform([a, b]))
if it has PDF
1
f X (x) = 1(a ≤ x ≤ b). (60)
b−a
An easy computation gives its CDF:


0 x<a
P(X ≤ x) = (x − a)/(b − a) a ≤ x ≤ b (61)


1 x > b.

Exercise 1.3.18. Let X ∼ Uniform([a, b]). Show that

a +b (b − a)2
E(X ) = , Var(E ) = . (62)
2 12
Example 1.3.19 (Exponential RV). X is an exponential RV with rate λ (denoted by X ∼ Exp(λ)) if it has
PDF
f X (x) = λe −λx 1(x ≥ 0). (63)
Integrating the PDF gives its CDF

P(X ≤ x) = (1 − e −λx )1(x ≥ 0). (64)

Using Exercise 1.3.14, we can compute


Z∞ " #∞
−λt e −λt
E(X ) = e dt = − = 1/λ. (65)
0 λ
0

Exercise 1.3.20. Let X ∼ Exp(λ). Show that E(X ) = 1/λ directly using definition (52). Also show that
Var(X ) = 1/λ2 .

Example 1.3.21 (Normal RV). X is a normal RV with mean µ and variance σ2 (denoted by X ∼ N (µ, σ2 ))
if it has PDF
1 (x−µ)2

f X (x) = p e 2σ2 . (66)
2πσ2
If µ = 0 and σ2 = 1, then X is called a standard normal RV. Note that if X ∼ N (µ, σ2 ), then Y := X − µ has
PDF
1 −x
2
f Y (x) = p e 2σ2 . (67)
2πσ2
Since this is an even function, it follows that E(Y ) = 0. Hence E(X ) = µ. Using the ‘Gaussian integral’, one
can also show that Var(X ) = σ2 . ▲
1.3. RANDOM VARIABLES 12

1.3.3. Expectation and variance of sums of RVs. In this subsection, we learn how we can compute
expectation and variance for sums of RVs. For expectation, we will see that we can swap the order to
summation and expectation. This is called the linearity of expectation, whose importance cannot be
overemphasized.

Exercise 1.3.22 (Linearity of expectation). In this exercise, we will show that the expectation of sum of
RVs is the sum of expectation of individual RVs.
(i) Let X and Y be RVs. Show that
X
P(X = x, Y = y) = P(X = x). (68)
y

(ii) Verify the following steps:


X
E(X + Y ) = z P(X + Y = z) (69)
z
X X
= (x + y)P(X = x, Y = y) (70)
z x,y
x+y=z
X
= (x + y) P(X = x, Y = y) (71)
x,y
X X
= x P(X = x, Y = y) + y P(X = x, Y = y) (72)
x,y x,y
à ! µ ¶
X X X X
= x P(X = x, y = y) + y P(X = x, Y = y) (73)
x y y x
X X
= x P(X = x) + y P(Y = y) (74)
x y
= E(X ) + E(Y ). (75)

(iii) Use induction to show that for any RVs X 1 , X 2 , · · · , X n , we have

E(X 1 + X 2 + · · · + X n ) = E(X 1 ) + E(X 2 ) + · · · + E(X n ). (76)

Our first application of linearity of expectation is the following useful formula for variance.

Exercise 1.3.23. For any RV X , show that

Var(X ) = E(X 2 ) − E(X )2 . (77)

Next, we will see a nice application of linearity of expectation.

Exercise 1.3.24 (Inclusion-exclusion). Let (Ω, F , P) be a probability space. Let A 1 , A 2 , · · · , A k ⊆ Ω. We


will show the following inclusion-exclusion principle:
à !
[
k X
k X
P Ai = P(A i 1 ) − P(A i 1 ∩ A i 2 ) (78)
i =1 i 1 =1 1≤i 1 <i 2 ≤k
à !
X \
k
+ P(A i 1 ∩ A i 2 ∩ A i 3 ) − · · · + (−1)k P Ai . (79)
1≤i 1 <i 2 <i 3 ≤k i =1

The method is so-called the ‘indicator trick’.


For each 1 ≤ i ≤ k, let X i = 1(A i ) be the indicator variable for the event A i . Consider the following RV

Y = (1 − X 1 )(1 − X 2 ) · · · (1 − X k ). (80)
1.3. RANDOM VARIABLES 13

(i) By expanding the product and using linearity of expectation, show that
X
k X
E[Y ] = 1 − E[X i 1 ] + E[X i 1 X i 2 ] (81)
i 1 =1 1≤i 1 <i 2 ≤k
X
− E[X i 1 X i 2 X i 3 ] + · · · − (−1)k E[X 1 X 2 · · · X k ]. (82)
1≤i 1 <i 2 <i 3 ≤k
T
(ii) Show that Y is the indicator variable of the event ki=1 A ci . Conclude that
à ! à !
\
k
c
[k
E[Y ] = P Ai = 1 − P Ai . (83)
i =1 i =1

(iii) From (i) and (ii), deduce the inclusion-exclusion principle.


Exercise 1.3.25 (Finite second moment implies finite first moment). Let X be a RV with E[X 2 ] < ∞.
(i) Show that
E[|X |] = E[|X |1(|X | ≤ 1)] + E[|X |1(|X | ≤ 1)] ≤ 1 + E[X 2 1(|X | ≤ 1)] ≤ 1 + E[X 2 ] < ∞. (84)
Deduce that E[X ] ∈ (−∞, ∞).
(ii) Deduce that Var(X ) < ∞.
Next, we study how we can simplify the variance of sums of RVs. Unlike expectation, we cannot
exchange the order of summation and variance in general.
Example 1.3.26. Flip two coins at the same time. The sample space for this experiment is Ω = {H , T }2
and the probability measure P on Ω is given by
P({(H , H )}) = P({(T, T )}) = 0 (85)
P({(H , T )}) = P({(T, H )}) = 1/2. (86)
Namely somehow we know for sure that the two coins will come up with the opposite sides with equal
probability.
Define a RV X : Ω → {−1, 1} by
(
1 if the first coin comes up a head
X= (87)
−1 if the first coin comes up a tail
Define another RV Y : Ω → {−1, 1} similarly for the second coin. Then
E[X ] = 1 · P(X = 1) + (−1)P(X = −1) (88)
= P({(H , H ), (H , T )}) − P({(T, H ), (T, T )}) = 0, (89)
2 2 2
E[X ] = 1 · P(X = 1) + (−1) P(X = −1) = 1, (90)
Var(X ) = E[X 2 ] − E[X ]2 = 1. (91)
A similar computation shows Var(Y ) = 1. However, note that X + Y takes value 0 with probability 1.
Hence Var(X + Y ) = 0. Thus 0 = Var(X + Y ) 6= Var(X ) + Var(Y ) = 2. ▲
In general, when we try to write down the variance of a sum of RVs, in addition to the variance of
each RV, we have extra contribution from each pairs of RVs called the covariance.
Exercise 1.3.27. In this exercise, we will see how we can express the variance of sums of RVs. For two RVs
X and Y , define their covariance Cov(X , Y ) by
Cov(X , Y ) = E(X Y ) − E(X )E(Y ). (92)
(i) Use Exercises 1.3.23 and 1.3.22 to show that
Var(X + Y ) = Var(X ) + Var(Y ) + 2Cov(X , Y ). (93)
1.4. CONDITIONAL EXPECTATION 14

(ii) Use induction to show that for RVs X 1 , X 2 · · · , X n


à !
X
n X
n X
Var Xi = Var(X i ) + 2 Cov(X i , X j ) (94)
i =1 i =1 1≤i , j ≤n

When knowing something about one RV does not yield any information of the other, we say the two
RVs are independent. Formally, we say two RVs X and Y are independent if for any two subsets A 1 , A 2 ⊆ R,
P(X ∈ A 1 and Y ∈ A 2 ) = P(X ∈ A 1 )P(Y ∈ A 2 ). (95)
We say two events or RVs are dependent if they are not independent.
Exercise 1.3.28. Suppose two RVs X and Y are independent. Then for any subsets A 1 , A 2 ⊆ R such that
P(Y ∈ A 2 ) > 0, show that
P(X ∈ A 1 | Y ∈ A 2 ) = P(X ∈ A 1 ). (96)
Example 1.3.29. Flip two fair coins at the same time, and let X = 1 if the first coin lands heads and
X = −1 if it lands tails. Let Y be a similar RV for the second coin. Suppose each of the four outcomes
in Ω = {H , T }2 are equality likely. Clearly knowing about one coin does not give any information of the
other. For instance, the first coin lands on heads with probability 1/2. Whether the first coin lands on
heads or not, the second coin will land on heads with probability 1/2. So
1 1
P(X = 1 and Y = 1) =· = P(X = 1)P(Y = 1). (97)
2 2
One can verify the above equation for possible outcomes. Hence X and Y are independent. ▲
Exercise 1.3.30. Let X , Y be two indepenent RVs. Show that
E[X Y ] = E[X ]E[Y ]. (98)
In the following exercise, we will see that we can swap summation and variance as long as the RVs
we are adding up are independent.
Exercise 1.3.31. Recall the definition of covariance given in Exercise 1.3.27.
(i) Show that if two RVs X and Y are independent, then Cov(X , Y ) = 0
(ii) Use Exercise 1.3.27 to conclude that if X 1 , · · · , X n are independent RVs, then
Var(X 1 + · · · + X n ) = Var(X 1 ) + · · · + Var(X n ). (99)

1.4. Conditional expectation


Let X , Y be discrete RVs. Recall that the expectation E(X ) is the ‘best guess’ on the value of X when
we do not have any prior knowledge on X . But suppose we have observed that some possibly related RV
Y takes value y. What should be our best guess on X , leveraging this added information? This is called
the conditional expectation of X given Y = y, which is defined by
X
E[X |Y = y] = x P(X = x|Y = y). (100)
x
This best guess on X given Y = y, of course, depends on y. So it is a function in y. Now if we do not
know what value Y might take, then we omit y and E[X |Y ] becomes a RV, which is called the conditional
expectation of X given Y .
Exercise 1.4.1. Let X , Y be discrete RVs. Show that for any function g : R → R,
E X [X g (Y ) | Y ] = g (Y )E X [X | Y ]. (101)
Exercise 1.4.2 (Iterated expectation). Let X , Y be discrete RVs. Use Fubini’s theorem to show that
E[X ] = EY [E X [X |Y ]]. (102)
1.4. CONDITIONAL EXPECTATION 15

In case when X or Y are continuous RVs, we simply replace the sum by integral and PMF by PDF. For
instance, if both X and Y are continuous with PDFs f X , f Y and joint PDF f X ,Y , then
Z∞
E[X |Y = y] = x f X |Y =y (x) d x, (103)
−∞
where f X |Y =y is the conditional PDF of X given Y = y defined by
f X ,Y (x, y)
f X |Y =y (x) = . (104)
f Y (y)
To summarize how we compute the iterated expectations when we condition on discrete and continuous
RV:
(P
y E[X | Y = y]P(Y = y) if Y is discrete
E[E[X | Y ]] = R∞ (105)
−∞ E[X | Y = y] f Y (y) d y if Y is continuous.
An important use of iterated expectation is that we can compute probabilities using conditioning,
since probability of an event is simply the expectation of the corresponding indicator variable.

Exercise 1.4.3 (Iterated expectation for probability). Let X , Y be RVs.


(i) For any subset A ⊆ R, show that P(X ∈ A) = E[1(X ∈ A)].
(ii) By using iterated expectation, show that
P(X ≤ x) = EY [P(X ∈ A | Y )]. (106)

Proposition 1.4.4. Let X be a discrete random variable and Y = (Y1 , . . . , Yk ) a discrete random vector. Let
g : Rk → R be a function.
(i) If X and Y are independent, then E[X | Y] = E[X ].
(ii) Assume that E[|g (Y)|] < ∞. Then E[g (Y) | Y] = g (Y).
(iii) Assume that E[|X g (Y)|] < ∞. Then E[X g (Y) | Y] = E[X | Y] g (Y).

P ROOF. To show (i), note that


X X
E[X | Y = y] = x P(X = x | Y = y) = x P(X = x) = E[X ], (107)
x x

where the second equality follows from the independence of X and Y.


To show (ii), note that
X ¡ ¢ ¡ ¢
E[g (Y) | Y = y] = g (y0 ) P Y = y0 | Y = g (y) = g (y) P Y = y | Y = y = g (y). (108)
y0

Since this holds for all y ∈ Rk , the assertion holds.


Lastly, to show (iii), note that
X
E[X g (Y) | Y = y] = xg (y0 ) P(X = x, Y = y0 | Y = y) (109)
x,y0
X
= xg (y) P(X = x, Y = y | Y = y) (110)
x
X
= xg (y) P(X = x, | Y = y) (111)
x
X
= g (y) x P(X = x, | Y = y) (112)
x
£ ¤
= g (y)E X | Y = y . (113)

Since this holds for all y ∈ Rk , the assertion holds. □


1.5. CONDITIONAL VARIANCE 16

1.5. Conditional variance


As we have defined conditional expectation, we could define the variance of a RV X given that anther
RV Y takes a particular value. Recall that the (unconditioned) variance of X is defined by
Var(X ) = E[(X − E(X ))2 ]. (114)
Note that there are two places where we take expectation. Given Y , we should improve both expectations
so the conditional variance of X given Y is defined by
Var(X | Y ) = E[(X − E[X | Y ])2 | Y ]. (115)
Proposition 1.5.1. Let X and Y be RVs. Then
Var(X | Y ) = E[X 2 | Y ] − E[X | Y ]2 . (116)
P ROOF. Using linearity of conditional expectation and the fact that E[X | Y ] is not random given Y ,
Var(X | Y ) = E[X 2 − 2X E[X | Y ] + E[X | Y ]2 | Y ] (117)
= E[X 2 | Y ] − E[2X E[X | Y ] | Y ] + E[E[X | Y ]2 | Y ] (118)
2 2
= E[X | Y ] − E[X | Y ]E[2X | Y ] + E[X | Y ] E[1 | Y ] (119)
= E[X 2 | Y ] − 2E[X | Y ]2 + E[X | Y ]2 (120)
2 2
= E[X | Y ] − E[X | Y ] . (121)

The following exercise explains in what sense the conditional expectation E[X | Y ] is the best guess
on X given Y , and that the minimum possible mean squared error is exactly the conditional variance
Var(X | Y ).
Exercise 1.5.2. Let X , Y be RVs. For any function g : R → R, consider g (Y ) as an estimator of X . Let
EY [(X − g (Y ))2 | Y ] be the mean squared error. Recall the conditional variance V (X |Y ) is defined by
Var(X | Y ) = E[(X − E[X | Y ])2 | Y ]. (122)
(i) Show that
EY [(X − g (Y ))2 | Y ] = EY [X 2 | Y ] − 2g (Y )EY [X | Y ] + g (Y )2 (123)
2 2 2
= (g (Y ) − EY (X | Y )) + EY [X | Y ] − EY [X | Y ] (124)
= (g (Y ) − EY (X | Y ))2 + Var(X | Y ). (125)
(ii) Conclude that the mean squared error is minimized when g (Y ) = EY [X | Y ] and the global minimum
is Var(X | Y ).
(iii) Show that the law of total variance:
Var(X ) = E(Var(X | Y )) + Var(E[X | Y ]). (126)
Next, we study how we can decompose the variance of X by conditioning on Y .
Proposition 1.5.3 (Law of total variance). Let X and Y be RVs. Then
Var(X ) = E(Var(X | Y )) + Var(E[X | Y ]). (127)
P ROOF. Using previous result, iterated expectation, and linearity of expectation, we have
Var(X ) = E(X 2 ) − (E(X ))2 (128)
2 2
= EY (E(X |Y )) − (EY (E(X |Y ))) (129)
2 2
= EY (Var(X |Y ) + (E(X |Y )) ) − (EY (E(X |Y ))) (130)
£ ¤
= EY (Var(X |Y )) + EY (E(X |Y ))2 ) − (EY (E(X |Y )))2 (131)
= EY (Var(X |Y )) + VarY (E(X |Y )). (132)
1.5. CONDITIONAL VARIANCE 17

Here is a handwavy explanation on why the above is true. Given Y , we should measure the fluc-
tuation of X | Y from the conditional expectation E[X | Y ], and this is measured as Var(X | Y ). Since we
don’t know Y , we average over all Y , giving E(Var(X | Y )). But the reference point E[X | Y ] itself varies with
Y , so we should also measure its own fluctuation by Var(E[X | Y ]). These fluctuations add up nicely like
Pythagorian theorem because E[X | Y ] is an optimal estimator so that these two fluctuations are ‘orthog-
onal’.

Exercise 1.5.4. Let X , Y be RVs and denote X̂ = E[X |Y ], meaning that X̂ is an estimator of X given Y . Let
X̃ = X̂ − X be the estimation error.
(i) Show that X̂ is an unbiased estimator of X , that is, E( X̂ ) = E(X ).
(ii) Show that E[ X̂ |Y ] = X̂ . Hence knowing Y does not improve our current best guess X̂ .
(iii) Show that E[ X̃ ] = 0.
(iv) Show that Cov( X̂ , X̃ ) = 0. Conclude that

Var(X ) = Var( X̂ ) + Var( X̃ ). (133)

Exercise 1.5.5. Let X , Y be RVs. Write X̄ = E[X | Y ] and X̃ = X − E[X | Y ] so that X = X̄ + X̃ . Here X̄ is the
estimate of X given Y and X̃ is the estimation error.
(i) Using Exercise 1.5.4 (iii) and iterated expectation, show that

E[ X̃ 2 ] = Var(E[X | Y ]). (134)

(ii) Using Exercise 1.5.4 (iv), conclude that

Var(X ) = E(Var(X | Y )) + Var(E[X | Y ]). (135)

Example 1.5.6. Let Y ∼ Uniform([0, 1]) and X ∼ Binomial(n, Y ). Since X | Y = y ∼ Binomial(n, y), we
have E[X | Y = y] = n y and Var(X | Y = y) = n y(1 − y). Also, since Y ∼ Uniform([0, 1]), we have
n2
Var(E[X | Y ]) = Var(nY ) = . (136)
12
So by iterated expectation, we get
Z1
n
E(X ) = EY (E[X | Y ]) = ny d y = . (137)
0 2
On the other hand, by law of total variance,

Var(X ) = E(Var(X | Y )) + Var(E(X | Y )) (138)


Z1
= n y(1 − y) d y + Var(nY ) (139)
0
· ¸1
y2 y3 n2
=n − + (140)
2 3 0 12
n2 n
= + . (141)
12 6

Exercise 1.5.7. Suppose we have a stick of length L. Break it into two pieces at a uniformly chosen point
and let X 1 be the length of the longer piece. Break this longer piece into two pieces at a uniformly chosen
point and let X 2 be the length of the longer one. Define X 3 , X 4 , · · · in a similar way.
(i) Let U ∼ Uniform([0, L]). Show that X 1 takes values from [L/2, L], and that X 1 = max(U , L −U ).
1.6. ELEMENTARY LIMIT THEOREMS 18

(ii) From (i), deduce that for any L/2 ≤ x ≤ L, we have


2(L − x)
P(X 1 ≥ x) = P(U ≥ x or L −U ≥ x) = P(U ≥ x) + P(U ≤ L − x) = . (142)
L
Conclude that X 1 ∼ Uniform([L/2, L]). What is E[X 1 ]?
(iii) Show that X 2 ∼ Uniform([x 1 /2, x 1 ]) conditional on X 1 = x 1 . Using iterated expectation, show that
E[X 2 ] = (3/4)2 L.
(iv) In general, show that X n+1 | X n ∼ Uniform([X n /2, X n ]). Conclude that E[X n ] = (3/4)n L.
Exercise 1.5.8 (Exercise 1.5.7 continued). Let X 1 , X 2 , · · · , X n be as in Exercise 1.5.7.
(i) Show that Var(X 1 ) = L 2 /48.
(ii) Show that Var(X 2 ) = (7/12) Var(X 1 ) + (1/48)E(X 1 )2 .
(iii) Show that Var(X n+1 ) = (7/12) Var(X n ) + (1/48)E(X n )2 for any n ≥ 1.
(iv) Using Exercise 1.5.7, show the following recursion on variance holds:
µ ¶
7 1 9 n 2
Var(X n+1 ) = Var(X n ) + L . (143)
12 48 16
Furthermore,
¡ ¢n compute Var(X 2 ) and Var(X 3 ).
(v)* Let A n = 16
9 Var(X n ). Show that A n ’s satisfy
µ ¶
28
A n+1 + L 2 = (A n + L 2 ). (144)
27
¡ ¢n−1
(vi)* Show that A n = 28
27 (A 1 + L 2 ) − L 2 for all n ≥ 1.
(vii)* Conclude that
·µ ¶n µ ¶n ¸
7 9
Var(X n ) = − L2. (145)
12 16

1.6. Elementary limit theorems


The primary subject in this section is the sequence of independent and identically distributed (i.i.d.)
RVs and their partial sums. Namely, let X 1 , X 2 , · · · be an (infinite) sequence of i.i.d. RVs, and define their
nth partial sum S n = X 1 + X 2 + · · · + X n for all n ≥ 1. If we call X i the i th step size or increment, then the
sequence of RVs (S n )n≥1 is called a random walk, where we usually set S 0 = 0. Think of X i as the gain
or loss after betting once in a casino. Then S n is the net gain of fortune after betting n times. Of course
there are ups and downs in the short term, but what we want to analyze using probability theory is the
long-term behavior of the random walk (S n )n≥1 . Results of this type is called limit theorems.
Suppose each increment X k has a finite mean µ. Then by linearity of expectation and independence
of the increments, we have
µ ¶
Sn E[S n ]
E = = µ, (146)
n n
µ ¶
Sn Var(S n ) n Var(X 1 ) Var(X 1 )
Var = = = . (147)
n n2 n2 n
So the sample mean S n /n has constant expectation and shrinking variance. Hence it makes sense to
guess that it should behave as the constant µ, without taking the expectation. That is,
Sn
lim = µ. (148)
n→∞ n
But this expression is shaky, since the left hand side is a limit of RVs while the right hand side is a constant.
In what sense the random sample means converge to µ? This is the content of the laws of large numbers,
for which we will prove a weak and a strong versions.
The first limit theorem we will encounter is called the Weak Law of Large Numbers (WLLN), which is
stated below:
1.6. ELEMENTARY LIMIT THEOREMS 19

F IGURE 2. Simulation of simple random walks

Pn
Theorem 1.6.1 (WLLN). Let (X k )k≥1 be i.i.d. RVs with mean µ < ∞ and let S n = k=1
X i , n ≥ 1 be a
random walk. Then for any positive constant ε > 0,
µ¯ ¯ ¶
¯ Sn ¯
lim P ¯¯ ¯
− µ¯ > ε = 0. (149)
n→∞ n
In words, the probability that the sample mean S n /n is not within ε distance from its expectation µ
decays to zero as n tends to infinity. In this case, we say the sequence of RVs (S n /n)n≥1 converges to µ in
probability.
The second version of law of large numbers is call the strong law of large numbers (SLLN), which is
available if the increments have finite variance.
P
Theorem 1.6.2 (SLLN). Let (X k )k≥1 be i.i.d. RVs and let S n = nk=1 X i , n ≥ 1 be a random walk. Suppose
E[X 1 ] = µ < ∞ and E[X 12 ] < ∞. Then
µ ¶
Sn
P lim = µ = 1. (150)
n→∞ n

To make sense out of this, notice that the limit of sample mean limn→ S n /n is itself a RV. Then SLLN says
that this RV is well defined and its value is µ with probability 1. In this case, we say the sequence of RVs
(S n /n)n≥1 converges to µ with probability 1 or almost surely.
Perhaps one of the most celebrated theorems in probability theory is the central limit theorem (CLT),
which tells about how the sample mean S n /n “fluctuates” around its mean µ. From 147, if we denote
σ2 = Var(X 1 ) < ∞, we know that Var(S n /n) = σ2 /n → 0 as n → ∞. So the fluctuation decays as we add
up more increments. To see the effect of fluctuation, we first center the sample mean by subtracting its
p
expectation and “zoom in” by dividing by the standard deviation σ/ n. This is where the name ‘central
limit’ comes from: it describes the limit of centered random walks.
P
Theorem 1.6.3 (CLT). Let (X k )k≥1 be i.i.d. RVs and let S n = nk=1 X i , n ≥ 1 be a random walk. Suppose
E[X 1 ] = µ < ∞ and E[X 12 ] = σ2 < ∞. Let Z ∼ N (0, 1) be a standard normal RV and define
S n − µn S n /n − µ
Zn = p = p . (151)
σ n σ/ n
1.6. ELEMENTARY LIMIT THEOREMS 20

Then for all z ∈ R,


Zz
1 x2
lim P(Zn ≤ z) = P(Z ≤ z) = p e− 2 d x. (152)
n→∞ 2π −∞

In words, the centered and rescaled RV Zn is asymptotically distributed as a standard normal RV


Z ∼ N (0, 1). In this case, we say Zn converges to Z as n → ∞ in distribution. This is a remarkable result
since as long as the increments X k have finite mean and variance, it does not matter which distribution
that they follow: the ‘central limit’ always looks like a standard normal distribution. We will not delve
into proof details in this course, but we do mention that in a standard proof of CLT, one first computes
the moment generating function (MGF) of S n and Taylor-expands it up to the second order term, and
use the uniqueness of MGF.
𝑥+𝑦=3 𝑥+𝑦=9

𝑥+𝑦=2 𝑥+𝑦=8

1 2 3 4 5 6
1 1 1 1 1 1

CHAPTER 2

𝜏 𝜏 𝜏 𝜏
Renewal Processes

𝑇 𝑇 𝑇 𝑇
2.1. Definition of renewal processes and Renewal SLLN
An arrival process is a sequence of strictly increasing RVs 0 < T1 < T2 < · · · with the convention of
setting T0 = 0. For each integer k ≥ 1, its kth inter-arrival time is defined by τk = Tk − Tk−1 . For a given
arrival 0process
1 (T2k )k≥1 3
, the associated
4 0 counting
1 2 process (N (t ))t ≥0 is defined by
X

N (t ) = 1(Tk ≤ t ) = #(arrivals up to time t ). (153)
k=1
2 these
Note that 3 three
3 4
processes 1
0 (arrival
2times,
2 inter-arrival times, and counting) determine each other:

(Tk )k≥1 ⇐⇒ (τk )k≥1 ⇐⇒ (N (t ))t ≥0 . (154)


Exercise 2.1.1. 6.12 Let (Tk )k≥1 be any arrival process and let (N (t ))t ≥0 be its associated counting pro-
cess. Show that these two processes determine each other by the following relation
{Tn ≤ t } = {N (t ) ≥ n}. (155)
In words, nth customer arrives by time t if and only if at least n customers arrive up to time t .

4 𝜏

3 𝜏

2 𝜏
𝑁(𝑠) = 3
1 𝜏

0 𝑇 𝑇 𝑇 𝑠 𝑇 𝑡
F IGURE 1. Illustration of a continuous-time arrival process (Tk )k≥1 and its associated counting
process (N (t ))t ≥0 . τk ’s denote inter-arrival times. N (t ) ≡ 3 for T3 < t ≤ T4 .

Exercise 2.1.2 (Well-definedness of the counting process). Let (N (t ))t ≥0 denote the counting process of
an arrival process with inter-arrival times (τk )k≥1 whose expectations are at least some positive constant
δ > 0 (but not necessarily independent or identically distributed). Recall that N (t ) and τk ’s are associated
as
X

N (t ) = 1(τ1 + · · · + τn ≤ t ). (156)
n=1
(i) Show that T N (t ) ≤ t for all t ≥ 0. Deduce that
· ¸ · ¸
N (t ) N (t ) Jensen 1 1
E ≤E ≤ £ ¤≤ . (157)
t T N (t ) E T N (t ) /N (t ) δ
21
2.1. DEFINITION OF RENEWAL PROCESSES AND RENEWAL SLLN 22

(ii) From (i), deduce that E[N (t )] ≤ t /δ < ∞ for all t ≥ 0. Furthermore, deduce that P(N (t ) < ∞) = 1 for
all t ≥ 0.

Definition 2.1.3 (Renewal process). A counting process (N (t ))t ≥0 is called a renewal process if its inter-
arrival times τ1 , τ2 , · · · are i.i.d. with E[τ1 ] < ∞.

A cornerstone in the theory of renewal processes is the following strong law of large numbers for
renewal processes. We first recall the strong law of large numbers.
P
Theorem 2.1.4 (SLLN). Let (X k )k≥1 be i.i.d. RVs and let S n = nk=1 X i , n ≥ 1 be a random walk. Suppose
E[|X 1 |] < ∞ and let E[X 1 ] = µ. Then
µ ¶
Sn
P lim = µ = 1. (158)
n→∞ n

P ROOF. See Durrett [Dur10, Thm. 2.4.1]. □


Pn
Exercise 2.1.5. Let (X k )k≥1 be i.i.d. RVs and let S n = k=1 X i , n ≥ 1 be a random walk. Suppose E[|X 1 |] <
∞ and let E[X 1 ] = µ. Let (N (t ))t ≥0 be any counting process such that N (t ) → ∞ as t → ∞. Show that
µ ¶
S N (t )
P lim = µ = 1. (159)
t →∞ N (t )

Theorem 2.1.6 (Renewal SLLN). Let (Tk )k≥0 be a renewal process and let (τk )k≥0 and (N (t ))t ≥0 be the
associated inter-arrival times and counting process, respectively. Let E[τk ] = µ be the mean inter-arrival
time. If 0 < µ < ∞, then
µ ¶
N (t ) 1
P lim = = 1. (160)
t →∞ t µ
P ROOF. First, write Tk = τ1 + τ2 + · · · + τk . Since the inter-arrival times are i.i.d. with mean µ < ∞, the
strong law of large numbers imply
µ ¶
Tk
P lim = µ = 1. (161)
k→∞ k

Next, fix t ≥ 0 and let N (t ) = n, so that there are total n arrivals up to time t . Then the nth arrival time Tn
must occur by time t , whereas the n + 1st arrival time Tn+1 must occur after time t . Hence Tn ≤ t < Tn+1 .
In general, we have

T N (t ) ≤ t < T N (t )+1 . (162)

Dividing by N (t ), we get
T N (t ) t T N (t )+1 N (t ) + 1
≤ < . (163)
N (t ) N (t ) N (t ) + 1 N (t )
To take the limit as t → ∞, we note that P(Tk < ∞) = 1 for all k since E[τk ] = µ < ∞. This yields
N (t ) ≥ k for some large enough t . Since k was arbitrary, this yields N (t ) % ∞ as t → ∞ with probability
1. Therefore, according to (161) and Exercise 2.1.5, we get
µ ¶ µ ¶ µ ¶
T N (t ) T N (t )+1 N (t ) + 1
P lim = µ = P lim = µ = P lim = 1 = 1. (164)
t →∞ N (t ) t →∞ N (t ) + 1 t →∞ N (t )

Hence (163) gives


µ ¶
t
P lim = µ = 1. (165)
t →∞ N (t )

Since µ > 0, we can take the reciprocal inside the above probability. This shows the assertion. □
2

𝜏 3 4
1
⋯ 2.2. RENEWAL REWARD PROCESSES AND RENEWAL REWARD SLLN 23
𝑇
𝑁

4 0 1 2 4
𝑇 ( ) , 𝑁(𝑡) (𝑡, 𝑁(𝑡))
3 𝑇 ( ) , 𝑁(𝑡)

0 1 2 2 2

1 𝑁(𝑡) = 3

0 𝑇 𝑇 𝑇 𝑡 𝑇
F IGURE 2. Illustration of the inequalities (163).

Example 2.1.7 (Poisson process). Suppose (N (t ))t ≥0 is a renewal process with its inter-arrival times
(τk )k≥1 as Exp(λ) distribution with some λ > 0. In this case, we call (N (t ))t ≥0 a “Poisson process with
arrival rate λ”. Note that the mean inter-arrival time is E[τ1 ] = 1/λ, so the renewal SLLN yields
µ ¶
N (t )
P lim = λ = 1. (166)
𝜏 t →∞ t

Namely, with probability 1, we tend to see about λt arrivals during [0, t ] as t → ∞. In other words, we
tend𝜏 to see λ arrivals during an interval of unit length. Hence it makes sense to call the parameter λ as
𝑡 −the
the ‘arrival rate’ of 𝑠 process.𝑍 ▲
𝜏

𝑁(𝑡) = 3
2.2. Renewal reward processes and Renewal Reward SLLN
In this section, we consider a renewal process together with rewards, which are given for each inter-
𝑇
arrival times.𝑇 This
= 𝑠 simple 𝑇 the renewal processes will greatly improve the applicability of our
𝑡 extension of
theory.
Let (Tk )k≥0 be a renewal process and let (τk )k≥0 and (N (t ))t ≥0 be the associated inter-arrival times
and counting process, respectively. Suppose we have a sequence of rewards (Yk )k≥1 , so we regard a
reward of Yk is given at the kth arrival time Tk . We define the reward process (R(t ))t ≥0 associated with
the sequence (τk , Yk )k≥1 as
NX
(t )
R(t ) = Yk . (167)
k=1
Namely, upon the kth arrival at time Tk = τ1 + · · · + τk , we receive a reward of Yk . Then R(t ) is the total
reward up to time t .
As we looked at the average number of arrivals N (t )/t as t → ∞, a natural quantity to look at for the
reward process is the ‘average reward’ R(t )/t as t → ∞. Intuitively, since everything refreshes upon new
arrivals, we should expect
R(t ) expected reward during one ‘cycle’
→ (168)
t expected duration of one ‘cycle’
as t → ∞ almost surely. This is made precise by the following result.
Theorem 2.2.1 (Renewal reward SLLN). Let (R(t ))t ≥0 be the reward process associated with an i.i.d. se-
quence of inter-arrival times (τk )k≥1 and an i.i.d. sequence of rewards (Yk )k≥1 . Suppose E[Yk ] < ∞ and
E[τk ] ∈ (0, ∞). Then
µ ¶
R(t ) E[Y1 ]
P lim = = 1. (169)
n→∞ t E[τ1 ]
2.2. RENEWAL REWARD PROCESSES AND RENEWAL REWARD SLLN 24

P ROOF. Let (τk )k≥0 denote the inter-arrival times for the renewal process (Tk )k≥0 . Note that
à !
R(t ) 1 NX (t ) N (t )
= Yk . (170)
t N (t ) k=1 t
Hence by Exercise 2.1.5, the ‘average reward’ up to time t in the bracket converges to E[Y1 ] almost surely.
Moreover, the average number of arrivals N (t )/t converges to 1/E[τ1 ] by Theorem 2.1.6. Hence the as-
sertion follows. □
Remark 2.2.2. Theorem 2.1.6 can be obtained as a special case of the above reward version of SLLN,
simply by choosing Yk = τk for k ≥ 1 so that R(t ) = N (t ).
Example 2.2.3 (Long run car costs). This example is excerpted from [Dur99]. Mr. White do not drive the
same car more than t ∗ years, where t ∗ > 0 is some fixed number. He changes to a new car when the old
one breaks down or reaches t ∗ years. Let X k be the life time of the kth car that Mr. White drives, which
are i.i.d. with finite expectation. Let τk be the duration of his kth car. According to his policy, we have
τk = min(X k , t ∗ ). (171)
Let Tk = τ1 +· · ·+τk be the time that Mr. White is done with the kth car. Then (Tk )k≥0 is a renewal process.
Note that the expected running time for the kth car is
E[τk ] = E[τk | X k < t ∗ ]P(X k < t ∗ ) + E[τk | X k ≥ t ∗ ]P(X k ≥ t ∗ ) (172)
∗ ∗ ∗ ∗
= E[X k | X k < t ]P(X k < t ) + t P(X k ≥ t ). (173)
Suppose that the car cost g during each cycle is given by
(
A + B if t < t ∗
g (t ) = (174)
A if t ≥ t ∗ .
Namely, if the car breaks down by t ∗ years, then Mr. White has to pay A + B dolors; otherwise, the cost is
only A dolors. Then the expected cost for one cycle is
E[g (τk )] = A + B P(τk < t ∗ ) = A + B P(X k < t ∗ ). (175)
Thus by Theorem 2.2.1, the long-run car cost of Mr. White is
R(t ) E[g (τk )] A + B P(X k < t ∗ ).
lim = = . (176)
t →∞ t E[τk ] E[X k | X k < t ∗ ]P(X k < t ∗ ) + t ∗ P(X k ≥ t ∗ )
For more concrete example, let X k ∼ Uniform([0, 10]) and let A = 10 and B = 3. Then
E[g (τk )] = 10 + 3t ∗ /10. (177)
On other other hand,
E[τk ] = E[X k | X k < t ∗ ]P(X k < t ∗ ) + t ∗ P(X k ≥ t ∗ ) (178)
∗ ∗ ∗
t t 10 − t
= + t∗ = t ∗ − (t ∗ )2 /20. (179)
2 10 10
Note that for E[X k | X k < t ∗ ] = t ∗ /2, observe that a uniform RV over [0, 10] conditioned on being [0, t ∗ ] is
uniformly distributed over [0, t ∗ ]. This yields
E[g (τk )] 10 + 0.3t ∗
= ∗ . (180)
E[τk ] t (1 − t ∗ /20)
Lastly, in order to minimize the above long-run cost, we differentiate it in t ∗ and find global minimum.
A straightforward computation shows that the long-run cost is minimized at
p
∗ −1 + 1.6
t = ≈ 8.83. (181)
0.03
Thus the optimal strategy for Mr. White in this situation is to drive each car up to 8.83 years. ▲
CHAPTER 3

Markov chains

Say we would like to model the USD price of bitcoin. We could observe the actual price at every
hour and record it by a sequence of real numbers x 1 , x 2 , · · · . However, it is more interesting to build a
‘model’ that could predict the price of bitcoin at time t , or at least give some meaningful insight how the
actual bitcoin price behaves over time. Since there are so many factors affecting the price at every time, it
might be reasonable that the price at time t should be given by a certain RV, say X t . Then our sequence of
predictions would be a sequence of RVs, (X t )t ≥0 . This is an example of stochastic processes. Here ‘process’
means that we are not interest in just a single RV, that their sequence as a whole: ‘stochastic’ means that
the way the RVs evolve in time might be random.
In this note, we will be studying a very important class of stochastic processes called Markov chains.
The importance of Markov chains lies two places: 1) They are applicable for a wide range of physical,
biological, social, and economical phenomena, and 2) the theory is well-established and we can actually
compute and make predictions using the models.

3.1. Definition and examples


Roughly speaking, Markov processes are used to model temporally changing systems where future
state only depends on the current state. For instance, if the price of bitcoin tomorrow depends only on
its price today, then bitcoin price can be modeled as a Markov process. (Of course, the entire history of
price often affects decisions of buyers/sellers so it may not be a realistic assumption.)
Even through Markov processes can be defined in vast generality, we concentrate on the simplest
setting where the state and time are both discrete. Let Ω = {1, 2, · · · , m} be a finite set, which we call the
state space. Consider a sequence (X t )t ≥0 of Ω-valued RVs, which we call a chain. We call the value of X t
the state of the chain at time t . In order to narrow down the way the chain (X t )t ≥0 behaves, we introduce
the following properties:
(i) (Markov property) The distribution of X t +1 given the history X 0 , X 1 , · · · , X t depends only on X t . That
is, for any values of j 0 , j 1 , · · · , j t , k ∈ Ω,

P(X t +1 = k | X t = j t , X t −1 = j t −1 , · · · , X 1 = j 1 , X 0 = j 0 ) = P(X t +1 = k | X t = j t ). (182)

(ii) (Time-homogeneity) The transition probabilities

p i j = P(X t +1 = j | X t = i ) i, j ∈ Ω (183)

do not depend on t .
When the chain (X t )t ≥0 satisfies the above two properties, we say it is a (discrete-time and time-homogeneous)
Markov chain. We define the transition matrix P to be the m × m matrix of transition probabilities:
 
p 11 p 12 · · · p 1m
 p 21 p 22 · · · p 2m 
 
P = (p i j )1≤i , j ≤m =  . .. ..  . (184)
 .. . . 
p m1 p m2 ··· p mm

25
3.1. DEFINITION AND EXAMPLES 26

Finally, since the state X t of the chain is a RV, we represent its probability mass function (PMF) via a row
vector
rt = [P(X t = 1), P(X t = 2), · · · , P(X t = m)]. (185)
Example 3.1.1. Let Ω = {1, 2} and let (X t )t ≥0 be a Markov chain on Ω with the following transition matrix
· ¸
p 11 p 12
P= . (186)
p 21 p 22
We can also represent this Markov chain pictorially as in Figure 4, which is called the ‘state space diagram’
of the chain (X t )t ≥0 .

𝑝
𝑝 1 2 𝑝
𝑝

F IGURE 1. State space diagram of a 2-state Markov chain

For some concrete example, suppose


p 11 = 0.2, p 12 = 0.8, p 21 = 0.6, p 22 = 0.4. (187)
If the initial state of the chain X 0 is 1, then
P(X 1 = 1) = P(X 1 = 1 | X 0 = 1)P(X 0 = 1) + P(X 1 = 1 | X 0 = 2)P(X 0 = 2) (188)
= P(X 1 = 1 | X 0 = 1) = p 11 = 0.2 (189)
and similarly,
P(X 1 = 2) = P(X 1 = 2 | X 0 = 1)P(X 0 = 1) + P(X 1 = 2 | X 0 = 2)P(X 0 = 2) (190)
= P(X 1 = 2 | X 0 = 1) = p 12 = 0.8. (191)
Also we can compute the distribution of X 2 . For example,
P(X 2 = 1) = P(X 2 = 1 | X 1 = 1)P(X 1 = 1) + P(X 2 = 1 | X 1 = 2)P(X 1 = 2) (192)
= p 11 P(X 1 = 1) + p 21 P(X 1 = 2) (193)
= 0.2 · 0.2 + 0.6 · 0.8 = 0.04 + 0.48 = 0.52. (194)
In general, the distribution of X t +1 can be computed from that of X t via a simple linear algebra. Note
that for i = 1, 2,
P(X t +1 = i ) = P(X t +1 = i | X t = 1)P(X t = 1) + P(X t +1 = i | X t = 2)P(X t = 2) (195)
= p 1i P(X t = 1) + p 2i P(X t = 2). (196)
This can be written as
· ¸
p 11 p 12
[P(X t +1 = 2), P(X t +1 = 2)] = [P(X t +1 = 2), P(X t +1 = 2)] . (197)
p 21 p 22
That is, if we represent the distribution of X t as a row vector, then the distribution of X t +1 is given by
multiplying the transition matrix P to the left. ▲
Example 3.1.2 (Gambler’s ruin). Suppose a gambler has fortune of k dolors initially and starts gambling.
At each time he wins or loses 1 dolor independently with probability p and 1 − p, respectively. The game
ends when his fortune reaches either 0 or N dolors. What is the probability that he wins N dolors and
goes home happy?
We use Markov chains to model his fortune after betting t times. Namely, let Ω = {0, 1, 2, · · · , N } be the
state space. Let (X t )t ≥0 be a sequence of RVs where X t is the gambler’s fortune after betting t times. We
first draw the state space diagram for N = 4 below: Next, we can write down its transition probabilities as
𝑝
𝑝 1 2 𝑝
𝑝
3.1. DEFINITION AND EXAMPLES 27

𝑝 𝑃
𝑝
1 0 1 2 3 4 1
1−𝑝 1−𝑝 1−𝑝

F IGURE 2. State space diagram of a 5-state gambler’s chain



 P(X t +1 = k + 1 | X t = k) = p ∀1 ≤ k < N


P(X
t +1 = k | X t = k + 1) = 1 − p ∀1 ≤ k < N
(198)

 P(X t +1 = 0 | X t = 0) = 1



P(X t +1 = N | X t = N ) = 1.
For example, the transition matrix P for N = 5 is given by
 
1 0 0 0 0 0
1 − p 0
 0 p 0 0 
 
 0 1−p 0 p 0 0
P = . (199)
𝑝  0 0 1−p 0 p 0
 
𝑝
 0 𝑝 0 0 1 − p 0 p
1 2
𝑝 0 0 0 0 0 1
We call the resulting Markov chain (X t )t ≥0 the gambler’s chain. ▲
Example 3.1.3 (Ehrenfest Chain). This chain is originated from the physics literature as a model for two
cubical volumes of air connected by a thin 𝑝tunnel. Suppose there𝑃 are total N indistinguishable balls split
𝑝
1 0
into two “urns” A and B . At each1step, we pick up one 2 of the N balls uniformly
3 4 move
at random, and 1 it to
1−𝑝
the other urn. Let X t denote the number1of − balls
𝑝 1 − t𝑝 steps. This is a Markov chain called the
in urn A after
Ehrenfest chain. (See the state space diagram in Figure 3.)

1 3/4 1/2 1/4


0 1 2 3 4
1/4 1/2 3/4 1

F IGURE 3. State space diagram of the Ehrenfest chain with 4 balls

It is easy to figure out the transition probabilities by considering different cases. If X t = k, then urn
B has N − k balls at time t . If 0 < k < N , then with probability k/N we move one ball from A to B and
with probability (N − k)/N we move one from B to A. If k = 0, then we must pick up a ball from urn
B so X t +1 = 1 with probability 1. If k = N , then we must move one from A to B and X t +1 = N − 1 with
probability 1. Hence, the transition kernel is given by


P(X t +1 = k + 1 | X t = k) = (N − k)/N ∀0 ≤ k < N


P(X
t +1 = k − 1 | X t = k) = k/N ∀0 < k ≤ N
(200)

P (X +1 = 1 | X = 0) = 1


t t

P(X t +1 = N − 1 | X t = N ) = 1.
For example, the transition matrix P for N = 5 is given by
 
0 1 0 0 0 0
1/5 0 4/5 0 0 
 0 
 
 0 2/5 0 3/5 0 0 
P = . (201)
 0 0 3/5 0 2/5 0 
 
 0 0 0 4/5 0 1/5
0 0 0 0 1 0
3.1. DEFINITION AND EXAMPLES 28


Exercise 3.1.4. Repeat rolling two four sided dices with numbers 1, 2, 3, and 4 on them. Let Yk be the
some of the two dice at the kth roll. Let S n = Y1 + Y2 + · · · + Yn be the total of the first n rolls, and define
X t = S t (mod6). Show that (X t )t ≥0 is a Markov chain on the state space Ω = {0, 1, 2, 3, 4, 5}. Furthermore,
identify its transition matrix.
We generalize our observation in Example 3.1.1 in the following exercise.
Exercise 3.1.5. Let (X t )t ≥0 be a Markov chain on state space Ω = {1, 2, · · · , m} with transition matrix P =
(p i j )1≤i , j ≤m . Let rt = [P(X t = 1), · · · , P(X t = m)] denote the row vector of the distribution of X t .
(i) Show that for each i ∈ Ω,
X
m
P(X t +1 = i ) = p j i P(X t = j ). (202)
j =1

(ii) Show that for each t ≥ 0,


rt +1 = rt P. (203)
(iii) Show by induction that for each t ≥ 0,
r t = r0 P t . (204)
Exercise 3.1.6. Let (X t )t ≥0 be a Markov chain on state space Ω = {1, 2, · · · , m} with transition matrix P =
(p i j )1≤i , j ≤m .
(i) (Multi-step transition prob.) Show that for each x, y ∈ Ω and a, b ≥ 1,
¡ ¢
P X a+b = y | X a = x = P b (x, y). (205)
Hint: Use Exercise 3.1.5.
(ii) (The Chapman-Kolmogorov eq.) Show that for each x, y ∈ Ω and n, m ≥ 1,
X n
P n+m (x, y) = P (x, z)P m (z, y). (206)
z∈Ω

While right-multiplication of P advances a given row vector of distribution one step forward in time,
left-multiplication of P on a column vector computes the expectation of a given function with respect to
the future distribution. This point is clarified in the following exercise.
Exercise 3.1.7. Let (X t )t ≥0 be a Markov chain on a state space Ω = {1, 2, · · · , m} with transition matrix P .
Let f : Ω → R be a function. Suppose that if the chain X t has state x at time t , then we get a ‘reward’ of
f (x). Let rt = [P(X t = 1), · · · , P(X t = m)] be the distribution of X t . Let v = [ f (1), f (2), · · · , f (m)]T be the
column vector representing the reward function f .
(i) Show that the expected reward at time t is given by
X
m
E[ f (X t )] = f (i )P(X t = i ) = rt v. (207)
i =1
(ii) Use part (i) and Exercise 3.1.5 to show that
E[ f (X t )] = r0 P t v. (208)
Pt
(iii) The total reward up to time t is a RV given by R t = k=0 f (X t ). Show that
E[R t ] = r0 (I + P + P 2 + · · · + P t )v. (209)
Exercise 3.1.8. Suppose that the probability it rains today is 0.4 if neither of the last two days was rainy,
but 0.5 if at least one of the last two days was rainy. Let Ω = {S, R}, where S=sunny and R=rainy. Let Wt
be the weather of day t .
(i) Show that (Wt )t ≥0 is not a Markov chain.
3.2. STATIONARY DISTRIBUTION AND EXAMPLES 29

(ii) Expand the state space into the set of pairs Σ := Ω2 . For each t ≥ 0, define X t = (Wt −1 ,Wt ) ∈ Σ. Show
that (X t )t ≥0 is a Markov chain on Σ. Identify its transition matrix.
(iii) What is the two-step transition matrix?
(iv) What is the probability that it will rain on Wednesday if it didn’t rain on Sunday and Monday?

3.2. Stationary distribution and examples


3.2.1. Stationary distribution. Let (X t )t ≥0 be a Markov chain on state space Ω = {1, 2, · · · , m} with
transition matrix P . If π is a distribution on Ω such that

π = πP, (210)

then we say π is a stationary distribution of the Markov chain (X t )t ≥0 .


In Exercise 3.1.5, we observed that we can simply multiply the transition matrix P to a given row
vector rt of distribution on the state space Ω in order to get the next distribution rt +1 . Hence if the initial
distribution of the chain is π, then its distribution is invariant in time.

Example 3.2.1. Consider the 2-state Markov chain (X t )t ≥0 with transition matrix (as in Exercise 3.1.1)
· ¸
0.2 0.8
P= . (211)
0.6 0.4

Then π = [3/7, 4/7] is a stationary distribution of X t . Indeed,


· ¸
0.2 0.8
[3/7, 4/7] = [3/7, 4/7] . (212)
0.6 0.4

Furthermore, this is the unique stationary distribution. To see this, let π = [π1 , π2 ] be a stationary distri-
bution of X t . Then π = πP gives

0.2π1 + 0.6π2 = π1 (213)


0.8π1 + 0.4π2 = π2 . (214)

These equations lead to

4π1 = 3π2 . (215)

Since π is a probability distribution, π1 + π2 = 1 so π = [3/7, 4/7] is the only solution. This shows the
uniqueness of the stationary distribution for X t . ▲
A Markov chain may have multiple stationary distributions, as the following example illustrates.

Example 3.2.2. Let (X t )t ≥0 be a 2-state Markov chain with transition matrix


· ¸
1 0
P= . (216)
0 1

Then any distribution π = [p, 1 − p] is a stationary distribution for the chain (X t )t ≥0 .

Stationary distributions are closely related with eigenvectors and eigenvalues of the transition matrix
P . Namely, by taking transpose,

πT = P T πT . (217)

Hence, the transpose of any stationary distribution is an eigenvector of P T associated with eigenvalue 1.
More properties of stationary distribution in this line of thought are given by the following exercise.
3.2. STATIONARY DISTRIBUTION AND EXAMPLES 30

Exercise 3.2.3. Given a matrix A, a row vector v, and a real number λ, we say v is a left eigenvector of A
associated with left eigenvalue λ if
vA = λv. (218)
If v is a column vector and if Av = λv, then we say v is a (right) eigenvector associated with (right)
eigenvalue λ. Let (X t )t ≥0 be a Markov chain on state space Ω = {1, 2, · · · , m} with transition matrix P =
(p i j )1≤i , j ≤m .
(i) Show that a distribution π on Ω is a stationary distribution for the chain (X t )t ≥0 if and only if it is a
left eigenvector of P associated with left eigenvalue 1.
(ii) Show that 1 is a right eigenvalue of P with right eigenvector [1, 1, · · · , 1]T .
(iii) Recall that a square matrix and its transpose have the same (right) eigenvalues and corresponding
(right) eigenspaces have the same dimension. Show that the Markov chain (X t )t ≥0 has a unique
stationary distribution if any only if [1, 1, · · · , 1]T spans the (right) eigenspace of P associated
with (right) eigenvalue 1.
In the following exercise, we compute the stationary distribution of the so-called birth-death chain.
Exercise 3.2.4 (Birth-Death chain). Let Ω = {0, 1, 2, · · · , N } be the state space. Let (X t )t ≥0 be a Markov
chain on Ω with transition probabilities


 P(X t +1 = k + 1 | X t = k) = p ∀0 ≤ k < N


P(X
t +1 = k − 1 | X t = k) = 1 − p ∀1 ≤ k ≤ N
𝑝 (219)

 P(X +1 = 0 | X = 0) = 1 − p
𝑝 1  2

t 𝑝 t

𝑝 P(X t +1 = N | X t = N ) = p.
This is called a Birth-Death chain. Its state space diagram is as below.

𝑝 𝑝 𝑃 𝑝
1−𝑝 0 1 2 3 4 𝑝
1−𝑝 1−𝑝 1−𝑝 1−𝑝

F IGURE 4. State space diagram of a 5-state Birth-Death chain

(i) Let π = [π0 , π1 , · · · , πN ] be a distribution on Ω. Show that π is a stationary distribution of the Birth-
Death chain if and only if it satisfy the following ‘balance equation’
pπk = (1 − p)πk+1 0 ≤ k < N. (220)
(ii) Let ρ = p/(1 − p). From (ii), deduce that πk = ρ k π0 for all 0 ≤ k < N .
(iii) Using the normalization condition π0 + π1 + · · · + πN = 1, show that π0 = 1/(1 + ρ + ρ 2 + · · · + ρ N ).
Conclude that
ρk
πk = 0 ≤ k ≤ N. (221)
1 + ρ + ρ2 + · · · + ρ N
Conclude that the Birth-Death chain has a unique stationary distribution given by (221), which
becomes the uniform distribution on Ω when p = 1/2.
Exercise 3.2.5 (Success run chain). Consider a Markov chain (X n )n≥0 on the state space Ω = Z≥0 of non-
negative integers with the following transition matrix P :
P (x, x + 1) = p, P (x, 0) = 1 − p for all x ∈ Z≥0 , (222)
where p ∈ [0, 1] is a fixed parameter. In words, one flips independent coins with success probability p;
every time it comes up heads with probability p X n increases by 1, and when it comes up tails, X n resets
back to 0.
3.2. STATIONARY DISTRIBUTION AND EXAMPLES 31

(i) Let π : Z≥0 → [0, ∞) be a function that satisfies πP = π. Show that the following recursion holds:
X

π(0) = (1 − p) π(k) (223)
k=0
π(1) = pπ(0) (224)
π(2) = pπ(1) (225)
..
. (226)

Deduce that π(k) = p k π(0) for all k ≥ 1.


(ii) Show that π is an stationary distribution on Z≥0 if and only if π(k) = (1 − p)p k for k ≥ 0. (Such π
is called the Geometric(p) distribution that puts probability p k (1 − p) on each k ∈ Z≥0 . The
number of success run of probability-p𝑁 (𝑡)coin flips has this distribution.)

Next, 𝑝 Rate
𝑁(𝑡) we take a look at an example of an 𝜆important
= 𝑝𝜆 class of Markov chains, which is called the ran-
dom walk on graphs. This is the basis of many algorithms involving machine learning on networks (e.g.,
Google’s
Rate 𝜆PageRank). 1 − 𝑝 𝑁 (𝑡)
Example 3.2.6 (Random walk on graphs). We first introduce some notions in graph theory. A graph G
Rate 𝜆 = (1 − 𝑝)𝜆
consists of a pair (V, E ) of sets of nodes V and edges E ⊆ V 2 . A graph G can be concisely represented as a
|V | × |V | matrix AG , which is called the adjacency matrix of G. Namely, the (i , j ) entry of AG is defined by
AG (i , j ) = 1(nodes i and j are adjacent in G). (227)
We say G is simple if (i , j ) ∈ E implies ( j , i ) ∈ E and (i , i ) ∉ E for all i ∈ V . For a simple graph G = (V, E ), we
2
say a node j is adjacent to i if (i , j ) ∈ E . We denote degG (i ) the number of neighbors of i in G, which we
𝑁 (𝑡)
call the degree of i .
1
Consider we𝜆hop around the nodes of a given simple graph G = (V, E ): at each time, we jump from
Rate 𝑁(𝑡)
0
one node to one of the neighbors withor equal probability. For instance, if we are currently at node 2 and if
2 is adjacent𝑁to(𝑡)
3, 5, and 6, then we jump to one of Rate the three
𝜆 = 𝜆 neighbors
+𝜆 with probability 1/3. The location
of this jump process at time t can be described as a Markov chain. Namely, a Markov chain (X t )t ≥0 on
the node setRate
V is𝜆 called a random walk on G if
AG (i , j )
P(X t +1 = j | X t = i ) = . (228)
degG (i )
Note that its transition matrix P is obtained by normalizing each row of the adjacency matrix AG by the
corresponding degree.

3 4
1

𝐺

F IGURE 5. A 4-node simple graph G, its adjacency matrix AG , and associated random walk tran-
sition matrix P

2 What is the stationary distribution of random walk on G? There could be many (see Exercise 3.7.1),
but here is a typical one that always works. Define a probability distribution π on V by
degG (i ) degG (i )
π(i ) = P = . (229)
2 j ∈V degG (i ) 2|E |
3.2. STATIONARY DISTRIBUTION AND EXAMPLES 32

Namely, the probability that X t = i is proportional to the degree of node i . Then π is a stationary distri-
bution for P . Indeed,
X X degG (i ) AG (i , j )
π( j )P ( j , i ) = (230)
i ∈V i ∈V 2|E | degG (i )
1 X degG ( j )
= AG (i , j ) = = π( j ). (231)
2|E | i ∈V 2|E |
Hence if we perform a random walk on facebook network, then we are about ten times more likely to
visit a person of 100 friends than to visit a person of 10 friends. ▲
Next, we will discuss two classes of Markov chains for which it is very easy to compute stationary
distributions.

3.2.2. Reversible chains. A probability distribution µ on Ω is reversible for transition matrix P : Ω2 →


[0, 1] if the following detailed balance equation is satisfied:

µ(x) P (x, y) = µ(y) P (y, x) for all x, y ∈ Ω. (232)

Below we observe that reversiblity implies stationarity.

Proposition 3.2.7. Let P be a transition matrix on state space Ω. If a probability distribution µ on Ω is


reversible for P , then it is also stationary for P .

P ROOF. Note that


£ ¤ X reversibility X X
µP (x) = µ(y)P (y, x) = µ(x)P (x, y) = µ(x) P (x, y) = µ(x). (233)
y∈Ω y∈Ω y∈Ω


Example 3.2.8 (Mass transport). In order to give some concrete intuition, consider the following thought
experiment. Imagine Ω is the note set of some graph, and µ(x) represents the mass of sand at node x for
each x ∈ Ω. Overnight, a truck runs around the graph and transport P (x, y) proportion of sand µ(x) at
node x to node y. Hence

(Amount of mass transported from x to y) = µ(x)P (x, y). (234)

Let µ0 denote the new distribution of sands on Ω. Observe that


X
(Total amount of mass transported from x) = µ(x)P (x, y) (235)
y∈Ω
X
= µ(x) P (x, y) = µ(x). (236)
y∈Ω

Hence the new sand distribution µ0 is given by


X
µ0 (x) = (Total amount of mass transported into x) = µ(y)P (y, x). (237)
y∈Ω

In other words, µ0 = µP . From this, we see that

µ = µ0 ⇐⇒ µ = µP ⇐⇒ µ is stationary for P. (238)

On the other hand, the detailed balance equation (232) can be interpreted as

(mass sent from x to y) = (mass sent from y to x). (239)

From this, it is obvious why a reversible distribution is also stationary. ▲


3.2. STATIONARY DISTRIBUTION AND EXAMPLES 33

A reversible Markov chain is a Markov chain whose initial distribution is reversible. Namely, let
(X n )n≥0 denote a Markov chain on state space Ω with transition matrix P . Let the initial state X 0 be
distributed as a reversible distribution µ. By Proposition 3.2.7, X n is a stationary process, meaning that
the distribution of X n does not change in n. However, we can say more than just stationarity. Namely,
one cannot distinguish if X n is run backward in time. More precisely,
Pµ (X 0 = x 0 , X 1 = x 1 , . . . , X n = x n ) = Pµ (X n = x 0 , X n−1 = x 1 , . . . , X 0 = x n ) (240)
for all n ≥ 1 and states x 0 , . . . , x n .
In practice, reversibility is easier to check than stationarity. So showing reveribility could be handy
to stationarity.

Example 3.2.9 (SSRW on graphs is reversible). Let (X n )n≥0 be a simple symmetric random walk G =
deg(x)
(V, E ). Let the initial state X 0 be distribued as the distribution µ(x) = 2|E | . From Example 3.2.6, we
already know that µ is a stationary distribution. However, the detailed balance equation is also easily
verified:
deg(x) AG (x, y) deg(y) AG (y, x)
µ(x) P (x, y) = = = µ(y) P (y, x). (241)
2|E | deg(x) 2|E | deg(y)
Hence X n is not just stationary, but also reversible. Imagine X n follows its trajectory backwards in time.
The above detailed balance equation makes it so that the backward random walk still follows the same
rule: Choose a neighbor uniformly at random and move. ▲
3.2.3. Doubly stochastic chains. Let P be the transition matrix of a Markov chain (X n )n≥0 on a state
space Ω. Recall that the row sums of P are always 1, since each row gives the transition probabilities
from a fixed state to all other possible states. We say the Markov chain (X n )n≥0 (as well as P ) is doubly
stochastic if the column sums of P are also always 1.

Definition 3.2.10. A matrix P : Ω × Ω → [0, ∞) is doubly stochastic if its row and column sums are all one,
that is, for all x, y ∈ Ω,
X X
P (x, z) = P (z, y) = 1. (242)
z∈Ω z∈Ω

For doubly stochastic Markov chains, the uniform distribution on Ω (which makes sense if Ω is finite)
is always a stationary distribution.

Proposition 3.2.11. Let (X n )n≥0 be a doubly stochastic Markov chain on a finite state space Ω. Then
π = Uniform(Ω) is a stationary distribution for the chain.

P ROOF. Let P denote the transition matrix of the chain (X n )n≥0 . Note that π(x) = 1/|Ω| for all x ∈ Ω.
Then, viewing π as a row vector, for each x ∈ Ω,
X 1 X 1
(πP )(x) = π(z)P (z, x) = P (z, x) = = π(x). (243)
z∈Ω |Ω| z∈Ω |Ω|
Hence π is stationary for P . □
Example 3.2.12 (Random Walk on a ring). Let (X n )n≥0 denote a random walk on the integer ring Z/m Z,
which consists of integers 0, . . . , m −1 and we view m ≡ 0 modulo m. The chain moves by 1 clockwise with
probability p and 1 counterclockwise with probability 1 − p. Then it is easy to verify the corresponding
transition matrix is doubly stochastic. Hence by Proposition 3.2.11, the uniform distribution is a station-
ary distribution. This is quite resonable since the process is invariant under ‘rotation’. ▲
Exercise 3.2.13. Suppose a Markov chain has the uniform distribution as a stationary distribution. Does
it have to be a doubly stochastic chain?
3.3. MARKOV PROPERTY TO THE INFINITE FUTURE 34

3.3. Markov property to the infinite future


The Markov property (182) states that, knowing the state X n at time n, the distribution of the imme-
diate future X n+1 does not depend on the past X 0 , . . . , X n−1 . How about one step ahead, X n+2 ? As one can
guess, the distribution of X n+2 also does not depend on the past X 0 , . . . , X n−1 given the present. As usual,
the argument is by conditioning on the ‘missing step’ X n+1 and applying the markov property (182):
P(X n+2 = j n+2 | X n = j n , . . . , X 0 = j 0 ) (244)
X
= P(X n+2 = j n+2 | X n+1 = j n+1 , X n = j n , . . . , X 0 = j 0 )P(X n+1 = j n+1 | X n = j n , . . . , X 0 = j 0 ) (245)
j n+1
X
= P(X n+2 = j n+2 | X n+1 = j n+1 )P(X n+1 = j n+1 | X n = j n ) (246)
j n+1
X
= P(X n+2 = j n+2 , X n+1 = j n+1 | X n = j n ) (247)
j n+1
= P(X n+2 = j n+2 | X n = j n ). (248)
Note that the second equality applies the Markov property to each conditional probabilites to forget the
past information. Using a similar argument, one can show that any distant future, X n+k for k ≥ 1, does
not depend on the past X 0 , . . . , X n−1 given X n . In fact, this is ture for ‘infinite future’ as well, meaning the
entire process (X n+k )k≥1 . Here we give a formal statement.

Proposition 3.3.1. Let (X n )n≥0 be a Markov chain on a finite state space Ω with transition matrix P . Then
P ((X n+1 , X n+2 , . . . ) ∈ U | X n = x, (X 0 , . . . , X n−1 ) ∈ B ) = P ((X n+1 , X n+2 , . . . ) ∈ U | X n = x) . (249)

P ROOF. Omitted. □
Example 3.3.2 (Absorption in Gambler’s ruin). Recall the Gambler’s ruin Markov chain X n , which is on
the state space Ω = {0, 1, . . . , N } and trasition matrix P with P (i , i + 1) = p for i ∈ {0, . . . , N − 1}, P (i − 1, i ) =
1−p for i ∈ {1, . . . , N }, P (0, 0) = 1P (N , N ) = 1. We will assume the initial fortune is X 0 = x. We will compute
the probability of winning, that is, the probability that the Gambler’s fortune X n absorbs into N rather
than to 0. As a function of x, denote
h(x) = Px (state N is reached before 0) (250)
:= P(state N is reached before 0 | X 0 = x). (251)
The goal is to solve for h(x). Note that it has the following bounary values
h(0) = 0, h(N ) = 1. (252)
If we start at x ∈ {1, . . . , N − 1}, then the Markov chain goes to x + 1 with probability p and renews as a
Gambler’s ruin chain with initial state x + 1; with probability q := 1 − p, it goes to x − 1 and renews there.
Hence it should be evident that
h(x) = ph(x + 1) + qh(x) for 0 < x < N . (253)
Noting that p + q = 1, (254) can be re-written as
ph(x) + qh(x) = ph(x + 1) + qh(x) for 0 < x < N . (254)
This gives
q
h(x + 1) − h(x) = (h(x) − h(x − 1)) for 0 < x < N , (255)
p
which is a multiplicative recursion for the differences h(x + 1) − h(x). It follows that
h(x + 1) − h(x) = (q/p)x h(1) for 0 < x < N . (256)
3.4. STRONG MARKOV PROPERTY 35

Summing over x and using one of the boundary condition h(0) = 0, we get
X
x
h(x + 1) = h(1) (q/p)x for 0 < x < N (257)
k=0
( 1−(q/p)x+1
h(1) 1−(q/p) if p 6= q
= (258)
h(1)(x + 1) if p = q = 1/2.

So it only remains to figure out h(1). For this, we use the remaining boundary condition h(N ) = 1 by
setting x = N − 1:
( 1−(q/p)N
h(1) 1−(q/p) if p 6= q
1 = h(N ) = (259)
h(1)N if p = q = 1/2.
1−(q/p)
It follows that h(1) = 1−(q/p)N
for p 6= q and h(1) = 1/N for p = q = 1/2. Thus, we conclude that
( 1−(q/p)x
1−(q/p)N
if p 6= q
h(x) = (260)
x/N if p = q = 1/2.

For instance, say we start the gamble with x = 50 dollor and win the whole game once we accumulate
N = 100 dollors before hitting 0 dollor. If each game is fair (p = q = 1/2), the probabilty of winning the
game is h(50) = 50/100 = 0.5. If it is a slightly unfavorable game with p = 49/100 and q = 51/100, then
1−(51/49)50
the probability of winning is h(1) = 1−(51/49) 100 ≈ 0.11917. So, just make each game slightly unfavorable

and your casino is on business! ▲

3.4. Strong Markov Property


In this section, we will discuss an extension of the Markov property stopped at random times. Namely,
we would like to restart a Markov chain (X n )n≥1 at a (possibly random) time T and would like to say the
future process (X n )n>T is independent of the past (X n )0≤n<T given the present state X T . Can we do this
for any random time T ? The short answer is no. For an illustration, let (X n )n≥0 be the gambler’s ruin
chain in Example 3.1.2 with initial fortune X 0 = 1 and cap N = 100. Consider three (possibly) random
times:
1. T = T1 = 50;
2. T = T2 = first time to reach $50;
3. T = T3 = first time to reach $50 that i immediately followed by $51.
We will consider the probability of jumping from 50 to 51 from time T to T + 1. If we can really
restart the Markov chain at time T , this transition probability should all be p, the probability of winning
one round of gamble. This is indeed the case for T = T1 = 50 by using time-homogenity. On the other
hand, it fails to be true for T = T3 , since X T3 +1 = 51 with probability 1 by the definition of T3 . In other
words, the Markov chain is ‘forced’ to transition to 51 at time T3 , not following its transition probabilities.
Roughly speaking, this change of behavior happends becuase knowning the value of T3 contains some
information about the chain at future times. Finally, how about T = T2 ? At first glance, knowing the value
of T2 should not exert any information on how it would behave in the following steps, since the value
of T2 is completely determined by the history of the chain at prior times n ≤ T2 . This leads us to the
definition of ‘stopping times’, which we define for a general stochastic process and not necessarily only
for Markov chains.

Definition 3.4.1 (Stopping time). Let (X n )n≥0 be a stochastic process defined on a probability space
(Ω, F , P). Let T be a random variable taking valus from Z≥0 ∪ {∞}. Then T is a stopping time (for the
3.4. STRONG MARKOV PROPERTY 36

process (X n )n≥0 ) if for each nonnegative integer n, there is a subset C n ⊆ Ωn+1 such that the following
equality of events holds:
{T = n} = {(X 0 , X 1 , . . . , X n ) ∈ C n }. (261)
In words, T is a stopping time for (X n )n≥0 if one can determine if T = n by checking a certain condi-
tion for the history X 0 , X 1 , . . . , X n , without using any information about the future states X n+1 , X n+2 , . . . .
In this sense, the random time T3 defined above is not a stopping time, but T2 is, since we can write
{T2 = n} = {X 0 6= 50, X 1 6= 50, . . . , X n−1 6= 50, X n = 50}. (262)
Example 3.4.2 (First hitting time is a stopping time). Let (X n )n≥0 be a stochastic process defined on a
probability space (Ω, F , P). Let A ⊆ Ω be a subset of the sample space. We will define τ A to be the ‘first
hitting time’ of A, which is the first time that the process X n enters into states in A:
τ A := inf{n ≥ 0 | X n ∈ A}. (263)
Here ‘inf’ stands for infimum, which agrees with the minimum if the set is nonempty, and conventionally
taken to be ∞ if the set is empty. For example, τ A = ∞ means the process X n never enters the states in A.
In order to see that τ A is indeed a stopping time, we simply mimic what we did in (262): For each n ≥ 0,
{τ A = n} = {X 0 6= A, X 1 6= A, . . . , X n−1 6= A, X n ∈ A}. (264)
Similarly, define T A to be the first time after time zero to hit some state in A:
T A := inf{n ≥ 1 | X n ∈ A}. (265)
As before, T A is also a stopping time since, for each n ≥ 0,
{T A = n} = {X 1 6= A, X 1 6= A, . . . , X n−1 6= A, X n ∈ A}. (266)
Hence, in general, first hitting times are stopping times. ▲.
Exercise 3.4.3. Let (X n )n≥0 be a stochastic process defined on a probability space (Ω, F , P). Let A ⊆ Ω be
a subset of the sample space. For each k ≥ 1, let T A(k) denote the kth time that the process X n visits some
state in A. That is,
(
(m) inf {n > T A(m−1) : X n ∈ A} if T A(m−1) < ∞
TA = (267)
∞ otherwise.

Show that T A(k) is a stopping time for all k ≥ 1.


Now that we have introduced stopping times, we can state the strong Markov property of Markov
chains, which states that one can restart Markov chains at stopping times.
Proposition 3.4.4 (Strong Markov property). Let (X n )n≥0 be a Markov chain on a finite state space Ω
and let T denote a stopping time. Let A be any event determined by (X 0 , X 1 , . . . , X T ). Then for any states
x, u 1 , . . . , u n ∈ Ω,
P(X T +1 = u 1 , . . . , X T +m = u m | T < ∞, A, X T = x) = P(X 1 = u 1 , . . . , X m = u m | X 0 = x). (268)
P ROOF. We will condition on the value of T and apply the usual Markov property. For instance,
suppose T = n. Then
P(X T +1 = u 1 , . . . , X T +m = u m | T = n, A, X T = x) = P(X n+1 = u 1 , . . . , X n+m = u m | A, X n = x) (269)
= P(X n+1 = u 1 , . . . , X n+m = u m | X n = x) (270)
= P(X 1 = u 1 , . . . , X m = u m | X 0 = x), (271)
where the last equality uses time-homogeneity of the Markov chain and the second equality uses the
Marko property. (Note that, A is determined by the history X 0 , . . . , X n . Hence we can decompose the
event A ∩ {X n = x} by A 0 ∩ {X n = x}, where A 0 is an event determined by X 0 , . . . , X n−1 . Then we can forget
A 0 in the conditioning by the Markov property.)
3.5. RECURRENCE AND TRANSIENCE 37

Now, in generall, we do not know the value of T . Hence we will use iterated expectation for probabil-
ity (Exercise 1.4.3) to condition on the value of T and then average over all values of T :
P(X T +1 = u 1 , . . . , X T +m = u m | T < ∞, A, X T = x) (272)
= ET [P(X T +1 = u 1 , . . . , X T +m = u m | T, A, X T = x)] (273)
= ET [P(X 1 = u 1 , . . . , X m = u m | X 0 = x)] (274)
= P(X 1 = u 1 , . . . , X m = u m | X 0 = x). (275)
Here, the third equailty follows from the observation we made in the first paragraph: If we know the value
of T < ∞, the assertion holds. □
Example 3.4.5. Let (X n )n≥0 be the gambler’s ruin chain in Example 3.1.2 with initial fortune X 0 = 1 and
cap N = 100. Let T = T2 denote the first hitting time of $50. We have seen that it is a stopping time. So by
the strong Markov property,
P(X T +1 = 51 | X T = 50) = P(X 1 = 51 | X 0 = 50) = p. (276)
▲.

3.5. Recurrence and Transience


We now have the sufficient tools to study the behavior of Markov chains. Throughout this section,
we fix a Markov chain (X n )n≥0 on finite state space Ω with transition matrix P . Here are some quick
highlights.
Elementary behavior of Markov chains:
(i) For each state x ∈ Ω, the Markov chain X n either visits x infinitely often with probability one (we say
x is ‘recurrent’) or only finitely many times with probability one (we say x is ‘transient’).
(ii) Starting from any state, the Markov chain X n must enter the set of recurrent states eventually. Once
entering a recurrent state, it will never go back to any transient states.
(iii) The set of recurrent states is further decomposed into ‘communicating classes’. The Markov chain
eventually enters exactly one communicating class and stays there forever, visiting every states
in that class infinitley often.

3.5.1. Definition of recurrence/transitions and equivalent conditions. There are multiple ways to
define recurrence/transience of states under a Markov chain: using 1) return probabilities; 2) infinitely
many returns; 3) expected number of returns. Our first approach is via return probabilities. Namely, for
each state x ∈ Ω, define
T x = T x(1) := inf {n ≥ 1 : X n = x}, (277)
which is the first time n ≥ 1 at which the chain X n hits the state x. In Example 3.4.2, we have seen that T x
is a stopping time for each x ∈ Ω. For states x, y ∈ Ω, let
ρ x y = P(T y < ∞ | X 0 = x), (278)
which is the probability that the chain X n eventually visits y starting from x.
Definition 3.5.1. Let (X n )n≥0 be a Markov chain on finite state space Ω. Each state x ∈ Ω is said to be
recurrent if ρ xx = 1 and transient if ρ xx < 1.
Exercise 3.5.2. Show the following implications:
ρ xx > 0 ⇐⇒ P m (x, x) > 0 for some m ≥ 1. (279)
Hint: Use
³ ¯ ´ µ∞ ¯ ¶ ∞ ³ ¯ ´
¯ [ ¯ X ¯
sup P X n = x ¯ X 0 = x ≤ P {X n = x} ¯ X 0 = x ≤ P Xn = x ¯ X0 = x . (280)
n≥1 n=1 n=1
3.5. RECURRENCE AND TRANSIENCE 38

Proposition 3.5.3. Fix states x, y ∈ Ω and let T y(m) denote the time of mth visit to y, as defined in Exercise
3.4.3. Then for each m ≥ 1,

P(T y(m) < ∞ | X 0 = x) = ρ x y ρ m−1


yy . (281)

P ROOF. We first note that the assertion for m = 1 follows from the definition of ρ x y :

P(T y(1) < ∞ | X 0 = x) = ρ x y . (282)

Next, suppose the assertion holds for m = k. We deduce the assertino for m = k + 1 by using the strong
Markov property. Namely, note that since T y(k+1) > T y(k) , T y(k+1) < ∞ implies T y(k) < ∞. Also, note that
X T (k) = y by definition. Hence,
y

P(T y(k+1) < ∞ | X 0 = x) = P(T y(m+1) < ∞, T y(k) < ∞ | X 0 = x) (283)


= P(T y(m+1) < ∞ | T y(k) < ∞, X 0 = x) P(T y(k) < ∞ | X 0 = x) (284)
= P(T y(m+1) < ∞ | X T (k) = y, T y(k) < ∞, X 0 = x) P(T y(k) < ∞ | X 0 = x) (285)
y

= P(T y(1) < ∞ | X 0 = y) P(T y(k) < ∞ | X 0 = x) (286)


= ρ y y (ρ x y ρ k−1 k
y y ) = ρx y ρ y y , (287)

where the last two equalities use the strong Markov property and the induction hypothesis, respectively.

Theorem 3.5.4. The following hold:


(i) x ∈ Ω is recurrent if and only if P(T x(m) < ∞ for all m ≥ 1 | X 0 = x) = 1.
(ii) x ∈ Ω is transient if and only if P(T x(m) < ∞ for all m ≥ 1 | X 0 = x) = 0.

P ROOF. Denote the probability in the assertion as p ∗ . Note that for each m ≥ 1, by monotonicity of
events,

p ∗ = P(T x(m) < ∞ for all m ≥ 1 | X 0 = x) ≤ P(T x(m) < ∞ | X 0 = x) = ρ m


xx . (288)
∗ ∗
Hence if ρ xx < 1, ρ m
xx → 0 as m → ∞ so the p = 0. Hence p = 0 if x is transient.
On the other hand, suppose x is recurrent. Then by subadditivity (3),
µ ∞ ¯ ¶
[ (m) ¯ X
∞ ¡ ¢
P {T x = ∞} ¯ X 0 = x ≤ P T x(m) = ∞ | X 0 = x = 0. (289)
m=1 m=1

Hence the probability on the left hand side equals zero. By taking the complement,
µ ∞ ¯ ¶ µ ∞ ¯ ¶
[ (m) ¯ \ (m) ¯
1 = 1−P {T x = ∞} ¯ X 0 = x = P {T x < ∞} ¯ X 0 = x = p ∗ . (290)
m=1 m=1

This shows p ∗ = 1 if x is recurrent.


Finally, the converse implications are automatic, since each state x ∈ Ω is either recurrent or tran-
sient, mutually exclusive, by definition. (Note that this also implies p ∗ is either 0 or 1. In other words,
with probability one, either x is visited infinitely often or only finitely many times.) □

Next, consider the following counting variable


X
∞ X

N y := 1(X n = y) = 1(T y(m) < ∞), (291)
n=1 m=1
3.5. RECURRENCE AND TRANSIENCE 39

which equals the total number of visits to state y ∈ Ω by the Markov chain. By using Fubini’s theorem,
one can write
· ∞ ¯ ¸ h ¯ i
X ¯ X

¯
E[N y | X 0 = x] = E 1(T y(m) < ∞) ¯ X 0 = x = E 1(T y(m) < ∞) ¯ X 0 = x (292)
m=1 m=1
X∞ ³ ´
= P T y(m) < ∞ | X 0 = x (293)
m=1
X∞
= ρ x y ρ m−1
yy . (294)
m=1

Note that the last expression equals zero if ρ x y = 0, infinity if ρ x y > 0 and ρ y y = 1, and a converging
geometric series if ρ x y > 0 and ρ y y < 1. Hence we obtain the following result:

Proposition 3.5.5. The following hold:




 if ρ x y = 0
0
E[N y | X 0 = x] = ∞ if ρ x y > 0 and y is recurrent (295)


 ρx y if ρ x y > 0 and y is transient.
1−ρ y y

Now we can give an alternative characterization of the recurrence of a state in terms of the sum of
multi-step transition probabilities.

Theorem 3.5.6. For each x, y ∈ Ω, it holds that


X

E[N y | X 0 = x] = P n (x, y). (296)
n=1
P
Furthermore, a state x ∈ Ω is recurrent if and only if E[N x | X 0 = x] = ∞ m
m=1 P (x, x) = ∞.
P
P ROOF. Recall that N y = ∞ n=1 1(X n = y). Hence
·∞ ¯ ¸ ∞ h ¯ i
X ¯ X ¯
E[N y | X 0 = x] = E 1(X n = y) ¯ X 0 = x = E 1(X n = y) ¯ X 0 = x (297)
n=1 n=1
X∞ ³ ¯ ´ X∞
¯
= P Xn = y ¯ X0 = x = P n (x, y), (298)
n=1 n=1

verifying the first assertion. But Proposition 3.5.5 implies that x is recurrent if and only if E[N x | X 0 = x] =
∞. Hence the assertion follows. □

3.5.2. Recurrence and transience of random walks on Zd . In this subsection, we study recurrence
and transience of simple random walks on the integer lattice Zd in d -dimensions. As a set, Zd consists
of the d tuple of integer coordinates: Zd = {(a 1 , . . . , a d ) | a 1 , . . . , a d ∈ Z}. We will view it as a graph by
giving ‘nearest neighbor edges’. Namely, if we have two points x, y ∈ Zd , then we connect x and y by an
edge if and only if they differ by 1 at exactly a single coordiante. For example, in Z2 , (1, 2) is adjacent to
(2, 1) but not adjacent to (2, 3). As a graph, Z2 looks like a square grid in R2 . Similarly, Z3 would be the
3-dimensional square lattice consisting of the grid points (a, b, c) for a, b, c ∈ Z.
A random ralk on any graph is simple if it only moves along edges and do not hop multiple edges,
and symmetric if at any node, there is an equal chance to pick any neighbor to move to1. Hence a simple
symmetric random walk (SSRW) on Z moves either up or down with equal probability; SSRW on Z2
chooses one of the 4 directions with equal probability and moves along one edge; SRW on Zd has 2d
directions to choose from at each iteration with equal probability.

1In this lecture note, we assume all random walks on graphs are simple.
3.5. RECURRENCE AND TRANSIENCE 40

F IGURE 6. A sample path of RW on Z3 . Figure excerpted from http://www.cmat.edu.uy/


~lessa/resource/randomwalknotes.pdf

The goal of the following serious of examples is to show the following Polya’s random walk theorem2:
SSRW on Zd is recurrent if d ≤ 2 and transient for d ≥ 3. (299)
In Example 3.5.7, we have seen that SRW on Z is recurrent. For higher dimensions, we will compute the
sum of return probabilities as we did before.

Example 3.5.7 (Recurrence and transience of simple random walk on Z). Let (X n )n≥0 denote a simple
random walk on the 1-dimensional integer lattice Z. Namely, it is a Markov chain with state space Z
and transition matrix P such that P (x, x + 1) = p and P (x, x − 1) = 1 − p for all x ∈ Z. Here p ∈ [0, 1] is a
parameter that governs the bias in one-step transition of the walker. Simply put, the walker moves up by
1 with probability p and down by 1 with probability 1 − p. Suppose X 0 = 0, meaning that the walkter is
originally at the origin.
The question we ask is as follows: Is the origin transient or recurrent? The answer depends on p.
If p > 1/2, then the walker has a positive bias, so it would be natural to guess the origin is transient –
it will drift away toward the infinity and will never get back to the origin after some time. Likewise, if
p < 1/2, it will drift away toward negative infinity and the origin should be transient. Lastly, if the walker
is symmetric, p = 1/2, maybe the origin is recurrent. We will justify these guesses using Theorem 3.5.6.
The key observation here is the following combinatorial fact: In order to return to the origin, the
walker must take an even number of steps with equal number of up-steps and down-steps. It follows
that
à !
2n n (2n)!
P (X 2n = 0 | X 0 = 0) = p (1 − p)n = (p(1 − p))n . (300)
n n!n!
We will need to understand how n! grows as n grows. We will apply Stirling’s approximation, which states
that
p
n! ∼ (n/e)n 2πn, (301)
where a n ∼ b n means a n /b n → 1 as n → ∞. Using this, one can show that
p
(2n/e)2n 4πn 1
P (X 2n = 0 | X 0 = 0) ∼ (p(1 − p))n = p (4p(1 − p))n . (302)
2n
(n/e) (2πn) πn
p
Now consider the symmetric case p = 1/2. Then P(X 2n = 0) ∼ 1/ πn. This implies that for some
p
constant c > 0, P(X 2n = 0) ≥ c/ πn for all n ≥ 1. Now in order to see the recurrence of the origin, we

2
Kakutani at UCLA colloquium said: “A drunk man will eventually find his way home, but a drunk bird may get lost forever”
3.5. RECURRENCE AND TRANSIENCE 41

compute the sum of all return probabilities:


X∞ X∞ X∞ c
P (X n = 0 | X 0 = 0) = P (X 2n = 0 | X 0 = 0) ≥ p = ∞, (303)
n=0 n=0 n=0 πn
where the first equality uses the fact that the walker cannot return to the origin in odd number of steps,
P p
and the last equality uses the fact that ∞ n=0 1/ n = ∞. Hence by Theorem 3.5.6, we deduce that the
origin is recurrent if p = 1/2. By Theorem 3.5.4 this means that the walker returns to the origin infinitely
many times with probability one. By Proposition 3.5.5, this also means that the expected number of visits
to the origin is infinite.
On the other hand, suppose p 6= 1/2. Recall that the function p 7→ p(1−p) attains its global maximum
of 1/4 at p = 1/2. Hence 0 ≤ 4p(1 − p) < 1 if p ∈ [0, 1] and p 6= 1/2. In this case, for some constant c > 0,
P (X 2n = 0 | X 0 = 0) ≤ cρ n for all n ≥ 1, where ρ = 4p(1 − p) ∈ (0, 1). Hence
X
∞ X
∞ X

P (X n = 0 | X 0 = 0) = P (X 2n = 0 | X 0 = 0) ≤ cρ n < ∞. (304)
n=0 n=0 n=0
So by Theorem 3.5.6, we deduce that the origin is transient if p 6= 1/2. By Theorem 3.5.4 this means that
the walker returns to the origin only finitely many times with probability one. By Proposition 3.5.5, this
also means that the expected number of visits to the origin is finite. ▲
Example 3.5.8 (SSRW on Z2 is recurrent (“A drunk man eventually finds way home”)). Let Xn denote
a SSRW on Z2 , which chooses one of the 4 steps [1, 0], [−1, 0], [0, 1], [0, −1] independently with equal
probability at each move. It would be easy to analyze the return probability of Xn if we can treat its two
coordinates as independent random walks on Z, since we have already analyze one-dimensional random
walks. However, the two coordinates of Xn are not independent. For example, if its first coordinate
increased or decreased, then we know that its second coordinate did not change.

F IGURE 7. A sample path of RW on Z2 . Projecting it on y = ±x shows that it has the same distri-
bution as a pair of independent random walks on Z. Figure excerpted from http://www.cmat.
edu.uy/~lessa/resource/randomwalknotes.pdf

In order to handle this issue, we are going to compare Xn with the following auxiliary process Yn =
[Yn(1) , Yn(2) ], where Yn(1) and Yn(2) are independent SSRW on Z. So each Yn(i ) changes by ±1 with equal
probability, independently for each i = 1, 2:
(i ) (i )
P(Yn+1 − Yn(i ) = 1) = P(Yn+1 − Yn(i ) = −1) = 1/2 for i = 1, 2. (305)

Now, since Yn(1) and Yn(2) change by ±1 at the same time, the vector Yn = [Yn(1) , Yn(2) ] on Z2 changes by [1, 1],
[−1, 1], [1, −1], [−1, −1] with equal probability. This is exactly the original SSRW Xn rotated by 45 degrees
3.5. RECURRENCE AND TRANSIENCE 42

p
and scaled by 2, so that it lives on the ‘checkerboard lattice’ {(a, b) | a, b ∈ Z, a + b is even} ⊆ Z2 . From
this observation, it follows that the probability of Yn returning to the origin is the same as the probability
that Xn returning to the origin. This gives

P (Xn = [0, 0] | X0 = [0, 0]) = P (Yn = [0, 0] | Y0 = [0, 0]) (306)


³ ´
= P Yn(1) = 0, Yn(2) = 0 | Y0(1) = 0, Y0(2) = 0 (307)
³ ´ ³ ´
= P Yn(1) = 0 | Y0(1) = 0 P Yn(2) = 0 | Y0(2) = 0 (308)
³ ´2
= P Yn(1) = 0 | Y0(1) = 0 . (309)

where the last equality is due to the independence between Yn(1) and Yn(2) . By using the analysis in Exam-
ple 3.5.7, we get
1
P (X2n = [0, 0] | X0 = [0, 0]) ∼ . (310)
πn
To conclude, note that Xn can only return to the origin at even number of steps, and the above asymptotic
yields that for some constant c > 0, P (X2n = [0, 0] | X0 = [0, 0]) ≥ c/(πn) for all n ≥ 1. Hence
X
∞ X
∞ X∞ c
P (Xn = 0 | X0 = 0) = P (X2n = 0 | X0 = 0) ≥ = ∞. (311)
n=0 n=0 n=0 πn

Therefore Xn is recurrent by Theorem 3.5.6. ▲

Example 3.5.9 (SSRW on Zd for d ≥ 3 is transient (“A drunk bird never finds way home”)). Let Xn denote
a SSRW on Zd with d ≥ 3, which chooses one of the 2d steps [±1, 0, . . . , 0], [0, ±1, 0, . . . , 0], . . . [0, . . . , 0, ±1]
independently with equal probability at each move. We will first apply a similar argument as in Example
3.5.8 and try to show that Xn is transient. This simple argument almost works, but we will see why it fails.
As we did in the two dimensional case, consider a process Yn = [Yn(1) , . . . , Yn(d ) ] on Zd where Yn(i ) ’s are
independent SSRWs on Z. So each Yn(i ) changes by ±1 with equal probability, independently for each
i = 1, . . . , d . As before, note that

P (Xn = [0, . . . , 0] | X0 = [0, . . . , 0])“ = ”P (Yn = [0, . . . , 0] | Y0 = [0, . . . , 0]) (312)


³ ´
= P Yn(1) = 0, . . . , Yn(d ) = 0 | Y0(1) = 0, . . . Yn(d ) = 0 (313)
³ ´d
= P Yn(1) = 0 | Y0(1) = 0 . (314)

Then by using the analysis in Example 3.5.7,


¶ µ
1 d
P (X2n = 0 | X0 = 0) ∼ p . (315)
πn
P
The above probabilities are not summable if d ≥ 3, as ∞ n=1 n
−d /2
= ∞ for d ≥ 3. Hence by Theorem 3.5.6,
conclude that Xn is transient.
The reason that the above argument fails is that the quoted equality in (313) is not true. For exam-
ple, set d = 3. Then there are 6 steps that Xn takes: [±1, 0, 0, ], [0, ±1, 0], and [0, 0, ±1]. However, Yn can
take 23 = 8 distinct steps in the integer lattice Z3 (see Figure 8). For a correct argument, we rely on a
combinatorial argument we employed in Example 3.5.7, which we will demonstrate in the case of d = 3.
Let d = 3. Note that the walk can only return to the origin in even number of steps. Let a, b, c denote
the number of [1, 0, 0], [0, 1, 0], [0, 0, 1]-steps that Xn takes in the first 2n steps, starting from the origin. In
order for X2n = 0, we need to have a +b +c = n and the number of [−1, 0, 0], [0, −1, 0], [0, 0, −1]-steps must
3.5. RECURRENCE AND TRANSIENCE 43

F IGURE 8. A triple of independent SSRWs on Z has 8 possible moves {−1, 1}3

be a, b, c, respectively. Hence
µ ¶2n
X (2n)! 1
P(Xn = 0 | X0 = 0) = (316)
0≤a,b,c (a!b!c!)2 6
a+b+c=n
à !µ ¶ µ ¶2 µ ¶2n
2n 1 2n X n! 1
= , (317)
n 2 0≤a,b,c (a!b!c!) 3
a+b+c=n

(2n)! ¡2n ¢ ³ n!
´2
where we have used the identity (a!b!c!)2
= n (a!b!c!) . Now note that
µ ¶n
X n! 1
1= , (318)
0≤a,b,c (a!b!c!) 3
a+b+c=n

since the right hand side equals the total probability of putting n balls into three bins uniformly at ran-
dom. Also, observe that
n! n!
≤ (319)
(a!b!c!) bn/3c!bn/3c!bn/3c!
for all a, b, c ≥ 0 with a + b + c = n and bn/3c is the largest smaller integer than n/3. In words, one has the
most amount of arrangements of n balls into three bins if the number of balls in each bins are as equal
as possible. Using this,
à !µ ¶ µ ¶2 µ ¶2n
2n 1 2n X n! 1
P (X2n = 0) = (320)
n 2 0≤a,b,c (a!b!c!) 3
a+b+c=n
à !µ ¶ µ ¶n
2n 1 2n n! 1
≤ (321)
n 2 bn/3c!bn/3c!bn/3c! 3
p µ ¶n
1 (n/e)n 2πn 1
∼p p = O(n −3/2 ). (322)
πn (n/3e)n 2πn/3 3 3

P P∞
From this, one can deduce that ∞ n=1 P(Xn = 0 | X0 = 0) ≤ n=1 cn
−3/2
< ∞. Thus, by Theorem 3.5.6,
conclude that Xn is transient. ▲
3.5.3. The set of recurrent states is closed. In this subsection, we will see that a Markov chain will
eventually enter the set of recurrent states and will never leave it. This is proved in the following state-
ment.
3.5. RECURRENCE AND TRANSIENCE 44

Definition 3.5.10 (Closed sets). Fix a subset of states A ⊆ Ω. We say A is closed if ρ x y = 0 for all x ∈ A and
y ∈ Ω \ A.
Proposition 3.5.11. For distinct x, y ∈ Ω, the following hold:
(i) Suppose ρ x y > 0 and ρ y x < 1. Then x is transient.
(ii) Suppose ρ x y > 0 and ρ xx = 1. Then ρ y x = 1 and y is recurrent.
(iii) Suppose x is recurrent and y is transient. Then ρ x y = 0. In particular, the set of recurrent states is
closed.
P ROOF. To show (i), suppose ρ x y > 0 and ρ y x < 1. Heuristically, this means that the chain can go
from x to y with some positive probability, but there is a positive chance 1 − ρ y x > 0 that it will never
return to x. Then it does not visit x any more, so x should be transient. To make this argument rigorous,
we note that
1 − ρ xx = P(X n never visits x | X 0 = x) (323)
≥ P(X n visits y and never returns to x | X 0 = x) (324)
= P(X n never returns to x after time T y | X 0 = x) P(T y < ∞ | X 0 = x) (325)
= P(X n never returns to x | X 0 = y) ρ x y . (326)
= (1 − ρ y x )ρ x y > 0, (327)
where we have used the strong Markov property to get the third equality. This shows ρ xx < 1, so x is
transient.
Next, in order to show (ii), suppose ρ x y > 0 and ρ xx = 1. In this case, we first observe that ρ y x = 1,
since otherwise by part (i) x must be transient, which contradicts the assumption that ρ xx = 1. Hence,
heuristically, assuming X 0 = y, the chain almost surely visits x and there is a positive chance ρ y x > 0 to
return to y; if it fails, then it will eventually return to y and by the strong Markov property we can start
the game anew. Among this infinitely many independent trials, it should succeed at least once since the
success probability is ρ x y ρ x y = ρ x y > 0.
More rigorously, since ρ x y , ρ y x > 0, there are integers k, ℓ ≥ 1 such that P (k) (x, y) > 0 and P ℓ (y, x) > 0
by Exercise 3.5.2. Hence
X∞ X
∞ X

P n (y, y) ≥ P k+n+ℓ (y, y) ≥ P k (y, x)P n (x, x)P ℓ (y, x) (328)
n=1 n=1 n=1
µ ¶
X

= P k (y, x) P n (x, x) P ℓ (y, x) = ∞. (329)
n=1
P
The last equality follows since P (k) (x, y), P ℓ (y, x) > 0 and ∞ n
n=1 P (x, x) = ∞ as x is recurrent (see Theorem
3.5.4). Then by Theorem 3.5.4, we deduce that y is also recurrent.
Lastly, (iii) is a direct consequence of (ii) (proof by contradiction). Heuristically, since x is visited
infinitely often, any positive chance to access y every time x will make y also recurrent. This implies that
the set of all recurrent states is closed. □
3.5.4. Cannonical decomposition of the state space. In this subsection, we discuss a generic struc-
ture about the state space Ω of a Markov chain.
Definition 3.5.12 (Accessibility). For two states x, y ∈ Ω, we say y is accessible from x, and denote as
x −→ y, if P n (x, x) > 0 for some n ≥ 0. If x and y are mutually accessible from each other, we say x and y
communicate and dnoete as x ←→ y.
From Exercise 3.5.2, It is easy to observe that x −→ y if and only if ρ x y > 0 or x = y. The following
exercise shows that the communication relation ←→ is an equivalence relation.
Exercise 3.5.13. Show that the binary relation ‘←→’ between the states defined in Definition 3.5.14 is an
equivalence relation. Namely, it satisfies the following properties:
3.5. RECURRENCE AND TRANSIENCE 45

(1) (reflexivity) For each x ∈ Ω, x ←→ x.


(2) (symmetry) For each x, y ∈ Ω, x ←→ y if and only if y ←→ x.
(3) (translativity) For each x, y, z ∈ Ω, x ←→ y and y ←→ z implies x ←→ z.
(Hint: Using Exercise 3.1.6, show that x −→ y and y −→ z implies x −→ z.)

Next, we define irreducible and closed subsets. In words, a set of states is called ‘irreducible’ if all
states there communicate, and ‘closed’ if it never leaves there once it gets there.

Definition 3.5.14 (Irreducible sets). Fix a subset of states A ⊆ Ω. We say A is irreducible if x ←→ y for all
x, y ∈ A and closed if ρ x y = 0 for all x ∈ A and y ∈ Ω \ A.

Now for each state x ∈ Ω, define its communicating class R x by

R x = {y ∈ Ω | x ←→ y}. (330)

A direct consequence of communication ←→ being an equivalence relation between states is that the
state space Ω is partitioned into pairwise disjoint communicating classes. While transient states may
form multiple communicating classes, it is customary to put them in one subset as stated in Theorem
3.5.15.

Theorem 3.5.15 (Cannonical decomposition of the state space). The state space Ω of a Markov chain can
be decomposed as
à !
[
Ω = T tR = T t Ri , (331)
i

where T is the set of all transient states, R is the set of all recurrent states, and {R i } is at most a countable
collection of pairwise disjoint, closed, and irreducible sets of recurrent states.

P ROOF. The first decomposition Ω = T t R is trivial by definition of transience and recurrence. We


S
will further decompose R into the distjoint union of communicating classes i R i . Indeed, note that
each recurrent state x must belong to some communicating class R i , as certainly x ∈ R x . Also, two
distinct recurrent states x, y ∈ Ω should either be in the same communicating class or belongs to disjoint
communicating classes by the transitivity of communication (see Exercise 3.5.13).
It remains to show that each communicating class R i is closed and irreducible. First, the irreducibil-
ity follows from the symmetry of communication (Exercise 3.5.13). Lastly, it remains to show that ρ x y = 0
if x ∈ R i and y ∈ Ω \ R i for each i . This is essentially a consequence of Proposition 3.5.11. Namely, if
y ∈ T , then ρ x y = 0 by Proposition 3.5.11 (iii). If y ∈ R j for some j 6= i , then ρ x y = 0 since otherwise
ρ y x = 1 by Proposition 3.5.11 (ii), which means x and y communicate so they must belong to the same
communicating class, a contradiction. □

Consequently, the transition matrix P can be written as the following block form (after permutting
the rows and columns):
R R2 ··· Rm T
 1 
R1 PR1 O O
R T R2  O O 
· ¸  PR2 O 
R PR O  .. ,
P= = ...  . ..
. O O  (332)
T S PT  
Rm  O O PR O 
T S1 S2 ··· Sm PT

where the last expression assumes that there are only finitely many communicating recurrent classes
R1, . . . , Rm .
3.5. RECURRENCE AND TRANSIENCE 46

Example 3.5.16 (Gambler’s ruin in cannonical form). Consider gambler’s ruin with state space Ω =
{0, 1, 2, 3, 4, 5} and transition matrix P given as (199). The set recurrent states is R = {0, 5} and T =
{1, 2, 3, 4} is the set of transient states (why?). Then a canonical form for the transition matrix is
0 5 1 2 3 4
 
0 0 1 0 0 0 0
 0 
5  1 0 0 0 0 
 
P= 1 1 − p 0 0 p 0 0 .
  (333)
2  0 0 1−p 0 p 0 
 
3  0 0 0 1−p 0 p 
4 0 p 0 0 1−p 0


Definition 3.5.17. Fix a Markov chain (X n )n≥1 on state space Ω with transition matrix P . We say the
chain (as well as P ) is said to be irreducible if Ω is irreducible, and recurrent if it is rreducible and all
states are recurrent.
Proposition 3.5.18. The following hold:
(i) A finite closed set A ⊆ Ω contains at least one recurrent state. In particular, if Ω is finite, then it contains
at least one recurrent state.
(ii) If A ⊆ Ω is finite, closed, and irreducible, then all states in A are recurrent.
(iii) A finite state Markov chain is recurrent if it is irreducible.
P
P ROOF. Let N y denote the total number of visits to y during [0, ∞). Note that y∈A N y counts the
total number of visits to any state in A, and it is infinite since A is closed. But then
" #
X X £ ¤
∞=E Ny = E Ny . (334)
y∈A y∈A

Since the last expression is a finite sum, it implies that E[N y ] = ∞ for some y ∈ A. Then such y ∈ A is
recurrent by Theorem 3.5.6. For the second part, note that Ω is a closed subset of Ω, so it must contain a
recurrent state by what we have just shown. This shows (i).
To show (ii), suppose A ⊆ Ω is finite, closed, and irreducible. According to (i), A contains at least one
recurrent state. But then by the irreducibility and Proposition 3.5.11, all states in A must be recurrent.
This shows (ii).
Lastly, (iii) follows immediately from (ii). □
3.5.5. Genearlized Gambler’s ruin and expected time until absorption. In this subsection, we gen-
eralize the Gambler’s ruin example (Example 3.1.2) and see how we can analyze various statistics related
to the first recurrent state ever reached.
Namely, let (X n )n≥0 be a Markov chain on a finite state space Ω with transition matrix P . We may
decompose Ω = T t R, where T is the set of all transient states, R is the set of all recurrent states. In
Proposition 3.5.18, we have seen that any finite-state Markov chain contains at least one recurrent state,
so R must be nonempty. We also assume that T is nonempty (just like gambler’s ruin). Define
TR = inf{n ≥ 0 | X n ∈ R} = first time to be in R. (335)
Since eventually X n must enter R (why?), TR is finite almost surely. Then
X TR = first recurrent state ever reached. (336)
We are interested in solving the following matrix U : T × R → [0, 1], where
U (x, y) = P(X TR = y | X 0 = x). (337)
3.5. RECURRENCE AND TRANSIENCE 47

Namely, U (x, y) is the probability that, starting from a transient state x, the first ever recurrent state
reached is y.
In order for our analysis, we may use the elementary decomposition of P in (332):
R T
· ¸
R PR O
P= . (338)
T S PT

Proposition 3.5.19. U = S + P T U . In particular, U = (I − P T )−1 S.


P ROOF. As we did in the analysis of Gambler’s ruin in Example 3.3.2, we will use a ‘first step analysis’,
which is simply to condition on the state X 1 after the first move. Fix x ∈ T and y ∈ R. By conditioning
on X 1 , we first write
U (x, y) = P(X TR = y | X 0 = x) (339)
X
= P(X TR = y | X 1 = z, X 0 = x)P (x, z) (340)
z∈Ω
X
= P (x, y) + P(X TR = y | X 0 = z)P (x, z) (341)
z∈T
X
= S(x, y) + P T (x, z)U (z, y). (342)
z∈T
We will justify (341) in more detail below. The last inequality above follows easily by using definitions
of S and P R . Before delving into details, we note that the above result can be translated into the matrix
equation U = S + P R U . This gives
(I − P R )U = S. (343)
Using Exercise 3.5.20, we can derive U = (I − P T )−1 S, as desired.
Now we justify (341). For the first term, we note that
X
P(X TR = y | X 1 = z, X 0 = x)P (x, z) = P(X TR = y | X 1 = y, X 0 = x)P (x, y) = P (x, y). (344)
z∈R
Namely, if the chain jumps to a recurrent state z ∈ R at time 1, then the only contribution to U (x, y)
is when z = y. Second, we note that, by conditioning on the values of the stopping time TR and using
strong Markov property,
X

P(X TR = y | X 1 = z, X 0 = x) = P(X TR = y | X 1 = z, X 0 = x) (345)
n=2
X∞
= P(X 2 ∈ T , . . . , X n−1 ∈ T , X n = y | TR = n, X 1 = z, X 0 = x) (346)
n=2
X∞
= P(X 1 ∈ T , . . . , X n−2 ∈ T , X n−1 = y | TR = n − 1, X 0 = z) (347)
n=2
= P(X TR = y | X 0 = z). (348)

−1 P∞ n
Exercise 3.5.20. Show that I − P R is invertible and (I − P R ) = n=0 (P R ) .
Next, we turn our attention to the expected time until first enterring the set of recurrent states.
Namely, we are interested in solving the following column vector W : T → [0, 1], where
W (x) = E[TR | X 0 = x]. (349)
So W (x) is the expect time to first hit any recurrent state starting from a transient state x.
Proposition 3.5.21. W = 1T +P T W . In particular, W = (I −P T )−1 1T . Here 1T denotes the |T |-dimensional
column vector of all ones.
3.5. RECURRENCE AND TRANSIENCE 48

P ROOF. As before, we use the ‘first step analysis’. Fix x ∈ T and y ∈ R. By conditioning on X 1 , we
first write
W (x) = E[TR | X 0 = x] (350)
X
= E[TR | X 1 = z, X 0 = x]P (x, z) (351)
z∈Ω
X
= 1+ E[TR | X 0 = z]P (x, z) (352)
z∈Ω
X
= 1+ E[TR | X 0 = z]P (x, z) (353)
z∈T
X
= 1T (x) + P T (x, z)W (z). (354)
z∈T
The justification (353) is similar as that of (341). This gives
(I − P R )W = 1T . (355)
Using Exercise 3.5.20, we can derive W = (I − P T )−1 1T , as desired. □
Exercise 3.5.22. Consider a Markov chain (X n )n≥0 on state space Ω = {1, 2, 3, 4, 5, 6, 7, 8, 9} whose transi-
tion probabilities are given by the following state-space diagram:

F IGURE 9. A triple of independent SSRWs on Z has 8 possible moves {−1, 1}3

(i) Show that T = {1, 2, 3, 4} is the set of transient states and R = {5, 6, 7, 8, 9} is the set of recurrent states.
(ii) Let U denote the matrix T × R → [0, 1] where U (x, y) is the probability that the chain first hits y
among all recurrent states, starting from x. Compute U . You may use
 −1  
1 −1/5 0 0 120 24 6 2
 0 −1/4 0   10 
 1  = 1  5 20 30 . (356)
 0 0 1 −1/3 119  20 4 120 40 
−1/2 0 1 0 60 12 3 120
(iii) Compute ρ 1,5 and ρ 1,6 . (Hint: Use U in part (ii).)
(iv) For each x ∈ T , let W (x) denote the expected hitting time of any recurrent state starting from x.
Compute W : T → [0, ∞).
(v) Compute all stationary distributions of the Markov chain.
3.6. EXISTENCE AND UNIEUQNESS OF STATIONARY DISTRIBUTION 49

3.6. Existence and unieuqness of stationary distribution


Does a Markov chain always have a stationary distribution? If yes, when there is a unique station-
ary distribution? In Exercise 3.2.3 (only for finite state spaces) we have exploited the relation between
stationary distribution and eigenvectors associated with eigenvalue 1 to study existence and unique-
ness of stationary distribution. Namely, the all-one column vector is a right-eigenvector associated with
eigenvalue 1 of a given transition matrix P . Hence its transpose P T has an eigenvector v associated with
eigenvalue 1. That is, P T v = v. Then taking transpose, we get v T P = v T , so π = v T /D, where D is the
sum of all entries in v, is a stationary distribution of P . For the uniqueness, see part (iii) of Exercise 3.2.3.
However, the linear algebra approach do not provide us a useful intuition for the behavior of the
Markov chain itself. In this section, we give an alternative and constructive argument to show that every
Markov chain has a stationary distribution. This is valuable since we can write down what the stationary
distribution is, whereas the linear algebra argument doesn’t give this information.
For each x, y ∈ Ω, let Vn (x, y) denote the number of visits to y in the first n steps given that X 0 = x,
X
n
Vn (x, y) := 1(X k = y | X 0 = x). (357)
k=1

We will see that with reasonable assumptions, a Markov chain will admit a unique stationary distribution
given by
1
π(y) := lim E[Vn (x, y)], (358)
t →∞ n

which is the expected proportion of times spending at y starting at x and it turns out that it does not
depend on x. The main result in this section, Theorem 3.6.4, states that when the Markov chain satisfies
certain reasonable conditions, (381) is indeed a stationary distribution of the Markov chain. However,
it is not a priori clear that the limit in (381) exists. Instead of taking the limiting frequency of visits to y,
we will instead construct a measure by only counting visits to y during a single ‘excursion’ from x to x,
which occurs during the time interval [0, T x ]. As we will see later, such a construction will give a valid
probability distribution if and only if E[T x | X 0 = x] is finite. For a further discussion, we introduce the
following notion.

Definition 3.6.1. Let (X n )n≥0 be a Markov chain on a state space Ω. We say a recurrent state x ∈ Ω is
positive recurrent if E[T x | X 0 = x] < ∞ and null recurrent if E[T x | X 0 = x] = ∞.

Proposition 3.6.2 (Poisitive recurrence is a class property). Let x, y be two communicating states. If one
is positive recurrent, then so is the other .

Example 3.6.3 (SRW on Z is null recurrent). In Example 3.5.7, we have seen that SRW on Z is recurrent,
which is equivalent to P(T x < ∞ | X 0 = x) = 1. However, this does not necessarily mean that we have
finite expected return time, E[T x | X 0 ] = 0. This is indeed the case for SRW on Z, which turns out ot be
null recurrent. For this, we remark the following asymptotic of ‘persistence probability’ of SRW on Z:
r
2
P(X 1 ≥ 1, X 2 ≥ 1, . . . , X n ≥ 1 | X 0 = 0) ∼ . (359)
πn
(For reference, see, e.g., [LS19].) It follows that, by the tail-sum formula for the expectation of nonnega-
tive RVs,
r
X∞ X
∞ 2
E[T0 | X 0 = 0] = P(T0 ≥ n) ∼ = ∞. (360)
n=1 n=1 πn
Thus the origin is null recurrent. By symmetry, all other states are also null recurrent. ▲
The full theorem regarding the existence and uniqueness of the stationary distribution is the follow-
ing:
3.7. UNIQUENESS OF STATIONARY DISTRIBUTION 50

Theorem 3.6.4 (Existence and uniqueness of stationary distribution). Assume P is irreducible. Then P
has a stationary distribution if and only if all states are positive recurrent. In this case the stationary
distribution π is unique and its values satisfy
1
π(x) = ∀ x ∈ Ω. (361)
E[T x | X 0 = x]
Corollary 3.6.5. Assume Ω is finite and P is irreducible. Then P has a unique stationary distribution π
given by (361).
P ROOF. By Theorem 3.6.4, we only need to verify if all states in Ω are positive recurrent. This is done
in Exercise 3.6.6. □
Exercise 3.6.6 (Finite irreducible implies positive recurrence). Let (X n )n≥0 be an irreducible Markov
chain on a finite state space Ω with transition matrix P . Let X 0 = x ∈ Ω. For any y ∈ Ω and t ≥ 0, de-
fine the first hitting time of y by
T y = min{k ≥ 1 | X k = y}. (362)
We will show that, for all x, y ∈ Ω,
E[T y | X 0 = x] < ∞. (363)
That is, the expected hitting time of (X n )n≥0 to any state starting from any state is finite.
(i) Recall that by the irreducibility and Exercise 3.5.2, for any z, y ∈ Ω, P r (z, y) > 0 for some r ≥ 1 (r may
depend on z and y). Let
d = max min{r ≥ 1 | P r (z, y) > 0}. (364)
z,y∈Ω

Show that there exists some constant δ > 0 such that


min P(X n visits y some time during [0, d ] | X 0 = z) ≥ δ > 0. (365)
y,z∈Ω

(ii) Show that for any k ≥ 1, (You may use strong Markov property)
P(T y > kd ) ≤ (1 − δ)P(T y > (k − 1)d ). (366)
Use induction to conclude that
P(T y > kd ) ≤ (1 − δ)k . (367)
(iii) Noting that P(T y ≥ t ) is a non-increasing function in t , show that
X
∞ X
∞ X

E[T y ] = P(T y ≥ t ) ≤ d P(T y ≥ kd ) ≤ d (1 − δ)k = d /δ < ∞. (368)
t =1 k=0 k=0
Conclude that all states in an finite irreducible Markov chain are positive recurrent.

3.7. Uniqueness of stationary distribution


In establishing Theorem 3.6.4, we first consider the uniqueness of stationary distribution and when
it should hold. Before we state our main result, we first take a look at random walk on disconnected
graphs and that it must have multiple stationary distribution.
Exercise 3.7.1 (Random walk on disconnected graphs). Let G = (V, E ) be a graph with two disjoint com-
ponents G 1 and G 2 . Let (X n )n≥0 be a random walk on G and denote by P its transition matrix.
(i) Let P 1 and P 2 be the transition matrices for random walks on G 1 and G 2 , respectively. Let V1 =
{1, 2, · · · , n} and V2 = {n + 1, n + 2, · · · , n + m} be the set of nodes in G 1 and G 2 . Then show that P
is of the following block diagonal form
· ¸
P1 O
P= . (369)
O P2
3.7. UNIQUENESS OF STATIONARY DISTRIBUTION 51

(ii) Let π1 = [π(1) (n) (1) m


1 , · · · , π1 ] and π2 = [π2 , · · · , π2 ] be any stationary distributions for P 1 and P 2 . Let

π̄1 = [π(1) (n)


1 , · · · π1 , 0 · · · , 0] (370)
π̄2 = [0, · · · , 0, π(1) m
2 , · · · , π2 ] (371)
be distributions on V = V1 ∪ V2 = {1, 2, · · · n, n + 1, · · · , n + m}. Show that for any λ ∈ [0, 1], π :=
λπ̄1 + (1 − λ)π̄2 is a stationary distribution for P .
(iii) Recall that random walk on arbitrary graph has a stationary distribution, where probability of each
node is proportional to its degree. Conclude that the random walk (X n )n≥0 on G has infinitely
many stationary distribution.
A graph G is said to be connected if there exists a ‘path’ between any given two nodes. A path in G
is a sequence of distinct nodes v 0 , v 1 , · · · , v k such that consecutive nodes are adjacent in G. In Exercise
3.7.1, we have seen that the random walk on a disconnected graph has infinitely many stationary distri-
bution. How about random walk on connected graphs? We will see that they have a unique stationary
distribution, which must be the typical one that we know.
Proposition 3.7.2 (RW on connected graphs). Let G = (V, E ) be a connected graph. Then the random walk
on G has a unique stationary distribution π, which is given by
degG (i )
π(i ) = . (372)
2|E |
P ROOF. Let P denote the transition matrix of random walk on G. Let V = {1, 2, · · · , N } and let E 1 be
the column eigenspace of P associated with eigenvalue 1. That is,
E 1 = {v ∈ RN | P v = v}. (373)
T
Note that [1, 1, · · · , 1] ∈ E 1 , so E 1 has dimension ≥ 1. According to Exercise 3.2.3, it suffice to show that
E 1 has dimension 1. This is equivalent to say that E 1 is spanned by [1, · · · , 1]T .
The argument uses the ‘maximum principle’ for ‘harmonic functions’. Let x ∈ E 1 be arbitrary. We will
view x as a function on the node set V . So we want to show that x is a constant function on V . Since V
is finite, there is some node k where x attains its global maximum. Suppose node k has some neighbor j
such that x( j ) < x(k). Then from the eigenvector condition P x = x,
X
N
x(k) = P (k, i )x(i ) (374)
i =1
X
= P (k, j )x( j ) + P (k, i )x(i ) (375)
i 6= j
X
< P (k, j )x(k) + P (k, i )x(i ) (376)
i 6= j
X
≤ P (k, j )x(k) + P (k, i )x(k) = x(k), (377)
i 6= j

which is a contradiction. This implies that x also attains its global maximum. Apply the similar argument
to the neighbors of k. Repeating the same argument, we see that x must be constant on a ‘connected
component’ of G containing k. But since G is connected, it follows that x is constant on V . This shows
the assertion. □
If we carefully examine the proof of above result, we find that we haven’t really used the value of en-
tries of the transition matrix P . Rather, we only used an abstract property following from the connectivity
of G: We can reach any node from any other node. We extract this as a general property of Markov chains.
Definition 3.7.3 (Irreducibility). Let P be the transition matrix of a Markov chain (X n )n≥0 on a finite state
space Ω. We say the chain (or P ) is irreducible if for any i , j ∈ Ω,
P(X k = j | X 0 = i ) > 0 (378)
3.7. UNIQUENESS OF STATIONARY DISTRIBUTION 52

for some integer k ≥ 0.

In words, a Markov chain is irreducible if every state is ‘accessible’ from any other state.

Exercise 3.7.4 (RW on connected graphs is irreducible). Let G = (V, E ) be a connected graph. Let (X n )n≥0
be a random walk on G and denote by P its transition matrix. Let v 0 , v 1 , · · · , v k such that consecutive
nodes are adjacent in G. Show that

P k (v 0 , v k ) = P(X k = v k | X 0 = v 0 ) ≥ P (v 0 , v 1 )P (v 1 , p 2 ) · · · P (v k−1 , v k ) (379)


1 1 1
= ··· > 0. (380)
degG (v 0 ) degG (v 1 ) degG (v k )

Conclude that random walks on connected graphs are irreducible.

Now we state the general uniqueness theorem of stationary distribution.

Lemma 3.7.5 (Uniqueness of stationary distribution). Let (X n )n≥0 be an irreducible Markov chain. Then
it has at most a unique stationary distribution.

P ROOF. Exercise. Mimic the proof of Proposition 3.7.2. □

3.7.1. Construction of stationary distribution (Optional*). The gist of the to construct of a sta-
tionary distribution using the visit counts Vn (x, y) is given in the following proposition. It constructs a
stationary distribution for general Markov chains with a positive recurrent state.

Lemma 3.7.6. Consider a Markov chain (X n )n≥0 on Ω with transition matrix P . Fix a recurrent state x ∈ Ω
and for each y ∈ Ω, define
£ ¤
λx (y) := E VTx (x, y) , (381)

which is the expected number of visits to y during [0, T x ] (a single excursion from x to x).
P
(i) For each x ∈ Ω, λx (x) = 1 and y∈Ω λx (y) = E[T x | X 0 = x].
(ii) For each x ∈ Ω, λx = λx P .
(iii) If x is positive recurrent, then πx := E[Tx |1X 0 =x] λx is a stationary distribution for P .

P ROOF. To show (i), note that NTx (x, x) = 1 so λx (x) = 1. Also,


" #
X X £ ¤ X XTx ¯
¯
λx (y) = E VTx (x, y) = E 1(X n = y) ¯ X 0 = x (382)
y∈Ω y∈Ω y∈Ω n=1
" #
X X
Tx ¯
¯
=E 1(X n = y) ¯ X 0 = x (383)
y∈Ω n=1
 
 Tx  " #
X X ¯  Tx ¯
X
 ¯  ¯
= E 1(X n = y) ¯ X 0 = x  = E 1 ¯ X 0 = x = E[T x | X 0 = x], (384)
n=1 y∈Ω  n=1
| {z }
=1

where the third equality uses Fubini’s theorem for nonnegative summands.
To show (ii), our goal is to verify for each y ∈ Ω,
X
λx (z)P (z, y) = λx (y). (385)
z∈Ω
3.7. UNIQUENESS OF STATIONARY DISTRIBUTION 53

Indeed, first observe that


" # ·∞ ¸
X
Tx ¯ X ¯
¯ ¯
E 1(X n = z) ¯ X 0 = x = E 1(X n = z, T x ≥ n) ¯ X 0 = x (386)
n=1 n=1
X
∞ ³ ¯ ´
¯
= P X n = z, T x ≥ n ¯ X 0 = x (387)
n=1
X∞ ³ ¯ ´
¯
= P X 1 6= x, . . . , X n−1 6= x, X n = z ¯ X 0 = x , (388)
n=1
where for the first equality, we have converted a random number of summation into an infinite sum with
additioanl indicator. Next, by time-homogeneity and Markov property, we observe that
³ ¯ ´
¯
P X 1 6= x, . . . , X n−1 6= x, X n = z ¯ X 0 = x P (z, y) (389)
³ ¯ ´
¯
= P X 1 6= x, . . . , X n−1 6= x, X n = z ¯ X 0 = x P(X n+1 = y | X n = z) (390)
³ ¯ ´
¯
= P X 1 6= x, . . . , X n−1 6= x, X n = z ¯ X 0 = x P(X n+1 = y | X 1 6= x, . . . , X n−1 6= x, X n = z) (391)
³ ¯ ´
¯
= P X 1 6= x, . . . , X n−1 6= x, X n = z, X n+1 = y ¯ X 0 = x . (392)

Hence, noting that λx (x) = 1 from Lemma 3.7.6,


" #
X X X
Tx ¯
¯
λx (z)P (z, y) = P (x, y) + E 1(X n = z) ¯ X 0 = x P (z, y) (393)
z∈Ω z∈Ω\{x} n=1
X X
∞ ³ ¯ ´
¯
= P (x, y) + P X 1 6= x, . . . , X n−1 6= x, X n = z, X n+1 = y ¯ X 0 = x (394)
z∈Ω\{x} n=1
X∞ X ³ ¯ ´
¯
= P (x, y) + P X 1 6= x, . . . , X n−1 6= x, X n = z, X n+1 = y ¯ X 0 = x (395)
n=1 z∈Ω\{x}
X∞ ³ ¯ ´
¯
= P (x, y) + P X 1 6= x, . . . , X n−1 6= x, X n 6= x, X n+1 = y ¯ X 0 = x (396)
n=1
X
∞ ³ ¯ ´
¯
= P X 1 6= x, . . . , X n−1 6= x, X n 6= y ¯ X 0 = x (397)
n=1
" #
X
Tx ¯
¯
=E 1(X n = z) ¯ X 0 = x = λx (y). (398)
n=1

This shows (ii).


Lastly, if x is positive recurrent, then one can normalize λx by the total sum E[T x | X 0 = x] < ∞ (see
(i)) to obtain a probability distribution πx that is invariant under right-multiplication of P . This shows
(iii). □
The following result shows the (the more important) half of Theorem 3.6.4.
Theorem 3.7.7. Consider an irreducible Markov chain (X n )n≥0 on Ω with transition matrix P . Suppose
all states are positive recurrent. Then there is a unique stationary distribution π for P given by (361).
P ROOF. According to Lemma 3.7.6, for each x ∈ Ω, we have stationary distribution πx defined in
Lemma 3.7.6 (iii). But by Lemma 3.7.5, there is a unique stationary distribution for P , which we will
denote as π. Then for each x ∈ Ω, it holds that
λx (x) 1
π(x) = πx (x) = = . (399)
E[T x | X 0 = x] E[T x | X 0 = x]
This shows the assertion. □
3.8. LIMIT THEOREMS FOR MARKOV CHAINS 54

In the following proposition, we make a simple observation that any stationary distribution will take
value zero on transient states.

Proposition 3.7.8 (Stationary distribution is supported on recurrent states). Let a transition matrix P on
Ω has an stationary distribution π. Then for each x ∈ Ω, π(x) > 0 implies that x is recurrent.

P ROOF. Suppose π(x) > 0. Note that π = πP n for all n ≥ 1. Hence


X
∞ ∞ X
X X X

∞= π(x) = π(z)P n (z, x) = π(z)P n (z, x) (400)
n=1 n=1 z∈Ω z∈Ω n=1
X X

= π(z) P n (z, x) (401)
z∈Ω n=1
X
= π(z)E[N x | X 0 = z], (402)
z∈Ω

where we have used Fubini’s theorem to switch the order of double summation of nonnegative terms and
the last equality uses (296). This implies that there exists some z ∈ Ω such that π(z) > 0 and E[N x | X 0 =
z] = ∞. But by Proposition 3.5.5 this implies that x is recurrent (as well as ρ zx > 0), as desired. □

P ROOF OF T HEOREM 3.6.4. According to Theorem 3.7.7, it suffices to show that, assuming irreducibil-
ity and the existence of stationary distribution π, all states are positive recurrent. Indeed, since π is a
probability distribution, π(x) > 0 for some x ∈ Ω, so by Proposition 3.7.8, x is recurrent. But since P is
irreducible, it follows that all states are recurrent. Next, we need the following lemma:
Lemma: For irreducible and recurrent Markov chains, any two invariant measures for P is a constant
multiple of each other. That is, if λ = λP and µ = µP , then λ = cµ for some c > 0.
For the proof of the above lemma, see [SV, Thm. 2.67]. Now, by Lemma 3.7.6, λx is an invariant measure
for P for each x ∈ Ω. In particular, we have λx = c x π for some c x > 0. Then
X X
E[T x | X 0 = x] = λx (y) = c x π(y) = c x < ∞, (403)
y∈Ω y∈Ω

so x is positive recurrent. Since this holds for all x, we conclude that all states are positive recurrent. □

P ROOF OF P ROPOSITION 3.6.2. Suppose two states x, y ∈ Ω communicate and x is positive recur-
rent. By restricting onto the communicating class that contains x and y, we may assume that the Markov
chain is irreducible and recurrent. By Lemma 3.7.6, there is a statioanry distribution, say, πx . Then by a
similar argument as in the the last part of the proof of Theorem 3.6.4, we have λ y = c x πx and this implies
that y is also positive recurrent. □

3.8. Limit theorems for Markov chains


In this section, we establish a Markov chain version of the strong law of large numbers. Namely, con-
sider the sequence of RVs f (X 0 ), f (X 1 ), . . . where (X n )n≥0 is a Markov chain and f : Ω → R is a function.
Can a version of law of large number hold for the random increments f (X 0 ), f (X 1 ), . . . ? Namely, does
the average of such increments converge to ‘the’ expected value? These are certainly not i.i.d. since they
are driven by the underlying Markov chain X n . However, if the chain starts from a positive recurrent
state, say x, then one can group those increments according to the excursions of the chain. By the strong
Markov property such excursions are i.i.d., so we can use the usual laws of large numbers for i.i.d. incre-
ments. This is the basic idea behind the proof of the following result, which states that “time-average of
a function converges to a space-average given by the stationary distribution”.
3.8. LIMIT THEOREMS FOR MARKOV CHAINS 55

Theorem 3.8.1 (SLLN for finite Markov chains). Let (X n )n≥0 be an irreducible Markov chain on finite state
space Ω with transition matrix P with arbitrary initial distribution. Let f : Ω → R denote a function. Then
almost surely,
1 Xn £ ¤ X
lim f (X k ) = E X ∼π f (X ) = f (x)π(x), (404)
n→∞ n
k=1 x∈Ω

where π denotes the unique stationary distribution for P .


P ROOF. Since Ω is finite and P is irreducible, all states are positive recurrent (see Exercise 3.6.6) and
P
there exists a unique stationary distribution π (see Theorem 3.6.4). Let Vx (n) = nk=1 1(X k = x) denote
the number of visits to x in the first n steps of the Markov chain. The key step in converting time-average
into space-average is the following observation: for each x ∈ Ω, f (x) is added up by Vx (n) times:
X
n X
n X X X
n
f (X k ) = f (X k ) 1(X k = x) = f (X k )1(X k = x) (405)
k=1 k=1 x∈Ω x∈Ω k=1
X X n
= f (x)1(X k = x) (406)
x∈Ω k=1
X X
n
= f (x) 1(X k = x) (407)
x∈Ω k=1
X
= f (x)Vx (n). (408)
x∈Ω
Now dividing by n and taking the limit n → ∞, we get
1 Xn X Vx (n) X Vx (n) X
lim f (X k ) = lim f (x) = f (x) lim = f (x)π(x), (409)
n→∞ n n→∞ n n→∞ n
k=1 x∈Ω x∈Ω x∈Ω

where the second equality of exchanging limit and summation uses the fact that Ω is finite (it would be
non-trivial if Ω is infinite) and the last equality uses Exercise 3.8.2. □
Exercise 3.8.2 (Convergence of empirical distribution). Let (X t )t ≥0 be an irreducible Markov chain on Ω
with a unique stationary distribution π. For each x ∈ Ω, define the following quantities
T x(k) := time of the kth visit to x (410)
X
n
Vx (n) := 1(X k = x) = number of visits to x during [0, n]. (411)
k=1
Here X 0 follows an arbitrary initial distribution µ on Ω.
(i) Let τk = T x(k) − T x(k−1) for k ≥ 2. Show that τk ’s are i.i.d. and 0 < E[τk ] < ∞. (For the i.i.d. part, use
strong Markov property. For the finite expectation part, see Exercise 3.6.6 for the finite case and
Theorem 3.6.4 for the general case.)
(ii) Use Markov property and strong law of large numbers to show
³ τ2 + · · · + τn ´
P lim = E[τ2 ] = 1. (412)
n→∞ n
Furthermore, noting that T x(n) = T x(n) + τ2 + · · · + τn , conclude that
µ ¶
Tn
P lim = E[τ2 ] = 1. (413)
n→∞ n

(iii) Let Vx (t ) = V (t ) be the number of visits to x up to time t . Using the fact that E[τk ] < ∞, show that
(V (n)) (V (n)+1)
V (t ) → ∞ as t → ∞ a.s. Also, noting that T x x ≤ n < Tx x , use (ii) to deduce
µ ¶
Vx (n) 1
P lim = = 1. (414)
n→∞ n E[τ2 ]
3.8. LIMIT THEOREMS FOR MARKOV CHAINS 56

Furthermore, use Theorem 3.6.4 to deduce that


µ ¶
Vx (t )
P lim = π(x) = 1. (415)
t →∞ t
Example 3.8.3. Let (X n )n≥0 be a Markov chain on Ω = {0, 1} with transition matrix
· ¸
1/4 3/4
P= . (416)
2/3 1/3
Given a function f : Ω → R, we would like to find the limit
1 Xn
lim f (X k ). (417)
n→∞ n
k=1
To this end, first note that P is irreducible. Since Ω is finite, it has a unique stationary distribu-
tion. More explictly, one can solve π = πP for a probability distribution π on Ω and see that π must be
[8/17, 9/17]. Then by Theorem 3.8.1, almost surely,
1 Xn 8 9
lim f (X k ) = f (0)π(0) + f (1)π(1) = f (0) + f (1). (418)
n→∞ n 17 17
k=1
Note that this holds regardless of the initial distribution of the Markov chain. ▲
The Markov chain SLLN stated in Theorem 3.8.1 holds for finite irreducible chains. When the state
space is countably infinite, the analogous result continues to hold if we need to augment irreduciblity
with positive recurrence:
Theorem 3.8.4 (SLLN for countable Markov chains). Let (X n )n≥0 be an irreducible and positive recurrent
Markov chain on state space Ω with transition matrix P unique statioanry distribution π. Let f : Ω → R
P
denote a function that is integerable w.r.t. π, that is, x∈Ω | f (x)|π(x) < ∞. Then almost surely, for arbitrary
initial distribution,
1 Xn £ ¤ X
lim f (X k ) = E X ∼π f (X ) = f (x)π(x). (419)
n→∞ n
k=1 x∈Ω

P ROOF. Omitted. See [SV, Thm 2.94]. □


Example 3.8.5 (Success run). Consider the success run chain in Exercise 3.2.5: It is a Markov chain
(X n )n≥0 on the state space Ω = Z≥0 of nonnegative integers with the following transition matrix P :
P (x, x + 1) = p, P (x, 0) = 1 − p for all x ∈ Z≥0 , (420)
where p ∈ [0, 1] is a fixed parameter. We have shown that it has a unique stationary distribution π with
π(k) = (1 − p)p k for k = 1, 2, . . . . Hence for a function f : Z≥0 → R, by Theorem 3.8.4, we have
1 Xn X∞
lim f (X k ) = f (k)(1 − p)p k a.s. (421)
n→∞ n
k=1 k=1
We take the following two special examples:
(a) Suppose one gets a dollor at each success, but only when the success run has length at least two.
What is the long-term rate rewards, relative to the number of trials? In this case we set
f (k) = 1(k ≥ 3). (422)
Then the long-term rate of reward is
1 Xn X∞
lim f (X k ) = (1 − p)p k = p 3 a.s. (423)
n→∞ n
k=1 k=3
(b) What is the average length of success run in the long run? In this case we take f (k) = k for k ≥ 1 and
1 Xn X∞ p
lim f (X k ) = k(1 − p)p k = a.s. (424)
n→∞ n
k=1 k=1 1 − p
3.9. CONVERGENCE TO THE STATIONARY DISTRIBUTION 57

3.9. Convergence to the stationary distribution


Let (X n )n≥0 be an irreducible Markov chain on a finite state space Ω. By Theorem 3.6.4, we know that
the chain has unique stationary distribution π. Hence if we denote by πt the distribution of X t , then we
should expect that πt ‘converges’ to to π as t → ∞ in some sense. We will prove a precise version of this
claim in this section.
We start with an example, where we show the convergence of a 2-state chain using a diagonalization
of its transition matrix.

Exercise 3.9.1. Let (X n )n≥0 be a Markov chain on Ω = {1, 2} with the following transition matrix
· ¸
0.2 0.8
P= . (425)
0.6 0.4
(i) Show that P admits the following diagonalization
· ¸· ¸· ¸−1
1 −4/3 1 0 1 −4/3
P= . (426)
1 1 0 −2/5 1 1
(ii) Show that P t admits the following diagonalization
· ¸· ¸· ¸−1
1 −4/3 1 0 1 −4/3
Pt = . (427)
1 1 0 (−2/5)t 1 1
(iii) Let rt denote the row vector of distribution of X t . Show that
· ¸· ¸· ¸−1
1 −4/3 1 0 1 −4/3
r t = r0 . (428)
1 1 0 (−2/5)t 1 1
Also show that
· ¸
3/7 4/7
lim rt = r0 = [3/7, 4/7]. (429)
t →∞ 3/7 4/7
(iv) Show that π = [3/7, 4/7] is the unique stationary distribution for P . Conclude that regardless of the
initial distribution r0 , the distribution of the Markov chain (X n )n≥0 converges to [3/7, 4/7].

Next, we observe that an irreducible MC may not always converge to the stationary distribution. The
key issue there is the notion of ‘periodicity’.

Example 3.9.2. Let (X n )n≥0 be a 2-state MC on state space Ω = {0, 1} with transition matrix
· ¸
0 1
P= . (430)
1 0
Namely, the chain deterministically alternates between the two states. Note that it is irreducible and has
a unique stationary distribution
π = [1/2, 1/2]. (431)
Let πt be the distribution of X n , where the initial distribution is given by π0 = [1, 0]. Then we have
π1 = [0, 1] (432)
π2 = [1, 0] (433)
π3 = [0, 1] (434)
π4 = [1, 0], (435)
and so on. Hence the distributions πn do not converge to the stationary distribution π. ▲
𝑝 𝑃
𝑝
1 0 1 2 3 4 1
1−𝑝 1−𝑝 1−𝑝
3.9. CONVERGENCE TO THE STATIONARY DISTRIBUTION 58

Example 3.9.3 (RW on torus). Let Zn be the set of integers modulo n. Let G = (V, E ) be a graph where
V = Zn × Zn and two nodes u = 1
(u 1 , u 2 ) and v = (v 1 , v3/4
2 ) are adjacent if and 1/2
only if 1/4
0 |u11 − u 2 | + |v 1 − v 2 | = 1. 2 3 (436) 4
1/4
Such a graph G is called the n ×n torus and we write G 1/2 3/4
= Zn × Zn . Intuitively, 1 n ×n
it is obtained from the
square grid by adding boundary edges to wrap around (see Figure 10 left).

4
3
2

0
0 1 2 3 4 5
F IGURE 10. (Left) Torus graph (Figure excerpted from [LP17]). (Right) RW on torus G = Z6 × Z6
has period 2.

Now let (X t )t ≥0 be a random walk on G. Since G is connected, X t is irreducible. Since all nodes in
G have degree 4, the uniform distribution on Zn × Zn , which we denote by π, is the unique stationary
distribution of X t . Let πt denote the distribution of X t .
For instance, consider G = Z6 × Z6 . As illustrated in Figure 10 below, observe that if X 0 is one of the
red nodes (where sum of coordinates is even), then X t is at a red node for any t = even and at a black
node (where sum of coordinates is odd) at t = odd. Hence, πt is supported only on the ‘even’ nodes for
even times and on the ‘odd’ nodes for the odd times. Hence πt does not converge in any sense to the
uniform distribution π. ▲
The key issue in the 2-periodicity of RW on G = Z6 × Z6 is that it takes even number of steps to return
to any given node. Generalizing this observation, we introduce the following notion of periodicity.

Definition 3.9.4. Let P be the transition matrix of a Markov chain (X t )t ≥0 on a finite state space Ω. For
each state x ∈ Ω, let T (x) = {t ≥ 1 | P t (x, x) > 0} be the set of times when it is possible for the chain to
return to starting state x. We define the period of x by the greatest common divisor of T (x). We say the
chain X t is aperiodic if all states have period 1.

Example 3.9.5. Let T (x) = {4, 6, 8, 10, · · · }. Then the period of x is 2, even through it is not possible to go
from x to x in 2 steps. If T (x) = {4, 6, 8, 10, · · · } ∪ {3, 6, 9, 12, · · · }, then the period of x is 1. This means the
return times to x is irregular. For the RW on G = Z6 × Z6 in Example 3.9.3, all nodes have period 2. ▲
Exercise 3.9.6 (Aperiodicity of RW on graphs). Let (X t )t ≥0 be a random walk on a connected graph G =
(V, E ).
(i) Show that all nodes have the same period.
(ii) If G contains an odd cycle C (e.g., triangle), show that all nodes in C have period 1.
(iii) Show that X t is aperiodic if and only if G contains an odd cycle.
(iv)* Show that X t is periodic if and only if G is bipartite. (A graph G is bipartite if there exists a partition
V = A ∪ B of nodes such that all edges are between A and B .)

Remark 3.9.7. If (X t )t ≥0 is an irreducible Markov chain on a finite state space Ω, then all states x ∈ Ω
must have the same period. The argument is similar to that for Exercise 3.9.6 (i).
3.9. CONVERGENCE TO THE STATIONARY DISTRIBUTION 59

Exercise 3.9.8 (Aperiodicity implies irreducibility for the product chain). Let P be an irreducible transi-
tion matrix for a Markov chain on Ω. Let Q be a matrix on Ω × Ω defined by
Q((x, y), (z, w)) = P (x, z) P (y, w) for x, y, z, w ∈ Ω. (437)
(i) Consider a Markov chain Z t := (X t , Y t ) on Ω × Ω that evolves by independently evolving its two coor-
dinates according to P . Show that Q above is the transition matrix for Z t .
(ii) Show that Q is irreducible if P is aperiodic.
We now state the convergence theorem for random walks on graphs.
Theorem 3.9.9 (Convergence of RW on graphs). Let (X t )t ≥0 be a random walk on a connected graph
G = (V, E ) with an odd cycle. Let π denote the unique stationary distribution π. Then for each x, y ∈ V ,
lim P(X t = y | X 0 = x) = π(y). (438)
t →∞

P ROOF. Since G contains an odd cycle, random walk on G is aperiodic according to Exercise 3.9.6.
Let (Y t )t ≥0 be another RW on G. We evolve X t and Y t simultaneously independently until they meet,
after which we evolve them in union. Namely, let τ be the ‘meeting time’ of these two walkers:
τ = min{t ≥ 0 | X t = Y t }. (439)
Then we let X t = Y t for all t ≥ τ. Still, if we disguise one random walk, then the other behaves exactly as
it should do. (This is a ‘coupling’ between the two random walks)
Now suppose X 0 = x for some x ∈ V and Y0 is drawn from the stationary distribution π. In particular,
the distribution of Y t is π for all t . Let πt denote the distribution of X t . Then for any y ∈ V ,
¯ ¯
|πt (y) − π(y)| = ¯P(X t = y) − P(Y t = y)¯ (440)
¯ ¯
= ¯P(X t = y, Y t = y) + P(X t = y, Y t 6= y) − P(X t = y, Y t = y) − P(X t 6= y, Y t = y)¯ (441)
≤ P(X t = y, Y t 6= y) + P(X t 6= y, Y t = y) (442)
≤ P(X t 6= Y t ) (443)
= P(τ > t ). (444)
Hence if suffices to show that P(τ > t ) → 0 as t → ∞, as this will yield |πt (y) − π(y)| → 0 as t → ∞. This
follows from the fact that P(τ < ∞) = 1, that is, the two independent random walks on G eventually meet
with probability 1 (see Exercise 3.9.12). □
The following exercise shows that if a RW on G is irreducible and aperiodic, then it is possible to
reach any node from any other node in a fixed number of steps.
Exercise 3.9.10. Let (X t )t ≥0 be a RW on a connected graph G = (V, E ). Let P denote its transition matrix.
Suppose G contains an odd cycle, so that the walk is irreducible and aperiodic. For each x ∈ V , let T (x)
denote the set of all possible return times to the state x.
(i) For any x ∈ V , show that a, b ∈ T (x) implies a + b ∈ T (x).
(ii) For any x ∈ V , show that T (x) contains 2 and some odd integer b.
(iii) For each x ∈ V , let b x be the smallest odd integer contained in T (x). Show that m ∈ T (x) whenever
m ≥ bx .
(iv) Let b ∗ = maxx∈V b x . Show that m ∈ T (x) for all x ∈ V whenever m ≥ b ∗ .
(v) Let r = |V | + b ∗ . Show that P r (x, y) > 0 for all x, y ∈ V .
With a little more work, one can also show a similar statement for general irreducible and aperiodic
Markov chains.
Exercise 3.9.11. Let P be the transition matrix of a Markov chain (X t )t ≥0 on a finite state space Ω. Show
that (i) implies (ii):
(i) P is irreducible and aperiodic.
3.10. MARKOV CHAIN MONTE CARLO 60

(ii) There exists an integer r ≥ 0 such that every entry of P r is positive.


Furthermore, show that (ii) implies (i) if X t is a RW on some graph G = (V, E ).
(Hint for (i)⇒(ii): You may use the fact that if a subset T of positive integers N is closed under ad-
dition with gcd(T ) = 1, then it must contain all but finitely many integers. Proof is similar to Exercise
3.9.10, but a bit more number-theoretic. See [LP17, Lem. 1.27].)
Exercise 3.9.12. Let (X t ) and (Y t ) be independent random walks on a connected graph G = (V, E ). Sup-
pose that G contains an odd cycle. Let P be the transition matrix of the random walk on G.
(i) Let t ≥ 0 be arbitrary. Use Exercise 3.9.10 to deduce that for some integer r ≥ 1, we have
X
P(X t +r = Y t +r | X t = x, Y t = y) = P(X t +r = z = Y t +r | X t = x, Y t = y) (445)
z∈V
X
= P(X t +r = z | X t = x)P(Y t +r = z | Y t = y) (446)
z∈V
X
= P r (x, z)P r (y, z) > 0. (447)
z∈V
(ii) Let r ≥ 1 be as in (i) and let
X
δ = min P r (x, z)P r (y, z) > 0. (448)
x,y∈V z∈V

Use (i) and Markov property that in every r steps, X t and Y t meet with probability ≥ δ > 0.
(iii) Let τ be the first time that X t and Y t meet. From (ii) and Markov property, deduce that
P(τ ≥ kr ) = P(τ ≥ r )P(X t and Y t never meet during [r, kr ) | X r −1 6= Yr −1 ) (449)
≤ (1 − δ)P(X t and Y t never meet during [r, kr ) | X r −1 6= Yr −1 ). (450)
By an induction on k, conclude that
P(τ ≥ kr ) ≤ (1 − δ)k → 0 as k → ∞. (451)
The general convergence theorem is stated below.
Theorem 3.9.13 (Convergence Thm). Let (X t )t ≥0 be an irreducible aperiodic Markov chain on a finite
state space Ω. Let π denote the unique stationary distribution of X t . Then for each x, y ∈ V ,
lim P(X t = y | X 0 = x) = π(y). (452)
t →∞

P ROOF. As always, there is a linear algebra proof to this result (see, e.g., [LP17, Thm. 4.9]) Mimic the
argument for Theorem 3.9.9 for a coupling based proof. □

3.10. Markov chain Monte Carlo


So far, we were given a Markov chain (X t )t ≥0 on a finite state space Ω and studied existence and
uniqueness of its stationary distribution and convergence to it. In this section, we will consider the
reverse problem. Namely, given a distribution π on a sample space Ω, can we construct a Markov chain
(X t )t ≥0 such that π is a stationary distribution? If in addition the chain is irreducible an aperiodic, then
by the convergence theorem (Theorem 3.9.13), we know that the distribution πt of X t converges to π.
Hence if we run the chain for long enough, the state of the chain is asymptotically distributed as π.
In other words, we can sample a random element of Ω according to the prescribed distribution π by
emulating it through a suitable Markov chain. This method of sampling is called Markov chain Monte
Carlo (MCMC).
Example 3.10.1 (Uniform distribution on regular graphs). Let G = (V, E ) be a connected regular graph,
meaning that all nodes have the same degree. Let µ be the uniform distribution on the node set V . How
can we sample a random node according to µ? If we have a list of all nodes, then we can label them
from 1 to N = |V |, choose a random number between 1 and N , and find corresponding node. But often
0 1 2 3 4 5

3.10. MARKOV CHAIN MONTE CARLO 61

F IGURE 11. MCMC simulation of Ising model on 200 by 200 torus at temperature T = 1 (left), 2
(middle), and 5 (right).

times, we do not have the full list of nodes, especially when we want to sample a random node from a
social network. Given only local information (set of neighbors for each given node), can we still sample
a uniform random node from G?
One answer is given by random walk. Indeed, random walks on graphs are defined by only using
local information of the underlying graph: Choose a random neighbor and move there. Moreover, since
G is connected, there is a unique stationary distribution π for the walk, which is given by
degG (x)
π(x) = . (453)
2|E |
Since G is regular, any two nodes have the same degree, so π(x) = π(y) for all x, y ∈ V . This means π
equals the uniform distribution µ on V . Hence the sampling algorithm we propose is as follows:
(∗) Run a random walk on G for t À 1 steps, and take the random node that the walk sits on.
However, there is a possible issue of convergence. Namely, if the graph G does not contain any odd
cycle, then random walk on G is periodic (see Exercise 3.9.6), so we are not guaranteed to have conver-
gence. We can overcome this by using a lazy random walk instead, which is introduced in Exercise 3.10.2.
We know that the lazy RW on G is irreducible, aperiodic, and has the same set of stationary distribution
as the ordinary RW on G. Hence we can apply the sampling algorithm (∗) above for lazy random walk on
G to sample a uniform random node in G. ▲
Exercise 3.10.2 (Lazy RW on graphs). Let G = (V, E ) be a graph. Let (X t )t ≥0 be a Markov chain on the
node set V with transition probabilities


1/2 if j = i
P(X t +1 = j | X t = i ) = 1/(2 degG (i )) if j is adjacent to i (454)


0 otherwise.

This chain is called the lazy random walk on G. In words, the usual random walker on G now flips a fair
coin to decide whether it stays at the same node or make a move to one of its neighbors.
(i) Show that for any connected graph G, the lazy random walk on G is irreducible and aperiodic.
(ii) Let P be the transition matrix for the usual random walk on G. Show that the following matrix
1
Q = (P + I ) (455)
2
is the transition matrix for the lazy random walk on G.
(iii) For any distribution π on V , show that πQ = π if and only if πP = π. Conclude that the usual and
lazy random walks on G have the same set of stationary distribution.
3.10. MARKOV CHAIN MONTE CARLO 62

Example 3.10.3 (Finding local minima). Let G = (V, E ) be a connected graph and let f : V → [0, ∞) be
a ‘cost’ function. The objective is to find a node x ∗ ∈ V such that f takes global minimum at x ∗ . This
problem has a lot of application in machine learning, for example. Note that if the domain V is very
large, then an exhaustive search is too expensive to use.
Here is simple form of the popular algorithm of stochastic gradient descent, which lies at the heart of
most of the important machine learning algorithms.
(i) Initialize the first guess X 0 = x 0 ∈ V .
(ii) Suppose X t = x ∈ V is chosen. Let

D t = {y a neighbor of x or x itself | f (y) ≤ f (x)}. (456)

Define X t +1 to be a uniform random node from D t .


(iii) The algorithm terminates if it finds a local minima.
In words, at each step we move to a random neighbor which could possibly decrease the current value of
f . It is easy to see that one always converges to a local minima, which may not be a global minimum. In
a very complex machine learning task (e.g., training a deep neural network), this is often good enough.
Is this a Markov chain? Irreducible? Aperiodic? Stationary distribution? ▲
There is a neat solution to finding global minimum. The idea is to allow that we go uphill with a small
probability.

Example 3.10.4 (Finding global minimum). Let G = (V, E ) be a connected regular graph and let f : V →
[0, ∞) be a cost function. Let
½ ¾

V = x ∈ V | f (x) = min f (y) (457)
y∈V

be the set of all nodes where f attains global minimum.


Fix a parameter λ ∈ (0, 1], and define a probability distribution πλ on V by

λ f (x)
πλ (x) = , (458)

P
where Zλ = x∈V λ f (x) is the normalizing constant. Since πλ (x) is decreasing in f (x), it favors nodes x
for which f (x) is small.
Let (X t )t ≥0 be Markov chain on V , whose transition is defined as follows. If X t = x, then let y be
a uniform random neighbor of x. If f (y) ≤ f (x), then move to y; If f (y) > f (x), then move to y with
probability λ f (y)− f (x) and stay at x with probability 1 − λ f (y)− f (x) . We analyze this MC below:
(i) (Irreducibility) Since G is connected and we are allowing any move (either downhill or uphill) we can
go from one node to any other in some number of steps. Hence the chain (X t )t ≥0 is irreducible.
(ii) (Aperiodicity) By (i) and Remark 3.9.7, all nodes have the same period. Moreover, let x ∈ V ∗ be an
arbitrary node where f takes global minimum. Then all return times are possible, so x has
period 1. Hence all nodes have period 1, so the chain is aperiodic.
(iii) (Stationarity) Here we show that πλ is a stationary distribution of the chain. To do this, we first need
to write down the transition matrix P . Namely, if we let AG (y, z) denote the indicator that y and
z are adjacent, then
( A (x,y)
f (y)− f (x)
(x) min(1, λ ) if x 6= y
G

P (x, y) = degGP (459)


1 − y6=x P (x, y) if y = x.

To show πλ P = πλ , it suffices to show for any y ∈ V that


X
πλ (z)P (z, y) = πλ (y). (460)
z∈V
3.10. MARKOV CHAIN MONTE CARLO 63

Note that
X X
πλ (z)P (z, y) = πλ (y)P (y, y) + πλ (z)P (z, y) (461)
z∈V z6= y
X X
= πλ (y) − πλ (y)P (y, z) + πλ (z)P (z, y). (462)
z6= y z6= y

Hence it suffices to show that


πλ (y)P (y, z) = πλ (z)P (z, y) (463)
for any z 6= y. Indeed, considering the two cases f (z) ≤ f (y) and f (z) > f (y), we have
λ f (y) AG (y, z) 1 AG (z, y) max( f (y), f (z))
πλ (y)P (y, z) = min(1, λ f (z)− f (y) ) = λ , (464)
Zλ degG (y) Zλ degG (z)
λ f (z) AG (z, y) 1 AG (y, z) max( f (y), f (z))
πλ (z)P (z, y) = min(1, λ f (y)− f (z) ) = λ . (465)
Zλ degG (z) Zλ degG (y)
Now since AG (z, y) = AG (y, z) and we are assuming G is a regular graph, this yields (463), as
desired. Hence πλ is a stationary distribution for the chain (X t )t ≥0 .
(iv) (Convergence) By (i), (iii), Theorem 3.6.4, we see that πλ is the unique stationary distribution for the
chain X t . Furthermore, by (i)-(iii) and Theorem 3.9.13, we conclude that the distribution of X t
converges to πλ .
(v) (Global minimum) Let f ∗ = minx∈V f (x) be the global minimum of f . Note that by definition of V ∗ ,
we have f (x) = f ∗ for any x ∈ V ∗ . Then observe that
λ f (x) λ f (x) /λ f ∗
lim πλ (x) = lim P = lim P (466)
λ→0 λ→0 y∈V λ f (y) λ→0 y∈V λ f (y) /λ f ∗
λ f (x)− f ∗ 1(x ∈ V ∗ )
= lim P = . (467)
λ→0 |V ∗ | + y∈V \V ∗ λ f (y)− f ∗ |V ∗ |
Thus for λ very small, πλ is approximately the uniform distribution on the set of all nodes V ∗
where f attains global minimum. ▲
Exercise 3.10.5 (Detailed Balance equation). Let P be a transition matrix of a Markov chain on state
space Ω. Suppose π is a probability distribution on Ω that satisfies the following detailed balance equa-
tion
π(x)P (x, y) = π(y)P (y, x) for all x, y ∈ Ω. (468)
(i) Show that for all x ∈ Ω,
X X X
π(y)P (y, x) = π(x)P (x, y) = π(x) P (x, y) = π(x). (469)
y∈Ω y∈Ω y∈Ω

(ii) Conclude that π is a stationary distribution for P , that is, πP = π.


Exercise 3.10.6 (Metropolis-Hastings algorithm). Let P be a transition matrix of a Markov chain on state
space Ω = {1, 2, · · · , m}. Let π be a probability distribution on Ω, which is not necessarily a stationary
distribution for P . Our goal is to design a Markov chain on Ω that has π as a stationary distribution.
Below we will derive the famous Metropolis-Hastings algorithm for MCMC sampling.
Fix a m ×m matrix A of entries from [0, 1]. Consider a Markov chain (X t )t ≥0 on Ω that uses additional
rejection step on top of the transition matrix P as follows:
(Generation) Suppose the current location X t = a. Generate a candidate b ∈ Ω from the conditional
distribution P (a, ·).
(Rejection) Flip an independent coin with success probability A(a, b). Upon success, accept the pro-
posed move and set X t +1 = b; Otherwise, reject the move and set X t +1 = a.
Here, the (a, b) entry A(a, b) is called the acceptance probability of the move a → b.
3.10. MARKOV CHAIN MONTE CARLO 64
𝑝
𝑝 the transition
(i) Let Q denote 1 𝑝 t )t ≥0 defined above. Show that
matrix of the2 chain (X
𝑝 (
P (a, b)A(a, b) if b 6= a
Q(a, b) = P (470)
1 − c:c6=a P (a, c)A(a, c) if b = a.

(ii) Show that π(x)Q(x, y) = π(y)Q(y, x) ∀x, y ∈ Ω, x 6=𝑝y implies πQ = π.𝑃Deduce that if 𝑝
1 0 1 2 3 4 1
π(x)P (x, y)A(x,
1 − 𝑝 y) = π(y)P (y,1x)A(y,
−𝑝
x) ∀x, y ∈ Ω, x 6= y, (471)
1−𝑝
then π is a stationary distribution for (X t )t ≥0 .
(iii) Since we also want the Markov chain to converge fast, we want to choose the acceptance probability
A(a, b) ∈ [0, 1] as large as 1possible for each3/4
a, b ∈ Ω. Show that
1/2the following choice
1/4 (denoting
α ∧ β := min(α, β))
0 1 2 3 4
1/4 π(y)P (y, x) 1/2 3/4 1
A(x, y) = ∧1 ∀x, y ∈ Ω, x 6= y (472)
π(x)P (x, y)
satisfies the condition in (ii) and each A(x, y) is maximized for all x 6= y.
(iv) Let (Y t )t ≥0 be a random walk on the 5-wheel graph G = (V, E ) as shown in Figure 12. Show that π =

5 1 2

4 3

F IGURE 12. 5-wheel graph G


£ 5 3 3 3 3 3
¤
20 , 20 , 20 , 20 , 20 , 20
is the unique stationary distribution of Y t . Apply the Metropolis-Hastings
algorithm derived in (i)-(iii) above to modify Y t to obtain a new Markov chain X t on V that
converges to Uniform(V ) in distribution.
CHAPTER 4

Martingales

In order to properly develop our discussion on martingales in the following sections, we need to
generalize the notion of conditional expectation E[X | Y ] of a RV X given another RV Y . Recall that this
was the a collection of ‘best guesses’ of X given Y = y for all y. But what if we only know, say, Y ≥ 1? Can
we condition on this event as well?
More concretely, suppose Y takes values from {1, 2, 3}. Regarding Y , the following outcomes are pos-
sible:

E Y := {{Y = 1}, {Y = 2}, {Y = 3}, {Y = 1, 2}, {Y = 2, 3}, {Y = 1, 3}, {Y = 1, 2, 3}}. (473)

For instance, the information {Y = 1, 2} could yield some nontrival implication on the value of X , so our
best guess in this scenario should be
X
E[X | {Y = 1, 2}] = x P(X = x|{Y = 1, 2}). (474)
x

More generally, for each A ∈ E Y , the best guess of X given A ∈ EY is the following conditional expectation
X
E[X | A] = x P(X = x|A). (475)
x

Now, what if we don’t know which event in the collection E Y to occur? As we did before to define
E[X | Y ] from E[X | Y = y] by simply not specifying what value y that Y takes, we simply do not specify
which event A ∈ E A to occur. Namely,

E[X | E Y ] = best guess on X given the information in E Y . (476)

In general, this could be defined for any collection of events E in place of E Y . Mathematically, we under-
stand E[X | E ] as1

E[X | E ] = the collection of E[X | A] for all A ∈ E . (477)

4.1. Definition and examples of martingales


Let (X t )t ≥0 be the sequence of observations of the price of a particular stock over time. Suppose that
an investor has a strategy to adjust his portfolio (M t )t ≥0 according to the observation (X t )t ≥0 . Namely,

M t = Net value of portfolio after observing (X k )0≤k≤t . (478)

We are interested in the long-term behavior of the ‘portfolio process’ (M t )t ≥0 . Martingales provide a very
nice framework for this purpose.
Martingale is a class of stochastic processes, whose expected increment conditioned on the past is
always zero. Recall that the simple symmetric random walk has this property, since each increment is
i.i.d. and has mean zero. Martingales do not assume any kind of independence between increments, but
it turns out that we can proceed quite far with just the unbiased conditional increment property.

1For more details, see [Dur10].

65
4.1. DEFINITION AND EXAMPLES OF MARTINGALES 66

In order to define martingales properly, we need to introduce the notion of ‘information up to time
t ’. Imagine we are observing the stock market starting from time t . We define
E t := collection of all possible events we can observe at time t (479)
[
t
Ft := E k = collection of all possible events we can observe up to time t . (480)
k=1

In words, E t is the information available at time t and Ft contains all possible information that we can
obtain by observing the market up to time t . We call Ft the information up to time t . As a collection of
events, Ft needs to satisfy the following properties2:
(i) (closed under complementation) A ∈ Ft =⇒ A c ∈ Ft ;
S
(ii) (closed under countable union) A 1 , A 2 , A 3 , · · · ∈ Ft =⇒ ∞ A ∈ Ft .
k=1 k
Note that as we gain more and more information, we have
Fs ⊆ Ft ∀t ≥ s ≥ 0. (481)
In other words, (Ft )t ≥0 is an increasing set of information, which we call a filtration. The roll of a filtration
is to specify what kind of information is observable or not, as time passes by.

Example 4.1.1. Suppose (Ft )t ≥0 is a filtration generated by observing the stock price (X t )t ≥0 of company
A in New York. Namely, E t consists of the information on the values of the stock price X t at day t . Given
F10 , we know the actual values of X 0 , X 1 , · · · , X 10 . For instance, X 8 is not random given F10 , but X 11 could
still be random. On the other hand, if (Y t )t ≥0 is the stock price of company B in Hong Kong, then we may
have only partial information for Y0 , · · · , Y10 given Ft . ▲
Now we define martingales.

Definition 4.1.2. Let (Fn )n≥0 be a filtration and (M n )n≥0 be discrete-time stochastic processes. We call
(M n )n≥0 a martingale with respect to (Fn )n≥0 if the following conditions are satisfied: For all n ≥ 0,
(i) (finite expectation) E[|M n |] < ∞.
(ii) (measurability 3) {M n = m} ∈ Fn for all m ∈ R .
(iii) (conditional increments) E[M n+1 − M n | A] = 0 for all A ∈ Fn .
If (iii) holds with “=” replaced by “≤” (resp., “≥”), then (M n )n≥0 is called a supermartingale (resp., sub-
martingale) with respect to (X n )n≥0 , respectively.

We will abbreviate the condition (iii) as


E[M n+1 − M n | Fn ] = 0. (482)
In many cases, we work with filtrations generated by observing a single reference process. Let (X n )n≥1
denote a discrete-time stochastic process on Ω and for each n, let Fn = σ(X 1 , . . . , X n ) denote the collec-
tion of all events that can be determined by the values of X 1 , . . . , X n . Then (Fn )n≥1 is called the filtration
generated by the process (X n )n≥1 . In this case, the conditions (ii)-(iii) in Definition 4.1.2 reduce to the
following simpler conditions:
(ii)’ (measurability) M n is a function of X 0 , X 1 , . . . , X n .
(iii)’ (conditional increments) E[M n+1 − M n | X 0 , X 1 , . . . , X n ] = 0.
That is, in Definition 4.1.2 (iii), we only need to consider events A ∈ Fn = σ(X 0 , . . . , X n ) of the form
A = {X 0 = x 0 , X 1 = x 1 , . . . , X n = x n }, (483)
which specifies the entire trejectory of the reference process (X n )n≥1 . See Execise 4.1.3 for more details.

2We are requiring F to be a σ-algebra, but we avoid using this terminology.


t
3In this case, we say “M is measurable w.r.t. F ”, but we avoid using this terminology.
n n
4.1. DEFINITION AND EXAMPLES OF MARTINGALES 67

Exercise 4.1.3 (Martingale w.r.t. the filtration generated by a stochastic process). Let (X n )n≥1 be a discrete-
time stochastic process on Ω. Let (M n )n≥1 denote a discrete-time stochastic process satisfying the fol-
lowing property:
(a) E[|M n |] < ∞ for n ≥ 1;
(b) M n is a function of X 1 , . . . , X n for n ≥ 1;
(c) E[M n+1 − M n | (X 1 , . . . , X n )] = 0 for n ≥ 1.
Then show that (M n )n≥1 is a martingale (in the sense of Definition 4.1.2) with respect to the filtration
(Fn )n≥1 , where Fn = σ(X 0 , . . . , X n ).
Hint: The key step is to show that, for A ∈ Fn , E[M n+1 − M n | A] = E[E[M n+1 − M n | A, X 1 , . . . , X n ]] = 0.
In order to apply property (c) here, justify the following identities
E[M n+1 − M n | A, X 1 , . . . , X n ] = E[(M n+1 − M n )1((X 1 , . . . , X n ) ∈ A) | X 1 , . . . , X n ] (484)
= 1((X 1 , . . . , X n ) ∈ A) E[(M n+1 − M n ) | X 1 , . . . , X n ]. (485)
If martingale is a fair gambling strategy, then one can think of supermartingale and submartingale as
unfavorable and favorable gambling strategies, resepctively. For instance, expected winning in gambling
on an unfavorable game should be non-increasing in time. This is an immediate consequence of the
definition and iterated expectation.
Proposition 4.1.4. Let (M t )t ≥0 be a stochastic process and let (Ft )t ≥0 be a filtration.
(i) If (M t )t ≥0 is a supermartingale w.r.t. filtration (Ft )t ≥0 , then E[M n ] ≤ E[M m ] for all n ≥ m ≥ 0.
(ii) If (M t )t ≥0 is a submartingale w.r.t. filtration (Ft )t ≥0 , then E[M n ] ≥ E[M m ] for all n ≥ m ≥ 0.
(iii) If (M t )t ≥0 is a martingale w.r.t. filtration (Ft )t ≥0 , then E[M n ] = E[M m ] for all n ≥ m ≥ 0.
P ROOF. (ii) and (iii) follows directly from (i), so we only show (i). Let (M t )t ≥0 be a supermartingale
w.r.t. (X t )≥0 . Recall that for each m ∈ R, {M t = m} ∈ Ft . Hence
E[M t +1 − M t | M t = m] ≤ 0 (486)
since M t is a supermartingale. Hence by conditioning on the values of M t ,
E[M t +1 − M t ] = E[E[M t +1 − M t | M t ]] ≤ 0. (487)
Then the assertion follows by an induction. □
In order to get familiar with martingales, it is helpful to envision them as a kind of simple symmetric
random walk. In general, one can subtract off the mean of a given random walk to make it a martingale.
Example 4.1.5 (Random walks). Let (X t )t ≥1 be a sequence of i.i.d. increments with E[X i ] = µ < ∞. Let
S t = S 0 + X 1 + · · · + X t . Then (S t )t ≥0 is called a random walk. Define a stochastic process (M t )t ≥0 by
M t = S t − µt . (488)
For each t ≥ 0, let Ft be the information obtained by observing S 0 , S 1 , · · · , S t . Then (M t )t ≥0 is a martingale
with respect to the filtration (Ft )t ≥0 . Indeed, we have
E[|M t |] = E[|S t − µt |] ≤ E[|S t | + |µt |] = E[|S t |] + µt < ∞, (489)
and for any m ∈ R,
{M t = m} = {S t − µt = m} = {S t = m + µt } ∈ Ft . (490)
Furthermore, Since X t +1 is independent from S 0 , · · · , S t , it is also independent from any A ∈ Ft . Hence
E[M t +1 − M t | X 0 , . . . , X t ] = E[X t +1 − µ | X 0 , . . . , X t ] (491)
= E[X t +1 − µ] = E[X t +1 ] − µ = 0. (492)

4.1. DEFINITION AND EXAMPLES OF MARTINGALES 68

Example 4.1.6 (Products of indep. RVs). Let (X t )t ≥0 be a sequence of independent RVs such that X t ≥ 0
and E[X t ] = 1 for all t ≥ 0. For each t ≥ 0, let Ft be the information obtained by observing M 0 , X 0 , · · · , X t .
Define
M t = M0 X 1 X 2 · · · X t , (493)
where M 0 is a constant. Then (M t )t ≥0 is a martingale with respect to (Ft )t ≥0 . Indeed, the assumption
implies E[|M t |] < ∞ and that {M t = m} ∈ Ft for all m ∈ R since M t is determined by M 0 , X 1 · · · , X t . Fur-
thermore, since X t +1 is independent from X 1 , · · · , X t , for each A ∈ Ft ,
E[M t +1 − M t | X 0 , . . . , X t ] = E[M t X t +1 − M t | X 0 , . . . , X t ] (494)
= E[(X t +1 − 1)(M 0 X 1 · · · X t ) | X 0 , . . . , X t ] (495)
= E[X t +1 − 1 | X 0 , . . . , X t ] E[(M 0 X 1 · · · X t ) | X 0 , . . . , X t ] (496)
= E[X t +1 − 1] E[(M 0 X 1 · · · X t ) | X 0 , . . . , X t ] = 0. (497)
This multiplicative model a reasonable for the stock market since the changes in stock prices are
believed to be proportional to the current stock price. Moreover, it also guarantees that the price will
stay positive, in comparison to additive models. ▲
Exercise 4.1.7 (Exponential martingale). Let (X t )t ≥0 be a sequence of i.i.d. RVs such that their moment
generating function exists, namely, there exists θ > 0 for which
φ(θ) := E[exp(θX k )] < ∞ ∀k ≥ 0. (498)
Let S t = S 0 + X 1 + · · · + X t . Define
M t = exp(θS t )/φ(θ)t . (499)
Show that (M t )t ≥0 is a martingale with respect to filtration (Ft )t ≥0 , where Ft is the information obtained
by observing S 0 , · · · , S t .
The following lemma allows us to provide many examples of martingales from Markov chains.
Lemma 4.1.8. Let (X t )t ≥0 be a Markov chain on state space Ω with transition matrix P . For each t ≥ 0, let
f t be a bounded function Ω → R such that
X
f t (x) = P (x, y) f t +1 (y) ∀x ∈ Ω. (500)
y∈Ω

Then M t = f t (X t ) defines a martingale with respect to filtration (Ft )t ≥0 , where Ft = σ(X 0 , . . . , X t ).


P ROOF. Since f t is bounded, E[| f t (M t )|] < ∞. By definition, M t is given as a function f t (X t ) of the
reference process (X n )n≥0 . For the conditional increment property, note that for any x 0 , x 1 , . . . , x t −1 , x ∈ Ω,
E[M t +1 − M t | X 0 = x 0 , . . . , X t = x t −1 , X t = x] = E[M t +1 − M t | X t = x] (501)
"Ã ! #
X
=E P (x, y) f t +1 (y) − f t (x) = 0. (502)
y∈Ω

This shows the assertion. □


Example 4.1.9 (Simple random walk). Let (X t )t ≥1 be a sequence of i.i.d. RVs with
P(X k = 1) = p, P(X k = −1) = 1 − p. (503)
Let S t = S 0 + X 1 + · · · + X t . Note that (S t )t ≥0 is a Markov chain on Z. Define
µ ¶
1 − p Sn
Mt = . (504)
p
Then (M t )t ≥0 is a martingale with respect to filtration (Ft )t ≥0 , where Ft is the information obtained by
observing S 0 , · · · , S t .
4.2. GAMBLING STRATEGIES AND STOPPING TIMES 69

In order to see this, define a function h t (x) = ((1 − p)/p)x . According to Lemma 4.1.8, it suffice to
show that h is a harmonic function with respect to the Gambler’s chain. (One can also check that M t is a
martingale by Example 4.1.6.) Namely,
X
P (x, y)h t (y) = ph(x + 1) + (1 − p)h(x − 1) (505)
y∈Z
µ ¶ µ ¶
1 − p x+1 1 − p x−1
=p + (1 − p) (506)
p p
µ ¶ µ ¶ µ ¶
1−p x 1−p x 1−p x
= (1 − p) +p = = h t (x). (507)
p p p
Hence by Lemma 4.1.8, (M t )t ≥0 is a martingale with respect to the filtration (Ft )t ≥0 . ▲
Example 4.1.10 (Simple symmetric random walk). Let (X t )t ≥1 be a sequence of i.i.d. RVs with
P(X k = 1) = P(X k = −1) = 1/2. (508)
Let S t = S 0 + X 1 + · · · + X t . Note that (S t )t ≥0 is a Markov chain on Z. Define
M t = S 2t − t . (509)
Then (M t )t ≥0 is a martingale with respect to (S t )t ≥0 .
One can check the martingale condition directly as follow:
E[(S 2t +1 − (t + 1)) − (S 2t − t ) | (X 0 , . . . , X t )] = E[(S t + X t +1 )2 − S 2t − 1 | (X 0 , . . . , X t )] (510)
= E[2S t X t +1 + X t2+1 −1 | (X 0 , . . . , X t )] (511)
| {z }
=1
= 2S t E[X t +1 | (X 0 , . . . , X t )] = 0, (512)
where the last equality follows since E[X t +1 ] = 0 and X t +1 is independent from X 0 , . . . , X t .
One can also check the martingale condition for S 2t − t using Lemma 4.1.8. Namely, for each t ≥ 0,
define a function f t : Z → R by f t (x) = x 2 − t . By Lemma 4.1.8, it suffices to check if f t (x) is the average of
f t +1 (y) with respect to the transition matrix of S t . Namely,
X 1 1
P (x, y) f t +1 (y) = f t +1 (x + 1) + f t +1 (x − 1) (513)
y∈Z 2 2
(x + 1)2 − (t + 1) (x − 1)2 − (t − 1)
= + = x 2 − t = f t (x). (514)
2 2
Hence by Lemma 4.1.8, (M t )t ≥0 is a martingale with respect to filtration (Ft )t ≥0 , where Ft is the infor-
mation obtained by observing S 0 , · · · , S t . ▲

4.2. Gambling strategies and stopping times


In this section, we derive some properties of martingales in relation to stopping times viewed as
gambling strategies. The following observation is natural if we think of supermartingale as a random
walk with negative drift; or as betting on an unfavorable game.
Next, we prove one of the most famous result in martingale theory, which says
“You can’t beat an unfavorable game.” (515)
To formulate this, let (M t )t ≥0 be a supermartingale w.r.t. (X t )t ≥0 . Think of (X t )t ≥0 as the progression of
stock market, and (M t )t ≥0 is the price of the stock of company A. Say we have an investment strategy so
that we can determine
H t +1 = Amount of stock of company A that we hold during [t , t + 1), (516)
4.2. GAMBLING STRATEGIES AND STOPPING TIMES 70

based on the observation X 0 , · · · , X t . Then the stock price will change from M t to M t +1 , so
H t +1 (M t +1 − M t ) = Net gain occured between time t and t + 1. (517)
Hence if we let W0 denote the initial fortune, then
X
t
Wt := W0 + Hk (M k − M k−1 ) = Total fortune at time t . (518)
k=1

The following theorem tells us that, if (M t )t ≥0 is declining stock on average (supermartingale), then no
matter what strategy (H t )t ≥0 we use, we will always lose our fortune.

Theorem 4.2.1. Let (M t )t ≥0 be a supermartingale w.r.t. a filtration (Ft )t ≥0 . Suppose (H t )t ≥0 is such that
(i) (Predictability) {H t +1 = h} ∈ Ft for all h ∈ R.
(ii) (Boundedness) 0 ≤ H t ≤ c t for some constant c t ≥ 0 for all t ≥ 0.
Let (Wt )t ≥0 be as defined at (518). Then (Wt )t ≥0 is a supermartingale w.r.t. the filtration (Ft )t ≥0 .

P ROOF. By (ii) and a triangle inequality,


X
t
|Wt | ≤ |W0 | + c t (|M k | + |M t −1 |). (519)
k=1

Taking expectation shows E[|Wt |] < ∞. Moreover, for each w ∈ R,


( )
Xt
{Wt = w} = W0 + Hk (M k − M k−1 ) = w ∈ Ft (520)
k=1

since the event on the right hand side can be written as the union of suitable events involving the values
of H0 , . . . , H t and M 0 , . . . , M t , which are all events belonging to Ft .
Lastly, fix A ∈ Ft . For each h ∈ [0, c t ], since {H t +1 = h} ∈ Ft , we also have A ∩ {H t +1 = h} ∈ Ft . Since
M t is a supermartingale, we have
E[Wt +1 − Wt | A ∩ {H t +1 = h}] = E[H t +1 (M t +1 − M t ) | A ∩ {H t +1 = h}] (521)
= h E[M t +1 − M t | A] ≤ 0. (522)
Hence, by conditioning on the values of H t +1 , we get
E[Wt +1 − Wt | A] = E [E[Wt +1 − Wt | A, H t +1 ] | A]] (523)
= E [E[(Wt +1 − Wt )1(H t +1 ∈ A) | H t +1 ] | A]] (524)
 
¯
¯
= E 1(H t +1 ∈ A) E[(Wt +1 − Wt ) | H t +1 ] ¯ A] = 0. (525)
| {z }
=0

This shows the assertion. □


Next, we define stopping times with respect to a given filtration.

Definition 4.2.2. A random variable T taking valuse from {0, 1, 2, . . . } ∪ {∞} is a stopping time w.r.t. a
filtration (Ft )t ≥0 if
{T = s}, {T 6= s} ∈ Fs ∀s ≥ 0. (526)

Note that the definition of stopping times w.r.t. a process (X n )n≥1 (see Definition 3.4.1) is a special
case of the above definition when Ft = σ(X 1 , . . . , X n ) (see Exercise 4.1.3).

Example 4.2.3. Let (N1 (t ))t ≥0 and (N2 (t ))t ≥0 be the counting processes of two renewal processes that are
not necessarily independent. For each t ≥ 0, let Ft be the information obtained by observing (Ni (t ))0≤s≤t
for i = 1, 2. Let Tk(i ) denote the kth arrival time for the i th process.
4.2. GAMBLING STRATEGIES AND STOPPING TIMES 71

Clearly, for each i ∈ {1, 2} and k ≥ 0, Tk(i ) is a stopping time w.r.t. the filtration (Ft )t ≥0 . Namely, for
each t ≥ 0,

{Tki = t } = {N (i ) (t ) = k, N (i ) (t − ) = k − 1} ∈ Ft . (527)

Also, let

T̃k := kthe arrival time for the merged process N1 + N2 . (528)

This is also a stopping time w.r.t. the filtration (Ft )t ≥0 . Namely, for each t ≥ 0,

{T̃k = t } = {N (1) (t ) + N (2) (t ) = k, N (1) (t − ) + N (2) (t − ) = k − 1} ∈ Ft . (529)

On the other hand, let (N (3) (t ))t ≥0 be the counting process of another renewal process, which is
independent from the other processes. Let Tk(3) denote the kth arrival time of this process. Is this a
stopping time w.r.t. the filtration (Ft )t ≥0 ? No, since for each t ≥ 0, the event

{Tk(3) = t } = {N (3) (t ) = k, N (3) (t − ) = k − 1} (530)

cannot be determined from observing the other two processes N (1) and N (2) up to time t . ▲
Example 4.2.4 (Constant betting up to a stopping time). Let (M t )t ≥0 be a supermartingale w.r.t. (X t )t ≥0 .
Let T be a stopping time for (X t )t ≥0 . Consider the gambling strategy of betting $1 up to time T . That is,
(
1 if k ≤ T
Hk = 1(k ≤ T ) = . (531)
0 if k > T

Since T is a stopping time, the above Hk defines a predictable sequence as in Theorem 4.2.1. Then the
wealth at time t is
X
t
W t = W0 + Hk (M k − M k−1 ) = W0 + M T ∧t − M 0 , (532)
k=1

where T ∧ t = min(T, t ). The above relation can be easily verified by considering whether t > T or t ≤
T . According to Theorem 4.2.1, we known that Wt is a supermartingale. It follows that M T ∧t is also a
supermaringale. ▲
Theorem 4.2.5 (Stopped martingale is a martingale). Let (M t )t ≥0 be a supermartingale w.r.t. a filtration
(Ft )t ≥0 . Let T be a stopping time for (Ft )t ≥0 . Then M T ∧t is a supermartingale w.r.t. (Ft )t ≥0 . Furthermore,

E[M T ∧t ] ≤ E[M 0 ] ∀t ≥ 0. (533)

In particular, if (M t )t ≥0 is a martingale, then

E[M T ∧t ] = E[M 0 ] ∀t ≥ 0. (534)

P ROOF. Let Wt be as defined as in Example 4.2.4. Since Wt = W0 + M T ∧t − M 0 and since Wt is a


supermaringale by Theorem 4.2.1, and since W0 and M 0 are constants, it follows that M T ∧t is also a
supermartingale. Then by Proposition 4.1.4, we have

E[M T ∧t ] ≤ E[Wt ] ≤ E[W0 ] = E[M 0 ]. (535)

The same argument holds for the submartingale case with the reversed inequalities. Then the martingale
case follows. □
Exercise 4.2.6. Let (M t )t ≥0 be a supermartingale w.r.t. a filtration (Ft )t ≥0 . We will directly show that
(M T ∧t )t ≥0 is a supermartingale w.r.t. the filtration (Ft )t ≥0 .
4.3. OPTIONAL STOPPING TIME THEOREM AND ITS APPLICATIONS 72

(i) For each fixed A ∈ Ft , show that


à !
[
t
A ∩ {T ≤ t } = A ∩ {T = k} ∈ Ft , (536)
k=1
à !
\
t
A ∩ {T ≥ t + 1} = A ∩ {T 6= k} ∈ Ft . (537)
k=1

(ii) Use conditioning on whether T ≤ t or not to show


E[M T ∧(t +1) − M T ∧t | A] = E[M T − M T | A ∩ {T ≤ t }]P(T ≤ t ) (538)
+ E[M t +1 − M t | A ∩ {T ≥ t + 1}]P(T ≥ t + 1) (539)
≤ 0. (540)
Conclude that (M T ∧t )t ≥0 is a supermartingale w.r.t. the filtration (Ft )t ≥0 .

4.3. Optional stopping time theorem and its applications


If a martingale is bounded up to a stopping time, then the expected value of the martingale at stop-
ping time equals the initial expectation. This observation will be useful in many applications.

Theorem 4.3.1 (Optional stopping). Suppose (M t )≥0 is a martingale and T is a stopping time with respect
to the same filtration (Fn )n≥1 such that P(T < ∞) = 1. If any of the following conditions hold, then E[M T ] =
E[M 0 ].
(i) (Bounded stopping time) There is a constant c > 0 such that P(T ≤ c) = 1;
(ii) (Bounded stopped martingale) There is a constant c > 0 such that P(M T ∧n ≤ c) = 1 for all n ≥ 1;
(iii) (Bounded conditional increments) There is a constant c > 0 such that E[|M T ∧(n+1) − M T ∧n | | Fn ] ≤ c
for all n ≥ 1.

Example 4.3.2 (Gambler’s ruin). Let (X t )t ≥1 be i.i.d. RVs with


P(X = 1) = p, P(X = −1) = 1 − p. (541)
Let S t = S 0 + X 1 + · · · + X t with S 0 = i . Let Ft be the information obtained by observing S 0 , · · · , S t . Let
h(x) = ((1 − p)/p)x as in Exercise 4.1.9. We have seen M t := h(S t ) is a martingale w.r.t. the filtration
(Ft )t ≥0 .
Fix integers a < S 0 < b, and define
T = min{t ≥ 0 | S t = a or S t = b}. (542)
So T is the first time that the random walk hits a or b. Note that T is a stopping time, and since S t is
an irreducible Markov chain on Z, its hitting time to any state is almost surely finite, so P(T < ∞) = 1.
Moreover, S T , S T ∧t ∈ [a, b], satisfying the bounded stopped martingale condition in the optional stop-
ping (Theorem 4.3.1). Hence by that theroem, E[h(S T )] = E[S 0 ] = 0, so we have
µ ¶ µ ¶
1−p a 1−p b
0 = E[h(S T )] = P(S T = a) + (1 − P(S T = a)). (543)
p p
Assuming p 6= 1/2, solving for P(S T = a), this gives
³ ´ ³ ´
1−p b 1−p x
p − p
P(S T = a) = ³ ´ ³ ´ . (544)
1−p b 1−p a
p − p

4.3. OPTIONAL STOPPING TIME THEOREM AND ITS APPLICATIONS 73

Example 4.3.3 (Duration of fair games). Let (X t )t ≥1 be i.i.d. RVs with


P(X = 1) = P(X = −1) = 1/2. (545)
Let S t = S 0 + X 1 + · · · + X t with S 0 = 0. Let Ft be the information obtained by observing S 0 , · · · , S t . Note
that (S t )t ≥0 itself is a martingale w.r.t. the filtration (Ft )t ≥0 . Also, we have seen that M t := S 2t − t is a
martingale w.r.t. the filtration (Ft )t ≥0 in Exercise 4.1.10.
Fix integers a < 0 < b, and define
T = min{t ≥ 0 | S t = a or S t = b}. (546)
Again T is a stopping time w.r.t. the filtration (Ft )t ≥0 . As before, we also have P(T < ∞) and S T , S T ∧t ∈
[a, b]. Hence by Theorem 4.3.1, we get
0 = E[S 0 ] = E[S T ] = a P(S T = a) + b P(S T = b). (547)
Since P(S T = b) = 1 − P(S T = a), the first equation yield
b −a
P(S T = a) = , P(S T = b) = . (548)
b−a b−a
Next, we will again apply the optional stopping (Thoerem 4.3.1) for the martingale M t = S 2t − t with
stopping time T . In order to do so, we check the bounded increment condition in Thoerem 4.3.1:
¯ ¯ ¯ ¯
¯M T ∧(t +1) − M T ∧t ¯ = ¯S 2 − S 2 − (T ∧ (t + 1) − T ∧ t )¯ (549)
T ∧(t +1) T ∧t
2 2
≤ S T ∧(t +1) + S T ∧t + 1 (550)
2 2
≤ max(a , b ) + 1, (551)
where the last inequality above uses the fact that S T ∧n ∈ [a, b] for all n ≥ 1. Hence by optional stopping,
0 = E[M 0 ] = E[M T ] = E[S T2 − T ] = E[S T2 ] − E[T ] (552)
2 2
= a P(S T = a) + b P(S T = b) − E[T ] (553)
a2b ab 2
= − − E[T ] (by (548)) (554)
b−a b−a
= −ab − E[T ]. (555)
Therefore we deduce
E[T ] = −ab > 0. (556)
Observe that for each fixed a < 0, E[T ] = |ab| → ∞ as b → ∞. In other words, if one starts gambling
on fair coin flips, each time winning or losing $1, then the expected time E[T ] that he reaches −$1, or
ruins, is infinity. In terms of random walk, this shows that the first return time of a simple symmetric
random walk has infinite expected value (recall null recurrence of SSRW on Z in Example 3.6.3) . In terms
of M /M /1 queue, this shows that if the arrival and service rate is the same, then there is no stationary
distribution for the queue size. ▲
Next, we recall the following computation of normal MGF.
2
Exercise 4.3.4 (MGF of normal RVs). Let X ∼ N (µ, σ2 ) and Z ∼ N (0, 1). Using the fact that E[e t Z ] = e t /2
,
show that
E[e t Y ] = exp(σ2 t 2 /2 + t µ). (557)
The following example illustrates how martingales can be used in a risk management situation.
Example 4.3.5 (Cramér’s estimate of ruin). Let S n denote the total assets of an insurance company at
the end of year n. During year n, premiums totaling c > 0 dollars are received, while claims totaling Yn
dolors are paid. Hence
S n = S n−1 + c − Yn . (558)
4.3. OPTIONAL STOPPING TIME THEOREM AND ITS APPLICATIONS 74

Let X n = c − Yn be the net profit during year n. We may assume that X n ’s are i.i.d. with distribution
N (µ, σ2 ). We are interested in estimating the probability of bankruptcy. Namely, let
B = {S n < 0 for some n ≥ 1} = {Bankruptcy}. (559)
We will show that
µ ¶
2µS 0
P(B ) ≤ exp − 2 . (560)
σ
Hence, the company will avoid bankruptcy by maximizing the mean profit per year µ and the initial asset
S 0 , while minimizing the uncertainty σ of profit per year.
To begin, notice that S n = S 0 + X 1 + · · · + X n is a random walk. Let Ft be the information obtained by
observing S 0 , S 1 , · · · , S t . According to Exercise 4.3.4, we have
φ(θ) := E[e θX k ] = exp(σ2 θ 2 /2 + θµ) < ∞. (561)
Hence we can use the following exponential martingale (see Exercise 4.1.7)
exp(θS n )
M n (θ) := (562)
φ(θ)
with respect to the filtration (Ft )t ≥0 . Moreover, note that
φ(−2µ/σ2 ) = exp(2µ2 /σ2 − 2µ2 /σ2 ) = exp(0) = 1. (563)
Hence
µ ¶
2 2µS n
M n := M n (−2µ/σ ) = exp − 2 (564)
σ
is a martingale w.r.t. the filtration (Ft )t ≥0 .
Let T = min{k ≥ 1 | S k < 0} be the first time of hitting negative asset, which is a stopping time w.r.t.
the filtration (Ft )t ≥0 . Also note that
µ ¶
2µS T
M T = exp − 2 > 1 a.s., (565)
σ
since S T < 0. Since S t is a supercritical random walk (µ > 0), P(T = ∞) > 0. Hence we need to stop the
martingale at truncated stopping time T ∧ t , using Theorem 4.3.1. Noting that M t ≥ 0 for all t ≥ 0, this
yields
µ ¶
2µS 0
exp − 2 = E[M 0 ] = E[M T ∧t ] = E [M T | T ≤ t ] P(T ≤ t ) + E[M t | T > t ]P(T > t ) (566)
σ
≥ E [M T | T ≤ t ] P(T ≤ t ) ≥ P(T ≤ t ). (567)
Note that the last equality follows since S T given T ≤ t is zero almost surely and M t ≥ 0 for all t ≥ 0. To
finish, note that P(T ≤ t ) is the probability of having bankruptcy by time t . Hence
µ ¶
2µS 0
P(B ) = P(T < ∞) = lim P(T ≤ n) ≤ exp − 2 . (568)
n→∞ σ

P ROOF OF T HEOREM 4.3.1. (Optional*) Since P(T < ∞) = 1, it holds that limt →∞ T ∧ t = T almost
surely. Hence by Theorem 4.2.5, one would have
?
E[M 0 ] = lim E[M T ∧t ] = E[ lim M T ∧t ] = E[M T ] (569)
t →∞ t →∞
provided the middle equality is verified.
To this end, fix t > 0, and by conditioning on whether T > t or not, we can write
E[M T ] = E[M T | T ≤ t ]P(T ≤ t ) + E[M T | T > t ]P(T > t ) (570)
= E[M T ∧t | T ≤ t ]P(T ≤ t ) + E[M T | T > t ]P(T > t ). (571)
4.4. MARTINGALE LIMIT THEOREM 75

Similarly, we also write


E[M T ∧t ] = E[M T ∧t | T ≤ t ]P(T ≤ t ) + E[M T ∧t | T > t ]P(T > t ). (572)
Subtracting these two equations, we get
|E[M T ] − E[M T ∧t ]| ≤ |E[M T | T > t ]| P(T > t ) + |E[M T ∧t | T > t ]| P(T > t ). (573)
Now suppose (i) holds. Then P(T > t ) = 0 for all t > c. Hence (573) gives
lim |E[M T ] − E[M T ∧t ]| = 0, (574)
t →∞

as desired. Simmilarly, suppose (ii) holds. Then M T ∧t ≤ c for all t ≥ 0, so (573) gives
|E[M T ] − E[M T ∧t ]| ≤ 2c P(T > t ) (575)
and the right hand side tends to zero as t → ∞.
Lastly, suppose (iii) holds. Since P(T < ∞) = 1, we have T ∧ n → T a.s. as n → ∞, so M T ∧n → M T a.s.
as n → ∞. In order to show limn→∞ E[M T ∧n ] = E[M T ], we will have to use the dominated convergence
theorem (see [Dur10, Thm. 1.5.8]). For that, we only need to find a dominating and integrable RV M , that
is, M T ∧n ≤ M for all n ≥ 1 (dominating) and E[|M |] < ∞ (integrable). For this, we use triangle inequality
to construct a dominating RV as follows:
X

M := |M 0 | + |M n − M n−1 |1(T ≥ n). (576)
n=1

Ineed, by triangle inequality,


X

M = M0 + (M n − M n−1 )1(n ≤ T ) ≤ M . (577)
n=1

On the other hand, from (iii) and iterated expectation,


X

E[|M |] = E[M ] = E[|M 0 |] + E [|M n − M n−1 |1(T ≥ n)] (578)
n=1
X∞ h h ¯ ii
¯
= E[|M 0 |] + E E |M n − M n−1 |1(T > n − 1) ¯ Fn−1 (579)
n=1
 
X∞  h ¯ i
¯
= E[|M 0 |] + E
1(T > n − 1) E |M T ∧n − M T ∧(n−1) | ¯ Fn−1

 (580)
n=1 | {z }
≤c
X

= E[|M 0 |] + c E [1(T > n − 1)] (581)
n=1
X

= E[|M 0 |] + c P(T ≥ n) = E[|M 0 |] + c E[T ] < ∞. (582)
n=0

This is enough to conclude by dominiated convergence theorem. □

4.4. Martingale limit theorem


Recall the familiar fact in analysis that a non-decreasing sequence of reals bounded from above con-
verges; a non-increasing sequence of reals bounded from below converges. Can we have a random vari-
able version of this fact? Consider the two extreme cases:
(1) A sequence of a.s. non-decreasing and bdd RVs: X 1 ≤ X 2 ≤ · · · ≤ c. In this case indeed we have
limn→∞ X n = X ∞ for some RV X ∞ . But this setting is too restrictive for broad applications.
(2) A sequence of RVs with non-decreasing and bdd expectations: E[X 1 ] ≤ E[X 2 ] ≤ · · · ≤ c. In this case X n
does not always converge to a limiting RV. (Counterexample: X n = Bernoulli(1/2) − 1/n.)
4.5. BRANCHING PROCESSES 76

It turns out, the submartingale condition is in between the two extreme cases above and is flexible
enough to have many applications. This is stated in the following the following “martingale limit the-
orem” is a martingale analgue of these facts:
Theorem 4.4.1 (Martingale convergence theorem). If (M n )n≥1 is a submartingale that is bounded from
above, that is, there exists a constant c > 0 such that P(M n < c) = 1 for all n ≥ 1, then there exists a random
variable M ∞ that is almost surely finite, and limn→∞ M n = M ∞ with probability one.
P ROOF. See [Dur10, Thm. 4.2.11]. □
By taking the negative sign, the supermartingale counterpart of the above theorem also holds. More
specifically, an important corollary of Theroem 4.4.1 is the following.
Corollary 4.4.2. If (M n )n≥1 is a supermartingale and M n ≥ 0 for n ≥ 1, then M n converges almost surely
to some random variable M ∞ such that P(0 ≤ M ∞ < ∞) = 1.
The following corollary is immedaite from Theorem 4.4.1 with the observation that a martingale is
both submartingale and supermatingale.
Corollary 4.4.3 (Limit theorem for bounded martingales). Let (M n )n≥0 be a martingale with respect to
(X n )n≥0 . Suppose that either all M n are bounded from above or from below almost surely. Then there
exists a random variable M ∞ that is almost surely finite, and limn→∞ M n = M ∞ with probability one.
Example 4.4.4 (SSRW on Z). Let (S n )n≥0 with S 0 = 0 denote simple symmetric random walk on Z. We
will cook up both examples and non-examples of the martingale convergence theorem.
First, (S n )n≥0 is indeed a martingale. However, it is not bounded from above and also from below.
That is, there is no constant c ∈ R such that either S n > c for all n ≥ 1 or S n < c for all n ≥ 1 almost
surely. (Indeed, recall that S n is recurrent, so it visits every integer infinitely often with probability 1.)
Thus Corollary 4.4.2 (as well as Theorem 4.4.1) cannot be applied. Indeed, since S n+1 − S n ∈ {−1, 1} for all
n ≥ 1, S n cannot converge to any single random variable.
Second, by stopping the random walk, we can create a convergent martingale. Let T = inf{n ≥ 1 | S n =
1} denote the first hitting time of 1. Then T is a stopping time, so by Theroem 4.2.5, M n = S T ∧n is a
martingale. The martingale satisfies M n ≤ 1 since S n takes values from {. . . , −2, −1, 0, 1} until time T (recall
that S 0 = 0) and it stays at 1 after time T . Hence by Corollary 4.4.2, M n → M ∞ for some almost surely finite
RV M ∞ .
We can in fact identify the limiting RV M ∞ . Recall that T < ∞ with probability 1 the SSRW S n is
recurrent (see Example 3.5.7 and Theorem 3.5.4). Thus T ∧ n → T a.s. as n → ∞. Hence limn→∞ M n =
limn→∞ S T ∧n = S T = 1. So M ∞ = 1 with probability 1. ▲

4.5. Branching processes


We introduce the Galton-Watson branching process (X n )n≥0 as follow, which is a simple population
growth model, where X n denotes the population at generation n. Let X 0 = 1, which means that initially
there is one living specie. At each discrete time step, each individual in the previous generation gives
birth to a (nonnegative) random number of offsprings and dies. Let
Zn,i := number of offspring of the i th individual at generation n, 1 ≤ i ≤ Xn , (583)
Then (X n )n≥1 satisfy the following recursion
X
Xn
X n+1 = Zn,i n ≥ 0. (584)
i =1
In order to analyze this process, we assume that the offspring numbers Zn,i are i.i.d. with the common
distribution [β0 , β1 , . . . ] on Z≥0 :
P(Zn,i = k) = βk , k ∈ Z≥0 . (585)
4.5. BRANCHING PROCESSES 77

We denoted the mean offspring number as µ, that is,


X

µ = E[Zn,i ] = kβk . (586)
k=0
Then we claim that
M n := µ−n X n , n≥1 (587)
is a martingale with respect to the filtration generated by all offspring numbers until the current genera-
tion.
Proposition 4.5.1. M n = µ−n X n is a martingale with respect to the filteration Fn = σ(X 0 , . . . , X n ).
P ROOF. First we verify the conditional increment condition:
h ¯ i h ¯ i
¯ ¯
E M n+1 ¯ X 0 , X 1 , . . . , X n = µ−n−1 E Zn,1 + · · · + Zn,X n ¯ X 0 , X 1 , . . . , X n = µ−n X n = M n . (588)

Since the offspring numbers Zn,i are nonnegative, by iterated expectation it follows that
h h ¯ ii
¯
E[|M n+1 |] = E[M n ] = E E M n+1 ¯ X 0 , X 1 , . . . , X n = E[M n ] = · · · = E[M 0 ] = X 0 = 1. (589)

Lastly, M n is clearly measurable with respect to the offspring numbers Zk,i for all 1 ≤ k ≤ n. Hence M n is
a martingale. □
As an application, by Proposition 4.1.4, we have
E[X n ] = µn E[M n ] = µn E[M 0 ] = µn . (590)
Hence, if the mean offspring number µ is > 1, then the expected population E[X n ] grows exponentially as
µn (supercritical branching process); if µ < 1, then E[X n ] decays to zero exponentially as µn (subcritical
branching process); if µ = 1, then E[X n ] = 1 for all n ≥ 1 (critical branching process).
Since M n = µ−n X n is a nonnegative martingale, it should converge to a limit limn→∞ M n = M ∞ by
Corollary 4.4.2. This means that we can describe the generation n population X n as
X n ≈ µn M ∞ . (591)
Therefore, not only in expectation as in (590), but also trajectory-wise, X n grows/decays exponentailly
fast depending on µ > 1 or µ < 1.
However, there is a subtle point here: What if M ∞ = 0 with some positive probability? Then µn M ∞
does not necessarily diverge almost surely even if µ > 1. In fact, there are two cases depending on
whether ‘extinction event’ occurs. Namely, define
Extinction := {X n = 0 for all sufficiently large n} = ‘Extinction event’ (592)
0
π := P(Extinction) = ‘extinction probability . (593)
Then from (591), we can deduce
Either extintion event occurs or X n grows exponentially in n. (594)
Below we identify the limit M ∞ starting from the subcritical case, µ < 1. Here, we will observe that
the extinction event occurs. This should not be surprising since each individual gives birth to less then
one child on average, so the population should die out.
Proposition 4.5.2. If µ < 1, then X n = 0 for all sufficiently large n. (That is, P(Ext) = 0.) Hence M n =
µ−n X n → 0 as n → ∞ almost surely.
P ROOF. Since M n = µ−n X n is a martingale, E[M n ] = E[M 0 ] = 1 so we have E[X n ] = µn . Now recalling
that X n takes values from Z≥0 , by using Markov’s inequality,
P(X n > 0) = P(X n ≥ 1) ≤ E[X n ] = µ−n . (595)
4.5. BRANCHING PROCESSES 78

Thus, since 0 < µ < 1,


X
∞ X
∞ 1
P(X n > 0) ≤ µ−n = < ∞. (596)
n=1 n=1 1 − µ−1
Then by Borel-Cantelli lemma, X n > 0 only for finitely many n almost surely. In other words, X n = 0 for
all sufficeintly large n almost surely. □
Proposition 4.5.3. If µ = 1 and β1 < 1, then X n = 0 for all sufficiently large n. (That is, P(Ext) = 0.)
P ROOF. Since µ = 1, X n = µ−n X n itself is a nonngeative martingale, so it should converge, that is,
X n → X ∞ for some a.s. finite RV X ∞ . But since X n takes values from nonnegative integers, we must have
X n = Z∞ for all large n. But since β1 < 1 and µ = 1, X n cannot stay at the same positive value after any
large N . Hence X ∞ = 0 almost surely. □
Now suppose µ > 1. For simple cases, note that β0 = P(Zn,i = 0) = 1 implies immedaite extinction
β0 = 0 implies no extinction. (Also note that, in general, π ≥ β0 so β0 > 0 implies π > 0.) In order to
analyze the intermediate case β0 ∈ (0, 1), we will use a generating function argument. Recall that
X

g (s) := s k βk = generating function of the offspring distribution. (597)
k=1
Then by one-step analysis and using a ‘renwal property of branching process’, we can observe the follow-
ing recursion:
X

π = β0 + βk πk = g (π). (598)
k=1
Namely, the sub-population trees descending from the generation 1 individuals follow exactly the same
distribution as the whole population tree, so the probability that each such subtree goes extinct is again
π, and these events are all independent. The above equation tells us that the extinction probability π is a
fixed point of the generation function g of the offspring distribution.

F IGURE 1. Generating functions g (s) for cases when µ < 1 (left), µ = 1 (middle), µ > 1 (right) in
bold, and the lighter solid line is for the functino y = x. The dotted lines in the left and right
plots are tengent lines to the generating functions at s = 1− . The y-intercept is g (0) = β0 , and
the x-coordinates of the intercepts between the curve and the line y = x are the corresponding
extinction probabilities π. Figure excerpted from [SV].

With a more refined analysis, we can show the following:


Proposition 4.5.4. If µ > 1 and X 0 = 1, then the extinction probability π is the only solution of π = g (π) in
[0, 1). In particular,
P(The population survives forever) = 1 − π > 0. (599)
P ROOF. See [Dur10, Thm 4.3.12] □
4.6. ADDITIONAL PROPERTIES OF MARTINGALES (OPTIONAL*) 79

Finally, recall that, on the event that there is no extinction, X n grows exponentially fast as µn M ∞ , and
on this even we also know M ∞ > 0 since otherwise X 0 → 0, contradicting the no-extinction event.

Example 4.5.5. Consider a branching process with offspring distribution Poisson(λ). The generating
function of the offspring distribution can be computed as
X
∞ λk e −λ X ∞ (sλ)k e −λ X∞ (sλ)k
g (s) = sk = = e −λ = e −λ e sλ = e λ(s−1) , (600)
k=0 k! k=0 k! k=0 k!

where we have used the Talyer expansion of the exponential function.


Now, suppose λ ≤ 1. Then the mean offspring number is λ ≤ 1, so the extinction probability π is 1.
On the other hand, if λ > 1, then π is the unique solution of π = e λ(π−1) in (0, 1). This equation cannot be
solved analytically, but with some computational aid, we can see π ≈ 0.2 when λ = 2.
In fact, there is a good algorithm that computes the fixed point π with exponentially improving preci-
sion. Namely, consider generating a sequence (πn )n≥0 as πn+1 = g (πn ) = e λ(πn −1) . For instance, set λ = 2
and π0 = 0.5. Then π1 = 0.3678, π2 = 0.2824, π3 = 0.2380, π4 = 0.2178, π4 = 0.2092, π5 = 0.2056, and so
on. ▲

4.6. Additional properties of martingales (Optional*)


In this section, we study some of the basic properties of martingales. We begin with the relation
between martingale increments and convex functions. Recall that a function φ : R → R is convex if

φ(λx + (1 − λ)y) ≤ λφ(x) + (1 − λ)(φ(y)), ∀ λ ∈ [0, 1] and x, y ∈ R. (601)

For instance, φ(x) = x 2 and exp(x) are convex functions.

Exercise 4.6.1 (Jensen’s inequality). Let X be any RV with E[X ] < ∞. Let φ : R → R be any convex function,
that is,

φ(λx + (1 − λ)y) ≤ λφ(x) + (1 − λ)(φ(y)), ∀ λ ∈ [0, 1] and x, y ∈ R. (602)

Jensen’s inequality states that

φ(E[X ]) ≤ E[φ(X )]. (603)

(i) Let c := E[X ] < ∞. Show that there exists a line f (x) = ax + b such that f (c) = φ(c) and φ(x) ≥ f (x) for
all x ∈ R.
(ii) Verify the following and prove Jensen’s inequality:

E[φ(X )] ≥ E[ f (X )] = a E[X ] + b = f (c) = φ(c) = φ(E[X ]). (604)

(iii) Let X be RV, A an event, φ be the convex function as before. Show the Jensen’s inequality for the
conditional expectation:

φ(E[X |A]) ≤ E[φ(X ) | A]. (605)

Proposition 4.6.2. Let (M t )t ≥0 be a submartingale with respect to a filtration (Ft )t ≥0 . Let φ : R → R be a


convex function. Suppose that E[|φ(M t )|] < ∞ for all t ≥ 0.
(i) If (M t )t ≥0 is a martingale, then (φ(M t ))t ≥0 is a submartingale w.r.t. (Ft )t ≥0 .
(ii) If φ : R → R is a non-decreasing function, then (φ(M t ))t ≥0 is a submartingale w.r.t. (Ft )t ≥0 .

P ROOF. We first show (i). Fix A ∈ Ft . Since (M t )t ≥0 is a martingale, for each A 0 ∈ Ft ,

E[M t +1 | A 0 ] − E[M t | A 0 ] = E[M t +1 − M t | A 0 ] = 0. (606)


4.6. ADDITIONAL PROPERTIES OF MARTINGALES (OPTIONAL*) 80

Note that for each m ∈ R, since {M t = m} ∈ Ft , by using Jensen’s inequality,


E[φ(M t +1 ) − φ(M t ) | A ∩ {M t = m}] = E[φ(M t +1 ) | A ∩ {M t = m}] − φ(m) (607)
≥ φ(E[M t +1 | A ∩ {M t = m}]) − φ(m) (608)
= φ(E[M t | A ∩ {M t = m}]) − φ(m) = φ(m) − φ(m) = 0. (609)
Then by iterated expectation and using the fact that A ∩ {M t = m} ∈ Ft ,
£ ¤
E[φ(M t +1 ) − φ(M t ) | A] = E E[φ(M t +1 ) − φ(M t ) | A, M t ] = 0. (610)
So (ϕ(M t ))t ≥0 is a submartingale. This shows (i).
To show (ii), note that since (M t )t ≥0 is a submartingale, for any A 0 ∈ Ft ,
E[M t +1 | A 0 ] − E[M t | A 0 ] = E[M t +1 − M t | A 0 ] ≥ 0. (611)
By Jensen’s inequality and since φ is non-decreasing,
E[φ(M t +1 ) | A 0 ] ≥ φ(E[M t +1 | A 0 ]) ≥ φ(E[M t | A 0 ]). (612)
Then following the similar argument as before shows that (ϕ(M t ))t ≥0 is a submartingale. □
Example 4.6.3. Let (M t )t ≥0 be a martingale. Since φ(x) = x 2 is a convex function, (M t2 )t ≥0 is a submartin-
gale. ▲
The following observation is a martingale analogue of the formula E[X 2 ] − E[X ]2 = Var(X ).

Proposition 4.6.4. If (M t )t ≥0 is a martingale with respect to filtration (Ft )t ≥0 , then we have


E[M t2+1 − M t2 | Ft ] = E[(M t +1 − M t )2 | Ft ] ≥ 0. (613)

P ROOF. By the martingale condition, we have


E[M t +1 − M t | Ft ] = 0 (614)
Hence by expanding the square, we get
E[(M t +1 − M t )2 | Ft ] = E[M t2+1 | Ft ] − 2M t E[M t +1 | Ft ] + E[M t2 | Ft ] (615)
= E[M t2+1 | Ft ] − E[M t2 | Ft ]. (616)

Exercise 4.6.5 (Long range martingale condition). Let (M t )t ≥0 be a martingale with respect to a filtration
(Ft )t ≥0 . For any 0 ≤ k < n, we will show that
E[(M n − M k ) | Fk ] = 0. (617)
(i) Suppose (617) holds for fixed 0 ≤ k < n. For each A ∈ Fk , to show that
E[M n+1 − M k | A] = E[M n+1 − M n | A] + E[M n − M k | A] (618)
= E[M n+1 − M n | A] = 0. (619)
(ii) Conclude (617) for all 0 ≤ k < n by induction.

Proposition 4.6.6 (Orthogonality of martingale increments). Let (M t )t ≥0 be a martingale with respect to


filtration (Ft )t ≥0 .
(i) For each 0 ≤ j ≤ k < n, we have
E[(M n − M k )M j ] = 0. (620)
(ii) For each 0 ≤ j ≤ k < n, we have
E[(M n − M k )(M j − M i )] = 0. (621)
4.6. ADDITIONAL PROPERTIES OF MARTINGALES (OPTIONAL*) 81

P ROOF. To show (i), we use conditioning on the values of M j . By noting that {M j = m} ∈ F j ⊆ Fk for
all m ∈ R, we have
E[(M n − M k )M j | M j = m] = m E[M n − M k | M j = m] = 0, (622)
where we have used Exercise 4.6.5 for the last equality. Hence
E[(M n − M k )M j ] = E[E[(M n − M k )M j | M j ]] = 0. (623)
Moreover, (ii) follows from (i) immediately since
E[(M n − M k )(M j − M i )] = E[(M n − M k )M j ] − E[(M n − M k )M i ]. (624)
This shows the assertion. □
Exercise 4.6.7 (Pythagorian theorem for martingales). Let (M t )t ≥0 be a martingale with respect to a fil-
tration (Ft )t ≥0 . Use Proposition 4.6.6 to show that
X
t
E[(M t − M 0 )2 ] = E[(M k − M k−1 )2 ]. (625)
k=1
CHAPTER 5

Introduction to mathematical finance (Optional*)

We introduce basic notions of no arbitrage principle, binomial model, and connect to martingales.

5.1. Hedging and replication in the two-state world


We model the market as a probability space (Ω, P), where Ω consists of sample paths ω of the market,
which describes a particular time evolution scenario. For each event E ⊆ Ω, P(E ) gives the probability
that the event E occurs. A portfolio is a collection of assets that one has at a particular time. The value
of a portfolio A at time t is denoted by VtA . If t denotes the current time, then VtA is a known quantity.
However, at a future time T ≥ t , VTA depends on how the market evolves during [t , T ], so it is a random
variable. Also recall the definition of an arbitrage portfolio:

Definition 5.1.1. A portfolio A at current time t is said to be an arbitrage portfolio if its value V A satisfies
the followings:
(i) V A (t ) ≤ 0.
(ii) There exists a future time T ≥ t such that P(V A (T ) ≥ 0) = 1 and P(V A (T ) > 0) > 0.

Example 5.1.2 (A 1-step binomial model). Suppose we have an asset with price (S t )t ≥0 . Consider a Eu-
ropean call option at time t = 0 with strike K = 110 and maturity t = 1 (year). Suppose that S 0 = 100 and
at time 1, S 1 takes one of the two values 120 and 90 according to a certain distribution. One can imagine
flipping a coin with unknown probability, and according to whether it lands heads (H ) or tail (T ), the
stock value S 1 takes values S 1 (H ) = 120 and S 1 (T ) = 90. Assume annually compounded interested rate
r = 4%. Can we determine its current value c = C 110 (0, 1)? We will show c = 14/3 ≈ 4.48 by using two
arguments – hedging and replication.

𝑆 (𝐻) = 120

𝑆 = 100

𝑆 (𝑇) = 90
𝑟 = 4%

𝑡=0 𝑡=1

F IGURE 1. 1-step binomial model

Here we give a ‘hedging argument’ for option pricing. Consider the following portfolio at time t = 0:
Portfolio A: [x shares of the stock] + [y European call options with strike 110 and maturity 1].
82
5.2. THE FUNDAMENTAL THEOREM OF ASSET PRICING 83

The cost of entering into this portfolio (at time t = 0) is 100x + c y. Hence the profit of this portfolio takes
the following two values
(
V1A (H ) − (100x + c y)(1.04) = [120x + y(120 − 110)+ ] − [104x + (1.04)c y] = 16x + (10 − (1.04)c)y
(626)
V1A (T ) − (100x + c y)(1.04) = [90x + y(90 − 110)+ ] − [104x + (1.04)c y] = −14x − (1.04)c y.
In order for a perfect hedging, consider choosing the values of x and y such that the profit of this portfolio
at maturity is the same for the two outcomes of the stock. Hence we must have
16x + (10 − (1.04)c)y = −14x − (1.04)c y. (627)
Solving this, we find
3x + y = 0. (628)
Hence if the above equation is satisfied, the profit of portfolio A is
V1A − (104x + (1.04)c y) = −14x − (1.04)c(−3x) = ((3.12)c − 14) x. (629)
If (3.12)c > 14, then portfolio A is an arbitrage portfolio; If (3.12)c < 14, then the ‘dual’ of portfolio A,
which consists of −x shares of the stock and −y European call options, is an arbitrage portfolio. Hence
assuming no-arbitrage, the only possible value of c is c = 14/(3.12). ▲

5.2. The fundamental theorem of asset pricing


The observation we made in Example 5.1.2 can be generalized into the so-called ‘fundamental theo-
rem of asset pricing’. For its setup, consider a market where there are n different time evolution ω1 , · · · , ωn
between time t = 0 and t = 1, each occurs with a positive probability. Suppose there are assets A (1) , · · · , A (m) ,
whose price at time t is given by S (it ) for i = 1, 2, · · · , m (see Figure 2). For each 1 ≤ i ≤ m and 1 ≤ j ≤ n,
define
µ ¶
profit at time t = 1 of buying one share of asset A (i )
αi , j = (630)
at time t = 0 when the market evolves via ω j .
𝑆 = 100
Let A = (αi , j ) denote the (m × n) matrix of profits:
 
α1,1 α1,2 · · · α1,n
 α2,1 α2,2 · · · α2,n 
 
A :=  . .. ..  . (631)
 .. . ··· .  𝑟 = 4%
αm,1 αm,2 · · · αm,n
𝑡=0
Consider the following portfolio at time t = 0:
Portfolio A: [x 1 shares of asset A (1) ] + · · · + [x m shares of asset A (m) ].

𝜔 𝜔
()
𝑆 (𝜔 )
( )
()
𝐴 𝛼 𝛼
𝑆 (𝜔 )
() ( )
𝑆 𝐴

()
𝑆 (𝜔 )

𝑡=0 𝑡=1

F IGURE 2. 1-step n-state model for asset A (i ) .


5.2. THE FUNDAMENTAL THEOREM OF ASSET PRICING 84

Theorem 5.2.1 (The fundamental theorem of asset pricing). Consider portfolio A and the profit matrix
A = (αi , j ) as above. Then exactly one of the followings hold:
(i) There exists an investment allocation (x 1 , · · · , x m ) such that portfolio A is an arbitrage portfolio, that
is, the n-dimensional row vector
 
α1,1 α1,2 · · · α1,n
 α2,1 α2,2 · · · α2,n 
 
[x 1 , x 2 , · · · , x m ]  . .. ..  (632)
 .. . ··· . 
αm,1 αm,2 · · · αm,n
has nonzero coordinates and at least one strictly positive coordinate.
(ii) There exists a strictly positive probability distribution p∗ = (p 1∗ , · · · , p n∗ ) under which the expected
profit of each asset is zero:
   ∗  
α1,1 α1,2 ··· α1,n p1 0
 α2,1 α2,2 ··· α2,n  p ∗  0
  2   
 . .. ..   ..  =  ..  . (633)
 .. . ··· .  .  .
αm,1 αm,2 · · · αm,n p n∗ 0
Remark 5.2.2. Theorem 5.2.1 (i) states that the portfolio A is an arbitrage portfolio for some (x 1 , · · · , x m ).
The probability distribution p∗ in the above theorem is called the risk-neutral probability distribution.
Hence Theorem 5.2.1 states that there is no way to make A into an arbitrage portfolio if and only if there
exists a risk-neutral probability distribution under which the expected profit of each asset A (i ) is zero.
Example 5.2.3 (Example 5.1.2 revisited). Consider the situation described in Example 5.1.2. Let A (1) be
the asset of price (S t )t ≥0 and A (2) denote the European call with strike K = 110 on this asset with maturity
T . Then the matrix A of profits is given by
· ¸
16 −14
A= , (634)
10 − (1.04)c −(1.04)c
where c = C 110 (0, 1) denotes the price of this European option. Assuming no-arbitrage, the fundamental
theorem implies that there exists risk-neutral probability distribution p∗ = (p 1∗ , p 2∗ ) such that A(p∗ )T =
0T . Namely,
(
16p 1∗ − 14p 2∗ = 0
(635)
(10 − (1.04)c)p 1∗ − (1.04)c p 2∗ = 0
Since p 1∗ + p 2∗ = 1, the first equation implies p 1∗ = 7/15 and p 2∗ = 8/15. Then from the second equation,
we get
14
(1.04)c = 10p 1∗ = . (636)
3
This gives c = 14/(3.12). ▲
Exercise 5.2.4. Rework Examples 5.1.2 and 5.2.3 with following parameters:
S 0 = 100, S 1 (H ) = 130, S 1 (T ) = 80, r = 5%, K = 110. (637)
P ROOF OF T HEOREM 5.2.1. Suppose (i) holds with x = (x 1 , · · · , x n ). We want to show that (ii) cannot
hold. Fix a strictly positive probability distribution p = (p 1 , · · · , p n )0 , where 0 denotes the transpose so that
p is an n-dimensional column vector. By (i), we have
x(Ap) = (xA)p > 0. (638)
n ∗
It follows that Ap cannot be the zero vector in R . Hence (633) cannot hold for p = p, as desired.
Next, suppose that (ii) holds for some strictly positive probability distribution p∗ = (p 1∗ , · · · , p n∗ )0 . We
use a linear algebra argument to show that (i) does not hold. For each m-dimensional row vector x =
𝑆 (𝐻) = 120

5.3. THE BINOMIAL MODEL


𝑆 = 100 85

(x 1 , · · · , x m ) ∈ Rm , one can correspond a n-dimensional column vector xA ∈ Rn . The condition (633) says
that Ap∗ = 0, where T denotes the transpose. Hence for each x ∈ Rm , by using associativity 𝑆of(𝑇) = 90
matrix
multiplication, 𝑟 = 4%

¡ ∗¢
(xA)p = x Ap = x0 = 0. (639)
𝑡=0 n 𝑡=1
This shows that the image of the linear map x 7→ xA, which is a linear subspace of R , is orthogo-
nal to the strictly positive vector p∗ . Hence this linear subspace intersects with the positive orthant
{(y 1 , · · · , y n ) | y 1 , · · · , y n ≥ 0} only at the origin. This shows that (i) does not hold, as desired. □
𝜔 𝜔
()
𝑆 (𝜔 ) The binomial model
5.3. ( )
()
𝐴 𝛼 𝛼
(𝜔 )
5.3.1. 1-step binomial model.𝑆 Suppose we have an asset with price (S t )t ≥0 . Assume that a future
() ( )
time t 𝑆= 1, the stock price S 1 can take two values S 1 (H ) =𝐴S 0 u and S 1 (T ) = S 0 d according to an invisible
coin flip, where u, d > 0 are multiplicative ⋮ factors for upward and downward moves for the stock price
during the period [0, 1]. Assume the 𝑆 aggregated
()
(𝜔 ) interested rate during [0, 1] is r > 0, so that the value of
ZCB (Zero Cupon Bond) maturing at 1 is given by Z (0, 1) = 1/(1 + r ).
Consider a general European option on this stock, whose value at time t = 0, 1 is denoted Vt . We
would like to determine its initial value (price at t = 0) V0 in terms of its payoff V1 . In the previous section,
𝑡=0 𝑡=1
we have seen that there are three ways to proceed: 1) hedging argument, 2) replication argument, and 3)
risk-neutral probability distribution.

Stock Option

𝑆 (𝐻) = 𝑆 𝑢 𝑉 (𝐻)

𝑆 𝑉

𝑆 (𝑇) = 𝑆 𝑑 𝑉 (𝑇)
Interest rate = 𝑟 Interest rate = 𝑟

𝑡=0 𝑡=1 𝑡=0 𝑡=1

F IGURE 3. 1-step binomial model with general European option

Proposition 5.3.1. In the above binomial model, the followings hold.


(i) There exists a risk-neutral probability distribution p∗ = (p 1∗ , p 2∗ ) if and only if
0 < d < 1 + r < u. (640)

Furthermore, if the above condition holds, p is uniquely given by
(1 + r ) − d u − (1 + r )
p 1∗ = , p 2∗ = . (641)
u −d u −d
(ii) Suppose (640) holds. Then the initial value V0 of the European option is given by
µ ¶
1 1 (1 + r ) − d u − (1 + r )
V0 = Ep∗ [V1 ] = V1 (H ) + V1 (T ) . (642)
1+r 1+r u −d u −d
P ROOF. To begin, we first need to compute the (2 × 2) profit matrix A = (αi , j ), whose rows and
columns correspond to the two kinds of assets (stock and European option) and outcomes (coin flips),
respectively. We find
· ¸
S 1 (H ) − S 0 · (1 + r ) S 1 (T ) − S 0 · (1 + r )
A= . (643)
V1 (H ) − V0 · (1 + r ) V1 (T ) − V0 · (1 + r )
5.3. THE BINOMIAL MODEL 86

The risk-neutral probability p∗ = (p 1∗ , p 2∗ )0 satisfies Ap∗ = 0, so


(
[S 1 (H ) − S 0 · (1 + r )]p 1∗ + [S 1 (T ) − S 0 · (1 + r )]p 2∗ = 0
(644)
[V1 (H ) − V0 · (1 + r )]p 1∗ + [V1 (T ) − V0 · (1 + r )]p 2∗ = 0.
Using the fact that p 1∗ + p 2∗ = 1, the first equation gives
S 0 (1 + r ) − S 1 (T ) (1 + r ) − d S 1 (H ) − S 0 (1 + r ) u − (1 + r )
p 1∗ = = , p 2∗ = = . (645)
S 1 (H ) − S 1 (T ) u −d S 1 (H ) − S 1 (T ) u −d
Hence the desired expression for the risk-neutral probabilities p 1∗ and p 2∗ holds. Note that this gives a
strictly positive probability distribution if and only if (640) holds. This shows (i). (Why does this condition
make sense?)
Assuming (640), the second equation in (644) then gives
V0 · (1 + r ) = V1 (H )p 1∗ + V1 (T )p 2∗ . (646)
The right hand side can be regarded as the expectation Ep∗ [V1 ] of the value V1 of the European option at
time t = 1 under the risk-neutral probability distribution p∗ . Then (ii) follows from (i). □
Proposition 5.3.2. In the 1-step binomial model as before, consider the following portfolios:
Portfolio A: [∆0 shares of the stock] + [Short one European option].
Portfolio B : [∆0 shares of the stock] + [x cash],
Then the followings hold:
(i) Portfolio A is perfectly hedged (i.e., constant payoff at time t = 1) if and only if
V1 (H ) − V1 (T )
∆0 = . (647)
S 1 (H ) − S 1 (T )

(ii) Portfolio B replicates long one European option if and only if we have (647) and
1 S 1 (H )V1 (T ) − S 1 (T )V1 (H )
x= . (648)
1+r S 1 (H ) − S 1 (T )
Furthermore,
µ ¶
1 (1 + r ) − u u − (1 + r )
V 0 = x + ∆0 S 0 = V1 (H ) + V1 (T ) . (649)
1+r u −d u −d
P ROOF. To show (i), we equate the two payoffs of portfolio A at time t = 1 and obtain
∆0 S 1 (H ) − V1 (H ) = ∆0 S 1 (T ) − V1 (T ). (650)
Solving this for ∆0 shows the assertion.
To show (ii), note that portfolio A replicates long one European option if and only if
(
∆0 S 1 (H ) + x · (1 + r ) = V1 (H )
(651)
∆0 S 1 (T ) + x · (1 + r ) = V1 (T ),
or in matrix form,
· ¸· ¸ · ¸
S 1 (H ) 1 + r ∆0 V1 (H )
= . (652)
S 1 (T ) 1 + r x V1 (T )
This is equivalent to
· ¸ · ¸· ¸
∆0 1 1+r −1 − r V1 (H )
= , (653)
x (1 + r )(S 1 (H ) − S 1 (T )) −S 1 (T ) S 1 (H ) V1 (T )
which is also equivalent to (647) and (648), as desired.
5.3. THE BINOMIAL MODEL 87

Lastly, suppose portfolio B replicates long one European option. Then by the monotinicty theorem,
intitial value V0 of the European option should equal to the value of portfolio A at time t = 0. This shows

V 0 = x + ∆0 S 0 . (654)

Using (648), we also have


1 S 1 (H )V1 (T ) − S 1 (T )V1 (H ) V1 (H )S 0 − V1 (T )S 0
x + ∆0 S 0 = + (655)
1+r S 1 (H ) − S 1 (T ) S 1 (H ) − S 1 (T )
µ ¶
1 S 0 · (1 + r ) − S 1 (H ) S 1 (H ) − S 0 · (1 + r )
= V1 (H ) + V1 (T ) (656)
1+r S 1 (H ) − S 1 (T ) S 1 (H ) − S 1 (T )
µ ¶
1 (1 + r ) − u u − (1 + r )
= V1 (H ) + V1 (T ) . (657)
1+r u −d u −d
This shows (ii). (Remark: By using Proposition 5.3.1, one can avoid using the monotonicity theorem
here.) □
Example 5.3.3 (Excerpted from [Dur99]). Suppose a stock is selling for $60 today. A month from now it
will be either $80 or $50, i.e., u = 4/3 and d = 5/6. Assume the interest rate r = 1/18 for this period. Then
according to Proposition 5.3.2, the risk-neutral probability distribution p∗ = (p 1∗ , p 2∗ ) is given by
1
(1 + 18 ) − 56 4 5
p 1∗ = 4 5
= , p 2∗ = . (658)
3 − 6
9 9

Now consider a European call option on this stock with strike K = 65 maturing in a month. Then
V1 (H ) = (80 − 65)+ = 15 and V1 (T ) = (50 − 65)+ = 0. By Proposition 5.3.2, the initial value V0 of of this
European option is
1 18 4 120
V0 = Ep∗ [V1 ] = · 15 · = = 6.3158. (659)
1 + (1/18) 19 9 19
Working in an investment bank, you were able to sell 10,000 calls to a customer for a slightly higher price
each at $6.5, receiving up-front payment of $65, 000. At maturity, the overall profit is given by
(
(19/18) · $65, 000 − 10, 000 · (80 − 65)+ = −$81, 389 if stock goes up
(660)
(19/18) · $65, 000 − 10, 000 · (50 − 65)+ = $68, 611 if stock goes down.

Being worried about losing a huge amount if the stock goes up, you decided to hedge and lock the
profit. According to Proposition 5.3.2, the hedge ratio ∆0 is given by
(80 − 65)+ − (50 − 65)+ 15 1
∆0 = = = . (661)
80 − 50 30 2
Since you have shorted 10,000 calls, this means you need to buy 5,000 shares of the stock owing 5, 000 ·
60 − $65, 000 = $235, 000 to the bank. This forms a portfolio of

[5, 000 shares of stock] + [10, 000 short calls]. (662)

The overall profit at maturity is then


(
5, 000 · $80 − 10, 000 · (80 − 65)+ − (19/18) · $235, 000 = $1, 944 if stock goes up
(663)
5, 000 · $50 − 10, 000 · (50 − 65)+ − (19/18) · $235, 000 = $1, 944 if stock goes down.

Exercise 5.3.4. Rework Example 5.3.3 for the following parameters:

S 0 = 50, S 1 (T ) = 70, S 1 (H ) = 40, r = 4%, K = 60. (664)


𝑆 (𝑇𝑇) 𝑉 (𝑇𝑇)

Interest rate = 𝑟 Interest rate = 𝑟

𝑡=0 𝑡=1 𝑡 = MODEL


5.3. THE BINOMIAL 2 𝑡=0 𝑡=1 88 𝑡=2

5.3.2. The N -step binomial model. In this subsection, we consider the general N -step binomial
model. Namely, staring at the current time t = 0, we flip N coins at times t = 1, 2, · · · , N to determine the
market evolution. More precisely, now the sample space of the outcomes is Ω = {H , T }N , which consists
of sequences of length N strings of H ’s or T ’s. We assume constant interest rate for each periods [k, k +1]
for k = 0, 1, · · · , N − 1.

𝑆 (𝜔𝐻)
𝑉 (𝜔𝐻)
𝑆 (𝜔)
𝑆 𝑉 (𝜔)
𝜔 𝑆 (𝜔𝑇)
𝑉 (𝜔𝑇)

Interest rate ≡ 𝑟

𝑡=0 𝑡=𝑘 𝑡 =𝑘+1

F IGURE 4. Illustration of the N -step binomial model. ω is a sample path of the market evolution
in the first k steps. In the next period [k, k + 1], either up or down evolution occurs and the same
path is extended to ωH or ωT accordingly. S t and Vt denote the stock price and option payoff.

The following result for the general N -step binomial model is a direct analog of the 2-step case we
have discussed in the previous subsection. To better understand its second part, recall the remark on a
probabilistic interpretation on the 2-step value formula as in Proposition ??.

Proposition 5.3.5. Consider the N -step binomial model as above. Consider a European option on this
stock wit value (Vt )0≤t ≤N .
(i) For each integer 0 ≤ k < N and a sample path ω ∈ {H , T }k for the first k steps, define the risk-neutral
probability distribution p∗ (ω) = (p 1∗ (ω), p 2∗ (ω)) by
(1 + r )S k (ω) − S k+1 (ωT ) S k+1 (ωH ) − (1 + r )S k (ω)
p 1∗ (ω) = , p 2∗ (ω) = . (665)
S k+1 (ωH ) − S k+1 (ωT ) S k+1 (ωH ) − S k+1 (ωT )
If 0 < p 1∗ (ω) < 1, then
1
Vk (ω) = Ep∗ (ω) [Vk+1 | first k coin flips = ω] (666)
(1 + r )
1 ¡ ¢
= Vk+1 (ωH )p 1∗ (ω) + Vk+1 (ωT )p 2∗ (ω) . (667)
(1 + r )
(ii) Consider N consecutive coin flips such that given any sequence x 1 x 2 · · · x k of the first k flips, the (k +
1)st coin lands on heads with probability p 1∗ (x 1 x 2 · · · x k ). Let P∗ denote the induced probability
measure (risk-neutral probability measure) on the sample space Ω = {H , T }N . Then
1
V0 = EP∗ [VN ]. (668)
(1 + r )N
P ROOF. The argument for (i) is exactly the same as in the proof of Proposition ??. For (ii) we use an
induction on the number of steps. The base case is verified by (i). For the induction step, we first use (i)
to write
1
V0 = (V1 (H )p 1∗ (;) + V1 (T )p 2∗ (;)). (669)
1+r
5.3. THE BINOMIAL MODEL 89

Let X 1 , X 2 , · · · , X N be a sequence of N (random) coin flips given by the risk-neutral probability measure
P∗ . Denote the expectation under P∗ by E∗ . By the induction hypothesis, we have
1 1
V1 (H ) = E∗ [VN | X 1 = H ], V1 (T ) = E∗ [VN | X 1 = T ]. (670)
(1 + r )N −1 (1 + r )N −1
Hence we have
1 £ ∗ ¤
V0 = E [VN | X 1 = H ]p 1∗ (;) + E∗ [VN | X 1 = T ]p 2∗ (;) (671)
(1 + r )N
1 1
= E∗ [E∗ [VN | X 1 ]] = E∗ [VN ], (672)
(1 + r )N (1 + r )N
where we have used iterated expectation for the last equality. This shows the assertion. □

5.3.3. The N -step binomial model revisited. In this subsection, we revisit the N -step binomial
model in the framework of martingales. First recall the model. Staring at the current time t = 0, we flip
N coins at times t = 1, 2, · · · , N to determine the market evolution. The sample space of the outcomes is
Ω = {H , T }N , which consists of sequences of length N strings of H ’s or T ’s. We assume constant interest
rate for each periods [k, k + 1] for k = 0, 1, · · · , N − 1.
Let Ft denote the information that we can obtain by observing the market up to time t . For instance,
Ft contains the information of the first t coin flips, stock prices S 0 , S 1 , · · · , S t , European option values
V0 ,V1 , · · · ,Vt , and so on. Then (Ft )t ≥0 defines a natural filtration for the N -step binomial model. Below
we reformulate Proposition 5.3.5.

Proposition 5.3.6. Consider the N -step binomial model as above. Let P∗ denote the risk-neutral proba-
bility measure defined in Proposition 5.3.5. Consider a European option on this stock wit value (Vt )0≤t ≤N .
¡ ¢
(i) The process (1 + r )−t Vt 0≤t ≤N forms a martingale with respect to the filtration (Ft )t ≥0 under the risk-
neutral probability measure P∗ . That is,
h ¯ i
¯
EP∗ (1 + r )−(t +1)Vt +1 − (1 + r )−t Vt ¯ Ft = 0, (673)

which is also equivalent to


h ¯ i
1 ¯
Vt = EP∗ Vt +1 ¯ Ft = 0. (674)
1+r
(ii) We have
£ ¤
V0 = EP∗ (1 + r )−N VN . (675)

P ROOF. To show (i), we first note that, conditioning on the information Ft up to time t , we know all
the coin flips, stock prices, and European option values up to time t . Hence we have
h ¯ i h ¯ i
¯ ¯
EP∗ (1 + r )−(t +1)Vt +1 ¯ Ft = (1 + r )−(t +1) EP∗ Vt +1 ¯ Ft (676)

= (1 + r )−(t +1) · (1 + r )Vt = (1 + r )−t Vt . (677)

This shows that (1 + r )−t Vt is a martingale with respect to the natural filtration (Ft )t ≥0 under the risk-
neutral probability measure P∗ .
Now (ii) follows from (i) and Exercise 4.6.5. Namely, for each A ∈ F0 , by Exercise 4.6.5,
EP∗ [(1 + r )−N VN − (1 + r )0V0 | F0 ] = 0. (678)
This yields
V0 = EP∗ [V0 | F0 ] = EP∗ [(1 + r )−N VN | F0 ], (679)
as desired. □
5.3. THE BINOMIAL MODEL 90

Exercise 5.3.7 (Risk-neutral probabilities make the stock price a martingale). Consider the N -step bi-
nomial model with stock price (S t )0≤t ≤N . Let P∗ denote the risk-neutral probability measure defined in
Proposition 3.10 in Lecture note 2. 5.3.5.
(i) Show that the discounted stock price (1 + r )−t S t forms a martingale with respect to the filtration
(Ft )t ≥0 under the risk-neutral probability measure P∗ . That is,
h ¯ i
¯
EP∗ (1 + r )−(t +1) S t +1 − (1 + r )−t S t ¯ Ft = 0, (680)

which is also equivalent to


h ¯ i
1 ¯
St = EP∗ S t +1 ¯ Ft = 0. (681)
1+r
(Hint: Use the fundamental thm of asset pricing.)
(ii) Show that
£ ¤
S 0 = EP∗ (1 + r )−N S N . (682)
CHAPTER 6

Poisson Processes

In this chapter, we introduce a particular class of renewal processes called the “Poisson processes”,
which is a arrival process whose inter-arrival times are i.i.d. exponential RVs. Due to the nice properties
of exponential distribution, Poisson processes enjoy what’s called the ‘memoryless property’, which says
that Poisson processes can be restared at any stopping times, not necessarily at arrival times. Recall that
renewal processs has renewal property, which means they can be restarted at arrival times. Hence the
memoryless property of Poisson processes are stronger versions of renewal properties only enjoyed by
Poisson processes. This allows one to analyze Poisson processes in greater details than general renewal
processes.

6.1. Recap of exponential and Poisson random variables


In this section, we study two important properties of exponential and Poisson random variables,
which will be crucial when we study Poisson processes from the following section.

Example 6.1.1 (Exponential RV). X is an exponential RV with rate λ (denoted by X ∼ Exp(λ)) if it has
PDF

f X (x) = λe −λx 1(x ≥ 0). (683)

Integrating the PDF gives its CDF

P(X ≤ x) = (1 − e −λx )1(x ≥ 0). (684)

The following complimentary CDF for exponential RVs will be useful:

P(X ≥ x) = e −λx 1(x ≥ 0). (685)

Exercise 6.1.2. Let X ∼ Exp(λ). Show that E(X ) = 1/λ and Var(X ) = 1/λ2 .

Minimum of independent exponential RVs is again an exponential RV.

Exercise 6.1.3. Let X 1 ∼ Exp(λ1 ) and X 2 ∼ Exp(λ2 ) and suppose they are independent. Define Y =
min(X 1 , X 2 ). Show that Y ∼ Exp(λ1 + λ2 ). (Hint: Compute P(Y ≥ y).)

The following example is sometimes called the ‘competing exponentials’.

Example 6.1.4. Let X 1 ∼ Exp(λ1 ) and X 2 ∼ Exp(λ2 ) be independent exponential RVs. We will show that

λ1
P(X 1 < X 2 ) = (686)
λ1 + λ2
91
6.1. RECAP OF EXPONENTIAL AND POISSON RANDOM VARIABLES 92

using the iterated expectation. Using iterated expectation for probability (see Exercise 1.4.3),
P(X 1 < X 2 ) = E [P(X 2 > X 1 | X 1 )] (687)
Z∞
= P(X 1 < X 2 | X 1 = x 1 )λ1 e −λ1 x1 d x 1 (688)
0
Z∞
= P(X 2 > x 1 )λ1 e −λ1 x1 d x 1 (689)
0
Z∞
= λ1 e −λ2 x1 e −λ1 x1 d x 1 (690)
0
Z∞
λ1
= λ1 e −(λ1 +λ2 )x1 d x 1 = . (691)
0 λ1 + λ2

When we add up independent continuous RVs, we can compute the PDF of the resulting RV by using
the convolution formula or moment generating functions. In the following exercise, we compute the
PDF of the sum of independent exponentials.
Exercise 6.1.5 (Sum of i.i.d. Exp is Erlang). Let X 1 , X 2 , · · · , X n ∼ Exp(λ) be independent exponential RVs.
(i) Show that f X 1 +X 2 (z) = λ2 ze −λz 1(z ≥ 0).
(ii) Show that f X 1 +X 2 +X 3 (z) = 2−1 λ3 z 2 e −λz 1(z ≥ 0).
(iii) Let S n = X 1 + X 2 + · · · + X n . Use induction to show that S n ∼ Erlang(n, λ), that is,
λn z n−1 e −λz
f S n (z) = . (692)
(n − 1)!
R∞ continuous RVs. Then the PDF of X + Y is given by the convolution
Hint: Let X , Y be independent
formula: f X +Y (z) = −∞ f X (x) f Y (z − x) d x.
Remark: For parameters (r, λ), a continuous random variable X is said to have the Gamma distribution
with parameters (r, λ) (and denote X ∼ Gamma(r, λ)) if
λr z r −1 e −λz
f X (x) = 1(x ≥ 0), (693)
Γ(r )
where Γ(z) is the Euler Gamma function defined by
Z∞
Γ(z) = x z−1 e −x d x. (694)
0
By an induction and integration by parts, one can show that Γ(n) = (n − 1)! for n ∈ N. Hence
Gamma(r, λ) = Erlang(r, λ) whenever r is a positive integer. So we can also say that S n ∼ Gamma(r, λ).
Exponential RVs will be the builing block of the Poisson processes, because of their ‘memoryless
property’.
Exercise 6.1.6 (Memoryless property of exponential RV). A continuous positive RV X is say to have mem-
oryless property if
P(X ≥ t 1 + t 2 ) = P(X ≥ t 1 )P(X ≥ t 2 ) ∀x 1 , x 2 ≥ 0. (695)
(i) Show that (695) is equivalent to
P(X ≥ t 1 + t 2 | X ≥ t 2 ) = P(X ≥ t 1 ) ∀x 1 , x 2 ≥ 0. (696)
(ii) Show that exponential RVs have memoryless property.
(iii) Suppose X is continuous, positive, and memoryless. Let g (t ) = log P(X ≥ t ). Show that g is contin-
uous at 0 and
g (x + y) = g (x) + g (y) for all x, y ≥ 0. (697)
Using Exercise 6.1.7, conclude that X must be an exponential RV.
6.1. RECAP OF EXPONENTIAL AND POISSON RANDOM VARIABLES 93

Exercise 6.1.7. Let g : R≥0 → R be a function with the property that g (x + y) = g (x) + g (y) for all x, y ≥ 0.
Further assume that g is continuous at 0. In this exercise, we will show that g (x) = c x for some constant
c.
(i) Show that g (0) = g (0 + 0) = g (0) + g (0). Deduce that g (0) = 0.
(ii) Show that for all integers n ≥ 1, g (n) = ng (1).
(iii) Show that for all integers n, m ≥ 1,
ng (1) = g (n · 1) = g (m(n/m)) = mg (n/m). (698)
Deduce that for all nonnegative rational numbers r , we have g (r ) = r g (1).
(iv) Show that g is continuous.
(v) Let x be nonnegative real number. Let r k be a sequence of rational numbers such that r k → x as
k → ∞. By using (iii) and (iv), show that
µ ¶
g (x) = g lim r k = lim g (r k ) = g (1) lim r k = x · g (1). (699)
k→∞ k→∞ k→∞

Lastly, we introduce the Poisson RVs and record some of their nice properties.

Example 6.1.8 (Poisson RV). A RV X is a Poisson RV with rate λ > 0 if


λk e −λ
P(X = k) = (700)
k!
for all nonnegative integers k ≥ 0. We write X ∼ Poisson(λ).

Exercise 6.1.9. Let X ∼ Poisson(λ). Show that E(X ) = Var(X ) = λ.

Exercise 6.1.10 (Sum of ind. Poisson RVs is Poisson). Let X ∼ Poisson(λ1 ) and Y ∼ Poisson(λ2 ) be inde-
pendent Poisson RVs. Show that X + Y ∼ Poisson(λ1 + λ2 ).

Example 6.1.11 (MGF of Poisson RV). Let X ∼ Poisson(λ). Then using the Taylor expansion of the expo-
nential function, we can compute the moment generating function (MGF) of X :
X
∞ λk e −λ X∞ (e t λ)k
= e −λ = e −λ e λe = e λ(e −1) .
t t
E[e t X ] = e kt (701)
k=0 k! k=0 k!

Proposition 6.1.12 (Splitting a Poisson RV). Suppose there are N ∼ Poisson(λ) balls in a box. Each ball
gets a color i ∈ {1, . . . , m} with probability p i independently, where p 1 + · · · + p m = 1. Let Ni denote the
number of balls of color i . Then Ni ∼ Poisson(λp i ) for 1 ≤ i ≤ m and N1 , . . . , Nm are independent.

P ROOF. We directly compute the joint distribution of N1 , . . . , Nm . The key is to recognize that the
joint distribution of (N1 , . . . , Nm ) conditional on N = n is multinomial:
P(N1 = k 1 , . . . , Nm = k m ) = P(N = n) P(N1 = k 1 , . . . , Nm = k m | N = n) (702)
−n −λ
λ e n! k k
= p 1 · · · p mm (703)
n! k1 ! · · · km ! 1
à ! à !
(p 1 λ)k1 e −p 1 λ (p m λ)km e −p m λ
= ··· (704)
k1 ! km !
= P(Poisson(p 1 λ) = k 1 ) · · · P(Poisson(p m λ) = k m ), (705)
where we have used the fact that λ = λ(p 1 + · · · + p m ). Hence the joint distribution of (N1 , . . . , Nm ) fac-
torizes into the product of Poisson distributions with parameters p k λ for 1 ≤ k ≤ m. This is enough to
conclude the independence of N1 , . . . , Nm as well as their claimed Poisson distributions. □
6.3. MOMORYLESS PROPERTY AND STOPPING TIMES 94

6.2. Poisson processes as an arrival process

Recall the definition of arrival and renewal processes in Section 2.1. Below we define Poisson process.
Definition 6.2.1 (Poisson process). An arrival process (Tk )k≥1 is a Poisson process of rate λ if its inter-
arrival times are i.i.d. Exp(λ) RVs. In this case, we denote (Tk )k≥1 ∼ PP(λ) and equivalently (N (t ))t ≥0 ∼
PP(λ) for its associated counting process (N (t ))t ≥0 .
The choice of exponential inter-arrival times is special due to the memoryless property of exponen-
tial RVs (Exercise 6.1.6).
Exercise 6.2.2. Let (Tk )k≥1 be a Poisson process with rate λ. Show that E[Tk ] = k/λ and Var(Tk ) = k/λ2 .
Furthermore, show that Tk ∼ Erlang(k, λ), that is,
λk z k−1 e −λz
f Tk (z) = . (706)
(k − 1)!
The following exercise explains what is ‘Poisson’ about the Poisson process.
Exercise 6.2.3. Let (Tk )k≥1 be a Poisson process with rate λ and let (N (t ))t ≥0 be the associated counting
process. We will show that N (t ) ∼ Poisson(λt ).
(i) Using the relation {Tn ≤ t } = {N (t ) ≥ n} and Exercise 6.2.2, show that
Zt n n−1 −λz
λ z e
P(N (t ) ≥ n) = P(Tn ≤ t ) = d z. (707)
0 (n − 1)!
P
(ii) Let G(t ) = ∞ m −λt
m=n (λt ) e /m! = P(Poisson(λt ) ≥ n). Show that
d λn t n−1 e −λt d
G(t ) = = P(Tn ≤ t ). (708)
dt (n − 1)! dt
Conclude that G(t ) = P(Tn ≤ t ).
(iii) From (i) and (ii), conclude that N (t ) ∼ Poisson(λt ).

6.3. Momoryless property and stopping times


Let (N (t ))t ≥0 PP(λ). For any fixed u ≥ 0, (N (t +u)− N (t ))t ≥0 is a counting process itself, which counts
the number of arrivals during the interval [u, u + t ]. We call this the counting process (N (t ))t ≥0 restarted
at time u. One of the remarkable properties of a Poisson process is that their counting process restarted
at any given time u is again the counting process of a Poisson process, which is independent of what
have happened up to time u. This is called the memoryless property of Poisson processes.
Memoryless property of Poisson processes not only applies for a deterministic restart times, but also
some class of random times called ‘stopping times’. We introduce this notion in a greater generality.
Definition 6.3.1 (Stopping time). Let Ft be the collection of all events that one could observe by time
t 1. The sequence (Ft )t ≥0 is called a filtration2. A random variable T ≥ 0 is a stopping time w.r.t. a given
filtration (Ft )t ≥0 if for each u ≥ 0,
{T ≤ u} ∈ Fu or {T > u} ∈ Fu . (709)
The condition above is interpreted as follows: By using all information we gathered up to time u, we
can determine whether T ≤ u or T > u, not using any future information that is yet to be observed.
Theorem 6.3.2 (Memoryless property of PP). Let (N (t ))t ≥0 ∼ PP(λ) and let T ≥ 0 be a stopping time w.r.t.
a filtration (Ft )t ≥0 such that Ft contains all information of (N (u))0≤u≤t . Then the following holds.
1Proper name: σ-field generated by information up to time t
2Since one can only gain information, we have F ⊆ F for each 0 ≤ a ≤ b.
a b
6.3. MOMORYLESS PROPERTY AND STOPPING TIMES 95

(i) (N (T + t ) − N (T ))t ≥0 ∼ PP(λ).


(ii) (N (t ))0≤t ≤T and (N (T + t ) − N (T ))t ≥0 are independent.
The following is a very useful observation, which allows us to restart a Poisson process at any stop-
ping time and still get an independent Poisson process.
Lemma 6.3.3. Let (X (t ))t ≥0 be any stochastic processes and let T be a stopping time for (X (t ))t ≥0 . Suppose
the followings hold:
(i) (Independent increments) For any u ≥ 0, the processes (X (t ))0≤t ≤u and (X (t ) − X (u))t >u are indepen-
dent.
(ii) (Stationary increments) The distribution of the process (X (t ) − X (u))t >u does not depend on u (sta-
tionary increment).
Then (X (t ) − X (T ))t >T has the same distribution as (X (t ) − X (0))t >0 and is independent of (X (t ))t ≤T .
According to Lemma 6.3.3, Theorem 6.3.2 follows once we show that the counting process (N (t ))t ≥0
of PP(λ) satisfies the independent and stationary increment properties stated in Lemma 6.3.3. This is
verified by the following proposition.
Proposition 6.3.4 (PP has independent and stationary increments). Let (Tk )k≥1 be a Poisson process of
rate λ and let (N (t ))t ≥0 be the associated counting process.
(i) For any t ≥ 0, let Z (t ) = inf{s > 0 : N (t + s) > N (t )} be the waiting time for the first arrival after time t .
Then Z (t ) ∼ Exp(λ) and it is independent of the process up to time t .
(ii) For any s ≥ 0, (N (t + s) − N (t ))s≥0 is the counting process of a Poisson process of rate λ, which is inde-
pendent of the process (N (u))t ≤u .
Exercise 6.3.5 (Sum of independent Poisson RV’s is Poisson). Let (Tk )k≥1 be a Poisson process with rate
λ and let (N (t ))t ≥0 be the associated counting process. Fix t , s ≥ 0.
(i) Use memoryless property to show that N (t ) and N (t + s) − N (t ) are independent Poisson RVs of rates
λt and λs.
(ii) Note that the total number of arrivals during [0, t + s] can be divided into the number of arrivals
during [0, t ] and [t , t + s]. Conclude that if X ∼ Poisson(λt ) and Y ∼ Poisson(λs) and if they are
independent, then X + Y ∈ Poisson(λ(t + s)).
Exercise 6.3.6 (Poisson process restarted at stopping times). Let (Tk )k≥1 ∼ PP(λ) and let (N (t ))t ≥0 be its
counting process.
(i) Let T be any stopping time for (N (t ))t ≥0 . Use the memoryless property of Poisson processes and
Lemma 6.3.3 to show that (N (t ) − N (T ))t ≥T is the counting process of a PP(λ), which is inde-
pendent of (N (t ))0≤t <T .
(ii) Let T be the first time that we see three arrivals during a unit time. That is,
T = inf{t ≥ 0 | N (t ) − N (t − 1) = 3}. (710)
Show that T is a stopping time. According to (i), (N (t ) − N (T ))t ≥T is the counting process of a
PP(λ), which is independent of what have happened during [0, T ].
(iii) Let T be the first time t that there is no arrival during the interval [t , t + 1]. Is T a stopping time? Is
(N (t ) − N (T ))t ≥T the counting process of a PP(λ)?
P ROOF OF L EMMA 6.3.3. (Optional*) Let E , E 0 be any events for real stochastic processes. For any
stochastic process W = (W (t ))t ≥0 , we write W ∈ E if the event E occurs for the process W . We want to
show
P ((X (t ) − X (T ))t >T ∈ E | (X (t ))0≤t ≤T ) = P ((X (t ) − X (T ))t >T ∈ E ) . (711)
We show the above equation under the conditioning on {T = u}, and then undo the conditioning by
iterated expectation.
6.3. MOMORYLESS PROPERTY AND STOPPING TIMES 96

Since T is a stopping time for (X (t ))t ≥0 , for any u ≥ 0, the event


{(X (t ))0≤t ≤T ∈ E 0 , T = u} = {(X (t ))0≤t ≤u ∈ E 0 , T = u} (712)
is independent of the process (X (t ) − X (u))t >u on the time interval (u, ∞). Hence by the independent
increments assumption (i), the above event is independent from the event {(X (t ) − X (u))t >u ∈ E }. This
gives that, for any u ≥ 0,
³ ¯ ´
¯
P (X (t ) − X (T ))t >T ∈ E ¯ (X (t ))0≤t ≤T ∈ E 0 , T = u (713)
³ ¯ ´
¯
= P (X (t ) − X (u))t >u ∈ E ¯ (X (t ))0≤t ≤u ∈ E 0 , T = u (714)
¡ ¢
= P (X (t ) − X (u))t >u ∈ E (715)
¡ ¢
= P (X (t ) − X (0))t >0 ∈ E , (716)
where for the last equality, we have used the stationary increments assumption (ii).
Hence by iterated expectation,
³ ¯ ´
¯
P (X (t ) − X (T ))t >T ∈ E ¯ (X (t ))0≤t ≤T ∈ E 0 (717)
h ³ ¯ ´¯ i
¯ ¯
= E P (X (t ) − X (T ))t >T ∈ E ¯ (X (t ))0≤t ≤T ∈ E 0 , T ¯T (718)
h ¯ i
¯
= E P ((X (t ) − X (0))t >0 ∈ E ) ¯T (719)
= P ((X (t ) − X (0))t >0 ∈ E ) . (720)

To finish the proof, let E 0 be the entire sample space. Then the above yields
P ((X (t ) − X (T ))t >T ∈ E ) = P ((X (t ) − X (0))t >0 ∈ E ) . (721)
This shows that (X (t )−X (T ))t >T has the same distribution as (X (t )−X (0))t >0 . Furthermore, we conclude
¡ ¢
P (X (t ) − X (T ))t >T ∈ E | (X (t ))0≤t ≤T ∈ E 0 = P ((X (t ) − X (0))t >0 ∈ E ) (722)
= P ((X (t ) − X (T ))t >T ∈ E ) . (723)

Since E , E 0 were arbitrary, this also shows that (X (t ) − X (T ))t >T and (X (t ))0≤t ≤T are independent. □

P ROOF OF P ROPOSITION 6.3.4 . We first show (ii). Note that


T N (t ) ≤ t < T N (t )+1 . (724)
Hence we may write
Z (t ) = T N (t )+1 − t = τN (t )+1 − (t − T N (t ) ). (725)
Namely, Z (t ) is the remaining portion of the N (t ) + 1st inter-arrival time τN (t )+1 after we waste the first
t − T N (t ) of it. (See Figure 3).
Now consider restarting the arrival process (Tk )k≥0 at time t . The first inter-arrival time is T N (t )+1 −t =
Z (t ), which is Exp(λ) and independent from the past by (i). The second inter-arrival time is T N (t )+2 −
T N (t )+1 , which is Exp(λ) and is independent of everything else by assumption. Likewise, the following
inter-arrival times for this restarted arrival process are i.i.d. Exp(λ) variables. This shows (ii).
Next, we show (i). Let E by any event for the counting process (N (s))0≤s≤t up to time t . In order
to show the remaining waiting time Z (t ) and the past process up to time t are independent and Z (t ) ∼
Exp(λ), we want to show that

P(Z (t ) ≥ x | (N (s))0≤s≤t ∈ E ) = P(Z (t ) ≥ x) = e −λx . (726)


for any x ≥ 0. To this end,
6.3. MOMORYLESS PROPERTY AND STOPPING TIMES 97

4 𝜏

3 𝜏
𝑡−𝑠 𝑍
2 𝜏

1 𝑁(𝑡) = 3
𝜏

0 𝑇 𝑇 𝑇 =𝑠 𝑡 𝑇
F IGURE 1. Assuming N (t ) = 3 and T3 = s ≤ t , we have Z = τ4 − (t − s). By memoryless property of
exponential RV, Z follows Exp(λ) on this conditioning.

As can be seen from (6.3), Z (t ) depends on three random variables: τN (t )+1 , N (t ), and T N (t ) . To show
, we argue by conditioning the last two RVs and use iterated expectation. Using 6.3, note that
³ ¯ ´
¯
P Z (t ) ≥ x ¯ (N (s))0≤s≤t ∈ E , N (t ) = n, T N (t ) = s (727)
³ ¯ ´
¯
= P τn+1 − (t − s) ≥ x ¯ (N (s))0≤s≤t ∈ E , N (t ) = n, Tn = s (728)
³ ¯ ´
¯
= P τn+1 − (t − s) ≥ x ¯ (N (s))0≤s≤t ∈ E , Tn+1 > t , Tn = s (729)
³ ¯ ´
¯
= P τn+1 − (t − s) ≥ x ¯ (N (s))0≤s≤t ∈ E , τn+1 > t − s, Tn = s . (730)

Conditioned on N (t ) = n, the event that (N (s))0≤s≤t ∈ E is determined by the arrival times T1 , · · · , Tn and
the fact that Tn+1 ≥ t . Hence we can rewrite
© ª
{(N (s))0≤s≤t ∈ E , τn+1 > t − s, Tn = s} = (τ1 , · · · , τn ) ∈ E 0 , τn+1 > t − s (731)
for some event E 0 to be satisfied by the first n inter-arrival times. Since inter-arrival times are indepen-
dent, this gives
³ ¯ ´
¯
P Z (t ) ≥ x ¯ (N (s))0≤s≤t ∈ E , N (t ) = n, T N (t ) = s (732)
³ ¯ ´
¯
= P τn+1 − (t − u) ≥ x ¯ τn+1 ≥ t − s (733)

= P (τn+1 ≥ x) = e −λx , (734)


where we have used the memoryless property of exponential variables. Hence by iterated expectation,
³ ¯ ´ h ³ ¯ ´i
¯ ¯
P Z (t ) ≥ x ¯ (N (s))0≤s≤t ∈ E , N (t ) = n = ETN (t ) P Z (t ) ≥ x ¯ (N (s))0≤s≤t ∈ E , N (t ) = n, T N (t ) (735)

= ETN (t ) [e −λx ] = e −λx . (736)


By using iterated expectation once more,
³ ¯ ´ h ³ ¯ ´i
¯ ¯
P Z (t ) ≥ x ¯ (N (s))0≤s≤t ∈ E , = EN (t ) P Z (t ) ≥ x ¯ (N (s))0≤s≤t ∈ E , N (t ) (737)

= EN (t ) [e −λx ] = e −λx . (738)


By taking E to be the entire sample space, this also gives
P (Z (t ) ≥ x) = e −λx . (739)
This shows (726). □
𝑁 (𝑡)
1/
𝑝 Rate 𝜆 = 𝑝𝜆
𝑁(𝑡)
5/

2/
Rate 𝜆 6.4. MERGING
1 −AND
𝑝 SPLITTING OF POISSON PROCESS
𝑁 (𝑡) 98

X 6.4. Merging and splitting ofRate


Poisson process
𝜆 = (1 − 𝑝)𝜆

3 If customers arrive at a bank according to Poisson(λ) and if each one is male or female independently
with probability q and 1 − q, then the ‘thinned out’ process of only male customers is a Poisson(qλ); the
process of female customers is a Poisson((1 − q)λ).
𝑥 + 𝑦 = 12
𝑁 (𝑡)
𝑥 + 𝑦 = 11
Rate 𝜆 𝑁(𝑡)
𝑥 + 𝑦 = 10 or
𝑁 (𝑡) Rate 𝜆 = 𝜆 + 𝜆
𝑥+𝑦=9
Rate 𝜆
𝑥+𝑦=8
F IGURE 2. Merging two independent Poisson processes of rates λ1 and λ2 gives a new Poisson
5 6 process of rate λ1 + λ2 .
1 1
The reverse operation of splitting a given PP into two complementary PPs is call the ‘merging’. Namely,
imagine customers arrive at a register through two doors A and B independently according to PPs of rates
λ A and λB , respectively. Then the combined arrival process of entire customers is again a PP of the added
rate.
1/
𝜏 𝜏
𝑁 (𝑡)
1/ ⋯
𝑇 𝑇 𝑝 Rate 𝜆 = 𝑝𝜆
𝑁(𝑡)
5/

2/
Rate 𝜆 1−𝑝 𝑁 (𝑡)

Rate 𝜆 = (1 − 𝑝)𝜆
3 4 0 X 1 2

3 F IGURE 3. Splitting of Poisson process N (t ) of rate λ according to an independent Bernoulli pro-


cess of parameter p.
4 0 1 2 2
𝑥 + 𝑦 = 12
𝑁 (𝑡) Transmitters A and B independently send messages to a single
Exercise 6.4.1 (Excerpted from [BT02]).
𝑥receiver
+ 𝑦 = 11 according to Poisson processes with rates λ
A = 3 and λB = 4 (messages per min). Each message
Rate 𝜆
(regardless of the source) contains a random number of words with PMF 𝑁(𝑡)
𝑥 + 𝑦 = 10 or
P(1 word) =
𝑁 2/6,
(𝑡) P(2 words) = 3/6, Rate=𝜆 1/6,
P(3 words) = 𝜆 +𝜆 (740)
𝑥+𝑦=9
which is independent of everything else.
Rate 𝜆
𝑥(i) = 8 P(total nine messages are recieved during [0, t ]).
+ 𝑦Find
(ii) Let M (t ) be the total number of words received during [0, t ]. Find E[M (t )].
4 5 6 (iii) Let T be the first time that the receiver receives exactly three messages consisting of three words
1 1 1 from transmitter A. Find distribution of T .
(iv) Compute P(exactly seven messages
𝜏 out of the first ten messages are from A).
Exercise 6.4.2 (Order statistics of i.i.d. Exp RVs). One hundred light bulbs are simultaneously put on a
𝜏
life test. Suppose the lifetimes of the individual light bulbs are independent Exp(λ) RVs. Let Tk be the
𝑡−𝑠 𝑍
𝜏 kth time that some light bulb fails. We will find the distribution of Tk using Poisson processes.
𝜏 𝜏 (i) Think of T1 as the first arrival time among 100 independent PPs of rate λ. Show that T1 ∼ Exp(100λ).
𝑁(𝑡) = 3
⋯time T1 , there are 99 remaining light bulbs. Using memoryless property, argue that T2 −T1 is the
(ii) After
first arrival time of 99 independent PPs of rate λ. Show that T2 − T1 ∼ Exp(99λ) and that T2 − T1
𝑇 𝑇 𝑇
𝑇 𝑇 is independent
𝑇 = 𝑠 of T1 .𝑡 𝑇

3 4 0 1 2
6.4. MERGING AND SPLITTING OF POISSON PROCESS 99

(iii) As in the coupon collector problem, we break up


T k = τ 1 + τ2 + · · · + τ k , (741)
where τi = Ti −Ti −1 with τ1 = T1 . Note that τi is the waiting time between i −1st and i th failures.
Using the ideas in (i) and (ii), show that τi ’s are independent and τi ∼ Exp((101 − i )λ). Deduce
that
µ ¶
1 1 1 1
E[Tk ] = + +···+ , (742)
λ 100 99 (100 − k + 1)
µ ¶
1 1 1 1
Var[Tk ] = 2 + +···+ . (743)
λ 1002 992 (100 − k + 1)2
(iv) Let X 1 , X 2 , · · · , X 100 be i.i.d. Exp(λ) variables. Let X (1) < X (2) < · · · < X (100) be their order statistics, that
is, X (k) is the kth smallest among the X i ’s. Show that X (k) has the same distribution as Tk , the
kth time some light bulb fails. (So we know what it is from the previous parts.)

In the next two exercises, we rigorously justify merging and splitting of Poisson processes.

Exercise 6.4.3 (Merging of independent PPs). Let (N1 (t ))t ≥0 and (N2 (t ))t ≥0 be the counting processes of
two independent PPs of rates λ1 and λ2 , respectively. Define a new counting process (N (t ))t ≥0 by
N (t ) = N1 (t ) + N2 (t ). (744)
In this exercise, we show that (N (t ))t ≥0 ∼ PP(pλ).
(i) Let τ(1)
k
, τ(2)
k
, and τk be the kth inter-arrival times of the counting processes (N1 (t ))t ≥0 , (N2 (t ))t ≥0 , and
(N (t ))t ≥0 . Show that τ1 = min(τ(1) (2)
1 , τ1 ). Conclude that τ1 ∼ Exp(λ1 + λ2 ).
(ii) Let Tk be the kth arrival time for the joint process (N (t ))t ≥0 . Use memoryless property of PP and
Exercise 6.3.6 to deduce that N1 and N2 restarted at time Tk are independent PPs of rates λ1
and λ2 , which are also independent from the past (before time Tk ).
(iii) From (ii), show that
τk+1 = min(τ̃1 , τ̃2 ), (745)
where τ̃1 is the waiting time for the first arrival after time Tk for N1 , and similarly for τ̃2 . Deduce
that τk+1 ∼ Exp(λ1 + λ2 ) and it is independent of τ1 , · · · , τk . Conclude that (N (t ))t ≥0 ∼ PP(λ1 +
λ2 ).

Exercise 6.4.4 (Splitting of PP). Let (N (t ))t ≥0 be the counting process of a Poisson(λ), and let (X k )k≥0 be
a sequence of i.i.d. Bernoulli(p) RVs. We define two counting processes (N1 (t ))t ≥0 and (N2 (t ))t ≥0 by
X

N1 (t ) = 1(Tk ≤ t )1(X k = 1) = #(arrivals with coin landing on heads up to time t ), (746)
k=1
X∞
N2 (t ) = 1(Tk ≤ t )1(X k = 0) = #(arrivals with coin landing on tails up to time t ). (747)
k=1

In this exercise, we show that (N1 (t ))t ≥0 ∼ Poisson(pλ) and (N2 (t ))t ≥0 ∼ Poisson((1 − p)λ).
(i) Let τk and τ(1)
k
be the kth inter-arrival times of the counting processes (N (t ))t ≥0 and (N1 (t ))t ≥0 . Let
Yk be the location of kth 1 in (X t )t ≥0 . Show that
X
Y1
τ(1)
1 = τi . (748)
i =1

(ii) Show that


X
Y2
τ(1)
2 = τi . (749)
k=Y1 +1
6.5. POISSON PROCESS AND POISSON POINT PROCESS 100

(iii) Show that in general,


X
Yk
τ(1)
k
= τi . (750)
k=Yk−1 +1

(iv) Recall that Yk −Yk−1 ’s are i.i.d. Geom(p) RVs. Use Example 6.4.5 and (iii) to deduce that τ(1)
k
’s are i.i.d.
Exp(pλ) RVs. Conclude that (N1 (t )) ∼ Poisson(pλ). (The same argument shows (N2 (t ))t ≥0 ∼
Poisson((1 − p)λ).)
Example 6.4.5 (Sum of geometric number of Exp. is Exp.). Let X i ∼ Exp(λ) for i ≥ 0 and let N ∼ Geom(p).
PN
Let Y = k=1 X k . Suppose all RVs are independent. Then Y ∼ Exp(pλ).
To see this, recall that their moment generating functions are
λ pe t
M X 1 (t ) = , M N (t ) = . (751)
λ−t 1 − (1 − p)e t
Hence (see Remark 6.4.6)
λ
p λ−t pλ pλ
M Y (t ) = M N (log M X 1 (t )) = λ
= = . (752)
1 − (1 − p) λ−t (λ − t ) − λ(1 − p) pλ − t
Notice that this is the MGF of an Exp(pλ) variable. Thus by uniqueness, we conclude that Y ∼ Exp(pλ).

Remark 6.4.6. Let Y = X 1 + X 2 + · · · + X N , where X i ’s are i.i.d. RVs and N is an independent RV taking
values from positive integers. By iterated expectation, we have
M Y (t ) = E[e t Y ] = E[e t X 1 e t X 2 · · · e t X N ] (753)
t X1 t X2 t XN
= EN [E[e e ···e ]|N] (754)
= EN [E[e t X 1 ]N | N ] (755)
N
= EN [M X 1 (t ) ] (756)
= EN [e N log M X 1 (t ) ] (757)
= M N (log M X 1 (t )). (758)

6.5. Poisson process and Poisson point process


In Section 6.2, we have defined an arrival process (Tk )k≥1 to be a Poisson process of rate λ if its
inter-arrival times are i.i.d. Exp(λ) variables (Definition 6.2.1). In this section, we provide equivalent
definitions of Poisson processes in terms of the associated counting process (see (153)). This new per-
spective has many complementary advantages. Most importantly, this allows us to define the time-
inhomogeneous Poisson processes, where the rate of arrival changes in time.
An arrival process gives rise to the notion of a ‘size’ of each subset of A ⊂ R as the numbe of arrival
times that occur in A. This is called the associated ‘counting measure’ as defined below.
Definition 6.5.1. Let (Tn )n≥0 , T0 = 0 denote the sequence of arrival times of an arrival process. The
associated counting measure of a renewal process is the function N (A) of subsets A ⊆ R that gives the
number of arrival times that lie in the set A:
X∞
N (A) := 1(Tn ∈ A) = # of arrival times in A (759)
n=1

With the above definition, we can think of the associaved counting process (N (t ))t ≥0 as N (t ) =
N ([0, t ]). Then, taking A = (a, b] for 0 ≤ a < b, we can write N (A) = N ((a, b]) as
N ((a, b]) = N (b) − N (a) = (# of arrivals during [0, b]) -(# of arrivals during [0, a]) . (760)
6.5. POISSON PROCESS AND POISSON POINT PROCESS 101

Definition 6.5.2 (Poisson point process). A counting measure N on R is called a Poisson point process of
intensity λ (denoted N ∼ PPP(λ)) if it has the following properties:
(i) For each 0 ≤ a < b, N ((a, b]) ∼ Poisson(λ(b − a)).
(ii) For disjoint intervals (a 1 , b 1 ], . . . , , (a n , b n ], the random variables N ((a 1 , b 1 ]), . . . , N ((a n , b n ]) are inde-
pendent.

The next theorem describes the properties of the counting measure of the Poisson process.

Proposition 6.5.3 (PP implies PPP). Let (N (t ))t ≥0 ∼ PP(λ) be the counting process of a Poisson process of
rate λ. As a counting measure, N ((a, b]) := N (b) − N (a), N is a Poisson point process with intensity λ.

P ROOF. Follows from the memoryless process of Poisson process. Namely, restarting the PP at time
t = a, we get N ((a − b)) = N (b) − N (a) ∼ PP(λ(b − a)), since (N (a + t ) − N (a))t ≥0 is a PP(λ) by Proposition
6.3.4. Also, fix disjoint intervals (a 1 , b 1 ], . . . , , (a n , b n ]. Without loss of generality, we may assume that
a 1 ≤ b 1 ≤ a 2 ≤ b 2 ≤ a 3 ≤ · · · ≤ a n ≤ b n . Then restarting the process at t = b 1 and using the memoryless
property of PP, we see that N ((a 1 , b 1 ]) is independent of N ((a k , b k ]) for 2 ≤ k ≤ n. Similarly, restarting the
process at time t = b 2 , we see that N ((a 2 , b 2 ]) is independent of N ((a k , b k ]) for 3 ≤ k ≤ n. Proceeding in
the same way, we see the RVs N ((a 1 , b 1 ]), . . . , N ((a n , b n ]) are independent. □

Exercise 6.5.4 (PPP implies PP). Let N ∼ PPP(λ). Consider the induced counting process N (t ) := N ([0, t ])
for t ≥ 0. Let Tk = inf{u ≥ 0 | N (u) = k} be the kth ‘arrival time’ and let τk = Tk − Tk−1 be the kth ‘inter-
arrival time’.
(i) Use the fact that Tk is a stopping time for (N (t ))t ≥0 and Lemma 6.3.3 to deduce that (N (Tk + t ) −
N (Tk ))t ≥0 is the counting process of a PP(λ) that is independent of (N (t ))t ≤Tk .
(ii) Let Z (t ) = inf{u ≥ 0 | N (t + u) > N (t )} be the waiting time for the first arrival after time t . Show that
Z (t ) ∼ Exp(λ) for all t ≥ 0.
(iii) Use (ii) and conditioning on Tk−1 to show that τk ∼ Exp(λ) for all k ≥ 1.

Recall that in Exercise 6.4.4, we have established splitting of PP by directly analyzing the super-inter-
arrival times obtained by merging a geometric number of Exp(λ) inter-arrival times. In the following
exercise, we give steps to obtaining a similar statement via using PPP instead.

Exercise 6.5.5 (Splitting of PP via PPP). Let N ∼ PPP(λ) and let T1 < T2 < . . . denote the associated arrival
times (or particle locations), where for any interval I , the counting measure N (I ) of I equals the number
of Tk ’s such that Tk ∈ I . Now suppose each particle at location Tk for k ≥ 1 is either red with probability p
and blue with probability 1−p, independently of everything else. Define two induced counting measures
N1 and N2 by

N1 (I ) = # of red points in I , (761)


N2 (I ) = # of blue points in I . (762)

(i) Show that N1 ∼ PPP(pλ) and N2 ∼ PPP((1 − p)λ). (Use Proposition 6.4.4.)
(ii) For each interval I , show that N1 (I ) and N2 (I ) are independent. (Use Proposition 6.4.4.)
(iii) Show that N1 and N2 are independent. Namely, for intervals I 1 and I 2 (not necessarily disjoint),
show that N (I 1 ) and N (I 2 ) are independent. (Hint: Decompose N1 (I 1 ) = N1 (I 1 \ I 2 ) + N1 (I 1 ∩ I 2 )
and similarly for N2 (I 2 ). N1 (I 1 \I 2 ) and N2 (I 2 \I 1 ) are independent by the independence property
of N and since (I 1 \ I 2 ) and (I 2 \ I 1 ) are disjoint. Use (ii) to conclude.)
(iv) By using Exercise 6.5.4, conclude that splitting a PP(λ) with independent Bernoulli(p) coin flips gives
two independent PP(pλ) and PP((1 − p)λ).
6.6. POISSON PROCESS CONDITIONED ON THE NUMBER OF ARRIVALS 102

6.6. Poisson process conditioned on the number of arrivals


Suppose we know there are total 10 arrivals during [0, 1] of a PP(λ). What can we tell about the arrival
times T1 < · · · < T10 ? Clearly they all lie in the interval [0, t ]. It turns out that, if we sample n points
U1 , . . . ,Un independently and uniformly at random from [0, 1], then the leftmost value is T1 , the one to
its right is T2 , and so on, in distribution. Note that in this description, there is no involvement of the rate
parameter λ. In other words, the role of the rate parameter λ in PP(λ) is only to modulate the number of
arrivals in a given interval, and once we know the number, the specific arrival times within that interval
are generic and do not depend on λ.
To make a more precise statement of the above sketch, recall the following notion of order statistics.
Let U1 , . . . ,Un denote n RVs. Let Un;k denote the kth smallest among them for 1 ≤ k ≤ n. Then Un;1 ≤
· · · ≤ Un;n , which are called the order statistic of U1 , · · · ,Un .

Proposition 6.6.1. Let (Tn )n≥1 denote the arrival times of (N (t ))t ≥0 ∼ PP(λ).
(i) The joint distribution of (T1 , . . . , Tn ) conditional on N (t ) = n is the same as the joint distribution of the
order statistics of n i.i.d. Uniform([0, t ]) RVs.
(ii) The joint PDF of (T1 , . . . , Tn ) is given as
n!
f (T1 ,...,Tn )|N (t )=n (t 1 , . . . , t n ) = 1(0 ≤ t 1 < · · · < t n ≤ t ). (763)
tn
P ROOF. We first show (ii). For this we give a ‘physics’ style proof here. First note that 0 < T1 < · · · <
Tn < t with probability 1: the ordering follows from definition of arrival times and Tn < t since we are
conditioning on N (t ) = n. Hence we may need to verify the formula (763) for 0 < t 1 < · · · < t n < t . Fix
such t i ’s. Consider an ε-box at point (t 1 , . . . , t n ): (t 1 , t 1 + ε] × · · · × (t n , t n + ε]. We will assume that 0 < ε <
min(t 1 , t 2 − t 1 , . . . , t n − t n−1 , t − t n ) so that the intervals (t 1 , t 1 + ε], (t 1 + ε, t 2 ], . . . , (t n , t n + ε], (t n + ε, t ] are
disjoint. By the mean value theorem for integrals,
1 ³ ¯ ´
¯
lim t P T1 ∈ (t 1 , t 1 + ε), . . . , Tn ∈ (t n , t n + ε] ¯ N (t ) = n = f (T1 ,...,Tn )|N (t )=n (t 1 , . . . , t n ). (764)
ε→0 ε

Observe that
³ ¯ ´
¯
P T1 ∈ (t 1 , t 1 + ε), . . . , Tn ∈ (t n , t n + ε] ¯ N (t ) = n (765)
³ ¯ ´
¯
= P N (t 1 , t 1 + ε] = 1, N (t 1 + ε, t 2 ] = 0, . . . , N (t n , t n + ε] = 1, N (t n + ε, t ] = 0 ¯ N (t ) = n (766)
P (N (t 1 , t 1 + ε] = 1, N (t 1 + ε, t 2 ] = 0, . . . , N (t n , t n + ε] = 1, N (t n + ε, t ] = 0)
= (767)
P(N (t ) = n)
n! ³ ´³ ´³ ´³ ´ ³ ´³ ´
−λε −λ(t 2 −t 1 −ε) −λε −λ(t 3 −t 2 −ε) −λε −λ(t −t n −ε)
= (λε)e e (λε)e e · · · (λε)e e (768)
(λt )n e −λt | {z }
t repetitions
n!
= λt εt e −λt (769)
(λt )n e −λt
n!
= n εt . (770)
t
Hence dividing both sides by εt and taking the limit ε → 0, we obtain the desired result.
i .i .d .
Next, we deduce (i). Let U1 , . . . ,Un ∼ Uniform([0, t ]). Then the joint PDF of (U1 , . . . ,Un ) is con-
stant over the cube [0, t ]n with value 1/t n . But for each values 0 < t 1 < · · · < t n < t for the order statis-
tic Un;1 , . . . ,Un;n , there are n! ways that these values will be realized by U1 , . . . ,Un , and each case will
have the same probability by symmetry. Hence the joint PDF of Un;1 , . . . ,Un;n is constant on the region
{(t 1 , . . . , n n ) : 0 < t 1 < . . . , < t n < t } with constant value n!/t n . This is exactly the same joint PDF as in
(ii). □
6.7. NONHOMOGENEOUS POISSON PROCESS 103

Example 6.6.2. Let N ∼ PPP(λ). Let 0 < r < t and we compute P(T2 < r | N (t ) = 2), the probability that
the second arrival is by time r given that there are total two arrivals up to time t . By using Proposition
6.6.1, we have
2
f (T1 ,T2 )|N (t )=2 (t 1 , t 2 ) = 2 1(0 < t 1 < t 2 < t ). (771)
t
Hence we get
Zr Z t 2
2 2 r2 r2
P(T2 < r | N (t ) = 2) = 2
d t 1 d t 2 = = 2. (772)
0 0 t t2 2 t
To go one step further, by differentiating the above by r , we can obtain the conditional PDF of T2 given
N (t ) = 2:
2x
f T2 |N (t )=2 (x) = 2 1(0 < x < t ). (773)
t

6.7. Nonhomogeneous Poisson process


In this section, we introduce Poisson process with time-varying rate λ(t ).
Example 6.7.1. Consider an counting process (N (t ))t ≥0 , which follows PP(λ1 ) on interval [1, 2), PP(λ2 ) on
interval [2, 3), and PP(λ3 ) on interval [3, 4]. Further assume that the increments N (2) − N (1), N (3) − N (2),
and N (4) − N (3) are independent. Then what is the distribution of the total number of arrivals during
[1, 4]? Since we can add independent Poisson RVs and get a Poisson RV with added rates, we get
N (4) − N (1) = [N (4) − N (3)] + [N (3) − N (2)] + [N (2) − N (1)] ∼ Poisson(λ1 + λ2 + λ3 ). (774)
Note that the combined rate λ1 + λ2 + λ3 can be seen as the integral of the step function f (t ) = λ1 1(t ∈
[1, 2)) + λ2 1(t ∈ [2, 3)) + λ3 1(t ∈ [3, 4]). ▲
In general, suppose we have a concatenation of Poisson processes on disjoint intervals of very small
lengths. Then N (t ) − N (s) can be seen as the sum of independent increments over the interval [t , s], and
by additivity of independent Poisson increments, it follows a Poisson distribution with rate given by the
‘Riemann sum’. As the lengths of the intervals go to zero, this Riemann sum of rates tend to the integral of
the rate function λ(r ) over the interval [s, t ]. This suggests the following definition of nonhomogeneous
Poisson processes.
Definition 6.7.2. An arrival process (Tk )k≥1 is said to be a Poisson process with rate λ(t ) if its counting
process (N t )t ≥0 satisfies the following properties:
(i) N (0) = 0.
(ii) (N (t ))t ≥0 has independent increments.
(iii) For any 0 ≤ s < t , N (t ) − N (s) ∼ Poisson(µ) where
Zt
µ= λ(r ) d r. (775)
s
Example 6.7.3. A store opens at 8 AM. From 8 until 10 AM, customers arrive at a Poisson rate of four per
hour. Between 10 AM and 12 PM, they arrive at a Poisson rate of eight per hour. From 12 PM to 2 PM the
arrival rate increases steadily from eight per hour at 12 PM to ten per hour at 2 PM; and from 2PM to 5
PM, the arrival rate drops steadily from ten per hour at 2 PM to four per hour at 5 PM. Let us determine
the probability distribution of the number of customers that enter the store on a given day (between 8
AM and 5 PM).
Let λ(t ) be the rate function in the statement and let N (t ) be a nonhomogeneous Poisson process
R17
with this rate function. From the description above, we can compute m = 8 λ(s) d s = 63. Hence
N (17) − N (8) ∼ Poisson(63). (776)
6.8. PP VIA ASYMPTOTIC PROPERTIES OF THE COUNTING PROCESS (OPTIONAL*) 104


Exercise 6.7.4. Let (Tk )k≥1 ∼ PP(λ(t )). Let (τk )k≥1 be the inter-arrival times.
(i) Let Z (t ) be the waiting time for the first arrival after time t . Show that
µ Zt +x ¶
P(Z (t ) ≥ x) = exp − λ(t ) d t . (777)
t
(ii) From (i), deduce that τ1 has PDF
Rt
f τ1 (t ) = λ(t )e − 0 λ(r ) d r
. (778)
Rt
(iii) Denote µ(t ) = 0 λ(s) d s. Use (i) and conditioning to show
P(τ2 > x) = Eτ1 [P(τ2 > x | τ1 ) | τ1 ] (779)
Z∞
= P(τ2 > x | τ1 = t ) f τ1 (t ) d t (780)
Z0∞
= e −(µ(t +x)−µ(t )) λ(t )e −µ(t ) d t (781)
0
Z∞
= λ(t )e −µ(t +x) d t . (782)
0
Conclude that τ1 and τ2 do not necessarily have the same distribution.
The following exercise shows that the nonhomogeneous Poisson process with rate λ(t ) can be ob-
tained by a time change of the usual Poisson process of rate 1.
Exercise 6.7.5. Let (N0 (t ))t ≥0 be the counting process of a Poisson process of rate 1. Let λ(t ) denote a
non-negative function of t , and let
Zt
m(t ) = λ(s) d s. (783)
0
Define N (t ) by
N (t ) = N0 (m(t )) = # arrivals during [0, m(t )]. (784)
Show that (N (t ))t ≥0 is the counting process of a Poisson process of rate λ(t ).

6.8. PP via asymptotic properties of the counting process (Optional*)


In this section, we give yet another definition of Poisson process in terms of the asymptotic proper-
ties of its counting process. For this, we need something called the ‘small-o’ notation. We say a function
f (t ) is of order o(t ) or write f (t ) = o(t ) if
f (t )
lim = 0. (785)
t →0 t
Definition 6.8.1 (Def of PP-counting-asymptotic). Suppose a counting process (N (t ))t ≥0 satisfies the fol-
lowing conditions (and denote (N (t ))t ≥0 ∼ PP0 (λ)):
(i) N (0) = 0;
(ii) P(N (t ) = 0) = 1 − λt + o(t );
(iii) P(N (t ) = 1) = λt + o(t );
(iv) P(N (t ) ≥ 2) = o(t );
(v) (Independent increment) For any t , s ≥ 0, N (t + s) − N (t ) is independent of (N (u))u≤t ;
(vi) (Stationary increment) For any t , s ≥ 0, the distribution of N (t + s) − N (t ) is invariant under t .
It is easy to see that our usual definition of Poisson process in Definition 6.2.1 satisfies the properties
(i)-(vi) above.
Proposition 6.8.2. Let (N (t ))t ≥0 ∼ PP(λ) (see Definition 6.2.1). Then (N (t ))t ≥0 ∼ PP0 (λ).
6.8. PP VIA ASYMPTOTIC PROPERTIES OF THE COUNTING PROCESS (OPTIONAL*) 105

P ROOF. Using Taylor expansion of exponential function, note that

e −λt = 1 − λt + o(t ) (786)


for all t > 0. Hence
(λt )n (λt )n
P(N (t ) = n) = e −λt = (1 − λt + o(t )) . (787)
n! n!
So plugging in n = 0 and 1 gives (ii) and (iii). For (iv), we use (ii) and (iii) to get
P(N (t ) ≥ 2) = 1 − P(N (t ) ≤ 1) = 1 − (1 − λt + o(t )) − (λt + o(t )) = o(t ). (788)
Lastly, (v) and (iv) follows from the memoryless property of Poisson process (Proposition 6.3.4). □

Next, we consider the converse implication. We will break this into several exercises.

Exercise 6.8.3. Let (N (t ))t ≥0 is the Poisson process with rate λ > 0 in the sense of Definition 6.8.1. In this
exercise, we will show that P(N (t ) = 0) = e −λt .
(i) Use independent/stationary increment properties to show that
P(N (t + h) = 0) = P(N (t ) = 0, N (t + h) − N (t ) = 0) (789)
= P(N (t ) = 0)P(N (t + h) − N (t ) = 0) (790)
= P(N (t ) = 0)(1 − λh + o(h)). (791)
(ii) Denote f 0 (t ) = P(N (t ) = 0). Use (i) to show that
µ ¶
f 0 (t + h) − f 0 (h) o(h)
= −λ + f 0 (t ). (792)
h h
By taking limit as h → 0, show that f (t ) satisfies the following differential equation
d f 0 (t )
= −λ f 0 (t ). (793)
dt
(iii) Conclude that P(N (t ) = 0) = e −λt .

Next, we generalize the ideas used in the previous exercise to compute the distribution of N (t ).

Exercise 6.8.4. Let (N (t ))t ≥0 is the Poisson process with rate λ > 0 in the sense of Definition 6.8.1. Denote
f n (t ) = P(N (t ) = n) for each n ≥ 0.
(i) Show that
P(N (t ) ≤ n − 2, N (t + h) = n) ≤ P(N (t + h) − N (t ) ≥ 2). (794)
Conclude that
P(N (t ) ≤ n − 2, N (t + h) = n) = o(h). (795)
(ii) Use (i) and independent/stationary increment properties to show that
f n (t + h) = P(N (t + h) = n) = P(N (t ) = n, N (t + h) − N (t ) = 0) (796)
+ P(N (t ) = n − 1, N (t + h) − N (t ) = 1) (797)
+ P(N (t ) ≤ n − 2, N (t + h) = n) (798)
= f n (t )(1 − λh + o(h)) + f n−1 (t )(λh + o(h)) + o(h). (799)
(iii) Use (ii) to show that the following differential equation holds:
d f n (t )
= −λ f n (t ) + λ f n−1 (t ). (800)
dt
6.8. PP VIA ASYMPTOTIC PROPERTIES OF THE COUNTING PROCESS (OPTIONAL*) 106

(iv) By multiplying the integrating factor µ(t ) = e λt to (800), show that


(e λt f n (t ))0 = λe λt f n−1 (t ). (801)
Use the initial condition f n (0) = P(N (0) = n) = 0 to derive the recursive equation
Zt
−λt
f n (t ) = λe e λs f n−1 (s) d s. (802)
0
n −λt
(v) Use induction to conclude that f n (t ) = (λt ) e /n!.
(vi) Conclude that for all t , s ≥ 0 and n ≥ 0,
N (t + s) − N (s) ∼ Poisson(λt ). (803)
CHAPTER 7

Renewal Processes: Examples and Queuing Theory

7.1. Age and residual life


Suppose buses arrive at a stop according to a renewal process with arrival times (Tn )n≥1 and counting
process (N (t ))t ≥0 . Assume that we get to the stop at time s. How much time has passed from the last
bus? How long should we wait for the next bus? These quantities are called the age and residual live,
respectively. Below we give precise definitions.
Z s := T N (s)+1 − s = Residual life (wating time until the next bus from time s) (804)
A s := s − T N (s) = Age (time passed from the previous bus till time s)
What can we say about the above random variables for fixed s or as s → ∞?
We first address the above question for the speical case of Poisson processes.
Proposition 7.1.1. Let (N (t ))t ≥0 ∼ PP(λ). Then the age and residual life, A s and Z s in (804) satisfy the
following.
(i) For each fixed s > 0, Z s and A s are independent with distributions Z s ∼ Exp(λ) and A s ∼ τ ∧ s, where
τ ∼ Exp(λ).
(ii) As s → ∞, the random vector (A s , Z s ) converges in distribution to the pair of indpendent Exp(λ) RVs.
P ROOF. We first show (i). By the memoryless property of PP, once we restart the process at time s, the
future process (N (s + t ) − N (s))t ≥0 is another PP(λ) that is independent of the past process (N (t ))0≤t ≤s .
Since Z s and A s are determined by the future and the past processes, respectively, we see that they are
independent. Also, Z s is exactly the first inter-arrival time of the restarted PP(λ), so it has the distribution
Exp(λ). For A s , note that for x ≥ 0,
P(A s ≥ x) = P(no arrival during [s − x, s])1(0 ≤ x ≤ s) (805)
−λx
=e 1(0 ≤ x ≤ s). (806)
Note that, if τ ∼ Exp(λ), then for x ≥ 0,
P(τ ∧ s ≥ x) = P(τ ≥ s)1(0 ≤ x ≤ s) = e −λx 1(0 ≤ x ≤ s) = P(A s ≥ x). (807)
It follows that A s and τ ∧ s have the same distribution, as desired.
Next, (ii) follows from (i). Indeed, as s → ∞, the distribution of Z s remains the same and the distri-
bution of A s converges to Exp(λ), and these RVs remain independent for all s. □
Therefore, if busses arrive according to PP(λ), then the expected wait time until the next bus is always
E[Z s ] = 1/λ regardless of s. So, regardless of how many students are still waiting in the stop, you may get
your ride in 1/λ time on average!
Next, we discuss the age and residual live for the general arrival processes. Notice that, in the proof
of Proposition 7.1.1, the use of memoryless property of PPs was crucial, which is not available for the
general arrival processes. Hence we can expect the ansers in the general case would not be as simple as
in the Poisson case.
Fix 0 ≤ x ≤ s and y ≥ 0. Note that
P(A s > x, Z s > y) = P(no arrival during [s − x, s + y]). (808)

107
7.1. AGE AND RESIDUAL LIFE 108

The right hand side is e −λ(x+y) for PP(λ), but is not easy to evaluate for general renewal processes. Hence,
we ask a bit of easier question instead. In the long run, what portion of times do we have A s > x and
Z s > y? Namely, this is the portion of times that we missed the previous bus by more than x and should
wait at least y until the next bus.

Proposition 7.1.2. Let (Tn )n≥1 denote the arrival times of a renewal process and suppose the inter-arrival
times (τn )n≥1 have finite expectation. Let A s and Z s denote the age and the residual life as in (804). Then
h¡ ¢+ i
Z E τ1 − (x + y) Z∞
1 t 1
lim 1(A s > x, Z s > y) d s = = P(τ1 ≥ z) d z. (809)
t →∞ t 0 E[τ1 ] E[τ1 ] x+y
In particular, by taking x = 0 or y = 0, we have
Z Z∞
1 t 1
lim 1(A s > x) d s = P(τ1 ≥ z) d z, (810)
t →∞ t 0 E[τ1 ] x
Z Z∞
1 t 1
lim 1(Z s > y) d s = P(τ1 ≥ z) d z. (811)
t →∞ t 0 E[τ1 ] y
P ROOF. We will frame this problem as a renewal reward process. Namely, the renewal process is
given by the same arrival times (Tn )n≥1 of the buses, each ‘cycle’ is the inter-arrival interval (Tn−1 , Tn ],
and the ‘reward’ from that interval is the amount of times that the event {A s ≥ x, Z s ≥ s} holds during that
interval, which is given by
ZTn
R n := 1(A s > x, Z s > y) d s. (812)
Tn−1

By the renewal property, note that R n ’s are i.i.d. with common expectation E[R 1 ], which is the numerator
in the right hand side above. Hence by the renewal reward SLLN (Theorem 2.2.1), we can deduce
Z
1 t E [R 1 ]
lim 1(A s > x, Z s > y) d s = . (813)
t →∞ t 0 E[τ1 ]
Furthermore, observe that
1(A s > s, Z s > y, s ∈ [0, T1 ]) = 1(no arrival during [s − x, s + y], s ∈ [0, T1 ]) (814)
= 1([s − x, s + y] ⊆ [0, T1 ]) (815)
= 1(x ≤ s ≤ T1 − y)1(T1 ≤ x + y). (816)
Therefore, we get
R 1 = (T1 − (x + y))1(T ≤ x + y) = (τ1 − (x + y))+ . (817)
Using the tail-sum formula for the expectation of nonnegative random variables,
h¡ Z∞ Z∞
¢+ i ¡ +
¢
E[R 1 ] = E τ1 − (x + y) = P (τ1 − (x + y)) ≥ z d z = P(τ1 ≥ z) d z. (818)
0 x+y

Hence the asserion follows. □


Next, if we impose an additional restriction on the inter-arrival times that they do not take values
from an arithmatic progression, then the long-run averages in Proposition 7.1.2 gives their limiting dis-
tributions.

Theorem 7.1.3. Let (Tn )n≥1 denote the arrival times of a renewal process and suppose the inter-arrival
times (τn )n≥1 have finite expectation. Let A s and Z s denote the age and the residual life as in (804). Further
assume that, for every a > 0,
P (τ ∈ {a, 2a, 3a, . . . }) < 1. (819)
7.2. EXAMPLES OF RENEWAL PROCESSES 109

Then the following thold:


Z∞
1
lim P(A t > x, Z t > y) = P(τ1 ≥ z) d z, (820)
t →∞ E[τ1 ] x+y
Z∞
1
lim P(A t > x) = P(τ1 ≥ z) d z, (821)
t →∞ E[τ1 ] x
Z∞
1
lim P(Z s > y) = P(τ1 ≥ z) d z. (822)
t →∞ E[τ1 ] y
P ROOF. Omitted. □

7.2. Examples of Renewal Processes


Example 7.2.1. Let (X t )t ≥0 be an irreducible Markov chain on a finite state space Ω. Let X 0 = x ∈ Ω and
let Tk be the kth time that the chain returns to x. By the strong Markov property, the inter-arrival times
τk = Tk − Tk−1 are i.i.d. We also know that E[τk ] < ∞ since hitting times of finite-state irreducible chains
have finite mean (see Exercise 4.6 in Lecture note 1). Hence (Tk )k≥0 is a renewal process. ▲
Example 7.2.2. Let (ξ)k≥0 be a sequence of i.i.d. RVs with

P(ξk = 1) = P(ξk = −1) = 1/2. (823)

Define a random walk (S n )n≥0 by S 0 = 0 and S n = ξ1 + · · · + ξn . Let Tk denote the kth time that the walk
visits the origin. Note that one can view S n as a Markov chain on state space Ω = Z. Hence again by
the strong Markov property, the inter-arrival times τk = Tk − Tk−1 are i.i.d. However, later we will see
that E[τk ] = ∞. Hence (Tk )k≥0 is not a renewal process as we defined above. However, we can still call
it a ‘renewal process with infinite expected inter-arrival times’. This will make more sense since, albeit
having infinite mean, the inter-arrival times are finite almost surely. In other words, S n always returns to
the origin with probability 1.
How do we know that the walk S n will eventually return back to the origin? That is, do we know that
the inter-arrival times are finite almost surely? It is easy to see that S n is an irreducible Markov chain.
However, since the state space Ω for S n is infinite, this does not guarantee that S n returns to zero with
probability 1.
In order to see this, recall the Gambler’s ruin problem (Exercise 7.1 in Lecture note 1). Namely, let
(X t )t ≥0 be a Markov chain on state space ΩN = {0, 1, · · · , N } with transition probabilities

P(X t +1 = k + 1 | X t = k) = p, P(X t +1 = k − 1 | X t = k) = 1 − p ∀0 ≤ k < N . (824)

Let ρ = (1 − p)/p. We have seen that, for any 0 < i < N ,

1 + ρ + · · · + ρ i −1
P(X t hits N before 0 | X 0 = i ) = . (825)
1 + ρ + · · · + ρ N −1
For our case, let p = 1/2 so that ρ = 1. We can view X t as the random walk S n during it lies in the interval
[0, N ]. Hence

P(S n hits N before 0 | S 0 = i ) = i /N . (826)

Now by taking the limit N → ∞, which makes the right barrier at N to fade away, we deduce

P(S n never hits 0 | S 0 = i ) = lim P(S n hits N before 0 | S 0 = i ) = 0. (827)


N →∞

This shows S 0 eventually returns to 0 with probability 1 starting from any initial state S 0 = i > 0. Since S n
is symmetric, the same conclusion holds for negative starting location. Hence the inter-arrival times are
finite with probability 1. ▲
7.2. EXAMPLES OF RENEWAL PROCESSES 110

Example 7.2.3. Let (X t )t ≥0 be an irreducible Markov chain on a finite state space Ω. Let X 0 = x ∈ Ω and
let Tk be the kth time that the chain returns to x. By the strong Markov property, the inter-arrival times
τk = Tk − Tk−1 are i.i.d. We also know that E[τk ] < ∞ since hitting times of finite-state irreducible chains
have finite mean (see Exercise 4.6 in Lecture note 1). Hence (Tk )k≥0 is a renewal process. ▲
Example 7.2.4. Let (ξ)k≥0 be a sequence of i.i.d. RVs with
P(ξk = 1) = P(ξk = −1) = 1/2. (828)
Define a random walk (S n )n≥0 by S 0 = 0 and S n = ξ1 + · · · + ξn . Let Tk denote the kth time that the walk
visits the origin. Note that one can view S n as a Markov chain on state space Ω = Z. Hence again by
the strong Markov property, the inter-arrival times τk = Tk − Tk−1 are i.i.d. However, later we will see
that E[τk ] = ∞. Hence (Tk )k≥0 is not a renewal process as we defined above. However, we can still call
it a ‘renewal process with infinite expected inter-arrival times’. This will make more sense since, albeit
having infinite mean, the inter-arrival times are finite almost surely. In other words, S n always returns to
the origin with probability 1.
How do we know that the walk S n will eventually return back to the origin? That is, do we know that
the inter-arrival times are finite almost surely? It is easy to see that S n is an irreducible Markov chain.
However, since the state space Ω for S n is infinite, this does not guarantee that S n returns to zero with
probability 1.
In order to see this, recall the Gambler’s ruin problem (Exercise 7.1 in Lecture note 1). Namely, let
(X t )t ≥0 be a Markov chain on state space ΩN = {0, 1, · · · , N } with transition probabilities
P(X t +1 = k + 1 | X t = k) = p, P(X t +1 = k − 1 | X t = k) = 1 − p ∀0 ≤ k < N . (829)
Let ρ = (1 − p)/p. We have seen that, for any 0 < i < N ,
1 + ρ + · · · + ρ i −1
P(X t hits N before 0 | X 0 = i ) = . (830)
1 + ρ + · · · + ρ N −1
For our case, let p = 1/2 so that ρ = 1. We can view X t as the random walk S n during it lies in the interval
[0, N ]. Hence
P(S n hits N before 0 | S 0 = i ) = i /N . (831)
Now by taking the limit N → ∞, which makes the right barrier at N to fade away, we deduce
P(S n never hits 0 | S 0 = i ) = lim P(S n hits N before 0 | S 0 = i ) = 0. (832)
N →∞
This shows S 0 eventually returns to 0 with probability 1 starting from any initial state S 0 = i > 0. Since S n
is symmetric, the same conclusion holds for negative starting location. Hence the inter-arrival times are
finite with probability 1. ▲
Exercise 7.2.5 (Markov chain with continuous-time jumps). Let (X k )k≥0 be an irreducible and aperi-
odic Markov chain on state space Ω = {1, 2, · · · , m} with transition matrix P = (p i j ). Let π be the unique
stationary distribution of the chain.
Suppose the chain spends an independent amount of time at each state x ∈ Ω, whose distribution F x
may depend only on x. For each real t ≥ 0, let Y (t ) ∈ Ω denote the state of the chain at time t .
(i) Fix x ∈ Ω, and let Tk(x) denote the kth time that the Markov process (Y (t ))t ≥0 returns to x. Let (τk(x) )k≥1
and (N (x) (t ))t ≥0 be the associated inter-arrival times and the counting process, respectively.
Then
N (x) (t ) = number of visits to x that (Y (t ))t ≥0 makes up to time t . (833)

Show that (Tk(x) )k≥2 is a renewal process. Then using renewal SLLN, show that
à !
N (x) (t ) 1
P lim = = 1. (834)
n→∞ t E[τ(x) ] 1
7.2. EXAMPLES OF RENEWAL PROCESSES 111

(ii) Let Tk denote the kth time that the Markov process (Y (t ))t ≥0 jumps. Let (τk )k≥1 and (N (t ))t ≥0 be the
associated inter-arrival times and the counting process, respectively. Note that
N (t ) = N (1) (t ) + N (2) (t ) + · · · + N (m) (t ). (835)
Use (i) to derive that
à !
N (t ) 1 1
P lim = +···+ = 1. (836)
t →∞ t E[τ̃(1)
1 ] E [ τ̃ (m)
1 ]
(iii) Using the Markov chain SLLN, show that
à !
1 Xn
P lim 1(X k = x) = π(x) = 1. (837)
n→∞ n
k=1

From this, deduce


µ ¶
N (x) (t )
P lim = π(x) = 1. (838)
t →∞ N (t )

(iv) From the previous parts, conclude that


1
E[τ(x)
1 ]
π(x) = 1 1
. (839)
+···+
E[τ(1)
1 ] E[τ(m)
1 ]

In particular, suppose the chain spends unit time at each state x ∈ Ω. Then each τ1(x) equals the
discrete return time of the Markov chain X k to x, so we know π(x) = 1/E[τ1(x) ]. Check that the
above formula is consistent in this case.

Exercise 7.2.6 (Alternating renewal process). Let (τk )t ≥1 be a sequence of independent RVs where
E[τ2k−1 ] = µ1 , E[τ2k ] = µ2 ∀k ≥ 1. (840)
Define and arrival process (Tk )k≥1 by Tk = τ1 + · · · + τk for all k ≥ 1.
(i) Is (Tk )k≥1 a renewal process?
(ii) Let (X k )k≥0 be a Markov chain on state space Ω = {1, 2} with transition matrix
· ¸
0 1
P= . (841)
1 0
Show that the chain has π = [1/2, 1/2] as the unique stationary distribution.
(iii) Suppose the chain spends time τk at state X k ∈ Ω in between the k −1st and kth jump. For each real
t ≥ 0, let Y (t ) ∈ Ω denote the state of the chain at time t . Let N (t ) be the number of jumps that
Y (t ) makes up to time t . Define
NX
(1)
(t )
R (1) (t ) = τk 1(X k = 1), (842)
k=1

which is the total amount of time that (Y (t ))t ≥0 spends at state 1. Use Exercise 7.2.5 to deduce
µ ¶
R (1) (t ) µ1
P lim = = 1. (843)
n→∞ t µ1 + µ2
Exercise 7.2.7 (Poisson janitor). (Excerpted from [Dur99]) A light bulb has a random lifespan with distri-
bution F and mean µF . A janitor comes at times according to PP(λ) and checks and replace the bulb if it
is burnt out. Suppose all bulbs have independent lifespans with the same distribution F .
(i) Let Tk be the kth time that the janitor arrives and replaces the bulb. Show that (Tk )k≥0 with T0 = 0 is
a renewal process.
7.3. LITTLE’S LAW 112

(ii) Let (τk )k≥1 be the inter-arrival times of the renewal process defined in (i). Using the memoryless
property of Poisson processes to show that

E[τk ] = µF + 1/λ ∀k ≥ 1. (844)

(iii) Let N (t ) be the number of bulbs replaced up to time t . Show that


µ ¶
N (t ) 1
P lim = = 1. (845)
n→∞ t µF + 1/λ
(iv) Let B (t ) be the total duration that bulb is working up to time t , that is,
Zt
B (t ) = 1(Bulb is on at time s) d s. (846)
0

Use renewal reward process to show that


µ ¶
B (t ) µF
P lim = = 1. (847)
n→∞ t µF + 1/λ
(v) Let V (t ) denote the total number of visits that the janitor has made by time t . Show that
µ ¶
N (t ) 1/λ
P lim = = 1. (848)
n→∞ V (t ) µF + 1/λ
1/λ
That is, the fraction of times that the janitor replaces the bulb converges to (µF +1/λ) almost
surely, which is also the fraction of times that the bulb is off by (iv).

7.3. Little’s Law


In this section, we will learn one of the cornerstones of queuing theory, which is called Little’s Law.
Roughly speaking, this reads

(average size of the system) = (arrival rate) × (average time spent in the system), (849)

or ℓ = λw for short. The power of Little’s law lies in its applicability in a very general situation; one can
even choose a portion of a gigantic queuing network and apply Little’s law to analyze the local behavior
of the system. We also remark that Little’s law applies for deterministic queuing system: For stochastic
ones satisfying certain conditions, it will hold with probability 1.
Consider a queuing system, where the kth customer arrives at time t k , spends w k units of time, and
the exits the system. We assume no two customers arrive at the same time, that is, t 1 < t 2 < · · · . Let N (t )
and N d (t ) denote the number of arrivals and departures up to time t , respectively. Finally, let L(t ) denote
the number of customers in the system at time t . To summarize:

t k = Time that the kth customer enters the system. (850)


w k = Time that the kth customer spends in the system until he exits. (851)
X

N (t ) = 1(t k ≤ t ) = Number of arrivals up to time t (852)
k=1
X∞
N d (t ) = 1(t k + w k ≤ t ) = Number of departures up to time t (853)
k=1

L(t ) = N (t ) − N d (t ) = Size of the system at time t . (854)


𝑇 𝑇 𝐷 𝑇 𝐷 𝐷 𝑇
𝑇 𝑇 𝐷 𝑇 𝐷 𝐷 𝑇

7.3. LITTLE’S LAW 113

𝑁(𝑡)
𝑁(𝑡)
𝑁 (𝑡)
𝑁 (𝑡)
𝑤
𝑤
𝑤
𝑤
𝑤
𝑤
𝑡 𝑡 𝑡 𝑡
𝑡 𝑡 𝑡 𝑡
F IGURE 1. Arrival times, wait times, number of arrivals, and number of departures.

𝑁(𝑡)
𝑁(𝑡)
𝑁 (𝑡)
𝑁 (𝑡)

𝐿(𝑡)
𝐿(𝑡)

𝑡
𝑡

F IGURE 2. System size at time t

Now we introduce three key quantities which describe the average behavior of the system. Define
the following quantities, whenever their limit exist:
Z
1 t
ℓ = lim L(s) d s = Average size of the queue (855)
t →∞ t 0
N (t )
λ = lim = Arrival rate (856)
t →∞ t

1 NX (t )
w = lim w k = Average wait time. (857)
t →∞ N (t )
k=1

Little’s law gives a very simple relation between the above three average quantities.

Theorem 7.3.1 (Little’s law). If both λ and w exist and are finite, then so does ℓ and
ℓ = λw. (858)

P ROOF. Note that


The kth customer is in the queue at time t ⇐⇒ t k ≤ t < t k + w k . (859)
Hence we may write
X

L(t ) = 1(t k ≤ t < t k + w k ). (860)
k=1
7.3. LITTLE’S LAW 114

Thus by Fubini’s theorem,


ZT ZT X

L(t ) d t = 1(t k ≤ t < t k + w k ) d t (861)
0 0k=1
∞ ZT
X
= 1(t k ≤ t < t k + w k ) d t (862)
k=1 0
NX
(T )
= min(T, t k + w k ) − t k . (863)
k=1
This yields
NX
d
(T ) ZT NX
(T )
wk ≤ L(t ) d t ≤ wk . (864)
k=1 0 k=1
Now observe that
à !
1 NX
(T ) N (T ) 1 NX (T )
lim w k = lim w k = λw. (865)
T →∞ T k=1 T →∞ T N (T ) k=1
On the other hand, using Exercise 7.3.2 we also have
à !
1 NX 1 NX
d d
(T ) N d (T ) (T )
lim w k = lim w k = λw. (866)
T →∞ T k=1 T →∞ T N d (T ) k=1
Hence the assertion follows from (864). □
Exercise 7.3.2. Let (t k )k≥0 be a sequence of arrival times and let w k be the time that the kth customer
spends in the system. Also, let N (t ) and N d (t ) denote the number of arrivals and departures up to time
t . Suppose the following limit exist and finite:
N (t ) 1 Xn
λ := lim , w := lim wk . (867)
t →∞ t n→∞ n
k=1
We will show that
N d (t )
lim = λ. (868)
T →∞ t
(i) Write
à ! à !
wn 1 Xn n −1 X
1 n−1
= wk − wk . (869)
n n k=1 n n − 1 k=1
Deduce that limn→∞ w n /n = 0. Hence for each δ > 0, there exists M (δ) > 0 such that
w n ≤ δn ∀n ≥ M (δ). (870)
(ii) Show that for each ε > 0, there exists T1 (ε) > 0 such that for all t > T1 (ε),
(λ − ε)εt ≤ N (εt ) ≤ (λ + ε)εt , (871)
(λ − ε)(1 − ε)t ≤ N ((1 − ε)t ) ≤ (λ + ε)(1 − ε)t . (872)
Also deduce that for all t > T1 (ε),
εt ≤ t n < (1 − ε)t =⇒ (λ − ε)εt ≤ n ≤ (λ + ε)(1 − ε)t . (873)
(iii) Using (i) and (ii), show that
µ µ ¶¶
1 ε
t > max T1 (ε), M and εt ≤ t n < (1 − ε)t =⇒ w n ≤ (ε/2)t . (874)
(λ − ε)ε 2(λ + ε)(1 − ε)
Deduce that for each t large enough, there are at least (λ − ε − 2ελ)t arrivals during [εt , (1 − ε)t ]
and they all depart by time t .
7.3. LITTLE’S LAW 115

(𝜆 − 𝜖)𝜖𝑡 ≤ 𝑁(𝜖𝑡) ≤ (𝜆 + 𝜖)𝜖𝑡 𝑁 (1 − 𝜖)𝑡 − 𝑁(𝜖𝑡) ≥ (𝜆 − 𝜖 − 2𝜖𝜆)𝑡


depart

0 𝜖𝑡 (1 − 𝜖)𝑡 𝑡

F IGURE 3. All customers arriving during [εt , (1 − ε)t ] departs by t

(iv) From (iii), show that for each ε > 0.


N d (t )
lim inf ≥ λ − ε − 2ελ. (875)
t →∞ t
Finally, conclude (868).

Example 7.3.3 (Housing Market). Suppose the local real estate in Westwood estimates that it takes 120
days on average to sell a house; This number does fluctuate with the economy and season, but it has
been fairly stable over the past decade. We found out that at any given day last year, the number of
houses for sale has ranged from 20 to 30, with average of 25. What can we say about the average number
of transaction last year?
In order to apply Little’s law, we view the housing market as a queuing system. Namely, we regard
houses being put up for sale as an arrival to the system. The queue consists of unsold houses, and when
houses are being sold, we regard them as exiting the queue. Now from the description above, we set the
average wait time w = 120, and average queue length ℓ = 25. Then by Little’s law, we infer that the arrival
rate is λ = ℓ/w = 25/120 houses per day, or 75 houses per year. ▲
Exercise 7.3.4 (SLLN for weighted sum). Let (X k )k≥1 be a sequence of i.i.d. RVs with finite variance. Let
(w k )k≥1 be a sequence of real numbers such that
1 Xn 1 Xn
w̄ = lim w k < ∞, lim w k2 < ∞. (876)
n→∞ n n→∞ n
k=1 k=1
Pn
Define S n = k=1
X k w k . In this exercise, we will show that, almost surely,
lim S n /n = E[X 1 ]w̄. (877)
n→∞
(i) Write
Sn 1 X n 1 Xn
= E[X k ]w k + (X k − E X k )w k . (878)
n n k=1 n k=1
Show that it suffices to show the assertion assuming E[X k ] = 0 for all k ≥ 1. We may assume this
for the following steps.
(ii) Use Chebyshef’s inequality to show that
à !
−2 2
X
n
2
P (S n ≥ t ) ≤ t E[X 1 ] wk . (879)
k=1

(iii) Use (ii) to conclude that


µ ¶ Ã !
Sn 1 (log n)2 2 1 Xn
2
P ≥ ≤ E[X 1 ] w . (880)
n log n n n k=1 k
Deduce that
µ ¶
X
∞ S n2 1
P ≥ < ∞. (881)
n=1 n2 2 log n
7.3. LITTLE’S LAW 116

By Borel-Cantelli Lemma, this yields


µ ¶
S2
P lim n2 = 0 = 1. (882)
n→∞ n

(iv) Use (iii) to show that, for any sequence (n k ) of integers such that n k → ∞ as k → ∞, we can choose
a further subsequence (n k(r ) ) such that
à !
S n2 k (r )
P lim 2 = 0 = 1. (883)
n→∞ n
k(r )

This implies that, almost surely as n → ∞,


Sn Sn
0 = lim inf ≤ lim sup = 0. (884)
n→∞ n n→∞ n

This implies limn→∞ S n /n = 0 almost surely. This shows (877).

Exercise 7.3.5 (Expected load in the server). Consider a single-server queuing system, which is deter-
mined by the arrival times (t k )k≥1 and total times spent in the system (w k )k≥0 (a.k.a. sojourn times). Let
N (t ) and N d (t ) denote the number of arrivals and departures up to time t , respectively.
Each customer may wait for w̃ k time in the queue, and then spends s̃ k time the the server to get
serviced, so that

w k = w̃ k + s̃ k . (885)

We may assume the following limits exist:

N (t ) 1 Xn 1 Xn
λ := lim , w := lim wk , s̃ 2 := lim s̃ k2 < ∞. (886)
t →∞ t n→∞ n n→∞ n
k=1 k=1

(i) Show that the kth customer is in the server if and only if

t k + w̃ k ≤ t < t k + w̃ k + s̃ k . (887)

(ii) Let R(t ) denote the remaining service time of the current customer in the server. Use (i) to show that
X

R(t ) = (t k + w̃ k + s̃ k − t )1(t k + w̃ k ≤ t < t k + w̃ k + s̃ k ). (888)
k=1

(iii) Use Fubini’s theorem to justify the following steps:


ZT (T ) ZT
NX
R(t ) d t = (t k + w̃ k + s̃ k − t )1(t k + w̃ k ≤ t < t k + w̃ k + s̃ k ) d t (889)
0 k=1 0
(T ) Zmin(s̃ k ,T −t k −w̃ k )
NX
= (s̃ k − t ) d t . (890)
k=1 0

(iv) From (iii), deduce


NX
d
(T ) s̃ 2 ZT NX
(T ) s̃ 2
k k
≤ R(t ) d t ≤ . (891)
k=1 2 0 k=1 2

Finally, derive the following formula for the average load in the server:
Z
1 T
r := lim R(t ) d t = λs̃ 2 /2. (892)
T →∞ T 0
7.3. LITTLE’S LAW 117

Exercise 7.3.6 (PollaczekKhinchine formula). Consider a G/G/1 queue, where arrivals are given by a
renewal process of rate λ and service times are i.i.d. copies of a RV S with finite mean and variance. We
use the following notations:
Tk = kth arrival time (893)
W̃k = Time that the kth customer spends in the queue (894)
S̃ k = Time that the kth customer spends in the server (895)
W̃ (t ) = Remaining time until exit of the last customer in the queue at time t . (896)
Note that limh&0 W̃ (Tk −h) equals the waiting time W̃k of the kth customer in the queue. We will assume
that the following limit exists almost surely:
1 Xn
lim W̃k < ∞. (897)
n→∞ n
k=1
The goal of this exercise is to show the following PollaczekKhinchine formula for the mean waiting
time: Almost surely,
Z
1 t λE[S 2 ]
w̃ := lim W̃ (s) d s = . (898)
t →∞ t 0 2(1 − λE[S])
(i) Let S(t ) denote the sum of service times of all customers in the queue at time t . Show that
X∞
S(t ) = S̃ k 1(t k ≤ t < t k + W̃k ). (899)
k=1

(ii) Let N (t ) and N d (t ) denote the number of arrivals and departures up to time t . Use Fubini’s theorem
to show that
ZT ∞ ZT
X
S(t ) d t = S̃ k 1(t k ≤ t < t k + W̃k ) d t (900)
0 k=1 0
NX
(T ) £ ¤
= S̃ k min(T, t k + W̃k ) − t k . (901)
k=1
Also, deduce that
NX
d
(T ) ZT NX
(T )
S̃ k W̃k ≤ S(t ) d t ≤ S̃ k W̃k . (902)
k=1 0 k=1
(iii) From (ii) and Exercise 7.3.4, show that
ZT
1
lim S(t ) d t = λE[S]w̃. (903)
T →∞ T 0
(iv) Let R(t ) denote the remaining service time of the current customer in the server. Then W̃ (t ) =
R(t ) + S(t ). Using (iii) and Exercise (7.3.5), conclude the PK formula (898).
CHAPTER 8

Continuous-time Markov Chains

In this chapter, we introduce the continuous-time Markov chains (CTMCs) and study various proper-
ties and examples. The continuous-time extension of MCs has a wide range of applications in simulating
chemical or biological systems and also serves as a building block for Brownian motion.

8.1. Definition of CTMC


A continuous-time process (X t )t ≥0 on state space Ω is said to have Markov property if for every t 0 > 0
and x ∈ Ω, the distribution of the future process (X t )t ≥t0 given the present state X t0 = x depends only on
x and not on the past process (X t )0≤t <t0 . Moreover, as in the discrete-time case, we require the transition
probabilities do not depend on the time that the transition takes place. These two properties are stated
below:
(i) (Continuous-time Markov property) The distribution of X t given previous states X s0 , X s1 , · · · , X sn , s 0 <
s 1 < · · · < s n < t , depends only on X sn . That is, for every 0 ≤ s 0 < · · · < s n < t and x 0 , . . . , x n , y ∈ Ω,
³ ¯ ´ ³ ¯ ´
¯ ¯
P X t = y ¯ X sn = x n , . . . , X s0 = x 0 = P X t = y ¯ X sn = x n . (904)

(ii) (Time-homogeneity) There exists a transition probability function (t , x, y) 7→ Pt (x, y) for t ≥ 0, x, y ∈


Ω such that for each 0 ≤ s ≤ t and x, y ∈ Ω,
³ ¯ ´
¯
Pt −s (x, y) = P X t = y ¯ X s = x . (905)

A continuous-time process (X t )t ≥0 on state space Ω is called a continuous-time Markov chain (CTMC)


if it satisfies the continuous-time Markov property. A time-homogeneous CTMC (X t )t ≥0 on state space
Ω and transition probability function Pt is a CTMC that satisfies time-homogeneity with the probability
transition function Pt .
The following proposition gives basic properties of transition probability function.
Proposition 8.1.1. The transition probability function Pt of a time-homogeneous CTMC (X t )t ≥0 on state
space Ω satisfies the following properties:
P
(i) Pt (x, y) ≥ 0, y∈Ω Pt (x, y) = 1 for all t ≥ 0 and x, y ∈ Ω.
(ii) P0 (x, y) = 1(x = y) for x, y ∈ Ω.
P
(iii) (Kolmogorov-Champman eg.) Pt +s (x, z) = y∈Ω Pt (y, z) Ps (x, y) for all s, t ≥ 0 and x, y ∈ Ω.
³ ¯ ´
¯
P ROOF. Note that Pt (x, y) = P X t = y ¯ X 0 = x is a conditional probability so (i) and (ii) follow from
properties of conditional probability. To see (iii),
³ ¯ ´
¯
Pt +s (x, z) = P X t +s = z ¯ X 0 = x (906)
X ³ ¯ ´
¯
= P X t +s = z, X s = y ¯ X 0 = x (907)
y∈Ω
X ³ ¯ ´ ³ ¯ ´
¯ ¯
= P X t +s = z ¯ X s = y, X 0 = x P X s = y ¯ X 0 = x (908)
y∈Ω

118
8.1. DEFINITION OF CTMC 119

X ³ ¯ ´
¯
= P X t +s = z ¯ X s = y P s (x, y) (909)
y∈Ω
X ³ ¯ ´
¯
= P X t = z ¯ X 0 = y P s (x, y) (910)
y∈Ω
X
= Pt (y, z) P s (x, y), (911)
y∈Ω

where the last two equalities follow from the continuous Markov property and time-homogeneity. □
Remark 8.1.2. Proposition 8.1.1 (iii) is called a ‘semi-group’ property or Kolmogorov-Champman equa-
tion.
Example 8.1.3 (Homogeneous Poisson Process). Let (N (t ))t ≥0 ∼ PP(λ). We will show that it is a time-
homogeneous CTMC on state space Z≥0 . Note that
³ ¯ ´
¯
P N (t ) = y ¯ N (s n ) = x n , . . . , N (s 0 ) = x 0 (912)
³ ¯ ´
¯
= P N ((s n , t ]) = y − x n ¯ N ([s 0 , s 1 ]) = x 1 , . . . , N ((s n−1 , s n ]) = x n − x n−1 (913)
¡ ¢
= P N ((s n , t ]) = y − x n (914)
¡ ¢
= P Poisson(λ(t − s n )) = y − x n (915)
(λ(t − s n ) y−xn )e −λ(t −sn )
= 1(y − x n ≥ 0) , (916)
(y − x n )!
where for the second equality, we used independent increment property of the Poisson counting process
N over disjoint intervals. An identical computation shows
³ ¯ ´ (λ(t − s n ) y−xn )e −λ(t −sn )
¯
P N (t ) = y ¯ N (s n ) = x n = 1(y − x n ≥ 0) . (917)
(y − x n )!
Hence we verify the continuous Markov property as well as time-homogeneity, where the last expression
gives the transition probability function:
(λt ) y−x e −λt ¡ ¢
Pt (x, y) = 1(y ≥ x) = P Poisson(λt ) = y − x . (918)
(y − x)!
The indicator 1(y ≥ x) above indicates that the counting process N (t ) can only increase. ▲
Example 8.1.4 (Discrete-time Markov chain run by a Poisson process). Let (Yn )n≥1 denote a discrete-
time Markov chain on a countable state space Ω with transition matrix P . Suppose we have a Poisson
arrival process (N (t ))t ≥0 ∼ PP(λ) independent of the chain (Yn )n≥1 . We define a continuous-time process
(X t )t ≥0 on the same state space Ω by
X t = Y N (t ) . (919)
We call (X t )t ≥0 the Markov chain (Yn )n≥1 run by the Poisson process N ∼ PP(λ). For instance, if there
has been ten arrivals up to time t (N (t ) = 10), then X t = Y10 . In other words, X t is a version of Yn where
we make the jumps given by Yn at arrival times Tk of the PP. Alternatively, the continuous-time chain X t
waits an independent Exp(λ) time between each jump (inter-arrival time of the PP) and makes the jumps
modulated by Yn .
We will show that (X t )t ≥0 is a time-homogeneous CTMC with transition probability function
X∞ (λt )k e −λt
Pt (x, y) = P k (x, y). (920)
k=0 k!
Namely, the chain X t jumps a random amount of times during a period of length t , which has distri-
bution Poisson(λt ). Once this number is given, say k, then one has to make discrete-time jumps from
x to y in k steps according to the transition matrix Q, which occurs with probability Q k (x, y). Since k
8.1. DEFINITION OF CTMC 120

is random, the right hand side of (920) takes the form of the conditional expectation of Q k (x, y) where
k ∼ Poisson(λt )1.
Fix times 0 ≤ s 0 < s 1 < · · · < s n−1 < s n (= s) < t and states x 0 , . . . , x n−1 , x n (= x), y ∈ Ω. Note that the
number of arrivals during the disjoint intervals [s 0 , s 1 ], . . . , (s n−1 , s n ], (s n , t ] are independent and follows
Poisson distributions with rate λ times the length of the corresponding intervals. We will partition the
probability of interest according to the values of these arrival counts. Namely, fix integers k 0 , . . . , k n+1 ≥ 0,
and denote t i := k 0 + · · · + k i for 0 ≤ i ≤ n. Then note that
³ ¯ ´
¯
P Y N (t ) = y, N ((s, t ]) = k n+1 ¯ N (s 0 ) = k 0 , Y N (s0 ) = x 0 , N ([s 0 , s 1 ]) = k 1 , . . . N ((s n−1 , s n ]) = k n , Y N (s) = x
³ ¯ ´
¯
= P Y tn +kn+1 = y, N ((s, t ]) = k n+1 ¯ N (t 0 ) = k 0 , Y t0 = x 0 , N ([s 0 , s 1 ]) = k 1 , . . . N ((s n−1 , s n ]) = k n , Y tn = x
³ ¯ ´
¯
= P Y tn +kn+1 = y, N ((s, t ]) = k n+1 ¯ N (t 0 ) = k 0 , N ([s 0 , s 1 ]) = k 1 , . . . N ((s n−1 , s n ]) = k n , Y tn = x
³ ¯ ´
¯
= P Y tn +kn+1 = y, N ((s, t ]) = k n+1 ¯ Y tn = x
³ ¯ ´
¯
= P Y tn +kn+1 = y, ¯ Y tn = x P (N ((s, t ]) = k n+1 )

= P kn+1 (x, y) P (N ((s, t ]) = k n+1 ) ,


where the second equality uses the Markov property of Yn , the following equality uses independence of
N from Yn as well as the independent increment property of N over disjoint intervals. Since the above
holds for all k 0 , . . . , k n ≥ 0, by averaging out the counting variables N (s 0 ), N ((s 0 , s 1 ]), . . . , N ((s n−1 , s n ]), we
get
³ ¯ ´
¯
P Y N (t ) = y, N ((s, t ]) = k n+1 ¯ Y N (s0 ) = x 0 , . . . Y N (s) = x = P kn+1 (x, y) P (N ((s, t ]) = k n+1 ) ,

Thus, by summing over the above expressions over k n+1 ≥ 0 and using the definition X t = Y N (t ) , we
obtain
³ ¯ ´ X∞
¯
P X t = y ¯ X s0 = x 0 , . . . X sn = x = P k (x, y) P (N ((s n , t ]) = k) . (921)
k=0

Since the right hand side above does not depend on x 0 , . . . , x n−1 , averaging out X s0 , . . . , X sn gives
³ ¯ ´ X∞
¯
P X t = y ¯ X sn = x = P k (x, y) P (N ((s n , t ]) = k) . (922)
k=0

This verifies both the continuous-time Markov property and the time-homogenity. Noting that N ((s n , t )) ∼
Poisson(λ(t − s n )), we have also obtained the desired transition probability function in (920). ▲
Example 8.1.5. Let (Yn )n≥1 be a discrete-time Markov chain on state space Z≥0 with deterministic jumps
P(Yn+1 = Yn + 1 | Yn ) = 1 for n ≥ 1. Suppose we have an independent Poisson arrival process (N (t ))t ≥0 ∼
PP(λ), and let X t = Y N (t ) . Then in Example 8.1.4, we have seen that (X t )t ≥0 is a time-homogeneous CTMC
on Z≥1 . In fact, in this case,
X t = Y N (t ) = N (t ). (923)
Hence the Poisson counting process N (t ) as a CTMC is a special case of a MC run by PP. ▲
Example 8.1.6 (Countinuous-time SSRW). Let (Yn )n≥1 be a SSRW on Z with Y0 = 0. That is, P(Yn+1 −Yn =
1) = P(Yn+1 − Yn = −1) = 1/2 for n ≥ 1. Suppose (N (t ))t ≥0 ∼ PP(1) is an independent Poisson counting
process and let X t = Y N (t ) . In Example 8.1.4, we have seen that (X t )t ≥0 is a time-homogeneous CTMC on
Z. Probabilistically, X t operates as follow: We wait for an independent Exp(1) amount of time and the
flip a fair coin independently of everything else.

1The formula (920) is also called the ‘Poissonization’ (Poisson average) of the quantity P k (x, y).
8.2. CONSTRUCTION OF CTMC 121

Let Q denote the transition matrix of the SSRW Yn on Z. From (920), we have
X∞ t k e −t
Pt (x, x) = P k (x, x) (924)
k=0 k!
X∞ t 2n e −t
= P 2n (x, x) (925)
n=0 (2n)!
à !
X∞ t 2n e −t 2n
= 2−2n (926)
n=0 (2n)! n
X∞ (t /2)2n
= e −t 2
. (927)
n=0 (n!)

Note that for small t , we have Pt (x, x) = 1 + O(t 2 ), so Pt (x, x) ≈ 1 for small t . (Similar observation holds
for general CTMC. Check!) ▲

8.2. Construction of CTMC


8.2.1. Three constructions of CTMC. In this subsection, we give three constructions of CTMC that
satisfy the continuous-time Markov property as well as time-homogenity. Each construction has its own
advantage. Throughout this subsection, Ω denotes the state space of the CTMC that we are going to
construct.
Definition 8.2.1 (Construction I: Poisson clock on edges). Suppose we have the following objects:
(i) (Initial distribution) A probability distribution µ on Ω;
(ii) (Jump rate) A matrix q : Ω × Ω → [0, ∞) with q(x, x) ≡ 0.
We then generate a continuous-time process (X t )t ≥0 as follow:
1. Sample X 0 ∈ Ω according to µ.
2. At each edge (x, y) ∈ Ω2 , place a ‘Poisson clock’, which rings according to PP(q(x, y)).
3. If at any time X t = x, then X t stays there until the first time some clock on (x, y) rings and jumps to y.
4. Go to 3.
Remark: If the jump rate q(x, y) of the edge (x, y) is large, then it is more likely to make the jump x → y
at x. At each time at x, there is competition between exponential variables of distribution Exp(q(x, y))
for y 6= x. The jump rate matrix q do not need to have row sums equal to 1 (as for Markov transition
matrices). This construction is also known as the graphical construction.
Definition 8.2.2 (Construction II: Jump chain and holding times). Suppose we have the following ob-
jects:
(i) (Initial distribution) A probability distribution µ on Ω;
(ii) (Routing matrix) A transition matrix R for a (discrete-time) Markov chain on Ω;
(iii) (Exponential holding parameter) A function λ : Ω → [0, ∞).
We then generate a continuous-time process (X t )t ≥0 as follow:
1. Sample X 0 ∈ Ω according to µ.
2. If at any time X t = x, then sample τ ∈ Exp(λ(x)) and wait at x until time t + τ, sample y ∼ R(x, ·), and
jump to y.
3. Go to 2.
Note that if at any time X t = x and λ(x) = 0, then X t stays at x forever thereafter, that is, x is an absorbing
state. This is because τ = ∞ almost surely if τ ∼ Exp(0).
Remark: If x ∈ Ω is absorbing, then we can set λ(x) = 0 and R(x, x) = 1. Otherwise, we should have
λ(x) > 0 and X t eventually jumps to some y 6= x. Since jumping back to x itself does not give a proper
jump, we can set R(x, x) = 0 without loss of generality.
8.2. CONSTRUCTION OF CTMC 122

Example 8.2.3. Suppose Ω = Z, X 0 = 0, λ(x) ≡ 1, and R(x, y) = 1(y = x + 1). What is (X t )t ≥0 ?


Example 8.2.4. Suppose Ω = {0, 1}, X 0 = 0, λ(x) ≡ 1, and R(x, y) = 1(y 6= x). What is (X t )t ≥0 ?
Exercise 8.2.5. Show that Definitions 8.2.2 and 8.2.1 are equivalent under the relation
q(x, y) = λ(x) R(x, y) for x 6= y in Ω, (928)
assuming the routing matrix R satisfies R(x, x) ≡ 0. You may argue with the following steps.
(i) (From Poisson clocks) R(x, y) equals the probability of the Poisson clock on (x, y) rings the first, which
q(x,y)
is P q(x,y) by competing exponentials (see Exercise 6.1.4). On the other hand, the holding time
y6=x
τx at x is the minimum of exponential RVs with parameters q(x, y) for y 6= x, so by Exercise 6.1.3,
P
τx ∼ Exp(λ(x)) where λ(x) = y6=x q(x, y). Combine these results.
(ii) (From holding times) Define the jump rates q(x, y), x 6= y according to (928). Since R(x, x) ≡ 0, one
P q(x,y)
can show λ(x) = y6=x q(x, y) and R(x, y) = P q(x,y) . Use the argument in (i) to conclude con-
y6=x
sistency.
Definition 8.2.6 (Construction III: MC run by PP). Suppose we have the following objects:
(i) (Initial distribution) A probability distribution µ on Ω;
(ii) (PP ) Poisson arrival process (Tn )n≥1 ∼ PP(α)
P
(iii) (Jump rate) A matrix q : Ω × Ω → [0, ∞) with q(x, x) ≡ 0; Further assume y6=x q(x, y) ≤ α for all x
We then generate a continuous-time process (X t )t ≥0 as follow:
1. Sample X 0 ∈ Ω according to µ.
2. If at any time X t = x, then X t stays there until time T N (t )+1 . At time T N (t )+1 , X t jumps to y with
P
probability q(x, y)/α, and stays at x with probability 1 − y6=x q(x, y)/α.
3. Go to 2.
Proposition 8.2.7. Consider a CTMC (X t )t ≥0 as in Definition 8.2.6. Define a transition matrix P as
( q(x,y)
α if y 6= x
P (x, y) = 1P (929)
1 − α z∈Ω\{x} q(x, z) if y = x.
Then (X t )t ≥0 is the Markov chain with transition matrix P run by PP(α) as in Example 8.1.4. Namely, if
(Yn )n≥1 is a Markov chain with transition matrix P and if (N (t ))t ≥0 ∼ PP(α), X t = Y N (t ) . Furthermore, the
probability transition function Pt (x, y) is given by
X∞ (αt )k e −αt
Pt (x, y) = P k (x, y). (930)
k=0 k!
P ROOF. The first part of the assertion is clear from the construction. The transition probability func-
tion formula in (930) has been shown in the computation given in Example 8.1.4 to derive (920). □
Exercise 8.2.8. Show that Definitions 8.2.6 and 8.2.1 are equivalent assuming that there exists α > 0 such
P
that y6=x q(x, y) ≤ α for all x. You may argue with the following steps.
(i) Under the hypothesis, one can define the process (X t )t ≥0 in two ways. Suppose X t = x and X t − 6= x. It
suffices to show that the two definitions agree in distribution until the next jump.
(ii) (Holding time at x) (Poisson clocks) By merging all PP(q(x, y)) for y 6= x, we see that X t at x gets
P
a clock ring according to PP(λ(x)) with λ(x) := y6=x q(x, y). In particular, X t waits at x for
Exp(λ(x)) amount of time. (Poisson arrivals) On the other hand, by thinning of PP(α) with prob-
ability λ(x)/α, the holding time at x follows Exp(λ(x)).
(iii) (Jump distribution): (From Poisson clocks) R(x, y) equals the probability of the Poisson clock on
q(x,y)
(x, y) rings the first, which is P q(x,y) by competing exponentials (see Exercise 6.1.4). (From
y6=x

Poisson arrivals) Once we know that we are jumping to some y 6= x, its probability is P q(x,y) .
y6=x q(x,y)
8.2. CONSTRUCTION OF CTMC 123

Now, we will show that the equivalent continuous-time processes (X t )t ≥0 in Definitions 8.2.2-8.2.6
are time-homogeneous CTMCs if we can ensure that there is no explosion of jumps, that is, no infinite
𝑝!"
number of jumps during
↺ a finite time interval.
𝑝!! 1 2 𝑝""


𝑝"!
Theorem 8.2.9. The continuous-time process (X t )t ≥0 in Definition 8.2.2 is a time-homogeneous CTMC if
there is no explosion of jumps. A sufficient condition is the following:
X
There exists α > 0 such that q(x, y) ≤ α for all x ∈ Ω. (931)
𝑝 y6=x 𝑃
𝑝

1 0 1 2 3 4 1


P ROOF. Omitted.1 One
− 𝑝 can immitate the proof of renewal SLLN (Theorem 2.1.6).
1−𝑝

1−𝑝
Example 8.2.10 (M/M/1 queue). Suppose customers arrive at a single server according to a Poisson
process with rate λ > 0. Customers gets serviced one by one according to the first-in-first-out ordering,
and suppose there is no cap in the queue. 3/4 Finally, assume1/2that each customer
1/4 in the top of the queue
0
takes an independent 1 to get serviced2and exit the queue.
Exp(µ) time 3 4
1/4 1/2 3/4 1
This is called the M/M/1 queue in queuing theory. Here the name of this model follows Kendall’s
notation, where the first ‘M’ stands for memorylessness of the arrival process, the second ‘M’ stands for
memorylessness of the 1 service times, and ‘1’ means there is a single server. One can think of M /M /c
queue for c servers, in general.
The M /M /1 queue
𝜆/(𝜇 +can
𝜆) be described
𝜆/(𝜇 +nicely
𝜆) 𝜆/(𝜇 +with
as a CTMC 𝜆) Poisson𝜆/(𝜇 + 𝜆) The state space is Ω =
clocks.

Z≥0 . Since customers


0 arrive according
1 to PP(λ) and
2 existing customers
3 4
depart according ⋯to PP(µ), the
rightward transitions
𝜇/(𝜇 + 𝜆) have
𝜇/(𝜇 + 𝜆) jump rate λ and
𝜇/(𝜇 + 𝜆)leftward transitions
𝜇/(𝜇 + 𝜆) have jump rate µ.
𝜇/(𝜇 + 𝜆) At 0, which represents
that there is no custer𝑝currently in the queue, the only possible jump is to 1 and has jump rate λ. Hence
!"

we have
𝑝!! the following
1 jump rate2 diagram 𝑝"" determining the M /M /1 queue:

𝑝"!
𝜆 𝜆 𝜆 𝜆
0 1 2 3 4 ⋯
𝜇 𝜇 𝜇 𝜇
𝑝 𝑃
𝑝

1 0 1 2 3 queue 4 1


F IGURE 1. Jump rate diagram of the M/M/1
1−𝑝 1−𝑝 1−𝑝
As in the construction of CTMC using jump chains and holding times (Definition 8.2.1), we can also
analyze the jump chain of the M /M /1 queue. According to Exercise 8.2.5, we can compute the exponen-
tial holding parameter λ and the routing
3/4
matrix R as 1/2 1/4
0 1 2 3 4
λ(0) = λ, λ(x) = λ + µ for x ≥ 1 (932)
1/4 1/2 3/4 1
λ µ
R(0, 1) = 1, R(x, x + 1) = for x ≥ 1, R(x, x − 1) = for x ≥ 1. (933)
1 λ+µ λ+µ

1 𝜆/(𝜇 + 𝜆) 𝜆/(𝜇 + 𝜆) 𝜆/(𝜇 + 𝜆)


0 1 2 3 4 ⋯
𝜇/(𝜇 + 𝜆) 𝜇/(𝜇 + 𝜆) 𝜇/(𝜇 + 𝜆) 𝜇/(𝜇 + 𝜆)

F IGURE 2. State space diagram of the jump chain of the M/M/1 queue

𝜆 figure gives the space-state


For instance, the above 𝜆 diagram 𝜆 with transition matrix R.
𝜆 for the jump chain
0 1 2 3 4 ⋯ ▲
𝜇 𝜇 𝜇 𝜇
Example 8.2.11 (Analyzing the jump chain of M /M /1 queue). Consider the M /M /1 queue in Example
8.2.10. One of the main quantity of interest is the number of customers waiting in the queue at time t ,
which we denote by Y (t ). Then (Y (t ))t ≥0 is a continuous-time stochastic process, which changes by 1
whenever a new customer arrives or the top customer in the queue leaves the system. In fact, this system
can be modeled as a Markov chain, if we only think of the times when the queue state changes. Namely,
8.2. CONSTRUCTION OF CTMC 124

let T1 < T2 < · · · denote the times when the queue length changes. Let X k := Y (Tk ). Then (X k )k≥0 forms
a Markov chain. In fact, it is the Birth-Deach chain we have seen before.
(i) Let (Tka )k≥0 ∼ PP(λ) and (Tkd )k≥0 ∼ PP(µ). These are sequences of arrival and departure times, re-
spectively. Let T̃i be the i th smallest time among all such arrival and departure times. In the
next section, we will learn (T̃i )i ≥0 ∼ PP(λ + µ). This is called ‘merging’ of two independent Pois-
son processes (see Exercise 6.4.3). Note that T̃i is the i th time that ‘something happens’ to the
queue.
(ii) Define a Markov chain (X k )k≥0 on state space Ω = {0, 1, 2, · · · } by
X k = Y (T̃k ). (934)
Namely, X k is the number of customers in the queue at kth time that something happens to the
queue.
(iii) What is the probability that X 2 = 2 given X 1 = 1? As soon as a new customer arrives at time T1 ,
she gets serviced and it takes an independent σ1 ∼ Exp(µ) time. Let τ1 be the inter-arrival time
between the first and second customer. Then by Example 6.1.4 (competing exponentials),
P(X 2 = 2 X 1 = 1) = P(New customer arrives before the previous customer exits) (935)
λ
= P(τ1 < σ1 ) = . (936)
µ+λ
Similarly,
P(X 2 = 0 X 1 = 1) = P(New customer arrives after the previous customer exits) (937)
µ
= P(τ1 > σ1 ) = . (938)
µ+λ
(iii) In general, consider what has to happen for X k+1 = n + 1 given X k = n ≥ 1:
µ ¶
remaining service time remaining time until first
P(X k+1 = n + 1 | X k = n) = P > (939)
after time Tk arrival after time Tk
Note that the remaining service time after time Tk is still an Exp(µ) variable due to the memory-
less property of exponential RVs. Moreover, by the memoryless property of Poisson processes,
the arrival process restarted at time Tk is a Poisson(λ) process that is independent of the past.
Hence (939) is the probability that an Exp(µ) RV is less than an independent Exp(λ) RV. Thus
we are back to the same computation as in (ii). Similar argument holds for the other possibility
P(X k+1 = n − 1 | X k = n).
(iv) (Stationary distribution) From (i)-(iii), we conclude (X k )k≥0 is a Birth-Deach chain on state space
Ω = {0, 1, 2, · · · }. By computing the transition matrix P (which is of infinite size!) and solving
πP = π, one obtains that the stationary distribution for the M/M/1 queue is unique and it is a
geometric distribution. Namely, if we write ρ = λ/µ, then
π(n) = ρ n (1 − ρ) ∀n ≥ 0. (940)
Namely, π is the (shifted) geometric distribution with parameter ρ, which is well-defined if and
only if ρ < 1, that is, µ > λ. In words, the rate of service times should be larger than that of the
arrival process in order for the queue to not to blow up.

Exercise 8.2.12 (RW perspective of M/M/1 queue). Consider a M /M /1 queue with service rate µ and ar-
rival rate λ. Denote p = λ/(µ+λ). Let (X k )k≥0 be the Markov chain where X k is the number of customers
in the queue at kth time that either departure or arrival occurs. (c.f. [Dur10, Example 6.2.4])
(i) Let (ξk )k≥1 be a sequence of i.i.d. RVs such that
P(ξk = 1) = p, P(ξk = −1) = 1 − p (941)
8.2. CONSTRUCTION OF CTMC 125

Define a sequence of RVs (Zk )k≥0 by


Zk+1 = max(0, Zk + ξk ). (942)
Show that X k and Zk have the same distribution for all k ≥ 1.
(ii) Define a simple random walk (S n )n≥0 by S n = ξ1 + · · · + ξn for n ≥ 1 and S 0 = 0. Show that Zk can also
be written as
Zk = S k − min S i . (943)
0≤i ≤k
(iii) The simple random walk (S n )n≥0 is called subcritical if p < 1/2, critical if p = 1/2, and supercritical if
p > 1/2. Below is a plot of (Zk )k≥0 when p < 1/2, in which case Zk is more likely to decrease than
to increase when Zk ≥ 1. Does it make sense that the M /M /1 queue has a unique stationary dis-
tribution for p < 1/2? Draw plots of Zk for the critical and supercritical case. Convince yourself
that the M /M /1 queue should not have a stationary distribution for p > 1/2. How about the
critical case?

F IGURE 3. Plot of Zk = S k − min0≤i ≤k S i for the subcritical case p < 1/2.

8.2.2. Completeness of the CTMC construction. We first start with some general discussion on
what kind of information we need to specify. We first make the following assumption on the type of
continuous-time processes we consider.
Assumption 8.2.13. Let (X t )t ≥0 denote a continuous-time process on a state space Ω. We may assume
that there is a sequence of jump times 0 < T1 < T2 < . . . such that X t ∈ Ω is constant on each interval
(0, T1 ], (T1 , T2 ], . . . 2.
Since the process X t can change its state only at jump times, we can record its states at jump times
as a discrete-time stochastic process on Ω, as in the following definition.
Definition 8.2.14. (X t )t ≥0 denote a continuous-time process on a state space Ω with jump times (Tn )n≥1 .
Define the discrete-time process (Zn )n≥1 , which is called the jump chain of (X t )t ≥0 , as Zn = X Tn .
Conversely, a sequence of jump times and a jump chain determines a continuous-time process.
Definition 8.2.15. Suppose we have a jump chain (Zn )n≥1 and jump times (Tn )n≥1 . Define the associated
P
continuous-time process (X t )t ≥0 by X t = Z N (t ) , where N (t ) = ∞
k=1
1(Tk ≤ t ) is the associated counting
process.
Proposition 8.2.16. Suppose we have a jump chain (Zn )n≥1 and jump times (Tn )n≥1 . Let (X t )t ≥0 denote
the associated continuous-time process as in Definition (8.2.15). Then (X t )t ≥0 has jump times (Tn )n≥1 and
jump chain (Zn )n≥1 in the sense of Definition 8.2.14.
P ROOF. Follows from definition. □
Proposition 8.2.17. Suppose (X t )t ≥0 is a continuous-time process on Ω with jump times (Tn )n≥1 . Further
assume that (X t )t ≥0 satisfies the continuous-time Markov property.
(i) The associated jump chain (Zn )n≥1 is a discrete-time Markov chain on Ω.

2Not all CTMC satisfy this assumption, for instance, the Brownian motion, which changes it state at every single time
instance almost surely. However, construction of such ‘smooth CTMC’ can be obtained by a suitable limiting procesure of
‘discrete CTMC’ that we consider here.
8.3. EVOLUTION OF THE TRANSITION PROBABILITY FUNCTION 126

(ii) Conditional on Zn , the holding time τn := Tn+1 − Tn at state Zn ∈ Ω is independent from the past and
has an exponential distribution.
P ROOF. Then the future states should only depend on the present state, so the jump chain (Zk )k≥1
should be a discrete-time Markov chain. Also, a continuous-time version of strong Markov property
holds (see the proof of Proposition 3.4.4) the jump times are stopping times, since
{Tn ≤ s} = (944)
the amount of time τk = Tk+1 − Tk spent at state x := Zk = X Tk should be memoryless due to the
continuous-time Markov property, so it has to be an exponential RV (see Exercise 6.1.6) with parameter
λ. Finally, note that
¡ ¢
P(τk ≥ s | Zn , Zn−1 , . . . , Z1 ) = P X t ≡ Zn for t ∈ [Tn , Tn + s] | X Tn , X Tn−1 , . . . , X T1 (945)
¡ ¢
= P X t ≡ Zn for t ∈ [Tn , Tn + s] | X Tn (946)
= P(τk ≥ s | Zn ), (947)
where we have used a continuous-time version of strong Markov property and the fact that the jump
times are stopping times. □

8.3. Evolution of the transition probability function


In this section, we give a differential equation point of view of CTMC, deriviing various quantities
such as jump rates and exponential holding parameter from the transition probability function.
8.3.1. Differential Equations for TPF.
Proposition 8.3.1 (Rates from TPF). Consider a CTMC with bounded jump rates q(x, y) ≤ α for all x 6= y ∈
Ω. Let (t , x, y) 7→ Pt (x, y) be its transition probability function. Then the following hold:
¯
Ph (x, y) d ¯
(i) (Jump rate) q(x, y) = lim = Pt (x, y)¯ for x 6= y ∈ Ω;
h→0+ h dt t =0+
1 − Ph (x, x) ¯
d ¯
(ii) (Holding parameter) λ(x) = lim = − Pt (x, x)¯ for x ∈ Ω.
h→0+ h dt t =0+

P ROOF. We prove the assertion only for the bounded jump rate case: q(x, y) ≤ α for all x 6= y ∈ Ω. In
this case, we can construct the CTMC using Definition 8.2.6 (MC run by PP). Then we can use the formula
(930). Suppose x 6= y so that P (x, y) = q(x, y)/α. Note that P 0 (x, y) = I (x, y) = 1(x = y) = 0. Hence
X∞ (αt )k e −αt
Pt (x, y) = P k (x, y) (948)
k=0 k!
q(x, y) X ∞ (αt )k e −αt
= (αt )e −αt + P k (x, y). (949)
α k=2 k!
Hene
Pt (x, y) X∞ (αt )k−2 e −αt
= e −αt q(x, y) + α2 t P k (x, y). (950)
t k=2 k!
P (αt )k−2 e −αt k P
Note that 0 ≤ ∞ k=2 k! P (x, y) ≤ ∞ k=2
1
(αt )k−2 = 1−αt for t ∈ (0, 1/α). Thus the last term above
vanishes as t → 0+. This shows (i).
We show (ii) similarly. Since P 0 (x, x) = I (x, x) = 1(x = x) = 1, we have
X∞ (αt )k−1 e −αt
Pt (x, x) = e −αt + αt P k (x, x) (951)
k=1 k!
X∞ (αt )k−1 e −αt
= e −αt + αt e −αt Q(x, x) − α2 t 2 P k (x, x). (952)
k=2 k!
8.3. EVOLUTION OF THE TRANSITION PROBABILITY FUNCTION 127

Divide both sides by t and take the limit t → 0+. Then the second term in the last expression vanishes as
P
before, and recall that P (x, x) = 1 − α1 y6=x q(x, y) = 1 − λ(x)/α. So we get
µ ¶
Pt (x, x) − 1 e αt − 1 λ(x)
lim = lim +α 1− = −λ(x). (953)
t →0+ t t →0+ t α
This shows the second part of the assertion. □
Definition 8.3.2 (Generator matrix). Consider a CTMC with jump rates {q(x, y) | x 6= y ∈ Ω}. Define the
generator matrix Q : Ω2 → [0, ∞) by
(
q(x, y) if y 6= x
Q(x, y) = P (954)
−λ(x) = − y6=x q(x, y) if y = x.

Example 8.3.3. Consider a CTMC X t = Y N (t ) , which is a discrete-time Marvkov chain (Yn )n≥1 with tran-
sition matrix P run by a Poisson process (N (t ))t ≥0 ∼ PP(α). Then its generator matrix Q is given by
( (
q(x, y) if y 6= x αP (x, y) if y 6= x
Q(x, y) = = (955)
−λ(x) if y = x −λ(x) = α(P (x, x) − 1) if y = x,

where the second equality follows from (929). In a matrix form, we have

Q = α(P − I ). (956)


Theorem 8.3.4 (Transition probability function and the generater matrix). Consider a CTMC with bounded
jump rates q(x, y) ≤ α for all x 6= y ∈ Ω. Let (t , x, y) 7→ Pt (x, y) be its transition probability function and Q
denote its generator matrix. Then the following hold:
(i) (Forward Kolmogorov eq.) Ṗt = Pt Q for all t ≥ 0 with initial condition P0 = I ;
(ii) (Backward Kolmogorov eq.) Ṗt = QPt for all t ≥ 0 with initial condition P0 = I ;
d
where Ṗt := d t Pt denotes the entry-wise time drivative of Pt .

P ROOF. We only give a proof that is rigorous for finite state space Ω. Recall that Kolmogorov-Champman
equation in Proposition 8.1.1 (iii). When Ω is finite, then Pt is a matrix-valued function in t so we can
view it as the following matrix equation

Pt Ps = Pt +s . (957)

Now for the forward equation, note that


1 1 1
(Pt +h − Pt ) = (Pt Ph − Pt ) = Pt (Ph − P0 ). (958)
h h h
Now taking the limit h → 0+, we obtain the forward equation Ṗt = Pt Q, where the identity Q = Ṗ0 follows
from Proposition 8.3.1 and the definition of the generator matrix Q. For the backward equation, note
that
1 1 1
(Pt +h − Pt ) = (Ph Pt − Pt ) = (Ph − P0 )Pt (959)
h h h
and take the limit h → 0+. □
Remark 8.3.5. Compare the forward and backward Kolmogorov-Champman equations with the follow-
ing equations of discrete-time MC transition matrices:

P t +1 = P P t = P t P. (960)
8.3. EVOLUTION OF THE TRANSITION PROBABILITY FUNCTION 128

8.3.2. Kolmogorov’s equation for finnite state space. Consider the following differential equation
ẋ(t ) = αx(t ), x(0) = 1. (961)

In order to solve it, we multiply an integrating factor e −αt > 0 and use product formula for derivative:
ẋ(t ) = αx(t ) ⇔ ẋ(t ) − αx(t ) = 0 ⇔ e −αt ẋ(t ) − αe −αt x(t ) = 0 (962)
d −αt
⇔ e x(t ) = 0 (963)
dt
⇔ e −αt x(t ) ≡ C , (964)

where C is a constant. By using the initial condition x(0) = 1, we find C = 1. Hence x(t ) = e −αt . Con-
versely, such solution satisfies the original boundary value problem (961). Hence it follows that
ẋ(t ) = αx(t ), x(0) = 1 ⇐⇒ x(t ) = e −αt . (965)
We notice that Theorem 8.3.4 gives a matrix version of the analogous differential equation (961). So,
is it true that the transition probability function Pt is also given by the ‘exponential function’ e tQ , where
Q is the generator matrix? It turns out to be true, as we will see shortly, but for that we first need to define
exponential functions for matrices and discuss some basic properties. The idea is to adapt the Talyor
P tk
expansion of the exponential function: e x = ∞ k=0 k!
.

Definition 8.3.6 (Matrix exponential). Let A be a matrix. Then we define


X∞ Ak
e A := . (966)
k=1 k!

Matrix exponentials have familiar properties such as e A+B = e A e B . (Check!)

Remark 8.3.7. The matrix exponential e A converges absolutely entrywise. To see this, we use a matrix
norm k·k with the property kAB k ≤ kAk kB k. Then we have
X∞ kAkk
ke A k ≤ = e kAk < ∞. (967)
k=1 k!
qP
Examples of such matrix norms inlude the matrix Frobenius norm kAkF = ij A 2i j and the matrix 2-
norm kAk2 = supx,kxk=1 kAxk
kxk2 denotes the matrix 2-norm, where k· · ·k2 denotes the usual Euclidean norm
2

for vectors. This quantity measures the maximum amount that A stretches a given unit vector. Also, we
have kAk2 = σ1 (A), the largest singular value of A.

Theorem 8.3.8. Let Q be the generator matrix of a continuous-time Markov chain with finite state space
Ω. Then both the Kolmogorov forward and backward equations in Theorem 8.3.4 have unique solution
X∞ tk
Pt = e −tQ = Qk . (968)
k=0 k!

P ROOF. Convergent power series can be differentiated term by term. This applies also to a matrix-
valued series. So we can verify that e −tQ indeed solves both Komogorov’s equation:
à ! à !
d X ∞ tk
k
X
∞ d tk
k
X∞ t k−1 k−1
X∞ t k−1 k−1
Q = Q =Q Q = Q Q (969)
d t k=0 k! k=0 d t k! k=0 (k − 1!) k=0 (k − 1!)

= Qe −tQ = e −tQ Q. (970)

For the initial condition, note that e 0Q = I by the definition. Hence Pt = e −tQ satisfies both Komogorov’s
equation.
8.4. STATIONARY DISTRIBUTION AND LIMIT THEOREMS FOR CTMC 129

For the converse direction, we can mimic the integrating factor argument as follow:

Ṗt = Q Ṗt ⇔ Ṗt −Q Ṗt = O ⇔ e −Qt Ṗt − e −tQ Q Ṗt = O (971)


⇔ e −Qt Ṗt −Qe −Qt Ṗt = O (972)
d −Qt
e⇔ Pt = O (973)
dt
⇔ e −Qt P(t ) ≡ C , (974)
³P ´ ³ P∞ t k k+1 ´ ³ P∞ t k k ´
tk k
where the third equivalence uses the communtativity Q ∞ k=0 k!
Q = k=0 k!
Q = k=0 k!
Q Q
and the fourth equivalence uses the product rule for differentiation entrywise. Using the initial condition
P0 = I , we find Pt = e Qt , as desired. □
Example 8.3.9 (MC run by PP – PTF). Let (Yn )n≥1 denote a discrete-time Markov chain on a countable
state space Ω with transition matrix P . Suppose we have a Poisson arrival process (N (t ))t ≥0 ∼ PP(λ)
independent of the chain (Yn )n≥1 . We define a continuous-time process (X t )t ≥0 on the same state space
Ω by
X t = Y N (t ) . (975)
Recall that in Example 8.1.4, we derived using a conditinoing argument the transition probability func-
tion Pt of (X t )t ≥0 as
X∞ (λt )k e −λt
Pt (x, y) = P k (x, y). (976)
k=0 k!
We can use Theorem 8.3.8 to re-derive this. Note that the generator matrix Q of X t is given by Q =
α(P − I ) (see Example 8.3.3).

8.4. Stationary distribution and limit theorems for CTMC


8.4.1. Stationary distribution for CTMC.

Definition 8.4.1. A probability measure π on the state space Ω is an invariant distribution (also station-
ary distribution) for the continuous-time Markov chain with transition probability function Pt if
πPt = π. (977)

Example 8.4.2 (Continuation of Example 8.3.9). Let X t = Y N (t ) , where Yn is a MC on Ω with transition


matrix P and N (t ) ∼ PP(α) is an independent Poisson process. Since the clock N t ticks at a constant
average rate, the switch between discrete and continuous time does not involve a time distortion. As a
result X t and Yn have the same invariant distributions:
πPt = π ⇐⇒ πP = π. (978)
This can be rigorously justified by using the formula (976).

Theorem 8.4.3. The probability distribution π is invariant for a (non-explosive) continuous-time Markov
chain if and only if
πQ = 0. (979)
By using the definition of Q in (954), this is equivalent to the following system of equations
X X
π(x)q(x, y) = π(y)λ(y) = π(y) q(x, y) x, y ∈ Ω. (980)
x:x6= y x:x6= y

(Incoming probability flow = outcoing probability flow)


8.4. STATIONARY DISTRIBUTION AND LIMIT THEOREMS FOR CTMC 130

P ROOF. First assume πPt ≡ π. Then its derivative should be zero, so


d d
0= πPt = π Pt = πQPt (981)
dt dt
where we used Komogorov backward equation. Then since P0 = I , we have πQ = 0.
Conversely, suppose πQ = 0. Then
d d
πPt = π Pt = πQPt ≡ 0, (982)
dt dt
so this implies that πPt is constant. But by using P0 = I , we also get πPt = πP0 = π. □
For many classes of CTMCs, invariant distributions can be computed by solving a detailed balance
equation, as in the discrete MC cases.
Definition 8.4.4. A probability distribution π on the state space Ω of A CTMC with transition probability
function Pt is a reversible distribution for (X t )t ≥0 if the detailed balance equation hold:
π(x)Pt (x, y) = π(y)Pt (y, x) for all t ≥ 0 and x, y ∈ Ω. (983)
As in the discrete-time case, reversible probability distributions are in fact stationary. Also, we can
simply use the jump rates q(x, y) in place of the transitino probability function Pt (x, y) in (983). These
are justified in the following proposition.
Proposition 8.4.5. Let (X t )t ≥0 denote a CTMC on a state space Ω and transition probability function Pt .
(i) If π is a reversible distribution for (X t )t ≥0 , then it is also an invariant distirbution.
(ii) A distribution π is a reversible distribution for (X t )t ≥0 if and only if
π(x)q(x, y) = π(y)q(y, x). (984)
P ROOF. For (i), we mimic the proof of a similar statement for the discrete MC (Proposition 3.2.7).
That is, note that
X reversibility X X
[πPt ] (x) = µ(y)Pt (y, x) = π(x)Pt (x, y) = µ(x) Pt (x, y) = π(x). (985)
y∈Ω y∈Ω y∈Ω

That is, we have shown that πPt = π, so π is an invariant distribution for (X t )t ≥0 .


Next, we show (ii). First suppose (983) holds. By differentiating both sides and plugging in t = 0 and
using the Kolmogorov equations in Theorem 8.3.4, we get
π(x)Q(x, y) = π(y)Q(y, x) for x, y ∈ Ω. (986)
where Q is the generator matrix defined in terms of the jump rates q as in (954). If x 6= y, then Q(x, y) =
q(x, y) and Q(y, x) = q(y, x), so (984) holds. If x = y, then (984) holds trivially.
Conversely, suppose (984) holds. Then by definition of the generator matrix Q in (954), we have (986).
It follows that
π(x)Q k (x, y) = π(y)Q k (y, x) for k ∈ Z≥0 and x, y ∈ Ω. (987)
Indeed, the above is true for k = 0 since Q = I and for k = 1 due to (986). Suppose (987) holds for some
k ≥ 1. Then
X k
π(x)Q k+1 (x, y) = π(x) Q (x, z)Q(z, y) (988)
z∈Ω
X
= π(x)Q k (x, z)Q(z, y) (989)
z∈Ω
X
= Q k (x, z)π(z)Q(z, y) (990)
z∈Ω
à !
X k
X
= Q (x, z)Q(z, y)π(y) = Q (x, z)Q(z, y) π(y) = Q k+1 (x, y)π(y).
k
(991)
z∈Ω z∈Ω
8.4. STATIONARY DISTRIBUTION AND LIMIT THEOREMS FOR CTMC 131

This completes the induction for (987).


Now by using (987) and Theorem 8.3.8,
X∞ tk X∞ tk
π(x)Pt (x, y) = π(x)e −tQ(x,y) = π(x)Q k (x, y) = π(y)Q k (y, x) (992)
k=0 k! k=0 k!
X∞ tk
= π(y) Q k (y, x) = π(y)e −tQ(y,x) = π(y)Pt (y, x). (993)
k=0 k!
This shows the assertion. □

8.4.2. Limit theorems for CTMC. In this subsection, we study some basic limit theorems for CTMC
and how to use stationary distributions for describing various asymptotic behavior of CTMC.

Definition 8.4.6. A CTMC (X t )t ≥0 on Ω with transition probability function Pt is irreducible if for all
x, y ∈ Ω, Pt (x, y) > 0 for some t ≥ 0.

Proposition 8.4.7. A CTMC (X t )t ≥0 on Ω is irreducible if and only if its jump chain is irreducible.

P ROOF. Exericse. □
Theorem 8.4.8 (SLLN for CTMC). An irreducible continuous-time Markov chain with an invariant dis-
tribution satisfies the following limit theorems.
(i) (Limit distribution). For any initial distribution µ and any state y,
lim Pµ (X t = y) = π(y). (994)
t →∞
P
(ii) (SLLN) Let f : Ω → R be a function such that E X ∼π [| f (X )|] = x π(x)| f (x)| < ∞. Then for any initial
distribution µ, the limit below holds with probability one:
Z X
1 t
lim f (X s ) d s = π(x) f (x) = E X ∼π [ f (X )]. (995)
t →∞ t 0 x

In particular, by taking f (y) = 1(y 6= x),


Zt
1
(Asymptotic fraction of time spent at x) = lim 1(X s = x) d s = π(x). (996)
t →∞ t 0

P ROOF. Omited. □
Think about why ‘aperiodicity’ is not needed for CTMC in order for the convergence to stationary
distribution in Theorem 8.4.8 (i) to hold.

Theorem 8.4.9. Suppose π is an invariant distribution for an irreducible continuous-time Markov chain.
Then for all x ∈ Ω,
1
π(x) = , (997)
λ(x) E[T x | X 0 = x]
where T x = inf{t > 0 | X t = x} denotes the first return time to x and λ(x) denotes the exponential holding
parameter.

P ROOF. Let T x(k) denote the k the return time to x, assuming X 0 = x. These are stopping times with
respect to the CTMC (X t )t ≥0 . By strong Markov property, these are i.i.d. and hence define a renewal
process. By Theorem 8.4.8 (i) and the renewal reward SLLN (Theorem 2.2.1),
Z
1 T E[time spent at x] 1/λ(x)
π(x) = lim 1(X s = x) d x = = . (998)
T →∞ T 0 E[duraction of a cycle] E[T x | X 0 = x]
This shows the assertion. □
𝑝


1 0 1 2 3 4 1


1−𝑝 1−𝑝 1−𝑝

8.4. STATIONARY DISTRIBUTION 3/4 1/2 FOR CTMC


AND LIMIT THEOREMS 1/4 132
0 1 2 3 4
1/4 1/2 3/4 1
Example 8.4.10. Consider a CTMC (X t )t ≥0 on the state space Ω = {0, 1, 2, 3} whose jump rates q(x, y) are
given by the following diagram in Figure
1 4. Recall that for each state x, the exponential holding parameter
P
λ(x) is given by λ(x) = y6=x q(x, y). Hence we have λ(0) = 1, λ(1) = 2, λ(2) = 3, and λ(3) = 3. Its generator
matrix Q (see (954)) is given by 1 𝜆/(𝜇 + 𝜆) 𝜆/(𝜇 + 𝜆) 𝜆/(𝜇 + 𝜆)
0
1 2
 4 ⋯
−1 1 3
𝜇/(𝜇 + 𝜆)  1 𝜇/(𝜇
−2+ 𝜆)1 𝜇/(𝜇 + 𝜆) 𝜇/(𝜇 + 𝜆)
Q = 
. (999)
2 −3 1 
3 −3
𝜆 𝜆 𝜆 𝜆
This CTMC is irreducible since 0
the jump rate1 diagram gives 2a strongly connected
3
directed graph.
4 ⋯That is,
for any two states x, y, there is a path𝜇from x to y with𝜇jumps of positive
𝜇 rates. Any such
𝜇 path has a positive
probability to occur, so this yields that the CTMC can move from x to y with a positive probability. (Or
derive the routing matrix and draw the state space diagram of the jump chain and see it is irreducible).

1 1 1
0 1 2 3
1 2 3

F IGURE 4. A jump rate diagram

By Theorem 8.4.8 (i) there is a unique stationary distribution π. By Theorem 8.4.3, π is given by
πQ = 0. Solving this equation for π = [π0 , π1 , π2 , π3 ], we get
−π0 + π1 = 0 (1000)
π0 − 2π1 + 2π2 = 0 (1001)
π1 − 3π2 + 3π3 = 0 (1002)
π2 − 3π3 = 0. (1003)
From the first and last equations, we get π0 = π1 and π2 = 3π3 . Plugging these into the middle two
equations, we get
−π1 + 2π2 = 0 (1004)
π1 − 2π2 = 0. (1005)
Thus π1 = 2π2 . So π2 = π0 /2 and π3 = π2 /3 = π0 /6. Since
12 + 3 + 1 16
1 = π0 + π1 + π2 + π3 = π0 (1 + 1 + (1/2) + (1/6)) = π0 = π0 . (1006)
6 6
Hence π0 = 3/8, so we conclude
· ¸
3 3 3 1
π= , , , . (1007)
8 8 16 16
Now by Theorem 8.4.8 (ii), the asymototic fraction of times that X t spends at each state x is π(x). For
example,
5
(Asymptotic fraction of times spent not at 0) = π(1) + π(2) + π(3) = . (1008)
8
Moreover, by Theorem 8.4.9, we can also find the expected return time to each state x:
1
π(x) = . (1009)
λ(x) E[T x | X 0 = x]
For example, E[T0 | X 0 = 0] = 1/(π(0)λ(0)) = (8/3). So starting from 0, the expected time to return to 0 is
8/3. ▲
Bibliography

[BT02] Dimitri P Bertsekas and John N Tsitsiklis, Introduction to probability, vol. 1, Athena Scientific
Belmont, MA, 2002.
[Dur99] Rick Durrett, Essentials of stochastic processes, vol. 1, Springer, 1999.
[Dur10] , Probability: theory and examples, Cambridge university press, 2010.
[LP17] David A Levin and Yuval Peres, Markov chains and mixing times, vol. 107, American Mathemat-
ical Soc., 2017.
[LS19] Hanbaek Lyu and David Sivakoff, Persistence of sums of correlated increments and clustering in
cellular automata, Stochastic Processes and their Applications 129 (2019), no. 4, 1132–1152.
[SV] Timo Seppalainen and Benedek Valko, Introduction to stochastic processes.

133

You might also like