Stochastic Notes
Stochastic Notes
Lecture Notes
Hanbaek Lyu
www.hanbaeklyu.com
Contents
Chapter 4. Martingales 65
4.1. Definition and examples of martingales 65
4.2. Gambling strategies and stopping times 69
4.3. Optional stopping time theorem and its applications 72
4.4. Martingale limit theorem 75
4.5. Branching processes 76
4.6. Additional properties of martingales (Optional*) 79
Bibliography 133
CHAPTER 1
When Ω is finite, one can replace (iii) above with the following simpler condition: If two events E 1 , E 2 ⊆ Ω
are disjoint, then P(E 1 ∪ E 2 ) = P(E 1 ) + P(E 1 ).
Exercise 1.1.2. Let P be a probability measure on sample space Ω. Show the following.
P
(i) Let E = {x 1 , x 2 , · · · x k } ⊆ Ω be an event. Then P(E ) = ki=1 P({x i }).
P
(ii) x∈Ω P({x}) = 1.
Of course, there could be many (in fact, infinitely many) different probability measures on the same
sample space.
Exercise 1.1.3 (coin flip). Let Ω = {H , T } be a sample space. Fix a parameter p ∈ [0, 1], and define a
function Pp : 2Ω → [0, 1] by Pp (;) = 0, Pp ({H }) = p, Pp ({T }) = 1 − p, Pp ({H , T }) = 1. Verify that Pp is a
probability measure on Ω for each value of p.
4
1.1. PROBABILITY MEASURE AND PROBABILITY SPACE 5
A typical way of constructing a probability measure is to specify how likely it is to see each individual
P
element in Ω. Namely, let f : Ω → [0, 1] be a function that sums up to 1, i.e., x∈Ω f (x) = 1. Define a
function P : F → [0, 1] by
X
P(E ) = f (ω). (4)
ω∈E
Then this is a probability measure on Ω, and f is called a probability distribution on Ω. For instance, the
probability distribution on {H , T } we used to define Pp in Exercise 1.1.3 is f (H ) = p and f (T ) = 1 − p.
Example 1.1.4 (Uniform probability measure). Let Ω = {1, 2, · · · , m} be a sample space and let P be the
uniform probability measure on Ω, that is,
P({x}) = 1/m ∀x ∈ Ω. (5)
Then for the event A = {1, 2, 3}, we have
P(A) = P({1} ∪ {2} ∪ {3}) (6)
= P({1}) + P({2}) + P({3}) (7)
1 1 1 3
= + + = (8)
m m m m
Likewise, if A ⊆ Ω is any event and if we let |A| denote the size (number of elements) of A, then
|A|
P(A) = . (9)
m
For example, let Ω = {1, 2, 3, 4, 5, 6}2 be the sample space of a roll of two fair dice. Let A be the event that
the sum of two dice is 5. Then
A = {(1, 4), (2, 3), (3, 2), (4, 1)}, (10)
so |A| = 4. Hence P(A) = 4/36 = 1/9. ▲
Exercise 1.1.5. Show that the function P : F → [0, 1] defined in (4) is a probability measure on Ω. Con-
versely, show that every probability measure on a finite sample space Ω can be defined in this way.
Remark 1.1.6 (General probability space). A probability space does not need to be finite, but we need a
more careful definition in that case. For example, if we take Ω to be the unit interval [0, 1], then we have
to be careful in deciding which subset E ⊆ Ω can be an ‘event’: not every subset of Ω can be an event. A
proper definition of general probability space is out of the scope of this course.
Exercise 1.1.7. Let (Ω, F , P) be a probability space and let A ⊆ Ω be an event. Show that P(A c ) = 1− P(A).
Example 1.1.8 (Roll of two dice). Suppose we roll two dice and let X and Y be the outcome of each die.
Say all possible joint outcomes are equally likely. The sample space for the roll of a single die can be
written as {1, 2, 3, 4, 5, 6}, so the sample space for rolling two dice can be written as Ω = {1, 2, 3, 4, 5, 6}2 .
In picture, think of 6 by 6 square grid and each node represents a unique outcome (x, y) of the roll. By
assumption, each out come has probability 1/36. Namely,
P((X , Y ) = (x, y)) = 1/36 ∀1 ≤ x, y ≤ 6. (11)
This gives the uniform probability distribution on our sample space Ω (see Example 1.1.4).
We can compute various probabilities for this experiment. For example,
P(at least one die is even) = 1 − P(both dice are odd) (12)
9
= 1 − P({1, 3, 5} × {1, 3, 5}) = 1 − = 3/4, (13)
36
where for the first equality we have used the complementary probability in Exercise 1.1.7.
Y
4/ 1/ 1/ 1/
3 1.2. INDEPENDENCE 6
3/ 0 What
5/ is 1/
the most probable value for the sum X + Y ? By
2
Now think about the following question:
considering diagonal lines x + y = k in the 2-dimensional plane for different values of k, we find
2/ 3/ 1/ 5/
1 # of intersections between the line x + y = k and Ω
P(X + Y = k) = . (14)
1 1/ 1/ 2/ 36 2/
From example, P(X +Y = 2) = 01/36 and P(X +Y = 7) = 6/36 = 1/6. Moreover, from Figure 1, it is clear that
the number of intersections is maximized when the diagonal X line x + y = k passes through the extreme
3 for X + Y with the probability being 1/6.
2 value
points (1, 6) and (6, 1). Hence 7 is the0 most1probable
1
𝑥+𝑦=7
𝑥+𝑦=6 𝑥 + 𝑦 = 12
𝑥+𝑦=5 𝑥 + 𝑦 = 11
𝑥+𝑦=4 𝑥 + 𝑦 = 10
𝑥+𝑦=3 𝑥+𝑦=9
𝑥+𝑦=2 𝑥+𝑦=8
1 2 3 4 5 6
1 1 1 1 1 1
F IGURE 1. Sample space representation for roll of two independent fair dice and and events of
fixed sum of two dice.
1.2. Independence
When knowing something about one event does not yield any information of the other, we say the
two events are independent. To make this statement a bit more precise, suppose Bob is betting $5 for
whether an event E 1 occurs or not. Suppose Alice tells him that some other event E 2 holds true. If Bob
can somehow leverage on this knowledge to increase his chance of winning, then we should say the two
events are not independent.
Formally, we say two events E 1 and E 2 are independent if
P(E 1 ∩ E 2 ) = P(E 1 ) P(E 2 ). (15)
We say two events or RVs are dependent if they are not independent.
Example 1.2.1. Flip two fair coins at the same time and let X and Y be their outcome, labeled by H and
T . Clearly knowing about one coin does not give any information of the other. For instance, the first coin
lands on heads with probability 1/2. Whether the first coin lands on heads or not, the second coin will
land on heads with probability 1/2. So
1 1
P(X = H and Y = H ) = · = P(X = H )P(Y = H ). (16)
2 2
▲
Exercise 1.2.2. Suppose two events A 1 and A 2 are independent and suppose that P(A 2 ) > 0. Show that
P(A 1 | A 2 ) = P(A 1 ). (17)
1.3. RANDOM VARIABLES 7
1.3.1. Discrete random variables. Given a finite probability space (Ω, P), a (discrete) random vari-
able (RV) is any real-valued function X : Ω → R. We can think of it as the outcome of some experiment on
Ω (e.g., height of a randomly selected friend). We often forget the original probability space and specify
a RV by probability mass function (PMF) f X : R → [0, 1],
Example 1.3.1. Say you win $1 if a fair coin lands heads and lose $1 if lands tails. We can set up our
probability space (Ω, P) by Ω = {H , T } and P = uniform probability measure on Ω. The RV X : Ω → R for
this game is X (H ) = 1 and X (T ) = −1. The PMF of X is given by f X (1) = P(X = 1) = P({H }) = 1/2 and
likewise f X (−1) = 1/2.
Exercise 1.3.2. Let (Ω, P) be a probability space and X : Ω → R be a RV. Let f X be the PMF of X , that is,
f X (x) = P(X = x) for all x. Show that f X adds up to 1, that is,
X
f X (x) = 1, (19)
x
where the summation runs over all numerical values x that X can take.
There are two useful statistics of a RV to summarize its two most important properties: Its average
and uncertainty. First, if one has to guess the value of a RV X , what would be the best choice? It is the
expectation (or mean) of X , defined as below:
X
E(X ) = x P(X = x). (20)
x
Example 1.3.3 (Doubling strategy). Suppose we bet $x on a game where we have to predict whether a
fair coin clip comes up heads. We win $x on heads and lose $x on tails. Suppose we are playing the
‘doubling strategy’. Namely, we start betting $1, and until the first time we win, we double our bet; upon
the first win, we quit the game. Let X be the random variable giving the net gain of the overall game. How
can we evaluate this strategy?
Let N be the random number of coin flips we have to encounter until we see the first head. For
instance,
In general, we have
Note that on the event that N = k (i.e., we bet k times), the net gain X |{N = k} is
Hence our net gain is $1 with probability 1. In particular, the expected net gain is also 1:
E[X ] = 1 · P(X = 1) = 1. (29)
What if we use tripling strategy? ▲
Exercise 1.3.4. For any RV X and real numbers a, b ∈ R, show that
E[aX + b] = a E[X ] + b. (30)
Exercise 1.3.5 (Tail sum formula for expectation). Let X be a RV taking values on positive integers.
(i) For any x, show that
X
∞
P(X ≥ x) = P(X = y). (31)
y=x
On the other hand, say you play two different games where in the first game, you win or lose $1
depending on a fair coin flip, and in the second game, you win or lose $10. In both games, your expected
winning is 0. But the two games are different in how much the outcome fluctuates around the mean.
This notion if fluctuation is captured by the following quantity called variance:
Var(X ) = E[(X − E(X ))2 ]. (34)
Namely, it is the expected squared difference between X and its expectation E(X ).
The following exercise ties the expectation and the variance of a RV into a problem of finding a point
estimator that minimizes the mean squared error.
Exercise 1.3.6 (Variance as minimum MSE). Let X be a RV. Let x̂ ∈ R be a number, which we consider
as a ‘guess’ (or ‘estimator’ in Statistics) of X . Let E[(X − x̂)2 ] be the mean squared error (MSE) of this
estimation.
(i) Show that
EY [(X − x̂)2 ] = EY [X 2 ] − 2x̂ E[X ] + x̂ 2 (35)
= (x̂ − E[X ])2 + E[X 2 ] − E[X ]2 (36)
2
= (x̂ − E[X ]) + Var(X ). (37)
(ii) Conclude that the MSE is minimized when x̂ = E[X ] and the global minimum is Var(X ). In this sense,
E[X ] is the ‘best guess’ for X and Var(X ) is the corresponding MSE.
Here are some of the simplest and yet most important discrete RVs.
1.3. RANDOM VARIABLES 9
Exercise 1.3.7 (Bernoulli RV). A RV X is a Bernoulli variable with (success) probability p ∈ [0, 1] if it takes
value 1 with probability p and 0 with probability 1 − p. In this case we write X ∼ Bernoulli(p). Show that
E(X ) = p and Var(X ) = p(1 − p).
Exercise 1.3.8 (Indicator variables). Let (Ω, P) be a probability space and let E ⊆ Ω be an event. The
indicator variable of the event E , which is denoted by 1E , is the RV such that 1E (ω) = 1 if ω ∈ E and
1E (ω) = 0 if ω ∈ E c . Show that 1E is a Bernoulli variable with success probability p = P(E ).
Example 1.3.9 (Binomial RV). Let X 1 , X 2 , · · · , X n be independent and identically distributed Bernoulli p
variables. Let X = X 1 + · · · + X n . One can think of flipping the same probability p coin n times. Then X is
the total number of heads. Note that X has the following PMF
à !
n k
P(X = k) = p (1 − p)k (38)
k
for k nonnegative integer, and P(X = k) = 0 otherwise. We say X follows the Binomial distribution with
parameters n and p, and write X ∼ Binomial(n, p).
We can compute the mean and variance of X using the above PMF directly, but it is much easier to
break it up into Bernoulli variables and use linearity. Recall that X i ∼ Bernoulli(p) and we have E(X i ) = p
and Var(X i ) = p(1 − p) for each 1 ≤ i ≤ n (from Exercise 1.3.7). So by linearity of expectation (Exercise
1.3.22),
E(X ) = E(X 1 + · · · + X n ) = E(X 1 ) + · · · + E(X n ) = np. (39)
On the other hand, since X i ’s are independent, variance of X is the sum of variance of X i ’s (Exercise
1.3.31) so
Var(X ) = Var(X 1 + · · · + X n ) = Var(X 1 ) + · · · + Var(X n ) = np(1 − p). (40)
Example 1.3.10 (Geometric RV). Suppose we flip a probability p coin until it lands heads. Let X be the
total number of trials until the first time we see heads. Then in order for X = k, the first k − 1 flips must
land on tails and the kth flip should land on heads. Since the flips are independent with each other,
P(X = k) = P({T, T, · · · , T, H }) = (1 − p)k−1 p. (41)
This is valid for k positive integer, and P(X = k) = 0 otherwise. Such a RV is called a Geometric RV with
(success) parameter p, and we write X ∼ Geom(p).
The mean and variance of X can be easily computed using its moment generating function, which
we will learn soon in this course. For their direct computation, note that
E(X ) − (1 − p)E(X ) = (1 − p)0 p + 2(1 − p)1 p + 3(1 − p)2 p + 4(1 − p)3 p · · · (42)
£ ¤
− (1 − p)1 p + 2(1 − p)2 p + 3(1 − p)3 p + · · · (43)
= (1 − p)0 p + (1 − p)1 p + (1 − p)2 p + (1 − p)3 p · · · (44)
p
= = 1, (45)
1 − (1 − p)
where we recognized the series after the second equality as a geometric series. This gives
E(X ) = 1/p. (46)
(In fact, one can apply Exercise 1.3.5 and quickly compute the expection of a Geometric RV.)
Exercise 1.3.11. Let X ∼ Geom(p). Use a similar computation as we had in Example 1.3.10 to show
E(X 2 ) = (2 − p)/p 2 . Using the fact that E(X ) = 1/p, conclude that Var(X ) = (1 − p)/p 2 .
Example 1.3.12 (Poisson RV). A RV X is a Poisson RV with rate λ > 0 if
λk e −λ
P(X = k) = (47)
k!
for all nonnegative integers k ≥ 0. We write X ∼ Poisson(λ). A computation shows E(X ) = Var(X ) = λ.
1.3. RANDOM VARIABLES 10
1.3.2. Continuous Random Variables. So far we have only considered discrete RVs, which takes
either finitely many or countably many values. While there are many examples of discrete RVs, there are
also many instances of RVs which various continuously (e.g., temperature, height, weight, price, etc.).
To define a discrete RV, it was enough to specify its PMF. For a continuous RV, probability distribution
P
function
R (PDF) plays an analogous role of PMF. We also need to replace summation with an integral
d x.
Namely, X is a continuous RV if there is a function f X : R → [0, ∞) such that for any interval [a, b], the
probability that X takes a value from an interval (a, b] is given by integrating f X over the interval (a, b]:
Zb
P(X ∈ (a, b]) = f X (x) d x. (48)
a
The cumulative distribution function (CDF) of a RV X (either discrete or continuous), denoted by F X , is
defined by
F X (x) = P(X ≤ x). (49)
By definition of PDF, we get
Zx
F X (x) = f X (t ) d t . (50)
−∞
Conversely, PDFs can be obtained by differentiating corresponding CDFs.
Exercise 1.3.13. Let X be a continuous RV with PDF f X . Let a be a continuity point of f X , that is, f X is
continuous at a. Show that F X (x) is differentiable at x = a and
d F X ¯¯
¯ = f X (a). (51)
d x x=a
The expectation of a continuous RV X with pdf f X is defined by
Z∞
E(X ) = x f X (x) d x, (52)
−∞
and its variance Var(X ) is defined by the same formula (34).
Exercise 1.3.14 (Tail sum formula for expectation). Let X be a continuous RV with PDF f X and suppose
f X (x) = 0 for all x < 0. Use Fubini’s theorem to show that
Z∞
E(X ) = P(X ≥ t ) d t . (53)
0
Example 1.3.15 (Higher moments). Let X be a continuous RV with PDF f X and suppose f X (x) = 0 for all
x < 0. We will show that for any real number α > 0,
Z∞
α
E[X ] = x α f X (x) d x. (54)
0
First, Use Exercise 1.3.14 and to write
Z∞
α
E[X ] = P(X α ≥ x) d x (55)
0
Z∞
= P(X ≥ x 1/α ) d x (56)
0
Z∞ Z∞
= f X (t ) d t d x. (57)
0 x 1/α
We then use Fubini’s theorem to change the order of integral. This gives
Z∞ Z∞ Z∞ Zt α Z∞
E(X ) = f X (t ) d t d x = f X (t ) d x d t = t α f X (t ) d t , (58)
0 x 1/α 0 0 0
as desired. ▲
1.3. RANDOM VARIABLES 11
Exercise 1.3.16. Let X be a continuous RV with PDF f X and suppose f X (x) = 0 for all x < 0. Let g : R → R
be a strictly increasing function. Use Fubini’s theorem and tail sum formula for expectation to show
Z∞
E[g (X )] = g (x) f X (x) d x. (59)
0
Example 1.3.17 (Uniform RV). X is a uniform RV on the interval [a, b] (denoted by X ∼ Uniform([a, b]))
if it has PDF
1
f X (x) = 1(a ≤ x ≤ b). (60)
b−a
An easy computation gives its CDF:
0 x<a
P(X ≤ x) = (x − a)/(b − a) a ≤ x ≤ b (61)
1 x > b.
a +b (b − a)2
E(X ) = , Var(E ) = . (62)
2 12
Example 1.3.19 (Exponential RV). X is an exponential RV with rate λ (denoted by X ∼ Exp(λ)) if it has
PDF
f X (x) = λe −λx 1(x ≥ 0). (63)
Integrating the PDF gives its CDF
Exercise 1.3.20. Let X ∼ Exp(λ). Show that E(X ) = 1/λ directly using definition (52). Also show that
Var(X ) = 1/λ2 .
Example 1.3.21 (Normal RV). X is a normal RV with mean µ and variance σ2 (denoted by X ∼ N (µ, σ2 ))
if it has PDF
1 (x−µ)2
−
f X (x) = p e 2σ2 . (66)
2πσ2
If µ = 0 and σ2 = 1, then X is called a standard normal RV. Note that if X ∼ N (µ, σ2 ), then Y := X − µ has
PDF
1 −x
2
f Y (x) = p e 2σ2 . (67)
2πσ2
Since this is an even function, it follows that E(Y ) = 0. Hence E(X ) = µ. Using the ‘Gaussian integral’, one
can also show that Var(X ) = σ2 . ▲
1.3. RANDOM VARIABLES 12
1.3.3. Expectation and variance of sums of RVs. In this subsection, we learn how we can compute
expectation and variance for sums of RVs. For expectation, we will see that we can swap the order to
summation and expectation. This is called the linearity of expectation, whose importance cannot be
overemphasized.
Exercise 1.3.22 (Linearity of expectation). In this exercise, we will show that the expectation of sum of
RVs is the sum of expectation of individual RVs.
(i) Let X and Y be RVs. Show that
X
P(X = x, Y = y) = P(X = x). (68)
y
Our first application of linearity of expectation is the following useful formula for variance.
Y = (1 − X 1 )(1 − X 2 ) · · · (1 − X k ). (80)
1.3. RANDOM VARIABLES 13
(i) By expanding the product and using linearity of expectation, show that
X
k X
E[Y ] = 1 − E[X i 1 ] + E[X i 1 X i 2 ] (81)
i 1 =1 1≤i 1 <i 2 ≤k
X
− E[X i 1 X i 2 X i 3 ] + · · · − (−1)k E[X 1 X 2 · · · X k ]. (82)
1≤i 1 <i 2 <i 3 ≤k
T
(ii) Show that Y is the indicator variable of the event ki=1 A ci . Conclude that
à ! à !
\
k
c
[k
E[Y ] = P Ai = 1 − P Ai . (83)
i =1 i =1
When knowing something about one RV does not yield any information of the other, we say the two
RVs are independent. Formally, we say two RVs X and Y are independent if for any two subsets A 1 , A 2 ⊆ R,
P(X ∈ A 1 and Y ∈ A 2 ) = P(X ∈ A 1 )P(Y ∈ A 2 ). (95)
We say two events or RVs are dependent if they are not independent.
Exercise 1.3.28. Suppose two RVs X and Y are independent. Then for any subsets A 1 , A 2 ⊆ R such that
P(Y ∈ A 2 ) > 0, show that
P(X ∈ A 1 | Y ∈ A 2 ) = P(X ∈ A 1 ). (96)
Example 1.3.29. Flip two fair coins at the same time, and let X = 1 if the first coin lands heads and
X = −1 if it lands tails. Let Y be a similar RV for the second coin. Suppose each of the four outcomes
in Ω = {H , T }2 are equality likely. Clearly knowing about one coin does not give any information of the
other. For instance, the first coin lands on heads with probability 1/2. Whether the first coin lands on
heads or not, the second coin will land on heads with probability 1/2. So
1 1
P(X = 1 and Y = 1) =· = P(X = 1)P(Y = 1). (97)
2 2
One can verify the above equation for possible outcomes. Hence X and Y are independent. ▲
Exercise 1.3.30. Let X , Y be two indepenent RVs. Show that
E[X Y ] = E[X ]E[Y ]. (98)
In the following exercise, we will see that we can swap summation and variance as long as the RVs
we are adding up are independent.
Exercise 1.3.31. Recall the definition of covariance given in Exercise 1.3.27.
(i) Show that if two RVs X and Y are independent, then Cov(X , Y ) = 0
(ii) Use Exercise 1.3.27 to conclude that if X 1 , · · · , X n are independent RVs, then
Var(X 1 + · · · + X n ) = Var(X 1 ) + · · · + Var(X n ). (99)
In case when X or Y are continuous RVs, we simply replace the sum by integral and PMF by PDF. For
instance, if both X and Y are continuous with PDFs f X , f Y and joint PDF f X ,Y , then
Z∞
E[X |Y = y] = x f X |Y =y (x) d x, (103)
−∞
where f X |Y =y is the conditional PDF of X given Y = y defined by
f X ,Y (x, y)
f X |Y =y (x) = . (104)
f Y (y)
To summarize how we compute the iterated expectations when we condition on discrete and continuous
RV:
(P
y E[X | Y = y]P(Y = y) if Y is discrete
E[E[X | Y ]] = R∞ (105)
−∞ E[X | Y = y] f Y (y) d y if Y is continuous.
An important use of iterated expectation is that we can compute probabilities using conditioning,
since probability of an event is simply the expectation of the corresponding indicator variable.
Proposition 1.4.4. Let X be a discrete random variable and Y = (Y1 , . . . , Yk ) a discrete random vector. Let
g : Rk → R be a function.
(i) If X and Y are independent, then E[X | Y] = E[X ].
(ii) Assume that E[|g (Y)|] < ∞. Then E[g (Y) | Y] = g (Y).
(iii) Assume that E[|X g (Y)|] < ∞. Then E[X g (Y) | Y] = E[X | Y] g (Y).
Here is a handwavy explanation on why the above is true. Given Y , we should measure the fluc-
tuation of X | Y from the conditional expectation E[X | Y ], and this is measured as Var(X | Y ). Since we
don’t know Y , we average over all Y , giving E(Var(X | Y )). But the reference point E[X | Y ] itself varies with
Y , so we should also measure its own fluctuation by Var(E[X | Y ]). These fluctuations add up nicely like
Pythagorian theorem because E[X | Y ] is an optimal estimator so that these two fluctuations are ‘orthog-
onal’.
Exercise 1.5.4. Let X , Y be RVs and denote X̂ = E[X |Y ], meaning that X̂ is an estimator of X given Y . Let
X̃ = X̂ − X be the estimation error.
(i) Show that X̂ is an unbiased estimator of X , that is, E( X̂ ) = E(X ).
(ii) Show that E[ X̂ |Y ] = X̂ . Hence knowing Y does not improve our current best guess X̂ .
(iii) Show that E[ X̃ ] = 0.
(iv) Show that Cov( X̂ , X̃ ) = 0. Conclude that
Exercise 1.5.5. Let X , Y be RVs. Write X̄ = E[X | Y ] and X̃ = X − E[X | Y ] so that X = X̄ + X̃ . Here X̄ is the
estimate of X given Y and X̃ is the estimation error.
(i) Using Exercise 1.5.4 (iii) and iterated expectation, show that
Example 1.5.6. Let Y ∼ Uniform([0, 1]) and X ∼ Binomial(n, Y ). Since X | Y = y ∼ Binomial(n, y), we
have E[X | Y = y] = n y and Var(X | Y = y) = n y(1 − y). Also, since Y ∼ Uniform([0, 1]), we have
n2
Var(E[X | Y ]) = Var(nY ) = . (136)
12
So by iterated expectation, we get
Z1
n
E(X ) = EY (E[X | Y ]) = ny d y = . (137)
0 2
On the other hand, by law of total variance,
Pn
Theorem 1.6.1 (WLLN). Let (X k )k≥1 be i.i.d. RVs with mean µ < ∞ and let S n = k=1
X i , n ≥ 1 be a
random walk. Then for any positive constant ε > 0,
µ¯ ¯ ¶
¯ Sn ¯
lim P ¯¯ ¯
− µ¯ > ε = 0. (149)
n→∞ n
In words, the probability that the sample mean S n /n is not within ε distance from its expectation µ
decays to zero as n tends to infinity. In this case, we say the sequence of RVs (S n /n)n≥1 converges to µ in
probability.
The second version of law of large numbers is call the strong law of large numbers (SLLN), which is
available if the increments have finite variance.
P
Theorem 1.6.2 (SLLN). Let (X k )k≥1 be i.i.d. RVs and let S n = nk=1 X i , n ≥ 1 be a random walk. Suppose
E[X 1 ] = µ < ∞ and E[X 12 ] < ∞. Then
µ ¶
Sn
P lim = µ = 1. (150)
n→∞ n
To make sense out of this, notice that the limit of sample mean limn→ S n /n is itself a RV. Then SLLN says
that this RV is well defined and its value is µ with probability 1. In this case, we say the sequence of RVs
(S n /n)n≥1 converges to µ with probability 1 or almost surely.
Perhaps one of the most celebrated theorems in probability theory is the central limit theorem (CLT),
which tells about how the sample mean S n /n “fluctuates” around its mean µ. From 147, if we denote
σ2 = Var(X 1 ) < ∞, we know that Var(S n /n) = σ2 /n → 0 as n → ∞. So the fluctuation decays as we add
up more increments. To see the effect of fluctuation, we first center the sample mean by subtracting its
p
expectation and “zoom in” by dividing by the standard deviation σ/ n. This is where the name ‘central
limit’ comes from: it describes the limit of centered random walks.
P
Theorem 1.6.3 (CLT). Let (X k )k≥1 be i.i.d. RVs and let S n = nk=1 X i , n ≥ 1 be a random walk. Suppose
E[X 1 ] = µ < ∞ and E[X 12 ] = σ2 < ∞. Let Z ∼ N (0, 1) be a standard normal RV and define
S n − µn S n /n − µ
Zn = p = p . (151)
σ n σ/ n
1.6. ELEMENTARY LIMIT THEOREMS 20
𝑥+𝑦=2 𝑥+𝑦=8
1 2 3 4 5 6
1 1 1 1 1 1
CHAPTER 2
𝜏 𝜏 𝜏 𝜏
Renewal Processes
⋯
𝑇 𝑇 𝑇 𝑇
2.1. Definition of renewal processes and Renewal SLLN
An arrival process is a sequence of strictly increasing RVs 0 < T1 < T2 < · · · with the convention of
setting T0 = 0. For each integer k ≥ 1, its kth inter-arrival time is defined by τk = Tk − Tk−1 . For a given
arrival 0process
1 (T2k )k≥1 3
, the associated
4 0 counting
1 2 process (N (t ))t ≥0 is defined by
X
∞
N (t ) = 1(Tk ≤ t ) = #(arrivals up to time t ). (153)
k=1
2 these
Note that 3 three
3 4
processes 1
0 (arrival
2times,
2 inter-arrival times, and counting) determine each other:
4 𝜏
3 𝜏
2 𝜏
𝑁(𝑠) = 3
1 𝜏
0 𝑇 𝑇 𝑇 𝑠 𝑇 𝑡
F IGURE 1. Illustration of a continuous-time arrival process (Tk )k≥1 and its associated counting
process (N (t ))t ≥0 . τk ’s denote inter-arrival times. N (t ) ≡ 3 for T3 < t ≤ T4 .
Exercise 2.1.2 (Well-definedness of the counting process). Let (N (t ))t ≥0 denote the counting process of
an arrival process with inter-arrival times (τk )k≥1 whose expectations are at least some positive constant
δ > 0 (but not necessarily independent or identically distributed). Recall that N (t ) and τk ’s are associated
as
X
∞
N (t ) = 1(τ1 + · · · + τn ≤ t ). (156)
n=1
(i) Show that T N (t ) ≤ t for all t ≥ 0. Deduce that
· ¸ · ¸
N (t ) N (t ) Jensen 1 1
E ≤E ≤ £ ¤≤ . (157)
t T N (t ) E T N (t ) /N (t ) δ
21
2.1. DEFINITION OF RENEWAL PROCESSES AND RENEWAL SLLN 22
(ii) From (i), deduce that E[N (t )] ≤ t /δ < ∞ for all t ≥ 0. Furthermore, deduce that P(N (t ) < ∞) = 1 for
all t ≥ 0.
Definition 2.1.3 (Renewal process). A counting process (N (t ))t ≥0 is called a renewal process if its inter-
arrival times τ1 , τ2 , · · · are i.i.d. with E[τ1 ] < ∞.
A cornerstone in the theory of renewal processes is the following strong law of large numbers for
renewal processes. We first recall the strong law of large numbers.
P
Theorem 2.1.4 (SLLN). Let (X k )k≥1 be i.i.d. RVs and let S n = nk=1 X i , n ≥ 1 be a random walk. Suppose
E[|X 1 |] < ∞ and let E[X 1 ] = µ. Then
µ ¶
Sn
P lim = µ = 1. (158)
n→∞ n
Theorem 2.1.6 (Renewal SLLN). Let (Tk )k≥0 be a renewal process and let (τk )k≥0 and (N (t ))t ≥0 be the
associated inter-arrival times and counting process, respectively. Let E[τk ] = µ be the mean inter-arrival
time. If 0 < µ < ∞, then
µ ¶
N (t ) 1
P lim = = 1. (160)
t →∞ t µ
P ROOF. First, write Tk = τ1 + τ2 + · · · + τk . Since the inter-arrival times are i.i.d. with mean µ < ∞, the
strong law of large numbers imply
µ ¶
Tk
P lim = µ = 1. (161)
k→∞ k
Next, fix t ≥ 0 and let N (t ) = n, so that there are total n arrivals up to time t . Then the nth arrival time Tn
must occur by time t , whereas the n + 1st arrival time Tn+1 must occur after time t . Hence Tn ≤ t < Tn+1 .
In general, we have
Dividing by N (t ), we get
T N (t ) t T N (t )+1 N (t ) + 1
≤ < . (163)
N (t ) N (t ) N (t ) + 1 N (t )
To take the limit as t → ∞, we note that P(Tk < ∞) = 1 for all k since E[τk ] = µ < ∞. This yields
N (t ) ≥ k for some large enough t . Since k was arbitrary, this yields N (t ) % ∞ as t → ∞ with probability
1. Therefore, according to (161) and Exercise 2.1.5, we get
µ ¶ µ ¶ µ ¶
T N (t ) T N (t )+1 N (t ) + 1
P lim = µ = P lim = µ = P lim = 1 = 1. (164)
t →∞ N (t ) t →∞ N (t ) + 1 t →∞ N (t )
Since µ > 0, we can take the reciprocal inside the above probability. This shows the assertion. □
2
𝜏 3 4
1
⋯ 2.2. RENEWAL REWARD PROCESSES AND RENEWAL REWARD SLLN 23
𝑇
𝑁
4 0 1 2 4
𝑇 ( ) , 𝑁(𝑡) (𝑡, 𝑁(𝑡))
3 𝑇 ( ) , 𝑁(𝑡)
0 1 2 2 2
1 𝑁(𝑡) = 3
0 𝑇 𝑇 𝑇 𝑡 𝑇
F IGURE 2. Illustration of the inequalities (163).
Example 2.1.7 (Poisson process). Suppose (N (t ))t ≥0 is a renewal process with its inter-arrival times
(τk )k≥1 as Exp(λ) distribution with some λ > 0. In this case, we call (N (t ))t ≥0 a “Poisson process with
arrival rate λ”. Note that the mean inter-arrival time is E[τ1 ] = 1/λ, so the renewal SLLN yields
µ ¶
N (t )
P lim = λ = 1. (166)
𝜏 t →∞ t
Namely, with probability 1, we tend to see about λt arrivals during [0, t ] as t → ∞. In other words, we
tend𝜏 to see λ arrivals during an interval of unit length. Hence it makes sense to call the parameter λ as
𝑡 −the
the ‘arrival rate’ of 𝑠 process.𝑍 ▲
𝜏
𝑁(𝑡) = 3
2.2. Renewal reward processes and Renewal Reward SLLN
In this section, we consider a renewal process together with rewards, which are given for each inter-
𝑇
arrival times.𝑇 This
= 𝑠 simple 𝑇 the renewal processes will greatly improve the applicability of our
𝑡 extension of
theory.
Let (Tk )k≥0 be a renewal process and let (τk )k≥0 and (N (t ))t ≥0 be the associated inter-arrival times
and counting process, respectively. Suppose we have a sequence of rewards (Yk )k≥1 , so we regard a
reward of Yk is given at the kth arrival time Tk . We define the reward process (R(t ))t ≥0 associated with
the sequence (τk , Yk )k≥1 as
NX
(t )
R(t ) = Yk . (167)
k=1
Namely, upon the kth arrival at time Tk = τ1 + · · · + τk , we receive a reward of Yk . Then R(t ) is the total
reward up to time t .
As we looked at the average number of arrivals N (t )/t as t → ∞, a natural quantity to look at for the
reward process is the ‘average reward’ R(t )/t as t → ∞. Intuitively, since everything refreshes upon new
arrivals, we should expect
R(t ) expected reward during one ‘cycle’
→ (168)
t expected duration of one ‘cycle’
as t → ∞ almost surely. This is made precise by the following result.
Theorem 2.2.1 (Renewal reward SLLN). Let (R(t ))t ≥0 be the reward process associated with an i.i.d. se-
quence of inter-arrival times (τk )k≥1 and an i.i.d. sequence of rewards (Yk )k≥1 . Suppose E[Yk ] < ∞ and
E[τk ] ∈ (0, ∞). Then
µ ¶
R(t ) E[Y1 ]
P lim = = 1. (169)
n→∞ t E[τ1 ]
2.2. RENEWAL REWARD PROCESSES AND RENEWAL REWARD SLLN 24
P ROOF. Let (τk )k≥0 denote the inter-arrival times for the renewal process (Tk )k≥0 . Note that
à !
R(t ) 1 NX (t ) N (t )
= Yk . (170)
t N (t ) k=1 t
Hence by Exercise 2.1.5, the ‘average reward’ up to time t in the bracket converges to E[Y1 ] almost surely.
Moreover, the average number of arrivals N (t )/t converges to 1/E[τ1 ] by Theorem 2.1.6. Hence the as-
sertion follows. □
Remark 2.2.2. Theorem 2.1.6 can be obtained as a special case of the above reward version of SLLN,
simply by choosing Yk = τk for k ≥ 1 so that R(t ) = N (t ).
Example 2.2.3 (Long run car costs). This example is excerpted from [Dur99]. Mr. White do not drive the
same car more than t ∗ years, where t ∗ > 0 is some fixed number. He changes to a new car when the old
one breaks down or reaches t ∗ years. Let X k be the life time of the kth car that Mr. White drives, which
are i.i.d. with finite expectation. Let τk be the duration of his kth car. According to his policy, we have
τk = min(X k , t ∗ ). (171)
Let Tk = τ1 +· · ·+τk be the time that Mr. White is done with the kth car. Then (Tk )k≥0 is a renewal process.
Note that the expected running time for the kth car is
E[τk ] = E[τk | X k < t ∗ ]P(X k < t ∗ ) + E[τk | X k ≥ t ∗ ]P(X k ≥ t ∗ ) (172)
∗ ∗ ∗ ∗
= E[X k | X k < t ]P(X k < t ) + t P(X k ≥ t ). (173)
Suppose that the car cost g during each cycle is given by
(
A + B if t < t ∗
g (t ) = (174)
A if t ≥ t ∗ .
Namely, if the car breaks down by t ∗ years, then Mr. White has to pay A + B dolors; otherwise, the cost is
only A dolors. Then the expected cost for one cycle is
E[g (τk )] = A + B P(τk < t ∗ ) = A + B P(X k < t ∗ ). (175)
Thus by Theorem 2.2.1, the long-run car cost of Mr. White is
R(t ) E[g (τk )] A + B P(X k < t ∗ ).
lim = = . (176)
t →∞ t E[τk ] E[X k | X k < t ∗ ]P(X k < t ∗ ) + t ∗ P(X k ≥ t ∗ )
For more concrete example, let X k ∼ Uniform([0, 10]) and let A = 10 and B = 3. Then
E[g (τk )] = 10 + 3t ∗ /10. (177)
On other other hand,
E[τk ] = E[X k | X k < t ∗ ]P(X k < t ∗ ) + t ∗ P(X k ≥ t ∗ ) (178)
∗ ∗ ∗
t t 10 − t
= + t∗ = t ∗ − (t ∗ )2 /20. (179)
2 10 10
Note that for E[X k | X k < t ∗ ] = t ∗ /2, observe that a uniform RV over [0, 10] conditioned on being [0, t ∗ ] is
uniformly distributed over [0, t ∗ ]. This yields
E[g (τk )] 10 + 0.3t ∗
= ∗ . (180)
E[τk ] t (1 − t ∗ /20)
Lastly, in order to minimize the above long-run cost, we differentiate it in t ∗ and find global minimum.
A straightforward computation shows that the long-run cost is minimized at
p
∗ −1 + 1.6
t = ≈ 8.83. (181)
0.03
Thus the optimal strategy for Mr. White in this situation is to drive each car up to 8.83 years. ▲
CHAPTER 3
Markov chains
Say we would like to model the USD price of bitcoin. We could observe the actual price at every
hour and record it by a sequence of real numbers x 1 , x 2 , · · · . However, it is more interesting to build a
‘model’ that could predict the price of bitcoin at time t , or at least give some meaningful insight how the
actual bitcoin price behaves over time. Since there are so many factors affecting the price at every time, it
might be reasonable that the price at time t should be given by a certain RV, say X t . Then our sequence of
predictions would be a sequence of RVs, (X t )t ≥0 . This is an example of stochastic processes. Here ‘process’
means that we are not interest in just a single RV, that their sequence as a whole: ‘stochastic’ means that
the way the RVs evolve in time might be random.
In this note, we will be studying a very important class of stochastic processes called Markov chains.
The importance of Markov chains lies two places: 1) They are applicable for a wide range of physical,
biological, social, and economical phenomena, and 2) the theory is well-established and we can actually
compute and make predictions using the models.
p i j = P(X t +1 = j | X t = i ) i, j ∈ Ω (183)
do not depend on t .
When the chain (X t )t ≥0 satisfies the above two properties, we say it is a (discrete-time and time-homogeneous)
Markov chain. We define the transition matrix P to be the m × m matrix of transition probabilities:
p 11 p 12 · · · p 1m
p 21 p 22 · · · p 2m
P = (p i j )1≤i , j ≤m = . .. .. . (184)
.. . .
p m1 p m2 ··· p mm
25
3.1. DEFINITION AND EXAMPLES 26
Finally, since the state X t of the chain is a RV, we represent its probability mass function (PMF) via a row
vector
rt = [P(X t = 1), P(X t = 2), · · · , P(X t = m)]. (185)
Example 3.1.1. Let Ω = {1, 2} and let (X t )t ≥0 be a Markov chain on Ω with the following transition matrix
· ¸
p 11 p 12
P= . (186)
p 21 p 22
We can also represent this Markov chain pictorially as in Figure 4, which is called the ‘state space diagram’
of the chain (X t )t ≥0 .
𝑝
𝑝 1 2 𝑝
𝑝
𝑝 𝑃
𝑝
1 0 1 2 3 4 1
1−𝑝 1−𝑝 1−𝑝
P(X t +1 = k + 1 | X t = k) = p ∀1 ≤ k < N
P(X
t +1 = k | X t = k + 1) = 1 − p ∀1 ≤ k < N
(198)
P(X t +1 = 0 | X t = 0) = 1
P(X t +1 = N | X t = N ) = 1.
For example, the transition matrix P for N = 5 is given by
1 0 0 0 0 0
1 − p 0
0 p 0 0
0 1−p 0 p 0 0
P = . (199)
𝑝 0 0 1−p 0 p 0
𝑝
0 𝑝 0 0 1 − p 0 p
1 2
𝑝 0 0 0 0 0 1
We call the resulting Markov chain (X t )t ≥0 the gambler’s chain. ▲
Example 3.1.3 (Ehrenfest Chain). This chain is originated from the physics literature as a model for two
cubical volumes of air connected by a thin 𝑝tunnel. Suppose there𝑃 are total N indistinguishable balls split
𝑝
1 0
into two “urns” A and B . At each1step, we pick up one 2 of the N balls uniformly
3 4 move
at random, and 1 it to
1−𝑝
the other urn. Let X t denote the number1of − balls
𝑝 1 − t𝑝 steps. This is a Markov chain called the
in urn A after
Ehrenfest chain. (See the state space diagram in Figure 3.)
It is easy to figure out the transition probabilities by considering different cases. If X t = k, then urn
B has N − k balls at time t . If 0 < k < N , then with probability k/N we move one ball from A to B and
with probability (N − k)/N we move one from B to A. If k = 0, then we must pick up a ball from urn
B so X t +1 = 1 with probability 1. If k = N , then we must move one from A to B and X t +1 = N − 1 with
probability 1. Hence, the transition kernel is given by
P(X t +1 = k + 1 | X t = k) = (N − k)/N ∀0 ≤ k < N
P(X
t +1 = k − 1 | X t = k) = k/N ∀0 < k ≤ N
(200)
P (X +1 = 1 | X = 0) = 1
t t
P(X t +1 = N − 1 | X t = N ) = 1.
For example, the transition matrix P for N = 5 is given by
0 1 0 0 0 0
1/5 0 4/5 0 0
0
0 2/5 0 3/5 0 0
P = . (201)
0 0 3/5 0 2/5 0
0 0 0 4/5 0 1/5
0 0 0 0 1 0
3.1. DEFINITION AND EXAMPLES 28
▲
Exercise 3.1.4. Repeat rolling two four sided dices with numbers 1, 2, 3, and 4 on them. Let Yk be the
some of the two dice at the kth roll. Let S n = Y1 + Y2 + · · · + Yn be the total of the first n rolls, and define
X t = S t (mod6). Show that (X t )t ≥0 is a Markov chain on the state space Ω = {0, 1, 2, 3, 4, 5}. Furthermore,
identify its transition matrix.
We generalize our observation in Example 3.1.1 in the following exercise.
Exercise 3.1.5. Let (X t )t ≥0 be a Markov chain on state space Ω = {1, 2, · · · , m} with transition matrix P =
(p i j )1≤i , j ≤m . Let rt = [P(X t = 1), · · · , P(X t = m)] denote the row vector of the distribution of X t .
(i) Show that for each i ∈ Ω,
X
m
P(X t +1 = i ) = p j i P(X t = j ). (202)
j =1
While right-multiplication of P advances a given row vector of distribution one step forward in time,
left-multiplication of P on a column vector computes the expectation of a given function with respect to
the future distribution. This point is clarified in the following exercise.
Exercise 3.1.7. Let (X t )t ≥0 be a Markov chain on a state space Ω = {1, 2, · · · , m} with transition matrix P .
Let f : Ω → R be a function. Suppose that if the chain X t has state x at time t , then we get a ‘reward’ of
f (x). Let rt = [P(X t = 1), · · · , P(X t = m)] be the distribution of X t . Let v = [ f (1), f (2), · · · , f (m)]T be the
column vector representing the reward function f .
(i) Show that the expected reward at time t is given by
X
m
E[ f (X t )] = f (i )P(X t = i ) = rt v. (207)
i =1
(ii) Use part (i) and Exercise 3.1.5 to show that
E[ f (X t )] = r0 P t v. (208)
Pt
(iii) The total reward up to time t is a RV given by R t = k=0 f (X t ). Show that
E[R t ] = r0 (I + P + P 2 + · · · + P t )v. (209)
Exercise 3.1.8. Suppose that the probability it rains today is 0.4 if neither of the last two days was rainy,
but 0.5 if at least one of the last two days was rainy. Let Ω = {S, R}, where S=sunny and R=rainy. Let Wt
be the weather of day t .
(i) Show that (Wt )t ≥0 is not a Markov chain.
3.2. STATIONARY DISTRIBUTION AND EXAMPLES 29
(ii) Expand the state space into the set of pairs Σ := Ω2 . For each t ≥ 0, define X t = (Wt −1 ,Wt ) ∈ Σ. Show
that (X t )t ≥0 is a Markov chain on Σ. Identify its transition matrix.
(iii) What is the two-step transition matrix?
(iv) What is the probability that it will rain on Wednesday if it didn’t rain on Sunday and Monday?
π = πP, (210)
Example 3.2.1. Consider the 2-state Markov chain (X t )t ≥0 with transition matrix (as in Exercise 3.1.1)
· ¸
0.2 0.8
P= . (211)
0.6 0.4
Furthermore, this is the unique stationary distribution. To see this, let π = [π1 , π2 ] be a stationary distri-
bution of X t . Then π = πP gives
Since π is a probability distribution, π1 + π2 = 1 so π = [3/7, 4/7] is the only solution. This shows the
uniqueness of the stationary distribution for X t . ▲
A Markov chain may have multiple stationary distributions, as the following example illustrates.
Stationary distributions are closely related with eigenvectors and eigenvalues of the transition matrix
P . Namely, by taking transpose,
πT = P T πT . (217)
Hence, the transpose of any stationary distribution is an eigenvector of P T associated with eigenvalue 1.
More properties of stationary distribution in this line of thought are given by the following exercise.
3.2. STATIONARY DISTRIBUTION AND EXAMPLES 30
Exercise 3.2.3. Given a matrix A, a row vector v, and a real number λ, we say v is a left eigenvector of A
associated with left eigenvalue λ if
vA = λv. (218)
If v is a column vector and if Av = λv, then we say v is a (right) eigenvector associated with (right)
eigenvalue λ. Let (X t )t ≥0 be a Markov chain on state space Ω = {1, 2, · · · , m} with transition matrix P =
(p i j )1≤i , j ≤m .
(i) Show that a distribution π on Ω is a stationary distribution for the chain (X t )t ≥0 if and only if it is a
left eigenvector of P associated with left eigenvalue 1.
(ii) Show that 1 is a right eigenvalue of P with right eigenvector [1, 1, · · · , 1]T .
(iii) Recall that a square matrix and its transpose have the same (right) eigenvalues and corresponding
(right) eigenspaces have the same dimension. Show that the Markov chain (X t )t ≥0 has a unique
stationary distribution if any only if [1, 1, · · · , 1]T spans the (right) eigenspace of P associated
with (right) eigenvalue 1.
In the following exercise, we compute the stationary distribution of the so-called birth-death chain.
Exercise 3.2.4 (Birth-Death chain). Let Ω = {0, 1, 2, · · · , N } be the state space. Let (X t )t ≥0 be a Markov
chain on Ω with transition probabilities
P(X t +1 = k + 1 | X t = k) = p ∀0 ≤ k < N
P(X
t +1 = k − 1 | X t = k) = 1 − p ∀1 ≤ k ≤ N
𝑝 (219)
P(X +1 = 0 | X = 0) = 1 − p
𝑝 1 2
t 𝑝 t
𝑝 P(X t +1 = N | X t = N ) = p.
This is called a Birth-Death chain. Its state space diagram is as below.
𝑝 𝑝 𝑃 𝑝
1−𝑝 0 1 2 3 4 𝑝
1−𝑝 1−𝑝 1−𝑝 1−𝑝
(i) Let π = [π0 , π1 , · · · , πN ] be a distribution on Ω. Show that π is a stationary distribution of the Birth-
Death chain if and only if it satisfy the following ‘balance equation’
pπk = (1 − p)πk+1 0 ≤ k < N. (220)
(ii) Let ρ = p/(1 − p). From (ii), deduce that πk = ρ k π0 for all 0 ≤ k < N .
(iii) Using the normalization condition π0 + π1 + · · · + πN = 1, show that π0 = 1/(1 + ρ + ρ 2 + · · · + ρ N ).
Conclude that
ρk
πk = 0 ≤ k ≤ N. (221)
1 + ρ + ρ2 + · · · + ρ N
Conclude that the Birth-Death chain has a unique stationary distribution given by (221), which
becomes the uniform distribution on Ω when p = 1/2.
Exercise 3.2.5 (Success run chain). Consider a Markov chain (X n )n≥0 on the state space Ω = Z≥0 of non-
negative integers with the following transition matrix P :
P (x, x + 1) = p, P (x, 0) = 1 − p for all x ∈ Z≥0 , (222)
where p ∈ [0, 1] is a fixed parameter. In words, one flips independent coins with success probability p;
every time it comes up heads with probability p X n increases by 1, and when it comes up tails, X n resets
back to 0.
3.2. STATIONARY DISTRIBUTION AND EXAMPLES 31
(i) Let π : Z≥0 → [0, ∞) be a function that satisfies πP = π. Show that the following recursion holds:
X
∞
π(0) = (1 − p) π(k) (223)
k=0
π(1) = pπ(0) (224)
π(2) = pπ(1) (225)
..
. (226)
Next, 𝑝 Rate
𝑁(𝑡) we take a look at an example of an 𝜆important
= 𝑝𝜆 class of Markov chains, which is called the ran-
dom walk on graphs. This is the basis of many algorithms involving machine learning on networks (e.g.,
Google’s
Rate 𝜆PageRank). 1 − 𝑝 𝑁 (𝑡)
Example 3.2.6 (Random walk on graphs). We first introduce some notions in graph theory. A graph G
Rate 𝜆 = (1 − 𝑝)𝜆
consists of a pair (V, E ) of sets of nodes V and edges E ⊆ V 2 . A graph G can be concisely represented as a
|V | × |V | matrix AG , which is called the adjacency matrix of G. Namely, the (i , j ) entry of AG is defined by
AG (i , j ) = 1(nodes i and j are adjacent in G). (227)
We say G is simple if (i , j ) ∈ E implies ( j , i ) ∈ E and (i , i ) ∉ E for all i ∈ V . For a simple graph G = (V, E ), we
2
say a node j is adjacent to i if (i , j ) ∈ E . We denote degG (i ) the number of neighbors of i in G, which we
𝑁 (𝑡)
call the degree of i .
1
Consider we𝜆hop around the nodes of a given simple graph G = (V, E ): at each time, we jump from
Rate 𝑁(𝑡)
0
one node to one of the neighbors withor equal probability. For instance, if we are currently at node 2 and if
2 is adjacent𝑁to(𝑡)
3, 5, and 6, then we jump to one of Rate the three
𝜆 = 𝜆 neighbors
+𝜆 with probability 1/3. The location
of this jump process at time t can be described as a Markov chain. Namely, a Markov chain (X t )t ≥0 on
the node setRate
V is𝜆 called a random walk on G if
AG (i , j )
P(X t +1 = j | X t = i ) = . (228)
degG (i )
Note that its transition matrix P is obtained by normalizing each row of the adjacency matrix AG by the
corresponding degree.
3 4
1
⋯
𝐺
F IGURE 5. A 4-node simple graph G, its adjacency matrix AG , and associated random walk tran-
sition matrix P
2 What is the stationary distribution of random walk on G? There could be many (see Exercise 3.7.1),
but here is a typical one that always works. Define a probability distribution π on V by
degG (i ) degG (i )
π(i ) = P = . (229)
2 j ∈V degG (i ) 2|E |
3.2. STATIONARY DISTRIBUTION AND EXAMPLES 32
Namely, the probability that X t = i is proportional to the degree of node i . Then π is a stationary distri-
bution for P . Indeed,
X X degG (i ) AG (i , j )
π( j )P ( j , i ) = (230)
i ∈V i ∈V 2|E | degG (i )
1 X degG ( j )
= AG (i , j ) = = π( j ). (231)
2|E | i ∈V 2|E |
Hence if we perform a random walk on facebook network, then we are about ten times more likely to
visit a person of 100 friends than to visit a person of 10 friends. ▲
Next, we will discuss two classes of Markov chains for which it is very easy to compute stationary
distributions.
□
Example 3.2.8 (Mass transport). In order to give some concrete intuition, consider the following thought
experiment. Imagine Ω is the note set of some graph, and µ(x) represents the mass of sand at node x for
each x ∈ Ω. Overnight, a truck runs around the graph and transport P (x, y) proportion of sand µ(x) at
node x to node y. Hence
On the other hand, the detailed balance equation (232) can be interpreted as
A reversible Markov chain is a Markov chain whose initial distribution is reversible. Namely, let
(X n )n≥0 denote a Markov chain on state space Ω with transition matrix P . Let the initial state X 0 be
distributed as a reversible distribution µ. By Proposition 3.2.7, X n is a stationary process, meaning that
the distribution of X n does not change in n. However, we can say more than just stationarity. Namely,
one cannot distinguish if X n is run backward in time. More precisely,
Pµ (X 0 = x 0 , X 1 = x 1 , . . . , X n = x n ) = Pµ (X n = x 0 , X n−1 = x 1 , . . . , X 0 = x n ) (240)
for all n ≥ 1 and states x 0 , . . . , x n .
In practice, reversibility is easier to check than stationarity. So showing reveribility could be handy
to stationarity.
Example 3.2.9 (SSRW on graphs is reversible). Let (X n )n≥0 be a simple symmetric random walk G =
deg(x)
(V, E ). Let the initial state X 0 be distribued as the distribution µ(x) = 2|E | . From Example 3.2.6, we
already know that µ is a stationary distribution. However, the detailed balance equation is also easily
verified:
deg(x) AG (x, y) deg(y) AG (y, x)
µ(x) P (x, y) = = = µ(y) P (y, x). (241)
2|E | deg(x) 2|E | deg(y)
Hence X n is not just stationary, but also reversible. Imagine X n follows its trajectory backwards in time.
The above detailed balance equation makes it so that the backward random walk still follows the same
rule: Choose a neighbor uniformly at random and move. ▲
3.2.3. Doubly stochastic chains. Let P be the transition matrix of a Markov chain (X n )n≥0 on a state
space Ω. Recall that the row sums of P are always 1, since each row gives the transition probabilities
from a fixed state to all other possible states. We say the Markov chain (X n )n≥0 (as well as P ) is doubly
stochastic if the column sums of P are also always 1.
Definition 3.2.10. A matrix P : Ω × Ω → [0, ∞) is doubly stochastic if its row and column sums are all one,
that is, for all x, y ∈ Ω,
X X
P (x, z) = P (z, y) = 1. (242)
z∈Ω z∈Ω
For doubly stochastic Markov chains, the uniform distribution on Ω (which makes sense if Ω is finite)
is always a stationary distribution.
Proposition 3.2.11. Let (X n )n≥0 be a doubly stochastic Markov chain on a finite state space Ω. Then
π = Uniform(Ω) is a stationary distribution for the chain.
P ROOF. Let P denote the transition matrix of the chain (X n )n≥0 . Note that π(x) = 1/|Ω| for all x ∈ Ω.
Then, viewing π as a row vector, for each x ∈ Ω,
X 1 X 1
(πP )(x) = π(z)P (z, x) = P (z, x) = = π(x). (243)
z∈Ω |Ω| z∈Ω |Ω|
Hence π is stationary for P . □
Example 3.2.12 (Random Walk on a ring). Let (X n )n≥0 denote a random walk on the integer ring Z/m Z,
which consists of integers 0, . . . , m −1 and we view m ≡ 0 modulo m. The chain moves by 1 clockwise with
probability p and 1 counterclockwise with probability 1 − p. Then it is easy to verify the corresponding
transition matrix is doubly stochastic. Hence by Proposition 3.2.11, the uniform distribution is a station-
ary distribution. This is quite resonable since the process is invariant under ‘rotation’. ▲
Exercise 3.2.13. Suppose a Markov chain has the uniform distribution as a stationary distribution. Does
it have to be a doubly stochastic chain?
3.3. MARKOV PROPERTY TO THE INFINITE FUTURE 34
Proposition 3.3.1. Let (X n )n≥0 be a Markov chain on a finite state space Ω with transition matrix P . Then
P ((X n+1 , X n+2 , . . . ) ∈ U | X n = x, (X 0 , . . . , X n−1 ) ∈ B ) = P ((X n+1 , X n+2 , . . . ) ∈ U | X n = x) . (249)
P ROOF. Omitted. □
Example 3.3.2 (Absorption in Gambler’s ruin). Recall the Gambler’s ruin Markov chain X n , which is on
the state space Ω = {0, 1, . . . , N } and trasition matrix P with P (i , i + 1) = p for i ∈ {0, . . . , N − 1}, P (i − 1, i ) =
1−p for i ∈ {1, . . . , N }, P (0, 0) = 1P (N , N ) = 1. We will assume the initial fortune is X 0 = x. We will compute
the probability of winning, that is, the probability that the Gambler’s fortune X n absorbs into N rather
than to 0. As a function of x, denote
h(x) = Px (state N is reached before 0) (250)
:= P(state N is reached before 0 | X 0 = x). (251)
The goal is to solve for h(x). Note that it has the following bounary values
h(0) = 0, h(N ) = 1. (252)
If we start at x ∈ {1, . . . , N − 1}, then the Markov chain goes to x + 1 with probability p and renews as a
Gambler’s ruin chain with initial state x + 1; with probability q := 1 − p, it goes to x − 1 and renews there.
Hence it should be evident that
h(x) = ph(x + 1) + qh(x) for 0 < x < N . (253)
Noting that p + q = 1, (254) can be re-written as
ph(x) + qh(x) = ph(x + 1) + qh(x) for 0 < x < N . (254)
This gives
q
h(x + 1) − h(x) = (h(x) − h(x − 1)) for 0 < x < N , (255)
p
which is a multiplicative recursion for the differences h(x + 1) − h(x). It follows that
h(x + 1) − h(x) = (q/p)x h(1) for 0 < x < N . (256)
3.4. STRONG MARKOV PROPERTY 35
Summing over x and using one of the boundary condition h(0) = 0, we get
X
x
h(x + 1) = h(1) (q/p)x for 0 < x < N (257)
k=0
( 1−(q/p)x+1
h(1) 1−(q/p) if p 6= q
= (258)
h(1)(x + 1) if p = q = 1/2.
So it only remains to figure out h(1). For this, we use the remaining boundary condition h(N ) = 1 by
setting x = N − 1:
( 1−(q/p)N
h(1) 1−(q/p) if p 6= q
1 = h(N ) = (259)
h(1)N if p = q = 1/2.
1−(q/p)
It follows that h(1) = 1−(q/p)N
for p 6= q and h(1) = 1/N for p = q = 1/2. Thus, we conclude that
( 1−(q/p)x
1−(q/p)N
if p 6= q
h(x) = (260)
x/N if p = q = 1/2.
For instance, say we start the gamble with x = 50 dollor and win the whole game once we accumulate
N = 100 dollors before hitting 0 dollor. If each game is fair (p = q = 1/2), the probabilty of winning the
game is h(50) = 50/100 = 0.5. If it is a slightly unfavorable game with p = 49/100 and q = 51/100, then
1−(51/49)50
the probability of winning is h(1) = 1−(51/49) 100 ≈ 0.11917. So, just make each game slightly unfavorable
Definition 3.4.1 (Stopping time). Let (X n )n≥0 be a stochastic process defined on a probability space
(Ω, F , P). Let T be a random variable taking valus from Z≥0 ∪ {∞}. Then T is a stopping time (for the
3.4. STRONG MARKOV PROPERTY 36
process (X n )n≥0 ) if for each nonnegative integer n, there is a subset C n ⊆ Ωn+1 such that the following
equality of events holds:
{T = n} = {(X 0 , X 1 , . . . , X n ) ∈ C n }. (261)
In words, T is a stopping time for (X n )n≥0 if one can determine if T = n by checking a certain condi-
tion for the history X 0 , X 1 , . . . , X n , without using any information about the future states X n+1 , X n+2 , . . . .
In this sense, the random time T3 defined above is not a stopping time, but T2 is, since we can write
{T2 = n} = {X 0 6= 50, X 1 6= 50, . . . , X n−1 6= 50, X n = 50}. (262)
Example 3.4.2 (First hitting time is a stopping time). Let (X n )n≥0 be a stochastic process defined on a
probability space (Ω, F , P). Let A ⊆ Ω be a subset of the sample space. We will define τ A to be the ‘first
hitting time’ of A, which is the first time that the process X n enters into states in A:
τ A := inf{n ≥ 0 | X n ∈ A}. (263)
Here ‘inf’ stands for infimum, which agrees with the minimum if the set is nonempty, and conventionally
taken to be ∞ if the set is empty. For example, τ A = ∞ means the process X n never enters the states in A.
In order to see that τ A is indeed a stopping time, we simply mimic what we did in (262): For each n ≥ 0,
{τ A = n} = {X 0 6= A, X 1 6= A, . . . , X n−1 6= A, X n ∈ A}. (264)
Similarly, define T A to be the first time after time zero to hit some state in A:
T A := inf{n ≥ 1 | X n ∈ A}. (265)
As before, T A is also a stopping time since, for each n ≥ 0,
{T A = n} = {X 1 6= A, X 1 6= A, . . . , X n−1 6= A, X n ∈ A}. (266)
Hence, in general, first hitting times are stopping times. ▲.
Exercise 3.4.3. Let (X n )n≥0 be a stochastic process defined on a probability space (Ω, F , P). Let A ⊆ Ω be
a subset of the sample space. For each k ≥ 1, let T A(k) denote the kth time that the process X n visits some
state in A. That is,
(
(m) inf {n > T A(m−1) : X n ∈ A} if T A(m−1) < ∞
TA = (267)
∞ otherwise.
Now, in generall, we do not know the value of T . Hence we will use iterated expectation for probabil-
ity (Exercise 1.4.3) to condition on the value of T and then average over all values of T :
P(X T +1 = u 1 , . . . , X T +m = u m | T < ∞, A, X T = x) (272)
= ET [P(X T +1 = u 1 , . . . , X T +m = u m | T, A, X T = x)] (273)
= ET [P(X 1 = u 1 , . . . , X m = u m | X 0 = x)] (274)
= P(X 1 = u 1 , . . . , X m = u m | X 0 = x). (275)
Here, the third equailty follows from the observation we made in the first paragraph: If we know the value
of T < ∞, the assertion holds. □
Example 3.4.5. Let (X n )n≥0 be the gambler’s ruin chain in Example 3.1.2 with initial fortune X 0 = 1 and
cap N = 100. Let T = T2 denote the first hitting time of $50. We have seen that it is a stopping time. So by
the strong Markov property,
P(X T +1 = 51 | X T = 50) = P(X 1 = 51 | X 0 = 50) = p. (276)
▲.
3.5.1. Definition of recurrence/transitions and equivalent conditions. There are multiple ways to
define recurrence/transience of states under a Markov chain: using 1) return probabilities; 2) infinitely
many returns; 3) expected number of returns. Our first approach is via return probabilities. Namely, for
each state x ∈ Ω, define
T x = T x(1) := inf {n ≥ 1 : X n = x}, (277)
which is the first time n ≥ 1 at which the chain X n hits the state x. In Example 3.4.2, we have seen that T x
is a stopping time for each x ∈ Ω. For states x, y ∈ Ω, let
ρ x y = P(T y < ∞ | X 0 = x), (278)
which is the probability that the chain X n eventually visits y starting from x.
Definition 3.5.1. Let (X n )n≥0 be a Markov chain on finite state space Ω. Each state x ∈ Ω is said to be
recurrent if ρ xx = 1 and transient if ρ xx < 1.
Exercise 3.5.2. Show the following implications:
ρ xx > 0 ⇐⇒ P m (x, x) > 0 for some m ≥ 1. (279)
Hint: Use
³ ¯ ´ µ∞ ¯ ¶ ∞ ³ ¯ ´
¯ [ ¯ X ¯
sup P X n = x ¯ X 0 = x ≤ P {X n = x} ¯ X 0 = x ≤ P Xn = x ¯ X0 = x . (280)
n≥1 n=1 n=1
3.5. RECURRENCE AND TRANSIENCE 38
Proposition 3.5.3. Fix states x, y ∈ Ω and let T y(m) denote the time of mth visit to y, as defined in Exercise
3.4.3. Then for each m ≥ 1,
P ROOF. We first note that the assertion for m = 1 follows from the definition of ρ x y :
Next, suppose the assertion holds for m = k. We deduce the assertino for m = k + 1 by using the strong
Markov property. Namely, note that since T y(k+1) > T y(k) , T y(k+1) < ∞ implies T y(k) < ∞. Also, note that
X T (k) = y by definition. Hence,
y
where the last two equalities use the strong Markov property and the induction hypothesis, respectively.
□
P ROOF. Denote the probability in the assertion as p ∗ . Note that for each m ≥ 1, by monotonicity of
events,
Hence the probability on the left hand side equals zero. By taking the complement,
µ ∞ ¯ ¶ µ ∞ ¯ ¶
[ (m) ¯ \ (m) ¯
1 = 1−P {T x = ∞} ¯ X 0 = x = P {T x < ∞} ¯ X 0 = x = p ∗ . (290)
m=1 m=1
which equals the total number of visits to state y ∈ Ω by the Markov chain. By using Fubini’s theorem,
one can write
· ∞ ¯ ¸ h ¯ i
X ¯ X
∞
¯
E[N y | X 0 = x] = E 1(T y(m) < ∞) ¯ X 0 = x = E 1(T y(m) < ∞) ¯ X 0 = x (292)
m=1 m=1
X∞ ³ ´
= P T y(m) < ∞ | X 0 = x (293)
m=1
X∞
= ρ x y ρ m−1
yy . (294)
m=1
Note that the last expression equals zero if ρ x y = 0, infinity if ρ x y > 0 and ρ y y = 1, and a converging
geometric series if ρ x y > 0 and ρ y y < 1. Hence we obtain the following result:
Now we can give an alternative characterization of the recurrence of a state in terms of the sum of
multi-step transition probabilities.
verifying the first assertion. But Proposition 3.5.5 implies that x is recurrent if and only if E[N x | X 0 = x] =
∞. Hence the assertion follows. □
3.5.2. Recurrence and transience of random walks on Zd . In this subsection, we study recurrence
and transience of simple random walks on the integer lattice Zd in d -dimensions. As a set, Zd consists
of the d tuple of integer coordinates: Zd = {(a 1 , . . . , a d ) | a 1 , . . . , a d ∈ Z}. We will view it as a graph by
giving ‘nearest neighbor edges’. Namely, if we have two points x, y ∈ Zd , then we connect x and y by an
edge if and only if they differ by 1 at exactly a single coordiante. For example, in Z2 , (1, 2) is adjacent to
(2, 1) but not adjacent to (2, 3). As a graph, Z2 looks like a square grid in R2 . Similarly, Z3 would be the
3-dimensional square lattice consisting of the grid points (a, b, c) for a, b, c ∈ Z.
A random ralk on any graph is simple if it only moves along edges and do not hop multiple edges,
and symmetric if at any node, there is an equal chance to pick any neighbor to move to1. Hence a simple
symmetric random walk (SSRW) on Z moves either up or down with equal probability; SSRW on Z2
chooses one of the 4 directions with equal probability and moves along one edge; SRW on Zd has 2d
directions to choose from at each iteration with equal probability.
1In this lecture note, we assume all random walks on graphs are simple.
3.5. RECURRENCE AND TRANSIENCE 40
The goal of the following serious of examples is to show the following Polya’s random walk theorem2:
SSRW on Zd is recurrent if d ≤ 2 and transient for d ≥ 3. (299)
In Example 3.5.7, we have seen that SRW on Z is recurrent. For higher dimensions, we will compute the
sum of return probabilities as we did before.
Example 3.5.7 (Recurrence and transience of simple random walk on Z). Let (X n )n≥0 denote a simple
random walk on the 1-dimensional integer lattice Z. Namely, it is a Markov chain with state space Z
and transition matrix P such that P (x, x + 1) = p and P (x, x − 1) = 1 − p for all x ∈ Z. Here p ∈ [0, 1] is a
parameter that governs the bias in one-step transition of the walker. Simply put, the walker moves up by
1 with probability p and down by 1 with probability 1 − p. Suppose X 0 = 0, meaning that the walkter is
originally at the origin.
The question we ask is as follows: Is the origin transient or recurrent? The answer depends on p.
If p > 1/2, then the walker has a positive bias, so it would be natural to guess the origin is transient –
it will drift away toward the infinity and will never get back to the origin after some time. Likewise, if
p < 1/2, it will drift away toward negative infinity and the origin should be transient. Lastly, if the walker
is symmetric, p = 1/2, maybe the origin is recurrent. We will justify these guesses using Theorem 3.5.6.
The key observation here is the following combinatorial fact: In order to return to the origin, the
walker must take an even number of steps with equal number of up-steps and down-steps. It follows
that
à !
2n n (2n)!
P (X 2n = 0 | X 0 = 0) = p (1 − p)n = (p(1 − p))n . (300)
n n!n!
We will need to understand how n! grows as n grows. We will apply Stirling’s approximation, which states
that
p
n! ∼ (n/e)n 2πn, (301)
where a n ∼ b n means a n /b n → 1 as n → ∞. Using this, one can show that
p
(2n/e)2n 4πn 1
P (X 2n = 0 | X 0 = 0) ∼ (p(1 − p))n = p (4p(1 − p))n . (302)
2n
(n/e) (2πn) πn
p
Now consider the symmetric case p = 1/2. Then P(X 2n = 0) ∼ 1/ πn. This implies that for some
p
constant c > 0, P(X 2n = 0) ≥ c/ πn for all n ≥ 1. Now in order to see the recurrence of the origin, we
2
Kakutani at UCLA colloquium said: “A drunk man will eventually find his way home, but a drunk bird may get lost forever”
3.5. RECURRENCE AND TRANSIENCE 41
F IGURE 7. A sample path of RW on Z2 . Projecting it on y = ±x shows that it has the same distri-
bution as a pair of independent random walks on Z. Figure excerpted from http://www.cmat.
edu.uy/~lessa/resource/randomwalknotes.pdf
In order to handle this issue, we are going to compare Xn with the following auxiliary process Yn =
[Yn(1) , Yn(2) ], where Yn(1) and Yn(2) are independent SSRW on Z. So each Yn(i ) changes by ±1 with equal
probability, independently for each i = 1, 2:
(i ) (i )
P(Yn+1 − Yn(i ) = 1) = P(Yn+1 − Yn(i ) = −1) = 1/2 for i = 1, 2. (305)
Now, since Yn(1) and Yn(2) change by ±1 at the same time, the vector Yn = [Yn(1) , Yn(2) ] on Z2 changes by [1, 1],
[−1, 1], [1, −1], [−1, −1] with equal probability. This is exactly the original SSRW Xn rotated by 45 degrees
3.5. RECURRENCE AND TRANSIENCE 42
p
and scaled by 2, so that it lives on the ‘checkerboard lattice’ {(a, b) | a, b ∈ Z, a + b is even} ⊆ Z2 . From
this observation, it follows that the probability of Yn returning to the origin is the same as the probability
that Xn returning to the origin. This gives
where the last equality is due to the independence between Yn(1) and Yn(2) . By using the analysis in Exam-
ple 3.5.7, we get
1
P (X2n = [0, 0] | X0 = [0, 0]) ∼ . (310)
πn
To conclude, note that Xn can only return to the origin at even number of steps, and the above asymptotic
yields that for some constant c > 0, P (X2n = [0, 0] | X0 = [0, 0]) ≥ c/(πn) for all n ≥ 1. Hence
X
∞ X
∞ X∞ c
P (Xn = 0 | X0 = 0) = P (X2n = 0 | X0 = 0) ≥ = ∞. (311)
n=0 n=0 n=0 πn
Example 3.5.9 (SSRW on Zd for d ≥ 3 is transient (“A drunk bird never finds way home”)). Let Xn denote
a SSRW on Zd with d ≥ 3, which chooses one of the 2d steps [±1, 0, . . . , 0], [0, ±1, 0, . . . , 0], . . . [0, . . . , 0, ±1]
independently with equal probability at each move. We will first apply a similar argument as in Example
3.5.8 and try to show that Xn is transient. This simple argument almost works, but we will see why it fails.
As we did in the two dimensional case, consider a process Yn = [Yn(1) , . . . , Yn(d ) ] on Zd where Yn(i ) ’s are
independent SSRWs on Z. So each Yn(i ) changes by ±1 with equal probability, independently for each
i = 1, . . . , d . As before, note that
be a, b, c, respectively. Hence
µ ¶2n
X (2n)! 1
P(Xn = 0 | X0 = 0) = (316)
0≤a,b,c (a!b!c!)2 6
a+b+c=n
à !µ ¶ µ ¶2 µ ¶2n
2n 1 2n X n! 1
= , (317)
n 2 0≤a,b,c (a!b!c!) 3
a+b+c=n
(2n)! ¡2n ¢ ³ n!
´2
where we have used the identity (a!b!c!)2
= n (a!b!c!) . Now note that
µ ¶n
X n! 1
1= , (318)
0≤a,b,c (a!b!c!) 3
a+b+c=n
since the right hand side equals the total probability of putting n balls into three bins uniformly at ran-
dom. Also, observe that
n! n!
≤ (319)
(a!b!c!) bn/3c!bn/3c!bn/3c!
for all a, b, c ≥ 0 with a + b + c = n and bn/3c is the largest smaller integer than n/3. In words, one has the
most amount of arrangements of n balls into three bins if the number of balls in each bins are as equal
as possible. Using this,
à !µ ¶ µ ¶2 µ ¶2n
2n 1 2n X n! 1
P (X2n = 0) = (320)
n 2 0≤a,b,c (a!b!c!) 3
a+b+c=n
à !µ ¶ µ ¶n
2n 1 2n n! 1
≤ (321)
n 2 bn/3c!bn/3c!bn/3c! 3
p µ ¶n
1 (n/e)n 2πn 1
∼p p = O(n −3/2 ). (322)
πn (n/3e)n 2πn/3 3 3
P P∞
From this, one can deduce that ∞ n=1 P(Xn = 0 | X0 = 0) ≤ n=1 cn
−3/2
< ∞. Thus, by Theorem 3.5.6,
conclude that Xn is transient. ▲
3.5.3. The set of recurrent states is closed. In this subsection, we will see that a Markov chain will
eventually enter the set of recurrent states and will never leave it. This is proved in the following state-
ment.
3.5. RECURRENCE AND TRANSIENCE 44
Definition 3.5.10 (Closed sets). Fix a subset of states A ⊆ Ω. We say A is closed if ρ x y = 0 for all x ∈ A and
y ∈ Ω \ A.
Proposition 3.5.11. For distinct x, y ∈ Ω, the following hold:
(i) Suppose ρ x y > 0 and ρ y x < 1. Then x is transient.
(ii) Suppose ρ x y > 0 and ρ xx = 1. Then ρ y x = 1 and y is recurrent.
(iii) Suppose x is recurrent and y is transient. Then ρ x y = 0. In particular, the set of recurrent states is
closed.
P ROOF. To show (i), suppose ρ x y > 0 and ρ y x < 1. Heuristically, this means that the chain can go
from x to y with some positive probability, but there is a positive chance 1 − ρ y x > 0 that it will never
return to x. Then it does not visit x any more, so x should be transient. To make this argument rigorous,
we note that
1 − ρ xx = P(X n never visits x | X 0 = x) (323)
≥ P(X n visits y and never returns to x | X 0 = x) (324)
= P(X n never returns to x after time T y | X 0 = x) P(T y < ∞ | X 0 = x) (325)
= P(X n never returns to x | X 0 = y) ρ x y . (326)
= (1 − ρ y x )ρ x y > 0, (327)
where we have used the strong Markov property to get the third equality. This shows ρ xx < 1, so x is
transient.
Next, in order to show (ii), suppose ρ x y > 0 and ρ xx = 1. In this case, we first observe that ρ y x = 1,
since otherwise by part (i) x must be transient, which contradicts the assumption that ρ xx = 1. Hence,
heuristically, assuming X 0 = y, the chain almost surely visits x and there is a positive chance ρ y x > 0 to
return to y; if it fails, then it will eventually return to y and by the strong Markov property we can start
the game anew. Among this infinitely many independent trials, it should succeed at least once since the
success probability is ρ x y ρ x y = ρ x y > 0.
More rigorously, since ρ x y , ρ y x > 0, there are integers k, ℓ ≥ 1 such that P (k) (x, y) > 0 and P ℓ (y, x) > 0
by Exercise 3.5.2. Hence
X∞ X
∞ X
∞
P n (y, y) ≥ P k+n+ℓ (y, y) ≥ P k (y, x)P n (x, x)P ℓ (y, x) (328)
n=1 n=1 n=1
µ ¶
X
∞
= P k (y, x) P n (x, x) P ℓ (y, x) = ∞. (329)
n=1
P
The last equality follows since P (k) (x, y), P ℓ (y, x) > 0 and ∞ n
n=1 P (x, x) = ∞ as x is recurrent (see Theorem
3.5.4). Then by Theorem 3.5.4, we deduce that y is also recurrent.
Lastly, (iii) is a direct consequence of (ii) (proof by contradiction). Heuristically, since x is visited
infinitely often, any positive chance to access y every time x will make y also recurrent. This implies that
the set of all recurrent states is closed. □
3.5.4. Cannonical decomposition of the state space. In this subsection, we discuss a generic struc-
ture about the state space Ω of a Markov chain.
Definition 3.5.12 (Accessibility). For two states x, y ∈ Ω, we say y is accessible from x, and denote as
x −→ y, if P n (x, x) > 0 for some n ≥ 0. If x and y are mutually accessible from each other, we say x and y
communicate and dnoete as x ←→ y.
From Exercise 3.5.2, It is easy to observe that x −→ y if and only if ρ x y > 0 or x = y. The following
exercise shows that the communication relation ←→ is an equivalence relation.
Exercise 3.5.13. Show that the binary relation ‘←→’ between the states defined in Definition 3.5.14 is an
equivalence relation. Namely, it satisfies the following properties:
3.5. RECURRENCE AND TRANSIENCE 45
Next, we define irreducible and closed subsets. In words, a set of states is called ‘irreducible’ if all
states there communicate, and ‘closed’ if it never leaves there once it gets there.
Definition 3.5.14 (Irreducible sets). Fix a subset of states A ⊆ Ω. We say A is irreducible if x ←→ y for all
x, y ∈ A and closed if ρ x y = 0 for all x ∈ A and y ∈ Ω \ A.
R x = {y ∈ Ω | x ←→ y}. (330)
A direct consequence of communication ←→ being an equivalence relation between states is that the
state space Ω is partitioned into pairwise disjoint communicating classes. While transient states may
form multiple communicating classes, it is customary to put them in one subset as stated in Theorem
3.5.15.
Theorem 3.5.15 (Cannonical decomposition of the state space). The state space Ω of a Markov chain can
be decomposed as
à !
[
Ω = T tR = T t Ri , (331)
i
where T is the set of all transient states, R is the set of all recurrent states, and {R i } is at most a countable
collection of pairwise disjoint, closed, and irreducible sets of recurrent states.
Consequently, the transition matrix P can be written as the following block form (after permutting
the rows and columns):
R R2 ··· Rm T
1
R1 PR1 O O
R T R2 O O
· ¸ PR2 O
R PR O .. ,
P= = ... . ..
. O O (332)
T S PT
Rm O O PR O
T S1 S2 ··· Sm PT
where the last expression assumes that there are only finitely many communicating recurrent classes
R1, . . . , Rm .
3.5. RECURRENCE AND TRANSIENCE 46
Example 3.5.16 (Gambler’s ruin in cannonical form). Consider gambler’s ruin with state space Ω =
{0, 1, 2, 3, 4, 5} and transition matrix P given as (199). The set recurrent states is R = {0, 5} and T =
{1, 2, 3, 4} is the set of transient states (why?). Then a canonical form for the transition matrix is
0 5 1 2 3 4
0 0 1 0 0 0 0
0
5 1 0 0 0 0
P= 1 1 − p 0 0 p 0 0 .
(333)
2 0 0 1−p 0 p 0
3 0 0 0 1−p 0 p
4 0 p 0 0 1−p 0
▲
Definition 3.5.17. Fix a Markov chain (X n )n≥1 on state space Ω with transition matrix P . We say the
chain (as well as P ) is said to be irreducible if Ω is irreducible, and recurrent if it is rreducible and all
states are recurrent.
Proposition 3.5.18. The following hold:
(i) A finite closed set A ⊆ Ω contains at least one recurrent state. In particular, if Ω is finite, then it contains
at least one recurrent state.
(ii) If A ⊆ Ω is finite, closed, and irreducible, then all states in A are recurrent.
(iii) A finite state Markov chain is recurrent if it is irreducible.
P
P ROOF. Let N y denote the total number of visits to y during [0, ∞). Note that y∈A N y counts the
total number of visits to any state in A, and it is infinite since A is closed. But then
" #
X X £ ¤
∞=E Ny = E Ny . (334)
y∈A y∈A
Since the last expression is a finite sum, it implies that E[N y ] = ∞ for some y ∈ A. Then such y ∈ A is
recurrent by Theorem 3.5.6. For the second part, note that Ω is a closed subset of Ω, so it must contain a
recurrent state by what we have just shown. This shows (i).
To show (ii), suppose A ⊆ Ω is finite, closed, and irreducible. According to (i), A contains at least one
recurrent state. But then by the irreducibility and Proposition 3.5.11, all states in A must be recurrent.
This shows (ii).
Lastly, (iii) follows immediately from (ii). □
3.5.5. Genearlized Gambler’s ruin and expected time until absorption. In this subsection, we gen-
eralize the Gambler’s ruin example (Example 3.1.2) and see how we can analyze various statistics related
to the first recurrent state ever reached.
Namely, let (X n )n≥0 be a Markov chain on a finite state space Ω with transition matrix P . We may
decompose Ω = T t R, where T is the set of all transient states, R is the set of all recurrent states. In
Proposition 3.5.18, we have seen that any finite-state Markov chain contains at least one recurrent state,
so R must be nonempty. We also assume that T is nonempty (just like gambler’s ruin). Define
TR = inf{n ≥ 0 | X n ∈ R} = first time to be in R. (335)
Since eventually X n must enter R (why?), TR is finite almost surely. Then
X TR = first recurrent state ever reached. (336)
We are interested in solving the following matrix U : T × R → [0, 1], where
U (x, y) = P(X TR = y | X 0 = x). (337)
3.5. RECURRENCE AND TRANSIENCE 47
Namely, U (x, y) is the probability that, starting from a transient state x, the first ever recurrent state
reached is y.
In order for our analysis, we may use the elementary decomposition of P in (332):
R T
· ¸
R PR O
P= . (338)
T S PT
P ROOF. As before, we use the ‘first step analysis’. Fix x ∈ T and y ∈ R. By conditioning on X 1 , we
first write
W (x) = E[TR | X 0 = x] (350)
X
= E[TR | X 1 = z, X 0 = x]P (x, z) (351)
z∈Ω
X
= 1+ E[TR | X 0 = z]P (x, z) (352)
z∈Ω
X
= 1+ E[TR | X 0 = z]P (x, z) (353)
z∈T
X
= 1T (x) + P T (x, z)W (z). (354)
z∈T
The justification (353) is similar as that of (341). This gives
(I − P R )W = 1T . (355)
Using Exercise 3.5.20, we can derive W = (I − P T )−1 1T , as desired. □
Exercise 3.5.22. Consider a Markov chain (X n )n≥0 on state space Ω = {1, 2, 3, 4, 5, 6, 7, 8, 9} whose transi-
tion probabilities are given by the following state-space diagram:
(i) Show that T = {1, 2, 3, 4} is the set of transient states and R = {5, 6, 7, 8, 9} is the set of recurrent states.
(ii) Let U denote the matrix T × R → [0, 1] where U (x, y) is the probability that the chain first hits y
among all recurrent states, starting from x. Compute U . You may use
−1
1 −1/5 0 0 120 24 6 2
0 −1/4 0 10
1 = 1 5 20 30 . (356)
0 0 1 −1/3 119 20 4 120 40
−1/2 0 1 0 60 12 3 120
(iii) Compute ρ 1,5 and ρ 1,6 . (Hint: Use U in part (ii).)
(iv) For each x ∈ T , let W (x) denote the expected hitting time of any recurrent state starting from x.
Compute W : T → [0, ∞).
(v) Compute all stationary distributions of the Markov chain.
3.6. EXISTENCE AND UNIEUQNESS OF STATIONARY DISTRIBUTION 49
We will see that with reasonable assumptions, a Markov chain will admit a unique stationary distribution
given by
1
π(y) := lim E[Vn (x, y)], (358)
t →∞ n
which is the expected proportion of times spending at y starting at x and it turns out that it does not
depend on x. The main result in this section, Theorem 3.6.4, states that when the Markov chain satisfies
certain reasonable conditions, (381) is indeed a stationary distribution of the Markov chain. However,
it is not a priori clear that the limit in (381) exists. Instead of taking the limiting frequency of visits to y,
we will instead construct a measure by only counting visits to y during a single ‘excursion’ from x to x,
which occurs during the time interval [0, T x ]. As we will see later, such a construction will give a valid
probability distribution if and only if E[T x | X 0 = x] is finite. For a further discussion, we introduce the
following notion.
Definition 3.6.1. Let (X n )n≥0 be a Markov chain on a state space Ω. We say a recurrent state x ∈ Ω is
positive recurrent if E[T x | X 0 = x] < ∞ and null recurrent if E[T x | X 0 = x] = ∞.
Proposition 3.6.2 (Poisitive recurrence is a class property). Let x, y be two communicating states. If one
is positive recurrent, then so is the other .
Example 3.6.3 (SRW on Z is null recurrent). In Example 3.5.7, we have seen that SRW on Z is recurrent,
which is equivalent to P(T x < ∞ | X 0 = x) = 1. However, this does not necessarily mean that we have
finite expected return time, E[T x | X 0 ] = 0. This is indeed the case for SRW on Z, which turns out ot be
null recurrent. For this, we remark the following asymptotic of ‘persistence probability’ of SRW on Z:
r
2
P(X 1 ≥ 1, X 2 ≥ 1, . . . , X n ≥ 1 | X 0 = 0) ∼ . (359)
πn
(For reference, see, e.g., [LS19].) It follows that, by the tail-sum formula for the expectation of nonnega-
tive RVs,
r
X∞ X
∞ 2
E[T0 | X 0 = 0] = P(T0 ≥ n) ∼ = ∞. (360)
n=1 n=1 πn
Thus the origin is null recurrent. By symmetry, all other states are also null recurrent. ▲
The full theorem regarding the existence and uniqueness of the stationary distribution is the follow-
ing:
3.7. UNIQUENESS OF STATIONARY DISTRIBUTION 50
Theorem 3.6.4 (Existence and uniqueness of stationary distribution). Assume P is irreducible. Then P
has a stationary distribution if and only if all states are positive recurrent. In this case the stationary
distribution π is unique and its values satisfy
1
π(x) = ∀ x ∈ Ω. (361)
E[T x | X 0 = x]
Corollary 3.6.5. Assume Ω is finite and P is irreducible. Then P has a unique stationary distribution π
given by (361).
P ROOF. By Theorem 3.6.4, we only need to verify if all states in Ω are positive recurrent. This is done
in Exercise 3.6.6. □
Exercise 3.6.6 (Finite irreducible implies positive recurrence). Let (X n )n≥0 be an irreducible Markov
chain on a finite state space Ω with transition matrix P . Let X 0 = x ∈ Ω. For any y ∈ Ω and t ≥ 0, de-
fine the first hitting time of y by
T y = min{k ≥ 1 | X k = y}. (362)
We will show that, for all x, y ∈ Ω,
E[T y | X 0 = x] < ∞. (363)
That is, the expected hitting time of (X n )n≥0 to any state starting from any state is finite.
(i) Recall that by the irreducibility and Exercise 3.5.2, for any z, y ∈ Ω, P r (z, y) > 0 for some r ≥ 1 (r may
depend on z and y). Let
d = max min{r ≥ 1 | P r (z, y) > 0}. (364)
z,y∈Ω
(ii) Show that for any k ≥ 1, (You may use strong Markov property)
P(T y > kd ) ≤ (1 − δ)P(T y > (k − 1)d ). (366)
Use induction to conclude that
P(T y > kd ) ≤ (1 − δ)k . (367)
(iii) Noting that P(T y ≥ t ) is a non-increasing function in t , show that
X
∞ X
∞ X
∞
E[T y ] = P(T y ≥ t ) ≤ d P(T y ≥ kd ) ≤ d (1 − δ)k = d /δ < ∞. (368)
t =1 k=0 k=0
Conclude that all states in an finite irreducible Markov chain are positive recurrent.
which is a contradiction. This implies that x also attains its global maximum. Apply the similar argument
to the neighbors of k. Repeating the same argument, we see that x must be constant on a ‘connected
component’ of G containing k. But since G is connected, it follows that x is constant on V . This shows
the assertion. □
If we carefully examine the proof of above result, we find that we haven’t really used the value of en-
tries of the transition matrix P . Rather, we only used an abstract property following from the connectivity
of G: We can reach any node from any other node. We extract this as a general property of Markov chains.
Definition 3.7.3 (Irreducibility). Let P be the transition matrix of a Markov chain (X n )n≥0 on a finite state
space Ω. We say the chain (or P ) is irreducible if for any i , j ∈ Ω,
P(X k = j | X 0 = i ) > 0 (378)
3.7. UNIQUENESS OF STATIONARY DISTRIBUTION 52
In words, a Markov chain is irreducible if every state is ‘accessible’ from any other state.
Exercise 3.7.4 (RW on connected graphs is irreducible). Let G = (V, E ) be a connected graph. Let (X n )n≥0
be a random walk on G and denote by P its transition matrix. Let v 0 , v 1 , · · · , v k such that consecutive
nodes are adjacent in G. Show that
Lemma 3.7.5 (Uniqueness of stationary distribution). Let (X n )n≥0 be an irreducible Markov chain. Then
it has at most a unique stationary distribution.
3.7.1. Construction of stationary distribution (Optional*). The gist of the to construct of a sta-
tionary distribution using the visit counts Vn (x, y) is given in the following proposition. It constructs a
stationary distribution for general Markov chains with a positive recurrent state.
Lemma 3.7.6. Consider a Markov chain (X n )n≥0 on Ω with transition matrix P . Fix a recurrent state x ∈ Ω
and for each y ∈ Ω, define
£ ¤
λx (y) := E VTx (x, y) , (381)
which is the expected number of visits to y during [0, T x ] (a single excursion from x to x).
P
(i) For each x ∈ Ω, λx (x) = 1 and y∈Ω λx (y) = E[T x | X 0 = x].
(ii) For each x ∈ Ω, λx = λx P .
(iii) If x is positive recurrent, then πx := E[Tx |1X 0 =x] λx is a stationary distribution for P .
where the third equality uses Fubini’s theorem for nonnegative summands.
To show (ii), our goal is to verify for each y ∈ Ω,
X
λx (z)P (z, y) = λx (y). (385)
z∈Ω
3.7. UNIQUENESS OF STATIONARY DISTRIBUTION 53
In the following proposition, we make a simple observation that any stationary distribution will take
value zero on transient states.
Proposition 3.7.8 (Stationary distribution is supported on recurrent states). Let a transition matrix P on
Ω has an stationary distribution π. Then for each x ∈ Ω, π(x) > 0 implies that x is recurrent.
where we have used Fubini’s theorem to switch the order of double summation of nonnegative terms and
the last equality uses (296). This implies that there exists some z ∈ Ω such that π(z) > 0 and E[N x | X 0 =
z] = ∞. But by Proposition 3.5.5 this implies that x is recurrent (as well as ρ zx > 0), as desired. □
P ROOF OF T HEOREM 3.6.4. According to Theorem 3.7.7, it suffices to show that, assuming irreducibil-
ity and the existence of stationary distribution π, all states are positive recurrent. Indeed, since π is a
probability distribution, π(x) > 0 for some x ∈ Ω, so by Proposition 3.7.8, x is recurrent. But since P is
irreducible, it follows that all states are recurrent. Next, we need the following lemma:
Lemma: For irreducible and recurrent Markov chains, any two invariant measures for P is a constant
multiple of each other. That is, if λ = λP and µ = µP , then λ = cµ for some c > 0.
For the proof of the above lemma, see [SV, Thm. 2.67]. Now, by Lemma 3.7.6, λx is an invariant measure
for P for each x ∈ Ω. In particular, we have λx = c x π for some c x > 0. Then
X X
E[T x | X 0 = x] = λx (y) = c x π(y) = c x < ∞, (403)
y∈Ω y∈Ω
so x is positive recurrent. Since this holds for all x, we conclude that all states are positive recurrent. □
P ROOF OF P ROPOSITION 3.6.2. Suppose two states x, y ∈ Ω communicate and x is positive recur-
rent. By restricting onto the communicating class that contains x and y, we may assume that the Markov
chain is irreducible and recurrent. By Lemma 3.7.6, there is a statioanry distribution, say, πx . Then by a
similar argument as in the the last part of the proof of Theorem 3.6.4, we have λ y = c x πx and this implies
that y is also positive recurrent. □
Theorem 3.8.1 (SLLN for finite Markov chains). Let (X n )n≥0 be an irreducible Markov chain on finite state
space Ω with transition matrix P with arbitrary initial distribution. Let f : Ω → R denote a function. Then
almost surely,
1 Xn £ ¤ X
lim f (X k ) = E X ∼π f (X ) = f (x)π(x), (404)
n→∞ n
k=1 x∈Ω
where the second equality of exchanging limit and summation uses the fact that Ω is finite (it would be
non-trivial if Ω is infinite) and the last equality uses Exercise 3.8.2. □
Exercise 3.8.2 (Convergence of empirical distribution). Let (X t )t ≥0 be an irreducible Markov chain on Ω
with a unique stationary distribution π. For each x ∈ Ω, define the following quantities
T x(k) := time of the kth visit to x (410)
X
n
Vx (n) := 1(X k = x) = number of visits to x during [0, n]. (411)
k=1
Here X 0 follows an arbitrary initial distribution µ on Ω.
(i) Let τk = T x(k) − T x(k−1) for k ≥ 2. Show that τk ’s are i.i.d. and 0 < E[τk ] < ∞. (For the i.i.d. part, use
strong Markov property. For the finite expectation part, see Exercise 3.6.6 for the finite case and
Theorem 3.6.4 for the general case.)
(ii) Use Markov property and strong law of large numbers to show
³ τ2 + · · · + τn ´
P lim = E[τ2 ] = 1. (412)
n→∞ n
Furthermore, noting that T x(n) = T x(n) + τ2 + · · · + τn , conclude that
µ ¶
Tn
P lim = E[τ2 ] = 1. (413)
n→∞ n
(iii) Let Vx (t ) = V (t ) be the number of visits to x up to time t . Using the fact that E[τk ] < ∞, show that
(V (n)) (V (n)+1)
V (t ) → ∞ as t → ∞ a.s. Also, noting that T x x ≤ n < Tx x , use (ii) to deduce
µ ¶
Vx (n) 1
P lim = = 1. (414)
n→∞ n E[τ2 ]
3.8. LIMIT THEOREMS FOR MARKOV CHAINS 56
Exercise 3.9.1. Let (X n )n≥0 be a Markov chain on Ω = {1, 2} with the following transition matrix
· ¸
0.2 0.8
P= . (425)
0.6 0.4
(i) Show that P admits the following diagonalization
· ¸· ¸· ¸−1
1 −4/3 1 0 1 −4/3
P= . (426)
1 1 0 −2/5 1 1
(ii) Show that P t admits the following diagonalization
· ¸· ¸· ¸−1
1 −4/3 1 0 1 −4/3
Pt = . (427)
1 1 0 (−2/5)t 1 1
(iii) Let rt denote the row vector of distribution of X t . Show that
· ¸· ¸· ¸−1
1 −4/3 1 0 1 −4/3
r t = r0 . (428)
1 1 0 (−2/5)t 1 1
Also show that
· ¸
3/7 4/7
lim rt = r0 = [3/7, 4/7]. (429)
t →∞ 3/7 4/7
(iv) Show that π = [3/7, 4/7] is the unique stationary distribution for P . Conclude that regardless of the
initial distribution r0 , the distribution of the Markov chain (X n )n≥0 converges to [3/7, 4/7].
Next, we observe that an irreducible MC may not always converge to the stationary distribution. The
key issue there is the notion of ‘periodicity’.
Example 3.9.2. Let (X n )n≥0 be a 2-state MC on state space Ω = {0, 1} with transition matrix
· ¸
0 1
P= . (430)
1 0
Namely, the chain deterministically alternates between the two states. Note that it is irreducible and has
a unique stationary distribution
π = [1/2, 1/2]. (431)
Let πt be the distribution of X n , where the initial distribution is given by π0 = [1, 0]. Then we have
π1 = [0, 1] (432)
π2 = [1, 0] (433)
π3 = [0, 1] (434)
π4 = [1, 0], (435)
and so on. Hence the distributions πn do not converge to the stationary distribution π. ▲
𝑝 𝑃
𝑝
1 0 1 2 3 4 1
1−𝑝 1−𝑝 1−𝑝
3.9. CONVERGENCE TO THE STATIONARY DISTRIBUTION 58
Example 3.9.3 (RW on torus). Let Zn be the set of integers modulo n. Let G = (V, E ) be a graph where
V = Zn × Zn and two nodes u = 1
(u 1 , u 2 ) and v = (v 1 , v3/4
2 ) are adjacent if and 1/2
only if 1/4
0 |u11 − u 2 | + |v 1 − v 2 | = 1. 2 3 (436) 4
1/4
Such a graph G is called the n ×n torus and we write G 1/2 3/4
= Zn × Zn . Intuitively, 1 n ×n
it is obtained from the
square grid by adding boundary edges to wrap around (see Figure 10 left).
4
3
2
0
0 1 2 3 4 5
F IGURE 10. (Left) Torus graph (Figure excerpted from [LP17]). (Right) RW on torus G = Z6 × Z6
has period 2.
Now let (X t )t ≥0 be a random walk on G. Since G is connected, X t is irreducible. Since all nodes in
G have degree 4, the uniform distribution on Zn × Zn , which we denote by π, is the unique stationary
distribution of X t . Let πt denote the distribution of X t .
For instance, consider G = Z6 × Z6 . As illustrated in Figure 10 below, observe that if X 0 is one of the
red nodes (where sum of coordinates is even), then X t is at a red node for any t = even and at a black
node (where sum of coordinates is odd) at t = odd. Hence, πt is supported only on the ‘even’ nodes for
even times and on the ‘odd’ nodes for the odd times. Hence πt does not converge in any sense to the
uniform distribution π. ▲
The key issue in the 2-periodicity of RW on G = Z6 × Z6 is that it takes even number of steps to return
to any given node. Generalizing this observation, we introduce the following notion of periodicity.
Definition 3.9.4. Let P be the transition matrix of a Markov chain (X t )t ≥0 on a finite state space Ω. For
each state x ∈ Ω, let T (x) = {t ≥ 1 | P t (x, x) > 0} be the set of times when it is possible for the chain to
return to starting state x. We define the period of x by the greatest common divisor of T (x). We say the
chain X t is aperiodic if all states have period 1.
Example 3.9.5. Let T (x) = {4, 6, 8, 10, · · · }. Then the period of x is 2, even through it is not possible to go
from x to x in 2 steps. If T (x) = {4, 6, 8, 10, · · · } ∪ {3, 6, 9, 12, · · · }, then the period of x is 1. This means the
return times to x is irregular. For the RW on G = Z6 × Z6 in Example 3.9.3, all nodes have period 2. ▲
Exercise 3.9.6 (Aperiodicity of RW on graphs). Let (X t )t ≥0 be a random walk on a connected graph G =
(V, E ).
(i) Show that all nodes have the same period.
(ii) If G contains an odd cycle C (e.g., triangle), show that all nodes in C have period 1.
(iii) Show that X t is aperiodic if and only if G contains an odd cycle.
(iv)* Show that X t is periodic if and only if G is bipartite. (A graph G is bipartite if there exists a partition
V = A ∪ B of nodes such that all edges are between A and B .)
Remark 3.9.7. If (X t )t ≥0 is an irreducible Markov chain on a finite state space Ω, then all states x ∈ Ω
must have the same period. The argument is similar to that for Exercise 3.9.6 (i).
3.9. CONVERGENCE TO THE STATIONARY DISTRIBUTION 59
Exercise 3.9.8 (Aperiodicity implies irreducibility for the product chain). Let P be an irreducible transi-
tion matrix for a Markov chain on Ω. Let Q be a matrix on Ω × Ω defined by
Q((x, y), (z, w)) = P (x, z) P (y, w) for x, y, z, w ∈ Ω. (437)
(i) Consider a Markov chain Z t := (X t , Y t ) on Ω × Ω that evolves by independently evolving its two coor-
dinates according to P . Show that Q above is the transition matrix for Z t .
(ii) Show that Q is irreducible if P is aperiodic.
We now state the convergence theorem for random walks on graphs.
Theorem 3.9.9 (Convergence of RW on graphs). Let (X t )t ≥0 be a random walk on a connected graph
G = (V, E ) with an odd cycle. Let π denote the unique stationary distribution π. Then for each x, y ∈ V ,
lim P(X t = y | X 0 = x) = π(y). (438)
t →∞
P ROOF. Since G contains an odd cycle, random walk on G is aperiodic according to Exercise 3.9.6.
Let (Y t )t ≥0 be another RW on G. We evolve X t and Y t simultaneously independently until they meet,
after which we evolve them in union. Namely, let τ be the ‘meeting time’ of these two walkers:
τ = min{t ≥ 0 | X t = Y t }. (439)
Then we let X t = Y t for all t ≥ τ. Still, if we disguise one random walk, then the other behaves exactly as
it should do. (This is a ‘coupling’ between the two random walks)
Now suppose X 0 = x for some x ∈ V and Y0 is drawn from the stationary distribution π. In particular,
the distribution of Y t is π for all t . Let πt denote the distribution of X t . Then for any y ∈ V ,
¯ ¯
|πt (y) − π(y)| = ¯P(X t = y) − P(Y t = y)¯ (440)
¯ ¯
= ¯P(X t = y, Y t = y) + P(X t = y, Y t 6= y) − P(X t = y, Y t = y) − P(X t 6= y, Y t = y)¯ (441)
≤ P(X t = y, Y t 6= y) + P(X t 6= y, Y t = y) (442)
≤ P(X t 6= Y t ) (443)
= P(τ > t ). (444)
Hence if suffices to show that P(τ > t ) → 0 as t → ∞, as this will yield |πt (y) − π(y)| → 0 as t → ∞. This
follows from the fact that P(τ < ∞) = 1, that is, the two independent random walks on G eventually meet
with probability 1 (see Exercise 3.9.12). □
The following exercise shows that if a RW on G is irreducible and aperiodic, then it is possible to
reach any node from any other node in a fixed number of steps.
Exercise 3.9.10. Let (X t )t ≥0 be a RW on a connected graph G = (V, E ). Let P denote its transition matrix.
Suppose G contains an odd cycle, so that the walk is irreducible and aperiodic. For each x ∈ V , let T (x)
denote the set of all possible return times to the state x.
(i) For any x ∈ V , show that a, b ∈ T (x) implies a + b ∈ T (x).
(ii) For any x ∈ V , show that T (x) contains 2 and some odd integer b.
(iii) For each x ∈ V , let b x be the smallest odd integer contained in T (x). Show that m ∈ T (x) whenever
m ≥ bx .
(iv) Let b ∗ = maxx∈V b x . Show that m ∈ T (x) for all x ∈ V whenever m ≥ b ∗ .
(v) Let r = |V | + b ∗ . Show that P r (x, y) > 0 for all x, y ∈ V .
With a little more work, one can also show a similar statement for general irreducible and aperiodic
Markov chains.
Exercise 3.9.11. Let P be the transition matrix of a Markov chain (X t )t ≥0 on a finite state space Ω. Show
that (i) implies (ii):
(i) P is irreducible and aperiodic.
3.10. MARKOV CHAIN MONTE CARLO 60
Use (i) and Markov property that in every r steps, X t and Y t meet with probability ≥ δ > 0.
(iii) Let τ be the first time that X t and Y t meet. From (ii) and Markov property, deduce that
P(τ ≥ kr ) = P(τ ≥ r )P(X t and Y t never meet during [r, kr ) | X r −1 6= Yr −1 ) (449)
≤ (1 − δ)P(X t and Y t never meet during [r, kr ) | X r −1 6= Yr −1 ). (450)
By an induction on k, conclude that
P(τ ≥ kr ) ≤ (1 − δ)k → 0 as k → ∞. (451)
The general convergence theorem is stated below.
Theorem 3.9.13 (Convergence Thm). Let (X t )t ≥0 be an irreducible aperiodic Markov chain on a finite
state space Ω. Let π denote the unique stationary distribution of X t . Then for each x, y ∈ V ,
lim P(X t = y | X 0 = x) = π(y). (452)
t →∞
P ROOF. As always, there is a linear algebra proof to this result (see, e.g., [LP17, Thm. 4.9]) Mimic the
argument for Theorem 3.9.9 for a coupling based proof. □
F IGURE 11. MCMC simulation of Ising model on 200 by 200 torus at temperature T = 1 (left), 2
(middle), and 5 (right).
times, we do not have the full list of nodes, especially when we want to sample a random node from a
social network. Given only local information (set of neighbors for each given node), can we still sample
a uniform random node from G?
One answer is given by random walk. Indeed, random walks on graphs are defined by only using
local information of the underlying graph: Choose a random neighbor and move there. Moreover, since
G is connected, there is a unique stationary distribution π for the walk, which is given by
degG (x)
π(x) = . (453)
2|E |
Since G is regular, any two nodes have the same degree, so π(x) = π(y) for all x, y ∈ V . This means π
equals the uniform distribution µ on V . Hence the sampling algorithm we propose is as follows:
(∗) Run a random walk on G for t À 1 steps, and take the random node that the walk sits on.
However, there is a possible issue of convergence. Namely, if the graph G does not contain any odd
cycle, then random walk on G is periodic (see Exercise 3.9.6), so we are not guaranteed to have conver-
gence. We can overcome this by using a lazy random walk instead, which is introduced in Exercise 3.10.2.
We know that the lazy RW on G is irreducible, aperiodic, and has the same set of stationary distribution
as the ordinary RW on G. Hence we can apply the sampling algorithm (∗) above for lazy random walk on
G to sample a uniform random node in G. ▲
Exercise 3.10.2 (Lazy RW on graphs). Let G = (V, E ) be a graph. Let (X t )t ≥0 be a Markov chain on the
node set V with transition probabilities
1/2 if j = i
P(X t +1 = j | X t = i ) = 1/(2 degG (i )) if j is adjacent to i (454)
0 otherwise.
This chain is called the lazy random walk on G. In words, the usual random walker on G now flips a fair
coin to decide whether it stays at the same node or make a move to one of its neighbors.
(i) Show that for any connected graph G, the lazy random walk on G is irreducible and aperiodic.
(ii) Let P be the transition matrix for the usual random walk on G. Show that the following matrix
1
Q = (P + I ) (455)
2
is the transition matrix for the lazy random walk on G.
(iii) For any distribution π on V , show that πQ = π if and only if πP = π. Conclude that the usual and
lazy random walks on G have the same set of stationary distribution.
3.10. MARKOV CHAIN MONTE CARLO 62
Example 3.10.3 (Finding local minima). Let G = (V, E ) be a connected graph and let f : V → [0, ∞) be
a ‘cost’ function. The objective is to find a node x ∗ ∈ V such that f takes global minimum at x ∗ . This
problem has a lot of application in machine learning, for example. Note that if the domain V is very
large, then an exhaustive search is too expensive to use.
Here is simple form of the popular algorithm of stochastic gradient descent, which lies at the heart of
most of the important machine learning algorithms.
(i) Initialize the first guess X 0 = x 0 ∈ V .
(ii) Suppose X t = x ∈ V is chosen. Let
Example 3.10.4 (Finding global minimum). Let G = (V, E ) be a connected regular graph and let f : V →
[0, ∞) be a cost function. Let
½ ¾
∗
V = x ∈ V | f (x) = min f (y) (457)
y∈V
λ f (x)
πλ (x) = , (458)
Zλ
P
where Zλ = x∈V λ f (x) is the normalizing constant. Since πλ (x) is decreasing in f (x), it favors nodes x
for which f (x) is small.
Let (X t )t ≥0 be Markov chain on V , whose transition is defined as follows. If X t = x, then let y be
a uniform random neighbor of x. If f (y) ≤ f (x), then move to y; If f (y) > f (x), then move to y with
probability λ f (y)− f (x) and stay at x with probability 1 − λ f (y)− f (x) . We analyze this MC below:
(i) (Irreducibility) Since G is connected and we are allowing any move (either downhill or uphill) we can
go from one node to any other in some number of steps. Hence the chain (X t )t ≥0 is irreducible.
(ii) (Aperiodicity) By (i) and Remark 3.9.7, all nodes have the same period. Moreover, let x ∈ V ∗ be an
arbitrary node where f takes global minimum. Then all return times are possible, so x has
period 1. Hence all nodes have period 1, so the chain is aperiodic.
(iii) (Stationarity) Here we show that πλ is a stationary distribution of the chain. To do this, we first need
to write down the transition matrix P . Namely, if we let AG (y, z) denote the indicator that y and
z are adjacent, then
( A (x,y)
f (y)− f (x)
(x) min(1, λ ) if x 6= y
G
Note that
X X
πλ (z)P (z, y) = πλ (y)P (y, y) + πλ (z)P (z, y) (461)
z∈V z6= y
X X
= πλ (y) − πλ (y)P (y, z) + πλ (z)P (z, y). (462)
z6= y z6= y
(ii) Show that π(x)Q(x, y) = π(y)Q(y, x) ∀x, y ∈ Ω, x 6=𝑝y implies πQ = π.𝑃Deduce that if 𝑝
1 0 1 2 3 4 1
π(x)P (x, y)A(x,
1 − 𝑝 y) = π(y)P (y,1x)A(y,
−𝑝
x) ∀x, y ∈ Ω, x 6= y, (471)
1−𝑝
then π is a stationary distribution for (X t )t ≥0 .
(iii) Since we also want the Markov chain to converge fast, we want to choose the acceptance probability
A(a, b) ∈ [0, 1] as large as 1possible for each3/4
a, b ∈ Ω. Show that
1/2the following choice
1/4 (denoting
α ∧ β := min(α, β))
0 1 2 3 4
1/4 π(y)P (y, x) 1/2 3/4 1
A(x, y) = ∧1 ∀x, y ∈ Ω, x 6= y (472)
π(x)P (x, y)
satisfies the condition in (ii) and each A(x, y) is maximized for all x 6= y.
(iv) Let (Y t )t ≥0 be a random walk on the 5-wheel graph G = (V, E ) as shown in Figure 12. Show that π =
5 1 2
4 3
Martingales
In order to properly develop our discussion on martingales in the following sections, we need to
generalize the notion of conditional expectation E[X | Y ] of a RV X given another RV Y . Recall that this
was the a collection of ‘best guesses’ of X given Y = y for all y. But what if we only know, say, Y ≥ 1? Can
we condition on this event as well?
More concretely, suppose Y takes values from {1, 2, 3}. Regarding Y , the following outcomes are pos-
sible:
For instance, the information {Y = 1, 2} could yield some nontrival implication on the value of X , so our
best guess in this scenario should be
X
E[X | {Y = 1, 2}] = x P(X = x|{Y = 1, 2}). (474)
x
More generally, for each A ∈ E Y , the best guess of X given A ∈ EY is the following conditional expectation
X
E[X | A] = x P(X = x|A). (475)
x
Now, what if we don’t know which event in the collection E Y to occur? As we did before to define
E[X | Y ] from E[X | Y = y] by simply not specifying what value y that Y takes, we simply do not specify
which event A ∈ E A to occur. Namely,
In general, this could be defined for any collection of events E in place of E Y . Mathematically, we under-
stand E[X | E ] as1
We are interested in the long-term behavior of the ‘portfolio process’ (M t )t ≥0 . Martingales provide a very
nice framework for this purpose.
Martingale is a class of stochastic processes, whose expected increment conditioned on the past is
always zero. Recall that the simple symmetric random walk has this property, since each increment is
i.i.d. and has mean zero. Martingales do not assume any kind of independence between increments, but
it turns out that we can proceed quite far with just the unbiased conditional increment property.
65
4.1. DEFINITION AND EXAMPLES OF MARTINGALES 66
In order to define martingales properly, we need to introduce the notion of ‘information up to time
t ’. Imagine we are observing the stock market starting from time t . We define
E t := collection of all possible events we can observe at time t (479)
[
t
Ft := E k = collection of all possible events we can observe up to time t . (480)
k=1
In words, E t is the information available at time t and Ft contains all possible information that we can
obtain by observing the market up to time t . We call Ft the information up to time t . As a collection of
events, Ft needs to satisfy the following properties2:
(i) (closed under complementation) A ∈ Ft =⇒ A c ∈ Ft ;
S
(ii) (closed under countable union) A 1 , A 2 , A 3 , · · · ∈ Ft =⇒ ∞ A ∈ Ft .
k=1 k
Note that as we gain more and more information, we have
Fs ⊆ Ft ∀t ≥ s ≥ 0. (481)
In other words, (Ft )t ≥0 is an increasing set of information, which we call a filtration. The roll of a filtration
is to specify what kind of information is observable or not, as time passes by.
Example 4.1.1. Suppose (Ft )t ≥0 is a filtration generated by observing the stock price (X t )t ≥0 of company
A in New York. Namely, E t consists of the information on the values of the stock price X t at day t . Given
F10 , we know the actual values of X 0 , X 1 , · · · , X 10 . For instance, X 8 is not random given F10 , but X 11 could
still be random. On the other hand, if (Y t )t ≥0 is the stock price of company B in Hong Kong, then we may
have only partial information for Y0 , · · · , Y10 given Ft . ▲
Now we define martingales.
Definition 4.1.2. Let (Fn )n≥0 be a filtration and (M n )n≥0 be discrete-time stochastic processes. We call
(M n )n≥0 a martingale with respect to (Fn )n≥0 if the following conditions are satisfied: For all n ≥ 0,
(i) (finite expectation) E[|M n |] < ∞.
(ii) (measurability 3) {M n = m} ∈ Fn for all m ∈ R .
(iii) (conditional increments) E[M n+1 − M n | A] = 0 for all A ∈ Fn .
If (iii) holds with “=” replaced by “≤” (resp., “≥”), then (M n )n≥0 is called a supermartingale (resp., sub-
martingale) with respect to (X n )n≥0 , respectively.
Exercise 4.1.3 (Martingale w.r.t. the filtration generated by a stochastic process). Let (X n )n≥1 be a discrete-
time stochastic process on Ω. Let (M n )n≥1 denote a discrete-time stochastic process satisfying the fol-
lowing property:
(a) E[|M n |] < ∞ for n ≥ 1;
(b) M n is a function of X 1 , . . . , X n for n ≥ 1;
(c) E[M n+1 − M n | (X 1 , . . . , X n )] = 0 for n ≥ 1.
Then show that (M n )n≥1 is a martingale (in the sense of Definition 4.1.2) with respect to the filtration
(Fn )n≥1 , where Fn = σ(X 0 , . . . , X n ).
Hint: The key step is to show that, for A ∈ Fn , E[M n+1 − M n | A] = E[E[M n+1 − M n | A, X 1 , . . . , X n ]] = 0.
In order to apply property (c) here, justify the following identities
E[M n+1 − M n | A, X 1 , . . . , X n ] = E[(M n+1 − M n )1((X 1 , . . . , X n ) ∈ A) | X 1 , . . . , X n ] (484)
= 1((X 1 , . . . , X n ) ∈ A) E[(M n+1 − M n ) | X 1 , . . . , X n ]. (485)
If martingale is a fair gambling strategy, then one can think of supermartingale and submartingale as
unfavorable and favorable gambling strategies, resepctively. For instance, expected winning in gambling
on an unfavorable game should be non-increasing in time. This is an immediate consequence of the
definition and iterated expectation.
Proposition 4.1.4. Let (M t )t ≥0 be a stochastic process and let (Ft )t ≥0 be a filtration.
(i) If (M t )t ≥0 is a supermartingale w.r.t. filtration (Ft )t ≥0 , then E[M n ] ≤ E[M m ] for all n ≥ m ≥ 0.
(ii) If (M t )t ≥0 is a submartingale w.r.t. filtration (Ft )t ≥0 , then E[M n ] ≥ E[M m ] for all n ≥ m ≥ 0.
(iii) If (M t )t ≥0 is a martingale w.r.t. filtration (Ft )t ≥0 , then E[M n ] = E[M m ] for all n ≥ m ≥ 0.
P ROOF. (ii) and (iii) follows directly from (i), so we only show (i). Let (M t )t ≥0 be a supermartingale
w.r.t. (X t )≥0 . Recall that for each m ∈ R, {M t = m} ∈ Ft . Hence
E[M t +1 − M t | M t = m] ≤ 0 (486)
since M t is a supermartingale. Hence by conditioning on the values of M t ,
E[M t +1 − M t ] = E[E[M t +1 − M t | M t ]] ≤ 0. (487)
Then the assertion follows by an induction. □
In order to get familiar with martingales, it is helpful to envision them as a kind of simple symmetric
random walk. In general, one can subtract off the mean of a given random walk to make it a martingale.
Example 4.1.5 (Random walks). Let (X t )t ≥1 be a sequence of i.i.d. increments with E[X i ] = µ < ∞. Let
S t = S 0 + X 1 + · · · + X t . Then (S t )t ≥0 is called a random walk. Define a stochastic process (M t )t ≥0 by
M t = S t − µt . (488)
For each t ≥ 0, let Ft be the information obtained by observing S 0 , S 1 , · · · , S t . Then (M t )t ≥0 is a martingale
with respect to the filtration (Ft )t ≥0 . Indeed, we have
E[|M t |] = E[|S t − µt |] ≤ E[|S t | + |µt |] = E[|S t |] + µt < ∞, (489)
and for any m ∈ R,
{M t = m} = {S t − µt = m} = {S t = m + µt } ∈ Ft . (490)
Furthermore, Since X t +1 is independent from S 0 , · · · , S t , it is also independent from any A ∈ Ft . Hence
E[M t +1 − M t | X 0 , . . . , X t ] = E[X t +1 − µ | X 0 , . . . , X t ] (491)
= E[X t +1 − µ] = E[X t +1 ] − µ = 0. (492)
▲
4.1. DEFINITION AND EXAMPLES OF MARTINGALES 68
Example 4.1.6 (Products of indep. RVs). Let (X t )t ≥0 be a sequence of independent RVs such that X t ≥ 0
and E[X t ] = 1 for all t ≥ 0. For each t ≥ 0, let Ft be the information obtained by observing M 0 , X 0 , · · · , X t .
Define
M t = M0 X 1 X 2 · · · X t , (493)
where M 0 is a constant. Then (M t )t ≥0 is a martingale with respect to (Ft )t ≥0 . Indeed, the assumption
implies E[|M t |] < ∞ and that {M t = m} ∈ Ft for all m ∈ R since M t is determined by M 0 , X 1 · · · , X t . Fur-
thermore, since X t +1 is independent from X 1 , · · · , X t , for each A ∈ Ft ,
E[M t +1 − M t | X 0 , . . . , X t ] = E[M t X t +1 − M t | X 0 , . . . , X t ] (494)
= E[(X t +1 − 1)(M 0 X 1 · · · X t ) | X 0 , . . . , X t ] (495)
= E[X t +1 − 1 | X 0 , . . . , X t ] E[(M 0 X 1 · · · X t ) | X 0 , . . . , X t ] (496)
= E[X t +1 − 1] E[(M 0 X 1 · · · X t ) | X 0 , . . . , X t ] = 0. (497)
This multiplicative model a reasonable for the stock market since the changes in stock prices are
believed to be proportional to the current stock price. Moreover, it also guarantees that the price will
stay positive, in comparison to additive models. ▲
Exercise 4.1.7 (Exponential martingale). Let (X t )t ≥0 be a sequence of i.i.d. RVs such that their moment
generating function exists, namely, there exists θ > 0 for which
φ(θ) := E[exp(θX k )] < ∞ ∀k ≥ 0. (498)
Let S t = S 0 + X 1 + · · · + X t . Define
M t = exp(θS t )/φ(θ)t . (499)
Show that (M t )t ≥0 is a martingale with respect to filtration (Ft )t ≥0 , where Ft is the information obtained
by observing S 0 , · · · , S t .
The following lemma allows us to provide many examples of martingales from Markov chains.
Lemma 4.1.8. Let (X t )t ≥0 be a Markov chain on state space Ω with transition matrix P . For each t ≥ 0, let
f t be a bounded function Ω → R such that
X
f t (x) = P (x, y) f t +1 (y) ∀x ∈ Ω. (500)
y∈Ω
In order to see this, define a function h t (x) = ((1 − p)/p)x . According to Lemma 4.1.8, it suffice to
show that h is a harmonic function with respect to the Gambler’s chain. (One can also check that M t is a
martingale by Example 4.1.6.) Namely,
X
P (x, y)h t (y) = ph(x + 1) + (1 − p)h(x − 1) (505)
y∈Z
µ ¶ µ ¶
1 − p x+1 1 − p x−1
=p + (1 − p) (506)
p p
µ ¶ µ ¶ µ ¶
1−p x 1−p x 1−p x
= (1 − p) +p = = h t (x). (507)
p p p
Hence by Lemma 4.1.8, (M t )t ≥0 is a martingale with respect to the filtration (Ft )t ≥0 . ▲
Example 4.1.10 (Simple symmetric random walk). Let (X t )t ≥1 be a sequence of i.i.d. RVs with
P(X k = 1) = P(X k = −1) = 1/2. (508)
Let S t = S 0 + X 1 + · · · + X t . Note that (S t )t ≥0 is a Markov chain on Z. Define
M t = S 2t − t . (509)
Then (M t )t ≥0 is a martingale with respect to (S t )t ≥0 .
One can check the martingale condition directly as follow:
E[(S 2t +1 − (t + 1)) − (S 2t − t ) | (X 0 , . . . , X t )] = E[(S t + X t +1 )2 − S 2t − 1 | (X 0 , . . . , X t )] (510)
= E[2S t X t +1 + X t2+1 −1 | (X 0 , . . . , X t )] (511)
| {z }
=1
= 2S t E[X t +1 | (X 0 , . . . , X t )] = 0, (512)
where the last equality follows since E[X t +1 ] = 0 and X t +1 is independent from X 0 , . . . , X t .
One can also check the martingale condition for S 2t − t using Lemma 4.1.8. Namely, for each t ≥ 0,
define a function f t : Z → R by f t (x) = x 2 − t . By Lemma 4.1.8, it suffices to check if f t (x) is the average of
f t +1 (y) with respect to the transition matrix of S t . Namely,
X 1 1
P (x, y) f t +1 (y) = f t +1 (x + 1) + f t +1 (x − 1) (513)
y∈Z 2 2
(x + 1)2 − (t + 1) (x − 1)2 − (t − 1)
= + = x 2 − t = f t (x). (514)
2 2
Hence by Lemma 4.1.8, (M t )t ≥0 is a martingale with respect to filtration (Ft )t ≥0 , where Ft is the infor-
mation obtained by observing S 0 , · · · , S t . ▲
based on the observation X 0 , · · · , X t . Then the stock price will change from M t to M t +1 , so
H t +1 (M t +1 − M t ) = Net gain occured between time t and t + 1. (517)
Hence if we let W0 denote the initial fortune, then
X
t
Wt := W0 + Hk (M k − M k−1 ) = Total fortune at time t . (518)
k=1
The following theorem tells us that, if (M t )t ≥0 is declining stock on average (supermartingale), then no
matter what strategy (H t )t ≥0 we use, we will always lose our fortune.
Theorem 4.2.1. Let (M t )t ≥0 be a supermartingale w.r.t. a filtration (Ft )t ≥0 . Suppose (H t )t ≥0 is such that
(i) (Predictability) {H t +1 = h} ∈ Ft for all h ∈ R.
(ii) (Boundedness) 0 ≤ H t ≤ c t for some constant c t ≥ 0 for all t ≥ 0.
Let (Wt )t ≥0 be as defined at (518). Then (Wt )t ≥0 is a supermartingale w.r.t. the filtration (Ft )t ≥0 .
since the event on the right hand side can be written as the union of suitable events involving the values
of H0 , . . . , H t and M 0 , . . . , M t , which are all events belonging to Ft .
Lastly, fix A ∈ Ft . For each h ∈ [0, c t ], since {H t +1 = h} ∈ Ft , we also have A ∩ {H t +1 = h} ∈ Ft . Since
M t is a supermartingale, we have
E[Wt +1 − Wt | A ∩ {H t +1 = h}] = E[H t +1 (M t +1 − M t ) | A ∩ {H t +1 = h}] (521)
= h E[M t +1 − M t | A] ≤ 0. (522)
Hence, by conditioning on the values of H t +1 , we get
E[Wt +1 − Wt | A] = E [E[Wt +1 − Wt | A, H t +1 ] | A]] (523)
= E [E[(Wt +1 − Wt )1(H t +1 ∈ A) | H t +1 ] | A]] (524)
¯
¯
= E 1(H t +1 ∈ A) E[(Wt +1 − Wt ) | H t +1 ] ¯ A] = 0. (525)
| {z }
=0
Definition 4.2.2. A random variable T taking valuse from {0, 1, 2, . . . } ∪ {∞} is a stopping time w.r.t. a
filtration (Ft )t ≥0 if
{T = s}, {T 6= s} ∈ Fs ∀s ≥ 0. (526)
Note that the definition of stopping times w.r.t. a process (X n )n≥1 (see Definition 3.4.1) is a special
case of the above definition when Ft = σ(X 1 , . . . , X n ) (see Exercise 4.1.3).
Example 4.2.3. Let (N1 (t ))t ≥0 and (N2 (t ))t ≥0 be the counting processes of two renewal processes that are
not necessarily independent. For each t ≥ 0, let Ft be the information obtained by observing (Ni (t ))0≤s≤t
for i = 1, 2. Let Tk(i ) denote the kth arrival time for the i th process.
4.2. GAMBLING STRATEGIES AND STOPPING TIMES 71
Clearly, for each i ∈ {1, 2} and k ≥ 0, Tk(i ) is a stopping time w.r.t. the filtration (Ft )t ≥0 . Namely, for
each t ≥ 0,
{Tki = t } = {N (i ) (t ) = k, N (i ) (t − ) = k − 1} ∈ Ft . (527)
Also, let
This is also a stopping time w.r.t. the filtration (Ft )t ≥0 . Namely, for each t ≥ 0,
On the other hand, let (N (3) (t ))t ≥0 be the counting process of another renewal process, which is
independent from the other processes. Let Tk(3) denote the kth arrival time of this process. Is this a
stopping time w.r.t. the filtration (Ft )t ≥0 ? No, since for each t ≥ 0, the event
cannot be determined from observing the other two processes N (1) and N (2) up to time t . ▲
Example 4.2.4 (Constant betting up to a stopping time). Let (M t )t ≥0 be a supermartingale w.r.t. (X t )t ≥0 .
Let T be a stopping time for (X t )t ≥0 . Consider the gambling strategy of betting $1 up to time T . That is,
(
1 if k ≤ T
Hk = 1(k ≤ T ) = . (531)
0 if k > T
Since T is a stopping time, the above Hk defines a predictable sequence as in Theorem 4.2.1. Then the
wealth at time t is
X
t
W t = W0 + Hk (M k − M k−1 ) = W0 + M T ∧t − M 0 , (532)
k=1
where T ∧ t = min(T, t ). The above relation can be easily verified by considering whether t > T or t ≤
T . According to Theorem 4.2.1, we known that Wt is a supermartingale. It follows that M T ∧t is also a
supermaringale. ▲
Theorem 4.2.5 (Stopped martingale is a martingale). Let (M t )t ≥0 be a supermartingale w.r.t. a filtration
(Ft )t ≥0 . Let T be a stopping time for (Ft )t ≥0 . Then M T ∧t is a supermartingale w.r.t. (Ft )t ≥0 . Furthermore,
The same argument holds for the submartingale case with the reversed inequalities. Then the martingale
case follows. □
Exercise 4.2.6. Let (M t )t ≥0 be a supermartingale w.r.t. a filtration (Ft )t ≥0 . We will directly show that
(M T ∧t )t ≥0 is a supermartingale w.r.t. the filtration (Ft )t ≥0 .
4.3. OPTIONAL STOPPING TIME THEOREM AND ITS APPLICATIONS 72
Theorem 4.3.1 (Optional stopping). Suppose (M t )≥0 is a martingale and T is a stopping time with respect
to the same filtration (Fn )n≥1 such that P(T < ∞) = 1. If any of the following conditions hold, then E[M T ] =
E[M 0 ].
(i) (Bounded stopping time) There is a constant c > 0 such that P(T ≤ c) = 1;
(ii) (Bounded stopped martingale) There is a constant c > 0 such that P(M T ∧n ≤ c) = 1 for all n ≥ 1;
(iii) (Bounded conditional increments) There is a constant c > 0 such that E[|M T ∧(n+1) − M T ∧n | | Fn ] ≤ c
for all n ≥ 1.
Let X n = c − Yn be the net profit during year n. We may assume that X n ’s are i.i.d. with distribution
N (µ, σ2 ). We are interested in estimating the probability of bankruptcy. Namely, let
B = {S n < 0 for some n ≥ 1} = {Bankruptcy}. (559)
We will show that
µ ¶
2µS 0
P(B ) ≤ exp − 2 . (560)
σ
Hence, the company will avoid bankruptcy by maximizing the mean profit per year µ and the initial asset
S 0 , while minimizing the uncertainty σ of profit per year.
To begin, notice that S n = S 0 + X 1 + · · · + X n is a random walk. Let Ft be the information obtained by
observing S 0 , S 1 , · · · , S t . According to Exercise 4.3.4, we have
φ(θ) := E[e θX k ] = exp(σ2 θ 2 /2 + θµ) < ∞. (561)
Hence we can use the following exponential martingale (see Exercise 4.1.7)
exp(θS n )
M n (θ) := (562)
φ(θ)
with respect to the filtration (Ft )t ≥0 . Moreover, note that
φ(−2µ/σ2 ) = exp(2µ2 /σ2 − 2µ2 /σ2 ) = exp(0) = 1. (563)
Hence
µ ¶
2 2µS n
M n := M n (−2µ/σ ) = exp − 2 (564)
σ
is a martingale w.r.t. the filtration (Ft )t ≥0 .
Let T = min{k ≥ 1 | S k < 0} be the first time of hitting negative asset, which is a stopping time w.r.t.
the filtration (Ft )t ≥0 . Also note that
µ ¶
2µS T
M T = exp − 2 > 1 a.s., (565)
σ
since S T < 0. Since S t is a supercritical random walk (µ > 0), P(T = ∞) > 0. Hence we need to stop the
martingale at truncated stopping time T ∧ t , using Theorem 4.3.1. Noting that M t ≥ 0 for all t ≥ 0, this
yields
µ ¶
2µS 0
exp − 2 = E[M 0 ] = E[M T ∧t ] = E [M T | T ≤ t ] P(T ≤ t ) + E[M t | T > t ]P(T > t ) (566)
σ
≥ E [M T | T ≤ t ] P(T ≤ t ) ≥ P(T ≤ t ). (567)
Note that the last equality follows since S T given T ≤ t is zero almost surely and M t ≥ 0 for all t ≥ 0. To
finish, note that P(T ≤ t ) is the probability of having bankruptcy by time t . Hence
µ ¶
2µS 0
P(B ) = P(T < ∞) = lim P(T ≤ n) ≤ exp − 2 . (568)
n→∞ σ
▲
P ROOF OF T HEOREM 4.3.1. (Optional*) Since P(T < ∞) = 1, it holds that limt →∞ T ∧ t = T almost
surely. Hence by Theorem 4.2.5, one would have
?
E[M 0 ] = lim E[M T ∧t ] = E[ lim M T ∧t ] = E[M T ] (569)
t →∞ t →∞
provided the middle equality is verified.
To this end, fix t > 0, and by conditioning on whether T > t or not, we can write
E[M T ] = E[M T | T ≤ t ]P(T ≤ t ) + E[M T | T > t ]P(T > t ) (570)
= E[M T ∧t | T ≤ t ]P(T ≤ t ) + E[M T | T > t ]P(T > t ). (571)
4.4. MARTINGALE LIMIT THEOREM 75
as desired. Simmilarly, suppose (ii) holds. Then M T ∧t ≤ c for all t ≥ 0, so (573) gives
|E[M T ] − E[M T ∧t ]| ≤ 2c P(T > t ) (575)
and the right hand side tends to zero as t → ∞.
Lastly, suppose (iii) holds. Since P(T < ∞) = 1, we have T ∧ n → T a.s. as n → ∞, so M T ∧n → M T a.s.
as n → ∞. In order to show limn→∞ E[M T ∧n ] = E[M T ], we will have to use the dominated convergence
theorem (see [Dur10, Thm. 1.5.8]). For that, we only need to find a dominating and integrable RV M , that
is, M T ∧n ≤ M for all n ≥ 1 (dominating) and E[|M |] < ∞ (integrable). For this, we use triangle inequality
to construct a dominating RV as follows:
X
∞
M := |M 0 | + |M n − M n−1 |1(T ≥ n). (576)
n=1
It turns out, the submartingale condition is in between the two extreme cases above and is flexible
enough to have many applications. This is stated in the following the following “martingale limit the-
orem” is a martingale analgue of these facts:
Theorem 4.4.1 (Martingale convergence theorem). If (M n )n≥1 is a submartingale that is bounded from
above, that is, there exists a constant c > 0 such that P(M n < c) = 1 for all n ≥ 1, then there exists a random
variable M ∞ that is almost surely finite, and limn→∞ M n = M ∞ with probability one.
P ROOF. See [Dur10, Thm. 4.2.11]. □
By taking the negative sign, the supermartingale counterpart of the above theorem also holds. More
specifically, an important corollary of Theroem 4.4.1 is the following.
Corollary 4.4.2. If (M n )n≥1 is a supermartingale and M n ≥ 0 for n ≥ 1, then M n converges almost surely
to some random variable M ∞ such that P(0 ≤ M ∞ < ∞) = 1.
The following corollary is immedaite from Theorem 4.4.1 with the observation that a martingale is
both submartingale and supermatingale.
Corollary 4.4.3 (Limit theorem for bounded martingales). Let (M n )n≥0 be a martingale with respect to
(X n )n≥0 . Suppose that either all M n are bounded from above or from below almost surely. Then there
exists a random variable M ∞ that is almost surely finite, and limn→∞ M n = M ∞ with probability one.
Example 4.4.4 (SSRW on Z). Let (S n )n≥0 with S 0 = 0 denote simple symmetric random walk on Z. We
will cook up both examples and non-examples of the martingale convergence theorem.
First, (S n )n≥0 is indeed a martingale. However, it is not bounded from above and also from below.
That is, there is no constant c ∈ R such that either S n > c for all n ≥ 1 or S n < c for all n ≥ 1 almost
surely. (Indeed, recall that S n is recurrent, so it visits every integer infinitely often with probability 1.)
Thus Corollary 4.4.2 (as well as Theorem 4.4.1) cannot be applied. Indeed, since S n+1 − S n ∈ {−1, 1} for all
n ≥ 1, S n cannot converge to any single random variable.
Second, by stopping the random walk, we can create a convergent martingale. Let T = inf{n ≥ 1 | S n =
1} denote the first hitting time of 1. Then T is a stopping time, so by Theroem 4.2.5, M n = S T ∧n is a
martingale. The martingale satisfies M n ≤ 1 since S n takes values from {. . . , −2, −1, 0, 1} until time T (recall
that S 0 = 0) and it stays at 1 after time T . Hence by Corollary 4.4.2, M n → M ∞ for some almost surely finite
RV M ∞ .
We can in fact identify the limiting RV M ∞ . Recall that T < ∞ with probability 1 the SSRW S n is
recurrent (see Example 3.5.7 and Theorem 3.5.4). Thus T ∧ n → T a.s. as n → ∞. Hence limn→∞ M n =
limn→∞ S T ∧n = S T = 1. So M ∞ = 1 with probability 1. ▲
Since the offspring numbers Zn,i are nonnegative, by iterated expectation it follows that
h h ¯ ii
¯
E[|M n+1 |] = E[M n ] = E E M n+1 ¯ X 0 , X 1 , . . . , X n = E[M n ] = · · · = E[M 0 ] = X 0 = 1. (589)
Lastly, M n is clearly measurable with respect to the offspring numbers Zk,i for all 1 ≤ k ≤ n. Hence M n is
a martingale. □
As an application, by Proposition 4.1.4, we have
E[X n ] = µn E[M n ] = µn E[M 0 ] = µn . (590)
Hence, if the mean offspring number µ is > 1, then the expected population E[X n ] grows exponentially as
µn (supercritical branching process); if µ < 1, then E[X n ] decays to zero exponentially as µn (subcritical
branching process); if µ = 1, then E[X n ] = 1 for all n ≥ 1 (critical branching process).
Since M n = µ−n X n is a nonnegative martingale, it should converge to a limit limn→∞ M n = M ∞ by
Corollary 4.4.2. This means that we can describe the generation n population X n as
X n ≈ µn M ∞ . (591)
Therefore, not only in expectation as in (590), but also trajectory-wise, X n grows/decays exponentailly
fast depending on µ > 1 or µ < 1.
However, there is a subtle point here: What if M ∞ = 0 with some positive probability? Then µn M ∞
does not necessarily diverge almost surely even if µ > 1. In fact, there are two cases depending on
whether ‘extinction event’ occurs. Namely, define
Extinction := {X n = 0 for all sufficiently large n} = ‘Extinction event’ (592)
0
π := P(Extinction) = ‘extinction probability . (593)
Then from (591), we can deduce
Either extintion event occurs or X n grows exponentially in n. (594)
Below we identify the limit M ∞ starting from the subcritical case, µ < 1. Here, we will observe that
the extinction event occurs. This should not be surprising since each individual gives birth to less then
one child on average, so the population should die out.
Proposition 4.5.2. If µ < 1, then X n = 0 for all sufficiently large n. (That is, P(Ext) = 0.) Hence M n =
µ−n X n → 0 as n → ∞ almost surely.
P ROOF. Since M n = µ−n X n is a martingale, E[M n ] = E[M 0 ] = 1 so we have E[X n ] = µn . Now recalling
that X n takes values from Z≥0 , by using Markov’s inequality,
P(X n > 0) = P(X n ≥ 1) ≤ E[X n ] = µ−n . (595)
4.5. BRANCHING PROCESSES 78
F IGURE 1. Generating functions g (s) for cases when µ < 1 (left), µ = 1 (middle), µ > 1 (right) in
bold, and the lighter solid line is for the functino y = x. The dotted lines in the left and right
plots are tengent lines to the generating functions at s = 1− . The y-intercept is g (0) = β0 , and
the x-coordinates of the intercepts between the curve and the line y = x are the corresponding
extinction probabilities π. Figure excerpted from [SV].
Finally, recall that, on the event that there is no extinction, X n grows exponentially fast as µn M ∞ , and
on this even we also know M ∞ > 0 since otherwise X 0 → 0, contradicting the no-extinction event.
Example 4.5.5. Consider a branching process with offspring distribution Poisson(λ). The generating
function of the offspring distribution can be computed as
X
∞ λk e −λ X ∞ (sλ)k e −λ X∞ (sλ)k
g (s) = sk = = e −λ = e −λ e sλ = e λ(s−1) , (600)
k=0 k! k=0 k! k=0 k!
Exercise 4.6.1 (Jensen’s inequality). Let X be any RV with E[X ] < ∞. Let φ : R → R be any convex function,
that is,
(i) Let c := E[X ] < ∞. Show that there exists a line f (x) = ax + b such that f (c) = φ(c) and φ(x) ≥ f (x) for
all x ∈ R.
(ii) Verify the following and prove Jensen’s inequality:
(iii) Let X be RV, A an event, φ be the convex function as before. Show the Jensen’s inequality for the
conditional expectation:
P ROOF. To show (i), we use conditioning on the values of M j . By noting that {M j = m} ∈ F j ⊆ Fk for
all m ∈ R, we have
E[(M n − M k )M j | M j = m] = m E[M n − M k | M j = m] = 0, (622)
where we have used Exercise 4.6.5 for the last equality. Hence
E[(M n − M k )M j ] = E[E[(M n − M k )M j | M j ]] = 0. (623)
Moreover, (ii) follows from (i) immediately since
E[(M n − M k )(M j − M i )] = E[(M n − M k )M j ] − E[(M n − M k )M i ]. (624)
This shows the assertion. □
Exercise 4.6.7 (Pythagorian theorem for martingales). Let (M t )t ≥0 be a martingale with respect to a fil-
tration (Ft )t ≥0 . Use Proposition 4.6.6 to show that
X
t
E[(M t − M 0 )2 ] = E[(M k − M k−1 )2 ]. (625)
k=1
CHAPTER 5
We introduce basic notions of no arbitrage principle, binomial model, and connect to martingales.
Definition 5.1.1. A portfolio A at current time t is said to be an arbitrage portfolio if its value V A satisfies
the followings:
(i) V A (t ) ≤ 0.
(ii) There exists a future time T ≥ t such that P(V A (T ) ≥ 0) = 1 and P(V A (T ) > 0) > 0.
Example 5.1.2 (A 1-step binomial model). Suppose we have an asset with price (S t )t ≥0 . Consider a Eu-
ropean call option at time t = 0 with strike K = 110 and maturity t = 1 (year). Suppose that S 0 = 100 and
at time 1, S 1 takes one of the two values 120 and 90 according to a certain distribution. One can imagine
flipping a coin with unknown probability, and according to whether it lands heads (H ) or tail (T ), the
stock value S 1 takes values S 1 (H ) = 120 and S 1 (T ) = 90. Assume annually compounded interested rate
r = 4%. Can we determine its current value c = C 110 (0, 1)? We will show c = 14/3 ≈ 4.48 by using two
arguments – hedging and replication.
𝑆 (𝐻) = 120
𝑆 = 100
𝑆 (𝑇) = 90
𝑟 = 4%
𝑡=0 𝑡=1
Here we give a ‘hedging argument’ for option pricing. Consider the following portfolio at time t = 0:
Portfolio A: [x shares of the stock] + [y European call options with strike 110 and maturity 1].
82
5.2. THE FUNDAMENTAL THEOREM OF ASSET PRICING 83
The cost of entering into this portfolio (at time t = 0) is 100x + c y. Hence the profit of this portfolio takes
the following two values
(
V1A (H ) − (100x + c y)(1.04) = [120x + y(120 − 110)+ ] − [104x + (1.04)c y] = 16x + (10 − (1.04)c)y
(626)
V1A (T ) − (100x + c y)(1.04) = [90x + y(90 − 110)+ ] − [104x + (1.04)c y] = −14x − (1.04)c y.
In order for a perfect hedging, consider choosing the values of x and y such that the profit of this portfolio
at maturity is the same for the two outcomes of the stock. Hence we must have
16x + (10 − (1.04)c)y = −14x − (1.04)c y. (627)
Solving this, we find
3x + y = 0. (628)
Hence if the above equation is satisfied, the profit of portfolio A is
V1A − (104x + (1.04)c y) = −14x − (1.04)c(−3x) = ((3.12)c − 14) x. (629)
If (3.12)c > 14, then portfolio A is an arbitrage portfolio; If (3.12)c < 14, then the ‘dual’ of portfolio A,
which consists of −x shares of the stock and −y European call options, is an arbitrage portfolio. Hence
assuming no-arbitrage, the only possible value of c is c = 14/(3.12). ▲
𝜔 𝜔
()
𝑆 (𝜔 )
( )
()
𝐴 𝛼 𝛼
𝑆 (𝜔 )
() ( )
𝑆 𝐴
⋮
()
𝑆 (𝜔 )
𝑡=0 𝑡=1
Theorem 5.2.1 (The fundamental theorem of asset pricing). Consider portfolio A and the profit matrix
A = (αi , j ) as above. Then exactly one of the followings hold:
(i) There exists an investment allocation (x 1 , · · · , x m ) such that portfolio A is an arbitrage portfolio, that
is, the n-dimensional row vector
α1,1 α1,2 · · · α1,n
α2,1 α2,2 · · · α2,n
[x 1 , x 2 , · · · , x m ] . .. .. (632)
.. . ··· .
αm,1 αm,2 · · · αm,n
has nonzero coordinates and at least one strictly positive coordinate.
(ii) There exists a strictly positive probability distribution p∗ = (p 1∗ , · · · , p n∗ ) under which the expected
profit of each asset is zero:
∗
α1,1 α1,2 ··· α1,n p1 0
α2,1 α2,2 ··· α2,n p ∗ 0
2
. .. .. .. = .. . (633)
.. . ··· . . .
αm,1 αm,2 · · · αm,n p n∗ 0
Remark 5.2.2. Theorem 5.2.1 (i) states that the portfolio A is an arbitrage portfolio for some (x 1 , · · · , x m ).
The probability distribution p∗ in the above theorem is called the risk-neutral probability distribution.
Hence Theorem 5.2.1 states that there is no way to make A into an arbitrage portfolio if and only if there
exists a risk-neutral probability distribution under which the expected profit of each asset A (i ) is zero.
Example 5.2.3 (Example 5.1.2 revisited). Consider the situation described in Example 5.1.2. Let A (1) be
the asset of price (S t )t ≥0 and A (2) denote the European call with strike K = 110 on this asset with maturity
T . Then the matrix A of profits is given by
· ¸
16 −14
A= , (634)
10 − (1.04)c −(1.04)c
where c = C 110 (0, 1) denotes the price of this European option. Assuming no-arbitrage, the fundamental
theorem implies that there exists risk-neutral probability distribution p∗ = (p 1∗ , p 2∗ ) such that A(p∗ )T =
0T . Namely,
(
16p 1∗ − 14p 2∗ = 0
(635)
(10 − (1.04)c)p 1∗ − (1.04)c p 2∗ = 0
Since p 1∗ + p 2∗ = 1, the first equation implies p 1∗ = 7/15 and p 2∗ = 8/15. Then from the second equation,
we get
14
(1.04)c = 10p 1∗ = . (636)
3
This gives c = 14/(3.12). ▲
Exercise 5.2.4. Rework Examples 5.1.2 and 5.2.3 with following parameters:
S 0 = 100, S 1 (H ) = 130, S 1 (T ) = 80, r = 5%, K = 110. (637)
P ROOF OF T HEOREM 5.2.1. Suppose (i) holds with x = (x 1 , · · · , x n ). We want to show that (ii) cannot
hold. Fix a strictly positive probability distribution p = (p 1 , · · · , p n )0 , where 0 denotes the transpose so that
p is an n-dimensional column vector. By (i), we have
x(Ap) = (xA)p > 0. (638)
n ∗
It follows that Ap cannot be the zero vector in R . Hence (633) cannot hold for p = p, as desired.
Next, suppose that (ii) holds for some strictly positive probability distribution p∗ = (p 1∗ , · · · , p n∗ )0 . We
use a linear algebra argument to show that (i) does not hold. For each m-dimensional row vector x =
𝑆 (𝐻) = 120
(x 1 , · · · , x m ) ∈ Rm , one can correspond a n-dimensional column vector xA ∈ Rn . The condition (633) says
that Ap∗ = 0, where T denotes the transpose. Hence for each x ∈ Rm , by using associativity 𝑆of(𝑇) = 90
matrix
multiplication, 𝑟 = 4%
∗
¡ ∗¢
(xA)p = x Ap = x0 = 0. (639)
𝑡=0 n 𝑡=1
This shows that the image of the linear map x 7→ xA, which is a linear subspace of R , is orthogo-
nal to the strictly positive vector p∗ . Hence this linear subspace intersects with the positive orthant
{(y 1 , · · · , y n ) | y 1 , · · · , y n ≥ 0} only at the origin. This shows that (i) does not hold, as desired. □
𝜔 𝜔
()
𝑆 (𝜔 ) The binomial model
5.3. ( )
()
𝐴 𝛼 𝛼
(𝜔 )
5.3.1. 1-step binomial model.𝑆 Suppose we have an asset with price (S t )t ≥0 . Assume that a future
() ( )
time t 𝑆= 1, the stock price S 1 can take two values S 1 (H ) =𝐴S 0 u and S 1 (T ) = S 0 d according to an invisible
coin flip, where u, d > 0 are multiplicative ⋮ factors for upward and downward moves for the stock price
during the period [0, 1]. Assume the 𝑆 aggregated
()
(𝜔 ) interested rate during [0, 1] is r > 0, so that the value of
ZCB (Zero Cupon Bond) maturing at 1 is given by Z (0, 1) = 1/(1 + r ).
Consider a general European option on this stock, whose value at time t = 0, 1 is denoted Vt . We
would like to determine its initial value (price at t = 0) V0 in terms of its payoff V1 . In the previous section,
𝑡=0 𝑡=1
we have seen that there are three ways to proceed: 1) hedging argument, 2) replication argument, and 3)
risk-neutral probability distribution.
Stock Option
𝑆 (𝐻) = 𝑆 𝑢 𝑉 (𝐻)
𝑆 𝑉
𝑆 (𝑇) = 𝑆 𝑑 𝑉 (𝑇)
Interest rate = 𝑟 Interest rate = 𝑟
(ii) Portfolio B replicates long one European option if and only if we have (647) and
1 S 1 (H )V1 (T ) − S 1 (T )V1 (H )
x= . (648)
1+r S 1 (H ) − S 1 (T )
Furthermore,
µ ¶
1 (1 + r ) − u u − (1 + r )
V 0 = x + ∆0 S 0 = V1 (H ) + V1 (T ) . (649)
1+r u −d u −d
P ROOF. To show (i), we equate the two payoffs of portfolio A at time t = 1 and obtain
∆0 S 1 (H ) − V1 (H ) = ∆0 S 1 (T ) − V1 (T ). (650)
Solving this for ∆0 shows the assertion.
To show (ii), note that portfolio A replicates long one European option if and only if
(
∆0 S 1 (H ) + x · (1 + r ) = V1 (H )
(651)
∆0 S 1 (T ) + x · (1 + r ) = V1 (T ),
or in matrix form,
· ¸· ¸ · ¸
S 1 (H ) 1 + r ∆0 V1 (H )
= . (652)
S 1 (T ) 1 + r x V1 (T )
This is equivalent to
· ¸ · ¸· ¸
∆0 1 1+r −1 − r V1 (H )
= , (653)
x (1 + r )(S 1 (H ) − S 1 (T )) −S 1 (T ) S 1 (H ) V1 (T )
which is also equivalent to (647) and (648), as desired.
5.3. THE BINOMIAL MODEL 87
Lastly, suppose portfolio B replicates long one European option. Then by the monotinicty theorem,
intitial value V0 of the European option should equal to the value of portfolio A at time t = 0. This shows
V 0 = x + ∆0 S 0 . (654)
Now consider a European call option on this stock with strike K = 65 maturing in a month. Then
V1 (H ) = (80 − 65)+ = 15 and V1 (T ) = (50 − 65)+ = 0. By Proposition 5.3.2, the initial value V0 of of this
European option is
1 18 4 120
V0 = Ep∗ [V1 ] = · 15 · = = 6.3158. (659)
1 + (1/18) 19 9 19
Working in an investment bank, you were able to sell 10,000 calls to a customer for a slightly higher price
each at $6.5, receiving up-front payment of $65, 000. At maturity, the overall profit is given by
(
(19/18) · $65, 000 − 10, 000 · (80 − 65)+ = −$81, 389 if stock goes up
(660)
(19/18) · $65, 000 − 10, 000 · (50 − 65)+ = $68, 611 if stock goes down.
Being worried about losing a huge amount if the stock goes up, you decided to hedge and lock the
profit. According to Proposition 5.3.2, the hedge ratio ∆0 is given by
(80 − 65)+ − (50 − 65)+ 15 1
∆0 = = = . (661)
80 − 50 30 2
Since you have shorted 10,000 calls, this means you need to buy 5,000 shares of the stock owing 5, 000 ·
60 − $65, 000 = $235, 000 to the bank. This forms a portfolio of
5.3.2. The N -step binomial model. In this subsection, we consider the general N -step binomial
model. Namely, staring at the current time t = 0, we flip N coins at times t = 1, 2, · · · , N to determine the
market evolution. More precisely, now the sample space of the outcomes is Ω = {H , T }N , which consists
of sequences of length N strings of H ’s or T ’s. We assume constant interest rate for each periods [k, k +1]
for k = 0, 1, · · · , N − 1.
𝑆 (𝜔𝐻)
𝑉 (𝜔𝐻)
𝑆 (𝜔)
𝑆 𝑉 (𝜔)
𝜔 𝑆 (𝜔𝑇)
𝑉 (𝜔𝑇)
Interest rate ≡ 𝑟
F IGURE 4. Illustration of the N -step binomial model. ω is a sample path of the market evolution
in the first k steps. In the next period [k, k + 1], either up or down evolution occurs and the same
path is extended to ωH or ωT accordingly. S t and Vt denote the stock price and option payoff.
The following result for the general N -step binomial model is a direct analog of the 2-step case we
have discussed in the previous subsection. To better understand its second part, recall the remark on a
probabilistic interpretation on the 2-step value formula as in Proposition ??.
Proposition 5.3.5. Consider the N -step binomial model as above. Consider a European option on this
stock wit value (Vt )0≤t ≤N .
(i) For each integer 0 ≤ k < N and a sample path ω ∈ {H , T }k for the first k steps, define the risk-neutral
probability distribution p∗ (ω) = (p 1∗ (ω), p 2∗ (ω)) by
(1 + r )S k (ω) − S k+1 (ωT ) S k+1 (ωH ) − (1 + r )S k (ω)
p 1∗ (ω) = , p 2∗ (ω) = . (665)
S k+1 (ωH ) − S k+1 (ωT ) S k+1 (ωH ) − S k+1 (ωT )
If 0 < p 1∗ (ω) < 1, then
1
Vk (ω) = Ep∗ (ω) [Vk+1 | first k coin flips = ω] (666)
(1 + r )
1 ¡ ¢
= Vk+1 (ωH )p 1∗ (ω) + Vk+1 (ωT )p 2∗ (ω) . (667)
(1 + r )
(ii) Consider N consecutive coin flips such that given any sequence x 1 x 2 · · · x k of the first k flips, the (k +
1)st coin lands on heads with probability p 1∗ (x 1 x 2 · · · x k ). Let P∗ denote the induced probability
measure (risk-neutral probability measure) on the sample space Ω = {H , T }N . Then
1
V0 = EP∗ [VN ]. (668)
(1 + r )N
P ROOF. The argument for (i) is exactly the same as in the proof of Proposition ??. For (ii) we use an
induction on the number of steps. The base case is verified by (i). For the induction step, we first use (i)
to write
1
V0 = (V1 (H )p 1∗ (;) + V1 (T )p 2∗ (;)). (669)
1+r
5.3. THE BINOMIAL MODEL 89
Let X 1 , X 2 , · · · , X N be a sequence of N (random) coin flips given by the risk-neutral probability measure
P∗ . Denote the expectation under P∗ by E∗ . By the induction hypothesis, we have
1 1
V1 (H ) = E∗ [VN | X 1 = H ], V1 (T ) = E∗ [VN | X 1 = T ]. (670)
(1 + r )N −1 (1 + r )N −1
Hence we have
1 £ ∗ ¤
V0 = E [VN | X 1 = H ]p 1∗ (;) + E∗ [VN | X 1 = T ]p 2∗ (;) (671)
(1 + r )N
1 1
= E∗ [E∗ [VN | X 1 ]] = E∗ [VN ], (672)
(1 + r )N (1 + r )N
where we have used iterated expectation for the last equality. This shows the assertion. □
5.3.3. The N -step binomial model revisited. In this subsection, we revisit the N -step binomial
model in the framework of martingales. First recall the model. Staring at the current time t = 0, we flip
N coins at times t = 1, 2, · · · , N to determine the market evolution. The sample space of the outcomes is
Ω = {H , T }N , which consists of sequences of length N strings of H ’s or T ’s. We assume constant interest
rate for each periods [k, k + 1] for k = 0, 1, · · · , N − 1.
Let Ft denote the information that we can obtain by observing the market up to time t . For instance,
Ft contains the information of the first t coin flips, stock prices S 0 , S 1 , · · · , S t , European option values
V0 ,V1 , · · · ,Vt , and so on. Then (Ft )t ≥0 defines a natural filtration for the N -step binomial model. Below
we reformulate Proposition 5.3.5.
Proposition 5.3.6. Consider the N -step binomial model as above. Let P∗ denote the risk-neutral proba-
bility measure defined in Proposition 5.3.5. Consider a European option on this stock wit value (Vt )0≤t ≤N .
¡ ¢
(i) The process (1 + r )−t Vt 0≤t ≤N forms a martingale with respect to the filtration (Ft )t ≥0 under the risk-
neutral probability measure P∗ . That is,
h ¯ i
¯
EP∗ (1 + r )−(t +1)Vt +1 − (1 + r )−t Vt ¯ Ft = 0, (673)
P ROOF. To show (i), we first note that, conditioning on the information Ft up to time t , we know all
the coin flips, stock prices, and European option values up to time t . Hence we have
h ¯ i h ¯ i
¯ ¯
EP∗ (1 + r )−(t +1)Vt +1 ¯ Ft = (1 + r )−(t +1) EP∗ Vt +1 ¯ Ft (676)
This shows that (1 + r )−t Vt is a martingale with respect to the natural filtration (Ft )t ≥0 under the risk-
neutral probability measure P∗ .
Now (ii) follows from (i) and Exercise 4.6.5. Namely, for each A ∈ F0 , by Exercise 4.6.5,
EP∗ [(1 + r )−N VN − (1 + r )0V0 | F0 ] = 0. (678)
This yields
V0 = EP∗ [V0 | F0 ] = EP∗ [(1 + r )−N VN | F0 ], (679)
as desired. □
5.3. THE BINOMIAL MODEL 90
Exercise 5.3.7 (Risk-neutral probabilities make the stock price a martingale). Consider the N -step bi-
nomial model with stock price (S t )0≤t ≤N . Let P∗ denote the risk-neutral probability measure defined in
Proposition 3.10 in Lecture note 2. 5.3.5.
(i) Show that the discounted stock price (1 + r )−t S t forms a martingale with respect to the filtration
(Ft )t ≥0 under the risk-neutral probability measure P∗ . That is,
h ¯ i
¯
EP∗ (1 + r )−(t +1) S t +1 − (1 + r )−t S t ¯ Ft = 0, (680)
Poisson Processes
In this chapter, we introduce a particular class of renewal processes called the “Poisson processes”,
which is a arrival process whose inter-arrival times are i.i.d. exponential RVs. Due to the nice properties
of exponential distribution, Poisson processes enjoy what’s called the ‘memoryless property’, which says
that Poisson processes can be restared at any stopping times, not necessarily at arrival times. Recall that
renewal processs has renewal property, which means they can be restarted at arrival times. Hence the
memoryless property of Poisson processes are stronger versions of renewal properties only enjoyed by
Poisson processes. This allows one to analyze Poisson processes in greater details than general renewal
processes.
Example 6.1.1 (Exponential RV). X is an exponential RV with rate λ (denoted by X ∼ Exp(λ)) if it has
PDF
Exercise 6.1.2. Let X ∼ Exp(λ). Show that E(X ) = 1/λ and Var(X ) = 1/λ2 .
Exercise 6.1.3. Let X 1 ∼ Exp(λ1 ) and X 2 ∼ Exp(λ2 ) and suppose they are independent. Define Y =
min(X 1 , X 2 ). Show that Y ∼ Exp(λ1 + λ2 ). (Hint: Compute P(Y ≥ y).)
Example 6.1.4. Let X 1 ∼ Exp(λ1 ) and X 2 ∼ Exp(λ2 ) be independent exponential RVs. We will show that
λ1
P(X 1 < X 2 ) = (686)
λ1 + λ2
91
6.1. RECAP OF EXPONENTIAL AND POISSON RANDOM VARIABLES 92
using the iterated expectation. Using iterated expectation for probability (see Exercise 1.4.3),
P(X 1 < X 2 ) = E [P(X 2 > X 1 | X 1 )] (687)
Z∞
= P(X 1 < X 2 | X 1 = x 1 )λ1 e −λ1 x1 d x 1 (688)
0
Z∞
= P(X 2 > x 1 )λ1 e −λ1 x1 d x 1 (689)
0
Z∞
= λ1 e −λ2 x1 e −λ1 x1 d x 1 (690)
0
Z∞
λ1
= λ1 e −(λ1 +λ2 )x1 d x 1 = . (691)
0 λ1 + λ2
▲
When we add up independent continuous RVs, we can compute the PDF of the resulting RV by using
the convolution formula or moment generating functions. In the following exercise, we compute the
PDF of the sum of independent exponentials.
Exercise 6.1.5 (Sum of i.i.d. Exp is Erlang). Let X 1 , X 2 , · · · , X n ∼ Exp(λ) be independent exponential RVs.
(i) Show that f X 1 +X 2 (z) = λ2 ze −λz 1(z ≥ 0).
(ii) Show that f X 1 +X 2 +X 3 (z) = 2−1 λ3 z 2 e −λz 1(z ≥ 0).
(iii) Let S n = X 1 + X 2 + · · · + X n . Use induction to show that S n ∼ Erlang(n, λ), that is,
λn z n−1 e −λz
f S n (z) = . (692)
(n − 1)!
R∞ continuous RVs. Then the PDF of X + Y is given by the convolution
Hint: Let X , Y be independent
formula: f X +Y (z) = −∞ f X (x) f Y (z − x) d x.
Remark: For parameters (r, λ), a continuous random variable X is said to have the Gamma distribution
with parameters (r, λ) (and denote X ∼ Gamma(r, λ)) if
λr z r −1 e −λz
f X (x) = 1(x ≥ 0), (693)
Γ(r )
where Γ(z) is the Euler Gamma function defined by
Z∞
Γ(z) = x z−1 e −x d x. (694)
0
By an induction and integration by parts, one can show that Γ(n) = (n − 1)! for n ∈ N. Hence
Gamma(r, λ) = Erlang(r, λ) whenever r is a positive integer. So we can also say that S n ∼ Gamma(r, λ).
Exponential RVs will be the builing block of the Poisson processes, because of their ‘memoryless
property’.
Exercise 6.1.6 (Memoryless property of exponential RV). A continuous positive RV X is say to have mem-
oryless property if
P(X ≥ t 1 + t 2 ) = P(X ≥ t 1 )P(X ≥ t 2 ) ∀x 1 , x 2 ≥ 0. (695)
(i) Show that (695) is equivalent to
P(X ≥ t 1 + t 2 | X ≥ t 2 ) = P(X ≥ t 1 ) ∀x 1 , x 2 ≥ 0. (696)
(ii) Show that exponential RVs have memoryless property.
(iii) Suppose X is continuous, positive, and memoryless. Let g (t ) = log P(X ≥ t ). Show that g is contin-
uous at 0 and
g (x + y) = g (x) + g (y) for all x, y ≥ 0. (697)
Using Exercise 6.1.7, conclude that X must be an exponential RV.
6.1. RECAP OF EXPONENTIAL AND POISSON RANDOM VARIABLES 93
Exercise 6.1.7. Let g : R≥0 → R be a function with the property that g (x + y) = g (x) + g (y) for all x, y ≥ 0.
Further assume that g is continuous at 0. In this exercise, we will show that g (x) = c x for some constant
c.
(i) Show that g (0) = g (0 + 0) = g (0) + g (0). Deduce that g (0) = 0.
(ii) Show that for all integers n ≥ 1, g (n) = ng (1).
(iii) Show that for all integers n, m ≥ 1,
ng (1) = g (n · 1) = g (m(n/m)) = mg (n/m). (698)
Deduce that for all nonnegative rational numbers r , we have g (r ) = r g (1).
(iv) Show that g is continuous.
(v) Let x be nonnegative real number. Let r k be a sequence of rational numbers such that r k → x as
k → ∞. By using (iii) and (iv), show that
µ ¶
g (x) = g lim r k = lim g (r k ) = g (1) lim r k = x · g (1). (699)
k→∞ k→∞ k→∞
Lastly, we introduce the Poisson RVs and record some of their nice properties.
Exercise 6.1.10 (Sum of ind. Poisson RVs is Poisson). Let X ∼ Poisson(λ1 ) and Y ∼ Poisson(λ2 ) be inde-
pendent Poisson RVs. Show that X + Y ∼ Poisson(λ1 + λ2 ).
Example 6.1.11 (MGF of Poisson RV). Let X ∼ Poisson(λ). Then using the Taylor expansion of the expo-
nential function, we can compute the moment generating function (MGF) of X :
X
∞ λk e −λ X∞ (e t λ)k
= e −λ = e −λ e λe = e λ(e −1) .
t t
E[e t X ] = e kt (701)
k=0 k! k=0 k!
Proposition 6.1.12 (Splitting a Poisson RV). Suppose there are N ∼ Poisson(λ) balls in a box. Each ball
gets a color i ∈ {1, . . . , m} with probability p i independently, where p 1 + · · · + p m = 1. Let Ni denote the
number of balls of color i . Then Ni ∼ Poisson(λp i ) for 1 ≤ i ≤ m and N1 , . . . , Nm are independent.
P ROOF. We directly compute the joint distribution of N1 , . . . , Nm . The key is to recognize that the
joint distribution of (N1 , . . . , Nm ) conditional on N = n is multinomial:
P(N1 = k 1 , . . . , Nm = k m ) = P(N = n) P(N1 = k 1 , . . . , Nm = k m | N = n) (702)
−n −λ
λ e n! k k
= p 1 · · · p mm (703)
n! k1 ! · · · km ! 1
à ! à !
(p 1 λ)k1 e −p 1 λ (p m λ)km e −p m λ
= ··· (704)
k1 ! km !
= P(Poisson(p 1 λ) = k 1 ) · · · P(Poisson(p m λ) = k m ), (705)
where we have used the fact that λ = λ(p 1 + · · · + p m ). Hence the joint distribution of (N1 , . . . , Nm ) fac-
torizes into the product of Poisson distributions with parameters p k λ for 1 ≤ k ≤ m. This is enough to
conclude the independence of N1 , . . . , Nm as well as their claimed Poisson distributions. □
6.3. MOMORYLESS PROPERTY AND STOPPING TIMES 94
Recall the definition of arrival and renewal processes in Section 2.1. Below we define Poisson process.
Definition 6.2.1 (Poisson process). An arrival process (Tk )k≥1 is a Poisson process of rate λ if its inter-
arrival times are i.i.d. Exp(λ) RVs. In this case, we denote (Tk )k≥1 ∼ PP(λ) and equivalently (N (t ))t ≥0 ∼
PP(λ) for its associated counting process (N (t ))t ≥0 .
The choice of exponential inter-arrival times is special due to the memoryless property of exponen-
tial RVs (Exercise 6.1.6).
Exercise 6.2.2. Let (Tk )k≥1 be a Poisson process with rate λ. Show that E[Tk ] = k/λ and Var(Tk ) = k/λ2 .
Furthermore, show that Tk ∼ Erlang(k, λ), that is,
λk z k−1 e −λz
f Tk (z) = . (706)
(k − 1)!
The following exercise explains what is ‘Poisson’ about the Poisson process.
Exercise 6.2.3. Let (Tk )k≥1 be a Poisson process with rate λ and let (N (t ))t ≥0 be the associated counting
process. We will show that N (t ) ∼ Poisson(λt ).
(i) Using the relation {Tn ≤ t } = {N (t ) ≥ n} and Exercise 6.2.2, show that
Zt n n−1 −λz
λ z e
P(N (t ) ≥ n) = P(Tn ≤ t ) = d z. (707)
0 (n − 1)!
P
(ii) Let G(t ) = ∞ m −λt
m=n (λt ) e /m! = P(Poisson(λt ) ≥ n). Show that
d λn t n−1 e −λt d
G(t ) = = P(Tn ≤ t ). (708)
dt (n − 1)! dt
Conclude that G(t ) = P(Tn ≤ t ).
(iii) From (i) and (ii), conclude that N (t ) ∼ Poisson(λt ).
To finish the proof, let E 0 be the entire sample space. Then the above yields
P ((X (t ) − X (T ))t >T ∈ E ) = P ((X (t ) − X (0))t >0 ∈ E ) . (721)
This shows that (X (t )−X (T ))t >T has the same distribution as (X (t )−X (0))t >0 . Furthermore, we conclude
¡ ¢
P (X (t ) − X (T ))t >T ∈ E | (X (t ))0≤t ≤T ∈ E 0 = P ((X (t ) − X (0))t >0 ∈ E ) (722)
= P ((X (t ) − X (T ))t >T ∈ E ) . (723)
Since E , E 0 were arbitrary, this also shows that (X (t ) − X (T ))t >T and (X (t ))0≤t ≤T are independent. □
4 𝜏
3 𝜏
𝑡−𝑠 𝑍
2 𝜏
1 𝑁(𝑡) = 3
𝜏
0 𝑇 𝑇 𝑇 =𝑠 𝑡 𝑇
F IGURE 1. Assuming N (t ) = 3 and T3 = s ≤ t , we have Z = τ4 − (t − s). By memoryless property of
exponential RV, Z follows Exp(λ) on this conditioning.
As can be seen from (6.3), Z (t ) depends on three random variables: τN (t )+1 , N (t ), and T N (t ) . To show
, we argue by conditioning the last two RVs and use iterated expectation. Using 6.3, note that
³ ¯ ´
¯
P Z (t ) ≥ x ¯ (N (s))0≤s≤t ∈ E , N (t ) = n, T N (t ) = s (727)
³ ¯ ´
¯
= P τn+1 − (t − s) ≥ x ¯ (N (s))0≤s≤t ∈ E , N (t ) = n, Tn = s (728)
³ ¯ ´
¯
= P τn+1 − (t − s) ≥ x ¯ (N (s))0≤s≤t ∈ E , Tn+1 > t , Tn = s (729)
³ ¯ ´
¯
= P τn+1 − (t − s) ≥ x ¯ (N (s))0≤s≤t ∈ E , τn+1 > t − s, Tn = s . (730)
Conditioned on N (t ) = n, the event that (N (s))0≤s≤t ∈ E is determined by the arrival times T1 , · · · , Tn and
the fact that Tn+1 ≥ t . Hence we can rewrite
© ª
{(N (s))0≤s≤t ∈ E , τn+1 > t − s, Tn = s} = (τ1 , · · · , τn ) ∈ E 0 , τn+1 > t − s (731)
for some event E 0 to be satisfied by the first n inter-arrival times. Since inter-arrival times are indepen-
dent, this gives
³ ¯ ´
¯
P Z (t ) ≥ x ¯ (N (s))0≤s≤t ∈ E , N (t ) = n, T N (t ) = s (732)
³ ¯ ´
¯
= P τn+1 − (t − u) ≥ x ¯ τn+1 ≥ t − s (733)
2/
Rate 𝜆 6.4. MERGING
1 −AND
𝑝 SPLITTING OF POISSON PROCESS
𝑁 (𝑡) 98
3 If customers arrive at a bank according to Poisson(λ) and if each one is male or female independently
with probability q and 1 − q, then the ‘thinned out’ process of only male customers is a Poisson(qλ); the
process of female customers is a Poisson((1 − q)λ).
𝑥 + 𝑦 = 12
𝑁 (𝑡)
𝑥 + 𝑦 = 11
Rate 𝜆 𝑁(𝑡)
𝑥 + 𝑦 = 10 or
𝑁 (𝑡) Rate 𝜆 = 𝜆 + 𝜆
𝑥+𝑦=9
Rate 𝜆
𝑥+𝑦=8
F IGURE 2. Merging two independent Poisson processes of rates λ1 and λ2 gives a new Poisson
5 6 process of rate λ1 + λ2 .
1 1
The reverse operation of splitting a given PP into two complementary PPs is call the ‘merging’. Namely,
imagine customers arrive at a register through two doors A and B independently according to PPs of rates
λ A and λB , respectively. Then the combined arrival process of entire customers is again a PP of the added
rate.
1/
𝜏 𝜏
𝑁 (𝑡)
1/ ⋯
𝑇 𝑇 𝑝 Rate 𝜆 = 𝑝𝜆
𝑁(𝑡)
5/
2/
Rate 𝜆 1−𝑝 𝑁 (𝑡)
Rate 𝜆 = (1 − 𝑝)𝜆
3 4 0 X 1 2
3 4 0 1 2
6.4. MERGING AND SPLITTING OF POISSON PROCESS 99
In the next two exercises, we rigorously justify merging and splitting of Poisson processes.
Exercise 6.4.3 (Merging of independent PPs). Let (N1 (t ))t ≥0 and (N2 (t ))t ≥0 be the counting processes of
two independent PPs of rates λ1 and λ2 , respectively. Define a new counting process (N (t ))t ≥0 by
N (t ) = N1 (t ) + N2 (t ). (744)
In this exercise, we show that (N (t ))t ≥0 ∼ PP(pλ).
(i) Let τ(1)
k
, τ(2)
k
, and τk be the kth inter-arrival times of the counting processes (N1 (t ))t ≥0 , (N2 (t ))t ≥0 , and
(N (t ))t ≥0 . Show that τ1 = min(τ(1) (2)
1 , τ1 ). Conclude that τ1 ∼ Exp(λ1 + λ2 ).
(ii) Let Tk be the kth arrival time for the joint process (N (t ))t ≥0 . Use memoryless property of PP and
Exercise 6.3.6 to deduce that N1 and N2 restarted at time Tk are independent PPs of rates λ1
and λ2 , which are also independent from the past (before time Tk ).
(iii) From (ii), show that
τk+1 = min(τ̃1 , τ̃2 ), (745)
where τ̃1 is the waiting time for the first arrival after time Tk for N1 , and similarly for τ̃2 . Deduce
that τk+1 ∼ Exp(λ1 + λ2 ) and it is independent of τ1 , · · · , τk . Conclude that (N (t ))t ≥0 ∼ PP(λ1 +
λ2 ).
Exercise 6.4.4 (Splitting of PP). Let (N (t ))t ≥0 be the counting process of a Poisson(λ), and let (X k )k≥0 be
a sequence of i.i.d. Bernoulli(p) RVs. We define two counting processes (N1 (t ))t ≥0 and (N2 (t ))t ≥0 by
X
∞
N1 (t ) = 1(Tk ≤ t )1(X k = 1) = #(arrivals with coin landing on heads up to time t ), (746)
k=1
X∞
N2 (t ) = 1(Tk ≤ t )1(X k = 0) = #(arrivals with coin landing on tails up to time t ). (747)
k=1
In this exercise, we show that (N1 (t ))t ≥0 ∼ Poisson(pλ) and (N2 (t ))t ≥0 ∼ Poisson((1 − p)λ).
(i) Let τk and τ(1)
k
be the kth inter-arrival times of the counting processes (N (t ))t ≥0 and (N1 (t ))t ≥0 . Let
Yk be the location of kth 1 in (X t )t ≥0 . Show that
X
Y1
τ(1)
1 = τi . (748)
i =1
(iv) Recall that Yk −Yk−1 ’s are i.i.d. Geom(p) RVs. Use Example 6.4.5 and (iii) to deduce that τ(1)
k
’s are i.i.d.
Exp(pλ) RVs. Conclude that (N1 (t )) ∼ Poisson(pλ). (The same argument shows (N2 (t ))t ≥0 ∼
Poisson((1 − p)λ).)
Example 6.4.5 (Sum of geometric number of Exp. is Exp.). Let X i ∼ Exp(λ) for i ≥ 0 and let N ∼ Geom(p).
PN
Let Y = k=1 X k . Suppose all RVs are independent. Then Y ∼ Exp(pλ).
To see this, recall that their moment generating functions are
λ pe t
M X 1 (t ) = , M N (t ) = . (751)
λ−t 1 − (1 − p)e t
Hence (see Remark 6.4.6)
λ
p λ−t pλ pλ
M Y (t ) = M N (log M X 1 (t )) = λ
= = . (752)
1 − (1 − p) λ−t (λ − t ) − λ(1 − p) pλ − t
Notice that this is the MGF of an Exp(pλ) variable. Thus by uniqueness, we conclude that Y ∼ Exp(pλ).
▲
Remark 6.4.6. Let Y = X 1 + X 2 + · · · + X N , where X i ’s are i.i.d. RVs and N is an independent RV taking
values from positive integers. By iterated expectation, we have
M Y (t ) = E[e t Y ] = E[e t X 1 e t X 2 · · · e t X N ] (753)
t X1 t X2 t XN
= EN [E[e e ···e ]|N] (754)
= EN [E[e t X 1 ]N | N ] (755)
N
= EN [M X 1 (t ) ] (756)
= EN [e N log M X 1 (t ) ] (757)
= M N (log M X 1 (t )). (758)
With the above definition, we can think of the associaved counting process (N (t ))t ≥0 as N (t ) =
N ([0, t ]). Then, taking A = (a, b] for 0 ≤ a < b, we can write N (A) = N ((a, b]) as
N ((a, b]) = N (b) − N (a) = (# of arrivals during [0, b]) -(# of arrivals during [0, a]) . (760)
6.5. POISSON PROCESS AND POISSON POINT PROCESS 101
Definition 6.5.2 (Poisson point process). A counting measure N on R is called a Poisson point process of
intensity λ (denoted N ∼ PPP(λ)) if it has the following properties:
(i) For each 0 ≤ a < b, N ((a, b]) ∼ Poisson(λ(b − a)).
(ii) For disjoint intervals (a 1 , b 1 ], . . . , , (a n , b n ], the random variables N ((a 1 , b 1 ]), . . . , N ((a n , b n ]) are inde-
pendent.
The next theorem describes the properties of the counting measure of the Poisson process.
Proposition 6.5.3 (PP implies PPP). Let (N (t ))t ≥0 ∼ PP(λ) be the counting process of a Poisson process of
rate λ. As a counting measure, N ((a, b]) := N (b) − N (a), N is a Poisson point process with intensity λ.
P ROOF. Follows from the memoryless process of Poisson process. Namely, restarting the PP at time
t = a, we get N ((a − b)) = N (b) − N (a) ∼ PP(λ(b − a)), since (N (a + t ) − N (a))t ≥0 is a PP(λ) by Proposition
6.3.4. Also, fix disjoint intervals (a 1 , b 1 ], . . . , , (a n , b n ]. Without loss of generality, we may assume that
a 1 ≤ b 1 ≤ a 2 ≤ b 2 ≤ a 3 ≤ · · · ≤ a n ≤ b n . Then restarting the process at t = b 1 and using the memoryless
property of PP, we see that N ((a 1 , b 1 ]) is independent of N ((a k , b k ]) for 2 ≤ k ≤ n. Similarly, restarting the
process at time t = b 2 , we see that N ((a 2 , b 2 ]) is independent of N ((a k , b k ]) for 3 ≤ k ≤ n. Proceeding in
the same way, we see the RVs N ((a 1 , b 1 ]), . . . , N ((a n , b n ]) are independent. □
Exercise 6.5.4 (PPP implies PP). Let N ∼ PPP(λ). Consider the induced counting process N (t ) := N ([0, t ])
for t ≥ 0. Let Tk = inf{u ≥ 0 | N (u) = k} be the kth ‘arrival time’ and let τk = Tk − Tk−1 be the kth ‘inter-
arrival time’.
(i) Use the fact that Tk is a stopping time for (N (t ))t ≥0 and Lemma 6.3.3 to deduce that (N (Tk + t ) −
N (Tk ))t ≥0 is the counting process of a PP(λ) that is independent of (N (t ))t ≤Tk .
(ii) Let Z (t ) = inf{u ≥ 0 | N (t + u) > N (t )} be the waiting time for the first arrival after time t . Show that
Z (t ) ∼ Exp(λ) for all t ≥ 0.
(iii) Use (ii) and conditioning on Tk−1 to show that τk ∼ Exp(λ) for all k ≥ 1.
Recall that in Exercise 6.4.4, we have established splitting of PP by directly analyzing the super-inter-
arrival times obtained by merging a geometric number of Exp(λ) inter-arrival times. In the following
exercise, we give steps to obtaining a similar statement via using PPP instead.
Exercise 6.5.5 (Splitting of PP via PPP). Let N ∼ PPP(λ) and let T1 < T2 < . . . denote the associated arrival
times (or particle locations), where for any interval I , the counting measure N (I ) of I equals the number
of Tk ’s such that Tk ∈ I . Now suppose each particle at location Tk for k ≥ 1 is either red with probability p
and blue with probability 1−p, independently of everything else. Define two induced counting measures
N1 and N2 by
(i) Show that N1 ∼ PPP(pλ) and N2 ∼ PPP((1 − p)λ). (Use Proposition 6.4.4.)
(ii) For each interval I , show that N1 (I ) and N2 (I ) are independent. (Use Proposition 6.4.4.)
(iii) Show that N1 and N2 are independent. Namely, for intervals I 1 and I 2 (not necessarily disjoint),
show that N (I 1 ) and N (I 2 ) are independent. (Hint: Decompose N1 (I 1 ) = N1 (I 1 \ I 2 ) + N1 (I 1 ∩ I 2 )
and similarly for N2 (I 2 ). N1 (I 1 \I 2 ) and N2 (I 2 \I 1 ) are independent by the independence property
of N and since (I 1 \ I 2 ) and (I 2 \ I 1 ) are disjoint. Use (ii) to conclude.)
(iv) By using Exercise 6.5.4, conclude that splitting a PP(λ) with independent Bernoulli(p) coin flips gives
two independent PP(pλ) and PP((1 − p)λ).
6.6. POISSON PROCESS CONDITIONED ON THE NUMBER OF ARRIVALS 102
Proposition 6.6.1. Let (Tn )n≥1 denote the arrival times of (N (t ))t ≥0 ∼ PP(λ).
(i) The joint distribution of (T1 , . . . , Tn ) conditional on N (t ) = n is the same as the joint distribution of the
order statistics of n i.i.d. Uniform([0, t ]) RVs.
(ii) The joint PDF of (T1 , . . . , Tn ) is given as
n!
f (T1 ,...,Tn )|N (t )=n (t 1 , . . . , t n ) = 1(0 ≤ t 1 < · · · < t n ≤ t ). (763)
tn
P ROOF. We first show (ii). For this we give a ‘physics’ style proof here. First note that 0 < T1 < · · · <
Tn < t with probability 1: the ordering follows from definition of arrival times and Tn < t since we are
conditioning on N (t ) = n. Hence we may need to verify the formula (763) for 0 < t 1 < · · · < t n < t . Fix
such t i ’s. Consider an ε-box at point (t 1 , . . . , t n ): (t 1 , t 1 + ε] × · · · × (t n , t n + ε]. We will assume that 0 < ε <
min(t 1 , t 2 − t 1 , . . . , t n − t n−1 , t − t n ) so that the intervals (t 1 , t 1 + ε], (t 1 + ε, t 2 ], . . . , (t n , t n + ε], (t n + ε, t ] are
disjoint. By the mean value theorem for integrals,
1 ³ ¯ ´
¯
lim t P T1 ∈ (t 1 , t 1 + ε), . . . , Tn ∈ (t n , t n + ε] ¯ N (t ) = n = f (T1 ,...,Tn )|N (t )=n (t 1 , . . . , t n ). (764)
ε→0 ε
Observe that
³ ¯ ´
¯
P T1 ∈ (t 1 , t 1 + ε), . . . , Tn ∈ (t n , t n + ε] ¯ N (t ) = n (765)
³ ¯ ´
¯
= P N (t 1 , t 1 + ε] = 1, N (t 1 + ε, t 2 ] = 0, . . . , N (t n , t n + ε] = 1, N (t n + ε, t ] = 0 ¯ N (t ) = n (766)
P (N (t 1 , t 1 + ε] = 1, N (t 1 + ε, t 2 ] = 0, . . . , N (t n , t n + ε] = 1, N (t n + ε, t ] = 0)
= (767)
P(N (t ) = n)
n! ³ ´³ ´³ ´³ ´ ³ ´³ ´
−λε −λ(t 2 −t 1 −ε) −λε −λ(t 3 −t 2 −ε) −λε −λ(t −t n −ε)
= (λε)e e (λε)e e · · · (λε)e e (768)
(λt )n e −λt | {z }
t repetitions
n!
= λt εt e −λt (769)
(λt )n e −λt
n!
= n εt . (770)
t
Hence dividing both sides by εt and taking the limit ε → 0, we obtain the desired result.
i .i .d .
Next, we deduce (i). Let U1 , . . . ,Un ∼ Uniform([0, t ]). Then the joint PDF of (U1 , . . . ,Un ) is con-
stant over the cube [0, t ]n with value 1/t n . But for each values 0 < t 1 < · · · < t n < t for the order statis-
tic Un;1 , . . . ,Un;n , there are n! ways that these values will be realized by U1 , . . . ,Un , and each case will
have the same probability by symmetry. Hence the joint PDF of Un;1 , . . . ,Un;n is constant on the region
{(t 1 , . . . , n n ) : 0 < t 1 < . . . , < t n < t } with constant value n!/t n . This is exactly the same joint PDF as in
(ii). □
6.7. NONHOMOGENEOUS POISSON PROCESS 103
Example 6.6.2. Let N ∼ PPP(λ). Let 0 < r < t and we compute P(T2 < r | N (t ) = 2), the probability that
the second arrival is by time r given that there are total two arrivals up to time t . By using Proposition
6.6.1, we have
2
f (T1 ,T2 )|N (t )=2 (t 1 , t 2 ) = 2 1(0 < t 1 < t 2 < t ). (771)
t
Hence we get
Zr Z t 2
2 2 r2 r2
P(T2 < r | N (t ) = 2) = 2
d t 1 d t 2 = = 2. (772)
0 0 t t2 2 t
To go one step further, by differentiating the above by r , we can obtain the conditional PDF of T2 given
N (t ) = 2:
2x
f T2 |N (t )=2 (x) = 2 1(0 < x < t ). (773)
t
▲
▲
Exercise 6.7.4. Let (Tk )k≥1 ∼ PP(λ(t )). Let (τk )k≥1 be the inter-arrival times.
(i) Let Z (t ) be the waiting time for the first arrival after time t . Show that
µ Zt +x ¶
P(Z (t ) ≥ x) = exp − λ(t ) d t . (777)
t
(ii) From (i), deduce that τ1 has PDF
Rt
f τ1 (t ) = λ(t )e − 0 λ(r ) d r
. (778)
Rt
(iii) Denote µ(t ) = 0 λ(s) d s. Use (i) and conditioning to show
P(τ2 > x) = Eτ1 [P(τ2 > x | τ1 ) | τ1 ] (779)
Z∞
= P(τ2 > x | τ1 = t ) f τ1 (t ) d t (780)
Z0∞
= e −(µ(t +x)−µ(t )) λ(t )e −µ(t ) d t (781)
0
Z∞
= λ(t )e −µ(t +x) d t . (782)
0
Conclude that τ1 and τ2 do not necessarily have the same distribution.
The following exercise shows that the nonhomogeneous Poisson process with rate λ(t ) can be ob-
tained by a time change of the usual Poisson process of rate 1.
Exercise 6.7.5. Let (N0 (t ))t ≥0 be the counting process of a Poisson process of rate 1. Let λ(t ) denote a
non-negative function of t , and let
Zt
m(t ) = λ(s) d s. (783)
0
Define N (t ) by
N (t ) = N0 (m(t )) = # arrivals during [0, m(t )]. (784)
Show that (N (t ))t ≥0 is the counting process of a Poisson process of rate λ(t ).
Next, we consider the converse implication. We will break this into several exercises.
Exercise 6.8.3. Let (N (t ))t ≥0 is the Poisson process with rate λ > 0 in the sense of Definition 6.8.1. In this
exercise, we will show that P(N (t ) = 0) = e −λt .
(i) Use independent/stationary increment properties to show that
P(N (t + h) = 0) = P(N (t ) = 0, N (t + h) − N (t ) = 0) (789)
= P(N (t ) = 0)P(N (t + h) − N (t ) = 0) (790)
= P(N (t ) = 0)(1 − λh + o(h)). (791)
(ii) Denote f 0 (t ) = P(N (t ) = 0). Use (i) to show that
µ ¶
f 0 (t + h) − f 0 (h) o(h)
= −λ + f 0 (t ). (792)
h h
By taking limit as h → 0, show that f (t ) satisfies the following differential equation
d f 0 (t )
= −λ f 0 (t ). (793)
dt
(iii) Conclude that P(N (t ) = 0) = e −λt .
Next, we generalize the ideas used in the previous exercise to compute the distribution of N (t ).
Exercise 6.8.4. Let (N (t ))t ≥0 is the Poisson process with rate λ > 0 in the sense of Definition 6.8.1. Denote
f n (t ) = P(N (t ) = n) for each n ≥ 0.
(i) Show that
P(N (t ) ≤ n − 2, N (t + h) = n) ≤ P(N (t + h) − N (t ) ≥ 2). (794)
Conclude that
P(N (t ) ≤ n − 2, N (t + h) = n) = o(h). (795)
(ii) Use (i) and independent/stationary increment properties to show that
f n (t + h) = P(N (t + h) = n) = P(N (t ) = n, N (t + h) − N (t ) = 0) (796)
+ P(N (t ) = n − 1, N (t + h) − N (t ) = 1) (797)
+ P(N (t ) ≤ n − 2, N (t + h) = n) (798)
= f n (t )(1 − λh + o(h)) + f n−1 (t )(λh + o(h)) + o(h). (799)
(iii) Use (ii) to show that the following differential equation holds:
d f n (t )
= −λ f n (t ) + λ f n−1 (t ). (800)
dt
6.8. PP VIA ASYMPTOTIC PROPERTIES OF THE COUNTING PROCESS (OPTIONAL*) 106
107
7.1. AGE AND RESIDUAL LIFE 108
The right hand side is e −λ(x+y) for PP(λ), but is not easy to evaluate for general renewal processes. Hence,
we ask a bit of easier question instead. In the long run, what portion of times do we have A s > x and
Z s > y? Namely, this is the portion of times that we missed the previous bus by more than x and should
wait at least y until the next bus.
Proposition 7.1.2. Let (Tn )n≥1 denote the arrival times of a renewal process and suppose the inter-arrival
times (τn )n≥1 have finite expectation. Let A s and Z s denote the age and the residual life as in (804). Then
h¡ ¢+ i
Z E τ1 − (x + y) Z∞
1 t 1
lim 1(A s > x, Z s > y) d s = = P(τ1 ≥ z) d z. (809)
t →∞ t 0 E[τ1 ] E[τ1 ] x+y
In particular, by taking x = 0 or y = 0, we have
Z Z∞
1 t 1
lim 1(A s > x) d s = P(τ1 ≥ z) d z, (810)
t →∞ t 0 E[τ1 ] x
Z Z∞
1 t 1
lim 1(Z s > y) d s = P(τ1 ≥ z) d z. (811)
t →∞ t 0 E[τ1 ] y
P ROOF. We will frame this problem as a renewal reward process. Namely, the renewal process is
given by the same arrival times (Tn )n≥1 of the buses, each ‘cycle’ is the inter-arrival interval (Tn−1 , Tn ],
and the ‘reward’ from that interval is the amount of times that the event {A s ≥ x, Z s ≥ s} holds during that
interval, which is given by
ZTn
R n := 1(A s > x, Z s > y) d s. (812)
Tn−1
By the renewal property, note that R n ’s are i.i.d. with common expectation E[R 1 ], which is the numerator
in the right hand side above. Hence by the renewal reward SLLN (Theorem 2.2.1), we can deduce
Z
1 t E [R 1 ]
lim 1(A s > x, Z s > y) d s = . (813)
t →∞ t 0 E[τ1 ]
Furthermore, observe that
1(A s > s, Z s > y, s ∈ [0, T1 ]) = 1(no arrival during [s − x, s + y], s ∈ [0, T1 ]) (814)
= 1([s − x, s + y] ⊆ [0, T1 ]) (815)
= 1(x ≤ s ≤ T1 − y)1(T1 ≤ x + y). (816)
Therefore, we get
R 1 = (T1 − (x + y))1(T ≤ x + y) = (τ1 − (x + y))+ . (817)
Using the tail-sum formula for the expectation of nonnegative random variables,
h¡ Z∞ Z∞
¢+ i ¡ +
¢
E[R 1 ] = E τ1 − (x + y) = P (τ1 − (x + y)) ≥ z d z = P(τ1 ≥ z) d z. (818)
0 x+y
Theorem 7.1.3. Let (Tn )n≥1 denote the arrival times of a renewal process and suppose the inter-arrival
times (τn )n≥1 have finite expectation. Let A s and Z s denote the age and the residual life as in (804). Further
assume that, for every a > 0,
P (τ ∈ {a, 2a, 3a, . . . }) < 1. (819)
7.2. EXAMPLES OF RENEWAL PROCESSES 109
Define a random walk (S n )n≥0 by S 0 = 0 and S n = ξ1 + · · · + ξn . Let Tk denote the kth time that the walk
visits the origin. Note that one can view S n as a Markov chain on state space Ω = Z. Hence again by
the strong Markov property, the inter-arrival times τk = Tk − Tk−1 are i.i.d. However, later we will see
that E[τk ] = ∞. Hence (Tk )k≥0 is not a renewal process as we defined above. However, we can still call
it a ‘renewal process with infinite expected inter-arrival times’. This will make more sense since, albeit
having infinite mean, the inter-arrival times are finite almost surely. In other words, S n always returns to
the origin with probability 1.
How do we know that the walk S n will eventually return back to the origin? That is, do we know that
the inter-arrival times are finite almost surely? It is easy to see that S n is an irreducible Markov chain.
However, since the state space Ω for S n is infinite, this does not guarantee that S n returns to zero with
probability 1.
In order to see this, recall the Gambler’s ruin problem (Exercise 7.1 in Lecture note 1). Namely, let
(X t )t ≥0 be a Markov chain on state space ΩN = {0, 1, · · · , N } with transition probabilities
1 + ρ + · · · + ρ i −1
P(X t hits N before 0 | X 0 = i ) = . (825)
1 + ρ + · · · + ρ N −1
For our case, let p = 1/2 so that ρ = 1. We can view X t as the random walk S n during it lies in the interval
[0, N ]. Hence
Now by taking the limit N → ∞, which makes the right barrier at N to fade away, we deduce
This shows S 0 eventually returns to 0 with probability 1 starting from any initial state S 0 = i > 0. Since S n
is symmetric, the same conclusion holds for negative starting location. Hence the inter-arrival times are
finite with probability 1. ▲
7.2. EXAMPLES OF RENEWAL PROCESSES 110
Example 7.2.3. Let (X t )t ≥0 be an irreducible Markov chain on a finite state space Ω. Let X 0 = x ∈ Ω and
let Tk be the kth time that the chain returns to x. By the strong Markov property, the inter-arrival times
τk = Tk − Tk−1 are i.i.d. We also know that E[τk ] < ∞ since hitting times of finite-state irreducible chains
have finite mean (see Exercise 4.6 in Lecture note 1). Hence (Tk )k≥0 is a renewal process. ▲
Example 7.2.4. Let (ξ)k≥0 be a sequence of i.i.d. RVs with
P(ξk = 1) = P(ξk = −1) = 1/2. (828)
Define a random walk (S n )n≥0 by S 0 = 0 and S n = ξ1 + · · · + ξn . Let Tk denote the kth time that the walk
visits the origin. Note that one can view S n as a Markov chain on state space Ω = Z. Hence again by
the strong Markov property, the inter-arrival times τk = Tk − Tk−1 are i.i.d. However, later we will see
that E[τk ] = ∞. Hence (Tk )k≥0 is not a renewal process as we defined above. However, we can still call
it a ‘renewal process with infinite expected inter-arrival times’. This will make more sense since, albeit
having infinite mean, the inter-arrival times are finite almost surely. In other words, S n always returns to
the origin with probability 1.
How do we know that the walk S n will eventually return back to the origin? That is, do we know that
the inter-arrival times are finite almost surely? It is easy to see that S n is an irreducible Markov chain.
However, since the state space Ω for S n is infinite, this does not guarantee that S n returns to zero with
probability 1.
In order to see this, recall the Gambler’s ruin problem (Exercise 7.1 in Lecture note 1). Namely, let
(X t )t ≥0 be a Markov chain on state space ΩN = {0, 1, · · · , N } with transition probabilities
P(X t +1 = k + 1 | X t = k) = p, P(X t +1 = k − 1 | X t = k) = 1 − p ∀0 ≤ k < N . (829)
Let ρ = (1 − p)/p. We have seen that, for any 0 < i < N ,
1 + ρ + · · · + ρ i −1
P(X t hits N before 0 | X 0 = i ) = . (830)
1 + ρ + · · · + ρ N −1
For our case, let p = 1/2 so that ρ = 1. We can view X t as the random walk S n during it lies in the interval
[0, N ]. Hence
P(S n hits N before 0 | S 0 = i ) = i /N . (831)
Now by taking the limit N → ∞, which makes the right barrier at N to fade away, we deduce
P(S n never hits 0 | S 0 = i ) = lim P(S n hits N before 0 | S 0 = i ) = 0. (832)
N →∞
This shows S 0 eventually returns to 0 with probability 1 starting from any initial state S 0 = i > 0. Since S n
is symmetric, the same conclusion holds for negative starting location. Hence the inter-arrival times are
finite with probability 1. ▲
Exercise 7.2.5 (Markov chain with continuous-time jumps). Let (X k )k≥0 be an irreducible and aperi-
odic Markov chain on state space Ω = {1, 2, · · · , m} with transition matrix P = (p i j ). Let π be the unique
stationary distribution of the chain.
Suppose the chain spends an independent amount of time at each state x ∈ Ω, whose distribution F x
may depend only on x. For each real t ≥ 0, let Y (t ) ∈ Ω denote the state of the chain at time t .
(i) Fix x ∈ Ω, and let Tk(x) denote the kth time that the Markov process (Y (t ))t ≥0 returns to x. Let (τk(x) )k≥1
and (N (x) (t ))t ≥0 be the associated inter-arrival times and the counting process, respectively.
Then
N (x) (t ) = number of visits to x that (Y (t ))t ≥0 makes up to time t . (833)
Show that (Tk(x) )k≥2 is a renewal process. Then using renewal SLLN, show that
à !
N (x) (t ) 1
P lim = = 1. (834)
n→∞ t E[τ(x) ] 1
7.2. EXAMPLES OF RENEWAL PROCESSES 111
(ii) Let Tk denote the kth time that the Markov process (Y (t ))t ≥0 jumps. Let (τk )k≥1 and (N (t ))t ≥0 be the
associated inter-arrival times and the counting process, respectively. Note that
N (t ) = N (1) (t ) + N (2) (t ) + · · · + N (m) (t ). (835)
Use (i) to derive that
à !
N (t ) 1 1
P lim = +···+ = 1. (836)
t →∞ t E[τ̃(1)
1 ] E [ τ̃ (m)
1 ]
(iii) Using the Markov chain SLLN, show that
à !
1 Xn
P lim 1(X k = x) = π(x) = 1. (837)
n→∞ n
k=1
In particular, suppose the chain spends unit time at each state x ∈ Ω. Then each τ1(x) equals the
discrete return time of the Markov chain X k to x, so we know π(x) = 1/E[τ1(x) ]. Check that the
above formula is consistent in this case.
Exercise 7.2.6 (Alternating renewal process). Let (τk )t ≥1 be a sequence of independent RVs where
E[τ2k−1 ] = µ1 , E[τ2k ] = µ2 ∀k ≥ 1. (840)
Define and arrival process (Tk )k≥1 by Tk = τ1 + · · · + τk for all k ≥ 1.
(i) Is (Tk )k≥1 a renewal process?
(ii) Let (X k )k≥0 be a Markov chain on state space Ω = {1, 2} with transition matrix
· ¸
0 1
P= . (841)
1 0
Show that the chain has π = [1/2, 1/2] as the unique stationary distribution.
(iii) Suppose the chain spends time τk at state X k ∈ Ω in between the k −1st and kth jump. For each real
t ≥ 0, let Y (t ) ∈ Ω denote the state of the chain at time t . Let N (t ) be the number of jumps that
Y (t ) makes up to time t . Define
NX
(1)
(t )
R (1) (t ) = τk 1(X k = 1), (842)
k=1
which is the total amount of time that (Y (t ))t ≥0 spends at state 1. Use Exercise 7.2.5 to deduce
µ ¶
R (1) (t ) µ1
P lim = = 1. (843)
n→∞ t µ1 + µ2
Exercise 7.2.7 (Poisson janitor). (Excerpted from [Dur99]) A light bulb has a random lifespan with distri-
bution F and mean µF . A janitor comes at times according to PP(λ) and checks and replace the bulb if it
is burnt out. Suppose all bulbs have independent lifespans with the same distribution F .
(i) Let Tk be the kth time that the janitor arrives and replaces the bulb. Show that (Tk )k≥0 with T0 = 0 is
a renewal process.
7.3. LITTLE’S LAW 112
(ii) Let (τk )k≥1 be the inter-arrival times of the renewal process defined in (i). Using the memoryless
property of Poisson processes to show that
(average size of the system) = (arrival rate) × (average time spent in the system), (849)
or ℓ = λw for short. The power of Little’s law lies in its applicability in a very general situation; one can
even choose a portion of a gigantic queuing network and apply Little’s law to analyze the local behavior
of the system. We also remark that Little’s law applies for deterministic queuing system: For stochastic
ones satisfying certain conditions, it will hold with probability 1.
Consider a queuing system, where the kth customer arrives at time t k , spends w k units of time, and
the exits the system. We assume no two customers arrive at the same time, that is, t 1 < t 2 < · · · . Let N (t )
and N d (t ) denote the number of arrivals and departures up to time t , respectively. Finally, let L(t ) denote
the number of customers in the system at time t . To summarize:
𝑁(𝑡)
𝑁(𝑡)
𝑁 (𝑡)
𝑁 (𝑡)
𝑤
𝑤
𝑤
𝑤
𝑤
𝑤
𝑡 𝑡 𝑡 𝑡
𝑡 𝑡 𝑡 𝑡
F IGURE 1. Arrival times, wait times, number of arrivals, and number of departures.
𝑁(𝑡)
𝑁(𝑡)
𝑁 (𝑡)
𝑁 (𝑡)
𝐿(𝑡)
𝐿(𝑡)
𝑡
𝑡
Now we introduce three key quantities which describe the average behavior of the system. Define
the following quantities, whenever their limit exist:
Z
1 t
ℓ = lim L(s) d s = Average size of the queue (855)
t →∞ t 0
N (t )
λ = lim = Arrival rate (856)
t →∞ t
1 NX (t )
w = lim w k = Average wait time. (857)
t →∞ N (t )
k=1
Little’s law gives a very simple relation between the above three average quantities.
Theorem 7.3.1 (Little’s law). If both λ and w exist and are finite, then so does ℓ and
ℓ = λw. (858)
0 𝜖𝑡 (1 − 𝜖)𝑡 𝑡
Example 7.3.3 (Housing Market). Suppose the local real estate in Westwood estimates that it takes 120
days on average to sell a house; This number does fluctuate with the economy and season, but it has
been fairly stable over the past decade. We found out that at any given day last year, the number of
houses for sale has ranged from 20 to 30, with average of 25. What can we say about the average number
of transaction last year?
In order to apply Little’s law, we view the housing market as a queuing system. Namely, we regard
houses being put up for sale as an arrival to the system. The queue consists of unsold houses, and when
houses are being sold, we regard them as exiting the queue. Now from the description above, we set the
average wait time w = 120, and average queue length ℓ = 25. Then by Little’s law, we infer that the arrival
rate is λ = ℓ/w = 25/120 houses per day, or 75 houses per year. ▲
Exercise 7.3.4 (SLLN for weighted sum). Let (X k )k≥1 be a sequence of i.i.d. RVs with finite variance. Let
(w k )k≥1 be a sequence of real numbers such that
1 Xn 1 Xn
w̄ = lim w k < ∞, lim w k2 < ∞. (876)
n→∞ n n→∞ n
k=1 k=1
Pn
Define S n = k=1
X k w k . In this exercise, we will show that, almost surely,
lim S n /n = E[X 1 ]w̄. (877)
n→∞
(i) Write
Sn 1 X n 1 Xn
= E[X k ]w k + (X k − E X k )w k . (878)
n n k=1 n k=1
Show that it suffices to show the assertion assuming E[X k ] = 0 for all k ≥ 1. We may assume this
for the following steps.
(ii) Use Chebyshef’s inequality to show that
à !
−2 2
X
n
2
P (S n ≥ t ) ≤ t E[X 1 ] wk . (879)
k=1
(iv) Use (iii) to show that, for any sequence (n k ) of integers such that n k → ∞ as k → ∞, we can choose
a further subsequence (n k(r ) ) such that
à !
S n2 k (r )
P lim 2 = 0 = 1. (883)
n→∞ n
k(r )
Exercise 7.3.5 (Expected load in the server). Consider a single-server queuing system, which is deter-
mined by the arrival times (t k )k≥1 and total times spent in the system (w k )k≥0 (a.k.a. sojourn times). Let
N (t ) and N d (t ) denote the number of arrivals and departures up to time t , respectively.
Each customer may wait for w̃ k time in the queue, and then spends s̃ k time the the server to get
serviced, so that
w k = w̃ k + s̃ k . (885)
N (t ) 1 Xn 1 Xn
λ := lim , w := lim wk , s̃ 2 := lim s̃ k2 < ∞. (886)
t →∞ t n→∞ n n→∞ n
k=1 k=1
(i) Show that the kth customer is in the server if and only if
t k + w̃ k ≤ t < t k + w̃ k + s̃ k . (887)
(ii) Let R(t ) denote the remaining service time of the current customer in the server. Use (i) to show that
X
∞
R(t ) = (t k + w̃ k + s̃ k − t )1(t k + w̃ k ≤ t < t k + w̃ k + s̃ k ). (888)
k=1
Finally, derive the following formula for the average load in the server:
Z
1 T
r := lim R(t ) d t = λs̃ 2 /2. (892)
T →∞ T 0
7.3. LITTLE’S LAW 117
Exercise 7.3.6 (PollaczekKhinchine formula). Consider a G/G/1 queue, where arrivals are given by a
renewal process of rate λ and service times are i.i.d. copies of a RV S with finite mean and variance. We
use the following notations:
Tk = kth arrival time (893)
W̃k = Time that the kth customer spends in the queue (894)
S̃ k = Time that the kth customer spends in the server (895)
W̃ (t ) = Remaining time until exit of the last customer in the queue at time t . (896)
Note that limh&0 W̃ (Tk −h) equals the waiting time W̃k of the kth customer in the queue. We will assume
that the following limit exists almost surely:
1 Xn
lim W̃k < ∞. (897)
n→∞ n
k=1
The goal of this exercise is to show the following PollaczekKhinchine formula for the mean waiting
time: Almost surely,
Z
1 t λE[S 2 ]
w̃ := lim W̃ (s) d s = . (898)
t →∞ t 0 2(1 − λE[S])
(i) Let S(t ) denote the sum of service times of all customers in the queue at time t . Show that
X∞
S(t ) = S̃ k 1(t k ≤ t < t k + W̃k ). (899)
k=1
(ii) Let N (t ) and N d (t ) denote the number of arrivals and departures up to time t . Use Fubini’s theorem
to show that
ZT ∞ ZT
X
S(t ) d t = S̃ k 1(t k ≤ t < t k + W̃k ) d t (900)
0 k=1 0
NX
(T ) £ ¤
= S̃ k min(T, t k + W̃k ) − t k . (901)
k=1
Also, deduce that
NX
d
(T ) ZT NX
(T )
S̃ k W̃k ≤ S(t ) d t ≤ S̃ k W̃k . (902)
k=1 0 k=1
(iii) From (ii) and Exercise 7.3.4, show that
ZT
1
lim S(t ) d t = λE[S]w̃. (903)
T →∞ T 0
(iv) Let R(t ) denote the remaining service time of the current customer in the server. Then W̃ (t ) =
R(t ) + S(t ). Using (iii) and Exercise (7.3.5), conclude the PK formula (898).
CHAPTER 8
In this chapter, we introduce the continuous-time Markov chains (CTMCs) and study various proper-
ties and examples. The continuous-time extension of MCs has a wide range of applications in simulating
chemical or biological systems and also serves as a building block for Brownian motion.
118
8.1. DEFINITION OF CTMC 119
X ³ ¯ ´
¯
= P X t +s = z ¯ X s = y P s (x, y) (909)
y∈Ω
X ³ ¯ ´
¯
= P X t = z ¯ X 0 = y P s (x, y) (910)
y∈Ω
X
= Pt (y, z) P s (x, y), (911)
y∈Ω
where the last two equalities follow from the continuous Markov property and time-homogeneity. □
Remark 8.1.2. Proposition 8.1.1 (iii) is called a ‘semi-group’ property or Kolmogorov-Champman equa-
tion.
Example 8.1.3 (Homogeneous Poisson Process). Let (N (t ))t ≥0 ∼ PP(λ). We will show that it is a time-
homogeneous CTMC on state space Z≥0 . Note that
³ ¯ ´
¯
P N (t ) = y ¯ N (s n ) = x n , . . . , N (s 0 ) = x 0 (912)
³ ¯ ´
¯
= P N ((s n , t ]) = y − x n ¯ N ([s 0 , s 1 ]) = x 1 , . . . , N ((s n−1 , s n ]) = x n − x n−1 (913)
¡ ¢
= P N ((s n , t ]) = y − x n (914)
¡ ¢
= P Poisson(λ(t − s n )) = y − x n (915)
(λ(t − s n ) y−xn )e −λ(t −sn )
= 1(y − x n ≥ 0) , (916)
(y − x n )!
where for the second equality, we used independent increment property of the Poisson counting process
N over disjoint intervals. An identical computation shows
³ ¯ ´ (λ(t − s n ) y−xn )e −λ(t −sn )
¯
P N (t ) = y ¯ N (s n ) = x n = 1(y − x n ≥ 0) . (917)
(y − x n )!
Hence we verify the continuous Markov property as well as time-homogeneity, where the last expression
gives the transition probability function:
(λt ) y−x e −λt ¡ ¢
Pt (x, y) = 1(y ≥ x) = P Poisson(λt ) = y − x . (918)
(y − x)!
The indicator 1(y ≥ x) above indicates that the counting process N (t ) can only increase. ▲
Example 8.1.4 (Discrete-time Markov chain run by a Poisson process). Let (Yn )n≥1 denote a discrete-
time Markov chain on a countable state space Ω with transition matrix P . Suppose we have a Poisson
arrival process (N (t ))t ≥0 ∼ PP(λ) independent of the chain (Yn )n≥1 . We define a continuous-time process
(X t )t ≥0 on the same state space Ω by
X t = Y N (t ) . (919)
We call (X t )t ≥0 the Markov chain (Yn )n≥1 run by the Poisson process N ∼ PP(λ). For instance, if there
has been ten arrivals up to time t (N (t ) = 10), then X t = Y10 . In other words, X t is a version of Yn where
we make the jumps given by Yn at arrival times Tk of the PP. Alternatively, the continuous-time chain X t
waits an independent Exp(λ) time between each jump (inter-arrival time of the PP) and makes the jumps
modulated by Yn .
We will show that (X t )t ≥0 is a time-homogeneous CTMC with transition probability function
X∞ (λt )k e −λt
Pt (x, y) = P k (x, y). (920)
k=0 k!
Namely, the chain X t jumps a random amount of times during a period of length t , which has distri-
bution Poisson(λt ). Once this number is given, say k, then one has to make discrete-time jumps from
x to y in k steps according to the transition matrix Q, which occurs with probability Q k (x, y). Since k
8.1. DEFINITION OF CTMC 120
is random, the right hand side of (920) takes the form of the conditional expectation of Q k (x, y) where
k ∼ Poisson(λt )1.
Fix times 0 ≤ s 0 < s 1 < · · · < s n−1 < s n (= s) < t and states x 0 , . . . , x n−1 , x n (= x), y ∈ Ω. Note that the
number of arrivals during the disjoint intervals [s 0 , s 1 ], . . . , (s n−1 , s n ], (s n , t ] are independent and follows
Poisson distributions with rate λ times the length of the corresponding intervals. We will partition the
probability of interest according to the values of these arrival counts. Namely, fix integers k 0 , . . . , k n+1 ≥ 0,
and denote t i := k 0 + · · · + k i for 0 ≤ i ≤ n. Then note that
³ ¯ ´
¯
P Y N (t ) = y, N ((s, t ]) = k n+1 ¯ N (s 0 ) = k 0 , Y N (s0 ) = x 0 , N ([s 0 , s 1 ]) = k 1 , . . . N ((s n−1 , s n ]) = k n , Y N (s) = x
³ ¯ ´
¯
= P Y tn +kn+1 = y, N ((s, t ]) = k n+1 ¯ N (t 0 ) = k 0 , Y t0 = x 0 , N ([s 0 , s 1 ]) = k 1 , . . . N ((s n−1 , s n ]) = k n , Y tn = x
³ ¯ ´
¯
= P Y tn +kn+1 = y, N ((s, t ]) = k n+1 ¯ N (t 0 ) = k 0 , N ([s 0 , s 1 ]) = k 1 , . . . N ((s n−1 , s n ]) = k n , Y tn = x
³ ¯ ´
¯
= P Y tn +kn+1 = y, N ((s, t ]) = k n+1 ¯ Y tn = x
³ ¯ ´
¯
= P Y tn +kn+1 = y, ¯ Y tn = x P (N ((s, t ]) = k n+1 )
Thus, by summing over the above expressions over k n+1 ≥ 0 and using the definition X t = Y N (t ) , we
obtain
³ ¯ ´ X∞
¯
P X t = y ¯ X s0 = x 0 , . . . X sn = x = P k (x, y) P (N ((s n , t ]) = k) . (921)
k=0
Since the right hand side above does not depend on x 0 , . . . , x n−1 , averaging out X s0 , . . . , X sn gives
³ ¯ ´ X∞
¯
P X t = y ¯ X sn = x = P k (x, y) P (N ((s n , t ]) = k) . (922)
k=0
This verifies both the continuous-time Markov property and the time-homogenity. Noting that N ((s n , t )) ∼
Poisson(λ(t − s n )), we have also obtained the desired transition probability function in (920). ▲
Example 8.1.5. Let (Yn )n≥1 be a discrete-time Markov chain on state space Z≥0 with deterministic jumps
P(Yn+1 = Yn + 1 | Yn ) = 1 for n ≥ 1. Suppose we have an independent Poisson arrival process (N (t ))t ≥0 ∼
PP(λ), and let X t = Y N (t ) . Then in Example 8.1.4, we have seen that (X t )t ≥0 is a time-homogeneous CTMC
on Z≥1 . In fact, in this case,
X t = Y N (t ) = N (t ). (923)
Hence the Poisson counting process N (t ) as a CTMC is a special case of a MC run by PP. ▲
Example 8.1.6 (Countinuous-time SSRW). Let (Yn )n≥1 be a SSRW on Z with Y0 = 0. That is, P(Yn+1 −Yn =
1) = P(Yn+1 − Yn = −1) = 1/2 for n ≥ 1. Suppose (N (t ))t ≥0 ∼ PP(1) is an independent Poisson counting
process and let X t = Y N (t ) . In Example 8.1.4, we have seen that (X t )t ≥0 is a time-homogeneous CTMC on
Z. Probabilistically, X t operates as follow: We wait for an independent Exp(1) amount of time and the
flip a fair coin independently of everything else.
1The formula (920) is also called the ‘Poissonization’ (Poisson average) of the quantity P k (x, y).
8.2. CONSTRUCTION OF CTMC 121
Let Q denote the transition matrix of the SSRW Yn on Z. From (920), we have
X∞ t k e −t
Pt (x, x) = P k (x, x) (924)
k=0 k!
X∞ t 2n e −t
= P 2n (x, x) (925)
n=0 (2n)!
à !
X∞ t 2n e −t 2n
= 2−2n (926)
n=0 (2n)! n
X∞ (t /2)2n
= e −t 2
. (927)
n=0 (n!)
Note that for small t , we have Pt (x, x) = 1 + O(t 2 ), so Pt (x, x) ≈ 1 for small t . (Similar observation holds
for general CTMC. Check!) ▲
Poisson arrivals) Once we know that we are jumping to some y 6= x, its probability is P q(x,y) .
y6=x q(x,y)
8.2. CONSTRUCTION OF CTMC 123
Now, we will show that the equivalent continuous-time processes (X t )t ≥0 in Definitions 8.2.2-8.2.6
are time-homogeneous CTMCs if we can ensure that there is no explosion of jumps, that is, no infinite
𝑝!"
number of jumps during
↺ a finite time interval.
𝑝!! 1 2 𝑝""
↺
𝑝"!
Theorem 8.2.9. The continuous-time process (X t )t ≥0 in Definition 8.2.2 is a time-homogeneous CTMC if
there is no explosion of jumps. A sufficient condition is the following:
X
There exists α > 0 such that q(x, y) ≤ α for all x ∈ Ω. (931)
𝑝 y6=x 𝑃
𝑝
↺
1 0 1 2 3 4 1
↺
P ROOF. Omitted.1 One
− 𝑝 can immitate the proof of renewal SLLN (Theorem 2.1.6).
1−𝑝
□
1−𝑝
Example 8.2.10 (M/M/1 queue). Suppose customers arrive at a single server according to a Poisson
process with rate λ > 0. Customers gets serviced one by one according to the first-in-first-out ordering,
and suppose there is no cap in the queue. 3/4 Finally, assume1/2that each customer
1/4 in the top of the queue
0
takes an independent 1 to get serviced2and exit the queue.
Exp(µ) time 3 4
1/4 1/2 3/4 1
This is called the M/M/1 queue in queuing theory. Here the name of this model follows Kendall’s
notation, where the first ‘M’ stands for memorylessness of the arrival process, the second ‘M’ stands for
memorylessness of the 1 service times, and ‘1’ means there is a single server. One can think of M /M /c
queue for c servers, in general.
The M /M /1 queue
𝜆/(𝜇 +can
𝜆) be described
𝜆/(𝜇 +nicely
𝜆) 𝜆/(𝜇 +with
as a CTMC 𝜆) Poisson𝜆/(𝜇 + 𝜆) The state space is Ω =
clocks.
↺
we have
𝑝!! the following
1 jump rate2 diagram 𝑝"" determining the M /M /1 queue:
↺
𝑝"!
𝜆 𝜆 𝜆 𝜆
0 1 2 3 4 ⋯
𝜇 𝜇 𝜇 𝜇
𝑝 𝑃
𝑝
↺
1 0 1 2 3 queue 4 1
↺
F IGURE 1. Jump rate diagram of the M/M/1
1−𝑝 1−𝑝 1−𝑝
As in the construction of CTMC using jump chains and holding times (Definition 8.2.1), we can also
analyze the jump chain of the M /M /1 queue. According to Exercise 8.2.5, we can compute the exponen-
tial holding parameter λ and the routing
3/4
matrix R as 1/2 1/4
0 1 2 3 4
λ(0) = λ, λ(x) = λ + µ for x ≥ 1 (932)
1/4 1/2 3/4 1
λ µ
R(0, 1) = 1, R(x, x + 1) = for x ≥ 1, R(x, x − 1) = for x ≥ 1. (933)
1 λ+µ λ+µ
F IGURE 2. State space diagram of the jump chain of the M/M/1 queue
let T1 < T2 < · · · denote the times when the queue length changes. Let X k := Y (Tk ). Then (X k )k≥0 forms
a Markov chain. In fact, it is the Birth-Deach chain we have seen before.
(i) Let (Tka )k≥0 ∼ PP(λ) and (Tkd )k≥0 ∼ PP(µ). These are sequences of arrival and departure times, re-
spectively. Let T̃i be the i th smallest time among all such arrival and departure times. In the
next section, we will learn (T̃i )i ≥0 ∼ PP(λ + µ). This is called ‘merging’ of two independent Pois-
son processes (see Exercise 6.4.3). Note that T̃i is the i th time that ‘something happens’ to the
queue.
(ii) Define a Markov chain (X k )k≥0 on state space Ω = {0, 1, 2, · · · } by
X k = Y (T̃k ). (934)
Namely, X k is the number of customers in the queue at kth time that something happens to the
queue.
(iii) What is the probability that X 2 = 2 given X 1 = 1? As soon as a new customer arrives at time T1 ,
she gets serviced and it takes an independent σ1 ∼ Exp(µ) time. Let τ1 be the inter-arrival time
between the first and second customer. Then by Example 6.1.4 (competing exponentials),
P(X 2 = 2 X 1 = 1) = P(New customer arrives before the previous customer exits) (935)
λ
= P(τ1 < σ1 ) = . (936)
µ+λ
Similarly,
P(X 2 = 0 X 1 = 1) = P(New customer arrives after the previous customer exits) (937)
µ
= P(τ1 > σ1 ) = . (938)
µ+λ
(iii) In general, consider what has to happen for X k+1 = n + 1 given X k = n ≥ 1:
µ ¶
remaining service time remaining time until first
P(X k+1 = n + 1 | X k = n) = P > (939)
after time Tk arrival after time Tk
Note that the remaining service time after time Tk is still an Exp(µ) variable due to the memory-
less property of exponential RVs. Moreover, by the memoryless property of Poisson processes,
the arrival process restarted at time Tk is a Poisson(λ) process that is independent of the past.
Hence (939) is the probability that an Exp(µ) RV is less than an independent Exp(λ) RV. Thus
we are back to the same computation as in (ii). Similar argument holds for the other possibility
P(X k+1 = n − 1 | X k = n).
(iv) (Stationary distribution) From (i)-(iii), we conclude (X k )k≥0 is a Birth-Deach chain on state space
Ω = {0, 1, 2, · · · }. By computing the transition matrix P (which is of infinite size!) and solving
πP = π, one obtains that the stationary distribution for the M/M/1 queue is unique and it is a
geometric distribution. Namely, if we write ρ = λ/µ, then
π(n) = ρ n (1 − ρ) ∀n ≥ 0. (940)
Namely, π is the (shifted) geometric distribution with parameter ρ, which is well-defined if and
only if ρ < 1, that is, µ > λ. In words, the rate of service times should be larger than that of the
arrival process in order for the queue to not to blow up.
Exercise 8.2.12 (RW perspective of M/M/1 queue). Consider a M /M /1 queue with service rate µ and ar-
rival rate λ. Denote p = λ/(µ+λ). Let (X k )k≥0 be the Markov chain where X k is the number of customers
in the queue at kth time that either departure or arrival occurs. (c.f. [Dur10, Example 6.2.4])
(i) Let (ξk )k≥1 be a sequence of i.i.d. RVs such that
P(ξk = 1) = p, P(ξk = −1) = 1 − p (941)
8.2. CONSTRUCTION OF CTMC 125
8.2.2. Completeness of the CTMC construction. We first start with some general discussion on
what kind of information we need to specify. We first make the following assumption on the type of
continuous-time processes we consider.
Assumption 8.2.13. Let (X t )t ≥0 denote a continuous-time process on a state space Ω. We may assume
that there is a sequence of jump times 0 < T1 < T2 < . . . such that X t ∈ Ω is constant on each interval
(0, T1 ], (T1 , T2 ], . . . 2.
Since the process X t can change its state only at jump times, we can record its states at jump times
as a discrete-time stochastic process on Ω, as in the following definition.
Definition 8.2.14. (X t )t ≥0 denote a continuous-time process on a state space Ω with jump times (Tn )n≥1 .
Define the discrete-time process (Zn )n≥1 , which is called the jump chain of (X t )t ≥0 , as Zn = X Tn .
Conversely, a sequence of jump times and a jump chain determines a continuous-time process.
Definition 8.2.15. Suppose we have a jump chain (Zn )n≥1 and jump times (Tn )n≥1 . Define the associated
P
continuous-time process (X t )t ≥0 by X t = Z N (t ) , where N (t ) = ∞
k=1
1(Tk ≤ t ) is the associated counting
process.
Proposition 8.2.16. Suppose we have a jump chain (Zn )n≥1 and jump times (Tn )n≥1 . Let (X t )t ≥0 denote
the associated continuous-time process as in Definition (8.2.15). Then (X t )t ≥0 has jump times (Tn )n≥1 and
jump chain (Zn )n≥1 in the sense of Definition 8.2.14.
P ROOF. Follows from definition. □
Proposition 8.2.17. Suppose (X t )t ≥0 is a continuous-time process on Ω with jump times (Tn )n≥1 . Further
assume that (X t )t ≥0 satisfies the continuous-time Markov property.
(i) The associated jump chain (Zn )n≥1 is a discrete-time Markov chain on Ω.
2Not all CTMC satisfy this assumption, for instance, the Brownian motion, which changes it state at every single time
instance almost surely. However, construction of such ‘smooth CTMC’ can be obtained by a suitable limiting procesure of
‘discrete CTMC’ that we consider here.
8.3. EVOLUTION OF THE TRANSITION PROBABILITY FUNCTION 126
(ii) Conditional on Zn , the holding time τn := Tn+1 − Tn at state Zn ∈ Ω is independent from the past and
has an exponential distribution.
P ROOF. Then the future states should only depend on the present state, so the jump chain (Zk )k≥1
should be a discrete-time Markov chain. Also, a continuous-time version of strong Markov property
holds (see the proof of Proposition 3.4.4) the jump times are stopping times, since
{Tn ≤ s} = (944)
the amount of time τk = Tk+1 − Tk spent at state x := Zk = X Tk should be memoryless due to the
continuous-time Markov property, so it has to be an exponential RV (see Exercise 6.1.6) with parameter
λ. Finally, note that
¡ ¢
P(τk ≥ s | Zn , Zn−1 , . . . , Z1 ) = P X t ≡ Zn for t ∈ [Tn , Tn + s] | X Tn , X Tn−1 , . . . , X T1 (945)
¡ ¢
= P X t ≡ Zn for t ∈ [Tn , Tn + s] | X Tn (946)
= P(τk ≥ s | Zn ), (947)
where we have used a continuous-time version of strong Markov property and the fact that the jump
times are stopping times. □
P ROOF. We prove the assertion only for the bounded jump rate case: q(x, y) ≤ α for all x 6= y ∈ Ω. In
this case, we can construct the CTMC using Definition 8.2.6 (MC run by PP). Then we can use the formula
(930). Suppose x 6= y so that P (x, y) = q(x, y)/α. Note that P 0 (x, y) = I (x, y) = 1(x = y) = 0. Hence
X∞ (αt )k e −αt
Pt (x, y) = P k (x, y) (948)
k=0 k!
q(x, y) X ∞ (αt )k e −αt
= (αt )e −αt + P k (x, y). (949)
α k=2 k!
Hene
Pt (x, y) X∞ (αt )k−2 e −αt
= e −αt q(x, y) + α2 t P k (x, y). (950)
t k=2 k!
P (αt )k−2 e −αt k P
Note that 0 ≤ ∞ k=2 k! P (x, y) ≤ ∞ k=2
1
(αt )k−2 = 1−αt for t ∈ (0, 1/α). Thus the last term above
vanishes as t → 0+. This shows (i).
We show (ii) similarly. Since P 0 (x, x) = I (x, x) = 1(x = x) = 1, we have
X∞ (αt )k−1 e −αt
Pt (x, x) = e −αt + αt P k (x, x) (951)
k=1 k!
X∞ (αt )k−1 e −αt
= e −αt + αt e −αt Q(x, x) − α2 t 2 P k (x, x). (952)
k=2 k!
8.3. EVOLUTION OF THE TRANSITION PROBABILITY FUNCTION 127
Divide both sides by t and take the limit t → 0+. Then the second term in the last expression vanishes as
P
before, and recall that P (x, x) = 1 − α1 y6=x q(x, y) = 1 − λ(x)/α. So we get
µ ¶
Pt (x, x) − 1 e αt − 1 λ(x)
lim = lim +α 1− = −λ(x). (953)
t →0+ t t →0+ t α
This shows the second part of the assertion. □
Definition 8.3.2 (Generator matrix). Consider a CTMC with jump rates {q(x, y) | x 6= y ∈ Ω}. Define the
generator matrix Q : Ω2 → [0, ∞) by
(
q(x, y) if y 6= x
Q(x, y) = P (954)
−λ(x) = − y6=x q(x, y) if y = x.
Example 8.3.3. Consider a CTMC X t = Y N (t ) , which is a discrete-time Marvkov chain (Yn )n≥1 with tran-
sition matrix P run by a Poisson process (N (t ))t ≥0 ∼ PP(α). Then its generator matrix Q is given by
( (
q(x, y) if y 6= x αP (x, y) if y 6= x
Q(x, y) = = (955)
−λ(x) if y = x −λ(x) = α(P (x, x) − 1) if y = x,
where the second equality follows from (929). In a matrix form, we have
Q = α(P − I ). (956)
▲
Theorem 8.3.4 (Transition probability function and the generater matrix). Consider a CTMC with bounded
jump rates q(x, y) ≤ α for all x 6= y ∈ Ω. Let (t , x, y) 7→ Pt (x, y) be its transition probability function and Q
denote its generator matrix. Then the following hold:
(i) (Forward Kolmogorov eq.) Ṗt = Pt Q for all t ≥ 0 with initial condition P0 = I ;
(ii) (Backward Kolmogorov eq.) Ṗt = QPt for all t ≥ 0 with initial condition P0 = I ;
d
where Ṗt := d t Pt denotes the entry-wise time drivative of Pt .
P ROOF. We only give a proof that is rigorous for finite state space Ω. Recall that Kolmogorov-Champman
equation in Proposition 8.1.1 (iii). When Ω is finite, then Pt is a matrix-valued function in t so we can
view it as the following matrix equation
Pt Ps = Pt +s . (957)
P t +1 = P P t = P t P. (960)
8.3. EVOLUTION OF THE TRANSITION PROBABILITY FUNCTION 128
8.3.2. Kolmogorov’s equation for finnite state space. Consider the following differential equation
ẋ(t ) = αx(t ), x(0) = 1. (961)
In order to solve it, we multiply an integrating factor e −αt > 0 and use product formula for derivative:
ẋ(t ) = αx(t ) ⇔ ẋ(t ) − αx(t ) = 0 ⇔ e −αt ẋ(t ) − αe −αt x(t ) = 0 (962)
d −αt
⇔ e x(t ) = 0 (963)
dt
⇔ e −αt x(t ) ≡ C , (964)
where C is a constant. By using the initial condition x(0) = 1, we find C = 1. Hence x(t ) = e −αt . Con-
versely, such solution satisfies the original boundary value problem (961). Hence it follows that
ẋ(t ) = αx(t ), x(0) = 1 ⇐⇒ x(t ) = e −αt . (965)
We notice that Theorem 8.3.4 gives a matrix version of the analogous differential equation (961). So,
is it true that the transition probability function Pt is also given by the ‘exponential function’ e tQ , where
Q is the generator matrix? It turns out to be true, as we will see shortly, but for that we first need to define
exponential functions for matrices and discuss some basic properties. The idea is to adapt the Talyor
P tk
expansion of the exponential function: e x = ∞ k=0 k!
.
Remark 8.3.7. The matrix exponential e A converges absolutely entrywise. To see this, we use a matrix
norm k·k with the property kAB k ≤ kAk kB k. Then we have
X∞ kAkk
ke A k ≤ = e kAk < ∞. (967)
k=1 k!
qP
Examples of such matrix norms inlude the matrix Frobenius norm kAkF = ij A 2i j and the matrix 2-
norm kAk2 = supx,kxk=1 kAxk
kxk2 denotes the matrix 2-norm, where k· · ·k2 denotes the usual Euclidean norm
2
for vectors. This quantity measures the maximum amount that A stretches a given unit vector. Also, we
have kAk2 = σ1 (A), the largest singular value of A.
Theorem 8.3.8. Let Q be the generator matrix of a continuous-time Markov chain with finite state space
Ω. Then both the Kolmogorov forward and backward equations in Theorem 8.3.4 have unique solution
X∞ tk
Pt = e −tQ = Qk . (968)
k=0 k!
P ROOF. Convergent power series can be differentiated term by term. This applies also to a matrix-
valued series. So we can verify that e −tQ indeed solves both Komogorov’s equation:
à ! à !
d X ∞ tk
k
X
∞ d tk
k
X∞ t k−1 k−1
X∞ t k−1 k−1
Q = Q =Q Q = Q Q (969)
d t k=0 k! k=0 d t k! k=0 (k − 1!) k=0 (k − 1!)
For the initial condition, note that e 0Q = I by the definition. Hence Pt = e −tQ satisfies both Komogorov’s
equation.
8.4. STATIONARY DISTRIBUTION AND LIMIT THEOREMS FOR CTMC 129
For the converse direction, we can mimic the integrating factor argument as follow:
Definition 8.4.1. A probability measure π on the state space Ω is an invariant distribution (also station-
ary distribution) for the continuous-time Markov chain with transition probability function Pt if
πPt = π. (977)
Theorem 8.4.3. The probability distribution π is invariant for a (non-explosive) continuous-time Markov
chain if and only if
πQ = 0. (979)
By using the definition of Q in (954), this is equivalent to the following system of equations
X X
π(x)q(x, y) = π(y)λ(y) = π(y) q(x, y) x, y ∈ Ω. (980)
x:x6= y x:x6= y
8.4.2. Limit theorems for CTMC. In this subsection, we study some basic limit theorems for CTMC
and how to use stationary distributions for describing various asymptotic behavior of CTMC.
Definition 8.4.6. A CTMC (X t )t ≥0 on Ω with transition probability function Pt is irreducible if for all
x, y ∈ Ω, Pt (x, y) > 0 for some t ≥ 0.
Proposition 8.4.7. A CTMC (X t )t ≥0 on Ω is irreducible if and only if its jump chain is irreducible.
P ROOF. Exericse. □
Theorem 8.4.8 (SLLN for CTMC). An irreducible continuous-time Markov chain with an invariant dis-
tribution satisfies the following limit theorems.
(i) (Limit distribution). For any initial distribution µ and any state y,
lim Pµ (X t = y) = π(y). (994)
t →∞
P
(ii) (SLLN) Let f : Ω → R be a function such that E X ∼π [| f (X )|] = x π(x)| f (x)| < ∞. Then for any initial
distribution µ, the limit below holds with probability one:
Z X
1 t
lim f (X s ) d s = π(x) f (x) = E X ∼π [ f (X )]. (995)
t →∞ t 0 x
P ROOF. Omited. □
Think about why ‘aperiodicity’ is not needed for CTMC in order for the convergence to stationary
distribution in Theorem 8.4.8 (i) to hold.
Theorem 8.4.9. Suppose π is an invariant distribution for an irreducible continuous-time Markov chain.
Then for all x ∈ Ω,
1
π(x) = , (997)
λ(x) E[T x | X 0 = x]
where T x = inf{t > 0 | X t = x} denotes the first return time to x and λ(x) denotes the exponential holding
parameter.
P ROOF. Let T x(k) denote the k the return time to x, assuming X 0 = x. These are stopping times with
respect to the CTMC (X t )t ≥0 . By strong Markov property, these are i.i.d. and hence define a renewal
process. By Theorem 8.4.8 (i) and the renewal reward SLLN (Theorem 2.2.1),
Z
1 T E[time spent at x] 1/λ(x)
π(x) = lim 1(X s = x) d x = = . (998)
T →∞ T 0 E[duraction of a cycle] E[T x | X 0 = x]
This shows the assertion. □
𝑝
↺
1 0 1 2 3 4 1
↺
1−𝑝 1−𝑝 1−𝑝
1 1 1
0 1 2 3
1 2 3
By Theorem 8.4.8 (i) there is a unique stationary distribution π. By Theorem 8.4.3, π is given by
πQ = 0. Solving this equation for π = [π0 , π1 , π2 , π3 ], we get
−π0 + π1 = 0 (1000)
π0 − 2π1 + 2π2 = 0 (1001)
π1 − 3π2 + 3π3 = 0 (1002)
π2 − 3π3 = 0. (1003)
From the first and last equations, we get π0 = π1 and π2 = 3π3 . Plugging these into the middle two
equations, we get
−π1 + 2π2 = 0 (1004)
π1 − 2π2 = 0. (1005)
Thus π1 = 2π2 . So π2 = π0 /2 and π3 = π2 /3 = π0 /6. Since
12 + 3 + 1 16
1 = π0 + π1 + π2 + π3 = π0 (1 + 1 + (1/2) + (1/6)) = π0 = π0 . (1006)
6 6
Hence π0 = 3/8, so we conclude
· ¸
3 3 3 1
π= , , , . (1007)
8 8 16 16
Now by Theorem 8.4.8 (ii), the asymototic fraction of times that X t spends at each state x is π(x). For
example,
5
(Asymptotic fraction of times spent not at 0) = π(1) + π(2) + π(3) = . (1008)
8
Moreover, by Theorem 8.4.9, we can also find the expected return time to each state x:
1
π(x) = . (1009)
λ(x) E[T x | X 0 = x]
For example, E[T0 | X 0 = 0] = 1/(π(0)λ(0)) = (8/3). So starting from 0, the expected time to return to 0 is
8/3. ▲
Bibliography
[BT02] Dimitri P Bertsekas and John N Tsitsiklis, Introduction to probability, vol. 1, Athena Scientific
Belmont, MA, 2002.
[Dur99] Rick Durrett, Essentials of stochastic processes, vol. 1, Springer, 1999.
[Dur10] , Probability: theory and examples, Cambridge university press, 2010.
[LP17] David A Levin and Yuval Peres, Markov chains and mixing times, vol. 107, American Mathemat-
ical Soc., 2017.
[LS19] Hanbaek Lyu and David Sivakoff, Persistence of sums of correlated increments and clustering in
cellular automata, Stochastic Processes and their Applications 129 (2019), no. 4, 1132–1152.
[SV] Timo Seppalainen and Benedek Valko, Introduction to stochastic processes.
133