0% found this document useful (0 votes)
10 views93 pages

Main

The lecture notes cover stochastic processes and simulation in the natural sciences, focusing on discrete-time and continuous-time Markov chains. Key topics include the introduction to stochastic processes, classification of states, convergence to stationarity, and applications such as random walks and hidden Markov models. The document also discusses advanced concepts like Brownian motion, Gaussian processes, and diffusion processes.

Uploaded by

utkubah
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views93 pages

Main

The lecture notes cover stochastic processes and simulation in the natural sciences, focusing on discrete-time and continuous-time Markov chains. Key topics include the introduction to stochastic processes, classification of states, convergence to stationarity, and applications such as random walks and hidden Markov models. The document also discusses advanced concepts like Brownian motion, Gaussian processes, and diffusion processes.

Uploaded by

utkubah
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 93

Lecture notes

Stochastic processes and simulation in the natural sciences


30561

Giacomo Zanella

March 21, 2025


Contents

1 Discrete-time Markov chains 5

1.1 Introduction to Stochastic Processes . . . . . . . . . . . . . . . . . . . . . . . . . . 5


1.2 Markov Chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2.1 Chapman-Kolmogorov equations . . . . . . . . . . . . . . . . . . . . . . . . 9
1.3 Classification of states and recurrence . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.3.1 Communication classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.3.2 Transience and recurrence . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.3.3 Random walks in d dimension . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.4 Convergence to stationarity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.4.1 Stationary distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.4.2 Convergence in distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
1.4.3 Convergence of asymptotic frequencies . . . . . . . . . . . . . . . . . . . . . 22
1.4.4 Ergodic averages and law of large numbers for MCs . . . . . . . . . . . . . 23
1.5 Conter-examples where the convergence theorem fails . . . . . . . . . . . . . . . . . 24
1.5.1 Infinite state spaces and non-existence of stationary distributions . . . . . . 24
1.5.2 Lack of irreducibility and hitting times . . . . . . . . . . . . . . . . . . . . . 25
1.5.3 Time-inhomogeneity and self-reinforcing processes . . . . . . . . . . . . . . 27
1.6 Reversibility and detailed balance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

2 Illustrative examples and applications of discrete-time Markov chains 31

2.1 Random walks on graphs and PageRank . . . . . . . . . . . . . . . . . . . . . . . . 31


2.2 Sequential testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.3 Hidden Markov models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

3 Continuous-time Markov chains 41

3.1 Background and tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41


3.1.1 Exponential distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.1.2 Hazard rates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.1.3 First events properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.2 Markov property and transition probabilities . . . . . . . . . . . . . . . . . . . . . 45
3.2.1 Generator/Jump rates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.3 Examples of CTMCs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.4 Kolmogorov differential equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.4.1 Kolmogorov backward and forward equations . . . . . . . . . . . . . . . . . 48
3.5 Jump chain and holding time construction . . . . . . . . . . . . . . . . . . . . . . . 52
3.5.1 Connection between (ν, K) and Q . . . . . . . . . . . . . . . . . . . . . . . 54
3.6 Limiting behavior of CTMCs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.7 Birth and death processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
3.8 The Poisson Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
3.8.1 One-dimensional, homogeneous case . . . . . . . . . . . . . . . . . . . . . . 61
3.8.2 One-dimensional, non-homogeneous case . . . . . . . . . . . . . . . . . . . . 64

2
3 CONTENTS

3.8.3 General case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

4 Processes with continuous time and space 67

4.1 Brownian motion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67


4.1.1 BM as a limit of a random walk . . . . . . . . . . . . . . . . . . . . . . . . 67
4.1.2 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.1.3 Properties of BM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.2 Gaussian Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.3 Diffusion Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
4.3.1 Infinitesimal definition of BM . . . . . . . . . . . . . . . . . . . . . . . . . . 78
4.3.2 Infinitesimal definition of diffusions . . . . . . . . . . . . . . . . . . . . . . . 79
4.3.3 Transition kernels and Chapman-Kolmogorov (CK) equations . . . . . . . 81
4.3.4 Generators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
4.3.5 Fokker-Plank equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
4.3.6 Density evolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
4.3.7 Stationary Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

A Appendix 89

A.1 Viterbi algorithm for finite HMMs . . . . . . . . . . . . . . . . . . . . . . . . . . . 89


A.2 Kalman Filter for Gaussian HMMs . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
A.3 Smoothness of GPs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

Bibliography 93
Chapter 1

Discrete-time Markov chains

1.1 Introduction to Stochastic Processes


Stochastic processes are models for random quantities that evolve over time. Formally, a stochastic
process is a collection of random variables.
Definition 1 (Stochastic process). A stochastic process is a collection of random variables
{Xt , t ∈ T } or shortly (Xt )t∈T ,
with Xt : Ω → X for any t ∈ T .The sets X and T are called state space and index set of the
stochastic process.1
The r.v. Xt represents the state of the process at time t. Depending on T and X , we get different
kind of stochastic processes. For example, if T = [0, +∞) we say that the stochastic process has
continuous time, while if T = {0, 1, 2, ...} we say that the stochastic process has a discrete time.
In the latter case, (Xt )t∈T is a sequence of random variables (X0 , X1 , X2 , . . . ). Similarly, if X
is uncountable (e.g., X = R) we say that (Xt )t∈T has continuous space, while if X is finite or
countable (X = {0, 1, 2}) then we say that (Xt )t∈T has discrete space.
If X = R and T = [0, +∞), a realization of (Xt )t∈T could look like Figure 1.1. We call such
realization trajectory. It is just one possible realization of (Xt )t∈T , in the same way as a number
x ∈ R is a realization of a (real-valued) random variable X.

Figure 1.1: Example of a trajectory, or path, of a stochastic process with continuous time and
space.

Example 1.1.1: Gambler’s ruin

A gambler plays a game where at each turn he wins 1e w.p. p and loses 1e w.p. 1-p. The
gambler starts the game with xe and quits the game when he reaches either 0e or Ne for

1 We will usually assume that all random variables Xt have the same set of possible values, i.e. the same state
space X . One could also consider more general settings where X depends on t.

5
CHAPTER 1. DISCRETE-TIME MARKOV CHAINS 6

some pre-specified N ∈ N.

Denoting by Xt the gambler’s amount of money at time t, (Xt )t∈T is a stochastic process
with T = {0, 1, 2...} and X = {0, 1, ..., N }.

1.2 Markov Chains


We will start by considering stochastic processes (Xt )t≥0 with discrete time and discrete space. In
principle, in order to characterize the behaviour (or law ) of (Xt )t≥0 we need to specify the joint
distribution of (X0 , . . . , Xt ) for all t ≥ 1, or equivalently to specify the joint probabilities2
Pr(X0 = x0 , X1 = x1 , ..., Xt = xt ) ,
t+1
for all t ≥ 1, (x0 , . . . , xt ) ∈ X . Such joint probabilities can be hard to describe, and it is thus
common to decompose a joint distribution in terms of marginals and conditionals as follows.
Lemma 1 (Sequential decomposition). Let (X0 , . . . , Xn ) be a random vector with values in X n+1 .
For every (x0 , . . . , xn ) ∈ X n+1 we have
n−1
Y
Pr(X0 = x0 , ..., Xn = xn ) = Pr(X0 = x0 ) Pr(Xt+1 = xt+1 | Xt = xt , ..., X0 = x0 ) .
t=0

Proof. Follows by the definition of conditional probability


Pr(Xt+1 = xt+1 , Xt = xt , . . . , X0 = x0 )
Pr(Xt+1 = xt+1 | Xt = xt , . . . , X0 = x0 ) =
Pr(Xt = xt , . . . , X0 = x0 )
and by cancelling terms in the resulting telescopic product.

Remark
One can think at Lemma 1 as a recipe to generate a sample from the distribution of (X0 , . . . , Xt ),
namely first sample X0 from its marginal distribution, then X1 from the conditional given the
value of X0 , then X2 from the conditional given the value of X1 and X0 , etc.

At this level of generality, the distribution of Xt+1 can depend on the values of X0 , . . . , Xt .
This is a potentially very complicated structure, both to model effectively as well as to analyze
the behavior of the resulting stochastic process mathematically. Markov chains are a much more
tractable and yet very powerful sub-class of stochastic processes, where this structure is simplified
as follows.
Definition 2. A discrete time stochastic process (Xt )t≥0 is a Markov chain if
Pr(Xt+1 = xt+1 | Xt = xt , ..., X0 = x0 ) = Pr(Xt+1 = xt+1 | Xt = xt ) (1.1)
for every t ≥ 1 and every (x0 , . . . , xt+1 ) ∈ X t+1 .
Condition (1.1) is referred to as Markov property. The intuition of the Markov property is the
following: the future depends from the past only thorough the present. In other words, only the
current state of the system matters, and not how you got there.
2 Here Pr denotes a probability and the comas inside Pr denote intersections, e.g. Pr(X = x, Y = y) = Pr({X =
x} ∩ {Y = y}).
7 CHAPTER 1. DISCRETE-TIME MARKOV CHAINS

Example 1.2.1: Gambler’s ruin

The “Gambler’s ruin” process (Xt )t∈T is a Markov chain. In fact it holds

Pr(Xt+1 = j | Xt = i, Xt−1 = xt−1 , . . . , X0 = x0 ) = Pr(Xt+1 = j|Xt = i)



p if j = i + 1,

= 1 − p if j = i − 1,

0 otherwise.

The Markov property implies that the evolution of the Markov chain is fully determined by the
transition probabilities defined as

Pij := Pr(Xt+1 = j | Xt = i) i, j ∈ X .

Note that here we are implicitly assuming time-homogeneity (i.e. that Pij does not depend on t).
In these notes we will always assume time-homogeneity, unless stated otherwise.
When |X | = n, e.g. X = {1, ..., n}, we can arrange the transition probabilities (Pij )i,j∈X in a
n × n matrix  
P11 P12 ... P1n
 P21 P22 ... P2n 
P = . ..  , (1.2)
 
.. ..
 .. . . . 
Pn1 Pn2 ... Pnn
which is called the transition matrix, and it contains all the information governing the evolution
of the Markov chain. By construction, the matrix P satisfies
(
Pij ≥ 0 ∀i, j ∈ X
P
j∈X Pij = 1 ∀i ∈ X

since the i-th row (Pij )j∈X is the probability mass function (pmf) of Xt+1 | Xt = i. Indeed, any
matrix P satisfying the above condition is a valid transition matrix (i.e. it can be interpreted as
the transition matrix of a Markov chain).

Example 1.2.2: Weather chain

Let Xt be the weather condition on day t in Milan. Here Xt = 0 means rainy and Xt = 1
means not rainy. Suppose that if it rains today it will rain tomorrow with probability α,
regardless of the previous days, while if it does not rain today it will rain tomorrow with
probability β.

Then (Xt )t≥0 is a Markov chain with X = {0, 1} and


 
α 1−α
P = ,
β 1−β
CHAPTER 1. DISCRETE-TIME MARKOV CHAINS 8

because (
α = Pr(Xt+1 = 0 | Xt = 0)
β = Pr(Xt+1 = 0 | Xt = 1)

Example 1.2.3: Social mobility

Let Xt be a family’s social class at generation t. Here Xt = 1 means lower class, Xt = 2


median class and Xt = 3 upper class.

We may assume that Xt follows a Markov chain with changes of status happening according
to the following transition probability:
 
0.7 0.2 0.1
P = 0.3 0.5 0.2 (1.3)
0.2 0.4 0.4

Example 1.2.4: Gene spread

A gene has two variants (alleles), A and B, one of which is harmful. All individuals in a
population have either allele A or B. The following is a simplified version of the Wright-Fisher
model, which provides a useful model describing the gene spread in the population:
• at each generation there are N individuals in the population, with N fixed
• the N individuals produce a new generation of N offsprings and then die out
• each offspring has allele A w.p. Ni , where i is the number of individuals with allele A
in the previous generation, and allele B w.p. NN−i , independently of the other offsprings.

Denoting by Xt the number of individuals with allele A in the t-th generation, we have that
(Xt )t∈T is a Markov chain. In particular, given Xt = i we have that
 
i
Xt+1 | Xt = i ∼ Binom N,
N
PN (n)
This is because Xt+1 = n=1 Zt+1 , where
(
(n) 1 if the n-th individual in (t+1) generation has allele A
Zt+1 = ,
0 otherwise

(n)
and Zt+1 | Xt = i ∼ Bern( Ni ). Equivalently, in terms of transition probabilities of (Xt )t∈T ,
we have    j  N −j
N i N −i
Pij = Pr(Xt+1 = j | Xt = i) = .
j N N
9 CHAPTER 1. DISCRETE-TIME MARKOV CHAINS

Note that P00 = 1 and PN N = 1 here. In such cases, we say that 0 and N are absorbing
states, since the chain will not get out of those one therein.

In all the previous examples, we may be interested in different aspects of the problem, e.g.:

• (Weather chain) How many rainy days do we get each year on average?

• (Social mobility) Do the fraction of families in each social class stabilize as t increases? If
yes, to which values?

• (Gambler’s ruin) If the gambler starts with xe, what is the probability that he will end up
with N e rather than 0e?

• (Gene spread) What is the probability that the harmful gene A will eventually die out or
take over? How many generations will be needed for this to happen?

Suggested exercises: 1-6.

1.2.1 Chapman-Kolmogorov equations


In the previous section we introduced stochastic processes, and in particular Markov chains. We
saw that the evolution of a Markov chain is governed by their transition probabilities

Pij = Pr(Xt+1 = j | Xt = i) i, j ∈ X ,

which build up the transition matrix P . Let us now see more explicitly how P governs the evolution
of the Markov chain (Xt )t∈T . In particular, we will be interested in knowing where Xt might be
after n steps, for n ≥ 0. More precisely, we define the n-step transition probabilities as

Pijn = Pr(Xt+n = j | Xt = i) i, j ∈ X , n ≥ 0. (1.4)

It is common for n-step transition probabilities to be quantities of interest.

Example 1.2.5: Social mobility

Recall the social mobility example where X = {1, 2, 3} with 1 = lower class, 2 = medium
class, 3 = upper class. What is the probability that your grandchild will be in the upper class
knowing that you are in the medium class?

2
This corresponds to computing P23 = Pr(Xt+2 = 3 | Xt = 2)

How can we compute Pijn ?


Consider the case n = 2, what is Pr(X2 = j | X0 = i)? We can compute such probabilities by
conditioning on X1 :
X
Pr(X2 = j | X0 = i) = Pr(X2 = j | X1 = k, X0 = i) Pr(X1 = k | X0 = i),
k∈X
CHAPTER 1. DISCRETE-TIME MARKOV CHAINS 10

and by the Markov property we get


X
Pr(X2 = j | X0 = i) = Pr(X2 = j | X1 = k) Pr(X1 = k | X0 = i).
k∈X

In terms of transition probabilities, we just showed that


X
Pij2 = Pik Pkj .
k∈X

Example 1.2.6: Social mobility

In the social mobility example of before,


3
X
2
Pr(grandchild in upper class | you in medium class) = P23 = P2k Pk3
k=1
= P21 P13 + P22 P23 + P23 P33 .

The three addenda are three ways of going from 2 to 3 in 2 steps.

More generally we can obtain (m + n)-step transition probabilities from n-step and m-step ones
as follows.
Proposition 1 (Chapman-Kolmogorov equations). The n-step transition probabilities defined in
(1.4) satisfy
X
Pijn+m = n m
Pik Pkj i, j ∈ X ; n, m ≥ 0 . (1.5)
k∈X

Proof. Conditioning on Xn and applying the Markov property3 we obtain


X
Pijn+m = Pr(Xn+m = j | X0 = i) = Pr(Xn+m | Xn = k, X0 = i) Pr(Xn = k | X0 = i)
k∈X
X
= Pr(Xn+m = j | Xn = k) Pr(Xn = k | X0 = i)
k∈X
X
n m
= Pik Pkj .
k∈X

The equations (1.5) are known as Chapman-Kolmogorov equations and, iterating them, they
provide a way of computing n-step transition probabilities for any n ≥ 1 starting from 1-step ones,
(Pij )i,j∈X .
The Chapman-Kolmogorov equations can be conveniently expressed in terms of matrix opera-
tions. For example, we have that the matrix of 2-step transition probabilities (Pij2 )i,j∈X coincides
with the n × n matrix obtained by multiplying the P matrix with itself, which is why we denote
it as P 2 = P · P . Note that Pij2 6= (Pij )2 , but rather Pij2 = (P 2 )ij . More generally the n-step
transition matrix P n defined by (1.4) coincides with the n-th power of P .
3 More precisely the implication of the Markov property shown in Exercise 4
11 CHAPTER 1. DISCRETE-TIME MARKOV CHAINS

Evolution of marginal probabilities


So far we only considered conditional probabilities, such as Pr(Xt = j | X0 = i). To talk about
unconditional distribution of Xt , we need to specify a starting distribution for X0 , i.e. the values
(t)
αi = Pr(X0 = i) for all i ∈ X. We denote the (unconditional) distribution of Xt by α(t) = (αi )i∈X ,
meaning that
(t)
αi := Pr(Xt = i) i ∈ X, t ≥ 0. (1.6)

Given the starting distribution α(0) , we can compute the marginal (or unconditional) distribution
(t)
of Xt at time t, which we denote by α(t) = (αi )i∈X , as follows:
(t) (0)
X X
αj = Pr(Xt = j) = Pr(Xt = j | X0 = i) Pr(X0 = i) = Pijt αi
i∈X i∈X

Again, we can think at the latter equalities in terms of vector matrix multiplications.
Proposition 2. The marginal distributions defined in (1.6) satisfy

α(t) = α(0) P t t≥0 (1.7)

interpreting α(t) and α(0) are 1 × |X | vectors and P t as a |X | × |X | matrix.


Code 1. The R codes on Blackboard 1 Simulate gamblers ruin.R, 2 ChapKolm for Gamblers ruin.R,
3 ChapKolm for SocialMobility.R provide some basic visual illustrations of the Chapman-Kolmogorov
equations.4

1.3 Classification of states and recurrence


In this section we start to explore the “long-run” behavior of Markov chains, which we will continue
to do in the next section. A first notion we explore is the classification of states, i.e. of the elements
of X , based on the behavior of the Markov chain (Xt )t≥0 .

1.3.1 Communication classes


First we divide the state space into classes based on the notion of communication. In particular:
• A state j is accessible from i if

Pr(Xt = j for some t ≥ 0|X0 = i) > 0 . (1.8)

Intuition: j is accessible from i if, starting from i, the Markov chain can reach j.
Equivalent definitions: for i 6= j accessibility can also equivalently be defined as:
(ii) there exists t > 0 such that Pijt > 0
(iii) there exists t > 0 and a path (i0 , . . . , it ) ∈ X t with i0 = i and it = j such that

Pi0 i1 Pi1 i2 . . . Pit−1 it > 0


4 NB:the codes for this first half of the course are ancillary and not required for exam nor for the projects. You
will focus on simulation and numerical aspects in the second half of the course, where code will be essential. In this
first half, codes are just used in class to visualize the concepts discussed. Nonetheless, you are invited to look at
them and understand the concept they visualize.
CHAPTER 1. DISCRETE-TIME MARKOV CHAINS 12

Exercise 1. Prove the equivalence between (1.8) and (ii)-(iii) for i 6= j (see Theorem 1.2.1
of Norris [1998] for a solution).

• Two states i and j communicate if i is accessible from j, and j is accessible from i. In such
case we write i ↔ j.
The following proposition is easy to verify using the above definitions.
Proposition 3. Communication between states is an equivalence relation.5

Being an equivalence relation, the notion of communication allows to partition the states
space X into classes, or communication classes, where a class is a maximal collection of
states that communicate with each other.

Example 1.3.1

X = {0, 1, 2, 3} and transitions given by

1/2 1/2

1/2 1
0 1/2
1
1/4 1/4
3
2
1/4
1/4

1 1
Here 0 ↔ 1, because P01 > 0 and P10 > 0, but no other couple of states communicate. Thus
the Markov chain has three classes: {0, 1}, {2}, and {3}.

Example 1.3.2

X = {0, 1, 2} and transitions given by

5 i.e. it is reflexive, symmetric and transitive


13 CHAPTER 1. DISCRETE-TIME MARKOV CHAINS

1/2

1/2 2
0 1/2 1/2

1/4
1/2 1

1/4

Which couple of states communicate?


• 0 ↔ 1, because P01 > 0 and P10 > 0;
• 1 ↔ 2, because P12 > 0 and P21 > 0;
2 2
• 0 ↔ 2, because P02 > 0 and P20 > 0.
All states communicate, meaning that there is only one class, X itself.

If all states communicate with each other, we say that the Markov chain is irreducible. For
example, the Markov chain in Example 1.3.1 is not irreducible, while the one in Example 1.3.2 is.
We say that a class C is closed if the chain cannot escape from C, i.e. if i ∈ C implies that all
j’s accessible from i also belong to C. In Example 1.3.1 the closed classes are {0, 1} and {3}. In
Example 1.3.2 the class X is closed.

1.3.2 Transience and recurrence


The second notion we consider is the one of recurrence-vs-transience. Intuitively, a state i ∈ X is
recurrent if, starting from i, you will eventually return to i with probability 1. More precisely, let
fi := Pr(Xt = i for some t ≥ 1 | X0 = i)
= Pr(∪t≥1 {Xt = i} | X0 = i) .
be the probability of eventually returning to i, given X0 = i. We will sometime informally write
fi as Pr(return to i | X0 = i) and 1 − fi as Pr(not return to i | X0 = i).
Definition 3. A state i ∈ X is recurrent if fi = 1, and transient if fi < 1.

Example 1.3.3: Gambler’s ruin

Which state is transient and which recurrent?


Since P00 = 1, we have Pr(Xt+1 = 0|Xt = 0) = 1 and thus

f0 = Pr(return to 0 | X0 = 0) = 1 .
CHAPTER 1. DISCRETE-TIME MARKOV CHAINS 14

Similarly, fN = 1 because PN N = 1. Thus 0 and N are recurrent states. Consider i ∈


{1, . . . , N − 1}. Then fi < 1 since

1 − fi = Pr(never return to i | X0 = i)
= Pr(Xt 6= i for all i ≥ 1 | X0 = i)
≥ Pr(X1 = i − 1, X2 = i − 2, . . . , Xi = 0, Xi+1 = 0, · · · | X0 = i)
= (1 − p)i > 0 .

Thus all states i ∈ {1, . . . , N − 1} are transient. In this example it is easy to check that,
provided N ≥ 3, we also have fi > 0 for all i ∈ {1, . . . , N − 1}.

The notion of transience and recurrence is related to return times, which we now define. Assume
X0 = i and define the

Sk := time of the k-th return to i of (Xt )t≥0 k ≥ 0, (1.9)

where we set Sk = ∞ if (Xt )t≥0 returns to i less than k times (i.e. there is no k-th return to i).
More formally, we define S0 = 1 and Sk = min{t ≥ Sk−1 + 1 : Xt = i} inductively for k = 1, 2, . . . ,
where inf ∅ = ∞. We then define the time between the (k − 1)-th and the k-th return (i.e. the
length of the k-th tour or k-th excursion) as

Tk := Sk − Sk−1 k ≥ 1, (1.10)

where we define Tk = ∞ for any k such that Sk = ∞.


By definition T1 , T2 , . . . are random variables with values in N ∪ {∞}. In particular, they are
deterministic functions of the sequence of random variables (Xt )t≥0 and thus they are random
variables themselves. What can we say about their distributions?
Lemma 2. Let k ≥ 1. Conditional on Sk−1 < ∞, the random variable Tk is independent of
(X0 , . . . , XSk−1 ) and it is equal in distribution to T1 .
Modulo the requirement of Sk−1 < ∞, Lemma 2 can be interpreted as saying that T1 , T2 , . . .
are independent and identically distributed (iid) random variables. We do not provide a formal
proof of Lemma 2 in the interest of brevity. An intuitive argument for Lemma 2 goes as follows.
Every time the chain reaches i it starts a tour or excursion untile the next time it comes back to
i. By time-homogeneity and Markovianity, all those tours are iid. In other words, every time the
chain is back to i an iid copy of the same random experiment (i.e. starting the chain from i and
letting it evolve until it returns) takes place. The requirement of Sk−1 < ∞ comes from the fact
that if Tk , or equivalently Sk , is infinite then Tk+1 , Tk+2 , . . . are also set to infinity, meaning that
the Ti ’s are not iid in general. 6
Lemma 2 has a direct implication on the number of visits to i and the role of recurrence and
transience. Define the number of visits to i as

X
Vi = I(Xt = i) i∈X,
t=0

so that Vi is a random variable with values in N ∪ {∞}.


6 Turningthese arguments into a rigorous proof of Lemma 2 requires a bit of care. A proof can be found at
Lemma 1.5.1 and Theorem 1.4.2 of Norris [1998].
15 CHAPTER 1. DISCRETE-TIME MARKOV CHAINS

Proposition 4. For each i ∈ X , we have


Pr(Vi > r | X0 = i) = fir r ≥ 0. (1.11)
Proof. Follows by induction from Lemma 2.
Proposition 4 means that Vi = ∞ almost surely if i is recurrent and Vi ∼ Geom(1 − fi ) if i is
transient.7 Note that by definition of Vi we have
{Vi = ∞} = {(Xt )t≥0 visits i infinitely often} .
Thus, recurrence and transience for a state i discriminate between two qualitatively different
behaviors: almost surely infinitely many visits to state i versus almost surely finitely many visits
to state i, respectively.
Computing fi explicitly, to understand if i is recurrent or transient, can be hard. Transience
or recurrence can often be verified more easily using the following criteria.
P∞
Proposition 5. A state i is recurrent if and only if t=0 Piit = ∞, and transient if and only if
t
P
t≥0 Pii < ∞.

Proof. The result follows from noting that



X ∞
X
E[Vi | X0 = i] = E[I(Xt = i) | X0 = i] = Piit
t=0 t=0

and that E[Vi | X0 = i] = ∞ if and only if i is recurrent by (1.11).


Recurrence and transience are “class properties”, meaning that either all elements within a
class are transient, or they are all recurrent.
Proposition 6. If i ↔ j, then either both i and j are transient or both recurrent.
P∞
Proof. Assume i is recurrent and i ↔ j. By Proposition 5 we have t=0 Piit = ∞. Since i ↔ j,
`
there exist `, r > 0 such that Pji > 0 and Pijr > 0. Thus
`+t+r ` t r
Pjj ≥ Pji Pii Pij t≥0
and

X ∞
X ∞
X
n `+t+r ` r
Pjj ≥ Pjj ≥ Pji Pij Piit = ∞ ,
n=0 t=0 t=0

meaning that also j is recurrent by Proposition 5.

Example 1.3.4: Gambler’s ruin

There are three classes: {0}, {1,. . . ,N-1}, {N}; because i ↔ j if and only if i, j ∈ {1, . . . , N −1}.
The classes {0} and {N } are recurrent classes, while the class {1, . . . , N − 1} is transient. It
follows that, when started in some state i ∈ {1, . . . , N − 1} the chain will with probability 1

7 Recall
that there are two different definition of Geometric distribution, see e.g. https://en.wikipedia.org/
wiki/Geometric_distribution. Here we use the one with support {1, 2, . . . , } as opposed to {0, 1, 2, . . . , }.
CHAPTER 1. DISCRETE-TIME MARKOV CHAINS 16

visit that state for a finite number of times before ending up in one of the recurrent classes.

Example 1.3.5: Social mobility

Recall: X = {1, 2, 3} and graph given by

2
1

Here i ↔ j for all i, j ∈ X , so there is only one class, {1, 2, 3}. Thus either all states
are transient, or all recurrent. Since |X | < ∞, some state must be visited infinitely often.
Therefore, all states are recurrent and will be visited infinitely often with probability 1.

Remarks
• By Proposition 6, if a Markov chain is irreducible, either all states are transient or
recurrent. In such cases we say the Markov chain itself (rather than a single state) is
transient or recurrent.
• We just argued in Example 1.3.5 that if X is finite, some state must be recurrent. Thus
all states in a finite irreducible chain are recurrent.
• Proposition 4 implies that it is not possible to build a Markov chain such that
Pr(visit i infinitely often | X0 = i) ∈ (0, 1), since that probability can only be 1 or 0.
• Note that i being transient does not mean that there are no valid trajectories that
visit i infinitely often (e.g. in the Gambler’s ruin one could imagine a valid trajectory
oscillating indefinitely beteen i and i + 1 with {i, i + 1} ⊆ {1, . . . , N − 1}), but rather
that the collection of all such transitions have zero measure under the law on paths
induced by the stochastic process (i.e. that the process will eventually ’diverge’ from
such trajectories with probability 1).

1.3.3 Random walks in d dimension


When |X | = +∞ it is not so obvious to understand if states are transient or recurrent. We now
consider the example of random walks in d dimensions.
17 CHAPTER 1. DISCRETE-TIME MARKOV CHAINS

Example 1.3.6: Random walk on Z

Consider a drunken man walking along a straight line. For simplicity, assume that all steps
are of size 1 and that the man goes right w.p. 1 − p, independently of previous steps. The
location of the man Xt after t steps follows a Markov chain with state space X = Z =
{..., −2, −1, 0, 1, 2, ...}, transition probabilities

p
 if j = i + 1
Pij = 1 − p if j = i − 1 .

0 otherwise

In this case i ↔ j for all i, j ∈ X , so there is only one class, i.e. X itself. Thus either all states
are transient or they are all recurrent. The answer turns out to depend on the value of p.

Theorem 1. The random walk on Z is recurrent if p = 12 , and transient if p 6= 12 .


P∞
Proof. By Proposition 5, we can look at whether n=0 Piin is finite or infinite. Without loss of
n
generality (by translation invariance)
P∞ we canPconsider i = 0. Note that P00 = 0 if n is odd by
n ∞ 2n
definition of the RW. It follows n=0 P00 = n=0 P00 . For every n ∈ N, we have
2n
P00 = Pr(X2n = 0|X0 = 0)
= Pr(Z2n = n) with Z2n ∼ Bin(2n, p)
 
2n n n
= p q
n
(2n)! n n
= p q ,
n!n!
where Z2n is the random variable counting the number of steps to the right by time 2n. Now we
use the following two lemmas. Below  denotes ’is asymptotic to’, i.e. an  bn as n → ∞ means
that limn→∞ abnn = 1

P∞3. Let (an )n≥0 and (bn )n≥0Pbe∞two sequences with values in [0, ∞). If an  bn as n → ∞,
Lemma
then n=0 an = +∞ if and only if n=0 bn = +∞ .
√ n
Lemma 4 (Stirling). n!  2πn ne as n → ∞.
By Lemma 4 we have

2π2n ( 2n
e )
2n
1 1
2n
P00  √ n 2n
pn q n = √ n−1/2 22n pn q n = √ n−1/2 ρn as n → ∞ ,
( 2πn)2 ( e ) π π
P∞ P∞
with ρ = 4pq. Thus, by Lemma 3, n=0 P00 2n
= +∞ if and only if n=1 √1 n−1/2 ρn = +∞, which
π
P∞ P∞
implies that n=0 P00 n
= +∞ if and only if n=1 n−1/2 ρn = +∞.
If p = q = 21 then ρ = 1 and

X ∞
X
n−1/2 ρn = n−1/2 = ∞ ,
n=1 n=1
CHAPTER 1. DISCRETE-TIME MARKOV CHAINS 18

which implies that the symmetric RW on Z is recurrent.


If p 6= 12 then ρ = 4p(1 − p) < 1 and
∞ ∞
X X 1
n−1/2 ρn < ρn = < +∞
n=0 n=1
1−ρ
which implies that the biased random walk is transient.
Interestingly, whether the symmetric random walks is transient or recurrent depends on the
dimensionality of the ambient space. Formally, we define the RW on Zd as follows.

Example 1.3.7: Random walk on Zd

The RW on Zd is a MC with state space X = Zd and transition probability matrix defined as


(
1
if ki − jk = 1
Pij = 2d i, j ∈ Zd . (1.12)
0 otherwise

We have the following result.


Theorem 2. The symmetric random walk on Zd is recurrent if d ≤ 2, and transient if d ≥ 3.
Code 2. The R code on Blackboard 4 RandomWalk Transience Recurrence.R illustrates the tran-
sient and recurrence behaviour of the one-dimensional random walk in Example 1.3.7, as stated in
Theorem 1.

1.4 Convergence to stationarity


1.4.1 Stationary distributions
A remarkable property of Markov chains is that they often “converge to stationarity”. In particular,
it is often the case that the distribution of Xt converges to a fixed limiting distribution π regardless
of the starting state, meaning that
t
Pr(Xt = j | X0 = k) = Pkj → πj as t → +∞, for all k, j ∈ X . (1.13)
for some probability distribution π = (πi )i∈X on X . When (1.13) holds, we say that π is the
limiting distribution of the Markov chain (Xt )t≥0 , or equivalently that the πi ’s are the limiting
probabilities.
Given a Markov chain (Xt ) we are interested in understanding when such limiting distribution
exist and how to identify it. First we can see that, if (1.13) holds, by the Chapman-Kolmogorov
equations we have
X X X
t t−1 t−1
πj = lim Pkj = lim Pki Pij = ( lim Pki )Pij = πi Pij j∈X. (1.14)
t→∞ t→∞ t→∞
i∈X i∈X i∈X
8
The simple calculations in (1.14) imply that a necessary condition for a distribution π to be
limiting is for it to be stationary, as defined and stated below.
8 We assumed that we can switch the limit and the summation. This is, for example, justified when X thanks
P t−1
to the finiteness of the summation i∈X and the existence of the limits limt→∞ Pki . In the case |X | = ∞ more
care would be needed.
19 CHAPTER 1. DISCRETE-TIME MARKOV CHAINS

Definition 4. A probability distribution π on X is a stationary distribution for a Markov chain


with transition matrix P if
X
πj = πi Pij for all j ∈ X . (1.15)
i∈X

Condition (1.15) can be written in concise matrix form as

πP = π ,

where π is interpreted as a 1 × |X | row vector. Thus, when |X | = n < ∞, a probability distribution


π = (π1 , . . . , πn ) is stationary if and only if it is a left eigenvector of P with eigenvalue 1.
P
Interpretation of (1.15): by the Chapman-Kolmogorov equations, i πi Pij is the probability
of Xt+1 = j given Xt ∼ π. Thus π is a stationary distribution for (Xt )t≥0 if, given a starting
state X0 ∼ π it follows X1 ∼ π, and by induction Xt ∼ π for all t ≥ 0. Thus, requiring π to be a
stationary distribution for a Markov chain (Xt )t≥0 is equivalent to require that

X0 ∼ π ⇒ Xt ∼ π for all t ≥ 0 .

Example 1.4.1

X = {0, 1}, transition matrix  


p 1−p
P = .
1−p p
Here the stationary distribution is π = (π0 , π1 ) = (1/2, 1/2), because
X 1
πi Pi0 = π0 P00 + π1 P10 = = π0
2
i∈X
X 1
πi Pi0 = π0 P01 + π1 P11 = = π1 .
2
i∈X

Intuitively, we can think at checking if π is a stationary distribution for (Xt )t≥0 as follows:

(i) start by having πi amount of “sand” at location i, for all i ∈ X ;

(ii) move Pij fraction of the πi “sand” in i to j, for all i, j ∈ X ;

(iii) check if the resulting distribution of sand is still equal to π. If yes, π is stationary.

In this sense, π-stationarity corresponds to time evolution leaving the distribution π invariant.
We have the following general facts about existence and uniqueness of stationary distributions.

Theorem 3 (Existence and uniqueness of π).

(a) If X is finite, a stationary distribution always exists.

(b) If (Xt )t≥0 is irreducible there is at most one stationary distribution.


CHAPTER 1. DISCRETE-TIME MARKOV CHAINS 20

Part (b) of Theorem 3 follows from Theorem 5 (b) below since the mi ’s defined there are
unique. While we do not P provide a full proof of part (a) of Theorem 3 in the interest of brevity, we
note the following: since j∈X Pij = 1 for every i ∈ X , 1 = (1, . . . , 1)T is a right eigenvector with
eigenvalue 1, i.e. P 1 = 1. Since right and left eigenvalues are the same (thought their corresponding
eigenvectors differ in general), then 1 is also a left eigenvalue, i.e. there exist v = (v1 , . . . , vn ) ∈ Rn
such that vP = v. One can further show9 that there exist a non-negative such eigenvector, Pn i.e.
a vector v = (v1 , . . . , vn ) ∈ [0, ∞)n such that vP = v. Then π defined as πi = v1 / j=1 vj is a
stationary distribution.
Remark
• Finding the stationary distribution of a Markov chain is often the first thing to do in
order to understand the behavior of a stochastic process and start answering questions
of interest (see more in convergence part below).
• Exercise 19 in the exercise list provides an example where |X | = ∞ and π is identified
through notable distributions, as opposed to solving directly the set of linear equations.
You can think at the model in there as a discrete analogous of continuous autoregressive
processes (e.g., processes with transitions Xt+1 |Xt ∼ N(ρXt , σ 2 ) for some ρ < 1).

1.4.2 Convergence in distribution


The importance of stationary distribution comes from the following theorem, which is a classical
and important result in Markov chains theory. To state it we need the notion of aperiodicity.
Definition 5. A Markov chain with transition matrix P is aperiodic if, for every i ∈ X , the
greatest common divisor of the set {t ≥ 1 : Piit > 0} is 1.
See below for some discussion of the meaning and intuition of aperiodicity.
Theorem 4 (Convergence theorem). Let (Xt )t≥0 be an irreducible and aperiodic Markov chain
with stationary distribution π. Then
Pijt → πj as t → +∞, ∀i, j ∈ X . (1.16)
t
P
Since Pr(Xt = j) = j∈X Pij Pr(X0 = i), the convergence in (1.16) implies that, regardless of
the starting distribution of X0 , we have
Pr(Xt = j) → πj as t → +∞.
Roughly speaking this means that, provided the chain is aperiodic and irreducible, stationarity is
not only necessary but also sufficient for the law of Xt to converge to π, i.e. for π to be the limiting
distribution. When the convergence theorem holds, we say that the Markov chain “converges to
stationarity”, or “converges to equilibrium”. NB: this is a convergence in distributions, not a
convergence of Xt to a fixed value.
The following example provides an example of a Markov chain that is not aperiodic (i.e. it is
periodic) and illustrates why without aperiodicity the convergence theorem may fail.
9 The fact that there always exist a non-negative eigenvector v with eigenvalue 1 can be deduced, for example,

from the fact that P is non-negative and 1 is the leading eigenvalue. However, proving it requires linear algebra
results not discussed in this notes (e.g. Perron-Frobenius theorem). Existence and characterisations of stationary
distributions can also be obtained with purely probabilistic tools, see e.g. Section 1.7 of [Norris, 1998] or Section
1.5.3 of Levin and Peres [2017], but this also requires additional notions and space.
21 CHAPTER 1. DISCRETE-TIME MARKOV CHAINS

Example 1.4.2

X = {0, 1} and transition matrix P with P01 = P10 = 1.


t t
Here P00 = 0 if t is odd, and P00 > 0 if t is even. Thus, the greatest common divisor of
t
{t ≥ 1 : Pii > 0} = {2, 4, 6, 8, . . . } is 2, and the chain is not aperiodic, but rather periodic
with period 2.
In this case π = (1/2, 1/2) is the unique stationary distribution of P , but the convergence
theorem does not hold, and π is not a limiting distribution (i.e., Pijt 9 πj as t → ∞).

Example 1.4.3: satisfying the convergence theorem

• X = {0, 1} and transition matrix P given by

1−p 1−p
0 1

Provided p ∈ (0, 1), the Markov chain is irreducible (0 ↔ 1) and aperiodic (Piit > 0 for
all i ∈ X ). Thus, Pijt → πj = 1/2, for all i, j ∈ X ;

• (Weather chain) X = {0, 1} and transition matrix P given by

1−α

α 1−β
0 1

The chain is irreducible and aperiodic. What is the stationary distribution π?


To find π = (π0 , π1 ), we need to solve the following system of equations:
( (
π0 = π0 P00 + π1 P10 π0 = απ0 + βπ1
=⇒
π1 = π0 P01 + π1 P11 π1 = (1 − α)π0 + (1 − β)π1 .

Solving these equations, subject to the constraint π0 + π1 = 1, we obtain


β 1−α
π0 = , π1 =
1+β−α 1+β−α
t β t 1−α
Thus, Pi0 → 1+β−α and Pi1 → 1+β−α as t → ∞ for all i ∈ X .
CHAPTER 1. DISCRETE-TIME MARKOV CHAINS 22

1.4.3 Convergence of asymptotic frequencies


The Convergence Theorem allows to interpret πi as the limiting probability of the event {Xt = i}.
We now show that, under mild assumptions, πi also coincides with the average proportion of time
(Xt )t≥0 spends in i. Define
n−1
1(Xt = i)
X
Vi (n) = i ∈ X, n ≥ 1 (1.17)
t=0
= # of times Xt visits i before n ,

and the average excursion time from i as

mi = E[T1 |X0 = i] i∈X,

with T1 = min{t ≥ 1 : Xt = i} as in (1.10).


Theorem 5 (Asymptotic frequencies). Let (Xt )t≥0 be an irreducible Markov chain. Then
(a) Regardless of the starting distribution of X0

Vi (n) 1
→ a.s. as n → ∞ i∈X. (1.18)
n mi

(b) If π is a stationary distribution for (Xt )t≥0


1
= πi i∈X. (1.19)
mi

Remark
• Combining Theorem 5 (a) and (b) we have that, if (Xt )t≥0 is irreducible and π-stationary

Vi (n)
→ πi a.s. as n → ∞ i∈X. (1.20)
n

• Theorem 5 also implies Theorem 3(b); meaning that if (Xt )t≥0 is irreducible and there
exist a stationary distribution π, then it must be unique, since πi must equal m−1
i for
all i ∈ X .

Proof. Part (a). Fix i ∈ X . If (Xt )t≥0 is transient, then mi = ∞ and limn→∞ Vi (n) = Vi is almost
surely finite by Proposition 4. Thus in this case limn→∞ Vi (n)/n = 0 = 1/mi and (1.18) is satisfied.
Consider then the case of (Xt )t≥0 recurrent. For brevity, we will prove the result assuming X0 = i,
even if the result holds for general starting distributions 10 . In the recurrent case, the excursion
10 To extend the proof to the case X = j 6= i or X ∼ α(0) for some distribution α(0) , one could roughly
0 0
proceed as follows. First note that, when X0 6= i the random variable T1 = min{t ≥ 1 : Xt = i} has a different
distribution than T2 , T3 , . . . defined in (1.10). Nonetheless T1 is still almost surely finite (which can be proved using
irreducibility) and T2 , T3 , . . . are iid (which can be proved by a variation of Lemma 2). Combining this fact one can
deduce Sk /k → mi almost surely as k → ∞ and proceed as in the proof of part (a) above. We avoid formalizing
the previous arguments in the interest of brevity.
23 CHAPTER 1. DISCRETE-TIME MARKOV CHAINS

times T1 , T2 , . . . defined in (1.10) are iid random variables by Lemma 2. Thus, by the SLLN,
Sk /k → mi almost surely as k → ∞, with {Sk }k≥0 defined in (1.9). Also, by definition of {Sk }k≥0
and Vi (n) we have SVi (n) ≤ n < SVi (n)+1 . and thus
SVi (n) n SV (n)+1
≤ < i .
Vi (n) Vi (n) Vi (n)
By recurrence, Vi (n) → ∞ almost surely as n → ∞ and thus
Sk SVi (n) n SVi (n)+1 Sk
mi = lim = lim ≤ lim ≤ lim = lim = mi ,
k→∞ k n→∞ Vi (n) n→∞ Vi (n) n→∞ Vi (n) k→∞ k

n Vi (n) 1
which implies that limn→∞ Vi (n) = mi , and thus limn→∞ n = mi , almost surely as desired.
Part (b). By part (a) we know limn→∞ Vin(n)almost surely. Since Vin(n) ∈ (0, 1), by the
= 1
mi
h i
bounded convergence theorem, we also have limn→∞ E Vin(n) = m1i . The above holds for any
starting distribution for X0 . Taking X0 ∼ π, by π-stationarity we have Xt ∼ π for all t ≥ 0 and
thus
t=0 E [1(Xt = i)]
  Pn−1 Pn−1
Vi (n) πi
E = = t=0 = πi .
n n n
h i
Thus limn→∞ E Vin(n) equals both m1i and πi , which implies that πi = m1i for all i ∈ X as
desired.

1.4.4 Ergodic averages and law of large numbers for MCs


We now state (a version of) the ergodic theorem for Markov chains. Likewise Theorem 5 above, it
deals with averages over times. In our exposition, it can be seen as a generalization or a corollary
of Theorem 5.
Theorem 6 (Ergodic Theorem). Let (Xt )t≥0 be an irreducible MC with stationary distribution π.
Then, for any bounded function g : X → R, we have
n−1
1X X
g(Xt ) → Eπ [g] := πi g(i) a.s. as n → +∞ . (1.21)
n t=0
i∈X

Proof for |X | < ∞. By definition of Vi (n) in (1.17) we have


n−1
1X X Vi (n)
g(Xt ) = g(i) i∈X.
n t=0 n
i∈X

Vi (n)
Since n → πi almost surely as n → ∞ for all i ∈ X by Theorem (5) and (1.20), it follows
n−1
1X X Vi (n) X
g(Xt ) = g(i) → πi g(i) a.s. as n → +∞
n t=0 n
i∈X i∈X

as desired.
CHAPTER 1. DISCRETE-TIME MARKOV CHAINS 24

Remarks
• The ergodic theorem can be seen as a Law of large numbers for Markov chains and it
implies that, in terms of approximating expectations Eπ [g], we can treat the Markov
chain states (X0 , X1 , X2 , ...) as if they were i.i.d. samples from π. This is the key theo-
retical underpinning of commonly used Markov chain Monte Carlo algorithms (MCMC),
discussed in the second part of the course.

• The ergodic theorem does not require aperiodicity. Indeed periodicity does not prevent
time averages to converge. For example, for the chain in Example 1.4.2 one has (1.20)
and (1.21) even if (1.16) does not hold.

Code 3. The R codes on Blackboard 5 Illustration of convergence theorem PoissonAR.R and


6 Illustration Ergodic Theorem.R provide some basic visual illustrations of the difference between
the statements of the convergence and the ergodic theorems using Example 19.

1.5 Conter-examples where the convergence theorem fails


In the Convergence Theorem, in order to obtain convergence to stationarity as in (1.16), we required
the chain to be irreducible and a stationary distribution π to exist. What happens when these
assumptions are not satisfied? We now provide two ”counter-examples” to illustrate long-run
behaviours of Markov chains when the assumptions of the convergence theorem are not satisfied.
These two examples complement Example 1.4.2, where we illustrated how periodicity can make
Pijt fail to converge.

1.5.1 Infinite state spaces and non-existence of stationary distributions


We now consider an example with |X | = ∞ where no stationary distribution exists.

Example 1.5.1: non-existence of the stationary distribution (Random walk on Z)

X = Z and transitions given by


p p p p

...
... 1 2
-2 -1 0

1−p 1−p 1−p 1−p

Here the Markov chain is irreducible (only one class) but |X | = ∞. Thus, Theorem 3(a)
does not apply and a stationary distribution π may
P not exist. Indeed in this case
P there exist
stationary measures (i.e. µ = (µi )i∈X with µj = i∈X µi Pij and µj ≥ 0 but i∈X µi 6= 1)
but no stationary distributions. We prove the above fact for the symmetric RW (i.e. p = 1/2),
although similar arguments could be made for all p ∈ (0, 1).
Proposition 7. The only stationary measures for the symmetric RW on Z are constant
25 CHAPTER 1. DISCRETE-TIME MARKOV CHAINS

measures, i.e. µ = (µi )i∈X with µi = c for all i ∈ X for some c ≥ 0.


Proof. In this case, stationarity requires µi = (µi−1 + µi+1 )/2 for all i ∈ X . This implies
µi+1 − µi = µi − µi−1 for all i ∈ X , or equivalently µi+1 − µi = a for some a ∈ R independent
of i. It follows µi = µ0 + a · i for all i ∈ X . Since µi ≥ 0 for all i ∈ X we must have a = 0.
In terms of limiting probabilities, the non-existence of stationary (probability) distributions
is reflected in the fact that limt→+∞ Pijt will converge to some value but the limit will not be
a probability distribution. In particular, in this case it happens that

lim Pijt = 0 for all i, j ∈ X .


t→+∞

Intuition: when p = 1/2, as t increases the distribution of Xt becomes wider and wider and
the probability of being in any given j ∈ X goes to 0, despite the RW being recurrent. When
p 6= 1/2, the process “drifts off to infinity” which again implies limt→+∞ Pijt = 0 for any given
j ∈ X.
In both cases (p = 1/2 and p 6= 1/2) there is no limiting distribution and
X X
1 = lim Pijt 6= lim Pijt = 0
t→+∞ t→+∞
j∈X j∈X

due to a ”dispersion of probability in the tails” as t → ∞.

1.5.2 Lack of irreducibility and hitting times


Example 1.5.2: Lack of irreducibility (Gambler’s ruin)

In the Gambler’s ruin example the Markov chain is reducible (i.e. not irreducible), because
there are 3 classes: {0},{1, ..., N − 1}, {N }. Thus, Theorem 3(b) does not apply and the
stationary distribution π may not unique. For example, it is easy to check that both δ0 =
(1, 0, ..., 0) and δN = (0, 0, ..., 1) are stationary distributions.
The consequence of non-uniqueness of stationary distributions is that the limiting prob-
abilities limt→+∞ Pijt will depend on the starting state i, unlike in (1.16). For example it is
easy to see that
lim PNt N = 1 and lim P00 t
= 1,
t→+∞ t→+∞

meaning that when X0 = N the limiting distribution is δN while when X0 = 0 the limiting
distribution is δ0 . What happens for X0 = i with i ∈ {1, . . . , N − 1}? There the limiting
probability depends on the following hitting time
t
h(i) : = lim PiN = Pr(Xt hits N before 0 | X0 = i), i∈X
t→+∞

which we write for brevity as h(i) = Pr(win | X0 = i) since that is the probability of the
Gambler eventually winning. How to compute h(i)? We use a ”condition to the first event”
CHAPTER 1. DISCRETE-TIME MARKOV CHAINS 26

argumenta to find a set of equations for h(i). Conditioning on X1 we obtain

h(i) = Pr(win | X0 = i) i ∈ {1, . . . , N − 1}


= Pr(win | X1 = i + 1) Pr(X1 = i + 1 | X0 = i)+
Pr(win | X1 = i − 1) Pr(X1 = i − 1 | X0 = i)
= ph(i + 1) + (1 − p)h(i − 1),

which can be re-written as (1 − p)(h(i + 1) − h(i)) = p(h(i) − h(i − 1)). Combining these with
h(0) = Pr(win | X0 = 0) and h(N ) = Pr(win | X0 = N ) = 1 we obtain the following set of
(finite difference) equations and boundary conditions for h

p(h(i + 1) − h(i)) = (1 − p)(h(i) − h(i − 1)) i ∈ {1, . . . , N − 1}

h(0) = 0 . (1.22)

h(N ) = 1 .

When p = 1/2 the above implies that h(i + 1) − h(i) = h(i) − h(i − 1), meaning that h is
linear. Thus, interpolating h(0) = 0 and h(N ) = 1 we get

i 1
h(i) = Pr(win | X0 = i) = i ∈ {0, . . . , N }, when p = , (1.23)
N 2
meaning that

t i t N −i 1
lim PiN = and lim Pi0 = when p = .
t→+∞ N t→+∞ N 2
In accordance with intuition, the closer to N the Gambler starts, the higher his probability
of winning the game (i.e. ending up in N ). Note that when p = 1/2 the game is ”fair”,
which is reflected in the fact that the linear equations in (1.23) are symmetric in the sense
of h(i) = 1 − h(N − i). Intuitively this means that the opponent of the gambler, e.g. which
starts with N − X0 budget, has the same chances of winning of the gambler given the same
initial budget, i.e. Pr(Gambler wins | X0 = i) = Pr(Opponent wins | N − X0 = i).
When p 6= 1/2, the solution to the system of equationsb in (1.22) is

t 1 − θi 1−p
lim PiN = with θ := .
t→+∞ 1 − θN p

Here, if p < 1/2, the limit implies that the probability of reaching N goes to 0 exponentially
fast as you get further from N . In this case the winning probabilities are not symmetric
since the game is not ”fair”. See Example 1.44 and 1.45 of Durrett [2021] for step-by-step
calculations to solve (1.22) and some quantitative examples.
a Note that computing h(i) by summing over all trajectories that hit N would be much hard in general.
b See tablet notes for full solution of those

The strategy used in Example 1.5.2 to compute hitting probabilities extends to the case of
general Markov chains, as shown below.
27 CHAPTER 1. DISCRETE-TIME MARKOV CHAINS

General equations for hitting probabilities

Given a MC (Xt )t≥0 on a finite or countable state space X and a subset C ⊆ X , define the hitting
time of C as
τC := min{t ≥ 0 : Xt ∈ C} (1.24)

and the hitting probability of C starting from i as

h(i) := Pr(τC < ∞ | X0 = i) i ∈ X. (1.25)

Then, arguing as in Example 1.5.2, we can show that the function i 7→ h(i) satisfies the following
set of equations and boundary conditions
( P
h(i) = j∈X Pij h(j) i ∈ X \C
. (1.26)
h(i) = 1 i∈C

In general, the solution to such set of equations is not unique, and it can be shown that hitting
probabilities coincide with the minimal non-negative solution of such equations (see [Norris, 1998,
Thm.1.3.2] for details). In Example 1.5.2 we identified the correct solution to the equations by
imposing the additional condition h(0) = 0. A similar set of equations can also be derived for the
expectation of the hitting times, i.e. for the function i 7→ E[τC |X0 = i], see Exercise 28.

Remarks
The following remarks are beyond the scope of the course and just reported for the sake of
completeness and references:
• The qualitative behavior observed in Example 1.5.2 is common to general reducible
chains, in the sense that one can always partition the state space in a collection of
closed recurrent classes plus the set of all transient states (see e.g. Theorem 1.8 of
Durrett [2021] and related discussion) and express the limiting probabilities starting
form transient states as a convex combination of stationary distributions restricted to
the recurrent classes weighted by the probabilities of hitting such classes.

• In Example 1.5.1, despite the non-existence of π, we nonetheless have Vin(n) → m1i as well
as Pijt → m1i . However, mi = ∞ for every i ∈ X . This is an example where recurrence
occurs (i.e. excursion have finite length almost surely) but the average excursion lengths
(mi ) are infinite. Such chains are called null recurrent as opposed to positive recurrent
ones where mi < ∞. Positive recurrence is in one-to-one correspondence with existence
of stationary distributions (see e.g. Theorem 1.7.7 of Norris [1998]).

1.5.3 Time-inhomogeneity and self-reinforcing processes


Example 1.5.3: Polya’s Urn

To be added
CHAPTER 1. DISCRETE-TIME MARKOV CHAINS 28

1.6 Reversibility and detailed balance


Reversibility is a stronger condition than stationarity. Among other aspects, it provides a conve-
nient way to design π-stationary Markov chains. In fact, most computational algorithms based on
Markov chains (e.g. Markov Chain Monte Carlo, which will be discussed in the second part of the
course) rely on the notion of reversibility. Here we define the notion and discuss two ways to think
at it, one related to detailed balance and one to time reversibility.
Definition 6. A Markov chain (Xt )t≥0 with transition probabilities (Pij )i,j∈X is reversible with
respect to a distribution π if

πi Pij =πj Pji i, j ∈ X . (1.27)

Proposition 8. Let (Xt )t≥0 be a π-reversible Markov chain. Then (Xt )t≥0 is also π-stationary.
Proof. Let (Xt )t≥0 be π-reversible. Then its transition probabilities (Pij )i,j∈X satisfy
X by (1.27) X X
πi Pij = πj Pji = πj Pji = πj j∈X.
i∈X i∈X i∈X

Thus (Xt )t≥0 is π-stationary.


There are two alternative ways to think at π-reversibility.

Detailed-vs-Global balance interpretation


By definition, π-reversibility requires

πj Pji = πi Pij i, j ∈ X . (1.28)

The left-hand-side can be though at as the amount of probability going from state j to state i
at each Markov chain step in stationarity while the right-hand-side can be though at as the one
from state i to state j. More precisely, assuming Xt ∼ π and Pr(Xt+1 = j | Xt = i) = Pij , then
πj Pji is the probability of observing a transition from j to i, i.e. of observing (Xt , Xt+1 ) = (j, i),
while πi Pij is the probability of observing a transition from i to j, i.e. (Xt , Xt+1 ) = (i, j). In this
sense, π-reversibility requires the exchange of probability across each couple of states in X to be
balanced. In accordance with this intuition, a π-reversible chain is sometimes referred to as being
in detailed balance with respect to π. P
We can contrast that with π-stationarity. Since i∈X Pji = 1, we can re-write the π-stationarity
requirement in (1.15) as
X X
πj Pji = πi Pij j∈X. (1.29)
i∈X i∈X

The left-hand-side can be though at as the amount of probability


P going ”out of state j” at each
Markov chain step in stationarity while the right-hand-side i∈X πi Pij can be though at as the
one going ”into state j”. More precisely, assuming Xt ∼ π and Pr(Xt+1 = j | Xt = i) = Pij , then
P
πj Pji is the probability of observing a transition going out of state j, i.e. such that Xt = j,
i∈X P
while i∈X πi Pij it is the probability of observing a transition going into state j, i.e. such that
Xt+1 = j. In this sense, π-stationarity requires the total amount of probability going inside and
29 CHAPTER 1. DISCRETE-TIME MARKOV CHAINS

outside each single state to be balanced. In accordance with this intuition, a π-stationary chain is
sometimes referred to as bein in global balance with respect to π.
Clearly, detailed balance implies global balance since having each ”local” term in (1.28) being
equal implies that also the ”global” sum in (1.29) will be equal. See figures in hand-written notes
on Blackboard.

Time-reversibility interpretation
In addition to being π-stationary, a π-reversible MC is also ”time-reversible”, i.e.:
(
(Xt )t≥0 is π-stationary
(Xt )t≥0 is π-reversible ⇐⇒
(Xt )t≥0 is time-reversible
Intuitively, time-reversible means that if you look at the MC forward or backward in time you
obtain the same stochastic process. More precisely, we have the following.
Proposition 9. Let X0 ∼ π and n ≥ 1. Define
Y0 = Xn , Y1 = Xn−1 , ..., Yn = X0 (1.30)
Then (Y0 , . . . , Yn ) is equal to (X0 , . . . , Xn ) in distribution if and only if (Xt )t≥0 is π-reversible.
Sketch of proof. Let
Qij := Pr(Yt+1 = j | Yt = i) i, j ∈ X
11
be the transition probabilities of (Yt )t=0,...,n . Then
Qij = Pr(Yt+1 = j | Yt = i) = Pr(Xn−t−1 = j | Xn−t = i) (1.31)
Pr(Xn−t−1 = j, Xn−t = i)
= (1.32)
Pr(Xn−t = i)
Pr(Xn−t−1 = j) Pr(Xn−t = i | Xn−t−1 = j) πj Pji
= = i, j ∈ X . (1.33)
Pr(Xn−t = i) πi
Thus Qij = Pij if and only if
πi Pij =πj Pji i, j ∈ X
as desired.

Example 1.6.1

The Markov chain in Example 16 is an example of non-reversible process. In fact, using (1.31),
one can check that the transition probability of the time-reversed process are
 
0 0 1
Q = 1/2 1/2 0  .
0 1/2 1/2

11 Here we are taking for granted that (Y )


t t=0,...,n is both Markovian and time-homogeneous. These facts are not
obvious, especially time-homogeneity, but can be deduced by the Markovianity, time-homogeneity and π-stationarity
of (Xt )t=0,...,n . See e.g. Theorem 1.9.1 of Norris [1998] for more formal and complete arguments.
CHAPTER 1. DISCRETE-TIME MARKOV CHAINS 30

Since Q 6= P in this case, we have that (Xt )t≥0 is π-stationary but not π-reversible.

A simple example of reversible chains is given by the binary chain in Example 1.4.1. Exercises 33
and 34 in the exercises list provide less obvious examples of reversible chains.
Note that a π-reversible chain is ”time-reversible” (i.e. Proposition 9 holds) only in stationarity
(i.e. if X0 ∼ π) and not if X0 ∼ α(0) for some other, possibly deterministic, starting distribution
α(0) .12 See Exercise 34 and the associated code (as well as the in-class tablet notes) for some
illustration of this phenomenon.

12 To see why forward and backward processes behave differently out of stationarity consider the following: if

started out of stationarity the forward process (Xt )t≥0 would first ’converge to stationarity’ and then ’explore the
space sampling from π’; while the backward process (Xt )t≥0 would first ’explore the space sampling from π’ and
then ’go out of stationarity’, which is implausible under the forward process.
Chapter 2

Illustrative examples and


applications of discrete-time
Markov chains

2.1 Random walks on graphs and PageRank


A directed graph (or directed network) is a collection of vertices V and edges E ⊆ V × V among
nodes. For example, you could have V = {1, 2, 3, 4, 5}, and the following edges among those:

5
1

3 4

Given a directed graph on V = {1, ..., n} one can always construct a corresponding Markov
chain on the graph by taking X = V and assigning “uniform weights” to all edges (i.e., arrows) in
the graph. More precisely, we define the following.
Definition 7. The natural random walk on a directed graph (V, E) is a Makov chain with states
space X = V and transition matrix P (RW) given by
(
1
(RW) if there is an edge from i to j
Pij = di i, j ∈ X ,
0 otherwise
where di is the out-degree of i, i.e. the number of edges starting from i.
Random walks on graphs are often used to model phenomena that diffuse on networks (e.g.,
epidemics spreading thorough contacts, news spreading on social networks, etc). Maybe less in-
tuitively, they can also be used to gain insight into the structure and connectivity properties of a
graphs. We now consider a famous example from websites ranking.

Example 2.1.1: PageRank

PageRank is a ranking algorithm for websites which is at the heart of Google’s search engine.
It was invented by Larry Page and Sergei Brin, founders of Google (see the original paper ”The
anatomy of a large-scale hypertextual Web search engine” at https://snap.stanford.edu/
class/cs224w-readings/Brin98Anatomy.pdf). The idea is to assign each website a score
(independently of search queries), and use the resulting ranking of websites to help performing

31
CHAPTER 2. ILLUSTRATIVE EXAMPLES AND APPLICATIONS OF DISCRETE-TIME
MARKOV CHAINS 32

searches in the web. The basic idea is to rank websites based on the stationary distribution
of the random walk on the graph induced by weblinks.
More precisely, let X = {1, 2, ..., n} represent the list of websites. Define a directed graph
across websites with V = X and (i, j) ∈ E by adding if and only if site i contains a weblink to
site j. Then define the score of site i as πi , where π is the stationary distribution of P (RW ) ,
i.e. the natural random walk on (V, E). Intuition: consider a user surfing the web at random
that once in state i chooses one of the di weblinks in page i at random, and clicks on it. In
this metaphor, the score of site i corresponds to the probability that the random surfer is
visiting site i at time t for a large t.
For the above construction to make sense, i.e. to lead to a well defined score, we need π
to exist and be unique. In this context | X |< +∞, so a stationary distribution π exists by
Theorem 3(a). However, the Markov chain may not be irreducible, so Theorem 3(b) does not
apply and π may not be unique. To solve this issue, PageRank considers a modification of
the natural random walk on the graph: with probability α the random surfer follows P (RW ) ,
while with probability 1 − α he gets bored and opens a new website choosing it uniformly
at random from {1, 2, ..., n}. Thus we obtain a new Markov chain on X , which we denote as
(Yt )t≥0 , with the following transition probabilities:
(PR)
Pij = Pr(Yt+1 = j | Yt = i)
= Pr(Yt+1 = j | Yt = i, bored) Pr(bored) + Pr(Yt+1 = j | Yt = i, not bored) Pr(not bored)
1
= αPijRW + (1 − α) ,
n
for all i, j ∈ X . In terms of transition matrices this means,

P (PR) = αP (RW ) + (1 − α)P (unif) ,

where P (unif) is an n × n transition matrix with all entries equal to 1/n. In these cases we
say that P (PR) is a mixture of two transition matrices, i.e. P (RW) and P (unif) , which means
that the Markov chain evolving according to P (PR) chooses at each iteration whether to make
a transition according to P (RW) , with probability α, or according to P (unif) , with probability
1 − α. Note that P (PR) is irreducible by construction since

(PR) 1−α
Pij ≥ >0 i, j ∈ X ,
n
meaning that all states communicate. It follows that the P (PR) has a unique stationary
distribution π. The corresponding value of πi is the PageRank of website i. Note that P (PR)
(P R)
is also aperiodic by construction and thus πi = limt→∞ Pij = Pr(Yt = i | Y0 = j).

Some remarks on Example 2.1.1:


• the score πi does not simply count the number of inks pointing to i, but also gives more
importance to links coming from ”important” websites j (i.e. those with high score πj ), and
link coming from websites j which links to few other websites. To see that, note that π must
solve
n
(PR)
X
πi = πj Pji ;
j=1
CHAPTER 2. ILLUSTRATIVE EXAMPLES AND APPLICATIONS OF DISCRETE-TIME
33 MARKOV CHAINS

• In the original paper Google uses α = 0.85;


• To actually compute (π1 , ..., πn ) one would need to solve the system of n equations given by:
 (PR) (PR)
π = π1 P11 + ... + πn Pn1
 1


..
. .

 (PR) (PR)
πn = π1 P1n + ... + πn Pnn

toghether with π1 +· · ·+πn = 1. This system can be solved exactly with standard algorithms
for linear system at cubic cost, i.e.Cost = O(n3 ).1 When n is very large, such as in Google’s
website rankings applications, this may be too computationally expensive. To reduce the
computational cost, one can approximate π using the Convergence Theorem and Chapman-
(0)
Kolmogorov equations as follows: initialize the distribution of X0 , αi = Pr(X0 = i), with
(0) (0)
some probability distribution α(0) = (α1 , ..., αn ); compute the distribution of X1 , . . . , Xt
as:
α(1) = α(0) P (P R)
α(2) = α(1) P (P R)
..
.
α(t) = α(t−1) P (P R)
By the Convergence theorem, when t is large enough, α(t) provides an approximation of π.
Note that computing α(t) requires only matrix-vector multiplications, which have quadratic
cost and are thus cheaper than solving systems of linear equations. The cost of the above
procedure2 would thus be O(tn2 ) which is much smaller than O(n3 ) when t  n. 3

2.2 Sequential testing


We now consider an illustrative example of the application of Markov chains theory to sequential
hypothesis testing in Statistics.

Example 2.2.1: Sequential testing

Suppose that two new drugs have been developed for treating a certain disease. Drug 1 has
cure rate P1 and drug 2 has cure rate P2 . To decide whether P1 > P2 or P1 < P2 , the following
statistical test is used: couples of patients are treated sequentially giving drug 1 to one patient
and drug 2 to the second, and whether or not each patient is cured is recorded. The test stops
when the number of patients cured with drug 1 exceeds by A, or more units, the number of
patients cured with drug 2, concluding that drug 1 is better, or viceversa when the number

1 Here
the cubic costs would come from performing inversion or factorization of an n × n matrix.
2 In
computaional liner algebra terms, this procedure is an instance of the so-called power method to approximate
leading eigenvectors of matrices.
3 In the PageRank example, the value of t required to have α(t) ≈ π is often relatively small thanks to the

presence of the ”global steps” (1 − α)P (unif) in P (P R) , which help the chain (Yt )t≥0 converge fast to stationarity
(e.g. a few dozen steps can be enough to approximate π satisfactorily). Also, the value of t required can be reduced
by choosing a good initialization, e.g. α(0) = U nif (X ) would work better than α(0) = j for some j ∈ X in this case.
CHAPTER 2. ILLUSTRATIVE EXAMPLES AND APPLICATIONS OF DISCRETE-TIME
MARKOV CHAINS 34

of patients cured with drug 2 exceeds by B, or more units, the number of patients cured with
drug 1, concluding that drug 2 is better.
Given certain values of P1 and P2 , what is the probability of making the wrong decision?

Intuitively, the test in Example 2.2.1 is a race between drug 1 and 2: the first drug to outscore
the other by M or more in terms of cured patients is winner. For each i = 1, 2, . . . define
(
1 if the i-th patient treated with drug 1 is cured
Xi =
0 otherwise

and (
1 if the i-th patient treated with drug 2 is cured
Yi = .
0 otherwise
We assume that (Xt )t≥1 and (Yt )t≥1 are two independent sequences of iid Bernoulli(P1 ) and
Bernoulli(P2 ) random variables, respectively.
Then the number of patients cured with drug 1 and 2, after t couples, is X1 + · · · + Xt and
Y1 + · · · + Yt , respectively. We are interested in the first time t such that Dt ≥ A or Dt ≤ −B,
where Dt is defined as
Dt = (X1 + ... + Xt ) − (Yt + ... + Yt ).
The stochastic process (Dt )t≥0 is a Markov chain because, given Dt , the value of Dt+1 depends
only on the outcome for the (t + 1)-th couple of patients, (Xt+1 , Yt+1 ), which is independent of
D0 , . . . , Dt+1 . Note that in this context the time parameter t of the Markov chain corresponds to
the sample size in a statistical test.
What are the transition probabilities of (Dt )t≥1 ? The chain can only increase by 1, decrease
by 1, or stay where it is. The transitions from Dt to Dt+1 depend on the value of Xt+1 and Yt+1 .
In particular, we have

Pii := Pr(Dt+1 = i | Dt = i) = Pr({Xt+1 = 0, Yt+1 = 0} ∪ {Xt+1 = 1, Yt+1 = 1}


= Pr(Xt+1 = 0, Yt+1 = 0) + Pr(Xt+1 = 1, Yt+1 = 1)
= (1 − P1 )(1 − P2 ) + P1 P2 ,
Pi,i+1 := Pr(Dt+1 = i + 1 | Dt = i) = Pr(Xt+1 = 1, Yt+1 = 0) = P1 (1 − P2 ) ,
Pi,i−1 := Pr(Dt+1 = i − 1 | Dt = i) = Pr(Xt+1 = 0, Yt+1 = 1) = P2 (1 − P1 ) ,

and Pij = 0 if |i − j| > 1.


Given some values of P1 and P2 , we want to find

Pr(test asserts P1 > P2 ) = Pr((Dt )t≥0 hits A before −B | D0 = 0),

and Pr(test asserts P1 < P2 ) = 1 − Pr(test asserts P1 > P2 )4 . To compute these probabilities we
can proceed as in Example 1.5.2. First, we define

h(i) := Pr((Dt )t≥0 hits A before −B | D0 = i) i ∈= {−B, . . . , A} .


4 Note that the recurrence and transience theory tells us that D will eventually hit either A or B with prob-
t
ability 1, which implies that the test will always terminite in finite time and that Pr(test asserts P1 < P2 ) =
1 − Pr(test asserts P1 > P2 )
CHAPTER 2. ILLUSTRATIVE EXAMPLES AND APPLICATIONS OF DISCRETE-TIME
35 MARKOV CHAINS

Conditioning on D1 we have
h(i) =Pi,i+1 h(i + 1) + Pi,i−1 h(i − 1) + Pii h(i) i ∈ {−B + 1, . . . , A − 1}
which implies
h(i) =ph(i + 1) + (1 − p)h(i − 1), i ∈ {−B + 1, . . . , A − 1} ,
Pi,i+1 Pi,i−1
where p := 1−Pii and 1 − p = 1−P ii
. We thus obtain the system of equations

h(i) = ph(i + 1) + (1 − p)h(i − 1) i ∈ {−B + 1, . . . , A − 1}

h(−M ) = 0 . (2.1)

h(M ) = 1

These are the same set of equations we had in the Gambler’s ruin example (see (1.22)) modulo the
different boundary conditions. Solving them in an analogous way to what done for (1.22) gives
Pr(test asserts P1 > P2 ) = Pr((Dt )t≥0 hits A before −B | D0 = 0)
= Pr(Gambler reaches A + B | Gambler starts from B)
1 − θB 1−p (1 − P1 )P2
= with θ := = .
1 − θA+B p P1 (1 − P2 )
For example, if P1 = 0.6 and P2 = 0.4, the probability of wrongly concluding that P2 > P1
is 0.017 if A = B = 5, and decreases to 0.0003 if M = 10. Here A and B act as a confidence
parameters: the larger they are, the smaller the probability of an error but the larger the required
sample size.
Remark
The above is a simple example of sequential testing, where the sample size for a given statis-
tical test is not fixed in advance, but rather samples are collected sequentially until a given
criterion is met. The distinction is important, because if one collects samples sequentially
but perform testing as if the sample size had been fixed in advance, then in most cases he
will end with an invalid test (even a test that reject the eventually rejects the null hypothesis
with probability one even when it is true). The theory and methodology of sequential testing
is based on stochastic processes (where time corresponds to sample size). An example of a
popular and important sequential testing procedure is the sequential probability ratio test
(SPRT). For more details see the discussion in the in-class tablet notes on blackboard, and
more generally the literature on “sequential hypothesis testing” or ”sequential analysis”, or
for example applications to “sequential A/B testing”.

2.3 Hidden Markov models


In Hidden Markov Models (HMMs), or State Space Models (SSM), we assume to have some under-
lying quantity (Xt )t≥0 that evolves over time, but is not observed. What we observe, at each time
t, is a signal St that depends on Xt . We want to use the information contained in the sequence of
signals (S0 , S1 , ..., St ) to make inferences about the underlying quantity (X0 , X1 , ..., Xt ).
More precisely, we consider a mathematical model with the following ingredients:
CHAPTER 2. ILLUSTRATIVE EXAMPLES AND APPLICATIONS OF DISCRETE-TIME
MARKOV CHAINS 36

• a latent Markov chain (Xt )t≥0 with state space X , transition probabilities (Pij )i,j∈X , and
starting distribution α = (αi )i∈X , i.e. such that αi = Pr(X0 = i);
• a set of possible signals S;
• at each time t = 0, 1, 2, . . . , a signal St is emitted. St is a random variable whose distribution
depends on the current state of the latent process Xt , and it is denoted as f (St | Xt ). More
precisely we assume that for all t ≥ 0

Pr(St = s | Xt = j) = f (s | j) s ∈ S, j ∈ X ,

and the following conditional independence holds

St ⊥ (Xi , Si )i6=t | Xt .

Thus, the structure of HMM can be represented graphically as follows:


latent X0 X1 X2 X3 ...
variables

signals S0 S1 S2 S3

NB: this is a graph representing the (conditional dependence) relationship between the random
variables involved in a HMM. It says two things: first that (Xt )t≥0 evolves as a Markov chain; and
second that each signal St depends on Xt and conditionally on Xt is independent of the rest.
To sum-up a HMM assumes that

Pr(X0:t = x0:t , S0:t = s0:t ) = αx0 Px0 x1 . . . Pxt−1 xt f (s0 | x0 ) . . . f (st | xt ). (2.2)

for every t ≥ 0, x0:t := (x0 , ..., xt ) ∈ X t+1 and s0:t := (s0 , . . . , st ) ∈ S t+1 .

Remark
• HMMs are a widely used class of models, and one of the most natural class of models
for unobserved quantities that evolve over time. A typical application is time series
analysis or signal processing. Here (Xt )t≥0 is some quantity of interest, and St is a
noisy observation of Xt (such as St = Xt + , with  ∼ N(0, σ 2 ) when X = R). Here we
first consider HMMs with finite state spaces, and then discuss continuous ones in the
Kalman filter section.

Example 2.3.1: Production process

Consider a production process that in each period is either in a good state (state 1), or in a
poor state (state 2). If the process is in state 1 during a period then, independent of the past,
with probability 0.9 it will be in state 1 during the next period, and with probability 0.1 it
will be in state 2. Once in state 2, it remains in that state forever. Suppose that a single
item is produced each period, and that each item produced when the process is in state 1 is
CHAPTER 2. ILLUSTRATIVE EXAMPLES AND APPLICATIONS OF DISCRETE-TIME
37 MARKOV CHAINS

of acceptable quality with probability 0.99, while each item produced when the process is in
state 2 is of acceptable quality with probability 0.96.

The HMM formulation of the previous example is the following: for each t ≥ 0 the latent
variable Xt is the state of the process, i.e.
(
1 if the process is in good state at time t
Xt =
2 if the process is in poor state at time t.

(Xt )t≥0 evolves according to a Markov chain on the state space X = {1, 2} with transition proba-
bilities  
0.9 0.1
P = .
0 1
The signal St is the quality of the item produced at time t, with S = {h, l} and
(
h if the item produced at time t has high quality
St =
l if the item produced at time t has low quality.

The distribution of the signal as a function of the state is


( (
p(h | 1) = 0.99 p(h | 2) = 0.96
and
p(l | 1) = 0.01 p(l | 2) = 0.04.

Example 2.3.2: Production process (continued)

In the production process example, one could be interested in the following quantities:
1. What is the probability that the process is in poor state at time 2 given that we observed
(S0 , S1 , S2 ) = (h, l, h)?
need to compute Pr(X2 = 2 | S0:2 = (h, l, h));
2. What is the probability that the process will be in poor state at the next time?
need to compute Pr(X3 = 2 | S0:2 = (h, l, h));

3. What is the probability that the next item produced has high quality?
need to compute Pr(S3 = h | S0:2 = (h, l, h))

The above are examples of some of the typical distributions of interest in HMMs, which are
usually referred to as:
• Filtering distribution:

Pr(Xt = j | S0:t = s0:t ) j∈X,

i.e. when we want to make inferences on the latent process at time t given the observed
signals S0:t .
CHAPTER 2. ILLUSTRATIVE EXAMPLES AND APPLICATIONS OF DISCRETE-TIME
MARKOV CHAINS 38

• Predictive distributions: both for the latent process

Pr(Xt+1 = j | S0:t = s0:t ) j∈X,

and for the signal itself

Pr(St+1 = s | S0:t = s0:t ) s∈S.

NB: one may also want to compute predictions k-steps ahead, e.g. Pr(Xt+k = j | S0:t = s0:t ).
Here we focus on 1-step ahead for simplicity.
These can be efficiently computed using recursive algorithms, such as the Viterbi algorithm for
finite HMMs and the Kalman filter for Gaussian HMMs. We do not cover them this year in the
syllabus. The algorithms can be found in the appendix of these notes.
HMMs are often used in continuous spaces. Consider for example the following 1-dimensional
linear case. We assume X = S = R, transition kernel5 given by Pr(x, ·) = N (x, ση2 ) and signal
distribution f (x, ·) = N (x, σe2 ). Equivalently, for t = 0, 1, 2, . . . assume

Xt+1 = Xt + ηt ηt ∼ N (0, ση2 ) (2.3)


St = Xt + et et ∼ N (0, σe2 ) (2.4)

where (ηt )t≥0 and (et )t≥0 are two independent sequences of iid random variables.
We assume that (St )t≥0 is observed while (Xt )t≥0 is not. The task is to make inferences about
(Xt )t≥0 using (St )t≥0 . More specifically, we aim to compute the following quantities, where s0:t
denotes the observed signal:
• Marginal likelihood
Z
p(S0:t = s0:t ) = p(X0:t = x0:t , S0:t = s0:t )dx0:t (2.5)
Rt+1

• Filtering distribution
R
p(X0:t = x0:t , S0:t = s0:t )dx0:(t−1)
p(Xt = x|S0:t = s0:t ) = x∈R
p(S0:t = s0:t )

• Smoothing distribution

p(Xi = x|S0:t = s0:t ) i ∈ {0, . . . , t}

The Kalman filter provides an efficient way to compute the above quantities for linear HMMs.
Each recursion of the filter will update the following two steps: a predict step that goes from
L(Xt−1 |S0:(t−1) ) to L(Xt |S0:(t−1) ); and an update step that goes from L(Xt | S0:(t−1) ) to L(Xt |
S0:t ).6 In the Gaussian case, these steps can be computed in closed form (i.e. they boil down to
parametric updates), thanks to the closure of Gaussian kernels with respect to convolution (i.e.
sums of r.v.s) and multiplication (i.e. conditioning of r.v.s). See the appendix for full details.

5 We will more formally define those later. Basically, the above means that the distribution of X given X
t t−1 = x
is N (x, ση2 ).
6 Here we use L to denote conditional distributions and p to denote conditonal densities.
CHAPTER 2. ILLUSTRATIVE EXAMPLES AND APPLICATIONS OF DISCRETE-TIME
39 MARKOV CHAINS

Remarks
• (Linear cost) Given some observations s0:T ∈ RT +1 , the forward pass of the Kalman
Filter (i.e. Algorithm 2) allows to compute the filtering distributions {L(Xt | S0:t =
s0:t )}t=0,...,T with O(T ) computational cost. This is an example of ”online algorithm”,
i.e. algorithm where, once you get a new observation st , updating your knowledge/infer-
ence/estimate from s0:(t−1) to s0:t has O(1) cost. In other words, in an online algorithm,
the cost of processing a new observation does not increase with time.

• (Backward pass) One is often interested in computing not only the filtering distribution
but also the ”smoothing distributions”, i.e. {L(Xt | S0:T = s0:T )}t=0,...,T , which provide
better predictions of Xt given s0:T , since they condition on all the available information
including the ”future” of Xt . These can be efficiently computed with a ”backward pass”
of the Kalman Filter, which we do not discuss in details for brevity. Conceptually, it
is similar to Algorithm 2 but goes from t = T to t = 1 and allows to compute the
parameters of the distributions {L(Xt | S0:T = s0:T )}t=0,...,T with O(T ) cost.
• (General dynamic Gaussian models) The Kalman filter is a default and fundamen-
tal ”smoothing” tool in signal processing, time series analysis, tracking, etc. In the
above section, we only discussed the 1-dimensional, intercept-only version to avoid over-
complicating the calculations. More generally, the Kalman filter is used with more
general dynamic linear models such as

Xt+1 = At Xt + ηt ηt ∼ N (0, Σt ) (2.6)


St = Bt Xt + et et ∼ N (0, Rt ) (2.7)

where (Xt )t≥0 and (St )t≥0 are Rd -valued and Rm -valued stochastic processes, (Σt )t≥0
and (Rt )t≥0 are d × d and m × m covariance matrices, and (Σt )t≥0 and (Rt )t≥0 are d × d
and m × d regression matrices. The forward and backward passes extend to such more
general context, and still require a linear cost in T .
• (Parameter estimation) Above we assumed both the tranistion kernel Pr(x, ·) and the
signal distribution f (x, ·) to be fully known. This is usuallly not the case in practice,
where it is more common for those to be specified only in terms of unknown parameters.
For example, in the 1d example, we may still assume Pr(x, ·) = N (x, ση2 ) and f (x, ·) =
N (x, σe2 ) but treat θ := (ση , σe ) as unkown parameters. In such context, the most
common approach is to estimate θ by maximizing the marginal likleihood, i.e. setting θ
to θ̂M M L = arg maxθ Pr(S0:T = s0:T | θ). In these cases the Kalman filter can also be
used to compute efficiently the marginal likelihood Pr(S0:T = s0:T | θ) required to search
for θ̂M M L with some optimization algorithm. Similarly to smoothing distributions, the
recursive equations to compute marginal likelihoods are conceptually similar to the ones
in Algorithm 2 and we do not discuss them here for brevity.
Chapter 3

Continuous-time Markov chains

In this chapter we consider stochastic processes

{X(t), t ∈ [0, +∞)} = (X(t))t∈[0,∞)

with continuous time, t ∈ [0, +∞), and discrete space (i.e. either finite, e.g. X = {0, 1}, or count-
able, e.g. X = {0, 1, 2, . . . }).

3.1 Background and tools


It turns out, for reasons that we will see later, that a building block of many continuous-time
processes is the notion of hazard rate function and the exponential distribution. Thus we start by
revising those notions. Broadly speaking, the importance of the hazard rates and exponential dis-
tribution comes from them being natural tools to describe and manipulate time-to-event variables,
as we will see.

3.1.1 Exponential distribution


(Exponential distribution)
Recall: a continuous random variable X has exponential distribution with rate λ > 0 if its
pdf is (
λe−λx x > 0
f (x) =
0 x≤0
In such case we write X ∼ Exp(λ). The corresponding cdf of X is
Z x (R x (
−λy
0
e dy x ≥ 0 1 − e−λx x≥0
F (x) = f (y)dy = =
−∞ 0 x<0 0 x<0

The mean and variance of X can be easily computed using the above expression of f (x) and
are E[X] = λ−1 and Var(X) = λ−1 . Thus, the larger the rate λ the smaller the typical value
of the random variable X ∼ Exp(λ).
The exponential distribution satisfies the memoryless property.

Definition 8. A (0, ∞)-valued random variable X has the memoryless property (MP) if

Pr(X > s + t | X > t) = Pr(X > s) t, s > 0

To understand the intuition behind the MP it is useful to think at X as the time until an
event occurs. For example:

41
CHAPTER 3. CONTINUOUS-TIME MARKOV CHAINS 42

• if X ∼ Exp(λ) is the lifetime of a component, then the memoryless property means that
if the component did not brake by time t, the probability of breaking in the next s unit
of time is the same as of breaking in the first s units of times.
• if X ∼ Exp(λ) is the time until somebody enters the room, the memoryless property
means that if nobody has entered yet, the probability of somebody entering in the next
second does not depend on the current time.

3.1.2 Hazard rates


When a random variable X represents the lifetime of some item or the time until an event occurs,
it is useful to introduce the hazard rate function (hrf) r(t), which is defined as

f (t)
r(t) = t ≥ 0, (3.1)
1 − F (t)

where f and F are the pdf and cdf of X, respectively1 . Intuitively, r(t) represents the probability
(or better the probability density) of the event happening at time t given that it has not yet
happened before t. More precisely, we have
R t+dt
Pr(X ∈ (t, t + dt)) f (x)dx
Pr(X ∈ (t, t + dt) | X > t) = = t
Pr(X > t) 1 − F (t)
f (t)
= dt + o(dt) = r(t)dt + o(dt) as dt ↓ 0 .
1 − F (t)

The hazard rate function r(t) uniquely characterizes the distribution of the corresponding ran-
dom variable, as shown in the following proposition. For notational convenience, we denote the
survival function of a r.v. X as

F̄ (t) := Pr(X > t) = 1 − F (t) t ∈ R. (3.2)

Proposition 10. Let X be a (0, ∞)-valued random variable X with hazard rate function r. Then
its survival function F̄ satisfies
Rt
F̄ (t) = e− 0
r(s)ds
t ≥ 0. (3.3)

Proof. By (3.1) and F̄ 0 = −F 0 we have

F̄ 0 (t) d 
r(t) = − =− log F̄ (t) t > 0. (3.4)
F̄ (t) dt

Combining the above with F̄ (0) = 1 we have


Z t
r(s)ds = − log F̄ (t) + log F̄ (0) = − log F̄ (t) t > 0, (3.5)
0

which implies (3.3).


1 Throughout the section we assume that the pdf f exist and is well-defined, meaning that e.g. X is a continuous

random variables with probability zero of being equal to each single t > 0.
43 CHAPTER 3. CONTINUOUS-TIME MARKOV CHAINS

Since the hazard rate function uniquely characterises the distribution of X, one can also take
(3.3) directly as definition of hazard rate function (i.e. the hrf r is the function r : (0, ∞) 7→ [0, ∞)
such that (3.3) is satisfied).

Example 3.1.1: Exponential distribution - constant hazard

The survival function of X ∼ Exp(λ) is

F̄ (t) = e−λt t > 0, (3.6)

and its hazard rate function is


f (t) λe−λt
r(t) = = −λt = λ t > 0.
F̄ (t) e

Thus the exponential distribution has constant hazard rate.


Intuition: at each time t there is probability λdt of the event happening in the next dt interval,
regardless of the value of t. Comparing this intuition with the one of the MP discussed after
Definition 8, we can see that this indeed the same property as the MP. Indeed, having constant
hazard rate function is in one-to-one correspondence with satisfying the MPa . Since the hazard
rate function r(t) uniquely characterizes the distribution of the corresponding random variable
(Proposition 10), the exponential is the only distribution with constant hazard rate, as the
following proposition states.
Proposition 11. A (0, ∞)-valued random variable X has the memoryless property if and
only if X ∼ Exp(λ) for some λ > 0.
a In the interest of brevity, we do not provide a formal proof of this connection and just state it as a result.

When X represents a time-to-event(e.g. the lifetime of an item), the hazard rate function can
be a more intuitive way of characterizing the distribution of f rather than its pdf. See discussion
and figures in the in-class notes.
In some situations, the memoryless property allows to compute probabilities of interest without
the need of explicit calculations, see for example Exercise 47 in the list.

3.1.3 First events properties


Both hazard rate function in general, and exponential distributions in particular, have various
properties that make them natural and convenient mathematical objects to describe time-to-event
variables. Here we consider one property related to times of first events.

Proposition 12. Let (Xi )i=1,...,n be independent (0, ∞)-valued random variables with hazard rate
functions ri for i = 1, . . . , n. Then:

1. (First Event) The random variable

X := min Xi
i=1,...n
CHAPTER 3. CONTINUOUS-TIME MARKOV CHAINS 44

has hazard rate function given by


n
X
r(t) = ri (t) t > 0. (3.7)
i=1

2. (Hazard race) The random variable

I := arg min Xi
i=1,...n

satisfies

ri (t)
Pr (I = i | X = t) = Pn i = 1, . . . , n; t > 0 . (3.8)
j=1 rj (t)

Proof. Part 1: by definition of X, we have


n
Y n
Y
F̄ (t) := Pr(X > t) = Pr(∩ni=1 {Xi > t}) = Pr(Xi > t) =: F̄i (t) , (3.9)
i=1 i=1

where F̄ and F̄i denote the survival functions of X and Xi , respectively. By Proposition 10 we
thus have
n
Y  Z t   Z t  n
X
F̄ (t) = exp − ri (s)ds = exp − r(s)ds r(s) := ri (s) , (3.10)
i=1 0 0 i=1

meaning that r(s) is the hazard rate function of X.


Part 2: consider the joint distribution of X and I, which we denote as Pr(I = i, X = t).2 By
definition of X and I, for every i ∈ {1, . . . , n} and t > 0, we have

Pr(I = i, X = t) = Pr(Xi = t, Xj > t for all j 6= i) (3.11)

By independence of X1 , . . . , Xn
Y
Pr(I = i, X = t) = fi (t) F̄j (t) = ri (t)F̄ (t) (3.12)
j6=i

where fi denotes the pdf of Xi . Thus

Pr(I = i, X = t) ri (t)
Pr(I = i | X = t) = Pn = Pn (3.13)
j=1 Pr(I = j, X = t) j=1 rj (t)

as desired.
2 Technically, one should think at Pr(I = i, X = t) as a pmf with respect to (w.r.t.) i and a pdf w.r.t. t, meaning

that we need to sumP w.r.t. Ri and integrate w.r.t. t when we want to obtain probabilities of events. In other words
Pr(I ∈ A, X ∈ B) = i∈A B Pr(I = i, X = t)dt for every A ⊆ {1, . . . , n} and B ⊆ R.
45 CHAPTER 3. CONTINUOUS-TIME MARKOV CHAINS

If X1 , . . . , Xn are the times of n events, X = mini Xi is the time of the first event. For example,
if n students are trying to solve an exercise and the i-th student takes Xi time then X is the time
until somebody solves the exercise. P
Intuition: the hazard rate of the event X is r(t) = i λi . This is just the sum of the hazard
rates of the random variables Xi . This is quite intuitive if you are thinking at the interpretation
of hazard rates and the fact that as soon as some Xi happens, then also X happens.
When applied to the case of Exponential distribution, Proposition 12 leads to the following
corollary.

Corollary 1. Let Xi ∼ Exp(λi ) independently for i = 1, . . . , n. Then the r.v.s

X := min Xi and I := arg min Xi


i=1,...n i=1,...n

have distributions
n
X
X ∼ Exp(λ) with λ = λi (3.14)
i=1
λi
Pr (I = i) = Pn i = 1, . . . , n . (3.15)
j=1 λj

Moreover X ⊥ I, i.e. they are independent r.v.s.

Proof. Follows directly from Proposition 12 and Example 3.1.1.

The properties in Corollary 1 allow to perform computations in the context probabilistic models
with time-to-event variables (e.g. queing systems), see e.g. Exercise 48 in the list.

3.2 Markov property and transition probabilities


We will focus our attention on processes (X(t))t∈[0,∞) satisfying the Markov property.

Definition 9 (Markov property). A continuous-time stochastic process (X(t))t∈[0,∞) with discrete


space X satisfies the Markov property if

Pr(X(t + s) = j | X(s) = i, X(u) = x(u) for 0 ≤ u ≤ s) = Pr(X(t + s) = j | X(s) = i),

for all i, j ∈ X and s, t ≥ 0.

When X is discrete, we refer to continuous-time stochastic processes satisfying the above defi-
nition as continuous-time Markov chains (CTMC).
Intuition: similarly to the discrete-time case, the Markov property above means that the future
depends on the past only thorough the present.
A CTMC (X(t))t∈[0,∞) is time-homogeneous (or it has stationary transition probabilities) if

Pij (t) := Pr(X(t + s) = j | X(s) = i) i, j ∈ X ; s, t ≥ 0 (3.16)

do not depend on s. In the following, we will always assume that the CTMCs under consideration
are time-homogeneous, unless stated otherwise.
CHAPTER 3. CONTINUOUS-TIME MARKOV CHAINS 46

As done for their discrete-time version, we arrange the transition probability functions Pij (t)
in a matrix form leading to

P (t) = (Pij (t))i,j∈X t > 0.

The above transition probabilities satisfy a continuous-time analogue of the Chapman-Kolmogorov


(CK) equations of Proposition 1.

Proposition 13. The transition probabilities in (3.16) satisfy


X
Pij (s + t) = Pik (s)Pkj (t) i, j ∈ X ; s, t ∈ [0, ∞) . (3.17)
k∈X

Proof. Analogous to the one of Proposition 1.

When |X | < ∞, (3.18) can be written as products between |X | × |X | matrices, leading to the
following compact notation for the CK equations

P (t + s) = P (t)P (s) s, t ∈ [0, ∞) . (3.18)

3.2.1 Generator/Jump rates


In discrete time, the CK equations imply that knowledge of P , or equivalently P 1 , determines the
evolution of the MC, i.e. determines P n for all n = 0, 1, 2, . . . . In continuous-time, there’s no mini-
mal discrete unit of time (such as t = 1 for discrete time) and thus the ”unitary” transition matrix
P (1) does not play such an important role in continuous time. Instead, t can be infinitesimally
small in continuous time and in this case the CK equations imply that knowing the behaviour of
(P (t))t∈[0,) for any  > 0 is enough to uniquely determine P (t) for any t ∈ [0, ∞)3 . This means
that the evolution of (P (t))t∈[0,∞) is governed by the behaviour of P (t) for t in a neighbourhood
of 0. Indeed, it turns out that the fundamental quantities determining the evolution of CTMCs
are the following derivatives at t = 0.

Definition 10. We define

Pij (h)
Qij := lim i, j ∈ X with i 6= j (3.19)
h→0 h
as the jump rate (or instantaneous transition rate) from i to j of the CTMC (X(t))t∈[0,∞) .

By definition, we can interpret Qij as

Pij (h) = Pr(X(t + h) = j | X(t) = i) = Qij h + o(h) as h → 0 , (3.20)

meaning that Qij can be interpreted as the ”hazard rate” at which the Markov chain X(t) currently
at i might jump to j in the next infinitesimal instant.
The jump rates are the natural quantities, or parameters, that are typically used to specify
CTMC models, as shown in the examples of Section 3.3.
3 this can be deduced e.g. by writing P (t) = P ()n P (t − n) where n = bt/c is the interger part of t/.
47 CHAPTER 3. CONTINUOUS-TIME MARKOV CHAINS

3.3 Examples of CTMCs


We now provide some examples to illustrate expressions for the jump rates that naturally arise in
common modeling scenarios.

Example 3.3.1: Binary chain

Let X = {0, 1} with transition rates Q01 = λ and Q10 = µ. Here λ and µ are two free
parameters in [0, ∞).

Example 3.3.2: Poisson process

The one-dimensional Poisson Process, which we denote as (X(t))t∈[0,∞) ∼ PP(λ), is a CTMC


with X = N and jump rates (
Qi,i+1 = λ ∀i ∈ X
(3.21)
Qij = 0 otherwise.
This is an important process with many generalizations, we will return to that later.

Example 3.3.3: Birth process with linear birth

Suppose that X(t) represents the size of a population at time t and each member of a popu-
lation gives birth to a new individual at rate λ (and individuals never die). Let

X(t) = “# of individuals at time t” .

Then (X(t))t∈[0,∞) is naturally modelled as a CTMC with jump rates


(
iλ if j = i + 1 ≥ 0
Qij =
0 otherwise .

The above jump rates are saying two things:


(1) The process X(t) only jumps up by 1, i.e. Qij = 0 unless j = i + 1. This follows form
the fact that individuals in the population never die (thus X(t) does not goes down) and
that birth events happen in continuous time (and thus the probability of two or more events
happening at exactly the same instant, which correspond to X(t) jumping by 2 or more, is 0).
(2) The jump rate from i to i + 1 depends linearly on i, i.e. Qi,i+1 = iλ. This follows from
thinking at the time to a new arrival in a population of size i as

T = min (T1 , . . . , Ti ) ,

where each Tj for j = 1, . . . , i is the time taken by the j-th individual in the population to
give birth to a new individual. The underlying assumption is that Tj ∼ Exp(λ) independently
across j, which then implies T ∼ Exp(iλ) by (3.14). See Section 3.5 for more discussion about
the connection between CTMCs and Exponential distributions.
CHAPTER 3. CONTINUOUS-TIME MARKOV CHAINS 48

Example 3.3.4: Process with birth, death and immigration

Consider now a more general version of the previous example, where individuals give birth at
rate λ, die at rate µ and also new individuals immigrate into the population at rate θ (i.e.
on average individuals live µ1 units of time, give birth every λ1 units of time and θ individuals
immigrate into the population every unit of time). Then,

X(t) = “# of individuals at time t” .

can be naturally modelled as a CTMC with



iλ + θ
 if j = i + 1 ≥ 0
Qij = iµ if j = i − 1 ≥ 0

0 otherwise .

The expression above can be obtained with reasoning similar to Example 3.3.3 as above. Note
that the birth and death rates get multiplied by the population size (since these are individual
effects that sum up to the population level) while the immigration term does not depend on
the population size (assuming immigration to be exogenous and not directly dependent on
the inner population dynamics).
Here, a typical question of interest is: what is the distribution of the population size for
large t? How does it depend on λ, µ and θ?

3.4 Kolmogorov differential equations


In many cases, given a CTMC X = (X(t))t∈[0,∞) , we are interested in computing transition
probabilities Pij (t). For example, in Examples 3.3.3 and 3.3.4 context, what is the probability of
having j individuals in the population in t units of time from now given a current population of
size i? To answer, we need to compute Pij (t).
In general, we want to compute P (t) = (Pij (t))i,j∈X starting from (Qij )i,j∈X ,i6=j , or equiva-
lently (νi )i∈X and K = (Kij )i,j∈X . The tricky aspect in doing that (which make it harder than,
e.g. computing n-step transition probabilities for discrete-time chains) is that the transitions of
(X(t))t∈[0,∞) happen at random times and we do not know how many transitions led us from i to
j in t units of time, see e.g. Figure 3.1 below for a visual illustration.

3.4.1 Kolmogorov backward and forward equations


We now derive differential equations that describe the P
evolution of P (t) as a function of the jump
rates. To do that, it is convenient to define Qii = − j6=i Qij for every i ∈ X , and arrange the
jump rates into a full |X | × |X | matrix

Q = (Qij )i,j∈X .

We will call such matrix the Q-matrix, or generator matrix, of the CTMC X. The reason for
setting the diagonal terms of Q equal to −ν comes from the following lemma.
49 CHAPTER 3. CONTINUOUS-TIME MARKOV CHAINS

Lemma 5. Q = P 0 (0), meaning that


d
Qij = Pij (t) |t=0 i, j ∈ X . (3.22)
dt
Proof. By (3.19) and P (0) = I we have
Pij (h) Pij (h) − Pij (0) d
Qij = lim = lim = Pij (t) |t=0 i 6= j . (3.23)
h→0 h h→0 h dt
P
By j∈X Pij (t) = 1, (3.23) and the definition of Qii we have
X X
Pii0 (0) = − Pij0 (0) = − Qij = Qii i∈X. (3.24)
j6=i j6=i

Remark
P
Note that Q-matrices satisfy Pj∈X Qij = 0 by construction.
P Indeed, any matrix Q with
Qij ≥ 0 for i 6= j and Qii = − j6=i Qij (or equivalently j∈X Qij = 0 for every i) is a valid
Q-matrix with a well defined corresponding CTMC (X(t))t∈[0,∞) .

We have two set of differential equations that describe the evolution of P (t) as a function of Q,
called the Kolmogorov equations (KE).
Theorem 7 (Kolmogorov backward and forward equations). Let X be a CTMC with |X | < ∞
and well-defined Q-matrix4 . Then P satisfies the Kolmogorov backward equations (KBE)
X
Pij0 (t) = Qik Pkj (t) − νi Pij (t) t ≥ 0, i, j ∈ X
k6=j

= (QP (t))ij ,

and the the Kolmogorov forward equations (KFE)


X
Pij0 (t) = Qkj Pik (t) − νj Pij (t) t ≥ 0, i, j ∈ X
k6=j

= (P (t)Q)ij .

Proof. By the CK equations and the associativity of matrix product, we have

P (t + h) − P (t) = P (h)P (t) − P (0)P (t) = (P (h) − P (0))P (t) t, h ≥ 0 . (3.25)

It follows
P (t + h) − P (t) (P (h) − P (0))P (t)
P 0 (t) = lim = lim (3.26)
h→0 h h→0 h
 
(∗) P (h) − P (0)
= lim P (t) = QP (t) t ≥ 0, (3.27)
h→0 h
4 i.e. the limit Q
ij defined in (3.19) exists and belongs to [0, ∞) for every i 6= j. NB: this is authomatically
satisfied if the CTMC is defined through the jump chain and holding times construction.
CHAPTER 3. CONTINUOUS-TIME MARKOV CHAINS 50

where in the equation denoted by (∗) we exchanged a limit and a matrix product (i.e. a limit and
a sum) which is justified since |X | < ∞ and each limit inside the sum exists.
The proof of the KFE is analogous using the decomposition P (t+h)−P (t) = P (t)P (h)−P (t) =
P (t)(P (h) − P (0)).
The KE can can be written in compact matrix form, e.g.
KF E : P 0 (t) = P (t)Q t ≥ 0, (3.28)
0
KBE : P (t) = QP (t) t ≥ 0. (3.29)
In principle, they allow to compute Pij (t) starting from Q. In some cases the KE can be solved in
a fully analytic way to obtain explicit expressions for Pij (t).

Example 3.4.1: Binary chain

Let X = {0, 1} with transition rates Q01 = λ and Q10 = µ. Then, the KBE becomea
(
0
P00 (t) = λP10 (t) − λP00 (t)
0
P10 (t) = µP00 (t) − µP10 (t) .

Combining those with the starting conditions


(
P00 (0) = 1 = P11 (0)
P01 (0) = 0 = P01 (0) ,

one can solve those analytically to obtain


µ  
P10 (t) = 1 − e−(µ+λ)t
µ+λ
λ  
P01 (t) = 1 − e−(µ+λ)t t ≥ 0.
µ+λ
µ
This for example, imply that limt→∞ P10 (t) = µ+λ = 1 − limt→∞ P11 (t) and limt→∞ P01 (t) =
λ
µ+λ = 1 − limt→∞ P00 (t), which means that, at least in this example, P (t) converges to a
µ λ
limiting distribution ( µ+λ , µ+λ ). Note also that such limit does not change if all the jump
rates in Q are multiplied by a fixed constant, i.e. Q is replaced by cQ with c > 0.
a Here we keep 2 out of 4 equations, since the other two are a function of those.

Example 3.4.2: Poisson process (KE)

If (X(t))t∈[0,∞) ∼ PP(λ) then the closed-form expressions for the transition probabilities

e−λt (λt)j−i
Pij (t) = I(j ≥ i)
(j − i)!

can be verified to be the (unique) solutions to the KE equations with jump rates given by
51 CHAPTER 3. CONTINUOUS-TIME MARKOV CHAINS

(3.21). See e.g. Example 2.1.4 of Norris [1998].

In general, however, the KE are hard to solve in a fully analytic form. Nonetheless, their
solution always admits the following matrix exponential representation

X (tQ)n t2 Q
P (t) = etQ = = I + tQ + + ... t ≥ 0. (3.30)
n=0
n! 2!

The above series representation can be shown to converge for all values of t ≥ 0 and to solve the
KE following the steps

d tQ d X (tQ)n
P 0 (t) = e =
dt dt n=0 n!

X d (tQ)n
=
n=0
dt n!

X Qn
= ntn−1
n=0
n!

X (tQ)n−1
= Q
n=0
(n − 1)!

X (tQ)n
=Q = QetQ = QP (t) t ≥ 0.
n=0
n!

We do not provide formal justification of all the above steps, and refer to Theorem 2.1.1 of Norris
[1998] for more details on that.

Remarks
• The KBE and KFE are two different set of differential equations since

QM 6= M Q

for a general matrix M . However, they have the same unique solution P (t) which
satisfies
QP (t) = P (t)Q .

• The representation P (t) = etQ is fully general and useful in some situations, but is
can be too expensive to compute numerically in various situations and/or not explicit
enough for many tasks. In this sense, it can be thought at as a ‘semi-explicit solution’
and should not lead to think that P (t) is always available in explicit form. It is nicely
complemented by using the KE to characterize the long-run behavior of CTMCs in a
more explicit way.
CHAPTER 3. CONTINUOUS-TIME MARKOV CHAINS 52

3.5 Jump chain and holding time construction


How do trajectories of CTMCs behave? How can we construct and simulate them?
We will assume that the trajectories t 7→ X(t) of the CTMC (X(t))t∈[0,∞) under consideration
satisfy some basic regularity assumptions, which are needed to rule out ’pathological’ trajectories.
A classical assumption often assumed in stochastic process theory (ofter referred to as cad-lag
assumption) is that trajectories t 7→ X(t) are right-continuous with left limits, i.e.:

lim X(t + h) = X(t) t>0 (3.31)


h↓0

lim X(t − h) exists t > 0. (3.32)


h↓0

This, in particular, rules out trajectories having an uncountable number of discontinuity points.
Such trajectories would arise, for example, from ’white noise’ models in continuous-time where
(X(t))t∈[0,∞) satisfies X(t) ∼ π for every t > 0 and X(t) independent from X(s) for s 6= t (think
e.g. at X = {0, 1} and π = Unif(X )). Such models are not particularly interesting from the
modeling point of view and would not allow for a fruitful mathematical treatment. Note that, on
the other extreme, requiring t 7→ X(t) to be continuous would be too restrictive (since e.g. the
only continuous trajectories when X is discrete are constant ones). Thus, we can think at cadlag
trajectories, such as those in Figure 3.1, as a compromise between generality and tractability.
In the interest of brevity, we will not discuss regularity assumptions in details and simply assume
that our CTMC of interest has trajectories that can be represented as (3.33) for some sequence
(Yn , Tn )n∈N . This is arguably a mild assumption from the modelling point of view. We refer to
Section 2.2 of Norris [1998] for more discussion on possible behaviors of CTMCs trajectories.
By the cadlad assumption, trajectories t 7→ X(t) are piece-wise constant and can be described
by two sequences of random variables:

1. the so-called jump chain, or skeleton chain, (Yn )n=0,1,2,... , where Yn represents the n-th state
visited by the CTMC trajectory (X(t))t∈[0,∞) .

2. the holding times (Tn )n=0,1,2,... , where Tn represents the amount of time spent in the n-th
visited state Yn .

See Figure 3.1 for an illustration. Formally, we have a one-to-one mapping between a trajectories
(X(t))t∈[0,∞) and sequences (Yn , Tn )n∈N given by: for every t > 0 and i ∈ X
(
T0 + · · · + Tn−1 ≤ t < T0 + · · · + Tn
X(t) = i ⇔ (3.33)
Yn = i

for some n ∈ {0, 1, 2, . . . }. Given the above bijection, the distribution of a CTMC (X(t))t∈[0,∞)
can be conveniently described by specifying the distribution of the sequence of random variables
(Yn , Tn )n∈N . Unlike discrete time MCs, continuous time ones spend random times at each location
before making a transition to a new state.
Assume (X(t))t∈[0,∞) is a CTMC. What can we say about the distribution of the random
holding times that the process spends in each state? Suppose, e.g., that Y0 = i ∈ X and consider
the distribution of T0 . Using first the jump chain and holding times definition, and then the
53 CHAPTER 3. CONTINUOUS-TIME MARKOV CHAINS

Figure 3.1: A trajectory of a CTMC (X(t))t∈[0,∞) with the associated jump chain (Yn )n=0,1,... and
holding times (Tn )n=0,1,... .

Markov property and time homogeneity of (X(t))t∈[0,∞) we have

P (T0 > t + s | T0 > s, Y0 = i) = Pr(X(h) = i, h ∈ (s, s + t] | X(h) = i, h ∈ [0, s])


= Pr(X(h) = i, h ∈ (s, s + t] | X(s) = i)
= Pr(X(h) = i, h ∈ (0, t] | X(0) = i)
= P (T0 > t | Y0 = i) t, s > 0 .

Thus, the random variable T0 given Y0 = i satisfies the memoryless property and therefore it is
exponentially distributed, i.e.

(T0 | Y0 = i) ∼ Exp(νi ) for some νi > 0.

Note that the exponential rate of T0 , which we denoted as νi , is allowed to depend on the current
state i of the process. Intuitively, if νi is large, then the process will on average spend a short time
in state i at each visit to it.
When X(t) makes a jump out of Y0 = i it chooses the next state j with some probability, which
we denote as

Kij = P (Y1 = j | Y0 = i) i, j ∈ X . (3.34)


P
By construction (Kij )i,j∈X is a transition probability matrix satisfying j∈X Kij = 1 and Kii = 0
(since Y1 6= Y0 by construction otherwise X(t) would be making a transition to i itself, and T0
would not anymore be the time when the process makes a jump to a distinct state).
To sum up, given ν = (νi )i∈X and K = (Kij )i,j∈X the CTMC (X(t))t∈[0,∞) evolves according
to the following dynamics:
1. the process spend a random time T ∼ Exp(νi ) in the current location i;
2. when it jumps, the process chooses the new state j ∈ X with probability Kij .
CHAPTER 3. CONTINUOUS-TIME MARKOV CHAINS 54

The above recipie gives a simple and efficient algorithm to simulate trajectories of CTMCs, which is
often referred to as Gillespie Algorithm. Equivalently, in mathematical terms, the joint distribution
of jump chain and holding times (Yn , Tn )n=0,1,2,... can be described as follows:

(Yn )n=0,1,... is a DTMC with transition matrix K

Tn |Yn = i ∼ Exp(νi ) for every i ∈ X (3.35)

Tn ⊥ (Yi , Ti )i6=n | Yn .

3.5.1 Connection between (ν, K) and Q


The above discussion implies that ν and K fully determine the evolution (or distribution) of a
CTMC. How do these relate to the jump rates Qij ?

Lemma 6. Let (X(t))t∈[0,∞) be a CTMC defined through (3.33), with jump chain and holding
times (Yn , Tn )n=0,1,2,... satisfying (3.35). Then the jump rates Qij in (3.19) are well-defined (i.e.
the limits exist) and satisfy
X
νi = Qij i∈X (3.36)
j6=i
Qij
Kij = P i, j ∈ X with i 6= j. (3.37)
`6=i Qi`

5
Proof of (3.36)-(3.37). We show

Qij = νi Kij i 6= j ,
P
from which both (3.36) and (3.37) can be easily deduced thanks to j6=i Kij = 1 for every i ∈ X .
Assume i 6= j. By definition of jump chain and holding time we have

Pij (h) = Pr(X(h) = j | X(0) = i)


(3.33)
≥ P (T0 < h, Y1 = j, T1 > h | Y0 = i)
(3.35) (3.35)
= P (T0 < h | Y0 = i)P (Y1 = j | Y0 = i)P (T1 > h | Y1 = j) = (1 − e−νi h )Kij eνj h h > 0.

Thus

Pij (h) 1 − e−νi h


Qij = lim ≥ lim Kij eνj h = νi Kij i 6= j . (3.38)
h→0 h h→0 h
Also, we have

(3.33) (3.35)
Pii (h) = Pr(X(h) = i | X(0) = i) ≥ P (T0 > h | Y0 = i) = e−νi h h > 0,
5 Here we assume that the limits in (3.19) exist and prove (3.36) and (3.37). The proof could be extended to

prove the existence of limits, following similar calculations but considering limsups and liminfs and showing that
they coincide. We omit those arguments for brevity and leave that as an exercise for the interested reader.
55 CHAPTER 3. CONTINUOUS-TIME MARKOV CHAINS

which implies

X 1 − Pii (h) 1 − e−νi h


Qij = lim ≤ lim = νi i∈X. (3.39)
h→0 h h→0 h
j6=i

P P
Combining j6=i Qij ≤ νi and Qij ≥ νi Kij we have j6=i Qij = νi . Combining the latter with
Qij ≥ νi Kij we obtain Qij = νi Kij as desired.

Lemma 6 implies that the behaviour of a CTMC can be equivalently described in terms of
(ν, K) = ((νi )i∈X , (Kij )i,j∈X ) or in terms of Q = (Qij )i,j∈X , i.e. they are simply two alternative
parametrizations of the jump rates. In fact, one can recover (ν, K) from Q by (3.36)-(3.37), as well
as Q from (ν, K) by inverting those formulas, which gives Qij = νi Kij for all i 6= j.

Remark
The intuition behind (3.36) and (3.37) can be understood by considering the following equiv-
alent way to simulate (Yn , Tn )n=0,1,... starting from the jump rates Qij : given Yn = i ∈ X
do
1. simulate Ti→j ∼ Exp(Qij ) independently for all j ∈ X \{i};

2. set Tn = minj∈X \{i} Ti→j and jump to Yn+1 = arg minj∈X \{i} Ti→j .
By the ’first event’ and ’hazard race’ properties of exponentials, i.e. (3.14) and (3.15), it follows
that the sequence (Yn , Tn )n=0,1,... generated by the above procedure has the same distribution
as the one defined in (3.35), with ν and K given by (3.36) and (3.37). In other words, the
above procedure is a valid algorithm to simulate a CTMC starting from its jump rates Qij .a
The above construction provides an algorithmic/intuitive interpretation to the statement
that ’Qij coincides with the rate at which the process jumps from i to j’.
a Note that this algorithm is in generally less computationally efficient, and thus less used, than directly

using (3.35), since it simulates many more random variables.

Remarks
• We implicitly discussed three different characterizations of CTMCs, namely in terms
of: (a) jump chain and holding time construction (i.e. equations (3.33) and (3.35)); (b)
jump rates (i.e. equation (3.20)) and (c) Kolmogorov Equations (i.e. (3.28)). These
are all equivalent under regularity assumptions (e.g. right-continuity of trajectories and
|X | < ∞), see e.g. Theorem 2.8.2 of Norris [1998] for more details and a complete
formal results on this. Above we basically showed (a)⇒(b) in Lemma 6 and (b)⇒(c) in
Theorem 7; while one should also show (c)⇒(a) to close the circle. Below we discuss
more in details these equivalences for the Poisson Process.
CHAPTER 3. CONTINUOUS-TIME MARKOV CHAINS 56

3.6 Limiting behavior of CTMCs


Similarly to the discrete case, one is often interested in characterizing the long-run behaviour of
(X(t))t∈[0,∞) . To do so, we would like to identify the limiting probabilities

πj = lim Pij (t) i, j ∈ X . (3.40)


t→+∞

In this section we derive convergence results analogous to the ones we discussed for DTMCs.

Stationary distributions
Definition 11. A probability distribution π on X is stationary for a CTMC (X(t))t∈[0,∞) with
transition probabilities (P (t))t∈[0,∞) if

πP (t) = π t ∈ [0, ∞) . (3.41)

NB: in the above equation πP (t) represents a (1 × |X |) × (|X | × |X |) vector-matrix product.


Intuition: π is stationary if X(0) ∼ π implies X(t) ∼ π for all t ≥ 0.
The condition in (3.41) can in principle be hard to verify since it involves a collection of
conditions for all t ∈ [0, ∞). The following proposition provides an easier characterisation of
stationary distributions.
Proposition 14. A probability distribution π on X is stationary for a CTMC X with generator
matrix Q if and only if

πQ = 0 (3.42)

where 0 represents a vector (or better 1 × |X | matrix) with all entries equal to zero.
d
Proof for |X | < ∞. First note that, by (3.41), π is stationary if and only if dt πP (t) = 0 for all
d 0
t ≥ 0. Assume π is stationary. Then 0 = dt πP (t) = πP (t) for all t ≥ 0, where we have switched
a derivative and summation thanks to |X | < ∞. Taking t = 0 we obtain 0 = πP 0 (0) = πQ. For
the converse, assume πQ = 0. Then, switching again derivative and summation and applying the
KBE we have
d
πP (t) = πP 0 (t) = πQP (t) = 0 t ≥ 0, (3.43)
dt
meaning that t 7→ πP (t) is constant. Since πP (0) = π, we have πP (t) = π for every t ≥ 0 and
thus π is stationary.
To interpret (3.42) we can write out πQ = 0 as
X X
0= πi Qij = −νj πj + πi Qij j∈X (3.44)
i∈X i6=j

leading to the so-called balance equations 6


X
ν j πj = πi Qij j∈X. (3.45)
i6=j

6 or more explicitly global balance equations as opposed to detailed balance equations.


57 CHAPTER 3. CONTINUOUS-TIME MARKOV CHAINS

Intuitively, the balance equations require


νj πj , i.e. the rate at which the process leaves state j in stationarity
to be equal to
X
Qij πi , i.e. rate at which the process enters j in stationarity.
i6=j

Roughly speaking, the balance equations require to find a π such that the “flow of probability” that
enters into j and that goes out of j in stationarity is the same. The balance equations allow one
to find the stationary
Pdistribution π given the Q-matrix Q = (Qij )i,j∈X . They are the analogue of
the equations πj = k πk Pkj that we saw in (1.15) in the context of discrete time MCs.

Convergence to stationarity
To ensure convergence to stationarity (i.e. that a stationary distribution π is also limiting as in
(3.40)) we first require irreducibility. Intuitively, the definition of irreducibility for CTMCs is
analogous to the one for DTMCs, just replacing P with Q. More precisely, we have the following
formulation.
Definition 12. A CTMC X is irreducible if, for every i, j ∈ X , there exists a k ∈ N and
(i0 , . . . , ik ) ∈ X k+1 such that i0 = i, ik = j and Qi` i`+1 > 0 for all ` = 0, . . . , k − 1.
On the other hand, a difference compared to DTMCs is that for CTMCs we do not need to
require aperiodicity since this is always guaranteed (intuitively the inherent randomness in the
holding times of CTMCs prevent any ”deterministic oscillation” typical of periodic DTMCs). For
example, the following lemma shows that irreducibility authomatically implies that all entries of
P (t) are strictly positive for t > 0.
Lemma 7. Let X be an irreducible CTMC. Then
Pij (t) > 0 t > 0, i, j ∈ X . (3.46)
We can now state the convergence theorem for CTMCs.
Theorem 8. Let X be an irreducible CTMC with stationary distribution π. Then
lim Pij (t) = πj i, j ∈ X . (3.47)
t→∞

Example 3.6.1: Binary chain

Let X = {0, 1} with transition rates Q01 = λ and Q10 = µ. Then, the balance equations are
( (
ν0 π0 = Q10 π1 λπ0 = µπ1
=⇒
ν1 π1 = Q01 π0 µπ1 = λπ0 .

Thus, the limiting probabilities (π0 , π1 ) satisfy

π1 λ
= and π0 + π1 = 1,
π0 µ
CHAPTER 3. CONTINUOUS-TIME MARKOV CHAINS 58

meaning that
µ λ
π0 = and π1 = .
µ+λ µ+λ
µ
It follows by Theorem 8 that limt→∞ P00 (t) = limt→∞ P10 (t) = µ+λ and limt→∞ P01 (t) =
λ
limt→∞ P11 (t) = µ+λ , which is coherent with the explicit solutions for Pij (t) found in Example
3.4.1.

3.7 Birth and death processes


Birth and death (B&D) process are a sub-class of CTMCs (basically they are one-dimensional
CTMCs), which incorporate various models of interest in different domains, such as demogra-
phy, epidemiology, queuing theory and other areas. Here we use them to illustrate some notions
discussed above and provide some illustrative examples.
A birth and death (B&D) process, with arrival (or birth) rates (λn )+∞
n=0 and departure (or death)
rates (µn )+∞
n=0 , is a CTMC X = (X(t))t∈[0,∞) with X = N and

λi
 if j = i + 1
Qij = µi if j = i − 1 i, j ∈ X . (3.48)

0 otherwise,

Interpretation: think at X(t) as the number of individuals at time t in a system/population such


that, given i individuals currently in the system:

1. an arrival (or birth) of a new individual happens with exponential rate λi (i.e. the time for
the next arrival is Exp(λi ) distributed);
2. a departure (or death) of an individual happens with exponential rate µi (i.e. the time for
the next departure is Exp(µi ) distributed).

Equation (3.48) describes the Q-matrix of a B&D process, from which one can easily deduce the
corresponding (νi )i∈X and (Kij )i,j∈X parameters. We have thus two equivalent ways of thinking
at and simulating B&D processes:
(a) Based on the arrival and departure rates (λi )+∞ +∞
i=0 and (µi )i=0 : given a current state X(t) =
i ∈ X do

(a) simulate Tarr ∼ Exp(λi ) and Tdep ∼ Exp(µi ) independently;


(b) set the time spent in i to T = min (Tarr , Tdep );
(c) move to j = i + 1 if Tarr < Tdep and to j = i − 1 if Tdep < Tarr .
(b) Based on the (νi )+∞ +∞
i=0 and the transition probabilities K = (Kij )i,j=0 : given a current state
X(t) = i ∈ X do

(a) simulate the time spent in i as T ∼ Exp(µi + λi )


λi
(b) move to j = i + 1 with probability λi +µi and to j = i − 1 otherwise.
59 CHAPTER 3. CONTINUOUS-TIME MARKOV CHAINS

Example 3.7.1: Examples of B&D processes

• Linear model with immigration: (X(t))t∈[0,∞) as in Example 3.3.4 is a B&D process


with (
µn = nµ n≥1
λn = nλ + θ n ≥ 0

Example 3.7.2: Queuing systems

Customers arrive at a service station according to a Poisson process with rate λ and join the
queue upon arrival. The server takes an exponential time to serve each customer with an
average serving time of µ1 units of time. Here, the number of customers in the queue as time
evolves can be described as a B&D process with
(
µn = µ n ≥ 1
λn = λ n ≥ 0

This is known as M/M/1 queuing system (which stands for ”Markovian arrivals/Markovian
departures/1 server”).
The M/M/1 system has many generalizations. Suppose, for example, that we have s > 1
servers. Then, we obtain an M/M/s queuing system, which can be modelled as a B&D process
with arrival and departure rates
(
µn = min(s, n)µ n ≥ 1
λn = λ n≥0

Here a typical question of interest is: what is the average waiting time for each customer?
What is the probability of a customer waiting more than 30 minutes? How do these quantity
depend on λ, µ, and s?

Limiting distribution of B&D processes


Consider a B&D process X with arrival and departure rate (λi )+∞ +∞
i=0 and (µi )i=0 . To find a stationary
distribution, i.e. a π that satisfies the global balance equations πQ = 0, we first check if there is a
distribution which satisfies the local balance equations
πi Qij = πj Qji i, j ∈ X . (3.49)
Exercise: show that (3.49) implies πQ = 0, i.e. local balance implies global balance.
For B&D processes (3.49) reads
πj−1 λj−1 = πj µj j ≥ 1,
which imply
j
!
λj−1 Y λi−1
πj = πj−1 = π0 j ≥ 1.
µj i=1
µi
CHAPTER 3. CONTINUOUS-TIME MARKOV CHAINS 60

P+∞
Combining the above with j=0 πj = 1, we obtain

j j
∞ Y
Y λi−1 X λi−1
πj = Z −1 , with Z := . (3.50)
i=1
µi j=0 i=1
µi

Note that, assuming λi > 0 for all i ≥ 0 and µi > 0 for all i ≥ 1, we have that X is irreducible and
thus a stationary distribution π (if one exists) is unique. Also, note that a probability distribution
π satisfying (3.49) exists if and only if Z < ∞. One can also easily show that, for B&D processes,
any π that satisfies global balance, i.e. πQ = 0, is of the form in (3.50) and satisfies (3.49).
The expression in (3.50) allows us to compute the limiting probabilities given a specific B&D
process of interest.

Example 3.7.3: M/M/1 queuing system (limiting distribution)

Consider the M/M/1 queue of Example 3.7.2, where customers come at rate λ and there is
one server serving them at rate µ. Denoting by X(t) the number of customers in the system.
In this case λµi−1
i
= µλ for all i ≥ 1, and
 j
λ  j  
µ λ λ
πj =  k = 1− = (1 − p)j p j ≥ 0,
P∞ λ µ µ
k=0 µ

where p = 1 − µλ . Thus the limiting distribution is π = Geom(p) with p = 1 − µλ , meaning


that by Theorem (8) for large t the distribution of X(t) is approximately X(t) ≈ Geom(p)
and thus
1−p λ
E[X(t)] ≈ = .
p µ−λ

Example 3.7.4: M/M/1 queuing system (queue length example)

Consider an M/M/1 queue with customers arriving on average one every 10 minutes, and
service time taking on average 8 minutes. How many customers do you expect to be in the
system on average in the long run? How would this number change if the arrival rate increases
by 20%?
1
Solution. In this example we have λ = 10 and µ = 18 , assuming time to be measured in
minutes. Thus, the limiting distribution for the number of customers X(t) in the system is a
Geom(1 − µλ ) and, for large t we have

λ 1/5
E[X(t)] ≈ = = 4,
µ−λ 1/4 − 1/5

meaning that in the long run we have on average four customers in the system. If we increase
12
the arrival rate λ by 20% to λ = 100 , we obtain an average number of customers in the system
61 CHAPTER 3. CONTINUOUS-TIME MARKOV CHAINS

equal to
λ 12/100
E[X(t)] ≈ = = 24.
µ−λ 1/8 − 12/100
Thus, increasing by 20% the arrival rate results in a 600% increase in the average queue length.
More generally, when µ − λ is small even a small change in λ or µ can lead to a large change
in the behaviour of the system.

For general B&D processes where, unlike for M/M/1 queues, λµn−1 n
is not constant, π will not be a
geometric distribution and usually will not be of a simple parametric form (e.g. known distribution).
Nonetheless, the existence or not of a stationary distribution π will still depend on parameters in
an intuitive way, see Exercises 50 and 51.

3.8 The Poisson Process


We conclude this section discussing the Poisson Process, first in a special version, which can be
thought as a special case of CTMC, and then in a more general version, which can be thought at
as the canonical model for point processes with independent increments.

3.8.1 One-dimensional, homogeneous case


In one-dimension, the Poisson Process (PP) can be naturally thought at as counting process. For
these reason, we introduce some definitions/terminology related to counting processes, increments
and inter-arrival times.
Counting processes: the (one-dimensional) Poisson Process is an example of counting process.
Intuitively, a counting process is a stochastic process (N (t))t∈[0,∞) where N (t) counts the number
of events happening in the time interval [0,t]. Mathematically speaking, a counting process is an
N-valued stochastic process (N (t))t∈[0,∞) with N (0) = 0 and trajectories that are non-decreasing
and right-continuous7 . The interpretation of (N (t))t∈[0,∞) as an actual counting process means
that for any s < t, N (t)−N (s) is the number of events occurring in the interval (s, t]. For example,
N (t) could represent the number of children born by time t in a given region, in which case the
event is “a child is born”; or N (t) could represent the number of people who entered a given store
before time t, in which case the event is “a person enters the shop”. Note that if we had defined
N (t) as the number of people in the store at time t, then (N (t))t∈[0,∞) would not have been a
counting process because condition (iii) would not have been satisfied.
Figure 3.2 provides a simple example of how a trajectory of a counting process could look like.
Increments: we will use the following terminology:

• We say that (N (t))t∈[0,∞) has stationary increments if, for every s ≥ 0, the distribution of
the random variable

N (t + s) − N (t) t, s > 0

does not depend on t. This can be interpreted as a “time-homogeneity” property.


7 These properties should hold almost surely under the law of (N (t))t∈[0,∞)
CHAPTER 3. CONTINUOUS-TIME MARKOV CHAINS 62

Figure 3.2: Example of a trajectory of a counting process, with t on the x-axis and N (t) on the
y-axis. In this figure, N (t) reports real data counts of the number of earthquakes in Japan, starting
from January 1st 1885. Each ”tick” on the x-axis represents an event (i.e. earthquake).

• We say that (N (t))t∈[0,∞) has independent increments if, for every n ≥ 1 and t1 < t2 < t3 <
· · · < tn the random variables
N (t1 ), N (t2 ) − N (t1 ), N (t3 ) − N (t2 ), . . . , N (tn ) − N (tn−1 )
are independent. This means that the number of events in disjoint intervals is independent.
Note that the independent increment assumption implies Markovianity of (N (t))t∈[0,∞) since in
that case
Pr(N (t + s) = j | N (s) = i, N (u) = n(u) for 0 ≤ u ≤ s)
= Pr(N (t + s) − N (s) = j − i | N (s) = i, N (u) = n(u) for 0 ≤ u ≤ s)
= Pr(N (t + s) − N (s) = j − i)
= Pr(N (t + s) − N (s) = j − i | N (s) = i)
= Pr(N (t + s) = j | N (s) = i).
Interarrival times: a counting process (N (t))t∈[0,∞) can be described by looking at its inter-
arrival times T0 , T1 , T2 , . . . , i.e. on the random times between successive events. More precisely,
the interarrival times T0 , T1 , T2 , . . . are related to (N (t))t∈[0,∞) through the following equivalence
N (t) > n ⇐⇒ T0 + · · · + Tn ≤ t t ∈ [0, ∞) and n = 0, 1, 2, . . . . (3.51)

Equivalent definitions
The Poisson Process (PP) is the canonical model for counting processes with stationary and
independent increments.8 We can sum up the above properties of the PP with the following
theorem. We write (N (t))t∈[0,∞) ∼ PP(λ) to indicate that the counting process (N (t))t∈[0,∞) is
distributed as a Poisson process with rate λ.
8 It turns out that, under some basic regularity assumptions, it is the only counting process with stationary and

independent increments!
63 CHAPTER 3. CONTINUOUS-TIME MARKOV CHAINS

Theorem 9. A counting process N = (N (t))t∈[0,∞) satisfies N ∼ P P (λ) if and only if one of the
following equivalent conditions hold:
(a) the process has independent increments and

N (t + s) − N (s) ∼ Poiss(λt) s, t ≥ 0 . (3.52)

(b) the process has independent increments and

P (N (t + h) − N (t) = 1) = λh + o(h)
P (N (t + h) − N (t) ≥ 2) = o(h) t > 0, as λ ↓ 0 .

(c) the interarrival times of N satisfy


iid
T1 , T2 , . . . ∼ Exp(λ) .

Roughly speaking, the above Theorem can be interpreted as a special case of the equivalent
representations of CTMCs discussed above, applied to the special case of a B&D process with
λi = λ for all i ≥ 0 and µi = 0 for all i ≥ 1.

Remarks
• Definition (a) is self-consistent because, for example, for every t1 < t2 < t3 , we have
N (t3 ) − N (t1 ) ∼ Poiss(λ(t3 − t1 )), but also

N (t3 ) − N (t1 ) = (N (t3 ) − N (t2 )) + (N (t2 ) − N (t1 ))


∼ Poiss(λ(t3 − t2 ) + λ(t2 − t1 ))
= Poiss(λ(t3 − t1 ))

The two distributions are the same, if it was not the case (e.g., if you replace (3.52) with
N (t + s) − N (s) ∼ Poiss(λt2 ), then the definition would not be consistent and no such
process would exist).
• (3.52) implies that the PP has stationary increments, because the distribution of N (t +
s) − N (s) does not depend on s.

• Definition (a) may seem quite arbitrary (e.g. why Poisson?) and not intuitive from a
modelling point of view. Definition (b) is instead more natural/explicit from a modelling
point of view: it implies that the PP is the only counting process with events happening
independently and at a constant rate.
• Intuitively, the fact that Ti ∼ Exp(λ) for PPs is related to the fact that PPs is the
counting process with no memory and events happening at constant rate λ (i.e. at every
instant of size h, there is a probability λh of an event occurring independently of the
rest), which are the same assumptions underlying an Exp(λ) distribution for a time-to-
event random variable.
CHAPTER 3. CONTINUOUS-TIME MARKOV CHAINS 64

• Definition (c) is the only ’constructive’ one, which directly implies that such a process
exists and tells a simple way how to simulate it.

3.8.2 One-dimensional, non-homogeneous case


A strong assumption underlying the PP is that the rate λ at which events happen does not depend
on t. This assumption may be unrealistic in some scenarios (e.g. more arrivals at a store during
rush hours). Removing such assumption leads to the Non-Homogeneous Poisson Process (NHPP),
where the constant rate λ is replaced by a rate function, or intensity function, λ : t 7→ λ(t).

Theorem 10. A counting process N = (N (t))t∈[0,∞) is a NHPP with intensity function λ(t) if
and only if one of the following equivalent conditions hold:

(a) the process has independent increments and


Z s+t 
N (t + s) − N (s) ∼ Poiss λ(y)dy s, t ≥ 0 . (3.53)
s

(b) the process has independent increments and

P (N (t + h) − N (t) = 1) = λ(t)h + o(h)


P (N (t + h) − N (t) ≥ 2) = o(h) t > 0, as λ ↓ 0 .

Intuition: condition (b) means that ”events happen at rate λ(t)”, meaning that at any time t
the probability of an event happening in the next instant of (infinitesimal) size h is

Pr( event in (t, t + h]) ≈ λ(t)h .

Remarks
- Similarly to the homogeneous PP, the NHPP is characterized by having independent
Poisson increments. In words, condition (a) means that the number of events in (t, t + s]
R s+t
is Poisson distributed with mean s λ(y)dy.
- Unlike the PP, the NHPP does not have stationary increments (or constant rate λ).
This makes the NHPP a much more flexible model than the PP.
- Both the PP and the NHPP crucially rely on the assumption of independent increments.
Even more, one could say that the assumption of independent increments is basically the
only substantial assumption underlying NHPPs, meaning that every (regular enougha )
counting process with independent increments is a NHPP for some intensity function
λ(t).
a In particular one needs to prevent two or more events to happen at the very same time
65 CHAPTER 3. CONTINUOUS-TIME MARKOV CHAINS

3.8.3 General case


So far, we described the Poisson Process as a one-dimensional process, i.e. a process that counts
events happening over ”one-dimensional” time, or in other words over the domain D = [0, ∞).
However, a key feature of the Poisson Process is that its construction naturally extends to general
domains D, including multidimensional ones, non-euclidean ones, etc. In this general context, the
process N can be interpreted as a ”point process” describing a collection of ”events” happening
in D (or simply a collection of ”points” in D). Think for example at events happening in space or
space and time.
For general domains, which are not one-dimensional and ordered as [0, ∞), we need a different
terminology and notation, as follows. Given a domain D, e.g. D ⊆ Rd , define the set of ’point
patterns’ in D as9
D̄ := {finite subsets of D}

= ∪n≥1 {x1 , . . . , xn } : xi ∈ D for all i = 1, . . . , n .
A point process on D is a random variable with values in D̄, i.e. a measurable function N : Ω 7→ D̄,
where (Ω, F, P ) is a probability space. Equivalently, defining
N (A) := # of points in A A ⊆ D, (3.54)
we can view a point process as a stochastic process N = (N (A))A⊆D with state space X = N
and index set T = {A : A ⊆ D}. Indeed, specifying N (A) for all A ⊆ D uniquely determines
a configuration of points (see e.g. figure on tablet notes). To be coherent with the interpretation
(3.54), we require (N (A))A⊆D to be such that N (A ∪ B) = N (A) + N (B) if A and B are disjoint.
While this second representation, i.e. N = (N (A))A⊆D , may seem more complicated than the
first, i.e. N : Ω 7→ D̄, it can be more convenient to specify point process models. In particular, the
Poisson Process construction extends to the general domain as follows.
Definition 13. A point process N = (N (A))A⊆D is a Poisson process with intensity measure10 λ
if:
(i) N (A) is independent of N (B) for every A, B ⊆ D with A ∩ B = ∅.
(ii) N (A) ∼ Poiss(λ(A)) for every A ⊆ D.
We can interpret (i) as an independent increment condition and (ii) as a Poisson increments
condition. Definition 13 generalizes the definitions in part (a) of Theorems 9 and 10. Similarly to
those, also Definition 13 is self-consistent. In particular, given A, B ⊆ D with A ∩ B = ∅, we have
N (A ∪ B) ∼ Poiss(λ(A ∪ B)) by (ii), which is coherent with
N (A ∪ B) = (N (A) + N (B)) ∼ Poiss(λ(A) + λ(A)) = Poiss(λ(A ∪ B)) ,
by using the properties of Poisson distributions, independence and the additivity of λ.
When Definition 13 holds, we write in short N ∼ P P (λ), where λ indicates a measure on D.
PPs have a high degree of mathematical tractability. For example, they are closed with respect to
natural operations for point processes, namely ”thinning” and ”superposition/merging”, as shown
in the following proposition.
9 Here we consider point patterns with finitely many points for simplicity, even if one could extend the same ideas

to countable point patterns.


10 a measure λ on a space D with σ-algebra F , contained in the power set of D, is a non-negative and σ-additive

function from F to [0, ∞)


CHAPTER 3. CONTINUOUS-TIME MARKOV CHAINS 66

Proposition 15 (Superposition of PPs). Let N1 ∼ P P (λ1 ) and N2 ∼ P P (λ2 ), with N1 and N2


independent of each other. Then N = (N (A))A⊆D defined as N (A) = N1 (A) + N2 (A) for every
A ⊆ D satisfies N ∼ P P (λ1 + λ2 ).
Propostion 15 can be verified directly from Definition 13. The following proposition describes
closure with respect to ’thinning’, which can loosely be thought as the inverse of superposition.

Proposition 16 (Thinning of PPs). Let N ∼ P P (λ). Suppose that each point in N is either of
type 1 or 2, independently of the others, with probability p and 1−p, respectively. Let N1 and N2 be
the resulting point processes including only points of type 1 and 2, respectively. Then N1 ∼ P P (λ1 )
and N2 ∼ P P (λ2 ), with11 λ1 = pλ and λ2 = (1 − p)λ. Moreover, N1 and N2 are independent of
each other.

Proposition 16 can be verified combining Definition 13 with the following lemma (also appearing
in the exercise list).
Lemma 8. If N ∼ P oisson(λ) and N1 | N ∼ Binomial(N, q), then N1 ∼ P oisson(λq). Also, N1
and (N − N1 ) are independent of each other.

11 Here pλ stands for the measure defined as (pλ)(A) = pλ(A) for every A ⊆ D
Chapter 4

Processes with continuous time


and space

This chapter discusses some important classes of stochastic processes with continuous-time and
continuous-space, both in terms of underlying modeling assumptions as well as the mathematical
tools used to analyze them.
Throughout the chapter, we will sometimes refer to the notion of finite dimensional distributions
of a stochastic process.

Definition 14. The finite dimensional distributions of a stochastic process X = (Xt )t∈T with
index set T are the distributions of the random vector Xt1:k = (Xt1 , ..., Xtk ), for every k ∈ N and
(t1 , ..., tk ) ∈ T k .

Under regularity assumptions (such as continuous trajectories or even almost surely cadlag
trajectories), finite dimensional distributions uniquely characterize the distribution of a stochastic
process. In other words, if two stochastic processes have almost surely continuous trajectories (i.e.
Pr(t 7→ Xt is continuous for all t ∈ [0, ∞)) = 1) and have the same finite dimensional distributions,
then they are equal in distribution, meaning that the induced probability distributions on the space
of trajectories are the same. In this sense, finite dimensional distributions provide a convenient
way to describe the distribution of (sufficiently regular) stochastic processes.

4.1 Brownian motion


4.1.1 BM as a limit of a random walk
We introduce the Brownian Motion (BM) as the limit of a random walk which takes discrete steps
of small size every small interval of time. Let (Yn )n∈N be a symmetric random walk on Z, as in
Example 1.3.7. For any fixed δx, δt > 0 define a continuous-time stochastic process X as

Xt = δx Ydt/δte t≥0 (4.1)

where dt/δte denotes the smallest integer larger or equal than t/δt. This is a process which
makes jumps of size δx, either up or down with probability 1/2, at a fixed grid of times given by
t = δt, 2δx, 3δx, . . . . Note that we can express Yn as

Yn = Z1 + · · · + Zn

with Pr(Zi = 1) = Pr(Zi = −1) = 1/2 for all i and Z1 , Z2 , . . . independent of each other. Thus
E[Yn ] = nE[Z1 ] = 0, V ar(Yn ) = nV ar(Z1 ) = n for all n ∈ N and for every t ∈ [0, ∞) we have

E[Xt ] = δxE[Ydt/δte ] = 0 (4.2)


2 2
V ar(Xt ) = (δx) V ar(Ydt/δte ) = (δx) dt/δte . (4.3)

67
CHAPTER 4. PROCESSES WITH CONTINUOUS TIME AND SPACE 68

Figure 4.1: A visual illustration of the convergence in distribution of (4.5). An independent


trajectory of the process in (4.6) is simulated for decreasing values of δt. (NB: in the figure we
interpolate points of the discrete RW with a continuous piecewise linear function rather than a
step function as defined in (4.6), for visual convenience, but the result holds unchanged.)


Thus, taking δx = O( δt) we obtain √ a non-trivial limit (i.e. neither 0 nor ∞) for V ar(Xt ) as
δt ↓ 0. Specifically, taking δx = σ δt for some σ > 0, we obtain

lim V ar(Xt ) = lim σ 2 δtdt/δte = σ 2 t . (4.4)


δt↓0 δt↓0

Moreover, since the r.v.s Z1 , Z2 , . . . are independent and identically distributed (i.i.d.), the Central
Limit Theorem implies the following convergence in distribution:
d
Xt → N (0, σ 2 t) as δx ↓ 0 , for every t > 0.

We can think at Brownian Motion as the limiting distribution √ obtained over the space of
trajectories obtained by taking δt ↓ 0. Specifically, fix T > 0, δx = σ δt and define the uncountable
collection of random variables (Xt )t∈[0,T ] through (4.6). Then we have the following convergence
in distribution
d
(Xt )t∈[0,T ] → (Bt )t∈[0,T ] δt ↓ 0 , (4.5)
69 CHAPTER 4. PROCESSES WITH CONTINUOUS TIME AND SPACE

where (Bt )t∈[0,T ) is a BM with variance σ 2 . Formalizing the above convergence, both in terms
of mode of convergence as well as existence of a well-defined limit, in a rigorous way is beyond
the scope of this notes. For the sake of our treatment, the reader can think at it as a conver-
gence of finite dimensional distributions. Figure 4.1 illustrates the convergence for T = 10 and
δt = 0.0, 0.03, 0.01, 0.0001. Notice that this is a convergence of a sequence of distributions over tra-
jectories to a limiting distribution over trajectories, not an almost sure convergence of a sequence
of trajectories t 7→ Xt to a limiting trajectory t 7→ Bt .1
The specific distribution of Zi in the above construction is not critical. In fact, one obtains the
same distribution for (Bt )t∈[0,T ] by taking the limit of (Xt )t∈[0,T ] as δt ↓ 0, for Xt defined as

√ dt/δte
X
Xt = σ δt Zj (4.6)
j=1

for any i.i.d. sequence of r.v.s (Zj )j∈N with zero mean and unit variance.

4.1.2 Definition
The above construction suggests the following definition.
Definition 15. A continuous-time stochastic process X = (Xt )t∈[0,∞) on X = R is a Brownian
motion (BM) with variance σ 2 , which we write as

X ∼ BM (σ) ,

if
(i) Pr(X0 = 0) = 1
(ii) Pr(t 7→ Xt is continuous for all t ∈ [0, ∞)) = 1
(iii) the process has stationary and independent increments
(iv) Xt ∼ N (0, σ 2 t) for every t ∈ [0, ∞).

Remarks
• When σ = 1 we refer to X as a standard BM
d
• (iii) and (iv) imply that Xt − Xs = Xt−s − X0 and thus

Xt − Xs ∼ N (0, (t − s)σ 2 ) 0<s<t<∞ (4.7)

Indeed, an equivalent definition of BM (σ) would be to replace (iv) with (4.7) and remove
the explicit stationary increment requirement. Note that, similarly to the PP definition
in the previous chapter, the above condition is self-consistent with respect to summation
of increments.

1 Indeed,

if one sets δx = σ δt and takes the limit δt ↓ 0 in (4.6), then limδt↓0 Xt does almost surely not exist,
even if the distribution of Xt converges to a N (0, σ 2 t). If needed, one could define a limiting construction of BM
which enjoys almost sure convergence of trajectories, but that is slightly more involved and better expressed in
terms of Brownian bridges, see later on.
CHAPTER 4. PROCESSES WITH CONTINUOUS TIME AND SPACE 70

• BM is Markovian due to the independent increments requirement (see also similar remark
when discussing the PP).

Definition 16. A continuous-time stochastic process X on X = Rd is a d-dimensional BM with


variance σ 2 if
(1) (d)
Xt = (Xt , . . . , Xt ) t≥0

(i)
where (Xt )t∈[0,∞) for i = 1, . . . , d are d independent copies of the one-dimensional BM.

Figure 4.2 provides an illustration of trajectory for the one- and two-dimensional BM.

Figure 4.2: A trajectory of a one-dimensional Brownian motion (left; displaying the function
t 7→ Xt ) and a two-dimensional one (right; displaying only the states visited by (X1 (t), X2 (t)) for
t ∈ [0, 1]).

It is not a priori obvious that BM exists, i.e. that is is possible to define a probability distribution
over the space of functions t 7→ Xt from [0, ∞) to R such that (i) − (iv) in Definition 15 are
simultaneosly satisfied. This was shown by Wiener in the 1920s. The most subtle part is proving
that almost sure continuity of trajectories can be satisfied.2

Theorem 11 (Wiener). BM exists.

Remark
The BM is sometimes named ‘Brownian motion’ named after Robert Brown, who observed
and described the BM-like motion of particles suspended in a liquid, and sometimes called
‘Wiener process’, after Wiener, who proved the theorem above one century later.

2 We do not prove the above theorem, as we will not provide proofs of various statements in this chapter, doing

that is out of scope for the current course and would require too much time and space.
71 CHAPTER 4. PROCESSES WITH CONTINUOUS TIME AND SPACE


Figure 4.3: A visual illustration of the fundamental fact that BM travels O( t) space in O(t)
time. A BM (1)√trajectory is drawn
√ in black, 30 i.i.d. BM (1) trajectories are drawn in gray; the
functions t 7→ 2 t and t 7→ −2 t (which represents fixed quantiles of the distribution of Xt for
every t > 0) are drawn in red.

4.1.3 Properties of BM

The BM enjoys many properties.

Square root scaling

First we discuss a simple and fundamental


√ fact about BM: the independent increments assumption
implies that the BM travels O( t) space in O(t) time. √ In particular, for any t ∈ [0, ∞), the
α-quantiles of the distribution of Xt are equal to qα t, where qα are the quantiles of a N (0, σ 2 )
distribution and are independent of t. See Figure 4.3 for an illustration. This fact has many
implications, as we’ll see below in the diffusions section. For example, it means that for small
values of t, Xt will be of larger order than t (i.e. the quantile red dotted function in Figure 4.3 has
infinite derivative at t = 0).
CHAPTER 4. PROCESSES WITH CONTINUOUS TIME AND SPACE 72

Invariances
The following class of properties are invariance or symmetry properties of the distribution over
trajectories induced by BM.

Proposition 17 (Invariances of BM). Let X ∼ BM (σ) for some σ > 0. Then


 

(i) (Rescaling) Let a, b > 0 and Yt = aXt/b for t ∈ [0, ∞). Then (Yt )t∈[0,∞) ∼ BM √
b
.

d
(ii) (Time-reversal) Let Yt = X1−t − X1 for t ∈ [0, 1]. Then (Yt )t∈[0,1] = (Xt )t∈[0,1] .

(iii) (Inversion) Let Y0 = 0 and Yt := tX1/t for t ∈ [0, ∞). Then (Yt )t∈[0,∞) ∼ BM (σ).

Exercise: verify Proposition 17 using Definition 15 and properties of the univariate normal
distribution.

Trajectories
The following properties pertains individual trajectories of BM, meaning that they are satisfies by
a BM trajectory with probability 1.

(i) Continuity: trajectories are a.s. continuous everywhere:

Pr(t 7→ Xt is continuous for all t ∈ [0, ∞)) = 1 (4.8)

(ii) Nowhere differentiability: trajectories are a.s. nowhere differentiable:

Pr(t 7→ Xt is not differentiable at any t ∈ [0, ∞)) = 1 (4.9)

(iii) Recurrence: trajectories a.s. cross 0 infinitely often (i.o.), meaning that the one-dimensional
BM is a recurrent stochastic process:

Pr(t → Xt crosses 0 for infinitely many t ∈ [0, ∞)) =1 (4.10)

(iv) Recurrence near zero: trajectories visits zero i.o. also in a neighborhood of t = 0: for any
>0

Pr(t → Xt crosses 0 for infinitely many t ∈ [0, )) =1 (4.11)

The continuity property is imposed by definition. The other properties are harder to prove
and their proof its beyond the scope of this notes. Some heuristic arguments for them could go as
follow:
(ii) non-differentiability: if t 7→ Xt was differentiable at t, then we should have limh↓0 ∆+ h =
limh↓0 ∆− h , where ∆ +
h := h −1
(Xt+h − Xt ) and ∆ −
h := h−1
(X t−h − X t ). However for any h > 0,

by Definition 15 (iii) and (iv), ∆+ h and ∆ h are independent random variables with N (0, h−1 2
σ )
− −
distribution, which implies that (∆+ h − ∆ h ) ∼ N (0, 2h −1 2
σ ) and thus Pr(lim h↓0 (∆+
h − ∆ h ) = 0) =
0.
(iii) given the limiting construction in Section 4.1.1, recurrence can be deduced from the recurrence
73 CHAPTER 4. PROCESSES WITH CONTINUOUS TIME AND SPACE

of Random Walks on Z and the continuity of trajectories. This property is specific to the one-
dimensional BM and does not extend to larger d.
(iv) Given (iii), recurrence near zero can be deduced from the invariance with respect to inversion
in Proposition 17(iii).
We now consider two ways of generalizing the Brownian motion: Gaussian processes and Dif-
fusion processes. The Brownian motion sits at the intersection of the two.

4.2 Gaussian Processes


Consider now stochastic processes (Xt )t∈T with state space X = R and general index set T . Think
for concreteness at T ⊆ Rd such as T = [0, ∞) for processes over time or T = R2 for processes over
the plane.
Definition 17. A stochastic processes X = (Xt )t∈T is a Gaussian Processes (GP) if its finite
dimensional distributions are Multivariate Gaussian (MVG), i.e. if for every k ∈ N and (t1 , ..., tk ) ∈
T k , the random vector Xt1:k = (Xt1 , ..., Xtk ) satisfies Xt1:k ∼ N (µ, Σ) for some µ = µ(Xt1:k ) ∈ Rd
and some Σ = Σ(Xt1:k ) being a k × k positive definite matrix.
Proposition 18. X ∼ BM (σ) is a Gaussian Processes
Proof. For all k ∈ N and t1:k (t1 , ..., tk ) ∈ T k , the vector Xt1:k = (Xt1 , ..., Xtk ) can be expressed as
a linear combination of independent Gaussian increments, i.e.
 
Xt1
 Xt2 − Xt1 
Xt1:k = M z where z = 
 
.. 
 . 
Xtk − Xtk−1

for an appropriate k × k matrix M . Then conclude using the closure of MVG under linear trans-
formations.
Define the mean and covariance functions of X = (Xt )t∈T as

m(t) := E[Xt ] t∈T


C(t, s) := Cov(Xt , Xs ) = E[(Xt − m(t))Xs − m(s))] t, s ∈ T .

The mean and covariance functions, m and C, uniquely determine the distribution of a GP X =
(Xt )t∈T because (a) MVG are uniquely determined by the first two moments3 , m and C uniquely
determine the finite dimensional distributions of a GP. In particular, if X is a GP, for any k ∈ N
and t1:k ∈ T k , we have
 
m(t1 ) k
for µ =  ...  ,

Xt1:k ∼ N (µ, Σ) Σ = C(ti , tj ) . (4.12)
 
i,j=1
m(tk )

and (b) under regularity assumptions, fidi uniquely characterizes the distribution of a stochastic
process.
3 In particular, multivariate dependence in MVG is uniquely determine by pair-wise interactions
CHAPTER 4. PROCESSES WITH CONTINUOUS TIME AND SPACE 74

In terms of notation, we write

X ∼ GP (m, C)

to indicate that X = (Xt )t∈T is a Gaussian Process with mean function m and covariance function
C, i.e. that (4.12) holds ∀k ∈ N and t1:k ∈ T k . The mean and covariance functions provide a
concise and convenient way to describe the distribution of GPs.

Example 4.2.1: Brownian Motion as a GP

For X ∼ BM (σ) we have

m(t) = E[Xt ] = 0 t∈T


C(t, s) = E[Xt Xs ] = E[Xt2 ] 2
+ E[Xt (Xs − Xt )] = σ t t<s∈T

where in the last equalities we have used that Xs = Xt + (Xs − Xt ) and Xt ⊥ Xs − Xt ,


E[Xt ] = E[Xs − Xt ] = 0 and finally E[Xt2 ] = σ 2 t.
Thus, we could equivalently define X ∼ BM (σ) as the unique (continuous) GP such that
(
m(t) = 0 t ∈ T = [0, ∞)
2
C(t, s) = σ min(s, t) t, s ∈ T = [0, ∞)

GP’s are closed under many operations, such as integration/differentation/conditiong, as the


next example illustrates.

Example 4.2.2: Brownian Bridge

Let X ∼ BM (σ) for some σ > 0 and

Yt := (Xt |X1 = 0) t ∈ [0, 1] .

In words, Y = (Yt )t∈[0,1] is a Brownian motion conditioned on its endpoint being 0.

Proposition 19. Y = (Yt )t∈[0,1] ∼ GP (m, C) with m(t) = 0 for all t ∈ [0, 1] and

C(t, t0 ) = σ 2 t(1 − t0 ) t, t0 ∈ [0, 1] .

Proof. All the finite dimensional dimensional distributions of Y are MVG by closure of the
MVG family under conditioning: for all k ∈ N and t1:k ∈ T k we have that L(Xt1 , ..., Xtk , X1 )
is a MVG (with some mean and covariance) and thus L(Yt1 , ..., Ytk ) = L(Xt1 , ..., Xtk |X1 = 0)
is also a MVG (with some mean and covariance). Thus, Y is a GP by Definition 17.
To compute the mean and covariance functions of Y we use properties of Gaussian distri-
butions. First note that
t
E[Xt |Xt0 = x] = x 0 ≤ t ≤ t0 ≤ 1
t0
Var(Xt |X1 = 0) = σ 2 t(1 − t) t ∈ [0, 1]
75 CHAPTER 4. PROCESSES WITH CONTINUOUS TIME AND SPACE

Figure 4.4: Trajectories of a Brownian Bridge with variance σ 2 = 1 and endpoint Y1 = 3.

where the equation for the variance can be derived exploiting that if

X ∼ N (µA , τA−1 )
Y |X ∼ N (X, τB−1 )

then it follows

Prec(X|Y = y) = τA + τB .

Thus, the mean function of Y is

m(t) = E[Yt ] = E[Xt |X1 = 0] = 0 ∀t ∈ [0, 1]


CHAPTER 4. PROCESSES WITH CONTINUOUS TIME AND SPACE 76

and its covariance is


h i
C(t, t0 ) = E[Yt yt0 ] = E E[Yt Yt0 |Yt0 ] =
h i
= E Yt0 E[Yt |Yt0 ] =
∗ t t
= E[Yt0 0 Yt0 ] = 0 E[Xt20 |X1 = 0] =
t t | {z }
σ 2 t0 (1−t0 )
t 2
= σ t(1 − t0 ) = σ 2 t(1 − t)
t0
∗ t
where in equality = we have used that E[Yt |Yt0 = x] = E[Xt |Xt0 = x] = t0 x.

The process obtained by conditioning a Brownian motion on its endpoint is called Brownian
Bridge, and it can be defined as follows.
Definition 18. A Brownian Bridge (BB) with variance σ 2 ending at Ys = y is a GP with
T = [0, s] and
(
m(t) = st y t ∈ [0, s]
C(t, t0 ) = σ 2 t(s − t0 ) 0 ≤ t < t0 ≤ s

Figure 4.4 provides visual illustrations of independent trajectories of a Brownian Bridge


with variance σ 2 = 1 and endpoint Y1 = 3.

What is the role of the two parameters, m and C? The mean function m plays the role of an
additive deterministic function. The proof of the next lemma is left as an exercise.
Lemma 9. If X ∼ GP (0, C) then

Yt := Xt + m(t) t∈T

satisfies Y = (Yt )t∈T ∼ GP (m, C).


By Lemma 9, we can focus on zero-mean GPs, i.e. X ∼ GP (0, C), without loss of generality.
Thus, the main parameter of a GP is the covariance function

C :T ×T →R
(t, s) → C(t, s) = Cov(Xt , Xs )

which models the dependence between Xt ’s across different locations t ∈ T . Which conditions
should C satisfy to be a valid covariance function (i.e. to admit a corresponding GP)? ∀k ∈ N and
t1:k ∈ T k the k × k matrix
 k
Σ = C(ti , tj ) (4.13)
i,j=1

needs to be a symmetric positive definite matrix. Functions C : T 2 → R satisfying (4.13) for every
k ∈ N and t1:k ∈ T k are called positive definite functions. This indeed the only condition C should
satisfy, as stated in the next theorem.
77 CHAPTER 4. PROCESSES WITH CONTINUOUS TIME AND SPACE

Figure 4.5: Top row: three different functions k from R to [0, ∞) with different regularity at 0.
Bottom row: trajectories of the GPs on T = R with zero mean and covariance C(t, s) = k(t − s).

Theorem 12. Every symmetric positive definite function C : T × T → R induces a valid GP (0, C)
on T .

Symmetrix positive definite (s.p.d.) functions are often called ’kernels’ in the GP literature4 .
Examples of commonly used GP kernels include:

• Exponential kernel:
 
C(t, s) = a exp − b|t − s| t, s ∈ R

with a,
√b > 0, this is a s.p.d function, where a is the ”amplitude parameter” (y-scale), i.e.
Yt = aXt , and b is the ”time scale parameter”, i.e. Yt = Xbt . The corresponding GP
exhibits continuous but not differentiable trajectories.

• Squared Exponential kernel:

C(t, s) = a exp(−b|t − s|2 ) t, s ∈ T

is a s.p.d. kernel whose corresponding GP exhibits differentiable trajectories.


4 Not to be confused with the Markov transition kernels discussed below
CHAPTER 4. PROCESSES WITH CONTINUOUS TIME AND SPACE 78

The smoothness of C governs the smoothness of the trajectories of X. For example, if C(t, s) =
k(t − s) for some (symmetric) function k from R to [0, ∞) then continuuity and differentiability of
k at 0 are related to the continuity and differentiability of the trajectories of X. See Figure 4.5 for
an illustration and Section A.3 for more details if interested.

4.3 Diffusion Processes


GPs extend BM in the direction of general state spaces (e.g. multidimensional T ) and different
degrees of smoothness, while staying within the Gaussian family. We now consider another impor-
tant extension: diffusion processes. Diffusion processes are Markovian stochastic processes (Xt )t∈T
with index set T = [0, ∞) representing time, which can be seen as a solution to the natural stochas-
tic analogue of differential equations. To introduce them, we start by considering an infinitesimal
definition of BM, which will naturally extend to the general diffusion case.

4.3.1 Infinitesimal definition of BM


We have the following equivalent definition of BM.
Definition 19. The BM can be defined as the (only) Markov process (Xt )t∈[0,∞) with continuous
trajectories and X0 = 0 such that
(i) E[Xt+h − Xt |Xt = x] = 0 + o(h) as h ↓ 0, ∀x ∈ R
2
(ii) E[ Xt+h − Xt |Xt = x] = σ 2 h + o(h) as h ↓ 0, ∀x ∈ R
(iii) Pr(|Xt+h − Xt | > |Xt = x) = o(h) as h ↓ 0, ∀x ∈ R and  > 0
The above properties can be intuitively understood as:

(i) No upwards/downwards ’drift’


(ii) Constant ’instantaneous variance’ or ’volatility’ equal to σ 2
(iii) No jumps

Similarly to the PP case, an infinitesimal definition such as Definition 19 is arguably more intuitive
and natural from the modeling point of view as opposed to a definition, like Definition 15, that
imposes a priori a distributional form to the increments (in this case the Gaussian one).

Remarks
It is remarkable that the conditions in Definition 19 are enough to imply the Gaussian incre-
ment property. Similarly to the PP, this can be deduced by showing that (i)−(iii) implies that
P t , which is a transition kernel defined in Section 4.3.3 below, satisfies a certain differential
equation, in this case

∂ t σ2 ∂ 2 t
P (x, y) = P (x, y) . (4.14)
∂t 2 ∂y 2
79 CHAPTER 4. PROCESSES WITH CONTINUOUS TIME AND SPACE

This is an instance of the Fokker-Planck equations discussed below.

4.3.2 Infinitesimal definition of diffusions


Definition 19 naturally extends to define the general class of so-called diffusion processes. This can
be thought as an extension of BM to the case of general drift and volatility.
Definition 20. A continuous-time Markov process X = (Xt )t∈[0,∞) with continuous trajectories
is a diffusion process with drift function µ(x, t) and instantaneous variance (or volatility) σ 2 (x, t)
if
(i) E[Xt+h − Xt |Xt = x] = µ(x, t)h + o(h) as h ↓ 0, ∀x ∈ R
2
(ii) E[ Xt+h − Xt |Xt = x] = σ 2 (x, t)h + o(h) as h ↓ 0, ∀x ∈ R
(iii) Pr(|Xt+h − Xt | > |Xt = x) = o(h) as h ↓ 0, ∀ > 0
In the above definition we do not specify the starting condition for X, which could either be
deterministic, e.g. X0 = 0, or stochastic. If Definition 20 is satisfied, we say that X satisfies the
stochastic differential equation (SDE):
dXt = µ(Xt , t)dt + σ(Xt , t)dBt (4.15)

Remarks
While for our purposes it is enough to interpret (4.15) simply as a compact way to say that
(i)-(iii) hold, the notation in (4.15) admits a more ‘differential equation’-like interpretations.
We very briefly mention those here:

• Derivative interpretation. The drift and volatility coefficients µ and σ admit the following
derivative interpretation:
" #
Xt+h − Xt
lim E |Xt = x = µ(x, t) (4.16)
h↓0 h
" #
(Xt+h − Xt )2
lim E |Xt = x = σ 2 (x, t) (4.17)
h↓0 h

Equations (4.16) and (4.17) follow directly from (i) and (ii) in Definition 20. They can
be compared with the ordinary differential equation case, where the function f in an
equation of the form
dXt
= f (Xt , t) , or equivalently dXt = f (Xt , t)dt ,
dt
can be interpreted as the derivative of the function t 7→ Xt , i.e. it satisfies
Xt+h − Xt
lim = f (Xt )
h↓0 h
CHAPTER 4. PROCESSES WITH CONTINUOUS TIME AND SPACE 80

• Integral representation. Roughly speaking, the reason for the notation dBt in (4.15) is
that diffusion processes can be obtained integrating µ(x, t) against dt and σ(x, t) against
dBt , i.e. against Brownian motion increments. In particular, one can think at a diffusion
process X = (Xt )t∈[0,∞) with drift µ and volatility σ as the limit, as h ↓ 0, of a process
defined by the recursion

Xt+h = Xt + µ(Xt , t)h + σ(Xt , t)(Bt+h − Bt ) . (4.18)

The above defines the values of X on the grid t ∈ {0, h, 2h, 3h, . . . } which can be extend
extended through, e.g., linear interpolation to all t ∈ [0, ∞). The above construction
justifies the interpretation of dBt in the SDE as the increment Bt+h − Bt for h ↓ 0.
Intuitively, equation (4.19) means that Xt can be thought at as the solution to the
following integral
Z t Z t
Xt = X0 + µ(Xs , s)ds + σ(Xs , s)dBs . (4.19)
0 0

However, defining formally the second integral in (4.19) requires some care and it is
beyond the scope of this coursea .
a The main issue there is that the so-called absolute variation of BM trajectories is a.s. infinite, i.e.

Pr(limn→∞ n−1
P
i=0 |B(i+1)t/n − BRit/n | = ∞) = 1 for any t > 0, and as a consequence Riemann sums approx-
imations to integrals of the form 0t σ(Xs , s)dBs do not have a unique limits (i.e. different types of Riemann
sums can lead to different limits and thus different values of the integral). Interested students can find more
details (including the notion of Ito integrals) in chapers about ‘stochastic calculus’ in more advanced stochastic
processes books, see e.g. Stirzaker [2005] for a light introduction or Pavliotis [2014] for a more formal, though
still accessible, treatment.

We now provide a simple example of diffusion process, which admits a closed-form solution.

Example 4.3.1: BM with constant drift

Let µ ∈ R, σ ∈ R+ ,

(Bt )t∈[0,∞) ∼ BM (1)

and

Xt = µt + σBt t ≥ 0.

Then (Xt )t∈[0,∞) is a diffusion process with

µ(x, t) = µ σ 2 (x, t) = σ 2 ∀x ∈ R, t ≥ 0 .
81 CHAPTER 4. PROCESSES WITH CONTINUOUS TIME AND SPACE

The conditions in Definition 20 can be verified as follows. First note that


 
∆Xt := Xt+h − Xt |Xt = x
 
= (t + h)µ + σBt+h − tµ − σBt |Xt = x
d
= µh + σ(Bt+h − Bt )

where in the last equality we used the property of independent increments of BM. Thus

E[∆Xt ] = µh + σ E[B(t + h) − Bt ] (4.20)


| {z }
=0
E[(∆Xt ) ] = µ h + 2µh σE[B(t + h) − Bt ] +σ 2 E[(B(t + h) − Bt )2 ] =
2 2 2
(4.21)
| {z }
=0
= σ 2 h + o(h) (4.22)

as h → 0, as desired.
The above example also implies that X ∼ BM (σ) itself is a diffusion process with zero drift
and constant volatility, i.e. µ(x, t) = 0 and σ 2 (x, t) = σ 2 for all t > 0 and x ∈ R.
Remarks
• If t → Xt was deterministic then

Xt+h − Xt = µh + o(h) (4.23)


 2
Xt+h − Xt = σ 2 h + o(h) (4.24)

could not be simultaneously satisfied. For SDEs they do hold jointly in expectation.
Intuitively, equation (4.23) relates to a deterministic part while equation (4.24) to a
stochastic one.
• In the Brownian Motion with constant drift example the two parts of the SDE ‘evolve
independently’. In particular, the process solving the equation dXt = µdt + σdBt
coincides with the sum of the one solving dXt = µdt, which is Xt = µt, and the one
solving dXt = σdBt , which is Xt = σBt . In general this is not the case, since the two
components of the SDE (namely µ(Xt , t)dt and σ 2 (Xt , t)dBt ) do interact, because the
value of the drift µ(Xt , t) and volatility σ 2 (Xt , t) depend on the current state of Xt .

4.3.3 Transition kernels and Chapman-Kolmogorov (CK) equations


Let X = (Xt )t∈[0,∞) be a time-homogeneous, continuous-time Markov process on X = R, and
P t (x, y) x, y ∈ R; t > 0 , (4.25)
t
be the density of Xt at y given X0 = x. In other words, P (x, y) is such that
Z
Pr(Xt ∈ A|X0 = x) = P t (x, y)dy A ⊆ R. (4.26)
A
CHAPTER 4. PROCESSES WITH CONTINUOUS TIME AND SPACE 82

For example, for X ∼ BM (σ), we have

(x − y)2
 
1
P t (x, y) = √ exp − t > 0, x, y ∈ R . (4.27)
2πtσ 2 2tσ 2

The transition kernel P t = (P t (x, y))x,y∈R can also be seen as an operator that acts on functions:
for every f : R → R define the function P t f : R → R as
Z
P f (x) = f (y)P t (x, y)dy = E[f (Xt )|X0 = x] .
t
(4.28)

By Markovianity and time-homogeneity, (P t )t≥0 satisfies


Z
P t+s (x, y) = P t (x, z)P s (z, y)dz t, s > 0, x, y ∈ R . (4.29)
R

or, equivalently,

P t P s f = P t+s f . (4.30)

for every f . We refer to (4.29) and (4.30) by saying that (P t )t≥0 is a semigroup. Equalities (4.29)
and (4.30) are the continuous-time and continuous-space version of the CK equations seen in the
discrete context. The equality in (4.30) can be derived as follows. The one in (4.29) can either be
derived using (4.30) or directly using the law of total probability (see Exercise 55).
Derivation of (4.30).

P t+s f (x) =E[f (Xt+s )|X0 = x] (Definition of P t+s )


=E[E[f (Xt+s )|Xt ]X0 = x] (Tower property)
s s
=E[P f (Xt )|X0 = x] (Def. of P , Markov prop. and time-hom.)
t s
=P P f (x) Definition of P t .

4.3.4 Generators
Similarly to CTMCs, also in the case of continuous space, we can charaterize the semigroup (P t )t≥0
through its generator. Intuitively, the generator represents the derivative of P t at t = 0, i.e.
d t
“Q = P ”. (4.31)
dt t=0

More precisely, we define Q as an operator that sends a function f : R → R into a function


Qf : R → R defined as

d t P h f (x) − f (x) E[f (Xh )|X0 = x] − f (x)


Qf (x) = P f (x) = lim = lim . (4.32)
dt t=0 h↓0 h h↓0 h
83 CHAPTER 4. PROCESSES WITH CONTINUOUS TIME AND SPACE

Remark
In the CTMC case (i.e. when |X | < ∞), the generator was a matrix Q, which can viewed as
an operator that sends vectors into vectors, i.e. R|X | into R|X | . In this sense, defining Q as
an operator that sends functions f : R → R into functions Qf : R → R is a generalization of
Q-matrices to the case of general state spaces.

5
The following lemma characterizes the generators of diffusion processes.
Lemma 10. Let X = (Xt )t∈[0,∞) be a diffusion process solving the SDE
dXt = µ(Xt )dt + σ(Xt )dBt . (4.33)
Then, for smooth enough f
∂f 1 ∂2f
Qf (x) = µ(x) (x) + σ 2 (x) 2 (x) . (4.34)
∂x 2 ∂x
6
Sketch of proof. We have
E[f (Xh )|X0 = x] − f (x)
Qf (x) = lim Def. of Q
h↓0 h
−1
= lim h E[f (Xh ) − f (X0 )|X0 = x]
h↓0
1
= lim h−1 E[f 0 (x)∆h + f 00 (x)∆2h + o(∆2h )] with ∆h := (Xh − X0 |X0 = x)
h↓0 2
1
= f 0 (x) lim h−1 E[∆h ] + f 00 (x) lim h−1 E[∆2h ] assuming E[o(∆2h )] = o(h)
h↓0 2 h↓0
2
σ (x) 00
= µ(x)f 0 (x) + f (x) by (i) and (ii) in def. of diffusions .
2

The statement in (4.34) is often stated as saying that the generator of the diffusion process in
(4.33) is the differential operator
∂ σ2 ∂ 2
Q=µ + ,
∂x 2 ∂x2
which is just a more compact notation for (4.34). For example the generator of the Brownian
Motion is
σ 2 d2
Q=
2 dx2
Pd ∂ 2
which generalizes to the so-called Laplacian operator in Rd : Q = 12 i=1 ∂x 2 . The latter is an
i
important operator that plays a role in many areas of pure and applied mathematics.
5 For simplicity, in this section we restrict our attention to the case of µ and σ depending only on x and not on

t. This allows to stay within the class of time-homogeneous processes. Note that nonetheless, the key equations we
derive (i.e. (4.34), (4.35) and (4.37)) still hold in the case of time-dependent drift and variance simply by replacing
µ(x) and σ(x) with µ(x, t) and σ(x, t).
6 Here and throughout this section, we do not formally justify all steps in our derivations (e.g. exchanges of limits

and integrals, ignoring smaller order terms in Taylor expansion, etc). A fully rigorous treatment of SDEs is well
beyond the scope of this section.
CHAPTER 4. PROCESSES WITH CONTINUOUS TIME AND SPACE 84

Remark

The first derivative term in the generator, µ ∂x , relates to the deterministic part of X (i.e. its
2
∂2
drift, which has size O(h)); while the second derivative term, σ2 ∂x 2 , relates to the stochastic

part of X (i.e. its volatility/variance, which has zero mean and size O( h)). See also Exercise
56.

4.3.5 Fokker-Plank equations


We can use generators to compute the Kolmogorov Forward Equations (KFE) for diffusions, which
are often referred to as Fokker-Plank (FP) equations.
Theorem 13 (Fokker-Plank equations). Let (Xt )t∈[0,∞) be a diffusion process with (smooth enough)
drift and diffusion functions µ(x) and σ 2 (x). Then
∂ t ∂  t  1 ∂2  
t 2
P (x, y) = − P (x, y)µ(y) + P (x, y)σ (y) . (4.35)
∂t ∂y 2 ∂y 2
Sketch of proof. For every test function f we have
Z Z
∂ d d
f (y) P t (x, y)dy = f (y)P t (x, y)dy = P t f (x)
∂t dt dt
P t+h f (x) − P t f (x)
= lim
h↓0 h
P t (P h f − f )(x)
= lim by (4.30)
h↓0 h
h
 
t P f −f
= P lim (x) = P t Qf (x) by def of Q .
h↓0 h
Let f ∈ C02 (R), where C02 denotes functions with two continuous derivatives and bounded support.
Then, by (4.34), we have
σ 2 (z) 00  t
Z 
t
P Qf (x) = f 0 (z)µ(z) + f (z) P (x, z)dz
2
d2  P t (x, z)σ 2 (z) 
Z Z
d t 
= − f (z) P (x, z)µ(z) dz + f (z) 2 dz
dz dz 2
where in the last equality we used integration by parts and the bounded support of f . We have
thus shown
Z Z

f (y) P t (x, y)dy = f (y)g(y)dy for all f ∈ C02 (R) , (4.36)
∂t
with
d t  1 d2  
t 2
g(y) := − P (x, y)µ(y) + P (x, y)σ (y) .
dy 2 dy 2

Since (4.36) holds for every f ∈ C02 , we conclude7 that t
∂t P (x, y) = g(y) for almost every y.
7 Here C02
R
we use the fact that h(x)f (x)dx = 0 for for every f ∈ implies h = 0 almost everywhere.
85 CHAPTER 4. PROCESSES WITH CONTINUOUS TIME AND SPACE

4.3.6 Density evolution


The Fokker-Planck equations give a way to compute the distribution of Xt starting from the one
of X0 . Let h(y, t) be the pdf of Xt , defined by
Z
h(y, t)dy = Pr(Xt ∈ A) t ≥ 0, A ⊆ R .
A

By the law of total probability conditioning on X0 we have


Z
h(y, t) = P t (x, y)h(x, 0)dx t ≥ 0, y ∈ R .
R

Combining the latter equation with (4.35) we obtain


Z 
d d t 
h(y, t) = P (x, y) h(x, 0)dx =
dt dt
∂ 
Z  1 ∂2  Z 
=− µ(y) P t (x, y)h(x, 0)dx + σ 2
(y) P t
(x, y)h(x, 0)dx
∂y 2 ∂y 2
| {z } | {z }
h(y,t) h(y,t)

∂   1 ∂2  
=− µ(y)h(y, t) + 2
σ 2 (y)h(y, t)
∂y 2 ∂y

Thus, given X0 ∼ f0 for some pdf f0 , (h(x, t))x∈R,t∈[0,∞) solve the following equations:
(
h(y, 0) = f0 (y)     y∈R (4.37)
∂ ∂ 1 ∂2 2
∂t h(y, t) = − ∂y µ(y)h(y, t) + 2 ∂y 2 σ (y)h(y, t) y ∈ R, t ∈ [0, ∞) .

4.3.7 Stationary Distributions


The system in (4.37) can be used to identify the stationary distribution(s) of a diffusion process.
Recall that a distribution π is stationary for a stochastic process (Xt )t∈[0,∞) if X0 ∼ π implies
Xt ∼ π for all t ≥ 0. Equivalently, a pdf π on R is stationary for (Xt )t∈[0,∞) if, imposing

h(y, 0) = π(y) y ∈ R,


results in ∂t h(y, t) = 0 for all y ∈ R and t ≥ 0. By (4.37), this is equivalent to require π to satisfy

∂   1 ∂2  
2
0=− µ(y)π(y) + σ (y)π(y) y ∈ R,
∂y 2 ∂y 2

or in compact notation

1 2 00
(µπ)0 = (σ π) .
2
CHAPTER 4. PROCESSES WITH CONTINUOUS TIME AND SPACE 86

Figure 4.6: Illustration of two solutions to (4.37) for diffusion processes with different coefficients
and starting distributions. Left: analytically computed solution to (4.37) (with unimodal starting
distribution for X0 , and µ and σ as in Example 4.3.2). Right: example with a bimodal starting
pdf for X0 , where the solution is approximated by simulating many independent trajectories of
the diffusion process and plotting the evolution of the empirical distribution of the trajectories at
different times.

Integrating both sides we obtain


1 2 0
µπ = (σ π) + const . (4.38)
2
The above is a joint constraint between µ, σ and π. Suppose that we are given π and seek to
construct a π-invariant diffusion process. For fixed (smooth enough) π, there infinitely many com-
binations of σ and µ satisfying (4.38), meaning that there are infinitely many diffusions processes
with the same stationary distribution. An interesting case is the one of constant volatility, i.e.
σ 2 (x) = σ 2 for some fixed σ 2 > 0, and where the constant term in (4.38) is set to zero. Then there
is a one-to-one correspondence between the stationary distribution π and the corresponding drift
function given by
d
1 2 dx π(x) σ2
µ(x) = σ = (log π)0 (x) x ∈ R.
2 π(x) 2
In particular, for any pdf π (such that log π is well defined and differentiable), taking σ 2 (x) = σ 2
2
and µ = σ2 (log π)0 leads to a π-invariant diffusion process.
More generally, given any multivariate pdf π on X = Rd , the SDE
σ2
dXt = ∇ log π(Xt )dt + σdBt (4.39)
2
induces a π-invariant diffusion process.

Remark
The process in (4.39) is called (overdamped) Langevin diffusion, which is the canonical π-
invariant (actually π-reversible) stochastic process for an arbitrary smooth distribution π on
Rd . The Langevin diffusion is often used in many areas (such as Physics, Statistics, Computer
87 CHAPTER 4. PROCESSES WITH CONTINUOUS TIME AND SPACE

Science, Machine Learning, etc) both to model stochastic phenomena as well as to define
computational algorithms to sample from continuous multivariate distributions. Intuitively,
2
like the gradient flow dXt = σ2 ∇ log π(Xt )dt (and its discretizations) defines the canonical
flow used to compute the mode of a multivariate function log π, we can think at X defined in
(4.39) as a ‘gradient-flow plus noise’ which defines a canonical stochastic process used to draw
samples from π. An important feature in (4.39) is that the amount of noise and gradient push
are balanced (e.g. the drift scales with the noise in a specific way, i.e. σ 2 /2) so that the process
preserves π as its invariant distirbution (e.g. it spends a fraction of time near the mode vs in
the tails as specified by π, etc).

Example 4.3.2: Ornstein-Uhlenbeck process

A special case of (4.39) is given by the SDE on X = R defined as

dXt = µ(x)dt + σdBt , with µ(x) = −a(x − θ) (4.40)

for some fixed a, θ, σ ∈ R. The resulting diffusion X is called Ornstein-Uhlenbeck process


and it admits a Gaussian invariant distirbution and closed form transition probabilities. See
Exercise 57.
Appendix A

Appendix

THE MATERIAL IN THE APPENDIX IS NOT PART OF THE SYLLABUS. IT IS INCLUDED


FOR COMPLETENESS BUT WILL NOT BE EXAMINED.

A.1 Viterbi algorithm for finite HMMs


Suppose we want to compute the filtering distribution
Pr(Xt = j, S0:t = s0:t )
Pr(Xt = j | S0:t = s0:t ) = . j ∈ X , s0:t ∈ S t+1 . (A.1)
Pr(S0:t = s0:t )
In order to compute the above, we need to express it as a function of the parameters specifying
our model, i.e. (Pij )i,j∈X and {f (s | j)}s∈S,j∈X . We now show how to do it for generic HMMs
with finite state spaces. Consider first the numerator of (A.1). By (2.2) we have
X
Pr(Xt = j, S0:t = s0:t ) = αx0 Px0 x1 ...Pxt−1 j f (s0 | x0 ) . . . f (st | xt ). (A.2)
(x0 ,...,xt−1 )∈X t

The above equation expresses Pr(Xt = j, S0:t = s0:t ) in terms of things that we know. However,
computing Pr(Xt = j, S0:t = s0:t ) through the above expression requires summing over all possible
trajectories (x0 , ..., xt−1 ) ∈ X t , whose number is |X |t . This number grows exponentially with t,
and computing |X |t quantities becomes computationally infeasible as soon as t is modestly large,
e.g. t ≥ 20. There is a smarter way to compute probabilities of interest in HMM by exploiting the
sequential nature of the model. We first show how to compute the numerator of (A.1) efficiently
using a recursive algorithm, known as Viterbi algorithm.
Define the un-normalized filtering distributions

Ft (j) := Pr(Xt = j, S0:t = s0:t ) t ≥ 0, j ∈ X ,

where (s0 , s1 , s2 , ...) are the observed signals, which we take as fixed here. The idea is to compute
Ft (j) recursively using the analogous quantities for t − 1, i.e. {Ft−1 (i)}i∈X , using the following
Lemma (whose proof is left as an exercise, see exercise list and solutions).
Lemma 11. We have
X
Ft (j) =f (st | j) Ft−1 (i)Pij t ≥ 1, j ∈ X . (A.3)
i∈X

Thus, the recursive way of computing Ft (j) is the following.


Algorithm 1 (Forward pass of Viterbi algorithm).

1. (Initialization for t = 0) Compute

F0 (j) = Pr(X0 = j, S0 = s0 ) = αj p(s0 | j) for all j ∈ X ;

89
APPENDIX A. APPENDIX 90

2. (Recursion for t = 1, 2, . . . ) For t = 1, 2, . . . compute


X
Ft (j) = p(st | j) Ft−1 (i)Pij for all j ∈ X . (A.4)
i∈X

This recursive way of computing Ft (j) is referred to as forward pass, as we start from t = 0
and go forward in time. It has cost O(t|X |2 ) to compute {Fi (j)}i=1,...,t;j∈X rather than O(|X |t )
as the naive approach based on (A.2) would have. This allows to perform inferences on Xt given
the observed signals s0:t even when t and/or |X | are large. Given {Ft (j)}t≥0,j∈X , we can easily
compute the so-called filtering distribution as
Pr(Xt = j, S0:t = s0:t ) Ft (j)
Pr(Xt = j | S0:t = s0:t ) = =P j ∈ X.
Pr(S0:t = s0:t ) i∈X Ft (i)

Given the filtering distributions {Pr(Xt = j | S0:t = s0:t )}t≥0,j∈X we can efficiently compute
predictive distributions as
X
Pr(Xt+1 = j | S0:t = s0:t ) = Pr(Xt+1 = j | S0:t = s0:t , Xt = i) Pr(Xt = i | S0:t = s0:t )
i∈X
X
= Pij Pr(Xt = i | S0:t = s0:t ) j ∈ X.
i∈X

and1
X
Pr(St+1 = s | S0:t = s0:t ) = Pr(St+1 = s | S0:t = s0:t , Xt+1 = i) Pr(Xt+1 = i | S0:t = s0:t )
i∈X
X
= f (s | i) Pr(Xt+1 = i | S0:t = s0:t ) s ∈ S.
i∈X

A.2 Kalman Filter for Gaussian HMMs


Recall the closure under multiplication of Gaussian kernels, i.e. the Gaussian-Gaussian conju-
gacy,.
(Gaussian-Gaussian conjugacy)
Given m ∈ R an σ ∈ R+ , denote the pdf of a univariate Gaussian distribution N (m, 1/τ )
evaluated at x by

N (x; m, σ 2 ) = (2πσ 2 )−1/2 exp((x − m)2 /(2σ 2 )) x ∈ R. (A.5)

Lemma 12. Then, for every m1 , m2 ∈ R and τ1 , τ2 ∈ R+ , we have


     
1 1 τ1 m1 + τ2 m2 1
N x; m1 , N x; m2 , ∝ N x; , x∈R (A.6)
τ1 τ2 τ1 + τ2 τ1 + τ2

where ∝ denotes proportionality with respect to x (i.e. going from the left-hand-side to the
right-hand-side of the equation we ignore multiplicative constants independent of x).

1 Example 4.43 in Ross [2014] applies the above calculations to the specific quantities in Example 2.3.2, which

can serve as a provide a pedagogical example of using the above recursions.


91 APPENDIX A. APPENDIX

Proof. The statement can be easily verified by checking the proportionality starting from the
definition of N (x; m, σ 2 ).
Typical application: update current knowledge X ∼ N (mx , σx2 ) given a new observation
Y |X ∼ N (X, σy2 ). Then

p(X = x|Y = y) ∝N (x; mx , σx2 )N (y; x, σy2 ) = N (x; mx , σx2 )N (x; y, σy2 )
!
σx−2 mx + σy−2 y 1
∝N x; , −2 = p(X = x|Y = y) x, y ∈ R ,
σx−2 + σy−2 σx + σy−2

σ −2 mx +σ −2 y 1
meaning that E[X | Y ] = xσ−2 +σ−2 y
and V ar(X | Y ) = σ−2 +σ −2 .
x y x y
Intuition: Once you update your prior knowledge with a new observation y the precision
of your posterior knowledge about X increases from σx−2 to σx−2 + σy−2 and the expectation
updates from mx to the weighted average of mx and y, with weights given by the precisions
of prior and likelihood. Intuitively, you can think at X being pulled, respectively, towards mx
and y by the kernels N (x; mx , σx2 ) and N (x; y, σy2 ) with strength, respectively, σx−2 and σy−2 .

The Gaussian-Gaussian conjugacy allows to perform closed-form updates of filtering and predictive
distributions as follows. In terms of notation, we define ((mt , σt2 ))t≥0 and ((m̂t , σ̂t2 ))t≥0 as the
filtering and predictive parameters2 , meaning that

L(Xt | S0:t = s0:t ) = N (mt , σt2 ) ,


L(Xt | S0:(t−1) = s0:(t−1) ) = N (m̂t , σ̂t2 ) .

The following algorithms provides an efficient way to compute ((mt , σt2 ))t≥0 and ((m̂t , σ̂t2 ))t≥0
recursively. Below we assume an initial distribution X0 ∼ N (m̂0 , σ̂02 ) for some fixed m̂0 ∈ R and
σ̂0 ∈ R+ (i.e. α = N (m̂0 , σ̂02 ) in the notation of the previous subsection).
Algorithm 2 (Forward pass of Kalman filter).

1. (Initialization for t = 0) Set m̂0 and σ̂02 to the parameters of the starting distribution X0 ∼
α = N (m̂0 , σ̂02 ), and compute m0 and σ02 as in (A.8) with t = 0.
2. For t = 1, 2, . . . do:

• Predict step: set

m̂t = mt−1 , σ̂t2 = σt−1


2
+ ση2 (A.7)

• Update step: set

mt = (1 − Kt )m̂t + Kt st , σt2 = (1 − Kt )σ̂t2 (A.8)


σ̂e−2
where Kt := σ̂t−2 +σe−2
is the so-called Kalman gain (which can be interpreted as the gain
of information when observing St = st ).
2 Here
we are already implicitly assuming that filtering and predictive distributions will be of Gaussian form.
This is indeed the case and follows by closure of Gaussian kernels under convolution and multiplication.
APPENDIX A. APPENDIX 92

Derivation of (A.7) and (A.8). The formulas in (A.7) follow from Xt = Xt+1 + ηt combined with
2
L(Xt−1 | S0:(t−1) = s0:(t−1) ) = N (mt−1 , σt−1 ) and L(ηt | S0:(t−1) = s0:(t−1) ) = N (0, ση2 ). The one
in (A.8) follows from

p(Xt = x | S0:t = s0:t ) ∝ p(Xt = x, S0:t = s0:t )


∝ p(St = st | Xt = x)p(Xt = x, S0:(t−1) = s0:(t−1) )
∝ N (st ; x, ση2 )N (x; m̂t , σ̂t2 )
= N (x; st , ση2 )N (x; m̂t , σ̂t2 )

and Lemma 12.

A.3 Smoothness of GPs


In the discussion around Figure 4.5, we mentioned that the smoothness of C governs the smoothness
of the trajectories of X. This can be formalized as follows: taking C(t, s) = k(|t − s|) we have that
1. (Continuity) Continuity of k at 0 implies continuity of X in an L2 -sense, i.e.

lim E[|Xt+h − Xt |2 ] = 0 ∀t
h↓0

2. (Differentiability) Existence of h00 at 0 impliesdifferentiability of X in an L2 -sense, i.e. exis-


tence X 0 = (Xt0 )t∈T s.t.
" #
Xt+h − Xt 2
lim E − Xt0 =0 ∀t ∈ T
h↓0 h

One also has


∂2C
X 0 ∼ GP (m0 , C 0 ) where C 0 (t, s) := (t, s)
∂t∂s
Bibliography

Richard Durrett. Essentials of Stochastic Processes. Springer, 2021. URL https://services.


math.duke.edu/~rtd/EOSP/EOSP2021.pdf.
David A Levin and Yuval Peres. Markov chains and mixing times, volume 107. American Mathe-
matical Soc., 2017.
James R Norris. Markov chains. Cambridge university press, 1998.

Grigorios A Pavliotis. Stochastic processes and applications. Texts in Applied Mathematics, 60,
2014.
Sheldon M Ross. Introduction to Probability models, 11th edition. Academic press, 2014.
David Stirzaker. Stochastic processes and models. OUP Oxford, 2005.

93

You might also like