Main
Main
Giacomo Zanella
2
3 CONTENTS
A Appendix 89
Bibliography 93
Chapter 1
Figure 1.1: Example of a trajectory, or path, of a stochastic process with continuous time and
space.
A gambler plays a game where at each turn he wins 1e w.p. p and loses 1e w.p. 1-p. The
gambler starts the game with xe and quits the game when he reaches either 0e or Ne for
1 We will usually assume that all random variables Xt have the same set of possible values, i.e. the same state
space X . One could also consider more general settings where X depends on t.
5
CHAPTER 1. DISCRETE-TIME MARKOV CHAINS 6
some pre-specified N ∈ N.
Denoting by Xt the gambler’s amount of money at time t, (Xt )t∈T is a stochastic process
with T = {0, 1, 2...} and X = {0, 1, ..., N }.
Remark
One can think at Lemma 1 as a recipe to generate a sample from the distribution of (X0 , . . . , Xt ),
namely first sample X0 from its marginal distribution, then X1 from the conditional given the
value of X0 , then X2 from the conditional given the value of X1 and X0 , etc.
At this level of generality, the distribution of Xt+1 can depend on the values of X0 , . . . , Xt .
This is a potentially very complicated structure, both to model effectively as well as to analyze
the behavior of the resulting stochastic process mathematically. Markov chains are a much more
tractable and yet very powerful sub-class of stochastic processes, where this structure is simplified
as follows.
Definition 2. A discrete time stochastic process (Xt )t≥0 is a Markov chain if
Pr(Xt+1 = xt+1 | Xt = xt , ..., X0 = x0 ) = Pr(Xt+1 = xt+1 | Xt = xt ) (1.1)
for every t ≥ 1 and every (x0 , . . . , xt+1 ) ∈ X t+1 .
Condition (1.1) is referred to as Markov property. The intuition of the Markov property is the
following: the future depends from the past only thorough the present. In other words, only the
current state of the system matters, and not how you got there.
2 Here Pr denotes a probability and the comas inside Pr denote intersections, e.g. Pr(X = x, Y = y) = Pr({X =
x} ∩ {Y = y}).
7 CHAPTER 1. DISCRETE-TIME MARKOV CHAINS
The “Gambler’s ruin” process (Xt )t∈T is a Markov chain. In fact it holds
The Markov property implies that the evolution of the Markov chain is fully determined by the
transition probabilities defined as
Pij := Pr(Xt+1 = j | Xt = i) i, j ∈ X .
Note that here we are implicitly assuming time-homogeneity (i.e. that Pij does not depend on t).
In these notes we will always assume time-homogeneity, unless stated otherwise.
When |X | = n, e.g. X = {1, ..., n}, we can arrange the transition probabilities (Pij )i,j∈X in a
n × n matrix
P11 P12 ... P1n
P21 P22 ... P2n
P = . .. , (1.2)
.. ..
.. . . .
Pn1 Pn2 ... Pnn
which is called the transition matrix, and it contains all the information governing the evolution
of the Markov chain. By construction, the matrix P satisfies
(
Pij ≥ 0 ∀i, j ∈ X
P
j∈X Pij = 1 ∀i ∈ X
since the i-th row (Pij )j∈X is the probability mass function (pmf) of Xt+1 | Xt = i. Indeed, any
matrix P satisfying the above condition is a valid transition matrix (i.e. it can be interpreted as
the transition matrix of a Markov chain).
Let Xt be the weather condition on day t in Milan. Here Xt = 0 means rainy and Xt = 1
means not rainy. Suppose that if it rains today it will rain tomorrow with probability α,
regardless of the previous days, while if it does not rain today it will rain tomorrow with
probability β.
because (
α = Pr(Xt+1 = 0 | Xt = 0)
β = Pr(Xt+1 = 0 | Xt = 1)
We may assume that Xt follows a Markov chain with changes of status happening according
to the following transition probability:
0.7 0.2 0.1
P = 0.3 0.5 0.2 (1.3)
0.2 0.4 0.4
A gene has two variants (alleles), A and B, one of which is harmful. All individuals in a
population have either allele A or B. The following is a simplified version of the Wright-Fisher
model, which provides a useful model describing the gene spread in the population:
• at each generation there are N individuals in the population, with N fixed
• the N individuals produce a new generation of N offsprings and then die out
• each offspring has allele A w.p. Ni , where i is the number of individuals with allele A
in the previous generation, and allele B w.p. NN−i , independently of the other offsprings.
Denoting by Xt the number of individuals with allele A in the t-th generation, we have that
(Xt )t∈T is a Markov chain. In particular, given Xt = i we have that
i
Xt+1 | Xt = i ∼ Binom N,
N
PN (n)
This is because Xt+1 = n=1 Zt+1 , where
(
(n) 1 if the n-th individual in (t+1) generation has allele A
Zt+1 = ,
0 otherwise
(n)
and Zt+1 | Xt = i ∼ Bern( Ni ). Equivalently, in terms of transition probabilities of (Xt )t∈T ,
we have j N −j
N i N −i
Pij = Pr(Xt+1 = j | Xt = i) = .
j N N
9 CHAPTER 1. DISCRETE-TIME MARKOV CHAINS
Note that P00 = 1 and PN N = 1 here. In such cases, we say that 0 and N are absorbing
states, since the chain will not get out of those one therein.
In all the previous examples, we may be interested in different aspects of the problem, e.g.:
• (Weather chain) How many rainy days do we get each year on average?
• (Social mobility) Do the fraction of families in each social class stabilize as t increases? If
yes, to which values?
• (Gambler’s ruin) If the gambler starts with xe, what is the probability that he will end up
with N e rather than 0e?
• (Gene spread) What is the probability that the harmful gene A will eventually die out or
take over? How many generations will be needed for this to happen?
Pij = Pr(Xt+1 = j | Xt = i) i, j ∈ X ,
which build up the transition matrix P . Let us now see more explicitly how P governs the evolution
of the Markov chain (Xt )t∈T . In particular, we will be interested in knowing where Xt might be
after n steps, for n ≥ 0. More precisely, we define the n-step transition probabilities as
Recall the social mobility example where X = {1, 2, 3} with 1 = lower class, 2 = medium
class, 3 = upper class. What is the probability that your grandchild will be in the upper class
knowing that you are in the medium class?
2
This corresponds to computing P23 = Pr(Xt+2 = 3 | Xt = 2)
More generally we can obtain (m + n)-step transition probabilities from n-step and m-step ones
as follows.
Proposition 1 (Chapman-Kolmogorov equations). The n-step transition probabilities defined in
(1.4) satisfy
X
Pijn+m = n m
Pik Pkj i, j ∈ X ; n, m ≥ 0 . (1.5)
k∈X
The equations (1.5) are known as Chapman-Kolmogorov equations and, iterating them, they
provide a way of computing n-step transition probabilities for any n ≥ 1 starting from 1-step ones,
(Pij )i,j∈X .
The Chapman-Kolmogorov equations can be conveniently expressed in terms of matrix opera-
tions. For example, we have that the matrix of 2-step transition probabilities (Pij2 )i,j∈X coincides
with the n × n matrix obtained by multiplying the P matrix with itself, which is why we denote
it as P 2 = P · P . Note that Pij2 6= (Pij )2 , but rather Pij2 = (P 2 )ij . More generally the n-step
transition matrix P n defined by (1.4) coincides with the n-th power of P .
3 More precisely the implication of the Markov property shown in Exercise 4
11 CHAPTER 1. DISCRETE-TIME MARKOV CHAINS
Given the starting distribution α(0) , we can compute the marginal (or unconditional) distribution
(t)
of Xt at time t, which we denote by α(t) = (αi )i∈X , as follows:
(t) (0)
X X
αj = Pr(Xt = j) = Pr(Xt = j | X0 = i) Pr(X0 = i) = Pijt αi
i∈X i∈X
Again, we can think at the latter equalities in terms of vector matrix multiplications.
Proposition 2. The marginal distributions defined in (1.6) satisfy
Intuition: j is accessible from i if, starting from i, the Markov chain can reach j.
Equivalent definitions: for i 6= j accessibility can also equivalently be defined as:
(ii) there exists t > 0 such that Pijt > 0
(iii) there exists t > 0 and a path (i0 , . . . , it ) ∈ X t with i0 = i and it = j such that
Exercise 1. Prove the equivalence between (1.8) and (ii)-(iii) for i 6= j (see Theorem 1.2.1
of Norris [1998] for a solution).
• Two states i and j communicate if i is accessible from j, and j is accessible from i. In such
case we write i ↔ j.
The following proposition is easy to verify using the above definitions.
Proposition 3. Communication between states is an equivalence relation.5
Being an equivalence relation, the notion of communication allows to partition the states
space X into classes, or communication classes, where a class is a maximal collection of
states that communicate with each other.
Example 1.3.1
1/2 1/2
1/2 1
0 1/2
1
1/4 1/4
3
2
1/4
1/4
1 1
Here 0 ↔ 1, because P01 > 0 and P10 > 0, but no other couple of states communicate. Thus
the Markov chain has three classes: {0, 1}, {2}, and {3}.
Example 1.3.2
1/2
1/2 2
0 1/2 1/2
1/4
1/2 1
1/4
If all states communicate with each other, we say that the Markov chain is irreducible. For
example, the Markov chain in Example 1.3.1 is not irreducible, while the one in Example 1.3.2 is.
We say that a class C is closed if the chain cannot escape from C, i.e. if i ∈ C implies that all
j’s accessible from i also belong to C. In Example 1.3.1 the closed classes are {0, 1} and {3}. In
Example 1.3.2 the class X is closed.
f0 = Pr(return to 0 | X0 = 0) = 1 .
CHAPTER 1. DISCRETE-TIME MARKOV CHAINS 14
1 − fi = Pr(never return to i | X0 = i)
= Pr(Xt 6= i for all i ≥ 1 | X0 = i)
≥ Pr(X1 = i − 1, X2 = i − 2, . . . , Xi = 0, Xi+1 = 0, · · · | X0 = i)
= (1 − p)i > 0 .
Thus all states i ∈ {1, . . . , N − 1} are transient. In this example it is easy to check that,
provided N ≥ 3, we also have fi > 0 for all i ∈ {1, . . . , N − 1}.
The notion of transience and recurrence is related to return times, which we now define. Assume
X0 = i and define the
where we set Sk = ∞ if (Xt )t≥0 returns to i less than k times (i.e. there is no k-th return to i).
More formally, we define S0 = 1 and Sk = min{t ≥ Sk−1 + 1 : Xt = i} inductively for k = 1, 2, . . . ,
where inf ∅ = ∞. We then define the time between the (k − 1)-th and the k-th return (i.e. the
length of the k-th tour or k-th excursion) as
Tk := Sk − Sk−1 k ≥ 1, (1.10)
There are three classes: {0}, {1,. . . ,N-1}, {N}; because i ↔ j if and only if i, j ∈ {1, . . . , N −1}.
The classes {0} and {N } are recurrent classes, while the class {1, . . . , N − 1} is transient. It
follows that, when started in some state i ∈ {1, . . . , N − 1} the chain will with probability 1
7 Recall
that there are two different definition of Geometric distribution, see e.g. https://en.wikipedia.org/
wiki/Geometric_distribution. Here we use the one with support {1, 2, . . . , } as opposed to {0, 1, 2, . . . , }.
CHAPTER 1. DISCRETE-TIME MARKOV CHAINS 16
visit that state for a finite number of times before ending up in one of the recurrent classes.
2
1
Here i ↔ j for all i, j ∈ X , so there is only one class, {1, 2, 3}. Thus either all states
are transient, or all recurrent. Since |X | < ∞, some state must be visited infinitely often.
Therefore, all states are recurrent and will be visited infinitely often with probability 1.
Remarks
• By Proposition 6, if a Markov chain is irreducible, either all states are transient or
recurrent. In such cases we say the Markov chain itself (rather than a single state) is
transient or recurrent.
• We just argued in Example 1.3.5 that if X is finite, some state must be recurrent. Thus
all states in a finite irreducible chain are recurrent.
• Proposition 4 implies that it is not possible to build a Markov chain such that
Pr(visit i infinitely often | X0 = i) ∈ (0, 1), since that probability can only be 1 or 0.
• Note that i being transient does not mean that there are no valid trajectories that
visit i infinitely often (e.g. in the Gambler’s ruin one could imagine a valid trajectory
oscillating indefinitely beteen i and i + 1 with {i, i + 1} ⊆ {1, . . . , N − 1}), but rather
that the collection of all such transitions have zero measure under the law on paths
induced by the stochastic process (i.e. that the process will eventually ’diverge’ from
such trajectories with probability 1).
Consider a drunken man walking along a straight line. For simplicity, assume that all steps
are of size 1 and that the man goes right w.p. 1 − p, independently of previous steps. The
location of the man Xt after t steps follows a Markov chain with state space X = Z =
{..., −2, −1, 0, 1, 2, ...}, transition probabilities
p
if j = i + 1
Pij = 1 − p if j = i − 1 .
0 otherwise
In this case i ↔ j for all i, j ∈ X , so there is only one class, i.e. X itself. Thus either all states
are transient or they are all recurrent. The answer turns out to depend on the value of p.
P∞3. Let (an )n≥0 and (bn )n≥0Pbe∞two sequences with values in [0, ∞). If an bn as n → ∞,
Lemma
then n=0 an = +∞ if and only if n=0 bn = +∞ .
√ n
Lemma 4 (Stirling). n! 2πn ne as n → ∞.
By Lemma 4 we have
√
2π2n ( 2n
e )
2n
1 1
2n
P00 √ n 2n
pn q n = √ n−1/2 22n pn q n = √ n−1/2 ρn as n → ∞ ,
( 2πn)2 ( e ) π π
P∞ P∞
with ρ = 4pq. Thus, by Lemma 3, n=0 P00 2n
= +∞ if and only if n=1 √1 n−1/2 ρn = +∞, which
π
P∞ P∞
implies that n=0 P00 n
= +∞ if and only if n=1 n−1/2 ρn = +∞.
If p = q = 21 then ρ = 1 and
∞
X ∞
X
n−1/2 ρn = n−1/2 = ∞ ,
n=1 n=1
CHAPTER 1. DISCRETE-TIME MARKOV CHAINS 18
πP = π ,
X0 ∼ π ⇒ Xt ∼ π for all t ≥ 0 .
Example 1.4.1
Intuitively, we can think at checking if π is a stationary distribution for (Xt )t≥0 as follows:
(iii) check if the resulting distribution of sand is still equal to π. If yes, π is stationary.
In this sense, π-stationarity corresponds to time evolution leaving the distribution π invariant.
We have the following general facts about existence and uniqueness of stationary distributions.
Part (b) of Theorem 3 follows from Theorem 5 (b) below since the mi ’s defined there are
unique. While we do not P provide a full proof of part (a) of Theorem 3 in the interest of brevity, we
note the following: since j∈X Pij = 1 for every i ∈ X , 1 = (1, . . . , 1)T is a right eigenvector with
eigenvalue 1, i.e. P 1 = 1. Since right and left eigenvalues are the same (thought their corresponding
eigenvectors differ in general), then 1 is also a left eigenvalue, i.e. there exist v = (v1 , . . . , vn ) ∈ Rn
such that vP = v. One can further show9 that there exist a non-negative such eigenvector, Pn i.e.
a vector v = (v1 , . . . , vn ) ∈ [0, ∞)n such that vP = v. Then π defined as πi = v1 / j=1 vj is a
stationary distribution.
Remark
• Finding the stationary distribution of a Markov chain is often the first thing to do in
order to understand the behavior of a stochastic process and start answering questions
of interest (see more in convergence part below).
• Exercise 19 in the exercise list provides an example where |X | = ∞ and π is identified
through notable distributions, as opposed to solving directly the set of linear equations.
You can think at the model in there as a discrete analogous of continuous autoregressive
processes (e.g., processes with transitions Xt+1 |Xt ∼ N(ρXt , σ 2 ) for some ρ < 1).
from the fact that P is non-negative and 1 is the leading eigenvalue. However, proving it requires linear algebra
results not discussed in this notes (e.g. Perron-Frobenius theorem). Existence and characterisations of stationary
distributions can also be obtained with purely probabilistic tools, see e.g. Section 1.7 of [Norris, 1998] or Section
1.5.3 of Levin and Peres [2017], but this also requires additional notions and space.
21 CHAPTER 1. DISCRETE-TIME MARKOV CHAINS
Example 1.4.2
1−p 1−p
0 1
Provided p ∈ (0, 1), the Markov chain is irreducible (0 ↔ 1) and aperiodic (Piit > 0 for
all i ∈ X ). Thus, Pijt → πj = 1/2, for all i, j ∈ X ;
1−α
α 1−β
0 1
Vi (n) 1
→ a.s. as n → ∞ i∈X. (1.18)
n mi
Remark
• Combining Theorem 5 (a) and (b) we have that, if (Xt )t≥0 is irreducible and π-stationary
Vi (n)
→ πi a.s. as n → ∞ i∈X. (1.20)
n
• Theorem 5 also implies Theorem 3(b); meaning that if (Xt )t≥0 is irreducible and there
exist a stationary distribution π, then it must be unique, since πi must equal m−1
i for
all i ∈ X .
Proof. Part (a). Fix i ∈ X . If (Xt )t≥0 is transient, then mi = ∞ and limn→∞ Vi (n) = Vi is almost
surely finite by Proposition 4. Thus in this case limn→∞ Vi (n)/n = 0 = 1/mi and (1.18) is satisfied.
Consider then the case of (Xt )t≥0 recurrent. For brevity, we will prove the result assuming X0 = i,
even if the result holds for general starting distributions 10 . In the recurrent case, the excursion
10 To extend the proof to the case X = j 6= i or X ∼ α(0) for some distribution α(0) , one could roughly
0 0
proceed as follows. First note that, when X0 6= i the random variable T1 = min{t ≥ 1 : Xt = i} has a different
distribution than T2 , T3 , . . . defined in (1.10). Nonetheless T1 is still almost surely finite (which can be proved using
irreducibility) and T2 , T3 , . . . are iid (which can be proved by a variation of Lemma 2). Combining this fact one can
deduce Sk /k → mi almost surely as k → ∞ and proceed as in the proof of part (a) above. We avoid formalizing
the previous arguments in the interest of brevity.
23 CHAPTER 1. DISCRETE-TIME MARKOV CHAINS
times T1 , T2 , . . . defined in (1.10) are iid random variables by Lemma 2. Thus, by the SLLN,
Sk /k → mi almost surely as k → ∞, with {Sk }k≥0 defined in (1.9). Also, by definition of {Sk }k≥0
and Vi (n) we have SVi (n) ≤ n < SVi (n)+1 . and thus
SVi (n) n SV (n)+1
≤ < i .
Vi (n) Vi (n) Vi (n)
By recurrence, Vi (n) → ∞ almost surely as n → ∞ and thus
Sk SVi (n) n SVi (n)+1 Sk
mi = lim = lim ≤ lim ≤ lim = lim = mi ,
k→∞ k n→∞ Vi (n) n→∞ Vi (n) n→∞ Vi (n) k→∞ k
n Vi (n) 1
which implies that limn→∞ Vi (n) = mi , and thus limn→∞ n = mi , almost surely as desired.
Part (b). By part (a) we know limn→∞ Vin(n)almost surely. Since Vin(n) ∈ (0, 1), by the
= 1
mi
h i
bounded convergence theorem, we also have limn→∞ E Vin(n) = m1i . The above holds for any
starting distribution for X0 . Taking X0 ∼ π, by π-stationarity we have Xt ∼ π for all t ≥ 0 and
thus
t=0 E [1(Xt = i)]
Pn−1 Pn−1
Vi (n) πi
E = = t=0 = πi .
n n n
h i
Thus limn→∞ E Vin(n) equals both m1i and πi , which implies that πi = m1i for all i ∈ X as
desired.
Vi (n)
Since n → πi almost surely as n → ∞ for all i ∈ X by Theorem (5) and (1.20), it follows
n−1
1X X Vi (n) X
g(Xt ) = g(i) → πi g(i) a.s. as n → +∞
n t=0 n
i∈X i∈X
as desired.
CHAPTER 1. DISCRETE-TIME MARKOV CHAINS 24
Remarks
• The ergodic theorem can be seen as a Law of large numbers for Markov chains and it
implies that, in terms of approximating expectations Eπ [g], we can treat the Markov
chain states (X0 , X1 , X2 , ...) as if they were i.i.d. samples from π. This is the key theo-
retical underpinning of commonly used Markov chain Monte Carlo algorithms (MCMC),
discussed in the second part of the course.
• The ergodic theorem does not require aperiodicity. Indeed periodicity does not prevent
time averages to converge. For example, for the chain in Example 1.4.2 one has (1.20)
and (1.21) even if (1.16) does not hold.
...
... 1 2
-2 -1 0
Here the Markov chain is irreducible (only one class) but |X | = ∞. Thus, Theorem 3(a)
does not apply and a stationary distribution π may
P not exist. Indeed in this case
P there exist
stationary measures (i.e. µ = (µi )i∈X with µj = i∈X µi Pij and µj ≥ 0 but i∈X µi 6= 1)
but no stationary distributions. We prove the above fact for the symmetric RW (i.e. p = 1/2),
although similar arguments could be made for all p ∈ (0, 1).
Proposition 7. The only stationary measures for the symmetric RW on Z are constant
25 CHAPTER 1. DISCRETE-TIME MARKOV CHAINS
Intuition: when p = 1/2, as t increases the distribution of Xt becomes wider and wider and
the probability of being in any given j ∈ X goes to 0, despite the RW being recurrent. When
p 6= 1/2, the process “drifts off to infinity” which again implies limt→+∞ Pijt = 0 for any given
j ∈ X.
In both cases (p = 1/2 and p 6= 1/2) there is no limiting distribution and
X X
1 = lim Pijt 6= lim Pijt = 0
t→+∞ t→+∞
j∈X j∈X
In the Gambler’s ruin example the Markov chain is reducible (i.e. not irreducible), because
there are 3 classes: {0},{1, ..., N − 1}, {N }. Thus, Theorem 3(b) does not apply and the
stationary distribution π may not unique. For example, it is easy to check that both δ0 =
(1, 0, ..., 0) and δN = (0, 0, ..., 1) are stationary distributions.
The consequence of non-uniqueness of stationary distributions is that the limiting prob-
abilities limt→+∞ Pijt will depend on the starting state i, unlike in (1.16). For example it is
easy to see that
lim PNt N = 1 and lim P00 t
= 1,
t→+∞ t→+∞
meaning that when X0 = N the limiting distribution is δN while when X0 = 0 the limiting
distribution is δ0 . What happens for X0 = i with i ∈ {1, . . . , N − 1}? There the limiting
probability depends on the following hitting time
t
h(i) : = lim PiN = Pr(Xt hits N before 0 | X0 = i), i∈X
t→+∞
which we write for brevity as h(i) = Pr(win | X0 = i) since that is the probability of the
Gambler eventually winning. How to compute h(i)? We use a ”condition to the first event”
CHAPTER 1. DISCRETE-TIME MARKOV CHAINS 26
which can be re-written as (1 − p)(h(i + 1) − h(i)) = p(h(i) − h(i − 1)). Combining these with
h(0) = Pr(win | X0 = 0) and h(N ) = Pr(win | X0 = N ) = 1 we obtain the following set of
(finite difference) equations and boundary conditions for h
p(h(i + 1) − h(i)) = (1 − p)(h(i) − h(i − 1)) i ∈ {1, . . . , N − 1}
h(0) = 0 . (1.22)
h(N ) = 1 .
When p = 1/2 the above implies that h(i + 1) − h(i) = h(i) − h(i − 1), meaning that h is
linear. Thus, interpolating h(0) = 0 and h(N ) = 1 we get
i 1
h(i) = Pr(win | X0 = i) = i ∈ {0, . . . , N }, when p = , (1.23)
N 2
meaning that
t i t N −i 1
lim PiN = and lim Pi0 = when p = .
t→+∞ N t→+∞ N 2
In accordance with intuition, the closer to N the Gambler starts, the higher his probability
of winning the game (i.e. ending up in N ). Note that when p = 1/2 the game is ”fair”,
which is reflected in the fact that the linear equations in (1.23) are symmetric in the sense
of h(i) = 1 − h(N − i). Intuitively this means that the opponent of the gambler, e.g. which
starts with N − X0 budget, has the same chances of winning of the gambler given the same
initial budget, i.e. Pr(Gambler wins | X0 = i) = Pr(Opponent wins | N − X0 = i).
When p 6= 1/2, the solution to the system of equationsb in (1.22) is
t 1 − θi 1−p
lim PiN = with θ := .
t→+∞ 1 − θN p
Here, if p < 1/2, the limit implies that the probability of reaching N goes to 0 exponentially
fast as you get further from N . In this case the winning probabilities are not symmetric
since the game is not ”fair”. See Example 1.44 and 1.45 of Durrett [2021] for step-by-step
calculations to solve (1.22) and some quantitative examples.
a Note that computing h(i) by summing over all trajectories that hit N would be much hard in general.
b See tablet notes for full solution of those
The strategy used in Example 1.5.2 to compute hitting probabilities extends to the case of
general Markov chains, as shown below.
27 CHAPTER 1. DISCRETE-TIME MARKOV CHAINS
Given a MC (Xt )t≥0 on a finite or countable state space X and a subset C ⊆ X , define the hitting
time of C as
τC := min{t ≥ 0 : Xt ∈ C} (1.24)
Then, arguing as in Example 1.5.2, we can show that the function i 7→ h(i) satisfies the following
set of equations and boundary conditions
( P
h(i) = j∈X Pij h(j) i ∈ X \C
. (1.26)
h(i) = 1 i∈C
In general, the solution to such set of equations is not unique, and it can be shown that hitting
probabilities coincide with the minimal non-negative solution of such equations (see [Norris, 1998,
Thm.1.3.2] for details). In Example 1.5.2 we identified the correct solution to the equations by
imposing the additional condition h(0) = 0. A similar set of equations can also be derived for the
expectation of the hitting times, i.e. for the function i 7→ E[τC |X0 = i], see Exercise 28.
Remarks
The following remarks are beyond the scope of the course and just reported for the sake of
completeness and references:
• The qualitative behavior observed in Example 1.5.2 is common to general reducible
chains, in the sense that one can always partition the state space in a collection of
closed recurrent classes plus the set of all transient states (see e.g. Theorem 1.8 of
Durrett [2021] and related discussion) and express the limiting probabilities starting
form transient states as a convex combination of stationary distributions restricted to
the recurrent classes weighted by the probabilities of hitting such classes.
• In Example 1.5.1, despite the non-existence of π, we nonetheless have Vin(n) → m1i as well
as Pijt → m1i . However, mi = ∞ for every i ∈ X . This is an example where recurrence
occurs (i.e. excursion have finite length almost surely) but the average excursion lengths
(mi ) are infinite. Such chains are called null recurrent as opposed to positive recurrent
ones where mi < ∞. Positive recurrence is in one-to-one correspondence with existence
of stationary distributions (see e.g. Theorem 1.7.7 of Norris [1998]).
To be added
CHAPTER 1. DISCRETE-TIME MARKOV CHAINS 28
Proposition 8. Let (Xt )t≥0 be a π-reversible Markov chain. Then (Xt )t≥0 is also π-stationary.
Proof. Let (Xt )t≥0 be π-reversible. Then its transition probabilities (Pij )i,j∈X satisfy
X by (1.27) X X
πi Pij = πj Pji = πj Pji = πj j∈X.
i∈X i∈X i∈X
The left-hand-side can be though at as the amount of probability going from state j to state i
at each Markov chain step in stationarity while the right-hand-side can be though at as the one
from state i to state j. More precisely, assuming Xt ∼ π and Pr(Xt+1 = j | Xt = i) = Pij , then
πj Pji is the probability of observing a transition from j to i, i.e. of observing (Xt , Xt+1 ) = (j, i),
while πi Pij is the probability of observing a transition from i to j, i.e. (Xt , Xt+1 ) = (i, j). In this
sense, π-reversibility requires the exchange of probability across each couple of states in X to be
balanced. In accordance with this intuition, a π-reversible chain is sometimes referred to as being
in detailed balance with respect to π. P
We can contrast that with π-stationarity. Since i∈X Pji = 1, we can re-write the π-stationarity
requirement in (1.15) as
X X
πj Pji = πi Pij j∈X. (1.29)
i∈X i∈X
outside each single state to be balanced. In accordance with this intuition, a π-stationary chain is
sometimes referred to as bein in global balance with respect to π.
Clearly, detailed balance implies global balance since having each ”local” term in (1.28) being
equal implies that also the ”global” sum in (1.29) will be equal. See figures in hand-written notes
on Blackboard.
Time-reversibility interpretation
In addition to being π-stationary, a π-reversible MC is also ”time-reversible”, i.e.:
(
(Xt )t≥0 is π-stationary
(Xt )t≥0 is π-reversible ⇐⇒
(Xt )t≥0 is time-reversible
Intuitively, time-reversible means that if you look at the MC forward or backward in time you
obtain the same stochastic process. More precisely, we have the following.
Proposition 9. Let X0 ∼ π and n ≥ 1. Define
Y0 = Xn , Y1 = Xn−1 , ..., Yn = X0 (1.30)
Then (Y0 , . . . , Yn ) is equal to (X0 , . . . , Xn ) in distribution if and only if (Xt )t≥0 is π-reversible.
Sketch of proof. Let
Qij := Pr(Yt+1 = j | Yt = i) i, j ∈ X
11
be the transition probabilities of (Yt )t=0,...,n . Then
Qij = Pr(Yt+1 = j | Yt = i) = Pr(Xn−t−1 = j | Xn−t = i) (1.31)
Pr(Xn−t−1 = j, Xn−t = i)
= (1.32)
Pr(Xn−t = i)
Pr(Xn−t−1 = j) Pr(Xn−t = i | Xn−t−1 = j) πj Pji
= = i, j ∈ X . (1.33)
Pr(Xn−t = i) πi
Thus Qij = Pij if and only if
πi Pij =πj Pji i, j ∈ X
as desired.
Example 1.6.1
The Markov chain in Example 16 is an example of non-reversible process. In fact, using (1.31),
one can check that the transition probability of the time-reversed process are
0 0 1
Q = 1/2 1/2 0 .
0 1/2 1/2
Since Q 6= P in this case, we have that (Xt )t≥0 is π-stationary but not π-reversible.
A simple example of reversible chains is given by the binary chain in Example 1.4.1. Exercises 33
and 34 in the exercises list provide less obvious examples of reversible chains.
Note that a π-reversible chain is ”time-reversible” (i.e. Proposition 9 holds) only in stationarity
(i.e. if X0 ∼ π) and not if X0 ∼ α(0) for some other, possibly deterministic, starting distribution
α(0) .12 See Exercise 34 and the associated code (as well as the in-class tablet notes) for some
illustration of this phenomenon.
12 To see why forward and backward processes behave differently out of stationarity consider the following: if
started out of stationarity the forward process (Xt )t≥0 would first ’converge to stationarity’ and then ’explore the
space sampling from π’; while the backward process (Xt )t≥0 would first ’explore the space sampling from π’ and
then ’go out of stationarity’, which is implausible under the forward process.
Chapter 2
5
1
3 4
Given a directed graph on V = {1, ..., n} one can always construct a corresponding Markov
chain on the graph by taking X = V and assigning “uniform weights” to all edges (i.e., arrows) in
the graph. More precisely, we define the following.
Definition 7. The natural random walk on a directed graph (V, E) is a Makov chain with states
space X = V and transition matrix P (RW) given by
(
1
(RW) if there is an edge from i to j
Pij = di i, j ∈ X ,
0 otherwise
where di is the out-degree of i, i.e. the number of edges starting from i.
Random walks on graphs are often used to model phenomena that diffuse on networks (e.g.,
epidemics spreading thorough contacts, news spreading on social networks, etc). Maybe less in-
tuitively, they can also be used to gain insight into the structure and connectivity properties of a
graphs. We now consider a famous example from websites ranking.
PageRank is a ranking algorithm for websites which is at the heart of Google’s search engine.
It was invented by Larry Page and Sergei Brin, founders of Google (see the original paper ”The
anatomy of a large-scale hypertextual Web search engine” at https://snap.stanford.edu/
class/cs224w-readings/Brin98Anatomy.pdf). The idea is to assign each website a score
(independently of search queries), and use the resulting ranking of websites to help performing
31
CHAPTER 2. ILLUSTRATIVE EXAMPLES AND APPLICATIONS OF DISCRETE-TIME
MARKOV CHAINS 32
searches in the web. The basic idea is to rank websites based on the stationary distribution
of the random walk on the graph induced by weblinks.
More precisely, let X = {1, 2, ..., n} represent the list of websites. Define a directed graph
across websites with V = X and (i, j) ∈ E by adding if and only if site i contains a weblink to
site j. Then define the score of site i as πi , where π is the stationary distribution of P (RW ) ,
i.e. the natural random walk on (V, E). Intuition: consider a user surfing the web at random
that once in state i chooses one of the di weblinks in page i at random, and clicks on it. In
this metaphor, the score of site i corresponds to the probability that the random surfer is
visiting site i at time t for a large t.
For the above construction to make sense, i.e. to lead to a well defined score, we need π
to exist and be unique. In this context | X |< +∞, so a stationary distribution π exists by
Theorem 3(a). However, the Markov chain may not be irreducible, so Theorem 3(b) does not
apply and π may not be unique. To solve this issue, PageRank considers a modification of
the natural random walk on the graph: with probability α the random surfer follows P (RW ) ,
while with probability 1 − α he gets bored and opens a new website choosing it uniformly
at random from {1, 2, ..., n}. Thus we obtain a new Markov chain on X , which we denote as
(Yt )t≥0 , with the following transition probabilities:
(PR)
Pij = Pr(Yt+1 = j | Yt = i)
= Pr(Yt+1 = j | Yt = i, bored) Pr(bored) + Pr(Yt+1 = j | Yt = i, not bored) Pr(not bored)
1
= αPijRW + (1 − α) ,
n
for all i, j ∈ X . In terms of transition matrices this means,
where P (unif) is an n × n transition matrix with all entries equal to 1/n. In these cases we
say that P (PR) is a mixture of two transition matrices, i.e. P (RW) and P (unif) , which means
that the Markov chain evolving according to P (PR) chooses at each iteration whether to make
a transition according to P (RW) , with probability α, or according to P (unif) , with probability
1 − α. Note that P (PR) is irreducible by construction since
(PR) 1−α
Pij ≥ >0 i, j ∈ X ,
n
meaning that all states communicate. It follows that the P (PR) has a unique stationary
distribution π. The corresponding value of πi is the PageRank of website i. Note that P (PR)
(P R)
is also aperiodic by construction and thus πi = limt→∞ Pij = Pr(Yt = i | Y0 = j).
toghether with π1 +· · ·+πn = 1. This system can be solved exactly with standard algorithms
for linear system at cubic cost, i.e.Cost = O(n3 ).1 When n is very large, such as in Google’s
website rankings applications, this may be too computationally expensive. To reduce the
computational cost, one can approximate π using the Convergence Theorem and Chapman-
(0)
Kolmogorov equations as follows: initialize the distribution of X0 , αi = Pr(X0 = i), with
(0) (0)
some probability distribution α(0) = (α1 , ..., αn ); compute the distribution of X1 , . . . , Xt
as:
α(1) = α(0) P (P R)
α(2) = α(1) P (P R)
..
.
α(t) = α(t−1) P (P R)
By the Convergence theorem, when t is large enough, α(t) provides an approximation of π.
Note that computing α(t) requires only matrix-vector multiplications, which have quadratic
cost and are thus cheaper than solving systems of linear equations. The cost of the above
procedure2 would thus be O(tn2 ) which is much smaller than O(n3 ) when t n. 3
Suppose that two new drugs have been developed for treating a certain disease. Drug 1 has
cure rate P1 and drug 2 has cure rate P2 . To decide whether P1 > P2 or P1 < P2 , the following
statistical test is used: couples of patients are treated sequentially giving drug 1 to one patient
and drug 2 to the second, and whether or not each patient is cured is recorded. The test stops
when the number of patients cured with drug 1 exceeds by A, or more units, the number of
patients cured with drug 2, concluding that drug 1 is better, or viceversa when the number
1 Here
the cubic costs would come from performing inversion or factorization of an n × n matrix.
2 In
computaional liner algebra terms, this procedure is an instance of the so-called power method to approximate
leading eigenvectors of matrices.
3 In the PageRank example, the value of t required to have α(t) ≈ π is often relatively small thanks to the
presence of the ”global steps” (1 − α)P (unif) in P (P R) , which help the chain (Yt )t≥0 converge fast to stationarity
(e.g. a few dozen steps can be enough to approximate π satisfactorily). Also, the value of t required can be reduced
by choosing a good initialization, e.g. α(0) = U nif (X ) would work better than α(0) = j for some j ∈ X in this case.
CHAPTER 2. ILLUSTRATIVE EXAMPLES AND APPLICATIONS OF DISCRETE-TIME
MARKOV CHAINS 34
of patients cured with drug 2 exceeds by B, or more units, the number of patients cured with
drug 1, concluding that drug 2 is better.
Given certain values of P1 and P2 , what is the probability of making the wrong decision?
Intuitively, the test in Example 2.2.1 is a race between drug 1 and 2: the first drug to outscore
the other by M or more in terms of cured patients is winner. For each i = 1, 2, . . . define
(
1 if the i-th patient treated with drug 1 is cured
Xi =
0 otherwise
and (
1 if the i-th patient treated with drug 2 is cured
Yi = .
0 otherwise
We assume that (Xt )t≥1 and (Yt )t≥1 are two independent sequences of iid Bernoulli(P1 ) and
Bernoulli(P2 ) random variables, respectively.
Then the number of patients cured with drug 1 and 2, after t couples, is X1 + · · · + Xt and
Y1 + · · · + Yt , respectively. We are interested in the first time t such that Dt ≥ A or Dt ≤ −B,
where Dt is defined as
Dt = (X1 + ... + Xt ) − (Yt + ... + Yt ).
The stochastic process (Dt )t≥0 is a Markov chain because, given Dt , the value of Dt+1 depends
only on the outcome for the (t + 1)-th couple of patients, (Xt+1 , Yt+1 ), which is independent of
D0 , . . . , Dt+1 . Note that in this context the time parameter t of the Markov chain corresponds to
the sample size in a statistical test.
What are the transition probabilities of (Dt )t≥1 ? The chain can only increase by 1, decrease
by 1, or stay where it is. The transitions from Dt to Dt+1 depend on the value of Xt+1 and Yt+1 .
In particular, we have
and Pr(test asserts P1 < P2 ) = 1 − Pr(test asserts P1 > P2 )4 . To compute these probabilities we
can proceed as in Example 1.5.2. First, we define
Conditioning on D1 we have
h(i) =Pi,i+1 h(i + 1) + Pi,i−1 h(i − 1) + Pii h(i) i ∈ {−B + 1, . . . , A − 1}
which implies
h(i) =ph(i + 1) + (1 − p)h(i − 1), i ∈ {−B + 1, . . . , A − 1} ,
Pi,i+1 Pi,i−1
where p := 1−Pii and 1 − p = 1−P ii
. We thus obtain the system of equations
h(i) = ph(i + 1) + (1 − p)h(i − 1) i ∈ {−B + 1, . . . , A − 1}
h(−M ) = 0 . (2.1)
h(M ) = 1
These are the same set of equations we had in the Gambler’s ruin example (see (1.22)) modulo the
different boundary conditions. Solving them in an analogous way to what done for (1.22) gives
Pr(test asserts P1 > P2 ) = Pr((Dt )t≥0 hits A before −B | D0 = 0)
= Pr(Gambler reaches A + B | Gambler starts from B)
1 − θB 1−p (1 − P1 )P2
= with θ := = .
1 − θA+B p P1 (1 − P2 )
For example, if P1 = 0.6 and P2 = 0.4, the probability of wrongly concluding that P2 > P1
is 0.017 if A = B = 5, and decreases to 0.0003 if M = 10. Here A and B act as a confidence
parameters: the larger they are, the smaller the probability of an error but the larger the required
sample size.
Remark
The above is a simple example of sequential testing, where the sample size for a given statis-
tical test is not fixed in advance, but rather samples are collected sequentially until a given
criterion is met. The distinction is important, because if one collects samples sequentially
but perform testing as if the sample size had been fixed in advance, then in most cases he
will end with an invalid test (even a test that reject the eventually rejects the null hypothesis
with probability one even when it is true). The theory and methodology of sequential testing
is based on stochastic processes (where time corresponds to sample size). An example of a
popular and important sequential testing procedure is the sequential probability ratio test
(SPRT). For more details see the discussion in the in-class tablet notes on blackboard, and
more generally the literature on “sequential hypothesis testing” or ”sequential analysis”, or
for example applications to “sequential A/B testing”.
• a latent Markov chain (Xt )t≥0 with state space X , transition probabilities (Pij )i,j∈X , and
starting distribution α = (αi )i∈X , i.e. such that αi = Pr(X0 = i);
• a set of possible signals S;
• at each time t = 0, 1, 2, . . . , a signal St is emitted. St is a random variable whose distribution
depends on the current state of the latent process Xt , and it is denoted as f (St | Xt ). More
precisely we assume that for all t ≥ 0
Pr(St = s | Xt = j) = f (s | j) s ∈ S, j ∈ X ,
St ⊥ (Xi , Si )i6=t | Xt .
signals S0 S1 S2 S3
NB: this is a graph representing the (conditional dependence) relationship between the random
variables involved in a HMM. It says two things: first that (Xt )t≥0 evolves as a Markov chain; and
second that each signal St depends on Xt and conditionally on Xt is independent of the rest.
To sum-up a HMM assumes that
Pr(X0:t = x0:t , S0:t = s0:t ) = αx0 Px0 x1 . . . Pxt−1 xt f (s0 | x0 ) . . . f (st | xt ). (2.2)
for every t ≥ 0, x0:t := (x0 , ..., xt ) ∈ X t+1 and s0:t := (s0 , . . . , st ) ∈ S t+1 .
Remark
• HMMs are a widely used class of models, and one of the most natural class of models
for unobserved quantities that evolve over time. A typical application is time series
analysis or signal processing. Here (Xt )t≥0 is some quantity of interest, and St is a
noisy observation of Xt (such as St = Xt + , with ∼ N(0, σ 2 ) when X = R). Here we
first consider HMMs with finite state spaces, and then discuss continuous ones in the
Kalman filter section.
Consider a production process that in each period is either in a good state (state 1), or in a
poor state (state 2). If the process is in state 1 during a period then, independent of the past,
with probability 0.9 it will be in state 1 during the next period, and with probability 0.1 it
will be in state 2. Once in state 2, it remains in that state forever. Suppose that a single
item is produced each period, and that each item produced when the process is in state 1 is
CHAPTER 2. ILLUSTRATIVE EXAMPLES AND APPLICATIONS OF DISCRETE-TIME
37 MARKOV CHAINS
of acceptable quality with probability 0.99, while each item produced when the process is in
state 2 is of acceptable quality with probability 0.96.
The HMM formulation of the previous example is the following: for each t ≥ 0 the latent
variable Xt is the state of the process, i.e.
(
1 if the process is in good state at time t
Xt =
2 if the process is in poor state at time t.
(Xt )t≥0 evolves according to a Markov chain on the state space X = {1, 2} with transition proba-
bilities
0.9 0.1
P = .
0 1
The signal St is the quality of the item produced at time t, with S = {h, l} and
(
h if the item produced at time t has high quality
St =
l if the item produced at time t has low quality.
In the production process example, one could be interested in the following quantities:
1. What is the probability that the process is in poor state at time 2 given that we observed
(S0 , S1 , S2 ) = (h, l, h)?
need to compute Pr(X2 = 2 | S0:2 = (h, l, h));
2. What is the probability that the process will be in poor state at the next time?
need to compute Pr(X3 = 2 | S0:2 = (h, l, h));
3. What is the probability that the next item produced has high quality?
need to compute Pr(S3 = h | S0:2 = (h, l, h))
The above are examples of some of the typical distributions of interest in HMMs, which are
usually referred to as:
• Filtering distribution:
i.e. when we want to make inferences on the latent process at time t given the observed
signals S0:t .
CHAPTER 2. ILLUSTRATIVE EXAMPLES AND APPLICATIONS OF DISCRETE-TIME
MARKOV CHAINS 38
NB: one may also want to compute predictions k-steps ahead, e.g. Pr(Xt+k = j | S0:t = s0:t ).
Here we focus on 1-step ahead for simplicity.
These can be efficiently computed using recursive algorithms, such as the Viterbi algorithm for
finite HMMs and the Kalman filter for Gaussian HMMs. We do not cover them this year in the
syllabus. The algorithms can be found in the appendix of these notes.
HMMs are often used in continuous spaces. Consider for example the following 1-dimensional
linear case. We assume X = S = R, transition kernel5 given by Pr(x, ·) = N (x, ση2 ) and signal
distribution f (x, ·) = N (x, σe2 ). Equivalently, for t = 0, 1, 2, . . . assume
where (ηt )t≥0 and (et )t≥0 are two independent sequences of iid random variables.
We assume that (St )t≥0 is observed while (Xt )t≥0 is not. The task is to make inferences about
(Xt )t≥0 using (St )t≥0 . More specifically, we aim to compute the following quantities, where s0:t
denotes the observed signal:
• Marginal likelihood
Z
p(S0:t = s0:t ) = p(X0:t = x0:t , S0:t = s0:t )dx0:t (2.5)
Rt+1
• Filtering distribution
R
p(X0:t = x0:t , S0:t = s0:t )dx0:(t−1)
p(Xt = x|S0:t = s0:t ) = x∈R
p(S0:t = s0:t )
• Smoothing distribution
The Kalman filter provides an efficient way to compute the above quantities for linear HMMs.
Each recursion of the filter will update the following two steps: a predict step that goes from
L(Xt−1 |S0:(t−1) ) to L(Xt |S0:(t−1) ); and an update step that goes from L(Xt | S0:(t−1) ) to L(Xt |
S0:t ).6 In the Gaussian case, these steps can be computed in closed form (i.e. they boil down to
parametric updates), thanks to the closure of Gaussian kernels with respect to convolution (i.e.
sums of r.v.s) and multiplication (i.e. conditioning of r.v.s). See the appendix for full details.
5 We will more formally define those later. Basically, the above means that the distribution of X given X
t t−1 = x
is N (x, ση2 ).
6 Here we use L to denote conditional distributions and p to denote conditonal densities.
CHAPTER 2. ILLUSTRATIVE EXAMPLES AND APPLICATIONS OF DISCRETE-TIME
39 MARKOV CHAINS
Remarks
• (Linear cost) Given some observations s0:T ∈ RT +1 , the forward pass of the Kalman
Filter (i.e. Algorithm 2) allows to compute the filtering distributions {L(Xt | S0:t =
s0:t )}t=0,...,T with O(T ) computational cost. This is an example of ”online algorithm”,
i.e. algorithm where, once you get a new observation st , updating your knowledge/infer-
ence/estimate from s0:(t−1) to s0:t has O(1) cost. In other words, in an online algorithm,
the cost of processing a new observation does not increase with time.
• (Backward pass) One is often interested in computing not only the filtering distribution
but also the ”smoothing distributions”, i.e. {L(Xt | S0:T = s0:T )}t=0,...,T , which provide
better predictions of Xt given s0:T , since they condition on all the available information
including the ”future” of Xt . These can be efficiently computed with a ”backward pass”
of the Kalman Filter, which we do not discuss in details for brevity. Conceptually, it
is similar to Algorithm 2 but goes from t = T to t = 1 and allows to compute the
parameters of the distributions {L(Xt | S0:T = s0:T )}t=0,...,T with O(T ) cost.
• (General dynamic Gaussian models) The Kalman filter is a default and fundamen-
tal ”smoothing” tool in signal processing, time series analysis, tracking, etc. In the
above section, we only discussed the 1-dimensional, intercept-only version to avoid over-
complicating the calculations. More generally, the Kalman filter is used with more
general dynamic linear models such as
where (Xt )t≥0 and (St )t≥0 are Rd -valued and Rm -valued stochastic processes, (Σt )t≥0
and (Rt )t≥0 are d × d and m × m covariance matrices, and (Σt )t≥0 and (Rt )t≥0 are d × d
and m × d regression matrices. The forward and backward passes extend to such more
general context, and still require a linear cost in T .
• (Parameter estimation) Above we assumed both the tranistion kernel Pr(x, ·) and the
signal distribution f (x, ·) to be fully known. This is usuallly not the case in practice,
where it is more common for those to be specified only in terms of unknown parameters.
For example, in the 1d example, we may still assume Pr(x, ·) = N (x, ση2 ) and f (x, ·) =
N (x, σe2 ) but treat θ := (ση , σe ) as unkown parameters. In such context, the most
common approach is to estimate θ by maximizing the marginal likleihood, i.e. setting θ
to θ̂M M L = arg maxθ Pr(S0:T = s0:T | θ). In these cases the Kalman filter can also be
used to compute efficiently the marginal likelihood Pr(S0:T = s0:T | θ) required to search
for θ̂M M L with some optimization algorithm. Similarly to smoothing distributions, the
recursive equations to compute marginal likelihoods are conceptually similar to the ones
in Algorithm 2 and we do not discuss them here for brevity.
Chapter 3
with continuous time, t ∈ [0, +∞), and discrete space (i.e. either finite, e.g. X = {0, 1}, or count-
able, e.g. X = {0, 1, 2, . . . }).
The mean and variance of X can be easily computed using the above expression of f (x) and
are E[X] = λ−1 and Var(X) = λ−1 . Thus, the larger the rate λ the smaller the typical value
of the random variable X ∼ Exp(λ).
The exponential distribution satisfies the memoryless property.
Definition 8. A (0, ∞)-valued random variable X has the memoryless property (MP) if
To understand the intuition behind the MP it is useful to think at X as the time until an
event occurs. For example:
41
CHAPTER 3. CONTINUOUS-TIME MARKOV CHAINS 42
• if X ∼ Exp(λ) is the lifetime of a component, then the memoryless property means that
if the component did not brake by time t, the probability of breaking in the next s unit
of time is the same as of breaking in the first s units of times.
• if X ∼ Exp(λ) is the time until somebody enters the room, the memoryless property
means that if nobody has entered yet, the probability of somebody entering in the next
second does not depend on the current time.
f (t)
r(t) = t ≥ 0, (3.1)
1 − F (t)
where f and F are the pdf and cdf of X, respectively1 . Intuitively, r(t) represents the probability
(or better the probability density) of the event happening at time t given that it has not yet
happened before t. More precisely, we have
R t+dt
Pr(X ∈ (t, t + dt)) f (x)dx
Pr(X ∈ (t, t + dt) | X > t) = = t
Pr(X > t) 1 − F (t)
f (t)
= dt + o(dt) = r(t)dt + o(dt) as dt ↓ 0 .
1 − F (t)
The hazard rate function r(t) uniquely characterizes the distribution of the corresponding ran-
dom variable, as shown in the following proposition. For notational convenience, we denote the
survival function of a r.v. X as
Proposition 10. Let X be a (0, ∞)-valued random variable X with hazard rate function r. Then
its survival function F̄ satisfies
Rt
F̄ (t) = e− 0
r(s)ds
t ≥ 0. (3.3)
F̄ 0 (t) d
r(t) = − =− log F̄ (t) t > 0. (3.4)
F̄ (t) dt
random variables with probability zero of being equal to each single t > 0.
43 CHAPTER 3. CONTINUOUS-TIME MARKOV CHAINS
Since the hazard rate function uniquely characterises the distribution of X, one can also take
(3.3) directly as definition of hazard rate function (i.e. the hrf r is the function r : (0, ∞) 7→ [0, ∞)
such that (3.3) is satisfied).
When X represents a time-to-event(e.g. the lifetime of an item), the hazard rate function can
be a more intuitive way of characterizing the distribution of f rather than its pdf. See discussion
and figures in the in-class notes.
In some situations, the memoryless property allows to compute probabilities of interest without
the need of explicit calculations, see for example Exercise 47 in the list.
Proposition 12. Let (Xi )i=1,...,n be independent (0, ∞)-valued random variables with hazard rate
functions ri for i = 1, . . . , n. Then:
X := min Xi
i=1,...n
CHAPTER 3. CONTINUOUS-TIME MARKOV CHAINS 44
I := arg min Xi
i=1,...n
satisfies
ri (t)
Pr (I = i | X = t) = Pn i = 1, . . . , n; t > 0 . (3.8)
j=1 rj (t)
where F̄ and F̄i denote the survival functions of X and Xi , respectively. By Proposition 10 we
thus have
n
Y Z t Z t n
X
F̄ (t) = exp − ri (s)ds = exp − r(s)ds r(s) := ri (s) , (3.10)
i=1 0 0 i=1
By independence of X1 , . . . , Xn
Y
Pr(I = i, X = t) = fi (t) F̄j (t) = ri (t)F̄ (t) (3.12)
j6=i
Pr(I = i, X = t) ri (t)
Pr(I = i | X = t) = Pn = Pn (3.13)
j=1 Pr(I = j, X = t) j=1 rj (t)
as desired.
2 Technically, one should think at Pr(I = i, X = t) as a pmf with respect to (w.r.t.) i and a pdf w.r.t. t, meaning
that we need to sumP w.r.t. Ri and integrate w.r.t. t when we want to obtain probabilities of events. In other words
Pr(I ∈ A, X ∈ B) = i∈A B Pr(I = i, X = t)dt for every A ⊆ {1, . . . , n} and B ⊆ R.
45 CHAPTER 3. CONTINUOUS-TIME MARKOV CHAINS
If X1 , . . . , Xn are the times of n events, X = mini Xi is the time of the first event. For example,
if n students are trying to solve an exercise and the i-th student takes Xi time then X is the time
until somebody solves the exercise. P
Intuition: the hazard rate of the event X is r(t) = i λi . This is just the sum of the hazard
rates of the random variables Xi . This is quite intuitive if you are thinking at the interpretation
of hazard rates and the fact that as soon as some Xi happens, then also X happens.
When applied to the case of Exponential distribution, Proposition 12 leads to the following
corollary.
have distributions
n
X
X ∼ Exp(λ) with λ = λi (3.14)
i=1
λi
Pr (I = i) = Pn i = 1, . . . , n . (3.15)
j=1 λj
The properties in Corollary 1 allow to perform computations in the context probabilistic models
with time-to-event variables (e.g. queing systems), see e.g. Exercise 48 in the list.
When X is discrete, we refer to continuous-time stochastic processes satisfying the above defi-
nition as continuous-time Markov chains (CTMC).
Intuition: similarly to the discrete-time case, the Markov property above means that the future
depends on the past only thorough the present.
A CTMC (X(t))t∈[0,∞) is time-homogeneous (or it has stationary transition probabilities) if
do not depend on s. In the following, we will always assume that the CTMCs under consideration
are time-homogeneous, unless stated otherwise.
CHAPTER 3. CONTINUOUS-TIME MARKOV CHAINS 46
As done for their discrete-time version, we arrange the transition probability functions Pij (t)
in a matrix form leading to
When |X | < ∞, (3.18) can be written as products between |X | × |X | matrices, leading to the
following compact notation for the CK equations
Pij (h)
Qij := lim i, j ∈ X with i 6= j (3.19)
h→0 h
as the jump rate (or instantaneous transition rate) from i to j of the CTMC (X(t))t∈[0,∞) .
meaning that Qij can be interpreted as the ”hazard rate” at which the Markov chain X(t) currently
at i might jump to j in the next infinitesimal instant.
The jump rates are the natural quantities, or parameters, that are typically used to specify
CTMC models, as shown in the examples of Section 3.3.
3 this can be deduced e.g. by writing P (t) = P ()n P (t − n) where n = bt/c is the interger part of t/.
47 CHAPTER 3. CONTINUOUS-TIME MARKOV CHAINS
Let X = {0, 1} with transition rates Q01 = λ and Q10 = µ. Here λ and µ are two free
parameters in [0, ∞).
Suppose that X(t) represents the size of a population at time t and each member of a popu-
lation gives birth to a new individual at rate λ (and individuals never die). Let
T = min (T1 , . . . , Ti ) ,
where each Tj for j = 1, . . . , i is the time taken by the j-th individual in the population to
give birth to a new individual. The underlying assumption is that Tj ∼ Exp(λ) independently
across j, which then implies T ∼ Exp(iλ) by (3.14). See Section 3.5 for more discussion about
the connection between CTMCs and Exponential distributions.
CHAPTER 3. CONTINUOUS-TIME MARKOV CHAINS 48
Consider now a more general version of the previous example, where individuals give birth at
rate λ, die at rate µ and also new individuals immigrate into the population at rate θ (i.e.
on average individuals live µ1 units of time, give birth every λ1 units of time and θ individuals
immigrate into the population every unit of time). Then,
The expression above can be obtained with reasoning similar to Example 3.3.3 as above. Note
that the birth and death rates get multiplied by the population size (since these are individual
effects that sum up to the population level) while the immigration term does not depend on
the population size (assuming immigration to be exogenous and not directly dependent on
the inner population dynamics).
Here, a typical question of interest is: what is the distribution of the population size for
large t? How does it depend on λ, µ and θ?
Q = (Qij )i,j∈X .
We will call such matrix the Q-matrix, or generator matrix, of the CTMC X. The reason for
setting the diagonal terms of Q equal to −ν comes from the following lemma.
49 CHAPTER 3. CONTINUOUS-TIME MARKOV CHAINS
Remark
P
Note that Q-matrices satisfy Pj∈X Qij = 0 by construction.
P Indeed, any matrix Q with
Qij ≥ 0 for i 6= j and Qii = − j6=i Qij (or equivalently j∈X Qij = 0 for every i) is a valid
Q-matrix with a well defined corresponding CTMC (X(t))t∈[0,∞) .
We have two set of differential equations that describe the evolution of P (t) as a function of Q,
called the Kolmogorov equations (KE).
Theorem 7 (Kolmogorov backward and forward equations). Let X be a CTMC with |X | < ∞
and well-defined Q-matrix4 . Then P satisfies the Kolmogorov backward equations (KBE)
X
Pij0 (t) = Qik Pkj (t) − νi Pij (t) t ≥ 0, i, j ∈ X
k6=j
= (QP (t))ij ,
= (P (t)Q)ij .
It follows
P (t + h) − P (t) (P (h) − P (0))P (t)
P 0 (t) = lim = lim (3.26)
h→0 h h→0 h
(∗) P (h) − P (0)
= lim P (t) = QP (t) t ≥ 0, (3.27)
h→0 h
4 i.e. the limit Q
ij defined in (3.19) exists and belongs to [0, ∞) for every i 6= j. NB: this is authomatically
satisfied if the CTMC is defined through the jump chain and holding times construction.
CHAPTER 3. CONTINUOUS-TIME MARKOV CHAINS 50
where in the equation denoted by (∗) we exchanged a limit and a matrix product (i.e. a limit and
a sum) which is justified since |X | < ∞ and each limit inside the sum exists.
The proof of the KFE is analogous using the decomposition P (t+h)−P (t) = P (t)P (h)−P (t) =
P (t)(P (h) − P (0)).
The KE can can be written in compact matrix form, e.g.
KF E : P 0 (t) = P (t)Q t ≥ 0, (3.28)
0
KBE : P (t) = QP (t) t ≥ 0. (3.29)
In principle, they allow to compute Pij (t) starting from Q. In some cases the KE can be solved in
a fully analytic way to obtain explicit expressions for Pij (t).
Let X = {0, 1} with transition rates Q01 = λ and Q10 = µ. Then, the KBE becomea
(
0
P00 (t) = λP10 (t) − λP00 (t)
0
P10 (t) = µP00 (t) − µP10 (t) .
If (X(t))t∈[0,∞) ∼ PP(λ) then the closed-form expressions for the transition probabilities
e−λt (λt)j−i
Pij (t) = I(j ≥ i)
(j − i)!
can be verified to be the (unique) solutions to the KE equations with jump rates given by
51 CHAPTER 3. CONTINUOUS-TIME MARKOV CHAINS
In general, however, the KE are hard to solve in a fully analytic form. Nonetheless, their
solution always admits the following matrix exponential representation
∞
X (tQ)n t2 Q
P (t) = etQ = = I + tQ + + ... t ≥ 0. (3.30)
n=0
n! 2!
The above series representation can be shown to converge for all values of t ≥ 0 and to solve the
KE following the steps
∞
d tQ d X (tQ)n
P 0 (t) = e =
dt dt n=0 n!
∞
X d (tQ)n
=
n=0
dt n!
∞
X Qn
= ntn−1
n=0
n!
∞
X (tQ)n−1
= Q
n=0
(n − 1)!
∞
X (tQ)n
=Q = QetQ = QP (t) t ≥ 0.
n=0
n!
We do not provide formal justification of all the above steps, and refer to Theorem 2.1.1 of Norris
[1998] for more details on that.
Remarks
• The KBE and KFE are two different set of differential equations since
QM 6= M Q
for a general matrix M . However, they have the same unique solution P (t) which
satisfies
QP (t) = P (t)Q .
• The representation P (t) = etQ is fully general and useful in some situations, but is
can be too expensive to compute numerically in various situations and/or not explicit
enough for many tasks. In this sense, it can be thought at as a ‘semi-explicit solution’
and should not lead to think that P (t) is always available in explicit form. It is nicely
complemented by using the KE to characterize the long-run behavior of CTMCs in a
more explicit way.
CHAPTER 3. CONTINUOUS-TIME MARKOV CHAINS 52
This, in particular, rules out trajectories having an uncountable number of discontinuity points.
Such trajectories would arise, for example, from ’white noise’ models in continuous-time where
(X(t))t∈[0,∞) satisfies X(t) ∼ π for every t > 0 and X(t) independent from X(s) for s 6= t (think
e.g. at X = {0, 1} and π = Unif(X )). Such models are not particularly interesting from the
modeling point of view and would not allow for a fruitful mathematical treatment. Note that, on
the other extreme, requiring t 7→ X(t) to be continuous would be too restrictive (since e.g. the
only continuous trajectories when X is discrete are constant ones). Thus, we can think at cadlag
trajectories, such as those in Figure 3.1, as a compromise between generality and tractability.
In the interest of brevity, we will not discuss regularity assumptions in details and simply assume
that our CTMC of interest has trajectories that can be represented as (3.33) for some sequence
(Yn , Tn )n∈N . This is arguably a mild assumption from the modelling point of view. We refer to
Section 2.2 of Norris [1998] for more discussion on possible behaviors of CTMCs trajectories.
By the cadlad assumption, trajectories t 7→ X(t) are piece-wise constant and can be described
by two sequences of random variables:
1. the so-called jump chain, or skeleton chain, (Yn )n=0,1,2,... , where Yn represents the n-th state
visited by the CTMC trajectory (X(t))t∈[0,∞) .
2. the holding times (Tn )n=0,1,2,... , where Tn represents the amount of time spent in the n-th
visited state Yn .
See Figure 3.1 for an illustration. Formally, we have a one-to-one mapping between a trajectories
(X(t))t∈[0,∞) and sequences (Yn , Tn )n∈N given by: for every t > 0 and i ∈ X
(
T0 + · · · + Tn−1 ≤ t < T0 + · · · + Tn
X(t) = i ⇔ (3.33)
Yn = i
for some n ∈ {0, 1, 2, . . . }. Given the above bijection, the distribution of a CTMC (X(t))t∈[0,∞)
can be conveniently described by specifying the distribution of the sequence of random variables
(Yn , Tn )n∈N . Unlike discrete time MCs, continuous time ones spend random times at each location
before making a transition to a new state.
Assume (X(t))t∈[0,∞) is a CTMC. What can we say about the distribution of the random
holding times that the process spends in each state? Suppose, e.g., that Y0 = i ∈ X and consider
the distribution of T0 . Using first the jump chain and holding times definition, and then the
53 CHAPTER 3. CONTINUOUS-TIME MARKOV CHAINS
Figure 3.1: A trajectory of a CTMC (X(t))t∈[0,∞) with the associated jump chain (Yn )n=0,1,... and
holding times (Tn )n=0,1,... .
Thus, the random variable T0 given Y0 = i satisfies the memoryless property and therefore it is
exponentially distributed, i.e.
Note that the exponential rate of T0 , which we denoted as νi , is allowed to depend on the current
state i of the process. Intuitively, if νi is large, then the process will on average spend a short time
in state i at each visit to it.
When X(t) makes a jump out of Y0 = i it chooses the next state j with some probability, which
we denote as
The above recipie gives a simple and efficient algorithm to simulate trajectories of CTMCs, which is
often referred to as Gillespie Algorithm. Equivalently, in mathematical terms, the joint distribution
of jump chain and holding times (Yn , Tn )n=0,1,2,... can be described as follows:
(Yn )n=0,1,... is a DTMC with transition matrix K
Tn |Yn = i ∼ Exp(νi ) for every i ∈ X (3.35)
Tn ⊥ (Yi , Ti )i6=n | Yn .
Lemma 6. Let (X(t))t∈[0,∞) be a CTMC defined through (3.33), with jump chain and holding
times (Yn , Tn )n=0,1,2,... satisfying (3.35). Then the jump rates Qij in (3.19) are well-defined (i.e.
the limits exist) and satisfy
X
νi = Qij i∈X (3.36)
j6=i
Qij
Kij = P i, j ∈ X with i 6= j. (3.37)
`6=i Qi`
5
Proof of (3.36)-(3.37). We show
Qij = νi Kij i 6= j ,
P
from which both (3.36) and (3.37) can be easily deduced thanks to j6=i Kij = 1 for every i ∈ X .
Assume i 6= j. By definition of jump chain and holding time we have
Thus
(3.33) (3.35)
Pii (h) = Pr(X(h) = i | X(0) = i) ≥ P (T0 > h | Y0 = i) = e−νi h h > 0,
5 Here we assume that the limits in (3.19) exist and prove (3.36) and (3.37). The proof could be extended to
prove the existence of limits, following similar calculations but considering limsups and liminfs and showing that
they coincide. We omit those arguments for brevity and leave that as an exercise for the interested reader.
55 CHAPTER 3. CONTINUOUS-TIME MARKOV CHAINS
which implies
P P
Combining j6=i Qij ≤ νi and Qij ≥ νi Kij we have j6=i Qij = νi . Combining the latter with
Qij ≥ νi Kij we obtain Qij = νi Kij as desired.
Lemma 6 implies that the behaviour of a CTMC can be equivalently described in terms of
(ν, K) = ((νi )i∈X , (Kij )i,j∈X ) or in terms of Q = (Qij )i,j∈X , i.e. they are simply two alternative
parametrizations of the jump rates. In fact, one can recover (ν, K) from Q by (3.36)-(3.37), as well
as Q from (ν, K) by inverting those formulas, which gives Qij = νi Kij for all i 6= j.
Remark
The intuition behind (3.36) and (3.37) can be understood by considering the following equiv-
alent way to simulate (Yn , Tn )n=0,1,... starting from the jump rates Qij : given Yn = i ∈ X
do
1. simulate Ti→j ∼ Exp(Qij ) independently for all j ∈ X \{i};
2. set Tn = minj∈X \{i} Ti→j and jump to Yn+1 = arg minj∈X \{i} Ti→j .
By the ’first event’ and ’hazard race’ properties of exponentials, i.e. (3.14) and (3.15), it follows
that the sequence (Yn , Tn )n=0,1,... generated by the above procedure has the same distribution
as the one defined in (3.35), with ν and K given by (3.36) and (3.37). In other words, the
above procedure is a valid algorithm to simulate a CTMC starting from its jump rates Qij .a
The above construction provides an algorithmic/intuitive interpretation to the statement
that ’Qij coincides with the rate at which the process jumps from i to j’.
a Note that this algorithm is in generally less computationally efficient, and thus less used, than directly
Remarks
• We implicitly discussed three different characterizations of CTMCs, namely in terms
of: (a) jump chain and holding time construction (i.e. equations (3.33) and (3.35)); (b)
jump rates (i.e. equation (3.20)) and (c) Kolmogorov Equations (i.e. (3.28)). These
are all equivalent under regularity assumptions (e.g. right-continuity of trajectories and
|X | < ∞), see e.g. Theorem 2.8.2 of Norris [1998] for more details and a complete
formal results on this. Above we basically showed (a)⇒(b) in Lemma 6 and (b)⇒(c) in
Theorem 7; while one should also show (c)⇒(a) to close the circle. Below we discuss
more in details these equivalences for the Poisson Process.
CHAPTER 3. CONTINUOUS-TIME MARKOV CHAINS 56
In this section we derive convergence results analogous to the ones we discussed for DTMCs.
Stationary distributions
Definition 11. A probability distribution π on X is stationary for a CTMC (X(t))t∈[0,∞) with
transition probabilities (P (t))t∈[0,∞) if
πQ = 0 (3.42)
where 0 represents a vector (or better 1 × |X | matrix) with all entries equal to zero.
d
Proof for |X | < ∞. First note that, by (3.41), π is stationary if and only if dt πP (t) = 0 for all
d 0
t ≥ 0. Assume π is stationary. Then 0 = dt πP (t) = πP (t) for all t ≥ 0, where we have switched
a derivative and summation thanks to |X | < ∞. Taking t = 0 we obtain 0 = πP 0 (0) = πQ. For
the converse, assume πQ = 0. Then, switching again derivative and summation and applying the
KBE we have
d
πP (t) = πP 0 (t) = πQP (t) = 0 t ≥ 0, (3.43)
dt
meaning that t 7→ πP (t) is constant. Since πP (0) = π, we have πP (t) = π for every t ≥ 0 and
thus π is stationary.
To interpret (3.42) we can write out πQ = 0 as
X X
0= πi Qij = −νj πj + πi Qij j∈X (3.44)
i∈X i6=j
Roughly speaking, the balance equations require to find a π such that the “flow of probability” that
enters into j and that goes out of j in stationarity is the same. The balance equations allow one
to find the stationary
Pdistribution π given the Q-matrix Q = (Qij )i,j∈X . They are the analogue of
the equations πj = k πk Pkj that we saw in (1.15) in the context of discrete time MCs.
Convergence to stationarity
To ensure convergence to stationarity (i.e. that a stationary distribution π is also limiting as in
(3.40)) we first require irreducibility. Intuitively, the definition of irreducibility for CTMCs is
analogous to the one for DTMCs, just replacing P with Q. More precisely, we have the following
formulation.
Definition 12. A CTMC X is irreducible if, for every i, j ∈ X , there exists a k ∈ N and
(i0 , . . . , ik ) ∈ X k+1 such that i0 = i, ik = j and Qi` i`+1 > 0 for all ` = 0, . . . , k − 1.
On the other hand, a difference compared to DTMCs is that for CTMCs we do not need to
require aperiodicity since this is always guaranteed (intuitively the inherent randomness in the
holding times of CTMCs prevent any ”deterministic oscillation” typical of periodic DTMCs). For
example, the following lemma shows that irreducibility authomatically implies that all entries of
P (t) are strictly positive for t > 0.
Lemma 7. Let X be an irreducible CTMC. Then
Pij (t) > 0 t > 0, i, j ∈ X . (3.46)
We can now state the convergence theorem for CTMCs.
Theorem 8. Let X be an irreducible CTMC with stationary distribution π. Then
lim Pij (t) = πj i, j ∈ X . (3.47)
t→∞
Let X = {0, 1} with transition rates Q01 = λ and Q10 = µ. Then, the balance equations are
( (
ν0 π0 = Q10 π1 λπ0 = µπ1
=⇒
ν1 π1 = Q01 π0 µπ1 = λπ0 .
π1 λ
= and π0 + π1 = 1,
π0 µ
CHAPTER 3. CONTINUOUS-TIME MARKOV CHAINS 58
meaning that
µ λ
π0 = and π1 = .
µ+λ µ+λ
µ
It follows by Theorem 8 that limt→∞ P00 (t) = limt→∞ P10 (t) = µ+λ and limt→∞ P01 (t) =
λ
limt→∞ P11 (t) = µ+λ , which is coherent with the explicit solutions for Pij (t) found in Example
3.4.1.
1. an arrival (or birth) of a new individual happens with exponential rate λi (i.e. the time for
the next arrival is Exp(λi ) distributed);
2. a departure (or death) of an individual happens with exponential rate µi (i.e. the time for
the next departure is Exp(µi ) distributed).
Equation (3.48) describes the Q-matrix of a B&D process, from which one can easily deduce the
corresponding (νi )i∈X and (Kij )i,j∈X parameters. We have thus two equivalent ways of thinking
at and simulating B&D processes:
(a) Based on the arrival and departure rates (λi )+∞ +∞
i=0 and (µi )i=0 : given a current state X(t) =
i ∈ X do
Customers arrive at a service station according to a Poisson process with rate λ and join the
queue upon arrival. The server takes an exponential time to serve each customer with an
average serving time of µ1 units of time. Here, the number of customers in the queue as time
evolves can be described as a B&D process with
(
µn = µ n ≥ 1
λn = λ n ≥ 0
This is known as M/M/1 queuing system (which stands for ”Markovian arrivals/Markovian
departures/1 server”).
The M/M/1 system has many generalizations. Suppose, for example, that we have s > 1
servers. Then, we obtain an M/M/s queuing system, which can be modelled as a B&D process
with arrival and departure rates
(
µn = min(s, n)µ n ≥ 1
λn = λ n≥0
Here a typical question of interest is: what is the average waiting time for each customer?
What is the probability of a customer waiting more than 30 minutes? How do these quantity
depend on λ, µ, and s?
P+∞
Combining the above with j=0 πj = 1, we obtain
j j
∞ Y
Y λi−1 X λi−1
πj = Z −1 , with Z := . (3.50)
i=1
µi j=0 i=1
µi
Note that, assuming λi > 0 for all i ≥ 0 and µi > 0 for all i ≥ 1, we have that X is irreducible and
thus a stationary distribution π (if one exists) is unique. Also, note that a probability distribution
π satisfying (3.49) exists if and only if Z < ∞. One can also easily show that, for B&D processes,
any π that satisfies global balance, i.e. πQ = 0, is of the form in (3.50) and satisfies (3.49).
The expression in (3.50) allows us to compute the limiting probabilities given a specific B&D
process of interest.
Consider the M/M/1 queue of Example 3.7.2, where customers come at rate λ and there is
one server serving them at rate µ. Denoting by X(t) the number of customers in the system.
In this case λµi−1
i
= µλ for all i ≥ 1, and
j
λ j
µ λ λ
πj = k = 1− = (1 − p)j p j ≥ 0,
P∞ λ µ µ
k=0 µ
Consider an M/M/1 queue with customers arriving on average one every 10 minutes, and
service time taking on average 8 minutes. How many customers do you expect to be in the
system on average in the long run? How would this number change if the arrival rate increases
by 20%?
1
Solution. In this example we have λ = 10 and µ = 18 , assuming time to be measured in
minutes. Thus, the limiting distribution for the number of customers X(t) in the system is a
Geom(1 − µλ ) and, for large t we have
λ 1/5
E[X(t)] ≈ = = 4,
µ−λ 1/4 − 1/5
meaning that in the long run we have on average four customers in the system. If we increase
12
the arrival rate λ by 20% to λ = 100 , we obtain an average number of customers in the system
61 CHAPTER 3. CONTINUOUS-TIME MARKOV CHAINS
equal to
λ 12/100
E[X(t)] ≈ = = 24.
µ−λ 1/8 − 12/100
Thus, increasing by 20% the arrival rate results in a 600% increase in the average queue length.
More generally, when µ − λ is small even a small change in λ or µ can lead to a large change
in the behaviour of the system.
For general B&D processes where, unlike for M/M/1 queues, λµn−1 n
is not constant, π will not be a
geometric distribution and usually will not be of a simple parametric form (e.g. known distribution).
Nonetheless, the existence or not of a stationary distribution π will still depend on parameters in
an intuitive way, see Exercises 50 and 51.
• We say that (N (t))t∈[0,∞) has stationary increments if, for every s ≥ 0, the distribution of
the random variable
N (t + s) − N (t) t, s > 0
Figure 3.2: Example of a trajectory of a counting process, with t on the x-axis and N (t) on the
y-axis. In this figure, N (t) reports real data counts of the number of earthquakes in Japan, starting
from January 1st 1885. Each ”tick” on the x-axis represents an event (i.e. earthquake).
• We say that (N (t))t∈[0,∞) has independent increments if, for every n ≥ 1 and t1 < t2 < t3 <
· · · < tn the random variables
N (t1 ), N (t2 ) − N (t1 ), N (t3 ) − N (t2 ), . . . , N (tn ) − N (tn−1 )
are independent. This means that the number of events in disjoint intervals is independent.
Note that the independent increment assumption implies Markovianity of (N (t))t∈[0,∞) since in
that case
Pr(N (t + s) = j | N (s) = i, N (u) = n(u) for 0 ≤ u ≤ s)
= Pr(N (t + s) − N (s) = j − i | N (s) = i, N (u) = n(u) for 0 ≤ u ≤ s)
= Pr(N (t + s) − N (s) = j − i)
= Pr(N (t + s) − N (s) = j − i | N (s) = i)
= Pr(N (t + s) = j | N (s) = i).
Interarrival times: a counting process (N (t))t∈[0,∞) can be described by looking at its inter-
arrival times T0 , T1 , T2 , . . . , i.e. on the random times between successive events. More precisely,
the interarrival times T0 , T1 , T2 , . . . are related to (N (t))t∈[0,∞) through the following equivalence
N (t) > n ⇐⇒ T0 + · · · + Tn ≤ t t ∈ [0, ∞) and n = 0, 1, 2, . . . . (3.51)
Equivalent definitions
The Poisson Process (PP) is the canonical model for counting processes with stationary and
independent increments.8 We can sum up the above properties of the PP with the following
theorem. We write (N (t))t∈[0,∞) ∼ PP(λ) to indicate that the counting process (N (t))t∈[0,∞) is
distributed as a Poisson process with rate λ.
8 It turns out that, under some basic regularity assumptions, it is the only counting process with stationary and
independent increments!
63 CHAPTER 3. CONTINUOUS-TIME MARKOV CHAINS
Theorem 9. A counting process N = (N (t))t∈[0,∞) satisfies N ∼ P P (λ) if and only if one of the
following equivalent conditions hold:
(a) the process has independent increments and
P (N (t + h) − N (t) = 1) = λh + o(h)
P (N (t + h) − N (t) ≥ 2) = o(h) t > 0, as λ ↓ 0 .
Roughly speaking, the above Theorem can be interpreted as a special case of the equivalent
representations of CTMCs discussed above, applied to the special case of a B&D process with
λi = λ for all i ≥ 0 and µi = 0 for all i ≥ 1.
Remarks
• Definition (a) is self-consistent because, for example, for every t1 < t2 < t3 , we have
N (t3 ) − N (t1 ) ∼ Poiss(λ(t3 − t1 )), but also
The two distributions are the same, if it was not the case (e.g., if you replace (3.52) with
N (t + s) − N (s) ∼ Poiss(λt2 ), then the definition would not be consistent and no such
process would exist).
• (3.52) implies that the PP has stationary increments, because the distribution of N (t +
s) − N (s) does not depend on s.
• Definition (a) may seem quite arbitrary (e.g. why Poisson?) and not intuitive from a
modelling point of view. Definition (b) is instead more natural/explicit from a modelling
point of view: it implies that the PP is the only counting process with events happening
independently and at a constant rate.
• Intuitively, the fact that Ti ∼ Exp(λ) for PPs is related to the fact that PPs is the
counting process with no memory and events happening at constant rate λ (i.e. at every
instant of size h, there is a probability λh of an event occurring independently of the
rest), which are the same assumptions underlying an Exp(λ) distribution for a time-to-
event random variable.
CHAPTER 3. CONTINUOUS-TIME MARKOV CHAINS 64
• Definition (c) is the only ’constructive’ one, which directly implies that such a process
exists and tells a simple way how to simulate it.
Theorem 10. A counting process N = (N (t))t∈[0,∞) is a NHPP with intensity function λ(t) if
and only if one of the following equivalent conditions hold:
Intuition: condition (b) means that ”events happen at rate λ(t)”, meaning that at any time t
the probability of an event happening in the next instant of (infinitesimal) size h is
Remarks
- Similarly to the homogeneous PP, the NHPP is characterized by having independent
Poisson increments. In words, condition (a) means that the number of events in (t, t + s]
R s+t
is Poisson distributed with mean s λ(y)dy.
- Unlike the PP, the NHPP does not have stationary increments (or constant rate λ).
This makes the NHPP a much more flexible model than the PP.
- Both the PP and the NHPP crucially rely on the assumption of independent increments.
Even more, one could say that the assumption of independent increments is basically the
only substantial assumption underlying NHPPs, meaning that every (regular enougha )
counting process with independent increments is a NHPP for some intensity function
λ(t).
a In particular one needs to prevent two or more events to happen at the very same time
65 CHAPTER 3. CONTINUOUS-TIME MARKOV CHAINS
Proposition 16 (Thinning of PPs). Let N ∼ P P (λ). Suppose that each point in N is either of
type 1 or 2, independently of the others, with probability p and 1−p, respectively. Let N1 and N2 be
the resulting point processes including only points of type 1 and 2, respectively. Then N1 ∼ P P (λ1 )
and N2 ∼ P P (λ2 ), with11 λ1 = pλ and λ2 = (1 − p)λ. Moreover, N1 and N2 are independent of
each other.
Proposition 16 can be verified combining Definition 13 with the following lemma (also appearing
in the exercise list).
Lemma 8. If N ∼ P oisson(λ) and N1 | N ∼ Binomial(N, q), then N1 ∼ P oisson(λq). Also, N1
and (N − N1 ) are independent of each other.
11 Here pλ stands for the measure defined as (pλ)(A) = pλ(A) for every A ⊆ D
Chapter 4
This chapter discusses some important classes of stochastic processes with continuous-time and
continuous-space, both in terms of underlying modeling assumptions as well as the mathematical
tools used to analyze them.
Throughout the chapter, we will sometimes refer to the notion of finite dimensional distributions
of a stochastic process.
Definition 14. The finite dimensional distributions of a stochastic process X = (Xt )t∈T with
index set T are the distributions of the random vector Xt1:k = (Xt1 , ..., Xtk ), for every k ∈ N and
(t1 , ..., tk ) ∈ T k .
Under regularity assumptions (such as continuous trajectories or even almost surely cadlag
trajectories), finite dimensional distributions uniquely characterize the distribution of a stochastic
process. In other words, if two stochastic processes have almost surely continuous trajectories (i.e.
Pr(t 7→ Xt is continuous for all t ∈ [0, ∞)) = 1) and have the same finite dimensional distributions,
then they are equal in distribution, meaning that the induced probability distributions on the space
of trajectories are the same. In this sense, finite dimensional distributions provide a convenient
way to describe the distribution of (sufficiently regular) stochastic processes.
where dt/δte denotes the smallest integer larger or equal than t/δt. This is a process which
makes jumps of size δx, either up or down with probability 1/2, at a fixed grid of times given by
t = δt, 2δx, 3δx, . . . . Note that we can express Yn as
Yn = Z1 + · · · + Zn
with Pr(Zi = 1) = Pr(Zi = −1) = 1/2 for all i and Z1 , Z2 , . . . independent of each other. Thus
E[Yn ] = nE[Z1 ] = 0, V ar(Yn ) = nV ar(Z1 ) = n for all n ∈ N and for every t ∈ [0, ∞) we have
67
CHAPTER 4. PROCESSES WITH CONTINUOUS TIME AND SPACE 68
√
Thus, taking δx = O( δt) we obtain √ a non-trivial limit (i.e. neither 0 nor ∞) for V ar(Xt ) as
δt ↓ 0. Specifically, taking δx = σ δt for some σ > 0, we obtain
Moreover, since the r.v.s Z1 , Z2 , . . . are independent and identically distributed (i.i.d.), the Central
Limit Theorem implies the following convergence in distribution:
d
Xt → N (0, σ 2 t) as δx ↓ 0 , for every t > 0.
We can think at Brownian Motion as the limiting distribution √ obtained over the space of
trajectories obtained by taking δt ↓ 0. Specifically, fix T > 0, δx = σ δt and define the uncountable
collection of random variables (Xt )t∈[0,T ] through (4.6). Then we have the following convergence
in distribution
d
(Xt )t∈[0,T ] → (Bt )t∈[0,T ] δt ↓ 0 , (4.5)
69 CHAPTER 4. PROCESSES WITH CONTINUOUS TIME AND SPACE
where (Bt )t∈[0,T ) is a BM with variance σ 2 . Formalizing the above convergence, both in terms
of mode of convergence as well as existence of a well-defined limit, in a rigorous way is beyond
the scope of this notes. For the sake of our treatment, the reader can think at it as a conver-
gence of finite dimensional distributions. Figure 4.1 illustrates the convergence for T = 10 and
δt = 0.0, 0.03, 0.01, 0.0001. Notice that this is a convergence of a sequence of distributions over tra-
jectories to a limiting distribution over trajectories, not an almost sure convergence of a sequence
of trajectories t 7→ Xt to a limiting trajectory t 7→ Bt .1
The specific distribution of Zi in the above construction is not critical. In fact, one obtains the
same distribution for (Bt )t∈[0,T ] by taking the limit of (Xt )t∈[0,T ] as δt ↓ 0, for Xt defined as
√ dt/δte
X
Xt = σ δt Zj (4.6)
j=1
for any i.i.d. sequence of r.v.s (Zj )j∈N with zero mean and unit variance.
4.1.2 Definition
The above construction suggests the following definition.
Definition 15. A continuous-time stochastic process X = (Xt )t∈[0,∞) on X = R is a Brownian
motion (BM) with variance σ 2 , which we write as
X ∼ BM (σ) ,
if
(i) Pr(X0 = 0) = 1
(ii) Pr(t 7→ Xt is continuous for all t ∈ [0, ∞)) = 1
(iii) the process has stationary and independent increments
(iv) Xt ∼ N (0, σ 2 t) for every t ∈ [0, ∞).
Remarks
• When σ = 1 we refer to X as a standard BM
d
• (iii) and (iv) imply that Xt − Xs = Xt−s − X0 and thus
Indeed, an equivalent definition of BM (σ) would be to replace (iv) with (4.7) and remove
the explicit stationary increment requirement. Note that, similarly to the PP definition
in the previous chapter, the above condition is self-consistent with respect to summation
of increments.
1 Indeed,
√
if one sets δx = σ δt and takes the limit δt ↓ 0 in (4.6), then limδt↓0 Xt does almost surely not exist,
even if the distribution of Xt converges to a N (0, σ 2 t). If needed, one could define a limiting construction of BM
which enjoys almost sure convergence of trajectories, but that is slightly more involved and better expressed in
terms of Brownian bridges, see later on.
CHAPTER 4. PROCESSES WITH CONTINUOUS TIME AND SPACE 70
• BM is Markovian due to the independent increments requirement (see also similar remark
when discussing the PP).
(i)
where (Xt )t∈[0,∞) for i = 1, . . . , d are d independent copies of the one-dimensional BM.
Figure 4.2 provides an illustration of trajectory for the one- and two-dimensional BM.
Figure 4.2: A trajectory of a one-dimensional Brownian motion (left; displaying the function
t 7→ Xt ) and a two-dimensional one (right; displaying only the states visited by (X1 (t), X2 (t)) for
t ∈ [0, 1]).
It is not a priori obvious that BM exists, i.e. that is is possible to define a probability distribution
over the space of functions t 7→ Xt from [0, ∞) to R such that (i) − (iv) in Definition 15 are
simultaneosly satisfied. This was shown by Wiener in the 1920s. The most subtle part is proving
that almost sure continuity of trajectories can be satisfied.2
Remark
The BM is sometimes named ‘Brownian motion’ named after Robert Brown, who observed
and described the BM-like motion of particles suspended in a liquid, and sometimes called
‘Wiener process’, after Wiener, who proved the theorem above one century later.
2 We do not prove the above theorem, as we will not provide proofs of various statements in this chapter, doing
that is out of scope for the current course and would require too much time and space.
71 CHAPTER 4. PROCESSES WITH CONTINUOUS TIME AND SPACE
√
Figure 4.3: A visual illustration of the fundamental fact that BM travels O( t) space in O(t)
time. A BM (1)√trajectory is drawn
√ in black, 30 i.i.d. BM (1) trajectories are drawn in gray; the
functions t 7→ 2 t and t 7→ −2 t (which represents fixed quantiles of the distribution of Xt for
every t > 0) are drawn in red.
4.1.3 Properties of BM
Invariances
The following class of properties are invariance or symmetry properties of the distribution over
trajectories induced by BM.
d
(ii) (Time-reversal) Let Yt = X1−t − X1 for t ∈ [0, 1]. Then (Yt )t∈[0,1] = (Xt )t∈[0,1] .
(iii) (Inversion) Let Y0 = 0 and Yt := tX1/t for t ∈ [0, ∞). Then (Yt )t∈[0,∞) ∼ BM (σ).
Exercise: verify Proposition 17 using Definition 15 and properties of the univariate normal
distribution.
Trajectories
The following properties pertains individual trajectories of BM, meaning that they are satisfies by
a BM trajectory with probability 1.
(iii) Recurrence: trajectories a.s. cross 0 infinitely often (i.o.), meaning that the one-dimensional
BM is a recurrent stochastic process:
(iv) Recurrence near zero: trajectories visits zero i.o. also in a neighborhood of t = 0: for any
>0
The continuity property is imposed by definition. The other properties are harder to prove
and their proof its beyond the scope of this notes. Some heuristic arguments for them could go as
follow:
(ii) non-differentiability: if t 7→ Xt was differentiable at t, then we should have limh↓0 ∆+ h =
limh↓0 ∆− h , where ∆ +
h := h −1
(Xt+h − Xt ) and ∆ −
h := h−1
(X t−h − X t ). However for any h > 0,
−
by Definition 15 (iii) and (iv), ∆+ h and ∆ h are independent random variables with N (0, h−1 2
σ )
− −
distribution, which implies that (∆+ h − ∆ h ) ∼ N (0, 2h −1 2
σ ) and thus Pr(lim h↓0 (∆+
h − ∆ h ) = 0) =
0.
(iii) given the limiting construction in Section 4.1.1, recurrence can be deduced from the recurrence
73 CHAPTER 4. PROCESSES WITH CONTINUOUS TIME AND SPACE
of Random Walks on Z and the continuity of trajectories. This property is specific to the one-
dimensional BM and does not extend to larger d.
(iv) Given (iii), recurrence near zero can be deduced from the invariance with respect to inversion
in Proposition 17(iii).
We now consider two ways of generalizing the Brownian motion: Gaussian processes and Dif-
fusion processes. The Brownian motion sits at the intersection of the two.
for an appropriate k × k matrix M . Then conclude using the closure of MVG under linear trans-
formations.
Define the mean and covariance functions of X = (Xt )t∈T as
The mean and covariance functions, m and C, uniquely determine the distribution of a GP X =
(Xt )t∈T because (a) MVG are uniquely determined by the first two moments3 , m and C uniquely
determine the finite dimensional distributions of a GP. In particular, if X is a GP, for any k ∈ N
and t1:k ∈ T k , we have
m(t1 ) k
for µ = ... ,
Xt1:k ∼ N (µ, Σ) Σ = C(ti , tj ) . (4.12)
i,j=1
m(tk )
and (b) under regularity assumptions, fidi uniquely characterizes the distribution of a stochastic
process.
3 In particular, multivariate dependence in MVG is uniquely determine by pair-wise interactions
CHAPTER 4. PROCESSES WITH CONTINUOUS TIME AND SPACE 74
X ∼ GP (m, C)
to indicate that X = (Xt )t∈T is a Gaussian Process with mean function m and covariance function
C, i.e. that (4.12) holds ∀k ∈ N and t1:k ∈ T k . The mean and covariance functions provide a
concise and convenient way to describe the distribution of GPs.
Proposition 19. Y = (Yt )t∈[0,1] ∼ GP (m, C) with m(t) = 0 for all t ∈ [0, 1] and
Proof. All the finite dimensional dimensional distributions of Y are MVG by closure of the
MVG family under conditioning: for all k ∈ N and t1:k ∈ T k we have that L(Xt1 , ..., Xtk , X1 )
is a MVG (with some mean and covariance) and thus L(Yt1 , ..., Ytk ) = L(Xt1 , ..., Xtk |X1 = 0)
is also a MVG (with some mean and covariance). Thus, Y is a GP by Definition 17.
To compute the mean and covariance functions of Y we use properties of Gaussian distri-
butions. First note that
t
E[Xt |Xt0 = x] = x 0 ≤ t ≤ t0 ≤ 1
t0
Var(Xt |X1 = 0) = σ 2 t(1 − t) t ∈ [0, 1]
75 CHAPTER 4. PROCESSES WITH CONTINUOUS TIME AND SPACE
where the equation for the variance can be derived exploiting that if
X ∼ N (µA , τA−1 )
Y |X ∼ N (X, τB−1 )
then it follows
Prec(X|Y = y) = τA + τB .
The process obtained by conditioning a Brownian motion on its endpoint is called Brownian
Bridge, and it can be defined as follows.
Definition 18. A Brownian Bridge (BB) with variance σ 2 ending at Ys = y is a GP with
T = [0, s] and
(
m(t) = st y t ∈ [0, s]
C(t, t0 ) = σ 2 t(s − t0 ) 0 ≤ t < t0 ≤ s
What is the role of the two parameters, m and C? The mean function m plays the role of an
additive deterministic function. The proof of the next lemma is left as an exercise.
Lemma 9. If X ∼ GP (0, C) then
Yt := Xt + m(t) t∈T
C :T ×T →R
(t, s) → C(t, s) = Cov(Xt , Xs )
which models the dependence between Xt ’s across different locations t ∈ T . Which conditions
should C satisfy to be a valid covariance function (i.e. to admit a corresponding GP)? ∀k ∈ N and
t1:k ∈ T k the k × k matrix
k
Σ = C(ti , tj ) (4.13)
i,j=1
needs to be a symmetric positive definite matrix. Functions C : T 2 → R satisfying (4.13) for every
k ∈ N and t1:k ∈ T k are called positive definite functions. This indeed the only condition C should
satisfy, as stated in the next theorem.
77 CHAPTER 4. PROCESSES WITH CONTINUOUS TIME AND SPACE
Figure 4.5: Top row: three different functions k from R to [0, ∞) with different regularity at 0.
Bottom row: trajectories of the GPs on T = R with zero mean and covariance C(t, s) = k(t − s).
Theorem 12. Every symmetric positive definite function C : T × T → R induces a valid GP (0, C)
on T .
Symmetrix positive definite (s.p.d.) functions are often called ’kernels’ in the GP literature4 .
Examples of commonly used GP kernels include:
• Exponential kernel:
C(t, s) = a exp − b|t − s| t, s ∈ R
with a,
√b > 0, this is a s.p.d function, where a is the ”amplitude parameter” (y-scale), i.e.
Yt = aXt , and b is the ”time scale parameter”, i.e. Yt = Xbt . The corresponding GP
exhibits continuous but not differentiable trajectories.
The smoothness of C governs the smoothness of the trajectories of X. For example, if C(t, s) =
k(t − s) for some (symmetric) function k from R to [0, ∞) then continuuity and differentiability of
k at 0 are related to the continuity and differentiability of the trajectories of X. See Figure 4.5 for
an illustration and Section A.3 for more details if interested.
Similarly to the PP case, an infinitesimal definition such as Definition 19 is arguably more intuitive
and natural from the modeling point of view as opposed to a definition, like Definition 15, that
imposes a priori a distributional form to the increments (in this case the Gaussian one).
Remarks
It is remarkable that the conditions in Definition 19 are enough to imply the Gaussian incre-
ment property. Similarly to the PP, this can be deduced by showing that (i)−(iii) implies that
P t , which is a transition kernel defined in Section 4.3.3 below, satisfies a certain differential
equation, in this case
∂ t σ2 ∂ 2 t
P (x, y) = P (x, y) . (4.14)
∂t 2 ∂y 2
79 CHAPTER 4. PROCESSES WITH CONTINUOUS TIME AND SPACE
Remarks
While for our purposes it is enough to interpret (4.15) simply as a compact way to say that
(i)-(iii) hold, the notation in (4.15) admits a more ‘differential equation’-like interpretations.
We very briefly mention those here:
• Derivative interpretation. The drift and volatility coefficients µ and σ admit the following
derivative interpretation:
" #
Xt+h − Xt
lim E |Xt = x = µ(x, t) (4.16)
h↓0 h
" #
(Xt+h − Xt )2
lim E |Xt = x = σ 2 (x, t) (4.17)
h↓0 h
Equations (4.16) and (4.17) follow directly from (i) and (ii) in Definition 20. They can
be compared with the ordinary differential equation case, where the function f in an
equation of the form
dXt
= f (Xt , t) , or equivalently dXt = f (Xt , t)dt ,
dt
can be interpreted as the derivative of the function t 7→ Xt , i.e. it satisfies
Xt+h − Xt
lim = f (Xt )
h↓0 h
CHAPTER 4. PROCESSES WITH CONTINUOUS TIME AND SPACE 80
• Integral representation. Roughly speaking, the reason for the notation dBt in (4.15) is
that diffusion processes can be obtained integrating µ(x, t) against dt and σ(x, t) against
dBt , i.e. against Brownian motion increments. In particular, one can think at a diffusion
process X = (Xt )t∈[0,∞) with drift µ and volatility σ as the limit, as h ↓ 0, of a process
defined by the recursion
The above defines the values of X on the grid t ∈ {0, h, 2h, 3h, . . . } which can be extend
extended through, e.g., linear interpolation to all t ∈ [0, ∞). The above construction
justifies the interpretation of dBt in the SDE as the increment Bt+h − Bt for h ↓ 0.
Intuitively, equation (4.19) means that Xt can be thought at as the solution to the
following integral
Z t Z t
Xt = X0 + µ(Xs , s)ds + σ(Xs , s)dBs . (4.19)
0 0
However, defining formally the second integral in (4.19) requires some care and it is
beyond the scope of this coursea .
a The main issue there is that the so-called absolute variation of BM trajectories is a.s. infinite, i.e.
Pr(limn→∞ n−1
P
i=0 |B(i+1)t/n − BRit/n | = ∞) = 1 for any t > 0, and as a consequence Riemann sums approx-
imations to integrals of the form 0t σ(Xs , s)dBs do not have a unique limits (i.e. different types of Riemann
sums can lead to different limits and thus different values of the integral). Interested students can find more
details (including the notion of Ito integrals) in chapers about ‘stochastic calculus’ in more advanced stochastic
processes books, see e.g. Stirzaker [2005] for a light introduction or Pavliotis [2014] for a more formal, though
still accessible, treatment.
We now provide a simple example of diffusion process, which admits a closed-form solution.
Let µ ∈ R, σ ∈ R+ ,
and
Xt = µt + σBt t ≥ 0.
µ(x, t) = µ σ 2 (x, t) = σ 2 ∀x ∈ R, t ≥ 0 .
81 CHAPTER 4. PROCESSES WITH CONTINUOUS TIME AND SPACE
where in the last equality we used the property of independent increments of BM. Thus
as h → 0, as desired.
The above example also implies that X ∼ BM (σ) itself is a diffusion process with zero drift
and constant volatility, i.e. µ(x, t) = 0 and σ 2 (x, t) = σ 2 for all t > 0 and x ∈ R.
Remarks
• If t → Xt was deterministic then
could not be simultaneously satisfied. For SDEs they do hold jointly in expectation.
Intuitively, equation (4.23) relates to a deterministic part while equation (4.24) to a
stochastic one.
• In the Brownian Motion with constant drift example the two parts of the SDE ‘evolve
independently’. In particular, the process solving the equation dXt = µdt + σdBt
coincides with the sum of the one solving dXt = µdt, which is Xt = µt, and the one
solving dXt = σdBt , which is Xt = σBt . In general this is not the case, since the two
components of the SDE (namely µ(Xt , t)dt and σ 2 (Xt , t)dBt ) do interact, because the
value of the drift µ(Xt , t) and volatility σ 2 (Xt , t) depend on the current state of Xt .
(x − y)2
1
P t (x, y) = √ exp − t > 0, x, y ∈ R . (4.27)
2πtσ 2 2tσ 2
The transition kernel P t = (P t (x, y))x,y∈R can also be seen as an operator that acts on functions:
for every f : R → R define the function P t f : R → R as
Z
P f (x) = f (y)P t (x, y)dy = E[f (Xt )|X0 = x] .
t
(4.28)
or, equivalently,
P t P s f = P t+s f . (4.30)
for every f . We refer to (4.29) and (4.30) by saying that (P t )t≥0 is a semigroup. Equalities (4.29)
and (4.30) are the continuous-time and continuous-space version of the CK equations seen in the
discrete context. The equality in (4.30) can be derived as follows. The one in (4.29) can either be
derived using (4.30) or directly using the law of total probability (see Exercise 55).
Derivation of (4.30).
4.3.4 Generators
Similarly to CTMCs, also in the case of continuous space, we can charaterize the semigroup (P t )t≥0
through its generator. Intuitively, the generator represents the derivative of P t at t = 0, i.e.
d t
“Q = P ”. (4.31)
dt t=0
Remark
In the CTMC case (i.e. when |X | < ∞), the generator was a matrix Q, which can viewed as
an operator that sends vectors into vectors, i.e. R|X | into R|X | . In this sense, defining Q as
an operator that sends functions f : R → R into functions Qf : R → R is a generalization of
Q-matrices to the case of general state spaces.
5
The following lemma characterizes the generators of diffusion processes.
Lemma 10. Let X = (Xt )t∈[0,∞) be a diffusion process solving the SDE
dXt = µ(Xt )dt + σ(Xt )dBt . (4.33)
Then, for smooth enough f
∂f 1 ∂2f
Qf (x) = µ(x) (x) + σ 2 (x) 2 (x) . (4.34)
∂x 2 ∂x
6
Sketch of proof. We have
E[f (Xh )|X0 = x] − f (x)
Qf (x) = lim Def. of Q
h↓0 h
−1
= lim h E[f (Xh ) − f (X0 )|X0 = x]
h↓0
1
= lim h−1 E[f 0 (x)∆h + f 00 (x)∆2h + o(∆2h )] with ∆h := (Xh − X0 |X0 = x)
h↓0 2
1
= f 0 (x) lim h−1 E[∆h ] + f 00 (x) lim h−1 E[∆2h ] assuming E[o(∆2h )] = o(h)
h↓0 2 h↓0
2
σ (x) 00
= µ(x)f 0 (x) + f (x) by (i) and (ii) in def. of diffusions .
2
The statement in (4.34) is often stated as saying that the generator of the diffusion process in
(4.33) is the differential operator
∂ σ2 ∂ 2
Q=µ + ,
∂x 2 ∂x2
which is just a more compact notation for (4.34). For example the generator of the Brownian
Motion is
σ 2 d2
Q=
2 dx2
Pd ∂ 2
which generalizes to the so-called Laplacian operator in Rd : Q = 12 i=1 ∂x 2 . The latter is an
i
important operator that plays a role in many areas of pure and applied mathematics.
5 For simplicity, in this section we restrict our attention to the case of µ and σ depending only on x and not on
t. This allows to stay within the class of time-homogeneous processes. Note that nonetheless, the key equations we
derive (i.e. (4.34), (4.35) and (4.37)) still hold in the case of time-dependent drift and variance simply by replacing
µ(x) and σ(x) with µ(x, t) and σ(x, t).
6 Here and throughout this section, we do not formally justify all steps in our derivations (e.g. exchanges of limits
and integrals, ignoring smaller order terms in Taylor expansion, etc). A fully rigorous treatment of SDEs is well
beyond the scope of this section.
CHAPTER 4. PROCESSES WITH CONTINUOUS TIME AND SPACE 84
Remark
∂
The first derivative term in the generator, µ ∂x , relates to the deterministic part of X (i.e. its
2
∂2
drift, which has size O(h)); while the second derivative term, σ2 ∂x 2 , relates to the stochastic
√
part of X (i.e. its volatility/variance, which has zero mean and size O( h)). See also Exercise
56.
∂ 1 ∂2
=− µ(y)h(y, t) + 2
σ 2 (y)h(y, t)
∂y 2 ∂y
Thus, given X0 ∼ f0 for some pdf f0 , (h(x, t))x∈R,t∈[0,∞) solve the following equations:
(
h(y, 0) = f0 (y) y∈R (4.37)
∂ ∂ 1 ∂2 2
∂t h(y, t) = − ∂y µ(y)h(y, t) + 2 ∂y 2 σ (y)h(y, t) y ∈ R, t ∈ [0, ∞) .
h(y, 0) = π(y) y ∈ R,
∂
results in ∂t h(y, t) = 0 for all y ∈ R and t ≥ 0. By (4.37), this is equivalent to require π to satisfy
∂ 1 ∂2
2
0=− µ(y)π(y) + σ (y)π(y) y ∈ R,
∂y 2 ∂y 2
or in compact notation
1 2 00
(µπ)0 = (σ π) .
2
CHAPTER 4. PROCESSES WITH CONTINUOUS TIME AND SPACE 86
Figure 4.6: Illustration of two solutions to (4.37) for diffusion processes with different coefficients
and starting distributions. Left: analytically computed solution to (4.37) (with unimodal starting
distribution for X0 , and µ and σ as in Example 4.3.2). Right: example with a bimodal starting
pdf for X0 , where the solution is approximated by simulating many independent trajectories of
the diffusion process and plotting the evolution of the empirical distribution of the trajectories at
different times.
Remark
The process in (4.39) is called (overdamped) Langevin diffusion, which is the canonical π-
invariant (actually π-reversible) stochastic process for an arbitrary smooth distribution π on
Rd . The Langevin diffusion is often used in many areas (such as Physics, Statistics, Computer
87 CHAPTER 4. PROCESSES WITH CONTINUOUS TIME AND SPACE
Science, Machine Learning, etc) both to model stochastic phenomena as well as to define
computational algorithms to sample from continuous multivariate distributions. Intuitively,
2
like the gradient flow dXt = σ2 ∇ log π(Xt )dt (and its discretizations) defines the canonical
flow used to compute the mode of a multivariate function log π, we can think at X defined in
(4.39) as a ‘gradient-flow plus noise’ which defines a canonical stochastic process used to draw
samples from π. An important feature in (4.39) is that the amount of noise and gradient push
are balanced (e.g. the drift scales with the noise in a specific way, i.e. σ 2 /2) so that the process
preserves π as its invariant distirbution (e.g. it spends a fraction of time near the mode vs in
the tails as specified by π, etc).
Appendix
The above equation expresses Pr(Xt = j, S0:t = s0:t ) in terms of things that we know. However,
computing Pr(Xt = j, S0:t = s0:t ) through the above expression requires summing over all possible
trajectories (x0 , ..., xt−1 ) ∈ X t , whose number is |X |t . This number grows exponentially with t,
and computing |X |t quantities becomes computationally infeasible as soon as t is modestly large,
e.g. t ≥ 20. There is a smarter way to compute probabilities of interest in HMM by exploiting the
sequential nature of the model. We first show how to compute the numerator of (A.1) efficiently
using a recursive algorithm, known as Viterbi algorithm.
Define the un-normalized filtering distributions
where (s0 , s1 , s2 , ...) are the observed signals, which we take as fixed here. The idea is to compute
Ft (j) recursively using the analogous quantities for t − 1, i.e. {Ft−1 (i)}i∈X , using the following
Lemma (whose proof is left as an exercise, see exercise list and solutions).
Lemma 11. We have
X
Ft (j) =f (st | j) Ft−1 (i)Pij t ≥ 1, j ∈ X . (A.3)
i∈X
89
APPENDIX A. APPENDIX 90
This recursive way of computing Ft (j) is referred to as forward pass, as we start from t = 0
and go forward in time. It has cost O(t|X |2 ) to compute {Fi (j)}i=1,...,t;j∈X rather than O(|X |t )
as the naive approach based on (A.2) would have. This allows to perform inferences on Xt given
the observed signals s0:t even when t and/or |X | are large. Given {Ft (j)}t≥0,j∈X , we can easily
compute the so-called filtering distribution as
Pr(Xt = j, S0:t = s0:t ) Ft (j)
Pr(Xt = j | S0:t = s0:t ) = =P j ∈ X.
Pr(S0:t = s0:t ) i∈X Ft (i)
Given the filtering distributions {Pr(Xt = j | S0:t = s0:t )}t≥0,j∈X we can efficiently compute
predictive distributions as
X
Pr(Xt+1 = j | S0:t = s0:t ) = Pr(Xt+1 = j | S0:t = s0:t , Xt = i) Pr(Xt = i | S0:t = s0:t )
i∈X
X
= Pij Pr(Xt = i | S0:t = s0:t ) j ∈ X.
i∈X
and1
X
Pr(St+1 = s | S0:t = s0:t ) = Pr(St+1 = s | S0:t = s0:t , Xt+1 = i) Pr(Xt+1 = i | S0:t = s0:t )
i∈X
X
= f (s | i) Pr(Xt+1 = i | S0:t = s0:t ) s ∈ S.
i∈X
where ∝ denotes proportionality with respect to x (i.e. going from the left-hand-side to the
right-hand-side of the equation we ignore multiplicative constants independent of x).
1 Example 4.43 in Ross [2014] applies the above calculations to the specific quantities in Example 2.3.2, which
Proof. The statement can be easily verified by checking the proportionality starting from the
definition of N (x; m, σ 2 ).
Typical application: update current knowledge X ∼ N (mx , σx2 ) given a new observation
Y |X ∼ N (X, σy2 ). Then
p(X = x|Y = y) ∝N (x; mx , σx2 )N (y; x, σy2 ) = N (x; mx , σx2 )N (x; y, σy2 )
!
σx−2 mx + σy−2 y 1
∝N x; , −2 = p(X = x|Y = y) x, y ∈ R ,
σx−2 + σy−2 σx + σy−2
σ −2 mx +σ −2 y 1
meaning that E[X | Y ] = xσ−2 +σ−2 y
and V ar(X | Y ) = σ−2 +σ −2 .
x y x y
Intuition: Once you update your prior knowledge with a new observation y the precision
of your posterior knowledge about X increases from σx−2 to σx−2 + σy−2 and the expectation
updates from mx to the weighted average of mx and y, with weights given by the precisions
of prior and likelihood. Intuitively, you can think at X being pulled, respectively, towards mx
and y by the kernels N (x; mx , σx2 ) and N (x; y, σy2 ) with strength, respectively, σx−2 and σy−2 .
The Gaussian-Gaussian conjugacy allows to perform closed-form updates of filtering and predictive
distributions as follows. In terms of notation, we define ((mt , σt2 ))t≥0 and ((m̂t , σ̂t2 ))t≥0 as the
filtering and predictive parameters2 , meaning that
The following algorithms provides an efficient way to compute ((mt , σt2 ))t≥0 and ((m̂t , σ̂t2 ))t≥0
recursively. Below we assume an initial distribution X0 ∼ N (m̂0 , σ̂02 ) for some fixed m̂0 ∈ R and
σ̂0 ∈ R+ (i.e. α = N (m̂0 , σ̂02 ) in the notation of the previous subsection).
Algorithm 2 (Forward pass of Kalman filter).
1. (Initialization for t = 0) Set m̂0 and σ̂02 to the parameters of the starting distribution X0 ∼
α = N (m̂0 , σ̂02 ), and compute m0 and σ02 as in (A.8) with t = 0.
2. For t = 1, 2, . . . do:
Derivation of (A.7) and (A.8). The formulas in (A.7) follow from Xt = Xt+1 + ηt combined with
2
L(Xt−1 | S0:(t−1) = s0:(t−1) ) = N (mt−1 , σt−1 ) and L(ηt | S0:(t−1) = s0:(t−1) ) = N (0, ση2 ). The one
in (A.8) follows from
lim E[|Xt+h − Xt |2 ] = 0 ∀t
h↓0
Grigorios A Pavliotis. Stochastic processes and applications. Texts in Applied Mathematics, 60,
2014.
Sheldon M Ross. Introduction to Probability models, 11th edition. Academic press, 2014.
David Stirzaker. Stochastic processes and models. OUP Oxford, 2005.
93