0% found this document useful (0 votes)
31 views25 pages

GSM 199 Prev

The document introduces the concept of random variables and probability, starting with elementary examples like coin tosses and extending to continuous distributions. It defines key terms such as probability space, conditional probability, and various types of distributions including Bernoulli, binomial, and Poisson distributions. Additionally, it discusses the mathematical framework for probability, including expectations and variances of random variables.

Uploaded by

Pinjala Anoop
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views25 pages

GSM 199 Prev

The document introduces the concept of random variables and probability, starting with elementary examples like coin tosses and extending to continuous distributions. It defines key terms such as probability space, conditional probability, and various types of distributions including Bernoulli, binomial, and Poisson distributions. Additionally, it discusses the mathematical framework for probability, including expectations and variances of random variables.

Uploaded by

Pinjala Anoop
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

Chapter 1

Random Variables

1.1. Elementary Examples


We will start with some elementary examples of probability. The most well-
known example is that of a fair coin: if flipped, the probability of getting
a head or tail both equal to 1/2. If we perform n independent tosses, then
the probability of obtaining n heads is equal to 1/2n : among the 2n equally
possible outcomes only one gives the result that we look for. More generally,
let Sn = X1 + X2 + · · · + Xn , where

1, if the result of the nth trial is a head,
Xj =
0, if the result of the nth trial is a tail.

Then the probability that we get k heads out of n tosses is equal to


 
1 n
Prob(Sn = k) = n .
2 k

Applying Stirling’s formula


√ n n
n! ∼ 2πn , n → ∞,
e
we can calculate, for example, the asymptotic probability of obtaining heads
exactly half of the time:
 
1 2n 1 (2n)! 1
Prob(S2n = n) = 2n = 2n 2
∼√ → 0,
2 n 2 (n!) πn
as n → ∞.

3
4 1. Random Variables

On the other hand, since we have a fair coin, we do expect to obtain


heads roughly half of the time; i.e.,
S2n 1
≈ ,
2n 2
for large n. Such a statement is indeed true and is embodied in the law of
large numbers that we will discuss in the next chapter. For the moment let
us simply observe that while the probability that S2n equals n goes to zero
as n → ∞, the probability that S2n is close to n goes to 1 as n → ∞. More
precisely, for any > 0,
 
S2n 1
Prob − > → 0,
2n 2
as n → ∞. This can be seen as follows. Noting that the distribution
Prob{S2n = k} is unimodal and symmetric around the state k = n, we have
 
S2n 1 1 (2n)!
Prob − > ≤ 2 · 2n
2n 2 2 k!(2n − k)!
k>n+2n
1 (2n)!
≤ 2(n − 2n ) · 2n
2 n + 2n !n − 2n !
√ √
2 1−2 n
∼ · →0
π(1 + 2 ) (1 − 2 ) n(1−2) (1 + 2 )n(1+2)

for sufficiently small and n  1, where · and · are the ceil and floor
functions, respectively, defined by x = m+1 and x = m if x ∈ [m, m+1)
for m ∈ Z. This is the weak law of large numbers for this particular example.
In the example of a fair coin, the number of outcomes in an experiment
is finite. In contrast, the second class of examples involves a continuous set
of possible outcomes. Consider the orientation of a unit vector τ . Denote by
S2 the unit sphere in R3 . Define ρ(n), n ∈ S2 , as the orientation distribution
density; i.e., for A ⊂ S2 ,

Prob(τ ∈ A) = ρ(n)dS,
A

where dS is the surface area element on S2 . If τ does not have a preferred


orientation, i.e., it has equal probability of pointing at any direction, then
1
ρ(n) = .

In this case, we say that τ is isotropic. On the other hand, if τ does have a
preferred orientation, say n0 , then we expect ρ(n) to be peaked at n0 .
1.2. Probability Space 5

1.2. Probability Space


It is useful to put these intuitive notions of probability on a firm mathemat-
ical basis, as was done by Kolmogorov. For this purpose, we need the notion
of probability space, often written as a triplet (Ω, F, P), defined as follows.
Definition 1.1 (Sample space). The sample space Ω is the set of all possible
outcomes. Each element ω ∈ Ω is called a sample point.
Definition 1.2 (σ-algebra). A σ-algebra (or σ-field) F is a collection of
subsets of Ω that satisfies the following conditions:
(i) Ω ∈ F ;
(ii) if A ∈ F , then Ac ∈ F , where Ac = Ω\A is the complement of A
in Ω;

(iii) if A1 , A2 , . . . ∈ F , then ∞
n=1 An ∈ F .

Each set A in F is called an event. Let B be a collection of subsets of


Ω. We denote by σ(B) the smallest σ-algebra generated by the sets in B,
i.e., the smallest σ-algebra that contains B. The pair (Ω, F ) with the above
properties is called a measurable space.
Definition 1.3 (Probability measure). The probability measure P : F →
[0, 1] is a set function defined on F which satisfies
(a) P(∅) = 0, P(Ω) = 1;
(b) if A1 , A2 , . . . ∈ F are pairwise disjoint, i.e., Ai ∩ Aj = ∅ if i = j,
then
∞  ∞

(1.1) P An = P(An ).
n=1 n=1

(1.1) is called countable additivity or σ-additivity.


Example 1.4 (Fair coin). The probability space for the outcome of one
trial can be defined as follows. The sample space Ω = {H, T } where H and
T represent head and tail, respectively. The σ-algebra
F = all subsets of Ω = {∅, {H}, {T }, Ω}
and
1
P(∅) = 0, P({H}) = P({T }) = , P(Ω) = 1.
2
For n independent tosses, we can take Ω = {H, T } , F = all subsets of Ω,
n

and
1
P(A) = n |A|,
2
where |A| is the cardinality of the set A.
6 1. Random Variables

Example 1.5 (Uniform orientation distribution on S2 ). In this case, the


sample space Ω = S2 . Let B be the set of all open sets of S2 , defined as the
intersection of any open set B ⊂ R3 and S2 . Then we can take the σ-algebra
to be F = σ(B) and
surface area(U )
P(U ) = for any U ∈ F .

Within this framework, the standard rules of set theory are used to
answer probability questions. For instance, if both A, B ∈ F , the probalility
that both A and B occurs is given by P(A ∩ B), the probability that either
A or B occurs is given by P(A ∪ B), the probability that A but not B occurs
is given by P(A\B), etc. It is easy to check that

(1.2) A ⊆ B ⇒ P(A) ≤ P(B).

This is because B = A ∪ (B\A); therefore, P(B) = P(A) + P(B\A) ≥ P(A).


We also have

(1.3) P(A ∪ B) ≤ P(A) + P(B)

since A = A ∪ (B ∩ Ac ), and P(A ∪ B) = P(A) + P(B ∩ Ac ) ≤ P(A) + P(B).


This last inequality is known as Boole’s inequality.

1.3. Conditional Probability


Let A, B ∈ F and assume that P(B) = 0. Then the conditional probability
of A given B is defined as
P(A ∩ B)
(1.4) P(A|B) = .
P(B)
This is the proportion of events that both A and B occur given that B
occurs. For instance, the probability to obtain two tails in two tosses of a
fair coin is 1/4, but the conditional probability to obtain two tails is 1/2
given that the first toss is a tail, and it is zero given that the first toss is a
head.
Since P(A ∩ B) = P(A|B)P(B) by definition, we also have

P(A ∩ B ∩ C) = P(A|B ∩ C)P(B|C)P(C),

and so on. It is straightforward to obtain


P(A)P(B|A)
(1.5) P(A|B) =
P(B)
from the definition of conditional probability. This is called Bayes’s rule.
1.4. Discrete Distributions 7

Proposition
∞ 1.6 (Bayes’s theorem). If A1 , A2 , . . . are disjoint sets such
that j=1 Aj = Ω, then we have
P(Aj )P(B|Aj )
(1.6) P(Aj |B) = ∞ for any j ∈ N.
n=1 P(An )P(B|An )

This is useful in Bayesian statistics where Aj corresponds to the hypoth-


esis and P(Aj ) is the prior probability of the hypothesis Aj . The conditional
probability P(Aj |B) is the posterior probability of Aj given that the event
B occurs.

1.4. Discrete Distributions


If the elements in Ω are finite or enumerable, say, Ω = {ω1 , ω2 , . . .}, we have
a situation of discrete probability space and discrete distribution. In this
case, let X(ωj ) = xj and
pj = P(X = xj ), j = 0, 1, . . . .
Of course, we have to have
0 ≤ pj ≤ 1, pj = 1.
j

Given a function f of X, its expectation is given by


(1.7) Ef (X) = f (xj )pj
j

if the sum is well-defined. In particular, the pth moment of the distribution


is defined as
mp = xpj pj .
j
When p = 1, it is called the mean of the random variable and is also denoted
by mean(X). Another important quantity is its variance, defined as

(1.8) Var(X) = m2 − m21 = (xj − m1 )2 pj .


j

Example 1.7 (Bernoulli distribution). The Bernoulli distribution has the


form 
p, j = 1,
P(X = j) =
q, j = 0,
p + q = 1 and p, q ≥ 0. When p = q = 1/2, it corresponds to the toss of a
fair coin. The mean and variance can be calculated directly:
EX = p, Var(X) = pq.
8 1. Random Variables

Example 1.8 (Binomial distribution B(n, p)). The binomial distribution


B(n, p) has the form
 
n k n−k
(1.9) P(X = k) = p q , k = 0, 1, . . . , n.
k
It is straightforward to obtain
EX = np, Var(X) = npq.

The binomial distribution B(n, p) can be obtained from the sum of n


independent Bernoulli trials (Exercise 1.1).
Example 1.9 (Poisson distribution P(λ)). The Poisson distribution P(λ)
has the form
λk −λ
(1.10) P(X = k) = e , k = 0, 1, . . . .
k!
The mean and variance are, respectively,
EX = Var(X) = λ.
This is often used to model the number of events during a time interval or
the number of points that fall in a given set.

The Poisson distribution P(λ) may be viewed as the limit of the binomial
distribution in the sense that
λk −λ
Cnk pk q n−k −→ e as n → ∞, p → 0, np = λ.
k!
The proof is simple and is left as an exercise.

1.5. Continuous Distributions


Consider now the general case when Ω is not necessarily enumerable. Let
us begin with the definition of a random variable. Denote by R the Borel
σ-algebra on R, the smallest σ-algebra containing all open sets.
Definition 1.10. A random variable X is an F-measurable real-valued
function X : Ω → R; i.e., for any B ∈ R, X −1 (B) ∈ F .
Definition 1.11. The distribution of the random variable X is a probability
measure μ on R, defined for any set B ∈ R by
(1.11) μ(B) = P(X ∈ B) = P ◦ X −1 (B).
In particular, we define the distribution function F (x) = P(X ≤ x) when
B = (−∞, x].
1.5. Continuous Distributions 9

If there exists an integrable function ρ(x) such that

(1.12) μ(B) = ρ(x)dx


B
for any B ∈ R, then ρ is called the probability density function (PDF) of
X. Here ρ(x) = dμ/dm is the Radon-Nikodym derivative of μ(dx) with
respect to the Lebesgue measure m(dx) if μ(dx) is absolutely continuous
with respect to m(dx); i.e., for any set B ∈ R, if m(B) = 0, then μ(B) = 0
(see also Section C of the appendix) [Bil79]. In this case, we write μ  m.
Definition 1.12. The expectation of a random variable X is defined as

(1.13) EX = X(ω)P(dω) = xμ(dx)


Ω R
if the integrals are well-defined.

The variance of X is defined as


(1.14) Var(X) = E(X − EX)2 .
For two random variables X and Y , we can define their covariance as
(1.15) Cov(X, Y ) = E(X − EX)(Y − EY ).
X and Y are called uncorrelated if Cov(X, Y ) = 0.
All of the above definitions can be extended to the vectorial case in which
X = (X1 , X2 , . . . , Xd )T ∈ Rd is a random vector and each component Xk is
a random variable. In this case, the covariance matrix of X is defined as
(1.16) Cov(X) = E(X − EX)(X − EX)T .
Definition 1.13. For any p ≥ 1, the space Lp (Ω) (or Lpω ) consists of random
variables whose pth-order moment is finite:
(1.17) Lp (Ω) = {X(ω) : E|X|p < ∞}.

For X ∈ Lp (Ω), let


(1.18) X p = (E|X|p )1/p , p ≥ 1.
Theorem 1.14.
(i) Minkowski inequality.
X +Y p ≤ X p + Y p, p ≥ 1, X, Y ∈ Lp (Ω)
(ii) Hölder inequality.
E|(X, Y )| ≤ X p Y q, p > 1, 1/p + 1/q = 1, X ∈ Lp (Ω), Y ∈ Lq (Ω),
where (X, Y ) denotes the standard scalar product in Rd .
10 1. Random Variables

(iii) Schwartz inequality.


E|(X, Y )| ≤ X 2 Y 2.

Obviously Schwartz inequality is a special case of Hölder inequality when


p = q = 2.
The proof of these inequalities can be found, for example, in Chapter 2
of [Shi96].
It also follows that · p is a norm. One can further prove that Lp (Ω) is
a Banach space and L2 (Ω) is a Hilbert space with inner product
(1.19) (X, Y )L2ω = E(X, Y ).

Lemma 1.15 (Chebyshev’s inequality). Let X be a random variable such


that E|X|p < ∞ for some p > 0. Then
1
(1.20) P{|X| ≥ λ} ≤ E|X|p ,
λp
for any positive constant λ.

Proof. For any λ > 0,

E|X|p = |x|p μ(dx) ≥ |x|p μ(dx) ≥ λp μ(dx) = λp P(|X| ≥ λ).


Rd |x|≥λ |x|≥λ

It is straightforward to generalize the above estimate to any nonnegative


increasing function f (x), which gives P(|X| ≥ λ) ≤ Ef (|X|)/f (λ) if f (λ) >
0.

Lemma 1.16 (Jensen’s inequality). Let X be a random variable such that


E|X| < ∞ and φ : R → R is a convex function such that E|φ(X)| < ∞.
Then
(1.21) Eφ(X) ≥ φ(EX).

This follows directly from the definition of convex functions. Readers


can also refer to [Chu01] for the details.
Below we list some typical continuous distributions.

Example 1.17 (Uniform distribution). The uniform distribution on a do-


main B (in Rd ) is defined by the probability density function:

1
, if x ∈ B,
ρ(x) = vol(B)
0, otherwise.
1.5. Continuous Distributions 11

In one dimension if B = [0, 1] (denoted as U [0, 1] later), this reduces to



1, if x ∈ [0, 1],
ρ(x) =
0, otherwise.

For the uniform distribution on [0, 1], we have


1 1
EX = , Var(X) = .
2 12
Example 1.18 (Exponential distribution). The exponential distribution
E(λ) is defined by the probability density function:

0, if x < 0,
ρ(x) = −λx
λe , if x ≥ 0.
The mean and variance of E(λ) are
1 1
(1.22) EX = , Var(X) = .
λ λ2
As an example, the waiting time of a Poisson process with rate λ is
exponentially distributed with parameter λ.
Example 1.19 (Normal distribution). The one-dimensional normal distri-
bution (also called Gaussian distribution) N (μ, σ 2 ) is defined by the proba-
bility density function:
 
1 1
(1.23) ρ(x) = √ exp − 2 (x − μ) 2
2πσ 2 2σ
with mean μ and variance σ 2 .
If Σ is an n × n symmetric positive definite matrix and μ is a vector
in Rn , we can also define the n-dimensional normal distribution N (μ, Σ)
through the density
 
1 1 T −1
(1.24) ρ(x) = exp − (x − μ) Σ (x − μ) .
(2π)n/2(det Σ)1/2 2
In this case, we have
EX = μ, Cov(X) = Σ.

The normal distribution is the most important probability distribution.


It is also called the Gaussian distribution. Random variables with normal
distribution are also called Gaussian random variables. In the case of de-
generacy, i.e., the covariance matrix Σ is not invertible, which corresponds
to the case that some components are in the subspace spanned by other
components, we need to define the Gaussian distribution via characteristic
functions (see Section 1.9).
12 1. Random Variables

Example 1.20 (Gibbs distribution). In equilibrium statistical mechanics,


we are concerned with a probability distribution π over a state space S.
In the case of an n-particle system with continuous states, we have x =
(x1 , . . . , xn , p1 , . . . , pn ) ∈ S = R6n , where xk and pk are the position and
momentum of the kth particle, respectively. The PDF π(x), called the Gibbs
distribution, has a specific form:
1
(1.25) π(x) = e−βH(x) , x ∈ R6n , β = (kB T )−1 ,
Z
where H is the energy of the considered system, T is the absolute tempera-
ture, kB is the Boltzmann constant, and

(1.26) Z= e−βH(x) dx
R6n
is called the partition function. Let f (x) be a function defined on the config-
uration space S. Then its thermodynamic average f , i.e., the expectation
of f , is given by

(1.27) Ef = f  = f (x)π(x)dx.
R6n
A similar definition holds for the discrete space setting by replacing the
integral with summation.

Another important notion is the distribution function of a random vari-


able X:
(1.28) F (x) = P(X < x).
One can easily see that if the distribution of X is absolutely continuous with
respect to the Lebesgue measure, then the density ρ and the distribution
function F of X are related by
d
(1.29) ρ(x) = F (x).
dx

1.6. Independence
We now come to one of the most distinctive notions in probability theory,
the notion of independence. Let us start by defining the independence of
events. Two events A, B ∈ F are independent if
P(A ∩ B) = P(A)P(B).
Definition 1.21. Two random variables X and Y are said to be indepen-
dent if for any two Borel sets A and B, X −1 (A) and Y −1 (B) are independent;
i.e.,
     
(1.30) P X −1 (A) ∩ Y −1 (B) = P X −1 (A) P Y −1 (B) .
1.6. Independence 13

The joint distribution of the two random variables X and Y is defined


to be the distribution of the random vector (X, Y ). Let μ1 and μ2 be the
probability distribution of X and Y , respectively, and let μ be their joint
distribution. If X and Y are independent, then for any two Borel sets A
and B, we have

(1.31) μ(A × B) = μ1 (A)μ2 (B).

Consequently, we have

(1.32) μ = μ1 μ2 ;

i.e., the joint distribution of two independent random variables is the product
distribution. If both μ1 and μ2 are absolutely continuous, with densities p1
and p2 , respectively, then μ is also absolutely continuous, with density given
by

(1.33) p(x, y) = p1 (x)p2 (y).

One can also understand independence from the viewpoint of expecta-


tions. Let f1 and f2 be two continuous functions. If X and Y are two
independent random variables, then

(1.34) Ef1 (X)f2 (Y ) = Ef1 (X)Ef2 (Y ).

In fact, this can also be used as the definition of independence.


This discussion can be readily generalized to multiple events and multiple
random variables. A sequence of events {Ak }nk=1 are independent if for
k = 1, 2 . . . , n and 1 ≤ i1 < i2 < · · · < ik ≤ n
⎛ ⎞
k k
P ⎝ Aij = ⎠ P(Aij ).
j=1 j=1

Note that pairwise independence does not imply independence (see Exercise
1.9).

Definition 1.22. The random variables X1 , . . . , Xn are said to be indepen-


dent if for any Borel sets Bj ∈ R,
⎛ ⎞
n 
n
(1.35) P ⎝ −1 ⎠
Xj (Bj ) = P(Xj−1 (Bj )).
j=1 j=1

If the random variables are independent, then their joint distribution is


the product measure of their distributions.
14 1. Random Variables

1.7. Conditional Expectation


Let X and Y be two discrete random variables, not necessarily independent,
with joint distribution
p(i, j) = P(X = i, Y = j).

Since i p(i, j) = P(Y = j), the probability that X = i conditioned on
Y = j is given by
p(i, j) p(i, j)
p(i|j) =  =
i p(i, j) P(Y = j)
 
if i p(i, j) > 0. The convention is to set p(i|j) = 0 if i p(i, j) = 0.
Now let f be a continuous function. It is natural to define the conditional
expectation of f (X) given that Y = j by
(1.36) E(f (X)|Y = j) = f (i)p(i|j).
i

A difficulty arises when one tries to generalize this to continuous random


variables. Indeed, given two continuous random variables X and Y , the
probability that Y = y is zero for most values of y. Therefore, we need to
proceed differently.
Let G be a sub-σ-algebra of F . Let X be a random variable such that
E|X| < ∞.
Definition 1.23 (Conditional expectation). The conditional expectation Z
of X given G is defined by the following conditions:
(i) Z is measurable with respect to G;
(ii) for any set A ∈ G,

Z(ω)P(dω) = X(ω)P(dω).
A A

The existence of Z = E(X|G) follows from the Radon-Nikodym


 theorem
if we consider the measure μ on G defined by μ(A) = A X(ω)P(dω) (see
[Bil79]). One can easily see that μ is absolutely continuous with respect to
the measure P|G , the restriction of P to G. Thus Z exists and is unique up
to almost sure equivalence in P|G .
In addition, we have
Theorem 1.24 (Properties of conditional expectation). Assume that X, Y
are random variables with E|X|, E|Y | < ∞. Let a, b ∈ R. Then:
(i) E(aX + bY |G) = aE(X|G) + bE(Y |G).
(ii) E(E(X|G)) = E(X).
(iii) E(X|G) = X if X is G-measurable.
1.7. Conditional Expectation 15

(iv) E(X|G) = EX if X is independent of G.


(v) E(XY |G) = Y E(X|G) if Y is G-measurable.
(vi) If H is a sub-σ-algebra of G, then E(X|H) = E(E(X|H)|G).

Similar to Lemma 1.16 we have

Lemma 1.25 (Conditional Jensen inequality). Let X be a random variable


such that E|X| < ∞ and let φ : R → R be a convex function such that
E|φ(X)| < ∞. Then

(1.37) E(φ(X)|G) ≥ φ(E(X|G)).

These statements are straightforward. The reader may also consult


[Chu01] for the details of the proof.
Given two random variables X and Y , the conditional expectation of X
with respect to Y is defined as the conditional expectation of X with respect
to the σ-algebra G = σ(Y ) generated by Y

G = σ(Y ) := {Y −1 (B), B ∈ R}.

To see that this definition reduces to the previous one for the case of discrete
random variables, let Ωj = {ω : Y (ω) = j} and

n
Ω= Ωj .
j=1

The σ-algebra G is simply the set of all possible unions of the Ωj ’s. The
measurability condition of E(X|Y ) with respect to G means E(X|Y ) is con-
stant on each Ωj , which is exactly E(X|Y = j) as we will see. By definition,
we have

(1.38) E(X|Y )P(dω) = X(ω)P(dω),


Ωj Ωj

which implies
1
(1.39) E(X|Y ) = X(ω)P(dω).
P(Ωj ) Ωj

This is exactly E(X|Y = j) in (1.36) when f (X) = X.


The conditional expectation is the optimal approximation in L2 (Ω)
among all G-measurable functions.

Proposition 1.26. Let g be a measurable function. Then

(1.40) E(X − E(X|Y ))2 ≤ E(X − g(Y ))2 .


16 1. Random Variables

Proof. We have
E(X − g(Y ))2 = E(X − E(X|Y ))2 + E(E(X|Y ) − g(Y ))2
 
+ 2E (X − E(X|Y ))(E(X|Y ) − g(Y ))
and
 
E (X − E(X|Y ))(E(X|Y ) − g(Y ))
  
= E E (X − E(X|Y ))(E(X|Y ) − g(Y ))|Y
 
= E (E(X|Y ) − E(X|Y ))(E(X|Y ) − g(Y )) = 0,
which implies (1.40). 

1.8. Notions of Convergence


Let {Xn }∞n=1 be a sequence of random variables defined on a probability
space (Ω, F , P), and let μn be the distribution of Xn . Let X be another
random variable with distribution μ. We will discuss four notions of conver-
gence: almost sure convergence, convergence in probability, convergence in
distribution, and convergence in Lp .
Definition 1.27 (Almost sure convergence). Xn converges to X almost
surely if
(1.41) P(ω : Xn (ω) → X(ω)) = 1.

We write almost sure convergence as Xn → X, a.s.


Definition 1.28 (Convergence in probability). Xn converges to X in prob-
ability if for any > 0,
(1.42) P(ω : |Xn (ω) − X(ω)| > ) → 0,
as n → ∞.
Definition 1.29 (Convergence in distribution). Xn converges to X in dis-
tribution if for any f ∈ Cb (R),
(1.43) Ef (Xn ) → Ef (X).
d d
This is denoted as Xn → X or μn → μ or μn ⇒ μ.
Definition 1.30 (Convergence in Lp ). Xn converges to X in Lp (0 < p <
∞) if
(1.44) E|Xn − X|p → 0.
For p = 1, this is also referred to as convergence in mean; for p = 2, this is
referred to as convergence in mean square.
1.9. Characteristic Function 17

Remark 1.31. Note that the power p does not need to be greater than 1
since it still gives a metric for the random variables. But only when p ≥ 1
do we get a norm. The statements in this section hold when 0 < p < ∞.

We have the following relations between different notions of convergence.


Theorem 1.32.
(i) Almost sure convergence implies convergence in probability.
(ii) Convergence in probability implies almost sure convergence along a
subsequence.
(iii) If p < q, then convergence in Lq implies convergence in Lp .
(iv) Convergence in Lp implies convergence in probability.
(v) Convergence in probability implies convergence in distribution.

Proof. (i) Note that

P(|Xn (ω) − X(ω)| > ) = χ{|Xn −X|>} (ω)P(dω) → 0


Ω
by the almost sure convergence and dominated convergence theorems.
(ii) The proof will be deferred to Section 1.11.
(iii) This is a consequence of the Hölder inequality:
 q
p
p
p p q
E|Xn − X| ≤ E(|Xn − X| )
p
= (E|Xn − X|q ) q , p < q.

(iv) This is a consequence of the Chebyshev inequality:


1
P(ω : |Xn (ω) − X(ω)| > ) ≤ p
E|Xn − X|p

for any > 0.


(v) Argue by contradiction. Suppose there exist a bounded continuous
function f (x) and a subsequence Xnk such that
(1.45) Ef (Xnk ) → Ef (X), k → ∞.
From assertion (ii), there exists a further subsequence of {Xnk } (still denoted
as {Xnk }) such that Xnk converges to X almost surely. This contradicts
(1.45) by the dominated convergence theorem. 

1.9. Characteristic Function


The characteristic function of a random variable X is defined as

(1.46) f (ξ) = EeiξX = eiξx μ(dx).


R
18 1. Random Variables

Proposition 1.33. The characteristic function has the following properties:


(1) ∀ξ ∈ R, |f (ξ)| ≤ 1, f (ξ) = f (−ξ), f (0) = 1;
(2) f is uniformly continuous on R.

Proof. The proof of the first statements is straightforward. For the second
statement, we have
|f (ξ1 ) − f (ξ2 )| = |E(eiξ1 X − eiξ2 X )| = |E(eiξ1 X (1 − ei(ξ2 −ξ1 )X ))|
≤ E|1 − ei(ξ2 −ξ1 )X |.

Since the integrand |1 − ei(ξ2 −ξ1 )X | depends only on the difference between
ξ1 and ξ2 and since it tends to 0 almost surely as the difference goes to
0, uniform continuity follows immediately from the dominated convergence
theorem. 
Example 1.34. Here are the characteristic functions of some typical dis-
tributions:
(1) Bernoulli distribution.
f (ξ) = q + peiξ .
(2) Binomial distribution B(n, p).
f (ξ) = (q + peiξ )n .
(3) Poisson distribution P(λ).
iξ −1)
f (ξ) = eλ(e .
(4) Exponential distribution E(λ).
f (ξ) = (1 − λ−1 iξ)−1 .
(5) Normal distribution N (μ, σ 2 ).
 
σ2ξ2
(1.47) f (ξ) = exp iμξ − .
2
(6) Multivariate normal distribution N (μ, Σ).
 
1 T
(1.48) f (ξ) = exp iμ ξ − ξ Σξ .
T
2

The property (1.48) is also used to define degenerate multivariate Gauss-


ian distributions.
The following result gives an explicit characterization of the weak con-
vergence of probability measures in terms of their characteristic functions.
This is the key ingredient in the proof of the central limit theorem.
1.10. Generating Function and Cumulants 19

Theorem 1.35 (Lévy’s continuity theorem). Let {μn }n∈N be a sequence of


probability measures, and let {fn }n∈N be their corresponding characteristic
functions. Assume that:
(1) fn converges everywhere on R to a limiting function f .
(2) f is continuous at ξ = 0.
d
Then there exists a probability distribution μ such that μn → μ. Moreover
f is the characteristic function of μ.
d
Conversely, if μn → μ, where μ is some probability distribution, then fn
converges to f uniformly on every finite interval, where f is the character-
istic function of μ.

For a proof, see [Chu01].


As for Fourier transforms, one can also define the inverse transform
1
ρ(x) = e−iξx f (ξ)dξ.
2π R
An interesting question arises as to when this gives the density of a proba-
bility measure. To address this, we introduce the following notion.
Definition 1.36. A function f is called positive semidefinite if for any
finite set of values {ξ1 , . . . , ξn }, n ∈ N, the matrix (f (ξi − ξj ))ni,j=1 is positive
semidefinite; i.e.,
(1.49) f (ξi − ξj )vi v̄j ≥ 0,
i,j

for every set of values v1 , . . . , vn ∈ C.


Theorem 1.37 (Bochner’s theorem). A function f is the characteristic
function of a probability measure if and only if it is positive semidefinite and
continuous at 0 with f (0) = 1.

Proof. We only prove the necessity part. The other part is less trivial and
readers may consult [Chu01]. Assume that f is a characteristic function.
Then
n n
2
(1.50) f (ξi − ξj )vi v̄j = vi eiξi x dx ≥ 0. 
i,j=1 R i=1

1.10. Generating Function and Cumulants


For a discrete random variable, its generating function is defined as

(1.51) G(x) = P (X = xk )xk .
k=0
20 1. Random Variables

One immediately has


1 (k)
P (X = xk ) = G (x) .
k! x=0

Definition 1.38. The convolution of two sequences {ak }, {bk }, {ck } =


{ak } ∗ {bk }, is defined by
k
(1.52) ck = aj bk−j .
j=0

It is easy to show that the generating functions defined by


∞ ∞ ∞
k k
A(x) = ak x , B(x) = bk x , C(x) = c k xk
k=0 k=0 k=0

with {ck } = {ak } ∗ {bk } satisfy C(x) = A(x)B(x). The following result is
more or less obvious.

Theorem 1.39. Let X and Y be two independent random variables with


probability distribution

P (X = j) = aj , P (Y = k) = bk ,

respectively, and let A and B be the corresponding generating functions.


Then the generating function of X + Y is C(x) = A(x)B(x).

This can be used to compute some generating functions.


(a) Bernoulli distribution: G(x) = q + px.
(b) Binomial distribution: G(x) = (q + px)n .
(c) Poisson distribution: G(x) = e−λ+λx .
The moment generating function of a random variable X is defined for
all values of t by


⎪ p(x)etx , X is discrete-valued,

(1.53) M (t) = EetX = x


⎩ p(x)etx dx, X is continuous,
R

provided that etX is integrable. It is obvious that M (0) = 1.


Once M (t) is defined, one can show M (t) ∈ C ∞ in its domain and its
relation to the nth moment

(1.54) M (n) (t) = E(X n etX ) and μn := EX n = M (n) (0), n ∈ N.


1.11. The Borel-Cantelli Lemma 21

This gives

tn
(1.55) M (t) = μn ,
n!
n=0
which explains why M (t) is called the moment generating function.
Theorem 1.40. Denote MX (t), MY (t), and MX+Y (t) the moment gener-
ating functions of the random variables X, Y , and X + Y , respectively. If
X, Y are independent, then
(1.56) MX+Y (t) = MX (t)MY (t).

Proof. The proof is straightforward by noticing


MX+Y (t) = Eet(X+Y ) = EetX EetY = MX (t)MY (t). 

The following can be obtained by direct calculation.


(a) Binomial distribution: M (t) = (pet + 1 − p)n .
(b) Poisson distribution: M (t) = exp[λ(et − 1)].
(c) Exponential distribution: M (t) = λ/(λ − t) for t < λ.

(d) Normal distribution N (μ, σ 2 ): M (t) = exp μt + 12 σ 2 t2 .
The cumulant generating function K(t) is defined by

tn
(1.57) Λ(t) = log M (t) = log Ee tX
= κn .
n!
n=1
With such a definition, we have κ0 = 0 and
dn
(1.58) κn = n Λ(0), n ∈ N.
dt
It is straightforward to define moment and cumulant generating func-
tions for random vectors:
M (t) = Eet·X , Λ(t) = log M (t), t ∈ Rd .
These notions are useful, for example, in statistical physics [TKS95],
[KTH95].

1.11. The Borel-Cantelli Lemma


We now turn to a technical tool that will be useful in the next chapter.
Given a sequence of events {An }∞
n=1 , we are interested in the event that An
occurs infinitely often, i.e.,
∞  ∞
{An i.o.} = {ω : ω ∈ An i.o.} = lim sup An = Ak .
k→∞ n=1 k=n
22 1. Random Variables

The probability of such an event can be characterized by the Borel-Cantelli


lemma.

Lemma 1.41.

(1) If ∞ n=1 P(An ) < ∞, then P ({An i.o.}) = 0.
∞
(2) If n=1 P(An ) = ∞ and the An ’s are mutually independent, then
P ({An i.o.}) = 1.

Proof. (1) We have


   
# ∞
∞  $ #
∞ $ ∞
P Ak ≤P Ak ≤ P(Ak )
n=1 k=n k=n k=n

∞
for any n. The last term goes to 0 as n → ∞ since k=1 P(Ak ) < ∞ by
assumption.
(2) Using independence, one has
 ∞
  ∞
 ∞ ∞
   
P Ak =1−P Ack =1− P(Ack ) = 1 − (1 − P(Ak )).
k=n k=n k=n k=n

Using 1 − x ≤ e−x , this gives


 ∞
 ∞
  ∞
P Ak ≥1− e−P(Ak ) = 1 − e− k=n P(Ak )
= 1.
k=n k=n

The result follows immediately. 

As an example of the application of this result, we prove

Lemma 1.42. Let {Xn }n∈N be a sequence of identically distributed (not


necessarily independent) random variables, such that E|Xn | < ∞. Then

Xn
lim =0 a.s.
n→∞ n

Proof. For any > 0, define

An = {ω ∈ Ω : |Xn (ω)/n| > }.


1.11. The Borel-Cantelli Lemma 23

Then
∞ ∞ ∞
P(An ) = P(|Xn | > n ) = P(|X1 | > n )
n=1 n=1 n=1
∞ ∞
= P (k < |X1 | ≤ (k + 1) )
n=1 k=n

= kP (k < |X1 | ≤ (k + 1) )
k=1

= k P(dω)
k=1 k<|X1 |≤(k+1)

1
≤ |X1 |P(dω)
k=1 k<|X1 |≤(k+1)

1 1
= |X1 |P(dω) ≤ E|X1 | < ∞.
<|X1 |
Therefore if we define
B = {ω ∈ Ω : ω ∈ An i.o.},

then P(B ) = 0. Let B = ∞n=1 B 1 . Then P(B) = 0, and
n

Xn (ω)
lim =0 if ω ∈ B.
n→∞ n
The proof is complete. 

With the Borel-Cantelli lemma, we can give the proof of (ii) in Theorem
1.32.

Proof. Without loss of generality, we may assume that X = 0. By the con-


dition of convergence in probability we can take nk ∈ N which is increasing
in k such that  
1 1
P |Xnk | ≥ ≤ k
k 2
for each k ∈ N. For any fixed , there exists an integer k0 such that k0−1 ≤
and we have
∞ k0 ∞  
1
P(|Xnk | ≥ ) ≤ P(|Xnk | ≥ ) + P |Xnk | ≥
k
k=1 k=1 k=k0 +1
k0 ∞
1
≤ P(|Xnk | ≥ ) + < ∞.
2k
k=1 k=k0 +1
24 1. Random Variables

By the Borel-Cantelli lemma we obtain

P(|Xnk | ≥ , i.o.) = 0.

Thus the subsequence Xnk converges to 0 almost surely by the same argu-
ment as the example above. 

Exercises
In the following, i.i.d. stands for independent, identically distributed.

1.1. Let X = nj=1 Xj , where the Xj are i.i.d. random variables with
Bernoulli distribution with parameter p. Prove that X ∼ B(n, p).
1.2. Let Sn be the number of heads in the outcome of n independent
tosses for a fair coin. Its distribution is given by
 
1 n
pj = P(Sn = j) = n .
2 j
Show that the mean and the variance of this random variable are
n n
ESn = , Var(Sn ) = .
2 4
This implies that Var(Sn )/(ESn )2 → 0 as n → ∞.
1.3. Prove that
λk −λ
Cnk pk q n−k −→ e for any fixed k
k!
as n → ∞, p → 0, and np = λ. This is the classical Poisson
approximation to the binomial distribution.
1.4. Numerically investigate the limit processes

Binomial B(n, p)−→Poisson P(λ) −→ Normal N (λ, λ)

when λ = np, n  1, and λ  1, by comparing the plots for


different distributions. Find the parameter regimes such that the
approximation is valid.
1.5. Suppose X ∼ P(λ), Y ∼ P(μ) are two independent Poisson random
variables.
(a) Prove that Z = X + Y ∼ P(λ + μ).
(b) Prove that the conditional distribution of X (or Y ), condi-
tioning on X + Y being fixed, i.e., X + Y = N , is a binomial
distribution with parameter n = N and p = λ/(λ + μ) (or
p = μ/(λ + μ)).
Exercises 25

1.6. Prove the following statements:


(a) (Memoryless property of exponential distribution) Suppose X
∼ E(λ). Prove that
P(X > s + t|X > s) = P(X > t) for all s, t > 0.
(b) Let X ∈ (0, ∞) be a random variable such that
P(X > s + t) = P(X > s)P(X > t) for all s, t > 0.
Prove that there exists λ > 0 such that X ∼ E(λ).
1.7. Let X = (X1 , . . . , Xn ) be an n-dimensional Gaussian random vari-
able and let Y = c1 X1 + c2 X2 + · · · + cn Xn , where c1 , . . . , cn are
constants. Show that Y is also Gaussian.
1.8. Suppose that the multivariate Gaussian variable
     
X μx Σxx Σxy
∼N , .
Y μy Σyx Σyy
Prove that the conditional distribution of X given Y = y is Gauss-
ian with mean μ̃ and covariance Σ̃:
μ̃ = μx + Σxy Σ−1
yy (y − μy ), Σ̃ = Σxx − Σxy Σ−1
yy Σyx .

1.9. Pairwise independence does not imply independence. Let X, Y be


two independent random variables such that
1
P(X = ±1) = P(Y = ±1) = ,
2
and let Z = XY . Check that X, Y , Z are pairwise independent
but not independent.
1.10. Independence does not imply conditional independence. Construct
a concrete example such that
p(x, y) = p(x)p(y) for any x, y
but p(x, y|z) = p(x|z)p(y|z) for some x, y, and z, where x, y, z are
the values of random variables X, Y , and Z.
1.11. Let R be the Borel σ-algebra on R. For any random variable X,
prove that the sets B = X −1 (R) form a σ-algebra.
1.12. If Xj (j = 1, . . . , n) are independent random variables and fj (j =
1, . . . , n) are Borel measurable functions, i.e., fj−1 (B) ∈ R for any
B ∈ R, then fj (Xj ) (j = 1, . . . , n) are independent.
1.13. Prove that if the σ-algebra F has finite members, then there exist

nonempty sets B1 , . . . , BK ∈ F such that Ω = K j=1 Bj , Bi ∩Bj = ∅
for i, j = 1, . . . , K, and any B ∈ F can be represented as a finite
union of the sets in {Bj }.
26 1. Random Variables

1.14. (Variance identity) Suppose the random variable X has the par-
tition X = (X (1) , X (2) ). Prove the variance identity for any inte-
grable function f :

Var(f (X)) = Var(E(f (X)|X (2) )) + E(Var(f (X)|X (2) )).

1.15. Prove Theorem 1.24.


1.16. Let X be a discrete random variable and let pi = P(X = i) for
i = 1, 2, . . . , n. Define the Shannon entropy of X by
n
H(X) = − pi log pi .
i=1

Prove that H(X) ≥ 0 and that the equality holds only when X is
reduced to a deterministic variable; i.e., pi = δik0 , for some fixed
integer k0 in {1, 2, . . . , n}. Here we take the convention 0 log 0 = 0.
1.17. Let X = (X1 , X2 , . . . , Xn ) be i.i.d. continuous random variables
with common distribution function F and density ρ(x) = F (x).
(X(1) , X(2) , . . . , X(n) ) are called the order statistics of X if X(k) is
the kth smallest of these random variables. Prove that the density
of the order statistics is given by


n
p(x1 , x2 , . . . , xn ) = n! ρ(xk ), x1 < x 2 < · · · < x n .
k=1

1.18. (Stable laws) A one-dimensional distribution μ(dx) is stable if given


any two independent random variables X and Y with distribution
μ(dx), there exists a and b such that

a(X + Y − b)

has distribution μ(dx). Show that f (ξ) = e−|ξ| is the characteristic


α

function of a stable distribution for 0 < α ≤ 2 and that it is not a


characteristic function for other values of α.
1.19. Prove that if the moment generating function M (t) can be defined
on an open set U , then M ∈ C ∞ (U ).
1.20. (Wick’s theorem) For multivariate Gaussian random variables X1 ,
X2 , . . ., Xn with mean 0, prove
 
E(Xi Xj ), k is even,
E(X1 X2 · · · Xk ) =
0, k is odd,
Notes 27


where the notation means summing of the products over all
possible partitions of X1 , . . . , Xk into pairs; e.g., if (X, Y, Z) is
jointly Gaussian, we have
(1.59) E(X 2 Y 2 Z 2 ) =(EX 2 )(EY 2 )(EZ 2 ) + 2(EY Z)2 EX 2 + 2(EXY )2 EZ 2
+ 2(EXZ)2 EY 2 + 8(EXY )(EY Z)(EXZ).
Each term in (1.59) can be schematically mapped to some graph,
as shown below:

(EXY )2 EZ 2 !−→ X Y Z , (EY Z)2 EX 2 !−→ X Y Z ,

(EXZ)2 EY 2 !−→ X Y Z , (EX 2 )(EY 2 )(EZ 2 ) !−→ X Y Z ,

(EXY )(EY Z)(EXZ) !−→ X Y Z .

The coefficient of each term is the combinatorial number for gener-


ating the corresponding schematic combinations. This is essentially
the so-called Feynman diagrams.
1.21. Suppose that the events {An } are mutually independent with
 
P An = 1 and P(An ) < 1
n
for each n. Prove that P(An i.o.) = 1.
1.22. (Girko’s circular law ) If the n × n matrix A has i.i.d. entries with

mean zero and variance σ 2 , then the eigenvalues of A/ nσ are
asymptotically uniformly distributed in the unit disk in the complex
plane when n → ∞. Investigate this numerically.

Notes
The axiomatic approach to probability theory starts with Kolmogorov’s mas-
terpiece [Kol56]. There are already lots of excellent textbooks on the in-
troduction to elementary probability theory [Chu01, Cin11, Dur10, KS07,
Shi96]. To know more about the measure theory approach to probability
theory, please consult [Bil79]. More on the practical side of probability
theory can be found in the classic book by Feller [Fel68] or the books by
some applied mathematicians and scientists [CH13, CT06, Gar09, VK04].

You might also like