0% found this document useful (0 votes)
14 views101 pages

Lecture Note

The document outlines the course STAT0005: Probability and Inference, taught by Kayvan Sadeghi at University College London. It covers advanced topics in probability theory, joint distributions, statistical estimation, and Bayesian methods, with a focus on multivariate distributions. The course includes lectures, tutorials, exercise sheets, and a final exam, emphasizing the application of statistical concepts in various quantitative fields.

Uploaded by

xxatellyte
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views101 pages

Lecture Note

The document outlines the course STAT0005: Probability and Inference, taught by Kayvan Sadeghi at University College London. It covers advanced topics in probability theory, joint distributions, statistical estimation, and Bayesian methods, with a focus on multivariate distributions. The course includes lectures, tutorials, exercise sheets, and a final exam, emphasizing the application of statistical concepts in various quantitative fields.

Uploaded by

xxatellyte
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 101

STAT0005: Probability and Inference

Kayvan Sadeghi

Based on Dr Yvo Pokern’s Lecture Notes

Department of Statistical Science


University College London

September 2024
2
Contents

Outline 7

1 Joint Probability Distributions 11

1.1 Revision of basic probability . . . . . . . . . . . . . . . . . . . . . . . . . . 11

1.2 Revision of random variables (univariate case) . . . . . . . . . . . . . . . . 14

1.2.1 What are Random Variables? . . . . . . . . . . . . . . . . . . . . . 14

1.2.2 Expectation of a Random Variable . . . . . . . . . . . . . . . . . . 15

1.2.3 Functions of a random variable . . . . . . . . . . . . . . . . . . . . 16

1.3 Joint distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

1.3.1 The joint CDF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

1.3.2 Joint distribution: the discrete case . . . . . . . . . . . . . . . . . . 20

1.3.3 Joint Distribution: the continuous case . . . . . . . . . . . . . . . . 25

1.4 Further results on expectations . . . . . . . . . . . . . . . . . . . . . . . . 28

1.4.1 Expectation of a sum . . . . . . . . . . . . . . . . . . . . . . . . . 28

1.4.2 Expectation of a product . . . . . . . . . . . . . . . . . . . . . . . 28

1.4.3 Covariance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3
4

1.4.4 Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

1.4.5 Conditional variance . . . . . . . . . . . . . . . . . . . . . . . . . . 33

1.5 Standard multivariate distributions . . . . . . . . . . . . . . . . . . . . . . 35

1.5.1 From bivariate to multivariate . . . . . . . . . . . . . . . . . . . . . 35

1.5.2 The multinomial distribution . . . . . . . . . . . . . . . . . . . . . . 36

1.5.3 The multivariate normal distribution . . . . . . . . . . . . . . . . . . 39

1.5.4 Reminder: Matrix notation . . . . . . . . . . . . . . . . . . . . . . 41

1.5.5 Matrix Notation for Multivariate Normal Random Variables . . . . . 43

2 Transformation of Variables 47

2.1 Univariate case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

2.1.1 Discrete case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

2.1.2 Continuous case . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

2.2 Bivariate case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

2.2.1 General Transformations . . . . . . . . . . . . . . . . . . . . . . . . 50

2.2.2 Sums of random variables . . . . . . . . . . . . . . . . . . . . . . . 51

2.3 Multivariate case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

2.4 Approximation of moments . . . . . . . . . . . . . . . . . . . . . . . . . . 52

2.5 Order Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

3 Generating Functions 57

3.1 The probability generating function (pgf) . . . . . . . . . . . . . . . . . . . 57


KS: STAT0005, 2024 – 2025 5

3.1.1 Definition of the pgf . . . . . . . . . . . . . . . . . . . . . . . . . . 57

3.1.2 Moments and the pgf . . . . . . . . . . . . . . . . . . . . . . . . . 58

3.2 The moment generating function (mgf) . . . . . . . . . . . . . . . . . . . . 59

3.2.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

3.2.2 Moments and the mgf . . . . . . . . . . . . . . . . . . . . . . . . . 59

3.2.3 Linear Transformations and the mgf . . . . . . . . . . . . . . . . . . 60

3.3 Joint generating functions . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

3.4 Linear combinations of random variables . . . . . . . . . . . . . . . . . . . 64

3.4.1 The Central Limit Theorem . . . . . . . . . . . . . . . . . . . . . . 67

4 Distributions of Functions of Normally Distributed Variables 73

4.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

4.2 Reminder: Random Sample from a Normal Population . . . . . . . . . . . . 73

4.3 The chi-squared (χ2 ) distribution . . . . . . . . . . . . . . . . . . . . . . . 74

(Xi − X)2 . . . . . . . . . . . . . . . . . . .
P
4.3.1 The distribution of 76

4.3.2 Student’s t distribution . . . . . . . . . . . . . . . . . . . . . . . . 78

4.4 The F distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

5 Statistical Estimation 83

5.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

5.2 Criteria for good estimators . . . . . . . . . . . . . . . . . . . . . . . . . . 83

5.2.1 Terminology: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
6

5.3 Methods for finding estimators . . . . . . . . . . . . . . . . . . . . . . . . 85

5.3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

5.3.2 The method of moments . . . . . . . . . . . . . . . . . . . . . . . 85

5.3.3 Least squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

5.3.4 Maximum likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . 86

6 Introduction to Bayesian Methods 91

6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

6.1.1 Reminder: Rule of Total Probability and Bayes’ Theorem . . . . . . 91

6.1.2 Bayesian statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

6.1.3 Calculation tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

6.2 Conjugate family of distributions . . . . . . . . . . . . . . . . . . . . . . . 94

6.3 Bayes’ estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95


Outline

Lecturer:

Kayvan Sadeghi

Aims of course

To continue the study of probability and statistics beyond the basic concepts introduced in
previous courses (see prerequisites below). To provide further study of probability theory, in
particular as it relates to multivariate distributions, and to introduce some formal concepts
and methods in statistical estimation.

Objectives of course

On successful completion of the course, a student should have an understanding of the


properties of joint distributions of random variables and be able to derive these properties
and manipulate them in straightforward situations; recognise the χ2 , t and F distributions of
statistics defined in terms of normal variables; be able to apply the ideas of statistical theory
to determine estimators and their properties in relation to a range of estimation criteria;
become familiar with the basic concepts of Bayesian inference.

Application areas

As with other core modules in probability and statistics, the material in this course has
applications in almost every field of quantitative investigation; the course expands on earlier

7
8

modules by introducing general-purpose techniques that are applicable in principle to a wide


range of real-life situations.

Prerequisites

STAT0002 and STAT0003 or MATH0057 or their equivalent.

Lectures

Lectures: 3 hours per week during term 1.


Tutorials: 1 hour per week during term 1 starting in the second week of term.

Exercise Sheets

There will be ten weekly exercise sheets in total numbered 0 to 9. Small group tutorials
related to exercise sheets will be held weekly: The tutorial related to Sheet 0 will be held in
the second week of the term, and to Exercise Sheet 8 in the last week of the term. Exercise
Sheet 9 will have no tutorials, and consequently full solutions to it will be provided on Moodle.

Four of these exercises are assessed. These are Exercise Sheets 2, 4, 6, and 8. These
four sheets make up a 25% in-course assessed component.The best three marks out of the
marks of the four assessed exercise sheets count. Your answers should be handed in on-line
to the Turnitin facility made available on the STAT0005 Moodle page by the deadline stated
on the exercise sheet (you can submit an entirely word-processed document, which will be a
lot of work, or a scan or photograph of hand-written work as long as the scan/photographs
are clearly legible, submitted in one single file, e.g. as pdf, and no contrary stipulations are
part of the exercise sheet).

If you are unable to meet the in-course assessment submission deadline for reasons outside
your control, for example illness or bereavement, you must submit a claim for extenuating
circumstances, normally within a week of the deadline. Your home department will advise
you of the appropriate procedures. For Statistical Science students, the relevant information
is on the DOSSSH Moodle page.

Your exercises will be handed back to you and solutions as well as common mistakes will be
KS: STAT0005, 2024 – 2025 9

discussed at tutorials. Succinct solutions to Sheets 1-8 will be available on Moodle.

Discussion Forum

There is a general discussion forum on Moodle, in which you are encouraged to take part.
You can post any questions regarding the course content including on the exercise questions
and you are allowed to do so peer-anonymously. You must not give away any of the answers
to the exercise sheet questions before the deadline has passed, however. (Staff can break
the anonymity if required, so please do not post any inappropriate content.) You are also
encouraged to answer other students’ questions on the forum, again as long as you do not
give away any of the answers to the exercise sheets before the deadline has passed. I do not
generally discuss mathematics by email. Please ask questions in the Moodle discussion forum
instead.

Summer Exam

A written closed-book examination paper in term 3 will make up 75% of the total mark for
the module. All questions need to be answered, past papers are available on Moodle. The
final mark will be a 75% to 25% weighted average of the written examination, and the four
assessed exercise sheets, respectively.

Attending Small Group Tutorials

If you do not attend small group tutorials (which are compulsory), then you will be asked to
discuss your progress with the Departmental Tutor. In an extreme case of non-participation
in tutorials, you may be banned from taking the summer exam for the course, which means
that you will be classified as ‘not complete’ for the course (in practice this means that you
will fail the course).

Feedback
10

Feedback in this course will be given mainly through two channels: written feedback on your
weekly exercise sheet and discussion of the exercise sheet, in particular of common mistakes,
in your small group tutorial. You can also come to the office hours to discuss any questions
you may have on the course material in greater detail.

Texts

The following texts are a small selection of the many good books available on this material.
They are recommended as being especially useful and relevant to this course. The first book
listed is particularly recommended. It includes large numbers of sensible worked examples
and exercises (with answers to selected exercises) and also covers material on data analysis
that will be useful for other statistics courses. Books marked ‘*’ are slightly more theoretical
and cover more details than given in the lectures. Overall, from past experience, the lecture
script contains all the relevant material and there are plenty of examples in the lecture notes,
homework sheets and past exams so that you should not need to use a book if you do not
want to.

• J. A. Rice: Mathematical Statistics and Data Analysis. (Third edition; 2006)


Duxbury.

• D. D. Wackerly, W. Mendenhall & R. L. Scheaffer: Mathematical Statistics with


Applications. (Sixth edition; 2002) Duxbury.

• R. V. Hogg & E. A. Tanis: Probability and Statistical Inference. (Sixth edition;


2001) Prentice Hall.

* G. Casella & R. L. Berger: Statistical Inference. (Second edition;2001) Duxbury.

* V. K. Rohatgi & E. Saleh: An Introduction to Probability and Statistics. (Second


edition; 2001) Wiley.
Chapter 1

Joint Probability Distributions

Joint probability distributions (or multivariate distributions) describe the joint be-
haviour of two or more random variables. Before introducing this new concept we will revise
the basic notions related to the distribution of only one random variable.

1.1 Revision of basic probability

The fundamental idea of probability is that chance can be measured on a scale which runs
from zero, which represents impossibility, to one, which represents certainty.

Sample space, Ω: the set of all outcomes of an experiment (real or hypothetical).

Event, A: a subset of Ω, written A ⊆ Ω. The elements ω ∈ Ω are called elementary


events or outcomes.

Event Space, A: The family of all events A whose probability we may be interested in. A
is a family of sets, so e.g. the events A1 ⊆ Ω and A2 ⊆ Ω may be contained in it: A1 ∈ A,
A2 ∈ A. The event space always contains Ω, i.e Ω ∈ A.

Probability measure, P : a mapping from the event space to [0, 1]. To qualify as a
probability measure P must satisfy the following axioms of probability:

1. P (A) ≥ 0 for any event A ∈ A;

2. P (Ω) = 1;

11
12

3. Countable additivity: If A1 , A2 , . . . is a sequence of pairwise disjoint sets ( i.e.


Ai ∩ Aj = ∅, for all i ̸= j) then

! ∞
[ X
P Ai = P (A1 ∪ A2 ∪ . . .) = P (Ai ).
i=1 i=1

If Ω is countable ( i.e. Ω = {ω1 , ω2 , ω3 , . . . }) then the event space A can be chosen to


include all subsets A ⊆ Ω. We will always make this choice in this course.

If Ω is uncountable, like the real numbers, we have to define a ‘suitable’ family of subsets,
i.e. the event space A does not contain all subsets of Ω. However, in practice the event
space can always be constructed to include all events of interest.

From the axioms of probability one can mathematically prove the addition rule:

For any two events A and B we have:

P (A ∪ B) = P (A) + P (B) − P (A ∩ B)

Events A and B are said to be independent if P (A ∩ B) = P (A)P (B).

Events A1 , A2 , . . . , An are independent if

P (Ai1 ∩ . . . ∩ Aik ) = P (Ai1 ) . . . P (Aik )

for all possible choices of k and 1 ≤ i1 < i2 < · · · < ik ≤ n. That is, the product rule must
hold for every subclass of the events A1 , . . . , An .

Note: In some contexts this would be called mutual independence. Whenever we speak of
independence of more than two events or random variables, in this course, we mean mutual
independence.

Example 1.1 Consider two independent tosses of a fair coin and the events A = ‘first toss
is head’, B = ‘second toss is head’, C = ‘different results on two tosses’.
Find the sample space, the probability of an elementary event and the individual probabilities
of A, B, and C.
Show that A, B, and C are not independent.
KS: STAT0005, 2024 – 2025 13

Suppose that P (B) > 0. Then the conditional probability of A given B, P (A|B) is defined
as
P (A ∩ B)
P (A|B) =
P (B)
i.e. the relative weight attached to event A within the restricted sample space B. The
conditional probability is undefined if P (B) = 0. Note that P (·|B) is a probability measure
on B. Further note that if A and B are independent events then P (A|B) = P (A) and
P (B|A) = P (B).

The above conditional probability formula yields the multiplication rule

P (A ∩ B) = P (A|B)P (B)
= P (B|A)P (A)

Note that if P (B|A) = P (B) then we recover the multiplication rule for independent events.

Two events A and B are conditionally independent given a third event C if

P (A ∩ B|C) = P (A|C)P (B|C).

Conditional independence means that once we know that C is true A carries no information
on B. Note that conditional independence does not imply independence, nor vice versa.

Example 1.2 (1.1 ctd.) Show that A and B are not conditionally independent given C.

The law of total probability, or partition law follows from the additivity axiom and the
definition of conditional probability: suppose that B1 , . . . , Bk are mutually exclusive and
exhaustive events ( i.e. Bi ∩ Bj = ∅ for all i ̸= j and ∪i Bi = Ω) and let A be any event.
Then
X k k
X
P (A) = P (A ∩ Bj ) = P (A|Bj )P (Bj )
j=1 j=1

Example 1.3 A child gets to throw a fair die. If the die comes up 5 or 6, she gets to sample
a sweet from box A which contains 10 chocolate sweets and 20 caramel sweets. If the die
comes up 1,2,3 or 4 then she gets to sample a sweet from box B which contains 5 chocolate
sweets and 15 caramel sweets. What is the conditional probability she will get a chocolate
sweet if the die comes up 5 or 6? What is the conditional probability she will get a chocolate
sweet if the die comes up 1,2,3 or 4? What is her probability of getting a chocolate sweet?
14

Bayes theorem follows from the law of total probability and the multiplication rule. Again,
let B1 , . . . , Bk be mutually exclusive and exhaustive events and let A be any event with
P (A) > 0. Then Bayes theorem states that
P (A|Bi )P (Bi )
P (Bi |A) = Pk
j=1 P (A|Bj )P (Bj )

Bayes’ theorem can be used to update the probability P (Bi ) attached to some belief Bi
held before an experiment is conducted in the light of the new information obtained in the
experiment. P (Bi ) is then called the a priori probability and P (Bi |A) is known as the a
posteriori probability. A lot more on this later!

1.2 Revision of random variables (univariate case)

1.2.1 What are Random Variables?

A random variable, X, assigns a real number x ∈ R to each element ω ∈ Ω of the sample


space Ω. The probability measure P on Ω then gives rise to a probability distribution for X.

More formally, any (measurable) function X : Ω → R is called a random variable. The


random variable X may be discrete or continuous. The probability measure P on Ω induces
a probability distribution for X. In particular, X has (cumulative) distribution function
(cdf) FX (x) = P ({ω : X(ω) ≤ x}), which is usually abbreviated to P (X ≤ x). It follows
that FX (−∞) = 0, FX (∞) = 1. Also, FX is non-decreasing and right-continuous (though
not necessarily continuous), and P (a < X ≤ b) = FX (b) − FX (a).

Example 1.4 Give an example of a random variable whose cdf is right-continuous (it has to
be) but not continuous.

Discrete random variables

X takes only a finite or countably infinite set of values {x1 , x2 , . . .}. FX is a step-function,
with steps at the xi of sizes pX (xi ) = P (X = xi ), and pX (·) is the probability mass
function (pmf) of X. ( E.g. X = place of horse in race, grade of egg.) CDFs of discrete
random variables are only right-continuous but not continuous.
KS: STAT0005, 2024 – 2025 15

Example 1.5 (1.1 ctd. II) Consider the random variable X = number of heads obtained
on the two tosses. Obtain the pmf and cdf of X. Sketch the cdf – is it continuous?

Example 1.6 Consider the random variable X ∼ Geo(p) with P (X = k) = (1 − p)k−1 p


where k ∈ N. Compute the cdf and sketch it. Is X a discrete or a continuous random
variable?

Continuous random variables

When FX can be expressed as


Z x
FX (x) = fX (u) du.
−∞
R∞
for a non-negative function fX ≥ 0 which integrates to one, i.e. −∞ fX (x) dx = 1, then FX
is the cdf of a continuous random variable. fX is called the probability density function
(pdf) of X. Continuous random variables X take values in a non-countable set and

P (x < X ≤ x + dx) ≈ fX (x) dx.

Thus fX (x) dx is the probability that X lies in the infinitesimal interval (x, x + dx). Note
that the probability that X is exactly equal to x is zero for all x ( i.e. P (X = x) = 0).

If FX is a valid cdf which is continuous with piecewise derivative g, then FX is the cdf of a
continuous random variable and the pdf is given by g.

Example 1.7 Suppose fX (x) = k(2 − x2 ) on (−1, 1). Calculate k and sketch the pdf.
Calculate and sketch the cdf. Is the cdf differentiable? Calculate P (|X| > 1/2).

1.2.2 Expectation of a Random Variable

A distribution has several characteristics that could be of interest, such as its shape or
skewness. Another one is its expectation, which can be regarded as a summary of the
‘average’ value of a random variable.
16

Discrete case:

X X
E[X] = xi pX (xi ) = X(ω)P ({ω}) .
i ω

That is, the averaging can be taken over the (distinct) values of X with weights given by the
probability distribution pX , or over the sample space Ω with weights P({ω}).

Continuous case:

Z ∞
E[X] = x fX (x) dx.
−∞

Note: Integration is applied for continuous random variables, summation is applied for
discrete random variables. Make sure not to confuse the two.

1 µk
Example 1.8 The discrete random variable X has pmf pX (k) = eµ −1 k!
for k ∈ N. Compute
its expectation.

1.2.3 Functions of a random variable

Let ϕ be a real-valued function on R; that is, ϕ : R → R. Then the random variable


Y = ϕ(X) is defined by
Y (ω) ≡ ϕ(X)(ω) = ϕ(X(ω))

Since X : Ω → R, it follows that ϕ(X) : Ω → R. Thus Y = ϕ(X) is also a random variable


and the above definitions apply. In particular, we have
X
E{ϕ(X)} = ϕ(xi )pX (xi ).
i
X
= ϕ(X(ω))P ({ω})
ω
KS: STAT0005, 2024 – 2025 17

The first expression on the right-hand side averages the values of ϕ(x) over the distribution
of X, whereas the second expression averages the values of ϕ(X(ω)) over the probabilities
of ω ∈ Ω. A third method would be to compute the distribution of Y and average the values
of y over the distribution of Y .

Example 1.9 (1.1 ctd. III) Let X be the random variable indicating the number of heads
on two tosses. Consider the transformation ϕ with ϕ(0) = ϕ(2) = 0 and ϕ(1) = 1.
Find E[X] and E[ϕ(X)].

The variance of X is
σ 2 = Var(X) = E (X − E[X])2 .
 

Equivalently σ 2 = E[X 2 ] − {E[X]}2 (exercise: prove). The square root, σ, of σ 2 is called


the standard deviation.

Example 1.10 (1.1 ctd. IV) Find Var(X) and Var{ϕ(X)}.

Linear functions of X

The following properties of expectation and variance are easily proved ( exercise/previous
notes):
E[a + bX] = a + bE[X], Var(a + bX) = b2 Var(X)

Example 1.11 (1.1 ctd. V) Let Y be the excess of heads over tails obtained on the two
tosses of the coin. Write down E[Y ] and Var(Y ).

Standard distributions. For ease of reference, Appendices 1 and 2 provide definitions of


standard discrete and continuous distributions given in earlier courses.

Learning Outcomes: Most of the material in STAT0002 and STAT0003 (or MATH0057)
is relevant and important for STAT0005. Students are strongly advised to revise this
material if they don’t feel confident about basic probability.
In particular, regarding subsections 1.1 and 1.2, you should be able to
18

1. Explain the concept of (mutual) independence of events and apply it to new


situations and examples;
2. Define conditional independence and verify it in a concrete situation;
3. name and check properties of pdfs, cdfs and pmfs in concrete examples and decide
whether a given random variable is discrete or continuous
4. Compute the expectation (of a transformation) of a discrete or continuous random
variable.
5. Be familiar with standard discrete and continuous distributions.

1.3 Joint distributions

1.3.1 The joint CDF

Let us first consider the bivariate case. Suppose that the two random variables X and Y
share the same sample space Ω ( e.g. the height and the weight of an individual). Then we
can consider the event
{ω : X(ω) ≤ x, Y (ω) ≤ y}
and define its probability, regarded as a function of the two variables x and y, to be the
joint (cumulative) distribution function of X and Y , denoted by

FX,Y (x, y) = P ({ω : X(ω) ≤ x, Y (ω) ≤ y})


= P (X ≤ x, Y ≤ y).

It is often helpful to think geometrically about X and Y : In fact, (X, Y ) is a random point on
the two-dimensional Euclidean plane, R2 , i.e. each outcome of the pair of random variables
X and Y , or equivalently each outcome of the bivariate random variable (X, Y ) corresponds
to the point in R2 whose horizontal coordinate is X and whose vertical coordinate is Y . For
this reason, (X, Y ) is also called a random vector. FX,Y (x, y) is then simply the probability
that the point lands in the semi-infinite rectangle (−∞, x] × (−∞, y] = {(a, b) ∈ R2 : a ≤
x and b ≤ y}.

The joint cumulative distribution function (cdf) has similar properties to the univariate cdf.
If the function FX,Y (x, y) is the joint distribution function of random variables X and Y then

1. FX,Y (−∞, y) = FX,Y (x, −∞) = 0 and FX,Y (∞, ∞) = 1 and


KS: STAT0005, 2024 – 2025 19

2. FX,Y is a non decreasing function of each of its arguments

3. FX,Y must also be right-continuous. That is, FX,Y (x + h, y + k) → FX,Y (x, y) as


h, k ↓ 0 for all x, y.

The marginal cdfs of X and Y can be found from

FX (x) = P (X ≤ x, Y < ∞) = FX,Y (x, ∞)

and
FY (y) = P (X < ∞, Y ≤ y) = FX,Y (∞, y)
respectively.

We already know in the univariate case that P (x1 < X ≤ x2 ) = FX (x2 ) − FX (x1 ). Similarly,
we find in the bivariate case that

P (x1 < X ≤ x2 , y1 < Y ≤ y2 ) =


FX,Y (x2 , y2 ) − FX,Y (x1 , y2 ) − FX,Y (x2 , y1 ) + FX,Y (x1 , y1 )

Understanding this expression is straightforward using the geometric interpretation: To calcu-


late the probability of (X, Y ) lying in the rectangle (x1 , x2 ]×(y1 , y2 ] one takes the probability
of lying in the rectangle (−∞, x2 ]×(−∞, y2 ] and subtracts two probabilities: firstly, the prob-
ability of landing in the rectangle (−∞, x1 ]×(−∞, y2 ] and secondly the probability of landing
in the rectangle (−∞, x2 ] × (−∞, y1 ]. Unfortunately, we have now subtracted the probabil-
ity that we land in the rectangle (−∞, x1 ] × (∞, y1 ] twice, so we need to add it again to
compensate for this mistake. It may help for you to draw a sketch of all those rectangles
here:

Example 1.12 Consider the function

FX,Y (x, y) = x2 y + y 2 x − x2 y 2 , 0 ≤ x ≤ 1, 0 ≤ y ≤ 1 .
20

extended by suitable constants outside (0, 1)2 such as to make it a cdf. Show that FX,Y
has the properties of a cdf mentioned above. Find the marginal cdfs of X and Y . Also find
P (0 ≤ X ≤ 12 , 0 ≤ Y ≤ 12 ).

Are the properties of bivariate cdfs given so far enough to decide whether a given function on
R2 is a cdf? Unfortunately, the answer is negative. A positive result is given by the following
lemma:

Lemma 1 A function F : R2 → [0, 1] is a cdf if and only if

• F is right-continuous,

• limx,y→∞ F (x, y) = 1,

• limx→−∞ F (x, y) = 0 for any y ∈ R,

• limy→−∞ F (x, y) = 0 for any x ∈ R and

• F (x2 , y2 ) − F (x1 , y2 ) − F (x2 , y1 ) + F (x1 , y1 ) ≥ 0 for any real numbers x2 ≥ x1 and


y2 ≥ y1 .

1.3.2 Joint distribution: the discrete case

Cumulative distribution functions fully specify the distribution of a random variable - they
encode everything there is to know about that distribution. However, as Example 1.12
showed, they can be somewhat difficult to handle. Making additional assumptions about the
random variables makes their distribution easier to handle, so let’s assume in this part that
X and Y take only values in a countable set, i.e. that (X, Y ) is a discrete bivariate random
variable. Then FX,Y is a step function in each variable separately and we consider the joint
probability mass function

pX,Y (xi , yj ) = P (X = xi , Y = yj ).

It is often convenient to represent a discrete bivariate distribution — a joint distribution


of two variables — by a two-way table. In general, the entries in the table are the joint
probabilities pX,Y (x, y), while the row and column totals give the marginal probabilities
pX (x) and pY (y). As always, the total probability is 1.
KS: STAT0005, 2024 – 2025 21

Example 1.13 Consider three independent tosses of a fair coin. Let X = ‘number of heads
in first and second toss’ and Y = ‘number of heads in second and third toss’. Give the
probabilities for any combination of possible outcomes of X and Y in a two-way table and
obtain the marginal pmfs of X and Y .

In general, from the joint distribution we can use the law of total probability to obtain the
marginal pmf of Y as
X
pY (yj ) = P (Y = yj ) = P (X = xi , Y = yj )
xi
X
= pX,Y (xi , yj ) .
xi

Similarly, the marginal pmf of X is given by


X
pX (xi ) = pX,Y (xi , yj ) .
yj

The marginal distribution is thus the distribution of just one of the variables.

The joint cdf can be written as


XX
FX,Y (x, y) = pX,Y (xi , yj ) .
xi ≤x yj ≤y

Note that there will be jumps in FX,Y at each of the xi and yj values.

Independence

The random variables X and Y , defined on the sample space Ω with probability measure P,
are independent if the events

{X = xi } and {Y = yj }

are independent events, for all possible values xi and yj . Thus X and Y are independent
if

pX,Y (xi , yj ) = P (X = xi , Y = yj ) = pX (xi )pY (yj ) (1.1)

for all xi , yj . This implies that P (X ∈ A, Y ∈ B) = P (X ∈ A)P (Y ∈ B) for all sets A and
B, so that the two events {ω : X(ω) ∈ A}, {ω : Y (ω) ∈ B} are independent. (Exercise:
prove this.)
22

NB: If x is such that pX (x) = 0, then pX,Y (x, yj ) = 0 for all yj and (1.1) holds automatically.
Thus it does not matter whether we require (1.1) for all possible xi , yj i.e. those with
positive probability, or all real x, y. (That is, pX,Y (x, y) = pX (x)pY (y) for all x, y would be
an equivalent definition of independence.)

If X, Y are independent then the entries in the two-way table are the products of the marginal
probabilities. In Example 1.13 we see that X and Y are not independent.

Conditional probability distributions

These are defined for random variables by analogy with conditional probabilities of events.
Consider the conditional probability
P (X = xi , Y = yj )
P (X = xi | Y = yj ) =
P (Y = yj )
pX,Y (xi , yj )
=
pY (yj )

as a function of xi , for fixed yj . Then this is a probability mass function — it is non-negative


and
X pX,Y (xi , yj ) 1 X
= p (xi , yj ) = 1,
xi
pY (yj ) pY (yj ) x X,Y
|i {z }
pY (yj )

and it gives the probabilities for observing X = xi given that we already know Y = yj . We
therefore define the conditional probability distribution of X given Y = yj as

pX,Y (xi , yj )
pX|Y (xi |yj ) =
pY (yj )

Conditioning on Y = yj can be compared to selecting a subset of the population, i.e. only


those individuals where Y = yj . The conditional distribution pX|Y of X given Y = yj then
describes the distribution of X within this subgroup.

From the above definition we immediately obtain the multiplication rule for pmfs:

pX,Y (xi , yj ) = pX|Y (xi |yj )pY (yj )

which can be used to find a bivariate pmf when we know one marginal distribution and one
conditional distribution.
KS: STAT0005, 2024 – 2025 23

Note that if X and Y are independent then pX,Y (xi , yj ) = pX (xi )pY (yi ) so that pX|Y (xi |yj ) =
pX (xi ) i.e. the conditional distribution is the same as the marginal distribution.

In general, X and Y are independent if and only if the conditional distribution of X given
Y = yj is the same as the marginal distribution of X for all yj . (This condition is equivalent
to pX,Y (xi , yj ) = pX (xi )pY (yj ) for all xi , yj , above).

The conditional distribution of Y given X = xi is defined similarly.

Example 1.14 (1.13 ctd.) Obtain the conditional pmf of X given Y = y. Use this condi-
tional distribution to verify that X and Y are not independent.

Example 1.15 Suppose that R and N have a joint distribution in which R|N is Bin(N, π)
and N is Poi(λ). Show that R is Poi(πλ).

Conditional expectation

Since pX|Y (xi |yj ) is a probability distribution, it has a mean or expected value:
X
E[X|Y = yj ] = xi pX|Y (xi |yj )
xi

which represents the average value of X among outcomes ω for which Y (ω) = yj . This
may also be written EX|Y [X|Y = yj ]. We can also regard the conditional expectation
E[X|Y = yj ] as the mean value of X in the subgroup characterised by Y = yj .

Example 1.16 (1.13 ctd. II) Find the conditional expectations E[X|Y = y] for y =
0, 1, 2. Plot the graph of the function ϕ(y) = E[X|Y = y]. What do these values tell
us about the relationship between X and Y ?

In general, what is the relationship between the unconditional expectation E[X] and the
conditional expectation E[X|Y = yj ]?

Example 1.17 Collect the joint distribution of X: gender (x1 = M, x2 = F ) and Y :


number of cups of tea drunk today (y1 = 0, y2 = 1, y3 = 2, y4 = 3 or more). Are X and Y
independent? What is the expectation of Y ? What is the conditional expectation Y |X = M
and Y |X = F ?
24

We see from the above example that the overall mean is just the average of the conditional
means. We now prove this fact in general. Consider the conditional expectation ϕ(y) =
EX|Y [X|Y = y] as a function of y. This function ϕ may be used to transform the random
variable Y , i.e. we can consider the new random variable ϕ(Y ). This random variable is
usually written simply EX|Y [X|Y ] because the possibly more correct notation EX|Y [X|Y =
Y ] would be even more confusing! We may then compute the expectation of our new random
variable ϕ(Y ), i.e. E[ϕ(Y )] = EY [EX|Y [X|Y ]]. But, from the definition of the expectation
P
of a function of Y , we have E[ϕ(Y )] = yj ϕ(yj )pY (yj ), so that
X
EY [EX|Y [X|Y ]] = E[X|Y = yj ] pY (yj )
| {z }
yj
function of yj

This gives the marginal expectation of E[X] as will be shown in the lectures. That is,

E[X] = EY [EX|Y [X|Y ]]

which is known as the iterated conditional expectation formula. It is most useful when
the conditional distribution of X given Y = y is known and easier to handle than the joint
distribution (requiring integration/summation to find the marginal of X if it is not known).

Example 1.18 (1.13 ctd. II) Verify that E[X] = EY [EX|Y [X|Y ]] in this example.

Example 1.19 (1.15 ctd.) Find the mean of R using the iterated conditional expectation
formula.

Note that the definition of expectation generalises immediately to functions of two variables,
i.e.

X
E [ϕ(X, Y )] = ϕ(X(ω), Y (ω))P ({ω})
ω
XX
= ϕ(xi , yj )P ({ω : X(ω) = xi , Y (ω) = yj })
xi yj
XX
= ϕ(xi , yj )pX,Y (xi , yj )
xi yj
KS: STAT0005, 2024 – 2025 25

and that the above result on conditional expectations generalises too, since
XX
E [ϕ(X, Y )] = ϕ(xi , yj )pX|Y (xi |yj )pY (yj )
xi yj
X X
= pY (yj ) ϕ(xi , yj )pX|Y (xi |yj )
yj xi
| {z }
EX|Y [ϕ(X,yj )|yj ]

= EY [EX|Y [ϕ(X, Y )|Y ]] .

Taking out what is known (TOK)

EX|Y [ϕ(Y )ψ(X, Y )|Y ] = ϕ(Y )EX|Y [ψ(X, Y )|Y ]

This will be shown in lectures for discrete random variables only. It also holds for continuous
random variables, however.

Example 1.20 Consider two discrete random variables X and Y , where the marginal prob-
abilities of Y are P (Y = 0) = 3/4, P (Y = 1) = 1/4 and the conditional probabilities of
X are P (X = 1|Y = 0) = P (X = 2|Y = 0) = 1/2 and P (X = 0|Y = 1) = P (X =
1|Y = 1) = P (X = 2|Y = 1) = 1/3. Use the iterated conditional expectation formula to
find E(XY ).

1.3.3 Joint Distribution: the continuous case

We consider now the case where both X and Y take values in a continuous range (i.e. their
set of possible values is uncountable) and their joint distribution function FX,Y (x, y) can be
expressed as Z Z x y
FX,Y (x, y) = fX,Y (u, v)dvdu
−∞ −∞

where fX,Y (x, y) is the joint probability density function of X and Y . In short, we
consider a bivariate continuous random variable (X, Y ).

Letting y → ∞ we get
Z x Z ∞ 
FX (x) = FX,Y (x, ∞) = fX,Y (u, v)dv du .
−∞ −∞
26
Rx
But from §1.2 we also know that FX (x) = −∞
fX (u) du. It follows that the marginal
density function of X is Z ∞
fX (x) = fX,Y (x, v)dv
−∞

Similarly, Y has marginal density


Z ∞
fY (y) = fX,Y (u, y)du
−∞

As for the univariate case, we have


Z x+dx Z y+dy
P (x < X ≤ x + dx, y < Y ≤ y + dy) = fX,Y (u, v)dvdu
x y

≈ fX,Y (x, y)dxdy .

That is, fX,Y (x, y)dxdy is the probability that (X, Y ) lies in the infinitesimal rectangle
(x, x + dx) × (y, y + dy). As in the univariate case, P (X = x, Y = y) = 0 for all x, y.

Example 1.21 Consider two continuous random variables X and Y with joint density

8xy 0 ≤ x ≤ y ≤ 1
fX,Y (x, y) =
0 otherwise

Sketch the area where fX,Y is positive. Derive the marginal pdfs of X and Y .

Independence

By analogy with the discrete case, two random variables X and Y are said to be independent
if their joint density factorises, i.e. if

fX,Y (x, y) = fX (x)fY (y) for all x, y.

An equivalent characterisation of independence reads as follows:

Two continuous random variables are independent if and only if there exist func-
tions g(·) and h(·) such for all (x, y) the joint density factorises as fX,Y (x, y) =
g(x)h(y), where g is a function of x only and h is a function of y only.
KS: STAT0005, 2024 – 2025 27

Proof. If X and Y are independent then simply take g(x) = fX (x) and h(y) = fY (y). For
the converse, suppose that fX,Y (x, y) = g(x)h(y) and define
Z ∞ Z ∞
G= g(x)dx, H= h(y)dy.
−∞ −∞

Note that both G and H are finite (why?). Then the marginal densities are fX (x) = g(x)H,
fY (y) = Gh(y) and either of these equations implies that GH = 1 (integrate wrt. x in the
first equation or wrt. y in the second equation to see this). It follows that
fX (x) fY (y)
fX,Y (x, y) = g(x)h(y) = = fX (x)fY (y)
H G
and so X and Y are independent. □

The advantage of knowing that under independence fX,Y (x, y) = g(x)h(y) is that we don’t
need to find the marginal densities fX (x) and fY (y) (which would typically involve some
integration) to verify independence. It suffices to know fX (x) and fY (y) up to some unknown
constant.

Example 1.22 (1.21 ctd.) Are X and Y independent?

Conditional distributions

For the conditional distribution of X given Y , we cannot condition on Y = y in the usual way,
as for any arbitrary set A, P (X ∈ A and Y = y) = P (Y = y) = 0 when Y is continuous,
so that
P (X ∈ A, Y = y)
P (X ∈ A | Y = y) =
P (Y = y)
is not defined (0/0). However, we can consider
P (x < X ≤ x + dx, y < Y ≤ y + dy) fX,Y (x, y)dxdy

P (y < Y ≤ y + dy) fY (y)dy
and interpret fX,Y (x, y)/fY (y) as the conditional density of X given Y = y written as
fX|Y (x | y).

Note that this is a probability density function — it is non-negative and


Z ∞ Z ∞
fX,Y (x, y) 1
dx = fX,Y (x, y)dx = 1.
−∞ fY (y) fY (y) −∞
| {z }
fY (y)
28

If X and Y are independent then, as before, the conditional density of X given Y = y is just
the marginal density of X.

Example 1.23 (1.21 ctd. II) Give the conditional densities of X given Y = y and of Y
given X = x indicating clearly the area where they are positive. Also, find E[X|Y = y] and
E[X], using the law of iterated conditional expectation for the latter. Compare this with the
direct calculation of E[X].

1.4 Further results on expectations

1.4.1 Expectation of a sum

Consider the sum ϕ(X) + ψ(Y ) when X, Y have joint probability mass function pX,Y (x, y).
(The continuous case follows similarly, replacing probability mass functions by probability
densities and summations by integrals.) Then
XX
EX,Y [ϕ(X) + ψ(Y )] = {ϕ(xi ) + ψ(yj )} pX,Y (xi , yj )
xi yj
X X X X
= ϕ(xi ) pX,Y (xi , yj ) + ψ(yj ) pX,Y (xi , yj )
xi yj yj xi
| {z } | {z }
pX (xi ) pY (yj )

= EX [ϕ(X)] + EY [ψ(Y )] .

Note that the subscripts on the E’s are unnecessary as there is no possible ambiguity in this
equation, and also that this holds regardless of whether or not X and Y are independent.

In particular we have E[X + Y ] = E[X] + E[Y ]. Note the power of this result: there is no
need to calculate the probability distribution of X + Y (which may be hard!) if all we need
is the mean of X + Y .

1.4.2 Expectation of a product


KS: STAT0005, 2024 – 2025 29

Now consider ϕ(X)ψ(Y ). Then


XX
EX,Y [ϕ(X)ψ(Y )] = {ϕ(xi )ψ(yj )}pX,Y (xi , yj ) (1.2)
xi yj

=?

If X and Y are independent, then pX,Y (xi , yj ) = pX (xi ) pY (yj ) and the double sum in
(1.2) factorises, that is
X X
EX,Y [ϕ(X)ψ(Y )] = ϕ(xi )pX (xi ) ψ(yj )pY (yj ).
xi y
| {z } |j {z }
EX [ϕ(X)] EY [ψ(Y )]

Thus, except for the case where X and Y are independent, we typically have that

E(product) ̸= product of expectations

even though, from above, it is always true that

E(sum) = sum of expectations

Slogan:
Independence means Multiply

1.4.3 Covariance

A particular function of interest is the covariance between X and Y . As we will see, this
is a measure for the strength of the linear relationship between X and Y . The covariance is
defined as
Cov(X, Y ) = E [(X − E[X])(Y − E[Y ])]
An alternative formula for the covariance follows on expanding the bracket, giving

Cov(X, Y ) = E [XY − XE[Y ] − Y E[X] + E[X] E[Y ]]


= E[XY ] − E[X] E[Y ] − E[X] E[Y ] + E[X] E[Y ]
= E[XY ] − E[X] E[Y ]
30

Note that Cov(X, X) = Var(X), giving the familiar formula Var(X) = E[X 2 ] − {E[X]}2 .

If X and Y are independent then, from above,

E[XY ] = E[X] E[Y ]

and it follows that


Cov(X, Y ) = 0 .

However in general Cov(X, Y ) = 0 does not imply that X and Y are independent! An
example for this will be given below in Example 1.25.

Also, if Z = aX + b then E[Z] = aE[X] + b and Z − E[Z] = a{X − E[X]} , so that

Cov(Z, Y ) = E[a(X − E[X])(Y − E[Y ])]


= aCov(X, Y ).

Using a similar argument we get

Cov(X + Y, W ) = Cov(X, W ) + Cov(Y, W ).

Exercise: Using the fact that Var(X +Y ) = Cov(X +Y, X +Y ), derive the general formula
Var(X + Y ) = Var(X) + Var(Y ) + 2Cov(X, Y ).

1.4.4 Correlation

From above, we see that the covariance varies with the scale of measurement of the variables
(lbs/kilos etc), making it difficult to interpret its numerical value. The correlation is a
standardised form of the covariance, which is scale-invariant and therefore its values are
easier to interpret.

The correlation between X and Y is defined by

Cov(X, Y )
Corr(X, Y ) = p
Var(X)Var(Y )

Suppose that a > 0. Then Cov(aX, Y ) = a Cov(X, Y ) and Var(aX) = a2 Var(X) and it
follows that Corr(aX, Y ) = Corr(X, Y ). Thus the correlation is scale-invariant.
KS: STAT0005, 2024 – 2025 31

A key result is that

−1 ≤ Corr(X, Y ) ≤ +1

for all random variables X and Y .

Example 1.24 (1.20 ctd.) Find the covariance and correlation of X and Y .

Example 1.25 Compute the correlation of X ∼ U(−1, 1) and Y = X 2 . Sketch a typical


scatter plot of X and Y , e.g. for a sample of size 20. Are X and Y independent?
32

Not examined:
To prove this, we use the following trick. For any constant z ∈ R,

Var(zX + Y ) = E [(zX + Y ) − (zE[X] + E[Y ])]2


= E [z(X − E[X]) + (Y − E[Y ])]2
= z 2 E [X − E[X]]2 + 2zE [(X − E[X])(Y − E[Y ])] + E [Y − E[Y ]]2
= z 2 Var(X) + 2z Cov(X, Y ) + Var(Y )

as a quadratic function of z. But Var(zX + Y ) ≥ 0, so the quadratic on the right-


hand side must have either no real roots or a single repeated root, i.e. we must have
(“b2 ≤ 4ac”). Therefore

{2Cov(X, Y )}2 ≤ 4Var(X)Var(Y )

which implies that Corr2 (X, Y ) ≤ 1, as claimed. □

We get the extreme values, Corr(X, Y ) = ±1, when the quadratic touches the z-axis;
that is, when Var (zX + Y ) = 0. But if the variance of a random variable is zero then
the random variable must be a constant (we say that its distribution is degenerate).
Therefore, letting z be the particular value for which the quadratic touches the z-axis,
we obtain

zX + Y = constant. (1.3)

Taking expectations of this we find that the constant is given by

constant = zE[X] + E[Y ]. (1.4)

Additionally, we can translate equation (1.3) to say zX = constant − Y so that taking


variances on both sides yields

z 2 Var(X) = Var(Y ). (1.5)

Thus, the quadratic equation z 2 Var(X) + 2zCov(X, Y ) + Var(Y ) = 0 implies that


z 2 Var(X) + 2zCov(X, Y ) + z 2 Var(X) = 0 and thus z = −Cov(X, Y )/Var(X) follows
(the case z = 0 corresponds to Var(Y ) = 0 and thus Var(X) = 0 in which case both
random variables are degenerate).
Now take equation (1.3), subtract equation (1.4) and substitute for z to obtain finally:

Cov(X, Y )
Y − E[Y ] = (X − E[X]).
Var(X)

We therefore see that correlation measures the degree of linearity of the relationship
between X and Y , and takes its maximum and minimum values (±1) when there is an
exact linear relationship between them. As there may be other forms of dependence
between X and Y ( i.e. non–linear dependence), it is now clear that Corr(X, Y ) = 0
does not imply independence.
KS: STAT0005, 2024 – 2025 33

1.4.5 Conditional variance

Consider random variables X and Y and the conditional probability distribution of X given
Y = y. This conditional distribution has a mean, denoted E(X|Y = y), and a variance,
var(X|Y = y). We have already shown that the marginal (unconditional) mean E(X) is
related to the conditional mean via the formula

E[X] = EY [EX|Y [X|Y ]] .

In the lectures we will obtain a similar result for the relation between the marginal and
conditional variances. The result is that

Var(X) = EY [Var(X|Y )] + VarY {E[X|Y ]}

Example 1.26 (1.20 ctd.) Find the conditional variances of X given Y = 0, 1. Compute
the marginal variance of X by using the above result.

Example 1.27 (1.15 ctd. II) Find the variance of R using the iterated conditional variance
formula.

Learning Outcomes: Sections 1.3 and 1.4 represent the base of STAT0005. A thorough
understanding of the material is essential in order to follow the remaining sections as
well as many courses in the second and third year.

Joint Distributions You should be able to

1. Name and verify the properties of joint cdfs;


2. Compute probabilities of rectangles using the cdf;
3. Define the marginal and conditional pmf / pdf in terms of the joint distribution;
4. Represent the joint distribution of discrete variables in a two-way table and identify
the marginal distributions;
5. Compute the marginal and conditional distributions (pmf / pdf) from the joint
distribution and vice versa;
6. Compute probabilities for joint and conditional events using the joint or conditional
pmf / pdf or cdf as appropriate.
34

Expectation / Variance You should be able to

1. Calculate the expectation of functions of more than two variables; in particular,


find expectations of sums and products;
2. Find / compute the conditional expectations given a pair of discrete or continuous
random variables and their joint or conditional distribution;
3. Use the law of iterated conditional expectation to find marginal expectations given
only the conditional distribution;
4. Apply iterated conditional expectation to find the expectation of a product, and
use iterated conditional expectation sensibly to get expectations of more complex
transformations;
5. Use the “Taking out what is known” rule to simplify conditional expectations
6. Compute and interpret conditional variances in simple cases;
7. Compute marginal variances when only conditional distributions are given using
the result on iterated conditional variance.

Independence You should be able to

1. Infer from joint or conditional distributions whether variables are independent;


2. Apply the main criteria to check independence of two random variables, and
identify the one that is easiest to check in a given situation;
3. Explain the relation between independence and uncorrelatedness.

Covariance / Correlation You should be able to

1. Compute the covariance of two variables using the simplest possible way for doing
so in standard situations;
2. Compute the correlation of two random variables, and interpret the result in terms
of linear dependence;
3. Derive the covariance / correlation for simple linear transformations of the vari-
ables;
4. State the main properties of the correlation coefficient;
5. Sketch the proof of −1 ≤ Corr ≤ 1.
KS: STAT0005, 2024 – 2025 35

1.5 Standard multivariate distributions

1.5.1 From bivariate to multivariate

The idea of joint probability distributions extends immediately to more than two variables, giv-
ing general multivariate distributions, i.e. the variables X1 , . . . , Xn have a joint cumulative
distribution function

FX1 ,...,Xn (x1 , . . . , xn ) = P (X1 ≤ x1 , . . . , Xn ≤ xn )

and may have a joint probability mass function

pX1 ,...,Xn (x1 , . . . , xn ) = P (Xi = xi ; i = 1, . . . , n)

or joint probability density function

fX1 ,...,Xn (x1 , . . . , xn ),

so that a function ϕ(X1 , . . . , Xn ) has an expectation with respect to this joint distribution
etc.

Conditional distributions of a subset of variables given the rest then follow as before; for
example, for discrete random variables X1 , X2 , X3 ,

pX1 ,X2 ,X3 (x1 , x2 , x3 )


pX1 ,X2 |X3 (x1 , x2 | x3 ) =
pX3 (x3 )

is the conditional pmf of (X1 , X2 ) given X3 = x3 . Similarly, the discrete random variables
X1 , . . . , Xn are (mutually) independent if and only if
n
Y
pX1 ,...,Xn (x1 , . . . , xn ) = pXi (xi )
i=1

for all x1 , . . . , xn . Independence of X1 , . . . , Xn implies independence of the events {X1 ∈


A1 }, . . . , {Xn ∈ An } (exercise: prove). Finally, we say that X1 and X2 are conditionally
independent given X3 if

pX1 ,X2 |X3 (x1 , x2 | x3 ) = pX1 |X3 (x1 | x3 ) pX2 |X3 (x2 | x3 )

for all x1 , x2 , x3 . These definitions hold for continuous distributions by replacing the pmf by
the pdf.
36

1.5.2 The multinomial distribution

The multinomial distribution is a generalisation of the binomial distribution. Suppose that a


sample of size n is drawn (with replacement) from a population whose members fall into
one of m + 1 categories. Assume that, for each individual sampled, independently of the rest

P (individual is of type i) = pi , i = 1, . . . , m + 1
Pm+1
where i=1 pi = 1. Let Ni be the number of type i individuals in the sample. Note that,
since Nm+1 = n − m
P
i=1 Ni , Nm+1 is determined by N1 , . . . , Nm . We therefore only need to
consider the joint distribution of the m random variables N1 , . . . , Nm .

The joint pmf of N1 , . . . , Nm is given by


 n! n nm+1
 n1 !...nm+1 ! p1 1 . . . pm+1 , n1 , . . . , nm ∈ {0, 1, 2, . . . , n},
P (Ni = ni , i = 1, . . . , m) = n1 + . . . + nm ≤ n
0 otherwise.

where nm+1 = n − m
P
i=1 ni . This is the multinomial distribution with index n and
parameters p1 , . . . , pm , where pm+1 = 1 − m
P
i=1 pi (so pm+1 is not a ‘free’ parameter).

To justify the above joint pmf note that we want the probability that the n trials result in
exactly n1 outcomes of the first category, n2 of the second, . . . , nm+1 in the last category.
nm+1
Any specific ordering of these n outcomes has probability pn1 1 . . . pm+1 by the assumption of
n!
independent trials, and there are n1 !...nm+1 ! such orderings.

If m = 1 the multinomial distribution is just the binomial distribution, i.e.


N1 ∼ Bin(n, p1 ), which has mean np1 and variance np1 (1 − p1 ).

Example 1.28 Suppose that a bag contains five red, five black and five yellow balls and
that three balls are drawn at random with replacement. What is the probability that there is
one of each colour?

Marginal distribution of Ni

Clearly Ni can be regarded as the number of successes in n independent Bernoulli trials if


we define success to be individual is of type i. Thus Ni has a binomial distribution, Ni ∼
Bin(n, pi ), with mean npi and variance npi (1 − pi ).
KS: STAT0005, 2024 – 2025 37

Not examined: It is instructive to derive the marginal distribu-


tion of N1 directly from the joint pmf of N1 , . . . , Nm by using
the multinomial expansion as follows. To do this, you will need
the result that
X X n! nm+1
... pn1 1 . . . pm+1 = (p1 + . . . + pm+1 )n = 1
n1 nm+1
n1 ! . . . nm+1 !

where the sum is taken over all n1 , . . . , nm+1 for which n1 +. . .+


nm+1 = n. Thus, the probabilities of the multinomial distribution
are the terms of the multinomial expansion.

Example 1.29 Let NA , NB and NF be the numbers of A grades, B grades and fails respec-
tively amongst a class of 100 students. Suppose that generally 5% of students achieve grade
A, 30% grade B and that 5% fail. Write down the joint distribution of NA , NB and NF and
find the marginal distribution of NA .

Joint distribution of Ni and Nj

Again we can regard individuals as being one of three types, i, j and k={not i or j}. This
is the trinomial distribution with probabilities
n
n!
pni p j pnk k ,

ni !nj !nk ! i j
n i + nj ≤ n
P (Ni = ni , Nj = nj ) =
0 otherwise

where nk = n − ni − nj and pk = 1 − pi − pj . It is intuitively clear that Ni and Nj are


dependent and negatively correlated, since a relatively large value of Ni implies a relatively
small value of Nj and conversely. We show this as follows. First, we have
XX
E(Ni Nj ) = ni nj P (Ni = ni , Nj = nj )
ni nj
X n! n
= ni nj pni i pj j pnk k
ni !nj !nk !
{ni ,nj ≥0,ni +nj ≤n}
X (n − 2)! n −1
= n(n − 1)pi pj pini −1 pj j pnk k
(ni − 1)!(nj − 1)!nk !
{ni −1,nj −1≥0,ni +nj −2≤n−2}

= n(n − 1)pi pj (pi + pj + pk )n−2


= n(n − 1)pi pj
38

The manipulations in the third line are designed to create a multinomial expansion that we
can sum. Note that we may take ni , nj ≥ 1 in the sum, since if either ni or nj is zero then
the corresponding term in the sum is zero.

Finally
Cov(Ni , Nj ) = E[Ni Nj ] − E[Ni ]E[Nj ] = n(n − 1)pi pj − (npi )(npj ) = −npi pj
and so
−npi pj pi pj
r
Corr(Ni , Nj ) = p =− .
npi (1 − pi )npj (1 − pj ) (1 − pi )(1 − pj )
Note that Corr(Ni , Nj ) is negative, as anticipated, and also that it does not depend on n.

Conditional distribution of Ni given Nj = nj

Given Nj = nj , there are n − nj remaining independent Bernoulli trials, each with probability
of being type i given by
P (type i) pi
P (type i|not type j) = = .
P (not type j) 1 − pj
pi
Thus, given Nj = nj , Ni has a binomial distribution with index n − nj and probability 1−pj
.

Exercise: Verify this result by using the definition of conditional probability together with
the joint distribution of Ni and Nj and the marginal distribution of Nj .

Example 1.30 (1.29 ctd.) Find the conditional distribution of NA given NF = 10 and
calculate Corr(NA , NF ).

Remark: The multinomial distribution can also be used as a model for contingency tables.
Let X and Y be discrete random variables with a number of I and J different outcomes,
respectively. Then, in a trial of size n, Nij will count the number of outcomes where we
observe X = i and Y = j. The counts Nij , i = 1, . . . , I, j = 1, . . . , J, are typically
arranged in a contingency table, and from the above considerations we know that their joint
distribution is multinomial with parameters n and pij = P (X = i, Y = j), i = 1, . . . , I,
j = 1, . . . , J. This leads to the analysis of categorical data, for which a question of interest
is often ‘are the categories independent?’, i.e. is pij = pi pj for all i, j? Exact significance
tests of this hypothesis can be constructed from the multinomial distribution of the entries
in the contingency table.
KS: STAT0005, 2024 – 2025 39

1.5.3 The multivariate normal distribution

The continuous random variables X and Y are said to have a bivariate normal distribution
if they have joint probability density function

fX,Y (x, y) =
" ( 2     2 )#
1 1 x − µX x − µX y − µY y − µY
p exp − − 2ρ +
2πσX σY 1 − ρ2 2(1 − ρ2 ) σX σX σY σY

for −∞ < x, y < ∞, where −∞ < µX , µY < ∞; σX , σY > 0; ρ2 < 1. The parameters of
2
this distribution are µX , µY , σX , σY2 , and ρ. As we will see below, these turn out to be the
marginal means, variances, and the correlation of X and Y .

The bivariate normal is widely used as a model for many observed phenomena where depen-
dence is expected, e.g. height and weight of an individual, length and width of a petal,
income and investment returns. Sometimes the data need to be transformed ( e.g. by taking
logs) before using the bivariate normal.

Marginal distributions

In order to simplify the integrations required to find the marginal densities of X and Y , we
set
x − µX y − µY
= u, = v.
σX σY
Then, integrating with respect to y, the marginal density of X can be found as
Z ∞
fX (x) = fX,Y (x, y)dy
−∞
Z ∞  
1 1  2 2
= p exp − u − 2ρuv + v σY dv
−∞ 2πσX σY 1 − ρ2 2(1 − ρ2 )
Z ∞  
1 1 1  2 2 2
= √ p exp − (v − ρu) + u (1 − ρ ) dv
σX 2π −∞ 2π(1 − ρ2 ) 2(1 − ρ2 )
where we have completed the square in v in the exponent. Taking the term not involving
v outside the integral we then get
 Z ∞  
1 1 2 1 1  2
fX (x) = √ exp − u p exp − (v − ρu) dv
σX 2π 2 −∞ 2π(1 − ρ2 ) 2(1 − ρ2 )
  (  2 )
1 1 2 1 1 x − µX
= √ exp − u = √ exp −
σX 2π 2 2πσX 2 σX
40

The final step here follows by noting that the integrand is the density of a N (ρu, 1 − ρ2 )
random variable and hence integrates to one. Thus the marginal distribution of X is normal,
2
with mean µX and variance σX .

By symmetry in X and Y we get that Y ∼ N (µY , σY2 ) is the marginal distribution of Y .

It will be shown later in chapter 3 that the fifth parameter, ρ, also has a simple interpretation,
namely ρ = Corr(X, Y ).

Conditional distributions

The conditional distribution of X given Y = y is found as follows. We have


fX,Y (x, y)
fX|Y (x|y) =
fY (y)
√ " ( 2   
2πσY 1 x − µX x − µX y − µY
= p exp − − 2ρ +
2πσX σY 1 − ρ2 2(1 − ρ2 ) σX σX σY
 2  2 )#
y − µY y − µ Y
+ − (1 − ρ2 )
σY σY

Now the expression in [·] can be written as


2
 
1 2 σX 2 σX 2
− 2 (x − µX ) − 2ρ (x − µX )(y − µY ) + ρ 2 (y − µY )
2σX (1 − ρ2 ) σY σY
 2
1 σX
=− 2 2
(x − µX ) − ρ (y − µY )
2σX (1 − ρ ) σY
(1.6)

and so, finally, we get


"  2 #
1 1 σX
fX|Y (x|y) = p exp − 2
x − µX − ρ (y − µY ) ,
2
2πσX (1 − ρ2 ) 2σX (1 − ρ2 ) σY

which is the density of the N (µX + ρ σσXY (y − µY ), σX


2
(1 − ρ2 )) distribution.

The role of ρ

Note that knowledge of Y = y reduces the variability of X by a factor (1 − ρ2 ). The closer


the correlation between X and Y , the smaller the conditional variance becomes. Note also
KS: STAT0005, 2024 – 2025 41

that the conditional mean of X is a linear function of y. If y is relatively large then the
conditional mean of X is also relatively large if ρ = Corr(X, Y ) > 0, or is relatively small if
ρ < 0.

Suppose that ρ = 0. Then


" ( 2  2 )#
1 1 x − µX y − µY
fX,Y (x, y) = exp − +
2πσX σY 2 σX σY
= fX (x)fY (y) (1.7)

showing that uncorrelated normal variables are independent (remember that this is not true
in the general case).

Example 1.31 Let X be the one-year yield of portfolio A and Y be the one-year yield of
portfolio B. From past data, the marginal distribution of X is modelled as N (7, 1), whereas
the marginal distribution of Y is N (8, 4) (being a more risky portfolio but having a higher
average yield). Furthermore, the correlation between X and Y is 0.5. Assuming that X, Y
have a bivariate normal distribution, find the conditional distribution of X given that Y = 9
and compare this with the marginal distribution of X. Calculate the probability P (X >
8|Y = 9).

1.5.4 Reminder: Matrix notation

Matrix Basics

First of all, let’s remind some general matrix notation, where an m by n matrix, i.e. a
matrix containing m rows and n columns of real numbers is denoted by (ai,j )m,ni=1,j=1 = A
m×n
and is thought of as an element of R . Matrices are added entry-wise and two matrices
k×m m×n
A∈R and B ∈ R can be multiplied to yield a matrix C ∈ Rk×n where the entries
are obtained by taking inner products of rows in A with columns in B, i.e. the entries are
obtained as follows:
m
X
ck,j = ak,i bi,j
i=1

Matrices can be multiplied by real numbers and they can act on column vectors (from the
right) and row vectors (from the left), so if A ∈ Rm×n is a matrix and x ∈ Rn is a vector
42

with entries xi and α ∈ R is a real number we have

αAx = y ∈ Rm
n
X
yi = α ai,j xj .
j=1

Matrices, vectors and scalars satisfy the usual associative and distributive laws, e.g. A(x +
y) = Ax + Ay and (AB)C = A(BC) etc. However, note that matrix multiplication,
contrary to normal multiplication of real numbers, is not commutative, i.e. in general we
have AB ̸= BA.

Transpose

 
x1
 x2 
The transpose of a column vector x =   is the row vector xT = (x1 , x2 , x3 , . . . , xn ),
 
..
 . 
xn
although sometimes we will use column and row vectors interchangeably when no confusion
can arise. The transpose of a matrix A = (ai,j )m,n T n,m
i,j=1 is just A = (aj,i )i=1,j=1 (i.e. you mirror
the matrix entries across its diagonal) and the following rules apply for transposition:

(αx + βy)T = αxT + βy T


(αA + βB)T = αAT + βBT
(Ax)T = xT AT
T T
AT = A , xT = x

Here, matrices are denoted by A, B, and x, y are vectors and α, β are real numbers.

Inner Product

The inner product, also known as scalar product, of two vectors, x, y, ∈ Rn is denoted by
xT y ∈ R. It is symmetric, i.e. xT y = y T x, and works with the transpose and inverse of
invertible matrices A ∈ Rn×n as follows:

xT Ay = (AT x)T y
T −1
A−1 = AT ,
T
where the latter equality is the reason for the abbreviated notation A−T = (A−1 ) – it doesn’t
matter whether the transpose or the inverse is carried out first.
KS: STAT0005, 2024 – 2025 43

Determinants

The determinant of a square matrix A ∈ Rn×n , denoted either det(A) or simply |A|, satisfies
the following rules:

det(αA) = αn det(A)
det(AT ) = det(A)
1
det(A−1 ) = ,
det(A)
where A is assumed to be invertible for the last line to hold. A determinant can be computed
by proceeding in a column first or a row first fashion and proceeding recursively to the
determinants of the sub-matrices created, e.g. in dimension n = 3 we have
 
a1,1 a1,2 a1,3
det(A) = det  a2,1 a2,2 a2,3 
a3,1 a3,2 a3,3
     
a2,2 a2,3 a1,2 a1,3 a1,2 a1,3
= a1,1 det − a2,1 det + a3,1 det
a3,2 a3,3 a3,2 a3,3 a2,2 a2,3
= a1,1 (a2,2 a3,3 − a3,2 a2,3 ) − a2,1 (a1,2 a3,3 − a3,2 a1,3 ) + a3,1 (a1,2 a2,3 − a2,2 a1,3 ),

where the simpler 2D rule


 
a1,1 a1,2
det = a1,1 a2,2 − a1,2 a2,1
a2,1 a2,2
has been used.
The matrix is invertible, i.e. A−1 exists, if and only if its determinant is non-zero, i.e.
det(A) ̸= 0.

1.5.5 Matrix Notation for Multivariate Normal Random Variables

Define
     2
  2

X µX σX σXY σX ρσX σY
X= µ= Σ= =
Y µY σXY σY2 ρσX σY σY2
Here we call X a random vector, µ = E(X) is its mean vector and Σ = Cov(X) is the
covariance matrix, or dispersion matrix of X. Then
 
2 2 2 −1 1 σY2 −ρσX σY
det(Σ) = σX σY (1 − ρ ) , Σ = 2
det(Σ) −ρσX σY σX
44
 
x
and, writing x = ,
y
  
T −1 1 σY2 −ρσX σY x − µX
(x − µ) Σ (x − µ) = (x − µX , y − µY ) 2
det(Σ) −ρσX σY σX y − µY
1
= 2 2 {(x − µX )2 σY2 − 2(x − µX )(y − µY )ρσX σY
σX σY (1 − ρ2 )
+ (y − µY )2 σX
2
}
( 2     2 )
1 x − µX x − µX y − µY y − µY
= − 2ρ + .
1 − ρ2 σX σX σY σY

It follows that the joint density fX (x) of X, Y can be written as


 
1 1 T −1
fX (x) = p exp − (x − µ) Σ (x − µ) (1.8)
det(2πΣ) 2

on noting that det(2πΣ)1/2 = 2πdet(Σ)1/2 . The quantity in {·} is a quadratic form in


x − µ. Note that the above way of writing the joint density resembles much more the
univariate density than the explicit formula given at the beginning of the section.

The usefulness of this matrix representation is that the bivariate normal distribution now
extends immediately to a general multivariate form, with joint density given by (1.8), with

X = (X1 , . . . , Xk )T , x = (x1 , . . . , xk )T , µ = (µ1 , . . . , µk )T

and
(Σ)ij = Cov(Xi , Xj ) = ρij σi σj
Further note that, since Σ is k × k, we can write det(2πΣ)1/2 = (2π)k/2 det(Σ)1/2 .

For this k–dimensional joint distribution, denoted by M N (µ, Σ) or Nk (µ, Σ),


ρij = Corr(Xi , Xj ) and var(Xi ) = σi2 . It can then be shown that Xi has marginal distribution
N (µi , σi2 ), that any two of these variables have a bivariate normal distribution as above, and
therefore that the conditional distribution of one variable given the other is also normal.

Example 1.32 Let X1 , X2 , X3 have a trivariate normal distribution with mean vector (µ1 , µ2 , µ3 )
and covariance matrix  
a 0 0
Σ =  0 b 0 .
0 0 c
Show that fX1 ,X2 ,X3 = fX1 fX2 fX3 and give the marginal distributions of X1 , X2 , and X3 .
KS: STAT0005, 2024 – 2025 45

Learning Outcomes:

Multinomial Distribution You should be able to

1. Identify situations where the multinomial distribution is appropriate;


2. Deduce the pmf of the multinomial distribution and make use of the link to the
multinomial expansion;
3. Derive the marginal and conditional distributions;
4. Explain why Ni and Nj have a negative correlation.

Multivariate Normal Distribution You should be able to

1. Recognise the bivariate normal density, describe its shape and interpret the pa-
rameters;
2. Name the marginal distributions and know how to derive them;
3. Give the conditional distributions, know how to derive them and name the char-
acteristic properties of the conditional distributions;
4. Relate the correlation parameter to independence between jointly normal random
variables;
5. With the help of the foregoing points, characterise situations where the multivari-
ate normal distribution is appropriate;
6. Compute probabilities of joint and conditional events;
7. Explain how the multivariate normal distribution is constructed using matrix no-
tation.
46
Chapter 2

Transformation of Variables

In this section we will see how to derive the distribution of transformed random variables.
This is useful because many statistics applied to data analysis (e.g. test statistics) are
transformations of the sample variables.

2.1 Univariate case

Suppose that we have a sample space Ω, a probability measure P on Ω, a random variable


X : Ω → R, and a function ϕ : R → R.

Recall from section 1.2: Y = ϕ(X) : Ω → R is defined by Y (ω) = ϕ(X)(ω) = ϕ(X(ω)).


Since Y = ϕ(X) is a random variable it also has a probability distribution, which can be
determined either directly from P or via the distribution of X.

2.1.1 Discrete case

47
48

X
P (Y = y) = P ({ω : ϕ(X(ω)) = y}) = P ({ω})
{ω:ϕ(X(ω))=y}
X
= P ({ω : X(ω) = x})
{x:ϕ(x)=y}
X
= pX (x).
{x:ϕ(x)=y}

So, for example


X
E(Y ) = ϕ(X(ω))P ({ω}) with respect to P on Ω
ω
X
= ϕ(x)pX (x) with respect to distribution of X
x
X
= ypY (y) with respect to distribution of Y = ϕ(X)
y

Example 2.1 Consider two independent throws of a fair die. Let X be the sum of the
numbers that show up. Give the distribution of X. Now consider the transformation Y =
(X − 7)2 . Derive the distribution of Y .

2.1.2 Continuous case

Suppose that Y = ϕ(X) where ϕ is a strictly increasing and differentiable function. Then,

FY (y) = P (ϕ(X) ≤ y) = P (X ≤ ϕ−1 (y)) = FX (ϕ−1 (y)).

The first and third equalities arise simply from the definition of the cdf. The middle equality
says that the two probabilities on either side are equal because the events are the same, i.e.
{ω ∈ Ω : ϕ(X(ω)) ≤ y} = {ω ∈ Ω : X(ω) ≤ ϕ−1 (y)}. In words this simply means that
the event that ϕ(X) ≤ y happens if and only if the event that X ≤ ϕ−1 (y) happens. To see
that this is true ...
KS: STAT0005, 2024 – 2025 49

Therefore, differentiating with respect to y, Y has density


d −1 dx
fY (y) = fX (ϕ−1 (y)) ϕ (y) = fX (x)
dy dy
x=ϕ−1 (y)

where the index x = ϕ−1 (y) means that any x in the formula has to be replaced by the
inverse ϕ−1 (y) because fY (y) is a function of y.

Similarly, if ϕ is decreasing then

FY (y) = P (ϕ(X) ≤ y) = P (X ≥ ϕ−1 (y)) = 1 − FX (ϕ−1 (y))

so that
dx
fY (y) = −fX (x)
dy
x=ϕ−1 (y)

In the first case dy/dx = dϕ(x)/dx is positive (since ϕ is increasing), in the second it is
negative (since ϕ is decreasing) so either way the transformation formula is

dx
fY (y) = fX (x)
dy x=ϕ−1 (y)

We can check that the right-hand side of the above formula is a valid pdf as follows. Recall
R∞
that −∞ fX (x)dx = 1. Changing variable to y = ϕ(x) we have, for ϕ increasing,
Z  
dx
1= fX (x) dy
dy x=ϕ−1 (y)
dx
so that fX (x) dy
is a valid pdf. Similarly for ϕ decreasing.

Example 2.2 Consider X ∼Uniform[− π2 , π2 ], i.e.


 1
π
− π2 ≤ x ≤ π
2
fX (x) =
0 otherwise.
Derive the density of Y = tan(X).

fX (x) dx
P
When ϕ is a many-to-one function we use the generalised formula fY (y) = dy
,
where the summation is over the set {x : ϕ(x) = y}. That is, we add up the contributions
to the density at y from all x values which map to y.

Example 2.3 Suppose that fX (x) = 2x on (0, 1) and let Y = (X − 21 )2 . Obtain the pdf of
Y.
50

2.2 Bivariate case

2.2.1 General Transformations

For the bivariate case we consider two random variables X, Y with joint density fX,Y (x, y).
What is the joint density of transformations U = u(X, Y ), V = v(X, Y ), where u(·, ·) and
v(·, ·) are functions from (R)2 to (R), such as the ratio X/Y or the sum X + Y ?

In order to use the following generalisation of the method of section 2.1, we need to assume
that u, v are such that each pair (x, y) defines a unique (u, v) and conversely, so that u =
u(x, y) and v = v(x, y) are differentiable and invertible. The formula that gives the joint
density of U, V is similar to the univariate case but the derivative, as we used it above, now
has to be replaced by the Jacobian J(x, y) of this transformation.

The result is that U = u(X, Y ), V = v(X, Y ) have joint density

fU,V (u, v) = fX,Y (x(u, v), y(u, v))|J(x, y)| x=x(u,v)


y=y(u,v)

Again, the index x=x(u,v)


y=y(u,v)
means that the x, y have to be replaced by the suitable transfor-
mations involving u, v only.

But how do we get the Jacobian J(x, y)? It is actually the determinant of the matrix of
partial derivatives:

∂x ∂x
 
∂(x, y) ∂u ∂v
J(x, y) = det = det ∂y ∂y
∂(u, v) ∂u ∂v

We finally take its absolute value, |J(x, y)|. There are two ways of computing this:

(1) Obtain the inverse transformation x = x(u, v), y = y(u, v), compute the matrix of partial
derivatives ∂(x, y)/∂(u, v) and then its determinant and absolute value.

(2) Alternatively find the determinant J(u, v) from the matrix of partial derivatives of (u, v)
with respect to (x, y) and then its absolute value and invert this.
KS: STAT0005, 2024 – 2025 51

The two methods are equivalent since


!−1
∂(x, y) ∂(u, v)
=
∂(u, v) ∂(x, y)
Which way to choose in a specific case will depend on which functions are easier to derive.
But note that the inverse transformations x = x(u, v) and y = y(u, v) are required anyway
so that the first approach is often preferable.

Example 2.4 Let X and Y be two independent exponential variables with X ∼ Exp(λ) and
Y ∼ Exp(µ). Find the distribution of U = X/Y

Example 2.5 Consider two independent and identically distributed random variables X and
Y having a uniform distribution on [0, 2]. Derive the joint density of Z = X/Y and W = Y ,
stating the area where this density is positive. Are Z and W independent?
Obtain the marginal density of Z = X/Y .

2.2.2 Sums of random variables

The distribution of a sum Z = X + Y of two (not necessarily independent) random variables


X and Y can be derived directly as follows.

In the discrete case note that the marginal distribution of Z is


X X
P (Z = z) = P (X = x, Z = z) = P (X = x, Y = z − x)
x x

That is,
X
pZ (z) = pX,Y (x, z − x)
x

Analogously, in the continuous case we get


Z ∞
fZ (z) = fX,Y (x, z − x) dx.
−∞

Example 2.6 Let X and Y be two positive random variables with joint pdf
fX,Y (x, y) = xye−(x+y) , x, y > 0 .
Derive and name the distribution of their sum Z = X + Y .
52

2.3 Multivariate case

The ideas of section 2.2 extend in a straightforward way to the case of more than two
continuous random variables. The general problem is to find the distribution of Y = ϕ(X),
where Y is s × 1 and X is r × 1, from the known distribution of X. Here X is the random
vector  
X1
 X2 
 
X=  · .

 · 
Xr

Case (i): ϕ is a one-to-one transformation (so that s = r). Then the rule is

fY (y) = fX (x(y)) |J(x)|x=x(y)

where J(x) = dx is the Jacobian of transformation. Here dx


is the matrix of partial
  dy dy
∂xi
derivatives dx
dy
= ∂y j
.
ij

Case (ii): s < r. First transform the s-vector Y to the r-vector Y ′ , where Yi′ = Yi , i =
1, . . . , s , and the other r − s random variables Yi′ , i = s + 1, . . . , r , are chosen for con-
venience. Now find the density of Y ′ as in case (i) and then integrate out Ys+1 ′
, . . . , Yr′ to
obtain the marginal density of Y , as required. ( c.f. Examples 2.6 & 2.7 in the bivariate
case.)

Case (iii): s = r but ϕ(·) is not monotonic. Then there will generally be more than one value
of x corresponding to a given y and we need to add the probability contributions from all
relevant xs.

Example 2.7 (linear transformation) Suppose that Y = AX, where A is an r × r


invertible matrix. Show that fY (y) = fX (A−1 y)|det(A)|−1 , where |det(A)| denotes the
absolute value of the determinant of A.

2.4 Approximation of moments


KS: STAT0005, 2024 – 2025 53

Sometimes we may not need the complete probability distribution of ϕ(X), but just the first
two moments. Recall that E[aX + b] = aE[X] + b, so the relation E[ϕ(X)] = ϕ(E[X]) is
true whenever ϕ is a linear function. However, in general if Y = ϕ(X) it will not be true
R R 
that E[Y ] = E[ϕ(X)] = ϕ(x)fX (x)dx is the same as ϕ(E[X]) = ϕ xfX (x)dx , (or
equivalent summations if X is discrete).

To find moments of Y we can use the distribution of X, as above. However, the sums or
integrals involved may be analytically intractable. In practice an approximate answer may be
sufficient. Intuitively, if X has mean µX and X is not very variable, then we would expect
E[Y ] to be quite close to ϕ(µX ).

Suppose that ϕ(x) is a continuous function of x for which the following Taylor expansion
about µX exists (which requires the existence of the derivatives of ϕ):
1
ϕ(x) = ϕ(µX ) + (x − µX )ϕ′ (µX ) + (x − µX )2 ϕ′′ (µX ) + . . .
2
Replacing x by X and taking expectations (or, equivalently, multiplying both sides of the
above equation by fX (x) and integrating over x) term by term, we get
1
E[ϕ(X)] = ϕ(µX ) + ϕ′ (µX ) E[X − µX ] + ϕ′′ (µX ) E [(X − µX )2 ] + . . .
| {z } 2 | {z }
=0 2
σX

so that
1 ′′ 2
E[Y ] ≈ ϕ(µX ) + ϕ (µX )σX
2

This approximation will be good if the function ϕ is well-approximated by the quadratic (i.e.
second order) Taylor expansion in the region to which fX assigns significant probability. If
2
this region is large, e.g. because σX is large, then ϕ will need to be well-approximated by the
2
quadratic Taylor expansion throughout a large region. If σX is small, ϕ may only need to be
nearly a quadratic function on a much smaller region.

The rougher approximation E(Y ) ≈ ϕ(µX ) will usually only be good if ϕ′′ (µX )σX 2
is small,
2 ′′
which will be the case if σX is small and/or ϕ (µX ) is small; that is, if X is not very variable
and/or ϕ is approximately linear at µX . Including the next term in the expansion will usually
provide a better approximation.

A usually sufficiently good approximate formula for the variance is based on a first order
approximation yielding
Var ϕ(X) = E[ϕ(X) − E(ϕ(X))]2

2
≈ E[ϕ(X) − ϕ(µX )]2 ≈ E[(X − µX )2 (ϕ′ (µX ))2 ] = (ϕ′ (µX )) E[(X − µX )2 ] ,
54

where we have used the approximation E[ϕ(X)] ≈ ϕ(µX ) . Therefore

2
Var(Y ) ≈ (ϕ′ (µX )) σX
2

Example 2.8 Consider a Poisson variable X ∼Poi(µ). Find approximations to the expecta-

tion and variance of Y = X.

2.5 Order Statistics

Order statistics are a special kind of transformation of the sample variables. Their joint and
marginal distributions can be derived by combinatorial considerations.
Suppose that X1 , . . . , Xn are independent with common density fX . Denote the ordered
values by X(1) ≤ X(2) ≤ . . . ≤ X(n) . What is the distribution Fr of X(r) ?

In particular, X(n) = max (X1 , . . . , Xn ) is the sample maximum and X(1) = min(X1 , . . . , Xn )
is the sample minimum. To find the distribution of X(n) , note that {X(n) ≤ x} and
{all Xi ≤ x} are the same event – and so have the same probability! Therefore the distribu-
tion function of X(n) is

Fn (x) = P (X(n) ≤ x) = P (all Xi ≤ x) = P (X1 ≤ x, X2 ≤ x, . . . , Xn ≤ x) = {FX (x)}n

since the Xi are independent with the same distribution function FX . Thus

Fn (x) = (FX (x))n

Furthermore, differentiating this expression we see that the density fn of X(n) is

fn (x) = n (FX (x))n−1 fX (x)

Using a similar argument for X(1) = min(X1 , . . . , Xn ) we see that

F1 (x) = P (X(1) ≤ x) = P (atleastoneXi ≤ x) = 1 − P (all Xi > x) = 1 − {1 − FX (x)}n

so the distribution function of X(1) is

F1 (x) = 1 − (1 − FX (x))n
KS: STAT0005, 2024 – 2025 55

and, differentiating, the pdf f1 of X(1) is

f1 (x) = n (1 − FX (x))n−1 fX (x)

Consider next the situation for general 1 ≤ r ≤ n. For dx sufficiently small we have
 
r − 1 values Xi such that Xi ≤ x, and
P (x < X(r) ≤ x + dx) = P  one value in (x, x + dx], and 
n − r values such that Xi > x + dx

n!
≈ {FX (x)}r−1 fX (x)dx (1 − FX (x + dx))n−r
(r − 1)!(n − r)!
| {z }
no. of ways of ordering the r − 1, 1 and n − r values
Recalling that fr (x) = limdx→0 P (x < X(r) ≤ x + dx)/dx, dividing both sides of the above
expression by dx and letting dx → 0 we obtain the density function of the rth order statistic
X(r) as
n!
fr (x) = (FX (x))r−1 (1 − FX (x))n−r fX (x)
(r − 1)!(n − r)!

Exercise: Show that this formula gives the previous densities when r = n and r = 1.

Activity 2.1 Collect a random sample of five students from the audience and have their
heights X1 , . . . , X5 measured. Record the heights and compute the mean height. Reorder
the students by height to obtain the first to fifth order statistic, X(1) , . . . , X(5) . What do the
first, third and fifth order statistic, X(1) , X(3) and X(5) correspond to?

Example 2.9 A village is protected from a river by a dike of height h. The maximum water
levels Xi reached by the river in subsequent years i = 1, 2, 3, . . . are modelled as independent
following an exponential distribution with mean λ−1 = 10. What is the probability that the
village will be flooded (in statistical language this would be called a “threshold exceedance”)
at least once in the next 100 years? How high does the dike need to be to make this probability
smaller than 0.5?

The distribution of the sample maximum is an important quantity in the field of extreme value
theory as the preceding example showed. Extreme value theory is important for its application
in insurance pricing and risk assessment. It turns out that the probability distribution of
threshold exceedances follows a universal class of distributions in the limit of high thresholds
h, independently of the individual distribution of the Xi . A most remarkable result!
56

Learning Outcomes: In Chapter 2, the most important aspects are the following.

Distributions of Transformations You should be able to

1. Derive the distribution (pmf) of a transformation of a discrete variable for simple


cases;
2. Apply the general theorem to find the distribution (density) of a transformed
continuous variable, specifying the range where the density is positive;
3. Apply the general theorem to find the joint distribution of transformations of
bivariate random variables, specifying the range where the density is positive;
4. Use the Jacobian in the appropriate way for point 3, choosing the simplest of the
two possible computations for the specific situation;
5. Derive the distribution of the sum of random variables (discrete and continuous
case);
6. When different methods could be applied, identify the easiest to find the distri-
bution of a transformation.

Approximation of moments You should be able to

1. Derive an approximation formula for the mean of a general transformation based


on the second-order Taylor expansion and compute this for standard cases;
2. Derive an approximation formula for the variance of a general transformation
based on the first-order Taylor expansion and compute this for standard cases;
3. Explain in which situations such approximations are good / bad.

Order Statistics You should be able to

1. Compute the distribution of the rth order statistic (in particular sample maximum
and sample minimum) and explain the required probabilistic and combinatorial
reasoning;
Chapter 3

Generating Functions

Overview

The transformation method presented in the previous chapter may become tedious when a
large number of variables is involved, in particular for transformations of the sample variables
when the sample size tends to infinity. Generating functions provide an alternative way of
determining a distribution ( e.g. of sums of random variables).

We consider different generating functions for the discrete and continuous case. For the
former, we can use probability generating functions (section 3.1) as well as moment generating
functions (section 3.2), whereas for the latter, only moment generating functions can be
used. We will point out the simple connection between pgfs and mgfs and further consider
joint generating functions in section 3.3 and apply these to linear combinations of random
variables in section 3.4. Finally, in section 3.4.1, we will state, prove, and use the Central
Limit Theorem.

3.1 The probability generating function (pgf)

3.1.1 Definition of the pgf

57
58

Suppose that X is a discrete random variable taking values 0, 1, 2 . . .. Then the probability
generating function (pgf) G(z) of X is defined as

G(z) ≡ E(z X )

The pgf is a function of particular interest, because it sometimes provides an easy way of
determining the distribution of a discrete random variable.

Write pi = P(X = i), i = 0, 1, . . .. Then, by the usual expectation formula,



X
X
G(z) = E(z ) = z i pi
i=0
2
= p0 + zp1 + z p2 + · · · .

Thus G(z) is a power series in z, and pr is the coefficient of z r . Note that


X X
|G(z)| ≤ |z|i pi ≤ pi = 1
i i
P
for all |z| ≤ 1 and that G(1) = i pi = 1. The sum is therefore convergent for (at least)
|z| ≤ 1.

We know from the theory of Taylor expansions that the rth derivative G(r) (0) = pr r!, yielding
an expression for the probability pr in terms of the rth derivative of G evaluated at z = 0:

G(r) (0)
pr = , r = 0, 1, 2, . . . .
r!
In practice it is usually easier to find the power series expansion of G and extract pr as the
coefficient of z r .

3.1.2 Moments and the pgf

Whereas the probabilities are related to the derivatives of G at z = 0, it turns out that
the moments of X are related to the derivatives of G at z = 1. To see this, note that
∞ ∞
G′ (z) = p1 + 2zp2 + 3z 2 p3 + · · · = iz i−1 pi so that G′ (1) =
P P
ipi = E(X). Thus, we
i=1 i=1
have
E(X) = G′ (1)

Further, G′′ (z) = i(i − 1)z i−2 pi so that G′′ (1) =
P P
i(i − 1)pi = E{X(X − 1)}. But we
i=2
can write Var(X) = E(X 2 ) − {E(X)}2 = E{X(X − 1)} + E(X) − {E(X)}2 from which
KS: STAT0005, 2024 – 2025 59

we obtain the formula

Var(X) = G′′ (1) + G′ (1) − {G′ (1)}2

Example 3.1 Let X ∼ Poi(µ). Find the pgf G(z) = E(z X ). Also, verify the above formulae
for the expectation and variance of X.

Example 3.2 Consider the pgf

G(z) = (1 − p + pz)n ,

where 0 < p < 1 and n ≥ 1 is an integer. Find the power expansion of G(z) and hence
derive the distribution of the random variable X that has this pgf.

3.2 The moment generating function (mgf)

3.2.1 Definition

Another function of special interest, particularly for continuous variables, is the moment
generating function (mgf) M (s) of X, defined as
Z∞
sX
M (s) ≡ E(e )= esx fX (x)dx
−∞

The moment generating function does not necessarily exist for all s ∈ R, i.e. the integral
might be infinite. However, we assume for the following that M (s) is finite for s in some
open interval containing zero.

3.2.2 Moments and the mgf

Using the expansion esx = 1 + sx + 2!1 s2 x2 + . . . we get


Z∞ X
∞ ∞ Z∞
s n xn X sn
M (s) = fX (x)dx = xn fX (x)dx
n! n!
−∞ n=0 n=0 −∞
60

integrating term by term (there are no convergence problems here due to assuming finiteness).
It follows that ∞
X sn
M (s) = E(X n ) .
n=0
n!
Thus M (s) is a power series in s and the coefficient of sn is E(X n )/n! – hence the name
‘moment generating function’.

Again, from the theory of Taylor expansions the rth derivative, M (r) (0), of M (s) at s = 0
must therefore equal the rth (raw) moment E(X r ) of X. In particular we have M ′ (0) = E(X)
and M ′′ (0) = E(X 2 ), so that
E(X) = M ′ (0)
and
Var(X) = M ′′ (0) − {M ′ (0)}2
(Alternatively, and more directly, note that M ′ (s) = E(XesX ) and M ′′ (s) = E(X 2 esX ) and
set s = 0.) Note also that M (0) = E(e0 ) = 1.

It can be shown that if the moment generating function exists on an open interval including
zero then it uniquely determines the distribution.

The pgf tends to be used more for discrete distributions and the mgf for continuous ones,
although note that when X takes nonnegative integer values then the two are related by
M (s) = E(esX ) = E{(es )X } = G(es ).

Example 3.3 Suppose that X has a gamma distribution with parameters (α, λ). Find the
mgf of X. Use this to derive the expectation and variance of X.

Example 3.4 Let X ∼ N (µ, σ 2 ) be a normal variable. Find the mgf of X. Use this mgf to
obtain the expectation and variance of X.

3.2.3 Linear Transformations and the mgf

Suppose that Y = a + bX and that we know the mgf MX (s) of X. What is the mgf of Y ?
We have
MY (s) = E(esY ) = E{es(a+bX) } = esa E(esbX ) = eas MX (bs) .
KS: STAT0005, 2024 – 2025 61

We can therefore easily obtain the mgf of any linear function of X from the mgf of X.

Example 3.5 (3.4 ctd.) Use the mgf of X to find the distribution of Y = a + bX.

A more general concept is that of the characteristic function. This is defined in a similar
way to the mgf and has similar properties but involves complex variables. The main advantage
over the moment generating function is that the characteristic function of a random variable
always exists. However, we will not consider it here.

3.3 Joint generating functions

So far we have considered the pgf or mgf of a single real variable. The joint distribution of
a collection of random variables X1 , . . . , Xn can be characterised in a similar way by the
joint generating functions:

The joint pgf G(z1 , . . . , zn ) of variables X1 , . . . , Xn is a function of n variables, z1 , . . . , zn ,


and defined to be
G(z1 , . . . , zn ) = E(z1X1 z2X2 · · · znXn )

The joint mgf M (s1 , . . . , sn ) is a function of n variables, s1 , . . . , sn , and is defined to be

M (s1 , . . . , sn ) = E(es1 X1 +···+sn Xn )

These generating functions uniquely determine the joint distribution of X1 , . . . , Xn . Note


that the mgf may also be written in vector notation as
TX
M (s) = E(es ),

where s = (s1 , . . . , sn )T and X = (X1 , . . . , Xn )T .

Generating functions and independence

In both cases (pgf and mgf) we find that if X1 , . . . , Xn are independent random variables
then the pgf / mgf are given as the product of the individual pgfs / mgfs. (Recall: E(XY ) =
E(X)E(Y ) when X, Y are independent.)
62

G(z1 , . . . , zn ) = E(z1X1 · · · znXn ) = E(z1X1 ) · · · E(znXn )


i.e. joint pgf = product of marginal pgfs

M (s1 , . . . , sn ) = E(es1 X1 · · · esn Xn ) = E(es1 X1 ) · · · E(esn Xn )


i.e. joint mgf = product of marginal mgfs.

The above property can be used to characterise independence because it can be shown that
the factorisation of the joint mgf holds if and only if the variables are independent.

Marginal mgfs

It is straightforward to see that if MX,Y (s1 , s2 ) is the joint mgf of X, Y then the marginal
mgf of X is given by MX (s1 ) = MX,Y (s1 , 0).
(Proof: E(es1 X ) = E(es1 X+0.Y ) = MX,Y (s1 , 0).)

Higher Moments

The joint moment generating function can further be useful to find higher moments of a
distribution. More precisely, we can compute E(Xir Xjk ) in the following way.

1. Differentiate M (s1 , . . . , sn ) r times w.r.t. si ;

2. further differentiate k times w.r.t. sj ;

3. then evaluate the resulting derivative for s1 = · · · = sn = 0.

Following the above steps we get

∂rM T
r
= E(Xir es X )
∂si
r+k
∂ M T

r k
= E(Xir Xjk es X ) ,
∂si ∂sj

which gives E(Xir Xjk ) on setting s = 0.

Linear transformation property


KS: STAT0005, 2024 – 2025 63

This is the multivariate generalisation of the univariate transformation property. Suppose


that Y = a + bX. Then the mgf of Y is ( c.f. §3.2)
TY T (a+bX) T TX T
MY (s) = E(es ) = E{es } = es a E(ebs ) = ea s MX (bs)

Example 3.6 Suppose X1 , . . . , Xn are jointly multivariate normally distributed. Then, from
equation (1.8) in §1.5.3, the density of X = (X1 , . . . , Xn ) is given in matrix notation by
 
1 1 T −1
fX (x) = exp − (x − µ) Σ (x − µ) ,
| 2πΣ |1/2 2

where
x = (x1 , . . . , xn )T , µ = (µ1 , . . . , µn )T , Σ = covariance matrix

i.e. the µi = E(Xi ) are the individual expectations and the σij = (Σ)ij = Cov(Xi , Xj ) are
the pairwise covariances (variances if i = j).

We obtain the joint mgf as follows. We have


Z
1 1
E(e s1 X1 +···+sn Xn
)= 1/2
exp {sT x − (x − µ)T Σ−1 (x − µ)}dx
| 2πΣ | 2

(Remember that the integral here represents an n-dimensional integral.) To evaluate this
integral we need to complete the square in {·}. The result (derived in lectures) is that
 
1
M (s1 , . . . , sn ) = exp T
s µ + sT Σs
2

Exercise: from this derive the mgf of the univariate N (µ, σ 2 ) distribution.

For illustration, we now derive the joint moment E(Xi Xj ). Differentiate first with respect to
si . Since
Xn X n X n
T T
s µ= sk µk , s Σs = sk sl σkl
k=1 k=1 l=1

we see that ∂(s µ)/∂si = µi . Also the terms involving si in sT Σs are


T

X
s2i σii + 2si sl σil
l̸=i
64

giving X X
∂(sT Σs)/∂si = 2si σii + 2 sl σil = 2 sl σil
l̸=i l

Therefore

n
!
TX
X 1
E(Xi es ) = ∂M (s)/∂si = µi + σil sl exp {sT µ + sT Σs}
l=1
2
Now differentiate again with respect to sj , j ̸= i, to give
( n
! n
!)
T
X X 1
E(Xi Xj es X ) = σij + µi + σil sl µj + σjl sl exp {sT µ + sT Σs}.
l=1 l=1
2

Setting s = 0 now gives E(Xi Xj ) = σij + µi µj and therefore Cov(Xi , Xj ) = E(Xi Xj −


µi µj = σij .

Example 3.7 Suppose that


     
X1 2 1 0.7
∼ N2 , .
X2 1 0.7 1

Find E(X1 X2 ) (i) directly and (ii) from the joint mgf.

Remark: The multivariate normal density of X1 , . . . , Xn is only valid when Σ is a non-


singular matrix, which can be shown to hold if and only if no exact linear relationships exist
between X1 , . . . , Xn . However, the multivariate normal distribution can still be defined when
Σ is singular. Note in particular that the joint mgf is valid even when Σ is singular.

3.4 Linear combinations of random variables

We will now use the above methods to derive (properties of) the distribution of linear com-
binations of random variables.

Let X1 , . . . , Xn be the original variables. A linear combination is defined by

Y = a1 X 1 + · · · + an X n
KS: STAT0005, 2024 – 2025 65

for any real-valued constants a1 , . . . , an . A popular linear combination is for example the
sample mean Y = X, for which ai = 1/n, i = 1, . . . , n.

Let us first find the expectation and variance of the linear combination Y in terms of the
moments of the Xi . The methods from §1.4 can be used for this purpose. First

E(Y ) = a1 E(X1 ) + · · · + an E(Xn )

regardless of whether or not the Xi are independent. For the variance we have
X X
Var (Y ) = Cov ( ai X i , aj X j )
i j
XX
= ai aj Cov(Xi , Xj )
i j
X X
= a2i Var(Xi ) + ai aj Cov(Xi , Xj ).
i i̸=j

In particular, we see that if the Xi are independent then

Var(Σai Xi ) = Σi Var(ai Xi ) = Σa2i Var(Xi )

but not otherwise. In vector notation, if we write Y = aT X, where X = (X1 , . . . , Xn )T , a =


(a1 , . . . , an )T then these relations becomes

E(Y ) = aT µ, Var(Y ) = aT Σa ,

where µ = E(X), Σ = Cov(X).

Now we want to find out about the actual distribution of Y . If we have the joint distribution
of X1 , . . . , Xn we could proceed by transformation similar to §2.3, but this could be very
tedious if n is large. Instead, let us explore an approach based on the joint pgf / mgf of
the Xi . (The result below for the mgf is actually just a special case of the earlier linear
transformation property of a joint mgf.)

If Y is discrete we find for its pgf

GY (z) = E(z Y ) = E(z a1 X1 +···+an Xn ) = E(z a1 X1 z a2 X2 · · · z an Xn ) ,

which is the joint pgf of X1 , . . . , Xn evaluated at zi = z ai , i.e. G(z a1 , . . . , z an ).

Similarly, if Y is continuous its mgf is given as

MY (s) = E(esY ) = E(es(a1 X1 +···+an Xn ) ) = E(esa1 X1 +···+san Xn ) ,


66

which is the joint mgf of X1 , . . . , Xn evaluated at si = sai , i.e. M (sa1 , . . . , san ).

So, an alternative to the transformation method is to obtain the joint pgf or mgf and use this
to derive the pgf or mgf of Y . Of course, we still have to get from there to the probability
mass function or density if needed — but often the generating function of the univariate Y
is known to belong to a specific distribution.

A simplification is available if X1 , . . . , Xn are independent. In this case we have


n
Y n
Y
GY (z) = E(z a1 X1 z a2 X2 · · · z an Xn ) = E(z ai Xi ) = GXi (z ai ) ,
i=1 i=1

which is the product of the individual pgfs GXi (z ai ) of the Xi evaluated at zi = z ai .

Analogously, for the mgf we find


n
Y n
Y
sa1 X1 +···+san Xn sai Xi
MY (s) = E(e )= E(e )= MXi (sai ) ,
i=1 i=1

which is the product of the individual mgfs MXi (sai ) of the Xi evaluated at si = sai .

Example 3.8 (3.2 ctd.) Consider independent random variables X1 , . . . , Xn with


Xi ∼Bin(mi , p), i.e. they have different numbers of trials mi but the same success probability
P
p. Find the pgf and the distribution of Y = Xi .

Example 3.9 (3.3 ctd.) Let X1 , . . . , Xn be independent with Xi ∼ Gam(αi , λ). Find the
P
mgf and the distribution of Y = Xi .

Example 3.10 (3.6 ctd.) Suppose X1 , . . . , Xn are jointly multivariate normally distributed.
Consider again the linear transformation Y = a1 X1 + · · · + an Xn , or Y = aT X in vector
notation, where a = (a1 , . . . , an )T , X = (X1 , . . . , Xn )T . It follows from the general result
that the mgf of Y is
 
T T 1 2 T
MY (s) = MX (sa) = E{exp(sa X)} = exp sa µ + s a Σa
2
from §3.3. By comparison with the univariate mgf (see Example 3.4) we see that

Y = aT X ∼ N (aT µ, aT Σa)
KS: STAT0005, 2024 – 2025 67

We have seen earlier that for any random vector X we have E(aT X) = aT µ, Var(aT X) =
aT Σa, so the importance of the foregoing result is that any linear combination of jointly
normal variables is itself normally distributed even if the variables are correlated (and thus
not independent).

Example 3.11 (3.7 ctd.) Use the joint mgf of X1 and X2 to find the distribution of Y =
X1 − X 2 .

3.4.1 The Central Limit Theorem

Possibly the most important theorem in statistics is the Central Limit Theorem which enables
us to approximate the distribution of sums of independent random variables. One of the most
popular uses is to make statements about the sample mean X̄ which, after all, is nothing but
a scaled sum of the samples.

Central Limit Theorem:

Suppose that X1 , X2 , . . . are i.i.d. random variables each with mean µ


and variance σ 2 < ∞. Then

n(X n − µ) d
−→ Z
σ

as n → ∞ where Z ∼ N (0, 1).

d
Here, −→ denotes convergence in distribution which means that as n → ∞ the cdf of the
random variable on the left converges to the cdf of√the random variable on the right at each
point. Writing this down more carefully, let Yn = σn (X n − µ). Then, the statement simply
means limn→∞ FYn (x) = Φ(x) holds for any x ∈ R, where Φ is the standard normal cdf.

We can use the central limit theorem to compute probabilities about X n from
√  √ 
n(b − µ) n(a − µ)
P(a < X n ≤ b) ≈ Φ −Φ .
σ σ

NB. There are many generalisations of the theorem where the assumptions of independence,
common distribution or finite variance of the Xi are relaxed.
68

The Central Limit Theorem is Amazing!

From next to no assumptions (independence and identical distribution with finite variance)
we arrive at a phenomenally strong conclusion: The sample mean asymptotically follows a
Gaussian distribution. Regardless of which particular distribution the Xn follow (exponential,
Bernoulli, binomial, uniform, triangle, ...), the resulting distribution is always Gaussian! It is
as though the whole of statistics and probability theory reduces to one distribution only!!

Example 3.12 Let X1 , . . . , Xn be an i.i.d. sample of exponential variables, i.e. Xi ∼ Exp(λ).


P
Find formulae to approximate the probabilities P (X̄n ≤ x) and P ( Xi ≤ x).

Example 3.13 Consider a sequence of independent Bernoulli trials with constant probability
of success π, so that P (Xi = 1) = π, i = 1, 2, . . .. Use the Central Limit Theorem with
µ = π and σ 2 = π(1 − π) to derive the normal approximation to the binomial distribution.
KS: STAT0005, 2024 – 2025 69

How large does n need to be for the CLT to work?

One of the disadvantages of the Central Limit Theorem is that in practical situations, one
can never be quite sure just how large n needs to be for the approximation to work well.
As an illustration, consider the sum of independent uniformly distibuted random variables.
Approximate probability density functions (obtained through simulation and histograms - if
you want to know more about how this works, take STAT0023) are shown in Figures 3.1
and 3.2 below. In the case of U (0, 1) variables, it is clear that for n ≥ 5, the difference in
the pdf is nearly invisible. Any rule of thumb as to how large n needs to be suffers from
mathematicians’ curse: since the actual distribution of the Xi is assumed unknown, one can
always find a distribution for which convergence has not yet occurred for the particular value
of n considered!

Figure 3.1: Estimated probability density functions standardised to mean zero and P
variance
one. Top left: X1 , top right: X1 + X2 , bottom left: X1 + X2 + X3 , bottom right: 5i=1 Xi
70

Figure 3.2: Estimated cumulative density functions standardised to mean zero. Note that for
N ≥ 5, the cdfs are essentially indistinguishable from standard normal.

Example 3.14 (Sums of i.i.d. uniformly distributed random variables) Let Xi ∼ U (0, 1)
be independent random variables. Compute the approximate distribution of Y = 100
P
n=1 Xn
and the probability that Y < 60.

Finally, as increasingly faster computers become available at steadily decreasing cost, the
Central Limit Theorem loses some of its practical importance. It is still regularly being used,
however, as an easy first guess.
KS: STAT0005, 2024 – 2025 71

We prove the theorem only in the case when the Xi have a moment generating function
defined on an open interval containing zero. Let us first suppose that E(X) = µ = 0. We
2
then know that E(X n ) = 0 and var(X n ) = σ √ /n.
Consider the standardised variable Zn = n · X n /σ. Then E[Zn ] = 0 and Var(Zn ) = 1
for all n. Note that Zn is a linear combination of the Xi , since
√ Pn n
n i=1 Xi X Xi
Zn = = √ .
σn i=1
σ n

Thus, by the arguments from the previous section, taking ai = 1/(σ n), the mgf of Zn is
given by
n     n
Y s s
MZn (s) = MX i √ = MX √
i=1
σ n σ n
since the Xi all have the same distribution and thus the same mgf MX .
As the distribution of the Xi is not given we don’t know the mgf MX . However, MX has
the following Taylor series expansion about 0:

t2 ′′
MX (t) = MX (0) + tMX′ (0) + MX (0) + εt , (3.1)
2
where εt is a term for which we know that εt /t2 → 0 for t → 0. We write this as εt = o(t2 )
(‘small order’ of t2 , meaning that εt tends to zero at a rate faster than t2 ).
To make use of the above Taylor series expansion, note that (recalling that µ = 0)

MX (0) = E[e0X ] = 1
MX′ (0) = E[Xi ] = 0
MX′′ (0) = E[Xi2 ] = σ 2 .

Inserting these values in (3.1) and replacing t by s/(σ n) we find that
2
s2
      
s 1 s 1 1
MX √ = 1 + 0 + σ2 √ +o =1+ +o .
σ n 2 σ n n 2n n

Now back to Zn : from earlier, the mgf MZ of Zn is now given as the nth power of the above
and we find that
n   n
s2
 
s 1
MZ (s) = MX √ = 1+ +o
σ n 2n n
 1 2 n
s + δn 1 2
= 1+ 2 −→ e 2 s
n

as n → ∞, since δn → 0. (Recall: (1 + nx )n → ex as n → ∞.) The limiting mgf is the one


that we know as belonging to the standard normal distribution.
72

It can be shown that convergence of the moment generating functions implies convergence
of the corresponding distribution functions at all points of continuity. This proves the claim
in the case µ = 0.
In the case that E(Xi ) = µ ̸= 0 define Yi = Xi − µ. Then E(Yi ) = 0 so the result already
√ √ d
proved gives Zn = n(X̄ − µ)/σ = nȲ /σ → N (0, 1) as required. □

Learning Outcomes:

Generating Functions (pgf / mgf) You should be able to

1. Reproduce the definitions of a pgf / mgf and derive their main properties;
2. Compute and recognise the pgf / mgf for standard situations;
3. Use the pgf to find the pmf of a discrete random variable as well as its expectation
and variance for standard cases;
4. Use the mgf to find the moments of random variables, in particular the mean and
variance, for standard cases.

Joint Generating Functions You should be able to

1. Reproduce the definition of joint probability / moment generating functions;


2. Derive the joint pgf / mgf for a collection of independent random variables;
3. Derive the pgf / mgf of linear combinations of a collection of (not necessarily
independent) random variables and recognise the resulting distribution (especially
for the multivariate normal case);
4. Characterise independence between variables with the help of their (joint) gener-
ating functions;
5. Apply the appropriate formula in order to find higer-order expectations from the
joint mgf, especially for the multivariate normal case;
6. State the central limit theorem;
7. Use the central limit theorem creatively to derive approximate probability state-
ments.
Chapter 4

Distributions of Functions of Normally


Distributed Variables

4.1 Motivation

The Central Limit Theorem (CLT) implies that the normal distribution arises in, or is a good
approximation to, many practical situations. Also, the case of a random sample X1 , . . . , Xn
from a normal population N (µ, σ 2 ) is favourable because many expressions are available in
explicit form which is probably at least as important a reason for the widespread use of the
normal approximation as the CLT. The random sample from a normal population was treated
in STAT0003 and the estimators X and S 2 for the mean and variance were exhibited as well
as some of their properties shown. In this chapter, we will develop a more careful discussion
of the properties and go beyond STAT0003 by establishing results on the joint distribution
of these two estimators. Finally, the t- and F-distributions are derived and discussed in order
to have them ready for applications in statistics (such as estimation and hypothesis testing).

4.2 Reminder: Random Sample from a Normal Popula-


tion

From the mathematical point of view, a random sample of size n ∈ N is a collection of


independent, identically distributed (iid) random variables X1 , X2 , . . . , Xn . In the case of a

73
74

normal sample, Xi ∼ N (µ, σ 2 ), the sample mean

n
1X
X= Xi
n i=1

is known (from STAT0003 or equivalent) to be unbiased, i.e. E[X] = µ and to have variance
Var(X) = σ 2 /n. To estimate the variance, σ 2 , in STAT0003 you used the sample variance

n
2 1 X
S = (Xi − X)2
n − 1 i=1

which you showed to be unbiased and consistent, i.e. you showed that the variance of S 2
goes to zero as the sample gets larger (i.e. as n → ∞) which implies consistency. To obtain
the sampling distribution of this estimator, i.e. the distribution of S 2 thought of as a random
variable itself, it is thus necessary to think about the distributions of sums of iid random
variables as well as the distribution of (Xi − X)2 and its sums, i.e. the sums of squares of
normal random variables. These distributions will occupy us for the remainder of this chapter
whereas the next chapter will consider estimators (and present more detail on unbiasedness,
variance, sampling distribution etc.) as well as address the question whether there could be
any better estimators than sample mean and sample variance.

4.3 The chi-squared (χ2) distribution

Preliminaries

Recall the pdf of the Gam(α, λ) distribution (see Appendix 2), its mean and variance E(X) =
α/λ, Var(X) = α/λ2 , its mgf {λ/(λ − s)}α (see example 3.3) and the additive property: if
X1 , . . . , Xn are independent Gam(α, λ) random variables then X1 + · · · + Xn is Gam(nα, λ).

Distribution of Sums of Squares of Normals

We start by showing that if X has the standard normal distribution then X 2 has the gamma
distribution with index 1/2 and rate parameter 1/2.
KS: STAT0005, 2024 – 2025 75

The moment generating function of X 2 is given by


Z ∞  
sX 2 1 1 2 2
E(e ) = √ exp − x + sx dx
−∞ 2π 2
Z ∞  
1 1 2
= √ exp − x (1 − 2s) dx (4.1)
−∞ 2π 2
= (1 − 2s)−1/2 (4.2)

where (4.2) follows by comparison of (4.1) with the integral of the density of a normal variable

with mean 0 and variance (1 − 2s)−1 . (Alternatively, substitute z = x 1 − 2s in the integral
(4.1).)

By comparison with the gamma mgf, we see that X 2 has a gamma distribution with index
and rate parameter each having the value 1/2. This distribution is also known as the χ2
distribution with one degree of freedom. (Thus χ21 ≡ Gam( 21 , 12 ).)

Example 4.1 Verify the result that X 2 ∼ Gam(1/2, 1/2) by using the transformation
ϕ(x) = x2 on a mean zero normal random variable X ∼ N (0, 1).

Now let X1 , . . . , Xν be independent standard normal variables, where ν is a positive integer.


Then it follows from the additive property of the gamma distribution that their sum of
squares X12 + . . . + Xν2 has the gamma distribution Gam( ν2 , 12 ) with index ν/2 and rate
parameter 1/2. This distribution is also known as the χ2 distribution with ν degrees of
freedom and is written as χ2ν . Its mgf is therefore { 21 /( 12 − s)}ν/2 = (1 − 2s)−ν/2 . (Thus
χ2ν ≡ Gam( ν2 , 21 ).)

It further follows from the mean and variance of the gamma distribution that the mean and
variance of U ∼ χ2ν are (verify)

E(U ) = ν , Var(U ) = 2ν

The pdf of the χ2 distribution with ν degrees of freedom (ν > 0) is


1 1
u 2 ν−1 e− 2 u
f (u) = 1
2 2 ν Γ( 12 ν)

for u > 0. This can be verified by comparison with the Gam( ν2 , 12 ) density (see Appendix 2).

Question: Could this also be derived using transformation of variables and the formula
R∞
fX+Y (z) = −∞ f (x)f (z − x)dx for the density of a sum of random variables?
76

Application to sampling distributions

We know that if X has a normal distribution with mean µ and variance σ 2 then the standard-
ised variable (X − µ)/σ has the standard normal distribution. It follows that if X1 , . . . , Xn
are independent normal variables, all with mean µ and variance σ 2 , then
n  2
X Xi − µ
∼ χ2n
i=1
σ

because the left-hand side is the sum of squares of n independent standard normal random
variables. Since σ 2 = E(X−µ)2 , when µ is known the sample average Sµ2 = n1 ni=1 (Xi − µ)2
P

is an intuitively natural estimator of σ 2 . The above result gives the sampling distribution of
Sµ2 as nSµ2 ∼ σ 2 χ2n . We can now deduce the mean and variance of Sµ2 from those of the χ2n
distribution:
 2
2 σ2 nSµ σ2
E(Sµ ) = E = × n = σ2
n σ2 n
 2
2 σ4 nSµ σ4 2σ 4
Var(Sµ ) = 2 Var = × 2n =
n σ2 n2 n

Note that the expectation formula is generally true, even if the Xi are not normal. However,
the variance formula is only true for normal distributions.

When µ is unknown, it is natural to estimate it by X. We already know that E(X) = µ and


2
var(X) = σn . Since a linear combination of normal variables is again normally distributed,
we deduce that
σ2
 
X ∼ N µ,
n

However, in order to be able to use this result for statistical inference when σ 2 is also unknown,
we need to estimate σ 2 . An intuitively natural estimator of σ 2 when µ is unknown is the
1 1
sample variance S 2 = n−1 (Xi − X)2 . The reason for the factor n−1 rather than n1 is
P

that S 2 is unbiased for σ 2 , as we will see in the next chapter. In order to deduce the sampling
distribution of S 2 we need to find the distribution of the sum of squares (Xi − X)2 .
P

(Xi − X)2
P
4.3.1 The distribution of

Let X1 , . . . , Xn be independent normally distributed variables with common mean µ and


Pn
variance σ 2 . We will show that 2 2 2
i=1 (Xi − X) /σ has the χ distribution with n − 1
KS: STAT0005, 2024 – 2025 77

degrees of freedom. In the following it will become clear why, exactly, we have to reduce the
number of degrees of freedom by one.

Result 4.1:
X and S 2 = (Xi − X)2 /(n − 1) are independent.
P

Result 4.2:
(n−1) 2
The sampling distribution of S 2 is σ2
S ∼ χ2n−1 .

We do not prove Result 4.1 in this module: it uses ”joint” moment generating functions.

Derivation of Result 4.2

In order to find the distribution of S 2 note that


X X
(Xi − µ)2 = (Xi − X)2 + n(X − µ)2

so that
(Xi − µ)2 (Xi − X)2 (X − µ)2
P P
= + .
σ2 σ2 σ 2 /n

But, since (Xi − µ)/σ are independent N (0, 1) and (X − µ)/(σ/ n) is N (0, 1), we have
(Xi − µ)2 (X − µ)2
P
2
∼ χn and ∼ χ21
σ2 σ 2 /n
Using mgfs, it follows that
(Xi − X)2
P
2
∼ χ2n−1
σ
since (X − µ) /(σ /n) and (Xi − X)2 /σ 2 are independent.
2 2
P

This result gives the sampling distribution of S 2 to be (n − 1)S 2 ∼ σ 2 χ2n−1 .

From Results 4.1 and 4.2 we can deduce the sampling mean and variance of S 2 to be
σ2 (n − 1)S 2 σ2
 
2
E(S ) = E 2
= × n − 1 = σ2
n−1 σ n−1
4 2
σ4 2σ 4
 
σ (n − 1)S
Var(S 2 ) = Var = × 2(n − 1) =
(n − 1)2 σ2 (n − 1)2 n−1
recalling that the mean and variance of χ2n−1 are n − 1, 2(n − 1) respectively. As for the case
where µ is known, the unbiasedness property of S 2 is generally true, even if the Xi are not
normal, but the variance formula is only true for normal distributions.
78

4.3.2 Student’s t distribution

If X1 , . . . , Xn are independent normally distributed variables with common mean µ and vari-
ance σ 2 then X is normally distributed with mean µ and variance σ 2 /n. Recall (STAT0002
& STAT0003, or equivalent) that if σ 2 is known, then we may test the hypothesis µ = µ0 by
examining the statistic
X − µ0
Z= √ .
σ/ n
Z is a linear transformation of a normal variable and hence is also normally distributed.
When µ = µ0 we see that Z ∼ N (0, 1) so we conduct the test by computing Z and referring
to the N (0, 1) distribution.

Now if σ 2 is unknown, it is intuitively reasonable to estimate it by S 2 (recalling that E(S 2 ) =


σ 2 ) and use the statistic
X − µ0
T = √ .
S/ n
However, in order to conduct the test we need to know the distribution of this statistic when
µ = µ0 .

Note that we can write T as

X − µ0 σ Z
T = √ =p ,
σ/ n S U/(n − 1)

where U = i (Xi − X)2 /σ 2 . From the above results we have Z ∼ N (0, 1), U ∼ χ2n−1 and
P

Z, U are independent random variables. (Note that the distribution of T does not depend
on µ0 and σ 2 , but only on the known number n of observations and is therefore suitable as
a test statistic.)

We can now find the probability distribution of T by the transformation method that was
described in chapter 2.2. Alternatively, it can be derived from the F distribution. The
distribution of T , denoted by tn−1 , is known as Student’s t distribution with n-1 degrees
of freedom.

The general description of the t distribution is as follows. Suppose that Z has a standard nor-
mal distribution, U has a χ2 distribution with ν degrees of freedom, and Z, U are independent
random variables. Then
Z
T =p
U/ν
KS: STAT0005, 2024 – 2025 79

has the Student’s t distribution with ν degrees of freedom, denoted by tν . T has probability
density function
−(ν+1)/2
t2

Γ((ν + 1)/2) 1
fT (t) = √ 1+
Γ(ν/2) νπ ν
for −∞ < t < ∞. The distribution of T is symmetrical about 0 , so that E(T ) = 0 (ν > 1).
It can further be shown that the variance of T is ν/(ν − 2) for ν > 2.

Example 4.2 Write down the pdf of a t1 -distribution and identify the distribution by name
(look back at Example 2.2). Why is ν > 1 needed for the expected value to be zero?

Not examined: What happens as the sample size ν becomes large? Derive the limit of the
pdf in this case.

4.4 The F distribution

Now suppose that we have two independent samples of observations: X1 , . . . , Xm are


2
independent normally distributed variables with common mean µX and variance σX , while
Y1 , . . . , Yn are independent and normally distributed with common mean µY and variance
σY2 . Suppose that we wish to test the hypothesis that the variances σX 2
and σY2 are equal.
2
If σX = σY2 then σX 2
/σY2 = 1 and it is natural to examine the ratio of the two sample
2
/SY2 and compare its value with 1. Here SX
2
= m 2 2
P
variances, SX i=1 (Xi − X) /(m − 1) , SY =
Pn 2
i=1 (Yi − Y ) /(n − 1).

2 2
Now, since (m − 1)SX /σX ∼ χ2m−1 , and (n − 1)SY2 /σY2 ∼ χ2n−1 , we can write
2 2
SX /σX U/(m − 1)
2 2
= ,
SY /σY V /(n − 1)

where U = i (Xi − X)2 /σX 2


∼ χ2m−1 , V = i (Yi − Y )2 /σY2 ∼ χ2n−1 and U and V are
P P
2
independent variables (since they are based on two independent samples). When σX = σY2 ,
2
the left hand side is just the ratio SX /SY2 .

The above considerations motivate the following general description of the F distribution.
Suppose that U ∼ χ2α , V ∼ χ2β and U, V are independent. Then

U/α
W =
V /β
80

has the F distribution with (α, β) degrees of freedom, denoted Fα,β .

Using the Beta function B(a, b) ≡ Γ(a)Γ(b)


Γ(a+b)
, the probability density of the F distribution
with (α, β) degrees of freedom can be written down as
  α2 −1
α αw
β β
fW (w) =   α+β
2
B( α2 , β2 ) 1+ αw
β

for w ≥ 0.

Note that, from the definition of the F distribution, we have


1 V /β
= ∼ Fβ,α
W U/α
i.e. if W ∼ Fα,β then 1/W ∼ Fβ,α .

It can be shown that, for all α and for β > 2, E(W ) = β/(β − 2).

We can now apply this result to our test statistic discussed earlier. Under the hypothesis that
2
σX = σY2 we have (taking α = m − 1, β = n − 1)
2
SX
∼ Fm−1,n−1
SY2

This is the sampling distribution of the variance ratio. Note that this distribution is free from
σX and σY .

Example 4.3 Suppose that X, Y, U are independent random variables such that X ∼
N (2, 9), Y ∼ t4 and U ∼ χ23 . Give four functions of the above variables that have the
following distributions:
(i) χ21 , (ii) χ24 , (iii) t3 , (iv) F1,4 . □

Learning Outcomes: Chapter 4 derives some fundamental results for statistical theory
using the methods presented in Chapters 1 – 3. It should therefore consolidate your
usage of and familiarity with those methods. In particular, you should be able to

1. Define the chi–squared distribution, t distribution and F distribution in terms of


transformations of normal variables and know how they relate to each other,
KS: STAT0005, 2024 – 2025 81

2. State the main properties of the above distributions (mean, variance, shape, mean-
ing of parameters),
3. State the relationship between the chi–squared and gamma distributions,
4. Remember the relationship between X and S 2 and sketch how it is derived,
5. Show how the chi–squared distribution can be derived using the properties of
mgfs;
82
Chapter 5

Statistical Estimation

5.1 Overview

We will now address the problem of estimating an unknown population parameter based on
a sample X1 , . . . , Xn . In particular, we focus on how to decide whether a potential estimator
is a ‘good’ one or if we can find a ‘better’ one. To this end, we need to define criteria
for good estimation. Here we present two obvious criteria that state that a good estimator
should be ‘close’ to the true unknown parameter value (accuracy) and should have as little
variation as possible (precision). We then go on to describe some methods that typically
yield good estimators in the above sense. Results from the previous sections may be used to
derive properties of estimators, which can be viewed as transformations of the original sample
variables X1 , . . . , Xn .

5.2 Criteria for good estimators

Suppose X1 , . . . , Xn represent observations on some sample space. Then a single or vector-


valued function Tn (X1 , . . . , Xn ) that does not depend on any unknown parameters is called
a statistic. If its value is to be used as an estimate of an unknown parameter θ in a model
for the observations, it is called an estimator of θ.

Since Tn is a random variable it has a probability distribution, called the sampling distribu-
tion of the estimator. The properties of this distribution determine whether or not Tn is a
‘good’ estimator of θ.

83
84

The difference E(Tn ) − θ = bTn (θ) is called the bias of the estimator Tn . If bTn (θ) ≡ 0 ( i.e.
bTn (θ) = 0 for all values of θ) then Tn is an unbiased estimator of θ. Although we tend to
regard unbiasedness as desirable, there may be a biased estimator Tn′ giving values that tend
to be closer to θ than those of the unbiased estimator Tn .

If we want an estimator that gives estimates close to θ we might look for one with small
mean square error (mse), defined as
mse(Tn ; θ) = E(Tn − θ)2 .
Note that
mse(Tn , θ) = E{Tn − E(Tn ) + E(Tn ) − θ}2
= E{Tn − E(Tn )}2 + {E(Tn ) − θ}2 + 2{E(Tn ) − θ}E{Tn − E(Tn )}
= Var(Tn ) + {bTn (θ)}2
since E{Tn − E(Tn )} = 0. We therefore have the relation
mse(Tn , θ) = Var(Tn ) + {bTn (θ)}2
We see that small mean square error provides a trade-off between small variance and small
bias.

Example 5.1 Let X1 , . . . , Xn be an iid sample from a normal distribution with mean µ and
variance σ 2 . Then S 2 is unbiased for σ 2 , whereas S̃ 2 = n1 (Xi − X)2 is biased for σ 2 .
P

Compare the mean square errors of S 2 and S̃ 2 .

If we restrict ourselves to unbiased estimators then the mse criterion is equivalent to a


variance criterion and we search for minimum variance unbiased estimators (mvue).

5.2.1 Terminology:

The standard deviation of an estimator is also called its standard error. When the standard
deviation is estimated by replacing the unknown θ by its estimate, it is the estimated standard
error – but often just referred to as the ‘standard error’.

p
Example 5.2 (Example 5.1 ctd.) S 2 has standard deviation 2σ 4 /(n − 1). So S 2 is an
p
estimator of σ 2 with standard error s2 2/(n − 1).

We now give the main result of this section.


KS: STAT0005, 2024 – 2025 85

5.3 Methods for finding estimators

5.3.1 Overview

We have discussed some desirable properties that estimators should have, but so far have only
‘inspired guesswork’ to find them. We need an objective method that will tend to produce
good estimators.

5.3.2 The method of moments

In general, for a sample X1 , . . . , Xn from a density with k unknown parameters, the method
of moments (Karl Pearson 1894) is to calculate the first k sample moments and equate
them to their theoretical counterparts (in terms of the unknown parameters), giving a set of k
simultaneous equations in k unknowns for solution. This procedure requires no distributional
assumptions. However, it might sometimes be helpful to assume a specific distribution in
order to express the theoretical moments in terms of the parameters to be estimated. The
moments may be derived using the mgf, for example.

Example 5.3 Let X1 , . . . , Xn be an i.i.d. sample where Xi has density


 θ−1
θx , 0 < x < 1
f (x) =
0, otherwise.
Find the moment estimator of θ.

Example 5.4 (Example 5.1 ctd.) What are the moment estimators of µ and σ 2 given an
i.i.d. normal sample?

5.3.3 Least squares

If X1 , . . . , Xn are observations whose means are functions of an unknown parameter (or


vector of parameters), θ, then the least squares estimator of θ is that value θ̂LS of θ which
minimises the sum of squares:
X
R= {Xi − E(Xi )}2
i
86

Example 5.5 Let X1 , . . . , Xn be a random sample with common mean E(Xi ) = µ. Find
the least squares estimator of µ.

Note that no other assumptions are required — the sample does not need to be independent
or identically distributed and no distributional assumptions are needed. The properties of
the estimator, however, will depend on the probability distribution of the sample.

Least squares is commonly used for estimation in (multiple) regression models. For example,
the straight-line regression model is

E(Xi ) = α + βzi

where the zi are given values. The least squares estimators of α and β are obtained by solving
the equations ∂R/∂α = ∂R/∂β = 0, where R = (Xi − α − βzi )2 .
P
i

BLUE property. In linear models least squares estimators have minimum variance amongst
all estimators that are linear in X1 , . . . , Xn and unbiased – best linear unbiased estimator.
(The Gauss-Markov Theorem.)

5.3.4 Maximum likelihood

This estimation method requires the distribution of the data to be known.

When X1 , . . . , Xn are i.i.d. with density (or mass function) fX (xi ; θ) — where now we make
explicit the dependence of the function on the unknown parameter(s) θ (which can be a
vector) — suppose that we observe the sample values x1 , . . . , xn . Then the joint density or
joint probability of the sample is
Y
fX (xi ; θ).
i

If we regard this as a function of θ for fixed x1 , . . . , xn then it is called the likelihood


function of θ, written L(θ). The space of possible values of θ is the parameter space, Θ.

The method of maximum likelihood estimates θ by that value, θ̂M L , which maximises
L(θ) over the parameter space Θ, i.e.

L(θ̂M L ) = sup L(θ).


θ∈Θ
KS: STAT0005, 2024 – 2025 87

Since, for a sample of independent observations, the likelihood is a product of the likelihoods
for the individual observations, it is often more convenient to maximise the log-likelihood
ℓ(θ) = log L(θ) :
Y X
ℓ(θ) = log L(θ) = log fX (xi ; θ) = log fX (xi ; θ).
i i

Example 5.6 Let X1 , . . . , Xn be an i.i.d. sample of Bernoulli variables with parameter p.


Find the maximum likelihood estimator p̂M L of p.

Example 5.7 Let X1 , . . . , Xn be an i.i.d. sample of Exponential variables with parameter λ.


Find the ML estimator λ̂M L of λ.

Differentiation gives only local maxima and minima, so that even if the second derivative
d2 ℓ(θ)/dθ2 is negative (corresponding to a local maximum, rather than minimum) it is still
possible that the global maximum is elsewhere. The global maximum will be either a local
maximum or achieved on the boundary of the parameter space. (In Example 5.7 there is a
single local maximum and the likelihood vanishes on the boundary of Θ.)

Example 5.8 Let X1 , . . . , Xn be an i.i.d. sample from a Uniform distribution on [0, θ]. Find
the ML estimator θ̂M L of θ.
NOTE: The ML estimator cannot be found by differentiating in this case. Why?

Example 5.9 Let X1 , . . . , Xn be an i.i.d. sample from a Uniform distribution on [θ, θ + 1].
Then the ML estimator of θ is not unique.

ML estimators of transformed parameters

Example 5.10 [Example 5.7 ctd.] Let X1 , . . . , Xn be an i.i.d. sample of Exponential vari-
ables with parameter λ. Find the ML estimator of µ = E(Xi ).
88

Example 5.10 illustrates an important result, that if we reparameterise the distribution using
particular functions of the original parameters, then the maximum likelihood estimators of
the new parameters are the corresponding functions of the maximum likelihood estimators of
the original parameters. This is easy to see, as follows.

Suppose that ϕ = g(θ) where θ is a (single) real parameter and g is an invertible function.
Then, letting L̃(ϕ) be the likelihood function in terms of ϕ, we have

L̃(ϕ) = f (x; θ = g −1 (ϕ)) = L(θ)

and so
dL̃(ϕ) dL(θ) dθ
=
dϕ dθ dϕ

Assuming the existence of a local maximum, the ML estimator ϕ̂M L of ϕ occurs when
dL̃(ϕ)/dϕ = 0. Since g is assumed invertible we have dϕ/dθ ̸= 0 so it follows that
dL(θ)/dθ = 0 and hence L(ϕ) is maximum when θ = θ̂M L , where θ = g −1 (ϕ). It fol-
lows that ϕ̂M L = g(θ̂M L ). (This result also generalises to the case when θ is a vector.)

Asymptotic behaviour of ML estimators

Maximum likelihood estimators can be shown to have good properties, and are often used.
Under fairly general regularity conditions it can be shown that the maximum likelihood esti-
mator (mle) exists and is unique for sufficiently large sample size and that it is consistent;
that is, θ̂M L → θ, the true value of θ, ‘in probability’. This means that when the sample size
is large the ML estimator of θ will (with high probability) be close to θ.

Under some additional conditions we can make an even stronger statement.

Define the Fisher information in the sample X:

∂2
 
i(θ) = E − 2 log fX (X, θ)
∂θ

Q
Note that if X1 , . . . , Xn are i.i.d. with density f (x, θ) then fX (X; θ) = f (Xi ; θ). There-
i
fore
∂2 X ∂2
log f X (X; θ) = log fX (Xi ; θ) ,
∂θ2 i
∂θ 2
KS: STAT0005, 2024 – 2025 89

which is a sum of n i.i.d. random variables. It follows that


∂2
 
I(θ) = n E − 2 log fX (X; θ) = ni(θ)
∂θ

where i(θ) is Fisher’s information in the observation X. This illustrates the additive property
of Fisher information; information increases with increasing sample size.

The result is that θ̂M L is asymptotically normally distributed with mean θ and variance
1/I(θ) where, for an i.i.d. sample, I(θ) = nE{−∂ 2 /∂θ2 log fX (X; θ)} as before. That is,
θ̂M L ∼ N (θ, 1/I(θ)) approximately for large n. More formally,
d
p
I(θ)(θ̂M L − θ) → N (0, 1)

as n → ∞.

In addition, since the asymptotic probability distribution is known and normal, we can use
it to construct tests or confidence intervals for θ. However, notice that the variance involves
Fisher’s information I(θ), which is unknown since we do not know θ. But it can be shown
that the above asympotic result remains true if we estimate I(θ) by simply plugging in the
ML estimator of θ; that is, use I(θ̂M L ).

NB: Although we have discussed the method of maximum likelihood in the context of an
i.i.d. sample, this is not required. All that is needed is that we know the form of the joint
density or probability mass function for the data.

Example 5.11 (Example 5.7 ctd.) Obtain the asymptotic distribution of λ̂M L .

Learning Outcomes: Chapter 5 presents the essentials of the theory of statistical estima-
tion. It is important that you are able to

1. Define, explain and compare desirable properties of estimators,


2. Derive these properties for standard estimators,
3. State and derive the relationship between the mean square error, bias and variance,
4. Explain the idea of maximum likelihood in words and name its pros and cons;
5. Find the maximum likelihood estimator in standard and non-standard situations,
6. State in general, and compute in specific cases, the asymptotic distribution of the
ML estimator,
90

7. Compare the finite sample properties of ML estimators with their asymptotic


distribution,
8. In a given situation, decide which of different possible estimators is most suitable,
taking their properties and the benefits / limitations of the above methods into
account.
Chapter 6

Introduction to Bayesian Methods

6.1 Introduction

6.1.1 Reminder: Rule of Total Probability and Bayes’ Theorem

Suppose B1 , B2 , B3 give a partition of a sample space Ω, so that B1 , B2 , B3 are mutually


exclusive, and their union is all of Ω. Given any event A, clearly it is given by the disjoint
union,
A = (A ∩ B1 ) ∪ (A ∩ B2 ) ∪ (A ∩ B3 ),

thus
P (A) = P (A ∩ B1 ) + P (A ∩ B2 ) + P (A ∩ B3 ).

We also know from the definition of conditional probabilities, that if each of the Bi ’s have
non-zero probabilities, then

P (A ∩ Bi ) = P (A|Bi )P (Bi ),

for each i = 1, 2, 3. Thus we obtain that,

P (A) = P (A|B1 )P (B1 ) + P (A|B2 )P (B2 ) + P (A|B3 )P (B3 ).

Recall that this is referred to as the rule of total probability.

Example 6.1 (Two Face) The DC comic book villain Two-Face often uses a coin to decide
the fate of his victims. If the result of the flip is tails, then the victim is spared, otherwise

91
92

the victim is killed. It turns out he actually randomly selects from three coins: a fair one,
one that comes up tails 1/3 of the time, and another that comes up tails 1/10 of the time.
What is the probability that a victim is spared?

Sometimes we also want to compute P (Bi |A), and a bit algebra gives the following formula,
in the case i = 3:
P (B3 ∩ A)
P (B3 |A) =
P (A)
P (A|B3 )P (B3 )
= .
P (A|B1 )P (B1 ) + P (A|B2 )P (B2 ) + P (A|B3 )P (B3 )

Recall that this is referred to as Bayes’ theorem.

Example 6.2 (Example 6.1 ctd.) Suppose that the victim was spared. Then what is the
probability that the fair coin was used?

6.1.2 Bayesian statistics

In classical statistics, θ ∈ ∆ is unknown so we take a random sample X = (X1 , . . . , Xn )


from fθ and then we make an inference about θ.

In Bayesian statistics, rather than thinking of the parameter as unknown, we think of it has
a random variable having some unknown distribution. Let (fθ )θ∈∆ be a family of pdfs. Let
Θ be a random variable with pdf r taking values in ∆. Here r is called prior pdf for Θ; we
do not really know the true pdf for Θ, and this is a subjective assignment or guess based on
our present knowledge or ignorance. We think of f (x1 ; θ) = f (x1 |θ) as the conditional pdf
of a random variable X1 that can be generated in the following two step procedure: First,
we generate Θ = θ, then we generate X1 with pdf fθ . In other words, we let the joint pdf of
X1 and Θ be given by
f (x1 |θ)r(θ).

In shorthand, we will denote this model by writing

X1 |θ ∼ f (x1 |θ)

Θ ∼ r(θ)
KS: STAT0005, 2024 – 2025 93

Similarly, we say that X = (X1 , . . . , Xn ) is an i.i.d. random sample from the conditional
distribution of X1 given Θ = θ if X1 |θ ∼ f (x1 |θ) and
n
Y
L(x; θ) = L(x|θ) = f (xi |θ);
i=1

in which case the joint pdf of X and Θ is given by

j(x, θ) = L(x|θ)r(θ).

Thus the random sample X can be sampled on a computer, by sampling Θ, and then upon
knowing that Θ = θ, we draw a random sample from the pdf fθ .

What we are interested in is updating our knowledge or belief about the distribution of Θ,
after observing X = x; more precisely, we consider
j(x, θ) L(x|θ)r(θ)
s(θ|x) = = ,
fX (x) fX (x)
where fX is the pdf of X (alone), which can be obtained by integrating or summing the joint
density j(x, θ) with respect to θ. We call s the posterior pdf. Thus ‘prior’ refers to our
knowledge of the distribution of Θ prior to our observation X and ‘posterior’ refers to our
knowledge after our observation of X.

Let us remark on our notation. Earlier in the course, we used Θ to denote the set of possible
parameter values; here, we use ∆ to denote this, since we are reserving Θ to be a random
variable taking values in θ ∈ ∆. We use r to denote the prior pdf, the next letter s to denote
the posterior pdf, j to denote the joint pdf of X and Θ, fX to denote the pdf of X alone,
and fθ = f (·|θ) to denote the pdf corresponding to the conditional distribution of Xi given
θ.

Example 6.3 Let X = (X1 , . . . , Xn ) be an i.i.d. random sample from the conditional dis-
tribution of X1 given Θ = θ, where X1 |θ ∼ Ber(θ) and Θ ∼ U (0, 1). Find the posterior
distribution.

6.1.3 Calculation tools

In computing a posterior distribution it is not necessary to directly compute the pdf of X.


We have that
s(θ|x) = L(x|θ)r(θ)(fX (x))−1 .
94

Since s itself is a pdf, fX (x) (which does not depend on θ) can be thought of as a normalizing
constant. Often one writes
s(θ|x) ∝ G(x; θ),

to mean that there exists a constant c(x) such that

s(θ|x) = c(x)G(x; θ).

Thus
s(θ|x) ∝ L(x|θ)r(θ).

It is often possible to identify the pdf from an expression involving L(x|θ)r(θ) or other
simplified expressions.

Example 6.4 Let X = (X1 , . . . , Xn ) be an i.i.d. random sample from the conditional dis-
tribution of X1 given Θ = θ, where X1 |θ ∼ N (θ, 1) and Θ ∼ N (0, 1). Find the posterior
distribution.

6.2 Conjugate family of distributions

Consider a family of prior pdfs C = {rθ : θ ∈ ∆}; and the family of conditional distributions
of Xi given Θ = θ, F = {fθ : θ ∈ ∆}.

Let X = (X1 , . . . , Xn ) be a random i.i.d. sample from the conditional distribution of Xi


given Θ = θ, where Xi |θ ∼ f (xi |θ) and Θ ∼ r(θ), from some r ∈ C. We say that the class
C gives a conjugate family of distributions for F if the posterior pdf of Θ given X is such
that s(θ|x) ∈ C for all x and all r ∈ C.

Example 6.5 Let C = (rα,β )α>0,β>0 , where rα,β is the pdf of the gamma distribution with
parameters α and β. Let F = (fθ )θ>0 , where fθ is the pdf of a Poisson random variable with
mean θ. Show that C is a conjugate family for F.

Example 6.6 Fix σ > 0. Let C = (rµ0 ,σ0 )µ0 >0,σ0 >0 , where rµ0 ,σ0 is the pdf of a normal
random variable with mean µ0 and variance σ02 . Let F = (fθ )θ>0 , where fθ is the pdf of a
KS: STAT0005, 2024 – 2025 95

normal random variable with mean θ and variance σ 2 . Show that C is a conjugate family for
F and the posterior hyperparameters are given by

σ02 σ 2 /n
µ′ = x̄ + µ0
σ02 + (σ 2 /n) σ02 + σ 2 /n

and
σ 2 /n
σ ′2 = σ2
σ02 + σ 2 /n 0

Example 6.7 Let C = (rα,β )α>0,β>0 , where rα,β is the pdf of a beta random variable with
parameters α and β. Let F = (fθ )θ>0 , where fθ is the pdf of a Bernoulli random variable
with parameter θ ∈ (0, 1). Show that C is a conjugate family for F and the posterior
hyperparameters are given by
α′ = α + t

and
β ′ = β + n − t,

where t is the sample sum.

Let us remark that in the notation of these examples, we will sometimes call α′ and β ′ the
posterior hyperparamaters and α and β the prior hyperparameters.

6.3 Bayes’ estimators

Let X = (X1 , . . . , Xn ) be an i.i.d. random sample from the conditional distribution of X1


given Θ = θ ∈ ∆ is given by f (x1 |θ) and Θ has prior pdf r(θ). Suppose that we observe
X = x and have calculated the posterior distribution s(θ|x). We can now compute

δ(x) := E(Θ|X = x).

In the case that s(θ|x) is continuous, we have


Z
δ(x) = θs(θ|x)dθ.

One natural point estimator for θ is δ(x) and the associated point estimator is given by
δ(X) = E(Θ|X). This is just one important example of a Bayes’ estimator.
96

It is not difficult to verify that δ(x) is the value a for which

E[(Θ − a)2 |X = x]

is minimized. Thus in the case where L(θ, θ′ ) = |θ − θ′ |2 , we have that δ(x) minimizes

E[L(Θ, δ(x)) | X = x].

In general, we can consider other choices of L. The function L is called a loss function, and
we aim to find a δ(x) which minimizes the conditional expected loss. The function δ is called
a decision function and is a Bayes’ estimate of θ if it is a minimizer. More, generally, if
we are interested in estimating a function of θ, given by g(θ), a Bayes’ estimator of g(θ) is
a decision function δ(x) which minimizes

E[L(g(Θ)), δ(x)|X = x].

In this course we will mostly be concerned with the squared loss function L(θ, θ′ ) = |θ − θ′ |2 .
There are many other reasonable choices of loss function to consider, for example, the absolute
loss given by L(θ, θ′ ) = |θ − θ′ |.

Example 6.8 Let Y be a random variable. Set g(a) = E(Y − a)2 . Minimize g.

Example 6.9 Let Y be a continuous random variable. Set g(a) = E[|Y − a|]. Minimize g.

Example 6.10 Let X = (X1 , . . . , Xn ) be an i.i.d. random sample from the conditional
distribution of X1 given Θ = θ, where X1 |θ ∼ U (0, θ) and Θ has the Pareto distribution
with scale parameter b > 0 and shape parameter α > 0 with pdf:
αbα
r(θ) = ,
θα+1
for θ > b; and 0 otherwise. Find the Bayes’ estimator (with respect to the squared loss
function) for θ.

So, we see from Example 6.10, the key to computing a Bayes’ estimator boils down to
computing the posterior distribution.
KS: STAT0005, 2024 – 2025 97

Learning Outcomes: Chapter 6 presents the essentials of the theory of Bayesian statistics.
It is important that you are able to

1. Understand the logic behind Bayesian inference,


2. Define and calculate posterior distributions given prior distributions and data,
3. Derive normalizing constants given a distributions,
4. Define and derive conjugate family of distributions,
5. Define and understand the concepts of loss function and risk function,
6. Minimize the expected risk under different loss functions, in particular the quadratic
loss function, and obtain Bayes’ estimators.
98

Appendix A. Standard discrete distributions

Bernoulli distribution 
0 probability 1 − p
X=
1 probability p
or
pX (x) = px (1 − p)1−x , x = 0, 1 ; 0 < p < 1
E(X) = p , Var(X) = p(1 − p)

Denoted by Ber(p)

Binomial distribution
 
n
pX (x) = px (1 − p)n−x , x = 0, 1, . . . , n ; 0 < p < 1
x

E(X) = np , Var(X) = np(1 − p)

Denoted by Bin(n, p)
X is the number of successes in n independent Bernoulli trials with constant probability of
success p.
X can be written X = Y1 + · · · + Yn , where Yi are independent Ber(p).

Geometric distribution

pX (x) = (1 − p)x−1 p , x = 1, 2, . . . ; 0 < p < 1

1 1−p
E(X) = p
, Var(X) = p2

Denoted by Geo(p)
X is the number of trials until the first success in a sequence of independent Bernoulli trials
with constant probability of success p.
KS: STAT0005, 2024 – 2025 99

Negative binomial distribution


 
x−1
pX (x) = pr (1 − p)x−r , x = r, r + 1, . . . ; 0 < p < 1, r ≥ 1
r−1

r r(1−p)
E(X) = p
, Var(X) = p2

Denoted by NB(r, p)
X is the number of trials until the rth success in a sequence of independent Bernoulli trials
with constant probability of success p.
The case r = 1 is the Geo(p) distribution.
X can be written X = Y1 + · · · + Yr , where Yi are independent Geo(p)

Hypergeometric distribution
  
M N −M
x n−x
pX (x) =   , x = 0, 1, . . . , min(n, M ) ; 0 < n, M ≤ N
N
n

nM nM (N −M )(N −n)
E(X) = N
, Var(X) = N 2 (N −1)

Denoted by H(n, M, N )
X is the number of items of Type I when sampling n items without replacement from a
population of size N , where there are M items of Type I in the population.
May be approximated by Bin(n, M N
) as N → ∞

Poisson distribution
e−λ λx
pX (x) = , x = 0, 1, . . . ; λ > 0
x!
E(X) = λ , Var(X) = λ

Denoted by Poi(λ)
X is the number of random events in time or space, where the rate of occurrence is λ.
Approximation to Bin(n, p) as n → ∞, p → 0 such that np → λ.
100

Appendix B. Standard continuous distributions

Uniform distribution
1
fX (x) = , a < x < b; a < b
b−a
b+a (b−a)2
E(X) = 2
, Var(X) = 12

Denoted by U(a, b)
Every point in the interval (a, b) is ‘equally likely’.
Important use for simulation of random numbers.

Exponential distribution

fX (x) = λe−λx , x > 0 ; λ > 0

1 1
E(X) = λ
, Var(X) = λ2

Denoted by Exp(λ)
X is the waiting time until the first event in a Poisson process, rate λ.

Gamma distribution

λα xα−1 e−λx
fX (x) = , x > 0 ; λ > 0, α > 0
Γ(α)

where Z ∞
Γ(α) = xα−1 e−x dx
0

is the gamma function.



Γ(r) = (r − 1)! for positive integers r, Γ(α) = (α − 1)Γ(α − 1), Γ( 12 ) = π
E(X) = αλ , Var(X) = λα2

Denoted by Gam(α, λ)
When α = r, an integer, X is the waiting time until the rth event in a Poisson process, rate
λ. The case α = 1 is the Exp(λ) distribution.
When α = r, an integer, X can be written X = Y1 + · · · + Yr , where Yi are independent
Exp(λ).
KS: STAT0005, 2024 – 2025 101

More generally we have the following additive property: if Yi ∼ Gam(αi , λ), i = 1, . . . , r,


and are independent then X = Y1 + · · · + Yr ∼ Gam( ri=1 αi , λ).
P

Beta distribution
xα−1 (1 − x)β−1
fX (x) = , 0 < x < 1 ; α > 0, β > 0
B(α, β)

where Z 1
Γ(α)Γ(β)
B(α, β) = xα−1 (1 − x)β−1 dx =
0 Γ(α + β)
is the beta function.
α αβ
E(X) = α+β , Var(X) = (α+β)2 (α+β+1)

Denoted by Beta(α, β)
The case α = β = 1 is the U(0, 1) distribution.
X sometimes represents an unknown proportion lying in the interval (0, 1).

Normal distribution
(x − µ)2
 
1
fX (x) = √ exp − , −∞ < x < ∞ ; σ > 0
σ 2π 2σ 2

E(X) = µ , Var(X) = σ 2

Denoted by N (µ, σ 2 )
Widely used as a distribution for continuous variables representing many real-world phenom-
ena, and as an approximation to many other distributions.

You might also like