0% found this document useful (0 votes)

562 views202 pages

Prob&StatsBook PDF

This document contains lecture notes on probability and statistics. It covers topics such as probability, discrete and continuous probability distributions, joint probability distributions, functions of random variables, descriptive statistics, and statistical inference. The notes are divided into 10 chapters that progress from basic probability concepts to more advanced statistical analysis techniques like linear regression and categorical data analysis.

Uploaded by

brandon_medina2011

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

562 views202 pages

Prob&StatsBook PDF

Uploaded by

brandon_medina2011

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 202

Math 382 Lecture Notes

Probability and Statistics

Anwar Hossain and Oleg Makhnin

January 8, 2013
2
Contents

1 Probability in the World Around Us 7

2 Probability 9
2.1 What is Probability . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 Review of set notation . . . . . . . . . . . . . . . . . . . . . . 10
2.3 Types of Probability . . . . . . . . . . . . . . . . . . . . . . . 16
2.4 Laws of Probability . . . . . . . . . . . . . . . . . . . . . . . . 17
2.5 Counting Rules useful in Probability . . . . . . . . . . . . . . 21
2.6 Conditional probability and independence . . . . . . . . . . . 28
2.7 Bayes Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

3 Discrete probability distributions 45

3.1 Discrete distributions . . . . . . . . . . . . . . . . . . . . . . . 45
3.2 Expected values of Random Variables . . . . . . . . . . . . . . 51
3.3 Bernoulli distribution . . . . . . . . . . . . . . . . . . . . . . . 57
3.4 Binomial distribution . . . . . . . . . . . . . . . . . . . . . . . 58
3.5 Geometric distribution . . . . . . . . . . . . . . . . . . . . . . 63
3.6 Negative Binomial distribution . . . . . . . . . . . . . . . . . . 66
3.7 Poisson distribution . . . . . . . . . . . . . . . . . . . . . . . . 68
3.8 Hypergeometric distribution . . . . . . . . . . . . . . . . . . . 73
3.9 Moment generating function . . . . . . . . . . . . . . . . . . . 76

4 Continuous probability distributions 79

4.1 Continuous RV and their prob dist . . . . . . . . . . . . . . . 79
4.2 Expected values of continuous RV . . . . . . . . . . . . . . . . 85
4.3 Uniform distribution . . . . . . . . . . . . . . . . . . . . . . . 89
4.4 Exponential distribution . . . . . . . . . . . . . . . . . . . . . 92
4.5 The Gamma distribution . . . . . . . . . . . . . . . . . . . . . 95

3
4 CONTENTS

4.5.1 Poisson process . . . . . . . . . . . . . . . . . . . . . . 97

4.6 Normal distribution . . . . . . . . . . . . . . . . . . . . . . . . 99
4.6.1 Using Normal tables in reverse . . . . . . . . . . . . . . 103
4.6.2 Normal approximation to Binomial . . . . . . . . . . . 105
4.7 Weibull distribution . . . . . . . . . . . . . . . . . . . . . . . . 110
4.8 MGF’s for continuous case . . . . . . . . . . . . . . . . . . . . 112

5 Joint probability distributions 113

5.1 Bivariate and marginal probab dist . . . . . . . . . . . . . . . 113
5.2 Conditional probability distributions . . . . . . . . . . . . . . 116
5.3 Independent random variables . . . . . . . . . . . . . . . . . . 119
5.4 Expected values of functions . . . . . . . . . . . . . . . . . . . 121
5.4.1 Variance of sums . . . . . . . . . . . . . . . . . . . . . 125
5.5 Conditional Expectations* . . . . . . . . . . . . . . . . . . . . 128

6 Functions of Random Variables 131

6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
6.1.1 Simulation . . . . . . . . . . . . . . . . . . . . . . . . . 131
6.2 Method of distribution functions (CDF) . . . . . . . . . . . . 132
6.3 Method of transformations . . . . . . . . . . . . . . . . . . . . 134
6.4 Central Limit Theorem . . . . . . . . . . . . . . . . . . . . . . 139
6.4.1 CLT examples: Binomial . . . . . . . . . . . . . . . . . 142

7 Descriptive statistics 145

7.1 Sample and population . . . . . . . . . . . . . . . . . . . . . . 145
7.2 Graphical summaries . . . . . . . . . . . . . . . . . . . . . . . 146
7.3 Numerical summaries . . . . . . . . . . . . . . . . . . . . . . . 149
7.3.1 Sample mean and variance . . . . . . . . . . . . . . . . 149
7.3.2 Percentiles . . . . . . . . . . . . . . . . . . . . . . . . . 150

8 Statistical inference 153

8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
8.1.1 Unbiased Estimation . . . . . . . . . . . . . . . . . . . 153
8.2 Confidence intervals . . . . . . . . . . . . . . . . . . . . . . . . 154
8.3 Statistical hypotheses . . . . . . . . . . . . . . . . . . . . . . . 158
8.3.1 Hypothesis tests of a population mean . . . . . . . . . 159
8.4 The case of unknown σ . . . . . . . . . . . . . . . . . . . . . . 163
8.4.1 Confidence intervals . . . . . . . . . . . . . . . . . . . 163
CONTENTS 5

8.4.2 Hypothesis test . . . . . . . . . . . . . . . . . . . . . . 167

8.4.3 Connection between Hypothesis tests and C.I.’s . . . . 169
8.4.4 Statistical significance vs Practical significance . . . . . 169
8.5 C.I. and tests for two means . . . . . . . . . . . . . . . . . . . 170
8.5.1 Matched pairs . . . . . . . . . . . . . . . . . . . . . . . 173
8.6 Inference for Proportions . . . . . . . . . . . . . . . . . . . . . 176
8.6.1 Confidence interval for population proportion . . . . . 176
8.6.2 Test for a single proportion . . . . . . . . . . . . . . . 176
8.6.3 Comparing two proportions* . . . . . . . . . . . . . . . 178

9 Linear Regression 183

9.1 Correlation coefficient . . . . . . . . . . . . . . . . . . . . . . 184
9.2 Least squares regression line . . . . . . . . . . . . . . . . . . . 185
9.3 Inference for regression . . . . . . . . . . . . . . . . . . . . . . 187
9.3.1 Correlation test for linear relationship . . . . . . . . . . 189
9.3.2 Confidence and prediction intervals . . . . . . . . . . . 190
9.3.3 Checking the assumptions . . . . . . . . . . . . . . . . 191

10 Categorical Data Analysis 195

10.1 Chi-square goodness-of-fit test . . . . . . . . . . . . . . . . . . 195
10.2 Chi-square test for independence . . . . . . . . . . . . . . . . 199
6 CONTENTS
Chapter 1

Probability in the World

Around Us

Probability theory is a tool to describe uncertainty. In science and engi-

neering, the world around us is described by mathematical models. Most
mathematical models are deterministic, that is, the model output is sup-
posed to be known uniquely once all the inputs are specified. As an example
of such model, consider the Newton’s law F = ma connecting the force F
acting on an object of mass m resulting in the acceleration a. Once F and
m are specified, we can determine exactly the object’s acceleration.1
What is wrong with this model from practical point of view? Most ob-
viously, the inputs in the model (F and m) are not precisely known. They
may be measured, but there’s usually a measurement error involved. Also,
the model itself might be approximate or might not take into account all
the factors influencing the model output. Finally, roundoff errors are sure to
crop up during the calculations. Thus, our predictions of planetary motions,
say, will be imperfect in the long run and will require further corrections as
more recent observations become available.
At the other end of the spectrum, there are some phenomena that seem
to completely escape any attempts at the rational description. These are
random phenomena - ranging from lotteries to the heat-induced motion of
the atoms. Upon closer consideration, there are still some laws governing
these phenomena. However, they would not apply on case by case basis, but
1
Now you are to stop and think: what are the factors that will make this model more
uncertain?

7
8 CHAPTER 1. PROBABILITY IN THE WORLD AROUND US

rather to the results of many repetitions. For example, we cannot predict the
result of one particular lottery drawing, but we can calculate probabilities of
certain outcomes. We cannot describe the velocity of a single atom, but we
can say something about the behavior of the velocities in the ensemble of all
atoms.
This is the stuff that probabilistic models are made of. Another example
of a field where probabilistic models are routinely used is actuarial science.
It deals with lifetimes of humans and tries to predict how long any given
person is expected to live, based on other variables describing the particulars
of his/her life. Of course, this expected life span is a poor prediction when
applied to any given person, but it works rather well when applied to many
persons. It can help to decide the rates the insurance company should charge
for covering any given person.
Today’s science deals with enormously complex models, for example, the
models of Earth’s climate (there are many of them available, at different levels
of complexity and resolution). The models should also take into account the
uncertainties from many sources, including our imperfect knowledge of the
current state of Earth, our imperfect understanding of all physical processes
involved, and the uncertainty about future scenarios of human development.2
Understanding and communicating this uncertainty is greatly aided by
the knowledge of the rules of probability.
The authors thank Lynda Ballou for contributing some examples and
exercises.

2
Not the least, our ability to calculate the output of such models is also limited by the
current state of computational science.
Chapter 2

Probability

2.1 What is Probability

Probability theory is the branch of mathematics that studies the possible
outcomes of given events together with the outcomes’ relative likelihoods
and distributions. In common usage, the word “probability” is used to mean
the chance that a particular event (or set of events) will occur expressed
on a linear scale from 0 (impossibility) to 1 (certainty), also expressed as a
percentage between 0 and 100%. The analysis of data (possibly generated
by probability models) is called statistics.
Probability is a way of summarizing the uncertainty of statements or
events. It gives a numerical measure for the degree of certainty (or degree of
uncertainty) of the occurrence of an event.
Another way to define probability is the ratio of the number of favorable
outcomes to the total number of all possible outcomes. This is true if the
outcomes are assumed to be equally likely. The collection of all possible
outcomes is called the sample space.
If there are n total possible outcomes in a sample space S, and m of those
are favorable for an event A, then probability of event A is given as
number of favorable outcomes n(A) m
P (A) = = =
total number of possible outcomes n(S) n
Example 2.1. Find the probability of getting a 3 or 5 while throwing a die.
Solution. Sample space S = {1, 2, 3, 4, 5, 6} and event A = {3, 5}.
We have n(A) = 2 and n(S) = 6.
So, P (A) = n(A)/n(S) = 2/6 = 0.3333

9
10 CHAPTER 2. PROBABILITY

Axioms of Probability
All probability values are positive numbers not greater than 1, i.e. 0 ≤ p ≤ 1.
An event that is not likely to occur or impossible has probability zero, while
an event that’s certain to occur has probability one.

Examples: P (A pregnant human being a female) = 1

P (A human male being pregnant) = 0.
Definition 2.1.
Random Experiment: A random experiment is the process of observing
the outcome of a chance event.

Outcome: The elementary outcomes are all possible results of the random
experiment.

Sample Space(SS): The sample space is the set or collection of all the
outcomes of an experiment and is denoted by S.

Example 2.2.
a) Flip a coin once, then the sample space is: S = {H, T }
b) Flip a coin twice, then the sample space is: S = {HH, HT, T H, T T }

We want to assign a numerical weight or probability to each outcome.

We write the probability of Ai as P (Ai ). For example, in our coin toss ex-
periment, we may assign P (H) = P (T ) = 0.5. Each outcome comes up half
the time.

2.2 Review of set notation

Definition 2.2. Complement
The complement of event A is the set of all outcomes in a sample that are
not included in the event A. The complement of event A is denoted by A0 .
If the probability that an event occurs is p, then the probability that the
event does not occur is q = (1 − p). i.e. probability of the complement of an
event = 1− probability of the event.
i.e. P (A0 ) = 1 − P (A)
2.2. REVIEW OF SET NOTATION 11

Example 2.3. Find the probability of not getting a 3 or 5 while throwing a

die.

Solution. Sample space S = {1, 2, 3, 4, 5, 6} and event B = {1, 2, 4, 6}.

n(B) = 4 and n(S) = 6

So, P (B) = n(B)/n(S) = 4/6 = 0.6667

On the other hand, A (described in Example 2.1) and B are complementary
events, i.e. B = A0 .
So, P (B) = P (A0 ) = 1 − P (A) = 1 − 0.3333 = 0.6667

Definition 2.3. Intersections of Events

The event A ∩ B is the intersection of the events A and B and consists of

outcomes that are contained within both events A and B. The probability of
this event, is the probability that both events A and B occur [but not
necessarily at the same time]. In the future, we will abbreviate intersection
as AB.

Definition 2.4. Mutually Exclusive Events

Two events are said to be mutually exclusive if AB = ∅ (i.e. they have

empty intersection) so that they have no outcomes in common.
'$
'$

A B
&%
&%

Definition 2.5. Unions of Events

The event A ∪ B is the union of events A and B and consists of the

outcomes that are contained within at least one of the events A and B. The
probability of this event P (A ∪ B), is the probability that at least one of the
events A and B occurs.
'$
'$
AB
A B
&%
&%
12 CHAPTER 2. PROBABILITY

Venn diagram
Venn diagram is often used to illustrate the relations between sets (events).
The sets A and B are represented as circles; operations between them (in-
tersections, unions and complements) can also be represented as parts of the
diagram. The entire sample space S is the bounding box. See Figure 2.1

A′B′

AB′ AB A′B

Figure 2.1: Venn diagram of events A (in bold) and B, represented as insides
of circles, and various intersections

Example 2.4. Set notation

Suppose a set S consists of points labeled 1, 2, 3 and 4. We denote this

by S = {1, 2, 3, 4}.
If A = {1, 2} and B = {2, 3, 4}, then A and B are subsets of S, denoted by
A ⊂ S and B ⊂ S (B is contained in S). We denote the fact that 2 is an
element of A by 2 ∈ A.
The union of A and B, A ∪ B = {1, 2, 3, 4}. If C = {4}, then A ∪
C = {1, 2, 4}. The intersection A ∩ B = AB = {2}. The complement A0 =
{3, 4}.
Distributive laws
A(B ∪ C) = AB ∪ AC
and
A ∪ (BC) = (A ∪ B)(A ∪ C)
2.2. REVIEW OF SET NOTATION 13

De Morgan’s Law
(A ∪ B)0 = A0 B 0
(AB)0 = A0 ∪ B 0

Exercises
2.1.
Use the Venn diagrams to illustrate Distributive laws and De Morgan’s law.

2.2.
Simplify the following (Draw the Venn diagrams to visualize)

a) (A0 )0
b) (AB)0 ∪ A
c) (AB) ∪ (AB 0 )
d) (A ∪ B ∪ C)B

2.3.
Represent by set notation the following events

a) both A and B occur

b) exactly one of A, B occurs
c) at least one of A, B, C occurs
d) at most one of A, B, C occurs

2.4.
The sample space consists of eight capital letters (outcomes), A, B, C ,...,
H. Let V = event that the letter represents a vowel, and L = event that the
letter is made of straight lines. Describe the outcomes that comprise

a) V L
b) V ∪ L0
c) V 0 L0
14 CHAPTER 2. PROBABILITY

2.5.
Out of all items sent for refurbishing, 40% had mechanical defects, 50% had
electrical defects, and 25% had both.
Denoting A = {an item has a mechanical defect} and
B = {an item has an electrical defect}, fill the probabilities into the Venn
diagram and determine the quantities listed below.
a) P (A)
b) P (AB)
c) P (A0 B)
'$
'$
d) P (A0 B 0 )
A B
e) P (A ∪ B)
&%
&%
f) P (A0 ∪ B 0 )
g) P ([A ∪ B]0 )
2.6.
A sample of mutual funds was classified according to whether a fund was up
or down last year (A and A0 ) and whether it was investing in international
stocks (B and B 0 ). The probabilities of these events and their intersections
are represented in the two-way table below.
B B0
A 0.33 ? ?
A0 ? ? 0.52
0.64 ? 1

a) Fill out all the ? marks.

b) Find the probability of A ∪ B
2.2. REVIEW OF SET NOTATION 15

Ways to represent probabilities:

• Venn diagram
We may write the probabilities inside the elementary pieces within a Venn
diagram. For example, P (AB 0 ) = 0.32 and
P (A) = P (AB) + P (AB 0 ) [ why?] = 0.58 The relative sizes of the pieces do
not have to match the numbers.

0.32 0.26 0.11

• Two-way table
This is a popular way to represent statistical data. The cells of the table
correspond to the intersections of row and column events. Note that the
contents of the table add up accross rows and columns of the table. The
bottom-right corner of the table contains P (S) = 1
B B0
A 0.26 0.32 0.58
A0 0.11 ? 0.42
0.37 0.63 1
• Tree diagram
A tree diagram may be used to show the sequence of choices that lead to the
complete description of outcomes. For example, when tossing two coins, we
may represent this as follows
●
Second toss Outcome

H HH
First toss

H
T HT

H TH
T

T TT

A tree diagram is also often useful for representing conditional probabilities

●

(see below).
16 CHAPTER 2. PROBABILITY

2.3 Types of Probability

There are three ways to define probability, namely classical, empirical and
subjective probability.

Definition 2.6. Classical probability

Classical or theoretical probability is used when each outcome in a sample

space is equally likely to occur. The classical probability for an event A is
given by
Number of outcomes in A
P (A) =
Total number of outcomes in S

Example 2.5.

Roll a die and observe that P (A) = P (rolling a 3) = 1/6.

Definition 2.7. Empirical probability

Empirical (or statistical) probability is based on observed data. The empirical

probability of an event A is the relative frequency of event A, that is
Frequency of event A
P (A) =
Total number of observations

Example 2.6.
The following are the counts of fish of each type, that you have caught before.

Fish Types Blue gill Red gill Crappy Total

Number of times caught 13 17 10 40
Estimate the probability that the next fish you catch will be a Blue gill.

P (Blue gill) = 13/40 = 0.325

2.4. LAWS OF PROBABILITY 17

Example 2.7.
Based on genetics, the proportion of male children among all children con-
ceived should be around 0.5. However, based on the statistics from a large
number of live births, the probability that a child being born is male is about
0.512.
The empirical probability definition has a weakness that it depends on
the results of a particular experiment. The next time this experiment is
repeated, you are likely to get a somewhat different result.
However, as an experiment is repeated many times, the empirical proba-
bility of an event, based on the combined results, approaches the theoretical
probability of the event.1
Subjective Probability: Subjective probabilities result from intuition, ed-
ucated guesses, and estimates. For example: given a patient’s health and
extent of injuries a doctor may feel that the patient has a 90% chance of a
full recovery.
Regardless of the way probabilities are defined, they always follow the
same laws, which we will explore starting with the following Section.

2.4 Laws of Probability

As we have seen in the previous section, the probabilities are not always
based on the assumption of equal outcomes.
Definition 2.8. Axioms of Probability
For an experiment with a sample space S = {e1 , e2 , . . . , en } we can assign
probabilities P (e1 ), P (e2 ), . . . , P (en ) provided that

a) 0 ≤ P (ei ) ≤ 1

b) P (S) = ni=1 P (ei ) = 1.

If a set (event) A consists of outcomes {e1 , e2 , . . . , ek }, then

k
X
P (A) = P (ei )
i=1
This definition just tells us which probability assignments are legal, but not
1
This is called Law of Large Numbers
18 CHAPTER 2. PROBABILITY

necessarily which ones would work in practice. However, once we have as-
signed the probability to each outcome, they are subject to further rules
which we will describe below.

Theorem 2.1. Complement Rule

For any event A,

P (A0 ) = 1 − P (A) (2.1)

Theorem 2.2. Addition Law

If A and B are two different events then

P (A ∪ B) = P (A) + P (B) − P (A ∩ B) (2.2)

Proof. Consider the Venn diagram. P (A ∪ B) is the probability of the sum of

all sample points in A ∪ B. Now P (A) + P (B) is the sum of probabilities of
sample points in A and in B. Since we added up the sample points in (A ∩ B)
twice, we need to subtract once to obtain the sum of probabilities in (A ∪ B),
which is P (A ∪ B).

Example 2.8. Probability that John passes a Math exam is 4/5 and that he
passes a Chemistry exam is 5/6. If the probability that he passes both exams
is 3/4, find the probability that he will pass at least one exam.

Solution. Let M = John passes Math exam, and C = John passes Chemistry
exam.
P (John passes at least one exam) = P (M ∪ C) =
= P (M ) + P (C) − P (M ∩ C) = 4/5 + 5/6 − 3/4 = 53/60

Corollary. If two events A and B are mutually exclusive, then

P (A ∪ B) = P (A) + P (B).

This follows immediately from (2.2). Since A and B are mutually exclusive,
P (A ∩ B) = 0.
2.4. LAWS OF PROBABILITY 19

Example 2.9. What is the probability of getting a total of 7 or 11, when two
dice are rolled?

1 2 3 4 5 6
1 (1,1) (1,2) (1,6)
2
3
4
5
6 (6,6)

Solution. Let A be the event that the total is 7 and B be the event that it is
11. The sample space for this experiment is

S = {(1, 1), (1, 2), ......, (2, 1), (2, 2), ........., (6, 6)}, n(S) = 36

A = {(1, 6), (2, 5), (3, 4), (4, 3), (5, 2), (6, 1)} and n(A) = 6.
So, P (A) = 6/36 = 1/6.

B = {(5, 6), (6, 5)} and n(B) = 2

So, P (B) = 2/36 = 1/18.
Since we cannot have a total equal to both 7 and 11, A and B are mutually
exclusive, i.e. P (A ∩ B) = 0.
So, we have P (A ∪ B) = P (A) + P (B) = 1/6 + 1/18 = 2/9.

Exercises
2.7.
Two cards are drawn from a pack, without replacement. What is the prob-
ability that both are greater than 2 and less than 8?
2.8.
A permutation of the word ”white” is chosen at random. Find the probability
that it begins with a vowel. Also find the probability that it ends with a
consonant.
2.9.
Find the probability that a leap year will have 53 Sundays.
20 CHAPTER 2. PROBABILITY

2.10.
As a foreign language, 40% of the students took Spanish and 30% took
French, while 60% took at least one of these languages. What percent of
students took both Spanish and French?

2.11.
In a class of 100 students, 30 are in mathematics. Moreover, of the 40 females
in the class, 10 are in Mathematics. If a student is selected at random from
the class, what is the probability that the student will be a male or be in
mathematics?

2.12.
Suppose that P (A) = 0.4, P (B) = 0.5 and P (AB) = 0.2. Find the following:

a) P (A ∪ B)
b) P (A0 B)
c) P [A0 (A ∪ B)]
d) P [A ∪ (A0 B)]

2.13.
Two tetrahedral (4-sided) symmetrical dice are rolled, one after the other.

a) Find the probability that both dice will land on the same number.
b) Find the probability that each die will land on a number less than 3.
c) Find the probability that the two numbers will differ by at most 1.
d) Will the answers change if we rolled the dice simultaneously?
2.5. COUNTING RULES USEFUL IN PROBABILITY 21

2.5 Counting Rules useful in Probability

In some experiments it is helpful to list the elements of the sample space
systematically by means of a tree diagram, see page 15.
In many cases, we shall be able to solve a probability problem by count-
ing the number of points in the sample space without actually listing each
element.
Theorem 2.3. Multiplication principle

If one operation can be performed in n1 ways, and if for each of these a

second operation can be performed in n2 ways, then the two operations can
be performed together in n1 n2 ways.

Example 2.10. How large is the sample space when a pair of dice is thrown?
Solution. The first die can be thrown in n1 = 6 ways and the second in
n2 = 6 ways. Therefore, the pair of dice can land in n1 n2 = 36 possible
ways.
Theorem 2.3 can naturally be extended to more than two operations: if
we have n1 , n2 ,...,nk consequent choices, then the total number of ways is
n1 n2 · · · nk .
The term permutations refers to an arrangement of objects when the or-
der matters (for example, letters in a word).
Theorem 2.4. Permutations

The number of permutations of n distinct objects taken r at a time is

n!
n Pr =
(n − r)!

Example 2.11.
From among ten employees, three are to be selected to travel to three out-
of-town plants A, B, and C, one to each plant. Since the plants are located
in different cities, the order in which the employees are assigned to the plants
is an important consideration. In how many ways can the assignments be
made?
22 CHAPTER 2. PROBABILITY

Solution. Because order is important, the number of possible distinct assign-

ments is
10!
10 P3 = = 10(9)(8) = 720.
7!
In other words, there are ten choices for plant A, but then only nine for plant
B, and eight for plant C. This gives a total of 10(9)(8) ways of assigning
employees to the plants.
The term combination refers to the arrangement of objects when order
does not matter. For example, choosing 4 books to buy at the store in any
order will leave you with the same set of books.
Theorem 2.5. Combinations

The number of distinct subsets or combinations of size r that can be selected

from n distinct objects, (r ≤ n), is given by

n n!
= (2.3)
r r! (n − r)!

Proof. Start with picking ordered sets of size r. This can be done in n Pr =
n!
(n−r)!
ways. However, many of these are the re-orderings of the same basic
set of objects. Each distinct set of r objects can be re-ordered in r Pr = r!
ways. Therefore, we need to divide the number of permutations n Pr by r!,
thus arriving at the equation (2.3).

Example 2.12.
In the previous example, suppose that three employees are to be selected
from among the ten available to go to the same plant. In how many ways
can this selection be made?
Solution. Here, order is not important; we want to know how many subsets
of size r = 3 can be selected from n = 10 people. The result is

10 10! 10(9)(8)
= = = 120
3 3! 7! 3(2)(1)
2.5. COUNTING RULES USEFUL IN PROBABILITY 23

Example 2.13.
A package of six light bulbs contains 2 defective bulbs. If three bulbs are
selected for use, find the probability that none of the three is defective.
Solution. P(none are defective) =
4
number of ways 3 nondefectives can be chosen 3 1
= = =
total number of ways a sample of 3 can be chosen 6 5
3

Example 2.14.
In a poker hand consisting of 5 cards, find the probability of holding 2 aces
and 3 jacks.
4

Solution. The number of ways of being dealt 2 aces from 4 is 2
= 6 and
4

the number of ways of being dealt 3 jacks from 4 is 3 = 4.
The total number of 5-card poker hands, all of which are equally likely is

52
= 2, 598, 960
5
Hence, the probability of getting 2 aces and 3 jacks in a 5-card poker hand
is P (C) = (6 ∗ 4)/2, 598, 960
Example 2.15.
A university warehouse has received a shipment of 25 printers, of which
10 are laser printers and 15 are inkjet models. If 6 of these 25 are selected
at random to be checked by a particular technician, what is the probability
that exactly 3 of these selected are laser printers? At least 3 inkjet printers?
Solution. First
choose 3 of the 15 inkjet and then 3 of the 10 laser printers.
15 10
There are 3 and 3 ways to do it, and therefore
15
10
3
P (exactly 3 of the 6) = 3
25 = 0.3083
6

(b) P(at least 3)

15
10 15
10
15
10
15
10

3 3 4 2 5 1 6 0
= 25
+ 25
+ 25
+ 25
= 0.8530
6 6 6 6
24 CHAPTER 2. PROBABILITY

Theorem 2.6. Partitions

The number of ways of partitioning n distinct objects into k groups

containing n1 , n2 , . . . , nk objects respectively, is
n!
n1 ! n2 ! . . . nk !
Pk
where i=1 ni = n.
Note that when there are k = 2 groups, we will obtain combinations.

Example 2.16.
Consider 10 people to be split into 3 groups to be assigned to 3 plants. If we
are to send 5 people to Plant A, 3 people to Plant B, and 2 people to Plant
C, then the total number of assignments is
10!
= 2520
5! 3! 2!

Exercises
2.14.
An incoming lot of silicon wafers is to be inspected for defectives by an engi-
neer in a microchip manufacturing plant. Suppose that, in a tray containing
20 wafers, 4 are defective. Two wafers are to be selected randomly for in-
spection. Find the probability that neither is defective.

2.15.
A person draws 5 cards from a shuffled pack of cards. Find the probability
that the person has at least 3 aces. Find the probability that the person has
at least 4 cards of the same suit.

2.16.
Three people enter the elevator on the basement level. The building has 7
floors. Find the probability that all three get off at different floors.

2.17.
In a group of 7 people, each person shakes hands with every other person.
How many handshakes did occur?
2.5. COUNTING RULES USEFUL IN PROBABILITY 25

2.18.
A marketing director considers that there’s “overwhelming agreement” in a
5-member focus group when either 4 or 5 people like or dislike the product.a
If, in fact, the product’s popularity is 50% (so that all outcomes are equally
likely), what is the probability that the focus group will be in “overwhelming
agreement” about it? Is the marketing director making a judgement error in
declaring such agreement “overwhelming”?

2.19.
A die is tossed 5 times. Find the probability that we will have 4 of a kind.

2.20.
In a lottery, 6 numbers are drawn out of 45. You hit a jackpot if you guess
all 6 numbers correctly, and get $400 if you guess 5 numbers out of 6. What
are the probabilities of each of those events?

2.21.
There are 21 Bachelor of Science programs at New Mexico Tech. Given 21
areas from which to choose, in how many ways can a student select:

a) A major area and a minor area?

b) A major area and first and second minor?

2.22.
From a box containing 5 chocolates and 4 hard candies, a child takes a
handful of 4 (at random). What is the probability that exactly 3 of the 4 are
chocolates?

2.23.
If a group consist of 8 men and 6 women, in how many ways can a committee
of 5 be selected if:

a) The committee is to consist of 3 men and 3 women.

b) There are no restrictions on the number of men and women on the
committee.
c) There must at least one man.
d) There must be at least one of each sex.
26 CHAPTER 2. PROBABILITY

2.24.
Suppose we have a lot of 40 transistors of which 8 are defective. If we sample
without replacement, what is the probability that we get 4 good transistors
in the first 5 draws?

2.25.
A housewife is asked to rank four brands A, B, C, and D of household cleaner
according to her preference, number one being the one she prefers most, etc.
she really has no preference among the four brands. Hence, any ordering is
equally likely to occur.

a) Find the probability that brand A is ranked number one.

b) Find the probability that brand C is number one D is number 2 in the
rankings.
c) Find the probability that brand A is ranked number one or number 2.

2.26.
How many ways can one arrange the letters of the word ADVANTAGE so
that the three As are adjacent to each other?

2.27.
How many distinct ways are there to permute the letters in the word PROB-
ABILITY?

2.28.
Eight tires of different brands are ranked 1 to 8 (best to worst) according
to mileage performance. If four of these tires are chosen at random by a
customer, find the probability that the best tire among the four selected by
the customer is actually ranked third among the original eight.
2.5. COUNTING RULES USEFUL IN PROBABILITY 27

Pascal’s triangle and binomial coefficients

Long before Pascal, this triangle has been described by several Oriental scholars.
It was used in the budding discipline of probability theory by the French
mathematician Blaise Pascal (1623-1662). The construction begins by writing 1’s
along the sides of a triangle and then filling it up row by row so that each number
is a sum of the two numbers immediately above ● it.
1

1 1

1 2 1

1 3 3 1

1 4 6 4 1

1
● 5 10 10 5 1
A step in construction

The number in each cell represents the number of downward routes from the
vertex to that point (can you explain why?). It is also a number of ways to choose
r objects out of n (can you explain why?), that is, nr .
●

1
1 1
1 2 1
1 3 3 1
1 4 6 4 1
1 5 10 10 5 1
1 6 15 20 15 6 1
1 7 21 1 35 35 21 7
1 8 28
56 70 56 28 8 1
1 9 36 84 126 126 84 36 9 1
1
● 10 45 120 210 252 210 120 45 10 1
The first 10 rows

The combinations numbers are also called binomial coefficients and are seen in
Calculus. Namely, they are the terms in the expansion
n
n
X n
(a + b) = ar bn−r
r
r=0

Note that, if you let a = b = 1/2, then on the right-hand side of the sum you will
get the probabilities

n
r
P (a is chosen r times and b is chosen n − r times) = n
2
and on the left-hand side you will have 1 (the total of all probabilities).
28 CHAPTER 2. PROBABILITY

2.6 Conditional probability and independence

Humans often have to act based on incomplete information. If your boss
has looked at you gloomily, you might conclude that something’s wrong with
your job performance. However, if you know that she just suffered some losses
in the stock market, this extra information may change your assessment of
the situation. Conditional probability is a tool for dealing with additional
information like this.
Conditional probability is the probability of an event occurring given the
knowledge that another event has occurred. The conditional probability of
event A occurring, given that event B has occurred is denoted by P (A|B)
and is read “probability of A given B”.
Definition 2.9. Conditional probability

The conditional probability of event A given B is

P (A ∩ B)
P (A | B) = for P (B) > 0 (2.4)
P (B)

Reduced sample space approach

In case when all the outcomes are equally likely, it is sometimes easier to find
conditional probabilities directly, without having to apply equation (2.4). If
we already know that B has happened, we need only to consider outcomes
in B, thus reducing our sample space to B. Then,

Number of outcomes in AB
P (A | B) =
Number of outcomes in B
For example, P (a die is 3 | a die is odd) = 1/3 and
P (a die is 4 | a die is odd) = 0.

Example 2.17.
Let A = {a family has two boys} and B = {a family of two has at least one boy}
Find P (A | B).
2.6. CONDITIONAL PROBABILITY AND INDEPENDENCE 29

Solution. The event B contains the following outcomes: (B, B), (B, G) and
(G, B). Only one of these is in A. Thus, P (A | B) = 1/3.
However, if I know that the family has two children, and I see one of
the children and it’s a boy, then the probability suddenly changes to 1/2.
There is a subtle difference in the language and this changes the conditional
probability!2

Statistical reasoning
Suppose I pick a card at random from a pack of playing cards, without
showing you. I ask you to guess which card it is, and you guess the five of
diamonds. What is the probability that you are right? Since there are 52
cards in a pack, and only one five of diamonds, the probability of the card
being the five of diamonds is 1/52.
Next, I tell you that the card is red, not black. Now what is the probability
that you are right? Clearly you now have a better chance of being right than
you had before. In fact, your chance of being right is twice as big as it was
before, since only half of the 52 cards are red. So the probability of the card
being the five of diamonds is now 1/26. What we have just calculated is a
conditional probability–the probability that the card is the five of diamonds,
given that it is red.
If we let A stand for the card being the five of diamonds, and B stand for
the card being red, then the conditional probability that the card is the five
of diamonds given that it is red is written P (A|B).
In our case, P (A ∩ B) is the probability that the card is the five of
diamonds and red, which is 1/52 (exactly the same as P(A), since there are
no black fives of diamonds!). P(B), the probability that the card is red, is
1/2. So the definition of conditional probability tells us that P (A|B) = 1/26,
exactly as it should. In this simple case we didn’t really need to use a formula
to tell us this, but the formula is very useful in more complex cases.
If we rearrange the definition of conditional probability, we obtain the
multiplication rule for probabilities:

P (A ∩ B) = P (A|B)P (B) (2.5)

The next concept, statistical independence of events, is very important.

2
Always read the fine print!
30 CHAPTER 2. PROBABILITY

Definition 2.10. Independence

The events A and B are called (statistically) independent if

P (A ∩ B) = P (A)P (B) (2.6)

Another way to express independence is to say that the knowledge of B oc-

curring does not change our assessment of P (A). This means that P (A|B) =
P (A). (The probability that a person is female given that he or she was born
in March is just the same as the probability that the person is female.)
Equation (2.6) is often called simplified multiplication rule because it can
be obtained from (2.5) by substituting P (A|B) = P (A).
Example 2.18.
For a coin tossed twice, denote H1 the event that we got Heads on the first
toss, and H2 is the Heads on the second. Clearly, P (H1 ) = P (H2 ) = 1/2.
Then, counting the outcomes, P (H1 H2 ) = 1/4 = P (H1 )P (H2 ), therefore H1
and H2 are independent events. This agrees with our intuition that the result
of the first toss should not affect the chances for H2 to occur.
The situation of the above example is very common for repeated experiments,
like rolling dice, or looking at random numbers etc.

Definition 2.10 can be extended to more than two events, but it’s fairly
difficult to describe.3 However, it is often used in this context:

If events A1 , A2 , ..., Ak are independent, then

P (A1 A2 ...Ak ) = P (A1 ) × P (A2 ) × ... × P (Ak ) (2.7)

For example, if we tossed a coin 5 times, the probability that all are Heads
is P (H1 ) × P (H2 ) × ... × P (H5 ) = (1/2)5 = 1/32. However, this calculation
also extends to outcomes with unequal probabilities.
Example 2.19.
Three bits (0 or 1 digits) are transmitted over a noisy channel, so they will
be flipped independently with probability 0.1 each. What is the probability
3
For example, the relation P (ABC) = P (A)P (B)P (C) does not guarantee that the
events A, B, C are independent.
2.6. CONDITIONAL PROBABILITY AND INDEPENDENCE 31

that
a) At least one bit is flipped
b) Exactly one bit is flipped?

Solution. a) Using the complement rule, P (at least one) = 1 − P (none). If

we denote Fk the event that kth bit is flipped, then P (no bits are flipped) =
P (F10 F20 F30 ) = (1 − 0.1)3 due to independence. Then,

P (at least one) = 1 − 0.93 = 0.271

b) Flipping exactly one bit can be accomplished in 3 ways:

P (exactly one) = P (F1 F20 F30 )+P (F10 F2 F30 )+P (F10 F20 F3 ) = 3(0.1)(1−0.1)2 = 0.243

It is slightly smaller than the one in part (a).

Self-test questions
Suppose you throw two dice, one after the other.

a) What is the probability that the first die shows a 2?

b) What is the probability that the second die shows a 2?
c) What is the probability that both dice show a 2?
d) What is the probability that the dice add up to 4?
e) What is the probability that the dice add up to 4 given that the first
die shows a 2?
f) What is the probability that the dice add up to 4 and the first die
shows a 2?

Answers:

a) The probability that the first die shows a 2 is 1/6.

b) The probability that the second die shows a 2 is 1/6.
c) The probability that both dice show a 2 is (1/6)(1/6) = 1/36 (using
the special multiplication rule, since the rolls are independent).
32 CHAPTER 2. PROBABILITY

d) For the dice to add up to 4, there are three possibilities–either both dice
show a 2, or the first shows a 3 and the second shows a 1, or the first
shows a 1 and the second shows a 3. Each of these has a probability
of (1/6)(1/6) 3= 1/36 (using the special multiplication rule, since the
rolls are independent). Hence the probability that the dice add up to 4
is 1/36 + 1/36 + 1/36 = 3/36 = 1/12 (using the special addition rule,
since the outcomes are mutually exclusive).
e) If the first die shows a 2, then for the dice to add up to 4 the second
die must also show a 2. So the probability that the dice add up to 4
given that the first shows a 2 is 1/6.
f) Note that we cannot use the simplified multiplication rule here, because
the dice adding up to 4 is not independent of the first die showing a
2. So we need to use the full multiplication rule. This tells us that
probability that the first die shows a 2 and the dice add up to 4 is
given by the probability that the first die shows a 2, multiplied by the
probability that the dice add up to 4 given that the first die shows a 2.
This is (1/6)(1/6) = 1/36.
Alternatively, see part (c).

Example 2.20. Trees in conditional probability

Suppose we are drawing marbles from a bag that initially contains 7 red and
3 green marbles. The drawing is without replacement, that is after we draw
the first marble, we do not put it back. Let’s denote the events
R1 = { the first marble is red } R2 = { the second marble is red }
G1 = { the first marble is green } and so on.
Let’s fill out the tree representing the consecutive choices. See Figure 2.2.
The conditional probability P (R2 | R1 ) can be obtained directly from reason-
ing that after we took the first red marble, there remain 6 red and 3 green
marbles. On the other hand, we could use the formula (2.4) and get
P (R2 R1 ) 42/90 2
P (R2 | R1 ) = = =
P (R1 ) 7/10 3
where the probability P (R2 R1 ) – same as P (R1 R2 ) – can be obtained from
counting the outcomes
7 7∗6

2 2∗1 42 7
P (R1 R2 ) = 10 = 10∗9 = =
2 2∗1
90 15
2.6. CONDITIONAL PROBABILITY AND INDEPENDENCE 33

Second marble
7 6 42
P(R2|R1) = 6 9 P(R1R2) = * =
First marble 10 9 90

P(R1) = 7 10
P(G2|R1) = 3 9 P(R1G2) = 21 90

P(R2|G1) = 7 9 P(G1R2) = ?
P(G1) = 3 10

P(G2|G1) = 2 9 P(G1G2) = ?

Figure 2.2: Tree diagram for marble choices

Now, can you tell me what P (R2 ) and P (R1 | R2 ) are? Maybe you know the
answer already. However, we will get back to this question in Section 2.7.

Example 2.21.
Suppose that of all individuals buying a certain digital camera, 60% include
an optional memory card in their purchase, 40% include a set of batteries,
and 30% include both a card and batteries. Consider randomly selecting a
buyer and let A={memory card purchased} and B= {battery purchased}.
Then find P (A|B) and P (B|A).

Solution. From given information, we have P (A) = 0.60, P (B) = 0.40, and
P(both purchased) = P (A ∩ B) =0.30. Given that the selected individual
purchased an extra battery, the probability that an optional card was also
purchased is
P (A ∩ B) 0.30
P (A|B) = = = 0.75
P (B) 0.40
That is, of all those purchasing an extra battery, 75% purchased an optional
34 CHAPTER 2. PROBABILITY

memory card. Similarly

P (B ∩ A) 0.30
P (battery | memory card) = P (B|A) = = = 0.50
P (A) 0.60

Notice that P (A|B) 6= P (A) and P (B|A) 6= P (B), that is, the events A and
B are dependent.

Exercises
2.29.
A year has 53 Sundays. What is the conditional probability that it is a leap
year?

2.30.
The probability that a majority of the stockholders of a company will attend
a special meeting is 0.5. If the majority attends, then the probability that
an important merger will be approved is 0.9. What is the probability that a
majority will attend and the merger will be approved?

2.31.
Let events A, B have positive probabilities. Show that, if P (A | B) = P (A)
then also P (B | A) = P (B).

2.32.
The cards numbered 1 through 10 are placed in a hat, mixed up, then one
of the cards is drawn. If we are told that the number on the drawn card is
at least five, then what is the probability that it is ten?

2.33.
In the roll of a fair die, consider the events A = {2, 4, 6} = “even numbers”
and B = {4, 5, 6} =“high scores”. Find the probability that die showing an
even number given that it is a high score.

2.34.
There are two urns. In the first urn there are 3 white and 2 black balls and
in the second urn there 1 white and 4 black balls. From a randomly chosen
urn, one ball is drawn. What is the probability that the ball is white?
2.6. CONDITIONAL PROBABILITY AND INDEPENDENCE 35

2.35.
The level of college attainment of US population by racial and ethnic group
in 1998 is given in the following tableb
Racial or Eth- Number of Percentage Percentage Percentage
nic Group Adults with with with
(Millions) Associate’s Bachelor’s Graduate or
Degree Degree Professional
Degree
Native Americans 1.1 6.4 6.1 3.3
Blacks 16.8 5.3 7.5 3.8
Asians 4.3 7.7 22.7 13.9
Hispanics 11.2 4.8 5.9 3.3
Whites 132.0 6.3 13.9 7.7

The percentages given in the right three columns are conditional percentages.

a) How many Asians have had a graduate or professional degree in 1998?

b) What percent of all adult Americans has had a Bachelor’s degree?

c) Given that the person had an Associate’s degree, what is the probability
that the person was Hispanic?

2.36.
The dealer’s lot contains 40 cars arranged in 5 rows and 8 columns. We pick
one car at random. Are the events A = {the car comes from an odd-numbered row}
and B = {the car comes from one of the last 4 columns} independent? Prove
your point of view.

2.37.
You have sent applications to two colleges. If you are considering your
chances to be accepted to either college as 60%, and believe the results are
statistically independent, what is the probability that you’ll be accepted to
at least one?
How will your answer change if you applied to 5 colleges?

2.38.
Show that, if the events A and B are independent, then so are A0 and B 0 .
36 CHAPTER 2. PROBABILITY

2.39.
In a high school class, 50% of the students took Spanish, 25% took French
and 30% of the students took neither.
Let A = event that a randomly chosen student took Spanish, and B =
event that a student took French. Fill in either the Venn diagram or a 2-way
table and answer the questions:
B B0

'$
'$ A
A B A0
&%
&%

a) Describe in words the meaning of the event AB 0 . Find the probability

of this event.
b) Are the events A, B independent? Explain with numbers why or why
not.
c) If it is known that the student took Spanish, what are the chances that
she also took French?

2.40.
One half of all female physicists are married. Among those married, 50% are
married to other physicists, 29% to scientists other than physicists and 21%
to nonscientists. Among male physicists, 74% are married. Among them, 7%
are married to other physicists, 11% to scientists other than physicists and
82% to nonscientists.c What percent of all physicists are female? [Hint: This
problem can be solved as is, but if you want to, assume that physicists comprise
1% of all population.]

2.41.
Error-correcting codes are designed to withstand errors in data being sent
over communication lines. Suppose we are sending a binary signal (consisting
of a sequence of 0’s and 1’s), and during transmission, any bit may get flipped
with probability p, independently of any other bit. However, we might choose
to repeat each bit 3 times. For example, if we want to send a sequence 010,
2.7. BAYES RULE 37

we will code it as 000111000. If one of the three bits flips, say, the receiver
gets the sequence 001111000, he will still be able to decode it as 010 by
majority voting. That is, reading the first three bits, 001, he will interpret
it as an attempt to send 000. However, if two of the three bits are flipped,
for example 011, this will be interpreted as an attempt to send 111, and thus
decoded incorrectly.
What is the probability of a bit being decoded incorrectly under this
scheme?d
2.42. ?
Give an example of events A, B, C such that they are pairwise independent
(i.e. P (AB) = P (A)P (B) etc.) but P (ABC) 6= P (A)P (B)P (C). [Hint:
You may build them on a sample space with 4 elementary outcomes.]

2.7 Bayes Rule

Events B1 , B2 , . . . , Bk are said to be a partition of the sample space S if the
following two conditions are satisfied.
a) Bi Bj = ∅ for each pair i, j

b) B1 ∪ B2 ∪ · · · ∪ Bk = S
This situation often arises when the statistics are available in subgroups of
a population. For example, an insurance company might know accident rates
for each age group Bi . This will give the company conditional probabilities
P (A | Bi ) (if we denote A = event of accident).
Question: if we know all the conditional probabilities P (A | Bi ), how do
we find the unconditional P (A)?
Consider a case when k = 2:
The event A can be written as the union of mutually exclusive events AB1
and AB2 , that is

A = AB1 ∪ AB2 it follows that P (A) = P (AB1 ) + P (AB2 )

If the conditional probabilities of P (A|B1 ) and P (A|B2 ) are known, that

is
P (AB1 ) P (AB2 )
P (A|B1 ) = and P (A|B2 ) = ,
P (B1 ) P (B2 )
then P (A) = P (A|B1 )P (B1 ) + P (A|B2 )P (B2 ).
38 CHAPTER 2. PROBABILITY

B1 B2 .... Bk

B2A

Figure 2.3: Partition B1 , B2 , . . . , Bk and event A (inside of the oval).

If B1 , B2 , . . . , Bk form a partition of the sample space S such that P (Bi ) 6= 0

for i = 1, 2, . . . , k, then for any event A of S,
k
X k
X
P (A) = P (Bi ∩ A) = P (Bi )P (A|Bi ) (2.8)
i=1 i=1

Subsequently,
P (Bj )P (A|Bj )
P (Bj |A) = (2.9)
P (A)
The equation (2.8) is often called Law of Total Probability.
2.7. BAYES RULE 39

Example 2.22.
A rare genetic disease (occuring in 1 out of 1000 people) is diagnosed using
a DNA screening test. The test has false positive rate of 0.5%, meaning that
P (test positive | no disease) = 0.005. Given that a person has tested positive,
what is the probability that this person actually has the disease?
First, guess the answer, then read on.

Solution. Let’s reason in terms of actual numbers of people, for a change.

Imagine 1000 people, 1 of them having the disease. How many out of 1000
will test positive? One that actually has the disease, and about 5 disease-free
people who would test false positive.4 Thus, P (disease | test positive) ≈ 1/6.
It is left as an exercise for the reader to write down the formal probability
calculation.

Example 2.23.
At a certain assembly plant, three machines make 30%, 45%, and 25%, re-
spectively, of the products. It is known from the past experience that 2%, 3%,
and 2% of the products made by each machine, respectively, are defective.
Now, suppose that a finished product is randomly selected.

a) What is the probability that it is defective?

b) If a product were chosen randomly and found to be defective, what is
the probability that it was made by machine 3?

Solution. Consider the following events:

A: the product is defective

B1 : the product is made by machine 1,
B2 : the product is made by machine 2,
B3 : the product is made by machine 3.
4
a) Of course, of any actual 1000 people, the number of people having the disease and
the number of people who test positive will vary randomly, so our calculation only makes
sense when considering averages in a much larger population. b) There’s also a possibility
of a false negative, i.e. person having the disease and the test coming out negative. We
will neglect this, quite rare, event.
40 CHAPTER 2. PROBABILITY

Applying additive and multiplicative rules, we can write

(a) P (A) = P (B1 )P (A|B1 ) + P (B2 )P (A|B2 ) + P (B3 )P (A|B3 ) =
= (0.3)(0.02) + (0.45)(0.03) + (0.25)(0.02) = 0.006 + 0.0135 + 0.005 = 0.0245

(b) Using Bayes’ rule

P (B3 )P (A|B3 ) 0.005

P (B3 |A) = = = 0.2041
P (A) 0.0245

This calculation can also be represented using a tree

0.02
(( ((( 0.3 × 0.02 = 0.006
( ( ((((
hh hhhh
0.3 hhh
hh

0.03 ((( 0.0135
((((

H 0.45 ( (
hhh ((
H hhhh
hhhh
0.97
HH
H
0.25
HH
H
HH (
0.02
( ((((
(( 0.005
(
h(
H( hhhh
hhhh
hh

Here, the first branching represents probabilities of the events Bi , and the
second branching represents conditional probabilities P (A | Bi ). The proba-
bilities of intersections, given by the products, are on the right. P (A) is their
sum.

Exercises
2.43.
Lucy is undecided as to whether to take a Math course or a Chemistry course.
She estimates that her probability of receiving an A grade would be 21 in a
math course, and 23 in a chemistry course. If Lucy decides to base her decision
on the flip of a fair coin, what is the probability that she gets an A?
2.7. BAYES RULE 41

2.44.
Of the customers at a gas station, 70% use regular gas, and 30% use diesel.
Of the customers who use regular gas, 60% will fill the tank completely, and
of those who use diesel, 80% will fill the tank completely.
a) What percent of all customers will fill the tank completely?

b) If a customer has filled up completely, what is the probability it was a

customer buying diesel?
2.45.
In 2004, 57% of White households directly and/or indirectly owned stocks,
compared to 26% of Black households and 19% of Hispanic households.e The
data for Asian households is not given, but let’s assume the same rate as for
Whites. Additionally, 77% of households are classified as either White or
Asian, 12% as African American, and 11% as Hispanic.
a) What proportion of all families owned stocks?
b) If a family owned stock, what is the probability it was White/Asian?
2.46.
Drawer one has five pairs of white and three pairs of red socks, while drawer
two has three pairs of white and seven pairs of red socks. One drawer is
selected at random a pair of socks is selected at random from that drawer.
a) What is the probability that it is a white pair of socks.
b) Suppose a white pair of socks is obtained. What is the probability that
it came from drawer two?
2.47.
For an on-line electronics retailer, 5% of customers who buy Zony digital
cameras will return them, 3% of customers who buy Lucky Star digital cam-
eras will return them, and 8% of customers who buy any other brand will
return them. Also, among all digital cameras bought, there are 20% Zony’s
and 30% Lucky Stars.
Fill in the tree diagram and answer the questions.

(a) What percent of all cameras are returned?

(b) If the camera was just returned, what is the probability it is a Lucky
42 CHAPTER 2. PROBABILITY

Star?
(c) What percent of all cameras sold were Zony and were not returned?

P (A B1 ) =
P (A|B )=
1

PP
PP
P PP
P PP
P

P (B1 )

P (B2 )
PP

@ PP
@ P PP
@ P PP
@ P
P@(B3 )
@
@
@

@
@
@
PP
PP
P PP
P PP
P

2.48.
Three newspapers, A, B, and C are published in a certain city. It is estimated
from a survey that that of the adult population: 20% read A, 16% read B,
14% read C, 8% read both A and B, 5% read both A and C, 4% read both B
and C, 2% read all three. What percentage reads at least one of the papers?
Of those that read at least one, what percentage reads both A and B?

2.49.
Suppose P (A|B) = 0.3, P (B) = 0.4, P (B|A) = 0.6. Find:

a) P (A)
b) P (A ∪ B)
NOTES 43

2.50. ?
This is the famous Monty Hall problem.f A contestant on a game show is
asked to choose among 3 doors. There is a prize behind one door and nothing
behind the other two. You (the contestant) have chosen one door. Then, the
host is flinging one other door open, and there’s nothing behind it. What
is the best strategy? Should you switch to the remaining door, or just stay
with the door you have chosen? What is your probability of success (getting
the prize) for either strategy?

2.51. ?
There are two children in a family. We overheard about one of them referred
to as a boy.

a) Find the probability that there are 2 boys in the family.

b) Suppose that the oldest child is a boy. Again, find the probability that
there are 2 boys in the family.g [Why is it different from part (a)?]

Chapter exercises
2.52.
At a university, two students were doing well for the entire semester but
failed to show up for a final exam. Their excuse was that they traveled out
of state and had a flat tire. The professor gave them the exam in separate
rooms, with one question worth 95 points: “which tire was it?”. Find the
probability that both students mentioned the same tire.h

2.53.
In firing the company’s CEO, the argument was that during the six years
of her tenure, for the last three years the company’s market share was lower
than for the first three years. The CEO claims bad luck. Find the probability
that, given six random numbers, the last three are the lowest among six.

Notes
a
Taken from Leonard Mlodinow, The Drunkard’s Walk
b
Source: US Department of Education, National Center for Education Statistics, as
reported in Chronicle of Higher Education Almanac, 1998-1999, 2000.
44 NOTES

c
Laurie McNeil and Marc Sher. The dual-career-couple problem. Physics Today, July
1999.
d
see David MacKay, Information Theory, Inference, and Learning Algorithms, 640
pages, Published September 2003.
Downloadable from http://www.inference.phy.cam.ac.uk/itprnn/book.html
e
According to "http://www.highbeam.com/doc/1G1-167842487.html",Consumer In-
terests Annual, January 1, 2007 by Hanna, Sherman D.; Lindamood, Suzanne
f
There are some interesting factoids about this in Mlodinow’s book, including Marylin
vos Savant’s column in Parade magazine and scathing replies from academics, who believed
that the probability was 50%. Vos Savant did it again in 2011 with another probability
question that seems, however, intentionally ambiguously worded.
g
Puzzle cited by Martin Gardner, mentioned in Math Horizons, Sept. 2010. See also the
discussion at http://www.stat.columbia.edu/~cook/movabletype/archives/2010/05/
hype about cond.html
h
This example is also from Mlodinow’s book.
Chapter 3

Discrete probability
distributions

3.1 Discrete distributions

In this chapter, we will consider random quantities that are usually called
random variables.

Definition 3.1. Random variable

A random variable (RV) is a number associated with each outcome of some

random experiment.

One can think of the shoe size of a randomly chosen person as a random
variable. We have already seen the example when a die was rolled and a
number was recorded. This number is also a random variable.

Example 3.1.
Toss two coins and record the number of heads: 0, 1 or 2. Then the following
outcomes can be observed.
Outcome TT HT TH HH
Number of heads 0 1 1 2
The random variables will be denoted with capital letters X, Y, Z, ... and the
lowercase x would represent a particular value of X. For the above example,
x = 2 if heads comes up twice. Now we want to look at the probabilities of

45
46 CHAPTER 3. DISCRETE PROBABILITY DISTRIBUTIONS

the outcomes. For the probability that the random variable X has the value
x, we write P (X = x), or just p(x).
For the coin flipping random variable X, we can make the table:
x 0 1 2
p(x) 1/4 1/2 1/4
This table represents the probability distribution of the random variable X.

Definition 3.2. Probability mass function

A random variable X is said to be discrete if it can take on only a finite or

countable number of possible values x. In this case,

a) P (X = x) = pX (x) ≥ 0
P
b) x P (X = x) = 1, where the sum is over all possible x

The function pX (x) or simply p(x) is called probability mass function (PMF)
of X.
What does this actually mean? A discrete probability function is a func-
tion that can take a discrete number of values (not necessarily finite). This
is most often the non-negative integers or some subset of the non-negative
integers. There is no mathematical restriction that discrete probability func-
tions only be defined at integers, but we will use integers in many practical
situations. For example, if you toss a coin 6 times, you can get 2 heads or 3
heads but not 2.5 heads.
Each of the discrete values has a certain probability of occurrence that is
between zero and one. That is, a discrete function that allows negative values
or values greater than one is not a PMF. The condition that the probabilities
add up to one means that one of the values has to occur.

Example 3.2.
A shipment of 8 similar microcomputers to a retail outlet contains 3 that are
defective. If a school makes a random purchase of 2 of these computers, find
the probability mass function for the number of defectives.

Solution. Let X be a random variable whose values x are the possible num-
bers of defective computers purchased by school. Then x must be 0, 1 or 2.
3.1. DISCRETE DISTRIBUTIONS 47

Then,
3
5
0 10
P (X = 0) = 8
2 =
2
28
3
5
1 15
P (X = 1) = 8
1 =
2
28
3
5
2 3
P (X = 2) = 8
0 =
2
28
Thus, the probability mass function of X is

x 0 1 2
10 15 3
p(x)
28 28 28

Definition 3.3. Cumulative distribution function

The cumulative distribution function (CDF) F (x) for a random variable X is
defined as
F (x) = P (X ≤ x)
If X is discrete, X
F (x) = p(y)
y≤x

where p(x) is the probability mass function.

Properties of discrete CDF

a) lim F (x) = 0
x→−∞

b) lim F (x) = 1
x→∞

c) F (x) is non-decreasing
d) p(x) = F (x) − F (x−) = F (x) − lim F (y)
y↑x

In words, CDF of a discrete RV is a step function, whose jumps occur at the

values x for which p(x) > 0 and are equal in size to p(x). It ranges from 0
on the left to 1 on the right.
48 CHAPTER 3. DISCRETE PROBABILITY DISTRIBUTIONS

Example 3.3.
Find the CDF of the random variable from Example 3.2. Using F (x), verify
that P (X = 1) = 15/28.

Solution. The CDF of the random variable X is:

F (0) = p(0) = 10
28
25
F (1) = p(0) + p(1) = 28
F (2) = p(0) + p(1) + p(2) = 28
28
= 1.
Hence,


 0 for x < 0
10/28 for 0 ≤ x < 1

F (x) = (3.1)

 25/28 for 1 ≤ x < 2
1 for x ≥ 2


25 10 15
Now, P (X = 1) = p(1) = F (1) − F (0) = 28
− 28
= 28
.

Graphically, p(x) can be represented as a probability histogram where the

heights of the bars are equal to p(x).
0.0 0.2 0.4 0.6 0.8 1.0

0.0 0.2 0.4 0.6 0.8 1.0

F(x)
p(x)

−1 0 1 2 3 −1 0 1 2 3

x x

Figure 3.1: PMF and CDF for Example 3.3

3.1. DISCRETE DISTRIBUTIONS 49

Exercises
3.1.
Suppose that two dice are rolled independently, with outcomes X1 and X2 .
Find the distribution of the random variable Y = X1 + X2 . [Hint: It’s easier
to visualize all the ourcomes if you make a two-way table.]

3.2.
What constant c makes p(x) a valid PMF?

a) p(x) = c for x = 1, 2, ..., 5.

b) p(x) = c(x2 + 1) for x = 0, 1, 2, 3.

3
c) p(x) = cx for x = 1, 2, 3.
x
3.3.
Are the following valid PMF’s? If yes, find constant k that makes it so.

a) p(x) = (x − 2)/k for x = 1, 2, ..., 5

b) p(x) = (x2 − x + 1)/k for x = 1, 2, ..., 5
k
c) p(x) = for x = −1, 0, 1, 2
2x
3.4.
With reference to the previous problem find an expression for the values of
F (x), that is CDF of X.

3.5.
For an on-line electronics retailer, X = the number of Zony digital cameras
returned per day follows the distribution given by
x 0 1 2 3 4 5
p(x) 0.05 0.1 ? 0.2 0.25 0.1
(a) Fill in the “?”
(b) Find P (X > 3)
(c) Find the CDF of X (make a table).
50 CHAPTER 3. DISCRETE PROBABILITY DISTRIBUTIONS

3.6.
Out of 5 components, 3 are domestic and 2 are imported. 3 components
are selected at random (without replacement). Calculate the PMF for X =
number of domestic components picked (make a table).

3.7.
The CDF of a discrete random variable X is shown in the plot below.

CDF
1.0
0.8
0.6
F(x)

0.4
0.2
0.0

−2 −1 0 1 2 3 4

Find the probability mass function pX (x) (make a table)

3.2. EXPECTED VALUES OF RANDOM VARIABLES 51

3.2 Expected values of Random Variables

One of the most important things we’d like to know about a random variable
is: what value does it take on average? What is the average price of a
computer? What is the average value of a number that rolls on a die?
The value is found as the average of all possible values, weighted by how
often they occur (i.e. probability)
Definition 3.4. Expected value (mean)
The mean or expected value of a discrete random variable X with probability
mass function p(x) is given by
X
E (X) = x p(x)
x

We will sometimes use the notation E (X) = µ.

Theorem 3.1. Expected value of a function

If X is a discrete random variable with probability mass function p(x) and if
g(x) is a real valued function of x, then
X
E [g(X)] = g(x)p(x).
x

Definition 3.5. Variance

The variance of a random variable X with expected value µ is given by

V (X) = σ 2 = E (X − µ)2 = E (X 2 ) − µ2 ,

where X
E (X 2 ) = x2 p(x).
x

The variance defines the average (or expected) value of the squared dif-
ference from the mean.
If we use V (X) = E (X − µ)2 as a definition, we can see that
V (X) = E (X−µ)2 = E (X 2 −2µX+µ2 ) = E (X 2 )−2µE (X)+µ2 = E (X 2 )−µ2
52 CHAPTER 3. DISCRETE PROBABILITY DISTRIBUTIONS

due to the linearity of expectation (see Theorem 3.2 below).

Definition 3.6. Standard deviation

The standard deviation of a random variable X is the square root of the

variance, and is given by
√ p
σ = σ 2 = E (X − µ)2

The mean describes the center of the probability distribution, while stan-
dard deviation describes the spread. Larger values of σ signify a distribution
with larger variation. This will be undesirable in some situations, e.g. indus-
trial process control, where we would like the manufactured items to have
identical characteristics. On the other hand, a degenerate random variable
X that has P (X = a) = 1 for some value of a is not random at all, and it
has the standard deviation of 0.

Example 3.4.
The number of fire emergencies at a rural county in a week, has the following
x 0 1 2 3 4
distribution
P (X = x) 0.52 0.28 0.14 0.04 0.02

Find E (X), V (X) and σ.

Solution. From Definition 3.4, we see that

E (X) = 0(0.52) + 1(0.28) + 2(0.14) + 3(0.04) + 4(0.02) = 0.76 = µ

and from definition of E (X 2 ), we get

E (X 2 ) = 02 (0.52) + 12 (0.28) + 22 (0.14) + 32 (0.04) + 42 (0.02) = 1.52

Hence, from Definition 3.5, we get

V (X) = E (X 2 ) − µ2 = 1.52 − (0.76)2 = 0.9424

√
Now, from Definition 3.6, the standard deviation σ = 0.8456 = 0.9708.
3.2. EXPECTED VALUES OF RANDOM VARIABLES 53

Theorem 3.2. Linear functions

For any random variable X and constants a and b,

a) E (aX + b) = aE (X) + b

b) V (aX + b) = a2 V (X) = a2 σ 2

c) σaX+b = |a| σ.

d) For several RV’s, X1 , X2 , ..., Xk ,

E (X1 + X2 + ... + Xk ) = E (X1 ) + E (X2 ) + ... + E (Xk )

Example 3.5.
Let X be a random variable having probability mass function given in Ex-
ample 3.4. Calculate the mean and variance of g(X) = 4X + 3.

Solution. In Example 3.4, we found E (X) = µ = 0.88 and V (X) = 0.8456.

Now, using Theorem 3.2,

E (g(X)) = 4E (X) + 3 = 4(0.88) + 3 = 3.52 + 3 = 6.52

and V (g(X)) = 42 V (X) = 16(0.8456) = 13.5296

Theorem 3.3. Chebyshev Inequality

Let X be a random variable with mean µ and a variance σ 2 . Then for any
positive k,
1
P (|X − µ| ≥ kσ) ≤ 2
k

The inequality in the statement of the theorem is equivalent to

1
P (µ − kσ < X < µ + kσ) > 1 −
k2
To interpret this result, let k = 2, for example. Then the interval from µ−2σ
to µ + 2σ must contain at least 1 − k12 = 1 − 14 = 34 of the probability mass
54 CHAPTER 3. DISCRETE PROBABILITY DISTRIBUTIONS

for the random variable.

Chebyshev inequality is useful when the mean and variance of a RV are known
and we would like to calculate estimates of some probabilities. However, these
estimates are usually quite crude.
Example 3.6.
The performance period of a certain car battery is known to have a mean of
30 months and standard deviation of 5 months.
a) Estimate the probability that a car battery will last at least 18 months.
b) Give a range of values to which at least 90% of all batteries’ lifetimes
will belong.
Solution. (a) Let X be the battery performance period. Calculate k such
that the value of 18 is k standard deviations below the mean: 18 = 30 − 5k,
therefore k = (30 − 18)/5 = 2.4. From Chebyshev’s theorem we have
P (30 − 5k < X < 30 + 5k) > 1 − 1/k 2 = 1 − 1/2.42 = 0.826
Thus, at least 82.6% of batteries will make it to 18 months. (However, in
reality this percentage could be much higher, depending on distribution.)
(b) From Chebyshev’s theorem we have
1
P (µ − kσ < X < µ + kσ) > 1 − 2
k
1
√
According to the problem set 1− k2 = 0.90 and solve for k, we get k = 10 =
3.16. Hence, the desired interval is between 30 − 3.16(5) and 30 + 3.16(5) =
14.2 to 45.8 months.
Example 3.7.
The number of customers per day at a certain sales counter, X, has a mean
of 20 customers and standard deviation of 2 customers. The probability
distribution of X is not known. What can be said about the probability that
X will be between 16 and 24 tomorrow?
Solution. We want P (16 ≤ X ≤ 24) = P (15 < X < 25). From Chebyshev’s
theorem
1
P (µ − kσ < X < µ + kσ) ≥ 1 − 2
k
given µ = 20, σ = 2 we set µ − kσ = 15 and hence k = 2.5. Thus, P (16 ≤
1
X ≤ 24) ≥ 1 − 6.25 = 0.84.
So, tomorrow’s customer total will be between 16 and 24 with probability at
least 0.84.
3.2. EXPECTED VALUES OF RANDOM VARIABLES 55

Exercises
3.8.
Timmy is selling cholocates door to door. The probability distribution of X,
the number of cholocates he sells in each house, is given by
x 0 1 2 3 4
P (X = x) 0.45 0.25 0.15 0.1 0.05
Find the expected value and standard deviation of X.

3.9.
In the previous exercise, suppose that Timmy earns 50 cents for school from
each purchase. Find the expected value and standard deviation of his earn-
ings per house.

3.10.
A dollar coin, a quarter, a nickel and a dime are tossed. I get to pocket all
the coins that came up heads. What are my expected winnings?

3.11.
Consider X with the distribution of a random digit, p(x) = 1/10, x =
0, 1, 2, ..., 9

a) Find the mean and standard deviation of X.

b) According to Chebyshev’s inequality, estimate the probability that a
random digit will be between 1 and 8, inclusive. Compare to the actual
probability.

3.12.
In the Numbers game, two players choose a random number between 1 and
6, and compute the absolute difference.
That is, if Player 1 gets the number Y1 , and Player 2 gets Y2 , then they find
X = |Y1 − Y2 |

a) Find the distribution of the random variable X (make a table). [Hint:

consider all outcomes (y1 , y2 ).]
b) Find the expected value and variance of X, and E (X 3 )
56 CHAPTER 3. DISCRETE PROBABILITY DISTRIBUTIONS

c) If Player 1 wins whenever the difference is 3 or more, and Player 2 wins

whenever the difference is 2 or less, who is more likely to win?
d) If Player 1 bets $1, what is the value that Player 2 should bet to make
the game fair?

3.13.
According to ScanUS.com, the number of cars per household in an Albu-
querque neighborhood was distributed as follows
x 0 1 2 3+
P (X = x) 0.047 0.344 0.402 0.207
3+ really means 3 or more, but let’s assume that there are no more than 3
cars in any household.
Find the expected value and standard deviation of X.

3.14.
For the above Problem, the web site really reported the average of 1.9 cars per
household. This is higher than the answer for the Problem 3.13. Probably,
it’s due to the fact that we limited the number of cars by 3.
Suppose we limit the number of cars by 4. This means the distribution
x 0 1 2 3 4
will look like
p(x) 0.047 0.344 0.402 p3 p4
where p3 + p4 = 0.207. Assuming that E (X) = 1.9, reverse-engineer this
information to find p3 and p4 .

3.15.
The frequencies of electromagnetic waves in the upper ionosphere observed
in the vicinity of earthquakes have the mean 1.7 kHz, and standard deviation
of 0.2 kHz. According to Chebyshev inequality,

a) What percent of all observed waves is guaranteed to be contained in

the interval 1.4 to 2.0 kHz?
b) Give an interval that would contain at least 95% of all such observed
waves.

3.16.
Find the mean and variance of the given PMF p(x) = 1/k, where x =
1, 2, 3, ..., k.
3.3. BERNOULLI DISTRIBUTION 57

3.17.
Show that the function defined by p(x) = 2−x for x = 1, 2, 3, ... can represent
a probability mass function of a random variable X. Find the mean and the
variance of X.

3.18.
For t > 0 show that p(x) = e−t (1 − e−t )x−1 , x = 1, 2, 3, ... can represent a
probability mass function. Also, find E (X) and V (X).

3.19. “Baker’s problem” ?

A shopkeeper is selling the quantity X (between 0 and 3) of a certain item
per week, with a given probability distribution:
x 0 1 2 3
p(x) 0.05 0.2 0.5 0.25
For each item bought, the profit is $50. On the other hand, if the item
is stocked, but was not bought, then the cost of upkeep, insurance etc. is
$20. At the beginning of the week, the shopkeeper stocks a items.
For example, if 3 items were stocked, then the expected profit can be calcu-
lated from the following table:
y −$60 $10 $80 $150
Y = Profit
p(y) 0.05 0.2 0.5 0.25

a) What is the expected profit if the shopkeeper stocked a = 3 items?

b) What is the expected profit if the shopkeeper stocked a = 1 and a = 2
items? [You’ll need to produce new tables for Y first.]
c) Which value of a maximizes the expected profit?

3.3 Bernoulli distribution

Let X be the random variable denoting the condition of the inspected item.
Agree to write X = 1 when the item is defective and X = 0 when it is
not. (This is a convenient notation because, once we inspect n such items,
X1 , X2 , ..., Xn denoting their condition, the total number of defectives will
be given by X1 + X2 + ... + Xn .)
58 CHAPTER 3. DISCRETE PROBABILITY DISTRIBUTIONS

Let p denote the probability of observing a defective item. The probability

distribution of X, then, is given by
x 0 1
p(x) q =1−p p

Such a random variable is said to have a Bernoulli distribution. Note that

X
E (X) = xp(x) = 0 × p(0) + 1 × p(1) = 0(q) + 1(p) = p and
X
E (X 2 ) = x2 p(x) = 0(q) + 1(p) = p.
Hence, V (X) = E (X 2 ) − (E X)2 = p − p2 = pq.

3.4 Binomial distribution

Now, let us inspect n items and count the total number of defectives. This
process of repeating an experiment n times is called Bernoulli trials. The
Bernoulli trials are formally defined by the following properties:

a) The result of each trial is either a success or a failure

b) The probability of success p is constant from trial to trial.
c) The trials are independent
d) The random variable X is defined to be the number of successes in n
repeated trials

This situation applies to many random processes with just two possible out-
comes: a heads-or-tails coin toss, a made or missed free throw in basketball
etc1 . We arbitrarily call one of these outcomes a success and the other a
failure.

1
However, we have to make sure that the probability of success remains constant. Thus,
for example, wins or losses in a series of football games may not be a Bernoulli experiment!
3.4. BINOMIAL DISTRIBUTION 59

Definition 3.7. Binomial RV

Assume that each Bernoulli trial can result in a success with probability p
and a failure with probability q = 1 − p. Then the probability distribution of
the binomial random variable X, the number of successes in n independent
trials, is
n k n−k
P (X = k) = p q , k = 0, 1, 2, . . . , n.
k
The mean and variance of the binomial distribution are

E (X) = µ = np and V (X) = σ 2 = npq.

We can notice that the mean and variance of the Binomial are n times larger
than those of the Bernoulli random variable.
0.20
0.00 0.02 0.04 0.06 0.08 0.10

0.15
0.10
0.05
0.00

15 20 25 30 35 40 45 50 0 5 10 15

Figure 3.2: Binomial PMF: left, with n = 60, p = 0.6; right, with n = 15,
p = 0.5

Note that Binomial distribution is symmetric when p = 0.5. Also, two

Binomials with the same n and p2 = 1 − p1 are mirror images of each other.

Example 3.8.
The probability that a certain kind of component will survive a shock test is
0.75. Find the probability that

a) exactly 2 of the next 8 components tested survive,

b) at least 2 will survive,
60 CHAPTER 3. DISCRETE PROBABILITY DISTRIBUTIONS

0.25
0.30

0.20
0.15
0.20
p(x)

0.10
0.10

0.05
0.00

0.00
0 5 10 15 0 5 10 15

Figure 3.3: Binomial PMF: left, with n = 15, p = 0.1; right, with n = 15,
p = 0.8

c) at most 6 will survive.

Solution. (a) Assuming that the tests are independent and p = 0.75 for each
of the 8 tests, we get

8 8!
P (X = 2) = (0.75)2 (0.25)8−2 = 0.752 0.256 =
2 2! (8 − 2)!
40320
= (0.5625)(0.000244) = 0.003843
2 × 720
(b)
P (X ≥ 2) = 1 − P (X ≤ 1) = 1 − [P (X = 1) + P (X = 0)]
= 1 − [8(0.75)(0.000061) + 0.000002] = 1 − 0.000386 ≈ 0.9996
(c)
P (X ≤ 6) = 1 − P (X ≥ 7) = 1 − [P (X = 7) + P (X = 8)]
= 1 − [0.2669 + 0.1001] = 1 − 0.367 = 0.633

Example 3.9.
It has been claimed that in 60% of all solar heating installations the utility
bill is reduced by at least one-third. Accordingly, what are the probabilities
that the utility bill will be reduced by at least one-third in
(a) four of five installations;
(b) at least four of five installations?
3.4. BINOMIAL DISTRIBUTION 61

Solution.

5
(a) P (X = 4) = (0.60)4 (0.4)5−4 = 5(0.1296)(0.4) = 0.2592
4

5
(b) P (X = 5) = (0.60)5 (0.40)5−5 = 0.605 = 0.0777
5
Hence, P (reduction for at least four) = P (X ≥ 4) = 0.2592 + 0.0777 =
0.3369

Exercises
3.20.
There’s 50% chance that a mutual fund return on any given year will beat the
industry’s average. What proportion of funds will beat the industry average
for at least 4 out of 5 last years?
3.21.
Biologists would like to catch Costa Rican glass frogs for breeding. There is
75% probability that a glass frog they catch is male. If 10 glass frogs of a
certain species are caught, what are the chances that they will have at least
2 male and 2 female frogs? What is the expected value of the number of
female frogs caught?
3.22.
A 5-member focus group are testing a new game console. Suppose that there’s
50% chance that any given group member approves of the new console, and
their opinions are independent of each other.
a) Calculate and fill out the probability distribution for X = number of
group members who approve of the new console.
b) Calculate P (X ≥ 3).
c) How does your answer in part (b) change when there’s 70% chance that
any group member approves of the new console?
3.23.
Suppose that the four engines of a commercial airplane were arranged to
operate independently and that the probability of in-flight failure of a single
engine is 0.01. Find:
62 CHAPTER 3. DISCRETE PROBABILITY DISTRIBUTIONS

a) Probability of no failures on a given flight.

b) Probability of at most one failure on a given flight.
c) The mean and variance for the number of failures on a given flight.
3.24.
Suppose a television contains 60 transistors, 2 of which are defectives. Five
transistors are selected at random, removed and inspected. Approximate
a) probability of selecting no defectives,
b) Probability of selecting at least one defective.
c) The mean and variance for the number of defectives selected.
3.25.
Suppose that the four engines of a commercial aircraft were arranged to
operate independently and that the probability of in-flight failure of a single
engine is 0.02. Find:
a) probability of no failures on a given flight,
b) probability of at most one failure on a given flight.
3.26.
Show that mean and variance of the binomial random variable X are np and
npq respectively.
3.27.
If a thumb-tack is flipped, then the probability that it will land point-up is
1/3. If this thumb-tack is flipped 6 times, then find:
a) the probability that it lands point-up on exactly 2 flips,
b) at least 2 flips,
c) at most 4 flips.
3.28.
The proportion of people with type A blood in a certain city is reported to
be 0.20. Suppose a random sample of 20 people is taken and their blood
types are to be checked. What is the probability that there are at least 4
people who have type A blood in the sample? What is the probability that
there are at most 5 people who have type A blood in the sample?
3.29.
A die and a coin are tossed together. Let us define success as the event that
the die shows an odd number and the coin shows a head. We repeat the
experiment 5 times. What is the probability of exactly 3 successes?
3.5. GEOMETRIC DISTRIBUTION 63

3.5 Geometric distribution

In the case of Binomial distribution, the number of trials was a fixed number
n, and the variable of interest was the number of successes. It is sometimes of
interest to count instead how many trials are required to achieve a specified
number of successes.
The number of trials Y required to obtain the first success is called a
Geometric random variable with parameter p.

Theorem 3.4. Geometric RV

The probability mass function for a Geometric random variable is

g(y; p) := P (Y = y) = (1 − p)y−1 p, y = 1, 2, 3, . . .

Its CDF is
F (y) = 1 − q y , y = 1, 2, 3, . . . , q =1−p
Its mean and variance are
1 1−p
µ= and σ2 =
p p2

Proof. To achieve the first success on yth trial means to have the first y −
1 trials to result in failures, and the last yth one a success, and then by
independence of trials,
P (F F...F S) = q y−1 p
Now the CDF
F (y) = P (Y ≤ y) = 1 − P (Y > y)
The latter means that all the trials up to and including the yth one, resulted
in failures, which equals P (y failures in a row) = q y and we get the CDF
subtracting this from 1.
The mean E (Y ) can be found by differentiating a geometric series:
∞
X ∞
X ∞
X
y−1
E (Y ) = yp(y) = yp(1 − p) =p y(1 − p)y−1 =
i=1 i=1 i=1

∞ ∞
X d y d X y d 2 3
=p q =p q =p (1 + q + q + q + · · · − 1) =
dq dq dq
i=1 i=1
64 CHAPTER 3. DISCRETE PROBABILITY DISTRIBUTIONS

d d p 1
(1 − q)−1 − (1) =

=p = .
dq dq (1 − q)2 p
The variance can be calculated by differentiating a geometric series twice:
∞ ∞
X
y−1
X d2 y
E {Y (Y − 1)} = pq = pq (q ) =
dq 2
i=1 i=1

d2 2 2q
= pq (1 − q)−1 = pq = 2
dq 2 (1 − q)3 p
2q 1 2q 1 1 q
Hence E (Y 2 ) = + and V (Y ) = + − 2 = 2
p2 p p2 p p p
0.5

0.5
0.4

0.4
0.3

0.3
p(x)

p(x)
0.2

0.2
0.1

0.1
0.0

0.0

0 5 10 15 20 0 5 10 15 20

x x

Figure 3.4: Geometric PMF: left, with p = 0.2; right, with p = 0.5

Example 3.10.
For a certain manufacturing process it is known that, on the average, 1 in
every 100 items is defective. What is the probability that the first defective
item found is the fifth item inspected? What is the average number of items
that should be sampled before the first defective is found?
Solution. Using the geometric distribution with x = 5 and p = 0.01, we have
g(5; 0.01) = (0.01)(0.99)4 = 0.0096.
Mean number of items needed is µ = 1/p = 100.
3.5. GEOMETRIC DISTRIBUTION 65

Example 3.11.
If the probability is 0.20 that a burglar will get caught on any given job,
what is the probability that he will get caught no later than on his fourth
job?
Solution. Substituting y = 4 and p = 0.20 into the geometric CDF, we get
P (Y ≤ 4) = 1 − 0.84 = 0.5904

Exercises
3.30.
The probability to be caught while running a red light is estimated as 0.1.
What is the probability that a person is first caught on his 10th attempt to
run a red light? What is the probability that a person runs a red light at
least 10 times without being caught?
3.31.
A computing center is interviewing people until they find a qualified person
to fill a vacant position. The probability that any single applicant is qualified
is 0.15.
a) Find the expected number of people to interview.
b) Find the probability the center will need to interview between 4 and 8
people (inclusive).
3.32.
If probability of success is 0.01, how many trials are necessary so that prob-
ability of at least one success is greater than 0.5?
3.33.
From past experience it is known that 3% of accounts in a large accounting
population are in error. What is the probability that the first account in
error is found on the 5th try? What is the probability that the first account
in error occurs in the first five accounts audited?
3.34.
A rat must choose between five doors, one of which contains chocolate. If the
rat chooses the wrong door, it is returned to the starting point and chooses
again (randomly), and continues until it gets the chocolate. What is the
probability of the rat getting chocolate on the second attempt?
66 CHAPTER 3. DISCRETE PROBABILITY DISTRIBUTIONS

3.6 Negative Binomial distribution

Let Y denote the number of the trial on which the rth success occurs in a
sequence of independent Bernoulli trials, with p the probability of success.
Such Y is said to have Negative Binomial distribution. When r = 1, we will
of course obtain the Geometric distribution.

Theorem 3.5. Negative Binomial RV

The PMF of the Negative Binomial random variable Y is

y−1
nb(y; r, p) := P (Y = y) = pr q y−r , y = r, r + 1, . . .
r−1

The mean and variance of Y are:

r rq
E (Y ) = and V (Y ) = .
p p2

Proof. We have P (Y = y) =

= P [First y − 1 trials contain r − 1 successes and yth trial is a success] =

y − 1 r−1 y−r y − 1 r y−r
= p q ×p= p q , y = r, r + 1, r + 2, . . .
r−1 r−1
The proof for the mean and variance uses the properties of the indepen-
dent sums to be discussed in Section 5.4. However, note at this point that
both µ and σ 2 are r times larger than those of the Geometric distribution.

Example 3.12.
In an NBA championship series, the team which wins four games out of seven
will be the winner. Suppose that team A has probability 0.55 of winning over
the team B, and the teams A and B face each other in the championship
games.
(a) What is the probability that team A will win the series in six games?
(b) What is the probability that team A will win the series?
3.6. NEGATIVE BINOMIAL DISTRIBUTION 67

Solution.
(a) nb(6; 4, 0.55) = 53 (0.55)4 (1 − 0.55)6−4 = 0.1853.

(b) P(team A wins the championship series) =

= nb(4; 4, 0.55) + nb(5; 4, 0.55) + nb(6; 4, 0.55) + nb(7; 4, 0.55) =
= 0.0915 + 0.1647 + 0.1853 + 0.1668 = 0.6083

Example 3.13.
A pediatrician wishes to recruit 5 couples, each of whom is expecting their
first child, to participate in a new childbirth regimen. She anticipates that
20% of all couples she asks will agree. What is the probability that 15 couples
must be asked before 5 are found who agree to participate?
Solution. Substituting x = 15, p = 0.2, r = 5, we get

14
nb(15; 5, 0.2) = (0.2)5 (0.8)15−5 = 0.034
4

Exercises
3.35.
Biologists catch Costa Rican glass frogs for breeding. There is 75% proba-
bility that a glass frog they catch is male. Biologists would like to have at
least 2 female frogs. What is the expected value of the total number of frogs
caught, until they reach their goal? What is the probability that they will
need exactly 6 frogs to reach their goal?
3.36.
Jim is a high school baseball player. He has 0.25 batting average, meaning
that he makes a hit in 25% of his tries (“at-bats”). What is the probability
that Jim makes his second hit of the season on his sixth at-bat?
3.37.
In the best-of-5 series, Team A has 60% chance to win any single game, and
the outcomes of the games are independent. Find the probability that Team
A will win the series (i.e. will win the majority of the games).
3.38.
For Problem 3.37, find the expected duration of the series (regardless of
which team wins). [Hint: First, fill out the table containing d, p(d) – the
distribution of the duration D. For example, P (D = 3) = P (team A wins in 3) +
P (team B wins in 3)]
68 CHAPTER 3. DISCRETE PROBABILITY DISTRIBUTIONS

3.7 Poisson distribution

It is often useful to define a random variable that counts the number of events
that occur within certain specified boundaries. For example, the average
number of telephone calls received by customer service within a certain time
limit. The Poisson distribution is often appropriate to model such situations.
Definition 3.8. Poisson RV
A random variable X with a Poisson distribution takes the values
x = 0, 1, 2, . . . with a probability mass function

e−µ µx
pois(x; µ) := P (X = x) =
x!
where µ is the parameter of the distribution.a
a
Some textbooks use λ for the parameter. We will use λ for the intensity of the Poisson
process, to be discussed later

Theorem 3.6. Mean and variance of Poisson RV

For Poisson RV with parameter µ,

E (X) = V (X) = µ.

Proof. Recall the Taylor series expansion of ex :

x2 x3
ex = 1 + x + + + ...
2! 3!
Now,
∞ ∞
X X e−µ µx X x e−µ µ µx−1
E (X) = x ∗ pois(x, µ) = x = =
x=0
x! x=1
x(x − 1)!
∞
µx−1 µ µ2 µ3
X
−µ −µ
= µe = µe 1+ + + . . . = µ e−µ eµ = µ
x=1
(x − 1)! 1! 2! 3!
To find E (X 2 ), let us consider the factorial expression E [X(X − 1)].
∞ ∞
X e−µ µx X µ2 e−µ µx−2
E [X(X − 1)] = x(x − 1) = x(x − 1)
x! x(x − 1)(x − 2)!
x=0 x=2
3.7. POISSON DISTRIBUTION 69

∞
X µx−2
= µ2 e−µ = µ2 e−µ eµ = µ2
(x − 2)!
x=2

Therefore, E [X(X − 1)] = E (X 2 ) − E (X) = µ2 . Now we can solve for E (X 2 )

which is E (X 2 ) = E [X(X − 1)] + E (X) = µ2 + µ.
Thus,
V (X) = E (X 2 ) − [E (X)]2 = µ2 + µ − µ2 = µ.
0.30

0.30
0.20

0.20
p(x)

p(x)
0.10

0.10
0.00

0.00

0 5 10 15 20 0 5 10 15 20

x x

Figure 3.5: Poisson PMF: left, with µ = 1.75; right, with µ = 8

Example 3.14.
During World War II, the Nazis bombed London using V-2 missiles. To
study the locations where missiles fell, the British divided the central area
of London into 576 half-kilometer squares.i The following is the distribution
of counts per square
Number of missiles in Expected (Poisson)
Number of squares
a square Number of squares
0 229 227.5
1 211 211.3
2 93 98.1
3 35 30.4
4 7 7.1
5 and over 1 1.6
Total 576 576.0
Are the counts suggestive of Poisson distribution?

Solution. The total number of missiles is 1(211) + 2(93) + 3(35) + 4(7) +

5(1) = 535 and the average number per square, µ = 0.9288. If the Poisson
70 CHAPTER 3. DISCRETE PROBABILITY DISTRIBUTIONS

distribution holds, then the expected number of 0 squares (out of 576) will
be
e−0.9288 0.92880
576 × P (X = 0) = 576 × = 227.5
0!
The same way, fill out the rest of the expected counts column. As you can
see, the data match the Poisson model very closely!
Poisson distribution is often mentioned as a distribution of spatial ran-
domness. As a result, British command were able to conclude that the mis-
siles were unguided.

Using the CDF

Knowledge of CDF (cumulative distribution function) is useful for calculating

probabilities of the type P (a ≤ X ≤ b). In fact,

P (a < X ≤ b) = FX (b) − FX (a) (3.2)

(you have to carefully watch strict and non-strict inequalities). We might

use CDF tables to calculate such probabilities. Nowadays, CDF’s of popular
distributions are built into various software packages.

Example 3.15.
During a laboratory experiment, the average number of radioactive particles
passing through a counter in one millisecond is 4. What is the probabil-
ity that 6 particles enter the counter in a given millisecond? What is the
probability of at least 6 particles?

Solution. Using the Poisson distribution with x = 6 and µ = 4, we get

e−4 46
pois(6; 4) = = 0.1042
6!

Alternatively, using the CDF, P (X = 6) = P (5 < X ≤ 6) = F (6) − F (5).

Using the Poisson table, P (X = 6) = 0.8893 − 0.7851 = 0.1042.
To find P (X ≥ 6), use P (5 < X ≤ ∞) = F (∞) − F (5) = 1 − 0.7851 =
0.2149
3.7. POISSON DISTRIBUTION 71

Poisson approximation for Binomial

Poisson distribution was originally derived as a limit of Binomial when n →
∞ while p = µ/n, with fixed µ. We can use this fact to estimate Binomial
probabilities for large n and small p.

Example 3.16.
At a certain industrial facility, accidents occur infrequently. It is known that
the probability of an accident on any given day is 0.005 and the accidents
are independent of each other. For a given period of 400 days, what is the
probability that
(a) there will be an accident on only one day?
(b) there are at most two days with an accident?
Solution. Let X be a binomial random variable with n = 400 and p = 0.005.
Thus µ = np = (400)(0.005) = 2. Using the Poisson approximation,
e−2 21
a) P (X = 1) = 1!
= 0.271
e−2 20 −2 21 −2 22
b) P (X ≤ 2) = P (X = 0)+P (X = 1)+P (X = 2) = 0!
+e 1!
+e 2!
= 0.1353 + 0.271 + 0.271 = 0.6766

Exercises
3.39.
Number of cable breakages in a year is known to have Poisson distribution
with µ = 0.32.

a) Find the mean and standard deviation of the number of cable breakages
in a year.
b) According to Chebyshev’s inequality, what is the upper bound for
P (X ≥ 2)?
c) What is the exact probability P (X ≥ 2), based on Poisson model?

3.40.
At a barber shop, expected number of customers per day is 8. What is a
probability that, on a given day, between 5 and 10 customers (inclusive) show
up? At least 8 customers?
72 CHAPTER 3. DISCRETE PROBABILITY DISTRIBUTIONS

3.41.
Poisson distribution can be derived by considering Binomial with n large and
p small. Compare computationally

a) Binomial with n = 20, p = 0.05: find P (X = 0), P (X = 1) and

P (X = 2).
b) Repeat for Binomial with n = 200, p = 0.005
c) Poisson with µ = np = 1 [Note that µ matches the expected value for
both (a) and (b).]
d) Compare the standard deviations for distributions in (a)-(c)

3.42.
Bolted assemblies on a hull of spacecraft may become loose with probabiity
0.005. There are 96 such assemblies on board. Assuming that assemblies
behave statistically independently, find the probability that there is at most
one loose assembly on board.

3.43.
An airline finds that 5% of the people making reservations on a certain flight
will not show up for the flight. If the airline sells 160 tickets for a flight with
155 seats, what is the probability that the flight ends up overbooked, i.e.
more that 155 people will show up? [Hint: Use the Poisson approximation
for the number of people who will not show up.]

3.44.
A region experiences, on average, 7.5 earthquakes (magnitude 5 or higher),
per year. Assuming Poisson distribution, find the probability that

a) between 5 and 9 earthquakes will happen in a year;

b) at least one earthquake will happen in a given month.
c) Find the mean and standard deviation of the number of earthquakes
per year.

3.45.
A plumbing company estimates to get the average of 60 service calls per
week. Assuming Poisson distribution, find the probability that, in a given
week
3.8. HYPERGEOMETRIC DISTRIBUTION 73

a) it gets exactly 60 service calls;

b) it gets between 55 and 59 service calls.
3.46.
A credit card company estimates that, on average, 0.18% of all its internet
transactions are fraudulent. Out of 1000 transactions,
a) find the mean and standard deviation of the number of fraudulent tran-
scations,
b) approximate the probability that at least one transaction will be fraud-
ulent,
c) approximate the probability that 3 or less transactions will be fraudu-
lent.

3.8 Hypergeometric distribution

Consider the Hypergeometric experiment, that is, one that possesses the
following two properties:
a) A random sample of size n is selected without replacement from N
items.
b) Of the N items overall, k may be classified as successes and N − k are
classified as failures.
We will be interested, as before, in the number of successes X, but now
the probability of success is not constant (why?).
Theorem 3.7.
The PMF of the hypergeometric random variable X, the number of successes
in a random sample of size n selected from N items of which k are labeled
success and N − k labeled failure, is

k N −k
x n−x
hg(x; N, n, k) = , x = 0, 1, ..., min(n, k)
N
n

The mean and variance of the hypergeometric distribution are µ = n Nk and
−n
σ 2 = n Nk 1 − Nk N

N −1
74 CHAPTER 3. DISCRETE PROBABILITY DISTRIBUTIONS

We have already seen such a random variable: see Example 3.2. Here are
some more examples.

Example 3.17.
Lots of 40 components each are called unacceptable if they contain as many
as 3 defectives or more. The procedure for sampling the lot is to select 5
components at random and to reject the lot if a defective is found. What is
the probability that exactly 1 defective is found in the sample if there are 3
defectives in the entire lot?
Solution. Using the above distribution with n = 5, N = 40, k = 3 and x = 1,
we can find the probability of obtaining one defective to be
3 37

1
hg(1; 40, 5, 3) = 4
40 = 0.3011
5

Example 3.18.
A shipment of 20 tape recorders contains 5 that are defective. If 10 of them
are randomly chosen for inspection, what is the probability that 2 of the 10
will be defective?
Solution. Subsituting x = 2, n = 10, k = 5, and N = 20 into the formula, we
get
5 15

10(6435)
2
P (X = 2) = 208 = = 0.348
10
184756
Note that, if we were sampling with replacement, we would have Binomial
distribution (why?) with p = k/N . In fact, if N is much larger than n, then
the difference between Binomial and Hypergeometric distribution becomes
small.

Exercises
3.47.
Out of 10 construction facilities, 4 are in-state and 6 are out of state. Three
facilities are earmarked as test sites for a new technology. What is the prob-
ability that 2 out of 3 are out of state?
3.8. HYPERGEOMETRIC DISTRIBUTION 75

3.48.
A box contains 8 diodes, among them 3 are of new design. If 4 diodes are
picked randomly for a circuit, what is the probability that at least one is of
new design?

3.49.
There are 25 schools in a district, 10 of which are performing below standard.
Five schools are selected at random for an in-depth study. Find:

a) Probability that in your sample, no schools perform below standard.

b) Probability of selecting at least one that performs below standard.
c) The mean and variance for the number of the schools that perform
below standard.

3.50.
A small division, consisting of 6 women and 4 men, picks “employee of the
month” for 3 months in a row. Suppose that, in fact, a random person is
picked each month. Let X be the number of times a woman was picked.
Calculate the distribution of X (make a table with all possible values), for
the cases

a) No repetitions are allowed.

b) Repetitions are allowed (the same person can be picked again and
again).
c) Compare the results.

3.51.
A jar contains 50 red marbles and 30 blue marbles. Four marbles were
selected at random. Find the probability to obtain at least 3 red marbles, if
the sampling was

a) without replacement;
b) with replacement.
c) Compare the results.
76 CHAPTER 3. DISCRETE PROBABILITY DISTRIBUTIONS

3.9 Moment generating function

We saw in an earlier section that, if g(Y ) is a function of a random variable
Y with PMF p(y), then
X
E [g(Y )] = g(y)p(y)
y

The expected value of the exponential function etY is especially important.

Definition 3.9. Moment generating function
The moment generating function (MGF) of a random variable Y is

M (t) = E (etY )

The expected values of powers of random variables are often called mo-
ments. For example, E (Y ) is the first moment of Y , and E (Y 2 ) is the second
moment of Y . When M (t) exists, it is differentiable in a neighborhood of
t = 0, and the derivatives may be taken inside the expectation. Thus,

0 dM (t) d tY d tY
M (t) = = E [e ] = E e = E [Y etY ]
dt dt dt
Now if we set t = 0, we have M 0 (0) = E Y . Going on the second derivative,
M 00 (t) = E [Y 2 etY ]
and hence M 00 (0) = E (Y 2 ). In general, M (k) (0) = E (Y k ) .
Theorem 3.8. Properties of MGF’s
a) Uniqueness: Let X and Y be two random variables with moment
generating functions MX (t) and MY (t), respectively, If MX (t) = MY (t)
for all values of t, in some neighborhood of 0, then X and Y have the
same probability distribution.

b) MX+b (t) = ebt MX (t).

c) MaX (t) = MX (at)

d) If X1 , X2 , . . . , Xn are independent random variables with moment

generating functions M1 (t), M2 (t), . . . , Mn (t), respectively, and
Y = X1 + X2 + · · · + Xn , then

MY (t) = M1 (t) × M2 (t) × · · · × Mn (t).

3.9. MOMENT GENERATING FUNCTION 77

Example 3.19.
Evaluate the moment generating function for the geometric distribution

Solution. From definition,

∞ ∞
X
tx x−1 pX t x
M (t) = e pq = (qe )
x=1
q x=1

On the right, we have an infinite geometric series with first term qet and the
∞
X qet
t
ratio qe . Its sum is (qet )x = t
. We obtain
x=1
1 − qe

t 1
M (t) = p e
1 − qet

Exercises
3.52.

Find MX (t) for random variables X given by

a) p(x) = 1/3, x = −1, 0, 1

x+1
1
b) p(x) = , x = 0, 1, 2, . . .
2

1 3
c) p(x) = , x = 0, 1, 2, 3
8 x
3.53.

a) Find the MGF of the Bernoulli distribution.

b) Apply the property (d) of Theorem 3.8 to calculate the MGF of the
Binomial distribution. [Hint: Binomial random variable Y with pa-
rameters n, p can be represented as Y = X1 + X2 + ... + Xn , where X’s
are independent and each has Bernoulli distribution with parameter p.]
78 CHAPTER 3. DISCRETE PROBABILITY DISTRIBUTIONS

3.54.
Apply the property (d) of Theorem 3.8 and Example 3.19 to calculate the
MGF of Negative Binomial distribution.

3.55.
Use the derivatives of MGF to calculate the mean and variance of geometric
distribution.

3.56.
Suppose that MGF of a random variable X was found equal to
1
M (t) =
1 − t2
Using the properties of MGF, find E (X) and E (X 2 ).

3.57. ?

a) Compute the MGF of Poisson distribution.

b) Using the property (d) of Theorem 3.8, describe the distribution of a

sum of two independent Poissons, one with mean λ1 and another with
mean λ2 .
Chapter 4

Continuous probability
distributions

4.1 Continuous random variables and their

probability distributions
All of the random variables discussed previously were discrete, meaning they
can take only a finite (or, at most, countable) number of values. However,
many of the random variables seen in practice have more than a countable
collection of possible values. For example, the proportions of impurities in
ore samples may run from 0.10 to 0.80. Such random variables can take any
value in an interval of real numbers. Since the random variables of this type
have a continuum of possible values, they are called continuous random
variables.
Even though the tools we will use to describe continuous RV’s are different
from the tools we use for discrete ones, practically there is not an enormous gulf
between them. For example, a physical measurement of, say, wavelength may be
continuous. However, when the measurements are recorded (either on paper or in
computer memory), they will take a finite number of values. The number of values
will increase if we keep more decimals in the recorded quantity. With rounding
we can discretize the problem, that is, reduce a continuous problem to a discrete
one, whose solution will hopefully be “close enough” to the continuous one. In
order to see if we have discretized a problem in a right way, we still need to know
something about the nature of continuous random variables.

79
80 CHAPTER 4. CONTINUOUS PROBABILITY DISTRIBUTIONS

Definition 4.1. Density (PDF)

The function f (x) is a probability density function (PDF) for the
continuous random variable X, defined over the set of real numbers R, if

a) f (x) ≥ 0, for all x

Z ∞
b) f (x) dx = 1.
−∞
Z b
c) P (a ≤ X ≤ b) = f (x) dx.
a

What does this actually mean? Since continuous probability functions are defined
for an infinite number of points over a continuous interval, the probability at a
single point is always zero. Probabilities are measured over intervals, not single
points. That is, the area under the curve between two distinct points defines the
probability for that interval. This means that the height of the probability func-
tion can in fact be greater than one. The property that the integral must equal
one is equivalent to the property for discrete distributions that the sum of all the
probabilities must equal one.

Probability mass function (PMF) vs. Probability Density Function

(PDF)
Discrete probability functions are referred to as probability mass functions and
continuous probability functions are referred to as probability density functions.
The term probability functions covers both discrete and continuous distributions.
When we are referring to probability functions in generic terms, we may use the
term probability density functions to mean both discrete and continuous proba-
bility functions.

Example 4.1.
Suppose that the error in the reaction temperature, in ◦ C, for a controlled
laboratory experiment is a continuous random variable X having the density
( 2
x
for −1 ≤ x ≤ 2
f (x) = 3
0 elsewhere
(a) Verify condition (b) of Definition 4.1.
(b) Find P (0 < X < 1).
4.1. CONTINUOUS RV AND THEIR PROB DIST 81
R∞ R2 2 3
Solution. (a) f (x)dx = −1 x3 dx = x9 |2−1 =
−∞
8
9
+ 1
9
=1
R1 2 3
(b) P (0 < X < 1) = 0 x3 dx = x9 |10 = 19 .

Definition 4.2. CDF

The cumulative distribution function (CDF) F (x) of a continuous

random variable X, with density function f (x), is
Z x
F (x) = P (X ≤ x) = f (t) dt (4.1)
−∞

As an immediate consequence of equation (4.1) one can write these two

results:
1
(a) P (a < X ≤ b) = F (b) − F (a)

(b) f (x) = F 0 (x), if the derivative exists.

Example 4.2.
For the density function of Example 4.1, find F (x) and use it to evaluate
P (0 < X < 1).
Solution. For −1 < x < 2, we have
Z x x
x
t2 t3 x3 + 1
Z
F (x) = f (t)dt = dt = = ,
−∞ −1 3 9 −1 9

Therefore, 
0
 x ≤ −1
x3 +1
F (x) = 9
for −1 < x < 2

1 x ≥ 2.


Now, P (0 < X < 1) = F (1) − F (0) = 92 − 19 = 19 , which agrees with the

result obtained using the density function in Example 4.1.

1
Note that the same relation holds for discrete RV’s but in the continuous case P (a ≤
X ≤ b), P (a < X ≤ b) and P (a < X < b) are all the same. Why?
82 CHAPTER 4. CONTINUOUS PROBABILITY DISTRIBUTIONS

Example 4.3.
The time X in months until failure of a certain product has the PDF
( 2 3
3x
exp − x64 for x > 0
f (x) = 64
0 elsewhere

Find F (x) and evaluate P (2.84 < X < 5.28)

3
x
Solution. F (x) = 1 − exp − , and P (2.84 ≤ X ≤ 5.28) = 0.5988
64

Example 4.4.
For each of the following functions,
(i) find the constant c so that f (x) is a PDF of a random variable X, and
(ii) find the distribution function F (x).
( 3
x
for 0 < x < c
a) f (x) = 4
0 elsewhere
(
3 2
x for −c < x < c
b) f (x) = 16
0 elsewhere
(
4xc for 0 < x < 1
c) f (x) =
0 elsewhere
(
c
3/4 for 0 < x < 1
d) f (x) = x
0 elsewhere
x4
Answers. a) c = 2 and F (x) = 16
, 0 < x < 2.
x3
b) c = 2 and F (x) = 16
+ 12 , −2 < x < 2.

c) c = 3 and F (x) = x4 , 0 < x < 1.

1
d) c = 4
and F (x) = x1/4 , 0 < x < 1.
4.1. CONTINUOUS RV AND THEIR PROB DIST 83

Example 4.5.
The life length of batteries X (in hundreds of hours) has the density
(
1 − x2
2
e for x > 0
f (x) =
0 elsewhere

Find the probability that the life of a battery of this type is less than 200 or
greater than 400 hours.

Solution. Let A denote the event that X is less than 2, and let B denote the
event that X is greater than 4. Then
Z 2 Z ∞
1 −x 1 −x
P (A ∪ B) = P (A) + P (B) (why?) = e 2 dx + e 2 dx
0 2 4 2

= (1 − e−1 ) + (e−2 ) = 1 − 0.368 + 0.135 = 0.767

Example 4.6.
Refer to Example 4.5. Find the probability that a battery of this type lasts
more than 300 hours, given than it already has been in use for more than
200 hours.

Solution. We are interested in P (X > 3|X > 2); and by the definition of
conditional probability,

P (X > 3, X > 2) P (X > 3)

P (X > 3|X > 2) = =
P (X > 2) P (X > 2)

because the intersection of the events (X > 3) and (X > 2) is the event
(X > 3). Now
Z ∞
1 −x/2
e dx 3
P (X > 3) 2 e− 2 1
=Z ∞3
= −1 = e− 2 = 0.606
P (X > 2) 1 −x/2 e
e dx
2 2
84 CHAPTER 4. CONTINUOUS PROBABILITY DISTRIBUTIONS

Exercises
4.1.
The lifetime of a vacuum cleaner, in years, is described by

x/4
 for 0 < x < 2
f (x) = (4 − x)/4 for 2 ≤ x < 4

0 elsewhere


Find the probability that the lifetime of a vacuum cleaner is

(a) less than 2.5 years
(b) between 1 and 3 years.

4.2.
The proportion of warehouse items claimed within 1 month is given by a
random variable X with density
(
c(x + 1) for 0 < x < 1
f (x) =
0 elsewhere

(a) Find c to make this a legitimate density function.

(b) Find the probability that the proportion of items claimed will be between
0.5 and 0.7.

4.3.
The demand for an antibiotic from a local pharmacy is given by a random
variable X with CDF
(
2500
1 − (x+50) 2 for x > 0
F (x) =
0 elsewhere

a) Find the probability that the demand is at least 50 doses

b) Find the probability that the demand is between 40 and 80 doses
c) Find the density function of X.
4.2. EXPECTED VALUES OF CONTINUOUS RV 85

4.4.
The waiting time, in minutes, between customers coming into a store is a
continuous random variable with CDF
(
0 for x < 0
F (x) =
1 − exp (−x/2) for x ≥ 0

Find the probability of waiting less than 1.5 minutes between successive
customers
a) using the cumulative distribution of X;
b) using the probability density function of X (first, you have to find it).
4.5.
A continuous random variable X that has a density function given by
(
1
for −1 < x < 4
f (x) = 5
0 elsewhere

a) Show that the area under the curve is equal to 1.

b) Find P (0 < X < 2).
c) Find c such that P (X < c) = 1/2. [This is called a median of the
distribution.]

4.2 Expected values of continuous random vari-

ables
The expected values of continuous RV’s are obtained using formulas simi-
lar to those of discrete ones. However, the summation is now replaced by
integration.
Definition 4.3. Expected value
The Expected Value of the continuous random variable X that has a
probability density function f (x) is given by
Z ∞
µ = E (X) = x f (x)dx
−∞
86 CHAPTER 4. CONTINUOUS PROBABILITY DISTRIBUTIONS

Theorem 4.1. Expected value of a function

If X is a continuous random variable with probability density function f (x),
and if g(x) is any real-valued function of X, then
Z ∞
E (g(X)) = g(x) f (x) dx
−∞

Definition 4.4. Variance

Let X be a random variable with probability density function f (x) and mean
E X = µ. The variance of X is
Z ∞
2 2
σ = E [(X − µ) ] = (x − µ)2 f (x)dx = E (X 2 ) − µ2
−∞

Example 4.7.
Suppose that X has density function given by
(
3x2 for 0 ≤ x ≤ 1
f (x) =
0 elsewhere
(a) Find the mean and variance of X
(b) Find mean and variance of u(X) = 4X + 3.
Solution. (a) From the above definitions,
Z ∞ Z 1 1 1
x4
Z
2 3 3
E (X) = x f (x)dx = x (3x )dx = 3x dx = 3 = = 0.75
−∞ 0 0 4 4
Z 1 Z 1 0
5 1
x 3
Now, E (X 2 ) = x2 (3x2 )dx = 3x4 dx = 3 = = 0.6
0 0 5 0 5
Hence, σ 2 = E (X 2 ) − µ2 = 0.6 − (0.75)2 = 0.6 − 0.5625 = 0.0375

(b) From Theorem 3.2, we get

E (u(X)) = E (4X + 3) = 4E (X) + 3 = 4(0.75) + 3 = 6
and
V (u(X)) = V (4X + 3) = 16[V (X)] + 0 (why?) = 16(0.0375) = 0.6
4.2. EXPECTED VALUES OF CONTINUOUS RV 87

Discrete and Continuous random variables

Discrete Continuous
Density
Probability function

Probability

d
f (x) = P (X ≤ x) = F 0 (x)
dx
p(x) = P (X = x)
P (X = x) is 0 for any x

CDF Is a ladder function Is continuous

● ● ●
● ●
●
●
●

F (x) = P (X ≤ x) ●

P (a < X ≤ b) = ●
●
● ●

= F (b) − F (a)
Mean Z
P
E (X) = µX xp(x) xf (x) dx

Mean of a function Z
P
E g(X) g(x)p(x) g(x)f (x) dx

Variance Z
2 2 2
= E (X ) − µ (x − µ)2 p(x) (x − µ)2 f (x) dx
P
σX
88 CHAPTER 4. CONTINUOUS PROBABILITY DISTRIBUTIONS

Exercises
4.6.
For the density described in Exercise 4.2, find the mean and standard devi-
ation of X.
4.7.
For a random variable X with the density

 2√1 x for 0 < x < 1
f (x) =
0 elsewhere

a) Find the mean of X

b) Find V (X)
c) Find E (X 4 )
4.8.
For a random variable X with the density

2 − x for 0 < x < c
f (x) =
0 elsewhere

a) Find c that makes f a legitimate density function

b) Find the mean of X
4.9.
For the density described in Exercise 4.1,
a) find the mean and standard deviation of X;
b) Use Chebyshev inequality to estimate the probability that X is between
1 and 3 years. Compare with the answer to Exercise 4.1.
4.10.
For a random variable X with the CDF


 x3 /8 for 0 < x < 2

F (x) = 0, x≤0


1, x≥2

4.3. UNIFORM DISTRIBUTION 89

a) Find the mean of X

b) Find V (X)
4.11.
The waiting time X, in minutes, between successive customers coming into
a store is given by
(
0 for x < 0
f (x) =
2 exp (−2x) for x ≥ 0

a) Find the average time between customers

b) Find E (eX )

4.3 Uniform distribution

One of the simplest continuous distributions is the continuous uniform dis-
tribution. This distribution is characterized by a density function that is flat
and thus the probability is uniform in a finite interval, say [a, b]. The density
function of the continuous uniform random variable X on the interval [a, b]
is 
1

 for a < x < b
f (x) = b − a

0 elsewhere

The CDF of a uniformly distributed X is given by

Z x
1 x−a
F (x) = dt = , a≤x≤b
a b−a b−a
The mean and variance of the uniform distribution are
b+a (b − a)2
µ= and σ 2 = .
2 12
Example 4.8.
Suppose that a large conference room for a certain company can be reserved
for no more than 4 hours. However, the use of the conference room is such
that both long and short conferences occur quite often. In fact, it can be
assumed that length X of a conference has a uniform distribution on the
interval [0, 4].
90 CHAPTER 4. CONTINUOUS PROBABILITY DISTRIBUTIONS

0.30

0.8
0.20
f(x)

f(x)

0.4
0.10
0.00

0.0
0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7

x x

Figure 4.1: Left: uniform density, right: uniform CDF, a = 2, b = 5

a) What is the probability density function of X?

b) What is the probability that any given conference lasts at least 3 hours?
Solution. (a) The appropriate density function for the uniformly distributed
random variable X in this situation is
(
1
for 0 < x < 4
f (x) = 4
0 elsewhere

(b) Z 4
1 1
P (X ≥ 3) = dx = .
3 4 4

Example 4.9.
The failure of a circuit board interrupts work by a computing system until
a new board is delivered. Delivery time X is uniformly distributed over the
interval of at least one but no more than four days. The cost C of this failure
and interruption consists of a fixed cost C0 for the new part and a cost that
increases proportionally to X 2 , so that

C = C 0 + C1 X 2

(a) Find the probability that the delivery time is two or more days.
(b) Find the expected cost of a single failure, in terms of C0 and C1 .
4.3. UNIFORM DISTRIBUTION 91
(
Solution. a) 1
for 1 ≤ x ≤ 5
4
f (x) =
0 elsewhere
Thus, Z 5
1 1 3
P (X ≥ 2) = dx = (5 − 2) =
2 4 4 4
b) We know that
E (C) = C0 + C1 E (X 2 )
so it remains for us to find E (X 2 ). This value could be found directly from
the definition or by using the variance and the fact that E (X 2 ) = V (X)+µ2 .
Using the latter approach, we find
2 2
(b − a)2 (5 − 1)2

2 a+b 1+5 31
E (X ) = + = + =
12 2 12 2 3

Thus, E (C) = C0 + C1 31

3
.

Exercises
4.12.
For a digital measuring device, rounding errors have Uniform distribution,
between −0.05 and 0.05 mm.

a) Find the probability that the rounding error is between −0.01 and
0.03mm
b) Find the expected value and the standard deviation of the rounding
error.
c) Calculate and plot the CDF of the rounding errors.

4.13.
The capacitances of “1mF” (microfarad) capacitors are, in fact, Uniform[0.95, 1.05]
mF.

a) What proportion of capacitors are 0.98 mF or above?

b) What proportion of capacitors are within 0.03 of the nominal value?
92 CHAPTER 4. CONTINUOUS PROBABILITY DISTRIBUTIONS

4.14.
For X having a Uniform[−1, 4] distribution, find the mean and variance.
Then, use the formula for variance and a little algebra to find E (X 2 ).
4.15.
Suppose the radii of spheres R have a uniform distribution on [2, 3]. Find
the mean volume. (V = 43 π R3 ). Find the mean surface area. (A = 4π R2 ).

4.4 Exponential distribution

Definition 4.5. Exponential distribution
The continuous random variable X has an exponential distribution, with
parameter β, if its density function is given by
x
(
1 −β
e for x > 0
f (x) = β
0 elsewhere

The mean and variance of the exponential distribution are

µ = β and σ 2 = β 2 .

The distribution function for the exponential distribution has the simple
form: Z t
1 − βx t
F (t) = P (X ≤ t) = e dx = 1 − e− β for t ≥ 0
0 β

The failure rate function r(t) is defined as

f (t)
r(t) = , t>0 (4.2)
1 − F (t)
Suppose that X, with density f , is a lifetime of an item. Consider the proportion
of items currently alive (at the time t) that will fail in the next time interval
(t, t + ∆t], where ∆t is small. Thus, by the conditional probability formula,

P {die in the next (t, t + ∆t] | currently alive} =

P {X ∈ (t, t + ∆t]} f (t)∆t

= ≈ = r(t)∆t
P (X > t) 1 − F (t)
4.4. EXPONENTIAL DISTRIBUTION 93

so the rate at which the items fail is r(t).

For the exponential case,

f (t) 1/β e−t/β 1

r(t) = = −t/β
=
1 − F (t) e β

Note that the failure rate λ = β1 of an item with exponential lifetime does not
depend on the item’s age. This is known as the memoryless property of exponential
distribution. The exponential distribution is the only continuous distribution to
have a constant failure rate.
In reliability studies, the mean of a positive-valued distribution, is also called
Mean Time To Fail or MTTF. So, we have exponential MTTF = β.

Relationship between Poisson and exponential distributions

Suppose that certain events happen at the rate λ, so that the average (ex-
pected) number of events on the interval [0, t] is µ = λt. If we assume that
the number of events on [0, t] has Poisson distribution, then the probability
of no events up to time t is given by

e−λt (λt)0
pois(0, λt) = = e−λt .
0!
Thus, if the time of first failure is denoted X, then

P (X ≤ t) = 1 − P (X > t) = 1 − e−λt

We see that P (X ≤ t) = F (t), the CDF for X, has the form of an exponential
CDF. Here, λ = β1 is again the failure rate. Upon differentiating, we see that
the density of X is given by

dF (t) d(1 − e−λt ) 1

f (t) = = = λe−λt = e−t/β
dt dt β
and thus X has an exponential distribution.
Some natural phenomena have a constant failure rate (or occurrence rate)
property; for example, the arrival rate of cosmic ray alpha particles or Geiger
counter ticks. The exponential model works well for interarrival times (while
the Poisson distribution describes the total number of events in a given pe-
riod).
94 CHAPTER 4. CONTINUOUS PROBABILITY DISTRIBUTIONS

Example 4.10.
A downtime due to equipment failure is estimated to have Exponential dis-
tribution with the mean β = 6 hours. What is the probability that the next
downtime will last between 5 and 10 hours?
Solution. P (5 < X < 10) =
= F (10) − F (5) = 1 − exp(−10/6) − [1 − exp(−5/6)] = 0.2457
Example 4.11.
The number of calls to the call center has Poisson distribution with parameter
λ = 4 calls per minute. What is the probability that we have to wait more
than 20 seconds for the next call?
Solution. The waiting time between calls, X, has exponential distribution
4
with parameter β = λ1 = 14 . Then, P (X > 13 ) = 1 − F ( 13 ) = e− 3 = 0.2636

Exercises
4.16.
Prove another version of the memoryless property of the exponential distri-
bution,
P (X > t + s | X > t) = P (X > s).
Thus, an item that is t years old has the same probabilistic properties as a
brand-new item. [Hint: Use the definition of conditional probability and the
expression for exponential CDF.]
4.17.
The 1-hour carbon monoxide concentrations in a big city are found to have
an exponential distribution with a mean of 3.6 parts per million (ppm).
(a) Find the probability that a concentration will exceed 9 ppm.
(b) A traffic control policy is trying to reduce the average concentration.
Find the new target mean β so that the probability in part (a) will
equal 0.01
(c) The median of probability distribution is defined as solution m to the
equation (F is the CDF)
F (m) = 0.5
Find the median of the concentrations from part (a).
4.5. THE GAMMA DISTRIBUTION 95

4.18.
Customers come to a barber shop as a Poisson process with the frequency of
3 per hour. Suppose Y1 is the time when first customer comes.
a) Find the expected value and the standard deviation of Y1
b) Find the probability that the store is idle for at least first 30 minutes
after opening.

4.5 The Gamma distribution

The Gamma distribution derives its name from the well-known gamma func-
tion, studied in many areas of mathematics. This distribution plays an im-
portant role in both queuing theory and reliability problems. Time between
arrivals at service facilities, and time to failure of component parts and elec-
trical systems, often are nicely modeled by the Gamma distribution.
Definition 4.6. Gamma function
The gamma function, for α > 0, is defined by
Z ∞
Γ(α) = xα−1 e−x dx
0

Γ(k) = (k − 1)! for integer k.

Definition 4.7. Gamma distribution

The continuous random variable X has a gamma distribution, with
parameters α and β, if its density function is given by
( x
1
β α Γ(α)
xα−1 e− β for x > 0
f (x) =
0 elsewhere

The mean and variance of the Gamma distribution are

µ = αβ and σ 2 = αβ 2 .

Note: When α = 1, the Gamma reduces to the exponential distribution.

Another well-known statistical distribution, chi-square, is also a special case
of the gamma.
96 CHAPTER 4. CONTINUOUS PROBABILITY DISTRIBUTIONS

1.5 α = 0.5
α=1
α=2
α=5
1.0
f(x)

0.5
0.0

0 2 4 6 8 10

Figure 4.2: Gamma densities, all with β = 1

Uses of the Gamma Distribution Model

a) The gamma is a flexible life distribution model that may offer a good fit to
some sets of failure data, or other data where positivity is enforced.
b) The gamma does arise naturally as the time-to-failure distribution for a
system with standby exponentially distributed backups. If there are
n−1 standby backup units and the system and all backups have exponential
lifetimes with mean β, then the total lifetime has a Gamma distribution with
α = n. Note: when α is a positive integer, the Gamma is sometimes called
Erlang distribution. The Erlang distribution is used frequently in queuing
theory applications.
c) A simple and often used property of sums of identically distributed, inde-
pendent gamma random variables will be stated, but not proved, at this
point. Suppose that X1 , X2 , . . . , Xn represent independent P
gamma random
variables with parameters α and β, as just used. If Y = ni=1 Xi then Y
also has a gamma distribution with parameters nα and β. Thus, we see that
E (Y ) = nαβ, and V (Y ) = nαβ 2 .
Example 4.12.
The total monthly rainfall (in inches) for a particular region can be modeled
using Gamma distribution with α = 2 and β = 1.6. Find the mean and
variance of the monthly rainfall.
4.5. THE GAMMA DISTRIBUTION 97

Solution. E (X) = αβ = 3.2, and variance V (X) = αβ 2 = 2(1.62 ) = 5.12

4.5.1 Poisson process

Following our discussion about Exponential distribution, the latter is a good
model for the waiting times between randomly occurring events. Adding
independent Exponential RV’s will result in the Poisson process.

● ● ● ●● ● ● ● ● ●● ● ● ●● ● ●● ●●

0.0 0.5 1.0 1.5 2.0

time

Figure 4.3: Events of a Poisson process

The Poisson process was first studied2 in 1900’s when modeling the obser-
vation times of radioactive particles recorded by Geiger counter. It consists
of the consecutive event times Y1 , Y2 ,... such that the interarrival times
X1 = Y1 , X2 = Y2 − Y1 ,... have independent Exponential distributions. (The
observations start at the time t = 0.)
From the property (c) above, the kth event time has Gamma distribution
with α = k. As in Section 4.4, the average number of particles to appear
during [0, t) has Poisson distribution with the mean µ = λt where the rate
λ = 1/β.
The same way, the number of events on any given interval of time, say,
(t1 , t2 ] follows the Poisson distribution with the mean µ = λ(t2 − t1 ). Thus,
the expected number of events to be observed equals the intensity times the
length of the observation period.
Note the units: if the rate λ is measured in events per hour (say), that
is, the unit is hours−1 , then the mean time between events is measured in
hours.
The Gamma CDF (for integer α) can be derived using this relationship.
Suppose Yk is the time to wait for kth event. Then it is Gamma (α = k,
β) random variable. On one hand, the probability that this event happens
before time t is the CDF F (t). On the other hand, this will happen if and
only if there is a total of at least k events on the interval [0, t]:
2
not by Poisson!
98 CHAPTER 4. CONTINUOUS PROBABILITY DISTRIBUTIONS

F (t) = P (Yk ≤ t) = P (N (t) ≥ k) (4.3)

-
0 Y1 Y2 t Y3
Figure 4.4: Illustration of the principle “Yk ≤ t if and only if N (t) ≥ k”, here
k = 2.

Here, N (t) is the number of events on the [0, t] interval. According to Poisson
process, N (t) has Poisson distribution with the mean µ = λt = t/β. Thus,
k−1
X (t/β)i
P (Yk ≤ t) = P (N ≥ k) = 1 − P (N < k) = 1 − e−t/β (4.4)
i=0
i!

This is an interesting link between continuous and discrete distributions! In

particular, when k = 1, we get back the familiar exponential CDF, F (t) =
1 − exp(−t/β).

Example 4.13.
For the situation in Example 4.12, find the probability that the total monthly
rainfall exceeds 5 inches.
Solution. P (Y > 5) = 1 − F (5) = 1 − (1 − P (N < k)) = P (N < k) where
k = α = 2. Equation 4.4 yields P (Y > 5) = e−5/1.6 (1 + 5/1.6) = 0.181

Exercises
4.19.
Customers come to a barber shop with the frequency of 3 per hour. Suppose
Y4 is the time when 4th customer has come.

a) Find the expected value and the standard deviation of Y4

b) Find the probability that the 4th customer comes within the 1st hour.
4.6. NORMAL DISTRIBUTION 99

4.20.
Differentiate Equation 4.4 for k = 2 to show that you indeed will get the
Gamma density function with α = 2.

4.21.
A truck has 2 spare tires. Under intense driving conditions, tire blowouts are
determined to approximately follow a Poisson process with the intensity of
1.2 per 100 miles. Let X be the total distance the truck can go with 2 spare
tires.

a) Find the expected value and the standard deviation of X

b) Find the probability that the truck can go at least 200 miles

4.6 Normal distribution

The most widely used of all the continuous probability distributions is the
normal distribution (also known as Gaussian). It serves as a popular model
for measurement errors, particle displacements under Brownian motion, stock
market fluctuations, human intelligence and many other things. It is also
used as an approximation for Binomial (for large n) and Gamma (for large
α) distributions.
The normal density follows the well-known symmetric bell-shaped curve.
The curve is centered at the mean value µ and its spread is, of course, mea-
sured by the standard deviation σ. These two parameters, µ and σ 2 , com-
pletely determine the shape and center of the normal density function.

Definition 4.8.

The normal random variable X has the PDF

−(x − µ)2

1
f (x) = √ exp , for −∞ < x < ∞
σ 2π 2σ 2

It will be denoted as X ∼ N (µ, σ 2 )

The normal random variable Z with µ = 0 and σ = 1 is said to have the
standard normal distribution. Direct integration would show that E (Z) = 0
and V (Z) = 1.
100 CHAPTER 4. CONTINUOUS PROBABILITY DISTRIBUTIONS

1.0
µ = − 1, σ = 1
0.8 µ = 0, σ = 1
µ = 2, σ = 3
µ = 5, σ = 0.5
0.6
f(x)

0.4
0.2
0.0

−2 0 2 4 6 8 10

Figure 4.5: Normal densities

Usefulness of Z
We are able to transform the observations of any normal random variable X
to a new set of observations of a standard normal random variable Z. This
can be done by means of the transformation
X −µ
Z= .
σ
The values of the CDF of Z can be obtained from Table A. Namely,

0.5 + TA(z), z ≥ 0
F (z) =
0.5 − TA(|z|), z < 0

where TA(z) = P (0 < Z < z) denotes table area of z. The second equation
follows from the symmetry of the Z distribution.
Table A allows us to calculate probabilities and percentiles associated
with normal random variables, as the direct integration of normal density is
not possible.
4.6. NORMAL DISTRIBUTION 101

Example 4.14.
If Z denotes a standard normal variable, find
(a) P (Z ≤ 1) (b) P (Z > 1) (c) P (Z < −1.5) (d) P (−1.5 ≤ Z ≤ 0.5).
(e) Find a number, say z0 , such that P (0 ≤ Z ≤ z0 ) = 0.49
Solution. This example provides practice in using Normal probability Table.
We see that
a) P (Z ≤ 1) = P (Z ≤ 0) + P (0 ≤ Z ≤ 1) = 0.5 + 0.3413 = 0.8413.
b) P (Z > 1) = 0.5 − P (0 ≤ Z ≤ 1) = 0.5 − 0.3413 = 0.1587
c) P (Z < −1.5) = P (Z > 1.5) = 0.5 − P (0 ≤ Z ≤ 1.5) = 0.5 − 0.4332 =
0.0668.
d) P (−1.5 ≤ Z ≤ 0.5) = P (−1.5 ≤ Z ≤ 0) + P (0 ≤ Z ≤ 0.5)
= P (0 ≤ Z ≤ 1.5) + P (0 ≤ Z ≤ 0.5) = 0.4332 + 0.1915 = 0.6247.
e) To find the value of z0 we must look for the given probability of 0.49
on the area side of Normal probability Table. The closest we can come
is at 0.4901, which corresponds to a Z value of 2.33. Hence z0 = 2.33.
0.4
0.2
0.0

−3 −2 −1 0 1 2 3

Figure 4.6: Splitting a normal area into two Table Areas

Example 4.15.
For X ∼ N (50, 102 ), find the probability that X is between 45 and 62.
Solution. The Z- values corresponding to X = 45 and X = 62 are
45 − 50 62 − 50
Z1 = = −0.5 and Z2 = = 1.2.
10 10
Therefore, P (45 ≤ X ≤ 62) = P (−0.5 ≤ Z ≤ 1.2) = TA(1.2) + TA(0.5) =
0.3849 + 0.1915 = 0.5764
102 CHAPTER 4. CONTINUOUS PROBABILITY DISTRIBUTIONS

Table A: standard normal probabilities

0 z

z .00 .01 .02 .03 .04 .05 .06 .07 .08 .09
.0 .0000 .0040 .0080 .0120 .0160 .0199 .0239 .0279 .0319 .0359
.1 .0398 .0438 .0478 .0517 .0557 .0596 .0636 .0675 .0714 .0753
.2 .0793 .0832 .0871 .0910 .0948 .0987 .1026 .1064 .1103 .1141
.3 .1179 .1217 .1255 .1293 .1331 .1368 .1406 .1443 .1480 .1517
.4 .1554 .1591 .1628 .1664 .1700 .1736 .1772 .1808 .1844 .1879
.5 .1915 .1950 .1985 .2019 .2054 .2088 .2123 .2157 .2190 .2224
.6 .2257 .2291 .2324 .2357 .2389 .2422 .2454 .2486 .2517 .2549
.7 .2580 .2611 .2642 .2673 .2704 .2734 .2764 .2794 .2823 .2852
.8 .2881 .2910 .2939 .2967 .2995 .3023 .3051 .3078 .3106 .3133
.9 .3159 .3186 .3212 .3238 .3264 .3289 .3315 .3340 .3365 .3389
1.0 .3413 .3438 .3461 .3485 .3508 .3531 .3554 .3577 .3599 .3621
1.1 .3643 .3665 .3686 .3708 .3729 .3749 .3770 .3790 .3810 .3830
1.2 .3849 .3869 .3888 .3907 .3925 .3944 .3962 .3980 .3997 .4015
1.3 .4032 .4049 .4066 .4082 .4099 .4115 .4131 .4147 .4162 .4177
1.4 .4192 .4207 .4222 .4236 .4251 .4265 .4279 .4292 .4306 .4319
1.5 .4332 .4345 .4357 .4370 .4382 .4394 .4406 .4418 .4429 .4441
1.6 .4452 .4463 .4474 .4484 .4495 .4505 .4515 .4525 .4535 .4545
1.7 .4554 .4564 .4573 .4582 .4591 .4599 .4608 .4616 .4625 .4633
1.8 .4641 .4649 .4656 .4664 .4671 .4678 .4686 .4693 .4699 .4706
1.9 .4713 .4719 .4726 .4732 .4738 .4744 .4750 .4756 .4761 .4767
2.0 .4772 .4778 .4783 .4788 .4793 .4798 .4803 .4808 .4812 .4817
2.1 .4821 .4826 .4830 .4834 .4838 .4842 .4846 .4850 .4854 .4857
2.2 .4861 .4864 .4868 .4871 .4875 .4878 .4881 .4884 .4887 .4890
2.3 .4893 .4896 .4898 .4901 .4904 .4906 .4909 .4911 .4913 .4916
2.4 .4918 .4920 .4922 .4925 .4927 .4929 .4931 .4932 .4934 .4936
2.5 .4938 .4940 .4941 .4943 .4945 .4946 .4948 .4949 .4951 .4952
2.6 .4953 .4955 .4956 .4957 .4959 .4960 .4961 .4962 .4963 .4964
2.7 .4965 .4966 .4967 .4968 .4969 .4970 .4971 .4972 .4973 .4974
2.8 .4974 .4975 .4976 .4977 .4977 .4978 .4979 .4979 .4980 .4981
2.9 .4981 .4982 .4982 .4983 .4984 .4984 .4985 .4985 .4986 .4986
3.0 .4987 .4987 .4987 .4988 .4988 .4989 .4989 .4989 .4990 .4990
4.6. NORMAL DISTRIBUTION 103

Example 4.16.
Given a random variable X having a normal distribution with µ = 300 and
σ = 50, find the probability that X is greater than 362.
Solution. To find P (X > 362), we need to evaluate the area under the normal
curve to the right of x = 362. This can be done by transforming x = 362 to
the corresponding Z-value. We get
x−µ 362 − 300
z= = = 1.24
σ 50
Hence P (X > 362) = P (Z > 1.24) = P (Z < −1.24) = 0.5 − TA(1.24) =
0.1075.

Example 4.17.
A diameter X of a shaft produced has a normal distribution with parameters
µ = 1.005, σ = 0.01. The shaft will meet specifications if its diameter is
between 0.98 and 1.02 cm. Which percent of shafts will not meet specifica-
tions?
Solution.

0.98 − 1.005 1.02 − 1.005
1 − P (0.98 < X < 1.02) = 1 − P <Z<
0.01 0.01

= 1 − (0.4938 + 0.4332) = 0.0730

4.6.1 Using Normal tables in reverse

Definition 4.9. Percentile

A pth percentile of a random variable X is the point q that leaves the area of
p/100% to the left. That is, q is the solution for the equation

P (X ≤ q) = p/100%

For example, the median (introduced in Exercise 4.17) is the 50th percentile
of a probability distribution.
We will discuss how to find percentiles of normal distribution. The pre-
vious two examples were solved by going first from a value of x to a z-value
104 CHAPTER 4. CONTINUOUS PROBABILITY DISTRIBUTIONS

and then computing the desired area. In the next example we reverse the
process and begin with a known area, find the z-value, and then determine
x by rearranging the equation z = x−µ
σ
to give

x = µ + σz

Using the Normal Table calculations, it’s straightforward to show the follow-
ing
The famous 68% - 95% rule
For a Normal population, 68% of all values lie in the interval [µ − σ, µ + σ],
and 95% lie in [µ − 2σ, µ + 2σ].
In addition, 99.7% of the population lies in [µ − 3σ, µ + 3σ].

Example 4.18.
Using the situation in Example 4.17, a diameter X of a shaft had µ =
1.005, σ = 0.01. Give an interval that would contain 95% of all diameters.
Solution. The interval is µ ± 2σ = 1.005 ± 2(0.01), that is, from 0.985 to
1.025.

Example 4.19.
The SAT Math exam is scaled to have the average of 500 points, and the
standard deviation of 100 points. What is the cutoff score for top 10% of the
SAT takers?
Solution. In this example we begin with a known area, find the z-value, and
then find x from the formula x = µ + σz. The 90th percentile corresponds to
the 90% area under the normal curve to the left of x. Thus, we also require
a z-value that leaves 0.9 area to the left and hence, the Table Area of 0.4.
From Table A, P (0 < Z < 1.28) = 0.3997. Hence

x = 500 + 100(1.28) = 628

Therefore, the cutoff for the top 10% is 628 points.

Example 4.20.
Let X = monthly sick leave time have normal distribution with parameters
µ = 200 hours and σ = 20 hours.
4.6. NORMAL DISTRIBUTION 105

a) What percentage of months will have sick leave below 150 hours?
b) What amount of time x0 should be budgeted for sick leave so that the
budget will not be exceeded with 80% probability?

Solution. (a) P (X < 150) = P (Z < −2.25) = 0.5 − 0.4938 = 0.0062

(b) P (X < x0 ) = P (Z < z0 ) = 0.8, which leaves a table area for z0 of 0.3.
Thus, z0 = 0.84 and hence x0 = 200 + 20(0.84) = 216.8 hours

Quantile-Quantile (Q-Q) plots

If X is normal (µ, σ 2 ) distribution, then

X = µ + σZ

and there is a perfect linear relationship between X and Z. This is a graphical

method for checking normality.

4.6.2 Normal approximation to Binomial

As another example of using the Normal distribution, consider the Normal
approximation to Binomial distribution. This will be also used when dis-
cussing sample proportions.

Theorem 4.2. Normal approximation to Binomial

If X is a Binomial random variable with mean µ = np and variance

σ 2 = npq, then the random variables
X − np
Zn = √
npq

approach the standard Normal as n gets large.

We already know one Binomial approximation (by Poisson). It mostly
applies when the Binomial distribution in question has a skewed shape, that
is, when p is close to 0 or 1. When the shape of Binomial distribution is
close to symmetric, the Normal appoximation will work better. Practically,
we will require that both np and n(1 − p) ≥ 5.
106 CHAPTER 4. CONTINUOUS PROBABILITY DISTRIBUTIONS

Example 4.21.
Suppose X is Binomial with parameters n = 15, and p = 0.4, then µ = np =
(15)(4) = 6 and σ 2 = npq = 15(0.4)(0.6) = 3.6. Suppose we are interested
in the probability that X assumes a value from 7 to 9 inclusive, that is,
P (7 ≤ X ≤ 9). The exact probability is given by
9
X
P (7 ≤ X ≤ 9) = bin(x; 15, 0.4) = 0.1771 + 0.1181 + 0.0612 = 0.3564
7

For Normal approximation we find the area between x1 = 6.5 and x2 = 9.5
using z-values which are
x1 − np x1 − µ 6.5 − 6
z1 = √ = = = 0.26,
npq σ 1.897

and
9.5 − 6
z2 = = 1.85
1.897
Adding or removing 0.5 is called continuity correction. It arises when we try
to approximate a distribution with integer values (here, Binomial) through
the use of a continuous distribution (here, Normal). Shown in Fig.4.7, the
sum over the discrete set {7 ≤ X ≤ 9} is approximated by the integral of
the continuous density from 6.5 to 9.5.
0.20
0.15
0.10
0.05
0.00

5 6 7 8 9 10 11

Figure 4.7: continuity correction

4.6. NORMAL DISTRIBUTION 107

Now,

P (7 ≤ X ≤ 9) = P (0.26 < Z < 1.85) = 0.4678 − 0.1026 = 0.3652

therefore, the normal approximation provides a value that agrees very closely
with the exact value of 0.3564. The degree of accuracy depends on both n
and p. The approximation is very good when n is large and if p is not too
near 0 or 1.
Example 4.22.
The probability that a patient recovers from a rare blood disease is 0.4. If
100 people are known to have contracted this disease, what is the probability
that at most 30 survive?
Solution. Let the binomial variable X represent the number of patients that
survive. Since n = 100 and p = 0.4, we have

µ = np = (100)(0.4) = 40

and
σ 2 = npq = (100)(0.4)(0.6) = 24,
√
also σ = σ 2 = 4.899. To obtain the desired probability, we compute z-value
for x = 30.5. Thus,
x−µ 30.5 − 40
z= = = −1.94,
σ 4.899
and the probability of fewer than 30 of the 100 patients surviving is P (X <
30) ≈ P (Z < −1.94) = 0.5 − 0.4738 = 0.0262.
Example 4.23.
A fair coin (p = 0.5) is tossed 10,000 times, and the number of Heads X is
recorded. What are the values that contain X with 95% certainty?
Solution. p
We have µ = np = 10, 000(0.5) = 5, 000 and σ = 10, 000(0.5)(1 − 0.5) =
50. We need to find x1 and x2 so that P (x1 ≤ X ≤ x2 ). Since the mean of
X is large, we will neglect the continuity correction.
Since we will be working with Normal approximation, let’s find z1 and z2
such that
P (z1 ≤ Z ≤ z2 ) = 0.95
108 CHAPTER 4. CONTINUOUS PROBABILITY DISTRIBUTIONS

The solution is not unique, but we can choose the values of z1,2 that are sym-
metric about 0. This will mean finding z such that P (0 < Z < z) = 0.475.
Using Normal tables “in reverse” we will get z = 1.96. Thus, P (−1.96 <
Z < 1.96) = 0.95.
Next, transforming back into X, use the formula x = µ + σz, so

x1 = 5000 + 50(−1.96) = 4902 and x2 = 5000 + 50(1.96) = 5098

Thus, with a large likelihood, our Heads count will be within 100 of the
expected value of 5,000.
This is an example of the famous “2 sigma” rule.

Exercises
4.22.
Given a standard normal distribution Z, find

a) P (0 < Z < 1.28)

b) P (−2.14 < Z < 0)
c) P (Z > −1.28)
d) P (−2.3 < Z < −0.75)
e) the value z0 such that P (Z > z0 ) = 0.25

4.23.
Given a normal distribution with µ = 30 and σ = 6, find

a) the normal curve area to the right of x = 17

b) the normal curve area to the left of x = 22
c) the normal curve area between x = 32 and x = 41
d) the value of x that has 80% of the normal curve area to the left
e) the two values of x that contain the middle 75% of the normal curve
area.

4.24.
Given the normally distributed variable X with mean 18 and standard devi-
ation 2.5, find
4.6. NORMAL DISTRIBUTION 109

a) P (X < 15)
b) the value of k such that P (X < k) = 0.2236
c) the value of k such that P (X > k) = 0.1814
d) P (17 < X < 21).

4.25.
A soft drink machine is regulated so that it discharges an average of 200
milliliters (ml) per cup. If the amount of drink is normally distributed with
a standard deviation equal to 15 ml,

a) what fraction of the cups will contain more than 224 ml?
b) what is the probability that a cup contains between 191 and 209 milliliters?
c) how many cups will probably overflow if 230 ml cups are used for the
next 1000 drinks?
d) below what value do we get the smallest 25% of the drinks?

4.26.
A company pays its employees an average wage of $15.90 an hour with a
standard deviation of $1.50. If the wages were approximately normally dis-
tributed and paid to the nearest cent,

a) What percentage of workers receive wages between $13.75 and $16.22

an hour?
b) What is the cutoff value for highest paid 5% of the employees?

4.27.
A solar panel produces, on average, 34.5 kWh (kilowatt-hours) per month,
with standard deviation of 2.5 kWh.

a) Find the probability that the panel output will be between 35 and 38
kWh in a month.
b) Find an interval, symmetric about the mean (that is, [µ − a, µ + a] for
some a), that contains 72% of monthly kWh values.
110 CHAPTER 4. CONTINUOUS PROBABILITY DISTRIBUTIONS

4.28.
The likelihood that a job application will result in an interview is estimated
as 0.1. A grad student has mailed 40 applications. Find the probability that
she will get at least 3 interviews,
a) Using the Normal approximation.
b) Using the Poisson approximation.
c) Find the exact probability. Which approximation has worked better?
Why?
4.29.
It is estimated that 33% of individuals in a population of Atlantic puffins have
a certain recessive gene. If 90 individuals are caught, estimate the probability
that there will be between 30 and 40 (inclusive) with the recessive gene.

4.7 Weibull distribution

Earlier we learned that Gamma is a generalization of Exponential distribution
(in fact, when α = 1 we get Exponential). The Weibull distribution is another
such generalization. Like Gamma, it has positive values and is, therefore,
suitable as a model of reliability and lifetimes, among other things.
The easiest way to look at the Weibull distribution is through its CDF
F (x) = 1 − exp[−(x/β)γ ], x > 0 (4.5)
Note: if γ = 1 then we get the Exponential distribution. The parameter β
has the dimension of time and γ is dimensionless.
By differentiating the CDF, we get the Weibull density
Definition 4.10. Weibull distribution
The Weibull RV has the density function
γ
γxγ−1 x
f (x) = γ
exp − , x>0
β β

and the CDF

F (x) = 1 − exp[−(x/β)γ ], x > 0

1
Its mean is µ = β Γ 1 +
γ
4.7. WEIBULL DISTRIBUTION 111

The Weibull distribution with γ > 1 typically has an asymmetric shape with
a peak in the middle and the long right “tail”. Shapes of Weibull density are
shown in Fig. 4.8 for various values of γ.
2.0
1.5

γ=1
γ=2
γ=5
f(x)

1.0
0.5
0.0

0 1 2 3 4

x
Figure 4.8: Weibull densities, all with β = 1

Regarding the computation of the mean: the Gamma function of non- √

integer parameter is, generally, not easy to find. Note only that Γ(0.5) = π,
and we can use the recursive relation Γ(α + 1) = αΓ(α) to compute the
Gamma function for α = 1.5, 2.5 etc.
Example 4.24.
The duration of subscription to the Internet services is modeled by the
Weibull distribution with parameters γ = 2 and β = 15 months.
a) Find the average duration.
b) Find the probability that a subscription will last longer than 10 months.
Solution. √
(a) µ = 15 Γ(1.5) = 15(0.5)Γ(0.5) = 7.5 π = 13.29
(b) P (X > 10) = 1 − F (10) = exp[−(10/15)2 ] = 0.6412
112 CHAPTER 4. CONTINUOUS PROBABILITY DISTRIBUTIONS

Exercises
4.30.
The time it takes for a server to respond to a request is modeled by the
Weibull distribution with γ = 2/3 and β = 15 milliseconds.

a) Find the average time to respond.

b) Find the probability that it takes less than 12 milliseconds to respond.
c) Find the 70th percentile of the response times.

4.31.
The lifetime of refrigerators is assumed to follow Weibull distribution with
parameters β = 7 years and γ = 4.
Find:

a) The proportion of refrigerators with lifetime between 2 and 5 years.

b) If a refrigerator has already worked for 5 years, what is the probability
that it will work for at least 3 more years?

4.8 Moment generating functions for contin-

uous case
The moment generating function of a continuous random variable X with a
pdf of f (x) is given by
Z ∞
tX
M (t) = E (e ) = etx f (x)dx
−∞

when the integral exists. For the exponential distribution, this becomes
Z ∞
1
M (t) = etx e−x/β dx = (1 − βt)−1
0 β
For properties of MGF’s, see Section 3.9
Chapter 5

Joint probability distributions

5.1 Bivariate and marginal probability distri-

butions
All of the random variables discussed previously were one dimensional, that
is, we consider random quantities one at a time. In some situations, how-
ever, we may want to record the simultaneous outcomes of several random
variables.
Examples:

a) We might measure the amount of precipitate A and volume V of gas

released from a controlled chemical experiment, giving rise to a two-
dimensional sample space.
b) A physician studies the relationship between exercise amount and pulse
rate of his patients.
c) An educator studies the relationship between students’ grades and time
devoted to study.

If X and Y are two discrete random variables, the probability that X equals
x while Y equals y is described by p(x, y) = P (X = x, Y = y). That is, the
function p(x, y) describes the probability behavior of the pair X, Y .

Definition 5.1. Joint PMF

113
114 CHAPTER 5. JOINT PROBABILITY DISTRIBUTIONS

The function p(x, y) is a joint probability mass function of the discrete

random variables X and Y if

a) p(x, y) ≥ 0 for all pairs (x, y),

P P
b) x y p(x, y) = 1,

c) P (X = x, Y = y) = p(x, y).
X
For any region A in the xy-plane, P [(x, y) belongs to A] = p(x, y).
(x,y)∈A

Definition 5.2. Marginal PMF

The marginal probability functions of X and Y respectively are given by

X X
pX (x) = p(x, y) and pY (y) = p(x, y)
y x

Example 5.1.
If two dice are rolled independently, then the numbers X and Y on the first
and second die, respectively, will each have marginal PMF p(x) = 1/6 for
x = 1, 2, ..., 6.
The joint PMF is p(x, y) = 1/36, so that p(x) = 6y=1 p(x, y)
P

Example 5.2.
Consider X = person’s age and Y = income. The data are abridged from the
US Current Population Survey.j For the purposes of this example, we replace
the age and income groups by their midpoints. For example, the first row
represents ages 25-34 and the first column represents incomes $0-$10,000.
Y, income
5 20 40 60 85 Total
X, age 30 0.049 0.116 0.084 0.039 0.032 0.320
40 0.042 0.093 0.081 0.045 0.061 0.322
50 0.047 0.102 0.084 0.053 0.072 0.358
Total 0.139 0.310 0.249 0.137 0.165 1.000
Here, the joint PMF is given inside the table and the marginal PMF’s of X
and Y are row and column totals, respectively.
5.1. BIVARIATE AND MARGINAL PROBAB DIST 115

For example, p(30, 60) = 0.039 and pY (40) = 0.084 + 0.081 + 0.084 =
0.249.
For continuous random variables, the PMF’s turn to densities, and summa-
tion to integration.
Definition 5.3. Joint density, marginal densities

The function f (x, y) is a joint probability density function for the

continuous random variables X and Y if

a) f (x, y) ≥ 0, for all (x, y)

Z∞ Z∞
b) f (x, y) dx dy = 1.
−∞ −∞
ZZ
c) P [(X, Y ) ∈ A] = f (x, y) dx dy for any region A in the xy-plane.
A

The marginal probability density functions of X and Y are given by

Z∞ Z∞
fX (x) = f (x, y) dy and fY (y) = f (x, y) dx
−∞ −∞

When X and Y are continuous random variables, the joint density func-
tion f (x, y) describes the likelihood that the pair (X, Y ) belongs to the neigh-
borhood of the point (x, y). It is visualized as a surface lying above the xy
plane.
Example 5.3.
A certain process for producing an industrial chemical yields a product that
contains two main types of impurities. Suppose that the joint probability
distribution of the impurity concentrations (in mg/l) X and Y is given by
(
2(1 − x) for 0 < x < 1, 0 < y < 1
f (x, y) =
0 elsewhere

(a) Verify the condition (b) of Definition 5.3

(b) Find P (0 < X < 0.5, 0.4 < Y < 0.7)
(c) Find the marginal probability density functions for X and Y .
116 CHAPTER 5. JOINT PROBABILITY DISTRIBUTIONS

3
0.1

2
1
3

0
2 0.3
0.3
1
density

0.25

−1
0.2
0

y
0.2
0.1
−1 0.15

−2
−3 −2 −2
−1
0
x 1 −3
0.05
2

−3
3

−3 −2 −1 0 1 2 3

Figure 5.1: An example of a joint density function. Left: surface plot. Right:
contour plot.

Solution. (b)
Z 0.7 Z 0.5
P (0 < X < 0.5, 0.4 < Y < 0.7) = 2(1 − x)dx dy = 0.225
0.4 0

(c) Z ∞ Z 1
fX (x) = f (x, y)dy = 2(1 − x)dy = 2(1 − x), 0<x<1
−∞ 0
Z ∞ Z 1
and fY (y) = f (x, y)dx = 2(1 − x)dx = 1, 0<y<1
−∞ 0

5.2 Conditional probability distributions

Definition 5.4. Conditional PMF or density

For a pair of discrete RV’s, the conditional PMF of X given Y is

p(x, y)
p(x | y) = for y such that p(y) > 0
pY (y)
5.2. CONDITIONAL PROBABILITY DISTRIBUTIONS 117

For a pair of continuous RV’s with joint density f (x, y), the conditional
density function of X given Y = y is defined as

f (x, y)
f (x|y) = for y such that fY (y) > 0
fY (y)

and the conditional density of Y given X = x is defined by

f (x, y)
f (y|x) = for x such that fX (x) > 0
fX (x)

For discrete RV’s, the conditional probability distribution of X given Y fixes

a value of Y . For example, conditioning on Y = 0, produces

P (X = 0, Y = 0)
P (X = 0 | Y = 0) =
P (Y = 0)

Example 5.4.
Using the data from Example 5.2,
Y, income
5 20 40 60 85 Total
X, age 30 0.049 0.116 0.084 0.039 0.032 0.320
40 0.042 0.093 0.081 0.045 0.061 0.322
50 0.047 0.102 0.084 0.053 0.072 0.358
Total 0.139 0.310 0.249 0.137 0.165 1.000
Calculate the conditional PMF of Y given X = 30.
Solution. Conditional PMF of Y given X = 30, will give the distribution of
incomes in that age group. Divide all of the row X = 30 by its marginal and
obtain
Y, income
5 20 40
X, age 30 0.049/0.320 = 0.153 0.116/0.320 = 0.362 0.084/0.320 =0.263
[continued] 60 85 Total
0.039/0.320 = 0.122 0.032/0.32 = 0.1 1

The conditional PMF’s will add up to 1.

118 CHAPTER 5. JOINT PROBABILITY DISTRIBUTIONS

Example 5.5.
The joint density for the random variables (X, Y ), where X is the unit tem-
perature change and Y is the proportion of spectrum shift that a certain
atomic particle produces is
(
10xy 2 for 0 < x < y < 1
f (x, y) =
0 elsewhere

(a) Find the marginal densities.

(b) Find the conditional densities f (x|y) and f (y|x).

1.0
●● ● ●● ● ●● ● ● ●●●●●●● ●● ● ● ● ● ●●●●● ●●●
● ● ●●● ●●● ●
● ● ●●● ●● ● ●● ● ● ●● ● ●
● ●●● ●●●●●●● ● ●
● ● ● ● ●● ● ● ● ● ● ●
●●
●● ●
● ●●●
●
●● ●●● ●
●● ● ●● ● ● ● ●●●
● ●● ●●●●●
● ●
●● ●●●
● ●●●
● ●●●
●● ● ● ●● ● ●● ● ●●● ●●●●●●●●●●●●
● ● ●●●
●●●
●● ●
● ●● ●●●● ● ● ●●
● ● ●●
●● ●
●●
●
● ● ●● ●● ● ● ●
● ● ●● ● ●● ●
● ● ●
●●●● ● ● ●● ● ● ●●●●●●●● ●●●●●
● ●● ●●●
● ●● ● ● ●●●● ●●●●● ● ●
● ● ●● ● ●
●●● ●● ●● ●● ● ●
● ●
●
● ● ●●●● ●
● ● ● ●
● ●● ●●●●● ● ●
●●●●● ●
● ● ●
●● ● ● ● ● ● ●● ● ● ● ● ●●
●
● ● ●● ● ●
●●● ●
●●●●● ● ● ●● ●●●
● ● ●● ●
● ●●●●●●● ●● ●● ● ●●●● ●
●●●
● ●
● ●●● ● ● ● ● ● ● ●●●●● ● ● ● ●●●● ●●● ● ●
● ●● ●
● ●● ●● ●●●
● ●● ● ● ● ●
● ●●● ●●● ●● ● ●●● ● ● ●
●● ● ●●● ●●
● ●●
● ●● ● ●●●●● ● ●●● ● ●●●
● ● ● ● ● ● ●●●
● ● ●● ●●● ●●●
● ●●● ●●
●● ●●● ● ● ● ● ● ●
● ● ●● ●
● ● ● ● ●●●●● ●●● ● ● ● ● ●● ● ● ●
●● ● ●● ●●●
● ● ●

0.8
●● ● ● ● ● ● ● ● ●● ● ●●●●● ●● ●●● ●
●
● ● ● ●●● ● ● ● ● ●● ● ●
●● ●
● ● ●
● ● ●●●●● ●
●●●●● ● ● ● ●●●●●●● ● ● ●●● ●●
● ● ●● ●●●
●●
● ●
●● ●● ●
●
● ●● ● ●● ●● ● ● ● ●● ●●●
●
● ●
● ● ● ● ● ● ● ●● ●
●
● ●●
●
●
● ● ● ●●●● ●● ● ● ● ●●●● ● ●
● ●● ●● ● ● ● ● ● ●● ●
●●
● ● ● ● ●●●● ● ●● ● ●● ●●
● ●●● ●
●● ●● ●● ● ● ● ●●
●
● ●●● ●
● ● ●● ● ● ●●
● ● ●
● ● ●
● ● ● ● ● ● ● ●● ●●● ●● ●●
● ● ● ● ● ● ● ● ● ●● ●
● ● ●●● ● ● ● ●●
● ● ●●
● ● ●
● ●
●● ● ● ●● ● ● ●
●
0.6
● ● ●
1.0 ● ● ● ●
●
●
●
● ●
●
●● ●●● ●
● ● ● ● ●
●
●● ●●
●
● ●
● ● ●●● ● ●
● ● ●
● ● ● ● ●● ● ●
●● ● ● ● ●
● ● ●
● ● ● ●
y

●
●● ● ●
0.8 ● ●
●
●●●
● ●
●
0.4

● ● ●

8 ●
●
● ●

0.6 ●

6
density

● ● ●●
4
y

0.2

0.4
2 ●

0
0.0 0.2
0.2
0.0

0.4
0.6
x 0.8 0.0
1.0 0.0 0.2 0.4 0.6 0.8 1.0

Figure 5.2: Left: Joint density from Example 5.5, right: a typical sample
from this distribution

Solution. (a) By definition,

Z 1
10
fX (x) = 10xy 2 dy = x(1 − x3 ), 0 < x < 1
x 3
Z y
fY (y) = 10xy 2 dx = 5y 4 , 0 < y < 1
0
5.3. INDEPENDENT RANDOM VARIABLES 119

(b) Now
f (x, y) 10xy 2 3y 2
f (y|x) = = = , 0<x<y<1
fX (x) (10/3)x(1 − x3 ) (1 − x3 )
and
f (x, y) 10xy 2 2x
f (x|y) = = 4
= 2, 0 < x < y < 1
fY (y) 5y y
For the last one, say, treat y as fixed (given) and x is the variable.

5.3 Independent random variables

Definition 5.5. Independence
The random variables X and Y are said to be statistically independent iff

p(x, y) = pX (x)pY (y) for discrete case

and
f (x, y) = fX (x)fY (y) for continuous case

This definition of independence agrees with our definition for the events,
P (AB) = P (A)P (B). For example, if two dice are rolled independently,
then the numbers X and Y on the first and second die, respectively, will
each have PMF p(x) = 1/6 for x = 1, 2, ..., 6. The joint PMF will then be
p(x, y) = pX (x)pY (y) = (1/6)2 = 1/36.
Example 5.6.
Show that the random variables in Example 5.3 are independent.
Solution. Here,
(
2(1 − x) for 0 < x < 1 and 0 < y < 1
f (x, y) =
0 elsewhere

We have fX (x) = 2(1 − x) and fY (y) = 1 from Example 5.3, thus

fX (x)fY (y) = 2(1 − x)(1) = 2(1 − x) = f (x, y)

for 0 < x, y < 1 and 0 elsewhere. Hence, X and Y are independent random
variables.
120 CHAPTER 5. JOINT PROBABILITY DISTRIBUTIONS

Exercises
5.1.
The joint distribution for the number of total sales =X1 and number of elec-
tronic equipment sales =X2 per hour for a wholesale retailer are given below

X2 0 1 2
X1 = 0 0.1 0 0
X1 = 1 0.1 0.2 0
X1 = 2 0.1 ? 0.15

a) Fill in the “?”

b) Compute the marginal probability function for X2 .

(That is, find P (X2 = i) for every i.)

c) Find the probability that both X1 ≤ 1 and X2 ≤ 1.

d) Find the conditional probability distribution for X2 given that X1 = 2.

(That is, find P (X2 = i | X1 = 2) for every i.)

e) Are X1 , X2 independent? Explain.

5.2.
X and Y have the following joint density:
(
k for 0 ≤ x ≤ y ≤ 1
f (x, y) =
0 elsewhere

a) Calculate the constant k that makes f a legitimate density.

b) Calculate the marginal densities of X and Y .

5.3.
A point lands into [0, 1] × [0, 1] square with random coordinates X, Y inde-
pendent, having Uniform[0, 1] distribution each.

a) What is the probability that the distance from the point to the origin
is less than 1, that is, P (X 2 + Y 2 < 1)?
5.4. EXPECTED VALUES OF FUNCTIONS 121

b) Find the conditional density of X given that Y = 0.5

5.4.
The random variables X, Y have joint density f (x, y) = e−(x+y) , x, y > 0

a) Are X, Y independent? Explain.

b) Find P (X < 3, Y > 2)

5.4 Expected values of functions

Definition 5.6. Expected values

Suppose that the discrete RV’s (X, Y ) have a joint PMF p(x, y). If g(x, y) is
any real-valued function, then
XX
E [g(X, Y )] = g(x, y)p(x, y).
x y

The sum is over all values of (x, y) for which p(x, y) > 0.
If (X, Y ) are continuous random variables, with joint PDF f (x, y), then
Z ∞Z ∞
E [g(X, Y )] = g(x, y) f (x, y)dx dy.
−∞ −∞

Definition 5.7. Covariance

The covariance between two random variables X and Y is given by

Cov(X, Y ) = E [(X − µX )(Y − µY )],

where µX = E (X) and µY = E (Y ).

The covariance helps us assess the relationship between two variables.
Positive covariance means positive association between X and Y meaning
that, as X increases, Y also tends to increase. Negative covariance means
negative association.
122 CHAPTER 5. JOINT PROBABILITY DISTRIBUTIONS

3
0.05

− − + +
2

0.15

− − + +
1

0.3
0

+ + 0.25

− −
−1

0.2

+ + − −
−2

0.1
−3

−3 −2 −1 0 1 2 3

Figure 5.3: Explanation of positive covariance

In Figure 5.3, positive covariance is achieved as pairs of x, y with positive

products have higher densities than those with the negative products.
This definition also extends our notion of variance as Cov(X, X) = V (X).
While covariance measures the direction of the association between two ran-
dom variables, its value is not directly interpretable. Correlation coefficient,
introduced below, measures the strength of the association and has some nice
properties.
Definition 5.8. Correlation
The correlation coefficient between two random variables X and Y is given by

Cov(X, Y )
ρ= p
V (X) V (Y )
Properties of correlation:
• The correlation coefficient lies between −1 and +1.
• The correlation coefficient is dimensionless (while covariance has di-
mension of XY).
5.4. EXPECTED VALUES OF FUNCTIONS 123

• If ρ = +1 or ρ = −1, then Y must be a linear function of X.

• The correlation coefficient does not change when X or Y are linearly
transformed (e.g. when you change the units from miles to ångströms.)
• However, the correlation coefficient is not a good indicator of a nonlin-
ear relationship.

The following Theorem simplifies the computation of covariance. Com-

pare it to the variance identity V (X) = E (X 2 ) − (E X)2 .

Theorem 5.1. Covariance

Cov(X, Y ) = E (XY ) − E (X)E (Y )

Example 5.7.
The fraction X of male runners and the fraction Y of female runners who
compete in marathon races is described by the joint density function
(
8xy for 0 ≤ x ≤ 1, 0 ≤ y ≤ x
f (x, y) =
0 elsewhere

Find the covariance and correlation of X and Y .

Solution. We first compute the marginal density functions. They are

(
4x3 for 0 ≤ x ≤ 1
f (x) =
0 elsewhere

and (
4y(1 − y 2 ) for 0 ≤ y ≤ 1
f (y) =
0 elsewhere
From the marginal density functions, we get
Z 1 Z 1
4 4 8
E (X) = 4x dx = and E (Y ) = 4y 2 (1 − y 2 )dy =
0 5 0 15
124 CHAPTER 5. JOINT PROBABILITY DISTRIBUTIONS

From the joint density functions given, we have

Z 1Z 1
4
E (XY ) = 8x2 y 2 dxdy = .
0 y 9

Then
4 4 8 4
Cov(X, Y ) = E (XY ) − E (X)E (Y ) = − =
9 5 15 225

To find correlation ρ, we first need to find variances of X and Y .

Z 1 Z 1
2 5 2 2 1
E (X ) = 4x dx = and E (Y ) = 4y 3 (1 − y 2 )dy =
0 3 0 3

Thus V (X) = 2/3 − (4/5)2 = 2/75 and V (Y ) = 1/3 − (8/15)2 = 11/225.

4/225 √
Finally, ρ = p p = 4/ 66
2/75 11/225

Theorem 5.2. Covariance and independence

If random variables X and Y are independent, then Cov(X, Y ) = 0.

Proof. We will show the proof for the continuous case; the discrete case
follows similarly.
For independent X, Y ,
ZZ ZZ
E (XY ) = xy f (x, y)dx dy = xfX (x)yfY (y)dx dy =

Z Z
= xfX (x)dx yfY (x)dy = E (X) E (Y )

Therefore, Cov(X, Y ) = E (XY ) − E (X)E (Y ) = 0.

Of course, if covariance is 0, then so is the correlation coefficient. Such

random variables are called uncorrelated. The inverse of this Theorem is
not true, meaning that zero covariance does not necessarily imply
independence.
5.4. EXPECTED VALUES OF FUNCTIONS 125

5.4.1 Variance of sums

The following Theorem simplifies calculation of variance in certain cases.

Theorem 5.3. Variance of sums

If X and Y are random variables and U = aX + bY + c, then

V (U ) = V (aX + bY + c) = a2 V (X) + b2 V (Y ) + 2ab Cov(X, Y )

If X and Y are independent then V (U ) = V (aX + bY ) = a2 V (X) + b2 V (Y )

Example 5.8.
If X and Y are random variables with variances V (X) = 2, V (Y ) = 4,
and covariance Cov(X, Y ) = −2, find the variance of the random variable
Z = 3X − 4Y + 8.
Solution. By Theorem 5.3,
V (Z) = σZ2 = V (3X − 4Y + 8) = 9V (X) + 16V (Y ) − 24 Cov(X, Y )
so V (Z) = (9)(2) + (16)(4) − 24(−2) = 130.

Corollary. If the random variables X and Y are independent, then

V (X + Y ) = V (X) + V (Y )
Note. Theorem 5.3 and the above Corollary naturally extend to more than
2 random variables. If X1 , X2 , ..., Xn are all independent RV’s, then
V (X1 + X2 + ... + Xn ) = V (X1 ) + V (X2 ) + ... + V (Xn )
Example 5.9.
We have discussed in Chapter 3 that the Binomial random variable Y with
parameters n, p can be represented as Y = X1 + X2 + ... + Xn . Here Xi are
independent Bernoulli (0/1) random variables with P (Xi = 1) = p.
It was found that V (Xi ) = p(1 − p). Then, using the above Note, V (Y ) =
V (X1 ) + V (X2 ) + ... + V (Xn ) = np(1 − p), which agrees with the formula for
Binomial variance in Section 3.4.
The same reasoning applies to Gamma RV’s. If Y = X1 + X2 + ... + Xn ,
where Xi are independent Exponentials, each with mean β, then we know
that V (Xi ) = β 2 and Y has Gamma distribution with α = n. Then, V (Y ) =
V (X1 ) + V (X2 ) + ... + V (Xn ) = nβ 2 .
126 CHAPTER 5. JOINT PROBABILITY DISTRIBUTIONS

Example 5.10.
A very important application of Theorem 5.3 is the calculation of variance
of the sample mean
X1 + X2 + ... + Xn Y
X= =
n n
where Xi are independent and identically distributed RV’s (representing a
sample of measurements), and Y denotes the total of all measurements.
Suppose that V (Xi ) = σ 2 for each i. Then
V (Y ) V (X1 ) + V (X2 ) + ... + V (Xn ) nσ 2 σ2
V (X) = = = =
n2 n2 n2 n
√
This means that√σX = σ/ n, that is, the mean of n independent
measurements is n more precise than a single measurement.
Example 5.11.
The error in a single permeability measurement has the standard deviation of
0.01 millidarcies (md). If we made 8 independent measurements, how large
is the error we should expect from their mean?
√ √
Solution. σX = σ/ n = 0.01/ 8 ≈ 0.0035md

Exercises
5.5.
Y 0 1 2
X =0 0.1 0 0
X =1 0.1 0.2 0
X =2 0.1 0.35 0.15

Find the covariance and correlation between X and Y .

5.6.
X and Y have the following joint density:
(
2 for 0 ≤ x ≤ y ≤ 1
f (x, y) =
0 elsewhere
5.4. EXPECTED VALUES OF FUNCTIONS 127

a) Calculate E (X 2 Y ).
b) Calculate E (X/Y ).
5.7.
Using the density in Problem 5.6, find the covariance and correlation between
X and Y .
5.8.
Ten people get into an elevator. Assume that their weights are independent,
with the mean 150 lbs and standard deviation 30 lbs.
a) Find the expected value and the standard deviation of their total
weight.
b) Assuming Normal distribution, find the probability that their combined
weight is less than 1700 pounds.
5.9.
While estimating speed of light in a transparent medium, an individual mea-
surement X is determined to be unbiased (that is, the mean of X equals
the unknown speed of light), but the measurement error, assessed as the
standard deviation of X, equals 35 kilometers per second (km/s).
a) In an experiment, 20 independent measurements of the speed of light
were made. What is the standard deviation of the mean of these mea-
surements?
b) How many measurements should be made so that the error in estimat-
ing the speed of light (measured as σX ) will decrease to 5 km/s?
5.10.
A part is composed of two segments. One segment is produced with the
mean length 4.2cm and standard deviation of 0.1cm, and the second segment
is produced with the mean length 2.5cm and standard deviation of 0.05cm.
Assuming that the production errors are independent, calculate the mean
and standard deviation of the total part length.
5.11.
Random variables X and Y have means 3 and 5, and variances 0.5 and 2,
respectively. Further, the correlation coefficient between X and Y equals
−0.5. Find the mean and variance of W = X − Y .
5.12. ?
Find an example of uncorrelated, but not independent random variables.
[Hint: Two discrete RV’s with 3 values each are enough.]
128 CHAPTER 5. JOINT PROBABILITY DISTRIBUTIONS

5.5 Conditional Expectations*

Definition 5.9. Conditional Expectation
If X and Y are any two random variables, the conditional expectation of X
given that Y = y is defined to be
Z ∞
E (X | Y = y) = xf (x|y)dx
−∞

if X and Y are jointly continuous, and

X
E (X | Y = y) = x p(x|y)
x
if X and Y are jointly discrete.
Note that E (X|Y = y) is a number depending on y. If now we allow y to
vary randomly, we get a random variable denoted by E (X|Y ). The concept
of conditional expectation is useful when we have only a partial information
about X, as in the following example.
Example 5.12.
Suppose that random variable X is the number rolled on a die, and Y = 0
when X ≤ 3 and Y = 1 otherwise. Thus, Y carries partial information about
X, namely, whether X ≤ 3 or not.
a) Compute the conditional expectation E (X | Y = 0).
b) Describe the random variable E (X | Y ).
Solution. (a) The conditional distributions of X are given by

P (X = x, Y = 0) 1/6
P (X = x | Y = 0) = = = 1/3
P (Y = 0) 1/2
for x = 1, 2, 3, and
P (X = x | Y = 1) = 1/3 for x = 4, 5, 6.
Thus, E (X | Y = 0) = (1/3)(1 + 2 + 3) = 2 and
E (X | Y = 1) = (1/3)(4 + 5 + 6) = 5
(b) E (X | Y ) is 2 or 5, depending on Y . Each value may happen with prob-
ability 1/2. Thus, P [E (X | Y ) = 2] = 0.5 and P [E (X | Y ) = 5] = 0.5
5.5. CONDITIONAL EXPECTATIONS* 129

Theorem 5.4. Expectation of expectation

Let X and Y denote random variables. Then

(a) E (X) = E [E (X|Y )]

(b) V (X) = E [V (X|Y )] + V [E (X|Y )]

Proof. (Part (a) only.)

Let X and Y have joint density f (x, y) and the marginal densities fX (x) and
fY (y), respectively. Then
Z ∞ Z ∞ Z ∞
E (X) = xfX (x)dx = xf (x, y)dx dy
−∞ −∞ −∞

Z ∞ Z ∞ Z ∞ Z ∞
= xf (x|y)f (y)dx dy = xf (x|y)dx f (y)dy
−∞ −∞ −∞ −∞
Z ∞
= E (X|Y = y)f (y)dy = E [E (X|Y )]
−∞

Example 5.13.
Suppose that the total weight X of occupants in a car depends on how many
there are, let the number of occupants equal Y , and each occupant weighs
150 lbs on average. Then E (X | Y = y) = 150y. Suppose Y has the following
distribution
y 1 2 3 4
p(y) 0.62 0.28 0.07 0.03
150y 150 300 450 600
Then E (X | Y ) has the distribution with values given in the last row of the
table, and probabilities identical to p(y). We can verify by straightforward
calculation that E (X | Y ) = E (150Y ) = 226.5. Then the Theorem says that
E (X) = 226.5 as well, so we don’t even have to know the distribution of
occupant weights, only its mean (150).
130 CHAPTER 5. JOINT PROBABILITY DISTRIBUTIONS

Exercises
5.13.
For the random variables X and Y from Example 5.12, verify the identity in
part (a) of the Theorem 5.4.

5.14.
Suppose that the number of lobsters caught in a trap follows the distribution
y 0 1 2 3
p(y) 0.5 0.3 0.15 0.05
and the average weight of lobster is 1.7 lbs, with variance 0.25 lbs2 . Find the
expected value and the variance of the total catch in one trap.
Chapter 6

Functions of Random Variables

6.1 Introduction
At times we are faced with a situation where we must deal not with the
random variable whose distribution is known but rather with some function
of that random variable. For example, we might know the distribution of
particle sizes, and would like to infer the distribution of particle weights.
In the case of a simple linear function, we have already asserted what
the effect is on the mean and variance. What has been omitted was what
actually happens to the distribution.
We will discuss several methods of obtaining the distribution of Y = g(X)
from known distribution of X. The CDF method and the transformation
method are most frequently used. The CDF method is all-purpose and flex-
ible. The transformation method is typically faster (when it works).

6.1.1 Simulation
One use of these methods is to generate random variables with a given distribution.
This is important in simulation studies. Suppose that we have a complex operation
that involves several components. Suppose that each component is described by a
random variable and that the outcome of the operation depends on the components
in a complicated way. One approach to analyzing such a system is to simulate each
component and calculate the outcome for the simulated values. If we repeat the
simulation many times, then we can get an idea of the probability distribution of
the outcomes.

131
132 CHAPTER 6. FUNCTIONS OF RANDOM VARIABLES

6.2 Method of distribution functions (CDF)

The CDF method is straightforward and very versatile. The procedure is to
derive the CDF for Y = g(X) in terms of both the CDF of X, F (x), and the
function g, while also noting how the range of possible values changes. This
is done by starting with the computation of P (Y < y) and inverting this into
a statement that can often be expressed in terms of the CDF of X.
If we also need to find the density of Y , we can do this by differentiating
its CDF.
Example 6.1.
Suppose X has cdf given by F (x) = 1 − e−λx , so that X is Exponential with
the mean 1/λ. Let Y = bX where b > 0. Note that the range of Y is the
same as the range of X, namely (0, ∞).

P (Y < y) = P (bX < y) = P (X < y/b) =

(Since b > 0, the inequality sign does not change.)

= 1 − e−λy/b = 1 − e−(λ/b)y

The student should recognize this as CDF of the exponential distribution

with the mean b/λ. We already knew that the mean would be b/λ, but we
did not know that Y also has an exponential distribution.
Example 6.2.
Suppose X has a uniform distribution on [a, b] and Y = cX + d, with c > 0.
Find the CDF of Y .
Solution. Recall that F (t) = (t − a)/(b − a). Note that the range of Y is
[ca + d, cb + d]. We have

P (Y < t) = P (cX + d < t) = P (X < (t − d)/c) = F ((t − d)/c)

= ((t − d)/c − a)/(b − a) = (t − d − ac)/(c(b − a))

With a little algebra, this can be shown to be the uniform CDF on [ca +
d, cb + d].
This example shows that certain simple transformations do not change
the distribution type, only the parameters. Sometimes, however, the change
is dramatic.
6.2. METHOD OF DISTRIBUTION FUNCTIONS (CDF) 133

Example 6.3.
Show that if X has a uniform distribution on the interval [0, 1] then
Y = − ln(1 − X) has an exponential distribution with mean 1.
Solution. Recall that for the uniform distribution on (0, 1), P (X < x) = x.
Also, note that the range of Y is (0, ∞).

P (Y < t) = P (− ln(1 − X) < t) = P (ln(1 − X) > −t) =

= P 1 − X > e−t = P X < 1 − e−t = 1 − e−t

Incidentally, note that if X has a uniform distribution on (0, 1), then so

does W = 1 − X. (See exercises.)
Example 6.4.
The pdf of X is given by
(
3x2 0≤x≤1
f (x) =
0 elsewhere

Find the pdf of U = 40(1 − X).

Solution.
u u
F (u) = P (U ≤ u) = P [40(1−X) ≤ u] = P X > 1 − = 1−P (X ≤ 1− )
40 40
Z 1−u/40
u u 3
= 1 − FX 1 − =1− f (x)dx = 1 − 1 − .
40 0 40
Therefore,
3 u 2
f (u) = FU0 (u) = 1− , for 0 ≤ u ≤ 40
40 40

Exercises
6.1.
Show that if X has a uniform distribution on [0, 1], then so does 1 − X.
6.2.
√
Let X have a uniform distribution on [0, 1]. Let Y = X.
a) Find the distribution of Y.
134 CHAPTER 6. FUNCTIONS OF RANDOM VARIABLES

b) Find the mean of Y using the result in (a).

R
c) Find the mean of Y using the formula E g(X) = g(x)f (x) dx.

6.3.
Using the CDF method, show that the Weibull random variable Y (with
some parameter γ > 0, and β = 1) can be obtained from Exponential X
(with the mean 1) as Y = X 1/γ .

6.4.
Suppose the radii of spheres have a normal distribution with mean 2.5 and
1
variance 12 . Find the median volume and median surface area.

6.5.
Let X have a uniform distribution on [0, 1]. Show how you could define H(x)
so the Y = H(X) would have a Poisson distribution with mean 1.3.

6.6.
A point lands into [0, 1] × [0, 1] square with random coordinates X, Y inde-
pendent, having Uniform[0, 1] distribution each. Use the CDF method to
find the distribution of U = max(X, Y ).

6.7.
Let X,
√ Y be independent, standard Normal RV’s. Find the distribution of
Z = X 2 + Y 2 . You can interpret this as the distance of the random point
(X, Y ) to the origin. [Hint: Use the polar coordinates.]

6.3 Method of transformations

Theorem 6.1. Transformations: discrete

Suppose that X is a discrete random variable with probability mass function

p(x). Let U = h(X) define a one-to-one transformation between the values of
X and U so that the equation u = h(x) can be uniquely solved for x, say
x = w(u). Then the PMF of U is g(u) = p[w(u)].
For a discrete RV, the probabilities will stay the same and only the values
of X will change to the values of Y . You should also take care to aggregate
the values that might appear several times.
6.3. METHOD OF TRANSFORMATIONS 135

Example 6.5.
Let X be a geometric random variable with PMF
x−1
3 1
p(x) = , x = 1, 2, 3, . . .
4 4
Find the distribution of the random variable U = X 2 .
Solution. Since the values of X are all positive, the transformation defines
√a
one to one correspondence between the x and u values, u = x2 and x = u.
Hence,
√u−1
√ 3 1
g(u) = p( u) = , u = 1, 4, 9, ...
4 4
For continuous RV’s, the transformation formula originates from the
change of variable formula for integrals.
Theorem 6.2. Transformations: continuous
Suppose that X is a continuous random variable with density f (x). Let
y = h(x) define a one-to-one transformation that can be uniquely solved for
x, say x = w(y). Then the density of Y = h(X) is

dx
fY (y) = f (x) = f [w(y)] × |J|
dy

where J = w0 (y) is called the Jacobian of the transformation.

Example 6.6.
Let X be a continuous random variable with probability distribution
(
x/12 for 1 ≤ x ≤ 5
f (x) =
0 elsewhere
Find the probability distribution of the random variable Y = 2X − 3.
Solution. The inverse solution of y = 2x − 3 yields x = (y + 3)/2, from which
we obtain J = w0 (y) = dx
dy
= 21 . Therefore, using the above Theorem 6.2, we
find the density function of Y to be

1 y+3 1 y+3
fY (y) = = , −1 < y < 7
12 2 2 48
136 CHAPTER 6. FUNCTIONS OF RANDOM VARIABLES

Example 6.7.
Let X be Uniform[0, 1] random variable. Find the distribution of Y = X 5 .
Solution. Inverting, x = y 1/5 , and dx/dy = (1/5)y −4/5 . Thus, we obtain
fY (y) = 1 × (1/5)y −4/5 = (1/5)y −4/5 , 0<y<1

Example 6.8.
Let X be a continuous random variable with density
(
x+1
2
for −1 ≤ x ≤ 1
f (x) =
0 elsewhere

Find the density of the random variable Y = X 2 .

√
Solution. The inversion of y = x2 yields x1,2 = ± y, from which we obtain
J1 = w10 (y) = dx
dy
1
= 2√1 y and J2 = w20 (y) = dx
dy
2
= − 2√1 y . We cannot directly
use Theorem 6.2 because the function y = x2 is not one-to-one. However, we
can split the range of X into two parts (−1, 0) and (0, 1) where the function
is one-to-one. Then, we will just add the results.
Thus, we find the density function of Y to be
√ √
fY (y) = |J1 |f ( y) + |J2 |f (− y)
√ √
y+1 − y+1

1 1
= √ + = √ for 0 ≤ y ≤ 1
2 y 2 2 2 y

Example 6.9. Location and Scale parameters

Suppose that X is some standard distribution (for example, Standard Nor-
mal, or maybe Exponential with β = 1) and Y = a + bX, or, solving for
X,
Y −a
X=
b
Then a is called location (or shift) parameter and b is scale parameter.
Let X have the density f (x). Then the density of Y can be obtained
from Theorem 6.2 as

dx 1 y − a
fY (y) = f (x) = f (6.1)
dy |b| b
6.3. METHOD OF TRANSFORMATIONS 137

For example, let X be Exponential with the mean 1, and Y = bX. Then
f (x) = e−x , x > 0, and (6.1) gives

fY (y) = (1/b)e−y/b , y>0

That is, Y is Exponentially distributed with the mean b. This agrees with
the result of Example 6.1.
Another example of location and scale parameters is provided by Normal
distribution: if Z is standard Normal, then Y = µ + σZ produces Y a Normal
(µ, σ 2 ) random variable. Thus, µ is the location and σ is the scale parameter.
Formula (6.1) also provides a faster way to solve some of the above Examples.

Exercises
6.8.
Suppose that Y = cos(πX) where the RV X is given by the table
x −2 −1 0 2 3
p(x) 0.1 0.2 0.3 0.3 0.1
Find the distribution of Y (make a table).

6.9.
The random variable X has a distribution given by the table

x −1 0 1 2
p(x) 0.1 0.2 0.3 0.4

Find the distribution of a random variable Y = X 2 − 1.

6.10.
Let X be a continuous random variable with density
(
2
(x + 1) for 0 ≤ x ≤ 1
f (x) = 3
0 elsewhere

Find the density of the random variable Y = X 2 .

138 CHAPTER 6. FUNCTIONS OF RANDOM VARIABLES

6.11.
Use the methods of this section to show that linear functions of normal
random variables again have a normal distribution. Let Y = a + bX, where
X is normal with the mean µ and variance σ 2 . How do the mean and variance
of Y relate to those of X? Again, use the methods of this section.

6.12.
The so-called Pareto random variable X with parameters 10 and 2 has the
density function
10
f (x) = 2 , x > 10
x
Write down the density function of Y = 4X − 20 (do not forget the limits!)

6.13.
Re-do Example 6.4 (p. 133) using the transform (Jacobian) method.

6.14.
For the following distributions identify the parameters as location or scale
parameters, or neither:

a) Weibull, parameter β.
b) Weibull, parameter γ.
c) Uniform on [−θ, θ], parameter θ.
d) Uniform on [b, b + 1], parameter b.
6.4. CENTRAL LIMIT THEOREM 139

6.4 Central Limit Theorem

Sample mean (average of all observations) plays a central role in statistics.
We have discussed the variance of the sample mean in Section 5.4. Here are
more facts about the behavior of the sample mean.
From the linear properties of the expectation, it’s clear that

X1 + X2 + ... + Xn nµ
E (X) = E = = µ.
n n

Summarizing the above, we obtain

Definition 6.1. Sample mean

A group of independent random variables from some distribution is called a

sample, usually denoted as
X1 , X2 , ..., Xn .
Sample mean, denoted X, is
X1 + X2 + ... + Xn
X=
n
If E (Xi ) = µ and V (Xi ) = σ 2 for all i, then the mean and variance of sample
mean are
E (X) = µ and V (X) = σ 2 /n

Example 6.10.
The average voltage of the batteries is 9.2V and standard deviation is 0.25V.
Assuming normal distribution and independence, what is the distribution of
total voltage Y = X1 + ... + X4 ? Find the probability that the total voltage
is above 37.

Solution. The mean is 4 × 9.2 = 36.8. The variance is 4 × 0.252 = 0.25.

Furthermore, Y itself will have a normal distribution.
Using z-scores, P (Y > 37) = P (Z > (37 − 36.8)/0.5) = P (Z > 0.4) =
0.5 − 0.1554 = 0.345 from Normal table, p. 102.
140 CHAPTER 6. FUNCTIONS OF RANDOM VARIABLES

Here we mention (without proof, which can be obtained using the mo-
ment generating functions) some properties of the sums of independent ran-
dom variables.

Distribution of Xi Distribution of
Y = X1 + X2 + ... + Xn (indep.)

0.15
1.0
0.8

0.10
0.6
0.4

0.05
0.2

7→

0.00
0.0

Exponential 0 1 2 3 4 5 Gamma 0 5 10 15 20
1.0

1.0
0.8

0.8
0.6

0.6
0.4

0.4
0.2

0.2
7→
0.0

0.0
Normal −3 −2 −1 0 1 2 3 Normal −3 −2 −1 0 1 2 3

2000
1500

1000
500
500

7→
0

0
Poisson 0 2 4 6 8 10 Poisson 5 10 15 20 25 30

What do these have in common?

The sum of independent Normal RV’s is always Normal. The shape of the
sum distribution for other independent RV’s starts resembling Normal as n
increases.
The Central Limit Theorem (CLT) ensures the similar property for most
general distributions. However, it holds in the limit, that is, as n gets large
(practically, n > 30 is usually enough). According to it, the sums of inde-
pendent RV’s approach normal distribution. The same holds for averages,
since they are sums divided by n.

Theorem 6.3. CLT

Let X be the mean of a sample coming from some distribution with mean µ
and variance σ 2 . Then, for large n, X n is approximately Normal with mean
µ and variance σ 2 /n.
If n < 30, the approximation is good only if the population distribution
is not too different from a normal. If the population is normal, the sampling
6.4. CENTRAL LIMIT THEOREM 141

distribution of X will follow a normal distribution exactly, no matter how

small the sample size.1
Example 6.11.
An electrical firm manufactures light bulbs with average lifetime equal to
800 hours and standard deviation of lifetimes equal 400 hours. Find the
probability that a random sample of 16 bulbs will have an average life of less
than 725 hours.
Solution. The sampling distribution of X will be approximately normal, with
mean µX = 800 and σX = √400 16
= 100. Therefore,

725 − 800
P (X < 725) ≈ P Z < = P (Z < −0.75) = 0.5−0.2734 = 0.2266
100

Dependence on n
As n increases, two things happen to the distribution of X: it is becoming
sharper (due to the variance decreasing) and also the shape is becoming more
and more Normal. For example, if Xi are Uniform[0,1], then the density of
X behaves as follows:
6

X from Uniform, n = 1, 2, 4, and 16

0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

1
There are some cases of the so-called “heavy-tailed” distributions for which the CLT
does not hold, but they will not be discussed here.
142 CHAPTER 6. FUNCTIONS OF RANDOM VARIABLES

Example 6.12.
The fracture strengths of a certain type of glass average 14 (thousands of
pounds per square inch) and have a standard deviation of 2. What is the
probability that the average fracture strength for 100 pieces of this glass
exceeds 14.5?

Solution. By the central limit theorem the average strength X has approx-
imately a normal distribution with mean= 14 and standard deviation, σ =
√2 = 0.2. Thus,
100

14.5 − 14
P (X > 14.5) ≈ P Z > = P (Z > 2.5) = 0.5 − 0.4938 = 0.0062
0.2

from normal probability Table.

6.4.1 CLT examples: Binomial

Historically, CLT was first discovered in case of Binomial distribution. Since
Binomial Y is a sum of n independent Bernoulli RV’s, CLT applies and says
that X = Y /n is approximately Normal, mean p and variance p(1 − p)/n.
In this case, p̂ := Y /n is called sample proportion. The Binomial Y itself
is also approximately Normal with mean np and variance np(1 − p), as was
discussed earlier in Section 4.6.2.

Example 6.13.
A fair (p = 0.5) coin is tossed 500 times.

a) What is the expected proportion of Heads?

b) What is the typical deviation from the expected proportion?
c) What is the probability that the sample proportion is between 0.46 and
0.54?
p p
Solution. (a) We have E (p̂) = p = 0.5 and σp̂ = p(1 − p)/n = 0.25/500 =
0.0224.
(b) For example, the empirical rule states that about 68% of a normal dis-
tribution is contained within one standard deviation of its mean. Here, the
6.4. CENTRAL LIMIT THEOREM 143

68% interval is about 0.5 ± 0.0224, or 0.4776 to 0.5224.

0.04

Binomial probabilities and Normal approximation

0.035

0.03
Normal approximation for
n = 500 and p = 0.5
0.025

0.02

0.015

0.01

0.005

0
200 210 220 230 240 250 260 270 280 290 300

Normal approximation is not very good when np is small. Here’s an example

with n = 50 and p = 0.05:

0.35

0.3 Normal approximation is no good

0.25

0.2

0.15

0.1

0.05

0
−5 0 5 10 15 20 25
144 CHAPTER 6. FUNCTIONS OF RANDOM VARIABLES

Exercises
6.15.
The average concentration of potassium in county soils was determined as 85
ppm, with standard deviation 30 ppm. If n = 20 samples of soils are taken,
find the probability that their average potassium concentration will be in the
“medium” range (80 to 120 ppm).

6.16.
The heights of students have a mean of 174.5 centimeters (cm) and a standard
deviation of 6.9 cm. If a random sample of 25 students is obtained, determine

a) the mean and standard deviation of X;

b) the probability that the sample mean will fall between 172.5 and 175.8
cm;
c) the 70th percentile of the X distribution.

6.17.
The measurements of an irregular signal’s frequency have mean of 20 Hz and
standard deviation of 5 Hz. 10 independent measurements are done.

a) Find the probability that the average of these 10 measurements will be

within 1 unit of the theoretical mean 20.
b) How many measurements should be done to ensure that the probability
in part (a) equals 0.1?

6.18.
A process yields 10% defective items. If 200 items are randomly selected from
the process, what is the probability that the sample proportion of defectives

a) exceeds 13%?
b) is less than 8%?
Chapter 7

Descriptive statistics

The goal of statistics is somewhat complementary to that of the probability.

Probability answers the question of what data are likely to be obtained from
known probability distributions.
Statistics answers the opposite question: what kind of probability distribu-
tions are likely to have generated the data at hand?
Descriptive statistics are the ways to summarize the data set, to represent
its tendencies in a concise form and/or describe them graphically.

7.1 Sample and population

We will usually refer to the given data set as a sample and denote its entries
as X1 , X2 , ..., Xn . The objects whose measurements are represented by Xi
are often called experimental units and are usually assumed to be sampled
randomly from a larger population of interest. The probability distribution
of Xi is then referred to as population distribution.

Definition 7.1. Population and sample

Population is the collection of all objects of interest. Sample is the collection

of objects from the population picked for the study.
A simple random sample (SRS) is a sample for which each object in the
population has the same probability to be picked as any other object, and is
picked independently of any other object.

145
146 CHAPTER 7. DESCRIPTIVE STATISTICS

Example 7.1.

a) We would like to learn the public opinion regarding a tax reform. We

set up phone interviews with n = 1000 people. Here, the population
(which we really would like to learn about) is all U.S. adults, and the
sample (which are the objects, or individuals we actually get), is the
1000 people contacted.

For some really important matters, the U.S. Census Bureau tries to
reach every single American, but this is practically impossible.
b) The gas mileage of a car is investigated. Suppose that we drive n = 20
times using a full tank of gas, until it’s empty, and calculate the average
gas mileage for each trip. Here, the population is all potential trips
between fillups on this car to be made (under usual driving conditions)
and the sample is the 20 trips actually made.

Usually, we require that our sample be a simple random sample (SRS)

so that we can extend our findings to the entire population of interest. This
means that no part of the population is preferentially selected for, or excluded
from the study.
Bias often occurs when the sample is not an SRS. For example, self-
selection bias occurs when subjects volunteer for the study. Medical studies
that pay for participation may attract lower-income volunteers. A question-
naire issued by a website will represent only the people that visit that website
etc.
The ideal way to implement an SRS is to create a list of all objects in a
population, and then use a random number generator to pick the objects to
be sampled. In practice, this is very difficult to accomplish.
In the future, we will always assume that we are dealing with an SRS,
unless otherwise noted. Thus, we will obtain a sequence of independent
and identically distributed (IID) random variables X1 , X2 , ..., Xn from the
population distribution we are studying.

7.2 Graphical summaries

The most popular graphical summary for a numeric data set is a histogram.
7.2. GRAPHICAL SUMMARIES 147

Definition 7.2.

The histogram of the data set X1 , X2 , . . . , Xn is a bar chart representing the

classes (or bins) on the x-axis and frequencies (or proportions) on the y-axis.
Bins should be of equal width so that all bars would visually be on the
same level.1 The construction of a histogram is easier to show by example.

Example 7.2.
Old Faithful is a famous geyser in Yellowstone National Park. The data
recorded represent waiting times between eruptions (in minutes). There are
n = 272 observations. The first ten observations are 79, 54, 74, 62, 85, 55,
88, 85, 51, 85. Using the bins 41-45, 46-50 etc we get
Bin 41-45 46-50 51-55 56-60 61-65 66-70 71-75
Count 4 22 33 24 14 10 27
Bin 76-80 81-85 86-90 91-95 96-100
Count 54 55 23 5 1

Histogram of Y
50
40
Frequency

30
20
10
0

40 50 60 70 80 90 100

Figure 7.1: histogram of Old Faithful data

The choice of bins of course affects the appearance of a histogram. With

too many bins, the graph becomes hard to read, and with too little bins, a
lot of information is lost. We would generally recommend to use more bins
1
Bins can be of unequal width but then some adjustment to their heights must be
made.
148 CHAPTER 7. DESCRIPTIVE STATISTICS

for larger sample sizes; but not too many bins, so that the histogram keeps a
smooth√ appearance. Some authors recommend the number of bins no higher
than n where n is the sample size.

15
80
60

10
Frequency

Frequency
40

5
20
0

0
40 50 60 70 80 90 100 50 60 70 80 90

Y Y

Figure 7.2: histograms of Old Faithful data: bins too wide, bins too narrow

Describing the shape of a histogram, we may note its features as being

symmetric, or maybe skewed (left or right); having one “bulge” (mode) - that
is, unimodal distribution, or two modes - that is, bimodal distribution etc.
The Old Faithful data have bimodal shape. Some skewed histogram shapes
are shown in Fig. 8.1

Left skewed Symmetric Right skewed

0.4
0.15

0.15
0.2
0.00

0.00
0.0

0 2 4 6 8 10 0 2 4 6 8 10 0 2 4 6 8 10

Figure 7.3: Symmetric and skewed shapes

7.3. NUMERICAL SUMMARIES 149

7.3 Numerical summaries

7.3.1 Sample mean and variance
The easiest and most popular summary for a data set is its mean X. The
mean is a measure of location for the data set. We often need also a measure
of spread. One such measure is the sample standard deviation.

Definition 7.3. Sample variance and standard deviation

The sample variance is denoted as S 2 and equals to

Pn Pn 2
2 − X)2
i=1 (Xi
2
− nX
i=1 (Xi )
S = = (7.1)
n−1 n−1
Sample standard deviation S is the square root of S 2 .
A little algebra may show that both expressions in the formula (7.1) are
equivalent. Denominator in the formula is n − 1 which is called degrees of
freedom. A simple explanation is that the calculation starts with n numbers
and is then constrained by finding X, thus n − 1 degrees of freedom are left.
Note that if n = 1 then the calculation of sample variance is not possible.
The sample mean and standard deviation are counterparts of the mean
and standard deviation of a probability distribution. Further we will use
them as the estimates of the unknown mean and standard deviation of a
probability distribution (or a population).

Example 7.3.
The heights of last 8 US presidents are (in cm)k : 185, 182, 188, 188, 185,
177, 182, 193. Find the mean and standard deviation of these heights.
Solution. The average height is X = 185. To make the calculations more
compact, let’s subtract 180 from each number, as it willP not affect the stan-
dard deviation: 5, 2, 8, 8, 5, −3, 2, 13, and X = 5. Then, Xi2 = 364 and we
364 − 52 (8) √
get S 2 = = 23.43 and S = 23.43 = 4.84.
8−1
150 CHAPTER 7. DESCRIPTIVE STATISTICS

7.3.2 Percentiles
Definition 7.4.

The pth percentile (or quantile) of a data set is a number q such that p% of
the entire sample are below this number. It can be calculated as
r = ((n + 1)p/100)th smallest number in the sample.
The algorithm for calculating pth percentile is then as follows.2
a) Order the sample, from smallest to largest, denote these as
X(1) , X(2) , . . . , X(n) .
b) Calculate r = (n + 1)p/100, let k = brc be the integer part of r.
c) If interpolation is desired, take X(k) + (r − k)[X(k+1) − X(k) ],
If interpolation is not needed, take X(r∗ ) where r∗ is the rounded value
of r.
Generally, if the sample size n is large, the interpolation is not needed.3
The 50-th percentile is known as median. It is, along with the mean, a
measure of center of the data set.

Example 7.4.
Back to the example of US presidents: find the median and 22nd percentile
of the presidents’ heights.

Solution. The ordered data are 177, 182, 182, 185, 185, 188, 188, 193. For
n = 8 we have two “middle observations”: ranked 4th and 5th, these are
both 185. Thus, the median is 185 (accidentally we have seen that X = 185
also).
To find 22nd percentile, take r = (n + 1)p = 9(0.22) = 1.98, round it to
2. Then, take 2nd ranked observation, which is 182.

Mean and median

The mean and median are popular measures of center. For a symmetric
data set, both give roughly the same result. However, for a skewed data set,
2
see e.g. http://www.itl.nist.gov/div898/handbook/prc/section2/prc252.htm
3
Software note: different books and software packages may have different ways to
interpret the fractional value of (n + 1)p/100, so the percentile results might vary.
7.3. NUMERICAL SUMMARIES 151

they might produce fairly different results. For the right-skewed distribution,
mean > median, and for the left-skewed, mean < median.
The median is resistant to outliers. This means that the unusually high
or low observations do not greatly affect the median. The mean X is not
resistant to outliers.

Mean of a function
We can define the mean of any function g of our data as
g(X1 ) + g(X2 ) + ... + g(Xn )
g(X) =
n
Similar to the properties of the expected values (see Theorem 3.2), we have
the following properties:
a) aX + b = aX + b
b) but, generally, g(X) = g(X)
c) For sample standard deviation, SaX+b = aSX

Exercises
7.1.
The temperature data one morning from different weather stations in the
vicinity of Socorro were
71.9, 73.7, 72.3, 74.6, 72.8, 67.5, 72.0 (in ◦F )
a) Find the mean and standard deviation of temperatures
b) Find the median and 86th percentile.
c) Suppose that the last measurement came from Magdalena Ridge and
became equal to 41.7 instead of 72.0. How will this affect the mean
and the median, respectively?
d) Re-calculate the above answers if the temperature is expressed in Cel-
cius. [Hint: you do not have to do it from scratch!]
7.2.
The heights of the last 20 US presidents are, in cm: 185, 182, 188, 188, 185,
177, 182, 193, 183, 179, 175, 188, 182, 178, 183, 180, 182, 178, 170, 180.
152 CHAPTER 7. DESCRIPTIVE STATISTICS

a) Make a histogram of the heights, choosing bins wisely.

b) Calculate mean and the median, compare. How do these relate to the
shape of the histogram?

7.3.
The permeabilities of 12 oil pumping locations, in millidarcies, are: 0.07,
0.17, 0.06, 0.09, 0.17, 0.18, 0.04, 0.07, 0.02, 0.57, 0.71, 0.05.

a) Make a histogram of the permeabilities, choosing bins wisely.

b) Calculate mean and the median, compare. How do these relate to the
shape of the histogram?
c) Find standard deviation of permeabilities.

7.4.
Several runners have completed a 1 mile race, with these results: 4.35, 4.51,
4.18, 4.56, 4.10, 3.75 (in minutes).

a) Find the average time of these runners.

b) Find the average speed (note: you will have to find each runner’s indi-
vidual speed, first).
c) Compare the answers to (a) and (b): why is mean speed not equal to
the inverse of mean running time?
Chapter 8

Statistical inference

8.1 Introduction
In previous sections we emphasized properties of the sample mean. In this
section we will discuss the problem of estimation of population parameters, in
general. A point estimate of some population parameter θ is a single value
θ̂ of a statistic. For example, the value X is the point estimate of population
X
parameter µ. Similarly, p̂ = is a point estimate of the true proportion p
n
in a binomial experiment.
Statistical inference deals with the question: can we infer something about
the unknown population parameters (e.g., µ, σ or p)? Two major tools
for statistical inference are confidence intervals (they complement a point
estimate with a margin of error) and hypothesis tests that try to prove some
statement about the parameters.

8.1.1 Unbiased Estimation

What are the properties of desirable estimators? We would like the sampling
distribution of θ̂ to have a mean equal to the parameter estimated. An
estimator possessing this property is said to be unbiased.
Definition 8.1.
A statistic θ̂ is said to be an unbiased estimator of the parameter θ if

E (θ̂) = θ.

153
154 CHAPTER 8. STATISTICAL INFERENCE

The unbiased estimators are correct “on average”, while actual samples
yield results higher or lower than the true value of the parameter
On the other hand, biased estimators would consistently overestimate or
underestimate the target parameter.
Example 8.1.
One reason that the sample variance S 2 = (Xi − X)2 /(n − 1) is divided by
P
n − 1 (instead of n) is the unbiasedness property. Indeed, it can be shown
that E (S 2 ) = σ 2 . However, E (S) 6= σ.

8.2 Confidence intervals

The confidence interval (CI) or interval estimate is an interval within
which we would expect to find the “true” value of the parameter.
Interval estimates, say, for population mean, are often desirable because
the point estimate X varies from sample to sample. Instead of a single
estimate for the mean, a confidence interval generates a lower and an upper
bound for the mean. The interval estimate provides a measure of uncertainty
in our estimate of the true mean µ. The narrower the interval, the more
precise is our estimate.
Confidence limits are evaluated in terms of a confidence level.1 Although
the choice of confidence level is somewhat arbitrary, in practice 90%, 95%,
and 99% intervals are often used, with 95% being the most commonly used.
Theorem 8.1. CI for the mean
If X is the mean of a random sample of size n from a normal population
with known variance σ 2 , an approximate (1 − α)100% confidence interval for
µ is given by
σ σ
X − zα/2 √ < µ < X + zα/2 √ , (8.1)
n n
where zα/2 is the Z-value leaving an area of α/2 to the right.
1
On a technical note, a 95% confidence interval does not mean that there is a 95%
probability that the interval contains the true mean. The interval computed from a given
sample either contains the true mean or it does not. Instead, the level of confidence is
associated with the method of calculating the interval. For example, for a 95% confidence
interval, if many samples are collected and a confidence interval is computed for each, in
the long run about 95% of these intervals would contain the true mean.
8.2. CONFIDENCE INTERVALS 155

Proof. Central Limit Theorem (CLT) claims that, regardless of the initial
distribution, the sample mean X = (X1 + ... + Xn )/n will be approximately
Normal:
X ≈ Normal (µ, σ 2 /n)
for n reasonably large (usually n ≥ 30 is considered enough).
Suppose that a confidence level C = 100%(1 − α) is given. Then, find zα/2
such that

P (−zα/2 < Z < zα/2 ) = 1 − α, Z is a standard Normal RV

Due to the symmetry of Z-distribution, we need to find the z-value with the
upper tail probability α/2. That is, table area TA(zα/2 ) = 0.5 − α/2.
X −µ
Then, using CLT, Z ≈ √ , therefore
σ/ n

X −µ
P −zα/2 < √ < zα/2 ≈ 1 − α
σ/ n
Solving for µ, we obtain the result.

Notes:
(a) If σ is unknown, it can replaced by S, the sample standard deviation,
with no serious loss in accuracy for the large sample case. Later, we will
discuss what happens for small samples.
(b) This CI (and many to follow) has the following structure

X ±m

where m is called margin of error.

Example 8.2.
The drying times, in hours, of a certain brand of latex paint are
3.4 2.5 4.8 2.9 3.6 2.8 3.3 5.6
3.7 2.8 4.4 4.0 5.2 3.0 4.8
Compute the 95% confidence interval for the mean drying time. Assume that
σ = 1.
156 CHAPTER 8. STATISTICAL INFERENCE

Solution. We compute X = 3.79 and zα/2 = 1.96

(α = 0.05, upper-tail probability = 0.025, table area = 0.5−0.025 = 0.475)
Then, using (8.1), the 95% C.I. for the mean is
√
3.79 ± 1(1.96)/ 15 = 3.79 ± 0.51

Example 8.3.
The average zinc concentration recovered from a sample of zinc measurements
in 36 different locations in the river is found to be 2.6 milligrams per liter.
Find the 95% and 99% confidence intervals for the mean zinc concentration
µ. Assume that the population standard deviation is 0.3.
Solution. The point estimate of µ is X = 2.6. For 95% confidence, zα/2 =
1.96. Hence, the 95% confidence interval is
0.3 0.3
2.6 − 1.96 √ < µ < 2.6 + 1.96 √ = (2.50, 2.70)
36 36
For a 99% confidence, zα/2 = 2.575 and hence the 99% confidence interval is
0.3 0.3
2.6 − 2.575 √ < µ < 2.6 + 2.575 √ = (2.47, 2.73)
36 36
We see that a wider interval is required to estimate µ with a higher degree
of confidence.
Example 8.4.
An important property of plastic clays is the amount of shrinkage on drying.
For a certain type of plastic clay 45 test specimens showed an average shrink-
age percentage of 18.4 and a standard deviation of 1.2. Estimate the “true”
average shrinkage µ for clays of this type with a 95% confidence interval.
Solution. For these data, a point estimate of µ is X = 18.4. The sample
standard deviation is S = 1.2. Since n is fairly large, we can replace σ by S.
Hence, 95% confidence interval for µ is
1.2 1.2
18.4 − 1.96 √ < µ < 18.4 + 1.96 √ = (18.05, 18.75)
45 45
Thus we are 95% confident that the true mean lies between 18.05 and 18.75.
8.2. CONFIDENCE INTERVALS 157

Sample size calculations

In practice, another problem often arises: how many data should be collected
to determine an unknown parameter with a given accuracy? That is, let m be
the desired size of the margin of error, for a given confidence level 100%(1−α)
σ
m = ±zα/2 √ (8.2)
n

What is the sample size n to achieve this goal?

To do this, assume that some estimate of σ is available. Then, solving
for n, z σ 2
α/2
n=
m
Example 8.5.
We would like to estimate the pH of a certain type of soil to within 0.1,
with 99% confidence. From past experience, we know that the soils of this
type usually have pH in the 5 to 7 range. Find the sample size necessary to
achieve our goal.
Solution. Let us take the reported 5 to 7 range as the ±2σ range. This
way, the crude estimate of σ is (7 − 5)/4 = 0.5. For 99% confidence, we
find the upper tail area α/2 = (1 − 0.99)/2 = 0.005, thus zα/2 = 2.576, and
n = (2.576 × 0.5/0.1)2 ≈ 166

Exercises
8.1.
In a school district, they would like to estimate the average reading rate of
first-graders. After selecting a random sample of n = 65 readers, they ob-
tained sample mean of 53.4 words per minute (wpm), and standard deviation
of 33.9 wpm.l Calculate a 98% confidence interval for the average reading
rate of all first-graders in the district.

8.2.
A random sample of 200 calls initiated while driving had a mean duration
of 3.5 minutes with standard deviation 2.2 minutes. Find a 99% confidence
interval for the mean duration of telephone calls initiated while driving.
158 CHAPTER 8. STATISTICAL INFERENCE

8.3.

a) Bursting strength of a certain brand of paper is supposed to have Nor-

mal distribution with µ = 150 kPa and σ = 15 kPa. Give an interval
that contains about 95% of all bursting strength values
b) Assuming now that the true µ and σ are unknown, the researchers
collected a sample of n = 100 paper bags and measured their bursting
strength. They obtained X = 148.4 kPa and S = 18.9 kPa. Calculate
the 95% C.I. for the mean bursting strength.
c) Sketch a Normal density curve with µ = 150, σ = 15, with both of your
intervals shown on the x-axis. Compare the intervals’ widths.
8.4.
In determining the mean viscosity of a new type of motor oil, the lab needs
to collect enough observations to approximate the mean within ±0.2 SAE
grade, with 96% confidence. The standard deviation typical for this type of
measurement is 0.4. How many samples of motor oil should the lab test?
8.5.
The reaction times for a sample of 25 experienced swimmers to react to
the pistol start were measured, yielding a mean of 0.214 sec and standard
deviation 0.036 sec. Find a 95% confidence interval for the average reaction
time for all experienced swimmers.

8.3 Statistical hypotheses

Definition 8.2.
A Statistical hypothesis is an assertion or conjecture concerning one or more
populations.
The goal of a statistical hypothesis test is to make a decision about an
unknown parameter (or parameters). This decision is usually expressed in
terms of rejecting or accepting a certain value of parameter or parameters.
Some common situations to consider:
• Is the coin fair? That is, we would like to test if p = 1/2 where
p = P (Heads).
8.3. STATISTICAL HYPOTHESES 159

• Is the new drug more effective than the old one? In this case, we would
like to compare two parameters, e.g. the average effectiveness of the
old drug versus the new one.
In making the decision, we will compare the statement (say, p = 1/2) with
the available data and will reject the claim p = 1/2 if it contradicts the data.
In the subsequent sections we will learn how to set up and test the hypotheses
in various situations.

Null and alternative hypotheses

A statement like p = 1/2 is called the Null hypothesis (denoted by H0 ). It
expresses the idea that the parameter (or a function of parameters) is equal
to some fixed value. For the coin example, it’s
H0 : p = 1/2
and for the drug example it’s
H0 : µ1 = µ2
where µ1 is the mean effectiveness of the old drug compared to µ2 for the
new one. Alternative hypothesis (denoted by HA ) seeks to disprove the
null. For example, we may consider two-sided alternatives
HA : p 6= 1/2 or, in the drug case, HA : µ1 6= µ2

8.3.1 Hypothesis tests of a population mean

A null hypothesis H0 for the population mean µ is a statement that desig-
nates the value µ0 for the population mean to be tested. It is associated with
an alternative hypothesis HA , which is a statement incompatible with the
null. A two-sided (or two-tailed) hypothesis setup is
H0 : µ = µ0 versus HA : µ 6= µ0
for a specified value of µ0 , and a one-sided (or one-tailed) hypothesis
setup is either
H0 : µ = µ0 versus HA : µ > µ0 (right-tailed test)
or
H0 : µ = µ0 versus HA : µ < µ0 (left-tailed test)
160 CHAPTER 8. STATISTICAL INFERENCE

Steps of a Hypothesis Test

a) Null Hypothesis H0 : µ = µ0
b) Alternative Hypothesis HA : µ 6= µ0 , or HA : µ > µ0 , or HA : µ < µ0 .
c) Critical value: zα/2 for two-tailed or zα for one-tailed test, for some
chosen significance level α. (Here, α is the false positive rate, i.e. how
often you will reject H0 that is, in fact, true.)
√
n(X − µ0 )
d) Test Statistic z =
σ
e) Decision Rule: Reject H0 if

|z| > zα/2 for two-tailed

z > zα for right-tailed
z < −zα for left-tailed

or, using p-value (see below), Reject H0 when p-value < α

f) Conclusion in the words of the problem.

Definition 8.3. P-values

A data set can be used to measure the plausibility of a null hypothesis H0
through the calculation of a p-value.a The smaller the p-value, the less
plausible is the null hypothesis.
Rejection Rule: Given the significance level α,

Reject H0 when p-value < α

otherwise Accept H0 .
a
Do not confuse p-value with notation for proportion p

Calculation of P-values
For the two-sided hypothesis, P-value = 2 × P (Z > |z|).
For the right-tailed hypothesis, HA : µ > µ0 , P-value = P (Z > z)
For the left-tailed hypothesis, HA : µ < µ0 , P-value = P (Z < z)
8.3. STATISTICAL HYPOTHESES 161
0.4

0.4

0.4
µ ≠ µ0 µ > µ0 µ < µ0
0.3

0.3

0.3
0.2

0.2

0.2
0.1

0.1

0.1
0.0

0.0

0.0
−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3

Figure 8.1: P-value calculation for different HA

Example 8.6.
A manufacturer of sports equipment has developed a new synthetic fishing
line that he claims has a mean breaking strength of 8 kg with a standard
deviation of 0.5 kg. A random sample of 50 lines is tested and found to have
a mean breaking strength of 7.8 kg. Test the hypothesis that µ = 8 against
the alternative that µ 6= 8. Use α = 0.01 level of significance.
Solution.
a) H0 : µ = 8
b) HA : µ 6= 8
c) α = 0.01 and hence critical value zα/2 = 2.57
d) Test statistic:
√ √
n(X − µ0 ) 50(7.8 − 8)
z= = = −2.83
σ 0.5

e) Decision: reject H0 since | − 2.83| > 2.57.

f) Conclusion: there is evidence that the mean breaking strength is not
8 kg (in fact, it’s lower).

Decision based on P-value:

Since the test in this example is two-sided, the p-value is double the area.

P-value = P (|Z| > 2.83) = 2 [0.5 − TA(2.83)] = 2(0.5 − 0.4977) = 0.0046

which allows us to reject the null hypothesis that µ = 8 kg at a level of

significance smaller than 0.01.
162 CHAPTER 8. STATISTICAL INFERENCE

Example 8.7.
A random sample of 100 recorded deaths in the United States during the
past year showed an average life span of 71.8 years. Assuming a population
standard deviation of 8.9 years, does this seem to indicate that the mean life
span today is greater than 70 years? Use a 0.05 level of significance.
Solution.
a) H0 : µ = 70 years.
b) HA : µ > 70 years.
c) α = 0.05 and zα = 1.645
d) Test statistic:
√ √
n(X − µ0 ) 100(71.8 − 70)
z= = = 2.02
σ 8.9

e) Decision: Reject H0 if 2.02 > 1.645, since 2.02 > 1.645, we reject H0 .
f) Conclusion: We conclude that the mean life span today is greater than
70 years.

Decision based on P-value:

Since the test in this example is one-sided, the desired p-value is the area to
the right of z = 2.02. Using Normal Table, we have

P-value = P (Z > 2.02) = 0.5 − 0.4783 = 0.0217.

Reject H0

Example 8.8.
The nominal output voltage for a certain electrical circuit is 130V. A random
sample of 40 independent readings on the voltage for this circuit gave a
sample mean of 128.6V and a standard deviation of 2.1V. Test the hypothesis
that the average output voltage is 130 against the alternative that it is less
than 130. Use a 5% significance level.
Solution.
a) H0 : µ = 130
b) HA : µ < 130
8.4. THE CASE OF UNKNOWN σ 163

c) α = 0.05 and zα = −1.645

d) Test statistic: √ √
n(X − µ0 ) 40(128.6 − 130)
z= = = −4.22
σ 2.1
e) Decision: Reject H0 since −4.22 < −1.645.
f) Conclusion: We conclude that the average output voltage is less than
130.
Decision based on p-value:

P-value = P (Z < −4.22) = (0.5 − 0.4990) = 0.001.

As a result, the evidence in favor of HA is even stronger than that suggested
by the 0.05 level of significance. (P-value is very small!)

Exercises
8.6.
It is known that the average height of US adult males is about 173 cm, with
standard deviation of about 6 cm.
Referring to Exercise 7.2, the average height of 20 last US presidents
was 181.9 cm. Are the presidents taller than the average? Test at the level
α = 0.05 and also compute the p-value.
8.7.
Is it more difficult to reject H0 when the significance level is smaller? Suppose
that the p-value for a test was 0.023. Would you reject H0 at the level
α = 0.05? At α = 0.01?

8.4 The case of unknown σ

8.4.1 Confidence intervals
Frequently, we are attempting to estimate the mean of a population when
the variance is unknown. Suppose that we have a random sample from a
normal distribution, then the random variable
X −µ
T = √
S/ n
164 CHAPTER 8. STATISTICAL INFERENCE

is said to have a (Student)m T-distribution with n − 1 degrees of freedom.

Here, S is the sample standard deviation.

0.4 df = 1
df=4
df=10
0.3

Z distribution
f(x)

0.2
0.1
0.0

−3 −2 −1 0 1 2 3

Figure 8.2: T distribution for different values of df = degrees of freedom

With σ unknown, T should be used instead of Z to construct a confidence

interval for µ. The procedure is same as for known σ except that σ is replaced
by S and the standard normal distribution is replaced by the T-distribution.
T-distribution is also symmetric, but has somewhat “heavier tails” than
Z. This is because of extra uncertainty of not knowing σ.
Definition 8.4. CI for mean, unknown σ
If X and S are the mean and standard deviation of a random sample from a
normal population with unknown variance σ 2 , a (1 − α)100% confidence
interval for µ is
S S
X − tα/2 √ < µ < X + tα/2 √ ,
n n
where tα/2 is the t-value with n − 1 degrees of freedom leaving an area of α/2
to the right.
Normality assumption becomes more important as n gets smaller. As
a practical rule, we will not trust the confidence intervals based on small
samples (generally, n < 30) that are strongly skewed or have outliers.
8.4. THE CASE OF UNKNOWN σ 165

On the other hand, we already noted that for large n we could simply use
Z-distribution for the C.I. calculation. This is justified by the fact that tα/2
values approach zα/2 values as n gets larger.

Example 8.9.
The contents of 7 similar containers of sulfuric acid are 9.8, 10.2, 10.4, 9.8,
10.0, 10.2 and 9.6 liters. Find a 95% confidence interval for the mean volume
of all such containers, assuming an approximate normal distribution.
Solution. The sample mean and standard deviation for the given data are
X = 10.0 and S = 0.283. Using the T-Table, we find t0.025 = 2.447 for 6
degrees of freedom. Hence the 95% confidence interval for µ is
0.283 0.283
10.0 − 2.447 √ < µ < 10.0 + 2.447 √ ,
7 7
which reduces to 9.74 < µ < 10.26

Example 8.10.
A random sample of 12 graduates of a certain secretarial school typed an
average of 79.3 words per minute (wpm) with a standard deviation of 7.8
wpm. Assuming a normal distribution for the number of words typed per
minute, find a 99% confidence interval for the average typing speed for all
graduates of this school.
Solution. The sample mean and standard deviation for the given data are
X = 79.3 and S = 7.8. Using the T-Table, we find t0.005 = 3.106 with 11
degrees of freedom. Hence the 95% confidence interval for µ is
7.8 7.8
79.3 − 3.106 √ < µ < 79.3 + 3.106 √ ,
12 12
which reduces to 72.31 < µ < 86.30.
We are 99% confident that the interval 72.31 to 86.30 includes the true av-
erage typing speed for all graduates.
166 CHAPTER 8. STATISTICAL INFERENCE

Table B: Critical points of the t-distribution 0 t

Upper tail probability

Degrees of
0.10 0.05 0.025 0.01 0.005 0.001 0.0005
freedom
1 3.078 6.314 12.706 31.821 63.657 318.309 636.619
2 1.886 2.920 4.303 6.965 9.925 22.327 31.599
3 1.638 2.353 3.182 4.541 5.841 10.215 12.924
4 1.533 2.132 2.776 3.747 4.604 7.173 8.610
5 1.476 2.015 2.571 3.365 4.032 5.893 6.869
6 1.440 1.943 2.447 3.143 3.707 5.208 5.959
7 1.415 1.895 2.365 2.998 3.499 4.785 5.408
8 1.397 1.860 2.306 2.896 3.355 4.501 5.041
9 1.383 1.833 2.262 2.821 3.250 4.297 4.781
10 1.372 1.812 2.228 2.764 3.169 4.144 4.587
11 1.363 1.796 2.201 2.718 3.106 4.025 4.437
12 1.356 1.782 2.179 2.681 3.055 3.930 4.318
13 1.350 1.771 2.160 2.650 3.012 3.852 4.221
14 1.345 1.761 2.145 2.624 2.977 3.787 4.140
15 1.341 1.753 2.131 2.602 2.947 3.733 4.073
16 1.337 1.746 2.120 2.583 2.921 3.686 4.015
17 1.333 1.740 2.110 2.567 2.898 3.646 3.965
18 1.330 1.734 2.101 2.552 2.878 3.610 3.922
19 1.328 1.729 2.093 2.539 2.861 3.579 3.883
20 1.325 1.725 2.086 2.528 2.845 3.552 3.850
21 1.323 1.721 2.080 2.518 2.831 3.527 3.819
22 1.321 1.717 2.074 2.508 2.819 3.505 3.792
23 1.319 1.714 2.069 2.500 2.807 3.485 3.768
24 1.318 1.711 2.064 2.492 2.797 3.467 3.745
25 1.316 1.708 2.060 2.485 2.787 3.450 3.725
30 1.310 1.697 2.042 2.457 2.750 3.385 3.646
40 1.303 1.684 2.021 2.423 2.704 3.307 3.551
60 1.296 1.671 2.000 2.390 2.660 3.232 3.460
120 1.289 1.658 1.980 2.358 2.617 3.160 3.373
∞ 1.282 1.645 1.960 2.326 2.576 3.090 3.291
8.4. THE CASE OF UNKNOWN σ 167

8.4.2 Hypothesis test

When sample sizes are small and population variance is unknown, use the
test statistic √
n(X − µ0 )
t= ,
S
with n − 1 degrees of freedom.

Steps of a Hypothesis Test

a) Null Hypothesis H0 : µ = µ0
b) Alternative Hypothesis HA : µ 6= µ0 , or HA : µ > µ0 , or HA : µ < µ0 .
c) Critical value: tα/2 for two-tailed or tα for one-tailed test.
√
n(X − µ0 )
d) Test Statistic t = with n − 1 degrees of freedom
S
e) Decision Rule: Reject H0 if

|t| > tα/2 for two-tailed

t > tα for right-tailed
t < −tα for left-tailed

or, using p-value, Reject H0 when p-value < α

f) Conclusion.

Example 8.11.
Engine oil was stated to have the mean viscosity of µ0 = 85.0. A sample of
n = 25 viscosity measurements resulted in a sample mean of X = 88.3 and a
sample standard deviation of S = 7.49. What is the evidence that the mean
viscosity is not as stated? Use α = 0.1.

Solution.
a) H0 : µ = 85.0
b) HA : µ 6= 85.0
c) α = 0.1 and tα/2 = 1.711 with 24 degrees of freedom.
168 CHAPTER 8. STATISTICAL INFERENCE

d) Test statistic:
√ √
n(X − µ0 ) 25(88.3 − 85.0)
t= = = 2.203
S 7.49
e) Decision: Reject H0 since 2.203 > 1.711.
f) Conclusion: We conclude that the average viscosity is not equal to 85.0
Decision based on P-value:
Since the test in this example is two sided, the desired p-value is twice the
tail area. Therefore, using t-table with df = 24, we have
P-value = 2 × P (T > 2.203) = 2(0.0187) = 0.0374,
which allows us to reject the null hypothesis that µ = 85 at a level of signif-
icance smaller than 0.1.
Conclusion: In summary, we conclude that there is fairly strong evidence
that the mean viscosity is not equal to 85.0
Example 8.12.
A sample of n = 20 cars driven under varying highway conditions achieved
fuel efficiencies with a sample mean of X = 34.271 miles per gallon (mpg)
and a sample standard deviation of S = 2.915 mpg. Test the hypothesis that
the average highway mpg is less than 35 with α = 0.05.
Solution.
a) H0 : µ = 35.0
b) HA : µ < 35.0
c) α = 0.05 and tα = 1.729 with 19 degrees of freedom.
d) Test statistic:
√ √
n(X − µ0 ) 20(34.271 − 35.0)
t= = = −1.119
S 2.915
e) Decision: since −1.119 > −1.729, we do not reject H0 .
f) Conclusion: There is no evidence that the average highway mpg is any
less than 35.0
Decision based on P-value:
P-value = P (T < −1.119) = P (T > 1.119) > 0.10,
(using df = 19), thus p-value > α = 0.05, do not reject H0 .
8.4. THE CASE OF UNKNOWN σ 169

8.4.3 Connection between Hypothesis tests and C.I.’s

We can test a two-sided hypothesis

H0 : µ = µ0 vs. HA : µ 6= µ0

at the level α, using a confidence interval with the confidence level 100%(1 −
α). If we found the 100%(1 − α) C.I. for the mean µ, and µ0 belongs to it,
we accept H0 , otherwise we reject H0 .
This way, the C.I. is interpreted as the range of “plausible” values for µ.
The false positive rate in this case will be equal to α = 1 − C/100%

Example 8.13.
Reconsider Example 8.11. There, we had to test H0 : µ = 85.0 with the data
n = 25, X = 88.3 and S = 7.49, at the level α = 0.1. Is there evidence that
the mean average viscosity is not 85.0?

Solution. If we calculate a 90% C.I. (90% = 100%(1 − α)), we get

7.49
88.3 ± 1.711 √ = 88.3 ± 2.6 or (85.7, 90.9)
25

Since 85.0 does not belong to this interval, there is evidence that the “true”
mean viscosity is not 85.0 (in fact, it’s higher).
We arrived at the same conclusion as in Example 8.11.

8.4.4 Statistical significance vs Practical significance

Statistical significance sometimes has little to do with practical significance.
Statistical significance (i.e. a small p-value) is only concerned with the
amount of evidence to reject H0 . It does not directly reflect the size of
the effect itself. Confidence intervals are more suitable for that.
For example, in testing the effect of a new medication for lowering choles-
terol, we might find that the confidence interval for the average decrease µ
equals (1.2, 2.8) units (mg/dL). Since the C.I. has positive values we proved
HA : µ > 0. However, the decrease of 1.2 to 2.8 units might be too small in
practical terms to justify developing this new drug.
170 CHAPTER 8. STATISTICAL INFERENCE

Exercises
8.8.
In determining the gas mileage of a new model of hybrid car, the independent
research company collected information from 14 randomly selected drivers.
They obtained the sample mean of 38.4 mpg, with the standard deviation of
5.2 mpg. Obtain a 99% C.I. for µ.
What is the meaning of µ in this problem? What assumptions are necessary
for your C.I. to be correct?
8.9.
This problem is based on the well-known Newcomb data set for the speed
of light.n It contains the measurements (in nanoseconds) it took the light to
bounce inside a network of mirrors. The numbers given are the time recorded
minus 24, 800 ns. We will only use the first ten values.
28 26 33 24 34 -44 27 16 40 -2
Some mishaps in the experimental procedure led to the two unusually low
values (−44 and −2). Calculate the 95% C.I.’s for the mean in case when
a) all the values are used
b) the two outliers are removed
Which of the intervals will you trust more and why?
8.10.
For the situation in Example 8.6 (fishing line strength), test the hypotheses
using the C.I. approach.

8.5 C.I. and hypothesis tests for comparing

two population means
Two-sample problems:
• The goal of inference is to compare the response in two groups.
• Each group is considered to be a sample from a distinct population.
• The responses in each group are independent of those in the other
group.
8.5. C.I. AND TESTS FOR TWO MEANS 171

We have two independent samples, from two distinct populations. Here

is the notation that we will use to describe the two populations:
population Variable Mean Standard deviation
1 X1 µ1 σ1
2 X2 µ2 σ2
We want to compare the two population means, either by giving a confidence
interval for µ1 − µ2 or by testing the hypothesis of difference, H0 : µ1 = µ2 .
Inference is based on two independent random samples. Here is the notation
that describes the samples:

sample sample size sample mean sample st.dev.

1 n1 X1 S1
2 n2 X2 S2
If independent samples of size n1 and n2 are drawn at random from two
populations, with means µ1 and µ2 and variances σ12 and σ22 , respectively, the
sampling distribution of the differences of the means X 1 − X 2 , is normally
2
distributed with mean µX 1 −X 2 = µ1 − µ2 and variance σD = σ12 /n1 + σ22 /n2 .
Then, the two-sample Z statistic

(X 1 − X 2 ) − (µ1 − µ2 )
Z=
σD

has the standard normal N (0, 1) sampling distribution.

Usually, population standard deviations σ1 and σ2 are not known. We
estimate them by using sample standard deviations S1 and S2 . But then the
Z-statistic will turn into (approximately) T-statistic, with degrees of freedom
equal to the smaller of n1 − 1 or n2 − 1.
Further, if we are testing H0 : µ1 = µ2 , then µ1 − µ2 = 0. Thus, we obtain
the confidence intervals and hypothesis tests for µ1 − µ2 .

The 100%(1 − α) confidence interval for µ1 − µ2 is given by

s
S12 S22
(X 1 − X 2 ) ± tα/2 + T has df = min(n1 , n2 ) − 1 (8.3)
n1 n2
172 CHAPTER 8. STATISTICAL INFERENCE

Steps of a Hypothesis Test

a) Null Hypothesis H0 : µ1 = µ2
b) Alternative Hypothesis HA : µ1 6= µ2 , or HA : µ1 > µ2 , or HA : µ1 < µ2 .
c) Critical value: tα/2 for two-tailed or tα for one-tailed test, for some
chosen significance level α.
X1 − X2
d) Test Statistic t = p
S12 /n1 + S22 /n2
e) Decision Rule: Reject H0 if

|t| > tα/2 for two-tailed

t > tα for right-tailed
t < −tα for left-tailed

or, using p-value, Reject H0 when p-value < α.

P-value is calculated similarly to 1-sample T-test, but now with
df = min(n1 , n2 ) − 1.
f) Conclusion in the words of the problem.

Example 8.14.
A study of iron deficiency among infants compared samples of infants fol-
lowing different feeding regimens. One group contained breast-fed infants,
while the other group were fed a standard baby formula without any iron
supplements. Here are the data on blood hemoglobin levels at 12 months of
age:
Group n X s
Breast-fed 23 13.3 1.7
Formula 19 12.4 1.8
(a) Is there significant evidence that the mean hemoglobin level is higher
among breast-fed babies?
(b) Give a 95% confidence interval for the mean difference in hemoglobin
level between the two populations of infants.

Solution. (a) H0 : µ1 − µ2 = 0 vs HA : µ1 − µ2 > 0, where µ1 is the mean

of the Breast-fed population and µ2 is the mean of the Formula population.
8.5. C.I. AND TESTS FOR TWO MEANS 173

The test statistic is

13.3 − 12.4 0.9
t= q = = 1.654
1.72 1.82
+ 19 0.544
23

with 18 degrees of freedom. The p-value is P (T > 1.654) = 0.057. This is

not quite significant at 5% level.
(b) The 95% confidence interval is

0.9 ± 2.101(0.544) = 0.9 ± 1.1429 = (−0.2429, 2.0429)

Standard Error
All previous formulas involving t-distribution have a common structure. For
example, (8.3) can be re-written as

(X 1 − X 2 ) ± tα/2 SEX 1 −X 2 ,
p
where the quantity SEX 1 −X 2 = S12 /n1 + S22 /n2 is called the Standard Error.
Likewise, the one-sample confidence interval for the mean is

X ± tα/2 SEX ,
√
where SEX = s/ n.
Likewise, the formulas for the t-statistic are

X1 − X2 X − µ0
t= for 2-sample, and t = for 1-sample situation.
SEX 1 −X 2 SEX

We will see a lot of similar structure in the CI and hypothesis testing

formulas in the future. The value of standard error is often reported by the
software when you request CI’s or hypothesis tests.

8.5.1 Matched pairs

Sometimes, we are comparing data that come in pairs of matched observa-
tions. A good example of this are “before” and “after” studies. They present
the measurement of some quantity for the same set of subjects before and
174 CHAPTER 8. STATISTICAL INFERENCE

after a certain treatment has been administered. Another example of this

situation is twin studies for which pairs of identical twins are selected and
one twin (at random) is given a treatment, while the other is serving as a
control (that is, does not receive any treatment, or maybe receives a fake
treatment, placebo, to eliminate psychological effects).
When the same subjects are used, we should not consider the measure-
ments independent. In this case, we would compute Difference = Before − After
or Treatment − Control and just do a one-sample test for the mean differ-
ence.

Example 8.15.

The following are the left hyppocampus volumes (in cm3 ) for a group of
twin pairs, one is affected by schizophrenia, and the other is noto

Pair number 1 2 3 4 5 6 7 8 9 10 11 12
Unaffected 1.94 1.44 1.56 1.58 2.06 1.66 1.75 1.77 1.78 1.92 1.25 1.93
Affected 1.27 1.63 1.47 1.39 1.93 1.26 1.71 1.67 1.28 1.85 1.02 1.34
Difference 0.67 -0.19 0.09 0.19 0.13 0.40 0.04 0.10 0.50 0.07 0.23 0.59

Is there evidence that the LH volumes for schizophrenia-affected people

are different from the unaffected ones?

Solution. Since the twins’ LH volumes are clearly not independent (if one is
large the other is likely to be large, too – positive correlation!), we cannot
use the 2-sample procedure.
However, we can just compute the differences (Unaffected – Affected) and
test for the mean difference to be equal to 0. That is,

H0 : µ = 0 versus HA : µ 6= 0

where µ is the “true” average difference, and X, S are computed for the
sample of differences.
Given that X = 0.235 and S = 0.254,√let’s test these hypotheses at
α = 0.10. We obtain t = (0.235 − 0)/(0.254/ 12) = 3.20. From the T-table
with df = 11 we get p-value between 2(0.005) = 0.01 and 2(0.001) = 0.002.
At α = 0.05, we Reject H0 , thus stating that there is a significant difference
between LH volumes of normal and schizophrenic people.
8.5. C.I. AND TESTS FOR TWO MEANS 175

Exercises
8.11.
In studying how humans pick random objects, the subjects were presented
a population of rectangles and have used two different sampling methods.
They then calculated the average areas of the sampled rectangles for each
method. Their results were

mean st.dev. n
Method 1 10.8 4.0 16
Method 2 6.1 2.3 16

Calculate the 99% C.I. for the difference of “true” means by the two methods.
Is there evidence that the two methods produce different results?

8.12.
The sports research lab studies the effects of swimming on maximal volume
of oxygen uptake.
For 8 volunteers, the maximal oxygen uptake was measured before and after
the 6-week swimming program. The results are as follows:

Before 2.1 3.3 2.0 1.9 3.5 2.2 3.1 2.4

After 2.7 3.5 2.8 2.3 3.2 2.1 3.6 2.9

Is there evidence that the swimming program has increased the maximal
oxygen uptake?

8.13.
Visitors to an electronics website rated their satisfaction with two models
of printers/scanners, on the scale of 1 to 5. The following statistics were
obtained:

n mean st.dev.
Model A 31 3.6 1.5
Model B 65 4.2 0.9

At the level of 5%, test the hypothesis that both printers would have the
same average rating in the general population, that is, H0 : µA = µB . Also,
calculate the 95% confidence interval for the mean difference µA − µB .
176 CHAPTER 8. STATISTICAL INFERENCE

8.6 Inference for Proportions

8.6.1 Confidence interval for population proportion
In this Chapter, we will consider estimating the proportion p of items of cer-
tain type, or maybe some probability p. The unknown population proportion
p is estimated by the sample proportion
X
p̂ = .
n
We know (from CLT, Section 6.4) that if the sample size is sufficiently large,
p̂ has approximately
q normal distribution, with mean E (p̂) = p and standard
p(1−p)
deviation σp̂ = n
. Based on this, we obtain

Theorem 8.2. CI for proportion

For a random sample of size n from a large population with unknown
proportion p of successes, the (1 − α)100% confidence interval for p is
p
p̂ ± zα/2 p̂(1 − p̂)/n

8.6.2 Test for a single proportion

To test the hypothesis H0 : p = p0 , use the z-statistic
p̂ − p0
z=p
p0 (1 − p0 )/n
In terms of a standard normal Z, the approximate p-value for a test of H0 is
P (Z > z) against HA : p > p0 ,
P (Z < z) against HA : p < p0 ,
2P (Z > |z|) against HA : p 6= p0 .
In practice, Normal approximation works well when both X and n − X
are at least 10.
Example 8.16.
The French naturalist Count Buffon once tossed a coin 4040 times and ob-
tained 2048 heads. Test the hypothesis that the coin was balanced.
8.6. INFERENCE FOR PROPORTIONS 177

Solution. To assess whether the data provide evidence that the coin was not
balanced, we test H0 : p = 0.5 versus HA : p 6= 0.5.
The test statistic is
p̂ − p0 0.5069 − 0.50
z=p =p = 0.88
p0 (1 − p0 )/n 0.50(1 − 0.5)/4040
From Z chart we find P (Z < 0.88) = 0.8106. Therefore, the p-value is 2(1 −
0.8106) = 0.38. The data are compatible with balanced coin hypothesis.
Now we will calculate a 99% confidence interval for p. The zα/2 = 2.576
from the normal table. Hence, the 99% CI for p is
r
(0.5069)(1 − 0.5069)
p̂ = 0.5069 ± 2.576 = 0.5069 ± (2.576)(0.00786)
4040
= 0.5069 ± 0.0202 = (0.4867, 0.5271)

Sample size computation

To set up a study (e.g. opinion poll) with a guarantee not to exceed a certain
maximum amount of error, we can solve for n in the formula for error margin
m z 2
p 1 α/2
m = zα/2 p̂(1 − p̂)/n, therefore n =
p̂(1 − p̂) m
Since p̂ is not known prior to the study (a “Catch-22” situation), we might
try to find n that will guarantee the desired maximum error margin m, no
matter what p is. It turns out that using p = 1/2 is the worst possible case,
i.e. produces the maximum margin of error. Thus, we should use
1 zα/2 2 1 z 2
α/2
n= if p is completely unknown, or n = ∗ ∗
,
4 m p (1 − p ) m
if some estimate p∗ of p is available.
Example 8.17.
How many people should be polled in order to provide a 98% margin of error
equal to ±1%?
Solution. Since we do not have a prior knowledge of p, use
2
1 zα/2 2 1 2.33
n= = = 13572 people!
4 m 4 0.01
Note that we converted m from 1% to 0.01
178 CHAPTER 8. STATISTICAL INFERENCE

8.6.3 Comparing two proportions*

We will call the two groups being compared Population 1 and Population 2,
with population proportions of successes p1 and p2 . Here is the notation we
will use in this section:

Population pop. prop. sample # successes sample prop.

1 p1 n1 X1 p̂1 = X1 /n1
2 p2 n2 X2 p̂2 = X2 /n2
To compare the two proportions, we use the difference between the two sam-
ple proportions: p̂1 − p̂2 . Therefore, when n1 and n2 are large, p̂1 − p̂2
q approximately normal with mean µ = p1 − p2 and standard deviation
is
p1 (1−p1 )
n1
+ p2 (1−p
n2
2)
. Note that for unknown p1 and p2 we replace them by
p̂1 and p̂2 respectively.
Definition 8.5. Inference for two proportions
The (1 − α)100% confidence interval for p1 − p2 is
s
p̂1 (1 − p̂1 ) p̂2 (1 − p̂2 )
p̂1 − p̂2 ± zα/2 +
n1 n2

To test the hypothesis H0 : p1 − p2 = 0, we use the test statistic

p̂1 − p̂2
z= ,
SEp̂
r
X1 +X2
where SEp̂ = p̂(1 − p̂) n11 + 1
n2
and p̂ = n1 +n2
.

Example 8.18.
To test the effectiveness of a new pain relieving drug, 80 patients at a clinic
were giving a pill containing the drug and 80 others were giving a placebo.
At the 0.01 level of significance, what can we conclude about the effectiveness
of the drug if the first group 56 of the patients felt a beneficial effect while
38 out of those who received placebo felt a beneficial effect?
Solution. H0 : p1 − p2 = 0 and HA : p1 − p2 > 0
z = 2.89, where p̂1 = 56
80
= 0.7 and p̂2 = 38
80
= 0.475 and p̂ = 56+38
80+80
= 0.5875
8.6. INFERENCE FOR PROPORTIONS 179

P-value = 1 − P (Z > 2.89) = 0.0019

Since the p-value is less than 0.01, the null hypothesis must be rejected, so
the drug is effective.

Exercises
8.14.
Suppose that a nutritionist claims that at least 75% of the preschool children
in a certain country have protein deficient diets, and that a sample survey
reveals that 206 preschool children in a sample of 300 have protein deficient
diets. Test the claim at the 0.02 level of significance. Also, compute a 98%
confidence interval.

8.15.
In a survey of 200 office workers, 165 said they were interrupted three or
more times an hour by phone messages, faxes etc. Find and interpret a
90% confidence interval for the population proportion of workers who are
interrupted three or more times an hour.

8.16.
You would like to design a poll to determine what percent of your peers
volunteer for charities. You have no clear idea of what the value of p is going
to be like, and you’ll be satisfied with the 90% margin of error equal to ±10%.
Find the sample size needed for your study.

8.17.
In random samples of 200 tractors from one assembly line and 400 tractors
from another, there were, respectively, 16 tractors and 20 tractors which
required extensive adjustments before they could be shipped. At the 5%
level of significance, can we conclude that there is a difference in the quality
of the work of the two assembly lines?

Chapter Exercises
For each of the questions involving hypothesis tests, state the null and alter-
native hypotheses, compute the test statistic, determine the p-value, make
the decision and summarize the results in plain English. Use α = 0.05 unless
otherwise specified.
180 CHAPTER 8. STATISTICAL INFERENCE

8.18.
Two brands of batteries are tested and their voltages are compared. The
summary statistics are below. Find and interpret a 95% confidence interval
for the true difference in means.

mean st.dev. n
Brand 1 9.2 0.3 25
Brand 2 8.9 0.6 27

8.19.
You are studying yield of a new variety of tomato. In the past, yields of
similar types of tomato have shown a standard deviation of 8.5 lbs per plant.
You would like to design a study that will determine the average yield within
a 90% error margin of ±2 lbs. How many plants should you sample?

8.20.
A biologist knows that the average length of a leaf of a certain full-grown
plant is 4 inches. A sample of 45 leaves from the plants that were given a
new type of plant food had an average length of 4.2 inches, with the standard
deviation of 0.6 inches. Is there reason to believe that the new plant food
is responsible for a change in the average growth of leaves? Use α = 0.02.
Would your conclusion have changed if you used α = 0.05?

8.21.
A job placement director claims that mean starting salary for nurses is
$38,000. A random sample of 10 nurses’ salaries has a mean $35,450 and
a standard deviation of $4,700. Is there enough evidence to reject the direc-
tor’s claim at α = 0.01?

8.22.
College Board claimsp that in 2010, public four-year colleges charged, on
average, $7,605 per year in tuition and fees for in-state students. A sample
of 20 public four-year colleges collected in 2011 indicated a sample mean
of $8,039 and the sample standard deviation was $1,950. Is there sufficient
evidence to conclude that the average in-state tuition has increased?
8.6. INFERENCE FOR PROPORTIONS 181

8.23.
The weights of grapefruit follow a normal distribution. A random sample of
12 new hybrid grapefruit had a mean weight of 1.7 pounds with standard
deviation 0.24 pounds. Find a 95% confidence interval for the mean weight
of the population of the new hybrid grapefruit.

8.24.
The Mountain View Credit Union claims that the average amount of money
owed on their car loans is $ 7,500. Suppose a random sample of 45 loans shows
the average amount owed equals $8,125, with standard deviation $4,930.
Does this indicate that the average amount owed on their car loans is not
$7,500? Use a 1% level of significance.

8.25.
An overnight package delivery service has a promotional discount rate in
effect this week only. For several years the mean weight of a package delivered
by this company has been 10.7 ounces. However, a random sample of 12
packages mailed this week gave the following weights in ounces:
12.1 15.3 9.5 10.5 14.2 8.8 10.6 11.4 13.7 15.0 9.5 11.1

Use a 1% level of significance to test the claim that the packages are
averaging more than 10.7 ounces during the discount week.

8.26.
Some people claim that during US elections, the taller of the two major party
candidates tends to prevail. Here are some data on the last 15 elections
(heights are in cm).
Year 2008 2004 2000 1996 1992 1988 1984 1980
Winning candidate 185 182 182 188 188 188 185 185
Losing candidate 175 193 185 187 188 173 180 177

Year 1976 1972 1968 1964 1960 1956 1952

Winning candidate 177 182 182 193 183 179 179
Losing candidate 183 185 180 180 182 178 178

Test the hypothesis that the winning candidates tend to be taller, on

average.
182 CHAPTER 8. STATISTICAL INFERENCE

8.27.
An item in USA Today reported that 63% of Americans owned a mobile
browsing device. A survey of 143 employees at a large school showed that
85 owned a mobile browsing device. At α = 0.02, test the claim that the
percentage is the same as stated in USA Today.

8.28.
A poll by CNN revealed that 47% of Americans approve of the job perfor-
mance of the President. The poll was based on a random sample of 537
adults.

a) Find the 95% margin of error for this poll.

b) Based on your result in part (a), test the hypothesis H0 : p = 0.5

where p is the proportion of all American adults that approve of the
job performance of the President. Do not compute the test statistic and
p-value.

c) Would you have also reached the same conclusion for H0 : p = 0.45?

8.29.
Find a poll cited in a newspaper, web site or other news source, with a men-
tion of the sample size and the margin of error. (For example, rasmussenreports.com
frequently discuss their polling methods.) Confirm the margin of error pre-
sented by the pollsters, using your own calculations.
Chapter 9

Linear Regression

In science and engineering, there is often a need to investigate the relationship

between two continuous random variables.
Suppose that, for every case observed, we record two variables, X and Y .
The linear relationship between X and Y means that E (Y ) = b0 + b1 X.
X variable is usually called predictor or independent variable and Y vari-
able is the response or dependent variable.

Example 9.1.
Imagine that we are opening an ice cream stand and would like to be able to
predict how many customers we will have. We might use the temperature as
a predictor. We decided to collect data over a 30-week period from March
to July.q
Week 1 2 3 4 5 6 7 8 9 10
Mean temp 41 56 63 68 69 65 61 47 32 24
Consumption 0.386 0.374 0.393 0.425 0.406 0.344 0.327 0.288 0.269 0.256
Week 11 12 13 14 15 16 17 18 19 20
Mean temp 28 26 32 40 55 63 72 72 67 60
Consumption 0.286 0.298 0.329 0.318 0.381 0.381 0.47 0.443 0.386 0.342
Week 21 22 23 24 25 26 27 28 29 30
Mean temp 44 40 32 27 28 33 41 52 64 71
Consumption 0.319 0.307 0.284 0.326 0.309 0.359 0.376 0.416 0.437 0.548

The following scatterplot is made to graphically investigate the relationship.

183
184 CHAPTER 9. LINEAR REGRESSION

Scatterplot of ice cream consumption vs temperature

0.6
●

0.5

●
Pints per Person

●
●
●
●
0.4

●
●
● ● ● ●
● ●
●
● ●
● ● ●
● ●
●
0.3

●
●
● ● ●
●
●
0.2

20 30 40 50 60 70 80

Mean Temperature (F)

Figure 9.1: Scatterplot of ice cream data

There indeed appears to be a straight-line trend. We will discuss fitting the

equation a little later.

9.1 Correlation coefficient

We already know the correlation coefficient between two random variables,
Cov(X, Y )
ρ=
σX σY
Now, let’s consider its sample analog, sample correlation coefficient
Pn
(Yi − Y )(Xi − X) SSXY
r = qP i=1 ≡√
n 2
P n 2 SSX SSY
i=1 (Xi − X) i=1 (Yi − Y )

You can recognize the summation on top as a discrete version of Cov(X, Y )

and the sums on the bottom as part of the computation for the sample
9.2. LEAST SQUARES REGRESSION LINE 185

variances of X, Y . These are

P P
X X Y
SSXY = XY − ,
n
( X)2 ( Y )2
X P X P
2 2
SSX = X − , SSY = Y −
n n
All sums are taken from 1 to n.
2
For example, the sample variance of X is SX = SSX /(n − 1).
Let’s review the properties of the correlation coefficient ρ and its sample
estimate, r:

• the sign of r points to positive (when X increases, Y increases too) or

negative (when one increases, the other decreases) relationship
• −1 ≤ r ≤ 1, with +1 being a perfect positive and −1 a perfect negative
relationship
• r ≈ 0 means no linear relationship between X and Y (caution: there
can still be a non-linear relationship!)
• r is dimensionless, and it does not change when X or Y are linearly
transformed.

9.2 Least squares regression line

The complete regression equation is
Y i = b0 + b1 X i + ε i , i = 1, ..., n
where the errors εi are assumed to be independent, N (0, σ 2 ).
To find the “best fit” line, we choose b̂0 and b̂1 that minimize the sum of
squared residuals
Xn
SSE = (Yi − b0 − b1 Xi )2
i=1

(SSE is for the Sum of Squared Errors, however the quantities Yi − b̂0 − b̂1 Xi
are usually referred to as residuals.)
To find the minimum, we would calculate partial derivatives of SSE with
186 CHAPTER 9. LINEAR REGRESSION

respect to b0 , b1 . Solving the resulting system of equations, we get the fol-

lowing

Theorem 9.1. Least squares estimates

The estimates for the regression equation Yi = b0 + b1 Xi + εi , i = 1, ..., n

are:
SSXY SY
Slope b̂1 = =r and Intercept b̂0 = Y − b̂1 X
SSX SX
Example 9.2.
To illustrate the computations, let’s consider another data set. Here, X =
amount of tannin in the larva food, and Y = growth of insect larvae.r
X 0 1 2 3 4 5 6 7 8
Y 12 10 8 11 6 7 2 3 3

Estimate the regression equation and correlation coefficient.

Solution.
P P P 2 P 2 P
X = 36, Y = 62, X = 204, Y = 536, XY = 175.
Therefore,
X = 36/9 = 4, Y = 62/9 = 6.89, SSX = 204 − 362 /9 = 60,
2
SSY = 536 − 62 /9 = 108.9, SSXY = 175 − 36(62)/9 = −73
and finally,
b̂1 = −73/60 = −1.22, b̂0 = 6.89 − (−1.22)4 = 11.76, r = −0.903
Thus, we get the equation

Ŷ = 11.76 − 1.22X

that is interpretable as a prediction for any given value X. In practice, the

accuracy of prediction depends on X, see the next Section.

Example 9.3.
For the data in Example 9.1,

a) Calculate and plot the least squares regression line

b) Predict the consumption when X = 50◦ F .
9.3. INFERENCE FOR REGRESSION 187

0.6

●
0.5
Pints per Person

● ●
●
●
0.4

●
● ● ●
● ●● ●
●
● ●
● ● ●
● ●
0.3

● ●
●
● ● ●
●
●
0.2

20 30 40 50 60 70 80

Mean Temperature (F)

Figure 9.2: Least squares regression line for the ice cream example

Solution.
(a) We may obtain the following estimates (usually done by a computer)

b̂0 = 0.2069, b̂1 = 0.003107 and r = 0.776

These can be used to plot the regression line (Fig. 9.2) and make predictions.
Can you interpret the slope and the intercept for this problem in plain En-
glish?

(b) Ŷ = b̂0 + b̂1 X = 0.2069 + 0.003107(50) = 0.362 pints per person.

9.3 Inference for regression

The error variance σ 2 determines the amount of scatter of the Y -values about
the line. That is, it reflects the uncertainty of prediction of Y using X.
188 CHAPTER 9. LINEAR REGRESSION

Its sample estimate is

Pn
SSE
2 i=1 [Yi − (b̂0 + b̂1 Xi )]2
S = =
n−2 n−2
we divide by n − 2 because two degrees of freedom have been used up when
estimating b̂0 , b̂1 . The estimate of S can be obtained by hand or using the
computer output.
The values Ŷi = b̂0 + b̂1 Xi are called predicted or fitted values of Y.
The differences

Actual − Predicted ≡ Yi − Ŷi = ei , i = 1, ..., n

are called residuals.

The least squares esimates for slope and intercept are also the sample esti-
mates for the “true” slope and intercept. We can apply the same methods
we have done for, say, estimating the unknown mean µ. However, it is harder
to compute the margins of error. We would typically use computer output
to produce standard errors (that is, the estimates of standard deviations) of
these estimates.
100%(1 − α) CI’s for regresison parameters are then found as

Estimate ± tα/2 (Std.Error), t has df = n − 2

Example 9.4.
Continuing the analysis of data from Example 9.1, let’s examine a portion
of computer output (done by R statistical package).
Estimate Std.Error t-value Pr(>|t|)
(Intercept) 0.2069 0.0247 8.375 4.13e-09
X 0.003107 0.000478 6.502 4.79e-07
We can calculate confidence intervals and hypothesis tests for the parameters
b0 and b1 .
The 95% C.I. for the slope b1 is

0.003107 ± 2.048(0.000478) = [0.002128, 0.004086]

To test the hypothesis H0 : b1 = 0 we could use the test statistic

Estimate
t=
Std.Error
9.3. INFERENCE FOR REGRESSION 189

For the above data, we have t = 0.003107/0.000478 = 6.502, as reported in

the table. The p-values for this test can be found using a T-table; they are
also reported by the computer. Above, the reported p-value of 4.79e-07 is
very small, meaning that the hypothesis H0 : b1 = 0 is strongly rejected.
Another part of the output will be useful later. This is a so-called ANOVA
(ANalysis Of VAriance) table:
Df Sum Sq Mean Sq F value Pr(>F)
temperature 1 0.075514 0.075514 42.28 4.789e-07
Residuals 28 0.050009 0.001786
Here, we are interested in the Mean Square of Residuals = 0.001786. Also
note that the p-value (here given as Pr(>F)) coincides with the T-test p-
value.

9.3.1 Correlation test for linear relationship

In terms of correlation r, the above test can be calculated more easily using
the test statistic r
n−2
t=r , df = n − 2
1 − r2
Strictly speaking, this is for testing
H0 : ρ = 0 versus HA : ρ 6= 0
but ρ = 0 and b1 = 0 are equivalent statements.
Example 9.5.
For a relationship between Population size and Divorce rate in n = 20 Amer-
ican cities the correlation of 0.28 was found. Is there a significant linear
relationship between Population size and Divorce rate?
Solution. r
20 − 2
t = 0.28 = 1.23 with df = 18
1 − 0.282
From T-table (comparing with table value t = 1.33), p − value > 2(0.1) =
0.2. Since p-value is larger than our default level α = 0.05, do not reject
H0 . Thus, we can claim no significant evidence of the linear relationship
between Population size and Divorce rate.
190 CHAPTER 9. LINEAR REGRESSION

9.3.2 Confidence and prediction intervals

In addition to the C.I.’s for b0 and b1 , we might be interested in the uncer-
tainty of estimating Y-values given the particular value of X.

100%(1−α) confidence interval for mean response E (Ŷ ) given X = x∗

s
1 (x∗ − X)2
(b̂0 + b̂1 x∗ ) ± tα/2 S + 2
n (n − 1)SX

100%(1 − α) prediction interval for a future observation Y given

X = x∗
s
1 (x∗ − X)2
(b̂0 + b̂1 x∗ ) ± tα/2 S 1 + + 2
n (n − 1)SX
What is the main difference between confidence and prediction intervals?
Confidence interval is only concerned with the mean response E (Y ). That
is, it’s trying to catch the regression line. Prediction interval is concerned
with any future observation. Thus, it is trying to catch all the points in the
scatterplot. As a consequence, prediction interval is typically much wider.
2
Note also that (n − 1)SX = SSX , and both intervals are narrowest when
∗
x is closest to X, the center of all data. The least squares fit becomes less
reliable as you move to values of X away from the center, especially the areas
where there is no X-data.
Example 9.6.
Continuing the analysis of data from Example 9.1, calculate both 95% confi-
dence and prediction intervals for the ice cream consumption when temper-
ature is 70◦ F
Solution. Ŷ = b̂0 + b̂1 x∗ = 0.2069 + (0.003107)70 = 0.4244, and using the
computer output in Example 9.4, we will get tα/2 = 2.048,
√ √ 1 (x∗ − X)2
S = Mean Sq Residuals = 0.001786 = 0.0423 and + 2
= 0.0892.
n (n − 1)SX
Then
CI 0.4244 ± 0.0259, PI 0.4244 ± 0.0904
For comparison, both intervals are plotted in Fig. 9.3 for various values of
x∗ . Note that the 95% prediction interval (broken lines) contains all but one
observation.
9.3. INFERENCE FOR REGRESSION 191

0.6

●
0.5
Pints per Person

● ●
●
●
0.4

●
● ● ●
● ●● ●
●
● ●
● ● ●
● ●
0.3

● ●
●
● ● ●
●
●
0.2

20 30 40 50 60 70 80

Mean Temperature (F)

Figure 9.3: Confidence (solid lines) and prediction bands (broken lines) for
the ice cream example

9.3.3 Checking the assumptions

To check the assumption of linear relationship and the constant variance (σ 2 )
of the residuals, we might make a plot of Residuals ei = Yi − Ŷi versus Fitted
values. If there is any trend or pattern in the residuals, then the assumptions
for linear regression are not met. It might tell us, for example, if the size of
residuals remains the same when the predicted value changes. Also, it can
help spot non-linear behavior, outliers etc.
Such a plot for the ice cream example is given in Fig. 10.1. We do not see
any particular trend except possibly one unusually high value (an outlier) in
the top right corner.

Exercises
9.1.
In the file http://www.nmt.edu/~olegm/382book/cars2010.csv, there are
some data on several 2010 compact car models. The variables are: engine dis-
placement (liters), city MPG, highway MPG, and manufacturer’s suggested
price.
192 CHAPTER 9. LINEAR REGRESSION

0.10
0.05

● ●
●
●
Residual

●
●
●
●
●
● ●
●
0.00

● ●
●
● ●
● ● ●
● ●
●
●
−0.05

● ●
●

0.30 0.35 0.40

Fitted values

Figure 9.4: Residuals for the ice cream example

a) Is the car price related to its highway MPG?

b) Is there a relationship between city and highway MPG?

Use scatterplots, calculate and interpret the correlation coefficient, test to

determine if there is a linear relationship.
For part (b), also compute and interpret the regression equation. Plot the
regression line on the scatterplot. Plot the residuals versus predicted values.
Does the model fit well?
9.2.
The following is an illustration of famous Moore’s Law for computer chips.
X = Year (minus 1900, for ease of computation), Y = number of transistors
(in 1000)
X 71 79 83 85 90 93 95
Y 2.3 31 110 280 1200 3100 5500
9.3. INFERENCE FOR REGRESSION 193

a) Make a scatterplot of the data. Is the growth linear?

b) Let’s try and fit the exponential growth model using a transformation:

If Yi = a0 ea1 Xi then ln Yi = ln a0 + a1 Xi

That is, doing the linear regression analysis of ln Y on X will help

recover the exponential growth. Make the regression analysis of ln Y
on X. Does this model do a good job fitting the data?
c) Predict the number of transistors in the year 2005. Did this prediction
come true?

9.3.
A head of a large Hollywood company has seen the following values of its
market share in the last six years:s

11.4, 10.6, 11.3, 7.4, 7.1, 6.7

Is there statistical evidence of a downward trend in the company’s market

share?

9.4.
For the Old Faithful geyser, the durations of eruption (X) were recorded,
with the interval to the next eruption (Y), both in minutes.

X 3.6 1.8 3.3 2.3 4.5 2.9 4.7 3.6 1.9

Y 79 54 74 62 85 55 88 85 51

Perform the regression analysis of Y on X. Interpret the slope and give a

95% confidence interval for the slope.

9.5.
Does the price of the first-class postal stamp follow linear regression, or some
other pattern?t
Year (since 1900) 32 58 63 68 71 74 75 78 81 85 88 91 95 99
Price (cents) 3 4 5 6 8 10 13 15 20 22 25 29 32 33
Year (since 1900) 101 102 106 108 111 112
Price (cents) 34 37 39 42 44 45

Predict the price in 2020.

194 CHAPTER 9. LINEAR REGRESSION
Chapter 10

Categorical Data Analysis

In Section 8.6, we learned to compare two population proportions. We can extend

this approach to more than two populations (groups) by the means of a chi-square
test.
Consider the experiment of randomly selecting n items, each of which belongs
to one of k categories (for example, we collect a sample of 100 people and look at
their blood types, and there are k = 4 types). We will count the number of items
in our sample of the type i and denote that Xi . We will refer to Xi as observed
count for category i. Note that X1 + X2 + ... + Xk = n.
We will be concerned with estimating or testing the probabilities (or P propor-
tions) of ith category, pi , i = 1, ..., k. Also, keep in mind the restriction i pi = 1.
There are two types of tests considered in this Chapter:

• A test for goodness-of-fit, that is, how well do the observed counts Xi fit
a given distribution.

• A test for independence, for which there are two classification categories
(variables), and we are testing the independence of these variables.

10.1 Chi-square goodness-of-fit test

This is a test for the fit of the sample proportions to given numbers. Suppose
that we have observations that can be classified into each of k groups (categorical
data). We would like to test

H0 : p1 = p01 , p2 = p02 , ... , pk = p0k

HA : some of the pi ’s are unequal to p0i ’s

195
196 CHAPTER 10. CATEGORICAL DATA ANALYSIS

where pi is the probability that aPsubjectPwill belong to group i and p0i , i = 1, ..., k
are given numbers. (Note that pi = p0i = 1, so that pk can actually be ob-
tained from the rest of pi ’s.)
Our data (Observed
Pk counts) are the counts of each category in the sample, X1 , X2 , ...., Xk
such that i=1 Xi = n. The total sample size is n. For k = 2 we would get X1 =
number of successes, and X2 = n − X1 = number of failures, that is, Binomial
distribution. For k > 2 we deal with Multinomial distribution.
For testing H0 , we compare the observed counts Xi to the ones we would expect
under null hypothesis, that is,

Expected counts E1 = np01 , .... , Ek = np0k

To adjust for the size of each group, we would take the squared difference divided
by Ei , that is (Ei − Xi )2 /Ei . Adding up, we obtain the

k
X (Ei − Xi )2
Chi-square statistic χ2 = (10.1)
Ei
i=1

with k − 1 degrees of freedom

We would reject H0 when χ2 statistic is large (that is, the Observed counts are far
from Expected counts). Thus, our test is always one-sided. To find the p-value,
use χ2 upper-tail probability table very much like the t-table. See Table C.
0.5
0.4

df = 2
df = 5
0.3

df = 10
f(x)

0.2
0.1
0.0

0 5 10 15 20

x
Figure 10.1: Chi-square densities
10.1. CHI-SQUARE GOODNESS-OF-FIT TEST 197

Table C: Critical points of the chi-square distribution

Upper tail probability (α)

Degrees of
freedom 0.100 0.050 0.025 0.010 0.005 0.001 0.0005
1 2.706 3.841 5.024 6.635 7.879 10.828 12.116
2 4.605 5.991 7.378 9.210 10.597 13.816 15.202
3 6.251 7.815 9.348 11.345 12.838 16.266 17.730
4 7.779 9.488 11.143 13.277 14.860 18.467 19.997
5 9.236 11.070 12.833 15.086 16.750 20.515 22.105
6 10.645 12.592 14.449 16.812 18.548 22.458 24.103
7 12.017 14.067 16.013 18.475 20.278 24.322 26.018
8 13.362 15.507 17.535 20.090 21.955 26.124 27.868
9 14.684 16.919 19.023 21.666 23.589 27.877 29.666
10 15.987 18.307 20.483 23.209 25.188 29.588 31.420
11 17.275 19.675 21.920 24.725 26.757 31.264 33.137
12 18.549 21.026 23.337 26.217 28.300 32.909 34.821
13 19.812 22.362 24.736 27.688 29.819 34.528 36.478
14 21.064 23.685 26.119 29.141 31.319 36.123 38.109
15 22.307 24.996 27.488 30.578 32.801 37.697 39.719
16 23.542 26.296 28.845 32.000 34.267 39.252 41.308
17 24.769 27.587 30.191 33.409 35.718 40.790 42.879
18 25.989 28.869 31.526 34.805 37.156 42.312 44.434
19 27.204 30.144 32.852 36.191 38.582 43.820 45.973
20 28.412 31.410 34.170 37.566 39.997 45.315 47.498
21 29.615 32.671 35.479 38.932 41.401 46.797 49.011
22 30.813 33.924 36.781 40.289 42.796 48.268 50.511
23 32.007 35.172 38.076 41.638 44.181 49.728 52.000
24 33.196 36.415 39.364 42.980 45.559 51.179 53.479
25 34.382 37.652 40.646 44.314 46.928 52.620 54.947
30 40.256 43.773 46.979 50.892 53.672 59.703 62.162
40 51.805 55.758 59.342 63.691 66.766 73.402 76.095
60 74.397 79.082 83.298 88.379 91.952 99.607 102.695
80 96.578 101.879 106.629 112.329 116.321 124.839 128.261
100 118.498 124.342 129.561 135.807 140.169 149.449 153.167
198 CHAPTER 10. CATEGORICAL DATA ANALYSIS

Assumption for chi-square test: all Expected counts should be ≥ 5 (this is

necessary so that the normal approximation for counts Xi holds.)
Some details: see below1

Example 10.1.
When studying earthquakes, we recorded the following numbers of earthquakes (1
and above on Richter scale) for 7 consecutive days in January 2008.

Day 1 2 3 4 5 6 7 Total
Count 85 98 79 118 112 135 137 764
Expected 109.1 109.1 109.1 109.1 109.1 109.1 109.1 764

Here, n = 764. Is there evidence that the rate of earthquake activity changes
during this week?

Solution. If the null hypothesis H0 : p1 = p2 = ... = p7 were true, then each

pi = 1/7, i = 1, ..., 7. Thus, we can find the expected counts Ei = 764/7 = 109.1.
Results: χ2 = 28.8, df = 6, p-value < 0.0005 from Table C. Since the p-value is
small, we reject H0 and claim that the earthquake frequency does change during
the week.

Example 10.2.
In this example, we will test whether a paricular distribution matches our experi-
mental results. These are the data from the probability board (quincunx), we test
if the distribution is really Binomial (as is often claimed). The slots are labeled
0-19. Some slots were merged together (why?)

Slots 0-6 7 8 9 10 11 12 13-19 Total

Observed 16 2 11 18 14 14 7 18 100
Expected 8.4 9.6 14.4 17.6 17.6 14.4 9.6 8.4 100
1
Chi-square distribution with degrees of freedom = k is related to Normal distribution
as follows:
χ 2
= Z12 + Z22 + ...... + Zk2 ,
where Z1 , ..., Zk are independent, standard Normal r.v.’s.

Also, it can be shown that chi-square (df = k) distribution is simply

Gamma(α = k/2, β = 2)– sorry, this α and the upper-tail area are not the same!
For example, Chi-square(df = 2) is the same as Exponential (β = 2). (Why?)
Note that this distribution has positive values and is not symmetric!
10.2. CHI-SQUARE TEST FOR INDEPENDENCE 199

Solution. The expected counts are computed using Binomial(n = 19, p = 0.5)
distribution, and then multiplying by the T otal = 100. For example,

19
E9 = 0.59 (1 − 0.5)19−9 × 100 = 17.6
9

Next, χ2 = 26.45, df = 7, and p-value < 0.0005.

Conclusion: Reject H0 , the distribution is not exactly Binomial.

10.2 Chi-square test for independence

This test is applied to the category probabilities for two variables. Each case is
classified according to variable 1 (for example, Gender) and variable 2 (for example,
College Major). The data are usually given in a cross-classification table (a 2-way
table). Let Xij be the observed table counts for row i and column j.
We are interested in testing whether Variable 1 (in r rows) is independent of
Variable 2 (in c columns).2
In this situation, we set up a chi-square statistic following equation (10.1).
However, now the table is bigger. The Expected counts will be found using inde-
pendence assumption, as
Ri Cj
Expected counts Eij = , i = 1, ..., r j = 1, ..., c
n
where Ri and Cj are the row and column totals.

Theorem 10.1. Chi-square test for independence

To test
H0 : Variable 1 is independent of Variable 2 vs
HA : Variable 1 is not independent of Variable 2
we can use the χ2 random variable with df = (r − 1)(c − 1), where
c
r X
X (Eij − Xij )2
test statistic χ2 = (10.2)
Eij
i=1 j=1

2
These are not random variables in the sense of Chapter 3, because they are categorical,
not numerical.
200 CHAPTER 10. CATEGORICAL DATA ANALYSIS

Example 10.3.
Suppose that we ordered 50 components from each of the vendors A, B and C,
and the results are as follows

Succeeded Failed Total

Vendor A 48 2 50
Vendor B 45 5 50
Vendor C 42 8 50

We would like to investigate whether all the vendors are equally reliable. That is,

H0 : Failure rate is independent of Vendor

HA : Not all Vendors have the same failure rate

Solution. We’ll put all the expected counts into the table

Expected counts:
Succeeded Failed Total
_____________________________________________________
Vendor A 45 5 50
Vendor B 45 5 50
Vendor C 45 5 50
_____________________________________________________
Total 135 15 150

The χ2 statistic will have df = (3 − 1)(2 − 1) = 2.

Here, χ2 = (45 − 48)2 /45 + (2 − 5)2 /5 + ... = 4.0, p-value > 0.1. Since p-value
is large, we are not rejecting H0 . Thus, there is no evidence that vendors have
different failure rates.3

Exercises
10.1.
In testing how well people can generate random patterns, the researchers asked
everyone in a group of 20 people to write a list of 5 random digits. The results are
tabulated below
3
For this particular example, since df = 2, there is a more exact p-value calculation
based on Exponential distribution: P (Y > 4) = exp(−4/2) = 0.1353. In general, we can
use Excel function chidist or other software to compute the exact p-values.
10.2. CHI-SQUARE TEST FOR INDEPENDENCE 201

Digits 0 1 2 3 4 5 6 7 8 9 Total
Observed 6 11 10 13 8 13 7 17 8 7 100

Are the digits completely random or do humans have preference for some particular
digits over the others?

10.2.
Forensic statistics. To uncover rigged elections, a variety of statistical tests might
be applied. For example, made-up precinct totals are sometimes likely to have an
excess of 0 or 5 as their last digits. For a city election, the observers counted that
21 precinct totals had the last digit 0, 18 had the last digit 5, while 102 had some
other last digit. Is there evidence that the elections were rigged?

10.3.
In an earlier example of Poisson distribution, we discussed the number of Nazi
bombs hitting 0.5×0.5km squares in London. The following were counts of squares
that have 0, 1, 2, ... hits:

number of hits 0 1 2 3 4 and up

count 229 211 93 35 8

Test whether the data fit the Poisson distribution (for p01 , ...p0k use the Poisson
probabilities, with the parameter µ estimated as average number of hits per square,
µ = 0.9288).

10.4.
To test the attitudes to a tax reform, the state officials collected data of the
opinions of likely voters, along with their income level

Income Level:
Low Medium High
For 182 213 203
Against 154 138 110

Do the people with different incomes have significantly different opinions on tax
reform? (That is, test whether the Opinion variable is independent of Income
variable.)

10.5.
Using exponential distribution, confirm the calculation of chi-square (df = 2) crit-
ical points from Table C for upper tail area α = 0.1 and α = 0.005. Find the point
for χ2 (df = 2) distribution with α = 0.2
202 NOTES

Notes
i
see e.g. http://forgetomori.com/2009/skepticism/seeing-patterns/
j
see
"http://www.census.gov/hhes/www/cpstables/032010/perinc/new01_001.htm"
k
http://en.wikipedia.org/wiki/
Heights_of_Presidents_of_the_United_States_and_presidential_candidates
l
see http://www.readingonline.org/articles/bergman/wait.html
m
“Student” [William Sealy Gosset] (March 1908). ”The probable error of a mean”.
Biometrika 6 (1): 1-25.
n
For example, see "http://www.stat.columbia.edu/~gelman/book/data/light.asc"
o
example from “Statistical Sleuth”
p
http://www.collegeboard.com/student/pay/add-it-up/4494.html
q
Kotswara Rao Kadilyala (1970). “Testing for the independence of regression distur-
bances” Econometrica, 38, 97-117. Appears in: A Handbook of Small Data Sets, D. J.
Hand, et al, editors (1994). Chapman and Hall, London.
r
from The R book by Michael Crawley
s
Mlodinow again. The director, Sherry Lansing, was subsequently fired only to see
several films developed during her tenure, including Men In Black, hit it big.
t
see "http://www.akdart.com/postrate.html"

Rohatgi Expl
No ratings yet
Rohatgi Expl
192 pages
Binomial Distribution Explained
No ratings yet
Binomial Distribution Explained
16 pages
Stat 231 Course Notes
100% (1)
Stat 231 Course Notes
326 pages
Empirical Process (Sara Van de Geer)
No ratings yet
Empirical Process (Sara Van de Geer)
91 pages
Solution CH # 5
No ratings yet
Solution CH # 5
39 pages
Generalized Linear Models: Ariel Alonso Abad
No ratings yet
Generalized Linear Models: Ariel Alonso Abad
43 pages
A Brief Course in Mathematical Statistics 1st Edition Tanis Hogg Solution Manual
75% (4)
A Brief Course in Mathematical Statistics 1st Edition Tanis Hogg Solution Manual
8 pages
Fundamental Maths
No ratings yet
Fundamental Maths
165 pages
Nonparametric Statistics On Manifolds and Their Applications To Object Data Analysis 1st Edition Victor Patrangenaru Full Digital Chapters
No ratings yet
Nonparametric Statistics On Manifolds and Their Applications To Object Data Analysis 1st Edition Victor Patrangenaru Full Digital Chapters
167 pages
Sufficient Statistics & Factorization
No ratings yet
Sufficient Statistics & Factorization
11 pages
Ergodic Theory
No ratings yet
Ergodic Theory
56 pages
Random Walk A Modern Introduction 2010
No ratings yet
Random Walk A Modern Introduction 2010
378 pages
Gamma Extended Frechet Distribution
No ratings yet
Gamma Extended Frechet Distribution
23 pages
Kernel Density Estimation
No ratings yet
Kernel Density Estimation
10 pages
STATS 325 Stochastic Processes Notes
No ratings yet
STATS 325 Stochastic Processes Notes
195 pages
Bayesian Statistics: A User's Perspective
No ratings yet
Bayesian Statistics: A User's Perspective
24 pages
MATH1208AnnotatedBook Imp
No ratings yet
MATH1208AnnotatedBook Imp
145 pages
Maximum Likelihood Estimation Guide
No ratings yet
Maximum Likelihood Estimation Guide
8 pages
John Gillresearchnote Tannery Theorem
No ratings yet
John Gillresearchnote Tannery Theorem
22 pages
Stochastic Simulation Book
No ratings yet
Stochastic Simulation Book
146 pages
Assignment 1 Answers
No ratings yet
Assignment 1 Answers
7 pages
Gaussian Noise Detection & Estimation
No ratings yet
Gaussian Noise Detection & Estimation
55 pages
Exponential Distribution
No ratings yet
Exponential Distribution
19 pages
Markov Chains
No ratings yet
Markov Chains
15 pages
Principles of Biostatistics: Class Notes To Accompany The Textbook by Pagano and Gauvreau
No ratings yet
Principles of Biostatistics: Class Notes To Accompany The Textbook by Pagano and Gauvreau
125 pages
Ejemplo de Inferencia Umvue
No ratings yet
Ejemplo de Inferencia Umvue
10 pages
Introduction To Survival Analysis: BIOST 515 February 26, 2004
No ratings yet
Introduction To Survival Analysis: BIOST 515 February 26, 2004
30 pages
Hennig 2021 Probabilistic Machine Learning
No ratings yet
Hennig 2021 Probabilistic Machine Learning
189 pages
Sufficient Statistics - Problems - Solved - Xiang - Yin
No ratings yet
Sufficient Statistics - Problems - Solved - Xiang - Yin
5 pages
Polynomials YCMA
No ratings yet
Polynomials YCMA
13 pages
Maximum Likelihood Estimation Guide
No ratings yet
Maximum Likelihood Estimation Guide
55 pages
Survival Analysis Dengan Pendekatan R
No ratings yet
Survival Analysis Dengan Pendekatan R
32 pages
STAT501 Multivariate Analysis
No ratings yet
STAT501 Multivariate Analysis
196 pages
STAT 480b Answer Key To Problem Set No. 4
No ratings yet
STAT 480b Answer Key To Problem Set No. 4
3 pages
Ejercicios Resueltos de Inferencia Estadistica
No ratings yet
Ejercicios Resueltos de Inferencia Estadistica
229 pages
Ergodic Theory Nonsingular Transformations
No ratings yet
Ergodic Theory Nonsingular Transformations
61 pages
STAT 650 - Foundations of Data Science Syllabus
No ratings yet
STAT 650 - Foundations of Data Science Syllabus
13 pages
Computational Bayesian Statistics
100% (1)
Computational Bayesian Statistics
254 pages
Mood Introduction To The Theory of Statistics
50% (2)
Mood Introduction To The Theory of Statistics
577 pages
Advanced Statistical Methods
No ratings yet
Advanced Statistical Methods
196 pages
Prof. U.J.Dixit
No ratings yet
Prof. U.J.Dixit
11 pages
Linear Regression: Major: All Engineering Majors Authors: Autar Kaw, Luke Snyder
100% (1)
Linear Regression: Major: All Engineering Majors Authors: Autar Kaw, Luke Snyder
25 pages
Hangal - Frailty Models
No ratings yet
Hangal - Frailty Models
307 pages
Probability Solutions by Grinstead & Snell
No ratings yet
Probability Solutions by Grinstead & Snell
45 pages
Convex Optimization Primer
No ratings yet
Convex Optimization Primer
300 pages
Ibrahim Survival
No ratings yet
Ibrahim Survival
491 pages
Probability Distributions
100% (1)
Probability Distributions
248 pages
Westmont College Student Solutions Manual: Probability and Statistics, Fourth Edition, by Hogg and Craig
No ratings yet
Westmont College Student Solutions Manual: Probability and Statistics, Fourth Edition, by Hogg and Craig
61 pages
Probability & Statistics Guide
No ratings yet
Probability & Statistics Guide
181 pages
Probability & Statistics Lecture Notes
No ratings yet
Probability & Statistics Lecture Notes
178 pages
STAT 230 Course Notes Fall 2019
No ratings yet
STAT 230 Course Notes Fall 2019
425 pages
STAT230 Course Notes F16
No ratings yet
STAT230 Course Notes F16
365 pages
Stat 230 No Tess 16 Print
No ratings yet
Stat 230 No Tess 16 Print
359 pages
Statistical Methods in Data Analysis - W. J. Metzger
No ratings yet
Statistical Methods in Data Analysis - W. J. Metzger
278 pages
Math 630 Course Notes Fall 2021
No ratings yet
Math 630 Course Notes Fall 2021
274 pages
Doc-Cours MathsV
No ratings yet
Doc-Cours MathsV
69 pages
STAT 230 Notes 2013
No ratings yet
STAT 230 Notes 2013
278 pages
ST102/ST109 Elementary Statistical Theory Course Pack 2022/23 (Michaelmas Term)
100% (1)
ST102/ST109 Elementary Statistical Theory Course Pack 2022/23 (Michaelmas Term)
235 pages
Introduction to Probability Concepts
100% (1)
Introduction to Probability Concepts
281 pages
Anil Kumar M (P03ME23M015205) Asec 123
No ratings yet
Anil Kumar M (P03ME23M015205) Asec 123
1 page
Unit III Optical Metrology-2 PDF
No ratings yet
Unit III Optical Metrology-2 PDF
63 pages
01 Laboratory Exercise 1
No ratings yet
01 Laboratory Exercise 1
3 pages
Engaging Ludic Activities for Teachers
No ratings yet
Engaging Ludic Activities for Teachers
10 pages
Aerial Infrared Photography for Corn Nitrogen Needs
No ratings yet
Aerial Infrared Photography for Corn Nitrogen Needs
262 pages
Efficiency of Air Curtains Used For Separating Smoke Free Zones in Case of Fire
No ratings yet
Efficiency of Air Curtains Used For Separating Smoke Free Zones in Case of Fire
6 pages
Case Study
No ratings yet
Case Study
4 pages
Ultimate Cheat SHEET - Analysis in R
No ratings yet
Ultimate Cheat SHEET - Analysis in R
17 pages
MSC BFS Flyer - ForMed
No ratings yet
MSC BFS Flyer - ForMed
2 pages
ECON 511 Revision Questions On CH 7
No ratings yet
ECON 511 Revision Questions On CH 7
4 pages
Overview of Geomechanical Properties of Bakken Formation in Williston Basin, North Dakota
No ratings yet
Overview of Geomechanical Properties of Bakken Formation in Williston Basin, North Dakota
11 pages
Cleanroom Systems Ultratech Precision: Insulated Panels
No ratings yet
Cleanroom Systems Ultratech Precision: Insulated Panels
24 pages
Intro to Microeconomics Course
No ratings yet
Intro to Microeconomics Course
5 pages
Off-Road Suspension Design Guide
No ratings yet
Off-Road Suspension Design Guide
30 pages
Item Assortment
No ratings yet
Item Assortment
270 pages
The Truth About Binge Watching
No ratings yet
The Truth About Binge Watching
5 pages
Pembahasan PTK-2011
No ratings yet
Pembahasan PTK-2011
19 pages
The Piano Handbook - 025
No ratings yet
The Piano Handbook - 025
1 page
Year 2 Daily Lesson Plans: By:Missash
No ratings yet
Year 2 Daily Lesson Plans: By:Missash
5 pages
Sboa 325
No ratings yet
Sboa 325
6 pages
3 Certificates
No ratings yet
3 Certificates
46 pages
Book Index The Art of Heavy Transport
50% (2)
Book Index The Art of Heavy Transport
6 pages
Taekwondo in Horn of Africa
No ratings yet
Taekwondo in Horn of Africa
13 pages
Datasheet WL260-F270 6020976 en
No ratings yet
Datasheet WL260-F270 6020976 en
8 pages
Cse Syllabus R 2009
No ratings yet
Cse Syllabus R 2009
87 pages
Lab 5
No ratings yet
Lab 5
3 pages
Well Control-Day 2 - MAASP & Types of Well Barriers
No ratings yet
Well Control-Day 2 - MAASP & Types of Well Barriers
19 pages
Test Quiz
No ratings yet
Test Quiz
6 pages
Arnzen - The Structure - Medioevo 32 2007
No ratings yet
Arnzen - The Structure - Medioevo 32 2007
22 pages
Biometric Atm
No ratings yet
Biometric Atm
14 pages