Part 1
Part 1
Bayesian Statistics
for Beginners
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi
Bayesian Statistics
for Beginners
A Step-by-Step Approach
THERESE M. DONOVAN
RUTH M. MICKEY
1
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi
3
Great Clarendon Street, Oxford, OX2 6DP,
United Kingdom
Oxford University Press is a department of the University of Oxford.
It furthers the University’s objective of excellence in research, scholarship,
and education by publishing worldwide. Oxford is a registered trade mark of
Oxford University Press in the UK and in certain other countries
© Ruth M. Mickey 2019
The moral rights of the author have been asserted
First Edition published in 2019
Impression: 1
All rights reserved. No part of this publication may be reproduced, stored in
a retrieval system, or transmitted, in any form or by any means, without the
prior permission in writing of Oxford University Press, or as expressly permitted
by law, by licence or under terms agreed with the appropriate reprographics
rights organization. Enquiries concerning reproduction outside the scope of the
above should be sent to the Rights Department, Oxford University Press, at the
address above
You must not circulate this work in any other form
and you must impose this same condition on any acquirer
Published in the United States of America by Oxford University Press
198 Madison Avenue, New York, NY 10016, United States of America
British Library Cataloguing in Publication Data
Data available
Library of Congress Control Number: 2019934655
ISBN 978–0–19–884129–6 (hbk.)
ISBN 978–0–19–884130–2 (pbk.)
DOI: 10.1093/oso/9780198841296.001.0001
Printed and bound by
CPI Group (UK) Ltd, Croydon, CR0 4YY
Links to third party websites are provided by Oxford in good faith and
for information only. Oxford disclaims any responsibility for the materials
contained in any third party website referenced in this work.
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi
To our parents, Thomas and Earline Donovan and Ray and Jean Mickey,
for inspiring a love of learning.
To our mentors, some of whom we’ve met only by their written words,
for teaching us ways of knowing.
Preface
Greetings. This book is our attempt at gaining membership to the Bayesian Conspiracy.
You may ask, “What is the Bayesian Conspiracy?” The answer is provided by Eliezer
Yudkowsky (http://yudkowsky.net/rational/bayes): “The Bayesian Conspiracy is a multi
national, interdisciplinary, and shadowy group of scientists that controls publication,
grants, tenure, and the illicit traffic in grad students. The best way to be accepted into the
Bayesian Conspiracy is to join the Campus Crusade for Bayes in high school or college, and
gradually work your way up to the inner circles. It is rumored that at the upper levels of the
Bayesian Conspiracy exist nine silent figures known only as the Bayes Council.”
Ha ha! Bayes’ Theorem, also called Bayes’ Rule, was published posthumously in 1763 in
the Philosophical Transactions of the Royal Society. In The Theory That Would Not Die,
author Sharon Bertsch McGrayne aptly describes how “Bayes’ rule cracked the enigma code,
hunted down Russian submarines, and emerged triumphant from two centuries of contro-
versy.” In short, Bayes’ Rule has vast application, and the number of papers and books that
employ it is growing exponentially.
Inspired by the knowledge that a Bayes Council actually exists, we began our journey by
enrolling in a 5-day ‘introductory’ workshop on Bayesian statistics a few years ago. On Day
1, we were introduced to a variety of Bayesian models, and, on Day 2, we sheepishly had to
inquire what Bayes’ Theorem was and what it had to do with MCMC. In other words, the
material was way over our heads. With tails between our legs, we slunk back home and
began trying to sort out the many different uses of Bayes’ Theorem.
As we read more and more about Bayes’ Theorem, we started noting our own questions as
they arose and began narrating the answers as they became more clear. The result is this
strange book, cast as a series of questions and answers between reader and author. In this
prose, we make heavy use of online resources such as the Oxford Dictionary of Statistics
(Upton and Cook, 2014), Wolfram Mathematics, and the Online Statistics Education: An
Interactive Multimedia Course for Study (Rice University, University of Houston Clear
Lake, and Tufts University). We also provide friendly links to online encyclopedias such
as Wikipedia and Encyclopedia Britannica. Although these should not be considered
definitive, original works, we have included the links to provide readers with a readily
accessible source of information and are grateful to the many authors who have contrib-
uted entries.
We are not experts in Bayesian statistics and make no claim as such. Therese Donovan is a
biologist for the U. S. Geological Survey Vermont Cooperative Fish and Wildlife Research
Unit, and Ruth Mickey is a statistician in the Department of Mathematics and Statistics at
the University of Vermont. We were raised on a healthy dose of “frequentist” and max-
imum likelihood methods but have begun only recently to explore Bayesian methods. We
have intentionally avoided controversial topics and comparisons between Bayesian and
frequentist approaches and encourage the reader to dig deeper—much deeper—than we
have here. Fortunately, a great number of experts have paved the way, and we relied heavily
on the following books while writing our own:
• N. T. Hobbs and M. B. Hooten. Bayesian Models: A Statistical Primer for Ecologists.
Princeton University Press, 2015.
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi
viii PREFACE
• J. Kruschke. Doing Bayesian Data Analysis: A Tutorial with R, JAGS, and Stan.
Elsevier, 2015.
• A. Gelman, J. B. Carlin, H. S. Stern, and D. B. Rubin. Bayesian Data Analysis. Chapman &
Hall, 2004.
• J. V. Stone. Bayes’ Rule: a Tutorial Introduction to Bayesian Analysis. Sebtel Press, 2014.
• H. Raiffa and R. Schlaifer. Applied Statistical Decision Theory. Division of Research,
Graduate School of Business Administration, Harvard University, 1961.
• P. Goodwin and G. Wright. Decision Analysis for Management Judgment. John Wiley &
Sons, 2014.
Although we relied on these sources, any mistakes of interpretation are our own.
Our hope is that Bayesian Statistics for Beginners is a “quick read” for the uninitiated and
that, in one week or less, we could find a reader happily ensconced in a book written by one
of the experts. Our goal in writing the book was to keep Bayes’ Theorem front and center in
each chapter for a beginning audience. As a result, Bayes’ Theorem makes an appearance in
every chapter. We frequently bring back past examples and explain what we did “back
then,” allowing the reader to slowly broaden their understanding and sort out what has
been learned in order to relate it to new material. For the most part, our reviewers liked this
approach. However, if this is annoying to you, you can skim over the repeated portions.
If this book is useful to you, it is due in no small part to a team of stellar reviewers. We owe
a great deal of gratitude to George Allez, Cathleen Balantic, Barry Hall, Mevin Hooten, Peter
Jones, Clint Moore, Ben Staton, Sheila Weaver, and Robin White. Their enthusiasm,
questions, and comments have improved the narrative immensely. We offer a heartfelt
thank you to Gary Bishop, Renee Westland, Melissa Murphy, Kevin Roark, John Bell, and
Stuart Geman for providing pictures for this book.
Therese Donovan
Ruth Mickey
October 2018
Burlington, VT
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi
Contents
1 Introduction to Probability 3
2 Joint, Marginal, and Conditional Probability 11
3 Bayes’ Theorem 29
4 Bayesian Inference 37
5 The Author Problem: Bayesian Inference with Two Hypotheses 48
6 The Birthday Problem: Bayesian Inference with Multiple Discrete Hypotheses 61
13 The Shark Attack Problem Revisited: MCMC with the Metropolis Algorithm 193
14 MCMC Diagnostic Approaches 212
15 The White House Problem Revisited: MCMC with the Metropolis–Hastings
Algorithm 224
16 The Maple Syrup Problem Revisited: MCMC with Gibbs Sampling 247
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi
x CONTENTS
SECTION 6 Applications
Appendices
Bibliography 399
Hyperlinks 403
Name Index 413
Subject Index 414
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi
SECTION 1
Basics of Probability
Overview
And so we begin. This first section deals with basic concepts in probability theory, and
consists of two chapters.
CHAPTER 1
Introduction to Probability
In this chapter, we’ll introduce some basic terms used in the study of probability. By the end
of this chapter, you will be able to define the following:
• Sample space
• Outcome
• Discrete outcome
• Event
• Probability
• Probability distribution
• Uniform distribution
• Trial
• Empirical distribution
• Law of Large Numbers
To begin, let’s answer a few questions . . .
What is probability?
Answer: The best way to introduce probability is to discuss an example. Imagine you’re a
gambler, and you can win $1,000,000 if a single roll of a die turns up four. You get only one
roll, and the entry fee to play this game is $10,000. If you win, you’re a millionaire. If you
lose, you’re out ten grand.
Bayesian Statistics for Beginners: A Step-by-Step Approach. Therese M. Donovan and Ruth M. Mickey,
Oxford University Press (2019). © Ruth M. Mickey 2019.
DOI: 10.1093/oso/9780198841296.001.0001
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi
Let’s call the number of possible outcomes N, so N ¼ 6. These outcomes are discrete,
because each result can take on only one of these values. Formally, the word “discrete” is
defined as “individually separate and distinct.” In addition, in this example, there is a
finite number of possible outcomes, which means “there are limits or bounds.” In other
words, the number of possible outcomes is not infinite.
If we believe that each and every outcome is just as likely to result as every other outcome
(i.e., the die is fair), then the probability of rolling a four is 1/N or 1/6. We can then say, “In
6 rolls of the die, we would expect 1 roll to result in a four,” and write that as Pr(four) ¼ 1/6.
Here, the notation Pr means “Probability,” and we will use this notation throughout
this book.
How can we get a good estimate of Pr(four) for this particular die?
Table 1.1
In this table, the column called Frequency gives the number of times each outcome was
observed. The column called Probability is the frequency divided by the sum of the
frequencies over all possible outcomes, a proportion. The probability of an event of four is
the frequency of the observed number of occurrences of four (which is 0) divided by the
total throws (which is 1). We can write this as:
jnumber of foursj 0
PrðfourÞ ¼ ¼ ¼ 0: ð1:1Þ
jtotal trialsj 1
This can be read as, “The probability of a four is the number of “four” events divided by the
number of total trials.” Here, the vertical bars indicate a number rather than an absolute
value. This is a frequentist notion of probability, because we estimate Pr(four) by asking
“How frequently did we observe the outcome that interests us out of the total?”
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi
INTRODUCTION TO PROBABILITY 5
1.0
0.8
Probability
0.6
0.4
0.2
0.0
One Two Three Four Five Six
Outcome
Answer: No.
By now, you should realize that one roll will never give us a good estimate of Pr(four).
(Sometimes, though, that’s all we have . . . we have only one Planet Earth, for example).
Next, you roll the die 9 more times, and summarize the results of the 10 total rolls (i.e., 10
trials or 10 experiments) in Table 1.2. The number of fours is 2, which allows us to estimate
Pr(four) as 2/10, or 0.20 (see Figure 1.2).
Table 1.2
1.0
0.8
Probability
0.6
0.4
0.2
0.0
One Two Three Four Five Six
Outcome
What can you conclude from these results? The estimate of Pr(four) ¼ 0.2 seems to
indicate that the die may be in your favor! (Remember, you expect Pr(four) ¼ 0.1667 if
the die was fair). But $10,000 is a lot of money, and you decide that you should keep test
rolling until the gamemaster shouts “Enough!” Amazingly, you are able to squeeze in 500
rolls, and you obtain the results shown in Table 1.3.
Table 1.3
The plot of the frequency results in Figure 1.3 is called a frequency histogram. Notice
that frequency, not probability, is on the y-axis. We see that a four was rolled 41 times.
Notice also that the sum of the frequencies is 500. The frequency distribution is an example
of an empirical distribution: It is constructed from raw data.
100
80
Frequency
60
40
20
0
One Two Three Four Five Six
Outcome
We can now estimate Pr(four) as 41/500 ¼ 0.082. We can calculate the probability estimates
for the other outcomes as well and then plot them as the probability distribution in Figure 1.4.
1.0
0.8
Probability
0.6
0.4
0.2
0.0
One Two Three Four Five Six
Outcome
INTRODUCTION TO PROBABILITY 7
yields an estimate that is closer and closer to the true probability as the number of trials
increases. In other words, your estimate of Pr(four) gets closer and closer to or approaches
the true probability when you use more trials (rolls) in your calculations.
Table 1.4 lists the six possible outcomes, and the probability of each event (1/6 ¼ 0.167).
Notice that the sum of the probabilities across the events is 1.0.
Table 1.4
Outcome Probability
One 0.167
Two 0.167
Three 0.167
Four 0.167
Five 0.167
Six 0.167
Sum 1
Figure 1.5 shows exactly the same information as Table 1.4; both are examples of a
probability distribution. On the horizontal axis, we list each of the possible outcomes. On
the vertical axis is the probability. The height of each bar provides the probability of
observing each outcome. Since each outcome has an equal chance of being rolled, the
heights of the bars are all the same and show as 0.167, which is 1/N. Note that this is not an
empirical distribution, because we did not generate it from an experiment. Rather, it was
based on the assumption that all outcomes are equally likely.
1.0
0.8
Probability
0.6
0.4
0.2
0.0
One Two Three Four Five Six
Outcome
How would you change the table and probability distribution if the
die were loaded in favor of a four?
Table 1.5
Outcome Probability
One 0.12
Two 0.12
Three 0.12
Four 0.4
Five 0.12
Six 0.12
Sum 1
The probabilities listed in the table sum to 1.0 just as before, as do the heights of the
corresponding blue bars in Figure 1.6.
1.0
0.8
Probability
0.6
0.4
0.2
0.0
One Two Three Four Five Six
Outcome
Answer: The bet is that if you roll a four, you win $1,000,000, and if you don’t roll a
four, you lose $10,000. It would be useful to group our 6 possible outcomes into one of
two events. As the Oxford Dictionary of Statistics explains, “An event is a particular
collection of outcomes, and is a subset of the sample space.” Probabilities are assigned
to events.
Our two events are E1 ¼ {four} and E2 ¼ {one, two, three, five, six}. The brackets { }
indicate the set of outcomes that belong in each event. The first event contains one
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi
INTRODUCTION TO PROBABILITY 9
outcome, while the second event consists of five possible outcomes. Thus, in probability
theory, outcomes can be grouped into new events at will. We started out by considering six
outcomes, and now we have collapsed those into two events.
Now we assign a probability to each event. We know that Pr(four) ¼ 0.4. What is the
probability of NOT rolling a four? That is the probability of rolling one OR two OR three OR
five OR six. Note that these events cannot occur simultaneously. Thus, we can write
Pr(four) as the SUM of the probabilities of events one, two, three, five, and six (which is
0.12 þ 0.12 þ 0.12 þ 0.12 þ 0.12 ¼ 0.6). Incidentally, the sign means “complement of.” If
A is an event, A is its complement (i.e., everything but A). This is sometimes written as Ac.
The word OR is a tip that you ADD the individual probabilities together to get your answer
as long as the events are mutually exclusive (i.e., cannot occur at the same time).
This is an example of a fundamental rule in probability theory: if two or more events are
mutually exclusive, then the probability of any occurring is the sum of the probabilities of
each occurring. Because the different outcomes of each roll (i.e., rolling a one, two, three,
five, or six) are mutually exclusive, the probability of getting any outcome other than four is
the sum of the probability of each one occurring (see Table 1.6).
Table 1.6
Event Probability
Four 0.4
Not Four 0.6
Sum 1
Note that the probabilities of these two possible events sum to 1.0. Because of that, if we
know that Pr(four) is 0.4, we can quickly compute the Pr(four) as 1 0.4 ¼ 0.6 and save a
few mental calculations.
The probability distribution looks like the one shown in Figure 1.7:
1.0
0.8
Probability
0.6
0.4
0.2
0.0
Four Not Four
Event
Remember: The SUM of the probabilities across all the different outcomes MUST EQUAL
1! In the discrete probability distributions above, this means that the heights of the bars
summed across all discrete outcomes (i.e., all bars) totals 1.
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi
Do you still want to play? At the end of the book, we will introduce decision trees, an
analytical framework that employs Bayes’ Theorem to aid in decision-making. But that
chapter is a long way away.
Answer: We’re a few chapters away from hitting on this very important topic. You’ll see
that Bayesians think of probability in a way that allows the testing of theories and hypoth-
eses. But you have to walk before you run. What you need now is to continue learning the
basic vocabulary associated with probability theory.
What’s next?
Answer: In Chapter 2, we’ll expand our discussion of probability. See you there.
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi
CHAPTER 2
Now that you’ve had a short introduction to probability, it’s time to build on our
probability vocabulary. By the end of this chapter, you will understand the
following terms:
• Venn diagram
• Marginal probability
• Joint probability
• Independent events
• Dependent events
• Conditional probability
Let’s start with a few questions.
A gala that celebrates the sense of vision? Nope. The eyeball event in this chapter
refers to whether a person is right-eyed dominant or left-eyed dominant. You already
know if you are left- or right-handed, but did you know that you are also left- or right-
eyed? Here’s how to tell (http://www.wikihow.com/Determine-Your-Dominant-Eye; see
Figure 2.1):
Bayesian Statistics for Beginners: A Step-by-Step Approach. Therese M. Donovan and Ruth M. Mickey,
Oxford University Press (2019). © Ruth M. Mickey 2019.
DOI: 10.1093/oso/9780198841296.001.0001
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi
1. Stretch your arms out in front of you and create a hole with your hands by joining your
finger tips to make a triangular opening, as shown.
2. Find a small object nearby and align your hands with it so that you can see it in the
triangular hole. Make sure you are looking straight at the object through your hands—
cocking your head to either side, even slightly, can affect your results. Be sure to keep
both eyes open!
3. Slowly move your hands toward your face to draw your viewing window toward you. As
you do so, keep your head perfectly still, but keep the object lined up in the hole between
your hands. Don’t lose sight of it.
4. Draw your hands in until they touch your face—your hands should end up in front of
your dominant eye. For example, if you find that your hands end up so you are looking
through with your right eye, that eye is dominant.
The eyeball characteristic has two discrete outcomes: lefty (for left-eyed dominant people)
or righty (for right-eyed dominant people). Because there are only two outcomes, we can
call them events if we want.
Let us suppose that you ask 100 people if they are “lefties” or “righties.” In this case,
the 100 people represent our “universe” of interest, which we designate with the letter
U. The total number of elements (individuals) in U is written |U |. (Once again, the
vertical bars here simply indicate that U is a number; it doesn’t mean the absolute value
of U.)
Here, there are only two possible events: “lefty” and “righty.” Together, they make up a
set of possible outcomes. Let A be the event “left-eye dominant,” and A be the event
“right-eye dominant.” Here, the tilde means “complement of,” and here it can be inter-
preted as “everything but A.” Notice that these two events are mutually exclusive: you
cannot be both a “lefty” and a “righty.” The events are also “exhaustive” because you must
be either a lefty or righty.
Suppose that 70 of 100 people are lefties. These people are a subset of the
larger population. The number of people in event A can be written | A |, and in this
example | A | ¼ 70. Note that | A | must be less than or equal to | A |, which is 100.
Remember that we use the vertical bars here to highlight that we are talking about a
number.
Since there are only two possibilities for eye dominance type, this means that 100 70 ¼ 30
people are righties. The number of people in event A can be written |A|, and in this
example |A| ¼ 30. Note that |A| must be less than or equal to |U|.
Our universe can be summarized as shown in Table 2.1.
Table 2.1
Event Frequency
Lefty (A) | A | ¼ 70
Righty (A) | A | ¼ 30
Universe (U) | U | ¼ 100
We can illustrate this example in a diagrammatic form, as shown in Figure 2.2. Here, our
universe of 100 people is captured inside a box.
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi
This is a Venn diagram, which shows A and A. The universe U is 100 people and is
represented by the entire box. We then allocate those 100 individuals into A and A. The blue
circle represents A; lefties stand inside this circle; righties stand outside the circle, but inside
the box. You can see that A consists of 70 elements, and A consists of 30 elements.
Lefty (A)
70
Righty (∼A) 30
Figure 2.2
Answer: Venn diagrams are named for John Venn (see Figure 2.3), who wrote his seminal
article in 1880.
According to the MacTutor History of Mathematics Archive, Venn’s son described him as
“of spare build, he was throughout his life a fine walker and mountain climber, a keen
botanist, and an excellent talker and linguist.”
Answer: We write this as Pr(A). Remember that Pr stands for probability, so Pr(A) means
the probability that a person is in group A and therefore is a lefty. We can determine the
probability that a person is in group A as:
jAj 70
PrðAÞ ¼ ¼ ¼ 0:7: ð2:1Þ
j U j 100
Probability is determined as the number of persons in group A out of the total. The
probability that the randomly selected person is in group A is 0.7.
Answer: There are 30 of them (100 70 ¼ 30), and they are righties.
jAj 30
Prð AÞ ¼ ¼ ¼ 0:3: ð2:2Þ
jU j 100
With only two outcomes for the eyeball event, our notation focuses on the probability
of being in a given group (A ¼ lefties) and the probability of not being in the given group
(A ¼ righties).
Answer: Yes, of course. Let’s probe these same 100 people and find out other details about
their anatomy. Suppose we are curious about the presence or absence of Morton’s toe.
People with “Morton’s toe” have a large second metatarsal, longer in fact than the first
metatarsal (which is also known the big toe or hallux toe). Wikipedia articles suggest that
this is a normal variation of foot shape in humans and that less than 20% of the human
population have this condition. Now we are considering a second characteristic for our
population, namely toe type.
Let’s let B designate the event “Morton’s toe.” Let the number of people with Morton’s
toe be written as |B|. Let B designate the event “common toe.” Suppose 15 of the 100
people have Morton’s toe. This means |B| ¼ 15, and |B| ¼ 85. The data are shown in
Table 2.2, and the Venn diagram is shown in Figure 2.4.
Table 2.2
Event Frequency
Morton’s toe (B) |B| ¼ 15
Common toe (B) |B| ¼ 85
Universe (U) |U| ¼ 100
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi
These events can be represented in a Venn diagram, where a box holds our universe of
100 people.
Note the size of this red circle is smaller than the previous example because the number of
individuals with Morton’s toe is much smaller.
Figure 2.4
Answer: You bet. With two characteristics, each with two outcomes, we have four possible
combinations:
1. Lefty AND Morton’s toe, which we write as A \ B.
2. Lefty AND common toe, which we write as A \B.
3. Righty AND Morton’s toe, which we write as A \ B.
4. Righty AND common toe, which we write as A \B.
The upside-down \ is the mathematical symbol for intersection. Here, you can read it as
“BOTH” or “AND.”
The number of individuals in A \ B can be written j A \ B j, where the bars indicate a
number (not absolute value). Let’s suppose we record the frequency of individuals in each
of the four combinations (see Table 2.3).
Table 2.3
blue) sums the number of people that are lefties and righties. The lower right (white)
quadrant gives the grand total, or |U|.
Now let’s plot the results in the same Venn diagram (see Figure 2.5).
Lefty (A)
Figure 2.5
The updated Venn diagram shows the blue circle with 70 lefties, the red circle with
15 people with Morton’s toe, and no overlap between the two. This means j A \ B j ¼ 0,
j A j ¼ j A \ B j ¼ 70, and j B j ¼ j A \ B j ¼ 15. By subtraction, we know the number of
individuals that are not in A OR B j A \ B j is 15 because we need to account for all 100
individuals somewhere in the diagram.
Answer: For our universe, no. If you are a person with Morton’s toe, you are standing in
the red circle and cannot also be standing in the blue circle. So these two events
(Morton’s toe and lefties) are mutually exclusive because they do not occur at the
same time.
Answer: You bet. All 70 lefties do not have Morton’s toe. These two events are non-
mutually exclusive.
Answer: In this case, we need to adjust the Venn diagram to show that five of the people
that are lefties also have Morton’s toe. These individuals are represented as the intersection
between the two events (see Figure 2.6). Note that the total number of individuals is still
100; we need to account for everyone!
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi
Lefty (A)
Figure 2.6
We’ll run with this example for the rest of the chapter.
This Venn diagram is pretty accurate (except for the size of the box overall). There are
70 people in A, 15 people in B, and 5 people in A \ B. The blue circle contains 70 elements
in total, and 5 of those elements also occur in B. The red circle contains 15 elements (so is a
lot smaller than the blue circle), and 5 of these are also in A. So 5/15 (33%) of the red circle
overlaps with the blue circle, and 5/70 (7%) of the blue circle overlaps with the red circle.
Of the four events (A, A, B, and B), which are not
mutually exclusive?
Answer:
• A and B are not mutually exclusive (a lefty can have Morton’s toe).
• A and B are not mutually exclusive (a lefty can have a common toe).
• A and B are not mutually exclusive (a righty can have Morton’s toe).
• A and B are not mutually exclusive (a righty can have a common toe).
In Venn diagrams, if two events overlap, they are not mutually exclusive.
Answer: If we focus on each circle, A and A are mutually exclusive (a person cannot be a
lefty and righty). B and B are mutually exclusive (a person cannot have Morton’s toe and a
common toe). Apologies for the trick question!
If you were one of the lucky 100 people included in the universe,
where would you fall in this diagram?
Table 2.4
This is an important table to study. Once again, notice that there are four quadrants or
sections in this table. In the upper left quadrant, the first two columns represent the two
possible events for eyeball dominance: lefty and righty. The first two rows represent the two
possible events for toe type: Morton’s toe and common toe.
The upper left entry indicates that 5 people are members of both A and B.
• Look for the entry that indicates that 20 people are A \B, that is, both not A and not B.
• Look for the entry that indicates that 65 people are A \B.
• Look for the entry that indicates that 10 people are A \ B.
The lower left and upper right quadrants of our table are called the margins of the table.
They are shaded light blue. Note that the total number of individuals in A (regardless of B)
is 70, and the total number of individuals in A is 30. The total number of individuals in B
(regardless of A) is 15 and the total number of individuals B is 85. Any way you slice it, the
grand total must equal 100 (the lower right quadrant).
Answer: Well, if you are interested in determining the probability that an individual
belongs to any of these four groups, you could use your universe of 100 individuals to do
the calculation. Do you remember the frequentist way to calculate probability? We
learned about that in Chapter 1.
Our total universe in this case is the 100 individuals. To get the probability that a person
selected at random would belong to a particular event, we simply divide the entire table
above by our total, which is 100, and we get the results shown in Table 2.5.
Table 2.5
We’ve just converted the raw numbers to probabilities by dividing the frequency table by
the grand total.
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi
Note that this differs from our die rolling exercise in Chapter 1, where you were unsure
what the probability was and had to repeatedly roll a die to estimate it. By the Law of Large
Numbers, the more trials you have, the more you zero in on the actual probability. In this
case, however, we are given the number of people in each category, so the calculation is
straightforward. These 100 people are the only people of interest. They represent our
universe of interest; we are not using them to sample a larger group. If you didn’t know the
make-up of the universe, you could randomly select one person out of the universe over
and over again to get the probabilities, where all persons are equally likely to be selected.
Let’s walk through one calculation. Suppose we want to know the probability that an
individual is a lefty AND has Morton’s toe. The number of individuals in A and B is
written jA \ Bj and the probability that an individual is a lefty with Morton’s toe is
written:
jA \ Bj 5
PrðA \ BÞ ¼ ¼ ¼ 0:05: ð2:4Þ
jU j 100
This is officially called the joint probability and is the upper left entry in our table. The
Oxford Dictionary of Statistics states that the “joint probability of a set of events is the
probability that all occur simultaneously.” Joint probabilities are also called conjoint
probabilities. Incidentally, a table that lists joint probabilities such as the one above is
sometimes referred to as a conjoint table.
When you hear the word joint, you should think of the word AND and realize that you
are considering (and quantifying) more than one characteristic of the population. In this
case, it indicates that someone is in A AND B. This is written as:
Answer: This is equivalent to asking, what is the joint probability that a person is
right-eye dominant AND has Morton’s toe? See if you can find this entry in Table 2.5. The
answer is 0.1.
In addition to the joint probabilities, the table also provides the marginal probabilities,
which look at the probability of A or A (regardless of B) and the probability of B or B
(regardless of A).
Answer: The word marginal in the dictionary is defined as “pertaining to the margins; or
situated on the border or edge.” In our table, the marginal probabilities are just the
probabilities for one characteristic of interest (e.g., A and A) regardless of other character-
istics that might be listed in the table.
Let’s now label each cell in our conjoint table by its probability type (see Table 2.6).
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi
Table 2.6
Suppose you know only the following facts: the marginal probability of being a lefty is
0.7, the marginal probability of having Morton’s toe is 0.15, and the joint probability of
being a lefty with Morton’s toe is 0.05. Also suppose that you haven’t looked at
Table 2.6!
Take out some scratch paper and a pencil. You can do it! Here are some hints:
• the lower right hand quadrant must equal 1.00;
• for any given characteristic, the sum of the two marginal probabilities must equal 1.00.
Table 2.7
Answer: Because the marginal probabilities for eyeballs must sum to 1.00, and the mar-
ginal probabilities for toes must sum to 1.00 (because they deal with mutually exclusive
events), we can fill in the missing marginal probabilities.
The marginal probability of a lefty, Pr(A), is 0.7, so the marginal of a righty, Pr(A), must
be 1.00 0.7 ¼ 0.3.
The marginal probability of having Morton’s toe, Pr(B), is 0.15, so the marginal of Pr(B)
must be 1.00 0.15 ¼ 0.85.
So far, so good. Once we know the marginals, we can calculate the joint probabilities in
the upper left quadrant (see Table 2.8). For example:
• if the marginal Pr(A) ¼ 0.7, then we know that PrðA \ BÞ ¼ 0.7 0.05 ¼ 0.65;
• if the marginal Pr(B) ¼ 0.15, then we know that Prð A \ BÞ ¼ 0.15 0.05 ¼ 0.1;
• if the marginal Pr(B) ¼ 0.3, then we know that Prð A \ BÞ ¼ 0.3 0.1 ¼ 0.2.
Table 2.8
Answer: Don’t cheat now . . . try to express Pr(B) as the sum of joint probabilities before
reading on! This step is essential for understanding Bayesian inference in future chapters!
How did you do?
Hint 1: We can decompose the total, 0.15, into its two pieces: 0.05 þ 0.1.
The probability that a lefty has Morton’s toe can be written:
If we put these two terms together, we can express the marginal probability of having
Morton’s toe as:
Can we look at this problem from the Venn diagram perspective again?
Lefty
Morton’s toe
65 5 10
Figure 2.7
Answer: To answer this question, we must introduce the very important concept of
conditional probability.
Answer: Conditional probability is the probability of an event given that another event
has occurred.
Conditional probability is written as:
• Pr(A|B), which is read “the probability of A, given that B occurs”; in our context, Pr(A|B)
is Pr(lefty | Morton’s toe);
• Pr(A|B), which is read “the probability of A, given that B occurs”; in our context,
Pr(A|B) is Pr(lefty | common toe);
• Pr(B|A), which is read “the probability of B, given that A occurs”; in our context,
Pr(B|A) is Pr(Morton’s toe | righty);
• etc.
The vertical bar means “given.”
Answer: You use the following equation, which is a standard equation in probability
theory:
PrðA \ BÞ
PrðAjBÞ ¼ : ð2:11Þ
PrðBÞ
It’s essential that you understand conditional probability, so let’s look at this equation from
a few different angles and, in the words of Kalid Azad, “let’s build some intuition” about
what it means.
Angle 1: The Venn diagram zoom
We already know that the numerator
PrðA \ BÞ ð2:12Þ
is the intersection in the Venn diagram where A and B overlap (the probability of a lefty and
Morton’s toe). This can be written as
PrðB \ AÞ ð2:13Þ
as well. The intersection of A and B is the intersection, no matter how you write it:
And we know that the denominator Pr(B) is the probability of Morton’s toe.
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi
In the Venn diagram, we can focus on the area of B and then look to see what fraction of
the total B is occupied by A. In this example, we restrict our attention to the 15 people with
Morton’s toe, and note that 5 of them are lefties. Therefore, about 5/15 or 1/3 of the red
circle is overlapped by the blue circle.
A ∩ B: 5 ∼A ∩ B: 10
Figure 2.8
For the numbers given, we can see that Pr(A | B) ¼ 5/15 ¼ 1/3 ¼ 0.333 ¼ 33.3%. A general
rule can help with the visualization: zoom to the denominator space B, then determine
what fraction of this space is occupied by A. Similarly, Pr(A | B) ¼ 10/15 ¼ 2/3 ¼ 0.667 ¼
66.7%. Note that these probabilities sum to 1.
Angle 2: The table approach
We can also tackle this problem using the raw data (see Table 2.9).
Table 2.9
PrðA \ BÞ
PrðA j BÞ ¼ : ð2:15Þ
PrðBÞ
jA \ Bj 5
PrðA \ BÞ ¼ ¼ : ð2:16Þ
jU j 100
jBj 15
PrðBÞ ¼ ¼ : ð2:17Þ
jU j 100
Answer: If you have Morton’s toe, the probability of being a lefty is 0.33. If you don’t have
Morton’s toe, the probability of being a lefty is 65/85 ¼ 0.77 (you can confirm this too).
If Morton’s toe does not matter, these conditional probabilities should be equal to the
marginal probability, which is 0.7. This is clearly not the case here.
Answer: In other words, is the probability of a lefty, given Morton’s toe, the same thing as
the probability of Morton’s toe, given a lefty? Let’s try it!
jA\Bj
jU j jA \ Bj 5
PrðA j BÞ ¼ ¼ ¼ ¼ 0:333 ð2:19Þ
jBj j Bj 15
jU j
jA\Bj
jU j jA \ Bj 5
PrðB j AÞ ¼ ¼ ¼ ¼ 0:072: ð2:20Þ
jAj jAj 70
jU j
So the answer is No! These two probabilities are very different things. The first asks what is
the probability of A given that event B happens (with a result of 0.333), while the second
asks what is the probability of B given that A happens (with a result of 0.072).
Answer: Yes . . . see if you can find it before looking at Table 2.10!
Table 2.10
Remember, when dealing with conditional probabilities, the key word is “zoom.” Let’s
start with Pr(A | B):
PrðA \ BÞ
PrðA j BÞ ¼ : ð2:21Þ
PrðBÞ
If B happens, we zoom to row 1 (Morton’s toe), and then ask what fraction of the people
with Morton’s toe are lefties:
:05
PrðA j BÞ ¼ ¼ 0:333: ð2:22Þ
0:15
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi
Answer: If A happens, we zoom to the first column (lefties) and then ask what fraction of
the lefties have Morton’s toe:
0:05
PrðB j AÞ ¼ ¼ 0:072: ð2:23Þ
0:7
PrðA \ BÞ
PrðA j BÞ ¼ : ð2:24Þ
PrðBÞ
You can rearrange this to your heart’s content. For this book, the most important
rearrangement is:
This formula can be used to calculate joint probability, PrðA \ BÞ. Take some time to make
sure this equation sinks in and makes full sense to you.
As an aside, if the occurrence of one event does not change the probability of the
other occurring, the two events are said to be independent. This means that
PrðA j BÞ ¼ PrðAj BÞ ¼ PrðAÞ.
So, when A and B are independent:
Answer: That, dear reader, is the subject of our next chapter, where we will derive Bayes’
Theorem. See you there!
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi
SECTION 2
Overview
• In Chapter 3, Bayes’ Theorem is introduced. The chapter shows its derivation and
describes two ways to think about it. First, Bayes’ Theorem describes the relationship
between two inverse conditional probabilities, P(A|B) and P(B|A). Second, Bayes’ The-
orem can be used to express how a degree of belief for a given hypothesis can be updated
in light of new evidence. This chapter focuses on the first interpretation.
• Chapter 4 introduces the concept of Bayesian inference. The chapter discusses the
scientific method, and illustrates how Bayes’ Theorem can be used for scientific inference.
Bayesian Inference is the use of Bayes’ Theorem to draw conclusions about a set of
mutually exclusive and exhaustive alternative hypotheses by linking prior knowledge
about each hypothesis with new data. The result is updated probabilities for each hy-
pothesis of interest. The ideas of prior probabilities, likelihood, and posterior probabilities
are introduced.
• Chapter 5, the “Author Problem,” provides a concrete example of Bayesian inference.
This chapter draws on work by Frederick Mosteller and David Wallace, who used Bayesian
inference to assign authorship for unsigned Federalist Papers. The Federalist Papers were a
collection of papers known to be written during the American Revolution. However,
some papers were unsigned by the author, resulting in disputed authorship. The chapter
provides a very basic Bayesian analysis of the unsigned “Paper 54,” which was written by
Alexander Hamilton or James Madison. The example illustrates the principles of Bayesian
inference for two competing hypotheses.
• Chapter 6, the “Birthday Problem,” is intended to highlight the decisions the analyst
(you!) must make in setting the prior distribution. The “Birthday Problem” expands
consideration from two hypotheses to multiple, discrete hypotheses. In this chapter,
interest is in determining the posterior probability that a woman named Mary was born
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi
By the end of this section, you will have a good understanding of how Bayes’ Theorem is
related to the scientific method.
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi
CHAPTER 3
Bayes’ Theorem
In this chapter, we’re going to build on the content in Section 1 and derive Bayes’ Theorem.
This is what you’ve been waiting for!
By the end of this chapter, you will be able to derive Bayes’ Theorem and explain the
relationship between PrðA j BÞ and PrðB j AÞ.
Let’s begin with a few questions.
Answer: It could be, but nobody is really sure! We’ll revisit this question in a future chapter.
Bayesian Statistics for Beginners: A Step-by-Step Approach. Therese M. Donovan and Ruth M. Mickey,
Oxford University Press (2019). © Ruth M. Mickey 2019.
DOI: 10.1093/oso/9780198841296.001.0001
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi
The fact that there are two interpretations can be a source of confusion, but we hope you will
fully appreciate the difference soon! In this chapter, we’ll derive Bayes’ Theorem and discuss
the first interpretation of the theorem: the relationship between PrðA j BÞ and PrðB j AÞ.
To help guide us, let’s return to our familiar Venn diagram and conjoint table of eye
dominance and toes. As before, our example will consider 100 total elements and four
events. Remember that our data consisted of 70 lefties and 15 Morties (people with
Morton’s toe). This time, however, we will simply refer to the eye dominance events (left-
eyed vs. right-eyed dominant) as A and A, and the toe events (Morton’s toe vs. common
toe) as B and B (see Figure 3.2). In this way, we can generalize the problem so that it
pertains to any two characteristics of interest.
B
65 5 10
∼A and ∼B: 20
Figure 3.2
These same numbers can be expressed in tabular form (Table 3.1), or as a conjoint table
(Table 3.2).
A A Sum
B 5 10 15
B 65 20 85
Sum 70 30 100
A A Sum
B 0.05 0.10 0.15
B 0.65 0.20 0.85
Sum 0.70 0.30 1.00
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi
BAYES’ THEOREM 31
PrðA \ BÞ
PrðA j BÞ ¼ ð3:1Þ
PrðBÞ
PrðB \ AÞ
PrðB j AÞ ¼ ð3:3Þ
PrðAÞ
:05
PrðA \ BÞ ¼ PrðA j BÞ ∗ PrðBÞ ¼ ∗ 0:15 ¼ 0:05 ð3:5Þ
0:15
:05
PrðB \ AÞ ¼ PrðB j AÞ ∗ PrðAÞ ¼ ∗ 0:7 ¼ 0:05: ð3:6Þ
0:7
Equations 3.5 and 3.6 produce the same result. We should expect this, right? We are trying
to estimate the probability of A and B occurring together, which is where the blue and red
circles intersect. Although PrðA j BÞ is 0.05/0.15 ¼ 0.33, and PrðB j AÞ is 0.05/0.7 ¼ 0.071,
when we multiply each by their respective marginals, we end up with the same joint
probability estimate.
We are just a couple of steps from Bayes’ Theorem. Are you ready?
We’ll start with a simple reminder that the joint probability of A and B can be viewed in
two ways:
Therefore:
PrðB j AÞ ∗ PrðAÞ
PrðA j BÞ ¼ : ð3:10Þ
PrðBÞ
As we’ll soon see, Bayes’ Theorem can be expressed in other ways as well, but in this
version it should be clear that Bayes’ Theorem is a way of calculating conditional
probability. Figure 3.3 provides a handy-dandy summary of the key concepts.
Pr(A ∩ B) Pr(A∩B)
Conditional probability of A, given B: Pr(A | B) = Conditional probability of B, given A: Pr(B | A) =
Pr(B) Pr(A)
Pr(B | A) * Pr(A)
Pr(A | B) =
Pr(B)
Bayes' Theorem
Figure 3.3
Table 3.3
A A Sum
B 0.05 0.10 0.15
B 0.65 0.20 0.85
Sum 0.70 0.30 1.00
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi
BAYES’ THEOREM 33
Answer: This is a problem where Bayes’ Theorem could be used because you are asked to
find a conditional probability: what is the probability of A given B. But we solved this easily
in Chapter 2 by just using the joint probability table: zoom to row B, and ask what
proportion of the total consists of A:
0:05
PrðA j BÞ ¼ ¼ 0:333: ð3:11Þ
0:15
This example demonstrates that even if you are given a Bayes’ type of problem, there are
still ways to solve the problem without really using the theorem. But, just to be complete,
let’s use Bayes’ Theorem to solve it:
PrðB j AÞ ∗ PrðAÞ
PrðA j BÞ ¼ ð3:12Þ
PrðBÞ
0:05
∗ 0:7
PrðA j BÞ ¼ 0:7
¼ 0:333: ð3:13Þ
0:15
It works!
Answer: You can use Bayes’ Theorem when you are provided with one kind of conditional
probability, like PrðA j BÞ, but are asked to find its inverse, PrðB j AÞ.
A notable example is posted at http://yudkowsky.net/rational/bayes. In that article,
Yudkowsky poses the following challenge:
One percent of women at age forty who participate in routine screening have breast
cancer; 80% of women with breast cancer will have a positive mammogram (test), while
9.6% of women without breast cancer will also get a positive result. A woman in this age
group had a positive mammogram in a routine screening. What is the probability that
she actually has breast cancer?
We are given the probability that a woman with breast cancer will get a positive mammo-
gram. But we are asked for the reverse of this: what is the probability that a woman with a
positive mammogram has breast cancer?
Shall we give this a go? Let’s let A represent women with breast cancer, and A represent
women without it. And let’s let B represent a positive test, and B represent a negative test.
Bayes’ Theorem would allow us to estimate the probability of cancer, given a positive test
result, as:
PrðB j AÞ ∗ PrðAÞ
PrðA j BÞ ¼ : ð3:14Þ
PrðBÞ
So we know PrðAÞ ¼ 0.01, and we’re given PrðB j AÞ ¼ 0.8. To get our answer via Bayes’
Theorem, all we need is to determine the denominator, PrðBÞ.
Let’s set up a conjoint table to visualize the problem (see Table 3.4).
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi
Our universe consists of women who participate in routine screenings for breast cancer.
From the problem, we know that PrðAÞ ¼ 0.01. This means that 1% of women have breast
cancer. It also means that Prð AÞ ¼ 0.99, or 99% of women do not have breast cancer.
These are the marginal probabilities for the cancer characteristic (see Table 3.5).
We can make some headway with this knowledge of PrðAÞ. We’re also given the prob-
ability of a positive test result given cancer:
We can use that information to compute the joint probability of A and B (cancer and
positive test). If the marginal probability of having cancer, PrðAÞ, is 0.01, then the prob-
ability of a positive test result and cancer is 0.008:
Now, we can calculate the probability of a negative test result and cancer through simple
subtraction:
Let’s add these entries to our joint and marginal probability table as follows, filling in any
cells we can (as shown in Table 3.6).
Notice that PrðB j AÞ, which is 0.8, is nowhere in this table. But 0.008/0.01 ¼ 0.8.
So far, so good. Now we just need either the joint probability of A \ B or the joint
probability of A \ B, and we can our find the last missing piece to solve the problem—
the marginal PrðBÞ.
In the problem statement, we’re given the probability of a positive test result given no
cancer:
PrðB j AÞ ¼ 0:096: ð3:18Þ
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi
BAYES’ THEOREM 35
This allows us to compute the joint probability that a woman has a positive test AND does
not have breast cancer:
We can also compute the joint probability that a woman has a negative test AND does
not have breast cancer as:
This allows us to fill in the rest of our table (as shown in Table 3.7).
It may be helpful to think in terms of counts instead of probabilities. Suppose there are
1000 women. We would expect that these women would be partitioned into the four
events, as shown in Table 3.8.
Our Venn diagram would look roughly like the one in Figure 3.4, with the following results
(noting that the size of the box would be much, much larger than shown):
95 Positive 8
Cancer
Figure 3.4
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi
With the joint and marginal table filled in, we can now answer Yudkowsky’s challenge: a
woman in this age group had a positive mammogram in a routine screening. What is the
probability that she actually has breast cancer?
She has a 7.76% chance of having cancer, given her positive test result. This probability
may be surprisingly small to you. You might have noticed that you didn’t need to fill out
the full conjoint table to get the answer. But this little exercise was a quick review of
Chapter 2.
Do you see how the Theorem allows you to switch the information around? We were
provided with information about the probability of a test result, given a cancer condition.
We used Bayes’ Theorem to determine the probability of cancer given the test result.
This use of Bayes’ Theorem relates inverse representations of the probabilities concerning
two events: PrðA j BÞ and PrðB j AÞ. Bayes noted that there is an intricate relationship be-
tween PrðA j BÞ and PrðB j AÞ. His theorem is useful because sometimes it is far easier to
estimate one of these conditional probabilities than the other.
Answer: In terms of what Bayes’ Theorem is and how you can use it to calculate condi-
tional probability, yes.
You’ll often see Bayes’ Theorem written in the following form:
PrðB j AÞ ∗ PrðAÞ
PrðA j BÞ ¼ : ð3:23Þ
PrðBÞ
There are other, equally valid ways to express Bayes’ Theorem. For example, the two
equations below are equivalent to the one given above:
PrðB j AÞ ∗ PrðAÞ
PrðA j BÞ ¼ ð3:24Þ
PrðA \ BÞ þ Prð A \ BÞ
PrðB j AÞ ∗ PrðAÞ
PrðA j BÞ ¼ : ð3:25Þ
PrðB j AÞ ∗ PrðAÞ þ PrðB j AÞ ∗ Prð AÞ
The second definition of Bayes’ Theorem focuses on inference—a topic we’ll explore in
Chapter 4, which relies on this expanded version.
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi
CHAPTER 4
Bayesian Inference
Now that you’ve been introduced to Bayes’ Theorem, we’ll focus our attention on Bayesian
inference. By the end of this chapter, you will understand the following concepts:
• Bayesian inference
• Induction
• Deduction
• Hypothesis
• Alternative hypotheses
• Prior probability of a hypothesis
• Likelihood of the observed data
• Posterior probability of a hypothesis, given the data
Before we get started, let’s quickly review Bayes’ Theorem, using a Venn diagram as a visual
aid (see Figure 4.1).
B
65 5 10
∼A and ∼B: 20
Figure 4.1
Here, A and B represent two events. Therefore, this diagram consists of four possible joint
events:
• A\B
• A \ B
• A \ B
• A \ B
In Section 1, we learned that the joint probability of A and B occurring can be expressed as:
Bayesian Statistics for Beginners: A Step-by-Step Approach. Therese M. Donovan and Ruth M. Mickey,
Oxford University Press (2019). © Ruth M. Mickey 2019.
DOI: 10.1093/oso/9780198841296.001.0001
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi
Therefore:
PrðB j AÞ ∗ PrðAÞ
PrðA j BÞ ¼ : ð4:4Þ
PrðBÞ
In Chapter 3, we provided the two definitions of Bayes’ Theorem and discussed definition 1:
• The Theorem relates inverse representations of the probabilities concerning two events,
that is, Pr(A | B) and Pr(B | A).
We walked through a few different examples to show how to use Bayes’ Theorem to
determine, for example, the probability that a woman has breast cancer if her mammogram
came back positive.
In this chapter, we turn our attention to the second definition of Bayes’ Theorem:
• The Theorem can express how a subjective degree of belief should rationally change to
account for evidence.
Here’s the theorem we’ll be using for definition 2:
PrðB j AÞ ∗ PrðAÞ
PrðA j BÞ ¼ : ð4:5Þ
PrðBÞ
Yes, the equation is the same, but in this second interpretation, Bayes’ Theorem opens up
entirely new possibilities for thinking about probabilities and the conduct of
Science
Answer: There are many definitions, formal and informal. Generally speaking, science
refers to a system of acquiring knowledge.
Answer: (From NASA): Science is curiosity in thoughtful action about the world and how
it behaves.
Answer: (From Wikipedia): Science (from Latin scientia, meaning “knowledge”) is a systematic
enterprise that builds and organizes knowledge in the form of testable explanations and
predictions about the universe.
Answer: We normally use what is called the scientific method. The Oxford English Diction-
ary (Stevenson, 2010) says that the scientific method is “a method or procedure that has
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi
BAYESIAN INFERENCE 39
characterized natural science since the 17th century, consisting in systematic observation,
measurement, and experiment, and the formulation, testing, and modification of hypotheses.”
A key concept in scientific endeavors is formulating testable, alternative explanations
about how the universe works. The scientific method actually consists of two types of
inquiry: induction and deduction, which, when used in concert, produce knowledge.
The scientific process is nicely captured in the diagram in Figure 4.2 (adapted from Rao,
1997). Let’s walk through this diagram, noting that it is a (diamond-shaped) circle at heart
and has neither beginning nor end. It can be thought of as a race track, or a wheel.
Hypotheses or
Theory
Creation of New Ideas
Deductive Reasoning
(Enlightened Guess Work)
Inference
Consequences
(Verification
(Prediction)
of Theory)
Design of Experiments
Inductive Reasoning
(Ensuring Validity of Data)
Data
Figure 4.2
We need to start somewhere in this diagram, which contains four boxes and four arrows.
Let’s start with the upper box:
5. Data Box. After the experiment is designed, we then collect some data. We can also use
existing datasets if they are appropriate.
6. Inductive Reasoning Arrow. The Oxford Reference tells us that inductive reasoning
involves inferring general principles from specific examples.
7. Inference Box. Inductive reasoning makes broad generalizations from specific obser-
vations. Dr. Sylvia Wassertheil-Smoller explains, “In inductive inference, we go from the
specific to the general. We make many observations, discern a pattern, make a general-
ization, and infer an explanation or a theory” (source: http://www.livescience.com/
21569-deduction-vs-induction.html; accessed August 17, 2017). At this point, we may
verify the pattern, or falsify our hypothesis or theory.
8. Creativity Arrow. The final process involves creativity, in which we bring to bear
creative ideas that may explain a pattern or a phenomenon, which brings us full circle.
C. R. Rao (1997) notes that the inference-through-consequences portion of the diagram
comes under the subject of research and the creative role played by scientists, while the
design of experiments through inductive reasoning comes under the realm of statistics.
Note that you can go through a portion of the scientific race track or wheel (e.g., collect
data in the lower box, analyze the data to infer a generalized pattern, and then stop).
However, going around the race track once or multiple times builds knowledge about a
system. As Dr. Wassertheil-Smoller states, “In science there is a constant interplay between
inductive inference (based on observations) and deductive inference (based on theory),
until we get closer and closer to the ‘truth,’ which we can only approach but not ascertain
with complete certainty.”
“Our search is only for working hypotheses which are supported by observational facts
and which, in course of time, may be replaced by better working hypotheses with more
supporting evidence from a wider set of data and provide wider applicability.” –
C.R. Rao.
There are reams of writings about the scientific process—much of it is (thankfully) beyond the
scope of this book. The nice thing about Bayes’ Theorem is that it can be used to address many
kinds of scientific questions, but the underlying concept of Bayesian inference is the notion
of alternative hypotheses, which is why we started in the upper box.
Answer: You’ll have to read more about the story of Bayes’ Theorem to learn about how he
came to his great insight. It involves a thought experiment with balls rolling on a table.
Due credit should also be given to Pierre-Simon Laplace. A Wikipedia author notes
“Pierre-Simon Laplace was an influential French scholar whose work was important to
the development of mathematics, statistics, physics and astronomy. He summarized and
extended the work of his predecessors in his five-volume Mécanique Céleste (Celestial
Mechanics) (1799–1825). This work translated the geometric study of classical mechanics
to one based on calculus, opening up a broader range of problems. In statistics, the
Bayesian interpretation of probability was developed mainly by Laplace” (article accessed
August 15, 2017).
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi
BAYESIAN INFERENCE 41
If you’d like to learn more, a terrific read is The Theory That Would Not Die: How Bayes’
Rule Cracked the Enigma Code, Hunted Down Russian Submarines, and Emerged Triumph-
ant from Two Centuries of Controversy, by Sharon Bertsch McGrayne (2011). If you can’t
buy the book, at the very least, watch a lecture.
Answer: Bayesian inference is the process of confronting alternative hypotheses with new
data and using Bayes’ Theorem to update your beliefs in each hypothesis.
The Oxford Dictionary of Statistics (Upton and Cook 2014) describes Bayesian inference
as “an approach concerned with the consequences of modifying our previous beliefs as a
result of receiving new data.”
Wikipedia defines it this way: “Bayesian inference is a method of statistical inference in
which Bayes’ Theorem is used to update the probability for a hypothesis as more evidence
or information becomes available” (article accessed August 15, 2017).
PrðB j AÞ ∗ PrðAÞ
PrðA j BÞ ¼ : ð4:6Þ
PrðBÞ
To use Bayes’ Theorem for scientific inference, it’s useful (actually, essential) to replace the
marginal denominator Pr(B) as the sum of the joint probabilities that make it up:
PrðB j AÞ ∗ PrðAÞ
PrðA j BÞ ¼ : ð4:7Þ
PrðA \ BÞ þ Prð A \ BÞ
It’s critical that you understand this: the marginal probability of B is the sum of the
joint probabilities that make it up. Perhaps the conjoint table will help jog your
memory (see Table 4.1).
A A Marginal
B Pr(A \ B) Pr(A \ B) Pr(B)
B Pr(A \ B) Pr(A \ B) Pr(B)
Marginal Pr(A) Pr(A) Total ¼ 1.00
Remember that the marginal probabilities are stored, well, in the margins, and that you
obtain them by adding the joint probabilities. So far, so good.
Now, let’s tackle a problem. Returning to the Yudkowsky example in Chapter 3, suppose
we were asked to find the probability that a woman has breast cancer (A), given that her
mammogram test came back positive (B). Here, the results of the mammogram are the “data.”
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi
OK, we are asked to find Pr(A | B). Thus, Pr(B) is in the denominator in Bayes’ Theorem,
which is:
PrðB j AÞ ∗ PrðAÞ
PrðA j BÞ ¼ : ð4:8Þ
PrðA \ BÞ þ PrðA \ BÞ
Let’s now identify the parts of this problem in terms of the scientific method:
• We have two competing hypotheses regarding cancer: the woman has cancer (A) vs. she
does not (A).
• We have data for this problem: the test came back positive. So, B represents our observed
data. Notice that B does not appear in the problem because we did not observe a
negative test.
Answer: We can make great headway if we now replace the joint probabilities with
their conditional probability equivalents:
PrðB j AÞ ∗ PrðAÞ
PrðA j BÞ ¼ : ð4:11Þ
PrðB j AÞ ∗ PrðAÞ þ PrðB j AÞ ∗ Prð AÞ
The denominator is still just the sum of the two joint probabilities that make up the
marginal probability of B. Never, ever, lose sight of this! Let’s look at this form of
Bayes’ Theorem more deeply.
Answer: You might have noticed that one term in the denominator is exactly the same
thing as the numerator!
PrðB j AÞ ∗ PrðAÞ
PrðA j BÞ ¼ : ð4:12Þ
PrðB j AÞ ∗ PrðAÞ þ PrðB j AÞ ∗ PrðAÞ
Now, it should be clear that Bayes’ Theorem returns a proportion, or probability. After all,
we need to find Pr(A | B), which is a probability that ranges between 0 and 1.
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi
BAYESIAN INFERENCE 43
Answer: Well, Bayes’ great insight was that when we use the Theorem in this way, the
equation itself can be used to draw inferences regarding competing hypotheses and thus is
directly tied to the scientific method. Cool!
Let’s revisit our scientific method diagram (see Figure 4.3).
Hypotheses or
Theory
Creation of New Ideas
Deductive Reasoning
(Enlightened Guess Work)
Inference
Consequences
(Verification
(Prediction)
of Theory)
Design of Experiments
Inductive Reasoning
(Ensuring Validity of Data)
Data
Figure 4.3
Again, the scientific “race track” has no beginning or end. But Bayesian inference begins
with the notion of multiple hypotheses, so we’ll start with the upper box. Our problem is to
determine the probability that a woman has breast cancer, given the test result was positive.
Let’s focus on the four main boxes only:
Finally, we use Bayes’ Theorem to determine a posterior probability for each hypoth-
esis. The posterior probability represents our updated belief in each hypothesis after new
data are collected:
• Probability of cancer, given the observed data
• Probability of no cancer, given the observed data.
PrðCancer j PositiveÞ
PrðPositive j CancerÞ ∗ PrðCancerÞ ð4:13Þ
¼ :
PrðPositive j CancerÞ ∗ PrðCancerÞ þ PrðPositive j CancerÞ ∗ Prð CancerÞ
Now, let’s replace each term with its Bayesian inference definition (see Figure 4.4).
Likelihood of the
data, B, given Prior probability of
hypothesis A hypothesis A
Posterior probability
of hypothesis A,
given data B
Pr(B | A) * Pr(A)
Pr(A | B) =
Pr(B | A) * Pr(A) + Pr(B | ∼A) * Pr(∼A)
Likelihood of the
data, B, given Prior probablity of
hypothesis ∼A hypothesis ∼A
Figure 4.4
BAYESIAN INFERENCE 45
Paul Samuelson, the Nobel laureate from the Massachusetts Institute of Technology,
recalled that John Maynard Keynes once was challenged for altering his position on
some economic issue. “When my information changes, I change my mind. What do
you do?”
One percent of women at age forty who participate in routine screening have breast
cancer. Eighty percent of women with breast cancer will get positive mammograms. In
addition, 9.6% of women without breast cancer will get positive mammograms.
A woman in this age group had a positive mammogram in a routine screening. What
is the probability that she actually has breast cancer?
In this chapter, we are solving the problem with the second interpretation of Bayes’ Theorem:
PrðB j AÞ ∗ PrðAÞ
PrðA j BÞ ¼ : ð4:16Þ
PrðB j AÞ ∗ PrðAÞ þ PrðB j AÞ ∗ Prð AÞ
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi
0:008
∗ 0:01
PrðA j BÞ ¼ 0:008 0:01
¼ 0:0776: ð4:17Þ
0:01 ∗ 0:01 þ 0:095
0:99 ∗ 0:99
Both return the same answer, and both are correct. The second interpretation, however,
places the problem within a scientific context, where you posit hypotheses and then update
your beliefs in each hypothesis after data are collected. In this particular example, our
hypotheses are that the woman has cancer and that she does not. Without a test, our initial
belief that a woman has breast cancer is 1% (because 1% of the population has breast
cancer), and our belief that she does not have breast cancer is 99%. With the test, however,
we can use Bayes’ Theorem to update our beliefs. In light of the new data, the probability
that a woman has breast cancer is 7.8%.
Now let’s generalize this problem even more. Suppose there are n hypotheses. Let’s number
our hypotheses from i ¼ 1 to n. Here, i is an index of hypotheses. In our example, n ¼ 2
since there are two hypotheses. We could let hypothesis i ¼ 1 be the “cancer hypothesis”,
and i ¼ 2 be the “no cancer hypothesis.” But we can generalize this as H1, H2,…, Hn
hypotheses. In other words, there can be any discrete number of hypotheses.
Let’s let data represent the data we collected, that is, our observed data. So now Bayes’
Theorem can be written:
Prðdata j Hi Þ ∗ PrðHi Þ
PrðHi j dataÞ ¼ : ð4:18Þ
X
n
Prðdata j Hj Þ ∗ PrðHj Þ
j¼1
• The left side of the equation, Pr(Hi | data), can be read: “The posterior probability of
hypothesis i, given the data.”
• The numerator requires the likelihood of observing the data under hypothesis i and is
written Pr(data | Hi). This is then multiplied by the prior probability for hypothesis i,
which is written Pr(Hi).
• In the denominator, the symbol Σ means “sum.” The “j” associated with the sum symbol
is an index of summation. Thus, in the denominator, we are summing over all
hypotheses. So the index j goes from 1 to n. In our example, we had two hypotheses,
so there must be two terms in the denominator.
We use Bayes’ Theorem to compute the posterior probabilities for each and every
hypothesis under consideration.
Broadly speaking, you start with a scientific question and set forth two or more alterna-
tive hypotheses. You then assign a prior probability that each alternative hypothesis is
true. Next, you collect data. Finally, you use Bayes’ Theorem to update the probability
for each hypothesis considered.
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi
BAYESIAN INFERENCE 47
Each time you collect more data and apply Bayes’ Theorem, you’re updating your belief in
each alternative hypothesis. The new posterior probabilities then become priors for the
next analysis. By tracking our beliefs through time, we track our learning. The phrase
“Today’s posterior is tomorrow’s prior” captures the process nicely, as long as the same data
are not used to update the prior again the next time around.
Answer: Loads of them! And the use of Bayes’ Theorem is skyrocketing in many fields. In
Chapter 5, we’ll work on another problem to give you more practice.
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi
CHAPTER 5
In this chapter, we provide a concrete example of Bayesian inference. By the end of this
chapter, you should have a renewed understanding of:
• Bayesian inference
• Hypothesis
• Alternative Hypothesis
• Prior probability of a hypothesis
• Prior probability distribution
• Likelihood of the observed data
• Posterior probability of a hypothesis
• Posterior probability distribution.
We’ll walk through a classic example of Bayesian Inference. In 1964, Frederick Mosteller
and David Wallace published an article in which they studied the disputed authorship of
some of the Federalist Papers. Mosteller and Wallace introduce their paper as follows:
In the 12 disputed papers, either Alexander Hamilton (see Figure 5.1, left) or James Madison
(see Figure 5.1, right) signed his name as “Publius” instead of using his given name. Mosteller
and Wallace used a Bayesian analysis to rightfully attribute the authorship of each paper.
Let’s assume we are working with a specific paper of unknown authorship (No. 54), and
walk through a Bayesian inference analysis. This paper discusses the way in which the seats
in the United States House of Representatives are apportioned among the states. It is titled
“The Apportionment of Members Among the States.”
First, let’s recall Bayes’ Theorem:
PrðB j AÞ ∗ PrðAÞ
PrðA j BÞ ¼ : ð5:1Þ
PrðB j AÞ ∗ PrðAÞ þ PrðB j AÞ ∗ PrðAÞ
Bayesian Statistics for Beginners: A Step-by-Step Approach. Therese M. Donovan and Ruth M. Mickey,
Oxford University Press (2019). © Ruth M. Mickey 2019.
DOI: 10.1093/oso/9780198841296.001.0001
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi
We’ll use a series of steps to conduct our analysis, which follow the scientific race track
described in Chapter 4:
1. Identify your hypotheses.
2. Express your belief that each hypothesis is true in terms of prior probabilities.
3. Gather the data.
4. Determine the likelihood of the observed data under each hypothesis.
5. Use Bayes’ Theorem to compute the posterior probabilities for each hypothesis.
What is step 1?
Answer: In step 1, we identify our hypotheses. Paper No. 54 is 2008 words in length, and
our goal is to determine its most likely author. For this paper, we have two hypotheses for
authors: Hamilton or Madison. Let’s nail down our nomenclature:
• Hamilton ¼ Hamilton hypothesis
• Madison ¼ Madison hypothesis
Note that the hypotheses are exhaustive and mutually exclusive. Since there are only two
hypotheses, it makes sense that:
and
In terms of our notation for Bayes’ Theorem in Equation 5.1 above, A could correspond to
the hypothesis that the author is Alexander Hamilton, and A could correspond to the
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi
hypothesis that the author is James Madison. And B in Bayes’ Theorem above represents the
observed data.
The posterior probability that the author was Hamilton is then:
What is step 2?
Answer: We express our belief that each hypothesis is true in terms of prior probabilities.
We do this because the paper is unsigned and we have some uncertainty about which
hypothesis is correct. If the author authentically signed the paper, the probability of their
authorship would be 1.0, and we wouldn’t have a problem to tackle!
• Pr(Hamilton) ¼ prior probability that the true author is Hamilton.
• Pr(Madison) ¼ prior probability that the true author is Madison.
Remember that the sum of the prior probabilities must add to 1.0. We have several options;
here are a few:
There are many possible combinations of prior probabilities to choose from. Knowing that
Hamilton penned 43 papers and that Madison penned 14, it may be reasonable to assign
Pr(Hamilton) a prior probability of 0.75 (the last option above) because Hamilton wrote
43 of the 57 papers that were signed by either man (43/(43+14) ¼ 0.75). That would mean
that Pr(Madison) ¼ 0.25.
However, Mosteller and Wallace set the odds at 50:50, which is Pr(Hamilton) ¼ 0.5 and
Pr(Madison) ¼ 0.5. This gives each hypothesis the same “weight” of belief as the other.
Let’s now represent our hypotheses and prior probabilities as a graph (see Figure 5.2).
1.0
0.8
Prior Probability
0.6
0.4
0.2
0.0
Hamilton Madison
A graph of the prior probabilities is called the prior probability distribution. This
distribution represents hypotheses on the x-axis and their probabilities on the y-axis.
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi
Note that the sum of the probabilities across hypotheses must be 1.0; this means that the
list of alternative hypotheses is exhaustive. In addition, the hypotheses are mutually
exclusive: the paper can’t be written by both men. Figure 5.2 is an example of a discrete
prior probability distribution.
What is step 3?
Answer: Gather the data. The data for this problem are found in Paper No. 54, and
it is 2008 words long.
Ultimately, we are asking, which hypothesis (Hamilton or Madison) is most consistent
with the data we observe (the paper)?
• Pr(Hamilton | data) ¼ what is the probability the author is Hamilton, given the paper?
• Pr(Madison | data) ¼ what is the probability the author is Madison, given the paper?
This is the challenge that Mosteller and Wallace set out to tackle, and it is a very tough nut
to crack!
Mosteller and Wallace didn’t use ALL of the data at their disposal (i.e., every word in the
paper), but instead focused on some signature words and phrases as a key to identify each
author.
Madison tended to use the word by more frequently than Hamilton, whereas Hamilton
tended to use the word to more frequently than Madison. The best single discriminator,
however, was the use of the word upon.
Hamilton used upon with overwhelmingly greater frequency than Madison. To be sure,
other metrics could have been used, including the length of each man’s sentences. How-
ever, the two men are nearly identical in this respect, with Hamilton averaging 34.55 words
per sentence, and Madison averaging 34.59 per sentence. Sentence length, therefore, would
not be very helpful in separating the two hypotheses.
Our sincere apologies to Mosteller and Wallace, but to keep things simple, we will
simplify their analysis tremendously and focus on the frequency of the use of upon by
each author.
The word “upon” appeared twice in the paper in question. Then, the rate of
upon is calculated as:
#upons 2
¼ ¼ 0:000996: ð5:5Þ
total words 2008
If we standardize this rate to 1000 words, then our manuscript has a standardized rate of
upon of
We need some standardization because papers vary in length: a paper with 1 upon in 20
words has a very different rate of upon than a paper with 1 upon in 2008 words.
Now that we have the data in hand, we need to determine the likelihood of observing a
rate of 0.996 upons under each hypothesis.
What is step 4?
Answer: Determine the likelihood of the observed data, assuming each hypothesis
is true.
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi
and
Here we pit the observed data against each hypothesis. This step is often the most challen-
ging part of a Bayesian analysis. How exactly how do you compute the likelihood of
the data?
Dictionary.com provides three definitions of likelihood:
So, likelihood is another word for probability. There is a subtle difference, however, in
the way the term is used in statistical analysis, where likelihood describes the probabil-
ity of observing data that have already been collected. We will cover this in more detail
in future chapters, but for now it’s enough just to have a good feel for the concept:
likelihood involves the collection of data, and we look retrospectively at
the probability of collecting those data. If we live in the Sahara Desert, we know
the probability of rain is low. Suppose it rains for seven consecutive days; this represents
our data. Can we all agree that that it is very unlikely to have observed seven consecu-
tive days of rain, given that we are in the Sahara Desert? We hope so!
Wolfram Math World differentiates likelihood and probability this way: “Likelihood is
the hypothetical probability that an event that has already occurred would yield a specific
outcome. The concept differs from that of a probability in that a probability refers to the
occurrence of future events, while a likelihood refers to past events with known
outcomes.”
Let’s put this in terms of authorship of the disputed paper. The paper is a past event with a
known outcome: the rate of upon is 0.996 per 1000 words. Our next step, then, is to
determine how likely it is to observe 0.996 upons per thousand words under each
hypothesis:
• Equation 5.9 asks: “What is the likelihood of observing a rate of 0.996 upons per 1000
words, given that the author is Hamilton?”
• Equation 5.10 asks: “What is the likelihood of observing a rate of 0.996 upons per 1000
words, given that the author is Madison?”
Notice that the likelihoods are conditional for each hypothesis! In this Bayesian ana-
lysis, the likelihood is interpreted as the probability of observing the data, given the
hypothesis.
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi
Answer: Computing the likelihood of the observed data is a critical part of Bayesian
analysis. Throughout this book, you’ll be learning about probability distributions that
will enable you to estimate the likelihood of the observed data under each hypothesis.
But, for now, what you need is a good, intuitive feel for what it is, and a firm understanding
of how the observed data (a rate of 0.996) has a different likelihood depending on whether
the author was Hamilton or Madison.
Imagine Mosteller and Wallace had a whole team of students who searched through 98
articles known to be penned by Hamilton or Madison, and tediously counted the
number of times each author used the word upon and also tallied the total number of
words in each document. The data might look like that shown in Table 5.1.
Table 5.1
Table 5.2
This table assigns each article into a “rate bin,” where the bins are provided in column 1.
The first bin includes all manuscripts where the word upon is never used (i.e., the rate of
upon ¼ 0). Let’s talk about the notation used for the remaining bins. A closed interval such
as [0,1] would include all manuscripts with a rate of 0 through 1, including the endpoints.
An open interval such as (0,1) does not include the endpoints 0 or 1. A half-closed interval
has the notation (0,1]. Thus, the interval (0,1] is a bin that includes all manuscripts where
the rate of upon was > 0 but 1. For the 48 papers known to be penned by Hamilton, 0 of
them had a rate of 0, 1 had a rate in the (0,1] bin, 10 had a rate in the (1,2] bin, and so on.
For the 50 papers known to be penned by Madison, 41 of them had a rate of 0 (in which the
word upon was never used), 7 had a rate > 0 and 1, and 2 had a rate > 1 and 2.
The same data can be shown as a frequency histogram (see Figure 5.3). The x-axis of
this histogram are the bins, or the rate of upons per 1000 words. The y-axis is the
frequency, or number of articles, corresponding to each rate of upon. Frequency histo-
grams therefore depict the raw data in graphic format.
40
Hamilton
30 Madison
Frequency
20
10
0
0 (0,1] (1,2] (2,3] (3,4] (4,5] (5,6] (6,7] (7,8]
Rate of Upon per 1000 Words
Remember, for this step we are trying to get estimates of the likelihood of the observed
data under each hypothesis. The unsigned manuscript has a rate of upon ¼ 0.996,
which falls into the second bin:
• Equation 5.11 asks: “What is the probability of observing a rate of 0.996 upons per 1000
words, given that the author is Hamilton?”
• Equation 5.12 asks: “What is the probability observing a rate of 0.996 upons per 1000
words, given that the author is Madison?”
Answer: Intuitively, the data are more consistent with the Madison hypothesis. Can you
see that too?
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi
Answer: Great question! We’ll try this different ways. First, though, think about how you
would approach this problem before reading on.
Looking back at Table 5.2, we can see the following:
• One of Hamilton’s 48 manuscripts had a rate of upon greater than 0 but less than 1.0.
Therefore, we can use 1/48 ¼ 0.021 as an estimate of the likelihood of the data under the
Hamilton hypothesis.
• Seven of Madison’s 50 manuscripts had a rate of upon greater than 0 but less than 1.0.
Therefore, we can use 7/50 ¼ 0.140 as an estimate of the likelihood of the data under the
Madison hypothesis.
These are quick and dirty estimates of each author’s use of upon, and we’ll run with this.
What is step 5?
Answer: Use Bayes’ Theorem to compute the updated Pr(Hamilton) and Pr(Madison),
given the data.
First, let’s recall Bayes’ Theorem and the familiar inferential terms (see Figure 5.4).
Likelihood of the
data, B, given Prior probability of
hypothesis A hypothesis A
Posterior probability
of hypothesis A,
given data B
Pr(B | A) * Pr(A)
Pr(A | B) =
Pr(B | A) * Pr(A) + Pr (B |∼A) * Pr(∼A)
Likelihood of the
data, B, given Prior probability of
hypothesis ∼A hypothesis ∼A
Figure 5.4
Bayes’ Theorem can be used to calculate the posterior probability that the author is
Hamilton:
PrðHamilton j 0:996Þ
Prð0:996 j HamiltonÞ ∗ PrðHamiltonÞ ð5:13Þ
¼ :
Prð0:996 j HamiltonÞ ∗ PrðHamiltonÞ þ Prð0:996 j MadisonÞ ∗ PrðMadisonÞ
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi
In words . . .
• The left side of the equation can be read as “the posterior probability that Hamilton
penned the paper, given the paper had a rate of 0.996 upon per thousand words.”
• On the right, the numerator multiplies two terms: the likelihood of observing a rate of
upons per 1000 words of 0.996 under the Hamilton hypothesis, multiplied by the prior
probability of the Hamilton hypothesis.
• The denominator repeats the numerator, but adds in the Madison hypothesis as well.
Of course, since there are only two hypotheses and they are mutually exclusive, you can get
the posterior probability of Madison as 1 minus the posterior probability of Hamilton. Or
you can write it out fully for the sake of completeness:
PrðMadison j 0:996Þ
Prð0:996 j MadisonÞ ∗ PrðMadisonÞ ð5:14Þ
¼ :
Prð0:996 j HamiltonÞ ∗ PrðHamiltonÞ þ Prð0:996 j MadisonÞ ∗ PrðMadisonÞ
In short, we can use Bayes’ Theorem to estimate the posterior probability for each hypoth-
esis. Let’s keep our focus on the Hamilton hypothesis for now:
PrðHamilton j 0:996Þ
Prð0:996 j HamiltonÞ ∗ PrðHamiltonÞ ð5:15Þ
¼ :
Prð0:996 j HamiltonÞ ∗ PrðHamiltonÞ þ Prð0:996 j MadisonÞ ∗ PrðMadisonÞ
PrðHamilton j 0:996Þ
Prð0:996 j HamiltonÞ ∗ PrðHamiltonÞ ð5:16Þ
¼ :
Prð0:996 j HamiltonÞ ∗ PrðHamiltonÞ þ Prð0:996 j MadisonÞ ∗ PrðMadisonÞ
PrðHamilton j 0:996Þ
Prð0:996 j HamiltonÞ ∗ PrðHamiltonÞ ð5:17Þ
¼ :
Prð0:996 j HamiltonÞ ∗ PrðHamiltonÞ þ Prð0:996 j MadisonÞ ∗ PrðMadisonÞ
Where are the likelihoods of the observed data under each hypothesis
in this equation?
PrðHamilton j 0:996Þ
Prð0:996 j HamiltonÞ ∗ PrðHamiltonÞ ð5:18Þ
¼ :
Prð0:996 j HamiltonÞ ∗ PrðHamiltonÞ þ Prð0:996 j MadisonÞ ∗ PrðMadisonÞ
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi
Remember that we set the priors to 0.5 for Hamilton and 0.5 for Madison. Also remember
that the likelihood of observing a rate of upons of 0.996 was 0.021 for Hamilton and 0.140
for Madison. Now let’s fill in our priors and likelihood:
Since we have only two, mutually exclusive, hypotheses, this means that our posterior
probability that the author was James Madison is 1 – 0.1304 ¼ 0.8696:
These new posterior estimates can be graphed as the posterior probability distribution.
1.0
Hamilton
0.8 Madison
Probability
0.6
0.4
0.2
0.0
Prior Distribution Posterior Distribution
Hypotheses
Thus, given the manuscript (and ancillary data about word use by the two authors from
KNOWN data), we now increase our belief that that the author of Federalist Paper No. 54
was James Madison. You can read about it here!
Answer: Great question! We’ll cover the topic of priors in more detail in Chapter 6. But for
now, let’s see what happens when we use a different set of priors instead:
We considered these potential priors at the beginning of the chapter. Let’s use Bayes’
Theorem to calculate the posterior probability that the author was Alexander Hamilton
with these new priors:
1.0
Hamilton
0.8 Madison
Probability
0.6
0.4
0.2
0.0
Prior Distribution Posterior Distribution
Hypotheses
Answer: Fantastic! The more information you have to calculate the likelihood, the better.
In this case, you would use this new information to get better estimates of the probability of
each author’s use of the word upon. Additionally, the discovery of more papers may
influence your choice of priors.
Answer: No, don’t confuse the prior probabilities for a set of hypotheses (which must sum
to 1.0) with the probability of the data. This can be confusing!
In Bayesian analysis, the sum of the prior probabilities across hypotheses, as well as the
sum of the posterior probabilities across hypotheses, must be 1.0. This is not true for the
likelihoods of observing the data under each hypothesis.
This is a very important concept, so let’s review our calculations to nail this point
(see Table 5.3).
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi
Table 5.3
Table 5.3 shows our calculations for the analysis where the prior probabilities for the two
hypotheses were equal. You’ve seen these calculations already, but notice that the sum of
the priors is 1.000 (as required), while the sum of the likelihoods for the two hypotheses is
0.1610.
Let’s have a look at our second analysis, in which the priors were unequal (see Table 5.4).
Table 5.4
Once again, notice the sum of the likelihoods is 0.161, whereas the sum of the priors is 1.000.
Not exactly. They used a Bayesian inference approach, but they calculated their likelihoods
a bit differently than we did here. Typically, the analyst makes an assumption about the
distribution that gives rise to the data. Mosteller and Wallace were premier statisticians and
instead fit the histogram data to what is called a Poisson distribution. The Poisson
distribution is used to estimate the probability of an event occurring (such as the number
of upons per 1000 words, the number of births per 1000 females, or the number of car
crashes in 100 days at a corner), given the average rate of that event. We’ll cover the Poisson
probability distribution in depth later in the book. Here, we are simply trying to give you a
general understanding of how to quantify likelihood.
OK. To summarize, Mosteller and Wallace set out to determine the author of Paper No. 54.
This paper had a standardized rate of upons ¼ 0.996 per 1000 words. They sought out the
posterior probability associated with two hypotheses:
• Pr(Hamilton | data) ¼ what is the probability the author is Hamilton, given the rate of
upon is 0.996?
• Pr(Madison | data) ¼ what is the probability the author is Madison, given the rate of
upon is 0.996?
To answer this, they assigned each hypothesis a prior probability of being true. Then,
they gathered data from manuscripts known to be penned by each author and used
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi
this information to calculate the likelihood of observing the data under each
hypothesis:
• Pr(data | Hamilton) ¼ what is the likelihood of the observed rate of 0.996, given the
author is Hamilton?
• Pr(data | Madison) ¼ what is the likelihood of the observed rate of 0.996, given the author
is Madison?
Mosteller and Wallace used Bayes’ Theorem, which says there is a relationship between the
original conditional probabilities and their inversions, in order to generate updated beliefs
that each hypothesis is true:
PrðHamilton j 0:996Þ
Prð0:996 j HamiltonÞ ∗ PrðHamiltonÞ ð5:25Þ
¼ :
Prð0:996 j HamiltonÞ ∗ PrðHamiltonÞ þ Prð0:996 j MadisonÞ ∗ PrðMadisonÞ
The use of Bayes’ Theorem in this way is known as Bayesian inference. The Merriam–
Webster dictionary defines inference as the “the act or process of inferring.” It then defines
“inferred” or “inferring” as “deriving conclusions from facts or premises,” and “infer” as “to
form an opinion or reach a conclusion through reasoning and information.”
Bayesian inference is then the use of Bayes’ Theorem to draw conclusions about a set of
mutually exclusive, exhaustive, alternative hypotheses by linking prior knowledge about
each hypothesis with new data. The result is an updated probability for each hypothesis of
interest.
How does this problem differ from the Breast Cancer Problem in the
last chapter?
CHAPTER 6
Now that you’ve had a taste of what Bayesian inference is all about, we’ll start on a new
example and explore two key features of Bayesian analysis: assigning priors and estimating
likelihoods. One of the more controversial aspects of Bayesian inference is the use of priors.
They are controversial, but they can’t be ignored.
In the frequentist notion of Bayes’ Theorem, the priors are just marginal probabilities. But
in Bayesian inference, the priors represent a priori probabilities that each alternative hy-
pothesis is correct, where a priori means “prior to data collection.” You cannot conduct a
Bayesian analysis without using a prior distribution. You must also collect some
information to estimate the likelihood of the data given a hypothesis.
By the end of this chapter, you should be able to define:
• Informative prior distribution
• Non-informative prior distribution
• Objective priors
• Subjective priors
• Prior sensitivity analysis
Let’s begin with a new example. This problem is taken from a short story called Absent
Treatment by P. G. Wodehouse (author of the infamous Jeeves collection). In this story, the
main character and author, Reggie Pepper, is helping his friend Bobbie Cardew. Bobbie is in
a pinch because he forgot the date of his wife’s birthday. The wife, Mary, has had enough
and has ditched poor Bobbie. Let’s pick up the story with Mary’s send-off letter to Bobbie
and then listen in on an exchange between Reggie and Bobbie:
MY DEAR BOBBIE, I am going away. When you care enough about me to remember to
wish me many happy returns on my birthday, I will come back. My address will be Box
341, London Morning News. – Mary
Bayesian Statistics for Beginners: A Step-by-Step Approach. Therese M. Donovan and Ruth M. Mickey,
Oxford University Press (2019). © Ruth M. Mickey 2019.
DOI: 10.1093/oso/9780198841296.001.0001
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi
What a great example! Here, our task is to use a Bayesian inference approach
to determine the month in which Mary was born. For the purposes of this
chapter, let’s suppose that Bobbie at least knows that Mary was born in the
year 1900.
To begin, we have 12 discrete hypotheses (n = 12). Remember that Bayes’ Theorem can be
expressed as:
Prðdata j Hi Þ ∗ PrðHi Þ
PrðHi j dataÞ ¼ : ð6:1Þ
X
n
Prðdata j Hj Þ ∗ PrðHj Þ
j¼1
In words, the posterior probability for hypothesis i given the observed data (data) is written
PrðHi j dataÞ. It is equal to a proportion. The numerator consists of the likelihood of
observing the data under hypothesis i, which is written Prðdata j Hi Þ, multiplied by the
prior probability of hypothesis i, which is written PrðHi Þ. The denominator repeats this
Xn
process for all 12 hypotheses, and sums them. The symbol means “sum;” in this case, it
j¼1
X
12
could have been written . The indices are from j = 1 to 12 to indicate that there are
j¼1
12 pieces of information to sum together. In short, i indicates a specific hypothesis, whereas
j is the index of summation and indicates a term in the denominator.
OK, here we go.
Step 1. Identify your hypotheses. There are 12 months in a year, and these represent the
alternative hypotheses. Note that the hypotheses are mutually exclusive and exhaustive.
They are mutually exclusive because Mary’s birth month cannot occur in both October and
November. They are exhaustive because Reggie and Bobbie considered every possible
outcome with respect to Mary’s birth month. Let’s get our nomenclature down:
Step 2. Express our belief that each hypothesis is true in terms of probabilities. We do this
because Bobbie and Reggie are uncertain about which hypothesis is correct.
• Pr(January) = prior probability that Mary’s true birth month is January
• Pr(February) = prior probability that Mary’s true birth month is February
• Pr(March) = prior probability that Mary’s true birth month is March
• Etc.
Now we assign prior probabilities that each hypothesis is true. You may be wondering how
to set priors for a problem in general. Let’s peek in on a famous article written by Eliezer
Yudkowsky for some insight:
Q. How can I find the priors for a problem?
A. Many commonly used priors are listed in the Handbook of Chemistry and Physics.
Q. Where do priors originally come from?
A. Never ask that question.
Q. Uh huh. Then where do scientists get their priors?
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi
A. Priors for scientific problems are established by annual vote of the AAAS. In recent years the vote has
become fractious and controversial, with widespread acrimony, factional polarization, and several out-
right assassinations. This may be a front for infighting within the Bayes Council, or it may be that the
disputants have too much spare time. No one is really sure.
Q. I see. And where does everyone else get their priors?
A. They download their priors from Kazaa.
Q. What if the priors I want aren’t available on Kazaa?
A. There’s a small, cluttered antique shop in a back alley of San Francisco’s Chinatown. Don’t ask about
the bronze rat.
Ha ha! Seriously, now. When Bobbie and Reggie first started out, they had absolutely no
idea which month was the birth month. Consequently, they gave equal weight to each
hypothesis, which is 0.083. Remember that the sum of the probabilities that make up the
prior distribution must be 1.0, so each month has a probability of 1/12 ¼ 0.083. The chart in
Figure 6.1 depicts the prior probability distribution; the prior probability of each and every
alternative hypothesis is provided.
0.30
0.25
Prior Probability
0.20
0.15
0.10
0.05
0.00
March
April
June
August
September
October
November
December
January
February
May
July
Hypotheses
These are called non-informative priors and result in equal probabilities for all
hypotheses; that is, each month has the same prior probability of being Mary’s birth
month. Other terms for the same concept include vague or diffuse priors. A SAS webpage
on Bayesian analysis states that “a prior distribution is non-informative if the prior is ‘flat’
relative to the likelihood function.” Wikipedia defines a non-informative prior as a prior
that “expresses vague or general information about a variable…The simplest and oldest rule
for determining a non-informative prior is the principle of indifference, which assigns
equal probabilities to all possibilities.”
However, Reggie and Bobbie might have used the information within the “When Were You
Born” books to set the priors differently. In that case, they would employ an informative
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi
prior distribution. A SAS webpage states that “an informative prior is a prior that is not
dominated by the likelihood and that has an impact on the posterior distribution. If a prior
distribution dominates the likelihood, it is clearly an informative prior.” Wikipedia defines
an informative prior as a prior that “expresses specific, definite information about a vari-
able.” These could arise from expert opinion or from previous study.
We will return to this topic shortly.
Suppose Reggie and Bobbie believe that the February and May hypotheses are more likely
than the rest based on the information in “When Were You Born.” Their informative prior
distribution may have looked like the one shown in Figure 6.2, with the sum of the
probabilities across hypotheses equal to 1.00.
0.30
0.25
Prior Probability
0.20
0.15
0.10
0.05
0.00
Jan
Feb
Mar
Apr
June
Aug
Sep
Oct
Dec
Nov
May
July
Hypotheses
Answer: Great question! The analyst must select a prior distribution. A colleague notes
that “the most important principle of prior selection is that your prior should represent the
best knowledge that you have before you look at the data.” He notes that it is unjustified to
use default, ignorance, or other automatic priors if you have substantial information that
can affect the answer (the posterior). Of course, this assumes that your best knowledge is
well founded, which perhaps is not the case here!
Step 3. Gather the data. In this case, we have one datapoint: one woman named Mary born
in the year 1900.
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi
Now that we have our data, Bayes’ Theorem can provide the posterior probability of her
birth month. Here’s Bayes’ Theorem again:
Prðdata j Hi Þ ∗ PrðHi Þ
PrðHi j dataÞ ¼ : ð6:2Þ
X
n
Prðdata j Hj Þ ∗ PrðHj Þ
j¼1
For example, the posterior probability that Mary was born in January can be written:
Step 4. Now Reggie and Bobbie need to estimate the likelihood of observing the data
(1 Mary) for each monthly hypothesis. As we’ve said, figuring out how to estimate the
likelihood of observing the data under each hypothesis is often the trickiest part of a
Bayesian inference problem.
Answer: For this problem, we need the rate at which girls are named Mary per month. Let’s
suppose that we had data on the names of girls born each month in 1900 in an English
“shire” where Mary was born (see Table 6.1). These figures represent the data from which
we will calculate our likelihoods. Note that Mary—the subject of our story—is included
somewhere in this table, but we’re not sure where! (This doesn’t need to be the case,
however, for Bayesian analysis.)
Table 6.1
We can represent these raw data as a frequency histogram, with the months on the
x-axis and the total number of births on the y-axis (see Figure 6.3). Remember that a
frequency histogram is a distribution of the raw data. You can see that, in 1900,
the name Mary is a popular winter name, but not so much at other times of the year.
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi
100
80
Frequency
60
40
20
March
April
June
August
September
October
November
December
January
February
May
July
Month
Now we need to determine the likelihood of the data, assuming each hypothesis is true.
In this case, we need to determine how likely it is to observe a newborn named Mary in each
month. We need:
Notice that the likelihoods are conditional for each hypothesis! In Bayesian analyses,
the likelihood is interpreted here as the probability of observing the data, given the
hypothesis.
The likelihood of being named Mary in each month is just the total Marys divided by the
total births. The graph in Figure 6.4 provides the likelihood of being named Mary for each
month in the year 1900. Notice the y-axis is probability (or likelihood) and that the shape of
this graph is similar to our frequency distribution but not exactly like it, since we had to
adjust for the fact that each month had a different number of births. For instance, 20 Marys
were born in both April and May, but in April there were 1190 births while in May there
were only 862 births.
Now we have the likelihood of observing a baby named Mary for each month of the year.
It looks like October, November, and December are more popular when it comes to “Mary.”
Step 5. Use Bayes’ Theorem to compute the posterior probabilities Pr(January), Pr(Febru-
ary), Pr(March), and so on, given the data (a baby named Mary born in the year 1900, who
grew up to marry Bobbie).
At this point, we combine the priors and the likelihoods to get a posterior probability that
each month is Mary’s birth month with Bayes’ Theorem. We’ll use the informative priors
based on the “When You Were Born” books (Figure 6.2).
Let’s start by calculating the posterior probability of the January hypothesis:
0.10
0.08
Likelihood
0.06
0.04
0.02
0.00
March
April
June
August
September
October
November
December
January
February
May
July
Month
The numerator focuses on the January hypothesis, which multiplies the likelihood of
observing 1 Mary given the January hypothesis, Pr(1 Mary | January), multiplied by the
prior probablity of the January hypothesis, Pr(January).
From the likelihood distribution in Figure 6.4, the likelihood of being named Mary
in January is 0.048 (57 Marys/1180 female births in January). We multiply this by
the prior probability of the January hypothesis, which is 0.06. The product of these two
terms is 0.0029.
The denominator is a different beast. Here, we need to compute the likelihood of observ-
ing 1 Mary under each hypothesis, multiply the result by its corresponding prior, and,
finally, sum the results. Here, a table specifying all 12 hypotheses can help (see Table 6.2).
Table 6.2
• The sums are provided in the bottom row of the table. The sum of the priors must be 1.0,
but this is not true for the likelihoods.
• The sum of the products (likelihood ∗ prior) is 0.0342. This is the denominator of Bayes’
Theorem for this problem, and it is fixed across the 12 hypotheses. What changes is the
numerator, and this depends on which hypothesis you are analyzing.
• The posterior probability for each month is then computed as the product divided
by the denominator. Note that the sum of the posterior probabilities across the hypoth-
eses is 1.0.
• Notice the prior probability was dominated by the February and May hypotheses. The
data show that most babies named “Mary” were born in October, November, or Decem-
ber; their likelihoods were higher. After observing the data, the posterior probabilities for
February and May were reduced, while the posterior probabilities for October, November,
and December were increased relative to the prior.
Answer: Yes. The divisor is also called a normalizing constant. Here, we could calculate
it easily. But summing up the denominator is often a great challenge in Bayesian statistics.
As we’ll see later, in many problems the divisor is an intractable calculation…a topic we will
visit in future sections of this book.
Now, let’s look at Bobbie and Reggie’s prior and posterior distribution side by side (see
Figure 6.5).
0.30
Prior
0.25 Posterior
0.20
Probability
0.15
0.10
0.05
0.00
March
April
June
August
September
October
November
December
January
February
May
July
Month
Bobbie and Reggie started with an “informative” prior based on the “When You Were
Born” books. They then collected data to calculate the likelihood of being named Mary in
the year 1900. They then used Bayes’ Theorem to confront each hypothesis by the data.
That is, they calculated the posterior probability for each hypothesis in light of the data.
Adding in “Mary data” along with the “When You Were Born” priors won’t get Bobbie
out of his pickle, but you can see that posterior distribution is quite different than the prior
distribution. At this point, Bobbie could use the posterior distribution as a prior distribution
if other sources of information became available.
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi
Answer: Let’s take a look. Here are the calculations again, this time with equal priors in
column 2 (see Table 6.3).
Table 6.3
Once again, notice that the sum of the prior probabilities across hypotheses is 1.00, as is the
sum of the posterior probabilities across hypotheses. For this example with a non-informative
prior, we noted earlier that the likelihood drives the results. You can see this in action here for
the January hypothesis (as an example). If you were only making use of the data, you would
say that the probability that Mary was born in January is 0.0483/0.4821 ¼ 0.1000. Here, the
numerator is 0.0483, which is the probability that “our” Mary was born in January. The
denominator is the sum of all 12 likelihoods. Thus, even if we dropped the prior altogether,
we could get the posterior probability ¼ 0.1000 as shown in the final column. This illustrates
what we mean when we said “with a non-informative prior, the likelihood of the data drives
the results.” However, with a Bayesian analysis, a prior distribution is required!
A graph of the prior distribution and posterior distribution looks like the one shown in
Figure 6.6.
0.30
Prior
0.25 Posterior
0.20
Probability
0.15
0.10
0.05
0.00
March
April
June
August
September
October
November
December
January
February
May
July
Month
Now, let’s compare the two posterior distributions, where one used an informative prior
and one used a non-informative prior. The chart in Figure 6.7 shows the posterior
distributions for the “non-informative” (flat) versus “informative” (When You Were
Born) priors that we’ve looked at thus far, with the non-informative results shown in blue:
0.30
Informative Prior
0.25 Non-Informative Prior
Probability
0.20
0.15
0.10
0.05
0.00
February
April
June
August
September
October
November
December
January
March
May
July
Month
Answer: Yes, this is why the selection of priors is important. If you have some information
at your disposal (assuming it is credible), you should use it. However, you should be able to
justify it.
The chart in Figure 6.7 is an example of a prior sensitivity analysis (Berger et al., 2000).
We are comparing two posterior distributions and assessing the effect of each prior distri-
bution’s influence on the posterior, which in turn affects our conclusions. A prior sensitivity
analysis is just one tool in the Robust Bayesian Analysis toolkit.
The word robust is defined as “strong and healthy; vigorous; sturdy in construction;
strong and effective in all or most situations and conditions.” According to Wikipedia,
“Robust Bayesian Analysis, also called Bayesian sensitivity analysis, investigates the robust-
ness of answers from a Bayesian analysis to uncertainty about the precise details of the
analysis” [article accessed August 15, 2017).
Are there other times when the prior drives the results?
Answer: Well, think about it. What if 1000 girls were born each month, and 15 of them
were always named Mary. We have observed a person who is named Mary. How likely is it
that she was born in January? February? The answer is that the likelihoods are identical for
each hypothesis. In this case, the posterior distribution will be identical to the prior
distribution!
In Bayesian analysis and scientific deduction, a primary goal of the analyst is to collect
data that will discriminate the hypotheses.
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi
Answer: The tricky part comes into play when you really don’t have any information to
set the prior and are trying to be as objective as possible. That is, you want to level the
playing field so that all hypotheses have equal weight. The idea behind these approaches is
to select a prior that has absolutely no bearing on the posterior; that is, you would draw
similar conclusions from a statistical analysis that is not Bayesian. We’ve demonstrated
such an approach when you have discrete hypotheses such as the birthday problem.
However, setting a purely uninformative prior is no small task. In the words of Zhu and
Lu (2004), “flat priors are not necessarily non-informative, and non-informative priors are
not necessarily flat.” We are getting a bit ahead of ourselves though. For now, Hobbs and
Hooten (2015) present friendly and informative thoughts on setting priors.
Answer: Subjectivity generally is associated with a belief. In a sense, all priors are
subjective because the analyst must select one and, in doing so, exercises subjectivity.
The term objective prior normally refers to a prior that can be justified by rationality
and consistency. The general goal in using an objective prior is that different analysts with
the same information should arrive at the same conclusions.
This is well beyond our scope here, but we encourage you to dig deeper into this topic, as
it is fodder for a rich discussion.
Answer: Eventually, Bobbie remembered that he took Mary to a play for her birthday.
After a bit of detective work, he determined Mary’s birthday was May 8th.
And they lived happily ever after.
Answer: Yes, it really is! We have one more problem in Bayesian inference in this section,
and we think you’ll enjoy it. See you soon.
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi
CHAPTER 7
Although you see this portrait frequently associated with Thomas Bayes, there is quite a
bit of dispute about whether it is really him (the Thomas Bayes). In this chapter, we’ll use a
Bayesian inference approach to address the problem. In the spirit of fun, the data used for
our example will be completely fabricated.
A key concept in this chapter is that you can combine multiple sources of data in a
Bayesian inference framework. By the end of this chapter, you will be familiar with the
following terms:
• Joint likelihood
• Frock coat
• Parson’s wig.
Let’s return to the portrait. The editors of the IMS Bulletin, a news source from the Institute
of Mathematical Statistics, posted the picture of “Thomas Bayes” shown above and offered
this challenge to its readers:
Bayesian Statistics for Beginners: A Step-by-Step Approach. Therese M. Donovan and Ruth M. Mickey,
Oxford University Press (2019). © Ruth M. Mickey 2019.
DOI: 10.1093/oso/9780198841296.001.0001
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi
The person submitting the most plausible answer was to receive a prize!
The image is taken from a 1936 book called History of Life Insurance by Terence
O’Donnell (American Conservation Co., Chicago). What an exciting book this must be!
The image appears with the caption “Rev. T. Bayes: Improver of the Columnar Method
developed by Barrett.” It’s not clear who added the caption or where the image came from.
Answer: According to the website, “The most plausible answer received in the Bulletin
Editorial office is from Professor David R. Bellhouse, University of Western Ontario, Lon-
don, Ontario, Canada. Professor Bellhouse wrote:
“The picture in The IMS Bulletin is supposedly of Thomas Bayes, who died in 1761 aged 59 and so was born
in 1701 or 1702. I have added “supposedly” since I believe that the picture is of doubtful authenticity. There
are three clues in the picture which lead me to this conclusion . . . For the purpose of comparison, consider the
pictures [see Figure 7.2] of three other Nonconformist ministers: on the left Joshua Bayes, Thomas’es (sic)
father (d. 1746); in the middle Richard Price (this portrait is dated 1776), who read Bayes’ paper before the
Royal Society; and on the right Philip Doddridge (1702–1751), who was a friend of Bayes’ brother-in-law,
Thomas Cotton.”
Figure 7.2 (Left) Joshua Bayes; (middle) Richard Price; (right) Phillip Doddridge.
Bellhouse continues: “The first thing to note in this picture is the apparent absence of a wig, or if a wig is
present, it is definitely the wrong style for the period. It is likely that Bayes would have worn a wig similar to
Doddridge’s, which was going out of fashion in the 1740s, or a wig similar to Price’s, which was coming into
style at the same time. The second thing to note is that Bayes appears to be wearing a clerical gown like his
father or a larger frock coat with a high collar. On viewing the other two pictures, we can see that the gown is
not in style for Bayes’ generation and the frock coat with a large collar is definitely anachronistic. Finally,
Price is wearing a stock or wide collar on his shirt which appears around his neck in the picture; this was
fashionable from about 1730 to 1770. Since Doddridge, one generation younger, appears without any stock
or shirt collar, it is questionable whether Bayes would have worn a stock. However, the nineteenth century-
looking clerical collar in this picture is again anachronistic. For reference, I have used C. Willett Cunnington
and P. Cunnington, Handbook of English Costume in the Eighteenth Century, pub. Faber & Faber, London,
1964.”
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi
So, Professor Bellhouse made his arguments based on fashion, which included the use
of wigs. In case you were wondering, a frock coat is a man’s double-breasted, long-skirted
coat, now worn primarily on formal occasions. And, in case you were wondering, a
Nonconformist is a Protestant Christian who did not “conform” to England’s Act of
Uniformity of 1662. Incidentally, Dr. Bellhouse wrote a biography of Thomas Bayes to
celebrate the tercentenary of his birth, which you can download here.
So, how can we determine the probability that the man in the
photo is Thomas Bayes?
Answer: We are asked to determine the probability that the man pictured is actually
Thomas Bayes. Why not use a Bayesian inference approach to tackle this problem?
Remember that Bayes’ Theorem can be expressed as:
Prðdata j Hi Þ ∗ PrðHi Þ
PrðHi j dataÞ ¼ Xn : ð7:1Þ
Prðdata j Hj Þ ∗ PðHj Þ
j ¼1
In words, the posterior probability for hypothesis i is the likelihood of observing the data
under hypothesis i multiplied by the prior probability of hypothesis i, divided by the sum of
likelihood times the prior across all hypotheses from j ¼ 1 to n.
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi
Where do we begin? We asked our colleague Anne Clark, who studies medieval saints at
the University of Vermont, “How do you determine the authenticity of things that
happened so long ago?” She replied, “You look for clues within the objects you have
in hand.”
If we had the portrait in hand, we could do things like analyze the paint.
Paint?
Answer: In our portrait, “Thomas” might be roughly 35–45 years old. If Thomas Bayes
was born in 1701 or 1702, the portrait would have been painted in the mid-1700’s, or mid-
18th century (just prior to the American Revolution). The white paint that was used in this
period is called “lead white.”
And how can lead white help us with dating the Thomas
Bayes’ portrait?
Answer: Lead white is a paint that includes, well, lead! Wikipedia notes that white lead
dating is a technique that is often used to pinpoint the age of an object up to 1,600
years old.
Answer: Umm, no. Presumably the actual portrait is long gone, and we’re
stuck with the printed image in O’Donnell’s book. But, we can get started
anyway . . . there are many other clues in the printed image that we can use.
Now, let’s get started.
We’ll use the same steps for Bayesian analysis that we have in previous chapters:
Answer: For this problem, we’ll consider just two hypotheses: the portrait is of Thomas
Bayes, and the portrait is not of Thomas Bayes.
Step 2. What are the prior probabilities that each hypothesis is true?
Answer: Do you recall the author problem in Chapter 5, where we used a Bayesian
inference approach to determine the probability that an unsigned Federalist Paper was
penned by Alexander Hamilton or James Madison? The data in that problem were the
words in the paper itself, and we focused on the frequency of the word upon in the
unsigned document. For this problem, we’ll consider the following sources of data that
are contained with this portrait of “Thomas Bayes”:
• Wigs! (or the lack thereof). We know that most ministers in the 1700’s wore wigs. But not all
of them! If all ministers wore wigs, there would be no chance that this portrait was of Thomas
Bayes. But maybe Thomas was a non-conforming Nonconformist who refused to wear a wig.
• Similarity of “Thomas Bayes” to Joshua Bayes. How much do you look like either of your
birth parents? In some cases, there are striking similarities, but not so much in other cases.
We don’t have the luxury of DNA testing, but suppose we can measure characteristics such as
eyebrow shape, nose length, forehead length, and so on of Joshua Bayes (from a known
portrait of Thomas Bayes’ father) and “Thomas Bayes,” and come up with a similarity index.
Suppose this index is an integer that ranges between 0 and 100, where 0 indicates absolutely
no similarity in any of the measured features, and 100 is an identical clone. Thus, the more
similar that Thomas is to Joshua, the more evidence we have that the man is Thomas Bayes.
So, we will be investigating two lines of evidence that we will assume are independent of
each other (the results of one line of reasoning will not affect the results of a second line of
reasoning).
OK, then. What are the observed data with respect to wigs?
Answer: The data here are either 0 or 1, where 0 indicates a wig is absent, and a 1 indicates
a wig is present. The man is clearly not wearing a wig, so the single observation
for this line of reasoning is 0.
The score is an overall similarity between the two men, based on things like eyebrow shape,
nose width, mouth shape, and any other facial feature that we can measure from the images.
Perhaps if “Thomas” were wearing a wig, it might be easier to focus on the facial features only.
Let’s give him one (see Figure 7.4)!
At first glance, the eyebrows and nose have some resemblance, but the eyes and mouth
are off. Perhaps “Thomas” inherited these traits from his mother!
For this problem, let’s assume the similarity index ¼ 55.
Answer: For this problem, we’re sticking with the following data that we can observe from
the portrait itself:
• Wigs = 0
• Similarity = 55.
Answer: Good idea. We can’t directly calculate the likelihood that the man in the portrait
is Thomas Bayes, but we can calculate the likelihood that a middle-aged man who sits for a
portrait in the 1750’s is a minister who does not wear a wig.
Suppose we were able to collect this information on the population of interest (middle-
aged men who sit for a portrait in the 1750’s; see Table 7.1).
Table 7.1
Here, we show just the first 10 records of a 100 record dataset. We will refer to this as
dataset 1. The first column identifies a person, the second identifies if the person is a
minister or not (where 1 ¼ minister), and the third column identifies whether the person
was wearing a wig (where 1 = wig).
Table 7.2
We can then summarize the results of all 100 of our records as a two-way table (see
Table 7.2).
Notice that this table has four “quadrants,” so to speak. Our actual data are stored in the
upper left quadrant, shaded the darkest blue. The upper right quadrant (shaded lighter) sums
the number of ministers (10) and non-ministers (90) in our dataset. The lower left quadrant
(shaded lighter) sums the number of men that wore a wig while sitting for a portrait (23) and
those that did not (77). The lower right quadrant (no shading) gives the grand total, 100.
Now let’s plot the results in the same Venn diagram (see Figure 7.5).
15 8 2
Ministers
Wig
∼Wig and ∼Minister 75
Figure 7.5
The box (which is not to scale) holds all 100 individuals in our dataset. Here, we can see
that 23% of the men wear wigs while sitting for a portrait. We can also see that 10% of all of
the men are ministers. And only 2% of the men sampled are ministers who do not
wear wigs.
Now, let’s ask how likely it is to observe our data, no wig, under the Thomas
Bayes hypothesis. The answer is 0.02 (2/100 ¼ 0.02), which is the joint probability that a
man that sits for a portrait in the mid 1700’s is a minister and does not wear a wig.
OK, now let’s ask how likely it is to observe our data, no wig, under the Not
Thomas Bayes hypothesis. This is trickier because the man in the portrait could be just
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi
about anyone. The answer here is 0.77, which is the marginal probability that a man who
sits for a portrait in the mid 1700’s will not wear a wig. That is the best we can do.
Answer: Remember that we calculated a similarity score between “Thomas Bayes” and
Joshua Bayes, with a result of 55. Now we move to the question, what is the
likelihood of observing a similarity score of 55 under each hypothesis? Suppose
we are able to collect similarity scores between 1000 known fathers and sons and also
between 1000 pairs of unrelated men in England. The first 10 records of our dataset might
look something like that shown in Table 7.3.
Table 7.3
The first column identifies a pair, the second identifies if the pair of males have a
father–son relationship (where 1 ¼ Yes), and the third column identifies their similarity
score. We will refer to this as dataset 2; note that it is separate from our wig dataset, or
dataset 1.
It might be useful to graph the data, where you can see that unrelated men tend to have
smaller relatedness scores, but that there are some scores (such as our observed score of 55 in
black) that can belong to either group (see Figure 7.6).
60
Related
50 Unrelated
Observed
40
Frequency
30
20
10
0
1 8 16 25 34 43 52 61 70 79 88 97
Similarity Score
Answer: Well, we don’t want to know the exact probability of observing 55. What we
really want to find is where our observed result falls in relation to the rest of a given
distribution. For instance, a score of 55 falls in the right-hand tail of the unrelated
distribution (left), but it falls in the left-hand portion of the related distribution
(right). To make these comparable, we need to quantify the area of each distribution
that is to the right of our observed similarity score of 55. Thus, 55 is our minimum score
of interest.
Answer: If our two data sources are independent, we can simply multiply the two likeli-
hood components together:
For the Thomas Bayes hypothesis, that would be
• evidence from wigs ¼ 0.02 (the joint probability that a man that sits for a portrait in the
mid-1700’s is a minister who does not wear a wig)
• evidence from similarity score ¼ 0.69 (69% of known father–son similarity scores are at
least 55).
Thus, the likelihood of observing the data under the Thomas Bayes hypothesis is the
product of these terms:
• evidence from wigs ¼ 0.77 (the marginal probability that a man that sits for a portrait in
the mid-1700’s will not wear a wig)
• evidence from similarity score ¼ 0.01 (1% of unrelated men have similarity scores of at
least 55).
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi
Thus, the likelihood of observing the data under the Not Thomas Bayes hypothesis is the
product of these terms:
Answer: Just use Bayes’ Theorem! Here, we have two hypotheses, so Bayes’ Theorem can
be written as:
Prðdata j Hi Þ ∗ PrðHi Þ
PrðHi j dataÞ ¼ X2 : ð7:4Þ
Prðdata j Hj Þ ∗ PðHj Þ
j¼1
Let’s assume the prior probabilities are 0.5 and 0.5. Then, the posterior probability that
the man in the portrait is Thomas Bayes is shown in Table 7.4.
Table 7.4
• Row 1 of the table is the Thomas Bayes hypothesis, and row 2 is the Not Thomas Bayes
hypothesis.
• Column 1 provides the priors, which we set at 0.5 and 0.5. Remember that these must
sum to 1.0.
• Column 2 is the joint likelihood of observing the two pieces of information we observed
under each hypothesis:
• wigs ¼ 0
• similarity score 55
• Column 3 multiplies the prior times the likelihood, which gives us the numerator of
Bayes’ Theorem for each hypothesis.
• The sum of column 3 is the denominator of Bayes’ Theorem (the sum of the products of
the priors times the likelihoods).
• Column 4 calculates the posterior for each hypothesis as the prior ∗ likelihood column
divided by the denominator. Note that the two posteriors sum to 1.0.
So, for the inputs for this fictitious problem, the posterior probability that the
man in the portrait is Thomas Bayes is 0.64, and the posterior probability that
it is someone else is 0.36. Two lines of evidence were used in this Bayesian
analysis.
Don’t forget that the data are fictitious! A graph of our prior and posterior distributions
can now be shown in Figure 7.7.
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi
1.0
Thomas Bayes
Probability of Hypothesis
0.8 Not Thomas Bayes
0.6
0.4
0.2
0.0
Prior Distribution Posterior Distribution
Figure 7.7
Answer: It depends. As the analyst, you would need to make an argument to justify the
calculation of the joint likelihood. Here, we are claiming that our two datasets are inde-
pendent and that both contribute to the likelihood calculations.
Answer: Yes, you can. The more lines of reasoning you have, the more information you
can bring to bear on the likelihood calculations and, ultimately, on the comparison of your
hypotheses.
Answer: This is a lot trickier. For instance, we could ask about the use of frock coats in
portraits for the same time period. If we scored the wearing of frock coats along with the
use of wigs in dataset 1, we may confound the analysis because frock coats and wigs appear
to go hand in hand. In addition, if you had a completely different dataset that provided
similar information with respect to father–son similarity, adding it into the likelihood
mix could spell trouble if they are not independent. These situations are beyond our
scope here, but you might want to keep them in mind.
Answer: Loads of them! This problem is meant to help you understand that Bayes’
Theorem can easily incorporate multiple sources of data, but each piece of information
that you consider must be carefully thought out. We have made multiple assumptions here,
including
Answer: The main take-home point is that Bayesian analysis can be very, very flexible. As
long as you can compute the likelihood of observing the data under each hypothesis, you
are golden. However, this is not always an easy task! But, assuming it can be done, you can
set up almost any kind of problem as a Bayesian inference problem.
Answer: We were really interested in the question of whether the portrait really is of
Thomas Bayes, so went ahead and purchased a copy of The History of Life Insurance by
Terrence O’Donnell to explore this further. There it was—the picture—on p. 335. The
picture does indeed have the caption “Rev. T. Bayes, Improver of the Columnar Method
developed by Barrett.” We searched high and low for any writing on Thomas Bayes, but
could not find anything in the book.
George Barrett, who proposed the columnar method in the calculation of annuities
(insurance), lived from 1752–1821. A Wikipedia author notes that “Barrett was the son of
a farmer of Wheeler Street, a small hamlet in Surrey. At an early age, although engaged in
daily labour, he made, unaided, considerable progress in mathematics, taking special
interest in the class of problems connected with the duration of human life. He afterwards,
during a period of twenty-five years (1786–1811), laboured assiduously at his great series of
life assurance and annuity tables . . . to whose support he devoted a great part of his
earnings.” O’Donnell himself weighs in by stating, “In 1810, George Barrett presented to
the Royal Society a new mode of calculating life Annuities. The old and honorable society
refused to publish the contribution, no doubt considering his deductions too reactionary.
However in spite of the frown of the Society, [Francis] Baily gave Barrett’s findings to the
world in the appendix of his very valuable work on Annuities and it immediately began to
influence sound actuarial thought of the time.”
Given that Thomas Bayes lived from around 1701 to 1761, it is hard to see how Bayes
could have improved Barrett’s method. When Bayes died, Barrett was 9 years old!
“Rev. T. Bayes” may have improved Barrett’s method, but it probably was not the Reverend
Thomas Bayes we’ve come to know and love.
What’s next?
Answer: This ends our section on Bayesian inference. In Section 3, we begin our journey
into the world of probability distributions, which opens up many new possibilities for using
Bayes’ Theorem.
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi
SECTION 3
Probability Functions
Overview
Welcome to Section 3! Now that you’ve had a taste of what Bayesian inference is about, it’s
time to take Bayes’ Theorem to the next level. A major use of Bayes’ Theorem involves
parameter estimation, and, as such, it is critical that you have a firm understanding of
parameters and variables, and how they are related in a probabilistic way.
This section consists of only two chapters.
• Chapter 8 introduces probability mass functions (pmf). This chapter introduces the idea
of a random variable and presents general concepts associated with probability distribu-
tions for discrete random variables. The binomial and Bernoulli distributions are used as
examples of these probability mass functions (pmf ’s). The pmf ’s can be used to specify
prior distributions, likelihoods, and/or posterior distributions in Bayesian inference.
• Chapter 9 introduces probability density functions (pdf). The focus is on general con-
cepts associated with probability density functions (pdf ’s), which are distributions asso-
ciated with continuous random variables. The continuous uniform and normal
distributions are highlighted as examples of pdf ’s. These and other pdf ’s can be used
to specify prior distributions, likelihoods, and/or posterior distributions in Bayesian
inference.
A solid understanding of both pmf ’s and pdf ’s is absolutely essential if you are to become a
Bayesian analyst. In Bayesian analyses, we hypothesize about the values of unknown
parameters. In doing so, we regard the unknown parameter as a random variable generated
from some probabilistic distribution. In particular, we use probability functions to specify:
• prior distributions
• likelihood functions
• posterior distributions.
These chapters pave the way to Sections 4 and 5. Take your time with this material as it is
essential for your Bayesian journey.
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi
CHAPTER 8
Hopefully you have a sense now of Bayes’ Theorem and how Bayesian inference is used to
update our belief in alternative hypotheses. One of the primary uses of Bayesian inference is
to estimate parameters. To do so, we need to first build a good understanding of prob-
ability distributions.
This chapter will present general concepts associated with probability distributions. We’ll
use the binomial and Bernoulli distributions as examples of probability mass functions
(pmf ’s). In Chapter 9, we’ll use the continuous uniform and normal distributions as
examples of probability density functions (pdf ’s). Although the two chapters will
contain only a few examples, the general concepts apply to other probability distributions.
These are fairly long chapters, and it may take a few readings for the material to sink in if
this is new material for you . . . so make sure to get up and stretch every now and then!
By the end of this chapter, you will be able to define and use the following concepts:
• Function
• Random variable
• Probability distribution
• Parameter
• Probability mass function (pmf)
• Binomial pmf
• Bernoulli pmf
• Likelihood
• Likelihood profile
Since this chapter is about functions, it makes sense to start by asking the following
question:
What is a function?
Answer: In math, a function relates an input to an output, and the classic way of writing
a function is
f ðxÞ ¼ : : : ð8:1Þ
f(x) = x2
Figure 8.1
Bayesian Statistics for Beginners: A Step-by-Step Approach. Therese M. Donovan and Ruth M. Mickey,
Oxford University Press (2019). © Ruth M. Mickey 2019.
DOI: 10.1093/oso/9780198841296.001.0001
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi
The function name is f. You can name your function anything you want, but f is a
commonly used name. The inputs go within the parentheses. Here, we have a single
input called x. The output of the function is a set of instructions that tell us what to do
with the input. In this example, we input a value for x; the instructions tell us that we
should square x to give us the output.
In other words, f(x) maps the input x to the output. We can visualize this as shown in
Figure 8.2.
x f x2
Figure 8.2
For example, if you would like to convert Fahrenheit to Celsius, you can use the function:
If you input 80 degrees Fahrenheit to this function, the output would be:
We are now positioned to introduce the term “variable.” Dictionary.com defines variable as
“capable of being varied or changed.” This is certainly true of temperatures on Planet Earth:
temperatures vary over time and space. Conventionally, variables are given a name, such as
T or C, where T represents temperature in degrees Fahrenheit, and C represents temperature
in Celsius. While the capital letters provide the name of the variable, specific values are
often indicated with lower-case letters (e.g., t ¼ 80, and c ¼ 26.7).
The Oxford Dictionary of Statistics (Upton and Cook, 2014) states that a variable is “the
characteristic measured or observed when an experiment is carried out or an observation is
made. Since a non-numerical observation can always be coded numerically, a variable is
usually taken to be numerical.” In other words, capital T or C is the characteristic measured,
and its value is usually numeric.
Answer: The Oxford Dictionary of Statistics (Upton and Cook, 2014) tells us, “When the
value of a variable is subject to random variation, or when it is the value of a randomly
chosen member of a population, it is described as a random variable—though the adjective
‘random’ may be omitted.”
Answer: Yes. Suppose you flip a fair coin three times, and record each head as “H” and
each tail as “T.” Let’s call three coin flips an experiment and start by listing all possible
outcomes:
• HHH
• THH
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi
• HTH
• HHT
• TTH
• THT
• HTT
• TTT
With two possible results (H or T), and three flips, there are 23 ¼ 8 possible outcomes. Count
them! These outcomes make up our sample space. If this doesn’t ring a bell, see Chapter 1
for a quick review.
Clearly, what you observe after 3 flips is determined at random. In this particular case, the
outcome of an experiment is not a number. It’s something like “HHT.” So, let’s define a
random variable called Y, where Y is the number of heads. Y can assume the values of 0,
1, 2, or 3. We will let lower-case y represent a particular value from a coin toss
experiment (see Table 8.1).
Table 8.1
Experiment Results
HHH THH HTH HHT TTH THT HTT TTT
y! 3 2 2 2 1 1 1 0
If we were to flip a coin three times, we don’t know with certainty whether its value, y,
will be 0, 1, 2, or 3. But we know that is has to be one of these values. That is, our random
variable is discrete—and it can take on a finite number of values. We can get an idea of
what the value could be if we assigned probabilities to each. Incidentally, discrete random
variables can also assume a countably infinite number of values, such as 0, 1, 2, . . . , to
infinity and beyond.
Answer: Yes! A random variable is a function with inputs that are outcomes of an
experiment (like “HHT”) and outputs that are numerical (like “2”). In many cases, the
inputs are already numerical, so the inputs are the same as the outputs.
Answer: We’re often interested in knowing the probability of observing particular out-
comes. This brings us to the all-important topic of probability theory.
Answer: If each of the 8 possible outcomes listed in Table 8.1 are equally likely (the coin is
fair), then we can count how many times the event y ¼ 3 appears out of 8 total. It appears
once, so:
The notation Pr indicates probability. This is the probability that a random variable named
Y will take on the value y ¼ 3. The answer is 0.125.
The next step moves us from the probability of a specific value of Y such as y ¼ 3 to all
possible values of Y, which can be seen in the header of Table 8.2. The possible values of
Y are 3, 2, 1, and 0. The number of times each of these values appears out of the 8 events is
given in row 1, and the probability is given in row 2. Notice that the sum of the
probabilities of observing all possible values of Y is 1.0.
Table 8.2
y
0 1 2 3
Frequency 1 3 3 1
Pr(Y ¼ y) 0.125 0.375 0.375 0.125
Answer: Yes, it is. From the Oxford Dictionary of Statistics (Upton and Cook, 2014):
“Probability distribution: A description of the possible values of a random variable, and of
the probabilities of occurrence of these values.”
We divided our sample space into subsets by defining the variable Y, which is the number
of heads that can possibly appear with three flips of a coin, and assigned a probability to each.
Answer: OK then! Each experiment now contains 10 flips. We’d do precisely the same thing.
We’d write out all possible outcomes, determine the associated values of Y, and then assign
to each a probability. Y is still the number of heads, and Y can assume values 0, 1, 2, . . . , 10.
For instance, we could end up with 10 heads, which would be HHHHHHHHHH. Or we
could end up with 0 heads (10 tails), which would be TTTTTTTTTT. Or we could end up
with 5 heads and 5 tails, and every combination in between. With two outcomes per flip
(H or T), our sample space would consist of 210 ¼ 1024 possible outcomes!
Remember, our variable Y can take on values of y ¼ 0, 1, 2, . . . , 10. Here are a few
examples:
• HHHHHHHHHH ! 10
• TTTTTTTTTT ! 0
• HTHTHTHTHT ! 5
• HHHHHTTTTT ! 5
Go ahead and list all of these outcomes and assign a probability to each. There
are only 1024 calculations, so it shouldn’t take too long.
Really?
Answer: Ha ha! Gotcha! Can you see what a chore that would be?
We need a function that will let us easily compute, for example, the probability of
observing y ¼ 5, or observing 5 heads in 10 coin flips. The binomial probability mass
function can aid us with this task.
• bi means “two”
• nomial means “name”
So, the word “binomial” means “two names.” In probability theory, it generally means
“two outcomes.” All binomial problems are composed of trials that have only two
possible outcomes (e.g., success or failure, live or die, heads or tails, present or absent).
Traditionally, we label one outcome a “success,” and the other a “failure.” For this
problem, let’s call heads a “success,” which means that tails is a “failure.” It doesn’t
matter which outcome you label a “success” and which you label a “failure,” as long as
you are clear about your choice!
Answer: The binomial probability function is widely used for problems where there are a
fixed number of independent tests or trials (designated n) and where each trial can have
only one of two outcomes. Statistics textbooks worldwide use coin flipping as a way to
demonstrate the binomial distribution.
Let’s return to our 3 coin flip example, where the coin is fair. Remember that we created a
variable called Y, and it can take on values y ¼ 0, 1, 2, 3. Often, you’ll see this written y ¼ 0, 1,
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi
2, . . . , n, where n is the highest possible value that the random variable can take. The actual
number of heads observed is subject to chance. Earlier, we calculated the probability of
observing 0, 1, 2, 3 heads in 3 coin flips by hand (see Table 8.3).
Table 8.3
y
0 1 2 3
Frequency 1 3 3 1
Pr(Y ¼ y) 0.125 0.375 0.375 0.125
The binomial function is written below (Equation 8.5). Notice the f notation, which denotes
that it is a function named f. The inputs to the function are provided within parentheses:
n y
f ðy; n; pÞ ¼ p ð1−pÞðn−yÞ y ¼ 0; 1; : : : ; n: ð8:5Þ
y
Instead of f(x), we have f(y; n, p). The semicolon is read “given.” This is read as the
probability of y successes, given n trials and probability p of success. So this function
requires three inputs: y, n, and p.
You may also see this written as f(y | n, p), where the vertical bar means “given.” In this
book, we use the semicolon notation to avoid any confusion with the conditional
probability terms in Bayes’ Theorem.
Two of the inputs are called parameters: n and p. We will discuss what a parameter is
shortly.
• n ¼ the total number of trials (in our case, n ¼ 3 coin flips or 3 trials)
• p ¼ the probability of success (in our case, the probability of flipping a heads, which is 0.5
each time if the coin is fair)
The third input, y, is the observed number of successes in the experiment. The lower-case y
represents the value of a random variable, and y can take on the discrete values 0, 1, 2, . . . , n.
Given these inputs, the instructions for creating the output is provided on the right side
of the equals sign. For example, we can use the binomial function to calculate the prob-
ability of observing 2 heads in 3 flips when the coin is fair as:
n y
f ðy; n; pÞ ¼ p ð1−pÞðn−yÞ y ¼ 0; 1; : : : ; n: ð8:6Þ
y
3
f ð2; 3; 0:5Þ ¼ 0:52 ð1−0:5Þð3−2Þ : ð8:7Þ
2
The semicolon here means “given.” The entire left side of the equation can be read “the
probability of observing 2 heads, given 3 coin flips and a probability of a heads equal to 0.5,”
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi
or in general terms, “the probability of 2 successes in 3 trials given the probability of success
on each trial is equal to 0.5.” This is equivalent to writing:
3
PrðY ¼ 2Þ ¼ f ðy ¼ 2; n ¼ 3; p ¼ 0:5Þ ¼ 0:52 ð1−0:5Þð3−2Þ : ð8:8Þ
2
This might look weird to you, but mathematicians use this notation because, to them, it is
super-clear and concise. The left term Pr(Y ¼ 2) asks what is the probability that our random
variable named Y has a value of 2. This is equivalent to the binomial pmf where y ¼ 2,
n ¼ 3, and p ¼ 0.5. We use the pmf to calculate the probability.
Let’s now break the right-hand side of the binomial probability function (the output)
into pieces:
• The term py gives p (the probability of success, or heads) raised to the number of times
the success (heads) occurred (y). This term calculates the probability of observing y
independent successes together in one experiment, just as we did at the beginning of
the chapter.
• The term (1 − p)n−y gives the probability of a failure (or tails) raised to the number of times
the failures (tails) occurred, which is (n − y). This term calculates the probability of
observing y independent failures together in one experiment, just as we did at the
beginning of the chapter.
To calculate the probability of flipping 2 heads out of 3 trials, we need to observe a success y
times (2 heads), and we need to observe a failure n − y times (1 tail). So far, our result would be:
But if you flip a fair coin 3 times, as we’ve seen, there is more than one way you could end
up with 2 heads and 1 tail. For instance, the sequence could be:
• HHT
• THH
• HTH.
In our problem, we are asked to compute the probability of getting 2 heads in 3 flips, given
that the coin is fair. We don’t care about the actual sequence of the outcomes, so we need to
account for all of the various possible ways we can end up with 2 heads in 3 flips. The
n
portion of the binomial probability function in brackets is called the binomial
y
coefficient and accounts for ALL the possible ways (combinations) in which two heads
and one tail could be obtained.
You can compute the binomial coefficient by hand:
n n!
¼ ð8:10Þ
y y!ðn−yÞ!
3∗2∗1 6
¼ ¼ 3: ð8:11Þ
2 ∗ 1 ∗ ð1Þ 2
Thus, there are 3 ways of getting 2 heads in 3 flips, which we confirmed earlier. We multiply
3 by 0.125 to get our final answer: 0.375. The binomial probability of getting exactly two
heads out of 3 coin flips, given that the coin is fair, is:
n y
f ðy; n; pÞ ¼ p ð1−pÞðn−yÞ y ¼ 0; 1; : : : ; n ð8:12Þ
y
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi
This is the same answer we got when we calculated the probability by hand (Table 8.3).
Similarly, we could use the function to compute the probability of observing, say, 0 heads
and 3 tails. That would be:
n y
f ðy; n; pÞ ¼ p ð1−pÞðn−yÞ y ¼ 0; 1; : : : ; n ð8:14Þ
y
3
f ð0; 3; 0:5Þ ¼ ∗ 0:50 ∗ ð1−0:5Þ3 ð8:15Þ
0
3!
f ð0; 3; 0:5Þ ¼ ∗ 0:50 ∗ ð1−0:5Þ3 ð8:16Þ
0!ð3−0Þ!
f ð0; 3; 0:5Þ ¼ 1 ∗ 1 ∗ 0:125 ¼ 0:125: ð8:17Þ
Don’t forget that zero factorial (0!) is equal to 1. Notice that in the first example we added
the names of the inputs (i.e., y, n, and p), while in the second example we did not. You’ll see
this both ways in the literature. And now let’s double-check these results with the table of
results we showed previously (see Table 8.4).
Table 8.4
y
0 1 2 3
Frequency 1 3 3 1
Pr(Y ¼ y) 0.125 0.375 0.375 0.125
Great! We hope you can see that a pmf such as the binomial pmf is a super-handy tool
(function) that returns probability very quickly. The proof of this function is credited to
Jakob Bernoulli (Bernoulli, 1744), a Swiss mathematician (see Figure 8.3). His tomb bears
the inscription “Jakob Bernoulli, the incomparable mathematician.”
If we have an experiment that has independent trials and a constant probability of success,
we can indicate that a random variable Y arises from a binomial process as:
If we flip a coin three times and the coin is fair, we can say:
Answer: The binomial distribution is a display of all possible outcomes of y, given the
n and p. In other words, we can use the binomial function to compute the probability of
observing 0, 1, 2, 3 heads out of 3 flips, given that the coin is fair (probability of heads ¼
probability of tails ¼ 0.5; see Table 8.5).
Table 8.5
Successes Probability
0 0.125
1 0.375
2 0.375
3 0.125
The sum of the probabilities that make up the binomial distribution must be 1.00.
We can then graph the results as in Figure 8.4.
1.0
0.8
Probability
0.6
0.4
0.2
0.0
0 1 2 3
Successes
This is a binomial distribution whose parameters are n ¼ 3 and p ¼ 0.5. The full range
of possible outcomes is given on the x-axis, and the binomial probability is given on the y-axis.
Can you find the probability of observing 2 heads out of 3 coin flips, given that the coin
is fair?
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi
How about the probability of observing 2.5 heads out of 3 coin flips,
given that the coin is fair?
What is a parameter?
Answer: The Oxford Dictionary of Statistics (Upton and Cook, 2014) defines it as “a
constant appearing as part of the description of a probability function . . . The shape of
the distribution depends on the value(s) given to the parameter(s).”
The parameters of a probability distribution define how the distribution looks: change
the parameters, and you change the shape and location of the distribution. Some authors
describe parameters as the “knobs” or “controls” of a function.
Let’s return to our 10 coin flip example. Figure 8.5 shows the binomial distribution when
n ¼ 10 and p ¼ 0.5 (note that p, the probability of success, is still 0.5, but n, the number of
trials, has changed from 3 to 10).
1.0
0.8
Probability
0.6
0.4
0.2
0.0
0 2 4 6 8 10
Successes
In this example, the full range of possible values for our variable is 0 to 10. Did you notice
that the binomial distribution is symmetric when p is 0.5?
Symmetry is not always the case. To illustrate, Figure 8.6 shows yet another binomial
distribution, this time when n ¼ 100 and p ¼ 0.05 (note that p, the probability of success,
was changed from 0.5 to 0.05, and n was changed from 10 to 100).
0.20
0.15
Probability
0.10
0.05
0.00
0 20 40 60 80 100
Successes
See how different they look? All three graphs are binomial probability distributions, but
they have different parameters (n and p). The x-axis should extend to the maximum
number of possible successes. The y-axis often ranges between 0 and 1 but is sometimes
truncated, as in Figure 8.6. Changing the y-axis does not change the distribution at all: it
just magnifies or shrinks how the distribution is displayed.
No matter what range you use for the y-axis, the sum of the bars will still be equal to 1.0.
When it comes to probability mass functions, the values of the parameters identify
which binomial distribution you are talking about; they determine the shape of the distri-
bution (the distribution of probabilities along the y-axis) and its location along the x-axis.
Does the “knob” or “control” image work for you?
The idea that p is constant is pretty straightforward: if we flip a coin 10 times, and p ¼ 0.5,
this means that p is held at 0.5 for each and every flip. The concept of independence relates
to the outcomes of two or more flips. If we flip a coin and it turns up heads, this outcome
has absolutely no bearing on what the result of the next flip will be.
• Negative binomial
• Bernoulli distribution
• Poisson distribution
• Discrete uniform distribution
• Geometric distribution
• Hypergeometric distribution
We will touch on the Bernoulli function in this chapter and discuss a few others in future
chapters.
Answer: In each, there is a discrete random variable, say Y. For possible values the random
variable can assume, f(y) is a probability, written Pr(Y ¼ y), and thus f(y) must assume values
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi
between 0 and 1. The values of f(y) must sum to 1 for all values that Y can assume. This is the
definition of a probability mass function.
Answer:
From The Oxford Dictionary of Statistics (Upton and Cook, 2014): The distribution of a
discrete random variable taking two values, usually 0 and 1.
From Wikipedia: In probability theory and statistics, the Bernoulli distribution,
named after Swiss scientist Jakob Bernoulli, is the probability distribution of a random
variable which takes the value 1 with success probability of p and the value 0 with failure
probability of q ¼ 1− p.
In short, a Bernoulli distribution is a special case of a binomial distribution
in which the number of trials is n ¼ 1. If we have just one coin flip (n ¼ 1) and the coin
is fair (p ¼ 0.5), then we can use the binomial pmf to calculate the probability that the
result is heads (y ¼ 1) as:
n y
f ðy; n; pÞ ¼ p ð1−pÞðn−yÞ y ¼ 0; 1; : : : ; n ð8:20Þ
y
If the coin was not fair, such that the probability of getting a heads is 0.4, then we
would have:
n y
f ðy; n; pÞ ¼ p ð1−pÞðn−yÞ ð8:22Þ
y
f ð1; 1; 0:4Þ ¼ 1 ∗ 0:41 ∗ ð1−0:4Þ0 ¼ 0:4: ð8:23Þ
Some things are worth pointing out here. First of all, with 1 trial, the binomial
coefficient is 1.
Let’s run through the probability of observing a success:
• With 1 trial, if we have a success, the term py is the same as p1, which is just p.
• With 1 trial, if we have a success, the term (1 − p)0 is 1. Remember that anything raised to
the power of 0 is 1.0.
• Multiply 1 (the binomial coefficient) by p and then by 1 and you end up with p.
• With 1 trial, if we have a failure, the term py is the same as p0, which is 1.
• With 1 trial, if we have a failure, the term (1 − p)1 is (1 − p).
• Multiply 1 (the binomial coefficient) by 1 and then by (1 − p), and you end up with (1 − p).
Thus, with a single trial, the probability of getting a success is the same thing as p, and the
probability of getting a failure is the same thing as (1 − p).
Because of these properties, the Bernoulli distribution is often written as:
The Bernoulli distribution provides the probabilities for all possible outcomes. With only 1
trial, there are only two possible results, and a graph of the distribution looks like the one
shown in Figure 8.7.
1.0
0.8
Probability
0.6
0.4
0.2
0.0
0 1
Successes
We mention the Bernoulli distribution because we will be using it later in this chapter
and in future chapters. Incidentally, you can think of a binomial distribution as a series of
independent Bernoulli trials.
Likelihood
Now, let’s gracefully roll into the concept of likelihood when dealing with probability mass
functions. Likelihood is a key concept in Bayesian inference . . . remember that Bayes’
Theorem can be expressed as shown in Figure 8.8.
Likelihood of the
data, B, given Prior probability of
hypothesis A hypothesis A
Posterior probability
of hypothesis A,
given data B
Pr(B | A) * Pr(A)
Pr(A | B) =
Pr(B | A) * Pr(A) + Pr (B | ∼A) * Pr(∼A)
Likelihood of the
data, B, given Prior probablity of
hypothesis ∼A hypothesis ∼A
Figure 8.8
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi
If we let a given hypothesis stand in for A, and our data stand in for B, then this equation
can be written as:
Prðdata j Hi Þ ∗ PrðHi Þ
PrðHi j dataÞ ¼ : ð8:25Þ
X
n
Prðdata j Hj Þ ∗ PrðHj Þ
j¼1
In words, the posterior probability for hypothesis Hi is the likelihood of observing the
data under hypothesis i (green) multiplied by the prior probability of hypothesis i, divided
by the sum of the likelihood ∗ the prior across all j ¼ 1 to n hypotheses.
Notice that the likelihoods are conditional on each hypothesis! In Bayesian analyses, the
likelihood is interpreted as the probability of observing the data, given the hypothesis.
In short, you must understand the concept of likelihood! In previous chapters, we tried to
give you a general feel for likelihood . . . now it is time to dig deeper.
Answer: As we’ve seen, a component of Bayes’ Theorem is the Pr(data | H), which is “the
probability of the data, given a hypothesis.” If the data are in hand, then the term
“likelihood” is often used in its stead. Thus, when you see Pr(data | H) in Bayes’ Theorem,
a likelihood computation is required.
Throughout this book, the likelihood term in Bayes’ Theorem is expressed as Pr(data | H)
when there are discrete hypotheses under consideration. When the likelihood is actually
computed, we will express this as ℒðdata; HÞ, where ℒ symbolizes “likelihood” and the
semicolon means “given.” We do this to distinctly identify the likelihood term in Bayes’ Theorem
from the actual calculation.
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi
As a side note to our readers who previously have studied maximum likelihood methods,
read the box below:
You may have seen likelihood expressed as the likelihood of the parameters given
the data. For example, the likelihood of the parameter, θ, given the data, y, can
be written ℒðθ; yÞ or ℒðθ j yÞ. This is fairly conventional, but can be confusing when related
to Bayes’ Theorem. Regardless of the notation used, you must enter values for θ and y to
calculate a result.
Are there any other key points to bear in mind regarding likelihood?
Answer: Yes. We’ve mentioned in previous chapters that the likelihood computations do
not need to sum to 1.00. This is in sharp contrast to the prior and posterior distributions in a
Bayesian analysis, which must sum to 1.00. Don’t ever forget this!
Answer: Of course. Suppose you are given 2 heads out of 3 coin flips, but you don’t know
p. We can plug in many alternative values (hypotheses) for p and use the binomial pmf to
compute the likelihood for each and every combination (see Table 8.6).
Table 8.6
p Likelihood
0 0
0.1 0.027
0.2 0.096
0.3 0.189
0.4 0.288
0.5 0.375
0.6 0.432
0.7 0.441
0.8 0.384
0.9 0.243
1 0
Here, the left column represents different hypothesized values for p (the probability of
success). The right column represents the likelihood of observing 2 heads out of 3 flips
given a particular value for p.
A few things worth pointing out:
• p can assume a value in the range of 0 to 1, so there are an infinite number of possibilities
we could examine; we just recorded 11 cases.
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi
• When the probability of success is 0, the likelihood of observing 2 heads out of 3 flips is 0.
It’s impossible to observe ANY heads if the probability of flipping a head is 0!
• When the probability of success is 1.0, the likelihood of observing 2 heads out of 3 flips
is 0. If the coin MUST land heads, then it is impossible that you observe any tails!
• The sum of the second column does not equal 1.00.
Let’s graph our results across a full spectrum of p alternatives (see Figure 8.9).
0.5
0.4
Likelihood
0.3
0.2
0.1
0.0
0.0 0.2 0.4 0.6 0.8 1.0
p
• Find the likelihood of observing y ¼ 2 if p ¼ 0.4 in Table 8.6 and match it to Figure 8.9.
• Find the likelihood of observing y ¼ 2 if p ¼ 0.5 in Table 8.6 and match it to Figure 8.9.
• The gray dotted lines will help you, and we’ll revisit these soon.
Answer: OK! Let’s try to put together everything we’ve learned so far!
Let’s go through a quick example where we are considering just 2 hypotheses. Remember
that Bayes’ Theorem can be expressed as shown below, with the likelihood portion shown
in green:
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi
Prðdata j Hi Þ ∗ PrðHi Þ
PrðHi j dataÞ ¼ : ð8:26Þ
X
n
Prðdata j Hj Þ ∗ PrðHj Þ
j¼1
In fact, although this is not conventional, we could write Bayes’ Theorem as:
ℒðdata j Hi Þ ∗ PrðHi Þ
PrðHi j dataÞ ¼ : ð8:27Þ
X
n
ℒðdata j Hj Þ ∗ PrðHj Þ
j¼1
Suppose we have two hypotheses for our coin in terms of fairness. A friend gives you the
coin and tells you that the coin is either fair (p ¼ 0.5) or that it is weighted such that the
probability of observing heads is only 0.4. These are the only two options.
We can now write out the posterior probability for hypothesis 1 with the expanded
denominator of Bayes’ Theorem as:
Prðdata j H1 Þ ∗ PrðH1 Þ
PrðH1 j dataÞ ¼ : ð8:28Þ
Prðdata j H1 Þ ∗ PrðH1 Þ þ Prðdata j H2 Þ ∗ PrðH2 Þ
Answer:
• H1 is the fair hypothesis: the coin is fair so that the probability of heads is 0.5 (p ¼ 0.5).
• H2 is the unfair hypothesis: the coin is weighted so that the probability of heads is 0.4
(p ¼ 0.4).
There are two discrete hypotheses (and we’ll assume these are the only two possibilities).
Step 2. What were the prior probabilities for each hypothesis?
• Let’s set the prior probability for each hypothesis ¼ 0.5. In other words, we give equal
weights to the hypothesis that the coin is fair and to the hypothesis that the coin is
weighted. In this case, we have no a priori reason to think that one hypothesis is more
likely to be true than the other.
1.0
0.8
0.6
0.4
0.2
0.0
H1: p = 0.5 H2: p = 0.4
Thus, we are using a Bernoulli distribution to set our priors for the two alternative
hypotheses (p ¼ 0.5 and p ¼ 0.4), and the probability associated with each hypothesis is 0.5.
In contrast to the Hamilton and Madison hypotheses in Chapter 5, here we have alternative
hypotheses for a parameter.
You can cross-check these results in the likelihood profile in Figure 8.9.
Prðdata j H1 Þ ∗ PrðH1 Þ
PrðH1 j dataÞ ¼ ð8:33Þ
Prðdata j H1 Þ ∗ PrðH1 Þ þ Prðdata j H2 Þ ∗ PrðH2 Þ
0:375 ∗ 0:5
PrðH1 j dataÞ ¼ ¼ 0:566: ð8:34Þ
0:375 ∗ 0:5 þ 0:288 ∗ 0:5
Prðdata j H2 Þ ∗ PrðH2 Þ
PrðH2 j dataÞ ¼ ð8:35Þ
Prðdata j H1 Þ ∗ PrðH1 Þ þ Prðdata j H2 Þ ∗ PrðH2 Þ
0:288 ∗ 0:5
PrðH2 j dataÞ ¼ ¼ 0:434: ð8:36Þ
0:375 ∗ 0:5 þ 0:288 ∗ 0:5
Thus, after 3 coin flips, we have updated our belief that the coin is fair from 0.5 to
0.566, and we have updated our belief that the coin is biased from 0.5 to 0.434
(see Figure 8.11).
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi
1.0
0.8
0.6
0.4
0.2
0.0
H1: p = 0.5 H2: p = 0.4
We hope this makes perfect sense! We are given a coin, and told that it is either fair
or weighted. These are our two hypotheses, and we gave them equal weight for the
prior distribution. We then collected data: 2 heads in 3 coin flips. We calculated the
likelihood of observing the data under each hypothesis. We then used Bayes’ Theorem to
calculate the posterior probability of each hypothesis. After the analysis, we still have two
hypotheses, but the probability that H1 is correct is now higher than the probability that H2
is correct.
Thus, our posterior distribution for the two alternative hypotheses for p is another
Bernoulli distribution.
Answer: Yes. Let’s take a step back to get a “big picture” view of this process (see
Figure 8.12).
H1 (p = 0.5) H2 (p = 0.4)
bin(n = 3, p)
0 1 2 3
yi
Figure 8.12
This sort of diagram was popularized by John Kruschke in his book, Doing Bayesian Data
Analysis (Kruschke, 2015) and is intended to communicate the structure of the prior and
likelihood. At the bottom of this diagram, we have our observed data, yi. The data were
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi
generated from a binomial distribution with n ¼ 3 rolls and the parameter, p, shown in red.
The parameter, p, in turn, is the unknown parameter that we are trying to estimate. Here,
we use a Bernoulli distribution (shown in blue) to set the “weights” on the two alternative
hypotheses for p. Thus, you can interpret the blue distribution above as a prior distribution
which provides our weights of belief for the two hypotheses. The drawing of this distribu-
tion can be generalized.
But there is another interpretation for the blue prior distribution displayed: the unknown
parameter, p, is a random variable that is generated from a Bernoulli distribution. We are
trying to characterize which Bernoulli distribution produced it. Through Bayes’ Theorem,
the prior Bernoulli distribution is updated to a posterior Bernoulli distribution in light of
new data.
With either interpretation, the goal is to make probabilistic statements about our belief in
p and the distribution that characterizes it, and to update those beliefs or knowledge in light
of new data.
Answer: For both problems, we had two hypotheses. For the author problem, our hypoth-
eses were Madison vs. Hamilton, and for this chapter’s problem the hypotheses were about
the value of a parameter: p ¼ 0.5 vs. p ¼ 0.4.
In the author problem, we informally calculated the likelihood of the data (use of the
word upon) under each hypothesis based on previous analysis of each man’s writing. In
contrast, in this chapter we used a probability mass function to calculate the likelihood
of the data under each hypothesis. The use of probability mass functions to compute the
likelihood of the data is common in Bayesian analysis (though not required, as we’ve seen).
• In this function, n and p are parameters, whereas y is a randomly observed discrete value.
The binomial distribution maps the probability associated with all possible values of the
random variable, Y.
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi
Answer: Chapter 9 will tackle probability density functions, another critical tool in
your Bayesian toolbox.
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi
CHAPTER 9
We’ll start with the same question that we started with in Chapter 8:
What is a function?
Answer: In math, a function relates an input to an output, and the classic way of writing
a function is
f ðxÞ ¼ : : : ð9:1Þ
Here, the function name is f, the inputs are denoted by the letter x, and the dots represent
instructions that will generate the output. For instance, consider the function shown in
Figure 9.1.
f(x) = x2
Figure 9.1
The function name is f. The inputs go between the parentheses. Here, we have a single
input called x. The output of the function is a set of instructions that tell us what to do with
the input. Here, we input a value for x; the instructions tell us that we should square x to
give us the output.
Bayesian Statistics for Beginners: A Step-by-Step Approach. Therese M. Donovan and Ruth M. Mickey,
Oxford University Press (2019). © Ruth M. Mickey 2019.
DOI: 10.1093/oso/9780198841296.001.0001
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi
In other words, f(x) maps the input x to the output. We can visualize this as shown in
Figure 9.2.
x f x2
Figure 9.2
The function’s name is f, and the inputs are the observed random variable, y (the number of
successes, which in this case is the 2 heads), n (the total trials, which is three flips), and p
(the probability of success, which is 0.5). Don’t forget that a variable is considered random
when its value is subject to random variation.
In this example, the discrete random variable named Y could be 0, 1, 2, or 3. We just used
the binomial pmf to calculate the probability of observing a random variable y ¼ 2, given
three tosses and a fair coin.
Probability mass functions deal with random variables, such as the number
of heads observed, that are discrete (or distinct) in nature. In this chapter,
however, we focus on random variables that are continuous in nature.
Answer: Of course. Let’s start with the following example from Wikipedia (accessed
August 17, 2017). Suppose a species of bacterium typically lives 4 to 6 hours. In this
example, we can say that capital X represents a random variable that stands for the lifespan
of a bacterium. This is an example of a continuous random variable, because the
lifespan can assume values over some interval (4 to 6 hours, such as 4.1, 4.2, 5.9999, etc.).
We can define the possible outcomes of X as:
4 X 6: ð9:4Þ
Answer: The answer is actually 0. A lot of bacteria live for approximately 5 hours, but there
is a negligible chance that any given bacterium dies at exactly 5.0000000000 . . . hours. We
can always get more precise!
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi
Instead we might ask: What is the probability that the bacterium dies between 5 hours
and 5.01 hours? Let’s say the answer is 0.02 (i.e., 2%). Next, what is the probability that
the bacterium dies between 5 hours and 5.001 hours? The answer is probably around
0.002, because this is 1/10th of the previous interval. The probability that the bacterium
dies between 5 hours and 5.0001 hours is probably about 0.0002, and so on. Notice that
as the time increment decreases, the probability also decreases. When the time incre-
ment is 0 (an exact time), the probability is 0. This isn’t very helpful if our goal is
probability!
Answer: For cases where a random variable is continuous, you need to consider prob-
ability in the context of a probability density function.
Here, the function name is f, and it has one argument called x. We can name the function
and argument anything we’d like. Regardless of what it is named, the function returns the
probability density, and it is always 0.5 (with some constraints that we’ll discuss below).
Answer: A uniform distribution where values range between 4 and 6. Technically, the
probability density function is written:
Notice the use of the comma to separate the function, f ðxÞ ¼ 0:5, from its constraints,
4 x 6. In other words, x must fall within this interval.
The random variable named X can be drawn from this distribution. We can write:
The symbol means “is distributed as.” So we read this equation “the random variable
named X is distributed as a uniform distribution with a minimum value of 4 and a
maximum value of 6.”
The uniform distribution is usually indicated by a capital U, and this distribution has two
parameters, the minimum (sometimes called a) and maximum (sometimes called b). Thus,
the uniform distribution is uniquely described as Uða; bÞ. If you change a or b, you have a
different uniform distribution.
Now let’s take a look at this uniform distribution. The lower-case x is the actual value of
X (see Figure 9.3). The blue line in this figure represents the probability density function
f ðxÞ ¼ 0.5. Notice that the y-axis is labeled “Density.” Notice also that we don’t use the word
“probability” here. If all values of x have the exact same density, it makes sense that the pdf
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi
1.0
0.8
0.6
Density
0.4
0.2
0.0
4.0 4.5 5.0 5.5 6.0
x
Figure 9.3
of a uniform distribution is a constant, f ðxÞ ¼ 0:5, right? It’s just a matter of figuring out
what this constant is.
Answer: Because just as the bars in the binomial distribution or any other pmf must sum
to 1.00, the area under the density f(x) must equal 1.00. Let’s explore this idea a
little more . . .
The uniform distribution is sometimes called a rectangular distribution because it
looks like a rectangle. Wild! And rectangles have area, which is calculated by multiplying
the length of the rectangle along the x-axis by the height of the rectangle along the y-
axis. The length of our rectangle is calculated by subtracting the minimum, 4, from the
maximum, 6, which is 2.0. If the area of our uniform distribution represents
probability, then we know that the area of this rectangle must be 1.0. So, we
now know:
• area ¼ 1.0
• length ¼ 2.0
And now solve for the height, which is called density:
1:0
height ¼ ¼ 0:50: ð9:10Þ
2:0
The height of the distribution at any given point is called its density (Figure 9.3, blue),
whereas the total area under the distribution must be 1.00 (Figure 9.3, gray). The function
that yields the density is called a probability density function.
Now, to repeat the most crucial point: the area under the pdf is equal to 1.0.
Thus, the probability that X is between 4 and 6 is 1.0, or:
Answer: You read our minds! As we mentioned, the uniform pdf has two parameters,
called a and b, where a represents the lower bound of the distribution, and b represents the
upper bound of the distribution:
1
f ðx; a; bÞ ¼ : ð9:12Þ
ba
This is the uniform probability density function (pdf). For our example, where a ¼ 4 and
b ¼ 6, the density is calculated as:
1 1
f ðx; a; bÞ ¼ ¼ ¼ 0:5: ð9:13Þ
64 2
Let’s not forget our constraints! The formal definition of the uniform pdf is written:
8
< 1
for a x b
f ðx; a; bÞ ¼ ba ð9:14Þ
:
0 for x < a or x > b:
In our example:
0:5 for 4 x 6
f ðx; a; bÞ ¼ ð9:15Þ
0 for x < 4 x > 6:
In words, the probability density is 0.5 for all x between 4 and 6 hours; it is 0 for any values
of x that are less than 4 hours or greater than 6 hours.
What is the probability that x is between 4.5 and 5.5 hours for our
uniform distribution?
Answer: OK, now we need to move from probability density to probability. Our
uniform distribution has a ¼ 4 and b ¼ 6. The probability that a random variable named X is
between 4.5 and 5.5 hours could be depicted as shown in Figure 9.4.
1.0
0.8
0.6
Density
0.4
0.2
0.0
4.0 4.5 5.0 5.5 6.0
x
Figure 9.4
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi
Intuitively, if the full grey box above represents an area of 1.00, then the red box
represents 1/2 of the total area, so the probability of drawing x between 4.5 and 5.5 hours
is 0.5. Here, we are talking about probability (what proportion of the full gray rectangle
consists of the red rectangle) and not density. However, we will use the density to calculate
the result. The area of the red rectangle is:
• length ¼ 5.5 – 4.5 = 1
• height (density) ¼ 0.5
• area ¼ length ∗ height ¼ 0.5.
The probability that a bacterium has a lifespan between 4.5 and 5.5 hours is 0.5, that is:
Answer: Once again, the answer is zero! Assuming that 5 means exactly 5 (and not a smitch
more or less), then we are dealing with a line, not a rectangle. Lines don’t have area, so the
answer technically is 0. When you ask what is the probability that x ¼ 5, you might really
mean what is the probability of 5, give or take a tiny amount.
Answer: The Online Statistics Education book (Lane, 2011) tells us: “The normal distribu-
tion is the most important and most widely used distribution in statistics. It is sometimes
called the ‘bell curve,’ although the tonal qualities of such a bell would be less than
pleasing.” The normal (or Gaussian) distribution is a very common continuous prob-
ability distribution.
Answer: The normal distribution is also called the Gaussian distribution, named for the
German mathematician Carl Friedrich Gauss, who “had an exceptional influence in many
fields of mathematics and science and is ranked as one of history’s most influential
mathematicians” (see Figure 9.5). Gauss’ works were collected and published posthumously
in 1863.
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi
0.8
0.6
Density
0.4
0.2
0.0
3 4 5 6 7
x = Lifespan of Bacterium (hours)
Here we are interested in the lifespan of a bacterium, which is given on the x-axis. The
y-axis is labeled Density, as it was in our uniform pdf example. The normal distribution is a
bell-shaped curve. The middle of the curve is centered on the mean, or average, which is
represented by the Greek letter, mu (μ). Here, μ ¼ 5. The spread of the curve is controlled by
the standard deviation, which is represented by the Greek letter, sigma (σ). Here, σ ¼ 0.5.
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi
Answer: Yes. Change the mean (μ) or standard deviation (σ), and you change the location
and/or spread of the curve. Remember the parameters are like the “control knobs.” The
parameter, μ is called the location parameter: it controls where the distribution is
centered over the x-axis. In contrast, the parameter σ is called the scale parameter: it
controls the shape or spread of the distribution.
Sometimes people will denote the parameters of a normal pdf as μ and σ 2 instead, where
σ is just σ ∗ σ. This is called the variance of the distribution. Figure 9.7 shows some
2
2.0
μ = 5, σ = 0.5
μ = 7, σ = 0.5
1.5 μ = 7, σ = 1
μ = 7, σ = 2
Density
1.0 μ = 5, σ = 0.25
0.5
0.0
2 4 6 8 10 12
Lifespan of Bacteria (hours)
• The red, purple, and blue distributions all have the same mean, μ ¼ 7. What differs among
them is the standard deviation, σ. The red distribution has the lowest standard deviation
(0.5), while the blue distribution has the highest standard deviation (2.0).
• The green and black distributions have the same mean, 5.0. What differs between them is
the standard deviation. The green distribution has a standard deviation of 0.5, while the
black distribution has a standard deviation of 0.25.
• The green and the red distributions have the same standard deviation (0.5). What differs
between them is their mean: the blue distribution has a mean of 5, while the red
distribution has a mean of 7.
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi
Answer: The normal probability density function is a pdf, just like the uniform pdf we
examined previously.
The function is as follows:
1 2
f ðx; μ; σÞ ¼ pffiffiffiffiffiffi eðxμÞ =ð2σ Þ
2
∞ x ∞: ð9:17Þ
2π σ
This function is named f and it has just three inputs. Two of the inputs are parameters: the
mean (which is μ) and the standard deviation (which is σ). The third input, x, represents an
outcome, such as bacterial lifespan in hours.
Look for this function on the Deutsche Mark (see Figure 9.8)!
Answer: Suppose the average lifespan of a bacterium is 5 hours. We can use the normal
pdf to get the probability density of observing, say, x ¼ 4.5 hours, when the mean
lifespan is μ ¼ 5 hours and the standard deviation σ ¼ 0.5 hours:
1 2
f ðx; μ; σÞ ¼ pffiffiffiffiffiffi eðxμÞ =ð2σ Þ
2
ð9:18Þ
2π σ
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi
1 2
eð4:55Þ =ð2 ∗ 0:5 Þ ¼ 0:4839414:
2
f ð4:5; 5; 0:5Þ ¼ pffiffiffiffiffiffi ð9:19Þ
2π ∗ 0:5
Remember that this is NOT probability . . . it is probability density!
Answer: Yes! Figure 9.9 shows the Gaussian distribution with μ ¼ 5 and σ ¼ 0.5 (blue).
Look for the density when x ¼ 4.5 as calculated above. Also plotted is the Gaussian
distribution with μ ¼ 5 and σ ¼ 0.25 (black).
2.0
1.5
Density
1.0
0.5
0.0
3 4 5 6 7
Lifespan of Bacteria (hours)
A critical thing to notice here is that both curves are generated by the normal pdf, but
they have different parameters. As with the continuous uniform distribution, the area
under each curve is 1.0. That is, the area is the same under the two curves. Since the
black curve has a small standard deviation compared to the blue curve, the density is
squished upward in order to accommodate the narrow spread. Notice that the density
can be greater than 1.0.
Answer: Remember how we calculated the density for the continuous uniform distribu-
tion? For the bacterium lifespan problem, we said that lifespan is distributed as a uniform
pdf between 4 and 6 hours, X Uð4; 6Þ. We knew the area of the rectangle was 1.0, we
knew the length of the rectangle (2.0), and we simply solved to get the height of the
rectangle, which is the density. We then calculated the probability that a bacterium
would live between 4.5 and 5.5 hours by determining the proportion of U(4,6) where
4.5 x 5.5, which is a smaller rectangle.
We can use the same sort of reasoning with the normal distribution, only we’ll need more
rectangles. Let’s have a look, focusing on the blue distribution in Figure 9.9, where μ ¼ 5,
σ ¼ 0.5. Here it is again in Figure 9.10, with some rectangles overlaid on top.
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi
1.0
0.8
0.6
Density
0.4
0.2
0.0
3 4 5 6 7
x = Lifespan (hours)
Here, the x-axis has been divided into 5 rectangles, where each rectangle has a
length of 1 hour and a height that is the normal pdf density (two of these rectangles
are in the tails and are barely visible). This is a very rough approximation of a normal
distribution!
See if you can find the rectangle where the lifespan is between 3.5 hours and 4.5 hours.
Now, this particular rectangle (blue) has a height, or density, of 0.1079819. The area of this
rectangle is length (1.00) times height (0.1079819) ¼ 0.1079819.
Answer: Now look for the rectangle where lifespan is between 4.5 and 5.5 hours. This
particular rectangle (red) has a density of 0.7978846. The area of this rectangle is length
(1.0) ∗ height (0.7978846) ¼ 0.7978846.
This is our first attempt to answer the question, “What is the probability that a bacterium
has a lifespan between 4.5 and 5.5 hours?” Hang on to this answer, as we will revisit
this same question in a few minutes.
Answer: If Gauss did his homework correctly, we can calculate the area of each rectangle
in the graph and then add the areas together. We should get an answer close to 1.0. Let’s try
it. We have 5 rectangles, each with a length of 1 and height that is given by its density. And
the answer is . . .
1:0143837: ð9:20Þ
Not bad at all! It’s not precisely 1.0 because our 5-rectangle distribution only approxi-
mates the normal pdf shown in blue, where the area under the curve is exactly 1.0.
Perhaps we can improve our result by creating 41 rectangles instead of five. Let’s have a
look (see Figure 9.11).
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi
1.0
0.8
0.6
Density
0.4
0.2
0.0
3 4 5 6 7
x = Lifespan (hours)
• Now, the x-axis has been divided into 41 slivers, where each sliver has a length of
0.1 hours.
• See if you can find the rectangle where lifespan is between 3.95 hours and 4.05 hours.
Now, this particular rectangle (blue) has a density of 0.1079819. The area of this rect-
angle is 0.1 ∗ 0.1079819 ¼ 0.0107982.
• Now look for the rectangle where lifespan is between 5.55 and 5.65 hours.
This particular rectangle (red) has a density of 0.3883721. The area of this rectangle is
0.1 ∗ 0.3883721 ¼ 0.0388372.
To get the area under the curve, we can repeat this little exercise for each of the 41 slivers
and then add up the results. And the answer is . . .
0:9999599: ð9:21Þ
Notice that this answer is closer to 1.0 than our previous example, where we had only 5
rectangles.
Answer: If we have 41 discrete rectangles, then the sum of the rectangles can be
expressed as:
X
41
The general idea is that we can more closely approximate the normal pdf as we create
thinner and thinner rectangles.
Answer: This is the same question we answered previously with our 5-rectangle approxi-
mation. Here are the steps: Find the rectangles centered on 4.5 and 5.5. Take half of the area
of each (because half the area is outside the boundary of interest). Then, add in the area of
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi
all rectangles that fall between these two anchors. The sum is the probability, and the
answer is:
0:6810742: ð9:23Þ
0:7978846: ð9:24Þ
Hang on to this result, and we’ll revisit it one more time in a minute or two!
What is the total area if our rectangles became really, really skinny?
P
Answer: When the rectangles approach a width of 0, the sum symbol, is replaced by the
Ð
integral symbol, , and the area under the curve is exactly 1.000. A general way to write this is:
ð∞
f ðxÞdx ¼ 1:0: ð9:25Þ
∞
• f(x) is a pdf, and you can plug in any pdf you’d like. We’ve introduced the continuous
uniform pdf and normal pdf so far.
• We want to get the area under the curve for a specified range of values of x. The range
of values is tucked next to the integral symbol, with the lower end of x provided on
the bottom of the integral symbol (here, negative infinity) and the upper end of x
provided at the top of the integral symbol (here, positive infinity). That covers all
possibilities!
• The integral symbol means “get the total area under the curve” for the specified range of
x. Since we cover all possibilities for x, the total area under the curve is 1.0.
• To the right of the pdf is the symbol dx, which is the width of the rectangles (approach-
ing 0), and not d times x. If dx is the width of the rectangles, and f(x) is the height of each
rectangle, then f(x)dx is the area of a very small rectangle.
• To the right of the equal sign is 1.0; the area under the curve for all possible values of x
between negative and positive infinity is 1.0.
Wolfram defines an integral as “a mathematical object that can be interpreted as an area or
a generalization of area. Integrals, together with derivatives, are the fundamental objects of
calculus. Other words for integral include antiderivative and primitive.”
Answer: Yes. That is a characteristic that defines all probability density functions. Click
here to see the derivation for the normal pdf!
What would the integral look like for the normal pdf?
Answer: Just replace f(x) above with the normal pdf, which is:
1 2
f ðx; μ; σÞ ¼ pffiffiffiffiffiffi eðxμÞ =ð2σ Þ :
2
ð9:26Þ
2π σ
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi
Then, we have:
ð
∞
OK, one more time! What is the probability that a bacterium has a
lifespan between 4.5 and 5.5 hours?
Answer: Before, we divided the distribution into thin rectangles, found the area of the
rectangles in question, and added them up. Now we are considering a subset of x’s, but we
make use of the density function to find the exact area.
Let’s step through this carefully:
ð
5:5
1 2
pffiffiffiffiffiffi eðxμÞ =ð2σ Þ dx:
2
Prð4:5 X 5:5Þ ¼ ð9:29Þ
2π σ
4:5
Now, we need to evaluate this to get our answer. The process of finding this answer is called
integration. And the result is a probability.
Answer: Well, we have the pdf, and we know the shape of the curve. Now we need a tool
to help us find the area under a specific part of the curve. This area is probability. Comput-
ing the Gaussian integral is a bit beyond the scope of this book, and you really don’t need to
solve it by hand for our purposes.
Fortunately, the normal distribution is such a commonly used distribution that we can
easily find the areas that we want from calculators or computers, which often use the
“super-skinny rectangle” approach to approximate the answer. Technically, this is called
Riemann’s (ree-mahn) approximation. We can use virtually any calculator or math software
to generate the probability that a bacterium’s lifespan is between 4.5 and 5.5 hours, given a
normal pdf with μ ¼ 5 and σ ¼ 0.5. The calculation gives 0.6822689. Our 41 rectangles
approach suggested that the probability was 0.6810742, which is not a bad approximation.
Answer: Yes, there are several excellent sources out there. Forget about memorizing
anything . . . the only thing you need for this book is an intuition for how things work.
Here are some of our favorites:
• Khan Academy
• Better Explained
• How to Enjoy Calculus by Eli Pine
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi
Answer: Loads of them! Here’s a look at a few that you may have come across:
• normal
• log-normal
• beta
• gamma
• exponential
• Weibull
• Cauchy
We will explore a few of these in future chapters, but encourage you to check these out on
Wikipedia. Who knows? Some may be very relevant to your work!
Likelihood
Now let’s gracefully switch to the topic of likelihood. Likelihood is a key concept in
Bayesian inference, so we need to spend time with it.
Answer: You may recall that Dictionary.com provides three definitions of likelihood:
Prðdata j Hi Þ ∗ PrðHi Þ
PrðHi j dataÞ ¼ : ð9:30Þ
X
n
Prðdata j Hi Þ ∗ PrðHj Þ
j¼1
Now, let’s consider Bayes’ Theorem when the hypotheses for a parameter are infinite. Here,
Bayes’ Theorem takes a new form. Suppose we are trying to estimate a single parameter
called θ. Bayes’ Theorem in this case is specified as:
Pðdata j θÞ ∗ PðθÞ
Pðθ j dataÞ ¼ ð : ð9:31Þ
Pðdata j θÞ ∗ PðθÞdθ
This is the generic version of Bayes’ Theorem when the posterior distribution for a single
parameter, θ, given the observed data, is represented by a pdf.
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi
Notice the notation P(θ | data) for the posterior distribution of θ. Here, P indicates
probability density. Both the prior (P(θ)) and the posterior P(θ | data) distributions are
pdfs. In contrast, with the discrete version of Bayes’ Theorem, the prior and posterior
distributions are pmfs, denoted with Pr.
The likelihood in Bayes’ Theorem may be written P(data | θ) or Pr(data | Hi)—which version
you use depends on the problem you are solving.
Throughout this book, we will express computations for the likelihood term of Bayes’
Theorem as Lðdata; θÞ where L symbolizes “likelihood” and the semicolon means “given.” We
do this to differentiate the likelihood term in Bayes’ Theorem from the actual computation.
As a side note to our readers who previously have studied maximum likelihood methods,
read the box below:
You may have seen likelihood expressed as the likelihood of the parameters given the
data, or Lðθ; dataÞ, which is conventional. Regardless of the notation used, you must enter
values for the parameters and observed data to calculate a result.
Earlier in this chapter, we used the normal pdf to get the probability density of
observing x, given the mean lifespan is μ and the standard deviation is σ:
1 2
f ðx; μ; σÞ ¼ pffiffiffiffiffiffi eðxμÞ =ð2σ Þ :
2
ð9:32Þ
2π σ
This is used before we observe the data. When we speak of likelihood, in contrast, we have
the variables (data) in hand, and one or more of the parameters are unknown. To
compute the likelihood of the data, we first need to make an assumption about how
the data were generated. Here, we assume that the data are generated from a normal
distribution. Let’s let X be our random variable, and X is the lifetime of a bacterium in
hours. We can write:
Suppose we hypothesize that μ ¼ 5.0, and assume that σ is known to be 0.5. And further
suppose that we draw a random bacterium that lives x ¼ 4.5 hours. We can ask, “What is the
likelihood that x ¼ 4.5 given that μ is 5.0 and σ ¼ 0.5?” We will use the normal pdf to
answer this question, which is
1 2
Lðx ¼ 4:5; μ ¼ 5:0; σ ¼ 0:5Þ ¼ pffiffiffiffiffiffi eðxμÞ =ð2σ Þ :
2
ð9:34Þ
2π σ
Now, let’s plug in the values that we know to get our answer:
1 2 2
Lðx ¼ 4:5; μ ¼ 5:0; σ ¼ 0:5Þ ¼ pffiffiffiffiffiffi eð4:55:0Þ =ð2 ∗ 0:5 Þ ¼ 0:4839414: ð9:35Þ
2π 0:5
This is a probability density. Look for this result in blue in Figure 9.12.
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi
Suppose now that we hypothesize that μ ¼ 4.1 (with σ known to be 0.5). Now let’s calculate
the likelihood:
1 2 2
Lðx ¼ 4:5; μ ¼ 4:1; σ ¼ 0:5Þ ¼ pffiffiffiffiffiffi eð4:54:1Þ =ð2 ∗ 0:5 Þ ¼ 0:5793831: ð9:36Þ
2π 0:5
Look for this result in green in Figure 9.12.
As you can see, we have an infinite number of possibilities for μ. Nevertheless, we can
generate the likelihood profile by repeatedly plugging in values and mapping the shape
of the surface. Here it is in Figure 9.12.
1.0
0.8
Likelihood
0.6
0.4
0.2
0.0
2 3 4 5 6 7 8
Mean Survival Rate (hours)
Answer: We generate the likelihood surface for two parameters. In other words, we play
the ‘what if ’ game for both μ and σ.
To illustrate, suppose we selected three bacteria at random that lived 3.6,
4.7, and 5.8 years, respectively. These are our observed data. We are trying to estimate μ
and σ from the distribution that generated these data. We choose a suite of combinations of
μ and σ. Then, for each datapoint, we calculate the likelihood of observing the data.
We use the normal pdf as our baseline:
1 2
f ðx; μ; σÞ ¼ pffiffiffiffiffiffi eðxμÞ =ð2σ Þ :
2
ð9:37Þ
2π σ
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi
For instance, if μ ¼ 5 and σ ¼ 1.0, then the likelihood of observing x ¼ 3.6 is calculated as:
1 2 2
Lðx ¼ 3:6; μ ¼ 5:0; σ ¼ 1:0Þ ¼ pffiffiffiffiffiffi eð3:65Þ =ð2ð1:0Þ Þ ð9:38Þ
2π ð1:0Þ
1 2 2
Lðx ¼ 4:7; μ ¼ 5:0; σ ¼ 1:0Þ ¼ pffiffiffiffiffiffi eð4:75Þ =ð2ð1:0Þ Þ ¼ 0:3813878: ð9:40Þ
2π ð1:0Þ
1 2 2
Lðx ¼ 4:8; μ ¼ 5:0; σ ¼ 1:0Þ ¼ pffiffiffiffiffiffi eð5:85Þ =ð2ð1:0Þ Þ ¼ 0:2896916: ð9:41Þ
2π ð1:0Þ
If we assume that all three datapoints were independent, we can compute the
likelihood of observing the full dataset (which is 3 observations) given μ ¼ 5
and σ ¼ 1 as the product of the independent likelihoods:
If we played this game across different combinations of μ and σ, we could create a likeli-
hood surface that looks like the one in Figure 9.13.
0.015
Likelihood
0.010
0.005
0.000
4
3 8
7
2 6
σ 5
1 4 μ
3
2
Figure 9.13
This is a likelihood surface for the unknown parameters, μ and σ. See if you can find
the likelihood value when μ ¼ 5 and σ ¼ 1. It should be 0.01654262 if our calculations were
correct.
Our observed dataset consists of 3 values of lifespan: 3.6, 4.7, and 5.8. The mean of this
dataset is 4.7 and the standard deviation is 1.1. You should see that the most likely
parameter estimates in the graph above correspond to these values.
Answer: Ah, yes! Let’s not lose sight of the big picture!
Suppose we are trying to estimate a single parameter called θ. Bayes’ Theorem in this case
is specified as:
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi
Pðdata j θÞ ∗ PðθÞ
Pðθ j dataÞ ¼ ð : ð9:43Þ
Pðdata j θÞ ∗ PðθÞdθ
This is the generic version of Bayes’ Theorem when the posterior distribution for a single
parameter, given the observed data, is represented by a pdf.
Note that this version deals with a single parameter called θ. But you can use Bayes’ Theorem
to estimate multiple parameters. We’ll touch on this topic more in future chapters.
Answer: Nope! Pðθ j dataÞ is a pdf, so the probability that θ assumes a specific value is 0.
Remember that the probability that a bacterium lives for exactly 5.0000000 hours is 0!
Answer: There are infinite hypotheses, and since you can’t estimate the probability of a
specific hypothesis, you are left with evaluating all of them! In other words, you
must estimate the entire posterior distribution.
Answer: OK! Let’s try to put together everything we’ve learned so far!
Let’s go through a quick example where we are considering hypotheses for an unknown
mean, which is a parameter from the Gaussian distribution. How about the average lifespan
of a bacterium? For this example, we’ll assume σ is known to be 0.5.
Pðdata j μÞ ∗ PðμÞ
Pðμ j dataÞ ¼ ð : ð9:44Þ
Pðdata j μÞ ∗ PðμÞdμ
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi
We have an infinite number of hypotheses for μ, and we want to update our beliefs after
collecting some data. Let’s review the steps of Bayesian analysis.
1.0
0.8
0.6
Density
0.4
0.2
0.0
4.0 4.5 5.0 5.5 6.0
x
Figure 9.14
This represents our prior distribution. This is called a proper prior because the density
function integrates to 1.
Step 3. Collect data.
Suppose we draw a random bacterium that lives 4.7 years.
Step 4. Compute the likelihood of the data under each hypothesis.
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi
Now we play the what if game, and use the normal pdf to compute the likelihood of
observing x ¼ 4.7 under different values of μ. Suppose, for the sake of example, that we
know that σ ¼ 0.5. We start by writing the normal pdf:
1 2
f ðx; μ; σÞ ¼ pffiffiffiffiffiffi eðxμÞ =ð2σ Þ :
2
ð9:45Þ
2π σ
But instead of calling the function f, let’s call it L. And, let’s fully specify our model. Here is
the likelihood value for just one value of μ:
1 2 2
Lðx ¼ 4:7; σ ¼ 0:5; μ ¼ 4:0Þ ¼ pffiffiffiffiffiffi eð4:74:0Þ =ð2ð0:5Þ Þ ¼ 0:2994549 ð9:46Þ
2π ð0:5Þ
1.0
0.8
Likelihood
0.6
0.4
0.2
0.0
4.0 4.5 5.0 5.5 6.0
Mean Survival Rate (hours)
Here, we have the likelihood of observing the data, x ¼ 4.7, under each hypothesized value
for μ. Given just a single datapoint, the most likely survival rate is 4.7. You can see from the
shape of this curve that several other hypotheses have decent height, too. The hypothesis
with the lowest likelihood is μ ¼ 6.0.
Remember, our prior distribution is the uniform pdf. The placeholder for that function is
shown in blue:
Pðdata j μÞ ∗ PðμÞ
Pðμ j dataÞ ¼ ð : ð9:48Þ
Pðdata j μÞ ∗ PðμÞdμ
And our likelihood function is generated using the Gaussian pdf. The placeholder for that
function is shown in red:
Pðdata j μÞ ∗ PðμÞ
Pðμ j dataÞ ¼ ð : ð9:49Þ
Pðdata j μÞ ∗ PðμÞdμ
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi
The posterior distribution is another pdf of some sort. The placeholder for that function is
shown in purple:
Pðdata j μÞ ∗ PðμÞ
Pðμ j dataÞ ¼ ð : ð9:50Þ
Pðdata j μÞ ∗ PðμÞdμ
Answer: Yes. Let’s focus on the big picture and create a diagram for this problem (Kruschke
plot) as we did in Chapter 8 (see Figure 9.16). Remember, this diagram is intended to
communicate the structure of the prior and likelihood.
μ ∼U(a = 4, b = 6)
N(μ, σ = 0.5)
xi
Figure 9.16
At the bottom of this diagram, we have our observed data, xi. The data were generated from
a normal distribution with an unknown parameter, μ, and σ ¼ 0.5. The parameter, μ, in
turn, is the unknown parameter that we are trying to estimate. Here, we use a uniform
distribution with parameters a ¼ 4 and b ¼ 6 to set the “weights” on the alternative
hypotheses for μ. Thus, you can interpret the blue distribution above as a prior distribution
that provides our weights of belief, or current knowledge, regarding μ. The drawing of this
distribution can be generalized.
This interpretation falls in line with that given by Sander Greenland (2006), who writes,
“It is often said (incorrectly) that parameters are treated as fixed by the frequentist but as
random by the Bayesians. For frequentists and Bayesians alike, the value of a parameter may
have been fixed from the start or may have been generated from a physically random
mechanism. In either case, both suppose it has taken on some fixed value that we would
like to know. The Bayesian uses formal probability models to express personal uncertainty
about that value. The ‘randomness’ in these models represents personal uncertainty about
the parameter’s value; it is not a property of the parameter (although we should hope it
accurately reflects properties of the mechanisms that produced the parameter).”
But there is another interpretation for the blue prior distribution displayed: the unknown
parameter, μ, is a random variable that arises from a uniform distribution. In this
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi
example, μ assumes the values between μ ¼ 4 and μ ¼ 6. The Bayesian machinery will
update the prior distribution to a posterior distribution in light of new data.
Answer: Sometimes it is intractable! For those cases, we need a special “tool” that can help
us estimate the posterior distribution without a mathematical, closed-form solution. That
tool is called Markov Chain Monte Carlo (MCMC), and we will introduce this topic later in
the book.
There are a few cases, however, where you can use a particular pdf as a prior distribution,
collect data of a specific flavor, and then derive the posterior pdf. In these special cases, the
pdf of the prior and posterior are the same probability density function, but their param-
eters may differ. The prior distribution is called a conjugate prior (Raiffa and Schlaeffer,
1961), and the effect of the data can then be interpreted in terms of changes in parameter
values (Upton and Cook, 2014).
Here are some examples:
SECTION 4
Bayesian Conjugates
Overview
There are many other conjugate solutions, but we’ve highlighted three commonly used
solutions. In each chapter, we set the scene with a fun problem and then use a conjugate
solution to update the prior distribution for an unknown parameter to the posterior
distribution. We will revisit each of these chapters in Section 5, where we will show you
how to estimate the posterior distribution with a different method, MCMC.
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi
CHAPTER 10
We hope you now have a very solid understanding of probability distributions. In this
chapter (and the next two that follow), we show you how to use Bayes’ Theorem to estimate
the parameters of a probability distribution. Indeed, a very common use of Bayesian
inference involves parameter estimation, where the analysis uses probability density
functions (pdf ’s).
By the end of this chapter, you will have a thorough knowledge of the following:
• Beta distribution
• Binomial data
• Hyperparameters
• Conjugate prior
• Credible intervals
The simplest way to show you how Bayes’ Theorem is used to estimate parameters is to dive
right into a new problem. Let’s run through an example of parameter estimation with
Bayesian inference.
This one was taken from The Mike Wise Radio Show (accessed August 18, 2017).
NBA star Shaquille O’Neal and a friend debated whether or not he could get into the
White House without an appointment. The wager: 1000 push-ups.
Here’s a picture of Shaq in case you haven’t met him (see Figure 10.1); he’s 70 100 , weighs
325 pounds, and has earned tons of accolades in the NBA. On top of that, he’s a rapper and
has a Ph.D. At the time of the bet, Barack Obama was President of the United States. Shaq
knows that President Obama is a huge basketball fan and coach and is sure he can get in to
meet with the President.
For this chapter, we’re going to broaden this problem a bit and ask, “What is
the probability that any famous person (like Shaq) can drop by the White
House without an appointment?”
Bayesian Statistics for Beginners: A Step-by-Step Approach. Therese M. Donovan and Ruth M. Mickey,
Oxford University Press (2019). © Ruth M. Mickey 2019.
DOI: 10.1093/oso/9780198841296.001.0001
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi
Answer: Well, let’s see. Shaq is going to attempt to get into the White House, so that
represents a trial. He will either succeed or fail. Shaq thinks his probability of success is
quite high. The outcome of this trial determines who wins the bet. Ring any bells?
This is a binomial problem. You might recall that the binomial probability mass
function is:
n y
f ðy; n; pÞ ¼ p ð1 pÞðnyÞ y ¼ 0; 1; : : : ; n ð10:1Þ
y
1.0
0.8
Probability
0.6
0.4
0.2
0.0
0 1
Successes
With one trial, the probability of getting in is 0.7, and the probability of not getting in is
0.3. The sum of the probabilities equals 1.0. Incidentally, a binomial distribution with just 1
trial is also called a Bernoulli distribution. Remember?
If the bet let Shaq try three times, he could compute the probability of observing 0, 1, 2,
3 successes out of three trials, given p ¼ 0:7:
3
f ðy; 3; 0:7Þ ¼ 0:7y ð1 0:7Þð3yÞ : ð10:4Þ
y
The binomial distribution for n ¼ 3 and p ¼ 0:7 looks like the one shown in Figure 10.3.
1.0
0.8
Probability
0.6
0.4
0.2
0.0
0 1 2 3
Successes
With 3 attempts, and p (the probability of success) being 0.7, Shaq would have a small
chance of never getting in (0 successes out of 3 trials; a probability of 0.027).
Answer: You might recall that a key assumption of the binomial distribution is that the
trials are independent. If Shaq makes all three attempts, it could be considered an example
of non-independence. You might argue, however, that if the White House guards change so
that the same guards don’t confront Shaq, the trials would be independent. However,
different guards might have different standards for uninvited guests! Let’s “play on,”
assuming that the trials are independent.
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi
Shaq’s friend, however, believes Shaq’s chances are much lower than 0.7. If p is, say 0.1,
then Shaq’s chances would look much different. Figure 10.4 shows the binomial distribu-
tion with n ¼ 3 trials and p ¼ 0:1.
1.0
0.8
Probability Mass
0.6
0.4
0.2
0.0
0 1 2 3
Successes
In this case, Shaq would have a large chance of failure (in fact, a probability of 0.729).
Here’s the kicker: We don’t know what p (the probability of success) is!
Obviously, there is some dispute over what value it is, or there would not be
a bet! So, here we are confronted with a parameter estimation problem.
You are trying to estimate p, the probability that Shaquille O’Neal or some other famous
person can get into the White House without an invitation.
Our goal here is to use a Bayesian inference approach to estimate the prob-
ability that a celebrity can get into the White House without an invitation.
We’ll use the same steps that we have in previous chapters:
1. Identify your hypotheses—these would be the alternative hypotheses for p, ranging
from 0 to 1.00.
2. Express your belief that each hypothesis is true in terms of prior densities.
3. Gather the data—Shaq makes his attempt, and will either fail or succeed.
4. Determine the likelihood of the observed data, assuming each hypothesis is true.
5. Use Bayes’ Theorem to compute the posterior densities for each value of p (i.e., the
posterior distribution).
Now, let’s go through the steps one at a time.
Answer: We know that p, the probability of success in the binomial distribution, can
take on any value between 0 and 1.0. Shaq could say that the prior probability of p is close
to 1.0; his friend could disagree and suggest that the prior probability is closer to 0.01. These
are just two hypotheses for p. However, there’s nothing to stop us from considering
the full range of hypotheses between 0 and 1, which is infinite (p ¼ 0:01, p ¼ 0.011,
p ¼ 0.0111, etc.).
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi
Answer: We need to assign a prior for each hypothesized value of p. Here, we will use the
beta distribution to set prior probabilities for each and every hypothesis for p. An
example of a beta probability distribution is shown in Figure 10.5.
2.5
2.0
1.5
Density
1.0
0.5
0.0
0.0 0.2 0.4 0.6 0.8 1.0
Hypotheses for p
Figure 10.5
The x-axis for the beta distribution ranges between 0 and 1. That is to say, random
variables that are beta distributed must be between 0 and 1. The y-axis gives the prob-
ability density, which we learned about in the last chapter. Remember that density is
not probability, per se, because the area under the curve must be 1. But you can think of
the densities associated with each value of p as the “weight” for each hypothesis. The
distribution above would put greater weight for p’s between, say, 0.2 and 0.6, than other
values.
Answer: The Oxford Dictionary of Statistics tells us that the beta distribution is “a
distribution often used as a prior distribution for a proportion.” Wikipedia has this to say:
“In probability theory and statistics, the beta distribution is a family of continuous prob-
ability distributions defined on the interval (0, 1) parameterized by two positive shape
parameters, typically denoted by alpha (α) and beta (β). The beta distribution can be suited
to the statistical modelling of proportions in applications where values of proportions equal
to 0 or 1 do not occur.”
So, the beta distribution is a statistical distribution whose x-axis spans the interval from 0
to 1. The beta distribution has two parameters that control its shape and position, and these
parameters are named alpha (α) and beta (β). The values for α and β must be positive (0 and
negative entries won’t work). The beta probability distribution shown in Figure 10.5 has
parameters α ¼ 2 and β = 3.
Observations drawn from this distribution can take on values between 0 and 1. This can
be denoted generally as:
X betaðα; βÞ ð10:5Þ
1
f ðx; α; βÞ ¼ xα1 ð1 xÞβ1 ; 0 < x < 1: ð10:6Þ
Bðα; βÞ
Can you find the two parameters, α and β here? The beta function, B, is a normalization
constant to ensure that the area under the curve is 1. Don’t confuse the beta function with
the beta pdf!
2.0
Density
1.5
1.0
0.5
0.0
0.0 0.2 0.4 0.6 0.8 1.0
Hypotheses for p
Figure 10.6
Since Shaq and his friends have totally different ideas of what p should be, they might
settle on using a beta distribution with α ¼ 0.5 and β ¼ 0.5 as the prior distribution for p,
which gives the U-shaped result in blue.
Answer: Generally speaking, the bigger α is relative to β, the more we shift the weight of
the curve to the right, and the bigger β is relative to α, the more we shift the weight of the
curve to the left.
But there is another way to get a handle on α and β. The mean of a beta distribution (also
called its first moment) can be calculated as:
α
μ¼ : ð10:7Þ
αþβ
α∗β
σ2 ¼ : ð10:8Þ
ðα þ βÞ2 ∗ ðα þ β þ 1Þ
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi
Answer: Sure. Suppose YOU think that the average probability of Shaq getting into the
White House is p ¼ 0.10 but think that neighboring values also are likely and set the
standard deviation, σ, to 0.05. Remember that the standard deviation is the square root of
variance, or σ 2 . Let’s rearrange, plug in these values, and solve:
You can derive this on your own at http://www.wolframalpha.com/. The resulting beta
distribution looks like the one shown in Figure 10.7.
10
α = 3.5, β = 31.5
6
Density
0
0.0 0.2 0.4 0.6 0.8 1.0
Hypotheses for p
Figure 10.7
This little shortcut—called moment matching—can help you parameterize the beta
distribution using means and standard deviations, which are probably more familiar
to you.
What prior distribution did Shaq and his friend settle on?
Answer: Let’s assume they go with a beta distribution with α and β set to 0.5. They think p
is either really high or really low. We can designate these as α0 ¼ 0:5 and β0 ¼ 0:5, where the
“naught” subscripts alert us that these are the parameters of a prior distribution. Technic-
ally, these are known as hyperparameters.
Hyperparameters?
Shaq and his friend had to select a prior distribution, and we can imagine them arguing
over which prior distribution to use. The prior distribution in Figure 10.8 may have resulted
after hours of heated discussion. Of course, Shaq and his friend may have opted to do a bit
of research and make inquiries to the FBI regarding other uninvited but attempted visits by
famous people!
3.5
α0 = 0.5, β0= 0.5
3.0
2.5
Density
2.0
1.5
1.0
0.5
0.0
0.0 0.2 0.4 0.6 0.8 1.0
Hypotheses for p
Answer: Collect data! That is step 3. Let’s assume that Shaq makes 1 attempt and
fails to get in. In the binomial function terms, the number of trials n ¼ 1, and the number
of successes y = 0.
Let’s create a diagram that illustrates the process by which the data were generated
(see Figure 10.9).
p ∼beta(α0, β0)
bin(n = 1, p)
0 1
yi
Figure 10.9
At the bottom of this diagram, we have our observed data, yi . If Shaq makes one attempt,
the data arise from a binomial distribution with parameters n ¼ 1 and p. You may recall that a
binomial distribution with n ¼ 1 is also called a Bernoulli distribution. The parameter, p, in
turn, is the unknown parameter that we are trying to estimate. Here, we use a beta distribu-
tion to set the “weights” on each and every hypothesis for the unknown parameter, p.
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi
Answer: Step 4 is to determine the likelihood of the observed data, assuming each
hypothesis is true.
Here is where your previous study of binomial probability mass function comes into play.
Because the data, 1 failure out of 1 attempt, are in hand, we will call the result “likelihood.”
n y
ℒðy; n; pÞ ¼ p ð1 pÞðnyÞ : ð10:11Þ
y
The number of trials is denoted n. The number of observed successes is denoted as y. The
probability of success is denoted as p.
Now, for each hypothesized value of p, let’s compute the binomial likelihood
of observing 0 successes out of 1 trial.
Just to give you an example, under the p ¼ 0.3 hypothesis, the likelihood of observing
0 successes out of 1 trial, given p ¼ 0.3 is computed as:
1
ℒðy ¼ 0; n ¼ 1; p ¼ 0:3Þ ¼ 0:30 ð1 0:3Þð10Þ ¼ 0:695: ð10:12Þ
0
Now what?
Answer: Well, we need to set this Bayesian inference problem up for a continuous
random variable. Suppose we are trying to estimate a single parameter called θ. If θ is
continuous, you have an infinite number of hypotheses for it. Bayes’ Theorem in this case
is specified as:
Pðdata j θÞ ∗ PðθÞ
Pðθ j dataÞ ¼ ð : ð10:13Þ
Pðdata j θÞ ∗ PðθÞdθ
This is the generic version of Bayes’ Theorem when the posterior distribution for a single
parameter, given the observed data, is represented by a pdf. This is designated Pðθ j dataÞ,
where P is probability density (not probability). Pay attention to this notation! This
is the left side of the equation. On the right side of the equation, the numerator multiplies
the prior probability density of θ, which is written PðθÞ, by the likelihood of observing the
data under a given hypothesis for θ, which is written Pðdata j θÞ. Technically, the likelihood
could be a pmf or a pdf, depending on the problem. For example, in this chapter’s
illustration, the likelihood is a pmf. In the denominator, we see the same terms, but this
Ð
time we also see a few more symbols. The symbol means “integrate,” which roughly
means “sum up all the pieces” for each tiny change in θ, which is written dθ. In other words,
the denominator accounts for the prior density ∗ likelihood for all possible hypotheses
for theta and then sums them.
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi
For the Shaq problem, we replace θ with p, so Bayes’ Theorem looks like this (where the
priors are colored blue, the likelihoods red, and the posterior purple):
Pðdata j pÞ ∗ PðpÞ
Pð p j dataÞ ¼ ð 1 : ð10:14Þ
Pðdata j pÞ ∗ PðpÞdp
0
But here’s the kicker: The integration of the denominator is often tedious,
and sometimes impossible!
Really?
Answer: Just kidding! For this particular problem, thankfully, no. There is an analytical
shortcut that makes updating possible in a snap. Here it is:
• αposterior ¼ α0 þ y:
• βposterior ¼ β0 þ n y:
For the White House Problem, our prior distribution is a beta distribution with α0 ¼ 0.5 and
β0 ¼ 0.5. Shaq made an attempt, so n ¼ 1. He failed to get into the White house, so y ¼ 0. We
can now use this shortcut to calculate the parameters of the posterior distribution:
• posterior α ¼ α0 þ y ¼ 0.5 þ 0 ¼ 0.5.
• posterior β ¼ β0 þ n y ¼ 0.5 þ 1 0 ¼ 1.5.
Incidentally, the parameters of the posterior distribution are also called hyperpara-
meters; posterior hyperparameters, to be exact.
Now we can look at the prior and posterior distributions for p (see Figure 10.10).
3.5
3.0 Prior: α0 = 0.5, β0 = 0.5
Posterior: α = 0.5, β = 1.5
2.5
Density
2.0
1.5
1.0
0.5
0.0
0.0 0.2 0.4 0.6 0.8 1.0
Hypotheses for p
Figure 10.10
We started off with a prior distribution shown in blue, where Shaq and his friend believed
they had a very high or very low probability of gaining entry to the White House without
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi
an invitation. Then we collected binomial data: one failure out of one trial. We then used
the shortcut to generate the posterior distribution for all hypotheses of p.
As a result of Shaq’s failed attempt, we now have new knowledge about the support for
each and every hypothesis of p. Notice how the posterior really shifted towards the lower
end of the p spectrum.
• αposterior ¼ α0 þ y ¼ 1 þ 0 ¼ 1:
• βposterior ¼ β0 þ n y ¼ 1 þ 1 0 ¼ 2:
This prior distribution is shown in Figure 10.11 as solid blue, and the posterior distribution
is shown as solid purple. The Figure 10.10 result is shown as a dashed line for comparison.
These differences highlight that the choice of prior affects the posterior.
3.5
Prior: α0 = 1, β0= 1
3.0 Posterior: α = 1, β = 2
2.5
Density
2.0
1.5
1.0
0.5
0.0
0.0 0.2 0.4 0.6 0.8 1.0
Hypotheses for p
Figure 10.11
Answer: No. Here is a strange twist: the U-shaped prior that was actually used (α and
β ¼ 0.5) is less informative (more vague) than the “flat prior” above, where α and β ¼ 1.
The conjugate solutions show that as we make these parameters tiny, they have less
influence on the resultant posterior distributions compared to the data. In other words,
the data play a large role in determining the parameters of the beta posterior.
Thus, a non-informative prior for a beta distribution will be one in which α and β are tiny.
In the words of Zhu and Lu, “flat priors are not necessarily non-informative, and non-
informative priors are not necessarily flat.” Once again, we refer you to Hobbs and Hooten
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi
(2015) for a thoughtful discussion on setting priors. These authors prefer the term “vague
prior” instead of non-informative prior because it’s difficult or impossible to have a prior
that is truly non-informative when dealing with a parameter that is continuous.
Answer: Let’s assume the first prior distribution mentioned was used (α0 and β0 ¼ 0.5), and
then updated after 1 failed attempt (α0 ¼ 0.5 and β0 ¼ 1.5). We now set a prior distribution
based on our updated knowledge, and then collect more data. Suppose Shaq fails again.
Now, the parameters for our next posterior distribution will be as follows:
• αposterior ¼ α0 þ y ¼ 0:5 þ 0 ¼ 0:5:
• βposterior ¼ β0 þ n y ¼ 1:5 þ 1 0 ¼ 2:5:
Now we can look at the prior and posterior distributions for p (see Figure 10.12).
4
Density
0
0.0 0.2 0.4 0.6 0.8 1.0
Hypotheses for p
Figure 10.12
Here you can see the major benefit of Bayesian inference: as we collect more data, we
update our beliefs. We start with a prior, collect data and then use Bayes’ Theorem to
generate posteriors. These posteriors then become the priors for the next round of data
collection. If we track our beliefs, we track our learning.
Answer: The answer is that the beta distribution is a conjugate distribution that can be
updated with binomial data. This is why we named this chapter “The beta-binomial
conjugate.”
These shortcuts were introduced by Howard Raiffa (Figure 10.13, left) and Robert Schlai-
fer (Figure 10.13, right) in their work on Bayesian decision theory. Their classic 1961 book is
titled Applied Statistical Decision Theory.
The conjugate shortcuts are conveniently provided on a Wikipedia page on Bayesian
conjugates. A visual overview of that page’s beta-binomial conjugate can be depicted as
shown in Figure 10.14.
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi
Unknown parameter: p
Prior distribution: Beta
Prior hyperparameters
α0, β0
Data: y
Likelihood: Binomial; n is known
Posterior distribution: Beta
Posterior hyperparameters
α = α0 + y
β = β0 + n – y
Figure 10.14
Here, each attempt was considered an experiment, where Shaq made a single attempt and
either succeeded or failed. Before making any attempt, we set a prior distribution on p. He
then failed to get into the White House. We then used this new information to get an
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi
updated posterior distribution. This in turn became our prior distribution for Shaq’s second
attempt. He failed again, and this information was used to update to yet another posterior
distribution.
Answer: Yes. We would end up in the same place if we started with a prior distribution and
considered 2 attempts (both failures) before updating. Let’s confirm this:
• The prior:
• α0 ¼ 0.5; β0 ¼ 0.5
• The data:
• n=2
• y=0
• The posterior:
• α ¼ 0.5 þ 0 ¼ 0.5
• β ¼ 0.5 þ 2 0 ¼ 2.5
This is the same answer we got before.
Answer: First, Dictionary.com defines the word “conjugate” in several ways. In grammar it
means “to recite or display all or some subsets of the inflected forms of a verb.” For example,
the forms of “to be” are “I am, you are, he is, we are, you are, they are.” As a noun or
adjective, the word conjugate means “joined together,” especially in pairs. Thus, conjugates
have a common theme that ties them together.
We mentioned the statistical definition in Chapter 9, where we indicated that there are
cases where you can use a particular pdf as a prior distribution, collect data of a specific
flavor, and then derive the posterior pdf. In these special cases, the pdf of the prior and
posterior are the same probability density function, but their parameters may differ. The
prior distribution is called a conjugate prior (Raiffa and Schlaeffer, 1961), and the effect of
the data can then be interpreted in terms of changes in parameter values (Upton and
Cook, 2014).
In the White House Problem, we used a beta distribution to set the priors for all
hypotheses of p, the probability that a famous person could get into the White House
without an invitation. We then collected binomial data, and, rather than computing the
posterior by the use of Bayes’ Theorem, we used the conjugate shortcut to generate the
posterior distribution for p.
Can the beta distribution be used as a conjugate prior for data other
than binomial data?
Answer: Yes. Wikipedia once more: “In Bayesian inference, the beta distribution is the
conjugate prior probability distribution for the Bernoulli, binomial, negative binomial and
geometric distributions. The beta distribution is a suitable model for the random behavior
of percentages and proportions.” These distributions all depend on a parameter, p that can
assume values between 0 and 1.
Answer: A picture may be worth a thousand words. If you can display the posterior
distribution, do so.
In addition, often people summarize the posterior distribution with some simple statis-
tics that are familiar to most people: the mean, median, mode (all measures of middleness),
plus the quantiles.
For well-known probability distributions such as the beta distribution, you can simply
look up the expected values that characterize the distribution on Wolfram Mathworld:
α
• Mean ¼ :
αþβ
α1
• Mode ¼ for α and β > 1.
αþβ2
α∗β
• Variance ¼ :
ðα þ βÞ2 ∗ ðα þ β þ 1Þ
Let’s do this for the Shaq problem. Let’s calculate these for our new posterior, which has
α ¼ 0.5 and β ¼ 2.5:
0:5
• Mean ¼ ¼ 0:1667:
0:5 þ 2:5
• Mode cannot be calculated because α < 1. Snap!
0:5 ∗ 2:5
• Variance ¼ ¼ 0:0617:
ð0:5 þ 2:5Þ2 ∗ ð0:5 þ 2:5 þ 1Þ
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi
Answer: Report the Bayesian “credible intervals.” Bayesian credible intervals are common-
ly reported in Bayesian inference problems that involve parameter estimation. According to
the Oxford Dictionary of Social Research Methods (Elliot et al., 2016), a credible interval
represents “an interval in the domain of the posterior or predictive distribution. For a 95
percent credible interval, the value of interest lies with a 95 percent probability in the
interval. In other words, given the data and the model, there is a 95 percent chance the true
value lies in that interval.”
A Wikipedia article describes three methods for choosing a credible interval (see also
Kruschke, 2015):
• Choosing the narrowest interval, which for a unimodal distribution will involve choos-
ing those values of highest probability density including the mode. This is sometimes
called the highest posterior density interval.
• Choosing the interval where the probability of being below the interval is as likely as
being above it. This interval will include the median. This is sometimes called the equal-
tailed interval.
• Assuming that the mean exists, choosing the interval for which the mean is the
central point.
Let’s use our most up-to-date estimation of p (given Shaq’s two failed attempts) and go with
option 2 above, that is, look for the 90% credible interval. Our posterior distribution is a
beta distribution with α ¼ 0.5 and β ¼ 2.5. We need to find the area under the curve where
5% of the distribution is in the upper tail, and 5% is in the lower tail. We can use any
computer software program to find these values, which are p ¼ 0.00087 for the lower end,
and p ¼ 0.57 for the upper end (as shown by the dashed green lines in Figure 10.15).
6
Density
0
0.0 0.2 0.4 0.6 0.8 1.0
Hypotheses for p
Figure 10.15
Interpretation: The probability that p is between 0.000867 and 0.57 is 0.90. The word
“probability” is correct here because we’ve calculated an area under the pdf.
Answer: Indeed.
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi
• We began by mentioning that Shaquille O’Neal made a bet with his friend that he could
get into the White House without an invitation.
• We broadened this problem to ask, “What is the probability that any famous person can
get into the White House without an invitation?”
• We introduced the beta distribution as a prior pdf that can be used to set hypotheses
for p, the probability of success, noting that the beta distribution returns a probability
density, not a probability. We selected hyperparameters for the beta distribution that
reflected both Shaq and his friend’s opinions about his probability of entry.
• We then collected data that is binomial in flavor. Shaq made one attempt and did not
get in.
• We then calculated the likelihood of observing the data under some hypothesized values
for p.
• But then we noted that we have an infinite number of hypotheses for p, so that the
denominator of Bayes’ Theorem now has an integral. This stopped us in our tracks.
• We relied on a shortcut—a conjugate solution—which allowed us to update our prior
beta distribution to a posterior beta distribution using the data (one trial, one failed
attempt).
• We then let Shaq make a second attempt. This time, we used our newly updated posterior
as the prior distribution and then updated it again, given the outcome of the second
attempt. We discussed this as a way to formalize “learning.”
• Finally, we discussed ways of presenting your results, including calculation of credible
intervals.
Answer: Sadly, he was denied entrance. Shaq recounted the story to a Washington Post
sportswriter this way (accessed August 17, 2017):
“I went to the gate,” [Shaq] said. “They were nice. They said, ‘Shaq, we can’t do it.’ I said,
‘I understand. Understood.’ ”
The sportswriter said that “Shaq left 1600 Pennsylvania Avenue in defeat Sunday, 1,000
push-ups in arrears. He said he was popping them off in bursts of 20 or 30 all day, and that
he had already logged about 100 by the time we saw him in the production room Monday
evening.”
We will revisit this chapter again in Section 5, where we will show you a different method
for estimating the posterior distribution. For now, though, Chapter 11 introduces another
conjugate called the gamma-Poisson conjugate. The chapter features blood and gore.
Beware!
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi
CHAPTER 11
We interrupt this book to bring you the following news flash from the National Post:
Great headline, eh? This article appeared in the National Post on September 7, 2001. An
unusually large number of shark attacks off the coast of Florida in 2001 spurred a flurry of
excitement. What is the cause? Will I be next? In the story, Dr. David Kelton of Penn State
University explained the attacks this way:
“Just because you see events happening in a rash like this does not imply that there’s some
physical driver causing them to happen. It is characteristic of random processes that they exhibit
this bursty behavior.”
Comforted now? What random process was he referring to? The headline refers to a
Poisson process. You might have guessed this probability distribution is used to estimate
Bayesian Statistics for Beginners: A Step-by-Step Approach. Therese M. Donovan and Ruth M. Mickey,
Oxford University Press (2019). © Ruth M. Mickey 2019.
DOI: 10.1093/oso/9780198841296.001.0001
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi
the number of times an event will occur. You’d be right! In this chapter, we’ll take a look at
the problem of shark attacks and put it into a Bayesian framework.
By the end of this chapter, you should be able to define and use the following:
• The Poisson probability mass function
• The gamma probability density function
• The gamma-Poisson conjugate
As usual, we’ll start with a question.
Answer: Because shark attacks are a public health problem, let’s see what Oxford Diction-
ary of Public Heath (Last, 2007) has to say: “A distribution function to describe rare events,
or the sampling distribution of isolated counts in a continuum of space or time, such as
cases of uncommon diseases.”
In the Shark Attack Problem, Dr. Kelton explains: “There is something in probability
theory called a Poisson process, giving amazingly good descriptions of such random
independent events occurring through time, like customer arrivals at a fast-food store,
cosmic rays striking a planet, accidents in a factory, airplane mishaps, and maybe shark
attacks.” We need to consider pmf ’s because we have discrete numbers; that is, the number
of attacks is not continuous.
The Poisson distribution is a mathematical rule that assigns probabilities to the number
of occurrences observed.
The Poisson distribution is named for the French mathematician Simeon Poisson (see
Figure 11.1), who derived this distribution in 1837. This is the same year that Queen
Victoria of England took the throne. Coincidentally, “poisson” means “fish” in French,
which is fitting for the Shark Attack Problem!
Let’s assume that the average rate is 2.1 attacks per year, based on the number of attacks
reported annually at Florida beaches in the past 10 years (see Table 11.1).
Knowing this, you can use the Poisson probability mass function to determine the
Poisson probability that there will be, say, 10 attacks next year.
λx e−λ
PrðX ¼ x; λÞ ¼ x ¼ 0; 1; 2; : : : ð11:1Þ
x!
where λ is the mean number of occurrences in a given period of time, x is the number of
occurrences we are interested in, and e is the natural logarithm constant (approximately
2.718), also called Euler’s number. The left side of the equation can be read, “What is the
probability that a random variable called X has a value of x, given λ.” Notice the three dots
to the right of the equation. Unlike the binomial distribution, where the number of possible
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi
Table 11.1
Year Attacks
1 1
2 0
3 2
4 1
5 1
6 3
7 2
8 3
9 4
10 4
values is fixed (x ¼ 0; 1; : : : ; n), the possible values that x can assume for the Poisson
distribution (x ¼ 0; 1; : : :) is not finite; rather it is “countably infinite.”
Answer: The Poisson pmf has just one parameter that controls both its shape and
location. This parameter is called “lambda” and is written λ. It represents the mean number
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi
of occurrences in a fixed space or time period, such as the mean number of births in a year,
the mean number of accidents in a factory in a year, or the mean number of shark attacks in
a year. So, λ must be > 0 or the event will never occur. By the way, λ is also the variance of the
distribution.
Let’s work through an example to see the Poisson pmf in action. We have information in
Table 11.1 that suggests the average rate of shark attacks in Florida is 2.1 attacks per year.
What is the probability that there will be 10 attacks next year?
The probability would be calculated as:
λx e−λ
Prðx; λÞ ¼ ð11:2Þ
x!
λx e−λ
Prðx; λÞ ¼ ð11:4Þ
x!
1.0
0.8
Probability
0.6
0.4
0.2
0.0
0 1 2 3 4 5 6 7 8 9 10 11 12
Attacks per Year (x)
Look for the probabilities associated with 3 and 10 attacks. Also notice that the y-axis is
labeled “Probability” and that the sum of the bars must equal 1.0. Now take a look at the
x-axis; it consists of non-negative integers only. The graph is depicted in bars to indicate
there are only discrete outcomes; that is, that there is no way to observe, say, 2.5 shark
attacks in a year. In a nutshell, the outcome x is a non-negative integer, but the average rate,
λ, can be any positive number such as 2.1. Thus, the Poisson function is a probability mass
function (pmf).
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi
There is only one parameter or “knob” to twiddle with for this probability function, and
that is λ. Let’s have a look at the Poisson distribution when λ ¼ 1, 3, and 7 (see Figure 11.3).
0.35
λ=1
0.30
λ=3
0.25 λ=7
Probability
0.20
0.15
0.10
0.05
0.00
0 2 4 6 8 10 12 14 16 18 20
Number of Attacks per Year (x)
These graphs highlight the fact that the Poisson probability distribution is governed by
just one parameter, λ. In looking at these three distributions, you might have noticed a few
things:
• First, the mean of the distribution is λ.
• Second, as λ gets larger, the Poisson distribution starts to look like a normal distribution.
• Third, the spread, or variance, of the distribution is the same as λ. So as λ increases, so does
the spread of the Poisson distribution. See if you can confirm this visually.
Answer: The news headline read: “Shark attacks attributed to random Poisson bursts.”
Let’s have a second look at the Poisson distribution for λ ¼ 2.1, shown in Figure 11.4. The
number of shark attacks x ¼ 1, 2, and 3 are all fairly likely outcomes, but 0, 4, and 5 are also
in the running. With an average attack rate of 2.1 per year, it would be rare, but not
impossible, to observe > 7 attacks.
1.0
0.8
Probability
0.6
0.4
0.2
0.0
0 1 2 3 4 5 6 7 8 9 10 12
Attacks per Year (x)
Let’s try an experiment where we randomly draw 50 observations from the Poisson
distribution in Figure 11.4 (λ ¼ 2.1) and record our results in a table with 10 rows and 5
columns (see Table 11.2).
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi
Table 11.2
2 4 2 2 3
1 3 1 1 0
1 2 1 1 1
3 1 3 1 3
3 4 2 1 3
2 1 3 3 3
5 2 3 1 0
5 2 4 5 1
4 5 0 4 1
7 2 3 1 3
What you are seeing are random draws from a Poisson distribution where λ ¼ 2.1.
Heading down the columns, we drew a 2, then 1, then a 1, then 3, and so on. This is an
example of a random Poisson process like the one described in the shark attack article. We
can say that each observation, x, arises from a Poisson distribution. That is, the random
variable X is the number of shark attacks in a year, and X is distributed as a Poisson
distribution:
X PoissonðλÞ: ð11:6Þ
Now, take a look at the last four entries in column 1 and the first entry in column 2. Do
you see a series of attacks? If you did not know that these values came from a Poisson
distribution with λ ¼ 2.1, you might conclude that shark populations are on the rise and
that the rate of attacks is actually increasing. With random variables, occasionally you can
string together a series of outcomes that strike you as “non-random,” even if they are.
Make sense?
Suppose you are on a Bay(es) Watch patrol team, and 5 attacks occurred
this year. Your supervisor doubts that the current estimate of λ = 2.1 is valid.
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi
She has charged you with getting an up-to-date estimate of the annual rate
of shark attacks. Thus, this is a parameter estimation problem, and you’ve
decided to use a Bayesian approach.
0.25
λ=2
0.20 λ=5
λ=7
Probability
0.15
0.10
0.05
0.00
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Number of Attacks per Year (x)
Because there are multiple alternative Poisson distributions that can yield 5 shark attacks
with a non-0 probability, it makes sense that we focus our attention on considering all
possible values for lambda, and then confront each hypothesis with our data of 5 shark
attacks. For each hypothesis for λ, we then ask: What is the likelihood of observing 5
shark attacks under each hypothesis for λ? That way, you are estimating a probabil-
ity distribution for λ and giving credence to an infinite number of hypotheses instead of
estimating just a single point.
This is a natural Bayesian problem. With a Bayesian inference problem, we need a prior
distribution that represents our ‘belief ’ in each alternative value for λ and the likelihood of
the data to generate the posterior distribution. We’ll get there before you know it.
Answer: Before we leap in, let’s take time to acquaint ourselves with a new probability
density function. Remember that λ can go from 0 to infinity. If our hypotheses for λ are
graphed along the x-axis, we need an x-axis that starts at 0 and increases to infinity. Because
λ can be any real number greater than 0, the x-axis must be continuous. Then on the
y-axis, we compute the probability density of each possible λ. An example of a prior
distribution for the various hypotheses for λ might look like something like the one shown
in Figure 11.6.
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi
0.25
0.20
Probability Density
0.15
0.10
0.05
0.00
0 5 10 15
Hypotheses for λ
Answer: Yes! In Chapter 10, we were intent on estimating the posterior distribution
for p, the probability that a famous person like Shaq could get into the White House
without an invitation. We used a beta distribution to set our priors for each and every
p (p ¼ probability of success, which ranges between 0 and 1). The data we collected
constituted a binomial problem (number of successes out of number of trials). Then, we
obtained a new posterior distribution for p in light of the new data.
In this problem, the beta distribution won’t work for us. Why? Because λ can be greater
than 1. Fortunately there is another distribution that is suitable, and it is called the gamma
distribution.
Gamma distribution?
Answer: Wolfram and the Oxford Dictionary of Statistics (Upton and Cook, 2014) provide
descriptions of the gamma distribution.
If you checked out these links, you may have noticed that there are different ways of
expressing the gamma distribution, which is explained in a Wikipedia entry (accessed
August 17, 2017): “In probability theory and statistics, the gamma distribution is a two-
parameter family of continuous probability distributions. There are three parameterizations
in common use:
Note: If you dig into other references, take care to note that different software packages
may use different names for the shape, rate, and scale parameters. In this book, we will
stick with the names α for shape, β for rate, and θ for scale.
0.5
α = 1.0, β = 0.5
α = 2.0, β = 0.5
0.4 α = 4.0, β = 0.5
α = 2.1, β = 1
α = 9.0, β = 2
0.3 α = 7.5, β = 1
Density
α = 0.5, β = 1
0.2
0.1
0.0
0 5 10 15 20
Hypotheses for λ
Answer: Nope, not a thing. The beta distribution is controlled by two parameters that are
also named α and β, but these could have been named something else. It just so happens
that α and β are common names for parameters in probability functions, much like Spot
and Daisy are common names for dogs!
Answer: Can you see that all three parameterizations are related to one another? In the
first form, we have a shape and scale parameter. In the second version, the rate parameter is
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi
just 1 divided by the scale parameter. In the third version, the mean is the shape parameter
divided by the rate parameter!
Which of the three alternative parameterizations you use depends on the
problem you are tackling. Here, we’ll stick with the second version which has α as the
shape parameter and β as the rate parameter, as that is the version commonly used for
Poisson hypotheses for λ. After all, λ from the Poisson distribution is a rate!
Answer: Well, because it’s a probability density function, you can’t calculate probability
for a specific value. Rather, you are calculating the probability density.
If you are given a particular gamma distribution, say, α ¼ 2 and β ¼ 1, you can compute
the density associated with, say, x ¼ 3 with this function:
βα xα−1 e−βx
gðx; α; βÞ ¼ 0 x ∞: ð11:7Þ
ΓðαÞ
This is the probability density function for the gamma distribution. Here, the function
name is g, and the inputs are x, α, and β. The random variable named X must assume values
greater than or equal to 0, and α and β must be positive. The output will provide the
probability density of x given α and β.
The funny symbol, Γ, in the denominator is the capital Greek letter for G (called gamma),
and represents the gamma function, which is not to be confused with the gamma distri-
bution! The gamma function is needed to ensure that the function integrates to 1.0 (that is,
the total area under the curve ¼ 1.0). Let’s try an example.
The probability density of observing x ¼ 3, given α ¼ 2 and β ¼ 1 is:
βα xα−1 e−βx
gðx; α; βÞ ¼ ð11:8Þ
ΓðαÞ
12 32−1 e−1 ∗ 3
gð3; 2; 1Þ ¼ : ð11:9Þ
Γð2Þ
Answer: When α is a positive integer, ΓðαÞ ¼ ðα−1Þ!, which in our case ends up being
Γð2Þ ¼ ð2−1Þ! ¼ 1! ¼ 1:0. So we are left with:
12 32−1 e−1 ∗ 3
gð3; 2; 1Þ ¼ ð11:10Þ
Γð2Þ
12 32−1 e−1 ∗ 3
gð3; 2; 1Þ ¼ ð11:11Þ
ð2−1Þ!
If α is not a positive integer, you can compute the value of the gamma function with an
online calculator.
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi
Here is a graph of the gamma distribution for α ¼ 2 and β ¼ 1. See if you can find the
probability density for x ¼ 3 in Figure 11.8.
It’s worth pointing out again that this is a probability density function. The
y-axis gives density, and the x-axis features a continuous variable.
0.8
0.6
Density
0.4
0.2
0.0
0 2 4 6 8 10
x
Answer: Our goal here is to use a Bayesian inference approach to estimate the
rate of shark attacks per year, or λ.
We’ll use the same steps that we have in previous chapters:
λ ∼gamma(α0, β0)
Poisson(λ)
0 2 4 6 8 10
xi
Figure 11.9
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi
At the very bottom of this diagram, we have our observed data, xi. The data are random
observations that arise from a Poisson distribution, defined by the parameter λ. The par-
ameter, λ, in turn, is the unknown parameter that we are trying to estimate. We use a
gamma distribution to “weight” each and every hypothesis for λ, treating the unknown
parameter as a random variable from a gamma distribution.
Now that we have specified the problem, let’s step through these one at a time.
Answer: We need to assign a prior for each hypothesized value of λ, and we already
introduced the gamma distribution as a suitable prior probability distribution. We’ve
also decided that the gamma distribution that uses the shape and rate parameterization
suits our needs.
But the mean of that dataset is 2.1. How do I convert this to the α and β
parameters for a gamma distribution?
Answer: Here’s where the Wolfram MathWorld, the Oxford Dictionary of Statistics, or
Wikipedia pages on probability distributions can be very helpful. For instance, Figure 11.10
shows a typical Wikipedia panel featuring the gamma distribution (with our parameteriza-
tion highlighted in red).
These panels provide summary information such as the mean, median, and variance, for
a given gamma distribution when the shape α and rate β parameterization is used. Scan
down and look at the mean:
α
μ ¼ E½X ¼ : ð11:13Þ
β
α−1
ð11:14Þ
β
Support x ∈ (0, ∞)
Probability βα
density xα–1e–βx
Г(α)
function (pdf)
Mean E[X] = αβ
E[ln X] = ψ(α)–ln (β)
(see digamma
function)
α – 1 for α ≥ 1
Mode β
Variance Var[X] = α2
β
Var[ln X] = ψ1(α)
(see trigamma
function)
Figure 11.10
Just as in the beta distribution in the last chapter, the magnitude of α and β are not nearly as
important as the relation between them. For instance, if the mean is 2.1, you might set α
to 2.1 and β to 1.0. Or you might set α to 4.2 and β to 2.0.
If the mode (the peak of the curve) is at 2.1, you might set α to 3.1 and β to 1.0. Or you
might set α to 31 and β to 14.3 (see Figure 11.11).
1.2
α = 2.1, β = 1.0
α = 4.2, β = 2.0
1.0 α = 6.3, β = 3.0
α = 3.1, β = 1.0
0.8 α = 3.1, β = 14.3
Density
0.6
0.4
0.2
0.0
0 5 10 15
x
• α and β for all of the blue distributions were calculated assuming the mean of the
distribution ¼ 2.1. Notice that the mean is not the top of curve, though, because there
is “weight” in the right tail of the distribution.
• When α and β are parameterized in terms of the mode, the peak of the curve is the mode
(black distributions).
With these graphs, it’s clear that even though you have prior information that the mean
number of shark attacks is 2.1 per year, you still need to exercise some subjectivity with
respect to which prior distribution to use. You could also consider the variance of the data
in Table 11.1 in making your selection (we’ll cover how to do that in Chapter 13).
Answer: Let’s go with the prior distribution of α0 ¼ 2.1 and β0 ¼ 1 (solid blue line in
Figure 11.12), which has a mean rate of 2.1. Notice that we use a “naught” subscript to
designate these as the hyperparameters of the prior distribution.
Hyperparameters?
0.4
0.3
Density
0.2
0.1
0.0
0 5 10 15
Hypotheses for λ
Answer: Step 4 is to determine the likelihood of the observed data, assuming each
hypothesis is true.
Now, for each hypothesized value of λ, let’s compute the Poisson likelihood
of observing five attacks in one year.
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi
We touched on the Poisson pmf at the beginning of the chapter, where we showed that
the Poisson pmf can be used to calculate the probability of observing 0, 1, 2, 3, . . . attacks,
given lambda. This time, we have observed the value x, and now we are using the function to
calculate the likelihood of observing the data under each hypothesized λ.
To give you an example, under the λ ¼ 2.1 hypothesis, the likelihood of observing five
shark attacks is computed as:
λx e−λ
ℒðX ¼ x; λÞ ¼ f ðx; λÞ ¼ ð11:16Þ
x!
0.25
Likelihood of observing 5 attacks
0.20
0.15
0.10
0.05
0.00
0 5 10 15
Hypotheses for λ
Figure 11.13
And Step 5?
Answer: The final step is to use Bayes’ Theorem to compute the posterior probabilities for
each value of λ. As in Chapter 10, we need to set this Bayesian inference problem up for a
continuous random variable. Suppose we are trying to estimate a single parameter
called θ. If θ is continuous, you have an infinite number of hypotheses for it. Bayes’
Theorem in this case is specified as:
Pðdata j θÞ ∗ PðθÞ
Pðθ j dataÞ ¼ ð : ð11:18Þ
Pðdata j θÞ ∗ PðθÞdθ
This is the generic version of Bayes’ Theorem when the posterior distribution for a single
parameter, given the observed data, is represented by a pdf. This is designated Pðθ j dataÞ,
where P is probability density. This is the left side of the equation. On the right side of
the equation, the numerator multiplies the prior probability density of θ, which is written
PðθÞ, by the likelihood of observing the data under a given hypothesis for θ, which is written
Pðdata j θÞ. Technically, the likelihood can be a pdf or a pmf. In this chapter’s illustration,
for instance, the likelihood is a pmf because it is a Poisson distribution. In the denominator,
Ð
we see the same terms, but this time we also see a few more symbols. The symbol means
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi
“integrate,” which roughly means “sum up all the pieces” for each tiny change in θ, which
is written dθ. In other words, the denominator accounts for the prior density * likelihood for
all possible hypotheses for theta, and sums them.
For the Shark Attack Problem, we replace θ with λ, so Bayes’ Theorem looks like this, with
the posterior shown in purple, the prior in blue, and the likelihood in red:
Pðdata j λÞ ∗ PðλÞ
Pðλ j dataÞ ¼ ð ∞ : ð11:19Þ
PðdatajλÞ ∗ PðλÞdλ
0
Remember that this is a density, not a probability, because the probability of a given value
of λ will always be 0 for a continuous pdf (refer to Chapter 9 for a refresher). Thus, in this
problem, we are interested in all possible values of λ, or the entire posterior
distribution.
But here’s the kicker: The integration of the denominator can be very
difficult to calculate!
Answer: Well, for this particular problem, there is an analytical shortcut that makes
updating the prior to the posterior distribution a snap. Integration not required! Remember
that our prior distribution set α0 ¼ 2.1 and β0 = 1.
Furthermore, we have observed 5 shark attacks in one year. So here, x ¼ 5. Let’s let n ¼ the
number of years; here n = 1. Here is the shortcut:
The updated α parameter is α0 plus the sum of the Poisson observations. We have just
one observation, which was of 5 attacks:
X
n
αposterior ¼ α0 þ xi : ð11:20Þ
i ¼1
So, the summation term results in the number 5, because x1=5. Then we have:
βposterior ¼ β0 þ n ð11:22Þ
βposterior ¼ 1 þ 1 ¼ 2: ð11:23Þ
These parameters are referred to as the posterior hyperparameters. Now we can look at the
prior and posterior distributions for λ (see Figure 11.14).
We started off with a prior distribution shown in blue. Then we collected data in the form
of a single Poisson random variable: five attacks. We then used the shortcut to generate the
posterior distribution for all hypotheses of λ.
As a result of our data, we now have new beliefs in each and every hypothesis of λ. The
posterior distribution is shown in purple. Notice how the posterior shifted toward our
observed data for this problem.
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi
0.4
Prior: α0 = 2.1, β0 =1
Posterior: α = 7.1, β =2
0.3
Density
0.2
0.1
0.0
0 5 10 15 20
Hypotheses for λ
Answer: The answer is that the gamma distribution is a conjugate distribution that can be
updated with Poisson data (Raiffa and Schlaeffer, 1961). This is why we named this chapter
“The gamma-Poisson conjugate.”
We mentioned the statistical definition of a conjugate distribution in Chapter 11, where
we indicated that there are cases where you can use a particular pdf as a prior distribution,
collect data of a specific flavor, and then derive the posterior pdf with a closed-form
solution. In these special cases, the pdf ’s of the prior and posterior distributions are the
same probability density function, but their parameters may differ. The prior distribution is
called a conjugate prior, and the effect of the data can then be interpreted in terms of
changes in parameter values (Upton and Cook, 2014).
In the Shark Attack Problem, we used a gamma distribution to set the priors for all
hypotheses of λ, the rate of shark attacks per year. We used the information provided in
Table 11.1 to help set the prior distribution. We then collected Poisson data (5 shark
attacks), and then used a shortcut to generate the posterior distribution for λ.
Raiffa and Schlaeffer’s (1961) conjugate solution is conveniently given in a Wikipedia
page on Bayesian conjugates. A quick overview of that page’s gamma-Poisson conjugate
can be visualized with the graph in Figure 11.15.
Let’s work our way through this figure. The top item indicates that the model parameter
of interest, λ, is a rate. The prior distribution is a gamma distribution, defined by hyperpara-
meters α0 and β0. The “naughts” signal that these are hyperparameters for the prior. The
data collected are Poisson in flavor, so the likelihood to be used is the Poisson pmf. With
this setup, the posterior hyperparameters (the conjugate shortcuts) are provided.
Answer: You could have used a different prior distribution, and your inferences provided
by the posterior distribution would probably be different. You must defend your choice for
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi
Unknown parameter: λ
Prior distribution: Gamma
Prior hyperparameters
α0, β0
Posterior hyperparameters
n
α = α0 + ∑ i = 1 xi β = β0 + n
Figure 11.15
the prior! Here, we had previous knowledge of the annual rate of shark attacks and thus used
an informative prior. To ignore this previous information is contrary to the scientific method.
Why do I need to use a Bayesian approach? Couldn’t I just use the data
from Table 11.1 and add in the extra observation of 5 shark attacks?
Answer: You certainly could do that. But then you wouldn’t be capitalizing on the notion of
learning, and on the notion of multiple, competing hypotheses for λ. Table 11.3 shows the
dataset again, this time with one extra observation (our new observation of 5 shark attacks).
Table 11.3
Year Attacks
1 1
2 0
3 2
4 1
5 1
6 3
7 2
8 3
9 4
10 4
11 5
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi
The mean of this dataset is 2.36. Yes, this does give you an estimate of λ, and you can ask
your friendly neighborhood statistician to help you draw conclusions about the mean
number of shark attacks in Florida based on your sample. However, this result fails to
acknowledge support for alternative values of λ. Bayesians are interested in the probability
distribution for λ, where the density associated with each alternative hypothesis for λ gives a
measure of “plausibility.” This approach is squarely in line with the scientific method. As
we collect new information, we can use Bayesian approaches to refine how plausible each
and every hypothesis for λ is.
Answer: Try it! The prior distribution will now be the posterior distribution after observing
5 shark attacks: α ¼ 7.1 and β = 2.
Let’s suppose you collect 5 more years of shark attack data. Let’s define the
variable X1 as the number of shark attacks in year 1, . . . , and X5 as the number of shark
attacks in year 5. You observe the results shown in Table 11.4.
Table 11.4
Year Attacks
12 1
13 2
14 0
15 3
16 4
Note that we have 5 new random observations from the Poisson distribution, so n ¼ 5.
Pn
Also notice that the sum of the attacks, x , is 10.
i¼1 i
Remember that the prior distribution is now defined by α0 ¼ 7.1 and β0 = 2. Now let’s
incorporate the n ¼ 5 years of new data (see Figure 11.16).
0.8
Prior: α0 = 2, β0 = 1
Posterior 1: α = 7.1, β = 2
Posterior 2: α = 17.1, β = 7
0.6
Density
0.4
0.2
0.0
0 5 10 15 20
Hypotheses for λ
The updated α parameter is the prior α0 plus the sum of the Poisson observations:
P
n
αposterior ¼ α0 þ xi ð11:24Þ
i¼1
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi
The updated β parameter is the prior β0 plus n, the number of Poisson observations:
βposterior ¼ β0 þ n ð11:26Þ
βposterior ¼ 2 þ 5 ¼ 7: ð11:27Þ
What parameters do you use for α0 and β0 if you want a vague prior?
Answer: If you didn’t have the information in Table 11.1 and wanted to use a vague prior,
you could set α0 and β0 to something really small, like 0.01. Let’s try our very first analysis
again, using a prior with α0 and β0 ¼ 0.01, and then update it after observing 5 shark attacks.
The updated alpha parameter is the prior alpha plus the sum of the Poisson observations:
P
n
αposterior ¼ α0 þ xi ð11:28Þ
i ¼1
The updated beta parameter is the prior beta plus n, the number of Poisson observations:
βposterior ¼ β0 þ n ð11:30Þ
If you were estimating the mean number of attacks based on the data alone, you would
estimate that λ is 5.0. Here, the mean of the posterior distribution is very close to that, and
can be calculated as:
α 5:01
¼ ¼ 4:96: ð11:32Þ
β 1:01
It’s evident that the prior is not contributing much to the posterior distribution (see
Figure 11.17).
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi
0.8
Prior: α0 = 0.01, β0 = 0.01
Posterior : α = 5.01, β = 1.01
0.6
Density
0.4
0.2
0.0
0 5 10 15 20
Hypotheses for λ
The prior certainly does not look “flat.” But notice in this case, since α0 and β0 were so
small for the prior distribution, they had almost no bearing on the shape of the posterior
distribution. But, why would you use this prior distribution if the data in Table 11.1 were at
your disposal?
Answer: If you can show the prior and posterior distributions, that would be fabulous! But
you can also summarize properties of the posterior distribution such as the mean and
variance by using some of the calculations presented on Wikipedia’s gamma distribution
page. And don’t forget the credible intervals!
Answer: You may! The log-normal distribution can be used as the prior distribution for
setting alternative hypotheses for λ. Figure 11.18 shows a few examples of the log-normal
distribution, which has two parameters.
3.0
ln(μ) = 0, ln(σ) = 1
ln(μ) = 0, ln(σ) = 0.25
2.5 ln(μ) = 0, ln(σ) = 0.5
Probability Density
2.0
1.5
1.0
0.5
0.0
0.0 0.5 1.0 1.5 2.0 2.5 3.0
x
Figure 11.18
Here you can see that the distribution ranges from 0 to ∞ along the x-axis. This distribu-
tion is very useful for many analyses. However, since it is not a conjugate distribution, there
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi
is no easy way—no shortcut—to get the posterior. You could estimate the posterior distri-
bution with an MCMC analysis, but that is a matter for another day.
We will revisit this chapter again in the MCMC section of this book, where we will show
you a different method for estimating the posterior distribution. For now, though, we have
one more conjugate chapter in this section, and it is a “sticky” one. We hope to see you
there!
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi
CHAPTER 12
In this chapter, we return to the normal distribution and set up a Bayesian inference
problem that deals with estimating parameters from the normal distribution. You might
recall that the normal distribution has two parameters that control the location and spread
of the distribution: the mean and the standard deviation. These are called mu (μ) and sigma
(σ), respectively. We’ll now be dealing with the joint distribution of the two parameters in
the normal distribution, where you need to assign a prior associated with a combination
of μ and σ.
Here’s a roadmap for this chapter. Pay attention! We’ll begin with a motivating example
that highlights the need to estimate the two parameters of the normal distribution. We’ll
first tackle the problem by considering just 10 discrete hypotheses for μ and σ so that you
can see Bayes’ Theorem at work with a joint prior. Then we’ll move to the case where the
number of joint hypotheses is infinite and in which the prior distribution is a continuous,
joint distribution. This complicates the analysis considerably. But, as with the previous two
chapters, this chapter introduces a conjugate solution, which can be applied only if you
assume that one of the parameters of the normal distribution is known. Specifically, we
assume that σ is known. Hence the focus is on estimating the unknown parameter of the
normal distribution, μ. We will revisit this chapter again in Section 5, where we show you
how to solve the problem when both μ and σ are unknown with an MCMC analysis using
Gibb’s sampling.
By the end of this chapter, you should be able to define and use the following:
• The normal distribution pdf in terms of the mean and standard deviation (μ and σ) and in
terms of the mean and precision (μ and τ)
• The normal-normal conjugate
We’ll start with a story written by Brendan Borrell in 2013 for Bloomberg Business:
On the morning of July 30, 2012, an accountant named Michel Gauvreau arrived at the Global Strategic
Maple Syrup Reserve, housed in a huge red brick warehouse on the side of the Trans-Canadian Highway…
about two hours northeast of Montreal. Inside, baby-blue barrels of maple syrup were stacked six high in
rows hundreds deep. Full, each barrel weighs about 620 pounds. With grade A syrup trading at about $32
per gallon, that adds up to $1,800 a barrel, approximately 13 times the price of crude oil.
The fiscal year was coming to a close, and the Federation of Quebec Maple Syrup Producers had hired
Gauvreau’s company to audit its inventory…There were around 16,000 barrels here, about one-tenth of
Quebec’s annual production. The gap between the rows was barely wide enough to walk through, and the
rubber soles of Gauvreau’s steel-tip boots stuck to the sugar-coated concrete floor.
Bayesian Statistics for Beginners: A Step-by-Step Approach. Therese M. Donovan and Ruth M. Mickey,
Oxford University Press (2019). © Ruth M. Mickey 2019.
DOI: 10.1093/oso/9780198841296.001.0001
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi
He scaled a row of barrels and was nearing the top of the stack when one of them rocked with his
weight. He nearly fell. Regaining his balance, he rattled the barrel: It was light because it was empty.
He soon found others that were empty. After notifying the Federation’s leaders and returning with them
to examine the stockpile, they unscrewed the cap on a full barrel. The liquid inside was not goopy,
brown, or redolent with the wintry scent of vanilla, caramel, and childhood; it was thin, clear, and
odorless. It was water… Sixty percent, or six million pounds of syrup, had vanished, worth about $18
million wholesale.
That’s some story! It has even caught Hollywood’s attention! This begs a few questions,
though.
Answer: Yum!
Here’s what Encyclopedia Britannica has to say: “Maple syrup, sweet-water sap of
certain North American maple trees, chiefly the sugar maple, Acer saccharum, but also
the black maple, Acer nigrum. It was utilized by the Indians of the Great Lakes and
St. Lawrence River regions prior to the arrival of European settlers and is still produced
solely in North America.
The sweet-water sap from which maple syrup is made is different from the circulatory sap
of the growing tree. When the tree is dormant, the sap will flow from any wound in the
sapwood, such as a taphole, each time a period of freezing is followed by a period of
thawing. The sap contains 1 1/2 to 3 percent solids, mostly sucrose, but does not contain
the colour or flavour of maple syrup, which are imparted to the sap as it is concentrated by
evaporation in open pans. About 30 to 50 gallons (115 to 190 litres) of sap yield one gallon
of syrup.”
Answer: The answer is explained in an article by Josh Sandburn from Time magazine:
It may seem bizarre that Canada has a maple-syrup cartel at all. But think of it this way: Quebec, which
produces about 77% of the world’s maple syrup, is the Saudi Arabia of the sweet, sticky stuff, and the FPAQ
[Fédération des producteurs acéricoles du Québec] is its OPEC. The stated goal of the cartel, in this case, is
keeping prices relatively stable.
The problem with maple syrup is that the natural supply of it varies dramatically from year to year. “It’s
highly dependent on the weather,” explains Pascal Theriault, an agricultural economist at McGill University
in Montreal. “The maple trees need optimal climate conditions—cold nights, temperate days—to produce
the right sap,” explained a recent article in the Washington Post. “That doesn’t always happen, and
production varies sharply each spring.”
Answer: Since the cartel has lost its syrup, for this chapter we are going to assume
that they want to incorporate the great state of Vermont (USA) into their
game. After all, Vermont is Quebec’s neighbor with healthy sugar maple forests (see
Figure 12.1). And by the way, the name Vermont stems from the Latin Viridis Montis,
which means Green Mountain.
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi
BC
AB
QC
WA ON
MT
OR ND
MN ME
ID NS
WY WI NH
NV MI VT
NE IA MA
UT PA RI
CO IL IN OH CT
CA KS WV
MO VA NJ
KY DE
AZ NC
NM OK TN MD
SC DC
MS AL GA
TX LA
AK MEXICO FL
Figure 12.1
Answer: We can use the normal probability density function. We studied the normal pdf
in detail in Chapter 9, so we’ll just briefly review it now, starting with an example
(see Figure 12.2).
The formula that generates the normal distribution for the random variable X (millions
of gallons of maple syrup produced) is called the normal probability density func-
tion. Here it is, and remember that the mean is represented by the symbol μ, and the
standard deviation is represented by the symbol σ:
1 ðx−μÞ2
f ðx; μ; σÞ ¼ pffiffiffiffiffiffi e− 2σ 2 −∞ x ∞ ð12:1Þ
2π σ
As an example, suppose the average number of gallons (in millions) of maple syrup
produced per year is μ ¼ 10.0, with a standard deviation σ of 2.3. The probability density
of observing 6.2 million gallons, say, would be:
1 ðx−μÞ2
f ðx; μ; σÞ ¼ pffiffiffiffiffiffi e− 2σ 2 ð12:2Þ
2πσ
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi
1 ð6:2−10Þ2
−
f ð6:2; 10; 2:3Þ ¼ pffiffiffiffiffiffi e 2 ∗ 2:32 ð12:3Þ
2π ∗ 2:3
1
e− 24:334 ¼ 0:044:
14:44
f ð6:2; 10; 2:3Þ ¼ ð12:4Þ
5:765
0.25
μ = 10; σ = 2.3
0.20
0.15
Density
0.10
0.05
0.00
5 10 15 20
Millions of Gallons Produced (x)
1 ðx−μÞ2
f ðx; μ; σÞ ¼ pffiffiffiffiffiffi e− 2σ 2 ð12:5Þ
2π σ
1 ð4:8−10Þ2
−
f ð4:8; 10; 2:3Þ ¼ pffiffiffiffiffiffi e 2 ∗ 2:32 ¼ 0:0135 ð12:6Þ
2π ∗ 2:3
Look for the probability density associated with 4.8 million gallons in the distribution
shown in Figure 12.3.
0.25
μ = 10; σ = 2.3
0.20
0.15
Density
0.10
0.05
0.00
5 10 15 20
Millions of Gallons Produced (x)
Our goal here is to use a Bayesian inference approach to estimate the two
parameters of the normal distribution: the mean, μ, and the standard devi-
ation, σ. This differs from that in Chapters 10 and 11, where our goal was to estimate only
one parameter of a pdf distribution (p for the White House Problem, and λ for the Shark
Attack Problem). Here, for illustrative purposes, we will consider just 10 discrete hypotheses
that represent alternative combinations of μ and σ.
We’ll use the same steps that we did in previous chapters:
1. Identify your hypotheses—these would be the alternative hypotheses for μ, and alter-
native hypotheses for σ.
2. Express your belief that each hypothesis is true in terms of prior probabilities.
3. Gather the data.
4. Determine the likelihood of the observed data under each hypothesis.
5. Use Bayes’ Theorem to compute the posterior probabilities for each hypothesis.
Well, let’s see. Let’s assume that the following statements are reasonable:
2.2
2.1
Hypotheses for σ
2.0
1.9
1.8
Each dot represents the cartel’s belief in a particular combination for the mean and the
standard deviation. For instance, the arrow in the chart indicates that one cartel member
believes that the mean is ~9.45 and the standard deviation is ~1.98.
These 10 dots represent 10 discrete hypotheses for μ and σ.
Answer: A key consideration here is that we are assigning prior probabilities for each
hypothesized combination of μ and σ. They are jointly estimated. If every single dot
represents a single hypothesis, and if each dot has the same weight, then the prior
probability associated with each dot (hypothesis) is 1/10, or 0.1. This is a non-informative
prior, and a probability mass function.
Answer: Collect data! Suppose the cartel visits Vermont and learns that 10.2
million gallons were produced last year.
Answer: Step 4 is to determine how likely the observed data are, assuming each joint
hypothesis is true.
Now, for each hypothesized combination of μ and σ, let’s compute the like-
lihood of observing 10.2 million gallons.
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi
At the beginning of this chapter, we showed that the normal pdf can be used to calculate
the probability density of observing a value from a normal distribution whose parameters
are given. This time, we have observed the value x, and now we are using the function to
calculate the likelihood of observing the data under a combination of entries for μ and σ.
Just to give you an example, suppose one of 10 hypotheses was μ ¼ 9.45 and σ ¼ 1.98
(the hypothesis highlighted by the arrow in Figure 12.4). Then we could ask: “What is
the likelihood of observing 10.2 from a normal distribution whose mean is μ ¼ 9.45 and
σ ¼ 1.98.” The answer is:
1 ðx−μÞ2
ℒðx; μ; σÞ ¼ pffiffiffiffiffiffi e− 2σ 2 ð12:7Þ
σ 2π
1 ð10:2−9:45Þ2
ℒðx ¼ 10:2; μ ¼ 9:45; σ ¼ 1:98Þ ¼ pffiffiffiffiffiffi e− 2 ∗ 1:982 ¼ 0:1875 ð12:8Þ
1:98 2π
We can do similar calculations for other combinations as well. Table 12.1 shows 10 specific
hypotheses; our highlighted example is hypothesis 6.
Table 12.1
And Step 5?
Answer: The final step is to use Bayes’ Theorem to compute the posterior probabilities for
each combination of μ and σ. We won’t present these calculations here.
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi
Answer: The main challenge is that we have an infinite number of hypotheses that
represent each joint combination for the unknown parameters of the normal distribution.
Accordingly, our prior and posterior distributions should reflect the density associated
with each μ and σ combination.
Answer: Yes. In other words, a handful of dots is not sufficient. If we gave the cartel a zillion
grains of sand, where each grain represents a joint hypothesis for μ and σ, their prior
distribution for the parameters may look something like that shown in Figure 12.5.
Densit
a
igm
y
s
for
ses
Hyp
oth the
ese
s fo
po
rm
Hy
Figure 12.5
Here, the prior distribution has volume. For any given parameter combination, the
corresponding height gives us the density, stressing the fact that we are dealing with a
bivariate distribution in which both parameters are continuous. Our 10 original hypotheses
are buried somewhere in this sandpile. This distribution really could have any shape—
lumps indicate those hypotheses that have more weight than others, which implies that
Figure 12.5 is an informative prior distribution.
It’s plain to see that the shape of our prior distribution for μ depends on what level of σ
you consider. For instance, Figure 12.5 illustrates that if we fix σ at the “red-level,” we see
plausible hypotheses for μ highlighted in red. And if we fix σ at the “green-level,” we see
plausible hypotheses for μ highlighted in green. These ribbons are called “slices” because
they slice through the mound at a particular location. Given a particular slice for σ, you
have a unique prior distribution for μ. Keep this point in the back of your mind—we will
return to it shortly.
Answer: Well, now that the cartel has a prior distribution, they would collect new data and
use Bayes’ Theorem to generate the posterior distribution. In other words, the pile of sand
will change shape after new data are collected.
You might recall from previous chapters the generic version of Bayes’ Theorem for one
continuous variable:
Pðdata j θÞ ∗ PðθÞ
Pðθ j dataÞ ¼ ð ð12:9Þ
Pðdata j θÞ ∗ PðθÞdθ
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi
This is the generic version of Bayes’ Theorem when the posterior distribution for a single
parameter, given the observed data, is represented by a pdf. This posterior density is
designated Pðθ j dataÞ, the prior density is designated PðθÞ, and the likelihood of observing
the data under a given hypothesis for θ is designated as Pðdata j θÞ. The integral in the
denominator accounts for the prior density times the likelihood for all possible hypotheses
for θ and sums them.
Now, let’s make this task even harder. For the maple syrup problem, there are two
parameters to be estimated, μ and σ. With two parameters, Bayes’ Theorem looks like this,
where each hypothesis for μ and σ is considered (the posterior is shown in purple, the prior
in blue, and the likelihood in red):
For any given prior hypothesis μ and σ, Bayes’ Theorem will return the posterior density of
the joint μ and σ by computing the likelihood of observing the data under hypothesis μ and
σ, multiplied by the prior probability of μ and σ. We need to do this for an infinite number
of hypotheses for each parameter.
Answer: As you might have guessed by the title of this chapter, there is a conjugate solution.
The main approach is to simplify the problem by focusing on only one of the parameters
and assuming the second parameter is known.
For a normal-normal conjugate, we use a normal distribution as our prior distribution for
the mean, μ, and then collect data and update to the posterior distribution for the mean, μ,
which will also be normally distributed. However, in this process, we must assume that σ is
known; that is, we must choose a level for σ (such as the green value or red value for sigma as
shown in Figure 12.5).
Let’s assume we know that σ = 2.0 Then, we have alternative hypotheses for μ, which
controls where the distribution is centered along the x-axis. For example, 5 alternative
hypotheses for μ are shown in Figure 12.6.
0.5
μ = 8, σ = 2
0.4 μ = 10, σ = 2
μ = 12, σ = 2
0.3 μ = 14, σ = 2
Density
μ = 16, σ = 2
0.2
0.1
0.0
0 5 10 15 20 25 30
Hypotheses for μ
Now, are you paying attention? The normal-normal conjugate method uses
a normal pdf to weight the alternative hypotheses for μ, like the one shown in
Figure 12.7.
0.5
0.4
0.3
Density
0.2
0.1
0.0
0 5 10 15 20 25 30
Hypotheses for μ
Thus, the prior distribution for the unknown parameter, μ, is a normal distribution.
I’m confused!
Answer: That’s completely normal. Figure 12.7 suggests that we have greater belief that
the average annual syrup production is between 10 and 15 than, say, at 5 or 25. Remember
that we have simplified the problem so we are only focusing on μ. We’ve already agreed that
sigma, σ, is known and is 2 million gallons.
In fact, the blue distribution in Figure 12.7 is a normal distribution with μ0 ¼ 12 and
σ0 ¼ 2. Alternatively, this distribution can be specified in terms of the parameters μ0 and τ0,
where μ0 ¼ 12 and τ0 ¼ 0.0625. These are formally called prior hyperparameters.
When you use a normal distribution as a prior distribution for the unknown parameter of a
normal distribution, μ, assume that σ (or τ) is known, and then collect data (from a normal
distribution). Then, you can use a conjugate shortcut to generate the posterior distribution
for the unknown parameter μ, which will also be a normal distribution.
A diagram such as that shown in Figure 12.8 may help.
μ ∼N(μ0, τ0)
N(μ, σ = 2.0)
or
N(μ, τ = 0.25)
xi
Figure 12.8
At the bottom of this diagram, we have our observed data, xi , which we assume arises
from a normal distribution with an unknown mean, μ, and known standard deviation, σ ¼
2.0. This is the same thing as a normal distribution with an unknown mean, μ, and known
precision, τ ¼ 0.25. The parameter, μ, is the unknown parameter that we are trying to
estimate. Here, we use a normal distribution as a prior distribution to set the “weights” on
each and every hypothesis for the unknown parameter, μ. This normal distribution is
defined by the hyperparameters μ0 and τ 0 .
Answer: Ok then. Time for a concrete example. First, let’s recall the steps for Bayesian
analysis:
1. Identify your hypotheses—these would be the alternative hypotheses for the mean. (The
conjugate shortcut requires that you assume either μ or σ is known; here we assume σ is
known.)
2. Express your belief that each hypothesis is true in terms of a prior distribution (i.e., a
pdf). For a normal distribution, μ can range from −∞ to ∞. For our example, however, μ
must be positive, so we will select a prior distribution such that the area under the curve
for μ < 0 is very, very small.
3. Gather the data.
4. Determine the likelihood of the observed data under each hypothesis.
5. Use Bayes’ Theorem to compute the posterior distribution, which is a pdf.
Step 1. These would be the alternative hypotheses for the mean. Here, we will assume that
τ is known and has a value of 0.25. Remember, this is known.
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi
That leaves us with just one unknown parameter, μ, which has an infinite number of
hypotheses, all of which are positive.
Step 2. Establish a prior distribution for the unknown parameter, μ. Remember that
our prior distribution is a normal distribution. Although Vermont secretly collects data on
syrup production, earlier in the chapter we used a dot exercise to “elicit” information from
cartel members. Elicitation is a method for “boiling down” what experts think they know
when other sources of data are not available (pun intended!).
Tip: “Use expert opinions in such a way that we stand to gain if they are correct and do
not lose if they are wrong.”—C.R. Rao
There are many different ways to elicit a prior distribution, and these are beyond the
scope of this book. For the sake of this chapter, let’s assume the cartel uses the normal
distribution shown in blue in Figure 12.8; it has the following hyperparameters:
• μ0 = 12
• τ0 ¼ 0.0625
Notice the subscript of 0 (naught) is used to indicate that these are prior
hyperparameters.
Step 3. Collect data. We have observed 10.2 million gallons of syrup produced.
We now write our observed data as:
X
n
xi ð12:11Þ
i¼1
Here, n is the number of samples (in this case, 1). Typically, we index each year with the
letter i, where the index i goes from 1 to n. So, xi ¼ x1 ¼ 10.2.
If we collected seven years of data, the index i would be 1, 2,…,7. We would report our
observed data as the sum of our seven observations, and n would be 7:
X
7
xi ð12:12Þ
n¼1
Steps 4 and 5. Steps 4 and 5 are to determine the likelihood of the observed data under
each hypothesis and then use Bayes’ Theorem to compute the posterior probabilities for
each hypothesis.
Here, though, we can update the prior normal distribution to the posterior normal
distribution with the shortcut analytic solution and avoid the use of Bayes’ Theorem
directly. Here are the shortcut calculations—pay attention to subscripting!
The mean of the posterior distribution for the unknown parameter, μ, is calculated as:
Xn
τ 0 μ0 þ τ xi 0:0625 ∗ 12 þ 0:25 ∗ 10:2
μposterior ¼ i¼1
¼ ¼ 10:56 ð12:13Þ
ðτ 0 þ n ∗ τÞ 0:0625 þ 1 ∗ 0:25
The precision of the posterior distribution (τ) for the unknown parameter, μ, is
calculated as:
τ posterior ¼ τ 0 þ n ∗ τ ¼ 0:0625 þ 1 ∗ 0:25 ¼ 0:31 ð12:14Þ
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi
0.30
Hyperparameters:
0.25 Prior: μ0 = 12, τ0 = 0.0625
Posterior: μ = 10.56, τ = 0.31
0.20
Density
0.15
0.10
0.05
0.00
0 5 10 15 20 25 30
Hypotheses for μ
Our prior distribution for the unknown parameter, μ, is a normal distribution with μ0 ¼
12 and τ0 ¼ 0.0625 (blue). After collecting one year of syrup data in Vermont (10.2 million
gallons shown by the dashed line), our posterior distribution for the unknown parameter μ
is a normal distribution with posterior hyperparameters μ ¼ 10.56 and τ ¼ 0.31 (purple).
Notice how the posterior has tightened up a bit and shifted to the left.
Answer: The answer is that the normal distribution is a conjugate distribution that can be
updated with normal data (Raiffa and Schlaeffer, 1961). This is why we named this chapter
“The normal-normal conjugate.”
We mentioned the statistical definition of a conjugate distribution in Chapter 11, where
we indicated that there are cases where you can use a particular pdf as a prior distribution,
collect data of a specific flavor, and then derive the posterior pdf with a closed-form
solution. In these special cases, the pdf ’s of the prior and posterior distributions are the
same probability density function, but their parameters may differ. The prior distribution is
called a conjugate prior, and the effect of the data can then be interpreted in terms of
changes in parameter values (Upton and Cook, 2014).
In the maple syrup problem, we used a normal distribution to set the priors for all
hypotheses of μ, the mean number of gallons of syrup (millions). We used expert opinion to
set the prior distribution. We then collected normal data, and then used a shortcut to
generate the posterior distribution for μ.
Raiffa and Schlaeffer’s (1961) conjugate solution is conveniently given in a Wikipedia
page on Bayesian conjugates. A quick overview of that page’s normal-normal conjugate can
be depicted with the diagram shown in Figure 12.10.
Let’s work our way through this figure. The top panel indicates that the model param-
eter of interest, μ, is a mean. The prior distribution is a normal distribution, defined by
hyperparameters μ0 and τ 0 . The “naughts” signal that these are hyperparameters for the
prior. The data collected are normal in flavor, with n being the total sample size. The
likelihood to be used is the normal pdf, where τ is known. With this setup, the posterior
distribution for the unknown parameter, μ, is a normal distribution with hyperparameters
(the conjugate shortcuts) displayed.
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi
Prior hyperparameters:
μ0, τ0
Posterior hyperparameters:
τ0μ0 + τ ∑ni=1 xi
μ=
τ 0 + nτ
τ = τ0 + nτ
Figure 12.10
Answer: All right. Have a look at Appendix 3. There, you can confirm that Bayes’ Theorem
is front and center, even though the results are astonishingly simple.
Answer: Let’s try it! Let’s suppose you collect five more years of syrup produc-
tion data (shown in Table 12.2).
Table 12.2
Year Gallons
1 7
2 10
3 10
4 8
5 4
39
Note that we have five new observations from the Gaussian (normal) distribution, so
n ¼ 5. Also notice that the sum of the observed data is 39 million gallons produced. Our
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi
prior distribution for the unknown parameter, μ, is now defined by the hyperparameters
μ0 ¼ 10.56 and τ0 ¼ 0.31; we assume τ is still known to be 0.25. If we repeat our
conjugate calculations, the mean of the posterior distribution for the unknown parameter,
μ, is calculated as:
Pn
τ 0 μ0 þ τ x
i ¼1 i 0:31 ∗ 10:56 þ 0:25 ∗ 39
μposterior ¼ ¼ ¼ 8:35 ð12:15Þ
ðτ 0 þ n ∗ τÞ 0:31 þ 5 ∗ 0:25
And σ for the posterior distribution of the unknown parameter, μ, is calculated as:
1 1
σ 2posterior ¼ ¼ ¼ 0:641 ð12:17Þ
τ 1:56
pffiffiffiffiffi pffiffiffiffiffiffiffiffiffiffiffiffiffi
σ posterior ¼ σ 2 ¼ 0:641 ¼ 0:801 ð12:18Þ
Let’s add this new posterior pdf to our graph in dashed purple (see Figure 12.11).
0.6
Hyperparameters:
Prior: μ0 = 12, σ0 = 4 (τ0 = 0.0625)
0.5
Posterior 1: μ = 10.56, σ = 1.79 (τ = 0.31)
Posterior 2: μ = 8.35, σ = 0.801 (τ = 1.56)
0.4
Density
0.3
0.2
0.1
0.0
0 5 10 15 20 25 30
Hypotheses for μ
Answer: We can work with a known variance (σ 2 ) OR known precision (τ, which is 1/σ 2 ).
Raiffa and Schlaeffer’s (1961) conjugate solution is conveniently given in a Wikipedia page
on Bayesian conjugates. Let’s have another look at our figure, but expand it a bit (see
Figure 12.12).
Figure 12.12
• In the left panel, we assume that τ is known, and we define our prior distribution for the
unknown parameter, μ, with the hyperparameters μ0 and τ 0 . The conjugate calculations
of the posterior hyperparameters are shown. We used this version here and we’ll make
use of it again in future chapters.
• In the right panel, we assume that σ 2 is known, and we define our prior distribution for
the unknown parameter, μ, with the hyperparameters μ0 and σ 20 . The conjugate calcula-
tions of the posterior hyperparameters are shown.
• Both approaches will return the same result.
Answer: There is a conjugate shortcut for that too, but it wouldn’t be called the normal-
normal conjugate shortcut anymore! For example, Figure 12.13 shows the conjugate
solutions for estimating σ 2 or τ given a known mean, μ.
An important difference is that we no longer use the normal pdf for our prior distribu-
tion for the unknown parameter. Instead, the prior distribution of σ 2 is either the inverse
gamma distribution or the scaled inverse chi-square distribution, while the prior distribu-
tion for τ is the gamma distribution. We haven’t introduced the first two distributions in
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi
Prior hyperparameters
α0, β0
Posterior hyperparameters
α = α0 + n
2 ∑ni=1 (xi –μ)2
β = β0 + 2
Figure 12.13
this book, but we learned about the gamma distribution in the Shark Attack Problem, and
that is what is shown above. And we will make use of the gamma distribution to estimate τ
with Gibbs sampling when we revisit the Maple Syrup Problem in Chapter 16. A key take-
home point is that you have quite a bit of flexibility with respect to conjugate solutions for
the parameters of the normal distribution.
Answer: Indeed.
• We started by mentioning that 60% of Quebec’s maple syrup stockpile was stolen and
that the cartel wishes to annex Vermont’s maple syrup production.
• We introduced the normal probability density function as a way to quantify the average
and variation in annual syrup production for the state of Vermont.
• We emphasized that the normal distribution has two parameters, the mean μ and the
standard deviation σ, which need to be estimated jointly.
• We remembered that we have an infinite number of hypotheses for μ and σ; these can be
described by continuous distributions. To complicate the matter, the estimate of one
parameter depends on the value the second parameter.
• To make headway for this problem, we assumed that one of the parameters of the normal
distribution was known. We said that σ was known to be 2.0.
• We talked about different measures of spread for the normal distribution: the standard
1
deviation (σ), the variance (σ 2 ), and the precision (τ ¼ 2 ).
σ
• We then set up a prior distribution to represent our hypotheses for the mean, μ, of the
normal distribution. This prior distribution was also a normal distribution with specified
hyperparameters μ0 and τ 0 .
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi
• We then collected data. Vermont produced 10.2 million gallons of syrup last year.
• We used a shortcut, analytical solution that allowed us to update our prior normal
distribution to a posterior normal distribution by using the data. The posterior distribu-
tion reflects our updated beliefs for each and every hypothesis for μ.
• We then collected five more years of data. This time, we used our newly updated posterior
as the prior distribution, and then updated it again given the newly observed data. Thus,
we are using Bayesian inference as a way to formalize “learning.”
What’s next?
Answer: This ends our section on Bayesian conjugates. You can see that these conjugate
“shortcuts” are extremely handy for certain types of problems. In Section 5, we’ll show you
how to solve these same problems using a different approach called Markov Chain Monte
Carlo, or MCMC for short.