0% found this document useful (0 votes)

54 views200 pages

Part 1

Uploaded by

Abdul Shamad

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

54 views200 pages

Part 1

Uploaded by

Abdul Shamad

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 200

OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi

Bayesian Statistics
for Beginners
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi

Bayesian Statistics
for Beginners
A Step-by-Step Approach

THERESE M. DONOVAN
RUTH M. MICKEY

1
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi

3
Great Clarendon Street, Oxford, OX2 6DP,
United Kingdom
Oxford University Press is a department of the University of Oxford.
It furthers the University’s objective of excellence in research, scholarship,
and education by publishing worldwide. Oxford is a registered trade mark of
Oxford University Press in the UK and in certain other countries
© Ruth M. Mickey 2019
The moral rights of the author have been asserted
First Edition published in 2019
Impression: 1
All rights reserved. No part of this publication may be reproduced, stored in
a retrieval system, or transmitted, in any form or by any means, without the
prior permission in writing of Oxford University Press, or as expressly permitted
by law, by licence or under terms agreed with the appropriate reprographics
rights organization. Enquiries concerning reproduction outside the scope of the
above should be sent to the Rights Department, Oxford University Press, at the
address above
You must not circulate this work in any other form
and you must impose this same condition on any acquirer
Published in the United States of America by Oxford University Press
198 Madison Avenue, New York, NY 10016, United States of America
British Library Cataloguing in Publication Data
Data available
Library of Congress Control Number: 2019934655
ISBN 978–0–19–884129–6 (hbk.)
ISBN 978–0–19–884130–2 (pbk.)
DOI: 10.1093/oso/9780198841296.001.0001
Printed and bound by
CPI Group (UK) Ltd, Croydon, CR0 4YY
Links to third party websites are provided by Oxford in good faith and
for information only. Oxford disclaims any responsibility for the materials
contained in any third party website referenced in this work.
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi

To our parents, Thomas and Earline Donovan and Ray and Jean Mickey,
for inspiring a love of learning.

To our mentors, some of whom we’ve met only by their written words,
for teaching us ways of knowing.

To Peter, Evan, and Ana—for everything.

OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi

Preface

Greetings. This book is our attempt at gaining membership to the Bayesian Conspiracy.
You may ask, “What is the Bayesian Conspiracy?” The answer is provided by Eliezer
Yudkowsky (http://yudkowsky.net/rational/bayes): “The Bayesian Conspiracy is a multi
national, interdisciplinary, and shadowy group of scientists that controls publication,
grants, tenure, and the illicit traffic in grad students. The best way to be accepted into the
Bayesian Conspiracy is to join the Campus Crusade for Bayes in high school or college, and
gradually work your way up to the inner circles. It is rumored that at the upper levels of the
Bayesian Conspiracy exist nine silent figures known only as the Bayes Council.”
Ha ha! Bayes’ Theorem, also called Bayes’ Rule, was published posthumously in 1763 in
the Philosophical Transactions of the Royal Society. In The Theory That Would Not Die,
author Sharon Bertsch McGrayne aptly describes how “Bayes’ rule cracked the enigma code,
hunted down Russian submarines, and emerged triumphant from two centuries of contro-
versy.” In short, Bayes’ Rule has vast application, and the number of papers and books that
employ it is growing exponentially.
Inspired by the knowledge that a Bayes Council actually exists, we began our journey by
enrolling in a 5-day ‘introductory’ workshop on Bayesian statistics a few years ago. On Day
1, we were introduced to a variety of Bayesian models, and, on Day 2, we sheepishly had to
inquire what Bayes’ Theorem was and what it had to do with MCMC. In other words, the
material was way over our heads. With tails between our legs, we slunk back home and
began trying to sort out the many different uses of Bayes’ Theorem.
As we read more and more about Bayes’ Theorem, we started noting our own questions as
they arose and began narrating the answers as they became more clear. The result is this
strange book, cast as a series of questions and answers between reader and author. In this
prose, we make heavy use of online resources such as the Oxford Dictionary of Statistics
(Upton and Cook, 2014), Wolfram Mathematics, and the Online Statistics Education: An
Interactive Multimedia Course for Study (Rice University, University of Houston Clear
Lake, and Tufts University). We also provide friendly links to online encyclopedias such
as Wikipedia and Encyclopedia Britannica. Although these should not be considered
definitive, original works, we have included the links to provide readers with a readily
accessible source of information and are grateful to the many authors who have contrib-
uted entries.
We are not experts in Bayesian statistics and make no claim as such. Therese Donovan is a
biologist for the U. S. Geological Survey Vermont Cooperative Fish and Wildlife Research
Unit, and Ruth Mickey is a statistician in the Department of Mathematics and Statistics at
the University of Vermont. We were raised on a healthy dose of “frequentist” and max-
imum likelihood methods but have begun only recently to explore Bayesian methods. We
have intentionally avoided controversial topics and comparisons between Bayesian and
frequentist approaches and encourage the reader to dig deeper—much deeper—than we
have here. Fortunately, a great number of experts have paved the way, and we relied heavily
on the following books while writing our own:
• N. T. Hobbs and M. B. Hooten. Bayesian Models: A Statistical Primer for Ecologists.
Princeton University Press, 2015.
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi

viii PREFACE

• J. Kruschke. Doing Bayesian Data Analysis: A Tutorial with R, JAGS, and Stan.
Elsevier, 2015.
• A. Gelman, J. B. Carlin, H. S. Stern, and D. B. Rubin. Bayesian Data Analysis. Chapman &
Hall, 2004.
• J. V. Stone. Bayes’ Rule: a Tutorial Introduction to Bayesian Analysis. Sebtel Press, 2014.
• H. Raiffa and R. Schlaifer. Applied Statistical Decision Theory. Division of Research,
Graduate School of Business Administration, Harvard University, 1961.
• P. Goodwin and G. Wright. Decision Analysis for Management Judgment. John Wiley &
Sons, 2014.

Although we relied on these sources, any mistakes of interpretation are our own.
Our hope is that Bayesian Statistics for Beginners is a “quick read” for the uninitiated and
that, in one week or less, we could ﬁnd a reader happily ensconced in a book written by one
of the experts. Our goal in writing the book was to keep Bayes’ Theorem front and center in
each chapter for a beginning audience. As a result, Bayes’ Theorem makes an appearance in
every chapter. We frequently bring back past examples and explain what we did “back
then,” allowing the reader to slowly broaden their understanding and sort out what has
been learned in order to relate it to new material. For the most part, our reviewers liked this
approach. However, if this is annoying to you, you can skim over the repeated portions.
If this book is useful to you, it is due in no small part to a team of stellar reviewers. We owe
a great deal of gratitude to George Allez, Cathleen Balantic, Barry Hall, Mevin Hooten, Peter
Jones, Clint Moore, Ben Staton, Sheila Weaver, and Robin White. Their enthusiasm,
questions, and comments have improved the narrative immensely. We offer a heartfelt
thank you to Gary Bishop, Renee Westland, Melissa Murphy, Kevin Roark, John Bell, and
Stuart Geman for providing pictures for this book.
Therese Donovan
Ruth Mickey
October 2018
Burlington, VT
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi

Contents

SECTION 1 Basics of Probability

1 Introduction to Probability 3
2 Joint, Marginal, and Conditional Probability 11

SECTION 2 Bayes’ Theorem and Bayesian Inference

3 Bayes’ Theorem 29

4 Bayesian Inference 37
5 The Author Problem: Bayesian Inference with Two Hypotheses 48
6 The Birthday Problem: Bayesian Inference with Multiple Discrete Hypotheses 61

7 The Portrait Problem: Bayesian Inference with Joint Likelihood 73

SECTION 3 Probability Functions

8 Probability Mass Functions 87

9 Probability Density Functions 108

SECTION 4 Bayesian Conjugates

10 The White House Problem: The Beta-Binomial Conjugate 133

11 The Shark Attack Problem: The Gamma-Poisson Conjugate 150
12 The Maple Syrup Problem: The Normal-Normal Conjugate 172

SECTION 5 Markov Chain Monte Carlo

13 The Shark Attack Problem Revisited: MCMC with the Metropolis Algorithm 193
14 MCMC Diagnostic Approaches 212
15 The White House Problem Revisited: MCMC with the Metropolis–Hastings
Algorithm 224
16 The Maple Syrup Problem Revisited: MCMC with Gibbs Sampling 247
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi

x CONTENTS

SECTION 6 Applications

17 The Survivor Problem: Simple Linear Regression with MCMC 269

18 The Survivor Problem Continued: Introduction to Bayesian Model Selection 308

19 The Lorax Problem: Introduction to Bayesian Networks 325

20 The Once-ler Problem: Introduction to Decision Trees 353

Appendices

A.1 The Beta-Binomial Conjugate Solution 369

A.2 The Gamma-Poisson Conjugate Solution 373
A.3 The Normal-Normal Conjugate Solution 379

A.4 Conjugate Solutions for Simple Linear Regression 385

A.5 The Standardization of Regression Data 395

Bibliography 399
Hyperlinks 403
Name Index 413
Subject Index 414
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi

SECTION 1

Basics of Probability

Overview

And so we begin. This ﬁrst section deals with basic concepts in probability theory, and
consists of two chapters.

• In Chapter 1, the concept of probability is introduced. Using an example, the chapter

focuses on a single characteristic and introduces basic vocabulary associated with
probability.
• Chapter 2 introduces additional terms and concepts used in the study of probability. The
chapter focuses on two characteristics observed at the same time, and introduces the
important concepts of joint probability, marginal probability, and conditional
probability.
After covering these basics, your Bayesian journey will begin.
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi

CHAPTER 1

Introduction to Probability

In this chapter, we’ll introduce some basic terms used in the study of probability. By the end
of this chapter, you will be able to deﬁne the following:
• Sample space
• Outcome
• Discrete outcome
• Event
• Probability
• Probability distribution
• Uniform distribution
• Trial
• Empirical distribution
• Law of Large Numbers
To begin, let’s answer a few questions . . .

What is probability?

Answer: The best way to introduce probability is to discuss an example. Imagine you’re a
gambler, and you can win $1,000,000 if a single roll of a die turns up four. You get only one
roll, and the entry fee to play this game is $10,000. If you win, you’re a millionaire. If you
lose, you’re out ten grand.

Should you play?

Answer: It’s up to you!

If the roll always comes up a four, you should play! If it never comes up four, you’d be
foolish to play! Thus, it’s helpful to know something about the die. Is it fair? That is, is each
face equally likely to turn up? How likely are you to roll a four?
This type of question was considered by premier mathematicians Gerolamo Cardano
(1501–1576), Pierre de Fermat (1601–1665), Blaise Pascal (1623–1662), and others. These
brilliant minds created a branch of mathematics known as probability theory.
The rolling of a die is an example of a random process: the face that comes up is subject
to chance. In probability, our goal is to quantify a random process, such as rolling a die.
That is, we want to assign a number to it. If we roll a die, there are 6 possible outcomes
(possible results), namely, one, two, three, four, ﬁve, or six. The set of all possible outcomes
is called the sample space.

Bayesian Statistics for Beginners: A Step-by-Step Approach. Therese M. Donovan and Ruth M. Mickey,
Oxford University Press (2019). © Ruth M. Mickey 2019.
DOI: 10.1093/oso/9780198841296.001.0001
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi

4 BAYESIAN STATISTICS FOR BEGINNERS

Let’s call the number of possible outcomes N, so N ¼ 6. These outcomes are discrete,
because each result can take on only one of these values. Formally, the word “discrete” is
defined as “individually separate and distinct.” In addition, in this example, there is a
finite number of possible outcomes, which means “there are limits or bounds.” In other
words, the number of possible outcomes is not infinite.
If we believe that each and every outcome is just as likely to result as every other outcome
(i.e., the die is fair), then the probability of rolling a four is 1/N or 1/6. We can then say, “In
6 rolls of the die, we would expect 1 roll to result in a four,” and write that as Pr(four) ¼ 1/6.

Here, the notation Pr means “Probability,” and we will use this notation throughout
this book.

How can we get a good estimate of Pr(four) for this particular die?

Answer: You collect some data.

Before you hand over the $10,000 entry, you ask the gamemaster if you could run an
“experiment” and roll the die a few times before you make the decision to play. In this
experiment, which consists of tossing a die many times, your goal is to get a rough estimate
of the probability of rolling a four compared to what you expect. To your amazement, the
gamemaster complies.
You start with one roll, which represents a single “trial” of an experiment. Suppose you roll
a three and give the gamemaster a sneer. Table 1.1 shows you rolled 1 three, and 0 for the rest.

Table 1.1

Outcome Frequency Probability

One 0 0
Two 0 0
Three 1 1
Four 0 0
Five 0 0
Six 0 0
Sum 1 1

In this table, the column called Frequency gives the number of times each outcome was
observed. The column called Probability is the frequency divided by the sum of the
frequencies over all possible outcomes, a proportion. The probability of an event of four is
the frequency of the observed number of occurrences of four (which is 0) divided by the
total throws (which is 1). We can write this as:

jnumber of foursj 0
PrðfourÞ ¼ ¼ ¼ 0: ð1:1Þ
jtotal trialsj 1

This can be read as, “The probability of a four is the number of “four” events divided by the
number of total trials.” Here, the vertical bars indicate a number rather than an absolute
value. This is a frequentist notion of probability, because we estimate Pr(four) by asking
“How frequently did we observe the outcome that interests us out of the total?”
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi

INTRODUCTION TO PROBABILITY 5

A probability distribution that is based on raw data is called an empirical probability

distribution. Our empirical probability distribution so far looks like the one in Figure 1.1:

1.0

0.8

Probability
0.6

0.4

0.2

0.0
One Two Three Four Five Six
Outcome

Figure 1.1 Empirical probability distribution for

die outcomes, given 1 roll.

Is one roll good enough?

Answer: No.
By now, you should realize that one roll will never give us a good estimate of Pr(four).
(Sometimes, though, that’s all we have . . . we have only one Planet Earth, for example).
Next, you roll the die 9 more times, and summarize the results of the 10 total rolls (i.e., 10
trials or 10 experiments) in Table 1.2. The number of fours is 2, which allows us to estimate
Pr(four) as 2/10, or 0.20 (see Figure 1.2).

Table 1.2

Outcome Frequency Probability

One 0 0.0
Two 2 0.2
Three 5 0.5
Four 2 0.2
Five 0 0.0
Six 1 0.1
Sum 10 1.0

1.0

0.8
Probability

0.6

0.4

0.2

0.0
One Two Three Four Five Six
Outcome

Figure 1.2 Empirical probability distribution for

die outcomes, given 10 rolls.
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi

6 BAYESIAN STATISTICS FOR BEGINNERS

What can you conclude from these results? The estimate of Pr(four) ¼ 0.2 seems to
indicate that the die may be in your favor! (Remember, you expect Pr(four) ¼ 0.1667 if
the die was fair). But $10,000 is a lot of money, and you decide that you should keep test
rolling until the gamemaster shouts “Enough!” Amazingly, you are able to squeeze in 500
rolls, and you obtain the results shown in Table 1.3.

Table 1.3

Outcome Frequency Probability

One 88 0.176
Two 91 0.182
Three 94 0.188
Four 41 0.082
Five 99 0.198
Six 87 0.174
Sum 500 1.000

The plot of the frequency results in Figure 1.3 is called a frequency histogram. Notice
that frequency, not probability, is on the y-axis. We see that a four was rolled 41 times.
Notice also that the sum of the frequencies is 500. The frequency distribution is an example
of an empirical distribution: It is constructed from raw data.

100

80
Frequency

0
One Two Three Four Five Six
Outcome

Figure 1.3 Frequency distribution of 500 rolls.

We can now estimate Pr(four) as 41/500 ¼ 0.082. We can calculate the probability estimates
for the other outcomes as well and then plot them as the probability distribution in Figure 1.4.
1.0

0.8
Probability

0.6

0.4

0.2

0.0
One Two Three Four Five Six
Outcome

Figure 1.4 Empirical probability distribution for

500 rolls.
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi

INTRODUCTION TO PROBABILITY 7

According to the Law of Large Numbers in probability theory, the formula:

jnumber of observed outcomes of interestj

probability ¼ ð1:2Þ
jtotal trialsj

yields an estimate that is closer and closer to the true probability as the number of trials
increases. In other words, your estimate of Pr(four) gets closer and closer to or approaches
the true probability when you use more trials (rolls) in your calculations.

What would we expect if the die were fair?

Table 1.4 lists the six possible outcomes, and the probability of each event (1/6 ¼ 0.167).
Notice that the sum of the probabilities across the events is 1.0.

Table 1.4

Outcome Probability
One 0.167
Two 0.167
Three 0.167
Four 0.167
Five 0.167
Six 0.167
Sum 1

Figure 1.5 shows exactly the same information as Table 1.4; both are examples of a
probability distribution. On the horizontal axis, we list each of the possible outcomes. On
the vertical axis is the probability. The height of each bar provides the probability of
observing each outcome. Since each outcome has an equal chance of being rolled, the
heights of the bars are all the same and show as 0.167, which is 1/N. Note that this is not an
empirical distribution, because we did not generate it from an experiment. Rather, it was
based on the assumption that all outcomes are equally likely.

1.0

0.8
Probability

0.6

0.4

0.2

0.0
One Two Three Four Five Six
Outcome

Figure 1.5 Probability distribution for rolling a fair die.

Again, this is an example of a discrete uniform probability distribution: discrete

because there are a discrete number of separate and distinct outcomes; uniform because
each and every event has the same probability.
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi

8 BAYESIAN STATISTICS FOR BEGINNERS

How would you change the table and probability distribution if the
die were loaded in favor of a four?

Answer: Your answer here!

There are many ways to do this. Suppose the probability of rolling a four is 0.4. Since all
the probabilities have to add up to 1.0, this leaves 0.6 to distribute among the remaining
ﬁve outcomes, or 0.12 for each (assuming these ﬁve are equally likely to turn up). If you roll
this die, it’s not a sure thing that you’ll end up with a four, but getting an outcome of four is
more likely than, say, getting a three (see Table 1.5).

Table 1.5

Outcome Probability
One 0.12
Two 0.12
Three 0.12
Four 0.4
Five 0.12
Six 0.12
Sum 1

The probabilities listed in the table sum to 1.0 just as before, as do the heights of the
corresponding blue bars in Figure 1.6.

1.0

0.8
Probability

0.6

0.4

0.2

0.0
One Two Three Four Five Six
Outcome

Figure 1.6 Probability distribution for rolling a

loaded die.

What would the probability distribution be for the bet?

Answer: The bet is that if you roll a four, you win $1,000,000, and if you don’t roll a
four, you lose $10,000. It would be useful to group our 6 possible outcomes into one of
two events. As the Oxford Dictionary of Statistics explains, “An event is a particular
collection of outcomes, and is a subset of the sample space.” Probabilities are assigned
to events.
Our two events are E1 ¼ {four} and E2 ¼ {one, two, three, ﬁve, six}. The brackets { }
indicate the set of outcomes that belong in each event. The ﬁrst event contains one
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi

INTRODUCTION TO PROBABILITY 9

outcome, while the second event consists of five possible outcomes. Thus, in probability
theory, outcomes can be grouped into new events at will. We started out by considering six
outcomes, and now we have collapsed those into two events.
Now we assign a probability to each event. We know that Pr(four) ¼ 0.4. What is the
probability of NOT rolling a four? That is the probability of rolling one OR two OR three OR
five OR six. Note that these events cannot occur simultaneously. Thus, we can write
Pr(four) as the SUM of the probabilities of events one, two, three, five, and six (which is
0.12 þ 0.12 þ 0.12 þ 0.12 þ 0.12 ¼ 0.6). Incidentally, the sign means “complement of.” If
A is an event, A is its complement (i.e., everything but A). This is sometimes written as Ac.
The word OR is a tip that you ADD the individual probabilities together to get your answer
as long as the events are mutually exclusive (i.e., cannot occur at the same time).
This is an example of a fundamental rule in probability theory: if two or more events are
mutually exclusive, then the probability of any occurring is the sum of the probabilities of
each occurring. Because the different outcomes of each roll (i.e., rolling a one, two, three,
five, or six) are mutually exclusive, the probability of getting any outcome other than four is
the sum of the probability of each one occurring (see Table 1.6).

Table 1.6

Event Probability
Four 0.4
Not Four 0.6
Sum 1

Note that the probabilities of these two possible events sum to 1.0. Because of that, if we
know that Pr(four) is 0.4, we can quickly compute the Pr(four) as 1 0.4 ¼ 0.6 and save a
few mental calculations.
The probability distribution looks like the one shown in Figure 1.7:

1.0

0.8
Probability

0.6

0.4

0.2

0.0
Four Not Four
Event

Figure 1.7 Probability distribution for rolling a four

or not.

Remember: The SUM of the probabilities across all the different outcomes MUST EQUAL
1! In the discrete probability distributions above, this means that the heights of the bars
summed across all discrete outcomes (i.e., all bars) totals 1.
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi

10 BAYESIAN STATISTICS FOR BEGINNERS

Do you still want to play? At the end of the book, we will introduce decision trees, an
analytical framework that employs Bayes’ Theorem to aid in decision-making. But that
chapter is a long way away.

Do Bayesians think of probability as long-run averages?

Answer: We’re a few chapters away from hitting on this very important topic. You’ll see
that Bayesians think of probability in a way that allows the testing of theories and hypoth-
eses. But you have to walk before you run. What you need now is to continue learning the
basic vocabulary associated with probability theory.

What’s next?

Answer: In Chapter 2, we’ll expand our discussion of probability. See you there.
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi

CHAPTER 2

Joint, Marginal, and Conditional

Probability

Now that you’ve had a short introduction to probability, it’s time to build on our
probability vocabulary. By the end of this chapter, you will understand the
following terms:
• Venn diagram
• Marginal probability
• Joint probability
• Independent events
• Dependent events
• Conditional probability
Let’s start with a few questions.

What is an eyeball event?

A gala that celebrates the sense of vision? Nope. The eyeball event in this chapter
refers to whether a person is right-eyed dominant or left-eyed dominant. You already
know if you are left- or right-handed, but did you know that you are also left- or right-
eyed? Here’s how to tell (http://www.wikihow.com/Determine-Your-Dominant-Eye; see
Figure 2.1):

Figure 2.1 Determining your dominant eye.

12 BAYESIAN STATISTICS FOR BEGINNERS

1. Stretch your arms out in front of you and create a hole with your hands by joining your
ﬁnger tips to make a triangular opening, as shown.
2. Find a small object nearby and align your hands with it so that you can see it in the
triangular hole. Make sure you are looking straight at the object through your hands—
cocking your head to either side, even slightly, can affect your results. Be sure to keep
both eyes open!
3. Slowly move your hands toward your face to draw your viewing window toward you. As
you do so, keep your head perfectly still, but keep the object lined up in the hole between
your hands. Don’t lose sight of it.
4. Draw your hands in until they touch your face—your hands should end up in front of
your dominant eye. For example, if you ﬁnd that your hands end up so you are looking
through with your right eye, that eye is dominant.

The eyeball characteristic has two discrete outcomes: lefty (for left-eyed dominant people)
or righty (for right-eyed dominant people). Because there are only two outcomes, we can
call them events if we want.
Let us suppose that you ask 100 people if they are “lefties” or “righties.” In this case,
the 100 people represent our “universe” of interest, which we designate with the letter
U. The total number of elements (individuals) in U is written |U |. (Once again, the
vertical bars here simply indicate that U is a number; it doesn’t mean the absolute value
of U.)
Here, there are only two possible events: “lefty” and “righty.” Together, they make up a
set of possible outcomes. Let A be the event “left-eye dominant,” and A be the event
“right-eye dominant.” Here, the tilde means “complement of,” and here it can be inter-
preted as “everything but A.” Notice that these two events are mutually exclusive: you
cannot be both a “lefty” and a “righty.” The events are also “exhaustive” because you must
be either a lefty or righty.
Suppose that 70 of 100 people are lefties. These people are a subset of the
larger population. The number of people in event A can be written | A |, and in this
example | A | ¼ 70. Note that | A | must be less than or equal to | A |, which is 100.
Remember that we use the vertical bars here to highlight that we are talking about a
number.
Since there are only two possibilities for eye dominance type, this means that 100 70 ¼ 30
people are righties. The number of people in event A can be written |A|, and in this
example |A| ¼ 30. Note that |A| must be less than or equal to |U|.
Our universe can be summarized as shown in Table 2.1.

Table 2.1

Event Frequency
Lefty (A) | A | ¼ 70
Righty (A) | A | ¼ 30
Universe (U) | U | ¼ 100

We can illustrate this example in a diagrammatic form, as shown in Figure 2.2. Here, our
universe of 100 people is captured inside a box.
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi

JOINT, MARGINAL, AND CONDITIONAL PROBABILITY 13

This is a Venn diagram, which shows A and A. The universe U is 100 people and is
represented by the entire box. We then allocate those 100 individuals into A and A. The blue
circle represents A; lefties stand inside this circle; righties stand outside the circle, but inside
the box. You can see that A consists of 70 elements, and A consists of 30 elements.

Lefty (A)

Righty (∼A) 30

Figure 2.2

Why is it called a Venn diagram?

Answer: Venn diagrams are named for John Venn (see Figure 2.3), who wrote his seminal
article in 1880.

Figure 2.3 John Venn.

OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi

14 BAYESIAN STATISTICS FOR BEGINNERS

According to the MacTutor History of Mathematics Archive, Venn’s son described him as
“of spare build, he was throughout his life a ﬁne walker and mountain climber, a keen
botanist, and an excellent talker and linguist.”

What is the probability that a person in universe U is in group A?

Answer: We write this as Pr(A). Remember that Pr stands for probability, so Pr(A) means
the probability that a person is in group A and therefore is a lefty. We can determine the
probability that a person is in group A as:

jAj 70
PrðAÞ ¼ ¼ ¼ 0:7: ð2:1Þ
j U j 100

Probability is determined as the number of persons in group A out of the total. The
probability that the randomly selected person is in group A is 0.7.

What about people who are not in group A?

Answer: There are 30 of them (100 70 ¼ 30), and they are righties.

jAj 30
Prð AÞ ¼ ¼ ¼ 0:3: ð2:2Þ
jU j 100

With only two outcomes for the eyeball event, our notation focuses on the probability
of being in a given group (A ¼ lefties) and the probability of not being in the given group
(A ¼ righties).

I’m sick of eyeballs. Can we consider another characteristic?

Answer: Yes, of course. Let’s probe these same 100 people and ﬁnd out other details about
their anatomy. Suppose we are curious about the presence or absence of Morton’s toe.
People with “Morton’s toe” have a large second metatarsal, longer in fact than the ﬁrst
metatarsal (which is also known the big toe or hallux toe). Wikipedia articles suggest that
this is a normal variation of foot shape in humans and that less than 20% of the human
population have this condition. Now we are considering a second characteristic for our
population, namely toe type.
Let’s let B designate the event “Morton’s toe.” Let the number of people with Morton’s
toe be written as |B|. Let B designate the event “common toe.” Suppose 15 of the 100
people have Morton’s toe. This means |B| ¼ 15, and |B| ¼ 85. The data are shown in
Table 2.2, and the Venn diagram is shown in Figure 2.4.

Table 2.2

Event Frequency
Morton’s toe (B) |B| ¼ 15
Common toe (B) |B| ¼ 85
Universe (U) |U| ¼ 100
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi

JOINT, MARGINAL, AND CONDITIONAL PROBABILITY 15

These events can be represented in a Venn diagram, where a box holds our universe of
100 people.
Note the size of this red circle is smaller than the previous example because the number of
individuals with Morton’s toe is much smaller.

Morton’s toe (B)

Common toe (∼B) 85

Figure 2.4

Can we look at both characteristics simultaneously?

Answer: You bet. With two characteristics, each with two outcomes, we have four possible
combinations:
1. Lefty AND Morton’s toe, which we write as A \ B.
2. Lefty AND common toe, which we write as A \B.
3. Righty AND Morton’s toe, which we write as A \ B.
4. Righty AND common toe, which we write as A \B.
The upside-down \ is the mathematical symbol for intersection. Here, you can read it as
“BOTH” or “AND.”
The number of individuals in A \ B can be written j A \ B j, where the bars indicate a
number (not absolute value). Let’s suppose we record the frequency of individuals in each
of the four combinations (see Table 2.3).

Table 2.3

Lefty (A) Righty (A) Sum

Morton’s toe (B) 0 15 15
Common toe (B) 70 15 85
Sum 70 30 100

Let’s study this table carefully.

Notice that this table has four “quadrants,” so to speak. Our actual values are stored in the
upper left quadrant, shaded dark blue. The upper right quadrant (shaded light blue) sums
the number of people with and without Morton’s toe. The lower left quadrant (shaded light
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi

16 BAYESIAN STATISTICS FOR BEGINNERS

blue) sums the number of people that are lefties and righties. The lower right (white)
quadrant gives the grand total, or |U|.
Now let’s plot the results in the same Venn diagram (see Figure 2.5).

Lefty (A)

70 Morton’s toe (B)

Righty with common toe (∼A and ∼B): 15

Figure 2.5

The updated Venn diagram shows the blue circle with 70 lefties, the red circle with
15 people with Morton’s toe, and no overlap between the two. This means j A \ B j ¼ 0,
j A j ¼ j A \ B j ¼ 70, and j B j ¼ j A \ B j ¼ 15. By subtraction, we know the number of
individuals that are not in A OR B j A \ B j is 15 because we need to account for all 100
individuals somewhere in the diagram.

Is it possible to have Morton’s toe AND be a lefty?

Answer: For our universe, no. If you are a person with Morton’s toe, you are standing in
the red circle and cannot also be standing in the blue circle. So these two events
(Morton’s toe and lefties) are mutually exclusive because they do not occur at the
same time.

Is it possible NOT to have Morton’s toe if you are a lefty?

Answer: You bet. All 70 lefties do not have Morton’s toe. These two events are non-
mutually exclusive.

What if ﬁve lefties also have Morton’s toe?

Answer: In this case, we need to adjust the Venn diagram to show that ﬁve of the people
that are lefties also have Morton’s toe. These individuals are represented as the intersection
between the two events (see Figure 2.6). Note that the total number of individuals is still
100; we need to account for everyone!
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi

JOINT, MARGINAL, AND CONDITIONAL PROBABILITY 17

Lefty (A)

Morton’s toe (B)

65 5 10

Righty with common toe (∼A and ∼B): 20

Figure 2.6

We’ll run with this example for the rest of the chapter.
This Venn diagram is pretty accurate (except for the size of the box overall). There are
70 people in A, 15 people in B, and 5 people in A \ B. The blue circle contains 70 elements
in total, and 5 of those elements also occur in B. The red circle contains 15 elements (so is a
lot smaller than the blue circle), and 5 of these are also in A. So 5/15 (33%) of the red circle
overlaps with the blue circle, and 5/70 (7%) of the blue circle overlaps with the red circle.

Of the four events (A, A, B, and B), which are not
mutually exclusive?

Answer:
• A and B are not mutually exclusive (a lefty can have Morton’s toe).
• A and B are not mutually exclusive (a lefty can have a common toe).
• A and B are not mutually exclusive (a righty can have Morton’s toe).
• A and B are not mutually exclusive (a righty can have a common toe).
In Venn diagrams, if two events overlap, they are not mutually exclusive.

Are any events mutually exclusive?

Answer: If we focus on each circle, A and A are mutually exclusive (a person cannot be a
lefty and righty). B and B are mutually exclusive (a person cannot have Morton’s toe and a
common toe). Apologies for the trick question!

If you were one of the lucky 100 people included in the universe,
where would you fall in this diagram?

Answer: Your answer here!

It’s handy to look at the numbers in a table format too. Table 2.4 shows the same values as
Figure 2.6.
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi

18 BAYESIAN STATISTICS FOR BEGINNERS

Table 2.4

Lefty (A) Righty (A) Sum

Morton’s toe (B) 5 10 15
Common toe (B) 65 20 85
Sum 70 30 100

This is an important table to study. Once again, notice that there are four quadrants or
sections in this table. In the upper left quadrant, the ﬁrst two columns represent the two
possible events for eyeball dominance: lefty and righty. The ﬁrst two rows represent the two
possible events for toe type: Morton’s toe and common toe.
The upper left entry indicates that 5 people are members of both A and B.
• Look for the entry that indicates that 20 people are A \B, that is, both not A and not B.
• Look for the entry that indicates that 65 people are A \B.
• Look for the entry that indicates that 10 people are A \ B.
The lower left and upper right quadrants of our table are called the margins of the table.
They are shaded light blue. Note that the total number of individuals in A (regardless of B)
is 70, and the total number of individuals in A is 30. The total number of individuals in B
(regardless of A) is 15 and the total number of individuals B is 85. Any way you slice it, the
grand total must equal 100 (the lower right quadrant).

What does this have to do with probability?

Answer: Well, if you are interested in determining the probability that an individual
belongs to any of these four groups, you could use your universe of 100 individuals to do
the calculation. Do you remember the frequentist way to calculate probability? We
learned about that in Chapter 1.

jnumber of observed outcomes of interestj

Pr ¼ : ð2:3Þ
jU j

Our total universe in this case is the 100 individuals. To get the probability that a person
selected at random would belong to a particular event, we simply divide the entire table
above by our total, which is 100, and we get the results shown in Table 2.5.

Table 2.5

Lefty (A) Righty (A) Sum

Morton’s toe (B) 0.05 0.1 0.15
Common toe (B) 0.65 0.2 0.85
Sum 0.7 0.3 1

We’ve just converted the raw numbers to probabilities by dividing the frequency table by
the grand total.
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi

JOINT, MARGINAL, AND CONDITIONAL PROBABILITY 19

Note that this differs from our die rolling exercise in Chapter 1, where you were unsure
what the probability was and had to repeatedly roll a die to estimate it. By the Law of Large
Numbers, the more trials you have, the more you zero in on the actual probability. In this
case, however, we are given the number of people in each category, so the calculation is
straightforward. These 100 people are the only people of interest. They represent our
universe of interest; we are not using them to sample a larger group. If you didn’t know the
make-up of the universe, you could randomly select one person out of the universe over
and over again to get the probabilities, where all persons are equally likely to be selected.

Let’s walk through one calculation. Suppose we want to know the probability that an
individual is a lefty AND has Morton’s toe. The number of individuals in A and B is
written jA \ Bj and the probability that an individual is a lefty with Morton’s toe is
written:

jA \ Bj 5
PrðA \ BÞ ¼ ¼ ¼ 0:05: ð2:4Þ
jU j 100

This is ofﬁcially called the joint probability and is the upper left entry in our table. The
Oxford Dictionary of Statistics states that the “joint probability of a set of events is the
probability that all occur simultaneously.” Joint probabilities are also called conjoint
probabilities. Incidentally, a table that lists joint probabilities such as the one above is
sometimes referred to as a conjoint table.
When you hear the word joint, you should think of the word AND and realize that you
are considering (and quantifying) more than one characteristic of the population. In this
case, it indicates that someone is in A AND B. This is written as:

PrðA \ BÞ : : : or; equivalently; : : : PrðB \ AÞ: ð2:5Þ

What is the probability that a person selected at random is a righty

and has Morton’s toe?

Answer: This is equivalent to asking, what is the joint probability that a person is
right-eye dominant AND has Morton’s toe? See if you can ﬁnd this entry in Table 2.5. The
answer is 0.1.
In addition to the joint probabilities, the table also provides the marginal probabilities,
which look at the probability of A or A (regardless of B) and the probability of B or B
(regardless of A).

What does the word “marginal” mean?

Answer: The word marginal in the dictionary is deﬁned as “pertaining to the margins; or
situated on the border or edge.” In our table, the marginal probabilities are just the
probabilities for one characteristic of interest (e.g., A and A) regardless of other character-
istics that might be listed in the table.
Let’s now label each cell in our conjoint table by its probability type (see Table 2.6).
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi

20 BAYESIAN STATISTICS FOR BEGINNERS

Table 2.6

Lefty (A) Righty (A) Sum

Morton’s toe (B) Joint Joint Marginal
Common toe (B) Joint Joint Marginal
Sum Marginal Marginal Total

Suppose you know only the following facts: the marginal probability of being a lefty is
0.7, the marginal probability of having Morton’s toe is 0.15, and the joint probability of
being a lefty with Morton’s toe is 0.05. Also suppose that you haven’t looked at
Table 2.6!

Can you ﬁll in the empty cells in Table 2.7?

Take out some scratch paper and a pencil. You can do it! Here are some hints:
• the lower right hand quadrant must equal 1.00;
• for any given characteristic, the sum of the two marginal probabilities must equal 1.00.

Table 2.7

Lefty (A) Righty (A) Sum

Morton’s toe (B) 0.05 ? 0.15
Common toe (B) ? ? ?
Sum 0.7 ? ?

Answer: Because the marginal probabilities for eyeballs must sum to 1.00, and the mar-
ginal probabilities for toes must sum to 1.00 (because they deal with mutually exclusive
events), we can ﬁll in the missing marginal probabilities.
The marginal probability of a lefty, Pr(A), is 0.7, so the marginal of a righty, Pr(A), must
be 1.00 0.7 ¼ 0.3.
The marginal probability of having Morton’s toe, Pr(B), is 0.15, so the marginal of Pr(B)
must be 1.00 0.15 ¼ 0.85.
So far, so good. Once we know the marginals, we can calculate the joint probabilities in
the upper left quadrant (see Table 2.8). For example:

• if the marginal Pr(A) ¼ 0.7, then we know that PrðA \ BÞ ¼ 0.7 0.05 ¼ 0.65;
• if the marginal Pr(B) ¼ 0.15, then we know that Prð A \ BÞ ¼ 0.15 0.05 ¼ 0.1;
• if the marginal Pr(B) ¼ 0.3, then we know that Prð A \ BÞ ¼ 0.3 0.1 ¼ 0.2.

Table 2.8

Lefty (A) Righty (A) Sum

Morton’s toe (B) 0.05 0.1 0.15
Common toe (B) 0.65 0.2 0.85
Sum 0.7 0.3 1
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi

JOINT, MARGINAL, AND CONDITIONAL PROBABILITY 21

Quickly: What is the marginal probability of having Morton’s

toe with this conjoint table?

Answer: The marginal probability of having Morton’s toe is written:

PrðBÞ ¼ 0:15 ð2:6Þ

Can you express the marginal probability of having Morton’s

toe as the sum of joint probabilities?

Answer: Don’t cheat now . . . try to express Pr(B) as the sum of joint probabilities before
reading on! This step is essential for understanding Bayesian inference in future chapters!
How did you do?
Hint 1: We can decompose the total, 0.15, into its two pieces: 0.05 þ 0.1.
The probability that a lefty has Morton’s toe can be written:

PrðA \ BÞ: ð2:7Þ

The probability that a righty has Morton’s toe can be written:

PrðA \ BÞ: ð2:8Þ

If we put these two terms together, we can express the marginal probability of having
Morton’s toe as:

PrðBÞ ¼ PrðA \ BÞ þ PrðA \ BÞ ð2:9Þ

PrðBÞ ¼ 0:05 þ 0:1 ¼ 0:15: ð2:10Þ

Can we look at this problem from the Venn diagram perspective again?

Of course! Here it is in Figure 2.7.

Lefty

Morton’s toe
65 5 10

Righty with common toe (∼A and ∼B): 20

Figure 2.7

We are now poised to ask some very interesting questions.

OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi

22 BAYESIAN STATISTICS FOR BEGINNERS

If you have Morton’s toe, does that inﬂuence your probability of

being a lefty?

Answer: To answer this question, we must introduce the very important concept of
conditional probability.

What is conditional probability?

Answer: Conditional probability is the probability of an event given that another event
has occurred.
Conditional probability is written as:

How exactly do you calculate the probability that a person is a

lefty, given the person has Morton’s toe?

Answer: You use the following equation, which is a standard equation in probability
theory:

PrðA \ BÞ
PrðAjBÞ ¼ : ð2:11Þ
PrðBÞ

It’s essential that you understand conditional probability, so let’s look at this equation from
a few different angles and, in the words of Kalid Azad, “let’s build some intuition” about
what it means.
Angle 1: The Venn diagram zoom
We already know that the numerator

PrðA \ BÞ ð2:12Þ

is the intersection in the Venn diagram where A and B overlap (the probability of a lefty and
Morton’s toe). This can be written as

PrðB \ AÞ ð2:13Þ

as well. The intersection of A and B is the intersection, no matter how you write it:

PrðA \ BÞ ¼ PrðB \ AÞ: ð2:14Þ

And we know that the denominator Pr(B) is the probability of Morton’s toe.
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi

JOINT, MARGINAL, AND CONDITIONAL PROBABILITY 23

In the Venn diagram, we can focus on the area of B and then look to see what fraction of
the total B is occupied by A. In this example, we restrict our attention to the 15 people with
Morton’s toe, and note that 5 of them are lefties. Therefore, about 5/15 or 1/3 of the red
circle is overlapped by the blue circle.

Morton’s toe (B)

A ∩ B: 5 ∼A ∩ B: 10

Figure 2.8

For the numbers given, we can see that Pr(A | B) ¼ 5/15 ¼ 1/3 ¼ 0.333 ¼ 33.3%. A general
rule can help with the visualization: zoom to the denominator space B, then determine
what fraction of this space is occupied by A. Similarly, Pr(A | B) ¼ 10/15 ¼ 2/3 ¼ 0.667 ¼
66.7%. Note that these probabilities sum to 1.
Angle 2: The table approach
We can also tackle this problem using the raw data (see Table 2.9).

Table 2.9

Lefty (A) Righty (A) Sum

Morton’s toe (B) 5 10 15
Common toe (B) 65 20 85
Sum 70 30 100

Here’s the key equation again:

PrðA \ BÞ
PrðA j BÞ ¼ : ð2:15Þ
PrðBÞ

In words, this equation says, what fraction of B consists of A \ B?

From our table, we calculated PrðA \ BÞ as:

jA \ Bj 5
PrðA \ BÞ ¼ ¼ : ð2:16Þ
jU j 100

And we know that Pr(B) is:

jBj 15
PrðBÞ ¼ ¼ : ð2:17Þ
jU j 100

Now we can calculate the probability of A given B as:

jA\Bj
jU j jA \ Bj 5
PrðA j BÞ ¼ ¼ ¼ ¼ 0:333: ð2:18Þ
jBj jBj 15
j Uj
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi

24 BAYESIAN STATISTICS FOR BEGINNERS

So if you have Morton’s toe, does that inﬂuence your probability

of being a lefty?

Answer: If you have Morton’s toe, the probability of being a lefty is 0.33. If you don’t have
Morton’s toe, the probability of being a lefty is 65/85 ¼ 0.77 (you can conﬁrm this too).
If Morton’s toe does not matter, these conditional probabilities should be equal to the
marginal probability, which is 0.7. This is clearly not the case here.

Does Pr(A | B) ¼ Pr(B | A)?

Answer: In other words, is the probability of a lefty, given Morton’s toe, the same thing as
the probability of Morton’s toe, given a lefty? Let’s try it!
jA\Bj
jU j jA \ Bj 5
PrðA j BÞ ¼ ¼ ¼ ¼ 0:333 ð2:19Þ
jBj j Bj 15
jU j

jA\Bj
jU j jA \ Bj 5
PrðB j AÞ ¼ ¼ ¼ ¼ 0:072: ð2:20Þ
jAj jAj 70
jU j

So the answer is No! These two probabilities are very different things. The ﬁrst asks what is
the probability of A given that event B happens (with a result of 0.333), while the second
asks what is the probability of B given that A happens (with a result of 0.072).

Can you calculate the conditional probability of being a lefty,

given you have Morton’s toe, from our conjoint table instead
of the raw numbers?

Answer: Yes . . . see if you can ﬁnd it before looking at Table 2.10!

Table 2.10

Lefty (A) Righty (A) Sum

Morton’s toe (B) 0.05 0.1 0.15
Common toe (B) 0.65 0.2 0.85
Sum 0.7 0.3 1

Remember, when dealing with conditional probabilities, the key word is “zoom.” Let’s
start with Pr(A | B):

PrðA \ BÞ
PrðA j BÞ ¼ : ð2:21Þ
PrðBÞ

If B happens, we zoom to row 1 (Morton’s toe), and then ask what fraction of the people
with Morton’s toe are lefties:

:05
PrðA j BÞ ¼ ¼ 0:333: ð2:22Þ
0:15
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi

JOINT, MARGINAL, AND CONDITIONAL PROBABILITY 25

Can you calculate conditional probability of having Morton’s toe,

given you are a lefty, from our conjoint table?

Answer: If A happens, we zoom to the ﬁrst column (lefties) and then ask what fraction of
the lefties have Morton’s toe:

0:05
PrðB j AÞ ¼ ¼ 0:072: ð2:23Þ
0:7

If we know the conditional and marginal probabilities, can we

calculate the joint probabilities?

Yes! Don’t forget this fundamental equation:

PrðA \ BÞ
PrðA j BÞ ¼ : ð2:24Þ
PrðBÞ

You can rearrange this to your heart’s content. For this book, the most important
rearrangement is:

PrðA \ BÞ ¼ PrðA j BÞ ∗ PrðBÞ: ð2:25Þ

This formula can be used to calculate joint probability, PrðA \ BÞ. Take some time to make
sure this equation sinks in and makes full sense to you.
As an aside, if the occurrence of one event does not change the probability of the
other occurring, the two events are said to be independent. This means that
PrðA j BÞ ¼ PrðAj BÞ ¼ PrðAÞ.
So, when A and B are independent:

PrðA \ BÞ ¼ PrðAÞ ∗ PrðBÞ: ð2:26Þ

Are Pr(A|B) and Pr(B|A) related in some way?

Answer: That, dear reader, is the subject of our next chapter, where we will derive Bayes’
Theorem. See you there!
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi

SECTION 2

Bayes’ Theorem and

Bayesian Inference

Overview

Welcome to Section 2! This section provides an introduction to Bayesian inference and

provides three (hopefully) fun examples to get your feet wet.
This section consists of 5 chapters.

• In Chapter 3, Bayes’ Theorem is introduced. The chapter shows its derivation and
describes two ways to think about it. First, Bayes’ Theorem describes the relationship
between two inverse conditional probabilities, P(A|B) and P(B|A). Second, Bayes’ The-
orem can be used to express how a degree of belief for a given hypothesis can be updated
in light of new evidence. This chapter focuses on the first interpretation.
• Chapter 4 introduces the concept of Bayesian inference. The chapter discusses the
scientific method, and illustrates how Bayes’ Theorem can be used for scientific inference.
Bayesian Inference is the use of Bayes’ Theorem to draw conclusions about a set of
mutually exclusive and exhaustive alternative hypotheses by linking prior knowledge
about each hypothesis with new data. The result is updated probabilities for each hy-
pothesis of interest. The ideas of prior probabilities, likelihood, and posterior probabilities
are introduced.
• Chapter 5, the “Author Problem,” provides a concrete example of Bayesian inference.
This chapter draws on work by Frederick Mosteller and David Wallace, who used Bayesian
inference to assign authorship for unsigned Federalist Papers. The Federalist Papers were a
collection of papers known to be written during the American Revolution. However,
some papers were unsigned by the author, resulting in disputed authorship. The chapter
provides a very basic Bayesian analysis of the unsigned “Paper 54,” which was written by
Alexander Hamilton or James Madison. The example illustrates the principles of Bayesian
inference for two competing hypotheses.
• Chapter 6, the “Birthday Problem,” is intended to highlight the decisions the analyst
(you!) must make in setting the prior distribution. The “Birthday Problem” expands
consideration from two hypotheses to multiple, discrete hypotheses. In this chapter,
interest is in determining the posterior probability that a woman named Mary was born
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi

28 BAYESIAN STATISTICS FOR BEGINNERS

in a given month; there are 12 alternative hypotheses. Furthermore, consideration is

given to assigning prior probabilities. The priors represent a priori probabilities that each
alternative hypothesis is correct, where a priori means “prior to data collection,” and can
be “informative” or “non-informative.” A Bayesian analysis cannot be conducted with-
out using a prior distribution. The concept of likelihood is explored more deeply.
• Chapter 7, the “Portrait Problem,” highlights the fact that multiple pieces of information
can be used in a Bayesian analysis. A key concept in this chapter is that multiple sources
of data can be combined in a Bayesian inference framework. The main take home point is
that Bayesian analysis can be very, very ﬂexible. A Bayesian analysis is possible as long as
the likelihood of observing the data under each hypothesis can be computed.

By the end of this section, you will have a good understanding of how Bayes’ Theorem is
related to the scientiﬁc method.
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi

CHAPTER 3

Bayes’ Theorem

In this chapter, we’re going to build on the content in Section 1 and derive Bayes’ Theorem.
This is what you’ve been waiting for!
By the end of this chapter, you will be able to derive Bayes’ Theorem and explain the
relationship between PrðA j BÞ and PrðB j AÞ.
Let’s begin with a few questions.

First, who is Bayes?

Answer: Thomas Bayes (1701–1761) was an English mathematician and Presbyterian

minister, known for having formulated a speciﬁc case of the theorem that bears his
name, Bayes’ Theorem.

Is that really a picture of Thomas Bayes in Figure 3.1?

Answer: It could be, but nobody is really sure! We’ll revisit this question in a future chapter.

Figure 3.1 “Thomas Bayes” (Photocopied

from Terrence O’Donnell)

30 BAYESIAN STATISTICS FOR BEGINNERS

Ok, what exactly is Bayes’ Theorem?

Answer : There are two ways to think about Bayes’ Theorem:

• It describes the relationship between PrðA j BÞ and PrðB j AÞ.

• It expresses how a subjective degree of belief should rationally change to account for
evidence.

The fact that there are two interpretations can be a source of confusion, but we hope you will
fully appreciate the difference soon! In this chapter, we’ll derive Bayes’ Theorem and discuss
the ﬁrst interpretation of the theorem: the relationship between PrðA j BÞ and PrðB j AÞ.
To help guide us, let’s return to our familiar Venn diagram and conjoint table of eye
dominance and toes. As before, our example will consider 100 total elements and four
events. Remember that our data consisted of 70 lefties and 15 Morties (people with
Morton’s toe). This time, however, we will simply refer to the eye dominance events (left-
eyed vs. right-eyed dominant) as A and A, and the toe events (Morton’s toe vs. common
toe) as B and B (see Figure 3.2). In this way, we can generalize the problem so that it
pertains to any two characteristics of interest.

B
65 5 10

∼A and ∼B: 20

Figure 3.2

These same numbers can be expressed in tabular form (Table 3.1), or as a conjoint table
(Table 3.2).

Table 3.1 Frequency table

A A Sum
B 5 10 15
B 65 20 85
Sum 70 30 100

Table 3.2 Conjoint probability table

A A Sum
B 0.05 0.10 0.15
B 0.65 0.20 0.85
Sum 0.70 0.30 1.00
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi

BAYES’ THEOREM 31

In Chapter 2, we learned that the conditional probability

PrðA \ BÞ
PrðA j BÞ ¼ ð3:1Þ
PrðBÞ

can be expressed as:

PrðA \ BÞ ¼ PrðA j BÞ ∗ PrðBÞ: ð3:2Þ

In words, the joint probability of A and B is the product of the conditional

probability of A, given B, and the marginal probability of B. For the conditional
probability equation (Equation 3.1), you visually ‘zoom’ to the denominator space, B, then
ask what fraction is also occupied by A.
And we also learned that this conditional probability equation:

PrðB \ AÞ
PrðB j AÞ ¼ ð3:3Þ
PrðAÞ

can be expressed as:

PrðB \ AÞ ¼ PrðB j AÞ ∗ PrðAÞ: ð3:4Þ

In words, the joint probability of B and A is the product of the conditional

probability of B, given A, and the marginal probability of A. Equation 3.3 indi-
cates that you “zoom” to the denominator space, A, and then ask what fraction is also
occupied by B to get the conditional probability of B, given A.
We additionally learned that PrðA j BÞ is not necessarily the same thing as PrðB j AÞ.
And yet, when you need to calculate the joint probability of B and A, we end up with the
same result, no matter which approach we use.
Let’s conﬁrm this:

:05
PrðA \ BÞ ¼ PrðA j BÞ ∗ PrðBÞ ¼ ∗ 0:15 ¼ 0:05 ð3:5Þ
0:15
:05
PrðB \ AÞ ¼ PrðB j AÞ ∗ PrðAÞ ¼ ∗ 0:7 ¼ 0:05: ð3:6Þ
0:7

Equations 3.5 and 3.6 produce the same result. We should expect this, right? We are trying
to estimate the probability of A and B occurring together, which is where the blue and red
circles intersect. Although PrðA j BÞ is 0.05/0.15 ¼ 0.33, and PrðB j AÞ is 0.05/0.7 ¼ 0.071,
when we multiply each by their respective marginals, we end up with the same joint
probability estimate.

What does this have to do with Bayes’ Theorem?

We are just a couple of steps from Bayes’ Theorem. Are you ready?
We’ll start with a simple reminder that the joint probability of A and B can be viewed in
two ways:

PrðA \ BÞ ¼ PrðA j BÞ ∗ PrðBÞ ð3:7Þ

PrðA \ BÞ ¼ PrðB j AÞ ∗ PrðAÞ: ð3:8Þ

OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi

32 BAYESIAN STATISTICS FOR BEGINNERS

Therefore:

PrðA j BÞ ∗ PrðBÞ ¼ PrðB j AÞ ∗ PrðAÞ: ð3:9Þ

And dividing both sides by PrðBÞ gives us Bayes’ Theorem!

PrðB j AÞ ∗ PrðAÞ
PrðA j BÞ ¼ : ð3:10Þ
PrðBÞ

As we’ll soon see, Bayes’ Theorem can be expressed in other ways as well, but in this
version it should be clear that Bayes’ Theorem is a way of calculating conditional
probability. Figure 3.3 provides a handy-dandy summary of the key concepts.

– Marginal probability of A: Pr(A) A B

– Marginal probability of B: Pr(B)
– Joint probability of A and B: Pr(A ∩ B) = Pr(B ∩ A)

Pr(A ∩ B) Pr(A∩B)
Conditional probability of A, given B: Pr(A | B) = Conditional probability of B, given A: Pr(B | A) =
Pr(B) Pr(A)

Pr(A ∩ B) = Pr(A | B) * Pr(B) Pr(A ∩ B) = Pr(B | A) * Pr(A)

Pr(A | B) * Pr(B) = Pr (B | A) * Pr (A)

Pr(B | A) * Pr(A)
Pr(A | B) =
Pr(B)
Bayes' Theorem

Figure 3.3

What is so remarkable about this?

Answer: Let’s review the deﬁnitions:

• Bayes’ Theorem describes the relationship between PrðA j BÞ and PrðB j AÞ, the focus of this
chapter.
• It expresses how a subjective degree of belief should rationally change to account for
evidence. We will discuss this in depth in Chapter 4.
In the first interpretation, Bayes’ Theorem is a fixed relationship between PrðAÞ, PrðBÞ,
PrðA j BÞ, and PrðB j AÞ. Suppose we give you a problem to solve, we provide you with one
type of conditional probability, PrðB j AÞ, and then we challenge you to find the reverse:
PrðA j BÞ. With Bayes’ Theorem, the calculations can be straightforward.
Let’s look at our conjoint table, Table 3.3, which gives the joint and marginal probabilities.

Table 3.3

A A Sum
B 0.05 0.10 0.15
B 0.65 0.20 0.85
Sum 0.70 0.30 1.00
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi

BAYES’ THEOREM 33

If you have a member of B, what is the probability that he/she is also a

member of A?

Answer: This is a problem where Bayes’ Theorem could be used because you are asked to
ﬁnd a conditional probability: what is the probability of A given B. But we solved this easily
in Chapter 2 by just using the joint probability table: zoom to row B, and ask what
proportion of the total consists of A:

0:05
PrðA j BÞ ¼ ¼ 0:333: ð3:11Þ
0:15
This example demonstrates that even if you are given a Bayes’ type of problem, there are
still ways to solve the problem without really using the theorem. But, just to be complete,
let’s use Bayes’ Theorem to solve it:

PrðB j AÞ ∗ PrðAÞ
PrðA j BÞ ¼ ð3:12Þ
PrðBÞ
0:05
∗ 0:7
PrðA j BÞ ¼ 0:7
¼ 0:333: ð3:13Þ
0:15
It works!

So, when would we need to use Bayes’ Theorem?

Answer: You can use Bayes’ Theorem when you are provided with one kind of conditional
probability, like PrðA j BÞ, but are asked to ﬁnd its inverse, PrðB j AÞ.
A notable example is posted at http://yudkowsky.net/rational/bayes. In that article,
Yudkowsky poses the following challenge:

One percent of women at age forty who participate in routine screening have breast
cancer; 80% of women with breast cancer will have a positive mammogram (test), while
9.6% of women without breast cancer will also get a positive result. A woman in this age
group had a positive mammogram in a routine screening. What is the probability that
she actually has breast cancer?

We are given the probability that a woman with breast cancer will get a positive mammo-
gram. But we are asked for the reverse of this: what is the probability that a woman with a
positive mammogram has breast cancer?
Shall we give this a go? Let’s let A represent women with breast cancer, and A represent
women without it. And let’s let B represent a positive test, and B represent a negative test.
Bayes’ Theorem would allow us to estimate the probability of cancer, given a positive test
result, as:

PrðB j AÞ ∗ PrðAÞ
PrðA j BÞ ¼ : ð3:14Þ
PrðBÞ

So we know PrðAÞ ¼ 0.01, and we’re given PrðB j AÞ ¼ 0.8. To get our answer via Bayes’
Theorem, all we need is to determine the denominator, PrðBÞ.
Let’s set up a conjoint table to visualize the problem (see Table 3.4).
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi

34 BAYESIAN STATISTICS FOR BEGINNERS

Table 3.4 Breast Cancer Problem

A: Cancer A: No Cancer Sum

B: Positive ? ? ?
B: Negative ? ? ?
Sum ? ? ?

Our universe consists of women who participate in routine screenings for breast cancer.
From the problem, we know that PrðAÞ ¼ 0.01. This means that 1% of women have breast
cancer. It also means that Prð AÞ ¼ 0.99, or 99% of women do not have breast cancer.
These are the marginal probabilities for the cancer characteristic (see Table 3.5).

Table 3.5 Breast cancer problem

A: Cancer A: No Cancer Sum

B: Positive ? ? ?
B: Negative ? ? ?
Sum 0.01 0.99 1.00

We can make some headway with this knowledge of PrðAÞ. We’re also given the prob-
ability of a positive test result given cancer:

PrðB j AÞ ¼ 0:8: ð3:15Þ

We can use that information to compute the joint probability of A and B (cancer and
positive test). If the marginal probability of having cancer, PrðAÞ, is 0.01, then the prob-
ability of a positive test result and cancer is 0.008:

PrðB \ AÞ ¼ PrðB j AÞ ∗ PrðAÞ ¼ 0:8 ∗ 0:01 ¼ 0:008: ð3:16Þ

Now, we can calculate the probability of a negative test result and cancer through simple
subtraction:

Prð B \ AÞ ¼ 0:01−0:008 ¼ 0:002: ð3:17Þ

Let’s add these entries to our joint and marginal probability table as follows, ﬁlling in any
cells we can (as shown in Table 3.6).

Table 3.6 Breast Cancer Problem

A: Cancer A: No Cancer Sum

B: Positive 0.008 ? ?
B: Negative 0.002 ? ?
Sum 0.01 0.99 1.00

Notice that PrðB j AÞ, which is 0.8, is nowhere in this table. But 0.008/0.01 ¼ 0.8.
So far, so good. Now we just need either the joint probability of A \ B or the joint
probability of A \ B, and we can our ﬁnd the last missing piece to solve the problem—
the marginal PrðBÞ.
In the problem statement, we’re given the probability of a positive test result given no
cancer:
PrðB j AÞ ¼ 0:096: ð3:18Þ
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi

BAYES’ THEOREM 35

This allows us to compute the joint probability that a woman has a positive test AND does
not have breast cancer:

Prð A \ BÞ ¼ PrðB j AÞ ∗ Prð AÞ ¼ 0:096 ∗ 0:99 ¼ 0:095: ð3:19Þ

The marginal probability of testing positive is then:

PrðBÞ ¼ PrðA \ BÞ þ Prð A \ BÞ ¼ 0:008 þ 0:095 ¼ 0:103: ð3:20Þ

We can also compute the joint probability that a woman has a negative test AND does
not have breast cancer as:

Prð A \ BÞ ¼ 0:99−0:095 ¼ 0:895: ð3:21Þ

This allows us to ﬁll in the rest of our table (as shown in Table 3.7).

Table 3.7 Breast Cancer Problem

A: Cancer A: No Cancer Sum

B: Positive 0.008 0.095 0.103
B: Negative 0.002 0.895 0.897
Sum 0.010 0.990 1.000

It may be helpful to think in terms of counts instead of probabilities. Suppose there are
1000 women. We would expect that these women would be partitioned into the four
events, as shown in Table 3.8.

Table 3.8 Breast Cancer Problem

A: Cancer A: No Cancer Sum

B: Positive 8 95 103
B: Negative 2 895 897
Sum 10 990 1000

Our Venn diagram would look roughly like the one in Figure 3.4, with the following results
(noting that the size of the box would be much, much larger than shown):

95 Positive 8

Cancer

∼Cancer and ∼Positive: 895

Figure 3.4
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi

36 BAYESIAN STATISTICS FOR BEGINNERS

• 10 of 1000 women have breast cancer (blue circle);

• 990 of 1000 women do not have breast cancer (and are any portion of the diagram that is
not blue);
• 103 of 1000 women will have a positive mammogram (red circle);
• 897 of 1000 women will have a negative mammogram (and are standing in any portion
of the diagram that is not red);
• 8 of the women with cancer will have a positive mammogram; 2 will have a negative
mammogram;
• 95 of the women without cancer will have a positive mammogram.

With the joint and marginal table ﬁlled in, we can now answer Yudkowsky’s challenge: a
woman in this age group had a positive mammogram in a routine screening. What is the
probability that she actually has breast cancer?

PrðB j AÞ ∗ PrðAÞ 0:008 ∗ 0:01

PrðA j BÞ ¼ ¼ 0:01 ¼ 0:0776: ð3:22Þ
PrðBÞ 0:103

She has a 7.76% chance of having cancer, given her positive test result. This probability
may be surprisingly small to you. You might have noticed that you didn’t need to ﬁll out
the full conjoint table to get the answer. But this little exercise was a quick review of
Chapter 2.
Do you see how the Theorem allows you to switch the information around? We were
provided with information about the probability of a test result, given a cancer condition.
We used Bayes’ Theorem to determine the probability of cancer given the test result.
This use of Bayes’ Theorem relates inverse representations of the probabilities concerning
two events: PrðA j BÞ and PrðB j AÞ. Bayes noted that there is an intricate relationship be-
tween PrðA j BÞ and PrðB j AÞ. His theorem is useful because sometimes it is far easier to
estimate one of these conditional probabilities than the other.

Is that all there is to it?

Answer: In terms of what Bayes’ Theorem is and how you can use it to calculate condi-
tional probability, yes.
You’ll often see Bayes’ Theorem written in the following form:

PrðB j AÞ ∗ PrðAÞ
PrðA j BÞ ¼ : ð3:23Þ
PrðBÞ

There are other, equally valid ways to express Bayes’ Theorem. For example, the two
equations below are equivalent to the one given above:

PrðB j AÞ ∗ PrðAÞ
PrðA j BÞ ¼ ð3:24Þ
PrðA \ BÞ þ Prð A \ BÞ

PrðB j AÞ ∗ PrðAÞ
PrðA j BÞ ¼ : ð3:25Þ
PrðB j AÞ ∗ PrðAÞ þ PrðB j AÞ ∗ Prð AÞ

The second deﬁnition of Bayes’ Theorem focuses on inference—a topic we’ll explore in
Chapter 4, which relies on this expanded version.
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi

CHAPTER 4

Bayesian Inference

Now that you’ve been introduced to Bayes’ Theorem, we’ll focus our attention on Bayesian
inference. By the end of this chapter, you will understand the following concepts:
• Bayesian inference
• Induction
• Deduction
• Hypothesis
• Alternative hypotheses
• Prior probability of a hypothesis
• Likelihood of the observed data
• Posterior probability of a hypothesis, given the data
Before we get started, let’s quickly review Bayes’ Theorem, using a Venn diagram as a visual
aid (see Figure 4.1).

B
65 5 10

∼A and ∼B: 20

Figure 4.1

Here, A and B represent two events. Therefore, this diagram consists of four possible joint
events:

• A\B
• A \ B
• A \ B
• A \ B

In Section 1, we learned that the joint probability of A and B occurring can be expressed as:

PrðA \ BÞ ¼ PrðA j BÞ ∗ PrðBÞ ð4:1Þ

38 BAYESIAN STATISTICS FOR BEGINNERS

In words the joint probability of A and B is the product of the conditional

probability of A, given B, and the marginal probability of B.
Similarly, the joint probability of A and B occurring can be expressed as:

PrðB \ AÞ ¼ PrðB j AÞ ∗ PrðAÞ: ð4:2Þ

Therefore:

PrðA j BÞ ∗ PrðBÞ ¼ PrðB j AÞ ∗ PrðAÞ: ð4:3Þ

And dividing both sides by Pr(B) gives us Bayes’ Theorem:

PrðB j AÞ ∗ PrðAÞ
PrðA j BÞ ¼ : ð4:4Þ
PrðBÞ

In Chapter 3, we provided the two deﬁnitions of Bayes’ Theorem and discussed deﬁnition 1:

• The Theorem relates inverse representations of the probabilities concerning two events,
that is, Pr(A | B) and Pr(B | A).
We walked through a few different examples to show how to use Bayes’ Theorem to
determine, for example, the probability that a woman has breast cancer if her mammogram
came back positive.
In this chapter, we turn our attention to the second deﬁnition of Bayes’ Theorem:
• The Theorem can express how a subjective degree of belief should rationally change to
account for evidence.
Here’s the theorem we’ll be using for deﬁnition 2:

PrðB j AÞ ∗ PrðAÞ
PrðA j BÞ ¼ : ð4:5Þ
PrðBÞ

Yes, the equation is the same, but in this second interpretation, Bayes’ Theorem opens up
entirely new possibilities for thinking about probabilities and the conduct of

Science

What exactly is science?

Answer: There are many deﬁnitions, formal and informal. Generally speaking, science
refers to a system of acquiring knowledge.
Answer: (From NASA): Science is curiosity in thoughtful action about the world and how
it behaves.
Answer: (From Wikipedia): Science (from Latin scientia, meaning “knowledge”) is a systematic
enterprise that builds and organizes knowledge in the form of testable explanations and
predictions about the universe.

How do we go about actually conducting science?

Answer: We normally use what is called the scientiﬁc method. The Oxford English Diction-
ary (Stevenson, 2010) says that the scientiﬁc method is “a method or procedure that has
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi

BAYESIAN INFERENCE 39

characterized natural science since the 17th century, consisting in systematic observation,
measurement, and experiment, and the formulation, testing, and modification of hypotheses.”
A key concept in scientific endeavors is formulating testable, alternative explanations
about how the universe works. The scientific method actually consists of two types of
inquiry: induction and deduction, which, when used in concert, produce knowledge.
The scientific process is nicely captured in the diagram in Figure 4.2 (adapted from Rao,
1997). Let’s walk through this diagram, noting that it is a (diamond-shaped) circle at heart
and has neither beginning nor end. It can be thought of as a race track, or a wheel.

Hypotheses or
Theory
Creation of New Ideas
Deductive Reasoning
(Enlightened Guess Work)

Inference
Consequences
(Verification
(Prediction)
of Theory)

Design of Experiments
Inductive Reasoning
(Ensuring Validity of Data)
Data

Figure 4.2

We need to start somewhere in this diagram, which contains four boxes and four arrows.
Let’s start with the upper box:

1. Hypothesis or Theory Box. A hypothesis is a proposed explanation for a phenom-

enon. A scientiﬁc theory is a coherent group of propositions formulated to explain a
group of facts or phenomena in the natural world and repeatedly conﬁrmed through
experiment or observation. We will also note that process-based models, such as models
of global climate circulation, are hypotheses at heart.
• A theory: Darwin’s Theory of Evolution
• A hypothesis: The earth is warming due to increased CO2 levels in the atmosphere.
2. Deductive Reasoning Arrow. The Oxford Reference tells us that deductive reasoning
is reasoning from the general to the particular. Here, you start with a hypothesis or
theory and test it from an examination of facts.
3. Consequences or Predictions Box. Dr. Sylvia Wassertheil-Smoller, a research profes-
sor at Albert Einstein College of Medicine, explains, “In deductive inference, we hold a
theory and based on it we make a prediction of its consequences. That is, we predict what
the observations should be if the theory were correct” (source: http://www.livescience.
com/21569-deduction-vs-induction.html; accessed August 17, 2017). A prediction is the
result of deduction.
4. Design of Experiments Arrow. In this step, you plan an experiment that will allow
you to collect data to test your hypothesis or hypotheses. A well-designed experiment
will ensure that the data you collect are valid.
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi

40 BAYESIAN STATISTICS FOR BEGINNERS

5. Data Box. After the experiment is designed, we then collect some data. We can also use
existing datasets if they are appropriate.
6. Inductive Reasoning Arrow. The Oxford Reference tells us that inductive reasoning
involves inferring general principles from specific examples.
7. Inference Box. Inductive reasoning makes broad generalizations from specific obser-
vations. Dr. Sylvia Wassertheil-Smoller explains, “In inductive inference, we go from the
specific to the general. We make many observations, discern a pattern, make a general-
ization, and infer an explanation or a theory” (source: http://www.livescience.com/
21569-deduction-vs-induction.html; accessed August 17, 2017). At this point, we may
verify the pattern, or falsify our hypothesis or theory.
8. Creativity Arrow. The final process involves creativity, in which we bring to bear
creative ideas that may explain a pattern or a phenomenon, which brings us full circle.
C. R. Rao (1997) notes that the inference-through-consequences portion of the diagram
comes under the subject of research and the creative role played by scientists, while the
design of experiments through inductive reasoning comes under the realm of statistics.
Note that you can go through a portion of the scientific race track or wheel (e.g., collect
data in the lower box, analyze the data to infer a generalized pattern, and then stop).
However, going around the race track once or multiple times builds knowledge about a
system. As Dr. Wassertheil-Smoller states, “In science there is a constant interplay between
inductive inference (based on observations) and deductive inference (based on theory),
until we get closer and closer to the ‘truth,’ which we can only approach but not ascertain
with complete certainty.”

“Our search is only for working hypotheses which are supported by observational facts
and which, in course of time, may be replaced by better working hypotheses with more
supporting evidence from a wider set of data and provide wider applicability.” –
C.R. Rao.

There are reams of writings about the scientiﬁc process—much of it is (thankfully) beyond the
scope of this book. The nice thing about Bayes’ Theorem is that it can be used to address many
kinds of scientiﬁc questions, but the underlying concept of Bayesian inference is the notion
of alternative hypotheses, which is why we started in the upper box.

How on earth did Thomas Bayes make a connection between

probability and scientiﬁc inference?

Answer: You’ll have to read more about the story of Bayes’ Theorem to learn about how he
came to his great insight. It involves a thought experiment with balls rolling on a table.
Due credit should also be given to Pierre-Simon Laplace. A Wikipedia author notes
“Pierre-Simon Laplace was an inﬂuential French scholar whose work was important to
the development of mathematics, statistics, physics and astronomy. He summarized and
extended the work of his predecessors in his ﬁve-volume Mécanique Céleste (Celestial
Mechanics) (1799–1825). This work translated the geometric study of classical mechanics
to one based on calculus, opening up a broader range of problems. In statistics, the
Bayesian interpretation of probability was developed mainly by Laplace” (article accessed
August 15, 2017).
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi

BAYESIAN INFERENCE 41

If you’d like to learn more, a terriﬁc read is The Theory That Would Not Die: How Bayes’
Rule Cracked the Enigma Code, Hunted Down Russian Submarines, and Emerged Triumph-
ant from Two Centuries of Controversy, by Sharon Bertsch McGrayne (2011). If you can’t
buy the book, at the very least, watch a lecture.

What is Bayesian inference?

Answer: Bayesian inference is the process of confronting alternative hypotheses with new
data and using Bayes’ Theorem to update your beliefs in each hypothesis.
The Oxford Dictionary of Statistics (Upton and Cook 2014) describes Bayesian inference
as “an approach concerned with the consequences of modifying our previous beliefs as a
result of receiving new data.”
Wikipedia deﬁnes it this way: “Bayesian inference is a method of statistical inference in
which Bayes’ Theorem is used to update the probability for a hypothesis as more evidence
or information becomes available” (article accessed August 15, 2017).

How does Bayesian inference work?

Answer: That is the million dollar question. Here we go!

Here’s Bayes’ Theorem as we derived it at the beginning of this chapter:

PrðB j AÞ ∗ PrðAÞ
PrðA j BÞ ¼ : ð4:6Þ
PrðBÞ

To use Bayes’ Theorem for scientiﬁc inference, it’s useful (actually, essential) to replace the
marginal denominator Pr(B) as the sum of the joint probabilities that make it up:

PrðB j AÞ ∗ PrðAÞ
PrðA j BÞ ¼ : ð4:7Þ
PrðA \ BÞ þ Prð A \ BÞ

It’s critical that you understand this: the marginal probability of B is the sum of the
joint probabilities that make it up. Perhaps the conjoint table will help jog your
memory (see Table 4.1).

Table 4.1 Conjoint table with joint and marginal probabilities.

A A Marginal
B Pr(A \ B) Pr(A \ B) Pr(B)
B Pr(A \ B) Pr(A \ B) Pr(B)
Marginal Pr(A) Pr(A) Total ¼ 1.00

Remember that the marginal probabilities are stored, well, in the margins, and that you
obtain them by adding the joint probabilities. So far, so good.
Now, let’s tackle a problem. Returning to the Yudkowsky example in Chapter 3, suppose
we were asked to ﬁnd the probability that a woman has breast cancer (A), given that her
mammogram test came back positive (B). Here, the results of the mammogram are the “data.”
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi

42 BAYESIAN STATISTICS FOR BEGINNERS

OK, we are asked to ﬁnd Pr(A | B). Thus, Pr(B) is in the denominator in Bayes’ Theorem,
which is:

PrðB j AÞ ∗ PrðAÞ
PrðA j BÞ ¼ : ð4:8Þ
PrðA \ BÞ þ PrðA \ BÞ

Let’s now identify the parts of this problem in terms of the scientiﬁc method:

• We have two competing hypotheses regarding cancer: the woman has cancer (A) vs. she
does not (A).
• We have data for this problem: the test came back positive. So, B represents our observed
data. Notice that B does not appear in the problem because we did not observe a
negative test.

How can we turn this into a Bayesian inference problem?

Answer: We can make great headway if we now replace the joint probabilities with
their conditional probability equivalents:

PrðA \ BÞ ¼ PrðB j AÞ ∗ PrðAÞ ð4:9Þ

Prð A \ BÞ ¼ PrðB j AÞ ∗ PrðAÞ: ð4:10Þ

Then, Bayes’ Theorem looks like this:

PrðB j AÞ ∗ PrðAÞ
PrðA j BÞ ¼ : ð4:11Þ
PrðB j AÞ ∗ PrðAÞ þ PrðB j AÞ ∗ Prð AÞ

The denominator is still just the sum of the two joint probabilities that make up the
marginal probability of B. Never, ever, lose sight of this! Let’s look at this form of
Bayes’ Theorem more deeply.

Is there a pattern in the denominator of this new version?

Answer: The denominator, which is the marginal probability of B, is the probability of

observing our data, B, given A multiplied by the probability of A, plus the probability
of observing our data, B, given A multiplied by the probability of A. This has a certain
ring to it, don’t you agree?

Does anything else about this equation strike you as notable?

Answer: You might have noticed that one term in the denominator is exactly the same
thing as the numerator!

PrðB j AÞ ∗ PrðAÞ
PrðA j BÞ ¼ : ð4:12Þ
PrðB j AÞ ∗ PrðAÞ þ PrðB j AÞ ∗ PrðAÞ

Now, it should be clear that Bayes’ Theorem returns a proportion, or probability. After all,
we need to ﬁnd Pr(A | B), which is a probability that ranges between 0 and 1.
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi

BAYESIAN INFERENCE 43

So, why all the fuss?

Answer: Well, Bayes’ great insight was that when we use the Theorem in this way, the
equation itself can be used to draw inferences regarding competing hypotheses and thus is
directly tied to the scientiﬁc method. Cool!
Let’s revisit our scientiﬁc method diagram (see Figure 4.3).

Hypotheses or
Theory
Creation of New Ideas
Deductive Reasoning
(Enlightened Guess Work)

Inference
Consequences
(Verification
(Prediction)
of Theory)

Design of Experiments
Inductive Reasoning
(Ensuring Validity of Data)
Data

Figure 4.3

Again, the scientiﬁc “race track” has no beginning or end. But Bayesian inference begins
with the notion of multiple hypotheses, so we’ll start with the upper box. Our problem is to
determine the probability that a woman has breast cancer, given the test result was positive.
Let’s focus on the four main boxes only:

1. Hypothesis or Theory box. Identify your hypotheses. Here, we are interested in

testing two alternative hypotheses: the woman has cancer (A) versus she does not (A). These
hypotheses are mutually exclusive and exhaustive. We now assign the probability that each
hypothesis is true (prior to obtaining mammogram results). The probabilities associated with
each hypothesis must sum to 1.0. In Bayesian circles, these are called prior probabilities
because they represent our current belief in each hypothesis prior to data collection.
2. Consequences box. We know that the data we will be collecting are medical tests
(mammograms). So, now we need to write out equations for calculating the probability of
observing the test data under each hypothesis. This probability is called likelihood, and
ﬁguring out how to calculate the likelihood of the data under each hypothesis is often the
most challenging part of Bayesian inference.
3. Data box. Next, we collect data. The test came back positive.
4. Inference Box. With data in hand, we can now plug our data into the likelihood
equations:
• likelihood of observing the data (a positive test result) under the cancer hypothesis
• likelihood of observing the data (a positive test result) under the no-cancer hypothesis.
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi

44 BAYESIAN STATISTICS FOR BEGINNERS

Finally, we use Bayes’ Theorem to determine a posterior probability for each hypoth-
esis. The posterior probability represents our updated belief in each hypothesis after new
data are collected:
• Probability of cancer, given the observed data
• Probability of no cancer, given the observed data.

Thus, in Bayesian inference, the posterior probability of a random event or an uncertain

proposition is the conditional probability that is assigned after the relevant evidence or
background is taken into account (Wikipedia; accessed August 15, 2017). Here’s what Bayes’
Theorem looks like for this problem:

PrðCancer j PositiveÞ
PrðPositive j CancerÞ ∗ PrðCancerÞ ð4:13Þ
¼ :
PrðPositive j CancerÞ ∗ PrðCancerÞ þ PrðPositive j CancerÞ ∗ Prð CancerÞ

Now, let’s replace each term with its Bayesian inference deﬁnition (see Figure 4.4).

Likelihood of the
data, B, given Prior probability of
hypothesis A hypothesis A
Posterior probability
of hypothesis A,
given data B

Pr(B | A) * Pr(A)
Pr(A | B) =
Pr(B | A) * Pr(A) + Pr(B | ∼A) * Pr(∼A)

Likelihood of the
data, B, given Prior probablity of
hypothesis ∼A hypothesis ∼A

Figure 4.4

Look for the following terms in this diagram:

• Prior probability of each hypothesis, A and A
• Likelihood of observing the data, B, under each hypothesis
• Posterior probability of hypothesis A, given B
Since there are only two hypotheses in this example, A and A, if we use Bayes’ Theorem to
estimate the posterior probability of A, which is written Pr(A | data), then by subtraction we
also estimate Pr(A | data) as 1 − Pr(A | data).
However, we could also solve it directly, as shown below:

Prð Cancer j PositiveÞ

PrðPositive j CancerÞ ∗ Prð CancerÞ ð4:14Þ
¼ :
PrðPositive j CancerÞ ∗ PrðCancerÞ þ PrðPositive j CancerÞ ∗ Prð CancerÞ
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi

BAYESIAN INFERENCE 45

How does this relate to science?

Answer: Your answer here!

Science refers to a system of acquiring knowledge. Bayes’ Theorem allows us to posit
speciﬁc hypotheses for some phenomena and express our current belief that each hypoth-
esis is true. Then, as new data become available, we update our belief in each hypothesis.

Paul Samuelson, the Nobel laureate from the Massachusetts Institute of Technology,
recalled that John Maynard Keynes once was challenged for altering his position on
some economic issue. “When my information changes, I change my mind. What do
you do?”

Ok, what exactly is the difference between the two interpretations of

Bayes’ Theorem?

Answer: Remember the alternative deﬁnitions:

• Bayes’ Theorem describes the relationship between Pr(A | B) and Pr(B | A).
• The Theorem expresses how a subjective degree of belief should rationally change to
account for evidence.
Let’s return to our example problem:

One percent of women at age forty who participate in routine screening have breast
cancer. Eighty percent of women with breast cancer will get positive mammograms. In
addition, 9.6% of women without breast cancer will get positive mammograms.
A woman in this age group had a positive mammogram in a routine screening. What
is the probability that she actually has breast cancer?

Here is our conjoint table once more:

Table 4.2 Breast Cancer Problem.

A: Cancer ~A: No Cancer Sum

B: Positive 0.008 0.095 0.103
~B: Negative 0.002 0.895 0.897
Sum 0.010 0.990 1.000

In Chapter 3, we solved this problem using this version of Bayes’ Theorem:

PrðB j AÞ ∗ PrðAÞ 0:008 ∗ 0:01

PrðA j BÞ ¼ ¼ 0:01 ¼ 0:0776: ð4:15Þ
PrðBÞ 0:103

In this chapter, we are solving the problem with the second interpretation of Bayes’ Theorem:

PrðB j AÞ ∗ PrðAÞ
PrðA j BÞ ¼ : ð4:16Þ
PrðB j AÞ ∗ PrðAÞ þ PrðB j AÞ ∗ Prð AÞ
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi

46 BAYESIAN STATISTICS FOR BEGINNERS

0:008
∗ 0:01
PrðA j BÞ ¼ 0:008 0:01
¼ 0:0776: ð4:17Þ
0:01 ∗ 0:01 þ 0:095
0:99 ∗ 0:99

Both return the same answer, and both are correct. The second interpretation, however,
places the problem within a scientiﬁc context, where you posit hypotheses and then update
your beliefs in each hypothesis after data are collected. In this particular example, our
hypotheses are that the woman has cancer and that she does not. Without a test, our initial
belief that a woman has breast cancer is 1% (because 1% of the population has breast
cancer), and our belief that she does not have breast cancer is 99%. With the test, however,
we can use Bayes’ Theorem to update our beliefs. In light of the new data, the probability
that a woman has breast cancer is 7.8%.

What if there are more than two hypotheses?

Now let’s generalize this problem even more. Suppose there are n hypotheses. Let’s number
our hypotheses from i ¼ 1 to n. Here, i is an index of hypotheses. In our example, n ¼ 2
since there are two hypotheses. We could let hypothesis i ¼ 1 be the “cancer hypothesis”,
and i ¼ 2 be the “no cancer hypothesis.” But we can generalize this as H1, H2,…, Hn
hypotheses. In other words, there can be any discrete number of hypotheses.
Let’s let data represent the data we collected, that is, our observed data. So now Bayes’
Theorem can be written:

Prðdata j Hi Þ ∗ PrðHi Þ
PrðHi j dataÞ ¼ : ð4:18Þ
X
n
Prðdata j Hj Þ ∗ PrðHj Þ
j¼1

• The left side of the equation, Pr(Hi | data), can be read: “The posterior probability of
hypothesis i, given the data.”
• The numerator requires the likelihood of observing the data under hypothesis i and is
written Pr(data | Hi). This is then multiplied by the prior probability for hypothesis i,
which is written Pr(Hi).
• In the denominator, the symbol Σ means “sum.” The “j” associated with the sum symbol
is an index of summation. Thus, in the denominator, we are summing over all
hypotheses. So the index j goes from 1 to n. In our example, we had two hypotheses,
so there must be two terms in the denominator.
We use Bayes’ Theorem to compute the posterior probabilities for each and every
hypothesis under consideration.

Broadly speaking, you start with a scientiﬁc question and set forth two or more alterna-
tive hypotheses. You then assign a prior probability that each alternative hypothesis is
true. Next, you collect data. Finally, you use Bayes’ Theorem to update the probability
for each hypothesis considered.
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi

BAYESIAN INFERENCE 47

One more time…what is Bayesian inference again?

The Merriam–Webster dictionary deﬁnes “inferred” or “inferring” as “deriving conclusions

from facts or premises,” and “infer” as “to form an opinion or reach a conclusion through
reasoning and information.” The Merriam–Webster Thesaurus suggests the following syn-
onyms for the word “infer”: conclude, decide, deduce, derive, extrapolate, gather, judge,
make out, reason, and understand.
Bayesian inference, then, is the use of Bayes’ Theorem to draw conclusions about a set of
mutually exclusive, exhaustive, alternative hypotheses by linking prior knowledge about
each hypothesis with new data. The result is updated probabilities for each hypothesis of
interest:

Initial Belief in Hypothesis i + New Data ! Updated Belief in Hypothesis i.

What if I collect more data?

Each time you collect more data and apply Bayes’ Theorem, you’re updating your belief in
each alternative hypothesis. The new posterior probabilities then become priors for the
next analysis. By tracking our beliefs through time, we track our learning. The phrase
“Today’s posterior is tomorrow’s prior” captures the process nicely, as long as the same data
are not used to update the prior again the next time around.

What other sort of questions have been tackled using Bayesian

inference approaches?

Answer: Loads of them! And the use of Bayes’ Theorem is skyrocketing in many ﬁelds. In
Chapter 5, we’ll work on another problem to give you more practice.
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi

CHAPTER 5

The Author Problem: Bayesian

Inference with Two Hypotheses

In this chapter, we provide a concrete example of Bayesian inference. By the end of this
chapter, you should have a renewed understanding of:

• Bayesian inference
• Hypothesis
• Alternative Hypothesis
• Prior probability of a hypothesis
• Prior probability distribution
• Likelihood of the observed data
• Posterior probability of a hypothesis
• Posterior probability distribution.
We’ll walk through a classic example of Bayesian Inference. In 1964, Frederick Mosteller
and David Wallace published an article in which they studied the disputed authorship of
some of the Federalist Papers. Mosteller and Wallace introduce their paper as follows:

“The Federalist papers were published anonymously in 1787–8 by Alexander Hamilton,

John Jay, and James Madison to persuade the citizens of the State of New York to ratify
the Constitution. Of the 77 essays, each 900 to 2500 words in length, that appeared in
newspapers, it is generally agreed that Jay wrote ﬁve (Nos. 2, 3, 4, 5, and 64), leaving no
further question about Jay’s share. Hamilton is identiﬁed as the author of 43 papers, and
Madison of 14. The authorship of 12 papers (Nos. 49–58, 62, and 63) is in dispute
between Hamilton and Madison.”

In the 12 disputed papers, either Alexander Hamilton (see Figure 5.1, left) or James Madison
(see Figure 5.1, right) signed his name as “Publius” instead of using his given name. Mosteller
and Wallace used a Bayesian analysis to rightfully attribute the authorship of each paper.
Let’s assume we are working with a speciﬁc paper of unknown authorship (No. 54), and
walk through a Bayesian inference analysis. This paper discusses the way in which the seats
in the United States House of Representatives are apportioned among the states. It is titled
“The Apportionment of Members Among the States.”
First, let’s recall Bayes’ Theorem:

PrðB j AÞ ∗ PrðAÞ
PrðA j BÞ ¼ : ð5:1Þ
PrðB j AÞ ∗ PrðAÞ þ PrðB j AÞ ∗ PrðAÞ

THE AUTHOR PROBLEM 49

Figure 5.1 Alexander Hamilton (left) and James Madison (right).

We’ll use a series of steps to conduct our analysis, which follow the scientiﬁc race track
described in Chapter 4:
1. Identify your hypotheses.
2. Express your belief that each hypothesis is true in terms of prior probabilities.
3. Gather the data.
4. Determine the likelihood of the observed data under each hypothesis.
5. Use Bayes’ Theorem to compute the posterior probabilities for each hypothesis.

We’ll be using these steps throughout the book. Here we go!

What is step 1?

Answer: In step 1, we identify our hypotheses. Paper No. 54 is 2008 words in length, and
our goal is to determine its most likely author. For this paper, we have two hypotheses for
authors: Hamilton or Madison. Let’s nail down our nomenclature:
• Hamilton ¼ Hamilton hypothesis
• Madison ¼ Madison hypothesis
Note that the hypotheses are exhaustive and mutually exclusive. Since there are only two
hypotheses, it makes sense that:

PrðHamiltonÞ ¼ PrðMadisonÞ ð5:2Þ

and

PrðMadisonÞ ¼ PrðHamiltonÞ: ð5:3Þ

In terms of our notation for Bayes’ Theorem in Equation 5.1 above, A could correspond to
the hypothesis that the author is Alexander Hamilton, and A could correspond to the
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi

50 BAYESIAN STATISTICS FOR BEGINNERS

hypothesis that the author is James Madison. And B in Bayes’ Theorem above represents the
observed data.
The posterior probability that the author was Hamilton is then:

Prðdata j HamiltonÞ ∗ PrðHamiltonÞ

PrðHamilton j dataÞ ¼ :
Prðdata j HamiltonÞ ∗ PrðHamiltonÞ þ Prðdata j MadisonÞ ∗ PrðMadisonÞ
ð5:4Þ

What is step 2?

Answer: We express our belief that each hypothesis is true in terms of prior probabilities.
We do this because the paper is unsigned and we have some uncertainty about which
hypothesis is correct. If the author authentically signed the paper, the probability of their
authorship would be 1.0, and we wouldn’t have a problem to tackle!
• Pr(Hamilton) ¼ prior probability that the true author is Hamilton.
• Pr(Madison) ¼ prior probability that the true author is Madison.

Remember that the sum of the prior probabilities must add to 1.0. We have several options;
here are a few:

• Pr(Hamilton) ¼ 0.1 and Pr(Madison) ¼ 0.9

• Pr(Hamilton) ¼ 0.5 and Pr(Madison) ¼ 0.5
• Pr(Hamilton) ¼ 0.7 and Pr(Madison) ¼ 0.3
• Pr(Hamilton) ¼ 0.75 and Pr(Madison) ¼ 0.25

There are many possible combinations of prior probabilities to choose from. Knowing that
Hamilton penned 43 papers and that Madison penned 14, it may be reasonable to assign
Pr(Hamilton) a prior probability of 0.75 (the last option above) because Hamilton wrote
43 of the 57 papers that were signed by either man (43/(43+14) ¼ 0.75). That would mean
that Pr(Madison) ¼ 0.25.
However, Mosteller and Wallace set the odds at 50:50, which is Pr(Hamilton) ¼ 0.5 and
Pr(Madison) ¼ 0.5. This gives each hypothesis the same “weight” of belief as the other.
Let’s now represent our hypotheses and prior probabilities as a graph (see Figure 5.2).

1.0

0.8
Prior Probability

0.6

0.4

0.2

0.0
Hamilton Madison

Figure 5.2 Prior probability distribution.

A graph of the prior probabilities is called the prior probability distribution. This
distribution represents hypotheses on the x-axis and their probabilities on the y-axis.
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi

THE AUTHOR PROBLEM 51

Note that the sum of the probabilities across hypotheses must be 1.0; this means that the
list of alternative hypotheses is exhaustive. In addition, the hypotheses are mutually
exclusive: the paper can’t be written by both men. Figure 5.2 is an example of a discrete
prior probability distribution.

What is step 3?

Answer: Gather the data. The data for this problem are found in Paper No. 54, and
it is 2008 words long.
Ultimately, we are asking, which hypothesis (Hamilton or Madison) is most consistent
with the data we observe (the paper)?
• Pr(Hamilton | data) ¼ what is the probability the author is Hamilton, given the paper?
• Pr(Madison | data) ¼ what is the probability the author is Madison, given the paper?

This is the challenge that Mosteller and Wallace set out to tackle, and it is a very tough nut
to crack!
Mosteller and Wallace didn’t use ALL of the data at their disposal (i.e., every word in the
paper), but instead focused on some signature words and phrases as a key to identify each
author.
Madison tended to use the word by more frequently than Hamilton, whereas Hamilton
tended to use the word to more frequently than Madison. The best single discriminator,
however, was the use of the word upon.
Hamilton used upon with overwhelmingly greater frequency than Madison. To be sure,
other metrics could have been used, including the length of each man’s sentences. How-
ever, the two men are nearly identical in this respect, with Hamilton averaging 34.55 words
per sentence, and Madison averaging 34.59 per sentence. Sentence length, therefore, would
not be very helpful in separating the two hypotheses.
Our sincere apologies to Mosteller and Wallace, but to keep things simple, we will
simplify their analysis tremendously and focus on the frequency of the use of upon by
each author.
The word “upon” appeared twice in the paper in question. Then, the rate of
upon is calculated as:

#upons 2
¼ ¼ 0:000996: ð5:5Þ
total words 2008
If we standardize this rate to 1000 words, then our manuscript has a standardized rate of
upon of

0:000996 ∗ 1000 ¼ 0:996 ð5:6Þ

We need some standardization because papers vary in length: a paper with 1 upon in 20
words has a very different rate of upon than a paper with 1 upon in 2008 words.
Now that we have the data in hand, we need to determine the likelihood of observing a
rate of 0.996 upons under each hypothesis.

What is step 4?

Answer: Determine the likelihood of the observed data, assuming each hypothesis
is true.
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi

52 BAYESIAN STATISTICS FOR BEGINNERS

In other words, determine:

Prð0:996 j HamiltonÞ ð5:7Þ

and

Prð0:996 j MadisonÞ: ð5:8Þ

Here we pit the observed data against each hypothesis. This step is often the most challen-
ging part of a Bayesian analysis. How exactly how do you compute the likelihood of
the data?
Dictionary.com provides three deﬁnitions of likelihood:

1. The state of being likely or probable; probability.

2. A probability or chance of something: There is a strong likelihood of his being elected.
3. Archaic.indication of a favorable end; promise.

So, likelihood is another word for probability. There is a subtle difference, however, in
the way the term is used in statistical analysis, where likelihood describes the probabil-
ity of observing data that have already been collected. We will cover this in more detail
in future chapters, but for now it’s enough just to have a good feel for the concept:
likelihood involves the collection of data, and we look retrospectively at
the probability of collecting those data. If we live in the Sahara Desert, we know
the probability of rain is low. Suppose it rains for seven consecutive days; this represents
our data. Can we all agree that that it is very unlikely to have observed seven consecu-
tive days of rain, given that we are in the Sahara Desert? We hope so!
Wolfram Math World differentiates likelihood and probability this way: “Likelihood is
the hypothetical probability that an event that has already occurred would yield a speciﬁc
outcome. The concept differs from that of a probability in that a probability refers to the
occurrence of future events, while a likelihood refers to past events with known
outcomes.”
Let’s put this in terms of authorship of the disputed paper. The paper is a past event with a
known outcome: the rate of upon is 0.996 per 1000 words. Our next step, then, is to
determine how likely it is to observe 0.996 upons per thousand words under each
hypothesis:

Prð0:996 j HamiltonÞ: ð5:9Þ

• Equation 5.9 asks: “What is the likelihood of observing a rate of 0.996 upons per 1000
words, given that the author is Hamilton?”

Prð0:996 j MadisonÞ: ð5:10Þ

• Equation 5.10 asks: “What is the likelihood of observing a rate of 0.996 upons per 1000
words, given that the author is Madison?”

Notice that the likelihoods are conditional for each hypothesis! In this Bayesian ana-
lysis, the likelihood is interpreted as the probability of observing the data, given the
hypothesis.
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi

THE AUTHOR PROBLEM 53

How exactly do we compute the likelihood?

Answer: Computing the likelihood of the observed data is a critical part of Bayesian
analysis. Throughout this book, you’ll be learning about probability distributions that
will enable you to estimate the likelihood of the observed data under each hypothesis.
But, for now, what you need is a good, intuitive feel for what it is, and a ﬁrm understanding
of how the observed data (a rate of 0.996) has a different likelihood depending on whether
the author was Hamilton or Madison.
Imagine Mosteller and Wallace had a whole team of students who searched through 98
articles known to be penned by Hamilton or Madison, and tediously counted the
number of times each author used the word upon and also tallied the total number of
words in each document. The data might look like that shown in Table 5.1.

Table 5.1

Article Author Length Upons Rate Standardized Rate

1 Madison 1672 1 0.0006 0.598
2 Hamilton 2196 2 0.00091 0.911
3 Madison 1690 2 0.00118 1.183
4 Hamilton 1013 3 0.00296 2.962
5 Madison 1160 1 0.00086 0.862

Our dataset consists of 98 such articles, 48 of which are known to be penned by

Hamilton, and 50 of which are known to be penned by Madison. Only the first 5
papers are shown above, and these are hypothetical, to give you a better flavor
for what was involved. Here, each article is listed in column 1. The author of each article
is identified in column 2, and the word length of the article is given in column 3. The total
number of upons is provided in column 4, and the rate of upons per 1000 words is
calculated in column 5. As you can see, the first article was known to be written by Madison,
and he used the word upon 1 time in 1672 words, for a standardized rate of 0.598 upons
per 1000 words.
Mosteller and Wallace summarized their standardized results as shown in Table 5.2
(Table 2.3 in their 1964 publication).

Table 5.2

Rate Hamilton Madison

0 (exactly) 0 41
(0,1] 1 7
(1,2] 10 2
(2,3] 11 0
(3,4] 11 0
(4,5] 10 0
(5,6] 3 0
(6,7] 1 0
(7,8] 1 0
48 50
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi

54 BAYESIAN STATISTICS FOR BEGINNERS

This table assigns each article into a “rate bin,” where the bins are provided in column 1.
The ﬁrst bin includes all manuscripts where the word upon is never used (i.e., the rate of
upon ¼ 0). Let’s talk about the notation used for the remaining bins. A closed interval such
as [0,1] would include all manuscripts with a rate of 0 through 1, including the endpoints.
An open interval such as (0,1) does not include the endpoints 0 or 1. A half-closed interval
has the notation (0,1]. Thus, the interval (0,1] is a bin that includes all manuscripts where
the rate of upon was > 0 but 1. For the 48 papers known to be penned by Hamilton, 0 of
them had a rate of 0, 1 had a rate in the (0,1] bin, 10 had a rate in the (1,2] bin, and so on.
For the 50 papers known to be penned by Madison, 41 of them had a rate of 0 (in which the
word upon was never used), 7 had a rate > 0 and 1, and 2 had a rate > 1 and 2.
The same data can be shown as a frequency histogram (see Figure 5.3). The x-axis of
this histogram are the bins, or the rate of upons per 1000 words. The y-axis is the
frequency, or number of articles, corresponding to each rate of upon. Frequency histo-
grams therefore depict the raw data in graphic format.

40
Hamilton
30 Madison
Frequency

0
0 (0,1] (1,2] (2,3] (3,4] (4,5] (5,6] (6,7] (7,8]
Rate of Upon per 1000 Words

Figure 5.3 Hamilton’s and Madison’s rates of “upon”.

Remember, for this step we are trying to get estimates of the likelihood of the observed
data under each hypothesis. The unsigned manuscript has a rate of upon ¼ 0.996,
which falls into the second bin:

Prð0:996 j HamiltonÞ ð5:11Þ

• Equation 5.11 asks: “What is the probability of observing a rate of 0.996 upons per 1000
words, given that the author is Hamilton?”

Prð0:996 j MadisonÞ ð5:12Þ

• Equation 5.12 asks: “What is the probability observing a rate of 0.996 upons per 1000
words, given that the author is Madison?”

Which of the two hypotheses more closely matches the

observed rate?

Answer: Intuitively, the data are more consistent with the Madison hypothesis. Can you
see that too?
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi

THE AUTHOR PROBLEM 55

If likelihood is a probability, how do we quantify this “consistency”

in terms of probability?

Answer: Great question! We’ll try this different ways. First, though, think about how you
would approach this problem before reading on.
Looking back at Table 5.2, we can see the following:
• One of Hamilton’s 48 manuscripts had a rate of upon greater than 0 but less than 1.0.
Therefore, we can use 1/48 ¼ 0.021 as an estimate of the likelihood of the data under the
Hamilton hypothesis.
• Seven of Madison’s 50 manuscripts had a rate of upon greater than 0 but less than 1.0.
Therefore, we can use 7/50 ¼ 0.140 as an estimate of the likelihood of the data under the
Madison hypothesis.
These are quick and dirty estimates of each author’s use of upon, and we’ll run with this.

What is step 5?

Answer: Use Bayes’ Theorem to compute the updated Pr(Hamilton) and Pr(Madison),
given the data.
First, let’s recall Bayes’ Theorem and the familiar inferential terms (see Figure 5.4).

Likelihood of the
data, B, given Prior probability of
hypothesis A hypothesis A
Posterior probability
of hypothesis A,
given data B

Pr(B | A) * Pr(A)
Pr(A | B) =
Pr(B | A) * Pr(A) + Pr (B |∼A) * Pr(∼A)

Likelihood of the
data, B, given Prior probability of
hypothesis ∼A hypothesis ∼A

Figure 5.4

Bayes’ Theorem can be used to calculate the posterior probability that the author is
Hamilton:

PrðHamilton j 0:996Þ
Prð0:996 j HamiltonÞ ∗ PrðHamiltonÞ ð5:13Þ
¼ :
Prð0:996 j HamiltonÞ ∗ PrðHamiltonÞ þ Prð0:996 j MadisonÞ ∗ PrðMadisonÞ
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi

56 BAYESIAN STATISTICS FOR BEGINNERS

In words . . .
• The left side of the equation can be read as “the posterior probability that Hamilton
penned the paper, given the paper had a rate of 0.996 upon per thousand words.”
• On the right, the numerator multiplies two terms: the likelihood of observing a rate of
upons per 1000 words of 0.996 under the Hamilton hypothesis, multiplied by the prior
probability of the Hamilton hypothesis.
• The denominator repeats the numerator, but adds in the Madison hypothesis as well.

Of course, since there are only two hypotheses and they are mutually exclusive, you can get
the posterior probability of Madison as 1 minus the posterior probability of Hamilton. Or
you can write it out fully for the sake of completeness:

PrðMadison j 0:996Þ
Prð0:996 j MadisonÞ ∗ PrðMadisonÞ ð5:14Þ
¼ :
Prð0:996 j HamiltonÞ ∗ PrðHamiltonÞ þ Prð0:996 j MadisonÞ ∗ PrðMadisonÞ

In short, we can use Bayes’ Theorem to estimate the posterior probability for each hypoth-
esis. Let’s keep our focus on the Hamilton hypothesis for now:

PrðHamilton j 0:996Þ
Prð0:996 j HamiltonÞ ∗ PrðHamiltonÞ ð5:15Þ
¼ :
Prð0:996 j HamiltonÞ ∗ PrðHamiltonÞ þ Prð0:996 j MadisonÞ ∗ PrðMadisonÞ

Where are the priors in this equation?

Answer: They are shown in color below (red and purple):

PrðHamilton j 0:996Þ
Prð0:996 j HamiltonÞ ∗ PrðHamiltonÞ ð5:16Þ
¼ :
Prð0:996 j HamiltonÞ ∗ PrðHamiltonÞ þ Prð0:996 j MadisonÞ ∗ PrðMadisonÞ

Where is the posterior probability of the Hamilton hypothesis in this

equation?

Answer: It’s highlighted in red below:

PrðHamilton j 0:996Þ
Prð0:996 j HamiltonÞ ∗ PrðHamiltonÞ ð5:17Þ
¼ :
Prð0:996 j HamiltonÞ ∗ PrðHamiltonÞ þ Prð0:996 j MadisonÞ ∗ PrðMadisonÞ

Where are the likelihoods of the observed data under each hypothesis
in this equation?

Answer: The likelihoods are shown in blue and green below:

PrðHamilton j 0:996Þ
Prð0:996 j HamiltonÞ ∗ PrðHamiltonÞ ð5:18Þ
¼ :
Prð0:996 j HamiltonÞ ∗ PrðHamiltonÞ þ Prð0:996 j MadisonÞ ∗ PrðMadisonÞ
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi

THE AUTHOR PROBLEM 57

Don’t lose track that we broke the components of Bayes’ Theorem

into pieces!

So, what is the posterior probability of the Hamilton hypothesis?

Remember that we set the priors to 0.5 for Hamilton and 0.5 for Madison. Also remember
that the likelihood of observing a rate of upons of 0.996 was 0.021 for Hamilton and 0.140
for Madison. Now let’s ﬁll in our priors and likelihood:

0:021 ∗ 0:5 0:0105

PrðHamilton j 0:996Þ ¼ ¼ ¼ 0:1304: ð5:19Þ
0:021 ∗ 0:5 þ 0:140 ∗ 0:5 0:0805
The answer is the posterior probability that the author of the paper in question was
Alexander Hamilton:

PrðHamilton j 0:996Þ ¼ 0:1304: ð5:20Þ

Since we have only two, mutually exclusive, hypotheses, this means that our posterior
probability that the author was James Madison is 1 – 0.1304 ¼ 0.8696:

PrðMadison j 0:996Þ ¼ 0:8696: ð5:21Þ

These new posterior estimates can be graphed as the posterior probability distribution.

1.0
Hamilton
0.8 Madison
Probability

0.6

0.4

0.2

0.0
Prior Distribution Posterior Distribution
Hypotheses

Figure 5.5 Prior and posterior distributions.

Thus, given the manuscript (and ancillary data about word use by the two authors from
KNOWN data), we now increase our belief that that the author of Federalist Paper No. 54
was James Madison. You can read about it here!

How do we set the prior probabilities?

Answer: Great question! We’ll cover the topic of priors in more detail in Chapter 6. But for
now, let’s see what happens when we use a different set of priors instead:

PrðHamiltonÞ ¼ 0:75 ð5:22Þ

PrðMadisonÞ ¼ 0:25: ð5:23Þ

OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi

58 BAYESIAN STATISTICS FOR BEGINNERS

We considered these potential priors at the beginning of the chapter. Let’s use Bayes’
Theorem to calculate the posterior probability that the author was Alexander Hamilton
with these new priors:

0:021 ∗ 0:75 0:01575

PrðHamilton j 1:00Þ ¼ ¼ ¼ 0:3103: ð5:24Þ
0:021 ∗ 0:75 þ 0:140 ∗ 0:25 0:05075
The answer now is 0.3103, whereas our ﬁrst result was 0.1304. In both cases, the analysis
now suggests that there is a greater probability that James Madison was the author.
However, the posterior probability for the Madison hypothesis is smaller when the odds
are stacked against him in the second example. In other words, priors matter . . . a point that
we will discuss in more detail in Chapter 6.

1.0
Hamilton
0.8 Madison
Probability

0.6

0.4

0.2

0.0
Prior Distribution Posterior Distribution
Hypotheses

Figure 5.6 Prior and posterior distributions.

What if we found more papers known to be authored by Hamilton

and Madison?

Answer: Fantastic! The more information you have to calculate the likelihood, the better.
In this case, you would use this new information to get better estimates of the probability of
each author’s use of the word upon. Additionally, the discovery of more papers may
inﬂuence your choice of priors.

Do the likelihoods of the data have to add to 1.0?

Answer: No, don’t confuse the prior probabilities for a set of hypotheses (which must sum
to 1.0) with the probability of the data. This can be confusing!

In Bayesian analysis, the sum of the prior probabilities across hypotheses, as well as the
sum of the posterior probabilities across hypotheses, must be 1.0. This is not true for the
likelihoods of observing the data under each hypothesis.

This is a very important concept, so let’s review our calculations to nail this point
(see Table 5.3).
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi

THE AUTHOR PROBLEM 59

Table 5.3

Observed Rate Hypothesis Prior Likelihood Prior ∗ L Posterior

0.996 Hamilton 0.5 0.0210 0.0105 0.1304
0.996 Madison 0.5 0.1400 0.0700 0.8696
Sum 1.0 0.1610 0.0805 1.0000

Table 5.3 shows our calculations for the analysis where the prior probabilities for the two
hypotheses were equal. You’ve seen these calculations already, but notice that the sum of
the priors is 1.000 (as required), while the sum of the likelihoods for the two hypotheses is
0.1610.
Let’s have a look at our second analysis, in which the priors were unequal (see Table 5.4).

Table 5.4

Observed Rate Hypothesis Prior Likelihood Prior ∗ L Posterior

0.996 Hamilton 0.75 0.0210 0.0158 0.3103
0.996 Madison 0.25 0.1400 0.0350 0.6897
Sum 1.00 0.1610 0.0508 1.0000

Once again, notice the sum of the likelihoods is 0.161, whereas the sum of the priors is 1.000.

Did Mosteller and Wallace really use this approach?

Not exactly. They used a Bayesian inference approach, but they calculated their likelihoods
a bit differently than we did here. Typically, the analyst makes an assumption about the
distribution that gives rise to the data. Mosteller and Wallace were premier statisticians and
instead ﬁt the histogram data to what is called a Poisson distribution. The Poisson
distribution is used to estimate the probability of an event occurring (such as the number
of upons per 1000 words, the number of births per 1000 females, or the number of car
crashes in 100 days at a corner), given the average rate of that event. We’ll cover the Poisson
probability distribution in depth later in the book. Here, we are simply trying to give you a
general understanding of how to quantify likelihood.

Can we summarize this problem?

OK. To summarize, Mosteller and Wallace set out to determine the author of Paper No. 54.
This paper had a standardized rate of upons ¼ 0.996 per 1000 words. They sought out the
posterior probability associated with two hypotheses:
• Pr(Hamilton | data) ¼ what is the probability the author is Hamilton, given the rate of
upon is 0.996?
• Pr(Madison | data) ¼ what is the probability the author is Madison, given the rate of
upon is 0.996?
To answer this, they assigned each hypothesis a prior probability of being true. Then,
they gathered data from manuscripts known to be penned by each author and used
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi

60 BAYESIAN STATISTICS FOR BEGINNERS

this information to calculate the likelihood of observing the data under each
hypothesis:

• Pr(data | Hamilton) ¼ what is the likelihood of the observed rate of 0.996, given the
author is Hamilton?
• Pr(data | Madison) ¼ what is the likelihood of the observed rate of 0.996, given the author
is Madison?
Mosteller and Wallace used Bayes’ Theorem, which says there is a relationship between the
original conditional probabilities and their inversions, in order to generate updated beliefs
that each hypothesis is true:

PrðHamilton j 0:996Þ
Prð0:996 j HamiltonÞ ∗ PrðHamiltonÞ ð5:25Þ
¼ :
Prð0:996 j HamiltonÞ ∗ PrðHamiltonÞ þ Prð0:996 j MadisonÞ ∗ PrðMadisonÞ

The use of Bayes’ Theorem in this way is known as Bayesian inference. The Merriam–
Webster dictionary deﬁnes inference as the “the act or process of inferring.” It then deﬁnes
“inferred” or “inferring” as “deriving conclusions from facts or premises,” and “infer” as “to
form an opinion or reach a conclusion through reasoning and information.”
Bayesian inference is then the use of Bayes’ Theorem to draw conclusions about a set of
mutually exclusive, exhaustive, alternative hypotheses by linking prior knowledge about
each hypothesis with new data. The result is an updated probability for each hypothesis of
interest.

How does this problem differ from the Breast Cancer Problem in the
last chapter?

Your answer here!

Answer: Both problems had two alternative hypotheses (e.g., Madison vs. Hamilton, and
Cancer vs. No Cancer). Both used Bayesian inference. The rate of upons corresponds to
“testing positive.” The rate of upons could extend up to 8.0 per 1000 words, whereas the
test result was either positive or negative. A main difference, however, is the assignment of
prior probabilities for each hypothesis. In the Breast Cancer Problem, we were told that 1%
of women of a particular age have breast cancer. This 1% estimate presumably comes from a
different source of information. In the author problem, Mosteller and Wallace did not have
an external source of information, so they set their priors at 50–50.
In Chapter 6, we will tackle a new problem that will highlight the different approaches
for setting the prior distribution.
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi

CHAPTER 6

The Birthday Problem: Bayesian

Inference with Multiple Discrete
Hypotheses

Now that you’ve had a taste of what Bayesian inference is all about, we’ll start on a new
example and explore two key features of Bayesian analysis: assigning priors and estimating
likelihoods. One of the more controversial aspects of Bayesian inference is the use of priors.
They are controversial, but they can’t be ignored.
In the frequentist notion of Bayes’ Theorem, the priors are just marginal probabilities. But
in Bayesian inference, the priors represent a priori probabilities that each alternative hy-
pothesis is correct, where a priori means “prior to data collection.” You cannot conduct a
Bayesian analysis without using a prior distribution. You must also collect some
information to estimate the likelihood of the data given a hypothesis.
By the end of this chapter, you should be able to deﬁne:
• Informative prior distribution
• Non-informative prior distribution
• Objective priors
• Subjective priors
• Prior sensitivity analysis
Let’s begin with a new example. This problem is taken from a short story called Absent
Treatment by P. G. Wodehouse (author of the infamous Jeeves collection). In this story, the
main character and author, Reggie Pepper, is helping his friend Bobbie Cardew. Bobbie is in
a pinch because he forgot the date of his wife’s birthday. The wife, Mary, has had enough
and has ditched poor Bobbie. Let’s pick up the story with Mary’s send-off letter to Bobbie
and then listen in on an exchange between Reggie and Bobbie:

MY DEAR BOBBIE, I am going away. When you care enough about me to remember to
wish me many happy returns on my birthday, I will come back. My address will be Box
341, London Morning News. – Mary

I [Reggie] read it twice, then I said, “Well, why don’t you?”

“Why don’t I what?”
“Why don’t you wish her many happy returns? It doesn’t seem much to ask.”
“But she says on her birthday.”
“Well, when is her birthday?”

62 BAYESIAN STATISTICS FOR BEGINNERS

“Can’t you understand?” said Bobbie. “I’ve forgotten.”

“Forgotten!” I said.
“Yes,” said Bobbie, “Forgotten.”
“How do you mean, forgotten?” I said. “Forgotten whether it’s the twentieth or twenty-first, or what? How
near do you get to it?”
“I know it came somewhere between the first of January and the thirty-first of December. That’s how near
I get to it.”
“Think.”
“Think? What’s the use of saying ‘Think’? Think I haven’t thought? I’ve been knocking sparks out of my
brain ever since I’ve opened that letter.”
“And you can’t remember?”
“No.”
I rang the bell and ordered restoratives.
“Well, Bobbie,” I said, “it’s a pretty hard case to spring on an untrained amateur like me. Suppose
someone had come to Sherlock Holmes and said, ‘Mr. Holmes, here’s a case for you. When is my
wife’s birthday?’ Wouldn’t that have given Sherlock a jolt? However, I know enough about the game
to understand that a fellow can’t shoot off his deductive theories unless you start him with a clue, so
rouse yourself out of that pop-eyed trance and come across with two or three. For instance, can’t you
remember the last time she had a birthday? What sort of weather was it? That might fix the month.”
Bobbie shook his head. “It was just ordinary weather, as near as I can recollect.”
“Warm?”
“Warmish.”
“Or cold?”
“Well, fairly cold, perhaps. I can’t remember.”
I ordered two more of the same. They seemed indicated in the Young Detective’s Manual. Getting ideas
is like golf. Some days you’re right off, others it’s as easy as falling off a log. I don’t suppose dear old
Bobbie had ever had two ideas in the same morning before in his life; but now he did it without an
effort. He just loosed another dry Martini into the undergrowth, and before you could turn round it had
flushed quite a brain-wave.
Do you know the little books called “When You Were Born”? There’s one for each month. They tell you your
character, your talents, your strong points, and your weak points at fourpence halfpenny a go. Bobbie’s idea
was to buy the whole twelve, and go through them till we found out which month hit off Mary’s character.
That would give us the month, and narrow it down a whole lot.
A pretty hot idea for a non-thinker like dear old Bobbie. We sallied out at once. He took half and I took
half, and we settled down to work. As I say, it sounded good. But when we came to go into the thing,
we saw that there was a flaw. There was plenty of information all right, but there wasn’t a single
month that didn’t have something that exactly hit off Mary. For instance, in the December book it said,
“December people are apt to keep their own secrets. They are extensive travelers.” Well, Mary had
certainly kept her secret, and she had travelled quite extensively enough for Bobbie’s needs. Then,
October people were “born with original ideas” and “loved moving.” You couldn’t have summed up
Mary’s little jaunt more neatly. February people had “wonderful memories”—Mary’s specialty.
We took a bit of a rest, then had another go at the thing.
Bobbie was all for May, because the book said that the women born in that month were “inclined to be
capricious, which is always a barrier to a happy married life;” but I plumped for February because February
women “are unusually determined to have their own way, are very earnest, and expect a full return in their
companion or mates,” which he owned was about as like Mary as anything could be.
In the end he tore the books up, stamped on them, burnt them, and went home.
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi

THE BIRTHDAY PROBLEM 63

What a great example! Here, our task is to use a Bayesian inference approach
to determine the month in which Mary was born. For the purposes of this
chapter, let’s suppose that Bobbie at least knows that Mary was born in the
year 1900.
To begin, we have 12 discrete hypotheses (n = 12). Remember that Bayes’ Theorem can be
expressed as:

Prðdata j Hi Þ ∗ PrðHi Þ
PrðHi j dataÞ ¼ : ð6:1Þ
X
n
Prðdata j Hj Þ ∗ PrðHj Þ
j¼1

In words, the posterior probability for hypothesis i given the observed data (data) is written
PrðHi j dataÞ. It is equal to a proportion. The numerator consists of the likelihood of
observing the data under hypothesis i, which is written Prðdata j Hi Þ, multiplied by the
prior probability of hypothesis i, which is written PrðHi Þ. The denominator repeats this
Xn
process for all 12 hypotheses, and sums them. The symbol means “sum;” in this case, it
j¼1
X
12
could have been written . The indices are from j = 1 to 12 to indicate that there are
j¼1
12 pieces of information to sum together. In short, i indicates a speciﬁc hypothesis, whereas
j is the index of summation and indicates a term in the denominator.
OK, here we go.
Step 1. Identify your hypotheses. There are 12 months in a year, and these represent the
alternative hypotheses. Note that the hypotheses are mutually exclusive and exhaustive.
They are mutually exclusive because Mary’s birth month cannot occur in both October and
November. They are exhaustive because Reggie and Bobbie considered every possible
outcome with respect to Mary’s birth month. Let’s get our nomenclature down:

• January = January hypothesis

• February = February hypothesis
• March = March hypothesis
• Etc.

Step 2. Express our belief that each hypothesis is true in terms of probabilities. We do this
because Bobbie and Reggie are uncertain about which hypothesis is correct.
• Pr(January) = prior probability that Mary’s true birth month is January
• Pr(February) = prior probability that Mary’s true birth month is February
• Pr(March) = prior probability that Mary’s true birth month is March
• Etc.

Now we assign prior probabilities that each hypothesis is true. You may be wondering how
to set priors for a problem in general. Let’s peek in on a famous article written by Eliezer
Yudkowsky for some insight:
Q. How can I ﬁnd the priors for a problem?
A. Many commonly used priors are listed in the Handbook of Chemistry and Physics.
Q. Where do priors originally come from?
A. Never ask that question.
Q. Uh huh. Then where do scientists get their priors?
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi

64 BAYESIAN STATISTICS FOR BEGINNERS

A. Priors for scientific problems are established by annual vote of the AAAS. In recent years the vote has
become fractious and controversial, with widespread acrimony, factional polarization, and several out-
right assassinations. This may be a front for infighting within the Bayes Council, or it may be that the
disputants have too much spare time. No one is really sure.
Q. I see. And where does everyone else get their priors?
A. They download their priors from Kazaa.
Q. What if the priors I want aren’t available on Kazaa?
A. There’s a small, cluttered antique shop in a back alley of San Francisco’s Chinatown. Don’t ask about
the bronze rat.
Ha ha! Seriously, now. When Bobbie and Reggie first started out, they had absolutely no
idea which month was the birth month. Consequently, they gave equal weight to each
hypothesis, which is 0.083. Remember that the sum of the probabilities that make up the
prior distribution must be 1.0, so each month has a probability of 1/12 ¼ 0.083. The chart in
Figure 6.1 depicts the prior probability distribution; the prior probability of each and every
alternative hypothesis is provided.

0.30

0.25
Prior Probability

0.20

0.15

0.10

0.05

0.00
March
April

June

August
September
October
November
December
January
February

May

July

Hypotheses

Figure 6.1 Prior probability distribution.

These are called non-informative priors and result in equal probabilities for all
hypotheses; that is, each month has the same prior probability of being Mary’s birth
month. Other terms for the same concept include vague or diffuse priors. A SAS webpage
on Bayesian analysis states that “a prior distribution is non-informative if the prior is ‘ﬂat’
relative to the likelihood function.” Wikipedia deﬁnes a non-informative prior as a prior
that “expresses vague or general information about a variable…The simplest and oldest rule
for determining a non-informative prior is the principle of indifference, which assigns
equal probabilities to all possibilities.”

A non-informative prior is a prior distribution that adds little or no information to the

Bayesian inference. When an analyst uses a non-informative prior, their goal is to obtain
a posterior distribution that is shaped primarily by the likelihood of the data.

However, Reggie and Bobbie might have used the information within the “When Were You
Born” books to set the priors differently. In that case, they would employ an informative
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi

THE BIRTHDAY PROBLEM 65

prior distribution. A SAS webpage states that “an informative prior is a prior that is not
dominated by the likelihood and that has an impact on the posterior distribution. If a prior
distribution dominates the likelihood, it is clearly an informative prior.” Wikipedia defines
an informative prior as a prior that “expresses specific, definite information about a vari-
able.” These could arise from expert opinion or from previous study.
We will return to this topic shortly.

An informative prior is a prior distribution that adds information to the Bayesian

inference. When an analyst uses an informative prior, their goal is to obtain a posterior
distribution that is shaped by both the prior and the likelihood of the data.

Suppose Reggie and Bobbie believe that the February and May hypotheses are more likely
than the rest based on the information in “When Were You Born.” Their informative prior
distribution may have looked like the one shown in Figure 6.2, with the sum of the
probabilities across hypotheses equal to 1.00.

0.30

0.25
Prior Probability

0.20

0.15

0.10

0.05

0.00
Jan
Feb
Mar
Apr

June

Aug
Sep
Oct

Dec
Nov
May

July

Hypotheses

Figure 6.2 Prior probability distribution.

Usually, though, an informative prior is based on previous scientiﬁc study or expert

opinion…this is probably not the case with “When You Were Born!”

Should Bobbie and Reggie use an informative prior?

Answer: Great question! The analyst must select a prior distribution. A colleague notes
that “the most important principle of prior selection is that your prior should represent the
best knowledge that you have before you look at the data.” He notes that it is unjustiﬁed to
use default, ignorance, or other automatic priors if you have substantial information that
can affect the answer (the posterior). Of course, this assumes that your best knowledge is
well founded, which perhaps is not the case here!
Step 3. Gather the data. In this case, we have one datapoint: one woman named Mary born
in the year 1900.
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi

66 BAYESIAN STATISTICS FOR BEGINNERS

Now that we have our data, Bayes’ Theorem can provide the posterior probability of her
birth month. Here’s Bayes’ Theorem again:

Prðdata j Hi Þ ∗ PrðHi Þ
PrðHi j dataÞ ¼ : ð6:2Þ
X
n
Prðdata j Hj Þ ∗ PrðHj Þ
j¼1

For example, the posterior probability that Mary was born in January can be written:

Prð1Mary j JanuaryÞ ∗ PrðJanuaryÞ

PrðJanuary j 1MaryÞ ¼ : ð6:3Þ
X
n
Prð1Mary j Hj Þ ∗ PrðHj Þ
j¼1

Step 4. Now Reggie and Bobbie need to estimate the likelihood of observing the data
(1 Mary) for each monthly hypothesis. As we’ve said, ﬁguring out how to estimate the
likelihood of observing the data under each hypothesis is often the trickiest part of a
Bayesian inference problem.

What data do we need then?

Answer: For this problem, we need the rate at which girls are named Mary per month. Let’s
suppose that we had data on the names of girls born each month in 1900 in an English
“shire” where Mary was born (see Table 6.1). These ﬁgures represent the data from which
we will calculate our likelihoods. Note that Mary—the subject of our story—is included
somewhere in this table, but we’re not sure where! (This doesn’t need to be the case,
however, for Bayesian analysis.)

Table 6.1

Months Female Births Marys

January 1180 57
February 963 14
March 899 22
April 1190 20
May 862 20
June 976 28
July 1148 11
August 906 10
September 1147 8
October 945 80
November 907 95
December 917 100

We can represent these raw data as a frequency histogram, with the months on the
x-axis and the total number of births on the y-axis (see Figure 6.3). Remember that a
frequency histogram is a distribution of the raw data. You can see that, in 1900,
the name Mary is a popular winter name, but not so much at other times of the year.
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi

THE BIRTHDAY PROBLEM 67

100

Frequency
60

March
April

June

August
September
October
November
December
January
February

May

July
Month

Figure 6.3 Frequency histogram of Marys by month.

Now we need to determine the likelihood of the data, assuming each hypothesis is true.
In this case, we need to determine how likely it is to observe a newborn named Mary in each
month. We need:

• Pr(1 Mary | January)

• Pr(1 Mary | February)
• Pr(1 Mary | March)
• Pr(1 Mary | April)
• etc.

Notice that the likelihoods are conditional for each hypothesis! In Bayesian analyses,
the likelihood is interpreted here as the probability of observing the data, given the
hypothesis.

The likelihood of being named Mary in each month is just the total Marys divided by the
total births. The graph in Figure 6.4 provides the likelihood of being named Mary for each
month in the year 1900. Notice the y-axis is probability (or likelihood) and that the shape of
this graph is similar to our frequency distribution but not exactly like it, since we had to
adjust for the fact that each month had a different number of births. For instance, 20 Marys
were born in both April and May, but in April there were 1190 births while in May there
were only 862 births.
Now we have the likelihood of observing a baby named Mary for each month of the year.
It looks like October, November, and December are more popular when it comes to “Mary.”
Step 5. Use Bayes’ Theorem to compute the posterior probabilities Pr(January), Pr(Febru-
ary), Pr(March), and so on, given the data (a baby named Mary born in the year 1900, who
grew up to marry Bobbie).
At this point, we combine the priors and the likelihoods to get a posterior probability that
each month is Mary’s birth month with Bayes’ Theorem. We’ll use the informative priors
based on the “When You Were Born” books (Figure 6.2).
Let’s start by calculating the posterior probability of the January hypothesis:

Prð1Mary j JanuaryÞ ∗ PrðJanuaryÞ

PrðJanuary j 1MaryÞ ¼ : ð6:4Þ
X
n
Prð1Mary j Hj Þ ∗ PrðHj Þ
j¼1
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi

68 BAYESIAN STATISTICS FOR BEGINNERS

0.10

0.08

Likelihood
0.06

0.04

0.02

0.00

March
April

June

August
September
October
November
December
January
February

May

July
Month

Figure 6.4 Likelihood of being named Mary by month.

The numerator focuses on the January hypothesis, which multiplies the likelihood of
observing 1 Mary given the January hypothesis, Pr(1 Mary | January), multiplied by the
prior probablity of the January hypothesis, Pr(January).
From the likelihood distribution in Figure 6.4, the likelihood of being named Mary
in January is 0.048 (57 Marys/1180 female births in January). We multiply this by
the prior probability of the January hypothesis, which is 0.06. The product of these two
terms is 0.0029.
The denominator is a different beast. Here, we need to compute the likelihood of observ-
ing 1 Mary under each hypothesis, multiply the result by its corresponding prior, and,
ﬁnally, sum the results. Here, a table specifying all 12 hypotheses can help (see Table 6.2).

Table 6.2

Month Prior Likelihood Product Denominator Posterior

January 0.06 0.0483 0.0029 0.0342 0.0848
February 0.20 0.0145 0.0029 0.0342 0.0848
March 0.06 0.0245 0.0015 0.0342 0.0439
April 0.06 0.0168 0.0010 0.0342 0.0292
May 0.20 0.0232 0.0046 0.0342 0.1345
June 0.06 0.0287 0.0017 0.0342 0.0497
July 0.06 0.0096 0.0006 0.0342 0.0175
August 0.06 0.0110 0.0007 0.0342 0.0205
September 0.06 0.0070 0.0004 0.0342 0.0117
October 0.06 0.0847 0.0051 0.0342 0.1491
November 0.06 0.1047 0.0063 0.0342 0.1842
December 0.06 0.1091 0.0065 0.0342 0.1901
1.00 0.4821 0.0342 1.0000

A few things to notice:

• Notice that for each month, the prior from Figure 6.2 is given, the likelihood from
Figure 6.4 is given, and the product of the two is given.
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi

THE BIRTHDAY PROBLEM 69

• The sums are provided in the bottom row of the table. The sum of the priors must be 1.0,
but this is not true for the likelihoods.
• The sum of the products (likelihood ∗ prior) is 0.0342. This is the denominator of Bayes’
Theorem for this problem, and it is ﬁxed across the 12 hypotheses. What changes is the
numerator, and this depends on which hypothesis you are analyzing.
• The posterior probability for each month is then computed as the product divided
by the denominator. Note that the sum of the posterior probabilities across the hypoth-
eses is 1.0.
• Notice the prior probability was dominated by the February and May hypotheses. The
data show that most babies named “Mary” were born in October, November, or Decem-
ber; their likelihoods were higher. After observing the data, the posterior probabilities for
February and May were reduced, while the posterior probabilities for October, November,
and December were increased relative to the prior.

Is the divisor of Bayes’ Theorem always a constant?

Answer: Yes. The divisor is also called a normalizing constant. Here, we could calculate
it easily. But summing up the denominator is often a great challenge in Bayesian statistics.
As we’ll see later, in many problems the divisor is an intractable calculation…a topic we will
visit in future sections of this book.
Now, let’s look at Bobbie and Reggie’s prior and posterior distribution side by side (see
Figure 6.5).

0.30
Prior
0.25 Posterior
0.20
Probability

0.15

0.10

0.05

0.00
March
April

June

August
September
October
November
December
January
February

May

July

Month

Figure 6.5 Informative prior distribution and posterior

distribution.

Bobbie and Reggie started with an “informative” prior based on the “When You Were
Born” books. They then collected data to calculate the likelihood of being named Mary in
the year 1900. They then used Bayes’ Theorem to confront each hypothesis by the data.
That is, they calculated the posterior probability for each hypothesis in light of the data.
Adding in “Mary data” along with the “When You Were Born” priors won’t get Bobbie
out of his pickle, but you can see that posterior distribution is quite different than the prior
distribution. At this point, Bobbie could use the posterior distribution as a prior distribution
if other sources of information became available.
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi

70 BAYESIAN STATISTICS FOR BEGINNERS

What if the non-informative prior were used instead of the

“When You Were Born” prior?

Answer: Let’s take a look. Here are the calculations again, this time with equal priors in
column 2 (see Table 6.3).

Table 6.3

Month Prior Likelihood Product Denominator Posterior

January 0.083 0.0483 0.0040 0.0400 0.1000
February 0.083 0.0145 0.0012 0.0400 0.0300
March 0.083 0.0245 0.0020 0.0400 0.0500
April 0.083 0.0168 0.0014 0.0400 0.0350
May 0.083 0.0232 0.0019 0.0400 0.0475
June 0.083 0.0287 0.0024 0.0400 0.0600
July 0.083 0.0096 0.0008 0.0400 0.0200
August 0.083 0.0110 0.0009 0.0400 0.0225
September 0.083 0.0070 0.0006 0.0400 0.0150
October 0.083 0.0847 0.0070 0.0400 0.1750
November 0.083 0.1047 0.0087 0.0400 0.2175
December 0.083 0.1091 0.0091 0.0400 0.2275
1.000 0.4821 0.0400 1.0000

Once again, notice that the sum of the prior probabilities across hypotheses is 1.00, as is the
sum of the posterior probabilities across hypotheses. For this example with a non-informative
prior, we noted earlier that the likelihood drives the results. You can see this in action here for
the January hypothesis (as an example). If you were only making use of the data, you would
say that the probability that Mary was born in January is 0.0483/0.4821 ¼ 0.1000. Here, the
numerator is 0.0483, which is the probability that “our” Mary was born in January. The
denominator is the sum of all 12 likelihoods. Thus, even if we dropped the prior altogether,
we could get the posterior probability ¼ 0.1000 as shown in the ﬁnal column. This illustrates
what we mean when we said “with a non-informative prior, the likelihood of the data drives
the results.” However, with a Bayesian analysis, a prior distribution is required!
A graph of the prior distribution and posterior distribution looks like the one shown in
Figure 6.6.

0.30
Prior
0.25 Posterior

0.20
Probability

0.15
0.10
0.05
0.00
March
April

June

August
September
October
November
December
January
February

May

July

Month

Figure 6.6 Non-informative prior distribution and

posterior distribution.
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi

THE BIRTHDAY PROBLEM 71

Now, let’s compare the two posterior distributions, where one used an informative prior
and one used a non-informative prior. The chart in Figure 6.7 shows the posterior
distributions for the “non-informative” (ﬂat) versus “informative” (When You Were
Born) priors that we’ve looked at thus far, with the non-informative results shown in blue:

0.30
Informative Prior
0.25 Non-Informative Prior

Probability
0.20
0.15
0.10
0.05
0.00
February

April

June

August
September
October
November
December
January

March

May

July
Month

Figure 6.7 Posterior distribution with informative

vs. uninformative priors.

So, the choice of the prior really affects the results?

Answer: Yes, this is why the selection of priors is important. If you have some information
at your disposal (assuming it is credible), you should use it. However, you should be able to
justify it.
The chart in Figure 6.7 is an example of a prior sensitivity analysis (Berger et al., 2000).
We are comparing two posterior distributions and assessing the effect of each prior distri-
bution’s inﬂuence on the posterior, which in turn affects our conclusions. A prior sensitivity
analysis is just one tool in the Robust Bayesian Analysis toolkit.
The word robust is deﬁned as “strong and healthy; vigorous; sturdy in construction;
strong and effective in all or most situations and conditions.” According to Wikipedia,
“Robust Bayesian Analysis, also called Bayesian sensitivity analysis, investigates the robust-
ness of answers from a Bayesian analysis to uncertainty about the precise details of the
analysis” [article accessed August 15, 2017).

Are there other times when the prior drives the results?

Answer: Well, think about it. What if 1000 girls were born each month, and 15 of them
were always named Mary. We have observed a person who is named Mary. How likely is it
that she was born in January? February? The answer is that the likelihoods are identical for
each hypothesis. In this case, the posterior distribution will be identical to the prior
distribution!

In Bayesian analysis and scientiﬁc deduction, a primary goal of the analyst is to collect
data that will discriminate the hypotheses.
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi

72 BAYESIAN STATISTICS FOR BEGINNERS

What is so tricky about setting the prior?

Answer: The tricky part comes into play when you really don’t have any information to
set the prior and are trying to be as objective as possible. That is, you want to level the
playing field so that all hypotheses have equal weight. The idea behind these approaches is
to select a prior that has absolutely no bearing on the posterior; that is, you would draw
similar conclusions from a statistical analysis that is not Bayesian. We’ve demonstrated
such an approach when you have discrete hypotheses such as the birthday problem.
However, setting a purely uninformative prior is no small task. In the words of Zhu and
Lu (2004), “flat priors are not necessarily non-informative, and non-informative priors are
not necessarily flat.” We are getting a bit ahead of ourselves though. For now, Hobbs and
Hooten (2015) present friendly and informative thoughts on setting priors.

I’ve heard the terms “objective” and “subjective” with reference to

Bayesian analysis. What do these mean?

Answer: Subjectivity generally is associated with a belief. In a sense, all priors are
subjective because the analyst must select one and, in doing so, exercises subjectivity.
The term objective prior normally refers to a prior that can be justiﬁed by rationality
and consistency. The general goal in using an objective prior is that different analysts with
the same information should arrive at the same conclusions.
This is well beyond our scope here, but we encourage you to dig deeper into this topic, as
it is fodder for a rich discussion.

What really happened to Bobbie and Mary?

Answer: Eventually, Bobbie remembered that he took Mary to a play for her birthday.
After a bit of detective work, he determined Mary’s birthday was May 8th.
And they lived happily ever after.

Isn’t that nice?

Answer: Yes, it really is! We have one more problem in Bayesian inference in this section,
and we think you’ll enjoy it. See you soon.
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi

CHAPTER 7

The Portrait Problem: Bayesian

Inference with Joint Likelihood

Ready for some fun? Let’s take a look at Figure 7.1.

What is the probabilty

that I am Thomas Bayes?

Figure 7.1 Is this a portrait of Thomas Bayes?

Although you see this portrait frequently associated with Thomas Bayes, there is quite a
bit of dispute about whether it is really him (the Thomas Bayes). In this chapter, we’ll use a
Bayesian inference approach to address the problem. In the spirit of fun, the data used for
our example will be completely fabricated.
A key concept in this chapter is that you can combine multiple sources of data in a
Bayesian inference framework. By the end of this chapter, you will be familiar with the
following terms:

• Joint likelihood
• Frock coat
• Parson’s wig.

Let’s return to the portrait. The editors of the IMS Bulletin, a news source from the Institute
of Mathematical Statistics, posted the picture of “Thomas Bayes” shown above and offered
this challenge to its readers:

74 BAYESIAN STATISTICS FOR BEGINNERS

• Who is this gentleman?

• When and where was he born?

The person submitting the most plausible answer was to receive a prize!
The image is taken from a 1936 book called History of Life Insurance by Terence
O’Donnell (American Conservation Co., Chicago). What an exciting book this must be!
The image appears with the caption “Rev. T. Bayes: Improver of the Columnar Method
developed by Barrett.” It’s not clear who added the caption or where the image came from.

Who won the bet?

Answer: According to the website, “The most plausible answer received in the Bulletin
Editorial ofﬁce is from Professor David R. Bellhouse, University of Western Ontario, Lon-
don, Ontario, Canada. Professor Bellhouse wrote:
“The picture in The IMS Bulletin is supposedly of Thomas Bayes, who died in 1761 aged 59 and so was born
in 1701 or 1702. I have added “supposedly” since I believe that the picture is of doubtful authenticity. There
are three clues in the picture which lead me to this conclusion . . . For the purpose of comparison, consider the
pictures [see Figure 7.2] of three other Nonconformist ministers: on the left Joshua Bayes, Thomas’es (sic)
father (d. 1746); in the middle Richard Price (this portrait is dated 1776), who read Bayes’ paper before the
Royal Society; and on the right Philip Doddridge (1702–1751), who was a friend of Bayes’ brother-in-law,
Thomas Cotton.”

Figure 7.2 (Left) Joshua Bayes; (middle) Richard Price; (right) Phillip Doddridge.

Bellhouse continues: “The first thing to note in this picture is the apparent absence of a wig, or if a wig is
present, it is definitely the wrong style for the period. It is likely that Bayes would have worn a wig similar to
Doddridge’s, which was going out of fashion in the 1740s, or a wig similar to Price’s, which was coming into
style at the same time. The second thing to note is that Bayes appears to be wearing a clerical gown like his
father or a larger frock coat with a high collar. On viewing the other two pictures, we can see that the gown is
not in style for Bayes’ generation and the frock coat with a large collar is definitely anachronistic. Finally,
Price is wearing a stock or wide collar on his shirt which appears around his neck in the picture; this was
fashionable from about 1730 to 1770. Since Doddridge, one generation younger, appears without any stock
or shirt collar, it is questionable whether Bayes would have worn a stock. However, the nineteenth century-
looking clerical collar in this picture is again anachronistic. For reference, I have used C. Willett Cunnington
and P. Cunnington, Handbook of English Costume in the Eighteenth Century, pub. Faber & Faber, London,
1964.”
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi

THE PORTRAIT PROBLEM 75

So, Professor Bellhouse made his arguments based on fashion, which included the use
of wigs. In case you were wondering, a frock coat is a man’s double-breasted, long-skirted
coat, now worn primarily on formal occasions. And, in case you were wondering, a
Nonconformist is a Protestant Christian who did not “conform” to England’s Act of
Uniformity of 1662. Incidentally, Dr. Bellhouse wrote a biography of Thomas Bayes to
celebrate the tercentenary of his birth, which you can download here.

Why did men wear wigs in the 1700’s?

Answer from wiki.answers.com: Baldness, bugs, and image.

Answer from boston1775.blogspot.com: In the 1700’s, a gentleman’s white wig not
only told other gentlemen that he was one of them; it could also signify what kind of
gentleman he was. There were general styles worn by businessmen and planters, but
there were also particular styles linked to professions. Doctors, for example, wore a
“physick’s wig,” which Karin Calvert says “had a woolly, teased appearance known as a
natty bob.” Another specialized subset was the “parson’s wig,” suitable for ministers and
characterized by its “rows of neat curls.” Figure 7.3 shows some examples and variations
on the style.

Figure 7.3 “Wigs”.

So, how can we determine the probability that the man in the
photo is Thomas Bayes?

Answer: We are asked to determine the probability that the man pictured is actually
Thomas Bayes. Why not use a Bayesian inference approach to tackle this problem?
Remember that Bayes’ Theorem can be expressed as:

Prðdata j Hi Þ ∗ PrðHi Þ
PrðHi j dataÞ ¼ Xn : ð7:1Þ
Prðdata j Hj Þ ∗ PðHj Þ
j ¼1

In words, the posterior probability for hypothesis i is the likelihood of observing the data
under hypothesis i multiplied by the prior probability of hypothesis i, divided by the sum of
likelihood times the prior across all hypotheses from j ¼ 1 to n.
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi

76 BAYESIAN STATISTICS FOR BEGINNERS

Where do we begin? We asked our colleague Anne Clark, who studies medieval saints at
the University of Vermont, “How do you determine the authenticity of things that
happened so long ago?” She replied, “You look for clues within the objects you have
in hand.”
If we had the portrait in hand, we could do things like analyze the paint.

Paint?

Answer: In our portrait, “Thomas” might be roughly 35–45 years old. If Thomas Bayes
was born in 1701 or 1702, the portrait would have been painted in the mid-1700’s, or mid-
18th century (just prior to the American Revolution). The white paint that was used in this
period is called “lead white.”

And how can lead white help us with dating the Thomas
Bayes’ portrait?

Answer: Lead white is a paint that includes, well, lead! Wikipedia notes that white lead
dating is a technique that is often used to pinpoint the age of an object up to 1,600
years old.

Great! Can we get started?

Answer: Umm, no. Presumably the actual portrait is long gone, and we’re
stuck with the printed image in O’Donnell’s book. But, we can get started
anyway . . . there are many other clues in the printed image that we can use.
Now, let’s get started.
We’ll use the same steps for Bayesian analysis that we have in previous chapters:

1. Identify your hypotheses.

2. Express your belief that each hypothesis is true in terms of prior probabilities.
3. Gather the data.
4. Determine the likelihood of the observed data, assuming each hypothesis is true.
5. Use Bayes’ Theorem to compute the posterior probabilities for each hypothesis.

Step 1. What are the hypotheses?

Answer: For this problem, we’ll consider just two hypotheses: the portrait is of Thomas
Bayes, and the portrait is not of Thomas Bayes.

Step 2. What are the prior probabilities that each hypothesis is true?

Answer: Your answer here! Is your answer informative or non-informative?

We need a prior distribution for our hypotheses, and for this chapter we will set the
prior to 0.5 for the “Thomas Bayes Hypothesis,” and 0.5 for the “Not Thomas Bayes
Hypothesis,” which are non-informative priors.
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi

THE PORTRAIT PROBLEM 77

Step 3. What are the data?

Answer: Do you recall the author problem in Chapter 5, where we used a Bayesian
inference approach to determine the probability that an unsigned Federalist Paper was
penned by Alexander Hamilton or James Madison? The data in that problem were the
words in the paper itself, and we focused on the frequency of the word upon in the
unsigned document. For this problem, we’ll consider the following sources of data that
are contained with this portrait of “Thomas Bayes”:
• Wigs! (or the lack thereof). We know that most ministers in the 1700’s wore wigs. But not all
of them! If all ministers wore wigs, there would be no chance that this portrait was of Thomas
Bayes. But maybe Thomas was a non-conforming Nonconformist who refused to wear a wig.
• Similarity of “Thomas Bayes” to Joshua Bayes. How much do you look like either of your
birth parents? In some cases, there are striking similarities, but not so much in other cases.
We don’t have the luxury of DNA testing, but suppose we can measure characteristics such as
eyebrow shape, nose length, forehead length, and so on of Joshua Bayes (from a known
portrait of Thomas Bayes’ father) and “Thomas Bayes,” and come up with a similarity index.
Suppose this index is an integer that ranges between 0 and 100, where 0 indicates absolutely
no similarity in any of the measured features, and 100 is an identical clone. Thus, the more
similar that Thomas is to Joshua, the more evidence we have that the man is Thomas Bayes.
So, we will be investigating two lines of evidence that we will assume are independent of
each other (the results of one line of reasoning will not affect the results of a second line of
reasoning).

OK, then. What are the observed data with respect to wigs?

Answer: The data here are either 0 or 1, where 0 indicates a wig is absent, and a 1 indicates
a wig is present. The man is clearly not wearing a wig, so the single observation
for this line of reasoning is 0.

And what are the observed data with respect to similarity?

The score is an overall similarity between the two men, based on things like eyebrow shape,
nose width, mouth shape, and any other facial feature that we can measure from the images.
Perhaps if “Thomas” were wearing a wig, it might be easier to focus on the facial features only.
Let’s give him one (see Figure 7.4)!

Figure 7.4 (Left) “Thomas” with a wig; (right) Joshua Bayes.

OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi

78 BAYESIAN STATISTICS FOR BEGINNERS

At ﬁrst glance, the eyebrows and nose have some resemblance, but the eyes and mouth
are off. Perhaps “Thomas” inherited these traits from his mother!
For this problem, let’s assume the similarity index ¼ 55.

So what is our ﬁnal dataset for step 3?

Answer: For this problem, we’re sticking with the following data that we can observe from
the portrait itself:

• Wigs = 0
• Similarity = 55.

Step 4. What is the likelihood of the observed data under

each hypothesis?

Answer: This is a critical step.

We now have two sources of data, and we assume that each piece of information is
independent of the other. We can now calculate the likelihood of observing each piece
of data under each hypothesis (the portrait is of Thomas Bayes vs. the portrait is not of
Thomas Bayes).
Once we have the two likelihood calculations, we can compute the joint likelihood for
each hypothesis. You might recall that joint probability refers to the probability that
multiple events will occur together; here we are interested in the probability of observing
no wig and a similarity score of 55.
Let’s work through each piece of information separately for each hypothesis and then
compute the joint likelihood by multiplying the two independent likelihood results together.

Should we start with wigs?

Answer: Good idea. We can’t directly calculate the likelihood that the man in the portrait
is Thomas Bayes, but we can calculate the likelihood that a middle-aged man who sits for a
portrait in the 1750’s is a minister who does not wear a wig.
Suppose we were able to collect this information on the population of interest (middle-
aged men who sit for a portrait in the 1750’s; see Table 7.1).

Table 7.1

Person Minister Wig

1 0 1
2 1 0
3 0 1
4 0 1
5 1 0
6 1 0
7 1 1
8 1 0
9 1 1
10 0 1
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi

THE PORTRAIT PROBLEM 79

Here, we show just the first 10 records of a 100 record dataset. We will refer to this as
dataset 1. The first column identifies a person, the second identifies if the person is a
minister or not (where 1 ¼ minister), and the third column identifies whether the person
was wearing a wig (where 1 = wig).

Table 7.2

Wig No Wig Sum

Ministers 8 2 10
Non-ministers 15 75 90
Sum 23 77 100

We can then summarize the results of all 100 of our records as a two-way table (see
Table 7.2).
Notice that this table has four “quadrants,” so to speak. Our actual data are stored in the
upper left quadrant, shaded the darkest blue. The upper right quadrant (shaded lighter) sums
the number of ministers (10) and non-ministers (90) in our dataset. The lower left quadrant
(shaded lighter) sums the number of men that wore a wig while sitting for a portrait (23) and
those that did not (77). The lower right quadrant (no shading) gives the grand total, 100.
Now let’s plot the results in the same Venn diagram (see Figure 7.5).

15 8 2

Ministers

Wig
∼Wig and ∼Minister 75

Figure 7.5

The box (which is not to scale) holds all 100 individuals in our dataset. Here, we can see
that 23% of the men wear wigs while sitting for a portrait. We can also see that 10% of all of
the men are ministers. And only 2% of the men sampled are ministers who do not
wear wigs.
Now, let’s ask how likely it is to observe our data, no wig, under the Thomas
Bayes hypothesis. The answer is 0.02 (2/100 ¼ 0.02), which is the joint probability that a
man that sits for a portrait in the mid 1700’s is a minister and does not wear a wig.
OK, now let’s ask how likely it is to observe our data, no wig, under the Not
Thomas Bayes hypothesis. This is trickier because the man in the portrait could be just
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi

80 BAYESIAN STATISTICS FOR BEGINNERS

about anyone. The answer here is 0.77, which is the marginal probability that a man who
sits for a portrait in the mid 1700’s will not wear a wig. That is the best we can do.

What about similarity between “Thomas Bayes” and Joshua Bayes?

Answer: Remember that we calculated a similarity score between “Thomas Bayes” and
Joshua Bayes, with a result of 55. Now we move to the question, what is the
likelihood of observing a similarity score of 55 under each hypothesis? Suppose
we are able to collect similarity scores between 1000 known fathers and sons and also
between 1000 pairs of unrelated men in England. The ﬁrst 10 records of our dataset might
look something like that shown in Table 7.3.

Table 7.3

Pair Related Similarity

1 0 31
2 0 80
3 0 29
4 0 71
5 1 60
6 1 61
7 1 26
8 1 39
9 1 29
10 1 75

The first column identifies a pair, the second identifies if the pair of males have a
father–son relationship (where 1 ¼ Yes), and the third column identifies their similarity
score. We will refer to this as dataset 2; note that it is separate from our wig dataset, or
dataset 1.
It might be useful to graph the data, where you can see that unrelated men tend to have
smaller relatedness scores, but that there are some scores (such as our observed score of 55 in
black) that can belong to either group (see Figure 7.6).

60
Related
50 Unrelated
Observed
40
Frequency

30
20
10
0
1 8 16 25 34 43 52 61 70 79 88 97
Similarity Score

Figure 7.6 Similarity scores of related and unrelated

men in the mid-1700’s.
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi

THE PORTRAIT PROBLEM 81

How do we calculate the likelihood under each hypothesis?

Answer: Well, we don’t want to know the exact probability of observing 55. What we
really want to ﬁnd is where our observed result falls in relation to the rest of a given
distribution. For instance, a score of 55 falls in the right-hand tail of the unrelated
distribution (left), but it falls in the left-hand portion of the related distribution
(right). To make these comparable, we need to quantify the area of each distribution
that is to the right of our observed similarity score of 55. Thus, 55 is our minimum score
of interest.

OK, then, what is the likelihood of observing a similarity score of 55 or

greater under each hypothesis?

Answer: Your answer here!

• For the Thomas Bayes hypothesis, the two men (“Thomas” and Joshua) would be related.
If we look at the green distribution above, we see that roughly 69% of the similarity scores
between related men were greater than 55 (and 31% were below 55). Thus, the likeli-
hood of observing a similarity score of at least 55 under the Thomas Bayes
hypothesis is 0.69.
• For the Not Thomas Bayes hypothesis, the two men (“Thomas” and Joshua) would be
unrelated. If we look at the orange distribution above, we see that roughly 1% of the
similarity scores between unrelated men were greater than the observed value of 55 (and
99% were below 55). The likelihood of observing a similarity score of 55 or more given
they are unrelated is thus 1%.

So how do we combine both results into one likelihood for

each hypothesis?

Answer: If our two data sources are independent, we can simply multiply the two likeli-
hood components together:
For the Thomas Bayes hypothesis, that would be
• evidence from wigs ¼ 0.02 (the joint probability that a man that sits for a portrait in the
mid-1700’s is a minister who does not wear a wig)
• evidence from similarity score ¼ 0.69 (69% of known father–son similarity scores are at
least 55).
Thus, the likelihood of observing the data under the Thomas Bayes hypothesis is the
product of these terms:

0:02 ∗ 0:69 ¼ 0:0138: ð7:2Þ

For the Not Thomas Bayes hypothesis, that would be

• evidence from wigs ¼ 0.77 (the marginal probability that a man that sits for a portrait in
the mid-1700’s will not wear a wig)
• evidence from similarity score ¼ 0.01 (1% of unrelated men have similarity scores of at
least 55).
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi

82 BAYESIAN STATISTICS FOR BEGINNERS

Thus, the likelihood of observing the data under the Not Thomas Bayes hypothesis is the
product of these terms:

0:77 ∗ 0:01 ¼ 0:0077: ð7:3Þ

Step 5. What is the posterior probability that the portrait is of

Thomas Bayes?

Answer: Just use Bayes’ Theorem! Here, we have two hypotheses, so Bayes’ Theorem can
be written as:

Prðdata j Hi Þ ∗ PrðHi Þ
PrðHi j dataÞ ¼ X2 : ð7:4Þ
Prðdata j Hj Þ ∗ PðHj Þ
j¼1

Let’s assume the prior probabilities are 0.5 and 0.5. Then, the posterior probability that
the man in the portrait is Thomas Bayes is shown in Table 7.4.

Table 7.4

Prior Likelihood Prior ∗ Likelihood Posterior

Thomas Bayes 0.5 0.0138 0.0069 0.64
Not Thomas Bayes 0.5 0.0077 0.00385 0.36
Sum 1.0 0.01075 1.00

Let’s go through this table together:

• Row 1 of the table is the Thomas Bayes hypothesis, and row 2 is the Not Thomas Bayes
hypothesis.
• Column 1 provides the priors, which we set at 0.5 and 0.5. Remember that these must
sum to 1.0.
• Column 2 is the joint likelihood of observing the two pieces of information we observed
under each hypothesis:
• wigs ¼ 0
• similarity score 55
• Column 3 multiplies the prior times the likelihood, which gives us the numerator of
Bayes’ Theorem for each hypothesis.
• The sum of column 3 is the denominator of Bayes’ Theorem (the sum of the products of
the priors times the likelihoods).
• Column 4 calculates the posterior for each hypothesis as the prior ∗ likelihood column
divided by the denominator. Note that the two posteriors sum to 1.0.
So, for the inputs for this ﬁctitious problem, the posterior probability that the
man in the portrait is Thomas Bayes is 0.64, and the posterior probability that
it is someone else is 0.36. Two lines of evidence were used in this Bayesian
analysis.
Don’t forget that the data are ﬁctitious! A graph of our prior and posterior distributions
can now be shown in Figure 7.7.
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi

THE PORTRAIT PROBLEM 83

1.0
Thomas Bayes

Probability of Hypothesis
0.8 Not Thomas Bayes

0.6

0.4

0.2

0.0
Prior Distribution Posterior Distribution

Figure 7.7

Are the two pieces of information really independent?

Answer: It depends. As the analyst, you would need to make an argument to justify the
calculation of the joint likelihood. Here, we are claiming that our two datasets are inde-
pendent and that both contribute to the likelihood calculations.

Can we add on more independent pieces of information?

Answer: Yes, you can. The more lines of reasoning you have, the more information you
can bring to bear on the likelihood calculations and, ultimately, on the comparison of your
hypotheses.

What if our information is not independent?

Answer: This is a lot trickier. For instance, we could ask about the use of frock coats in
portraits for the same time period. If we scored the wearing of frock coats along with the
use of wigs in dataset 1, we may confound the analysis because frock coats and wigs appear
to go hand in hand. In addition, if you had a completely different dataset that provided
similar information with respect to father–son similarity, adding it into the likelihood
mix could spell trouble if they are not independent. These situations are beyond our
scope here, but you might want to keep them in mind.

Are there any assumptions in this analysis?

Answer: Loads of them! This problem is meant to help you understand that Bayes’
Theorem can easily incorporate multiple sources of data, but each piece of information
that you consider must be carefully thought out. We have made multiple assumptions here,
including

• wigs and similarity indices are independent, and

• a relative of Thomas Bayes is not the subject of the portrait.

You may have thought of more assumptions, and that is great.

OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi

84 BAYESIAN STATISTICS FOR BEGINNERS

What is the main take-home point for this chapter?

Answer: The main take-home point is that Bayesian analysis can be very, very ﬂexible. As
long as you can compute the likelihood of observing the data under each hypothesis, you
are golden. However, this is not always an easy task! But, assuming it can be done, you can
set up almost any kind of problem as a Bayesian inference problem.

Looking back at the portrait, who was Barrett, developer of the

columnar method?

Answer: We were really interested in the question of whether the portrait really is of
Thomas Bayes, so went ahead and purchased a copy of The History of Life Insurance by
Terrence O’Donnell to explore this further. There it was—the picture—on p. 335. The
picture does indeed have the caption “Rev. T. Bayes, Improver of the Columnar Method
developed by Barrett.” We searched high and low for any writing on Thomas Bayes, but
could not find anything in the book.
George Barrett, who proposed the columnar method in the calculation of annuities
(insurance), lived from 1752–1821. A Wikipedia author notes that “Barrett was the son of
a farmer of Wheeler Street, a small hamlet in Surrey. At an early age, although engaged in
daily labour, he made, unaided, considerable progress in mathematics, taking special
interest in the class of problems connected with the duration of human life. He afterwards,
during a period of twenty-five years (1786–1811), laboured assiduously at his great series of
life assurance and annuity tables . . . to whose support he devoted a great part of his
earnings.” O’Donnell himself weighs in by stating, “In 1810, George Barrett presented to
the Royal Society a new mode of calculating life Annuities. The old and honorable society
refused to publish the contribution, no doubt considering his deductions too reactionary.
However in spite of the frown of the Society, [Francis] Baily gave Barrett’s findings to the
world in the appendix of his very valuable work on Annuities and it immediately began to
influence sound actuarial thought of the time.”
Given that Thomas Bayes lived from around 1701 to 1761, it is hard to see how Bayes
could have improved Barrett’s method. When Bayes died, Barrett was 9 years old!
“Rev. T. Bayes” may have improved Barrett’s method, but it probably was not the Reverend
Thomas Bayes we’ve come to know and love.

What’s next?

Answer: This ends our section on Bayesian inference. In Section 3, we begin our journey
into the world of probability distributions, which opens up many new possibilities for using
Bayes’ Theorem.
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi

SECTION 3

Probability Functions

Overview

Welcome to Section 3! Now that you’ve had a taste of what Bayesian inference is about, it’s
time to take Bayes’ Theorem to the next level. A major use of Bayes’ Theorem involves
parameter estimation, and, as such, it is critical that you have a ﬁrm understanding of
parameters and variables, and how they are related in a probabilistic way.
This section consists of only two chapters.
• Chapter 8 introduces probability mass functions (pmf). This chapter introduces the idea
of a random variable and presents general concepts associated with probability distribu-
tions for discrete random variables. The binomial and Bernoulli distributions are used as
examples of these probability mass functions (pmf ’s). The pmf ’s can be used to specify
prior distributions, likelihoods, and/or posterior distributions in Bayesian inference.
• Chapter 9 introduces probability density functions (pdf). The focus is on general con-
cepts associated with probability density functions (pdf ’s), which are distributions asso-
ciated with continuous random variables. The continuous uniform and normal
distributions are highlighted as examples of pdf ’s. These and other pdf ’s can be used
to specify prior distributions, likelihoods, and/or posterior distributions in Bayesian
inference.
A solid understanding of both pmf ’s and pdf ’s is absolutely essential if you are to become a
Bayesian analyst. In Bayesian analyses, we hypothesize about the values of unknown
parameters. In doing so, we regard the unknown parameter as a random variable generated
from some probabilistic distribution. In particular, we use probability functions to specify:
• prior distributions
• likelihood functions
• posterior distributions.
These chapters pave the way to Sections 4 and 5. Take your time with this material as it is
essential for your Bayesian journey.
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi

CHAPTER 8

Probability Mass Functions

Hopefully you have a sense now of Bayes’ Theorem and how Bayesian inference is used to
update our belief in alternative hypotheses. One of the primary uses of Bayesian inference is
to estimate parameters. To do so, we need to first build a good understanding of prob-
ability distributions.
This chapter will present general concepts associated with probability distributions. We’ll
use the binomial and Bernoulli distributions as examples of probability mass functions
(pmf ’s). In Chapter 9, we’ll use the continuous uniform and normal distributions as
examples of probability density functions (pdf ’s). Although the two chapters will
contain only a few examples, the general concepts apply to other probability distributions.
These are fairly long chapters, and it may take a few readings for the material to sink in if
this is new material for you . . . so make sure to get up and stretch every now and then!
By the end of this chapter, you will be able to define and use the following concepts:
• Function
• Random variable
• Probability distribution
• Parameter
• Probability mass function (pmf)
• Binomial pmf
• Bernoulli pmf
• Likelihood
• Likelihood profile
Since this chapter is about functions, it makes sense to start by asking the following
question:

What is a function?

Answer: In math, a function relates an input to an output, and the classic way of writing
a function is

f ðxÞ ¼ : : : ð8:1Þ

For instance, consider the function shown in Figure 8.1.

f(x) = x2

Function name is f Function input is x Function output is x2

Figure 8.1

88 BAYESIAN STATISTICS FOR BEGINNERS

The function name is f. You can name your function anything you want, but f is a
commonly used name. The inputs go within the parentheses. Here, we have a single
input called x. The output of the function is a set of instructions that tell us what to do
with the input. In this example, we input a value for x; the instructions tell us that we
should square x to give us the output.
In other words, f(x) maps the input x to the output. We can visualize this as shown in
Figure 8.2.

Input(s) Function Output(s)

x f x2

Figure 8.2

For example, if you would like to convert Fahrenheit to Celsius, you can use the function:

f ðxÞ ¼ ðx−32Þ=1:8: ð8:2Þ

If you input 80 degrees Fahrenheit to this function, the output would be:

f ð80Þ ¼ ð80−32Þ=1:8 ¼ 26:7: ð8:3Þ

We are now positioned to introduce the term “variable.” Dictionary.com deﬁnes variable as
“capable of being varied or changed.” This is certainly true of temperatures on Planet Earth:
temperatures vary over time and space. Conventionally, variables are given a name, such as
T or C, where T represents temperature in degrees Fahrenheit, and C represents temperature
in Celsius. While the capital letters provide the name of the variable, speciﬁc values are
often indicated with lower-case letters (e.g., t ¼ 80, and c ¼ 26.7).
The Oxford Dictionary of Statistics (Upton and Cook, 2014) states that a variable is “the
characteristic measured or observed when an experiment is carried out or an observation is
made. Since a non-numerical observation can always be coded numerically, a variable is
usually taken to be numerical.” In other words, capital T or C is the characteristic measured,
and its value is usually numeric.

What is a random variable?

Answer: The Oxford Dictionary of Statistics (Upton and Cook, 2014) tells us, “When the
value of a variable is subject to random variation, or when it is the value of a randomly
chosen member of a population, it is described as a random variable—though the adjective
‘random’ may be omitted.”

Can you show an example?

Answer: Yes. Suppose you ﬂip a fair coin three times, and record each head as “H” and
each tail as “T.” Let’s call three coin ﬂips an experiment and start by listing all possible
outcomes:
• HHH
• THH
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi

PROBABILITY MASS FUNCTIONS 89

• HTH
• HHT
• TTH
• THT
• HTT
• TTT
With two possible results (H or T), and three flips, there are 23 ¼ 8 possible outcomes. Count
them! These outcomes make up our sample space. If this doesn’t ring a bell, see Chapter 1
for a quick review.
Clearly, what you observe after 3 flips is determined at random. In this particular case, the
outcome of an experiment is not a number. It’s something like “HHT.” So, let’s define a
random variable called Y, where Y is the number of heads. Y can assume the values of 0,
1, 2, or 3. We will let lower-case y represent a particular value from a coin toss
experiment (see Table 8.1).

Table 8.1

Experiment Results
HHH THH HTH HHT TTH THT HTT TTT
y! 3 2 2 2 1 1 1 0

If we were to flip a coin three times, we don’t know with certainty whether its value, y,
will be 0, 1, 2, or 3. But we know that is has to be one of these values. That is, our random
variable is discrete—and it can take on a finite number of values. We can get an idea of
what the value could be if we assigned probabilities to each. Incidentally, discrete random
variables can also assume a countably infinite number of values, such as 0, 1, 2, . . . , to
infinity and beyond.

Is a random variable a function?

Answer: Yes! A random variable is a function with inputs that are outcomes of an
experiment (like “HHT”) and outputs that are numerical (like “2”). In many cases, the
inputs are already numerical, so the inputs are the same as the outputs.

Where do we go from here?

Answer: We’re often interested in knowing the probability of observing particular out-
comes. This brings us to the all-important topic of probability theory.

Probability theory is a branch of mathematics concerned with the analysis of random

phenomena. The outcome of a random event cannot be determined before it occurs, but
it may be any one of several possible outcomes. The actual outcome is considered to be
determined by chance (source: https://www.britannica.com/topic/probability-theory;
article accessed August 17, 2017).
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi

90 BAYESIAN STATISTICS FOR BEGINNERS

What is the probability of observing y ¼ 3 heads?

Answer: If each of the 8 possible outcomes listed in Table 8.1 are equally likely (the coin is
fair), then we can count how many times the event y ¼ 3 appears out of 8 total. It appears
once, so:

PrðY ¼ y ¼ 3Þ ¼ 1=8 ¼ 0:125: ð8:4Þ

The notation Pr indicates probability. This is the probability that a random variable named
Y will take on the value y ¼ 3. The answer is 0.125.

How do we move from the probability of a given value of Y to the

probability distribution for all possible values of Y?

The next step moves us from the probability of a speciﬁc value of Y such as y ¼ 3 to all
possible values of Y, which can be seen in the header of Table 8.2. The possible values of
Y are 3, 2, 1, and 0. The number of times each of these values appears out of the 8 events is
given in row 1, and the probability is given in row 2. Notice that the sum of the
probabilities of observing all possible values of Y is 1.0.

Table 8.2

y
0 1 2 3
Frequency 1 3 3 1
Pr(Y ¼ y) 0.125 0.375 0.375 0.125

Is this an example of a probability distribution?

Answer: Yes, it is. From the Oxford Dictionary of Statistics (Upton and Cook, 2014):
“Probability distribution: A description of the possible values of a random variable, and of
the probabilities of occurrence of these values.”
We divided our sample space into subsets by deﬁning the variable Y, which is the number
of heads that can possibly appear with three ﬂips of a coin, and assigned a probability to each.

Is this also a probability mass function?

Answer: Ah, yes . . . the title of this chapter. It is!

From The Oxford Dictionary of Statistics: “For a discrete random variable X, the prob-
ability mass function (or p.m.f.) of X is the function p such that pðxi Þ ¼ PrðX ¼ xi Þ, for all i.”
In this deﬁnition, they use X instead of Y, and they use p instead of f for the function name.
From Wikipedia: “In probability theory and statistics, a probability mass function (pmf)
is a function that gives the probability that a discrete random variable is exactly equal to
some value.” This deﬁnition would apply to all possible values, and the probabilities across
all values must sum to 1.
Sounds like a match, doesn’t it?
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi

PROBABILITY MASS FUNCTIONS 91

What if we had ﬂipped the coin 10 times?

Answer: OK then! Each experiment now contains 10 ﬂips. We’d do precisely the same thing.
We’d write out all possible outcomes, determine the associated values of Y, and then assign
to each a probability. Y is still the number of heads, and Y can assume values 0, 1, 2, . . . , 10.
For instance, we could end up with 10 heads, which would be HHHHHHHHHH. Or we
could end up with 0 heads (10 tails), which would be TTTTTTTTTT. Or we could end up
with 5 heads and 5 tails, and every combination in between. With two outcomes per ﬂip
(H or T), our sample space would consist of 210 ¼ 1024 possible outcomes!
Remember, our variable Y can take on values of y ¼ 0, 1, 2, . . . , 10. Here are a few
examples:

• HHHHHHHHHH ! 10
• TTTTTTTTTT ! 0
• HTHTHTHTHT ! 5
• HHHHHTTTTT ! 5

Go ahead and list all of these outcomes and assign a probability to each. There
are only 1024 calculations, so it shouldn’t take too long.

Really?

Answer: Ha ha! Gotcha! Can you see what a chore that would be?
We need a function that will let us easily compute, for example, the probability of
observing y ¼ 5, or observing 5 heads in 10 coin ﬂips. The binomial probability mass
function can aid us with this task.

OK, what does “binomial” mean?

• bi means “two”
• nomial means “name”

So, the word “binomial” means “two names.” In probability theory, it generally means
“two outcomes.” All binomial problems are composed of trials that have only two
possible outcomes (e.g., success or failure, live or die, heads or tails, present or absent).
Traditionally, we label one outcome a “success,” and the other a “failure.” For this
problem, let’s call heads a “success,” which means that tails is a “failure.” It doesn’t
matter which outcome you label a “success” and which you label a “failure,” as long as
you are clear about your choice!

When do we use binomial probability?

Answer: The binomial probability function is widely used for problems where there are a
fixed number of independent tests or trials (designated n) and where each trial can have
only one of two outcomes. Statistics textbooks worldwide use coin flipping as a way to
demonstrate the binomial distribution.
Let’s return to our 3 coin flip example, where the coin is fair. Remember that we created a
variable called Y, and it can take on values y ¼ 0, 1, 2, 3. Often, you’ll see this written y ¼ 0, 1,
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi

92 BAYESIAN STATISTICS FOR BEGINNERS

2, . . . , n, where n is the highest possible value that the random variable can take. The actual
number of heads observed is subject to chance. Earlier, we calculated the probability of
observing 0, 1, 2, 3 heads in 3 coin ﬂips by hand (see Table 8.3).

Table 8.3

y
0 1 2 3
Frequency 1 3 3 1
Pr(Y ¼ y) 0.125 0.375 0.375 0.125

Now we’ll use the binomial probability mass function instead.

What does the binomial probability mass function look like?

The binomial function is written below (Equation 8.5). Notice the f notation, which denotes
that it is a function named f. The inputs to the function are provided within parentheses:

n y
f ðy; n; pÞ ¼ p ð1−pÞðn−yÞ y ¼ 0; 1; : : : ; n: ð8:5Þ
y

Instead of f(x), we have f(y; n, p). The semicolon is read “given.” This is read as the
probability of y successes, given n trials and probability p of success. So this function
requires three inputs: y, n, and p.

You may also see this written as f(y | n, p), where the vertical bar means “given.” In this
book, we use the semicolon notation to avoid any confusion with the conditional
probability terms in Bayes’ Theorem.

Two of the inputs are called parameters: n and p. We will discuss what a parameter is
shortly.

• n ¼ the total number of trials (in our case, n ¼ 3 coin ﬂips or 3 trials)
• p ¼ the probability of success (in our case, the probability of ﬂipping a heads, which is 0.5
each time if the coin is fair)

The third input, y, is the observed number of successes in the experiment. The lower-case y
represents the value of a random variable, and y can take on the discrete values 0, 1, 2, . . . , n.
Given these inputs, the instructions for creating the output is provided on the right side
of the equals sign. For example, we can use the binomial function to calculate the prob-
ability of observing 2 heads in 3 ﬂips when the coin is fair as:

n y
f ðy; n; pÞ ¼ p ð1−pÞðn−yÞ y ¼ 0; 1; : : : ; n: ð8:6Þ
y

3
f ð2; 3; 0:5Þ ¼ 0:52 ð1−0:5Þð3−2Þ : ð8:7Þ
2

The semicolon here means “given.” The entire left side of the equation can be read “the
probability of observing 2 heads, given 3 coin ﬂips and a probability of a heads equal to 0.5,”
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi

PROBABILITY MASS FUNCTIONS 93

or in general terms, “the probability of 2 successes in 3 trials given the probability of success
on each trial is equal to 0.5.” This is equivalent to writing:

3
PrðY ¼ 2Þ ¼ f ðy ¼ 2; n ¼ 3; p ¼ 0:5Þ ¼ 0:52 ð1−0:5Þð3−2Þ : ð8:8Þ
2

This might look weird to you, but mathematicians use this notation because, to them, it is
super-clear and concise. The left term Pr(Y ¼ 2) asks what is the probability that our random
variable named Y has a value of 2. This is equivalent to the binomial pmf where y ¼ 2,
n ¼ 3, and p ¼ 0.5. We use the pmf to calculate the probability.
Let’s now break the right-hand side of the binomial probability function (the output)
into pieces:

• The term py gives p (the probability of success, or heads) raised to the number of times
the success (heads) occurred (y). This term calculates the probability of observing y
independent successes together in one experiment, just as we did at the beginning of
the chapter.
• The term (1 − p)n−y gives the probability of a failure (or tails) raised to the number of times
the failures (tails) occurred, which is (n − y). This term calculates the probability of
observing y independent failures together in one experiment, just as we did at the
beginning of the chapter.

To calculate the probability of ﬂipping 2 heads out of 3 trials, we need to observe a success y
times (2 heads), and we need to observe a failure n − y times (1 tail). So far, our result would be:

0:52 ∗ ð1−0:5Þ1 ¼ 0:125: ð8:9Þ

But if you ﬂip a fair coin 3 times, as we’ve seen, there is more than one way you could end
up with 2 heads and 1 tail. For instance, the sequence could be:
• HHT
• THH
• HTH.

In our problem, we are asked to compute the probability of getting 2 heads in 3 flips, given
that the coin is fair. We don’t care about the actual sequence of the outcomes, so we need to
account for all of the various possible ways we can end up with 2 heads in 3 flips. The

n
portion of the binomial probability function in brackets is called the binomial
y
coefficient and accounts for ALL the possible ways (combinations) in which two heads
and one tail could be obtained.
You can compute the binomial coefficient by hand:

n n!
¼ ð8:10Þ
y y!ðn−yÞ!

3∗2∗1 6
¼ ¼ 3: ð8:11Þ
2 ∗ 1 ∗ ð1Þ 2

Thus, there are 3 ways of getting 2 heads in 3 flips, which we confirmed earlier. We multiply
3 by 0.125 to get our final answer: 0.375. The binomial probability of getting exactly two
heads out of 3 coin flips, given that the coin is fair, is:

n y
f ðy; n; pÞ ¼ p ð1−pÞðn−yÞ y ¼ 0; 1; : : : ; n ð8:12Þ
y
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi

94 BAYESIAN STATISTICS FOR BEGINNERS

f ðy ¼ 2; n ¼ 3; p ¼ 0:5Þ ¼ 3 ∗ 0:52 ∗ ð1−0:5Þ1 ¼ 0:375: ð8:13Þ

This is the same answer we got when we calculated the probability by hand (Table 8.3).
Similarly, we could use the function to compute the probability of observing, say, 0 heads
and 3 tails. That would be:

n y
f ðy; n; pÞ ¼ p ð1−pÞðn−yÞ y ¼ 0; 1; : : : ; n ð8:14Þ
y

3
f ð0; 3; 0:5Þ ¼ ∗ 0:50 ∗ ð1−0:5Þ3 ð8:15Þ
0
3!
f ð0; 3; 0:5Þ ¼ ∗ 0:50 ∗ ð1−0:5Þ3 ð8:16Þ
0!ð3−0Þ!
f ð0; 3; 0:5Þ ¼ 1 ∗ 1 ∗ 0:125 ¼ 0:125: ð8:17Þ

Don’t forget that zero factorial (0!) is equal to 1. Notice that in the ﬁrst example we added
the names of the inputs (i.e., y, n, and p), while in the second example we did not. You’ll see
this both ways in the literature. And now let’s double-check these results with the table of
results we showed previously (see Table 8.4).

Table 8.4

y
0 1 2 3
Frequency 1 3 3 1
Pr(Y ¼ y) 0.125 0.375 0.375 0.125

Great! We hope you can see that a pmf such as the binomial pmf is a super-handy tool
(function) that returns probability very quickly. The proof of this function is credited to
Jakob Bernoulli (Bernoulli, 1744), a Swiss mathematician (see Figure 8.3). His tomb bears
the inscription “Jakob Bernoulli, the incomparable mathematician.”

Figure 8.3 Jakob Bernoulli.

OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi

PROBABILITY MASS FUNCTIONS 95

What notation should I use to describe a binomial process like

coin ﬂipping?

If we have an experiment that has independent trials and a constant probability of success,
we can indicate that a random variable Y arises from a binomial process as:

Y Binomialðn; pÞ: ð8:18Þ

If we ﬂip a coin three times and the coin is fair, we can say:

Y Binomialð3; 0:5Þ: ð8:19Þ

What is a binomial distribution?

Answer: The binomial distribution is a display of all possible outcomes of y, given the
n and p. In other words, we can use the binomial function to compute the probability of
observing 0, 1, 2, 3 heads out of 3 ﬂips, given that the coin is fair (probability of heads ¼
probability of tails ¼ 0.5; see Table 8.5).

Table 8.5

Successes Probability
0 0.125
1 0.375
2 0.375
3 0.125

The sum of the probabilities that make up the binomial distribution must be 1.00.
We can then graph the results as in Figure 8.4.

1.0

0.8
Probability

0.6

0.4

0.2

0.0
0 1 2 3
Successes

Figure 8.4 Binomial distribution: n ¼ 3, p ¼ 0.5.

This is a binomial distribution whose parameters are n ¼ 3 and p ¼ 0.5. The full range
of possible outcomes is given on the x-axis, and the binomial probability is given on the y-axis.
Can you ﬁnd the probability of observing 2 heads out of 3 coin ﬂips, given that the coin
is fair?
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi

96 BAYESIAN STATISTICS FOR BEGINNERS

How about the probability of observing 2.5 heads out of 3 coin ﬂips,
given that the coin is fair?

Answer: This is an impossible result! You cannot observe 2.5 heads.

What is a parameter?

Answer: The Oxford Dictionary of Statistics (Upton and Cook, 2014) defines it as “a
constant appearing as part of the description of a probability function . . . The shape of
the distribution depends on the value(s) given to the parameter(s).”
The parameters of a probability distribution define how the distribution looks: change
the parameters, and you change the shape and location of the distribution. Some authors
describe parameters as the “knobs” or “controls” of a function.
Let’s return to our 10 coin flip example. Figure 8.5 shows the binomial distribution when
n ¼ 10 and p ¼ 0.5 (note that p, the probability of success, is still 0.5, but n, the number of
trials, has changed from 3 to 10).

1.0

0.8
Probability

0.6

0.4

0.2

0.0
0 2 4 6 8 10
Successes

Figure 8.5 Binomial distribution: n ¼ 10, p ¼ 0.5.

In this example, the full range of possible values for our variable is 0 to 10. Did you notice
that the binomial distribution is symmetric when p is 0.5?
Symmetry is not always the case. To illustrate, Figure 8.6 shows yet another binomial
distribution, this time when n ¼ 100 and p ¼ 0.05 (note that p, the probability of success,
was changed from 0.5 to 0.05, and n was changed from 10 to 100).

0.20

0.15
Probability

0.10

0.05

0.00
0 20 40 60 80 100
Successes

Figure 8.6 Binomial distribution: n ¼ 100, p ¼ 0.05.

OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi

PROBABILITY MASS FUNCTIONS 97

See how different they look? All three graphs are binomial probability distributions, but
they have different parameters (n and p). The x-axis should extend to the maximum
number of possible successes. The y-axis often ranges between 0 and 1 but is sometimes
truncated, as in Figure 8.6. Changing the y-axis does not change the distribution at all: it
just magniﬁes or shrinks how the distribution is displayed.

No matter what range you use for the y-axis, the sum of the bars will still be equal to 1.0.

When it comes to probability mass functions, the values of the parameters identify
which binomial distribution you are talking about; they determine the shape of the distri-
bution (the distribution of probabilities along the y-axis) and its location along the x-axis.
Does the “knob” or “control” image work for you?

What are the assumptions of the binomial probability

mass function?

Answer: There are three of them.

1. The trials are independent.

2. There are two possible outcomes (success and failure) on each trial.
3. The probability of success is constant across trials.

The idea that p is constant is pretty straightforward: if we flip a coin 10 times, and p ¼ 0.5,
this means that p is held at 0.5 for each and every flip. The concept of independence relates
to the outcomes of two or more flips. If we flip a coin and it turns up heads, this outcome
has absolutely no bearing on what the result of the next flip will be.

Are there other probability mass functions besides the binomial?

Answer: You bet. A few that come to mind are:

• Negative binomial
• Bernoulli distribution
• Poisson distribution
• Discrete uniform distribution
• Geometric distribution
• Hypergeometric distribution
We will touch on the Bernoulli function in this chapter and discuss a few others in future
chapters.

What do all of these functions have in common?

Answer: In each, there is a discrete random variable, say Y. For possible values the random
variable can assume, f(y) is a probability, written Pr(Y ¼ y), and thus f(y) must assume values
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi

98 BAYESIAN STATISTICS FOR BEGINNERS

between 0 and 1. The values of f(y) must sum to 1 for all values that Y can assume. This is the
deﬁnition of a probability mass function.

All right then . . . what is the Bernoulli distribution?

Answer:
From The Oxford Dictionary of Statistics (Upton and Cook, 2014): The distribution of a
discrete random variable taking two values, usually 0 and 1.
From Wikipedia: In probability theory and statistics, the Bernoulli distribution,
named after Swiss scientist Jakob Bernoulli, is the probability distribution of a random
variable which takes the value 1 with success probability of p and the value 0 with failure
probability of q ¼ 1− p.
In short, a Bernoulli distribution is a special case of a binomial distribution
in which the number of trials is n ¼ 1. If we have just one coin ﬂip (n ¼ 1) and the coin
is fair (p ¼ 0.5), then we can use the binomial pmf to calculate the probability that the
result is heads (y ¼ 1) as:

n y
f ðy; n; pÞ ¼ p ð1−pÞðn−yÞ y ¼ 0; 1; : : : ; n ð8:20Þ
y

f ð1; 1; 0:5Þ ¼ 1 ∗ 0:51 ∗ ð1−0:5Þ0 ¼ 0:5: ð8:21Þ

If the coin was not fair, such that the probability of getting a heads is 0.4, then we
would have:

n y
f ðy; n; pÞ ¼ p ð1−pÞðn−yÞ ð8:22Þ
y
f ð1; 1; 0:4Þ ¼ 1 ∗ 0:41 ∗ ð1−0:4Þ0 ¼ 0:4: ð8:23Þ

Some things are worth pointing out here. First of all, with 1 trial, the binomial
coefﬁcient is 1.
Let’s run through the probability of observing a success:

• With 1 trial, if we have a success, the term py is the same as p1, which is just p.
• With 1 trial, if we have a success, the term (1 − p)0 is 1. Remember that anything raised to
the power of 0 is 1.0.
• Multiply 1 (the binomial coefﬁcient) by p and then by 1 and you end up with p.

Now let’s run through the probability of observing a failure:

• With 1 trial, if we have a failure, the term py is the same as p0, which is 1.
• With 1 trial, if we have a failure, the term (1 − p)1 is (1 − p).
• Multiply 1 (the binomial coefﬁcient) by 1 and then by (1 − p), and you end up with (1 − p).

Thus, with a single trial, the probability of getting a success is the same thing as p, and the
probability of getting a failure is the same thing as (1 − p).
Because of these properties, the Bernoulli distribution is often written as:

f ðy; 1; pÞ ¼ py ð1−pÞð1−yÞ : ð8:24Þ

OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi

PROBABILITY MASS FUNCTIONS 99

The Bernoulli distribution provides the probabilities for all possible outcomes. With only 1
trial, there are only two possible results, and a graph of the distribution looks like the one
shown in Figure 8.7.

1.0

0.8

Probability
0.6

0.4

0.2

0.0
0 1
Successes

Figure 8.7 Bernoulli distribution: n ¼ 1, p ¼ 0.4.

We mention the Bernoulli distribution because we will be using it later in this chapter
and in future chapters. Incidentally, you can think of a binomial distribution as a series of
independent Bernoulli trials.

Likelihood

Now, let’s gracefully roll into the concept of likelihood when dealing with probability mass
functions. Likelihood is a key concept in Bayesian inference . . . remember that Bayes’
Theorem can be expressed as shown in Figure 8.8.

Likelihood of the
data, B, given Prior probability of
hypothesis A hypothesis A
Posterior probability
of hypothesis A,
given data B

Pr(B | A) * Pr(A)
Pr(A | B) =
Pr(B | A) * Pr(A) + Pr (B | ∼A) * Pr(∼A)

Likelihood of the
data, B, given Prior probablity of
hypothesis ∼A hypothesis ∼A

Figure 8.8
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi

100 BAYESIAN STATISTICS FOR BEGINNERS

If we let a given hypothesis stand in for A, and our data stand in for B, then this equation
can be written as:

Prðdata j Hi Þ ∗ PrðHi Þ
PrðHi j dataÞ ¼ : ð8:25Þ
X
n
Prðdata j Hj Þ ∗ PrðHj Þ
j¼1

In words, the posterior probability for hypothesis Hi is the likelihood of observing the
data under hypothesis i (green) multiplied by the prior probability of hypothesis i, divided
by the sum of the likelihood ∗ the prior across all j ¼ 1 to n hypotheses.

Notice that the likelihoods are conditional on each hypothesis! In Bayesian analyses, the
likelihood is interpreted as the probability of observing the data, given the hypothesis.

In short, you must understand the concept of likelihood! In previous chapters, we tried to
give you a general feel for likelihood . . . now it is time to dig deeper.

OK, what exactly is likelihood?

Answer: As noted in Chapter 5, Dictionary.com provides three deﬁnitions of likelihood:

1. The state of being likely or probable; probability.
2. A probability or chance of something: There is a strong likelihood of his being elected.
3. Archaic: indication of a favorable end; promise.
Here, likelihood is another word for probability. As we’ll see in our next chapter, likeli-
hood can refer probability density as well. We’ll save that tidbit for Chapter 9.
There is a subtle difference, though, in how the term is used in statistical analysis, where
likelihood describes the probability of observing data that have already been
collected. That is, we have data (outcomes) in hand, and we look retrospectively at the
probability of collecting those data under a given set of parameters.
Indeed, the Cambridge Dictionary of Statistics (Everitt, 1998) formally deﬁnes likelihood
as “the probability of a set of observations given the value of some parameter or set of
parameters.”

Why is this important?

Answer: As we’ve seen, a component of Bayes’ Theorem is the Pr(data | H), which is “the
probability of the data, given a hypothesis.” If the data are in hand, then the term
“likelihood” is often used in its stead. Thus, when you see Pr(data | H) in Bayes’ Theorem,
a likelihood computation is required.

Throughout this book, the likelihood term in Bayes’ Theorem is expressed as Pr(data | H)
when there are discrete hypotheses under consideration. When the likelihood is actually
computed, we will express this as ℒðdata; HÞ, where ℒ symbolizes “likelihood” and the
semicolon means “given.” We do this to distinctly identify the likelihood term in Bayes’ Theorem
from the actual calculation.
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi

PROBABILITY MASS FUNCTIONS 101

As a side note to our readers who previously have studied maximum likelihood methods,
read the box below:

You may have seen likelihood expressed as the likelihood of the parameters given
the data. For example, the likelihood of the parameter, θ, given the data, y, can
be written ℒðθ; yÞ or ℒðθ j yÞ. This is fairly conventional, but can be confusing when related
to Bayes’ Theorem. Regardless of the notation used, you must enter values for θ and y to
calculate a result.

Are there any other key points to bear in mind regarding likelihood?

Answer: Yes. We’ve mentioned in previous chapters that the likelihood computations do
not need to sum to 1.00. This is in sharp contrast to the prior and posterior distributions in a
Bayesian analysis, which must sum to 1.00. Don’t ever forget this!

Can we quickly conﬁrm that the likelihoods do not need

to sum to 1.0 here?

Answer: Of course. Suppose you are given 2 heads out of 3 coin ﬂips, but you don’t know
p. We can plug in many alternative values (hypotheses) for p and use the binomial pmf to
compute the likelihood for each and every combination (see Table 8.6).

Table 8.6

p Likelihood
0 0
0.1 0.027
0.2 0.096
0.3 0.189
0.4 0.288
0.5 0.375
0.6 0.432
0.7 0.441
0.8 0.384
0.9 0.243
1 0

Here, the left column represents different hypothesized values for p (the probability of
success). The right column represents the likelihood of observing 2 heads out of 3 ﬂips
given a particular value for p.
A few things worth pointing out:
• p can assume a value in the range of 0 to 1, so there are an inﬁnite number of possibilities
we could examine; we just recorded 11 cases.
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi

102 BAYESIAN STATISTICS FOR BEGINNERS

• When the probability of success is 0, the likelihood of observing 2 heads out of 3 flips is 0.
It’s impossible to observe ANY heads if the probability of flipping a head is 0!
• When the probability of success is 1.0, the likelihood of observing 2 heads out of 3 flips
is 0. If the coin MUST land heads, then it is impossible that you observe any tails!
• The sum of the second column does not equal 1.00.
Let’s graph our results across a full spectrum of p alternatives (see Figure 8.9).

0.5

0.4
Likelihood

0.3

0.2

0.1

0.0
0.0 0.2 0.4 0.6 0.8 1.0
p

Figure 8.9 Likelihood distribution: n ¼ 3, y ¼ 2.

This is a likelihood proﬁle of the binomial function when n ¼ 3 and y ¼ 2. On the

x-axis is the unknown parameter, p, which ranges from 0 to 1. On the y-axis is the likelihood
value. The graph shows the range of likelihood values possible, given the data y ¼ 2, n ¼ 3.
Let’s look more closely:

• Find the likelihood of observing y ¼ 2 if p ¼ 0.4 in Table 8.6 and match it to Figure 8.9.
• Find the likelihood of observing y ¼ 2 if p ¼ 0.5 in Table 8.6 and match it to Figure 8.9.
• The gray dotted lines will help you, and we’ll revisit these soon.

Some other things are worth pointing out:

• We can plug in ANY value for p between 0 and 1 and calculate the likelihood of the data
given the parameters.
• That means there are an inﬁnite number of possible values plotted along the x-axis. For
instance, we could compute the likelihood that p is 0.3, 0.301 0.302, and so on.
• The likelihood proﬁle is often drawn as curved lines rather with bars if the parameter
depicted along the x-axis is continuous.
• And the area under the likelihood curve is not equal to 1.0.

How would this be used in a Bayesian inference problem?

Answer: OK! Let’s try to put together everything we’ve learned so far!
Let’s go through a quick example where we are considering just 2 hypotheses. Remember
that Bayes’ Theorem can be expressed as shown below, with the likelihood portion shown
in green:
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi

PROBABILITY MASS FUNCTIONS 103

Prðdata j Hi Þ ∗ PrðHi Þ
PrðHi j dataÞ ¼ : ð8:26Þ
X
n
Prðdata j Hj Þ ∗ PrðHj Þ
j¼1

In fact, although this is not conventional, we could write Bayes’ Theorem as:

ℒðdata j Hi Þ ∗ PrðHi Þ
PrðHi j dataÞ ¼ : ð8:27Þ
X
n
ℒðdata j Hj Þ ∗ PrðHj Þ
j¼1

Suppose we have two hypotheses for our coin in terms of fairness. A friend gives you the
coin and tells you that the coin is either fair (p ¼ 0.5) or that it is weighted such that the
probability of observing heads is only 0.4. These are the only two options.
We can now write out the posterior probability for hypothesis 1 with the expanded
denominator of Bayes’ Theorem as:

Prðdata j H1 Þ ∗ PrðH1 Þ
PrðH1 j dataÞ ¼ : ð8:28Þ
Prðdata j H1 Þ ∗ PrðH1 Þ þ Prðdata j H2 Þ ∗ PrðH2 Þ

Let’s review the steps of Bayesian analysis:

Step 1. What are the hypotheses?

Answer:
• H1 is the fair hypothesis: the coin is fair so that the probability of heads is 0.5 (p ¼ 0.5).
• H2 is the unfair hypothesis: the coin is weighted so that the probability of heads is 0.4
(p ¼ 0.4).
There are two discrete hypotheses (and we’ll assume these are the only two possibilities).
Step 2. What were the prior probabilities for each hypothesis?

• Let’s set the prior probability for each hypothesis ¼ 0.5. In other words, we give equal
weights to the hypothesis that the coin is fair and to the hypothesis that the coin is
weighted. In this case, we have no a priori reason to think that one hypothesis is more
likely to be true than the other.

So our prior distribution looks like Figure 8.10:

1.0

0.8

0.6

0.4

0.2

0.0
H1: p = 0.5 H2: p = 0.4

Figure 8.10 Prior distribution for a coin’s probability of heads.

OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi

104 BAYESIAN STATISTICS FOR BEGINNERS

Thus, we are using a Bernoulli distribution to set our priors for the two alternative
hypotheses (p ¼ 0.5 and p ¼ 0.4), and the probability associated with each hypothesis is 0.5.
In contrast to the Hamilton and Madison hypotheses in Chapter 5, here we have alternative
hypotheses for a parameter.

Step 3. Collect data.

• Let’s assume you tossed a coin 3 times, and ended up with 2 heads.

Step 4. Compute the likelihood of the data under each hypothesis.

• For the fair hypothesis (p ¼ 0.5):

n y
Prðdata j H1 Þ ¼ ℒðdata j H1 Þ ¼ p ð1−pÞðn−yÞ ð8:29Þ
y

3
Prðdata j H1 Þ ¼ ℒðdata; H1 Þ ¼ 0:52 ð1−0:5Þð3−2Þ ¼ 3 ∗ 0:52 ∗ 0:51 ¼ 0:375: ð8:30Þ
2

• For the unfair hypothesis (p ¼ 0.4):

n y
Prðdata j H2 Þ ¼ ℒðdata; H2 Þ ¼ p ð1−pÞðn−yÞ ð8:31Þ
y

3
Prðdata j H2 Þ ¼ ℒðdata; H2 Þ ¼ 0:42 ð1−0:4Þð3−2Þ ¼ 3 ∗ 0:42 ∗ 0:61 ¼ 0:288: ð8:32Þ
2

You can cross-check these results in the likelihood proﬁle in Figure 8.9.

Step 5. Use Bayes’ Theorem to update the priors to posteriors.

Now it is a matter of plugging in the numbers.
• For the fair coin hypothesis (p ¼ 0.5):

Prðdata j H1 Þ ∗ PrðH1 Þ
PrðH1 j dataÞ ¼ ð8:33Þ
Prðdata j H1 Þ ∗ PrðH1 Þ þ Prðdata j H2 Þ ∗ PrðH2 Þ
0:375 ∗ 0:5
PrðH1 j dataÞ ¼ ¼ 0:566: ð8:34Þ
0:375 ∗ 0:5 þ 0:288 ∗ 0:5

• For the unfair coin hypothesis (p ¼ 0.4):

Prðdata j H2 Þ ∗ PrðH2 Þ
PrðH2 j dataÞ ¼ ð8:35Þ
Prðdata j H1 Þ ∗ PrðH1 Þ þ Prðdata j H2 Þ ∗ PrðH2 Þ
0:288 ∗ 0:5
PrðH2 j dataÞ ¼ ¼ 0:434: ð8:36Þ
0:375 ∗ 0:5 þ 0:288 ∗ 0:5

Thus, after 3 coin ﬂips, we have updated our belief that the coin is fair from 0.5 to
0.566, and we have updated our belief that the coin is biased from 0.5 to 0.434
(see Figure 8.11).
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi

PROBABILITY MASS FUNCTIONS 105

1.0

0.8

0.6

0.4

0.2

0.0
H1: p = 0.5 H2: p = 0.4

Figure 8.11 Posterior probability distribution.

We hope this makes perfect sense! We are given a coin, and told that it is either fair
or weighted. These are our two hypotheses, and we gave them equal weight for the
prior distribution. We then collected data: 2 heads in 3 coin ﬂips. We calculated the
likelihood of observing the data under each hypothesis. We then used Bayes’ Theorem to
calculate the posterior probability of each hypothesis. After the analysis, we still have two
hypotheses, but the probability that H1 is correct is now higher than the probability that H2
is correct.
Thus, our posterior distribution for the two alternative hypotheses for p is another
Bernoulli distribution.

Can we depict this problem graphically?

Answer: Yes. Let’s take a step back to get a “big picture” view of this process (see
Figure 8.12).

H1 (p = 0.5) H2 (p = 0.4)
bin(n = 3, p)

0 1 2 3

Figure 8.12

This sort of diagram was popularized by John Kruschke in his book, Doing Bayesian Data
Analysis (Kruschke, 2015) and is intended to communicate the structure of the prior and
likelihood. At the bottom of this diagram, we have our observed data, yi. The data were
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi

106 BAYESIAN STATISTICS FOR BEGINNERS

generated from a binomial distribution with n ¼ 3 rolls and the parameter, p, shown in red.
The parameter, p, in turn, is the unknown parameter that we are trying to estimate. Here,
we use a Bernoulli distribution (shown in blue) to set the “weights” on the two alternative
hypotheses for p. Thus, you can interpret the blue distribution above as a prior distribution
which provides our weights of belief for the two hypotheses. The drawing of this distribu-
tion can be generalized.
But there is another interpretation for the blue prior distribution displayed: the unknown
parameter, p, is a random variable that is generated from a Bernoulli distribution. We are
trying to characterize which Bernoulli distribution produced it. Through Bayes’ Theorem,
the prior Bernoulli distribution is updated to a posterior Bernoulli distribution in light of
new data.
With either interpretation, the goal is to make probabilistic statements about our belief in
p and the distribution that characterizes it, and to update those beliefs or knowledge in light
of new data.

Can we compare this problem with the authorship problem?

Answer: For both problems, we had two hypotheses. For the author problem, our hypoth-
eses were Madison vs. Hamilton, and for this chapter’s problem the hypotheses were about
the value of a parameter: p ¼ 0.5 vs. p ¼ 0.4.
In the author problem, we informally calculated the likelihood of the data (use of the
word upon) under each hypothesis based on previous analysis of each man’s writing. In
contrast, in this chapter we used a probability mass function to calculate the likelihood
of the data under each hypothesis. The use of probability mass functions to compute the
likelihood of the data is common in Bayesian analysis (though not required, as we’ve seen).

What if we considered all possible hypotheses for p between 0 and 1

instead of just two speciﬁc hypotheses?

Answer: In that case, the prior distribution of p will be a continuous distribution

instead of a discrete distribution. Chapter 9 focuses on probability density functions, and
we provide the background needed to address this situation.

Can we summarize the main points of this chapter?

Answer: Of course. In this chapter:

• We introduced functions as a recipe that take in inputs and generate outputs based on a
set of instructions.
• We learned about the binomial probability mass function, whose inputs are n (the
number of trials), p (the probability of success in each trial), and y (the number of
observed successes across the n trials):

n y
f ðy; n; pÞ ¼ p ð1−pÞðn−yÞ : ð8:37Þ
y

• In this function, n and p are parameters, whereas y is a randomly observed discrete value.
The binomial distribution maps the probability associated with all possible values of the
random variable, Y.
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi

PROBABILITY MASS FUNCTIONS 107

• We stressed that the shape of a probability distribution is controlled by its parameters:

change the parameters, and the shape and location of the distribution changes.
• We then introduced the term “likelihood,” and stressed that likelihood is another word
for probability. We use the term “likelihood” when the data are in hand.
• We repeatedly stressed that any Bayesian analysis will involve a likelihood calculation
because the term Pr(data | hypothesis) is the likelihood of observing the data, assuming
the hypothesis is true.
• Finally, we wrapped up this chapter with a simple Bayesian inference problem in which
we had two hypotheses, collected binomial data, and then updated the priors to poster-
iors using Bayes’ Theorem. There are several other probability mass functions
that you may use in Bayesian analysis. We focused on the binomial pmf, but
the underlying concepts presented here should apply to new functions. Each
probability function has unique inputs and provides probability as an out-
put. The sum of the probabilities across all outcomes must be 1.

OK, what’s next?

Answer: Chapter 9 will tackle probability density functions, another critical tool in
your Bayesian toolbox.
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi

CHAPTER 9

Probability Density Functions

In this chapter, we’ll continue to build our understanding of probability distributions.

We will focus on general concepts associated with probability density functions, which
are functions associated with random variables that are continuous in nature. For this
chapter, we’ll be focusing on the continuous uniform and normal distributions. Although
we focus on only these distributions, the general concepts in this chapter will apply to other
continuous probability distributions. This is a fairly long chapter, and it may take a few
readings for the material to sink in, so make sure to get up and stretch every now and then.
By the end of this chapter, you should be able to deﬁne and use the following terms for a
continuous random variable:
• Random variable
• Probability distribution
• Parameter
• Probability density
• Likelihood
• Likelihood proﬁle.

We’ll start with the same question that we started with in Chapter 8:

What is a function?

Answer: In math, a function relates an input to an output, and the classic way of writing
a function is
f ðxÞ ¼ : : : ð9:1Þ

Here, the function name is f, the inputs are denoted by the letter x, and the dots represent
instructions that will generate the output. For instance, consider the function shown in
Figure 9.1.

f(x) = x2

Function name is f Function input is x Function output is x2

Figure 9.1

The function name is f. The inputs go between the parentheses. Here, we have a single
input called x. The output of the function is a set of instructions that tell us what to do with
the input. Here, we input a value for x; the instructions tell us that we should square x to
give us the output.

PROBABILITY DENSITY FUNCTIONS 109

In other words, f(x) maps the input x to the output. We can visualize this as shown in
Figure 9.2.

Input(s) Function Output(s)

x f x2

Figure 9.2

In Chapter 8, we introduced probability mass functions. We noted that for a discrete

random variable named Y, the probability mass function, f ðyÞ ¼ PrðY ¼ yÞ, provides the
probability for each value of y.
And we introduced the binomial probability mass function as a function that lets you
evaluate, for example, the probability of observing 2 heads in 3 ﬂips when the coin is fair as:

n y
f ðy; n; pÞ ¼ p ð1 pÞðnyÞ ð9:2Þ
y

3
f ð2; 3; 0:5Þ ¼ 0:52 ð1 0:5Þð32Þ ¼ 0:375: ð9:3Þ
2

The function’s name is f, and the inputs are the observed random variable, y (the number of
successes, which in this case is the 2 heads), n (the total trials, which is three ﬂips), and p
(the probability of success, which is 0.5). Don’t forget that a variable is considered random
when its value is subject to random variation.
In this example, the discrete random variable named Y could be 0, 1, 2, or 3. We just used
the binomial pmf to calculate the probability of observing a random variable y ¼ 2, given
three tosses and a fair coin.
Probability mass functions deal with random variables, such as the number
of heads observed, that are discrete (or distinct) in nature. In this chapter,
however, we focus on random variables that are continuous in nature.

Can you give me an example of a continuous random variable?

Answer: Of course. Let’s start with the following example from Wikipedia (accessed
August 17, 2017). Suppose a species of bacterium typically lives 4 to 6 hours. In this
example, we can say that capital X represents a random variable that stands for the lifespan
of a bacterium. This is an example of a continuous random variable, because the
lifespan can assume values over some interval (4 to 6 hours, such as 4.1, 4.2, 5.9999, etc.).
We can deﬁne the possible outcomes of X as:

4 X 6: ð9:4Þ

What is the probability that a bacterium lives exactly 5 hours?

Answer: The answer is actually 0. A lot of bacteria live for approximately 5 hours, but there
is a negligible chance that any given bacterium dies at exactly 5.0000000000 . . . hours. We
can always get more precise!
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi

110 BAYESIAN STATISTICS FOR BEGINNERS

Instead we might ask: What is the probability that the bacterium dies between 5 hours
and 5.01 hours? Let’s say the answer is 0.02 (i.e., 2%). Next, what is the probability that
the bacterium dies between 5 hours and 5.001 hours? The answer is probably around
0.002, because this is 1/10th of the previous interval. The probability that the bacterium
dies between 5 hours and 5.0001 hours is probably about 0.0002, and so on. Notice that
as the time increment decreases, the probability also decreases. When the time incre-
ment is 0 (an exact time), the probability is 0. This isn’t very helpful if our goal is
probability!

So, what can we do?

Answer: For cases where a random variable is continuous, you need to consider prob-
ability in the context of a probability density function.

Can we see an example of a probability density function?

Answer: Sure! How about this one:

f ðxÞ ¼ 0:5: ð9:5Þ

Here, the function name is f, and it has one argument called x. We can name the function
and argument anything we’d like. Regardless of what it is named, the function returns the
probability density, and it is always 0.5 (with some constraints that we’ll discuss below).

I see . . . and what distribution would result from this pdf?

Answer: A uniform distribution where values range between 4 and 6. Technically, the
probability density function is written:

f ðxÞ ¼ 0:5; 4 x 6: ð9:6Þ

Notice the use of the comma to separate the function, f ðxÞ ¼ 0:5, from its constraints,
4 x 6. In other words, x must fall within this interval.
The random variable named X can be drawn from this distribution. We can write:

X Uð4; 6Þ: ð9:7Þ

The symbol means “is distributed as.” So we read this equation “the random variable
named X is distributed as a uniform distribution with a minimum value of 4 and a
maximum value of 6.”
The uniform distribution is usually indicated by a capital U, and this distribution has two
parameters, the minimum (sometimes called a) and maximum (sometimes called b). Thus,
the uniform distribution is uniquely described as Uða; bÞ. If you change a or b, you have a
different uniform distribution.
Now let’s take a look at this uniform distribution. The lower-case x is the actual value of
X (see Figure 9.3). The blue line in this ﬁgure represents the probability density function
f ðxÞ ¼ 0.5. Notice that the y-axis is labeled “Density.” Notice also that we don’t use the word
“probability” here. If all values of x have the exact same density, it makes sense that the pdf
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi

PROBABILITY DENSITY FUNCTIONS 111

1.0

0.8

0.6

Density
0.4

0.2

0.0
4.0 4.5 5.0 5.5 6.0
x

Figure 9.3

of a uniform distribution is a constant, f ðxÞ ¼ 0:5, right? It’s just a matter of ﬁguring out
what this constant is.

Why is the density 0.5 in this example?

Answer: Because just as the bars in the binomial distribution or any other pmf must sum
to 1.00, the area under the density f(x) must equal 1.00. Let’s explore this idea a
little more . . .
The uniform distribution is sometimes called a rectangular distribution because it
looks like a rectangle. Wild! And rectangles have area, which is calculated by multiplying
the length of the rectangle along the x-axis by the height of the rectangle along the y-
axis. The length of our rectangle is calculated by subtracting the minimum, 4, from the
maximum, 6, which is 2.0. If the area of our uniform distribution represents
probability, then we know that the area of this rectangle must be 1.0. So, we
now know:
• area ¼ 1.0
• length ¼ 2.0
And now solve for the height, which is called density:

area ¼ length ∗ height ð9:8Þ

1:0 ¼ 2:0 ∗ height ð9:9Þ

1:0
height ¼ ¼ 0:50: ð9:10Þ
2:0

The height of the distribution at any given point is called its density (Figure 9.3, blue),
whereas the total area under the distribution must be 1.00 (Figure 9.3, gray). The function
that yields the density is called a probability density function.
Now, to repeat the most crucial point: the area under the pdf is equal to 1.0.
Thus, the probability that X is between 4 and 6 is 1.0, or:

Prð4 X 6Þ ¼ 1:00: ð9:11Þ

OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi

112 BAYESIAN STATISTICS FOR BEGINNERS

Can we formally deﬁne a uniform pdf?

Answer: You read our minds! As we mentioned, the uniform pdf has two parameters,
called a and b, where a represents the lower bound of the distribution, and b represents the
upper bound of the distribution:

1
f ðx; a; bÞ ¼ : ð9:12Þ
ba

This is the uniform probability density function (pdf). For our example, where a ¼ 4 and
b ¼ 6, the density is calculated as:

1 1
f ðx; a; bÞ ¼ ¼ ¼ 0:5: ð9:13Þ
64 2

Let’s not forget our constraints! The formal deﬁnition of the uniform pdf is written:
8
< 1
for a x b
f ðx; a; bÞ ¼ ba ð9:14Þ
:
0 for x < a or x > b:

In our example:

0:5 for 4 x 6
f ðx; a; bÞ ¼ ð9:15Þ
0 for x < 4 x > 6:

In words, the probability density is 0.5 for all x between 4 and 6 hours; it is 0 for any values
of x that are less than 4 hours or greater than 6 hours.

What is the probability that x is between 4.5 and 5.5 hours for our
uniform distribution?

Answer: OK, now we need to move from probability density to probability. Our
uniform distribution has a ¼ 4 and b ¼ 6. The probability that a random variable named X is
between 4.5 and 5.5 hours could be depicted as shown in Figure 9.4.

1.0

0.8

0.6
Density

0.4

0.2

0.0
4.0 4.5 5.0 5.5 6.0
x

Figure 9.4
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi

PROBABILITY DENSITY FUNCTIONS 113

Intuitively, if the full grey box above represents an area of 1.00, then the red box
represents 1/2 of the total area, so the probability of drawing x between 4.5 and 5.5 hours
is 0.5. Here, we are talking about probability (what proportion of the full gray rectangle
consists of the red rectangle) and not density. However, we will use the density to calculate
the result. The area of the red rectangle is:
• length ¼ 5.5 – 4.5 = 1
• height (density) ¼ 0.5
• area ¼ length ∗ height ¼ 0.5.
The probability that a bacterium has a lifespan between 4.5 and 5.5 hours is 0.5, that is:

Prð4:5 X 5:5Þ ¼ 0:5: ð9:16Þ

What is the probability that x is exactly 5 hours?

Answer: Once again, the answer is zero! Assuming that 5 means exactly 5 (and not a smitch
more or less), then we are dealing with a line, not a rectangle. Lines don’t have area, so the
answer technically is 0. When you ask what is the probability that x ¼ 5, you might really
mean what is the probability of 5, give or take a tiny amount.

Are there other examples of continuous probability density functions?

Answer: Yes, there are many to choose from.

Let’s return to the Wikipedia example posed in the beginning of the chapter. Suppose a
species of bacterium typically lives 4 to 6 hours. We just explored this question earlier with
a uniform distribution. But now let’s assume that most bacteria live 5 hours on
average, with fewer living to 4 or 6. Let’s us explore this answer using a normal
distribution.

What exactly is the normal distribution?

Answer: The Online Statistics Education book (Lane, 2011) tells us: “The normal distribu-
tion is the most important and most widely used distribution in statistics. It is sometimes
called the ‘bell curve,’ although the tonal qualities of such a bell would be less than
pleasing.” The normal (or Gaussian) distribution is a very common continuous prob-
ability distribution.

And what does “Gaussian” refer to?

Answer: The normal distribution is also called the Gaussian distribution, named for the
German mathematician Carl Friedrich Gauss, who “had an exceptional influence in many
fields of mathematics and science and is ranked as one of history’s most influential
mathematicians” (see Figure 9.5). Gauss’ works were collected and published posthumously
in 1863.
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi

114 BAYESIAN STATISTICS FOR BEGINNERS

Figure 9.5 Carl Friedrich Gauss.

What does a normal (Gaussian) distribution look like?

An example of a normal distribution is shown in Figure 9.6.

0.8

0.6
Density

0.4

0.2

0.0
3 4 5 6 7
x = Lifespan of Bacterium (hours)

Figure 9.6 Normal distribution.

Here we are interested in the lifespan of a bacterium, which is given on the x-axis. The
y-axis is labeled Density, as it was in our uniform pdf example. The normal distribution is a
bell-shaped curve. The middle of the curve is centered on the mean, or average, which is
represented by the Greek letter, mu (μ). Here, μ ¼ 5. The spread of the curve is controlled by
the standard deviation, which is represented by the Greek letter, sigma (σ). Here, σ ¼ 0.5.
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi

PROBABILITY DENSITY FUNCTIONS 115

Keep these points in mind:

• First, the peak of the normal distribution is centered on μ. It is a symmetrical curve, so

half of the distribution is to the right of the mean, and half is to the left. In other words,
the mean of a normal distribution is the same thing as its median (the value at which half
of the random variables are greater than the median, and half are less).
• Second, the distribution is shown with a smooth curve rather than discrete bars. Conse-
quently, the values of x underneath the curve comprise a continuous range of values
rather than being discrete values.
• Third, the values along the x-axis can be negative! We don’t show it here, but the normal
distribution can be used for any range of positive and negative values.
• Fourth, the spread of the distribution is controlled by the standard deviation, σ. The
higher the σ, the more spread there is.

So, the normal distribution has two parameters?

Answer: Yes. Change the mean (μ) or standard deviation (σ), and you change the location
and/or spread of the curve. Remember the parameters are like the “control knobs.” The
parameter, μ is called the location parameter: it controls where the distribution is
centered over the x-axis. In contrast, the parameter σ is called the scale parameter: it
controls the shape or spread of the distribution.
Sometimes people will denote the parameters of a normal pdf as μ and σ 2 instead, where
σ is just σ ∗ σ. This is called the variance of the distribution. Figure 9.7 shows some
2

examples with different control settings.

2.0
μ = 5, σ = 0.5
μ = 7, σ = 0.5
1.5 μ = 7, σ = 1
μ = 7, σ = 2
Density

1.0 μ = 5, σ = 0.25

0.5

0.0
2 4 6 8 10 12
Lifespan of Bacteria (hours)

Figure 9.7 Normal distributions.

Let’s study these distributions more closely.

• The red, purple, and blue distributions all have the same mean, μ ¼ 7. What differs among
them is the standard deviation, σ. The red distribution has the lowest standard deviation
(0.5), while the blue distribution has the highest standard deviation (2.0).
• The green and black distributions have the same mean, 5.0. What differs between them is
the standard deviation. The green distribution has a standard deviation of 0.5, while the
black distribution has a standard deviation of 0.25.
• The green and the red distributions have the same standard deviation (0.5). What differs
between them is their mean: the blue distribution has a mean of 5, while the red
distribution has a mean of 7.
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi

116 BAYESIAN STATISTICS FOR BEGINNERS

How were these distributions generated?

Answer: These graphs were generated by plugging in a range of hours (from 0 to 15 in

increments of 0.01) into the normal probability density function and then graphing
all of the results. The y-axis is labeled density because the normal probability density
function returns a probability density as opposed to actual probability (which is con-
strained between 0 and 1). The area under all of the bells is the same: namely, 1.

What exactly is the normal (Gaussian) pdf?

Answer: The normal probability density function is a pdf, just like the uniform pdf we
examined previously.
The function is as follows:

1 2
f ðx; μ; σÞ ¼ pffiffiffiffiffiffi eðxμÞ =ð2σ Þ
2
∞ x ∞: ð9:17Þ
2π σ
This function is named f and it has just three inputs. Two of the inputs are parameters: the
mean (which is μ) and the standard deviation (which is σ). The third input, x, represents an
outcome, such as bacterial lifespan in hours.
Look for this function on the Deutsche Mark (see Figure 9.8)!

Figure 9.8 Deutsche Mark.

Can you give me an example of how to use the normal pdf?

Answer: Suppose the average lifespan of a bacterium is 5 hours. We can use the normal
pdf to get the probability density of observing, say, x ¼ 4.5 hours, when the mean
lifespan is μ ¼ 5 hours and the standard deviation σ ¼ 0.5 hours:
1 2
f ðx; μ; σÞ ¼ pffiffiffiffiffiffi eðxμÞ =ð2σ Þ
2
ð9:18Þ
2π σ
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi

PROBABILITY DENSITY FUNCTIONS 117

1 2
eð4:55Þ =ð2 ∗ 0:5 Þ ¼ 0:4839414:
2
f ð4:5; 5; 0:5Þ ¼ pffiffiffiffiffiffi ð9:19Þ
2π ∗ 0:5
Remember that this is NOT probability . . . it is probability density!

So, we plug in multiple values for x and generate the distribution?

Answer: Yes! Figure 9.9 shows the Gaussian distribution with μ ¼ 5 and σ ¼ 0.5 (blue).
Look for the density when x ¼ 4.5 as calculated above. Also plotted is the Gaussian
distribution with μ ¼ 5 and σ ¼ 0.25 (black).

2.0

1.5
Density

1.0

0.5

0.0
3 4 5 6 7
Lifespan of Bacteria (hours)

Figure 9.9 Normal distributions.

A critical thing to notice here is that both curves are generated by the normal pdf, but
they have different parameters. As with the continuous uniform distribution, the area
under each curve is 1.0. That is, the area is the same under the two curves. Since the
black curve has a small standard deviation compared to the blue curve, the density is
squished upward in order to accommodate the narrow spread. Notice that the density
can be greater than 1.0.

How do we go from probability density to probability with a

normal pdf?

Answer: Remember how we calculated the density for the continuous uniform distribu-
tion? For the bacterium lifespan problem, we said that lifespan is distributed as a uniform
pdf between 4 and 6 hours, X Uð4; 6Þ. We knew the area of the rectangle was 1.0, we
knew the length of the rectangle (2.0), and we simply solved to get the height of the
rectangle, which is the density. We then calculated the probability that a bacterium
would live between 4.5 and 5.5 hours by determining the proportion of U(4,6) where
4.5 x 5.5, which is a smaller rectangle.
We can use the same sort of reasoning with the normal distribution, only we’ll need more
rectangles. Let’s have a look, focusing on the blue distribution in Figure 9.9, where μ ¼ 5,
σ ¼ 0.5. Here it is again in Figure 9.10, with some rectangles overlaid on top.
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi

118 BAYESIAN STATISTICS FOR BEGINNERS

1.0

0.8

0.6

Density
0.4

0.2

0.0
3 4 5 6 7
x = Lifespan (hours)

Figure 9.10 μ ¼ 5, σ ¼ 0.5.

Here, the x-axis has been divided into 5 rectangles, where each rectangle has a
length of 1 hour and a height that is the normal pdf density (two of these rectangles
are in the tails and are barely visible). This is a very rough approximation of a normal
distribution!
See if you can ﬁnd the rectangle where the lifespan is between 3.5 hours and 4.5 hours.
Now, this particular rectangle (blue) has a height, or density, of 0.1079819. The area of this
rectangle is length (1.00) times height (0.1079819) ¼ 0.1079819.

What is the probability that a bacterium has a lifespan between

4.5 and 5.5 hours?

Answer: Now look for the rectangle where lifespan is between 4.5 and 5.5 hours. This
particular rectangle (red) has a density of 0.7978846. The area of this rectangle is length
(1.0) ∗ height (0.7978846) ¼ 0.7978846.
This is our ﬁrst attempt to answer the question, “What is the probability that a bacterium
has a lifespan between 4.5 and 5.5 hours?” Hang on to this answer, as we will revisit
this same question in a few minutes.

Is the total area equal to 1.0?

Answer: If Gauss did his homework correctly, we can calculate the area of each rectangle
in the graph and then add the areas together. We should get an answer close to 1.0. Let’s try
it. We have 5 rectangles, each with a length of 1 and height that is given by its density. And
the answer is . . .

1:0143837: ð9:20Þ

Not bad at all! It’s not precisely 1.0 because our 5-rectangle distribution only approxi-
mates the normal pdf shown in blue, where the area under the curve is exactly 1.0.
Perhaps we can improve our result by creating 41 rectangles instead of ﬁve. Let’s have a
look (see Figure 9.11).
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi

PROBABILITY DENSITY FUNCTIONS 119

1.0

0.8

0.6

Density
0.4

0.2

0.0
3 4 5 6 7
x = Lifespan (hours)

Figure 9.11 μ ¼ 5, σ ¼ 0.5.

• Now, the x-axis has been divided into 41 slivers, where each sliver has a length of
0.1 hours.
• See if you can ﬁnd the rectangle where lifespan is between 3.95 hours and 4.05 hours.
Now, this particular rectangle (blue) has a density of 0.1079819. The area of this rect-
angle is 0.1 ∗ 0.1079819 ¼ 0.0107982.
• Now look for the rectangle where lifespan is between 5.55 and 5.65 hours.
This particular rectangle (red) has a density of 0.3883721. The area of this rectangle is
0.1 ∗ 0.3883721 ¼ 0.0388372.

To get the area under the curve, we can repeat this little exercise for each of the 41 slivers
and then add up the results. And the answer is . . .

0:9999599: ð9:21Þ

Notice that this answer is closer to 1.0 than our previous example, where we had only 5
rectangles.

How would one express this mathematically?

Answer: If we have 41 discrete rectangles, then the sum of the rectangles can be
expressed as:
X
41

Area ¼ length ∗ density: ð9:22Þ

j¼1

The general idea is that we can more closely approximate the normal pdf as we create
thinner and thinner rectangles.

With 41 slivers, what is the probability that a bacterium has a lifespan

between 4.5 and 5.5 hours?

Answer: This is the same question we answered previously with our 5-rectangle approxi-
mation. Here are the steps: Find the rectangles centered on 4.5 and 5.5. Take half of the area
of each (because half the area is outside the boundary of interest). Then, add in the area of
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi

120 BAYESIAN STATISTICS FOR BEGINNERS

all rectangles that fall between these two anchors. The sum is the probability, and the
answer is:

0:6810742: ð9:23Þ

Remember that the answer from our 5-rectangle approximation was:

0:7978846: ð9:24Þ

Hang on to this result, and we’ll revisit it one more time in a minute or two!

What is the total area if our rectangles became really, really skinny?
P
Answer: When the rectangles approach a width of 0, the sum symbol, is replaced by the
Ð
integral symbol, , and the area under the curve is exactly 1.000. A general way to write this is:
ð∞
f ðxÞdx ¼ 1:0: ð9:25Þ
∞

In this general version:

• f(x) is a pdf, and you can plug in any pdf you’d like. We’ve introduced the continuous
uniform pdf and normal pdf so far.
• We want to get the area under the curve for a specified range of values of x. The range
of values is tucked next to the integral symbol, with the lower end of x provided on
the bottom of the integral symbol (here, negative infinity) and the upper end of x
provided at the top of the integral symbol (here, positive infinity). That covers all
possibilities!
• The integral symbol means “get the total area under the curve” for the specified range of
x. Since we cover all possibilities for x, the total area under the curve is 1.0.
• To the right of the pdf is the symbol dx, which is the width of the rectangles (approach-
ing 0), and not d times x. If dx is the width of the rectangles, and f(x) is the height of each
rectangle, then f(x)dx is the area of a very small rectangle.
• To the right of the equal sign is 1.0; the area under the curve for all possible values of x
between negative and positive infinity is 1.0.
Wolfram defines an integral as “a mathematical object that can be interpreted as an area or
a generalization of area. Integrals, together with derivatives, are the fundamental objects of
calculus. Other words for integral include antiderivative and primitive.”

Do all probability density functions have an area under the

curve 5 1.0?

Answer: Yes. That is a characteristic that deﬁnes all probability density functions. Click
here to see the derivation for the normal pdf!

What would the integral look like for the normal pdf?

Answer: Just replace f(x) above with the normal pdf, which is:

1 2
f ðx; μ; σÞ ¼ pffiffiffiffiffiffi eðxμÞ =ð2σ Þ :
2
ð9:26Þ
2π σ
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi

PROBABILITY DENSITY FUNCTIONS 121

Then, we have:
ð
∞

f ðx; μ; σÞdx ¼ 1:0 ð9:27Þ

∞
ð∞
1 2
pffiffiffiffiffiffi eðxμÞ =ð2σ Þ dx ¼ 1:0:
2
ð9:28Þ
2π σ
∞

OK, one more time! What is the probability that a bacterium has a
lifespan between 4.5 and 5.5 hours?

Answer: Before, we divided the distribution into thin rectangles, found the area of the
rectangles in question, and added them up. Now we are considering a subset of x’s, but we
make use of the density function to ﬁnd the exact area.
Let’s step through this carefully:

ð
5:5
1 2
pffiffiffiffiffiffi eðxμÞ =ð2σ Þ dx:
2
Prð4:5 X 5:5Þ ¼ ð9:29Þ
2π σ
4:5

Now, we need to evaluate this to get our answer. The process of ﬁnding this answer is called
integration. And the result is a probability.

How does one go about integrating?

Answer: Well, we have the pdf, and we know the shape of the curve. Now we need a tool
to help us find the area under a specific part of the curve. This area is probability. Comput-
ing the Gaussian integral is a bit beyond the scope of this book, and you really don’t need to
solve it by hand for our purposes.
Fortunately, the normal distribution is such a commonly used distribution that we can
easily find the areas that we want from calculators or computers, which often use the
“super-skinny rectangle” approach to approximate the answer. Technically, this is called
Riemann’s (ree-mahn) approximation. We can use virtually any calculator or math software
to generate the probability that a bacterium’s lifespan is between 4.5 and 5.5 hours, given a
normal pdf with μ ¼ 5 and σ ¼ 0.5. The calculation gives 0.6822689. Our 41 rectangles
approach suggested that the probability was 0.6810742, which is not a bad approximation.

For many problems, however, integration is a complex mathematical calculation, or

worse, is completely intractable!

Are there refresher courses that review this material?

Answer: Yes, there are several excellent sources out there. Forget about memorizing
anything . . . the only thing you need for this book is an intuition for how things work.
Here are some of our favorites:

• Khan Academy
• Better Explained
• How to Enjoy Calculus by Eli Pine
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi

122 BAYESIAN STATISTICS FOR BEGINNERS

What other probability density functions are there?

Answer: Loads of them! Here’s a look at a few that you may have come across:

• normal
• log-normal
• beta
• gamma
• exponential
• Weibull
• Cauchy
We will explore a few of these in future chapters, but encourage you to check these out on
Wikipedia. Who knows? Some may be very relevant to your work!

Likelihood

Now let’s gracefully switch to the topic of likelihood. Likelihood is a key concept in
Bayesian inference, so we need to spend time with it.

OK, what exactly is likelihood?

Answer: You may recall that Dictionary.com provides three deﬁnitions of likelihood:

1. The state of being likely or probable; probability.

2. A probability or chance of something: There is a strong likelihood of his being elected.
3. Archaic: indication of a favorable end; promise.
So, likelihood is another word for probability. However, because our data are continuous
in nature, likelihood here can be taken to mean probability density. Likelihood involves the
collection of data (variables), and we look retrospectively at the probability or probability
density of collecting those data under a given set of parameters.
We noted in Chapter 8 that the discrete version of Bayes’ Theorem can be written as
shown below, with the likelihood portion shown in red, the priors in blue, and the
posterior in purple:

Prðdata j Hi Þ ∗ PrðHi Þ
PrðHi j dataÞ ¼ : ð9:30Þ
X
n
Prðdata j Hi Þ ∗ PrðHj Þ
j¼1

Now, let’s consider Bayes’ Theorem when the hypotheses for a parameter are inﬁnite. Here,
Bayes’ Theorem takes a new form. Suppose we are trying to estimate a single parameter
called θ. Bayes’ Theorem in this case is speciﬁed as:

Pðdata j θÞ ∗ PðθÞ
Pðθ j dataÞ ¼ ð : ð9:31Þ
Pðdata j θÞ ∗ PðθÞdθ

This is the generic version of Bayes’ Theorem when the posterior distribution for a single
parameter, θ, given the observed data, is represented by a pdf.
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi

PROBABILITY DENSITY FUNCTIONS 123

Notice the notation P(θ | data) for the posterior distribution of θ. Here, P indicates
probability density. Both the prior (P(θ)) and the posterior P(θ | data) distributions are
pdfs. In contrast, with the discrete version of Bayes’ Theorem, the prior and posterior
distributions are pmfs, denoted with Pr.

The likelihood in Bayes’ Theorem may be written P(data | θ) or Pr(data | Hi)—which version
you use depends on the problem you are solving.

Throughout this book, we will express computations for the likelihood term of Bayes’
Theorem as Lðdata; θÞ where L symbolizes “likelihood” and the semicolon means “given.” We
do this to differentiate the likelihood term in Bayes’ Theorem from the actual computation.

As a side note to our readers who previously have studied maximum likelihood methods,
read the box below:

You may have seen likelihood expressed as the likelihood of the parameters given the
data, or Lðθ; dataÞ, which is conventional. Regardless of the notation used, you must enter
values for the parameters and observed data to calculate a result.

Earlier in this chapter, we used the normal pdf to get the probability density of
observing x, given the mean lifespan is μ and the standard deviation is σ:

1 2
f ðx; μ; σÞ ¼ pffiffiffiffiffiffi eðxμÞ =ð2σ Þ :
2
ð9:32Þ
2π σ
This is used before we observe the data. When we speak of likelihood, in contrast, we have
the variables (data) in hand, and one or more of the parameters are unknown. To
compute the likelihood of the data, we first need to make an assumption about how
the data were generated. Here, we assume that the data are generated from a normal
distribution. Let’s let X be our random variable, and X is the lifetime of a bacterium in
hours. We can write:

X Nðμ; σÞ: ð9:33Þ

Suppose we hypothesize that μ ¼ 5.0, and assume that σ is known to be 0.5. And further
suppose that we draw a random bacterium that lives x ¼ 4.5 hours. We can ask, “What is the
likelihood that x ¼ 4.5 given that μ is 5.0 and σ ¼ 0.5?” We will use the normal pdf to
answer this question, which is

1 2
Lðx ¼ 4:5; μ ¼ 5:0; σ ¼ 0:5Þ ¼ pffiffiffiffiffiffi eðxμÞ =ð2σ Þ :
2
ð9:34Þ
2π σ
Now, let’s plug in the values that we know to get our answer:

1 2 2
Lðx ¼ 4:5; μ ¼ 5:0; σ ¼ 0:5Þ ¼ pffiffiffiffiffiffi eð4:55:0Þ =ð2 ∗ 0:5 Þ ¼ 0:4839414: ð9:35Þ
2π 0:5
This is a probability density. Look for this result in blue in Figure 9.12.
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi

124 BAYESIAN STATISTICS FOR BEGINNERS

Suppose now that we hypothesize that μ ¼ 4.1 (with σ known to be 0.5). Now let’s calculate
the likelihood:

1 2 2
Lðx ¼ 4:5; μ ¼ 4:1; σ ¼ 0:5Þ ¼ pffiffiffiffiffiffi eð4:54:1Þ =ð2 ∗ 0:5 Þ ¼ 0:5793831: ð9:36Þ
2π 0:5
Look for this result in green in Figure 9.12.
As you can see, we have an infinite number of possibilities for μ. Nevertheless, we can
generate the likelihood profile by repeatedly plugging in values and mapping the shape
of the surface. Here it is in Figure 9.12.

1.0

0.8
Likelihood

0.6

0.4

0.2

0.0
2 3 4 5 6 7 8
Mean Survival Rate (hours)

Figure 9.12 Likelihood proﬁle for μ.

A few things are worth pointing out:

• First, notice that the y-axis is labeled “Likelihood” instead of “Density” to indicate that
we are plotting the density associated with alternative values for an unknown parameter.
The x-axis is the unknown parameter, μ.
• Second, the function that can be used to draw this curve is called a likelihood function.
Here, we are given data in the form of observed random variables. We observed x ¼ 4.5.
Assuming we know σ ¼ 0.5, we use the normal pdf to ask how likely alternative values of
μ are.
• Third, the area under a likelihood function does NOT need to integrate to 1.0! We saw this
in Chapter 8 as well.

What if you don’t know that σ is 0.5?

Answer: We generate the likelihood surface for two parameters. In other words, we play
the ‘what if ’ game for both μ and σ.
To illustrate, suppose we selected three bacteria at random that lived 3.6,
4.7, and 5.8 years, respectively. These are our observed data. We are trying to estimate μ
and σ from the distribution that generated these data. We choose a suite of combinations of
μ and σ. Then, for each datapoint, we calculate the likelihood of observing the data.
We use the normal pdf as our baseline:

1 2
f ðx; μ; σÞ ¼ pffiffiffiffiffiffi eðxμÞ =ð2σ Þ :
2
ð9:37Þ
2π σ
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi

PROBABILITY DENSITY FUNCTIONS 125

For instance, if μ ¼ 5 and σ ¼ 1.0, then the likelihood of observing x ¼ 3.6 is calculated as:

1 2 2
Lðx ¼ 3:6; μ ¼ 5:0; σ ¼ 1:0Þ ¼ pffiffiffiffiffiffi eð3:65Þ =ð2ð1:0Þ Þ ð9:38Þ
2π ð1:0Þ

Lðx ¼ 3:6; μ ¼ 5:0; σ ¼ 1:0Þ ¼ 0:1497275: ð9:39Þ

For x ¼ 4.7, we have:

1 2 2
Lðx ¼ 4:7; μ ¼ 5:0; σ ¼ 1:0Þ ¼ pffiffiffiffiffiffi eð4:75Þ =ð2ð1:0Þ Þ ¼ 0:3813878: ð9:40Þ
2π ð1:0Þ

For x ¼ 5.8, we have:

1 2 2
Lðx ¼ 4:8; μ ¼ 5:0; σ ¼ 1:0Þ ¼ pffiffiffiffiffiffi eð5:85Þ =ð2ð1:0Þ Þ ¼ 0:2896916: ð9:41Þ
2π ð1:0Þ

If we assume that all three datapoints were independent, we can compute the
likelihood of observing the full dataset (which is 3 observations) given μ ¼ 5
and σ ¼ 1 as the product of the independent likelihoods:

Lðdata; μ ¼ 5:0; σ ¼ 1:0Þ ¼ 0:1497275 ∗ 0:3813878 ∗ 0:2896916 ¼ 0:01654262: ð9:42Þ

If we played this game across different combinations of μ and σ, we could create a likeli-
hood surface that looks like the one in Figure 9.13.

0.015
Likelihood

0.010
0.005
0.000
4
3 8
7
2 6
σ 5
1 4 μ
3
2

Figure 9.13

This is a likelihood surface for the unknown parameters, μ and σ. See if you can ﬁnd
the likelihood value when μ ¼ 5 and σ ¼ 1. It should be 0.01654262 if our calculations were
correct.
Our observed dataset consists of 3 values of lifespan: 3.6, 4.7, and 5.8. The mean of this
dataset is 4.7 and the standard deviation is 1.1. You should see that the most likely
parameter estimates in the graph above correspond to these values.

How can this be used in a Bayesian inference problem?

Answer: Ah, yes! Let’s not lose sight of the big picture!
Suppose we are trying to estimate a single parameter called θ. Bayes’ Theorem in this case
is speciﬁed as:
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi

126 BAYESIAN STATISTICS FOR BEGINNERS

Pðdata j θÞ ∗ PðθÞ
Pðθ j dataÞ ¼ ð : ð9:43Þ
Pðdata j θÞ ∗ PðθÞdθ

This is the generic version of Bayes’ Theorem when the posterior distribution for a single
parameter, given the observed data, is represented by a pdf.

• The posterior distribution is designated Pðθ j dataÞ, where P is probability density

(not probability). Pay attention to this notation! This is the left side of the equation,
and must integrate to 1.
• On the right side of the equation, the numerator multiplies the prior probability density
of θ, which is written P(θ), by the likelihood of observing the data under a given hypoth-
esis for θ, which is written Pðdata j θÞ. THIS is the likelihood we’ve been talking about, and
it can be a continuous or discrete function. Remember that the likelihood does not have
to integrate or sum to 1!
Ð
• In the denominator, we see the same terms, but this time we also see a few more symbols.
The symbol means “integrate”, which roughly means “sum up all the pieces” for each
tiny change in θ, which is written dθ. In other words, the denominator accounts for the
prior density ∗ likelihood for all possible hypotheses for θ, and sums them.

Note that this version deals with a single parameter called θ. But you can use Bayes’ Theorem
to estimate multiple parameters. We’ll touch on this topic more in future chapters.

Can you estimate the probability of a speciﬁc hypothesis for theta?

Answer: Nope! Pðθ j dataÞ is a pdf, so the probability that θ assumes a speciﬁc value is 0.
Remember that the probability that a bacterium lives for exactly 5.0000000 hours is 0!

So, there are no speciﬁc hypotheses?

Answer: There are inﬁnite hypotheses, and since you can’t estimate the probability of a
speciﬁc hypothesis, you are left with evaluating all of them! In other words, you
must estimate the entire posterior distribution.

Can we see a Bayesian inference problem with

inﬁnite hypotheses?

Answer: OK! Let’s try to put together everything we’ve learned so far!
Let’s go through a quick example where we are considering hypotheses for an unknown
mean, which is a parameter from the Gaussian distribution. How about the average lifespan
of a bacterium? For this example, we’ll assume σ is known to be 0.5.

Here, Bayes’ Theorem can be expressed as:

Pðdata j μÞ ∗ PðμÞ
Pðμ j dataÞ ¼ ð : ð9:44Þ
Pðdata j μÞ ∗ PðμÞdμ
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi

PROBABILITY DENSITY FUNCTIONS 127

We have an inﬁnite number of hypotheses for μ, and we want to update our beliefs after
collecting some data. Let’s review the steps of Bayesian analysis.

Step 1. What are the hypotheses?

Answer: There are an inﬁnite number of hypotheses for μ. You could have some bounds
though.
Step 2. What were the prior probabilities for each hypothesis?
Since there are an inﬁnite number of hypotheses, it makes sense that we use a pdf to
represent the prior distribution. Here, we have some choices. If we think that μ can range
between 4 and 6, we can use a uniform distribution to indicate that all hypotheses have the
same “weight” before considering the data (density ¼ 0.5; see Figure 9.14).

1.0

0.8

0.6
Density

0.4

0.2

0.0
4.0 4.5 5.0 5.5 6.0
x

Figure 9.14

This represents our prior distribution. This is called a proper prior because the density
function integrates to 1.
Step 3. Collect data.
Suppose we draw a random bacterium that lives 4.7 years.
Step 4. Compute the likelihood of the data under each hypothesis.
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi

128 BAYESIAN STATISTICS FOR BEGINNERS

Now we play the what if game, and use the normal pdf to compute the likelihood of
observing x ¼ 4.7 under different values of μ. Suppose, for the sake of example, that we
know that σ ¼ 0.5. We start by writing the normal pdf:

1 2
f ðx; μ; σÞ ¼ pffiffiffiffiffiffi eðxμÞ =ð2σ Þ :
2
ð9:45Þ
2π σ
But instead of calling the function f, let’s call it L. And, let’s fully specify our model. Here is
the likelihood value for just one value of μ:

1 2 2
Lðx ¼ 4:7; σ ¼ 0:5; μ ¼ 4:0Þ ¼ pffiffiffiffiffiffi eð4:74:0Þ =ð2ð0:5Þ Þ ¼ 0:2994549 ð9:46Þ
2π ð0:5Þ

and so on . . . look for this value in Figure 9.15.

As you can see, we have an inﬁnite number of possibilities for μ. Nevertheless, we can use
the normal pdf repeatedly to generate the likelihood proﬁle. Here it is in Figure 9.15.

1.0

0.8
Likelihood

0.6

0.4

0.2

0.0
4.0 4.5 5.0 5.5 6.0
Mean Survival Rate (hours)

Figure 9.15 Likelihood proﬁle for the unknown mean.

Here, we have the likelihood of observing the data, x ¼ 4.7, under each hypothesized value
for μ. Given just a single datapoint, the most likely survival rate is 4.7. You can see from the
shape of this curve that several other hypotheses have decent height, too. The hypothesis
with the lowest likelihood is μ ¼ 6.0.

Step 5. Use Bayes’ Theorem to update the priors to posteriors.

Here, Bayes’ Theorem can be expressed as:
Pðdata j μÞ ∗ PðμÞ
Pðμ j dataÞ ¼ ð : ð9:47Þ
Pðdata j μÞ ∗ PðμÞdμ

Remember, our prior distribution is the uniform pdf. The placeholder for that function is
shown in blue:
Pðdata j μÞ ∗ PðμÞ
Pðμ j dataÞ ¼ ð : ð9:48Þ
Pðdata j μÞ ∗ PðμÞdμ

And our likelihood function is generated using the Gaussian pdf. The placeholder for that
function is shown in red:
Pðdata j μÞ ∗ PðμÞ
Pðμ j dataÞ ¼ ð : ð9:49Þ
Pðdata j μÞ ∗ PðμÞdμ
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi

PROBABILITY DENSITY FUNCTIONS 129

The posterior distribution is another pdf of some sort. The placeholder for that function is
shown in purple:

Pðdata j μÞ ∗ PðμÞ
Pðμ j dataÞ ¼ ð : ð9:50Þ
Pðdata j μÞ ∗ PðμÞdμ

Now it’s just a matter of solving.

Can we depict this problem graphically?

Answer: Yes. Let’s focus on the big picture and create a diagram for this problem (Kruschke
plot) as we did in Chapter 8 (see Figure 9.16). Remember, this diagram is intended to
communicate the structure of the prior and likelihood.

μ ∼U(a = 4, b = 6)

N(μ, σ = 0.5)

Figure 9.16

At the bottom of this diagram, we have our observed data, xi. The data were generated from
a normal distribution with an unknown parameter, μ, and σ ¼ 0.5. The parameter, μ, in
turn, is the unknown parameter that we are trying to estimate. Here, we use a uniform
distribution with parameters a ¼ 4 and b ¼ 6 to set the “weights” on the alternative
hypotheses for μ. Thus, you can interpret the blue distribution above as a prior distribution
that provides our weights of belief, or current knowledge, regarding μ. The drawing of this
distribution can be generalized.
This interpretation falls in line with that given by Sander Greenland (2006), who writes,
“It is often said (incorrectly) that parameters are treated as fixed by the frequentist but as
random by the Bayesians. For frequentists and Bayesians alike, the value of a parameter may
have been fixed from the start or may have been generated from a physically random
mechanism. In either case, both suppose it has taken on some fixed value that we would
like to know. The Bayesian uses formal probability models to express personal uncertainty
about that value. The ‘randomness’ in these models represents personal uncertainty about
the parameter’s value; it is not a property of the parameter (although we should hope it
accurately reflects properties of the mechanisms that produced the parameter).”

S. Greenland. “Bayesian perspectives for epidemiological research: I. Foundations and

basic methods.” International Journal of Epidemiology 35.3 (2006): 765–74.

But there is another interpretation for the blue prior distribution displayed: the unknown
parameter, μ, is a random variable that arises from a uniform distribution. In this
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi

130 BAYESIAN STATISTICS FOR BEGINNERS

example, μ assumes the values between μ ¼ 4 and μ ¼ 6. The Bayesian machinery will
update the prior distribution to a posterior distribution in light of new data.

If we couldn’t integrate the normal distribution, how on earth are we

going to integrate the denominator of Bayes’ Theorem?

Answer: Sometimes it is intractable! For those cases, we need a special “tool” that can help
us estimate the posterior distribution without a mathematical, closed-form solution. That
tool is called Markov Chain Monte Carlo (MCMC), and we will introduce this topic later in
the book.
There are a few cases, however, where you can use a particular pdf as a prior distribution,
collect data of a speciﬁc ﬂavor, and then derive the posterior pdf. In these special cases, the
pdf of the prior and posterior are the same probability density function, but their param-
eters may differ. The prior distribution is called a conjugate prior (Raiffa and Schlaeffer,
1961), and the effect of the data can then be interpreted in terms of changes in parameter
values (Upton and Cook, 2014).
Here are some examples:

• beta pdf prior þ binomial data ! beta pdf posterior

• gamma pdf prior þ Poisson data ! gamma pdf posterior
• normal pdf prior þ normal data ! normal pdf posterior
• Dirichlet pdf prior þ multinomial data ! Dirichlet pdf posterior
We’ll dive into these in Section 4. We covered a lot of ground. Now make sure to stretch and
relax!
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi

SECTION 4

Bayesian Conjugates

Overview

Welcome to Section 4! As we’ve mentioned, a major use of Bayesian inference involves

parameter estimation. In this section, we will learn about three Bayesian conjugates that
can aid in parameter estimation. As you’ll see, these are special cases where a Bayesian prior
probability density function for an unknown parameter of interest can be updated to the
posterior pdf when data that are collected are a specific flavor. In these special cases, an
analytical solution exists that makes this update possible, and avoids the integration
required in the denominator of Bayes’ Theorem. In the process, we’ll learn two new
probability density functions (beta and gamma), and a new probability mass function
(Poisson).
• Chapter 10 introduces the beta-binomial conjugate. In the White House Problem, we use
a beta distribution to set the priors for all hypotheses of p, the probability that a famous
person could get into the White House without invitation. We then collected binomial
data to determine the number of times a famous person could gain entry out of a fixed
number of attempts, and update the prior distribution to the posterior distribution in
light of this new information. In short, a beta prior distribution for the unknown
parameter þ binomial data ! beta posterior distribution for the unknown parameter, p.
• Chapter 11 introduces the gamma-Poisson conjugate. In the Shark Attack Problem, we
use a gamma distribution as the prior distribution of λ, the mean number of shark attacks
in a given year. We then collect Poisson data to determine the number of shark attacks in
a given year, and update the prior distribution to the posterior distribution in light of this
new information. In short, a gamma prior distribution þ Poisson data ! gamma poster-
ior distribution.
• Chapter 12 introduces the normal-normal conjugate. In the Maple Syrup Problem, we
use Bayesian methods to estimate the two parameters that identify a normal distribution,
μ and σ. To use the conjugate solution, we assume σ is known and focus our attention on
μ. We use a normal distribution as the prior distribution of μ, the mean number of
millions of gallons of maple syrup produced in Vermont in a year. We then determined
the amount of syrup produced in multiple years and assumed that the amount followed a
normal distribution with known σ, and updated the prior distribution to the posterior
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi

132 BAYESIAN STATISTICS FOR BEGINNERS

distribution in light of this new information. In short, a normal prior distribution þ

normally distributed data ! normal posterior distribution.

There are many other conjugate solutions, but we’ve highlighted three commonly used
solutions. In each chapter, we set the scene with a fun problem and then use a conjugate
solution to update the prior distribution for an unknown parameter to the posterior
distribution. We will revisit each of these chapters in Section 5, where we will show you
how to estimate the posterior distribution with a different method, MCMC.
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi

CHAPTER 10

The White House Problem:

The Beta-Binomial Conjugate

We hope you now have a very solid understanding of probability distributions. In this
chapter (and the next two that follow), we show you how to use Bayes’ Theorem to estimate
the parameters of a probability distribution. Indeed, a very common use of Bayesian
inference involves parameter estimation, where the analysis uses probability density
functions (pdf ’s).
By the end of this chapter, you will have a thorough knowledge of the following:
• Beta distribution
• Binomial data
• Hyperparameters
• Conjugate prior
• Credible intervals
The simplest way to show you how Bayes’ Theorem is used to estimate parameters is to dive
right into a new problem. Let’s run through an example of parameter estimation with
Bayesian inference.
This one was taken from The Mike Wise Radio Show (accessed August 18, 2017).

NBA star Shaquille O’Neal and a friend debated whether or not he could get into the
White House without an appointment. The wager: 1000 push-ups.

Here’s a picture of Shaq in case you haven’t met him (see Figure 10.1); he’s 70 100 , weighs
325 pounds, and has earned tons of accolades in the NBA. On top of that, he’s a rapper and
has a Ph.D. At the time of the bet, Barack Obama was President of the United States. Shaq
knows that President Obama is a huge basketball fan and coach and is sure he can get in to
meet with the President.
For this chapter, we’re going to broaden this problem a bit and ask, “What is
the probability that any famous person (like Shaq) can drop by the White
House without an appointment?”

What do YOU think Shaq’s probability of getting into the

White House is?

Answer: Your answer here!

134 BAYESIAN STATISTICS FOR BEGINNERS

Figure 10.1 Shaquille O’Neal.

What probability function would be appropriate for Shaq’s bet?

Answer: Well, let’s see. Shaq is going to attempt to get into the White House, so that
represents a trial. He will either succeed or fail. Shaq thinks his probability of success is
quite high. The outcome of this trial determines who wins the bet. Ring any bells?
This is a binomial problem. You might recall that the binomial probability mass
function is:

n y
f ðy; n; pÞ ¼ p ð1 pÞðnyÞ y ¼ 0; 1; : : : ; n ð10:1Þ
y

The number of trials is denoted n. The number of observed successes is denoted as y.

The probability of success is denoted as p.
If we knew that Shaq’s probability of success is say, 0.7, and that Shaq had only one
chance, we could use the binomial probability mass function (pmf) to show the probability
of each outcome. The probability that Shaq would succeed is:

1
f ð1; 1; 0:7Þ ¼ 0:71 ð1 0:7Þð11Þ ¼ 0:7: ð10:2Þ
1

The probability that Shaq would fail is:

1
f ð0; 1; 0:7Þ ¼ 0:70 ð1 0:7Þð10Þ ¼ 0:3 ð10:3Þ
0

We can graph this probability distribution as shown in Figure 10.2.

OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi

THE WHITE HOUSE PROBLEM 135

1.0

0.8

Probability
0.6

0.4

0.2

0.0
0 1
Successes

Figure 10.2 Binomial distribution: n = 1, p = 0.7.

With one trial, the probability of getting in is 0.7, and the probability of not getting in is
0.3. The sum of the probabilities equals 1.0. Incidentally, a binomial distribution with just 1
trial is also called a Bernoulli distribution. Remember?
If the bet let Shaq try three times, he could compute the probability of observing 0, 1, 2,
3 successes out of three trials, given p ¼ 0:7:

3
f ðy; 3; 0:7Þ ¼ 0:7y ð1 0:7Þð3yÞ : ð10:4Þ
y

The binomial distribution for n ¼ 3 and p ¼ 0:7 looks like the one shown in Figure 10.3.

1.0

0.8
Probability

0.6

0.4

0.2

0.0
0 1 2 3
Successes

Figure 10.3 Binomial distribution: n = 3, p = 0.7.

With 3 attempts, and p (the probability of success) being 0.7, Shaq would have a small
chance of never getting in (0 successes out of 3 trials; a probability of 0.027).

Are these trials independent?

Answer: You might recall that a key assumption of the binomial distribution is that the
trials are independent. If Shaq makes all three attempts, it could be considered an example
of non-independence. You might argue, however, that if the White House guards change so
that the same guards don’t confront Shaq, the trials would be independent. However,
different guards might have different standards for uninvited guests! Let’s “play on,”
assuming that the trials are independent.
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi

136 BAYESIAN STATISTICS FOR BEGINNERS

But what about Shaq’s friend?

Shaq’s friend, however, believes Shaq’s chances are much lower than 0.7. If p is, say 0.1,
then Shaq’s chances would look much different. Figure 10.4 shows the binomial distribu-
tion with n ¼ 3 trials and p ¼ 0:1.

1.0

0.8

Probability Mass
0.6

0.4

0.2

0.0
0 1 2 3
Successes

Figure 10.4 Binomial distribution: n = 3, p = 0.1.

In this case, Shaq would have a large chance of failure (in fact, a probability of 0.729).
Here’s the kicker: We don’t know what p (the probability of success) is!
Obviously, there is some dispute over what value it is, or there would not be
a bet! So, here we are confronted with a parameter estimation problem.
You are trying to estimate p, the probability that Shaquille O’Neal or some other famous
person can get into the White House without an invitation.
Our goal here is to use a Bayesian inference approach to estimate the prob-
ability that a celebrity can get into the White House without an invitation.
We’ll use the same steps that we have in previous chapters:
1. Identify your hypotheses—these would be the alternative hypotheses for p, ranging
from 0 to 1.00.
2. Express your belief that each hypothesis is true in terms of prior densities.
3. Gather the data—Shaq makes his attempt, and will either fail or succeed.
4. Determine the likelihood of the observed data, assuming each hypothesis is true.
5. Use Bayes’ Theorem to compute the posterior densities for each value of p (i.e., the
posterior distribution).
Now, let’s go through the steps one at a time.

Step 1. What are the hypotheses for p?

Answer: We know that p, the probability of success in the binomial distribution, can
take on any value between 0 and 1.0. Shaq could say that the prior probability of p is close
to 1.0; his friend could disagree and suggest that the prior probability is closer to 0.01. These
are just two hypotheses for p. However, there’s nothing to stop us from considering
the full range of hypotheses between 0 and 1, which is inﬁnite (p ¼ 0:01, p ¼ 0.011,
p ¼ 0.0111, etc.).
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi

THE WHITE HOUSE PROBLEM 137

Step 2. What are the prior densities for these hypotheses?

Answer: We need to assign a prior for each hypothesized value of p. Here, we will use the
beta distribution to set prior probabilities for each and every hypothesis for p. An
example of a beta probability distribution is shown in Figure 10.5.

2.5

2.0

1.5
Density

1.0

0.5

0.0
0.0 0.2 0.4 0.6 0.8 1.0
Hypotheses for p

Figure 10.5

The x-axis for the beta distribution ranges between 0 and 1. That is to say, random
variables that are beta distributed must be between 0 and 1. The y-axis gives the prob-
ability density, which we learned about in the last chapter. Remember that density is
not probability, per se, because the area under the curve must be 1. But you can think of
the densities associated with each value of p as the “weight” for each hypothesis. The
distribution above would put greater weight for p’s between, say, 0.2 and 0.6, than other
values.

What is the beta distribution?

Answer: The Oxford Dictionary of Statistics tells us that the beta distribution is “a
distribution often used as a prior distribution for a proportion.” Wikipedia has this to say:
“In probability theory and statistics, the beta distribution is a family of continuous prob-
ability distributions deﬁned on the interval (0, 1) parameterized by two positive shape
parameters, typically denoted by alpha (α) and beta (β). The beta distribution can be suited
to the statistical modelling of proportions in applications where values of proportions equal
to 0 or 1 do not occur.”
So, the beta distribution is a statistical distribution whose x-axis spans the interval from 0
to 1. The beta distribution has two parameters that control its shape and position, and these
parameters are named alpha (α) and beta (β). The values for α and β must be positive (0 and
negative entries won’t work). The beta probability distribution shown in Figure 10.5 has
parameters α ¼ 2 and β = 3.
Observations drawn from this distribution can take on values between 0 and 1. This can
be denoted generally as:

X betaðα; βÞ ð10:5Þ

Here, X is a continuous random variable and is distributed as a beta distribution with

parameters α and β.
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi

138 BAYESIAN STATISTICS FOR BEGINNERS

The beta probability density function looks like this:

1
f ðx; α; βÞ ¼ xα1 ð1 xÞβ1 ; 0 < x < 1: ð10:6Þ
Bðα; βÞ

Can you ﬁnd the two parameters, α and β here? The beta function, B, is a normalization
constant to ensure that the area under the curve is 1. Don’t confuse the beta function with
the beta pdf!

Can you show other examples of beta distributions?

Answer: Of course. Figure 10.6 shows some other examples.

3.0 α = 1.0, β = 1.0

α = 2.0, β = 2.0
α = 4.0, β = 2.0
2.5 α = 2.0, β = 4.0
α = 0.5, β = 0.5

2.0
Density

1.5

1.0

0.5

0.0
0.0 0.2 0.4 0.6 0.8 1.0
Hypotheses for p

Figure 10.6

Since Shaq and his friends have totally different ideas of what p should be, they might
settle on using a beta distribution with α ¼ 0.5 and β ¼ 0.5 as the prior distribution for p,
which gives the U-shaped result in blue.

I don’t have a good feeling for what the α and β parameters do in

terms of controlling the shape and location of the distribution.
Can you help?

Answer: Generally speaking, the bigger α is relative to β, the more we shift the weight of
the curve to the right, and the bigger β is relative to α, the more we shift the weight of the
curve to the left.
But there is another way to get a handle on α and β. The mean of a beta distribution (also
called its ﬁrst moment) can be calculated as:
α
μ¼ : ð10:7Þ
αþβ

And the variance of a beta distribution can be calculated as:

α∗β
σ2 ¼ : ð10:8Þ
ðα þ βÞ2 ∗ ðα þ β þ 1Þ
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi

THE WHITE HOUSE PROBLEM 139

Can I see an example of how this would be used?

Answer: Sure. Suppose YOU think that the average probability of Shaq getting into the
White House is p ¼ 0.10 but think that neighboring values also are likely and set the
standard deviation, σ, to 0.05. Remember that the standard deviation is the square root of
variance, or σ 2 . Let’s rearrange, plug in these values, and solve:

μ ∗ ð1 μÞ2 0:1 ∗ ð1 0:1Þ2

β¼μ1þ ¼ 0:1 1 þ ¼ 31:5 ð10:9Þ
σ 2
0:052

β∗μ 31:5 ∗ 0:1

α¼ ¼ ¼ 3:5: ð10:10Þ
1μ 1 0:1

You can derive this on your own at http://www.wolframalpha.com/. The resulting beta
distribution looks like the one shown in Figure 10.7.

10
α = 3.5, β = 31.5

6
Density

0
0.0 0.2 0.4 0.6 0.8 1.0
Hypotheses for p

Figure 10.7

This little shortcut—called moment matching—can help you parameterize the beta
distribution using means and standard deviations, which are probably more familiar
to you.

What prior distribution did Shaq and his friend settle on?

Answer: Let’s assume they go with a beta distribution with α and β set to 0.5. They think p
is either really high or really low. We can designate these as α0 ¼ 0:5 and β0 ¼ 0:5, where the
“naught” subscripts alert us that these are the parameters of a prior distribution. Technic-
ally, these are known as hyperparameters.

Hyperparameters?

Answer: In Bayesian statistics, a hyperparameter is a parameter of a prior or posterior

distribution. This term is used to distinguish the parameters of the prior or posterior distri-
bution (see Figure 10.8) from the unknown parameter of interest. Our focus here is on the
parameters of the prior distribution for the unknown parameter of interest, p.
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi

140 BAYESIAN STATISTICS FOR BEGINNERS

Shaq and his friend had to select a prior distribution, and we can imagine them arguing
over which prior distribution to use. The prior distribution in Figure 10.8 may have resulted
after hours of heated discussion. Of course, Shaq and his friend may have opted to do a bit
of research and make inquiries to the FBI regarding other uninvited but attempted visits by
famous people!

3.5
α0 = 0.5, β0= 0.5
3.0
2.5
Density
2.0
1.5
1.0
0.5
0.0
0.0 0.2 0.4 0.6 0.8 1.0
Hypotheses for p

Figure 10.8 Prior distribution.

Step 3. Now what?

Answer: Collect data! That is step 3. Let’s assume that Shaq makes 1 attempt and
fails to get in. In the binomial function terms, the number of trials n ¼ 1, and the number
of successes y = 0.
Let’s create a diagram that illustrates the process by which the data were generated
(see Figure 10.9).
p ∼beta(α0, β0)

bin(n = 1, p)

0 1

Figure 10.9

At the bottom of this diagram, we have our observed data, yi . If Shaq makes one attempt,
the data arise from a binomial distribution with parameters n ¼ 1 and p. You may recall that a
binomial distribution with n ¼ 1 is also called a Bernoulli distribution. The parameter, p, in
turn, is the unknown parameter that we are trying to estimate. Here, we use a beta distribu-
tion to set the “weights” on each and every hypothesis for the unknown parameter, p.
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi

THE WHITE HOUSE PROBLEM 141

Step 4. And then?

Answer: Step 4 is to determine the likelihood of the observed data, assuming each
hypothesis is true.
Here is where your previous study of binomial probability mass function comes into play.
Because the data, 1 failure out of 1 attempt, are in hand, we will call the result “likelihood.”

n y
ℒðy; n; pÞ ¼ p ð1 pÞðnyÞ : ð10:11Þ
y

The number of trials is denoted n. The number of observed successes is denoted as y. The
probability of success is denoted as p.
Now, for each hypothesized value of p, let’s compute the binomial likelihood
of observing 0 successes out of 1 trial.
Just to give you an example, under the p ¼ 0.3 hypothesis, the likelihood of observing
0 successes out of 1 trial, given p ¼ 0.3 is computed as:

1
ℒðy ¼ 0; n ¼ 1; p ¼ 0:3Þ ¼ 0:30 ð1 0:3Þð10Þ ¼ 0:695: ð10:12Þ
0

Do you see a problem here?

Answer: Your answer here!

You might have noted something critical: Because p is a continuous variable between 0
and 1, we don’t have 1 hypothesis for p—we have an inﬁnite number of hypotheses!

Now what?

Answer: Well, we need to set this Bayesian inference problem up for a continuous
random variable. Suppose we are trying to estimate a single parameter called θ. If θ is
continuous, you have an inﬁnite number of hypotheses for it. Bayes’ Theorem in this case
is speciﬁed as:

Pðdata j θÞ ∗ PðθÞ
Pðθ j dataÞ ¼ ð : ð10:13Þ
Pðdata j θÞ ∗ PðθÞdθ

This is the generic version of Bayes’ Theorem when the posterior distribution for a single
parameter, given the observed data, is represented by a pdf. This is designated Pðθ j dataÞ,
where P is probability density (not probability). Pay attention to this notation! This
is the left side of the equation. On the right side of the equation, the numerator multiplies
the prior probability density of θ, which is written PðθÞ, by the likelihood of observing the
data under a given hypothesis for θ, which is written Pðdata j θÞ. Technically, the likelihood
could be a pmf or a pdf, depending on the problem. For example, in this chapter’s
illustration, the likelihood is a pmf. In the denominator, we see the same terms, but this
Ð
time we also see a few more symbols. The symbol means “integrate,” which roughly
means “sum up all the pieces” for each tiny change in θ, which is written dθ. In other words,
the denominator accounts for the prior density ∗ likelihood for all possible hypotheses
for theta and then sums them.
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi

142 BAYESIAN STATISTICS FOR BEGINNERS

For the Shaq problem, we replace θ with p, so Bayes’ Theorem looks like this (where the
priors are colored blue, the likelihoods red, and the posterior purple):

Pðdata j pÞ ∗ PðpÞ
Pð p j dataÞ ¼ ð 1 : ð10:14Þ
Pðdata j pÞ ∗ PðpÞdp
0

But here’s the kicker: The integration of the denominator is often tedious,
and sometimes impossible!

How do we make headway?

Answer: Start integrating!

Really?

Answer: Just kidding! For this particular problem, thankfully, no. There is an analytical
shortcut that makes updating possible in a snap. Here it is:
• αposterior ¼ α0 þ y:
• βposterior ¼ β0 þ n y:

For the White House Problem, our prior distribution is a beta distribution with α0 ¼ 0.5 and
β0 ¼ 0.5. Shaq made an attempt, so n ¼ 1. He failed to get into the White house, so y ¼ 0. We
can now use this shortcut to calculate the parameters of the posterior distribution:
• posterior α ¼ α0 þ y ¼ 0.5 þ 0 ¼ 0.5.
• posterior β ¼ β0 þ n y ¼ 0.5 þ 1 0 ¼ 1.5.
Incidentally, the parameters of the posterior distribution are also called hyperpara-
meters; posterior hyperparameters, to be exact.
Now we can look at the prior and posterior distributions for p (see Figure 10.10).

3.5
3.0 Prior: α0 = 0.5, β0 = 0.5
Posterior: α = 0.5, β = 1.5
2.5
Density

2.0
1.5
1.0
0.5
0.0
0.0 0.2 0.4 0.6 0.8 1.0
Hypotheses for p

Figure 10.10

We started off with a prior distribution shown in blue, where Shaq and his friend believed
they had a very high or very low probability of gaining entry to the White House without
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi

THE WHITE HOUSE PROBLEM 143

an invitation. Then we collected binomial data: one failure out of one trial. We then used
the shortcut to generate the posterior distribution for all hypotheses of p.
As a result of Shaq’s failed attempt, we now have new knowledge about the support for
each and every hypothesis of p. Notice how the posterior really shifted towards the lower
end of the p spectrum.

Would the posterior be different if Shaq used a different

prior distribution?

Answer: Yes! Can we talk about priors for a minute?

Let’s suppose that Shaq and his friend were concerned about using a non-informative
(vague) prior distribution. They may have been tempted to set equal weight on all values of
p, which would look like the distribution in blue below, where α ¼ 1 and β ¼ 1. With this
prior distribution, and with 0 successes out of 1 attempt to get into the White House
without an invitation, the posterior hyperparameters are:

• αposterior ¼ α0 þ y ¼ 1 þ 0 ¼ 1:
• βposterior ¼ β0 þ n y ¼ 1 þ 1 0 ¼ 2:
This prior distribution is shown in Figure 10.11 as solid blue, and the posterior distribution
is shown as solid purple. The Figure 10.10 result is shown as a dashed line for comparison.
These differences highlight that the choice of prior affects the posterior.

3.5
Prior: α0 = 1, β0= 1
3.0 Posterior: α = 1, β = 2

2.5
Density

2.0
1.5
1.0
0.5
0.0
0.0 0.2 0.4 0.6 0.8 1.0
Hypotheses for p

Figure 10.11

Is the ﬂat prior really non-informative?

Answer: No. Here is a strange twist: the U-shaped prior that was actually used (α and
β ¼ 0.5) is less informative (more vague) than the “flat prior” above, where α and β ¼ 1.
The conjugate solutions show that as we make these parameters tiny, they have less
influence on the resultant posterior distributions compared to the data. In other words,
the data play a large role in determining the parameters of the beta posterior.
Thus, a non-informative prior for a beta distribution will be one in which α and β are tiny.
In the words of Zhu and Lu, “flat priors are not necessarily non-informative, and non-
informative priors are not necessarily flat.” Once again, we refer you to Hobbs and Hooten
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi

144 BAYESIAN STATISTICS FOR BEGINNERS

(2015) for a thoughtful discussion on setting priors. These authors prefer the term “vague
prior” instead of non-informative prior because it’s difﬁcult or impossible to have a prior
that is truly non-informative when dealing with a parameter that is continuous.

What if Shaq makes a second attempt?

Answer: Let’s assume the ﬁrst prior distribution mentioned was used (α0 and β0 ¼ 0.5), and
then updated after 1 failed attempt (α0 ¼ 0.5 and β0 ¼ 1.5). We now set a prior distribution
based on our updated knowledge, and then collect more data. Suppose Shaq fails again.
Now, the parameters for our next posterior distribution will be as follows:
• αposterior ¼ α0 þ y ¼ 0:5 þ 0 ¼ 0:5:
• βposterior ¼ β0 þ n y ¼ 1:5 þ 1 0 ¼ 2:5:
Now we can look at the prior and posterior distributions for p (see Figure 10.12).

6 Prior: α0 = 0.5, β0 = 1.5

Posterior: α = 0.5, β = 2.5
5

4
Density

0
0.0 0.2 0.4 0.6 0.8 1.0
Hypotheses for p

Figure 10.12

Here you can see the major beneﬁt of Bayesian inference: as we collect more data, we
update our beliefs. We start with a prior, collect data and then use Bayes’ Theorem to
generate posteriors. These posteriors then become the priors for the next round of data
collection. If we track our beliefs, we track our learning.

How is this shortcut possible?

Answer: The answer is that the beta distribution is a conjugate distribution that can be
updated with binomial data. This is why we named this chapter “The beta-binomial
conjugate.”
These shortcuts were introduced by Howard Raiffa (Figure 10.13, left) and Robert Schlai-
fer (Figure 10.13, right) in their work on Bayesian decision theory. Their classic 1961 book is
titled Applied Statistical Decision Theory.
The conjugate shortcuts are conveniently provided on a Wikipedia page on Bayesian
conjugates. A visual overview of that page’s beta-binomial conjugate can be depicted as
shown in Figure 10.14.
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi

THE WHITE HOUSE PROBLEM 145

Figure 10.13 (Left) Howard Raiffa (HBS Archives Photograph

Collection: Faculty and Staff, Baker Library, Harvard Business
School (olvwork376291)). (Right) Robert Schlaifer (HBS Archives
Photograph Collection: Faculty and Staff, Baker Library, Harvard
Business School (olvwork383065)).

Unknown parameter: p
Prior distribution: Beta

Prior hyperparameters
α0, β0

0 0.2 0.4 0.6 0.8 1

Data: y
Likelihood: Binomial; n is known
Posterior distribution: Beta

Posterior hyperparameters
α = α0 + y
β = β0 + n – y

0 0.2 0.4 0.5 0.8 1

Figure 10.14

Here, each attempt was considered an experiment, where Shaq made a single attempt and
either succeeded or failed. Before making any attempt, we set a prior distribution on p. He
then failed to get into the White House. We then used this new information to get an
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi

146 BAYESIAN STATISTICS FOR BEGINNERS

updated posterior distribution. This in turn became our prior distribution for Shaq’s second
attempt. He failed again, and this information was used to update to yet another posterior
distribution.

Could he have just tried twice before we updated the prior?

Answer: Yes. We would end up in the same place if we started with a prior distribution and
considered 2 attempts (both failures) before updating. Let’s conﬁrm this:

• The prior:
• α0 ¼ 0.5; β0 ¼ 0.5
• The data:
• n=2
• y=0
• The posterior:
• α ¼ 0.5 þ 0 ¼ 0.5
• β ¼ 0.5 þ 2 0 ¼ 2.5
This is the same answer we got before.

What exactly does the word “conjugate” mean?

Answer: First, Dictionary.com defines the word “conjugate” in several ways. In grammar it
means “to recite or display all or some subsets of the inflected forms of a verb.” For example,
the forms of “to be” are “I am, you are, he is, we are, you are, they are.” As a noun or
adjective, the word conjugate means “joined together,” especially in pairs. Thus, conjugates
have a common theme that ties them together.
We mentioned the statistical definition in Chapter 9, where we indicated that there are
cases where you can use a particular pdf as a prior distribution, collect data of a specific
flavor, and then derive the posterior pdf. In these special cases, the pdf of the prior and
posterior are the same probability density function, but their parameters may differ. The
prior distribution is called a conjugate prior (Raiffa and Schlaeffer, 1961), and the effect of
the data can then be interpreted in terms of changes in parameter values (Upton and
Cook, 2014).
In the White House Problem, we used a beta distribution to set the priors for all
hypotheses of p, the probability that a famous person could get into the White House
without an invitation. We then collected binomial data, and, rather than computing the
posterior by the use of Bayes’ Theorem, we used the conjugate shortcut to generate the
posterior distribution for p.

Why does the shortcut work?

Answer: Wikipedia again: “A conjugate prior is an algebraic convenience, giving a closed-

form expression for the posterior; otherwise a difﬁcult numerical integration may be
necessary.” In other words, the shortcut solution that provides the same answer you’d get
if you worked through the integration yourself. The shortcut allows us to avoid the integral
in Bayes’ Theorem (although the result would be the same).
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi

THE WHITE HOUSE PROBLEM 147

Can you show me the proof?

Answer: Gladly. The proof is in Appendix 1.

Can the beta distribution be used as a conjugate prior for data other
than binomial data?

Answer: Yes. Wikipedia once more: “In Bayesian inference, the beta distribution is the
conjugate prior probability distribution for the Bernoulli, binomial, negative binomial and
geometric distributions. The beta distribution is a suitable model for the random behavior
of percentages and proportions.” These distributions all depend on a parameter, p that can
assume values between 0 and 1.

Will we have more practice with conjugate priors?

Answer: Yes! Chapters 11 and 12 will give us more practice:

• gamma prior þ Poisson data ! gamma posterior
• normal prior þ normal data ! normal posterior
Some problems don’t have the luxury of a conjugate prior. For those problems, we need
another method for avoiding the integral in the denominator of Bayes’ Theorem. That
method, called the MCMC (for Markov Chain Monte Carlo) method, is described in Section 5.

OK. I’ll wait. Suppose I do an analysis and have a new posterior

distribution. How should I present my results?

Answer: A picture may be worth a thousand words. If you can display the posterior
distribution, do so.
In addition, often people summarize the posterior distribution with some simple statis-
tics that are familiar to most people: the mean, median, mode (all measures of middleness),
plus the quantiles.
For well-known probability distributions such as the beta distribution, you can simply
look up the expected values that characterize the distribution on Wolfram Mathworld:
α
• Mean ¼ :
αþβ
α1
• Mode ¼ for α and β > 1.
αþβ2
α∗β
• Variance ¼ :
ðα þ βÞ2 ∗ ðα þ β þ 1Þ

Let’s do this for the Shaq problem. Let’s calculate these for our new posterior, which has
α ¼ 0.5 and β ¼ 2.5:
0:5
• Mean ¼ ¼ 0:1667:
0:5 þ 2:5
• Mode cannot be calculated because α < 1. Snap!
0:5 ∗ 2:5
• Variance ¼ ¼ 0:0617:
ð0:5 þ 2:5Þ2 ∗ ð0:5 þ 2:5 þ 1Þ
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi

148 BAYESIAN STATISTICS FOR BEGINNERS

How should I describe my conﬁdence in the hypothesized values for p?

Answer: Report the Bayesian “credible intervals.” Bayesian credible intervals are common-
ly reported in Bayesian inference problems that involve parameter estimation. According to
the Oxford Dictionary of Social Research Methods (Elliot et al., 2016), a credible interval
represents “an interval in the domain of the posterior or predictive distribution. For a 95
percent credible interval, the value of interest lies with a 95 percent probability in the
interval. In other words, given the data and the model, there is a 95 percent chance the true
value lies in that interval.”
A Wikipedia article describes three methods for choosing a credible interval (see also
Kruschke, 2015):

• Choosing the narrowest interval, which for a unimodal distribution will involve choos-
ing those values of highest probability density including the mode. This is sometimes
called the highest posterior density interval.
• Choosing the interval where the probability of being below the interval is as likely as
being above it. This interval will include the median. This is sometimes called the equal-
tailed interval.
• Assuming that the mean exists, choosing the interval for which the mean is the
central point.
Let’s use our most up-to-date estimation of p (given Shaq’s two failed attempts) and go with
option 2 above, that is, look for the 90% credible interval. Our posterior distribution is a
beta distribution with α ¼ 0.5 and β ¼ 2.5. We need to ﬁnd the area under the curve where
5% of the distribution is in the upper tail, and 5% is in the lower tail. We can use any
computer software program to ﬁnd these values, which are p ¼ 0.00087 for the lower end,
and p ¼ 0.57 for the upper end (as shown by the dashed green lines in Figure 10.15).

8 Posterior: α = 0.5, β = 2.5

6
Density

0
0.0 0.2 0.4 0.6 0.8 1.0
Hypotheses for p

Figure 10.15

Interpretation: The probability that p is between 0.000867 and 0.57 is 0.90. The word
“probability” is correct here because we’ve calculated an area under the pdf.

Can we summarize this chapter?

Answer: Indeed.
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi

THE WHITE HOUSE PROBLEM 149

• We began by mentioning that Shaquille O’Neal made a bet with his friend that he could
get into the White House without an invitation.
• We broadened this problem to ask, “What is the probability that any famous person can
get into the White House without an invitation?”
• We introduced the beta distribution as a prior pdf that can be used to set hypotheses
for p, the probability of success, noting that the beta distribution returns a probability
density, not a probability. We selected hyperparameters for the beta distribution that
reflected both Shaq and his friend’s opinions about his probability of entry.
• We then collected data that is binomial in flavor. Shaq made one attempt and did not
get in.
• We then calculated the likelihood of observing the data under some hypothesized values
for p.
• But then we noted that we have an infinite number of hypotheses for p, so that the
denominator of Bayes’ Theorem now has an integral. This stopped us in our tracks.
• We relied on a shortcut—a conjugate solution—which allowed us to update our prior
beta distribution to a posterior beta distribution using the data (one trial, one failed
attempt).
• We then let Shaq make a second attempt. This time, we used our newly updated posterior
as the prior distribution and then updated it again, given the outcome of the second
attempt. We discussed this as a way to formalize “learning.”
• Finally, we discussed ways of presenting your results, including calculation of credible
intervals.

What really happened to Shaq?

Answer: Sadly, he was denied entrance. Shaq recounted the story to a Washington Post
sportswriter this way (accessed August 17, 2017):

“I went to the gate,” [Shaq] said. “They were nice. They said, ‘Shaq, we can’t do it.’ I said,
‘I understand. Understood.’ ”

The sportswriter said that “Shaq left 1600 Pennsylvania Avenue in defeat Sunday, 1,000
push-ups in arrears. He said he was popping them off in bursts of 20 or 30 all day, and that
he had already logged about 100 by the time we saw him in the production room Monday
evening.”

Will we see Shaq again?

We will revisit this chapter again in Section 5, where we will show you a different method
for estimating the posterior distribution. For now, though, Chapter 11 introduces another
conjugate called the gamma-Poisson conjugate. The chapter features blood and gore.
Beware!
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi

CHAPTER 11

The Shark Attack Problem:

The Gamma-Poisson Conjugate

We interrupt this book to bring you the following news ﬂash from the National Post:

Shark attacks attributed to random Poisson burst

Great headline, eh? This article appeared in the National Post on September 7, 2001. An
unusually large number of shark attacks off the coast of Florida in 2001 spurred a ﬂurry of
excitement. What is the cause? Will I be next? In the story, Dr. David Kelton of Penn State
University explained the attacks this way:
“Just because you see events happening in a rash like this does not imply that there’s some
physical driver causing them to happen. It is characteristic of random processes that they exhibit
this bursty behavior.”
Comforted now? What random process was he referring to? The headline refers to a
Poisson process. You might have guessed this probability distribution is used to estimate

THE SHARK ATTACK PROBLEM 151

the number of times an event will occur. You’d be right! In this chapter, we’ll take a look at
the problem of shark attacks and put it into a Bayesian framework.
By the end of this chapter, you should be able to deﬁne and use the following:
• The Poisson probability mass function
• The gamma probability density function
• The gamma-Poisson conjugate
As usual, we’ll start with a question.

So, what exactly is the Poisson distribution?

Answer: Because shark attacks are a public health problem, let’s see what Oxford Diction-
ary of Public Heath (Last, 2007) has to say: “A distribution function to describe rare events,
or the sampling distribution of isolated counts in a continuum of space or time, such as
cases of uncommon diseases.”
In the Shark Attack Problem, Dr. Kelton explains: “There is something in probability
theory called a Poisson process, giving amazingly good descriptions of such random
independent events occurring through time, like customer arrivals at a fast-food store,
cosmic rays striking a planet, accidents in a factory, airplane mishaps, and maybe shark
attacks.” We need to consider pmf ’s because we have discrete numbers; that is, the number
of attacks is not continuous.

The Poisson distribution is a mathematical rule that assigns probabilities to the number
of occurrences observed.

To what does “Poisson” refer?

The Poisson distribution is named for the French mathematician Simeon Poisson (see
Figure 11.1), who derived this distribution in 1837. This is the same year that Queen
Victoria of England took the throne. Coincidentally, “poisson” means “ﬁsh” in French,
which is ﬁtting for the Shark Attack Problem!
Let’s assume that the average rate is 2.1 attacks per year, based on the number of attacks
reported annually at Florida beaches in the past 10 years (see Table 11.1).
Knowing this, you can use the Poisson probability mass function to determine the
Poisson probability that there will be, say, 10 attacks next year.

OK, what exactly is the function?

The Poisson probability mass function is written:

λx e−λ
PrðX ¼ x; λÞ ¼ x ¼ 0; 1; 2; : : : ð11:1Þ
x!
where λ is the mean number of occurrences in a given period of time, x is the number of
occurrences we are interested in, and e is the natural logarithm constant (approximately
2.718), also called Euler’s number. The left side of the equation can be read, “What is the
probability that a random variable called X has a value of x, given λ.” Notice the three dots
to the right of the equation. Unlike the binomial distribution, where the number of possible
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi

152 BAYESIAN STATISTICS FOR BEGINNERS

Figure 11.1 Simeon Poisson.

Table 11.1

Year Attacks
1 1
2 0
3 2
4 1
5 1
6 3
7 2
8 3
9 4
10 4

values is fixed (x ¼ 0; 1; : : : ; n), the possible values that x can assume for the Poisson
distribution (x ¼ 0; 1; : : :) is not finite; rather it is “countably infinite.”

What are the parameters of the function?

Answer: The Poisson pmf has just one parameter that controls both its shape and
location. This parameter is called “lambda” and is written λ. It represents the mean number
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi

THE SHARK ATTACK PROBLEM 153

of occurrences in a ﬁxed space or time period, such as the mean number of births in a year,
the mean number of accidents in a factory in a year, or the mean number of shark attacks in
a year. So, λ must be > 0 or the event will never occur. By the way, λ is also the variance of the
distribution.
Let’s work through an example to see the Poisson pmf in action. We have information in
Table 11.1 that suggests the average rate of shark attacks in Florida is 2.1 attacks per year.
What is the probability that there will be 10 attacks next year?
The probability would be calculated as:

λx e−λ
Prðx; λÞ ¼ ð11:2Þ
x!

2:110 e−2:1 ð1668 ∗ 0:12Þ

Prð10; 2:1Þ ¼ ¼ ¼ 0:0000563: ð11:3Þ
10! 3628800

What is the probability of 3 attacks if λ ¼ 2.1?

Answer: No problem! That would be:

λx e−λ
Prðx; λÞ ¼ ð11:4Þ
x!

2:13 e−2:1 ð9:61 ∗ 0:12Þ

Prð3; 2:1Þ ¼ ¼ ¼ 0:189: ð11:5Þ
3! 3∗2∗1

We could calculate the probability of observing exactly 0, 1, 2, 3, 4, 5, 6, 7, . . . attacks as well,

and display the results as the Poisson probability distribution (see Figure 11.2).

1.0

0.8
Probability

0.6

0.4

0.2

0.0
0 1 2 3 4 5 6 7 8 9 10 11 12
Attacks per Year (x)

Figure 11.2 Poisson distribution with λ = 2.1.

Look for the probabilities associated with 3 and 10 attacks. Also notice that the y-axis is
labeled “Probability” and that the sum of the bars must equal 1.0. Now take a look at the
x-axis; it consists of non-negative integers only. The graph is depicted in bars to indicate
there are only discrete outcomes; that is, that there is no way to observe, say, 2.5 shark
attacks in a year. In a nutshell, the outcome x is a non-negative integer, but the average rate,
λ, can be any positive number such as 2.1. Thus, the Poisson function is a probability mass
function (pmf).
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi

154 BAYESIAN STATISTICS FOR BEGINNERS

There is only one parameter or “knob” to twiddle with for this probability function, and
that is λ. Let’s have a look at the Poisson distribution when λ ¼ 1, 3, and 7 (see Figure 11.3).

0.35
λ=1
0.30
λ=3
0.25 λ=7

Probability
0.20
0.15
0.10
0.05
0.00
0 2 4 6 8 10 12 14 16 18 20
Number of Attacks per Year (x)

Figure 11.3 Poisson distribution.

These graphs highlight the fact that the Poisson probability distribution is governed by
just one parameter, λ. In looking at these three distributions, you might have noticed a few
things:
• First, the mean of the distribution is λ.
• Second, as λ gets larger, the Poisson distribution starts to look like a normal distribution.
• Third, the spread, or variance, of the distribution is the same as λ. So as λ increases, so does
the spread of the Poisson distribution. See if you can conﬁrm this visually.

What is so “bursty” about a Poisson process?

Answer: The news headline read: “Shark attacks attributed to random Poisson bursts.”
Let’s have a second look at the Poisson distribution for λ ¼ 2.1, shown in Figure 11.4. The
number of shark attacks x ¼ 1, 2, and 3 are all fairly likely outcomes, but 0, 4, and 5 are also
in the running. With an average attack rate of 2.1 per year, it would be rare, but not
impossible, to observe > 7 attacks.

1.0

0.8
Probability

0.6

0.4

0.2

0.0
0 1 2 3 4 5 6 7 8 9 10 12
Attacks per Year (x)

Figure 11.4 Poisson distribution with λ = 2.1.

Let’s try an experiment where we randomly draw 50 observations from the Poisson
distribution in Figure 11.4 (λ ¼ 2.1) and record our results in a table with 10 rows and 5
columns (see Table 11.2).
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi

THE SHARK ATTACK PROBLEM 155

Table 11.2

2 4 2 2 3
1 3 1 1 0
1 2 1 1 1
3 1 3 1 3
3 4 2 1 3
2 1 3 3 3
5 2 3 1 0
5 2 4 5 1
4 5 0 4 1
7 2 3 1 3

What you are seeing are random draws from a Poisson distribution where λ ¼ 2.1.
Heading down the columns, we drew a 2, then 1, then a 1, then 3, and so on. This is an
example of a random Poisson process like the one described in the shark attack article. We
can say that each observation, x, arises from a Poisson distribution. That is, the random
variable X is the number of shark attacks in a year, and X is distributed as a Poisson
distribution:

X PoissonðλÞ: ð11:6Þ

Now, take a look at the last four entries in column 1 and the ﬁrst entry in column 2. Do
you see a series of attacks? If you did not know that these values came from a Poisson
distribution with λ ¼ 2.1, you might conclude that shark populations are on the rise and
that the rate of attacks is actually increasing. With random variables, occasionally you can
string together a series of outcomes that strike you as “non-random,” even if they are.
Make sense?

What does the Poisson pmf have to do with a Bayesian analysis?

Answer: Ah, yes!

Are you ready for

another Bayesian
inference problem?

Suppose you are on a Bay(es) Watch patrol team, and 5 attacks occurred
this year. Your supervisor doubts that the current estimate of λ = 2.1 is valid.
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi

156 BAYESIAN STATISTICS FOR BEGINNERS

She has charged you with getting an up-to-date estimate of the annual rate
of shark attacks. Thus, this is a parameter estimation problem, and you’ve
decided to use a Bayesian approach.

What is your estimate of lambda?

Answer: Your answer here!

Well, you have just one new observation, which is that there were 5 attacks this year. You
note that it’s possible to observe 5 shark attacks under a variety of λ estimates. Look for the
probability of observing 5 attacks when λ ¼ 2, 5, or 7 (see Figure 11.5).

0.25
λ=2
0.20 λ=5
λ=7
Probability

0.15

0.10

0.05

0.00
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Number of Attacks per Year (x)

Figure 11.5 Poisson distribution.

Because there are multiple alternative Poisson distributions that can yield 5 shark attacks
with a non-0 probability, it makes sense that we focus our attention on considering all
possible values for lambda, and then confront each hypothesis with our data of 5 shark
attacks. For each hypothesis for λ, we then ask: What is the likelihood of observing 5
shark attacks under each hypothesis for λ? That way, you are estimating a probabil-
ity distribution for λ and giving credence to an inﬁnite number of hypotheses instead of
estimating just a single point.
This is a natural Bayesian problem. With a Bayesian inference problem, we need a prior
distribution that represents our ‘belief ’ in each alternative value for λ and the likelihood of
the data to generate the posterior distribution. We’ll get there before you know it.

Sounds good! Where do I start?

Answer: Before we leap in, let’s take time to acquaint ourselves with a new probability
density function. Remember that λ can go from 0 to inﬁnity. If our hypotheses for λ are
graphed along the x-axis, we need an x-axis that starts at 0 and increases to inﬁnity. Because
λ can be any real number greater than 0, the x-axis must be continuous. Then on the
y-axis, we compute the probability density of each possible λ. An example of a prior
distribution for the various hypotheses for λ might look like something like the one shown
in Figure 11.6.
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi

THE SHARK ATTACK PROBLEM 157

Of course, you could use a different prior distribution if you’d like.

0.25

0.20

Probability Density
0.15

0.10

0.05

0.00
0 5 10 15
Hypotheses for λ

Figure 11.6 Prior probability distribution for λ.

OK. Is there some sort of probability distribution I can use to

help me set the prior distribution?

Answer: Yes! In Chapter 10, we were intent on estimating the posterior distribution
for p, the probability that a famous person like Shaq could get into the White House
without an invitation. We used a beta distribution to set our priors for each and every
p (p ¼ probability of success, which ranges between 0 and 1). The data we collected
constituted a binomial problem (number of successes out of number of trials). Then, we
obtained a new posterior distribution for p in light of the new data.
In this problem, the beta distribution won’t work for us. Why? Because λ can be greater
than 1. Fortunately there is another distribution that is suitable, and it is called the gamma
distribution.

Gamma distribution?

Answer: Wolfram and the Oxford Dictionary of Statistics (Upton and Cook, 2014) provide
descriptions of the gamma distribution.
If you checked out these links, you may have noticed that there are different ways of
expressing the gamma distribution, which is explained in a Wikipedia entry (accessed
August 17, 2017): “In probability theory and statistics, the gamma distribution is a two-
parameter family of continuous probability distributions. There are three parameterizations
in common use:

• With a shape parameter k and a scale parameter θ.

1
• With a shape parameter α ¼ k and an inverse scale parameter β ¼ , called a rate
θ
parameter.
k
• With a shape parameter k and a mean parameter μ ¼ .
β
The second parameterization, with α and β, is more common in Bayesian statistics, where
the gamma distribution is used as a conjugate prior distribution for various types of
inverse scale (i.e., rate) parameters, such as the λ of an exponential distribution or a Poisson
distribution.”
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi

158 BAYESIAN STATISTICS FOR BEGINNERS

Uh-huh. Can you simplify?

Answer: The gamma distribution is a continuous probability distribution; the x-axis is

continuous, not discrete, and begins with 0 and goes to inﬁnity. There are three ways to
“create” a gamma distribution. In our Bayesian analysis, we are interested in a gamma
distribution whose shape and location are controlled by two parameters: the shape param-
eter is called alpha, α, and the rate parameter is called beta, β.

Note: If you dig into other references, take care to note that different software packages
may use different names for the shape, rate, and scale parameters. In this book, we will
stick with the names α for shape, β for rate, and θ for scale.

Let’s take a look at a few examples (see Figure 11.7).

0.5
α = 1.0, β = 0.5
α = 2.0, β = 0.5
0.4 α = 4.0, β = 0.5
α = 2.1, β = 1
α = 9.0, β = 2
0.3 α = 7.5, β = 1
Density

α = 0.5, β = 1
0.2

0.1

0.0
0 5 10 15 20
Hypotheses for λ

Figure 11.7 Gamma distributions.

Note that we have drawn these distributions as continuous functions (lines)—

highlighting that the gamma distribution is a probability density function—
there are an inﬁnite number of hypotheses for λ, the average number of shark attacks
per year.

Do α and β have anything to do with the α and β from the beta

distribution in Chapter 10?

Answer: Nope, not a thing. The beta distribution is controlled by two parameters that are
also named α and β, but these could have been named something else. It just so happens
that α and β are common names for parameters in probability functions, much like Spot
and Daisy are common names for dogs!

Why are there three ways to parameterize this function?

Answer: Can you see that all three parameterizations are related to one another? In the
ﬁrst form, we have a shape and scale parameter. In the second version, the rate parameter is
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi

THE SHARK ATTACK PROBLEM 159

just 1 divided by the scale parameter. In the third version, the mean is the shape parameter
divided by the rate parameter!
Which of the three alternative parameterizations you use depends on the
problem you are tackling. Here, we’ll stick with the second version which has α as the
shape parameter and β as the rate parameter, as that is the version commonly used for
Poisson hypotheses for λ. After all, λ from the Poisson distribution is a rate!

How do we use this distribution to estimate probability?

Answer: Well, because it’s a probability density function, you can’t calculate probability
for a speciﬁc value. Rather, you are calculating the probability density.
If you are given a particular gamma distribution, say, α ¼ 2 and β ¼ 1, you can compute
the density associated with, say, x ¼ 3 with this function:

βα xα−1 e−βx
gðx; α; βÞ ¼ 0 x ∞: ð11:7Þ
ΓðαÞ

Looks Greek to me!

This is the probability density function for the gamma distribution. Here, the function
name is g, and the inputs are x, α, and β. The random variable named X must assume values
greater than or equal to 0, and α and β must be positive. The output will provide the
probability density of x given α and β.
The funny symbol, Γ, in the denominator is the capital Greek letter for G (called gamma),
and represents the gamma function, which is not to be confused with the gamma distri-
bution! The gamma function is needed to ensure that the function integrates to 1.0 (that is,
the total area under the curve ¼ 1.0). Let’s try an example.
The probability density of observing x ¼ 3, given α ¼ 2 and β ¼ 1 is:

βα xα−1 e−βx
gðx; α; βÞ ¼ ð11:8Þ
ΓðαÞ

12 32−1 e−1 ∗ 3
gð3; 2; 1Þ ¼ : ð11:9Þ
Γð2Þ

But what do we do about Γ(2)?

Answer: When α is a positive integer, ΓðαÞ ¼ ðα−1Þ!, which in our case ends up being
Γð2Þ ¼ ð2−1Þ! ¼ 1! ¼ 1:0. So we are left with:

12 32−1 e−1 ∗ 3
gð3; 2; 1Þ ¼ ð11:10Þ
Γð2Þ

12 32−1 e−1 ∗ 3
gð3; 2; 1Þ ¼ ð11:11Þ
ð2−1Þ!

gð3; 2; 1Þ ¼ 1 ∗ 3 ∗ e−3 ¼ 0:14936: ð11:12Þ

If α is not a positive integer, you can compute the value of the gamma function with an
online calculator.
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi

160 BAYESIAN STATISTICS FOR BEGINNERS

Here is a graph of the gamma distribution for α ¼ 2 and β ¼ 1. See if you can ﬁnd the
probability density for x ¼ 3 in Figure 11.8.
It’s worth pointing out again that this is a probability density function. The
y-axis gives density, and the x-axis features a continuous variable.

0.8

0.6

Density
0.4

0.2

0.0
0 2 4 6 8 10
x

Figure 11.8 Gamma distribution: α = 2, β = 1.

OK, so how do we use the gamma distribution for the

Shark Attack Problem?

Answer: Our goal here is to use a Bayesian inference approach to estimate the
rate of shark attacks per year, or λ.
We’ll use the same steps that we have in previous chapters:

1. Identify your hypotheses—these would be the alternative hypotheses for λ, ranging

from 0 to inﬁnity.
2. Express your belief that each hypothesis is true in terms of prior densities.
3. Gather the data—5 attacks occurred this year.
4. Determine the likelihood of the observed data, assuming each hypothesis is true.
5. Use Bayes’ Theorem to compute the posterior densities for each value of λ (i.e., the
posterior distribution).
Before we leap in, let’s create a graphical diagram that illustrates the process by which the
data were generated (see Figure 11.9).

λ ∼gamma(α0, β0)

Poisson(λ)

0 2 4 6 8 10

Figure 11.9
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi

THE SHARK ATTACK PROBLEM 161

At the very bottom of this diagram, we have our observed data, xi. The data are random
observations that arise from a Poisson distribution, deﬁned by the parameter λ. The par-
ameter, λ, in turn, is the unknown parameter that we are trying to estimate. We use a
gamma distribution to “weight” each and every hypothesis for λ, treating the unknown
parameter as a random variable from a gamma distribution.
Now that we have speciﬁed the problem, let’s step through these one at a time.

Step 1. What are the hypotheses for lambda?

Answer: Your answer here!

We know that λ can take on any value from 0 to inﬁnity, so we have an inﬁnite number of
hypotheses for λ.

Step 2. What are the prior densities for these hypotheses?

Answer: We need to assign a prior for each hypothesized value of λ, and we already
introduced the gamma distribution as a suitable prior probability distribution. We’ve
also decided that the gamma distribution that uses the shape and rate parameterization
suits our needs.

OK, what values for α and β should we use?

Answer: Your answer here!

Answer: It makes sense that we dust off the 10 years of shark attack data in Table 11.1 and
use that information to inform our prior distribution.

But the mean of that dataset is 2.1. How do I convert this to the α and β
parameters for a gamma distribution?

Answer: Here’s where the Wolfram MathWorld, the Oxford Dictionary of Statistics, or
Wikipedia pages on probability distributions can be very helpful. For instance, Figure 11.10
shows a typical Wikipedia panel featuring the gamma distribution (with our parameteriza-
tion highlighted in red).
These panels provide summary information such as the mean, median, and variance, for
a given gamma distribution when the shape α and rate β parameterization is used. Scan
down and look at the mean:
α
μ ¼ E½X ¼ : ð11:13Þ
β

The mode is:

α−1
ð11:14Þ
β

when α 1. The variance is:

α
: ð11:15Þ
β2
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi

162 BAYESIAN STATISTICS FOR BEGINNERS

Parameters α > 0 shape

β > 0 rate

Support x ∈ (0, ∞)

Probability βα
density xα–1e–βx
Г(α)
function (pdf)

Mean E[X] = αβ
E[ln X] = ψ(α)–ln (β)
(see digamma
function)

Median No simple closed

form

α – 1 for α ≥ 1
Mode β
Variance Var[X] = α2
β
Var[ln X] = ψ1(α)
(see trigamma
function)

Figure 11.10

Just as in the beta distribution in the last chapter, the magnitude of α and β are not nearly as
important as the relation between them. For instance, if the mean is 2.1, you might set α
to 2.1 and β to 1.0. Or you might set α to 4.2 and β to 2.0.
If the mode (the peak of the curve) is at 2.1, you might set α to 3.1 and β to 1.0. Or you
might set α to 31 and β to 14.3 (see Figure 11.11).

1.2
α = 2.1, β = 1.0
α = 4.2, β = 2.0
1.0 α = 6.3, β = 3.0
α = 3.1, β = 1.0
0.8 α = 3.1, β = 14.3
Density

0.6

0.4

0.2

0.0
0 5 10 15
x

Figure 11.11 Gamma distributions.

A few things worth noticing:

• The area under all ﬁve distributions is 1.
• Notice that as α and β become larger, the distribution becomes tighter.
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi

THE SHARK ATTACK PROBLEM 163

• α and β for all of the blue distributions were calculated assuming the mean of the
distribution ¼ 2.1. Notice that the mean is not the top of curve, though, because there
is “weight” in the right tail of the distribution.
• When α and β are parameterized in terms of the mode, the peak of the curve is the mode
(black distributions).
With these graphs, it’s clear that even though you have prior information that the mean
number of shark attacks is 2.1 per year, you still need to exercise some subjectivity with
respect to which prior distribution to use. You could also consider the variance of the data
in Table 11.1 in making your selection (we’ll cover how to do that in Chapter 13).

Which gamma distribution should we use for this problem?

Answer: Let’s go with the prior distribution of α0 ¼ 2.1 and β0 ¼ 1 (solid blue line in
Figure 11.12), which has a mean rate of 2.1. Notice that we use a “naught” subscript to
designate these as the hyperparameters of the prior distribution.

Hyperparameters?

Answer: In Bayesian statistics, a hyperparameter is a parameter of a prior or posterior

distribution. This term is used to distinguish the parameters of the prior or posterior distri-
bution from the unknown parameter of interest, which is λ.

0.4

0.3
Density

0.2

0.1

0.0
0 5 10 15
Hypotheses for λ

Figure 11.12 Prior distribution: α0 = 2.1, β0 = 1.

Step 3. Now what?

Answer: Collect data! We have observed 5 shark attacks.

Step 4. And then?

Answer: Step 4 is to determine the likelihood of the observed data, assuming each
hypothesis is true.
Now, for each hypothesized value of λ, let’s compute the Poisson likelihood
of observing ﬁve attacks in one year.
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi

164 BAYESIAN STATISTICS FOR BEGINNERS

We touched on the Poisson pmf at the beginning of the chapter, where we showed that
the Poisson pmf can be used to calculate the probability of observing 0, 1, 2, 3, . . . attacks,
given lambda. This time, we have observed the value x, and now we are using the function to
calculate the likelihood of observing the data under each hypothesized λ.
To give you an example, under the λ ¼ 2.1 hypothesis, the likelihood of observing ﬁve
shark attacks is computed as:

λx e−λ
ℒðX ¼ x; λÞ ¼ f ðx; λÞ ¼ ð11:16Þ
x!

2:15 e−2:1 ð40:84 ∗ 0:12Þ

ℒðx ¼ 5; λ ¼ 2:1Þ ¼ ¼ ¼ 0:04167704: ð11:17Þ
5! 120
A plot of the likelihoods for a range of values for λ is called a likelihood proﬁle (see
Figure 11.13).

0.25
Likelihood of observing 5 attacks

0.20

0.15

0.10

0.05

0.00
0 5 10 15
Hypotheses for λ

Figure 11.13

And Step 5?

Answer: The final step is to use Bayes’ Theorem to compute the posterior probabilities for
each value of λ. As in Chapter 10, we need to set this Bayesian inference problem up for a
continuous random variable. Suppose we are trying to estimate a single parameter
called θ. If θ is continuous, you have an infinite number of hypotheses for it. Bayes’
Theorem in this case is specified as:

Pðdata j θÞ ∗ PðθÞ
Pðθ j dataÞ ¼ ð : ð11:18Þ
Pðdata j θÞ ∗ PðθÞdθ

This is the generic version of Bayes’ Theorem when the posterior distribution for a single
parameter, given the observed data, is represented by a pdf. This is designated Pðθ j dataÞ,
where P is probability density. This is the left side of the equation. On the right side of
the equation, the numerator multiplies the prior probability density of θ, which is written
PðθÞ, by the likelihood of observing the data under a given hypothesis for θ, which is written
Pðdata j θÞ. Technically, the likelihood can be a pdf or a pmf. In this chapter’s illustration,
for instance, the likelihood is a pmf because it is a Poisson distribution. In the denominator,
Ð
we see the same terms, but this time we also see a few more symbols. The symbol means
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi

THE SHARK ATTACK PROBLEM 165

“integrate,” which roughly means “sum up all the pieces” for each tiny change in θ, which
is written dθ. In other words, the denominator accounts for the prior density * likelihood for
all possible hypotheses for theta, and sums them.
For the Shark Attack Problem, we replace θ with λ, so Bayes’ Theorem looks like this, with
the posterior shown in purple, the prior in blue, and the likelihood in red:

Pðdata j λÞ ∗ PðλÞ
Pðλ j dataÞ ¼ ð ∞ : ð11:19Þ
PðdatajλÞ ∗ PðλÞdλ
0

Remember that this is a density, not a probability, because the probability of a given value
of λ will always be 0 for a continuous pdf (refer to Chapter 9 for a refresher). Thus, in this
problem, we are interested in all possible values of λ, or the entire posterior
distribution.
But here’s the kicker: The integration of the denominator can be very
difﬁcult to calculate!

How do we make headway?

Answer: Well, for this particular problem, there is an analytical shortcut that makes
updating the prior to the posterior distribution a snap. Integration not required! Remember
that our prior distribution set α0 ¼ 2.1 and β0 = 1.
Furthermore, we have observed 5 shark attacks in one year. So here, x ¼ 5. Let’s let n ¼ the
number of years; here n = 1. Here is the shortcut:
The updated α parameter is α0 plus the sum of the Poisson observations. We have just
one observation, which was of 5 attacks:
X
n

αposterior ¼ α0 þ xi : ð11:20Þ
i ¼1

So, the summation term results in the number 5, because x1=5. Then we have:

αposterior ¼ 2:1 þ 5 ¼ 7:1: ð11:21Þ

The updated β parameter is β0 plus n, the number of Poisson observations:

βposterior ¼ β0 þ n ð11:22Þ

βposterior ¼ 1 þ 1 ¼ 2: ð11:23Þ

These parameters are referred to as the posterior hyperparameters. Now we can look at the
prior and posterior distributions for λ (see Figure 11.14).
We started off with a prior distribution shown in blue. Then we collected data in the form
of a single Poisson random variable: ﬁve attacks. We then used the shortcut to generate the
posterior distribution for all hypotheses of λ.
As a result of our data, we now have new beliefs in each and every hypothesis of λ. The
posterior distribution is shown in purple. Notice how the posterior shifted toward our
observed data for this problem.
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi

166 BAYESIAN STATISTICS FOR BEGINNERS

0.4
Prior: α0 = 2.1, β0 =1
Posterior: α = 7.1, β =2
0.3

Density
0.2

0.1

0.0
0 5 10 15 20
Hypotheses for λ

Figure 11.14 Gamma prior and posterior.

How is this shortcut possible?

Answer: The answer is that the gamma distribution is a conjugate distribution that can be
updated with Poisson data (Raiffa and Schlaeffer, 1961). This is why we named this chapter
“The gamma-Poisson conjugate.”
We mentioned the statistical definition of a conjugate distribution in Chapter 11, where
we indicated that there are cases where you can use a particular pdf as a prior distribution,
collect data of a specific flavor, and then derive the posterior pdf with a closed-form
solution. In these special cases, the pdf ’s of the prior and posterior distributions are the
same probability density function, but their parameters may differ. The prior distribution is
called a conjugate prior, and the effect of the data can then be interpreted in terms of
changes in parameter values (Upton and Cook, 2014).
In the Shark Attack Problem, we used a gamma distribution to set the priors for all
hypotheses of λ, the rate of shark attacks per year. We used the information provided in
Table 11.1 to help set the prior distribution. We then collected Poisson data (5 shark
attacks), and then used a shortcut to generate the posterior distribution for λ.
Raiffa and Schlaeffer’s (1961) conjugate solution is conveniently given in a Wikipedia
page on Bayesian conjugates. A quick overview of that page’s gamma-Poisson conjugate
can be visualized with the graph in Figure 11.15.
Let’s work our way through this figure. The top item indicates that the model parameter
of interest, λ, is a rate. The prior distribution is a gamma distribution, defined by hyperpara-
meters α0 and β0. The “naughts” signal that these are hyperparameters for the prior. The
data collected are Poisson in flavor, so the likelihood to be used is the Poisson pmf. With
this setup, the posterior hyperparameters (the conjugate shortcuts) are provided.

Can I see the proof?

Answer: Of course! Check out Appendix 2.

What if I used a different prior distribution?

Answer: You could have used a different prior distribution, and your inferences provided
by the posterior distribution would probably be different. You must defend your choice for
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi

THE SHARK ATTACK PROBLEM 167

Unknown parameter: λ
Prior distribution: Gamma

Prior hyperparameters
α0, β0

Data: x1, x2, . . . , xn

Likelihood: Poisson

Posterior distribution: Gamma

Posterior hyperparameters
n
α = α0 + ∑ i = 1 xi β = β0 + n

Figure 11.15

the prior! Here, we had previous knowledge of the annual rate of shark attacks and thus used
an informative prior. To ignore this previous information is contrary to the scientiﬁc method.

Why do I need to use a Bayesian approach? Couldn’t I just use the data
from Table 11.1 and add in the extra observation of 5 shark attacks?

Answer: You certainly could do that. But then you wouldn’t be capitalizing on the notion of
learning, and on the notion of multiple, competing hypotheses for λ. Table 11.3 shows the
dataset again, this time with one extra observation (our new observation of 5 shark attacks).

Table 11.3

Year Attacks
1 1
2 0
3 2
4 1
5 1
6 3
7 2
8 3
9 4
10 4
11 5
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi

168 BAYESIAN STATISTICS FOR BEGINNERS

The mean of this dataset is 2.36. Yes, this does give you an estimate of λ, and you can ask
your friendly neighborhood statistician to help you draw conclusions about the mean
number of shark attacks in Florida based on your sample. However, this result fails to
acknowledge support for alternative values of λ. Bayesians are interested in the probability
distribution for λ, where the density associated with each alternative hypothesis for λ gives a
measure of “plausibility.” This approach is squarely in line with the scientiﬁc method. As
we collect new information, we can use Bayesian approaches to reﬁne how plausible each
and every hypothesis for λ is.

OK, what happens to the posterior if I collect new data?

Answer: Try it! The prior distribution will now be the posterior distribution after observing
5 shark attacks: α ¼ 7.1 and β = 2.
Let’s suppose you collect 5 more years of shark attack data. Let’s deﬁne the
variable X1 as the number of shark attacks in year 1, . . . , and X5 as the number of shark
attacks in year 5. You observe the results shown in Table 11.4.

Table 11.4

Year Attacks
12 1
13 2
14 0
15 3
16 4

Note that we have 5 new random observations from the Poisson distribution, so n ¼ 5.
Pn
Also notice that the sum of the attacks, x , is 10.
i¼1 i
Remember that the prior distribution is now deﬁned by α0 ¼ 7.1 and β0 = 2. Now let’s
incorporate the n ¼ 5 years of new data (see Figure 11.16).

0.8
Prior: α0 = 2, β0 = 1
Posterior 1: α = 7.1, β = 2
Posterior 2: α = 17.1, β = 7
0.6
Density

0.4

0.2

0.0
0 5 10 15 20
Hypotheses for λ

Figure 11.16 Gamma prior and posterior.

The updated α parameter is the prior α0 plus the sum of the Poisson observations:
P
n
αposterior ¼ α0 þ xi ð11:24Þ
i¼1
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi

THE SHARK ATTACK PROBLEM 169

αposterior ¼ 7:1 þ 10 ¼ 17:1: ð11:25Þ

The updated β parameter is the prior β0 plus n, the number of Poisson observations:

βposterior ¼ β0 þ n ð11:26Þ

βposterior ¼ 2 þ 5 ¼ 7: ð11:27Þ

A few things are worth pointing out here:

• We started off with a prior shown in blue
• We updated this based on a single year’s worth of data: 5 shark attacks. This shifted the
posterior to the right, but with a single year’s worth of data, the distribution is still
quite broad.
• We then collected 5 more years’ worth of data, during which a total of 10 attacks were
observed. This shifted the posterior back to the left, and the distribution narrowed
considerably.
α 17:1
• The mean of this new distribution is ¼ ¼ 2:44:
β 7
α 17:1
• The variance is 2 ¼ 2 ¼ 0:35:
β 7
• The standard deviation is 0.59.
Of course, you would want to report the Bayesian credible intervals.

What parameters do you use for α0 and β0 if you want a vague prior?

Answer: If you didn’t have the information in Table 11.1 and wanted to use a vague prior,
you could set α0 and β0 to something really small, like 0.01. Let’s try our very ﬁrst analysis
again, using a prior with α0 and β0 ¼ 0.01, and then update it after observing 5 shark attacks.
The updated alpha parameter is the prior alpha plus the sum of the Poisson observations:
P
n
αposterior ¼ α0 þ xi ð11:28Þ
i ¼1

αposterior ¼ 0:01 þ 5 ¼ 5:01: ð11:29Þ

The updated beta parameter is the prior beta plus n, the number of Poisson observations:

βposterior ¼ β0 þ n ð11:30Þ

βposterior ¼ 0:01 þ 1 ¼ 1:01: ð11:31Þ

If you were estimating the mean number of attacks based on the data alone, you would
estimate that λ is 5.0. Here, the mean of the posterior distribution is very close to that, and
can be calculated as:

α 5:01
¼ ¼ 4:96: ð11:32Þ
β 1:01

It’s evident that the prior is not contributing much to the posterior distribution (see
Figure 11.17).
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi

170 BAYESIAN STATISTICS FOR BEGINNERS

0.8
Prior: α0 = 0.01, β0 = 0.01
Posterior : α = 5.01, β = 1.01
0.6

Density
0.4

0.2

0.0
0 5 10 15 20
Hypotheses for λ

Figure 11.17 Gamma prior and posterior.

The prior certainly does not look “ﬂat.” But notice in this case, since α0 and β0 were so
small for the prior distribution, they had almost no bearing on the shape of the posterior
distribution. But, why would you use this prior distribution if the data in Table 11.1 were at
your disposal?

OK, how should I present my Bayesian results?

Answer: If you can show the prior and posterior distributions, that would be fabulous! But
you can also summarize properties of the posterior distribution such as the mean and
variance by using some of the calculations presented on Wikipedia’s gamma distribution
page. And don’t forget the credible intervals!

Can I use a different prior distribution than the gamma distribution?

Answer: You may! The log-normal distribution can be used as the prior distribution for
setting alternative hypotheses for λ. Figure 11.18 shows a few examples of the log-normal
distribution, which has two parameters.

3.0
ln(μ) = 0, ln(σ) = 1
ln(μ) = 0, ln(σ) = 0.25
2.5 ln(μ) = 0, ln(σ) = 0.5
Probability Density

2.0

1.5

1.0

0.5

0.0
0.0 0.5 1.0 1.5 2.0 2.5 3.0
x

Figure 11.18

Here you can see that the distribution ranges from 0 to ∞ along the x-axis. This distribu-
tion is very useful for many analyses. However, since it is not a conjugate distribution, there
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi

THE SHARK ATTACK PROBLEM 171

is no easy way—no shortcut—to get the posterior. You could estimate the posterior distri-
bution with an MCMC analysis, but that is a matter for another day.

Can we summarize this chapter?

Answer: Why not?

• We started by mentioning an article that indicated that a string of numerous shark

attacks could be a random process.
• We introduced the Poisson distribution as a probability mass function which has a single
parameter, λ, which is the average rate of attacks per year. We showed how you can get a
burst of a high numbers of attacks even when λ is relatively low.
• We identified a new Bayesian challenge, where the goal was to estimate λ.
• We then introduced the gamma distribution as one that can be used to set hypotheses for
λ, the average number of shark attacks per year, noting that the gamma distribution
returns a probability density, not a probability. We used an informative prior distribution
that was based on previous data.
• We then collected new data for one year. There were five shark attacks.
• We then calculated the likelihood of observing the data under some hypothesized values
for λ.
• But then we noted that we have an infinite number of hypotheses for λ; the denominator
of Bayes’ Theorem is an integral.
• We used a shortcut, an analytical solution that allowed us to update our prior gamma
distribution to a posterior gamma distribution using the data.
• We then collected five more years of data. This time, we used our newly updated posterior
as the prior distribution, and then updated it again using the newly observed data. Thus,
we are using Bayesian inference as a way to formalize “learning.”

We will revisit this chapter again in the MCMC section of this book, where we will show
you a different method for estimating the posterior distribution. For now, though, we have
one more conjugate chapter in this section, and it is a “sticky” one. We hope to see you
there!
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi

CHAPTER 12

The Maple Syrup Problem:

The Normal-Normal Conjugate

In this chapter, we return to the normal distribution and set up a Bayesian inference
problem that deals with estimating parameters from the normal distribution. You might
recall that the normal distribution has two parameters that control the location and spread
of the distribution: the mean and the standard deviation. These are called mu (μ) and sigma
(σ), respectively. We’ll now be dealing with the joint distribution of the two parameters in
the normal distribution, where you need to assign a prior associated with a combination
of μ and σ.
Here’s a roadmap for this chapter. Pay attention! We’ll begin with a motivating example
that highlights the need to estimate the two parameters of the normal distribution. We’ll
first tackle the problem by considering just 10 discrete hypotheses for μ and σ so that you
can see Bayes’ Theorem at work with a joint prior. Then we’ll move to the case where the
number of joint hypotheses is infinite and in which the prior distribution is a continuous,
joint distribution. This complicates the analysis considerably. But, as with the previous two
chapters, this chapter introduces a conjugate solution, which can be applied only if you
assume that one of the parameters of the normal distribution is known. Specifically, we
assume that σ is known. Hence the focus is on estimating the unknown parameter of the
normal distribution, μ. We will revisit this chapter again in Section 5, where we show you
how to solve the problem when both μ and σ are unknown with an MCMC analysis using
Gibb’s sampling.
By the end of this chapter, you should be able to define and use the following:
• The normal distribution pdf in terms of the mean and standard deviation (μ and σ) and in
terms of the mean and precision (μ and τ)
• The normal-normal conjugate
We’ll start with a story written by Brendan Borrell in 2013 for Bloomberg Business:

On the morning of July 30, 2012, an accountant named Michel Gauvreau arrived at the Global Strategic
Maple Syrup Reserve, housed in a huge red brick warehouse on the side of the Trans-Canadian Highway…
about two hours northeast of Montreal. Inside, baby-blue barrels of maple syrup were stacked six high in
rows hundreds deep. Full, each barrel weighs about 620 pounds. With grade A syrup trading at about $32
per gallon, that adds up to $1,800 a barrel, approximately 13 times the price of crude oil.
The ﬁscal year was coming to a close, and the Federation of Quebec Maple Syrup Producers had hired
Gauvreau’s company to audit its inventory…There were around 16,000 barrels here, about one-tenth of
Quebec’s annual production. The gap between the rows was barely wide enough to walk through, and the
rubber soles of Gauvreau’s steel-tip boots stuck to the sugar-coated concrete ﬂoor.

THE MAPLE SYRUP PROBLEM 173

He scaled a row of barrels and was nearing the top of the stack when one of them rocked with his
weight. He nearly fell. Regaining his balance, he rattled the barrel: It was light because it was empty.
He soon found others that were empty. After notifying the Federation’s leaders and returning with them
to examine the stockpile, they unscrewed the cap on a full barrel. The liquid inside was not goopy,
brown, or redolent with the wintry scent of vanilla, caramel, and childhood; it was thin, clear, and
odorless. It was water… Sixty percent, or six million pounds of syrup, had vanished, worth about $18
million wholesale.

That’s some story! It has even caught Hollywood’s attention! This begs a few questions,
though.

What exactly is maple syrup?

Answer: Yum!
Here’s what Encyclopedia Britannica has to say: “Maple syrup, sweet-water sap of
certain North American maple trees, chiefly the sugar maple, Acer saccharum, but also
the black maple, Acer nigrum. It was utilized by the Indians of the Great Lakes and
St. Lawrence River regions prior to the arrival of European settlers and is still produced
solely in North America.
The sweet-water sap from which maple syrup is made is different from the circulatory sap
of the growing tree. When the tree is dormant, the sap will flow from any wound in the
sapwood, such as a taphole, each time a period of freezing is followed by a period of
thawing. The sap contains 1 1/2 to 3 percent solids, mostly sucrose, but does not contain
the colour or flavour of maple syrup, which are imparted to the sap as it is concentrated by
evaporation in open pans. About 30 to 50 gallons (115 to 190 litres) of sap yield one gallon
of syrup.”

Why does Canada have a maple-syrup cartel?

Answer: The answer is explained in an article by Josh Sandburn from Time magazine:
It may seem bizarre that Canada has a maple-syrup cartel at all. But think of it this way: Quebec, which
produces about 77% of the world’s maple syrup, is the Saudi Arabia of the sweet, sticky stuff, and the FPAQ
[Fédération des producteurs acéricoles du Québec] is its OPEC. The stated goal of the cartel, in this case, is
keeping prices relatively stable.
The problem with maple syrup is that the natural supply of it varies dramatically from year to year. “It’s
highly dependent on the weather,” explains Pascal Theriault, an agricultural economist at McGill University
in Montreal. “The maple trees need optimal climate conditions—cold nights, temperate days—to produce
the right sap,” explained a recent article in the Washington Post. “That doesn’t always happen, and
production varies sharply each spring.”

I see. And how does this apply to Bayesian inference?

Answer: Since the cartel has lost its syrup, for this chapter we are going to assume
that they want to incorporate the great state of Vermont (USA) into their
game. After all, Vermont is Quebec’s neighbor with healthy sugar maple forests (see
Figure 12.1). And by the way, the name Vermont stems from the Latin Viridis Montis,
which means Green Mountain.
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi

174 BAYESIAN STATISTICS FOR BEGINNERS

BC
AB
QC
WA ON
MT
OR ND
MN ME
ID NS
WY WI NH
NV MI VT
NE IA MA
UT PA RI
CO IL IN OH CT
CA KS WV
MO VA NJ
KY DE
AZ NC
NM OK TN MD
SC DC
MS AL GA
TX LA

AK MEXICO FL

Figure 12.1

Suppose the cartel is interested in annexing Vermont’s syrup production.

They would need to have some idea of Vermont’s average annual production
of syrup, right? And because production varies from year to year, they would
want to quantify the variation, right?

So how would we capture the essence of this with a

probability distribution?

Answer: We can use the normal probability density function. We studied the normal pdf
in detail in Chapter 9, so we’ll just briefly review it now, starting with an example
(see Figure 12.2).
The formula that generates the normal distribution for the random variable X (millions
of gallons of maple syrup produced) is called the normal probability density func-
tion. Here it is, and remember that the mean is represented by the symbol μ, and the
standard deviation is represented by the symbol σ:
1 ðx−μÞ2
f ðx; μ; σÞ ¼ pffiffiffiffiffiffi e− 2σ 2 −∞ x ∞ ð12:1Þ
2π σ
As an example, suppose the average number of gallons (in millions) of maple syrup
produced per year is μ ¼ 10.0, with a standard deviation σ of 2.3. The probability density
of observing 6.2 million gallons, say, would be:

1 ðx−μÞ2
f ðx; μ; σÞ ¼ pffiffiffiffiffiffi e− 2σ 2 ð12:2Þ
2πσ
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi

THE MAPLE SYRUP PROBLEM 175

1 ð6:2−10Þ2
−
f ð6:2; 10; 2:3Þ ¼ pffiffiffiffiffiffi e 2 ∗ 2:32 ð12:3Þ
2π ∗ 2:3
1
e− 24:334 ¼ 0:044:
14:44
f ð6:2; 10; 2:3Þ ¼ ð12:4Þ
5:765

See if you can ﬁnd this result in Figure 12.2.

0.25
μ = 10; σ = 2.3
0.20

0.15
Density

0.10

0.05

0.00
5 10 15 20
Millions of Gallons Produced (x)

Figure 12.2 Pdf of a normal distribution.

What is the probability density associated with 4.8 million gallons

of syrup?

Answer: No problem! That would be:

1 ðx−μÞ2
f ðx; μ; σÞ ¼ pffiffiffiffiffiffi e− 2σ 2 ð12:5Þ
2π σ
1 ð4:8−10Þ2
−
f ð4:8; 10; 2:3Þ ¼ pffiffiffiffiffiffi e 2 ∗ 2:32 ¼ 0:0135 ð12:6Þ
2π ∗ 2:3
Look for the probability density associated with 4.8 million gallons in the distribution
shown in Figure 12.3.

0.25
μ = 10; σ = 2.3
0.20

0.15
Density

0.10

0.05

0.00
5 10 15 20
Millions of Gallons Produced (x)

Figure 12.3 Pdf of a normal distribution.

OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi

176 BAYESIAN STATISTICS FOR BEGINNERS

What does the normal pdf have to do with the

Canadian syrup cartel?

Answer: Your answer here!

The cartel is planning on annexing Vermont! Of course, this is purely hypothetical! They
need to know how much syrup is produced in Vermont on average, as well as the standard
deviation in production. In other words, they need to know which normal pdf best
describes maple syrup production in Vermont. Having this new source of syrup can help
stabilize the stockpile, and knowing the mean and standard deviation of Vermont’s pro-
duction can help them with pricing and decision making.
So, our challenge is to estimate the two parameters of the normal distribu-
tion in terms of syrup production in Vermont.

Are you ready for

another Bayesian
inference problem?

OK. Where do I start?

Our goal here is to use a Bayesian inference approach to estimate the two
parameters of the normal distribution: the mean, μ, and the standard devi-
ation, σ. This differs from that in Chapters 10 and 11, where our goal was to estimate only
one parameter of a pdf distribution (p for the White House Problem, and λ for the Shark
Attack Problem). Here, for illustrative purposes, we will consider just 10 discrete hypotheses
that represent alternative combinations of μ and σ.
We’ll use the same steps that we did in previous chapters:
1. Identify your hypotheses—these would be the alternative hypotheses for μ, and alter-
native hypotheses for σ.
2. Express your belief that each hypothesis is true in terms of prior probabilities.
3. Gather the data.
4. Determine the likelihood of the observed data under each hypothesis.
5. Use Bayes’ Theorem to compute the posterior probabilities for each hypothesis.

Let’s step through these one at a time.

Step 1. What are the hypotheses for μ and σ?

Answer: Your answer here!

OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi

THE MAPLE SYRUP PROBLEM 177

Well, let’s see. Let’s assume that the following statements are reasonable:

• The mean ranges between 5 and 15 million gallons per year.

• The standard deviation ranges between 1.2 and 2.8 millions of gallons.
To begin, let’s assume we give the cartel 10 dots to place on a chart that represent 10
combinations of μ and σ together. This is a small number of discrete hypotheses for μ and σ.
The cartel can place the dots where they’d like, putting more dots in locations that they feel
are more likely for the state of Vermont. When they are ﬁnished, the chart might look
something like that in Figure 12.4.

2.2

2.1
Hypotheses for σ

2.0

1.9

1.8

9.5 10.0 10.5 11.0 11.5 12.0

Hypotheses for μ

Figure 12.4 Scatterplot example.

Each dot represents the cartel’s belief in a particular combination for the mean and the
standard deviation. For instance, the arrow in the chart indicates that one cartel member
believes that the mean is ~9.45 and the standard deviation is ~1.98.
These 10 dots represent 10 discrete hypotheses for μ and σ.

Step 2. What are the prior probabilities for each hypothesis?

Answer: A key consideration here is that we are assigning prior probabilities for each
hypothesized combination of μ and σ. They are jointly estimated. If every single dot
represents a single hypothesis, and if each dot has the same weight, then the prior
probability associated with each dot (hypothesis) is 1/10, or 0.1. This is a non-informative
prior, and a probability mass function.

Step 3. Now what?

Answer: Collect data! Suppose the cartel visits Vermont and learns that 10.2
million gallons were produced last year.

Step 4. And then?

Answer: Step 4 is to determine how likely the observed data are, assuming each joint
hypothesis is true.
Now, for each hypothesized combination of μ and σ, let’s compute the like-
lihood of observing 10.2 million gallons.
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi

178 BAYESIAN STATISTICS FOR BEGINNERS

At the beginning of this chapter, we showed that the normal pdf can be used to calculate
the probability density of observing a value from a normal distribution whose parameters
are given. This time, we have observed the value x, and now we are using the function to
calculate the likelihood of observing the data under a combination of entries for μ and σ.
Just to give you an example, suppose one of 10 hypotheses was μ ¼ 9.45 and σ ¼ 1.98
(the hypothesis highlighted by the arrow in Figure 12.4). Then we could ask: “What is
the likelihood of observing 10.2 from a normal distribution whose mean is μ ¼ 9.45 and
σ ¼ 1.98.” The answer is:

1 ðx−μÞ2
ℒðx; μ; σÞ ¼ pffiffiffiffiffiffi e− 2σ 2 ð12:7Þ
σ 2π
1 ð10:2−9:45Þ2
ℒðx ¼ 10:2; μ ¼ 9:45; σ ¼ 1:98Þ ¼ pffiffiffiffiffiffi e− 2 ∗ 1:982 ¼ 0:1875 ð12:8Þ
1:98 2π
We can do similar calculations for other combinations as well. Table 12.1 shows 10 specific
hypotheses; our highlighted example is hypothesis 6.

Table 12.1

Hypothesis μ σ x Prior (Pr) Likelihood (ℒ) ℒ ∗ Pr Denominator Posterior

1 9.97 1.75 10.2 0.1 0.2260 0.0226 0.1842 0.1227
2 10.26 1.83 10.2 0.1 0.2179 0.0218 0.1842 0.1183
3 11.94 2.05 10.2 0.1 0.1357 0.0136 0.1842 0.0738
4 10.85 1.78 10.2 0.1 0.2097 0.0210 0.1842 0.1140
5 9.58 1.83 10.2 0.1 0.2058 0.0206 0.1842 0.1118
6 9.45 1.98 10.2 0.1 0.1875 0.0188 0.1842 0.1021
7 11.71 2.21 10.2 0.1 0.1429 0.0143 0.1842 0.0776
8 10.51 2.04 10.2 0.1 0.1933 0.0193 0.1842 0.1048
9 12.3 1.79 10.2 0.1 0.1120 0.0112 0.1842 0.0608
10 10.36 1.89 10.2 0.1 0.2103 0.0210 0.1842 0.1140

Let’s walk through this table:

• The ﬁrst three columns identify our 10 alternative hypotheses. Remember that a single
hypothesis considers μ and σ jointly.
• Column 4 (x) represents our observed data (Vermont produced 10.2 million gallons of
syrup last year).
• Column 5 is the prior probability for each hypothesis. In this example, each hypothesis
has equal weight.
• Column 6 uses the normal pdf to compute the likelihood of observing the data under
each hypothesis.
We’ll stop there for now. It would be a good idea to take out a pencil and paper, calculator,
or spreadsheet to verify at least a couple of results.

And Step 5?

Answer: The ﬁnal step is to use Bayes’ Theorem to compute the posterior probabilities for
each combination of μ and σ. We won’t present these calculations here.
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi

THE MAPLE SYRUP PROBLEM 179

Is this a realistic approach?

Answer: The main challenge is that we have an inﬁnite number of hypotheses that
represent each joint combination for the unknown parameters of the normal distribution.
Accordingly, our prior and posterior distributions should reﬂect the density associated
with each μ and σ combination.

Back to the drawing board?

Answer: Yes. In other words, a handful of dots is not sufﬁcient. If we gave the cartel a zillion
grains of sand, where each grain represents a joint hypothesis for μ and σ, their prior
distribution for the parameters may look something like that shown in Figure 12.5.
Densit

a
igm
y

s
for
ses
Hyp
oth the
ese
s fo
po

rm
Hy

Figure 12.5

Here, the prior distribution has volume. For any given parameter combination, the
corresponding height gives us the density, stressing the fact that we are dealing with a
bivariate distribution in which both parameters are continuous. Our 10 original hypotheses
are buried somewhere in this sandpile. This distribution really could have any shape—
lumps indicate those hypotheses that have more weight than others, which implies that
Figure 12.5 is an informative prior distribution.
It’s plain to see that the shape of our prior distribution for μ depends on what level of σ
you consider. For instance, Figure 12.5 illustrates that if we ﬁx σ at the “red-level,” we see
plausible hypotheses for μ highlighted in red. And if we ﬁx σ at the “green-level,” we see
plausible hypotheses for μ highlighted in green. These ribbons are called “slices” because
they slice through the mound at a particular location. Given a particular slice for σ, you
have a unique prior distribution for μ. Keep this point in the back of your mind—we will
return to it shortly.

OK, I’ve got it. Now what?

Answer: Well, now that the cartel has a prior distribution, they would collect new data and
use Bayes’ Theorem to generate the posterior distribution. In other words, the pile of sand
will change shape after new data are collected.
You might recall from previous chapters the generic version of Bayes’ Theorem for one
continuous variable:
Pðdata j θÞ ∗ PðθÞ
Pðθ j dataÞ ¼ ð ð12:9Þ
Pðdata j θÞ ∗ PðθÞdθ
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi

180 BAYESIAN STATISTICS FOR BEGINNERS

This is the generic version of Bayes’ Theorem when the posterior distribution for a single
parameter, given the observed data, is represented by a pdf. This posterior density is
designated Pðθ j dataÞ, the prior density is designated PðθÞ, and the likelihood of observing
the data under a given hypothesis for θ is designated as Pðdata j θÞ. The integral in the
denominator accounts for the prior density times the likelihood for all possible hypotheses
for θ and sums them.
Now, let’s make this task even harder. For the maple syrup problem, there are two
parameters to be estimated, μ and σ. With two parameters, Bayes’ Theorem looks like this,
where each hypothesis for μ and σ is considered (the posterior is shown in purple, the prior
in blue, and the likelihood in red):

Pðdata j μ;σÞ ∗ Pðμ;σÞ

Pðμ;σ j dataÞ ¼ ðð ð12:10Þ
Pðdata j μ;σÞ ∗ Pðμ;σÞdμdσ

For any given prior hypothesis μ and σ, Bayes’ Theorem will return the posterior density of
the joint μ and σ by computing the likelihood of observing the data under hypothesis μ and
σ, multiplied by the prior probability of μ and σ. We need to do this for an inﬁnite number
of hypotheses for each parameter.

How do we make headway?

Answer: As you might have guessed by the title of this chapter, there is a conjugate solution.

The main approach is to simplify the problem by focusing on only one of the parameters
and assuming the second parameter is known.

For a normal-normal conjugate, we use a normal distribution as our prior distribution for
the mean, μ, and then collect data and update to the posterior distribution for the mean, μ,
which will also be normally distributed. However, in this process, we must assume that σ is
known; that is, we must choose a level for σ (such as the green value or red value for sigma as
shown in Figure 12.5).
Let’s assume we know that σ = 2.0 Then, we have alternative hypotheses for μ, which
controls where the distribution is centered along the x-axis. For example, 5 alternative
hypotheses for μ are shown in Figure 12.6.

0.5
μ = 8, σ = 2
0.4 μ = 10, σ = 2
μ = 12, σ = 2
0.3 μ = 14, σ = 2
Density

μ = 16, σ = 2
0.2

0.1

0.0
0 5 10 15 20 25 30
Hypotheses for μ

Figure 12.6 Annual maple syrup production distribution.

OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi

THE MAPLE SYRUP PROBLEM 181

Now, are you paying attention? The normal-normal conjugate method uses
a normal pdf to weight the alternative hypotheses for μ, like the one shown in
Figure 12.7.

0.5

0.4

0.3
Density
0.2

0.1

0.0
0 5 10 15 20 25 30
Hypotheses for μ

Figure 12.7 Annual maple syrup production,

showing the prior distribution for μ.

Thus, the prior distribution for the unknown parameter, μ, is a normal distribution.

I’m confused!

Answer: That’s completely normal. Figure 12.7 suggests that we have greater belief that
the average annual syrup production is between 10 and 15 than, say, at 5 or 25. Remember
that we have simpliﬁed the problem so we are only focusing on μ. We’ve already agreed that
sigma, σ, is known and is 2 million gallons.
In fact, the blue distribution in Figure 12.7 is a normal distribution with μ0 ¼ 12 and
σ0 ¼ 2. Alternatively, this distribution can be speciﬁed in terms of the parameters μ0 and τ0,
where μ0 ¼ 12 and τ0 ¼ 0.0625. These are formally called prior hyperparameters.

What is the parameter τ?

The spread of a normal distribution can be parameterized in a few ways:

• You can express the spread in terms of the standard deviation, σ.

• You can express the spread in terms of variance, σ 2 , which is simply σ ∗ σ. If σ ¼ 4, then
σ 2 ¼ 16:0.
• You can express the spread in terms of precision using the Greek letter, tau (τ), which
1 1
rhymes with “cow” and is 2 . If σ 2 ¼ 16:0, then τ ¼ ¼ 0:0625.
σ 16
For reasons that will soon become apparent to you, we will deﬁne our prior distribution for
the unknown parameter μ in terms of μ0 and τ 0 . The subscripts identify the two
hyperparameters for the unknown μ. We will also deﬁne our known parameter, σ ¼ 2,
in terms of τ, which is then 1=ð22 Þ ¼ 0:25.
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi

182 BAYESIAN STATISTICS FOR BEGINNERS

Why is this chapter called “The normal-normal conjugate”?

When you use a normal distribution as a prior distribution for the unknown parameter of a
normal distribution, μ, assume that σ (or τ) is known, and then collect data (from a normal
distribution). Then, you can use a conjugate shortcut to generate the posterior distribution
for the unknown parameter μ, which will also be a normal distribution.
A diagram such as that shown in Figure 12.8 may help.

μ ∼N(μ0, τ0)

N(μ, σ = 2.0)
or
N(μ, τ = 0.25)

Figure 12.8

At the bottom of this diagram, we have our observed data, xi , which we assume arises
from a normal distribution with an unknown mean, μ, and known standard deviation, σ ¼
2.0. This is the same thing as a normal distribution with an unknown mean, μ, and known
precision, τ ¼ 0.25. The parameter, μ, is the unknown parameter that we are trying to
estimate. Here, we use a normal distribution as a prior distribution to set the “weights” on
each and every hypothesis for the unknown parameter, μ. This normal distribution is
deﬁned by the hyperparameters μ0 and τ 0 .

So, what is the conjugate shortcut?

Answer: Ok then. Time for a concrete example. First, let’s recall the steps for Bayesian
analysis:

1. Identify your hypotheses—these would be the alternative hypotheses for the mean. (The
conjugate shortcut requires that you assume either μ or σ is known; here we assume σ is
known.)
2. Express your belief that each hypothesis is true in terms of a prior distribution (i.e., a
pdf). For a normal distribution, μ can range from −∞ to ∞. For our example, however, μ
must be positive, so we will select a prior distribution such that the area under the curve
for μ < 0 is very, very small.
3. Gather the data.
4. Determine the likelihood of the observed data under each hypothesis.
5. Use Bayes’ Theorem to compute the posterior distribution, which is a pdf.
Step 1. These would be the alternative hypotheses for the mean. Here, we will assume that
τ is known and has a value of 0.25. Remember, this is known.
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi

THE MAPLE SYRUP PROBLEM 183

That leaves us with just one unknown parameter, μ, which has an inﬁnite number of
hypotheses, all of which are positive.
Step 2. Establish a prior distribution for the unknown parameter, μ. Remember that
our prior distribution is a normal distribution. Although Vermont secretly collects data on
syrup production, earlier in the chapter we used a dot exercise to “elicit” information from
cartel members. Elicitation is a method for “boiling down” what experts think they know
when other sources of data are not available (pun intended!).

Tip: “Use expert opinions in such a way that we stand to gain if they are correct and do
not lose if they are wrong.”—C.R. Rao

There are many different ways to elicit a prior distribution, and these are beyond the
scope of this book. For the sake of this chapter, let’s assume the cartel uses the normal
distribution shown in blue in Figure 12.8; it has the following hyperparameters:
• μ0 = 12
• τ0 ¼ 0.0625
Notice the subscript of 0 (naught) is used to indicate that these are prior
hyperparameters.
Step 3. Collect data. We have observed 10.2 million gallons of syrup produced.
We now write our observed data as:
X
n

xi ð12:11Þ
i¼1

Here, n is the number of samples (in this case, 1). Typically, we index each year with the
letter i, where the index i goes from 1 to n. So, xi ¼ x1 ¼ 10.2.
If we collected seven years of data, the index i would be 1, 2,…,7. We would report our
observed data as the sum of our seven observations, and n would be 7:
X
7

xi ð12:12Þ
n¼1

Steps 4 and 5. Steps 4 and 5 are to determine the likelihood of the observed data under
each hypothesis and then use Bayes’ Theorem to compute the posterior probabilities for
each hypothesis.
Here, though, we can update the prior normal distribution to the posterior normal
distribution with the shortcut analytic solution and avoid the use of Bayes’ Theorem
directly. Here are the shortcut calculations—pay attention to subscripting!
The mean of the posterior distribution for the unknown parameter, μ, is calculated as:
Xn
τ 0 μ0 þ τ xi 0:0625 ∗ 12 þ 0:25 ∗ 10:2
μposterior ¼ i¼1
¼ ¼ 10:56 ð12:13Þ
ðτ 0 þ n ∗ τÞ 0:0625 þ 1 ∗ 0:25

The precision of the posterior distribution (τ) for the unknown parameter, μ, is
calculated as:
τ posterior ¼ τ 0 þ n ∗ τ ¼ 0:0625 þ 1 ∗ 0:25 ¼ 0:31 ð12:14Þ
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi

184 BAYESIAN STATISTICS FOR BEGINNERS

A graph of our prior and posterior is shown in Figure 12.9.

0.30
Hyperparameters:
0.25 Prior: μ0 = 12, τ0 = 0.0625
Posterior: μ = 10.56, τ = 0.31
0.20

Density
0.15

0.10

0.05

0.00
0 5 10 15 20 25 30
Hypotheses for μ

Figure 12.9 Annual maple syrup production, showing

hypothesis for μ (in millions of gallons).

Our prior distribution for the unknown parameter, μ, is a normal distribution with μ0 ¼
12 and τ0 ¼ 0.0625 (blue). After collecting one year of syrup data in Vermont (10.2 million
gallons shown by the dashed line), our posterior distribution for the unknown parameter μ
is a normal distribution with posterior hyperparameters μ ¼ 10.56 and τ ¼ 0.31 (purple).
Notice how the posterior has tightened up a bit and shifted to the left.

Why does the shortcut work?

Answer: The answer is that the normal distribution is a conjugate distribution that can be
updated with normal data (Raiffa and Schlaeffer, 1961). This is why we named this chapter
“The normal-normal conjugate.”
We mentioned the statistical definition of a conjugate distribution in Chapter 11, where
we indicated that there are cases where you can use a particular pdf as a prior distribution,
collect data of a specific flavor, and then derive the posterior pdf with a closed-form
solution. In these special cases, the pdf ’s of the prior and posterior distributions are the
same probability density function, but their parameters may differ. The prior distribution is
called a conjugate prior, and the effect of the data can then be interpreted in terms of
changes in parameter values (Upton and Cook, 2014).
In the maple syrup problem, we used a normal distribution to set the priors for all
hypotheses of μ, the mean number of gallons of syrup (millions). We used expert opinion to
set the prior distribution. We then collected normal data, and then used a shortcut to
generate the posterior distribution for μ.
Raiffa and Schlaeffer’s (1961) conjugate solution is conveniently given in a Wikipedia
page on Bayesian conjugates. A quick overview of that page’s normal-normal conjugate can
be depicted with the diagram shown in Figure 12.10.
Let’s work our way through this figure. The top panel indicates that the model param-
eter of interest, μ, is a mean. The prior distribution is a normal distribution, defined by
hyperparameters μ0 and τ 0 . The “naughts” signal that these are hyperparameters for the
prior. The data collected are normal in flavor, with n being the total sample size. The
likelihood to be used is the normal pdf, where τ is known. With this setup, the posterior
distribution for the unknown parameter, μ, is a normal distribution with hyperparameters
(the conjugate shortcuts) displayed.
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi

THE MAPLE SYRUP PROBLEM 185

Unknown parameter: μ (τ is known)

Prior distribution: Normal

Prior hyperparameters:
μ0, τ0

Data: x1, x2,…, xn

Likelihood: Normal (τ is known)

Posterior distribution: Normal

Posterior hyperparameters:

τ0μ0 + τ ∑ni=1 xi
μ=
τ 0 + nτ

τ = τ0 + nτ

Figure 12.10

Can you show me the proof?

Answer: All right. Have a look at Appendix 3. There, you can conﬁrm that Bayes’ Theorem
is front and center, even though the results are astonishingly simple.

What happens to the posterior if we collect new data?

Answer: Let’s try it! Let’s suppose you collect ﬁve more years of syrup produc-
tion data (shown in Table 12.2).

Table 12.2

Year Gallons
1 7
2 10
3 10
4 8
5 4
39

Note that we have ﬁve new observations from the Gaussian (normal) distribution, so
n ¼ 5. Also notice that the sum of the observed data is 39 million gallons produced. Our
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi

186 BAYESIAN STATISTICS FOR BEGINNERS

prior distribution for the unknown parameter, μ, is now deﬁned by the hyperparameters
μ0 ¼ 10.56 and τ0 ¼ 0.31; we assume τ is still known to be 0.25. If we repeat our
conjugate calculations, the mean of the posterior distribution for the unknown parameter,
μ, is calculated as:
Pn
τ 0 μ0 þ τ x
i ¼1 i 0:31 ∗ 10:56 þ 0:25 ∗ 39
μposterior ¼ ¼ ¼ 8:35 ð12:15Þ
ðτ 0 þ n ∗ τÞ 0:31 þ 5 ∗ 0:25

And σ for the posterior distribution of the unknown parameter, μ, is calculated as:

τ posterior ¼ τ 0 þ n ∗ τ ¼ 0:31 þ 5 ∗ 0:25 ¼ 1:56 ð12:16Þ

1 1
σ 2posterior ¼ ¼ ¼ 0:641 ð12:17Þ
τ 1:56
pffiffiffiffiffi pffiffiffiffiffiffiffiffiffiffiffiffiffi
σ posterior ¼ σ 2 ¼ 0:641 ¼ 0:801 ð12:18Þ

Let’s add this new posterior pdf to our graph in dashed purple (see Figure 12.11).

0.6
Hyperparameters:
Prior: μ0 = 12, σ0 = 4 (τ0 = 0.0625)
0.5
Posterior 1: μ = 10.56, σ = 1.79 (τ = 0.31)
Posterior 2: μ = 8.35, σ = 0.801 (τ = 1.56)
0.4
Density

0.3

0.2

0.1

0.0
0 5 10 15 20 25 30
Hypotheses for μ

Figure 12.11 Annual maple syrup production,

showing hypotheses for μ.

This is a formal documentation of learning about average maple syrup production in

the Green Mountain State. A few things are worth pointing out here:
• We started off with an informative, subjective prior for μ shown in blue.
• We updated this based on a single year’s data: 10.2 million gallons produced in one year
in Vermont. This shifted the posterior to the left (solid purple), but with only a single
year’s data, the distribution is still quite broad.
• We then collected ﬁve more years of data, in which a total of 39 million gallons was
produced. This further shifted the posterior to the left (dashed purple), and the distribu-
tion narrowed considerably.
That’s what happens as you learn: the uncertainty associated with the true value of μ
shrinks as we continue to collect data. We still have uncertainty about what the true mean
is, but after some data collection we now have updated beliefs. Remember that we
have assumed that σ for maple syrup production is known with full
certainty!
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi

THE MAPLE SYRUP PROBLEM 187

Why do we use τ instead of σ directly?

Answer: We can work with a known variance (σ 2 ) OR known precision (τ, which is 1/σ 2 ).
Raiffa and Schlaeffer’s (1961) conjugate solution is conveniently given in a Wikipedia page
on Bayesian conjugates. Let’s have another look at our ﬁgure, but expand it a bit (see
Figure 12.12).

Unknown parameter: μ (τ is known) Unknown parameter: μ (σ2 is known)

Prior distribution: Normal Prior distribution: Normal

Prior hyperparameters: Prior hyperparameters:

μ0, τ0 μ0, σ20

Data: x1, x2, . . . ,xn Data: x1, x2,. . . ,xn

Likelihood: Normal (τ is known) Likelihood: Normal (σ2 is known)

Posterior distribution: Normal Posterior distribution: Normal

Posterior hyperparameters: Posterior hyperparameters:

μ ∑n x
τ0μ0 + τ ∑ni=1 xi μ = ( 20+ i =2 1 i ) ( 12 + n2)
μ= σ0 σ σ0 σ
τ0 + nτ
τ = τ0 + nτ 1 + n –1
σ2= ( )
σ20 σ2

Figure 12.12

• In the left panel, we assume that τ is known, and we deﬁne our prior distribution for the
unknown parameter, μ, with the hyperparameters μ0 and τ 0 . The conjugate calculations
of the posterior hyperparameters are shown. We used this version here and we’ll make
use of it again in future chapters.
• In the right panel, we assume that σ 2 is known, and we deﬁne our prior distribution for
the unknown parameter, μ, with the hyperparameters μ0 and σ 20 . The conjugate calcula-
tions of the posterior hyperparameters are shown.
• Both approaches will return the same result.

What if I know μ, but want to estimate σ or τ instead?

Answer: There is a conjugate shortcut for that too, but it wouldn’t be called the normal-
normal conjugate shortcut anymore! For example, Figure 12.13 shows the conjugate
solutions for estimating σ 2 or τ given a known mean, μ.
An important difference is that we no longer use the normal pdf for our prior distribu-
tion for the unknown parameter. Instead, the prior distribution of σ 2 is either the inverse
gamma distribution or the scaled inverse chi-square distribution, while the prior distribu-
tion for τ is the gamma distribution. We haven’t introduced the ﬁrst two distributions in
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi

188 BAYESIAN STATISTICS FOR BEGINNERS

Unknown parameter: τ (μ is known)

Prior distribution: Gamma

Prior hyperparameters

α0, β0

Likelihood: Normal (μ is known)

Posterior distribution: Normal

Posterior hyperparameters

α = α0 + n
2 ∑ni=1 (xi –μ)2
β = β0 + 2

Figure 12.13

this book, but we learned about the gamma distribution in the Shark Attack Problem, and
that is what is shown above. And we will make use of the gamma distribution to estimate τ
with Gibbs sampling when we revisit the Maple Syrup Problem in Chapter 16. A key take-
home point is that you have quite a bit of ﬂexibility with respect to conjugate solutions for
the parameters of the normal distribution.

Can we summarize this chapter?

Answer: Indeed.
• We started by mentioning that 60% of Quebec’s maple syrup stockpile was stolen and
that the cartel wishes to annex Vermont’s maple syrup production.
• We introduced the normal probability density function as a way to quantify the average
and variation in annual syrup production for the state of Vermont.
• We emphasized that the normal distribution has two parameters, the mean μ and the
standard deviation σ, which need to be estimated jointly.
• We remembered that we have an inﬁnite number of hypotheses for μ and σ; these can be
described by continuous distributions. To complicate the matter, the estimate of one
parameter depends on the value the second parameter.
• To make headway for this problem, we assumed that one of the parameters of the normal
distribution was known. We said that σ was known to be 2.0.
• We talked about different measures of spread for the normal distribution: the standard
1
deviation (σ), the variance (σ 2 ), and the precision (τ ¼ 2 ).
σ
• We then set up a prior distribution to represent our hypotheses for the mean, μ, of the
normal distribution. This prior distribution was also a normal distribution with speciﬁed
hyperparameters μ0 and τ 0 .
OUP CORRECTED PROOF – FINAL, 6/5/2019, SPi

THE MAPLE SYRUP PROBLEM 189

• We then collected data. Vermont produced 10.2 million gallons of syrup last year.
• We used a shortcut, analytical solution that allowed us to update our prior normal
distribution to a posterior normal distribution by using the data. The posterior distribu-
tion reﬂects our updated beliefs for each and every hypothesis for μ.
• We then collected ﬁve more years of data. This time, we used our newly updated posterior
as the prior distribution, and then updated it again given the newly observed data. Thus,
we are using Bayesian inference as a way to formalize “learning.”

What’s next?

Answer: This ends our section on Bayesian conjugates. You can see that these conjugate
“shortcuts” are extremely handy for certain types of problems. In Section 5, we’ll show you
how to solve these same problems using a different approach called Markov Chain Monte
Carlo, or MCMC for short.

Bayesian Statistics For Beginners A Step by Step Approach Therese M. Donovan Download
100% (1)
Bayesian Statistics For Beginners A Step by Step Approach Therese M. Donovan Download
39 pages
Bayesian Statistics For Beginners: A Step-By-Step Approach Therese M Donovan Download
100% (1)
Bayesian Statistics For Beginners: A Step-By-Step Approach Therese M Donovan Download
86 pages
Bayesian Statistics For Beginners Therese M. Donovan Full Chapters Included
100% (2)
Bayesian Statistics For Beginners Therese M. Donovan Full Chapters Included
150 pages
Main
No ratings yet
Main
195 pages
Bayesian Reasoning in Data Analysis A Critical Introduction by Giulio D. Agostini
100% (2)
Bayesian Reasoning in Data Analysis A Critical Introduction by Giulio D. Agostini
351 pages
Bayesian Statistics Essentials
No ratings yet
Bayesian Statistics Essentials
180 pages
Bayesian Theory and Methods With Applications by Vladimir Savchuk, Chris P. Tsokos
No ratings yet
Bayesian Theory and Methods With Applications by Vladimir Savchuk, Chris P. Tsokos
332 pages
Bayes Network
100% (1)
Bayes Network
80 pages
Bayesian Theory and Methods With Applications
100% (2)
Bayesian Theory and Methods With Applications
327 pages
Babybayes Master
No ratings yet
Babybayes Master
172 pages
Bayesian Theory and Methods With Applications ISBN 9491216139, 9789491216138 Digital DOCX Download
No ratings yet
Bayesian Theory and Methods With Applications ISBN 9491216139, 9789491216138 Digital DOCX Download
17 pages
Introduction To Bayesian Inference: M. Botje NIKHEF, PO Box 41882, 1009DB Amsterdam, The Netherlands June 21, 2006
No ratings yet
Introduction To Bayesian Inference: M. Botje NIKHEF, PO Box 41882, 1009DB Amsterdam, The Netherlands June 21, 2006
68 pages
Overview of Principles of Statistics
No ratings yet
Overview of Principles of Statistics
8 pages
Zzzz-Essential Bayes
No ratings yet
Zzzz-Essential Bayes
158 pages
Bayesian Analysis - Explanation
No ratings yet
Bayesian Analysis - Explanation
20 pages
BML Lecture Notes
No ratings yet
BML Lecture Notes
126 pages
Lecture 10
No ratings yet
Lecture 10
59 pages
Primer Clase de Estadística Bayesiana
No ratings yet
Primer Clase de Estadística Bayesiana
19 pages
Lecture 4 Notes
No ratings yet
Lecture 4 Notes
15 pages
Introduction To Bayesian Statistics: 24 February 2016 A Semester's Worth of Material in Just A Few Dozen Slides
No ratings yet
Introduction To Bayesian Statistics: 24 February 2016 A Semester's Worth of Material in Just A Few Dozen Slides
40 pages
Hoel, Paul G 1971 Introduction To Mathematical Statistics 4th Ed
100% (2)
Hoel, Paul G 1971 Introduction To Mathematical Statistics 4th Ed
424 pages
Bayesian Data Analysis, Second Edition (Chapman Hall CRC Texts in Statistical Science) by Andrew Gelman, John B. Carlin, Hal S. Stern, Donald B. Rubin
100% (1)
Bayesian Data Analysis, Second Edition (Chapman Hall CRC Texts in Statistical Science) by Andrew Gelman, John B. Carlin, Hal S. Stern, Donald B. Rubin
695 pages
Learning Bayesian Models With R - Sample Chapter
100% (1)
Learning Bayesian Models With R - Sample Chapter
22 pages
Bayesian Inference: Chris Mathys
No ratings yet
Bayesian Inference: Chris Mathys
32 pages
Bayesian Reasoning
No ratings yet
Bayesian Reasoning
27 pages
Bayesian
No ratings yet
Bayesian
50 pages
Oi PDF
No ratings yet
Oi PDF
107 pages
Computational Bayesian Statistics. An Introduction - Amaral, Paulino, Muller PDF
100% (4)
Computational Bayesian Statistics. An Introduction - Amaral, Paulino, Muller PDF
257 pages
Bayes For Beginners: Luca Chech and Jolanda Malamud Supervisor: Thomas Parr 13 February 2019
No ratings yet
Bayes For Beginners: Luca Chech and Jolanda Malamud Supervisor: Thomas Parr 13 February 2019
41 pages
Bayesian Inference
100% (3)
Bayesian Inference
347 pages
Chapitre 1 Statistique - Bayesienne
No ratings yet
Chapitre 1 Statistique - Bayesienne
47 pages
How To Become A Bayesian
No ratings yet
How To Become A Bayesian
28 pages
Lecture 7
No ratings yet
Lecture 7
46 pages
STATS 225: Bayesian Analysis Lecture 1: Introduction: Babak Shahbaba
No ratings yet
STATS 225: Bayesian Analysis Lecture 1: Introduction: Babak Shahbaba
49 pages
BaYesian Models Machine Learning 2016
No ratings yet
BaYesian Models Machine Learning 2016
126 pages
Bayesian Statistics Explained To Beginners in Simple English
No ratings yet
Bayesian Statistics Explained To Beginners in Simple English
16 pages
Smbi CH1
No ratings yet
Smbi CH1
97 pages
MA40189 Notes
No ratings yet
MA40189 Notes
70 pages
Bayesian
No ratings yet
Bayesian
26 pages
SMBI Ch1 - Introduction To Bayesian Statistics
No ratings yet
SMBI Ch1 - Introduction To Bayesian Statistics
92 pages
Bayesian Statistics Lecture Notes
No ratings yet
Bayesian Statistics Lecture Notes
146 pages
Bayesian Inference Basics
No ratings yet
Bayesian Inference Basics
7 pages
Bayesian Probability Explained
No ratings yet
Bayesian Probability Explained
9 pages
Bayesian Modelling For Data Analysis and Learning From Data
No ratings yet
Bayesian Modelling For Data Analysis and Learning From Data
19 pages
Intro-Bayes Theory
No ratings yet
Intro-Bayes Theory
17 pages
Let The Evidence Speak Using Bayesian Thinking in Law, Medicine, Ecology and Other Areas High-Quality Ebook
100% (14)
Let The Evidence Speak Using Bayesian Thinking in Law, Medicine, Ecology and Other Areas High-Quality Ebook
14 pages
Introduction To Bayesian Statistics 1st Edition William M. Bolstad Instant Download
100% (2)
Introduction To Bayesian Statistics 1st Edition William M. Bolstad Instant Download
173 pages
Bayesian Modeling Guide
No ratings yet
Bayesian Modeling Guide
13 pages
Hipnotic Bayestas
No ratings yet
Hipnotic Bayestas
30 pages
Bayesian Analysis & Confidence Intervals
No ratings yet
Bayesian Analysis & Confidence Intervals
17 pages
Probability Theory For Machine Learning: Chris Cremer September 2015
No ratings yet
Probability Theory For Machine Learning: Chris Cremer September 2015
40 pages
(Ebook) Introduction To Bayesian Econometrics and Decision Theory
No ratings yet
(Ebook) Introduction To Bayesian Econometrics and Decision Theory
29 pages
Bayesian Classification
No ratings yet
Bayesian Classification
7 pages
IDS21 Bayes Theorem
No ratings yet
IDS21 Bayes Theorem
22 pages
Chapter 9 Bayesian Methods - Machine Learning For Factor Investing
No ratings yet
Chapter 9 Bayesian Methods - Machine Learning For Factor Investing
11 pages
Advance Statistics
No ratings yet
Advance Statistics
23 pages
Introduction To Machine Learning CS - 229
No ratings yet
Introduction To Machine Learning CS - 229
109 pages
Bayesian Statistical Methods 1st Edition Brian J. Reich Download
100% (4)
Bayesian Statistical Methods 1st Edition Brian J. Reich Download
56 pages
Mutual Fund
100% (1)
Mutual Fund
68 pages
Academic English 2 Syllabus B1 Level
No ratings yet
Academic English 2 Syllabus B1 Level
3 pages
Acute Responses To Sprint-Interval and Continuous Exercise in Adults With and Without Exercise Induced Bronchoconstricion PDF
No ratings yet
Acute Responses To Sprint-Interval and Continuous Exercise in Adults With and Without Exercise Induced Bronchoconstricion PDF
10 pages
Architectural Design - Well Light Ventilated Villa
No ratings yet
Architectural Design - Well Light Ventilated Villa
23 pages
Effective Communication in AS9100D Audits
No ratings yet
Effective Communication in AS9100D Audits
2 pages
Bài tập câu if, wish
No ratings yet
Bài tập câu if, wish
9 pages
Sampilo v. Amistad
No ratings yet
Sampilo v. Amistad
2 pages
3 X 3 Matrix To Double Your Productivity & Profitability
No ratings yet
3 X 3 Matrix To Double Your Productivity & Profitability
2 pages
Practice Problems: The Basics of Electrostatics Solutions
0% (1)
Practice Problems: The Basics of Electrostatics Solutions
4 pages
SK Year 6 Cefr Unit 10 - Its A Mystery
100% (1)
SK Year 6 Cefr Unit 10 - Its A Mystery
21 pages
Avicenna: Ibn Sīnā ﺎﻨﯿﺳ ﻦﺑا
No ratings yet
Avicenna: Ibn Sīnā ﺎﻨﯿﺳ ﻦﺑا
10 pages
Isabelle Rampa: MHPSS Coordinator Profile
No ratings yet
Isabelle Rampa: MHPSS Coordinator Profile
3 pages
Prof. Hegab-TMJ Patient Evaluation Form-1
No ratings yet
Prof. Hegab-TMJ Patient Evaluation Form-1
5 pages
TQ1 Oral Test Sample
No ratings yet
TQ1 Oral Test Sample
2 pages
Part 7 Amadeus Script
No ratings yet
Part 7 Amadeus Script
38 pages
People Vs Belbes (Fulfillment of Duty)
No ratings yet
People Vs Belbes (Fulfillment of Duty)
2 pages
Friendship Between Predators and Prey A Scientific and Ethical Exploration
No ratings yet
Friendship Between Predators and Prey A Scientific and Ethical Exploration
4 pages
Ent530 Case Study
No ratings yet
Ent530 Case Study
30 pages
Buccoo Reef Case Study
60% (5)
Buccoo Reef Case Study
20 pages
Passive Voice Practice
No ratings yet
Passive Voice Practice
1 page
MSPGM
No ratings yet
MSPGM
8 pages
SAP S4HANA Upgrade 1909 UPGR - OP1909
No ratings yet
SAP S4HANA Upgrade 1909 UPGR - OP1909
26 pages
Everlight Electronics Co.,Ltd.: Technical Data Sheet 0.56" Dual Digit Displays
No ratings yet
Everlight Electronics Co.,Ltd.: Technical Data Sheet 0.56" Dual Digit Displays
5 pages
Aresbet Twitter News: Follow For Bonus Codes and Entry Links
No ratings yet
Aresbet Twitter News: Follow For Bonus Codes and Entry Links
4 pages
African Languages in Literature
No ratings yet
African Languages in Literature
14 pages
Fill in Verb Forms for Holiday Story
No ratings yet
Fill in Verb Forms for Holiday Story
1 page
Matlab DFT
No ratings yet
Matlab DFT
233 pages
Passing the Torch: Senior Legacy
No ratings yet
Passing the Torch: Senior Legacy
3 pages
UFO Waves and Predictions
100% (1)
UFO Waves and Predictions
102 pages
Absenteeism: Its Effect Among Grade 11 Zara Students On Their Academic Performance of Dangcagan Senior High School
No ratings yet
Absenteeism: Its Effect Among Grade 11 Zara Students On Their Academic Performance of Dangcagan Senior High School
8 pages