Infer
Infer
Statistical Inference
A Work in Progress
Springer
                   v
    Seymour Geisser was a mentor to Wes Johnson and me. He was Wes’s Ph.D.
advisor. Near the end of his life, Seymour was finishing his 2005 book Modes of
Parametric Statistical Inference and needed some help. Seymour asked Wes and
Wes asked me. I had quite a few ideas for the book but then I discovered that Sey-
mour hated anyone changing his prose. That was the end of my direct involvement.
The first idea for this book was to revisit Seymour’s. (So far, that seems only to
occur in Chapter 1.)
    Thinking about what Seymour was doing was the inspiration for me to think
about what I had to say about statistical inference. And much of what I have to say
is inspired by Seymour’s work as well as the work of my other professors at Min-
nesota, notably Christopher Bingham, R. Dennis Cook, Somesh Das Gupta, Mor-
ris L. Eaton, Stephen E. Fienberg, Narish Jain, F. Kinley Larntz, Frank B. Martin,
Stephen Orey, William Suderth, and Sanford Weisberg. No one had a greater influ-
ence on my career than my advisor, Donald A. Berry. I simply would not be where
I am today if Don had not taken me under his wing.
    The material in this book is what I (try to) cover in the first semester of a one year
course on Advanced Statistical Inference. The other semester I use Ferguson (1996).
                                                                                       vii
viii                                                                              Preface
The course is for students who have had at least Advanced Calculus and hopefully
some Introduction to Analysis. The course is at least as much about introducing
some mathematical rigor into their studies as it is about teaching statistical inference.
(Rigor also occurs in our Linear Models class but there it is in linear algebra and
here it is in analysis.) I try to introduce just enough measure theory for students to
get an idea of its value (and to facility my presentations). But I get tired of doing
analysis, so occasionally I like to teach some inference in the advanced inference
course.
   Many years ago I went to a JSM (Joint Statistical Meeting) and heard Brian
Joiner make the point that everybody learns from examples to generalities/theory.
Ever since I have tried, with varying amounts of success, to incorporate this dictum
into my books. (Plane Answers’ first edition preceded that experience.) This book
has four chapters of example based discussion before it begins the theory in Chapter
5. There are only three chapters of theory but they are tied to extensive appendices on
technical material. The first appendix merely reviews basic (non-measure theoretic)
ideas of multivariate distributions. The second briefly introduces ideas of measure
theory, measure theoretic probability, and convergence. Appendix C introduces the
measure theoretic approaches to conditional probability and conditional expecta-
tion. Appendix D adds a little depth (very little) to the discussion of measure theory
and probability. Appendix E introduces the concept of identifiability. Appendix F
merely reviews concepts of multivariate differentiation. Chapters 8 through 13 are
me being self-indulgent and tossing in things of personal interest to me. (They don’t
actually get covered in the class.)
   References to PA and ALM are to my books Plane Answers and Advanced Linear
Modeling.
Preface                                                                                          ix
Statistical Inference
Cox, D.R. (2006). Principles of Statistical Inference. Cambridge University Press, Cambridge.
Fisher, R.A. (1956). Statistical Methods and Scientific Inference, Third Edition, 1973. Hafner
        Press, New York.
Geisser, S. (1993). Modes of Parametric Statistical Inference. Wiley, New York.
Bayesian Books
de Finetti, B. (1974, 1975). Theory of Probability, Vols. 1 and 2. John Wiley and Sons, New York.
Jeffreys, H. (1961). Theory of Probability, Third Edition. Oxford University Press, London.
Savage, L. J. (1954). The Foundations of Statistics. John Wiley and Sons, New York.
DeGroot, M. H. (1970). Optimal Statistical Decisions. McGraw-Hill, New York.
    First three are foundational. There are now TONS of other books, see mine for other references.
Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
Table of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
1      Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .      1
       1.1 Early Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .                1
       1.2 Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .         1
       1.3 Decision Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .                 2
       1.4 Some Ideas about Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .                        2
       1.5 The End . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .           2
       1.6 The Bitter End . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .              3
2      Significance Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .           5
       2.1 Generalities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .           5
       2.2 Continuous Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .                    13
            2.2.1 One Sample Normals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .                         14
       2.3 Testing Two Sample Variances. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .                         19
       2.4 Fisher’s z distribution. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .              22
       2.5 Final Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .         28
3      Hypothesis Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .          31
       3.1 Testing Two Simple Hypotheses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .                           31
           3.1.1 Neyman-Pearson tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .                          32
           3.1.2 Bayesian Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .                    35
       3.2 Simple Versus Composite Hypotheses . . . . . . . . . . . . . . . . . . . . . . . . .                                36
           3.2.1 Neyman-Pearson Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .                            37
           3.2.2 Bayesian Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .                    38
       3.3 Composite versus Composite . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .                        39
           3.3.1 Neyman-Pearson Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .                            39
           3.3.2 Bayesian Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .                    40
       3.4 More on Neyman-Pearson Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .                            40
                                                                                                                              xiii
xiv                                                                                                             Contents
5     Decision Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   49
      5.1 Optimal Prior Actions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .           50
      5.2 Optimal Posterior Actions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .               55
      5.3 Traditional Decision Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .                 56
      5.4 Minimax Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .         60
      5.5 Prediction Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .         62
           5.5.1 Prediction Reading List . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .                  64
           5.5.2 Linear Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .            65
6     Estimation Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .       67
      6.1 Basic Estimation Definitions and Results . . . . . . . . . . . . . . . . . . . . . . .                        67
           6.1.1 Maximum Likelihood Estimation . . . . . . . . . . . . . . . . . . . . . . .                            68
      6.2 Sufficiency and Completeness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .                  68
           6.2.1 Ancillary Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .             71
           6.2.2 Proof of the Factorization Criterion . . . . . . . . . . . . . . . . . . . . .                         72
      6.3 Rao-Blackwell Theorem and Minimum Variance Unbiased
           Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   74
           6.3.1 Minimal Sufficient Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . .                    75
           6.3.2 Unbiased Estimation: Additional Results from Rao (1973,
                  Chapter 5) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .        76
      6.4 Scores, Information, and Cramér-Rao . . . . . . . . . . . . . . . . . . . . . . . . . .                      79
           6.4.1 Information and Maximum Likelihood . . . . . . . . . . . . . . . . . .                                 81
           6.4.2 Score Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .           82
      6.5 Gauss-Markov Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .                82
      6.6 Exponential Families . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .            82
      6.7 Asymptotic Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .           84
E Identifiability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203
Chapter 1
Overview
The 12th century theologian, physician, and philosopher Maimonides used proba-
bility to address a temple tax problem associated with women giving birth to boys
when the birth order is unknown, see Geisser (2005) and Rabinovitch (1970).
    One of the earliest uses of statistical testing was made by Arbuthnot (1710). He
had available the births from London for 82 years. Every year there were more male
births than females. Assuming that yearly births are independent and the probability
of more males is 1/2, he calculated the chance of getting all 82 years with more
males as (0.5)82 . This being a ridiculously small probability, he concluded that boys
are born more often.
    In the last half of the eighteenth century, Bayes, Price, and Laplace used what
we now call Bayesian estimation and Daniel Bernoulli used the idea of maximum
likelihood estimation, cf. Stigler (2007).
1.2 Testing
One of the famous controversies in statistics is the dispute between Fisher and
Neyman-Pearson about the proper way to conduct a test. Hubbard and Bayarri
(2003) give an excellent account of the issues involved in the controversy. Another
famous controversy is between Fisher and almost all Bayesians. In fact, Fienberg
(2006) argues that Fisher was responsible for giving Bayesians their name. Fisher
(1956) discusses one side of these controversies. Berger’s Fisher lecture attempted
to create a consensus about testing, see Berger (2003).
   The Fisherian approach is referred to as significance testing. The Neyman-
Pearson approach is called hypothesis testing. The Bayesian approach to testing
is an alternative to Neyman and Pearson’s hypothesis testing. A quick review and
comparison of these approaches is given in Christensen (2005). Here we cover much
                                                                                    1
2                                                                          1 Overview
of the same material but go into more depth with Chapter 2 examining significance
testing, Chapter 3 discussing hypothesis testing, and Chapter 4 drawing compar-
isons between the methods. These three chapters try to introduce the material with
a maximum of intuition and a minimum of theory.
Chapters 8 through 11 contain various ideas I have had about statistical inference.
The last two chapters are easy going. The first of these contains edited reprints of
my JASA reviews for two books on statistical inference by great statisticians: D. R
Cox and Erich Lehmann. The last chapter is as a reprint. It is a short biography of
Seymour Geisser.
1.6 The Bitter End                                                               3
The absolute end of the book is a series of appendices that cover multivariate dis-
tributions, an introduction to measure theory and convergence, a discussion of how
the Radon-Nikodym theorem provides the basis for measure theoretic conditional
probability, some additional detail on measure theory, and finally a summary of
multivariate differentiation.
Chapter 2
Significance Tests
In his seminal book The Design of Experiments, R.A. Fisher (1935) illustrated sig-
nificance testing with the example of “The Lady Tasting Tea,” cf. also Salsburg
(2001). Briefly, a woman claimed that she could tell the difference between whether
milk was added to her tea or if tea was added to her milk. Fisher set up an experiment
that allowed him to test that she could not.
    Fisher (1935, p.14) says, “In order to assert that a natural phenomenon is exper-
imentally demonstrable we need, not an isolated record, but a reliable method of
procedure. In relation to the test of significance, we may say that a phenomenon is
experimentally demonstrable when we know how to conduct an experiment which
will rarely fail to give us a statistically significant result.”
    The fundamental idea of significance testing is to extend the idea of a proof by
contradiction into probabilistic settings.
    Fisher (1925, p.80) says, “The term Goodness of Fit has caused some to fall into
the fallacy that the higher the value of P the more satisfactorily is the hypothesis
verified. Values over .999 have sometimes been reported which, if the hypotheses
were true, would only occur in only once in a thousand trials. ... In these cases the
hypothesis considered is as definitely disproved as if P had been.001.”
2.1 Generalities
The idea of a proof by contradiction is that you start with a collection of (antecedent)
statements, you then work from those statements to a conclusion that cannot possi-
bly be true, i.e., a contradiction, so that you can conclude that something must be
wrong with the original statements. For example if I state that “all women have blue
eyes” and that “Sharon has brown eyes” but then I observe the data that “Sharon
is a woman,” it follows that either the statement that “all women have blue eyes”
and/or the statement that “Sharon has brown eyes” must be false. Ideally, I would
know that all but one of the antecedent statements are correct, so a contradiction
would tell me that the final statement must be wrong. Since I happen to know that
                                                                                      5
6                                                                    2 Significance Tests
the antecedent statement “Sharon has brown eyes” is true, it must be the statement
“all women have blue eyes” that is false. I have proven by contradiction that “not all
women have blue eyes,” but to do that I needed to know that both of the statements
“Sharon has brown eyes” and “Sharon is a woman” are true.
    In significance testing we collect data that we take to be true (“Sharon is a
woman”) but we rarely have the luxury of knowing that all but one of our antecedent
statements are true. In practice, we do our best to validate all but one of the state-
ments (we look at Sharon’s eyes and see that they are brown) so that we can have
some idea which antecedent statement is untrue.
    In a significance test, we start with a probability model for some data, we then
observe data that are supposed to be generated by the model, and if the data are
impossible under the model, we have a contradiction, so something about the model
must be wrong. The extension of proof by contradiction that is fundamental to sig-
nificance testing is that, if the data are merely weird (unexpected) under the model,
that gives us a philosophical basis for questioning the validity of the model. In the
Lady Tasting Tea experiment, Fisher found weird data suggesting that something
might be wrong with his model, but he designed his experiment so that the only
thing that could possibly be wrong was the assumption that the lady was incapable
of telling the difference.
   The crux of significance testing is that you need somehow to determine how
weird the data are. Arbuthnot (1710) found it suspicious that for 82 years in a row,
more boys were born in London than girls. Suspicious of what? The idea that male
and female births are equally likely. Many of us would find it suspicious if males
were more common for merely 10 years in a row. If birth sex has a probability
model similar to coin flipping, each year the probability of more males should be
0.5 and outcomes should be independent. Under this model the probability of more
                                   .
males 10 years in a row is (0.5)10 = 0.001. Pretty weird data, right? But no weirder
2.1 Generalities                                                                        7
than seeing ten years with more boys the first year then alternating with girls,
i.e. (B, G, B, G, B, G, B, G, B, G), and no weirder than any other specific sequence,
say (B, B, G, B, B, G, B, G, G, G). What seems relevant here is the total number of
years with more boys born, not the particular pattern of which years have more boys
and which have more girls.
    Therein lies the rub of significance testing. To test a probability model you need
to summarize the observed data into a test statistic and then you need to determine
the relative weirdness of the possible values of the test statistic as determined by the
model. Typically, we would choose a test statistic that will be sensitive to the kinds
of things that we think most likely to go wrong with the model (e.g, one sex might
be born more often than the other). If the distribution of the test statistic is discrete,
it seems pretty clear that a good measure of weirdness is having a small probability.
If the distribution of the test statistic is continuous, it seems that a good measure of
weirdness is having a small probability density, but we will see later that there are
complications associated with using densities.
    For our birth sex problem, the coin flipping model implies that the probabil-
ity distribution for the number of times boys exceed girls in 10 years is binomial,
specifically, Bin(10, 0.5). The natural measure of weirdness for the outcomes in this
model is the probability of the outcome. The smaller the probability, the weirder the
outcome.
    Traditionally, a P value is employed to quantify the idea of weirdness. The P
value is defined as the probability of seeing something as weird or weirder than you
actually saw. It measures how consistent the data are with the hypothesized model.
If there is little consistency with the model, that provides evidence that something is
wrong with the model. We won’t know what in particular is wrong with the model
unless we can validate all of the assumptions in the model except one.
   Anything with a P value of 0 is something that cannot happen and gives an abso-
lute contradiction to the assumed probability model. In practice, one rarely obtains
P values of 0 but often encounters P values being rounded off to 0.
Note that seeing all girls is just as weird as seeing all boys, so the P for seeing
10 out of 10 boys is twice the probability of that outcome. The datum that is most
consistent with the model is seeing 5 boys. If we do see 10 out of 10 boys, that
suggests that the Bin(10, 0.5) model is wrong but does not itself suggest why the
model is wrong or what about the model is wrong.                                  2
    In a significance test there is only one probability model being tested so there
is no real need to give that model a label. But historically, significance tests have
been confounded with the hypothesis tests discussed in the next chapter, and it has
become common to refer to the probability model being tested as the “null model”
or as the “null hypothesis.” In many situations it makes sense to think of there being
an overall model for the data and a specific claim about that model (called the null
hypothesis) which, together, define the null model. If the data are inconsistent with
the null model, we do not know if (the data are just weird or if) the overall model is
wrong or if the null hypothesis is wrong. In practice, we do our best to validate the
overall model, so that it makes some sense to claim that the null hypothesis may be
wrong (cf. the Duhem-Quine thesis).
    In significance testing, the P value is the fundamental concept but it can be useful
to discuss α level tests. Technically, an α level is simply a decision rule as to which
P values will cause one to reject the null model. In other words, it is merely a
decision point as to how weird the data must be before rejecting the null model. In
an α level significance test, if the P value is less than or equal to α, the null model
is rejected. Implicitly, an α level determines what data would cause one to reject the
null model and what data will not cause rejection. The α level rejection region is
defined as the set of all data points that have a P value less than or equal to α.
    Note that in Example 2.1.1, an α = 0.01 test is identical to an α = 0.0125 test.
Both reject when observing either r = 2 or 3. Moreover, the probability of rejecting
an α = 0.0125 test when the null model is true is not 0.0125, it is 0.01. However,
significance testing is not interested in what the probability of rejecting the null
hypothesis will be, it is interested in what the probability was of seeing weird data.
    If an α level is chosen, for any semblance of a contradiction to occur, the α level
must be small. On the other hand, making α too small will defeat the purpose of
the test, making it extremely difficult to reject the test for any reason. A reasonable
view would be that an α level should never be chosen; that a scientist should simply
2.1 Generalities                                                                     9
evaluate the evidence embodied in the P value. (But that would not allow us to define
confidence regions associated with significance tests.)
Since there are 5 Direct injuries, the possible values for the number of E outcomes,
say r, is 0 to 5. However, we are conditioning on having seen a total of 10 E out-
comes, and it is impossible to get a total of 10 Es from samples of 5 Directs and 8
Boths without having at least 2 of the Direct injuries being Es. Given the numbers
of Direct and Both injuries (both numbers treated as fixed) and the total number of
excellent outcomes, the only outcome that would constitute any reasonable evidence
that something is wrong with the null model (has a small P value) is seeing y1 = 2,
which we did not see. (I did all these computations by hand except for finding the
decimal P values.)                                                                2
   We now derive the conditional distribution. The assumed model for the data is
that independently yi ∼ Bin(Ni , pi ), i = 1, 2. The null hypothesis is p1 = p2 ≡ p, so
under the null model y1 and y2 are independent yi ∼ Bin(Ni , p). Write the 2 × 2 table
as
                                 Success Failure     Total
                       Group 1     y1    N1 − y1      N1
                       Group 2     y2    N2 − y2      N2
                       Total        t   N1 + N2 − t N1 + N2
It follows that
                                                              
                         Pr(y1 = r and t = s)    N1   N2    N1 + N2
     Pr (y1 = r|t = s) =                      =                      ,
                              Pr(t = s)           r t −r       t
Exercise 2.1.     Show that the same distribution results if the 4 values in the table
are multinomial with Success/Failure independent of Groups when conditioning on
one row total and one column total.
where we have suppressed in the notation that the probability is conditional on see-
ing 12 total convictions, 13 identical twins, and 17 fraternal twins.
   The reason for addressing this example is because Fisher argued that the only
more extreme tables are
                      Result                            Result
            Twins Convicted Not Total         Twins Convicted Not Total
            Identical 11      2 13            Identical 12      1 13
            Fraternal  1     16 17            Fraternal  0     15 17
            Total     12     18 30            Total     12     18 30
so Fisher computed a P value as the sum of the probabilities of these three tables
giving
                                   619
                                             = 0.000465.
                                 1330665
This looks like some kind of one-sided test rather than a significance test.
   Indeed, it is not a significance test. The probability of seeing what we saw is
                         
                   1 13 17                1 13 · 12 · 11 17 · 16    1
 Pr(y1 = 10) = 30                   = 30                       = 30 17 · 13 · 16 · 11.
                   12
                         10     2         12
                                                 3 · 2     2       12
Obviously, 16 · 11 > 2 · 14, so Pr(y1 = 10) > Pr(y1 = 0) and Pr(y1 = 0) should be
added in when computing the P value. It turns out that the second most extreme
case in the other direction has Pr(y1 = 1) > Pr(y1 = 10), so it does not need to be
added to P. I used R to compute the distribution as shown below and obviously
Pr(y1 = 10) < Pr(y1 = r), r = 1, . . . , 9.
                      r Pr(y1 = r)     r Pr(y1 = r)
                      0 0.00007154318 7 0.1227681
                      1 0.001860123    8 0.03541387
                      2 0.01753830     9 0.005621250
                      3 0.08038387    10 0.0004497000
                      4 0.2009597     11 0.00001533068
                      5 0.2893819     12 0.0000001503008
                      6 0.2455362
12                                                                    2 Significance Tests
The significance test P value is 0.000472, not a whole lot different from Fisher’s
0.000465.                                                                       2
model is wrong. If the population was male University of New Mexico (UNM) stu-
dents, a sample of size 10 with sample mean of ȳ· = 69.3 is consistent with the
model, having a P value of 0.92. But ȳ· = 69.3 is even more consistent with the
model N(69.0001, 9) and is most consistent with the model N(69.3, 9). Of course
ȳ· = 69.3 is also perfectly consistent with the model that every male at UNM has the
height 69.3 inches, but (presumably) aspects of the data other than their mean could
prove that model incorrect.
    Seeing data that are consistent with the model does not make the model correct,
anymore than making a bunch of assumptions and not being able to find a contra-
diction makes the assumptions correct. The logic is one directional. A contradiction
means the assumptions are wrong, weird data suggest the model may be wrong.
Admittedly, I do feel that collecting data that are consistent with a null model is a
more worthwhile activity than merely thinking about assumptions and failing to find
a contradiction.
    A significance test can really be thought of as a model validation procedure. It
makes no reference to any alternative model(s). We have the distribution of the null
model and we examine whether the data look weird or not.
                                                                                       N(0,1)
                                                                                       t(3)
                                                                                       t(1)
         0.3
         0.2
         0.1
         0.0
−4 −2 0 2 4
Fig. 2.1 Three distributions: solid, N(0, 1); long dashes, t(1); short dashes, t(3).
2.2 Continuous Distributions                                                           15
We want to collect data and then use them to determine whether or not the data give
a test statistic that is consistent with the t(99) distribution.
    Taking weird observations to be those that have small probability densities, be-
causep of the shape of t(df ) distributions, weird observations are values of (ȳ −
3)/ s2 /100 that are far fromp     0. Because of symmetry about 0, the level of weird-
ness is determined by |ȳ − 3|/ s2 /100 with larger values more weird than smaller
values.
    If we happen to observe ȳobs = 1 and s2obs = 4, we get
                                ȳobs − 3     1−3
                        tobs ≡ q            =p       = −10,
                                 s2obs /100    4/100
which is a very strange thing to observe from a t(99) distribution. (A t(99) will
be pquite similar to a N(0, 1) distribution.) By symmetrypabout 0, seeing (ȳ −
3)/ s2 /100 = −10 is exactly as weird as                             2
                                            p seeing (ȳ − 3)/ s /100 = 10 and less
                                                2
weird than seeing anything with |ȳ − 3|/ s /100 < 10. So the P value, being the
probability of seeing anything as weird or weirder than the −10 that we actually
saw, is the probability that a t(99) distribution is (as far or) farther away from 0 than
10, i.e.,
which is approximately 0. This constitutes a great deal of evidence against the null
model but it does not necessarily constitute evidence against the null hypothesis.
Perhaps µ ̸= 3 but perhaps the data are not normal or perhaps the data are not inde-
pendent or perhaps the observations have different variances or different means.
                                                                                  2
We want to collect data and then use them to determine whether or not the data give
a test statistic that is consistent with the F(1, 99) distribution.
    We again take weird observations to be those that have small probability den-
sities but now the shape of the F(1, df ) distribution is as illustrated in Figure 2.2.
Because of the shape of F(1, df ) distributions, weird observations are values of
(ȳ − 3)2 /(s2 /100) that are above 0 with larger values more weird than smaller val-
ues.
                                                                            F(1,df)
                                                                            F(2,df)
        0
0 1
   If we happen to observe ȳobs = 1 and s2obs = 4, we get Fobs ≡ (ȳobs −3)2 /(s2obs /100) =
100 which is a very strange thing to observe from a F(1, df ) distribution. The P
value, being the probability of seeing anything as weird or weirder than the 100 that
we actually saw, is the probability that a F(1, 99) distribution is (as far or) farther
away from 0 than 100, i.e.,
   Exactly the same arguments apply to all the t(df ) tests and their corresponding
F(1, df ) tests that arise in regression analysis, in analysis of variance (ANOVA),
and in general linear models, cf. Christensen (1996, 2015, 2020a). Note that the
equivalence was based entirely on the fact that [t(df )]2 ∼ F(1, df ) and on the shapes
of the distributions.
   Generally, to determine weirdness we have to know and evaluate the density of
the test statistic under the null model. For an F(d1 , d2 ) distribution, that density is
                                                           d1                      d1 +d2
                                                                                d1 − 2
                                                                            
                                      1              d1     2        d1
              f (x|d1 , d2 ) ≡                                 x    2 −1    1+ x          ,
                                 B d21 , d22         d2                         d2
where B(·, ·) is the Beta function, which is defined in terms of the Gamma function
as
                                          Γ (x)Γ (y)
                                B(x, y) =             .
                                           Γ (x + y)
For d1 = 1, 2 the shape of the density was illustrated in Figure 2.2. For d1 ≥ 3, the
shape is illustrated in Figure 2.3.
F(df1, df2)
                           1−α
        0
1 F(1 − α,df1,df2)
The idea of the test is that under the null model both
                    SSE(Red.) − SSE(Full)
                                                 and     MSE(Full)
                    dfE(Red.) − dfE(Full)
should be unbiased estimates of the variance parameter σ 2 , so under the null model
F should take on a value close to 1, even though 1 is not typically the mean, median,
or mode for the F [dfE(Red.) − dfE(Full), dfE(Full)] distribution.
    All software that I have seen computes P ≡ Pr[F > Fobs ] where Fobs is the value
of F computed from the observed values of SSE(Red.), dfE(Red.), SSE(Full), and
dfE(Full). Indeed, if the full model is valid, only large values of F are inconsistent
with the null model. But we do not know that the full model is valid. With weirdness
defined by the density function, Figures 2.2 and 2.3 show that this P ≡ Pr[F > Fobs ]
computation only gives the significance test P value when dfE(Red.)−dfE(Full) =
1, 2. For dfE(Red.) − dfE(Full) ≥ 3, finding the significance test P value requires
2.3 Testing Two Sample Variances.                                                   19
finding the density value f (Fobs |dfE(Red.) − dfE(Full), dfE(Full)) and a second
value F∗ such that f (F∗ ) = f (Fobs |dfE(Red.) − dfE(Full), dfE(Full)) and finding
the probability that, if F∗ ≤ Fobs ,
or, if F∗ ≥ Fobs ,
                            P = Pr[F ≥ F∗ ] + Pr[F ≤ Fobs ].
   We will revisit these numerical examples once we have developed another ap-
proach to these problems.
Using the F density becomes more problematic when we seek to test the equality of
the variances from two normal samples.
E XAMPLE 2.3.1. We examine Jolicoeur and Mosimann’s log turtle height data,
cf. Christensen (2015, Example 4.4.1), consisting of 24 female heights and 24 male
heights. The sample variance of log female heights is s21 = 0.02493979 and the sam-
ple variance of log male heights is s22 = 0.00677276. The overall model is that the
observations are independent with one normal distribution for females and another
one for males. The null hypothesis is that the variances for males and females are
the same, i.e., σ22 = σ12 . The standard α = 0.01 level hypothesis (not significance)
test is rejected, i.e., we conclude that the null model with is wrong, if
20                                                                   2 Significance Tests
                               0.00677276 s22
             Fobs = 0.2716 =             = > F(0.995, 23, 23) = 3.04
                               0.02493979 s21
or if
                                                    1            1
         Fobs = 0.2716 < F(0.005, 23, 23) =                   =     = 0.33.
                                              F(0.995, 23, 23) 3.04
The second of these inequalities is true, so the null model with equal variances is
rejected at the 0.01 level. This is a simple hypothesis test (by no means an optimal
one).
   The significance test is a bit more work to determine. Denote the density
for the F(23, 23) distribution f (z|23, 23). Evaluating the density at Fobs gives
f (0.2716|23, 23) = 0.03597. It turns out that f (2.50835|23, 23) = 0.03597, so
F∗ = 2.50835 and Fobs = 0.2716 are equally rare events. The P value, given the
shape of f (z|23, 23), is
   In the next section we will fix this particular problem. Incidentally, the stan-
dard (nonoptimal) α level hypothesis test illustrated at the beginning of the section
does not change when you reverse the order of the sample variances, because of
the mathematical fact that F(α, r, s) = F(1 − α, s, r), and that is true even when you
have different numbers of degrees of freedom in the numerator and denominator.
   Before proceeding we also need to look at a similar F significance test where the
numerator and denominator degrees of freedom are not the same.
E XAMPLE 2.3.2. Consider the final point total data of Christensen (2015, Exam-
ple 4.4.2.). For a sample of 15 females the sample variance was s21 = 487.28 and for
22 males the sample variance was s22 = 979.29. The test statistic can be F = s21 /s22
with
2.3 Testing Two Sample Variances.                                                  21
                                      s21   487.28
                             Fobs =       =        = 0.498.
                                      s22   979.29
To find the significance test P value observe that f (0.498|14, 21) = 0.6207, so F∗ =
1.2051 because f (1.2051|14, 21) = 0.6207. Using the F(14, 21) distribution
                                      s22   979.29
                             Fobs =    2
                                          =        = 2.010.
                                      s1    487.28
To find the significance test P value observe that f (2.010|21, 14) = 0.15339, so F∗ =
0.31933 because f (0.31933|21, 14) = 0.15339. Using the F(21, 14) distribution
The F statistic was invented by George Snedecor in the 1930’s at Iowa State Univer-
sity and labeled F in honor of R.A. Fisher. Many of Fisher’s applications to which
we now apply F tests were invented before the F distribution, so obviously Fisher
did not originally use F tests. He used Fisher’s z distribution which is
                                           1
                                      z≡     log(F),
                                           2
see Fisher (1924) and Aroian (1941). In particular, this has the density
                                      d /2 d /2
                                   2d1 1 d2 2              ed1 x
                 f˜(x|d1 , d2 ) =                                           ,
                                  B(d1 /2, d2 /2) (d1 e2x + d2 )(d1 +d2 )/2
which always has a mode of 0 and is symmetric for d1 = d2 . Recall that F should be
near 1 but typically 1 is not the mode of the F distribution. Here, 0 = (1/2) log(1),
so the density of Fisher’s z distribution has its mode at the point that should be most
consistent with the null model.
   Frankly, given the current state of statistics, I cannot think of a single reason
to use Fisher’s z distribution except for using its density to define weirdness for F
statistics We will continue to compute P values using software for F distributions
but the sets we compute the probabilities for will be determined by the Fisher’s z
distribution.
E XAMPLE 2.4.1. We again examine the log turtle height data consisting of 24
female heights and 24 male heights. The sample variance of log female heights is
s21 = 0.02493979 and the sample variance of log male heights is s22 = 0.00677276.
                                             0.00677276 s22
                         Fobs = 0.2716 =               = ,
                                             0.02493979 s21
so
                          zobs = log(0.2716)/2 = −0.6517
Denote the density for the z(23, 23) distribution f˜(z|23, 23). From the symmetry of
the distribution f˜(−0.6517|23, 23) = f˜(0.6517|23, 23). The P value becomes
It turns out that if you perform the test with s21 /s22 you will get the same P value. 2
2.4 Fisher’s z distribution.                                                           23
   Even with unequal degrees of freedom, z gives the same P value either way you
do the test.
E XAMPLE 2.4.2. Consider again the final point total data. For a sample of 15
females the sample variance was s21 = 487.28 and for 22 males the sample variance
was s22 = 979.29. The test statistic can be F = s21 /s22 with
                                        s21   487.28
                               Fobs =    2
                                            =        = 0.497584985.
                                        s2    979.29
This leads to
                               zobs = log(0.498)/2 = −0.3489945
To find the significance test P value we first need to calculate f˜(zobs |14, 21) =
f˜(−0.3489945|14, 21) = 0.6168776, then find a value z∗ on the other side of the
mode from zobs with f˜(z∗ |14, 21) = f˜(zobs |14, 21), that is f˜(0.3335880|14, 21) =
f˜(−0.3489945|14, 21), and then we can find the probability.
                                        s22   979.29
                               Fobs =    2
                                            =        = 2.00970694.
                                        s1    487.28
This leads to
Although the test statistics are symmetric, with unequal degrees for freedom,
Fisher’s z distribution is not, so the P values will be slightly different. To find the sig-
nificance test P value we first need to calculate f˜(zobs |21, 14) = f˜(0.3489945|21, 14) =
0.6168776, then find a value z∗ on the other side of the mode from zobs with
f˜(z∗ |21, 14) = f˜(zobs |21, 14), that is f˜(−0.3335880|21, 14) = f˜(0.3489945|21, 14),
and then we can find the probability.
                                                                                       2
   The following R code illustrates the symmetry of the test.
   fobs=0.497584985
   xx=.3489945
24                                                                  2 Significance Tests
     x=c(log(fobs)/2,xx)
     d1=14
     d2=21
     ftilde = (2*d1ˆ(d1/2)*d2ˆ(d2/2)/beta(d1/2,d2/2))*
     exp(d1*x)/(d1*exp(2*x)+d2)ˆ((d1+d2)/2)
     matrix(c(x,ftilde),,2)
     fobs=2.00970695
     xx=-.3335880
     x=c(log(fobs)/2,xx)
     d1=21
     d2=14
     ftilde = (2*d1ˆ(d1/2)*d2ˆ(d2/2)/beta(d1/2,d2/2))*
     exp(d1*x)/(d1*exp(2*x)+d2)ˆ((d1+d2)/2)
     matrix(c(x,ftilde),,2)
   In the following R code by playing around with a, b, and c that determine xx
you can figure out what z∗ has to be. The first entry in x is zobs , so the first entry
in ftilde is what you are trying to reproduce. You can start with a = −5, b = 5,
1/10c = .1. Once you see what zobs is, z∗ should be somewhere near its negative.
Pick a and b appropriately and then decrease the last term in seq by a factor of 10
as needed. Minor changes allow finding F∗ . In particular, you would want to change
to a = 0.
     # Routine for finding z_*
     # Adaptable for finding F_*
     a=-5
     b=5
     c=1
     fobs=0.497584985
     xx=seq(a,b,1/10ˆc)
     x=c(log(fobs)/2,xx)
     # For F_* use
     # x=c(fobs,xx)
     d1=14
     d2=21
     ftilde = (2*d1ˆ(d1/2)*d2ˆ(d2/2)/beta(d1/2,d2/2))*
     exp(d1*x)/(d1*exp(2*x)+d2)ˆ((d1+d2)/2)
     # For F_* use
     # ftilde=df(x,d1,d2)
     matrix(c(x,ftilde),,2)
     Plots of these 3 densities, 36,36 and 14,21 and 21,14.
     x=seq(-1,1,.01)
     d1=23
     d2=23
2.4 Fisher’s z distribution.                                                    25
   ftilde = (2*d1ˆ(d1/2)*d2ˆ(d2/2)/beta(d1/2,d2/2))*
   exp(d1*x)/(d1*exp(2*x)+d2)ˆ((d1+d2)/2)
   plot(x,ftilde,type="l",ylim=c(0,2),ylab="",
   xlab="",lty=3,labels=T)
   d1=14
   d2=21
   ftilde1 = (2*d1ˆ(d1/2)*d2ˆ(d2/2)/beta(d1/2,d2/2))*
   exp(d1*x)/(d1*exp(2*x)+d2)ˆ((d1+d2)/2)
   lines(x,ftilde1,type="l",lty=2)
   d1=21
   d2=14
   ftilde2 = (2*d1ˆ(d1/2)*d2ˆ(d2/2)/beta(d1/2,d2/2))*
   exp(d1*x)/(d1*exp(2*x)+d2)ˆ((d1+d2)/2)
   lines(x,ftilde2,type="l",lty=1)
   legend("topright",c("z(23,23)","z(14,21)",
   "z(21,14)"),lty=c(3,2,1))
       2.0
                                                                     z(23,23)
                                                                     z(14,21)
                                                                     z(21,14)
       1.5
       1.0
       0.5
       0.0
   The R package VGAM has a command dlogF that gives the density for 2z =
log(F).
To find the significance test P value we first need to calculate f˜(zobs |5, 40) =
f˜(0.505800456|5, 40) = 0.2647452, then find a value z∗ on the other side of the
mode from zobs with f˜(z∗ |5, 40) = f˜(zobs |5, 40). In particular, f˜(0.505800456|5, 40) =
f˜(−0.6823366|5, 40), so we can find the probability.
   What about small values? With the same degrees of freedom suppose Fobs = 0.15.
The commonly computed one-sided P value is 0.978878 which is almost too good
to be true and large enough to make many of us suspicious that something is wrong.
The significance test P value from an F test is
     fobs=2.75
     d1=5
     d2=40
     1-pf(fobs,d1,d2)
     df(fobs,d1,d2)
     df(.0349,d1,d2)
     pf(.0349,d1,d2)
     pf(.0349,d1,d2)+1-pf(fobs,d1,d2)
     xx=-.6823366
2.4 Fisher’s z distribution.                                                          27
   x=c(log(fobs)/2,xx)
   ftilde = (2*d1ˆ(d1/2)*d2ˆ(d2/2)/beta(d1/2,d2/2))*
   exp(d1*x)/(d1*exp(2*x)+d2)ˆ((d1+d2)/2)
   matrix(c(x,ftilde),,2)
   exp(2*x)
   pf(exp(2*x),d1,d2)
                                                                                       2
                                        ȳobs − 3
                                tobs ≡ q            = −2,
                                         s2obs /100
and Fobs = 4.
                               zobs = log(4)/2 = 0.6931472
To find the significance test P value we first need to calculate f˜(zobs |5, 40) =
f˜(0.6931472|1, 99) = 0.2196706, then find a value z∗ on the other side of the mode
from zobs with f˜(z∗ |1.99) = f˜(zobs |1, 99), that is f˜(0.6931472|1, 99) = f˜(−1.245495|1, 99),
and then we can find the probability.
This oddity occurs because this z distribution is highly skewed to the left.
   Personally, this is the only situation in which I would choose to use P values from
the F distribution rather than Fisher’s z. But I have not really investigated behavior
with 2 degrees of freedom in the numerator. As a practical matter, although I do not
choose to do it, I live with the one-sided P values imposed by standard software.
And they are the same for both distributions.                                        2
28                                                                   2 Significance Tests
Rejecting a Significance test suggests that something is wrong with the (null) model.
It does not specify what is wrong.
    The example of a t test raises yet another question. Why should we summarize
these data by looking at the t statistic,
                                        ȳ − 0
                                           √ ?
                                        s/ n
One reason is purely practical. In order to perform a test, one must have a known
distribution to compare to the data. Without a known distribution there is no way
to identify which values of the data are weird. With the normal data, even when
assuming µ is known, we do not know σ 2 so we do not know the distribution of
the data. By summarizing the data into the t statistic, we get a function of the data
that has a known distribution, which allows us to perform a test. Another reason is
essentially: why not look at the t statistic? If you have another statistic you want
to base a test on, the Significance tester is happy to oblige. Fisher (1956, p. 49)
indicates that the hypothesis should be rejected “if any relevant feature of the obser-
vational record can be shown to [be] sufficiently rare”. After all, if the null model is
correct, it should be able to withstand any challenge. Moreover, there is no hint in
this passage of worrying about the effects of performing multiple tests. Inflating the
probability of Type I error (rejecting the null when it is true) by performing multiple
tests is not a concern in Significance testing because the probability of Type I error
is not a concern in Significance testing.
    The one place that possible alternative hypotheses arise in Significance testing is
in the choice of test statistics. Again quoting Fisher (1956, p. 50), “In choosing the
grounds upon which a general hypothesis should be rejected, personal judgement
may and should properly be exercised. The experimenter will rightly consider all
points on which, in the light of current knowledge, the hypothesis may be imper-
fectly accurate, and will select tests, so far as possible, sensitive to these possible
faults, rather than to others.” Nevertheless, the logic of Significance testing in no
way depends on the source of the test statistic.
    Although Fisher prefered his idea of fiducial inference, one can use Significance
testing to arrive at “confidence regions” that do not involve either fiducial inference
or repeated sampling. If you have a null model determined by an overall model and
a null hypothesis about a parameter’s value, a (1 − α) confidence region can be de-
fined simply as a collection of parameter values that would not be rejected by a α
level significance test, that is, a collection of parameter values that are consistent
with the data as judged by an α level test. This definition involves no long run fre-
quency interpretation of “confidence.” It makes no reference to what proportion of
hypothetical confidence regions would include the true parameter. It does, however,
require one to be willing to perform an infinite number of tests without worrying
about their frequency interpretation. This approach also raises some curious ideas.
For example, with the normal data discussed earlier, this leads to standard t confi-
2.5 Final Notes                                                                     29
dence intervals for µ and χ 2 confidence intervals for σ 2 , but one could also form
a joint 95% confidence region for µ and σ 2 by taking all the pairs of values that
satisfy
                                  |ȳ − µ|
                                       √ < 1.96.
                                  σ/ n
Certainly all such µ, σ 2 pairs are consistent with the data as summarized by ȳ.
Chapter 3
Hypothesis Tests
   The first of these is the same distribution that we used to illustrate significance
testing in Section 2.1. But now we no longer consider the question of whether
the data seem consistent with the simple null hypothesis H0 : θ = 0. Now we ask
whether the observed data are more consistent with H0 : θ = 0 or with the the alter-
native hypothesis HA : θ = 2.
                                                                                   31
32                                                                     3 Hypothesis Tests
    These hypotheses are simple in the sense that the distributions involved are com-
pletely specified. An hypothesis that data come from a family of two or more distri-
butions is call a composite hypothesis.
    This is a decision problem. We have two possible distributions and we are decid-
ing between them. The reformulation of significance testing into a decision prob-
lem is a primary reason that Fisher objected to Neyman-Pearson testing, see Fisher
(1956, Chp. 4).
    Before examining formal testing procedures, look at the distributions. Intuitively,
if we see r = 4 we are inclined to believe θ = 2, if we see r = 1 we are quite inclined
to believe that θ = 0, and if we see either a 2 or a 3, it is still 5 times more likely
that the data came from θ = 0.
    While significance testing does not use an explicit alternative, there is nothing to
stop us from doing two significance tests: a test of H0 : θ = 0 and then another test
of H0 : θ = 2. The significance tests both give perfectly reasonable results. The test
for H0 : θ = 0 has small P values for any of r = 2, 3, 4. These are all strange values
when θ = 0. The test for H0 : θ = 2 has small P values when r = 2, 3.
    When r = 4, we do not reject θ = 2; when r = 1, we do not reject θ = 0; when
r = 2, 3, we reject both θ = 0 and θ = 2. The significance tests are not being forced
to choose between the two distributions. Seeing either a 2 or a 3 is weird under both
distributions. Hypothesis testing decides between the available choices, it does not
allow one to reject both choices.
Neyman-Pearson testing involves the concepts of Type I and Type II error. Type I
error is rejecting the null hypothesis when it is true and Type II error is not rejecting
the null hypothesis when it is false. This very statement is a legacy of significance
testing in that it focuses on the null hypothesis. In this problem it should be equiva-
lent to describe Type I error as accepting the alternative hypothesis when it is false
and Type II error as not accepting the alternative hypothesis when it is true. (In sig-
nificance testing you may reject a null hypothesis (model) but you never accept it.)
(On the other hand, the α = 0.02 N-P test coincides with the significance test. Both
reject when observing any of r = 2, 3, 4.) The power of the α = 0.01 N-P test is 0.9
whereas the power of the significance α = 0.01 test is only 0.001 + 0.001 = 0.002.
Clearly the significance test is not a good way to decide between these alternatives.
But then the significance test was not designed to decide between two alternatives.
It was designed to see whether the null model seemed reasonable and, on its own
terms, it works well. Although the meaning of α differs between significance and
N-P tests, we have chosen two examples, α = 0.01 and α = 0.02, in which the
significance test rejection region also happens to define an N-P test with the same
numerical value of α. Such a comparison would not be appropriate if we had exam-
ined, say, α = 0.0125 significance and N-P tests because significance tests do not
admit randomized decision rules.
   In particular, the motivation for insisting on small α levels seems to be based
entirely on the philosophical idea of proof by contradiction. In a significance test,
using a large α level eliminates the suggestion that the data are unusual and thus
tend to contradict H0 . However, N-P testing cannot appeal to the idea of proof by
contradiction. Later we will examine situations in which most powerful N-P tests
reject for those data values that are most consistent with the null hypothesis. In
particular, such examples make it clear that significance test P values can have no
role in N-P testing! See also Hubbard and Bayarri (2003) and discussion.
   It seems that once you base the test on wanting a large probability of rejecting
when the alternative hypothesis is true (high power), you have put yourself in the
business of deciding between the two hypotheses. Even on this basis, the N-P test
does not always perform very well. The rejection region for the α = 0.02 optimal N-
P test of H0 : θ = 0 versus HA : θ = 2 includes r = 2, 3, even though 2 and 3 are five
times more likely under the null hypothesis than under the alternative. Admittedly,
2 and 3 are weird things to see under either hypothesis, but when deciding between
these specific alternatives, rejecting θ = 0 (accepting θ = 2) for r = 2 or 3 does not
seem reasonable. The Bayesian approach to testing, discussed in the next subsection,
seems to handle this decision problem well.
   Instead of arbitrarily deciding on a small value for α, good N-P testing needs to
play off the relative probabilities of Type I and Type II error. If a small α causes too
large a β (i.e., probability of Type II error), the N-P tester should pick a bigger α
(which will make β smaller), even to the point where α may no longer be “small.”
Somewhat ironically, in our little example, picking a smaller α, 0.01 instead of 0.02,
increases β to 0.1 from 0.098, but the change is β is much smaller than the change
in α, so the smaller α may be preferred. The point is that good N-P testing requires
consideration of both α and β (or α and the power), yet traditional N-P testing tends
just to pick a small α and try to do the best with it.
3.1 Testing Two Simple Hypotheses                                                       35
                                       f (r|θ )p(θ )
                    p(θ |r) =                             ,                 θ = 0, 2.
                                f (r|0)p(0) + f (r|2)p(2)
Decisions are based on these posterior probabilities. Other things being equal,
whichever value of θ has the larger posterior probability is the value of θ that we
will accept. If both posterior probabilities are near 0.5, we might admit that we do
not know which is right.
   In practice, posterior probabilities are computed only for the value of r that was
actually observed, but Table 2 gives posterior probabilities for all values of r and two
sets of prior probabilities: (a) one in which each value of θ has the same probability,
1/2, and (b) one set in which θ = 2 is five times more probable than θ = 0.
                              Prior             r     1       2       3        4
                                          f (r|0)   0.980   0.005   0.005   0.010
                                          f (r|2)   0.098   0.001   0.001   0.900
                          pa (0) = 1/2   pa (0|r)   0.91    0.83    0.83     0.01
                          pa (2) = 1/2   pa (2|r)   0.09    0.17    0.17     0.99
                          pb (0) = 1/6   pb (0|r)   0.67    0.50    0.50    0.002
                          pb (2) = 5/6   pb (2|r)   0.33    0.50    0.50    0.998
Table 3.1 Posterior probabilities of θ = 0, 2 for two prior distributions a and b.
and pb (0|3) = pb (2|3) = 0.50. Given the prior, the Bayesian procedure is always
reasonable.
   The Bayesian analysis gives no special role to the null hypothesis. It treats the two
hypotheses on an equal footing. That N-P theory treats the hypotheses in different
ways is something that many Bayesians find disturbing.
   As discussed in Chapter 5, if actions have losses or utilities associated with them,
the Bayesian can base a decision on maximizing expected posterior utility or mini-
mizing expected posterior loss. Berry (2004) discussed the practical importance of
developing approximate utilities for designing clinical trials.
   The absence of a clear source for the prior probabilities seems to be the primary
objection to the Bayesian procedure. Typically, if we have enough data, the prior
probabilities are not going to matter because the posterior probabilities will be sub-
stantially the same for different priors. If we do not have enough data, the posteriors
will not agree but why should we expect them to? The best we can ever hope to
achieve is that reasonable people (with reasonable priors) will arrive at a consensus
when enough data are collected. In the example, seeing one observation of r = 1 or
4 is already enough data to cause substantial consensus. One observation that turns
out to be a 2 or a 3 leaves us wanting more data.
The best thing that can happen in N-P testing of a composite alternative is to have
a uniformly most powerful test. With HA : θ > 0, let θ ∗ be a particular value that is
greater than 0. Test the simple null H0 : θ = 0 against the simple alternative HA : θ =
θ ∗ . If, for a given α, the most powerful test has the same rejection region regardless
of the value of θ ∗ , then that test is the uniformly most powerful (UMP) test. It is a
simple matter to see that the α = 0.01 N-P most powerful test of H0 : θ = 0 versus
HA : θ = 1 rejects when r = 4. We have already seen that that is also true when the
alternative is HA : θ = 2. Since the most powerful tests of the alternatives HA : θ = 1
and HA : θ = 2 are identical, and these are the only permissible values of θ > 0, this
is the uniformly most powerful α = 0.01 test. The test makes a “bad” decision when
r = 2, 3 because with θ = 1 as a consideration, you would intuitively like to reject
the test.
     The α = 0.02 uniformly most powerful test rejects for r = 2, 3, 4, which is in line
with our intuitive evaluation, but recall from the previous section that this is the test
that (intuitively) should not have rejected for r = 2, 3 when testing only HA : θ = 2.
     Theoretically, the key thing to note is that as r varies, the relative order of
 f (r|1)/ f (r|0) is identical to the relative order of f (r|2)/ f (r|0). The largest value of
 f (r|1)/ f (r|0) and f (r|2)/ f (r|0) occurs at r = 4. The smallest value of f (r|1)/ f (r|0)
and f (r|2)/ f (r|0) occurs at r = 1. For r = 2, 3 the values of f (r|1)/ f (r|0) are
the same and between the other two values. The same holds for f (r|2)/ f (r|0).
This common ordering means that, regardless of size, any most powerful test for
H0 : θ = 0 versus HA : θ = 1 will also be a most powerful test for H0 : θ = 0 versus
HA : θ = 2. Recall that the size of the test depends on the rejection region but not
on the alternative hypothesis. The most powerful test of size α = 0.01 rejects when
r = 4 for both alternatives. The most powerful test of size α = 0.02 rejects when
r = 2, 3, 4 for both alternatives. One most powerful test of size α = 0.0125 for both
alternatives rejects when r = 4 and rejects for r = 2 if a coin flip comes up heads.
     This same idea works very generally. Suppose we are testing H0 : θ = θ0 versus
HA : θ ∈ ΘA . If, as r varies, the relative order of f (r|θ )/ f (r|θ0 ) remains the same for
any θ ∈ ΘA , the most powerful test will remain the same regardless of which θ ∈ ΘA
we are considering, so we can find a uniformly most powerful test. In particular,
it is fairly obvious that if, for any θ ∈ ΘA , the function f (r|θ )/ f (r|θ0 ) is always
either increasing (or always decreasing) in r, the relative order of the likelihood
ratios will remain the same regardless of the choice of θ ∈ ΘA . This property is
called having monotone likelihood ratio and it is sufficient (but not necessary) for
the existence of uniformly most powerful tests. In fact, our little example displays
monotone likelihood ratio.
     If the likelihood ratios are always increasing, UMP tests can be chosen that reject
for large values of r (possibly with random rejection for the smallest value of r in the
rejection region), and if they are decreasing, UMP tests can reject for small values
of r. For example, because the likelihood ratios are not strictly increasing, as in our
simple versus composite example, both (a) reject whenever r = 4 and flip a coin, if
it comes up heads, reject when r = 2, and (b) reject whenever r = 4 and flip a coin,
38                                                                        3 Hypothesis Tests
if it comes up heads, reject when r = 3, are uniformly most powerful tests of size
α = 0.0125. But the second test is a UMP test that rejects for large values of r.
                                            f (r|θ )p(θ )
                  p(θ |r) =                                           ,
                              f (r|0)p(0) + f (r|1)p(1) + f (r|2)p(2)
                         Prior                  r     1        2        3       4
                                          f (r|0)   0.980    0.005    0.005   0.010
                                    f (r|θ > 0)     0.099   0.1005   0.1005   0.700
                   Pr(θ = 0) = 1/2 Pr(θ = 0|r)      0.908    0.047    0.047   0.014
                   Pr(θ > 0) = 1/2 Pr(θ > 0|r)      0.091    0.953    0.953   0.986
Table 3.2 Recasting a simple versus composite Bayes test as simple versus simple.
Now we add a fourth distribution to our consideration, f (r| − 1), and test the com-
posite null H0 : θ ≤ 0 versus the composite alternative HA : θ > 0 or, more specif-
ically, H0 : θ ∈ {−1, 0} ≡ Θ0 versus HA : θ ∈ {1, 2} ≡ ΘA . The distributions and
likelihood ratios are given below.
                                         r 1           2         3          4
                               f (r| − 1) 0.9803    0.0049    0.0049    0.0099
                                   f (r|0) 0.980     0.005     0.005     0.010
                                   f (r|1) 0.100     0.200     0.200     0.500
                                   f (r|2) 0.098     0.001     0.001     0.900
                     f (r|1)/ f (r| − 1) 0.102       40.82     40.82     50.51
                     f (r|2)/ f (r| − 1) 0.109       0.204     0.204      90.9
                         f (r|1)/ f (r|0) 10/98        40        40        50
                         f (r|2)/ f (r|0) 0.1         0.2       0.2        90
It is hard to make comparisons with significance testing because the composite hy-
potheses do not provide us with a model that determines a unique distribution for the
data. One could make four different significance tests. For this example, f (r| − 1)
was chosen to be a minor modification of f (r|0).
As always N-P theory (quite properly) focuses on likelihood ratios. (Problems arise
with what N-P theory does with these ratios.) The example was chosen so that for
every θ0 ∈ Θ0 and every θ1 ∈ ΘA as r varies we have an identical ordering to the
values of f (r|θ1 )/ f (r|θ0 ). As we have seen earlier, this means we can find UMP
tests. In particular, regardless of hypotheses, if θ∗ < 0.5 < θ# , our likelihood ratios
 f (r|θ# )/ f (r|θ∗ ) are all monotone increasing, so UMP tests can be formed by reject-
ing for large values of r (perhaps with some randomization on the smallest r value).
The trick is to find/define the size of these tests.
     For a composite null H0 : θ ∈ Θ0 , the size of a test is defined to be the largest
probability for the rejection region among all θ ∈ Θ0 . Thus, if we reject when r = 4,
the size for θ = 0 is 0.01 and the size for θ = −1 is 0.0099, so the overall size is the
40                                                                     3 Hypothesis Tests
maximum value, 0.01. For the rejection region r = 2, 3, 4, the size for θ = 0 is 0.02
and the size for θ = −1 is 0.0049 + 0.0049 + 0.99 = 0.197, so the overall size is the
maximum value, 0.02.
                                           f (r|θ )p(θ )
        p(θ |r) =                                                             ,
                    f (r| − 1)p(−1) + f (r|0)p(0) + f (r|1)p(1) + f (r|2)p(2)
With f (r| − 1) so similar to f (r|0) and equal probabilities p(−1) = p(0) = 1/4, the
values of Pr(Θ0 |r) from the last example have essentially been retained in this exam-
                        .         .
ple but now p(−1|r) = p(0|r) = Pr(Θ0 |r)/2. (I didn’t actually do the arithmetic
for the table but this “has to be” true.)
   As with the simple versus composite case, the composite versus composite
Bayesian test can be recast as a simple versus simple test. As before, one finds the
average data distribution under the alternative but now, rather than having a single
data distribution under the null, the average alternative is tested against the average
data distribution under the null. With the prior we chose, each of these is just a sim-
ple average of the probabilities. Table 3.3 illustrates the computations. The posterior
probabilities in the table roundoff to the same values as in Table 3.2, even though
they are slightly different, because I choose an f (r| − 1) extremely similar to f (r|0).
To handle more general testing situations, N-P theory has developed a variety of
concepts such as unbiased tests, invariant tests, and α similar tests, see Chapter 7 or
3.5 More on Bayesian Testing                                                           41
                     Prior                r      1         2         3         4
                                f (r|θ ≤ 0)   0.98015   0.00495   0.00495   0.00995
                                f (r|θ > 0)    0.099    0.1005    0.1005     0.700
               Pr(θ ≤ 0) = 1/2 Pr(θ ≤ 0|r)     0.908     0.047     0.047     0.014
               Pr(θ > 0) = 1/2 Pr(θ > 0|r)     0.092     0.953     0.953     0.986
Table 3.3 Recasting a composite versus composite Bayes test as simple versus simple.
Lehmann (1997). For example, the one and two sample t tests are not a uniformly
most powerful tests but are uniformly most powerful unbiased tests. Similarly, the
standard F test in regression and analysis of variance is a uniformly most powerful
invariant test.
   Similar to significance testing, the N-P approach to finding confidence regions is
also to find parameter values that would not be rejected by a α level test. However,
just as N-P theory interprets the size α of a test as the long run frequency of rejecting
a correct null hypothesis, N-P theory interprets the confidence 1 − α as the long run
probability of these regions including the true parameter. The rub is that you only
have one of the regions, not a long run of them, and you are trying to say something
about this parameter based on these data. In practice, the long run frequency of
α somehow gets turned into something called “confidence” that this parameter is
within this particular region.
   While I admit that the term “confidence,” as commonly used, feels good, I have
no idea what “confidence” really means as applied to the region at hand. Hubbard
and Bayarri (2003) make a case, implicitly, that an N-P concept of confidence would
have no meaning as applied to the region at hand, that it only applies to a long run
of similar intervals. Students, almost invariably, interpret confidence as posterior
probability. For example, if we were to flip a coin many times, about half of the
time we would get heads. If I flip a coin and look at it but do not tell you the result,
you may feel comfortable saying that the chance of heads is still 0.5 even though
I know whether it is heads or tails. Somehow the probability of what is going to
happen in the future is turning into confidence about what has already happened
but is unobserved. Since I do not understand how this transition from probability to
confidence is made (unless one is a Bayesian in which case confidence actually is
probability), I do not understand “confidence.”
Bayesian tests can go seriously wrong only if you pick inappropriate prior distri-
butions. This is the case in Lindley’s famous paradox in which, for a seemingly
simple and reasonable testing situation involving normal data, the null hypothesis is
accepted no matter how weird the observed data are relative to the null hypothesis.
The datum is X|µ ∼ N(µ, 1). The test is H0 : µ = 0 versus HA : µ > 0. The priors on
the hypotheses do not really matter, but take Pr[µ = 0] = 0.5 and Pr[µ > 0] = 0.5 In
42                                                                    3 Hypothesis Tests
an attempt to use a noninformative prior, take the density of µ given µ > 0 to be flat
on the half line. (This is an improper prior but similar proper priors lead to similar
results.) The Bayesian test compares the density of the data X under H0 : µ = 0 to
the average density of the data under HA : µ > 0. (The latter involves integrating the
density of X|µ times the density of µ given µ > 0.) The average density under the
alternative makes any X you could possibly see, infinitely more probable to have
come from the null distribution than from the alternative. Thus, anything you could
possibly see will cause you to accept µ = 0. Attempting to have a noninformative
prior on the half line leads one to a nonsensical prior that effectively puts all the
probability on unreasonably large values of µ so that, by comparison, µ = 0 always
looks more reasonable.
The definition of a P value for a significance test is straight forward, it is the prob-
ability of seeming something as weird or weirder than you actually saw. The prob-
ability is computed under the only distribution you have and that density of that
distribution is used to define how weird an observation is. For an hypothesis test,
you have at least two distributions and one again needs to define weird.
    For a simple versus simple hypothesis test the standard definition of a P value is
to compute the probability under the null distribution and to define weird in a relative
sense as having a large value of the likelihood ratio (alternative density divided by
null). This idea will also work for simple versus composite hypotheses for which
UMP tests exist. However a key feature is that weird observations are not weird
in any absolute sense, they are only weird relative to the alternative hypothesis.
As such, and as we have seen, a small P value in this sense does not necessarily
either contradict the null hypothesis, cf. Example 4.0.1, nor does it suggest that the
alternative is more likely to be true, cf. Section 3.1 with α = .02.
    For more complicated problems, the idea of an hypothesis test P value is to find
the α level for which the test would just barely reject. But that requires one to have a
collection of tests indexed by α for which the rejections regions are getting smaller
as α decreases in a continuous way so that the data always fall into some rejection
region and a smallest rejection region exists.
    Lehmann and Romano’s (2005) most general definition of a P value is that if you
have a collection of tests φα indexed by their size and if those tests have the property
that φα1 (x) ≤ φα2 (x) for any x and α1 ≤ α2 , then the P value for this collection of
tests is defined to be P ≡ inf{α|φα (X) = 1}. This makes P the smallest level of
significance for which the null is rejected with probability 1. Flip an α coin has
P = 0.
    They also define for nonrandomized tests with nested rejection regions Rα1 ⊂ Rα2
for any α1 ≤ α2 , P ≡ inf{α|X ∈ Rα }
3.7 Permutation Tests                                                            43
Are they hypothesis tests rather than significance tests because they require an al-
ternative? Probably an alternative based on stochastic inequality. The issue is that
all outcomes would have equal probabilities so you need some outside definition of
what constitutes “weird.”
Chapter 4
Comparing Testing Procedures
Significance tests and hypothesis tests are very different procedures. The clearest
example of this, that I know of, is the following.
E XAMPLE 4.0.1. Consider a significance test of the null model y ∼ N(0, 1) based
on one observation. The density decreases as one gets further from the mean 0, so
large |y| values constitute evidence against the null model. In particular, an α = 0.05
significance test rejects for |y| ≥ 1.96.
   Now consider testing H0 : y ∼ N(0, 1) versus HA : y ∼ N(0, σ 2 ), σ 2 < 1. Fig-
ure 4.1 illustrates the densities f (y|σ 2 ). With σ 2 < 1 the density of a N(0, σ 2 ) is
                                                                          N(0,1)
                                                                          N(0,0.6)
       0.6
       0.5
       0.4
       0.3
       0.2
       0.1
       0.0
−4 −2 0 2 4
                                                                                      45
46                                                           4 Comparing Testing Procedures
higher near 0 than the density of the N(0, 1) which means that the N-P hypothesis
test for an α = 0.05 test will reject for values close to 0, in particular it will reject for
|y| ≤ 0.063. Could two tests for the same null model be any more different? Seeing
a small value of |y| provides no evidence against the model N(0, 1) in any absolute
sense, such values are merely more consistent with the alternative than they are with
the null.
    More technically, the density (likelihood) ratio is
                           2    2
     f (y|σ 2 ) (1/2πσ )e−y /2σ
                                                                              
                                     −y2 /2σ 2 +y2 /2            2       1
               =             2    = e                 = exp  −(y   /2)      − 1    ,
      f (y|1)    (1/2πσ )e−y /2                                          σ2
which, for any value of σ 2 < 1, is maximized at 0 and decreases as |y| gets further
from 0. So according to the N-P lemma, the most powerful test for any particular
σ 2 < 1, rejects for the smallest values of |y|. This holds regardless of the particular
value of σ 2 , hence the test is uniformly most powerful. In Chapter 7 it is an exercise
to show that this is a UMP test.
   The same moral can be learned from discrete distributions. Our simple versus
composite hypothesis example of the previous chapter included the distributions
                                      r 1       2      3    4
                                f (r|1) 0.100 0.200 0.200 0.500
                                f (r|2) 0.098 0.001 0.001 0.900
                       f (r|2)/ f (r|1) 0.98 0.0005 0.0005 1.8
Now consider an N-P test of H0 : θ = 1 versus HA : θ = 2. The N-P Lemma indicates
that we should most readily reject for the data point with the highest likelihood
ratio, r = 4, which is precisely the data point that is most consistent with the null
hypothesis. (For any N-P test with α < 0.5, a randomized test is needed.)
4.1 Discussion
The basic elements of a significance test are: (1) There is a probability model for
the data. (2) Multidimensional data are summarized into a test statistic that has a
known distribution. (3) This known distribution provides a ranking of the “weird-
ness” of various observations. (4) The P value, which is the probability of observing
something as weird or weirder than was actually observed, is used to quantify the
evidence against the null hypothesis. (5) α level tests are defined by reference to the
P value.
   The basic elements of an N-P test are: (1) There are two (sets of) hypothesized
models for the data: H0 and HA . (2) An α level is chosen which is to be the (maxi-
mum) probability of rejecting H0 when H0 is true. (3) A rejection region is chosen
so that the probability of data falling into the rejection region is (at most) α when
H0 is true. With discrete data, this often requires the specification of a randomized
rejection region in which certain data values are randomly assigned to be in or out of
4.1 Discussion                                                                        47
the rejection region. (4) Various tests are evaluated based on their power properties.
Ideally, one wants the most powerful test. (5) In complicated problems, properties
such as unbiasedness or invariance are used to restrict the class of tests prior to
choosing a test with good power properties.
    Significance testing seems to be a reasonable approach to model validation. In
fact, Box (1980) suggested significance tests, based on the marginal distribution of
the data, as a method for validating Bayesian models. Significance testing is philo-
sophically based on the idea of proof by contradiction in which the contradiction is
not absolute.
    Bayesian testing seems to be a reasonable approach to making a decision between
alternative hypotheses. The results are influenced by the prior distributions, but one
can examine a variety of prior distributions.
    Neyman-Pearson testing seems to be neither fish nor fowl. It seems to mimic sig-
nificance testing with its emphasis on the null hypothesis and small α levels, but it
also employs an alternative hypothesis, so it is not based on proof by contradiction
as is significance testing. Because N-P testing focuses on small α levels, it often
leads to bad decisions between the two alternative hypotheses. Certainly, for simple
versus simple hypotheses, any problems with N-P testing vanish if one is not philo-
sophically tied down to small α values. For example, any reasonable test (as judged
by frequentist criteria) must be within both the collection of all most powerful tests
and the collection of all Bayesian tests, see Ferguson (1967, p. 204).
    Although most problems with testing seem to stem from choosing too small an
α at the expense of creating very large probabilities of type II error (β ), we have
seen an example where a decrease in α was appropriate because it barely increased
β.
    There is also the issue of whether α is merely a measure of how weird the data
are, or whether is should be interpreted as the probability of making the wrong
decision about the null. If α is the probability of making an incorrect decision about
the null, then performing multiple tests to evaluate a composite null causes problems
because it changes the overall probability of making the wrong decision. If α is
merely a measure of how weird the data are, it is less clear that multiple testing
inherently causes any problem. In particular, Fisher (1935, Chp. 24) did not worry
about the experimentwise error rate when making multiple comparisons using his
“least significant difference” method in analysis of variance. He did, however, worry
about drawing inappropriate conclusions by using an invalid null distribution for
tests determined by examining the data.
    In significance testing the P value is well defined and an α level test is defined in
terms of the P value. In hypothesis testing, an α level test is well defined and some
people want to define P values for hypothesis tests. We have seen that significance
tests and hypothesis tests are fundamentally different creatures, so any hypothesis
testing P value, needs to have a different definition than a significance testing P
value. Indeed, for all the restrictions one may need to place on N-P tests (composite
α value, unbiasedness, invariance), ultimately N-P tests are trying to reject for val-
ues with large likelihood ratios. So a reasonable definition of an hypothesis testing P
value will have to measure the weirdness of data by how much the likelihood ratio
48                                                         4 Comparing Testing Procedures
favors the alternative over the null. But this will always lead to a choice between
hypotheses and not a contradiction to the null model.
   In the late 20-teens, it became fashionable to criticize NHST (Null Hypothe-
sis Significance Testing). Unfortunately, NHST is something of a straw man. Over
the years many statisticians have conflated significance and hypothesis testing into
NHST, which blurs the very important distinctions between the two methodologies.
(Even the name NHST conflates the methodologies because a significance test is for
a given model not a null hypothesis.) [Our use of the term “null model” is a capit-
ulation to the prevalence of NHST.] The problem is exacerbated by the fact that the
most commonly taught tests happen to be instances wherein the differences between
significance and hypothesis testing are easily glossed over.
A test of significance examines not only the event that occurred, but discrepant
events that did not occur. Jeffreys’s (1961) criticized the procedure saying,
     “What the use of the P value implies, therefore, is that a hypothesis that may be
     true may be rejected because it has not predicted observable results that have not
     occurred.”
While that sounds good, its meaning is not clear. In the context of testing a null
probability model, it seems hardly to apply. If a probability model were to predict
different observable results, it would be a different probability model, and thus it
cannot be true. From the point of view of testing a parameter, the quotation makes
more sense. Suppose we observe X = 2 with E(X) = θ and we are testing H0 :
θ = 0. Whether we reject the hypothesis depends on assumptions we make about
unobserved quantities. If we assume X ∼ N(θ , 1), values of X greater than 2 units
from 0 are not predicted to occur often, so we could reject the hypothesis with
P = .045. If X − θ ∼ t(2), P = .18 which is unlikely to make us reject the null.
While this example fits the context of Jeffreys’s statement, it does not seem a very
damning criticism of significance tests.
   Jeffreys’s criticism is far more meaningful when applied to tests that involve the
specification of an alternative hypothesis. In those cases it is appropriate to base
conclusions on which alternative is more likely to have generated the observed data.
In such a case, it seems ludicrous to incorporate into any conclusion the relative
likelihoods of data that were not observed.
   Discuss the stopping rule principal. Binomial versus negative binomial P val-
ues. Savage’s quote (mentioned in Barnard conversation(?) and probably Rereading
Fisher). Jessica’s ESP experiment.
Chapter 5
Decision Theory
Decision theory is a very general theory that allows one to examine Bayesian es-
timation and hypothesis testing as well as Neyman-Pearson hypothesis testing and
many aspects of frequentist estimation. I am not aware that it has anything to say
about Fisherian significance testing.
   In decision theory we start with states of nature θ ∈ Θ , potential actions a ∈
A , and a loss function L(θ , a) that takes real values. We are interesting in taking
actions that will reduce our losses. Some formulations of decision theory incorporate
a utility function U(θ , a) and seek actions that increase utility. The formulations are
interchangeable by simply taking U(θ , a) = −L(θ , a).
   Eventually, we will want to incorporate data in the form of a random vector X
taking values in X and having density f (x|θ ). The distribution of X|θ is called the
sampling distribution.
   We will focus on three special cases.
   Estimation of a scalar state of nature involves scalar actions with Θ = A = R.
Three commonly used loss functions are
• Squared error, L(θ , a) = (θ − a)2 ;
• Weighted squared error, L(θ , a) = w(θ )(θ − a)2 , wherein w(θ ) is a known
  weighting function taking positive values;
• Absolute error, L(θ , a) = |θ − a|.
   Estimation of a vector involves Θ = A = Rd . Three commonly used loss func-
tions are
• L(θ , a) = (θ − a)′ (θ − a) ≡ ∥θ − a∥2 ;
• L(θ , a) = w(θ )∥θ − a∥2 , with known w(θ ) > 0;
• L(θ , a) = ∑dj=1 |θ j − a j |.
   Hypothesis testing involves two hypotheses, say Θ = {θ0 , θ1 }, and two corre-
sponding actions A = {a0 , a1 }. What is key in this problem is that there are only
two states of nature in Θ that we can think of as the null and alternative hypotheses,
respectively, and two corresponding actions in A that we can think of as accepting
                                                                                     49
50                                                                    5 Decision Theory
the null (rejecting the alternative) and accepting the alternative (rejecting the null).
The standard loss function is
                                    L(θ , a) a0 a1
                                      θ0     0 1
                                      θ1     1 0
A more general loss function is
                                    L(θ , a) a0 a1
                                      θ0 c00 c01
                                      θ1 c10 c11
wherein, presumably, c00 ≤ c01 and c10 ≥ c11 .
  More generally in hypothesis testing we can partition a more general Θ into
Θ0 (the null hypothesis) and Θ1 (the alternative hypothesis) with only two actions
A = {a0 , a1 } and the standard loss function becomes
                                    L(θ , a) a0 a1
                                    θ ∈ Θ0 0 1
                                    θ ∈ Θ1 1 0.
Again, a1 is taken to mean rejecting the null hypothesis and a0 is taken to mean
accepting the null hypothesis. To reject is to ‘not accept’ and to ‘not accept’ is to
reject. (Recall that in Significance Testing, not rejecting the null is different from
accepting it and there is no formal alternative to accept.) The use of a1 when θ ∈ Θ0 ,
i.e., rejecting the null hypothesis when it is true, is called a Type I error. Using a0
when θ ∈ Θ1 , i.e., not rejecting/accepting the null hypothesis when it is false, is
called a Type II error.
If θ is random, i.e., if θ has a prior distribution, then the optimal action is defined
to be the action that minimizes the expected loss, E[L(θ , a)] ≡ Eθ [L(θ , a)]
The third equality holds because (â − a)2 is a constant and the fourth holds because
E[θ − E(θ )] = 0.                                                                  2
Proposition 5.1.2.     For Θ = A = R and L(θ , a) = w(θ )(θ − a)2 with w(θ ) > 0,
if θ is random, the optimal action is â = E[w(θ )θ ]/E[w(θ )].
where the first inequality holds because over the range of integration m + a − 2θ ≥
m + a − 2a and the second inequality holds because, by definition, pm ≥ 0.5 and, by
assumption, a ≥ m.
   The proof for a < m is similar.                                                2
Proposition 5.1.4.      For Θ = {θ0 , θ1 }, A = {a0 , a1 }, and the standard loss func-
tion, the optimal action is
                                
                                  a0 if Pr(θ = θ0 ) > 0.5,
                           â =
                                  a1 if Pr(θ = θ0 ) < 0.5.
and
If Pr(θ = θ1 ) < Pr(θ = θ0 ) the optimal action is a0 and if Pr(θ = θ1 ) > Pr(θ = θ0 )
the optimal action is a1 . However, Pr(θ = θ0 ) + Pr(θ = θ1 ) = 1, so Pr(θ = θ1 ) <
Pr(θ = θ0 ) if and only if Pr(θ = θ0 ) > 0.5                                       2
5.1 Optimal Prior Actions                                                                                   53
    The final result is a generalization of Propostion 5.1.3 that establishes that quan-
tiles/percentiles other than the median can be optimal actions for an appropriate loss
function.
P ROOF : The proof is similar to that for Propostion 5.1.3 but more involved. I
have broken it into more manageable pieces. Of course α = 0.5 is the special case
addressed earlier. For notational simplicity and similarity with the proof for absolute
error, we denote the α quantile as m rather than the more commonly used notation
qα .
   First assume a is greater than the α quantile m of θ so that F(m) ≤ F(a) where
F is the cdf of θ . By the definition of the alpha quantile, F(m) ≡ P[θ ≤ m] ≥ α
and P[θ ≥ m] ≥ 1 − α with the inequalities used to cope with discrete distributions.
For discrete distributions we take cd to mean I(c,d] with d = ∞ irrelevant because
                                    R          R
P[θ = ∞] = 0.
We are going to look at these three terms separately and then put them back together.
  The first term reduces to
               Z ∞
          α          (θ − a) dP(θ )
                a
                    Z ∞                            Z ∞                           Z ∞
          =α               (θ − a) dP(θ ) + α              (a − m) dP(θ ) + α          (m − a) dP(θ )
                    Za ∞                           Za ∞                           a
          =α               (θ − m) dP(θ ) + α              (m − a) dP(θ )
                     a                                 a
                    Z ∞
          =α               (θ − m) dP(θ ) + α[1 − F(a)](m − a)
                     a
where the last inequality holds because in the previous relation the sum of the last
two terms is nonnegative. To see this,
              Z a                                           Z a
 (1 − 2α)             (θ − m) dP(θ ) + (1 − α)                    (m + a − 2θ ) dP(θ )
                  m
      Z a                                                   Zma
 =          (θ − m + m + a − 2θ ) dP(θ ) −                        (2αθ − 2αm + αm + αa − 2αθ ) dP(θ )
      Zma                                   Z a               m
 =          (a − θ ) dP(θ ) − α                   (a − m) dP(θ )
      Zma                                   Zma
 ≥          (a − m) dP(θ ) − α                    (a − m) dP(θ )
       m                                     m
                  Z a
 = (1 − α)                (a − m) dP(θ ) = (1 − α)[F(a) − F(m)](a − m) ≥ 0.
                      m
which leads to redefining F as F(a) = Pr[θ < a] so that now the alpha quantile has
1 − F(m) ≥ 1 − α.
Showing that
                       Z m                              Z m
                   α         (θ − a) dP(θ ) ≥ (1 − α)         (m − θ ) dP(θ )
                        a                                a
is left as an exercise. For the first and third terms, you also need to prove that
which is equivalent to
                                 α[1 − F(m)] ≥ (1 − α)F(a),
which follows because α ≥ F(a) and [1 − F(m)] ≥ (1 − α).                                        2
Suppose we have a data vector X with density f (u|θ ). If θ is random, i.e., if θ has
a prior density p(θ ), a Bayesian updates the distribution of θ using the data and
Bayes’ Theorem to get the posterior density
                                                   f (u|θ )p(θ )
                             p(θ |X = u) = R                        .
                                                f (u|θ )p(θ )dµ(θ )
The Bayes action is defined to be the action that minimizes the expected loss,
56                                                                      5 Decision Theory
    The Bayes action is just the optimal action when the distribution on θ is the pos-
terior distribution given X. Recognizing this fact, the previous section immediately
provides four results.
Proposition 5.2.2.     For Θ = A = R, data X = u, and L(θ , a) = w(θ )(θ −a)2 with
w(θ ) > 0, if θ is random, the Bayes action is â = E[w(θ )θ |X = u]/E[w(θ )|X = u].
   A similar result also holds for quantile estimation. In Section 5 we will see that
predictions problems have the same structure and the same results.
δ :X →A.
   To frequentists, the risk function is the soul of decision theory. They would like
to pick a δ that minimizes R(θ , δ ) uniformly in θ . That is very hard to do.
   Uniformly minimum variance unbiased (UMVU) estimators of h(θ ) use squared
error loss, minimize R(θ , δ ) uniformly in θ , but restrict δ to rules with EX|θ [δ (X)] =
h(θ ).
5.3 Traditional Decision Theory                                                        57
   In testing problems with the standard loss function, we would love to minimize
R(θ , δ ) uniformly in θ but we cannot. For θ ∈ Θ0 , R(θ , δ ) is the probability of Type
I error. In particular, since δ (X) only takes two values,
For θ ∈ Θ1 , R(θ , δ ) becomes the probability of Type II error. If Θ0 and Θ1 are not
single points, both probabilities are functions of θ . For θ ∈ Θ0 , R(θ , δ ) is sometimes
called the size function of the test. (The actual size of a test is usually taken as
supθ ∈Θ0 R(θ , δ ).)
    The power of the test δ is the probability of rejecting the null hypothesis when
it is false (picking a1 when θ ∈ Θ1 ), and equals 1 − R(θ , δ ) when θ ∈ Θ1 . As a
function of θ , PX|θ [δ (X) = a1 ] gives the size function when θ ∈ Θ0 and the power
function when θ ∈ Θ1 , so it provides the size-power function. Uniformly most pow-
erful (UMP) tests minimize R(θ , δ ) uniformly for θ ∈ Θ1 , but restrict δ to rules
with R(θ , δ ) ≤ α for all θ ∈ Θ0 . Uniformly most powerful unbiased (UMPU) and
uniformly most powerful invariant (UMPI) tests place additional restrictions on the
δ rules that are considered.
    The Bayes risk is a frequentist idea of what a Bayesian should worry about. With
a prior distribution, call it p, on θ , the Bayes risk is defined as
Frequentists think that Bayesians should be concerned about finding the Bayes de-
cision rule that minimizes the Bayes risk. Formally, for a prior p, the Bayes rule is
a decision function δ p with
As discussed in the previous section, Bayesians think that they should be concerned
with finding the Bayes action given the data. Fortunately, these amount to the same
thing. To minimize the Bayes risk, you pick δ (x) to minimize
                          r(p, δ ) = E[R(θ , δ )]
                                                            
                                   = Eθ EX|θ {L[θ , δ (X)]}
                                                            
                                   = EX Eθ |X {L[θ , δ (X)]} .
This can be minimized by picking δ (x) to be the Bayes action that minimizes
    One exception to Bayesians being concerned about the Bayes action rather than
the Bayes decision rule is when a Bayesian is trying to design an experiment, hence
is concerned with possible data rather than already observed data.
    We now introduce other basic concepts from decision theory.
Definition 5.3.1        The rule δ is inadmissible if there exists δ∗ such that, for any θ ,
R(θ , δ∗ ) ≤ R(θ , δ ) and there exists θ∗ such that R(θ∗ , δ∗ ) < R(θ∗ , δ ). In such a case
we say that δ∗ is better than δ . The rule δ is admissible if it is not inadmissible, i.e.,
if no rule is better than it. Two rules δ1 and δ2 are equivalent if R(θ , δ1 ) = R(θ , δ2 )
for all θ . A rule δ1 is as good as δ2 if it is either better than or equivalent to δ2 .
   For a discrete Θ and a prior that puts positive probability on each θ , the Bayes
rule is admissible. Typically Bayes rules are admissible in decision problems unless
something funky is going on. Suppose δ p is Bayes and inadmissible with, say, δ
being better so that R(θ , δ ) ≤ R(θ , δ p ) with there existing θ0 such that R(θ0 , δ ) <
R(θ0 , δ p ). Obviously the prior p cannot put positive probability on θ0 or else δ
would have a strictly smaller Bayes risk. When Θ is not discrete, if the risk functions
are continuous in θ in a neighborhood of θ0 , then there is a neighborhood of θ0 on
which R(θ , δ ) < R(θ , δ p ) and the difference is bounded above 0. The prior p cannot
have positive probability on this neighborhood of θ0 or else δ would have a strictly
smaller Bayes risk.
Every complete class contains all of the admissible rules. The Complete Class The-
orem is that (under suitable conditions) the Bayes rules constitute a complete class.
One rationale for being a Bayesian is that if all reasonable decision rules correspond
to some prior distribution, before choosing something among the reasonable deci-
sion rules, you should investigate whether its corresponding prior seems reasonable.
   Two generalizations of decision rules exist. You can randomly pick a decision
rule or you can have a decision rule that yields randomized actions, i.e., if you see
X = x you randomly pick an action with the randomization allowed to depend on x.
This second idea is called a randomized decision rule. For a randomized action A,
For example, suppose ∆ is that you flip a coin and use δ1 if the coin is heads and δ2
if the coin is tails, then
                                     1            1
                          R(θ , ∆ ) = R(θ , δ1 ) + R(θ , δ2 ).
                                     2            2
   Neither of these ideas are very attractive to statisticians. You have certain ev-
idence X = x that global warming is true. Why would you flip a coin to decide
whether that evidence is sufficient for you to act as if global warming is true.
Nonetheless, we will see in Chapter 7 that randomized decision rules are a key
feature of the theory of hypothesis testing. (Fortunately, they are not a key feature
of its practice.)
E XAMPLE 5.3.4. In testing with the standard loss function, supθ L(θ , a0 ) = 1 and
supθ L(θ , a1 ) = 1 so the value is 1. Either action will minimize your maximum loss
and achieve the value of the problem.
   With a randomized action,
                               
                                  a0 with probability p
                           A=
                                  a1 with probability 1 − p,
we get L(θ0 , A) = 1 − p and L(θ1 , A) = p and supθ L(θ , A) = max(p, 1 − p). With
randomized actions, the value of the problem is infA supθ L(θ , a) = inf p max(p, 1 −
p) = 0.5, so it corresponds to the action of flipping a fair coin to decide which
hypothesis to accept.
   If you think of this as a game, you get to take actions, your opponent determines
the states of nature, and whatever you lose your opponent wins. Your opponent acts
to maximize your losses, so you can do better on average if you take randomized
actions. With randomized actions, the worst thing that can happen to you in this
example is half as bad and with fixed actions.
   For the loss function
                                      L(θ , a) a0 a1
                                        θ0     2 4
                                        θ1     3 2
action a0 will minimize your maximum loss. This is the minimax pure action.        2
Exercise 5.1      For the loss function immediately above, show that the randomized
action that takes a0 with probability 2/3 minimizes the maximum expected loss.
In other words, g∗ is the prior that is going to give a Bayesian the worst possible
outcome (risk).
Exercise 5.2         Show that g∗ is least favorable if and only if infδ r(g∗ , δ0 ) ≥
infδ r(g, δ ) for any δ0 and all g.
   Note that on the right side, if supg infδ r(g, δ ) = infδ r(g∗ , δ ) then g∗ is a least
favorable distribution.
so
5.4 Minimax Rules                                                                    61
Conversely, by considering the subset of priors that take on the value θ with proba-
bility one, say gθ , note that r(gθ , δ ) = R(θ , δ ) and
= r(g∗ , δ∗ )
This must be an equality since we know by definition of the Bayes rule that
r(g∗ , δ0 ) ≥ r(g∗ , δ∗ )
Since δ0 and δ∗ have the same Bayes risk, they must both be Bayes rules. 2
   The point is that a Bayes rule for a least favorable distribution isn’t necessarily a
minimax rule, but a minimax rule, if it exists, is necessarily a Bayes rule for a least
favorable distribution.
   We now introduce a method for finding minimax rules.
P ROOF :
     inf sup R(θ , δ ) ≤ sup R(θ , δ0 ) = K = r(g0 , δ0 ) = inf r(g0 , δ ) ≤ sup inf r(g, δ ).
     δ   θ                θ                                   δ                g   δ
By the Minimax Theorem and Corollary 4, all of these are equal, so in particular
Exercise 5.3a.     Let X1 , . . . , Xn |θ ∼ N(θ , σ 2 ). For squared error loss, show that
the sample mean is an equalizer rule.
Exercise 5.3b.      Let X|θ ∼ Bin(n, θ ) and θ ∼ Beta(α, β ). Assume that the Min-
imax Theorem holds! For squared error loss, find the Bayes rule, say, δαβ . Find
R(θ , δαβ ). Pick α and β so that δαβ is an equalizer rule. Establish that δαβ a mini-
max rule.
or in alternative notation
where the expectation is over both y and x. Identifying prediction with decision and
conditioning on x, we see that Proposition 5.1.1 implies
Proposition 5.5.1.        For data (x′ , y), y ∈ R, and L(y, ỹ(x)) = [y − ỹ(x)]2 , the best
predictor is ŷ = E(y|x).
Regression, both linear and nonparametric, is about estimating the optimal predictor
E(y|x). Note that this result holds even when y is Bernoulli, in which case the best
predictor under squared error loss is E(y|x) = Pr[y = 1|x]. Using squared error loss
with a Bernoulli variable y is essentially using Brier Scores.
   Similarly we can get other best predictors.
Proposition 5.5.2.      For data (x′ , y), y ∈ R, and L(y, ỹ(x)) = w(y)[y − ỹ(x)]2 with
w(θ ) > 0, the best predictor is ŷ = E[w(y)y|x]/E[w(y)|x].
Proposition 5.5.3.       For data (x′ , y), y ∈ R, and L(y, ỹ(x)) = |y − ỹ(x)|, a best
predictor is any ŷ = m ≡ Median(y|x).
  When y takes values in {0, 1}, an alternative loss function is the so called Ham-
ming loss,
                           L[y, ỹ(x)] = I [y ̸= ỹ(x)],
wherein I (logical) is 0 if the logical statement is false and 1 if it is true and a
predictor ỹ(x) also needs to take values in {0, 1}. We want to minimize the expected
prediction error
                           E{L[y, ỹ(x)]} = E{I [y ̸= ỹ(x)]}
where the expectation is over both y and x. We see that Proposition 5.1.4 implies
Proposition 5.5.4.        For data (x′ , y), y ∈ {0, 1} and L(y, ỹ(x)) = I (y ̸= ỹ(x)), a
best predictor has                   
                                         0   if Pr(y = 0|x) > 0.5
                           ŷ(x) ≡                                .
                                         1   if Pr(y = 0|x) < 0.5
Do a prediction chapter. the prediction result from PA Exercise 2.1 can be used
more generally when predicting ỹ and specifically some function of it ρ̃ ′ ỹ like the
difference in sample means between two predictive samples. When X̃β is estimable,
                                  ρ̃ ′ ỹ − ρ̃ ′ X̃ β̂
                          √                                ′
                           MSE ρ̃ ′ ρ̃ + ρ̃ ′ X̃(X ′ X)− X̃ ρ̃
y|θ ∼ f (v|θ ); θ ∈ Θ.
Definition 6.1.1.    Any function of y, say T (y), is a statistic. Statistics can be real
valued or vector valued.
   Note that a statistic is not allowed to depend on θ but for fixed θ we can still
treat functions G(y, θ ) as random variables.
Definition 6.1.2.      A statistic (estimator) g(y) is unbiased for h(θ ), if Ey|θ [g(y)] =
h(θ ) for all θ . The bias of g(y) for estimating h(θ ) is defined by bgh (θ ) ≡
Ey|θ [g(y) − h(θ )]. If h(θ ) ≡ θ , we suppress the bias subscript h. These functions
can be either real valued or vector valued.
For g(y) and h(θ ) real valued with the loss function
the risk is
R[θ , g(y)] = Ey|θ [g(y) − h(θ )]2 = Vary|θ [g(y)] + [bgh (θ )]2 .
                                                                                        67
68                                                                      6 Estimation Theory
This is often referred to as the mean squared error. (In linear model theory the
“mean squared error” is used to indicate the unbiased estimate of the variance, so to
distinguish the concepts, this risk may be called the “expected squared error.”)
Calling a statistic T (y) “sufficient” is intended to convey that all the information
about θ is contained in T (y).
First shown by Fisher (1922), this result was proven in great generality by Halmos
and Savage (1949). When establishing properties of sufficient statistics, knowing the
the factorization holds is very useful. However when finding sufficient statistics for a
particular model, finding a factorization is how we typically establish sufficiency. So
both directions in the if and only if statement are important. That notwithstanding,
only a proof that the factorization implies sufficiency is given at the end of the
section.
   As a function of θ , the likelihood is now L(θ |y) ∝ g[T (y); θ ], so, for example,
the maximum of the likelihood function must occur at the maximum of g[T (y); θ ],
which only depends on y through the sufficient statistic T (y). Similarly, if we have a
prior density on θ , say pθ (u), from Bayes Theorem the posterior density of θ given
y = v is
                                 f (v|u)pθ (u)      g[T (v); u]pθ (u)
                pθ |y (u|v) = R                  =R                      ,
                                f (v|u)pθ (u) du    g[T (v); u]pθ (u) du
where the last term shows that the posterior distribution depends on y = v only
through the fact that T (y) = T (v). Thus the posterior depends on y only through the
value of the sufficient statistic, any sufficient statistic.
E XAMPLE 6.2.3. Consider y1 , . . . , yn iid U(0, θ ), θ > 0. For these data the largest
order statistic is sufficient.
                n                 n
                   1                 1               1
    f (v|θ ) = ∏ I(0,θ ) (vi ) = ∏ I(0,θ ) (v(i) ) = n I(0,∞) (v(1) )I(0,θ ) (v(n) ).
               i=1 θ             i=1 θ              θ
The second equality holds because we are merely reordering the terms in the prod-
uct. The third equality follows from the definition of the order statistics. In the fac-
torization criterion, h(v) = I(0,∞) (v(1) )
E XAMPLE 6.2.4. Consider y1 , . . . , yn iid U(θ1 , θ2 ), θ2 > θ1 . For these data the
smallest and largest order statistics are sufficient.
70                                                                             6 Estimation Theory
                    n                                 n
                           1                                 1
     f (v|θ1 , θ2 ) = ∏           I(θ1 ,θ2 ) (vi ) = ∏              I(θ1 ,θ2 ) (v(i) )
                   i=1 (θ2 − θ1 )                    i=1 (θ2 − θ1 )
                                                        1
                                                =              I         (v )I         (v ).
                                                    (θ2 − θ1 )n (θ1 ,θ2 ) (1) (θ1 ,θ2 ) (n)
If the smallest and largest order statistics are between θ1 and θ2 , all of the order
statistics must be between them.
Suppose you have two functions of a complete statistic, say g1 [H(y)] and g2 [H(y)],
and both are unbiased for h(θ ). Then Ey|θ {g1 [H(y)] − g2 [H(y)]} = 0 for all θ which
means that Py|θ [g1 [H(y)] − g2 [H(y)] = 0] = 1. Basically, for any parameter h(θ ),
there can be only one unbiased estimate that is a function of H(y). We will see in
the next section that if a statistic T (y) is both complete and sufficient, any function
of it, say g[T (y)] has to be a minimum variance unbiased estimate of its expectation,
i.e., of h(θ ) ≡ Ey|θ {g[T (y)]}.
    If Θ0 ⊂ Θ , then any T (y) that is complete for θ ∈ Θ0 is automatically complete
for θ ∈ Θ . Relative to subsets of Θ , sufficiency and completeness work in opposite
ways. Think of Θ as indexing all distributions that are absolutely continuous wrt
(with respect to) Lebesgue measure and for which the expected value exists. Think
of Θ0 as indexing N(µ, 1) distributions. Consider y1 , . . . , yn iid f∗ (·|θ ). For Θ , the
order statistics y(1) ≤ . . . ≤ y(n) are complete and sufficient, see Frasier (1957). For
Θ0 , we will see later that the sample mean ȳ· is complete and sufficient. For Θ0 ,
the order statistics are sufficient but not complete. For Θ , ȳ· is complete but not
sufficient.
    Incidentally, ȳ· is the minimum variance unbiased estimate of E(yi ) in both fam-
ilies. For θ ∈ Θ it is relatively hard to find statistics that estimate the expected
6.2 Sufficiency and Completeness                                                      71
value unbiasedly. Thus ȳ· , which is the mean of the order statistics, is the best of a
relatively small group of unbiased estimators but is best for a wide array of distri-
butions. For θ ∈ Θ0 it is relatively easy to find statistics that estimate the expected
value unbiasedly (partly because of symmetry). Thus ȳ· is the best of a large group
of estimators but is best for a relatively small collection of distributions.
    Some people like to analyze experimental designs in which treatments are ran-
domly assigned to experimental units based entirely on the random assignment. (For
some simple applications see PA, Appendix G.) Obviously, the random assignment
has nothing to do with any parameters related to the results of the experiment, so the
result of the randomization is an ancillary statistic. If the randomization is the only
thing random, conditioning on the ancillary statistic leaves nothing on which to base
an analysis. Ironically, Fisher was big on both the idea of conditioning on ancillary
statistics and on the idea of using only the random assignment of treatments as the
basis for analyzing experiments.
    When predicting y on the basis of x, Fisher argued that the only parameters of
interest are associated with the conditional distribution given x. Pretty obviously, the
distribution of x does not depend on any of those parameters, so is ancillary. (This
is actually a somewhat more nuanced argument involving parameters of interest and
nuisance parameters.)
    In dealing with count data, Fisher’s exact conditional test for 2 × 2 contingency
tables conditions on ancillary statistics (row and column totals). In fact, all exact
conditional tests for contingency tables involve conditioning on statistics that are
ancillary for the parameters of interest. Whether these tests are more appropriate
than unconditional tests is a source of controversy, cf. Agresti (1992).
    The most famous result relating to ancillary statistics is due to Basu.
P ROOF : [Lehmann, 1983, p.46] Let ηB (t) ≡ Py|θ [A(y) ∈ B|T (y) = t]. By suffi-
ciency this conditional probability does not depend on θ . Dropping the unnecessary
subscript on the conditional probability,
72                                                                           6 Estimation Theory
Lemma 6.2.11.            Let r ≥ 0 and s be two functions and let E[r(y)] ≤ Kr < ∞.
Because it is nonnegative , r can act like a (not typically probability) density with
regards to the dominating measure P, call this new measure µr , and so we can con-
struct something akin to a conditional expectation relative to this new measure, call
it Er [s(y)|T (y)], then
 Z                                        Z
                g[T (v)]r(v)s(v)dP(v) =              g[T (v)]E[r(v)s(v)|T (v)]dP(v)
     T −1 (B)                             T −1 (B)
                                          Z
                                     =               g[T (v)]Er [s(v)|T (v)]r(v)dP(v)
                                          T −1 (B)
                                          Z
                                     =               g[T (v)]Er [s(v)|T (v)]E[r(v)|T (v)]dP(v).
                                          T −1 (B)
so
      1                                       1
            Z                                       Z
                         r(v)s(v)dP(v) =                  Er [s(y)|T (v)]r(v)dP(v)
      Kr      T −1 (B)                        Kr T −1 (B)
                                              1
                                                Z          
                                            =             E Er [s(y)|T (v)]r(y) T (v) dP(v)
                                              Kr T −1 (B)
                                              1
                                                Z                          
                                            =             Er [s(y)|T (v)]E r(y) T (v) dP(v)
                                              Kr T −1 (B)
                                                       IAi (v)
                                                        ∞
                                         dν∗ (v) = ∑    i
                                                               dν(v).
                                                    i=1 ν(Ai )
                                                       2
P ROOF : Obvious.
                                                             h(v)
                                      = g[T (v); θ ]        IAi (v)
                                                                           dν∗ (v)
                                                       ∑∞
                                                        i=1 2i ν(Ai )
In Lemma 6.2.11 identify g[T (v)] → g[T (v); θ ], r(v) → h1 (v), s(v) → IA (v) dP(v) →
dν∗ (v), so
Z                                               Z
            g[T (v); θ ]h1 s(v)dP(v) =                       g[T (v)]E[h1 (y)IA (y)|T (v)] dν∗ (v)
 T −1 (B)                                       T −1 (B)
                                                Z
                                            =                g[T (v); θ ]Eh1 [IA (y)|T (v)]h1 (v) dν∗ (v)
                                                T −1 (B)
                                                Z
                                            =                g[T (v); θ ]Eh1 [IA (y)|T (v)]E[h1 (y)|T (v)]dP(v)
                                                T −1 (B)
74                                                                               6 Estimation Theory
Z                                      Z
            E[IA (y)|T (v)]dPθ (v) =               IA (v)g[T (v); θ ]h1 (v) dν∗ (v)
 T −1 (B)                               T −1 (B)
                                       Z
                                   =               g[T (v); θ ]Eh1 [IA (y)|T (v)]h1 (v) dν∗ (v)
                                        T −1 (B)
                                       Z
                                   =               g[T (v); θ ]Eh1 [IA (y)|T (v)]Eν∗ [h1 (y)|T (v)] dν∗ (v)
                                        T −1 (B)
                                       Z
                                   =               g[T (v); θ ]Eh1 [IA (y)|T (v)]h1 (y) dν∗ (v)
                                        T −1 (B)
                                       Z
                                   =               Eh1 [IA (y)|T (v)] dPθ (v).
                                        T −1 (B)
By definition Eh1 [IA (y)|T (v)] = Ey|θ [IA (y)|T (v)] but Eh1 [IA (y)|T (v)] is defined
with respect to the measure with density h1 (v) dν∗ (v) which does not depend on θ
Suppose T (y) is any statistic and g(y) is unbiased for h(θ ), both real valued. To
simplify notation, for fixed θ write the conditional expectation and variance of g(y)
given T (y) as both
                           Ey|θ [g(y)|T (y)] ≡ Ey|θ ,T (y) [g(y)]
and write
                            Vary|θ [g(y)|T (y)] ≡ Vary|θ ,T (y) [g(y)].
The key point in this section is that these numbers typically depend on θ but, when
T (Y ) is sufficient, they do not.
   Standard results, cf. Exercise A.1, on conditional probabilities provide that for
any statistic T (y),
and
             Vary|θ [g(y)] = Vary|θ {Ey|θ [g(y)|T (y)]} + Ey|θ {Vary|θ [g(y)|T (y)]}
                          ≥ Vary|θ {Ey|θ [g(y)|T (y)]}.                                           (2)
T (y) does not depend on θ , so Ey|θ [g(y)|T (y)] ≡ Ey|θ ,T (y) [g(y)] = Ey|T (y) [g(y)] ≡
E[g(y)|T (y)] is a statistic, it is a function of y that does not depend on θ . It then
follows from (1) that E[g(y)|T (y)] is an unbiased estimate of h(θ ) and from (2)
that Vary|θ {E[g(y)|T (y)]} ≤ Var[g(y)], so the conditional expectation is at least as
good an unbiased estimate as the original unbiased estimate. We have proven the
following.
   If T (Y ) is both complete and sufficient, any function of it, say g̃[T (y)], is unbi-
ased for its expected value, say, h(θ ) ≡ Ey|θ {g̃[T (y)]}. If g(y) is any other unbiased
estimate of h(θ ), then by sufficiency E[g(y)|T (y)] is also an unbiased statistic and a
function of T (Y ), so E{g̃[T (y)] − E[g(y)|T (y)]} = 0 and by completeness of T (y),
1 = Pry|θ {g̃[T (y)] − E[g(y)|T (y)] = 0} = Pry|θ {g̃[T (y)] = E[g(y)|T (y)]}. It follows
that
              Vary|θ {g̃[T (y)]} = Vary|θ {Ey|θ [g(y)|T (y)]} ≤ Vary|θ [g(y)],
so the variance of g̃[T (y)] is at least as small as the variance of any other unbiased
estimate. This result is sometimes called the Lehmann-Scheffé Theorem.
   The factorization criterion allows us to find sufficient statistics, the remaining
question is how to find complete sufficient statistics. That will be addressed in the
section on exponential families.
   A similar result holds for any loss function L(θ , a) that is convex in a. Jensen’s
inequality applied to the conditional distribution of y given T (y) implies
so
R[θ , g(y)] = E (E{L[θ , g(y)]|T (y)}) ≥ E (L{θ , E[g(y)|T (y)]}) = R{θ , E[g(y)|T (y)]}.
E XAMPLE 6.3.7. Suppose y1 , . . . , yn are iid N(µ, σ 2 ), then the data and the order
statistics are sufficient but (ȳ· , s2 ) and (∑i yi , ∑i y2i ) are minimal sufficient.
    Suppose T (y) is sufficient and T0 (y) is minimal sufficient and suppose that
g[T (y)] and g0 [T0 (y)] are unbiased for h(θ ). By Rao-Blackwell, E{g[T (y)] | T0 (y)]
is at least as good as g[T (y)] and may be better. However, by minimal sufficiency,
E{g0 [T0 (y)] | T (y)] = E (g0 {q[T (y)]} | T (y)) = g0 {q[T (y)]} = g0 [T0 (y)],
E XAMPLE 6.3.8. (This also appears in Cox and Hinkley.) Let y1 , . . . , yn be iid
N(µ, σ 2 ). Then E(s2 ) = σ 2 , Var(s2 ) = 2σ 4 /(n − 1), but s2 (n − 1)/n maximizes the
likelihood function. Consider the risk under squared error loss from estimates of the
form cs2 for a constant c.
                                           2                
                            2      2 2      2c
                        E[cs − σ ] =              + (1 − c) σ 4 .
                                                           2
                                           n−1
This is minimized for c = (n − 1)/(n + 1). Squared error loss is kind of weird when
there is a bound on the parameter space. You tend to get risk improvements by
shrinking towards the bound.                                                     2
   In normal theory statistical inference, what matters is not the point estimate of
σ 2 but the fact that ∑i (yi − ȳ· )2 /σ 2 ∼ χ 2 (n − 1). Using s2 leads to the t(n − 1)
and F(1, n − 1) distributions. If you use a different point estimate, you need to use
different distributions, but they adjust is such a way that you get the same tests and
confidence intervals.
σi2 ≤ σ 2 . Now define θ̄ ≡ ∑ki=1 θ̂i /k. We still get E θ̄ = θ + b but now Var θ̄ =
                                                                              
∑i σi2 /k2 ≤ σ 2 /k. So combining biased estimators reduces variance but does not
help to reduce bias.                                                               2
P ROOF : ⇐          Suppose E[T (y)] = θ = E[h(y)], then U(y) = T (y) − h(y) has
E[U(y)] = 0.
   ⇒ SupposeT (y) is minimum variance unbiased for θ and E[h(y)] = 0. For any
scalar λ ,
             Var[T + λ h] = Var[T ] + 2λ Cov[T, h] + λ 2 Var[h]
which is less than Var[T ] if 2λ Cov[T, h] + λ 2 Var[h] < 0
  If λ > 0,
                                          −2Cov[T, h]
                                 0<λ <
                                             Var[h]
   If λ < 0,
                                           2Cov[T, h]
                                 0>λ >
                                             Var[h]
   If Cov[T, f ] ̸= 0 can find a λ so that T is not MVU
Corollary 6.3.11. E[T (y)] = θ E[h(y)] = 0 T (y) is MVU if Cov[T (y), h(y)] ≥ 0.
P ROOF : If Cov[T (y), T (y) − h(y)] ≥ 0, then Var[T (y)] = Cov[T (y), h(y)]
Corollary 6.3.12.      T (y) and h(y) are minimum variance unbiased, then Corr[T (y), h(y)] =
1.
Corollary 6.3.13.      T (y) ∈ G and unbiased. T (y) is MVU within G if and only if
Cov[T (y), h(y)] = 0 for every h ∈ G with E[h] = 0.
78                                                                  6 Estimation Theory
P ROOF : Same.
P ROOF : Similar
Corollary 6.3.15.        If T1 (y) and T2 (y) are minimum variance unbiased for θ1 and
θ2 then b1 T1 (y) + b2 T2 (y) are minimum variance unbiased for b1 θ1 + b2 θ2 .
Corollary 6.3.16.      Results hold for y ∼ f (v). If results hold for all η, with y ∼
f (v|η).
P ROOF :
P ROOF : iid implies that the order statistics are sufficient. Conditioning on the
order statistics gives a symmetric function of the data.
                                      ∑permutation p h[p(y)
                        E[h(y)|O] =
                                              n!
E[U − h(θ )]2 = E[U − g(θ ) + g(θ ) − h(θ )]2 = Var[U − g(θ )]2 + [g(θ ) − h(θ )]2
We begin by introducing the the score and information functions. We then use these
concepts to introduce the Cramér-Rao inequality.
   For θ ∈ Θ ⊂ Rd ,                  Z
                                 1 = f (v|θ ) dv.
                          0 = dθ 1
                                 Z             
                            = dθ    f (v|θ ) dv
                                Z
                            =       dθ f (v|θ ) dv
                                Z                     
                                       1
                            =               dθ f (v|θ ) f (v|θ ) dv
                                   f (v|θ )
                                                         
                                         1
                            = Ey|θ             dθ f (y|θ ) .
                                      f (y|θ )
Note that if d > 1, all of these quantities are d-dimensional row vectors.
  Define the score function d-vector as
                                              1
                              S(y; θ ) ≡            [dθ f (y|θ )]′ .
                                           f (y|θ )
Since it depends on θ , S(y; θ ) is not a statistic. The score function can also be
thought of as {dθ log[ f (y|θ )]}′ .
   We have just shown that E[S(y; θ )] = 0, so it follows that
Similarly, the information matrix for y is the sum of the information matrices for
the yi s, but those are all identical, so I(θ ) = nI∗ (θ ) where we use I∗ (θ ) to denote
the information in a single observation. All of this remains true if the yi s are iid r
vectors so that y is actually an rn × 1 vector.
Exercise 6.1.                                     h                   i′ 
(a) Show I(θ ) = E          [∑ni=1 S∗ (yi ; θ )]     n
                                                    ∑ j=1 S∗ (y j ; θ )     = nI∗ (θ ).
Hint: Independence allows you to get rid of cross-product terms.
(b) Show that I(θ ) = E −{d2θ θ log[ f (y|θ )]} .
                                               
                                                     R
Hint: Take the second derivative on each side of 1 = f (v|θ ) dv.
   The Cramér-Rao Inequality gives a lower bound for the variance of any estimate
of a scalar parameter θ . Obviously, if you have an unbiased estimate that actually
achieves the lower bound, it must be a minimum variance unbiased estimate.
   The Cramér-Rao inequality involves applying the Cauchy-Schwarz inequality to
a real valued estimator g(y) of a real valued parameter θ and the real valued score
function S(y; θ ). If g(y) is a possibly biased estimate of θ then
                                                       Z
                                   θ + bg (θ ) =           g(v) f (v|θ ) dv
The Cauchy-Schwarz Inequality states that for any random functions x and y,
Look at the covariance term. Using (1) and the fact that 0 = Ey|θ [S(y; θ )],
                                                   [1 + ḃg (θ )]2
                                 Vary|θ [g(y)] ≥                   .
                                                       I(θ )
The exact asymptotic distribution of the MLE depends on the information, cf. Fer-
guson (1996). Under suitable conditions with Θ ⊂ Rd ,
                                                      L
                                 [I(θ )]1/2 [θ̂ − θ ] → N(0, Id )
and
                                                          L
                               [θ̂ − θ ]′ [I(θ )][θ̂ − θ ] → χ 2 (d).
Here we are using a Singular Value Decomposition to define [I(θ )]1/2 . For iid data
these reduce to           √            L
                            n[θ̂ − θ ] → N 0, I∗ (θ )−1
                                                        
and
                                                           L
                              n[θ̂ − θ ]′ [I∗ (θ )][θ̂ − θ ] → χ 2 (d).
               P
Typically, θ̂ → θ and these relationships also hold when the information on θ is
replaced with the information evaluated at θ̂ .
82                                                                          6 Estimation Theory
While the score function S(y; θ ) ≡ [dθ f (y|θ )]′ [1/ f (y|θ )] is not a statistic, if we
replace θ with its MLE θ̂ we get the score statistic, S(y; θ̂ ). In the next chapter we
will consider tests based on score statistics. Also, if we are only considering one
value of θ , say θ = θ0 , we also refer to S(y; θ0 ) as a (test) statistic.
Y = Xβ + e; E(e) = 0; Cov(e) = σ 2 I,
the Gauss-Markov theorem is that for any estimable function, the least squares es-
timate is the best (minimum variance) linear unbiased estimate. See Plane Answers
for details.
A density function is in the natural exponential family if for functions c and h and a
statistic T (y) it can be written as
f (v|θ ) = c(θ )h(v) exp[θ ′ T (v)] = h(v) exp[θ ′ T (v) − r(θ )],
where c(θ ) ≡ exp[−r(θ )] (provided that c(θ ) > 0). By the factorization criterion,
T (y) is a sufficient statistic.
   The support of a distribution is the set of v values for which f (v|θ ) > 0. (The
support is only defined up to sets of dominating measure 0.) The density of y is
then equal to f (v|θ ) times the indicator function of the support. If the support of
the distribution depends on θ , the distribution is not in the exponential family. If the
support depends on θ , we can write the support as a set A(θ ) ̸= Rn . The indicator
for the support is IA(θ ) (v) This cannot be part of c(θ ) or h(v) because it involves
both θ and v. It also cannot be part of exp[θ ′ T (v)] because exp[θ ′ T (v)] > 0 yet
the indicator function of the support can be 0. For example, uniform distributions
determined by a parametric endpoint are not members on an exponential family.
   The mean of the statistic T (y) can often be written in terms of r(θ ).
                       Z                   Z
                  1=       f (v|θ ) dv =       h(v) exp[θ ′ T (v) − r(θ )] dv.
This leads to
                                       E [T (y)] = [dθ r(θ )]′ .
  Lehmann has given a couple of conditions under which T (Y ) is also complete.
The following theorem gives what seems to be the simplest.
Theorem 6.6.1.        [Lehmann, 1986, p.142 ] Suppose y|θ has a density in the nat-
ural exponential family. Then if neither θ nor T (Y ) is subject to a linear constraint,
T (y) is sufficient and complete.
We need to show that if ET |θ [Q(T )] = 0 for all θ , then PT |θ [Q(T ) = 0] = 1 for all
θ . As in Appendix D.2, write Q = Q+ − Q− . Note that since 0 = ET |θ [Q(T )] =
ET |θ [Q+ (T )] − ET |θ [Q− (T )],
             Z                                        Z
                                       ′
                  +
                 Q (t)c(θ ) exp[θ t]ν∗ (t) =              Q− (t)c(θ ) exp[θ ′t] dν∗ (t),   (1)
hence for θ = 0,        Z                         Z
                                   +
                               Q (t) dν∗ (t) =        Q− (t) dν∗ (t) ≡ K.
It follows that Q+ (t)/K and Q− (t)/K can be viewed as densities wrt dν∗ (t) and
equation (1) specifies that the densities have the same moment generating function,
hence the densities determine the same distribution. If the distributions are the same,
the densities must be the same a.e. (ν∗ ), i.e., Q+ (t) = Q− (t) a.e., hence Q(t) =
Q+ (t) − Q− (t) = 0 a.e. (ν∗ ) which is enough to ensure that the function is 0 a.e. (ν)
         f (v|θ ) = c(θ )h(v) exp[η(θ )′ T (v)] = h(v) exp[η(θ )′ T (v) − r(θ )],
84                                                                  6 Estimation Theory
Consistency, Efficiency, etc. Leave the details of these procedures to other sources
on asymptotic theory like Ferguson (1996) or Lehmann(1999).
Chapter 7
Hypothesis Test Theory
This is often called the α level of the test and the (maximum) probability of Type I
error.
                                                                                               85
86                                                                         7 Hypothesis Test Theory
This is the probability of rejection under the null hypothesis, i.e., the probability of
Type I error.
   The probability of Type II error is R(θ , φ ) for θ ∈ Θ1 and written as,
                                         Z
                            βφ (θ ) ≡        [1 − φ (v)] f (v|θ ) dv.
   The power of a test is the probability of rejecting the null hypothesis when it is
false. For any θ ∈ Θ , the power function (more awkwardly but correctly called the
size-power function) is
                                     Z
                         πφ (θ ) ≡       φ (v) f (v|θ ) dv = E[φ (y)].
Despite the name, the power function only gives the power of the test when θ ∈
Θ1 , whence πφ (θ ) = 1 − R(θ , φ ) = 1 − βφ (θ ). Somewhat ironically, for θ ∈ Θ0 ,
the power function is actually the size function because πφ (θ ) = R(θ , φ ) is the
probability of Type I error for θ ∈ Θ0 . The size of the test is supθ ∈Θ0 πφ (θ ).
   A test φ̃ in a class of tests C is uniformly most powerful (UMP) of size α in C if
α = supθ ∈Θ0 R(θ , φ̃ ) and if for any other test φ ∈ C with α ≥ supθ ∈Θ0 R(θ , φ ),
If C is the collection of all tests, we merely say that φ̃ is uniformly most powerful.
   Two restricted classes of tests are often used.
   A size α test φ is said to be unbiased if
                               sup πφ (θ ) ≤ inf πφ (θ ).
                               θ ∈Θ0                 θ ∈Θ1
need a test statistic that may or may not depend on particular values of θ . Not
infrequently, test statistics depend on a value θ0 ∈ Θ0 .
One of the nice things about this test is that it is not a very randomized rule. Only
when f (y|θ1 ) = K f (y|θ0 ) do you have to resort to randomization for picking an
action. Nonetheless, when y has a discrete distribution, randomized actions are a
vital part of the theory.
   Notice that the test φ̃ can be rewritten as
                             
                             1       if f (y|θ1 ) − K f (y|θ0 ) > 0,
                     φ̃ (y) = γ(y) if f (y|θ1 ) − K f (y|θ0 ) = 0,
                                0     if f (y|θ1 ) − K f (y|θ0 ) < 0.
                             
This form of the test is actually more convenient for our next proof and three ex-
amples. The test can also be written in terms of the likelihood ratio f (y|θ1 )/ f (y|θ0 )
being greater than, equal to, or less than K but that form requires us to worry about
what happens when f (y|θ0 ) = 0.
However, this is just the difference in the powers of the two tests, so φ̃ must have at
least as much power as φ .
     To establish (1) it suffices to show that [φ̃ (v) − φ (v)][ f (v|θ1 ) − K f (v|θ0 )] ≥
0. We consider three cases: when the second term is positive, negative, and 0.
When [ f (v|θ1 ) − K f (v|θ0 )] > 0, we have φ̃ (y) = 1 and since 0 ≤ φ (v) ≤ 1, we
have [φ̃ (v) − φ (v)] ≥ 0. Thus [φ̃ (v) − φ (v)][ f (v|θ1 ) − K f (v|θ0 )] ≥ 0. Similarly,
when [ f (v|θ1 ) − K f (v|θ0 )] < 0, we have φ̃ (y) = 0, so [φ̃ (v) − φ (v)] ≤ 0, and
[φ̃ (v) − φ (v)][ f (v|θ1 ) − K f (v|θ0 )] ≥ 0. Finally, when [ f (v|θ1 ) − K f (v|θ0 )] = 0, we
have [φ̃ (v) − φ (v)][ f (v|θ1 ) − K f (v|θ0 )] = 0.                                           2
    For values v with [ f (v|θ1 ) − K f (v|θ0 )] = 0, there are many functions γ(v) that
can give a size α most powerful test, however there always exists a constant function
γ(v) ≡ γ0 that will give a most powerful test. In particular, to get an α level test, take
K̃ to be the smallest K value with 1 − α ≤ Py|θ0 [ f (y|θ1 ) ≤ K f (y|θ0 )]. Then define
α0 ≡ Py|θ0 [ f (y|θ1 ) > K̃ f (y|θ0 )] ≤ α. As a function of K, the function Py|θ0 [ f (y|θ1 ) ≤
K f (y|θ0 )] is either continuous at K̃ or it is not. If it is continuous, α0 = α, and
taking γ0 = 0 we are done. If it is discontinuous then α0 < α and we must have
0 < η0 ≡ Py|θ0 [ f (y|θ1 ) = K̃ f (y|θ0 )]. In that case take γ0 = (α − α0 )/η0 and we are
done.
E XAMPLE 7.1.2. Test H0 : y ∼ U[−1, 1] versus   p   H1 : y ∼ N(0, 1). Write the stan-
dard normal density as ϕ(y) ≡ exp(−y2 /2)/ 2π). Then
                                                   
                                                    ϕ(y)          if y < −1
  f (y|θ1 ) − K f (y|θ0 ) = ϕ(y) − KI[−1,1] (y)/2 = ϕ(y) − K/2 if −1 ≤ y ≤ 1 .
                                                                   if 1 < y
                                                   
                                                     ϕ(y)
We always reject when |y| > 1 because then ϕ(y) > 0 and f (y|θ1 ) − K f (y|θ0 ) > 0.
For any K ≥ 0, P[ϕ(y) − K/2 = 0] = 0 under both hypotheses, so there no need to
worry about randomized tests. We accept H0 if ϕ(y) − K/2 < 0 and that depends
specifically on the value of K. To get a most powerful size α test pick y0 = α so that
under the null uniform distribution P[|y| < y0 ] = α. Take K so that K/2 = ϕ(y0 ),
thus ϕ(y) − K/2 > 0 iff (if an only if) ϕ(y) − ϕ(y0 ) > 0 iff |y| < y0 . So the most
powerful size α test rejects when |y| ≤ α or |y| > 1.                               2
E XAMPLE 7.1.3. Now we reverse the roles of the distributions in the previous
example and test H0 : y ∼ N(0, 1) versus√H1 : y ∼ U[−1, 1]. Again write the standard
normal density as ϕ(y) = exp(−y2 /2)/ 2π. We get
90                                                               7 Hypothesis Test Theory
                                                  
                                                   −Kϕ(y)     if y < −1
 f (y|θ1 ) − K f (y|θ0 ) = I[−1,1] (y)/2 − Kϕ(y) = 0.5 − Kϕ(y) if −1 ≤ y ≤ 1 .
                                                    −Kϕ(y)     if 1 < y
                                                  
For K > 0, you never reject when |y| > 1 because then ϕ(y) > 0 and f (y|θ1 ) −
K f (y|θ0 ) < 0. To get a most powerful size α test pick y0 so that under the null
standard normal distribution P[1 ≥ |y| > y0 ] = α. Take K > 0 so that 1/2K = ϕ(y0 ),
thus for −1 ≤ y ≤ 1, 0.5 − Kϕ(y) > 0 iff ϕ(y0 ) − ϕ(y) > 0 iff 1 ≥ |y| > y0 . So the
most powerful size α test rejects when 1 ≤ |y| > 1 − α2 .                         2
so a most powerful test exists that depends only on the sufficient statistic. More
generally, we can consider T ≡ T (y) with density fT (t|θ ) = hT (t)g(t; θ ) and write
a most powerful test φ̃ (randomized decision rule) that rejects the null hypothesis
with probabilities specified by
                             
                             1      if fT (T |θ1 ) > K fT (T |θ0 ),
                    φ̃ (T ) = γ(T ) if fT (T |θ1 ) = K fT (T |θ0 ),
                               0     if fT (T |θ1 ) < K fT (T |θ0 ),
                             
Clearly, any size α test of this form will also have, when considered a function of y,
the form of most powerful α level test.
Definition 7.2.1.     The densities are said to have monotone likelihood ratio if for
any θ1 > θ0 , the ratio f (v|θ1 )/ f (v|θ0 ) is a monotone function in v whenever both
densities are nonzero.
   For the general exponential family, it suffices to have η(θ ) increasing and T (v)
monotone. We tend to think in terms of increasing likelihood ratios but the theory
works as well for decreasing likelihood ratios.
   More importantly, the placeholder variable v in the definition is implicitly real
valued, which means that the definition applies for random variables y rather than
random vectors. In practice, we will apply the definition to situations in which a real
valued sufficient statistic T (y) exists, and require monotone likelihood ratio in the
densities of the sufficient statistic. .
Theorem 7.2.1.         If T has nondecreasing likelihood ratio, then any test of the
form
                                          1 if T > t0 ,
                                        (
                               φ̃ (T ) = γ̃ if T = t0 ,
                                          0 if T < t0 ,
has nondecreasing size-power function ET |θ [φ (T )] and is uniformly most powerful
of its size for testing H0 : θ ≤ θ0 versus H0 : θ > θ0 for any θ0 . Moreover, for any
0 ≤ α ≤ 1, there exist a t0 and γ̃ that give a size α test.
    The values t0 and γ̃ are chosen to give a size α test at θ = θ0 but the test does not
depend on θ1 > θ0 , so if it is most powerful for any θ1 it is most powerful for all of
them.
    For θ < θ0 , if we think about θ0 as the alternative, the power is α so the size
must be be less than that. More specifically, if the size at θ is ξ , then we can make
a test that always rejects with probability ξ , an the most powerful test at alternative
θ0 has to have power at least as great as ξ .
P ROOF :    For any θ0 < θ1 , the most powerful test has the form
                            
                            1        if fT (T |θ1 ) > K fT (T |θ0 ),
                   φ̃ (T ) = γ(T ) if fT (T |θ1 ) = K fT (T |θ0 ),
                                0     if fT (T |θ1 ) < K fT (T |θ0 ),
                            
We need to show that the test in the theorem can be written in this form.
92                                                                7 Hypothesis Test Theory
    Example t ∼ N(θ , 1) Algebra to show that f (t|θ1 ) > k f (t|θ0 ) iff t > (θ1 −
θ0 )2 /2 + log(K).
Exercise 7.1.      Assume y ∼ N(0, σ 2 ). Show that the problem displays monotone
likelihood ratio 0 < σ 2 ≤ 1. Find the UMP test for H0 : σ 2 = 1 versus H1 : σ 2 < 1.
   The composite versus composite example in Section 3.3 had a uniformly most
powerful test without having monotone likelihood ratio because it was monotone
where it counted. The likelihood ratios were increasing when θ0 ∈ Θ0 and θ1 ∈ ΘA
but that broke down when looking at two distributions from the same hypothesis.
Monotone likelihood ratio (for θ s both in Θ0 ) also assures that the size of a test is
the size associated with largest θ0 ∈ Θ0 .
I think it was Ed Bedrick who pointed out to me that the normal theory two-sided
t is a clearly reasonable thing to do. So the fact that it is UMPU is less evidence
that it a reasonable test and more the case that it gives credence to UMPU being a
reasonable criterion on which to base a test.
    Locally best tests.
As I recall from Lehmann’s wonderful (2011) book, some people, I believe Gossett
and Eagon Pearson, were dissatisfied with the fact that significance tests did not tell
7.5 A Worse than Useless Generalized Likelihood Ratio Test                                93
you what was wrong with the null model. So E. Pearson came up with the idea of
specifying an alternative hypothesis and the generalized likelihood ratio test statistic
– before he and Neyman began collaborating on the theory of hypothesis testing.
   This involves partitioning Θ into Θ0 (the null hypothesis) and Θ1 (the alternative
hypothesis). The generalized likelihood ratio test statistic is
Geisser (2005) contains a favorite example from Hacking (1965) illustrating foun-
dational issues related to testing.
   Consider the null model
   The significance test also defines a Neyman-Pearson test, so we can explore it’s
N-P properties. In this example, the probability of type I error is 0.1. When used in
N-P testing, significance tests can have very poor power for some alternatives since
they are constructed without reference to any alternative. For these alternatives, the
significance test has power 0.09 regardless of the alternative, so its power is less
than its size. This is not surprising. Given any test, you can always construct an
alternative that will have power less than the size.
   The most powerful test for an alternative θ > 0 depends on θ , so a uniformly
most powerful test does not exist. The significance test is also the likelihood ratio
test. The likelihood ratio examines the transformation
                                      Pr[X = x|θ = 0]
                        T (x) =
                                maxi=0,...,100 Pr[X = x|θ = i]
                                
                                  0.9/0.91 = 0.989       if x = 0
                              =
                                  0.001/0.09 = 1/90 if x ̸= 0
and rejects for small values of the test statistic T (X). That the likelihood ratio test
has power less than its size IS surprising.
   The uniformly most powerful invariant (UMPI) test of size .1 is a randomized
test. It rejects when X = 0 with probability 1/9. The size is .9(1/9) = 0.1 and the
power is 0.91(1/9) > 0.1. Note, however, that observing X = 0 does not contradict
the null hypothesis because X = 0 is the most probable outcome under the null
hypothesis. Moreover, the test does not reject for any value X ̸= 0, even though such
data are 90 times more likely to come from the alternative θ = X than from the null.
We examine the transformations necessary for establishing that the linear model
F test is a uniformly most powerful invariant (UMPI) test. We also note that the
Studentized range test for equality of groups means in a balanced one-way ANOVA
is not invariant under all of these transformations so the UMPI result says nothing
about the relative powers of the ANOVA F test and the Studentized range test. The
discussion uses notation from Christensen (2020).
8.1 Introduction
It has been well-known for a long time that the linear model F test is a uniformly
most powerful invariant (UMPI) test. Lehmann (1959) discussed the result in the
first edition of his classic test and in all subsequent editions, e.g. Lehmann and
Romano (2005). But the exact nature of this result is a bit convoluted and may be
worth looking at with some simpler and more modern terminology.
    Consider a (full) linear model
Y = Xβ + e, e ∼ N(0, σ 2 I)
                                 Y ′ (M − M0 )Y /[r(X) − r(X0 )]
                   F(Y ) ≡ F ≡                                   ,
                                      Y ′ (I − M)Y /[n − r(X)]
                                                                                95
96                                                                   8 UMPI Tests for Linear Models
yi j = µi + εi j , εi j iid N(0, σ 2 ),
The best known competitor to an F test for H0 is the test that rejects for large values
of the studentized range,
                                                  maxi ȳi· − mini ȳi·
                               Q(Y ) ≡ Q ≡          p                   .
                                                        MSE/N
We already know that F is location and scale invariant and it is easy to see that Q is
too. In this case, location invariance means that the test statistic remains the same if
we add a constant to every observation. Moreover, it is reasonably well known that
neither of these tests is uniformly superior to the other, which means that Q must
not be invariant under the full range of transformations that are required to make F
a UMPI test.
   We can decompose Y into three orthogonal pieces,
The first term of the decomposition contains the fitted values for the reduced model.
The second term is the difference between the fitted values of the full model and
those of the reduced model. The last term is the residual vector from the full model.
Intuitively we can think of the transformations that define the invariance as relating
to the three parts of this decomposition. The residuals are used to estimate σ , the
scale parameter of the linear model, so we can think of scale invariance as relating
to (I − M)Y . The translation invariance of adding vectors X0 δ modifies M0Y . To get
8.1 Introduction                                                                           97
                             ∥(M − M0 )Y ∥
                                           (M − M0 )v.
                             ∥(M − M0 )v∥
Finally, the complete set of transformations to obtain the UMPI result is for any
positive number a, any appropriate size vector δ , and any n vector v with ∥(M −
M0 )v∥ ̸= 0,
                                                                     
                                              ∥(M − M0 )Y ∥
          G(Y ) = a M0Y + X0 δ + (I − M)Y +                 (M − M0 )v .
                                              ∥(M − M0 )v∥
  (M − M0 )Y = X β̂ − X0 γ̂
                                                                     (ȳ − ȳ )J 
                      JN                  0        ȳ1·
                                                     
                                                                         1·     ·· N
                           JN                   ȳ2·             (ȳ2· − ȳ·· )JN 
                =               ..              .  − JaN ȳ·· =         ..         . (2)
                                      .         . 
                                                     .                        .
                                                                                     
                       0                  Jn        ȳa·               (ȳa· − ȳ·· )JN
   Thinking about the decomposition in (1), if Q(Y ) were invariant we should get
the same test statistic if we replace M0Y with M0Y +X0 δ (which we do) and if we re-
place (M − M0 )Y with [∥(M − M0 )Y ∥/∥(M − M0 )v∥](M − M0 )v (which we do not).
The numerator of Q is a function of (M − M0 )Y , namely, it takes the difference be-
tween the largest and smallest components of (M − M0 )Y . For Q(Y ) to be invariant,
the max minus the min of (M − M0 )Y would have to be the same as the max minus
the min of [∥(M − M0 )Y ∥/∥(M − M0 )v∥](M − M0 )v for any Y and v. Alternatively,
the max minus the min of [1/∥(M − M0 )Y ∥](M − M0 )Y would have to be the same
98                                                      8 UMPI Tests for Linear Models
as the max minus the min of [1/∥(M − M0 )v∥](M − M0 )v for any Y and v. In other
words, given (2) and (3),
                               maxi ȳi· − mini ȳi·
                                  √
                                     SSGrps
would have to be a constant for any data vector Y , which it is not.
Chapter 9
Significance Testing for Composite Hypotheses
In their most natural form, significance tests typically lead to two-sided t and χ 2
tests. We review significance tests as probabilistic proofs by contradiction. We em-
phasize an appropriate definition of p values for significance testing and compare
it to alternate definitions. We introduce the concept of composite significance tests,
and illustrate that they can generate one-sided tests. We review interval estimation
for both parameters and predictions based on significance testing and illustrate that
one-sided interval estimates can be constructed from composite significance tests.
Finally, we address the issue of multiple comparisons in the context of significance
testing.
9.1 Introduction
Schervish, M.J. (1996), “P-Values: What They Are and What They Are Not,” The
American Statistician, 50, 203-206.
    Fisher (1956) p. 94 seems to be saying the the significance of a composite hy-
pothesis is the significance of each individual test. This can differ radically from
N-P probability of type I error. H0 : (A ∩ B)c is true. Reject at level α if both Ac and
Bc rejected at level α. Example really lends itself to an alternative H1 : A ∩ B is true.
Probability of type I error is much smaller than significance level.
    Fisher was certainly interested in one side rejection regions!
    Fisher 1925 p. 80-81 In preparing this table we have borne in mind that in practice
we do not always want to know the exact value of P for any observed χ 2 , but,
in the first place, whether or not the observed value is open to suspicion. If P is
between .1 and .9 there is certainly no reason to suspect the hypothesis tested. If it
is below .02 it is strongly indicated that the hypothesis fails to account for the whole
of the facts. Belief in the hypothesis as an accurate representation of the population
sampled is confronted by the logical disjunction: Either the hypothesis is untrue or
the value of χ 2 has attained by chance an exceptionally high value. The actual value
of P obtainable from the table by interpolaton indicates the strength of the evidence
                                                                                      99
100                                         9 Significance Testing for Composite Hypotheses
against the hypothesis. A value of χ 2 exceeding the 5 per cent point is seldom to be
disregarded.
   a paragraph later
   The term Goodness of Fit has caused some to fall into the fallacy of believing
that the higher the value of P the more satisfactorily is the hypothesis verified. Val-
ues over .999 have sometimes been reported which, if the hypothesis were true,
would only occur once in a thousand trials. Generally such cases are demonstrably
due to the use of inaccurate formulae, but occasionally small values of χ 2 beyond
the expected range do occur, as in Ex. 4 with the colony numbers obtained in the
plating method of bacterial counting. In these cases the hypothesis considered is as
definitely disproved as if P had been .001.
   Significance (Fisherian) tests are probabilistic versions of proof by contradic-
tion. A probabilistic model is assumed and observed data are either deemed to be
inconsistent with the model, which suggests that the model is wrong, or the data are
consistent with the model, which suggests very little. Data consistent with the as-
sumed model are almost always equally consistent with other models. The extent to
which the data are consistent with the model is measured using a p value with small
p values indicating data that are inconsistent with the model. Significance testing
is distinct from the Neyman-Pearson theory of hypothesis testing, see Christensen
(2005).
   Significance tests for unimodal distributions typically yield two-sided tests. We
begin with a discussion of simple significance tests and alternative definitions of p
values. Section 3 discusses extensions of simple significance tests that include the
possibility of one-sided tests. Section 4 discusses interval estimation with emphasis
on the significance testing interpretation of prediction intervals and one-sided inter-
vals. Finally, Section 5 briefly addresses the role of multiple comparison corrections
in significance testing.
The standard form for significance testing assumes some model for a data vector
y = (y1 , . . . , yn )′ . A statistic W ≡ W (y) with a known distribution is chosen to serve
as a test statistic. Denote W ’s (possibly discrete) density f (w). The model is called in
question if the observed value of the test statistic looks weird relative to the density
f (w). Denote the observed data yobs and let Wobs = W (yobs ).
   To illustrate, assume y1 , . . . , yn iid N(0, σ 2 ), then a standard test statistic is
                                           ȳ· − 0
                             T ≡ T (y) ≡       √ ∼ t(n − 1),
                                           s/ n
where ȳ· and s are the sample mean and standard deviation, respectively. The weird-
est data are those that correspond to small densities under the t(n − 1) distribution.
9.2 Simple Significance Tests                                                             101
The t(n − 1) density decreases away from 0, so the weirdest observations are far
from 0.
   The p value is the probability of observing a test statistic as weird or weirder than
we actually saw. In the illustration, because the t(n − 1) density is symmetric about
0, with Tobs ≡ T (yobs ) the p value is
A small p value suggests that something is wrong with the model. Perhaps the mean
is not 0 but perhaps the data are not normal, are not independent, or are heteroscedas-
tic. Interestingly, this two-sided t(n − 1) test, when using the alternative test statistic
T 2 , corresponds to a one-sided F(1, n − 1) test, because the mode of an F(1, n − 1)
distribution is at 0.
    In general, with Wobs ≡ W (yobs ), the p value is
For a less standard illustration, assume y1 , . . . , yn iid N(µ, 4). With a test statistic
                                      (n − 1)s2
                                W=              ∼ χ 2 (n − 1),
                                          4
denote the density χ 2 (w|n − 1). For n − 1 > 2, unless Wobs happens to be the mode,
there are two values w1 < w2 that have
p = Pr[W ≤ w1 ] + Pr[W ≥ w2 ].
so
                p = Pr[W ≤ 4.21] + Pr[W ≥ 16.50] = .04 + .12 = .16.
While the machinations needed to find the p value may seem a bit complicated,
they are simpler than those necessary to find a Neyman-Pearson theory two-sided
uniformly most powerful unbiased test, see Lehmann (1997, pp.139, 194).
    Although we are actually performing a test of the entire model that has been
assumed for the data, under the influence of Neyman-Pearson theory, these illus-
trations are often called tests of the null hypotheses H0 : µ = 0 and H0 : σ 2 = 4,
respectively. In Neyman-Pearson theory, the hypotheses H0 : µ = 0 and H0 : σ 2 = 4
are also referred to as composite hypotheses because the first test does not specify
σ 2 and the second test does not specify µ. However, in our two examples the model
and test statistic pairs provide simple significance tests.
102                                        9 Significance Testing for Composite Hypotheses
test p value, is the probability under the null model of seeing data with a likelihood
ratio as small or smaller than that actually observed, hence
                                       1,    yobs = 0
                                     
                                     
                                       .999, yobs = 1, . . . , 95
                                     
                         p̃(yobs ) =                              .
                                      .049, yobs = 96
                                     
                                       0,    yobs = 97
The point is that observing a 96 in no way contradicts the null model, it is the
observation most likely to be observed, yet p̃ is small. On the other hand, observing
0 tends to contradict the null model, but p̃ is large.
   Cox’s (2005) approach seems to rely on the existence of an informal alternative
suggesting that, say, larger values of the test statistic provide more evidence against
the null model. In this case the p value, here called p̆, as a function of yobs becomes
                             1,                         yobs = 0
                           
                           
                             .049 + .01(96 − yobs ), yobs = 1, . . . , 95
                           
               p̆(yobs ) =                                                .
                            .049,
                                                       yobs = 96
                             0,                         yobs = 97
This is similar to p̃ in that p̆ again rejects the null model for the data most consistent
with it (yobs = 96) and fails to reject for data that are most inconsistent with the null
model (yobs = 0).
    The significance test is designed as a probabilistic proof by contradiction of the
the null model. In the parlance of philosophy of science, it is a probabilistic method
for falsifying the null model. The evidence against the null is appropriately mea-
sured by p with small values required to conclude that the data are inconsistent with
the null model. Neither p̃ nor p̆ provides the basis for such a proof by contradiction.
The other “p” values provide appropriate, if not good, measures for evaluating the
evidence between the null and some alternative. Since they do not provide a proof
by contradiction of the null model, they seem to display an unnatural focus on the
null hypothesis. With a formal alternative available, there seems to be little reason to
focus on the null hypothesis as opposed to the alternative, hence little reason to re-
strict one’s self to tests in which the p value or probability of Type I error is small. In
a decision procedure, one should play off the probabilities of both Type I and Type
II error so that both are reasonable. See Christensen (2005) for more discussion of
testing null versus alternative hypotheses and in particular the virtues of Bayesian
testing for evaluating the weight of evidence between the hypotheses.
    The difficulty with significance testing is picking a test statistic with a known dis-
tribution. Fisher (1956, p.50) suggests that the choice of W should be made subject
to the analyst’s prior information. However, the requirement that W have a known
distribution under the model can be quite restrictive. Composite significance testing
discussed in the next section both relaxes this assumption and can generate familiar
one-sided tests for common problems.
104                                        9 Significance Testing for Composite Hypotheses
We now present a significance testing approach to expanded models that can gener-
ate familiar one-sided tests. We begin with two simple examples.
   Suppose our model is that the data come from one of the distributions
With one observation, take y as the test statistic. There are no distributions in this
model that would make observing a y of 2 or anything larger than a 2 plausible.
On the other hand, any observation less than 0 is completely plausible. The trick is
deciding what values between 0 and 2 are plausible and how to quantify that idea.
Clearly, for observations that are positive, the probability of seeing something as
weird or weirder than we actually saw is appropriately measured by the probability
that a standard normal is larger than yobs . We take the position that all values for yobs
of 0 or less are perfectly consistent with the model, hence the p value as a function
of yobs is discontinuous at 0 jumping from 1 to .5. (A case could be made that that p
should increase continuously to a value approaching 1 for huge negative yobs .)
   In significance testing one typically only specifies a null model. There is no al-
ternative model. Although one could specify an alternative to model (3.1) that is
simply “not model (3.1),” that alternative cannot be specified as µ ≥ 0 or even
y ∼ N(µ, 1); µ ≥ 0. We might observe y ≥ 2 because the true distribution is a
Cauchy centered at 0. There is about a 15% chance of seeing such data from a
Cauchy. The point is that seeing y ≥ 2 makes model (3.1) implausible. In the ab-
sence of other assumptions, it does not suggest what the true model might be. If you
are willing to make such assumptions, you should do Bayesian or Neyman-Pearson
testing.
   Now suppose our model is
Not only are y values of 2 and larger implausible for all allowable µ but values of
−3 and lower are correspondingly implausible. The p value for observing 2 should
be the same as the p value for observing −3. An appropriate quantification seems
to be
                          p(µ) = Pr[y ≤ −3] + Pr[y ≥ 2]
computed under either µ = −1 or 0. The two values are both .024. Another reason-
able choice for computing the p value could be µ = −.5 but that minimizes p(µ).
To ensure that the data contradict the model, the appropriate p value is the largest of
the values p(µ), i.e.,
                                  p = sup p(µ).
                                        −1<µ<0
If there is any value for µ that makes p(µ) large, the data do not contradict the
model.
9.3 Composite Significance Tests                                                    105
General Definitions
We now present general terminology for composite significance tests and then return
to the simple normal illustration.
    A simple significance test, like those illustrated in Section 2, is based on a test
statistics W (y) that has a known distribution under the null model. In our illustra-
tions of composite significance testing, we had a test statistic y but a variety of
distributions for y that depend on a parameter. For the general discussion, we use
an equivalent formulation in which the test is based on a function of the data and
the parameter, a function that has a known distribution. Suppose we have a model
for y with densities q(y|θ ) for θ ∈ Θ0 . A composite significance test is based on a
test function W (y; θ ) such that when θ is the true parameter, the test function has a
known density f (w). Note that if, say, 1 ∈ Θ0 , then W (y; 1) is a random variable but
W (y; 1) ∼ f (w) only if θ = 1.
    The largest density for W (y; θ ) is denoted
The function f∗ orders observations y by how weird they are relative to the null
model. f∗ is not a function of θ .
   To compute a probability of obtaining data as weird or weirder than we actually
saw, we pick a θ in Θ0 and compute
For the data to contradict the null model, this number must be small for every θ ∈
Θ0 , so the p value is defined as
106                                        9 Significance Testing for Composite Hypotheses
Fisher was never comfortable with Neyman-Pearson confidence intervals, hence his
development of fiducial intervals, see Fisher (1956). I think that interval estimates
can be developed in a reasonable manner based on significance tests.
     Significance tests are fundamentally based on p values. The standard procedure
with a significance test is to report a p value, the evidence that the data are consistent
with the model. To extend significance tests to interval estimates we first need the
concept of an α level significance test. For α ∈ [0, 1], define an α level significance
test as a test that rejects the null model whenever p ≤ α. If the test is not rejected,
we say that the data are consistent with the null model as determined by an α level
test.
     To get a two-sided t interval estimate, we consider a t test of the null model
y1 , . . . , yn iid N(µ0 , σ 2 ). The null model can be decomposed into the (base) model
y1 , . . . , yn iid N(µ, σ 2 ) and the parametric null hypothesis H0 : µ = µ0 . Under this
construction, the usual two-sided (1 − α)100% interval is precisely the set of all µ0
values that are consistent with both the data and the model as determined by an α
level significance test.
     This idea applies whenever we can separate the null model into two parts: a (base)
model and a parametric null hypothesis indexed by some parameter λ , for which we
have available a simple significance test for every λ = λ0 . A 1 − α interval (actually,
a “regional”) estimate consists of all parameter values λ0 that are consistent with the
data and the model as determined by an α level test.
     Technically, we are specifying a collection of models that are consistent with the
data. In the normal example, there is a collection of models y1 , . . . , yn iid N(µ0 , σ 2 )
that are consistent with the data. If we could find the distribution of the T statistic for
Cauchy data with median µ, we could also discuss the collection of Cauchy models
that are consistent with the data. The significance testing procedure is not telling us
that we should believe the normal model, it is just telling us what the reasonable
µ0 values are, if you believe the normal model. Nonetheless, it is convenient to
refer to the collection of models by specifying the parameter λ , hence the “interval
estimate.” (There is little reason to call this a 1 − α interval estimate rather than
an α level estimate except that they often correspond to Neyman-Pearson 1 − α
confidence intervals.)
     Similarly, to construct a prediction interval for the normal model, we test whether
an independent new observation y f ∼ N(µ, σ 2 ) is consistent with the data and the
model. The standard α level significance test takes the form of rejecting if
                                |y f − ȳ· |
                                 q           > t1−α/2 (n − 1)
                                s 1 + 1n
where tα (n − 1) denotes the 100α percentile of the t(n − 1) distribution. The test
would be executed upon observing all of y1 , . . . , yn , y f . Treating y f as the indexing
parameter for the tests, the 1 − α prediction interval consists of all y f values that are
108                                            9 Significance Testing for Composite Hypotheses
consistent with both the model and the observed data y1 , . . . , yn as determined
                                                                                q by an
α level test. The result is the standard prediction interval ȳ· ±t1−α/2 (n − 1)s 1 + 1n .
    The interpretation of the significance testing interval as of a collection of param-
eters that are consistent with both the data and the model does not actually presume
the model to be true. However, it is a small step to making that assumption, which
in turn would allow a Bayesian or Neyman-Pearson analysis.
    Finally, consider the collection of models
indexed by µ0 with the associated t test. The model would not be rejected by an α
level composite significance test for any µ0 above the value that has
                                           ȳobs − µ0
                        T (yobs ; µ0 ) =          √ = t1−α (n − 1)
                                            sobs / n
                                  √
or µ0 = ȳobs −t1−α (n − 1)sobs / n. This serves as the composite significance testing
lower 1 − α bound for µ0 . It provides an infinite interval estimate for µ0 , not for µ.
The one-sided interval tells us that µ0 , the
                                           √ upper bound on plausible µ values, must
be at least µ0 = ȳobs − t1−α (n − 1)sobs / n. This is a reasonable interpretation, but a
far cry from the usual intuition of a one-sided interval.
    A good Neyman-Peason-ite would correctly (if perhaps uselessly) interpret a
one-sided confidence interval in terms of its long-run frequency of covering the true
parameter µ. Nonetheless, a one-side confidence interval might be thought to con-
tain a collection of parameter values µ that are reasonable. That is not the case! The
data are never going to be consistent with any infinite interval of µ values. Suppose
ȳobs = 16, sobs = 4, and n = 16, so t.95 (15) = 1.753 and the .95 one-sided interval
is (14.25, ∞). This is not an interval of µ values that are consistent with the data
because with these data T (y; 116) = −100. The data are clearly inconsistent with
the normal model having µ = 116 even though 116 is well within the one-sided
interval. The large positive values contained in the one-sided interval can only be
deemed consistent with the data as plausible values of µ0 , that is, as plausible upper
bounds for µ.
signed to measure how strange a set of data are relative to a null model. What does
that have to do with the probability of errors in multiple tests? The principals of sig-
nificance testing can be applied to multiple comparisons if we view the multiple tests
as defining one overall test. If you want to be able to make statements about which
individual hypotheses are correct or incorrect, you need to make stronger assump-
tions and use Neyman-Pearson or Bayesian procedures. But significance testing can
perhaps help in identifying individual hypotheses that contribute to the evidence that
the overall null model is wrong.
      The very notion of evaluating the results of a collection of individual tests is
contrary to the nature of significance testing. In significance testing, the collection
of tests need to be combined into an overall measure of the evidence against some
null model. This usually amounts to combining the individual tests into one test
of a collective null model. (The overall null model may be nothing more than the
collection of null models associated with the individual tests.)
      Consider the common problem of significance testing for outliers in a normal
linear model. For each of n data points, we get an associated t statistic, say ti,obs ,
which is one observation on a random variable ti that has a t(d f E − 1) distribution
where d f E is the degrees of freedom for Error when fitting the model. The random
variables ti typically are correlated. If we came into the problem with a suspicion that
case i′ might be outlier, we could do a standard t test for that one case, comparing
ti′ ,obs to a t(d f E − 1) distribution to obtain a p value.
      More commonly, we scan through the n different ti statistics to see if any of
them have large absolute values. In essence, we base our conclusion on the value of
maxi |ti,obs |. Recall Fisher’s dictum that one chooses the test statistic based on one’s
ideas about what may go wrong with the model. However, the test is still just a test
of whether the data (as summarized by the test statistic) are consistent with the null
model.
      To compute a p value, we compare the number maxi |ti,obs | to the distribution
of the maximum of the random variables |ti |. Finding the distribution of maxi |ti |
is difficult but clearly its density is maximized at 0 and decreases monotonically
away from 0. (Clearly the density of maxi ti is symmetric about 0 and decreases
monotonically away from 0.) Therefore,
                                                               
                                p = Pr max |ti | ≥ max |ti,obs | .
                                        i                i
The maximum of the |ti |s is at least as large as maxi |ti,obs | if (and only if) any one
of the |ti |s is as large as our observed value so
                                     "                 #
                                    [
                           p = Pr           |ti | ≥ max |ti,obs |   ,
                                                     i
                                    i
While most of the ideas for defining a composite significance test are apparent from
the previous simple normal illustration of Section 3, the notation was developed to
9.5 Multiple Comparisons                                                              111
handle more complicated problems. Actually, the test function is W (y; η) for η ∈ Θ0
with W (y; θ ) ∼ f (w). While this assumption is enough to define the test procedure,
to actually compute the p value we need to think of W (y; η) as a random variable for
each fixed η. We know the distribution of W (y; θ ) but we also need the distribution
of W (y; η) when the parameter is θ . The necessity of these requirements become
clearer when dealing with t tests but we first introduce the ideas in the context of the
simple normal example.
   With Z(y; η) ≡ y−η, the general definition of f∗ is f∗ (y) ≡ sup−1<µ<0 φ (Z(y; µ)).
The earlier analysis allows us to rewrite f∗ as
                              
                               φ (Z(y; −1)) for y ≤ −1,
                      f∗ (y) = φ (0)             for −1 ≤ y ≤ 0,
                                φ (Z(y; 0))      for y ≥ 0.
                              
  Because normal distributions with known variance are tractable, we were able to
compute the p value earlier. Nonetheless, the p value can be rewritten in terms of
Z(y; η) as
To compute this last expression we need to know the distributions of the random
variables Z(y; −1) and Z(y; 0) for all µ between −1 and 0. Again, because of the
simple nature of this problem, the distributions of Z(y; −1) and Z(y; 0) are read-
ily available for all µ. In the next example, the equivalent random variables have
noncentral t distributions.
   For the model y1 , . . . , yn iid N(µ, σ 2 ) with −1 < µ < 0 we choose the t test func-
tion
                                            ȳ· − µ
                                 T (y; µ) ≡ √ ∼ t(n − 1).
                                             s/ n
To contradict the model, the data must be inconsistent with each parameter within
the model, that is, T (yobs ; µ) must be inconsistent with the t(n − 1) distribution for
every value of µ allowed in the model. Denote the t(n − 1) density t(·|n − 1). The
supremum of the densities is
                                    
                          ȳ· +1
                              √ |n − 1 ≡ t (T (y; −1)|n − 1) ȳ· ≤ −1
                      t
                     s/ n
                    
          f∗ (y) = t(0|n  − 1)                              −1 ≤ ȳ· ≤ 0 .
                     t ȳ√· |n − 1 ≡ t (T (y; 0)|n − 1)
                    
                                                              ȳ· ≥ 0
                    
                          s/ n
112                                           9 Significance Testing for Composite Hypotheses
     Suppose ȳobs and s2obs are such that T (yobs ; 0) = 2. We must then have ȳobs > 0,
so
                                   f∗ (yobs ) = t(2|n − 1).
For any data with T (y; 0) > 2, we again have ȳ· > 0, so f∗ (y) = t(T (y; 0)|n − 1) <
t(2|n − 1). Also, for any data with T (y; −1) ≤ −2, we must have ȳ· < −1 and by
symmetry f∗ (y) = t(T (y; −1)|n − 1) ≤ t(2|n − 1).
   To find the p value defined by (3.2) we need to maximize the probability that
T (y; −1) ≤ −2 or T (y; 0) ≥ 2, that is, maximize the probability of
                                                        
                          ȳ· + 1          [     ȳ· − 0
                              √ ≤ −2                 √ ≥2 ,
                          s/ n                   s/ n
over all parameter values in the model. These are disjoint sets, so we can compute
the probabilities separately. More formally, define
Again,
                                     p=     sup p(µ).
                                          −1<µ<0
   To compute p we need the distributions of T (y; −1) and T (y; 0) for µs between
−1 and 0. For µ = 0, p(µ) is the probability that a central t(n − 1) exceeds 2 and a
noncentral t with parameter 1 is below −2. When µ = −1, p(µ) is the probability
that a central t(n − 1) is below −2 and a noncentral t with parameter −1 is above 2.
Obviously, they are the same number. Other parameters µ give smaller values but
involve using two noncentral t(n − 1) distributions. With n − 1 = 111,
Again, this may seem complicated, but the Neyman-Pearson theory for this test is
considerably more complicated, see Hodges and Lehmann (1954, Sec.3).
   If null distributions get stochastically larger as θ increases, Θ0 and interval,
do maximum densities occur at the ends of the interval? Relate to mode.
   Replace Cox example with X ∼ N(0, σ 2 ) H0 : σ 2 = 1 Fisherian test totally dif-
ferent from N-P HA : σ 2 < 1.
   Look at Fisher’s exact test as a one-sided test, especially for extreme outcomes
using negative binomial distribution. For extreme data, the sup is at the boundary.
See Yung-Pin Chen (2011) Do the Chi-Square Test and Fisher’s Exact Test Agree in
Determining Extreme for 2 × 2 Tables?, The American Statistician, 65:4, 239-245,
   Look at chi-square test assuming chi-squared distribution for test statistic. f∗ (y) =
sup p indep χ32 (X 2 (y, p)) I think this should be achieved, for any y at the minimum
9.5 Multiple Comparisons                                                                         113
chi-square value, f∗ (y) = χ32 (X 2 (y, p̂(y))). Then show that X 2 (y, p̂(y)) is a monotone
function of f∗ (y) = χ32 (X 2 (y, p̂(y))) or maybe that the chi-squared 1 distribution is
monotone for chi-square 3. Would work like a charm if we were only doing one-
sided. Or that there is virtually no probability of a chi-square 3 density getting small
when the distribution is actually chi-squared 1. that is, if X 2 (y, p̂(y)) ∼ χ12 then
                                                        .
  P(χ32 (X 2 (y, p̂(y))) < χ32 (X 2 (yobs , p̂(yobs ))) = P(X 2 (y, p̂(y)) > X 2 (yobs , p̂(yobs ))
y values that get a chi-squared 3 density small have virtually no probability under a
chi-squared 1. Or does it work backwards????
   sup p indep χ32 (X 2 (y, p)) occurs at the same place as sup p indep log[χ32 (X 2 (y, p))]
function is decreasing in X 2 (y, p) because derivative is
   If you find the sup by minimizing the test statistic, then small values will become
a problem. sup p indep χ32 (X 2 (y, p)) = χ32 (inf p indep X 2 (y, p))
   seems like this needs to be comparing looking at noncentral chi-squares with
lower df and central with higher df or vice versa. Simplest is
Y = Xβ + e, e ∼ N(0, 1)
so
                       ∥Y − Xβ0 ∥2 ∼ χn2           ∥Y − X β̂ ∥2 ∼ χn−r(X)
                                                                   2
The first leads to a one-sided χ 2 (n − r) test, the second trivially leads to a two-sided
χ 2 (n − r) test because the test statistic does not depend on β , not sure what the third
leads to. Have to find the density of W3 , maximize it relative to β (hopefully this is
a function of W2 ) and find which values of the maximized density are the smallest.
Not that W3 has a χ 2 (n) distribution but that is only relevant for finding the value
of β that maximizes the density and then determining the values of the maximized
density that are smaller than the observed maximized density.
114                                        9 Significance Testing for Composite Hypotheses
References
Suppose we have a random vector (y, x′ ) where y is a scalar random variable and
want to use x to predict y. We do this by defining some predictor function f (x). We
also have a prediction loss function L[y, f (x)] that allows us to evaluate how well a
predictor does. Want f that minimizes
which is called the expected prediction loss or the expected prediction error. Also,
for whatever predictor we end up using, we want to be able to estimate the expected
prediction error.
   T WO EXAMPLES :
   The loss function determines the best predictor. These problems are equivalent
to Bayesian decision problems if we just think of y as θ , the marginal distribution
of y as the prior of θ , and the prediction loss as the decision loss. In this context,
is the Bayes risk of the decision problem and the optimal predictor will be the Bayes
decision rule.
    The most common loss function for prediction is squared error, see PA Section
6.3,
                                L[y, f (x)] = [y − f (x)]2
from which it follows that the optimal estimator is the “posterior” mean
m(x) ≡ E(y|x).
   For the special case when y ∼ Bern(p), write p(x) ≡ E(y|x). The use of squared
error loss leads to estimates of the expected prediction error called Brier scores.
Another option called Hamming loss is
                                                                                   115
116                                               10 Thoughts on prediction and cross-validation.
In other words, for Hamming loss if you predict y correctly there is no loss and
if you predict it incorrectly the loss is 1. The expected prediction error is just the
probability of mispredicting y, i.e.,
Note that with Hamming loss, it makes no sense to predict a value other than 0 or
1, so these will be referred to as valid predictions. Hamming loss is equivalent to
the Bayes test procedure with y = 1 the alternative hypothesis and y = 0 the null.
The optimal prediction is equivalent to rejecting when the posterior probability of
the alternative is greater that 0.5, i.e., the optimal rule δ has
                                       
                                          1 if p(x) > 0.5
                              δ (x) =                        .
                                          0 if p(x) < 0.5
We don’t care which valid prediction we make (action we take) when p(x) = 0.5.
This rule clearly minimizes the loss for each x but using Bayes Theorem one can
also show that it has the form of the N-P Lemma so is a most powerful test.  2
   Note that
                                             Z                                    Z
Ey,x {L[y, δ (x)]} = Pry,x {y ̸= δ (x)]} =                   [1− p(x)] f (x)dx+                   p(x) f (x)dx.
                                              {x|p(x)≥0.5}                         {x|p(x)<0.5}
    These rules depend on knowing the joint distribution of (y, x′ ), which is generally
unknown in prediction problems. We want to use data to estimate both E(y|x) and
Ey,x {L[y, m(x)]}. Suppose (y, x′ ), (y1 , x1′ ), . . . , (yn , xn′ ) are iid. Let Y be the vector of
yi s and let X is the matrix with xi′ as its ith row.
    Estimate E(y|x).
    A nonparametric approach to estimating E(y|x) is to identify xi values that are
close to x and estimate E(y|x) by taking a weighted mean of the yi s that correspond
to close xi s. Obviously, the weights on the yi might well depend on how far the xi s
are from x. This is called a nearest-neighbor approach.
    Quite generally, one can assume that E(y|x) is a member of a parametric family,
say m(x; θ ) and use a maximum likelihood estimate of θ , say θ̂ . In this set-up, the xi
are treated as fixed and the distributions of yi given xi are assumed independent and
to be in a parametric family of distributions (largely) determined by its mean. This
is already in the form of nonlinear regression but standard generalized linear models
also fit this paradigm. Nonparametric regression techniques based on basis functions
such as polynomials, wavelets, or sines and cosines also fit into the generalized
linear model paradigm.
    In general, we end up with an estimate
m̂(x) ≡ m(x; θ̂ ).
.
    Estimate Ey,x {L[y, m(x)]}.
    If we know m, an unbiased estimate is
10 Thoughts on prediction and cross-validation.                                     117
                                      1 n
                                        ∑ L[yi , m(xi )].
                                      n i=1
                                                                                    (1)
                                      1 n
                                        ∑ L[yi , m̂(xi )].
                                      n i=1
                                                                                    (2)
Since m̂ is a complex function of the data, the expected value of this function is hard
to find. Conventional wisdom is that (2) underestimates the true expected prediction
error, e.g.,        (                    )     (                 )
                      1 n                        1 n
                  E     ∑ L[yi , m̂(xi )] ≤ E n ∑ L[yi , m(xi )] .
                      n i=1                        i=1
I wonder if Eaton’s methods might be able to show this? (We will show not only
that this is true for linear models but that cross-validation can be even more biased
upwards.)
   To “fix” this problem, people try Cross-Validation. Life is much easier if we have
one set of (training) data from which to estimate E(y|x) and a different set of (test)
data from which to estimate Ey,x {L[y, m(x)]}. In such a case, m̂ based on the training
data is a fixed predictor with regard to the test data so equation (1) gives an unbiased
estimate of expected prediction error for m̂ given the training data. One might call
this procedure, validation.
   Cross-validation is based on using the validation idea repeatedly with the same
data. For example, k-fold cross-validation randomly divides the data into k subsets
of roughly equal size. First identify one subset as the test data and combine the other
k − 1 subsets into the training data. Estimate the best predictor from the training data
and then use that estimate with the test data to estimate the expected prediction error.
So far, this is just validation and the estimate of the expected prediction error should
be conditionally unbiased.
   However, in k-fold cross-validation there are k possible choices for the test data,
so one goes through all k validation processes and averages the k estimates of the
expected prediction error to give an overall estimate of the expected prediction error.
With n data points, the largest choice for k is n, which is known as leave one out
cross-validation.
   Let’s look at how all of this works in the most tractable case, linear models with
squared prediction error loss. In linear models and more generally in nonparametric
regression the model is typically taken as
This model will not work for y ∼ Bern(p) because the constant variance condition
cannot hold except in degenerate cases. Under squared error loss
    In a linear model
                                        m(x) = x′ β
.
    It is not hard to see that (1) leads to
                           1 n                   1 n
                             ∑   L[yi , m(xi )] = ∑ [yi − xi′ β ]2
                           n i=1                 n i=1
   Finally, for leave one out cross-validation, the estimate uses the well known Press
statistic, see PA Chapter 13. In the following, let p ≡ r(X). With M the perpendicular
projection operator onto the model matrix space, I believe
                                                                      
                           1                           1
             E(Press/n) = E Y ′ (I − M)D2                       (I − M)Y
                           n                       (1 − mii )
                                                                      
                           1               1
                        = tr D2                     (I − M)σ 2 I(I − M)
                           n           (1 − mii )
                              2
                                                                          
                           σ                1                          1
                        =       tr D                 (I − M)D
                            n           (1 − mii )                 (1 − mii )
                              σ2 n     1
                          =      ∑ (1 − mii ) .
                              n i=1
   In fact, since
                                    n
                                       (1 − mii ) n − p
                                   ∑             =
                                   i=1     n        n
Jensen’s Inequality gives
                                 1 n     1          n
                                   ∑ (1 − mii ) ≥ n − p ,
                                 n i=1
so it looks like Leave One Out CV is multiplicatively more biased upward than the
naive estimator is biased downward.
   I remember from talking to Rick Picard about his thesis years ago that he claimed
Press really sucked. I wonder if this is why he said that.
the BP is
                                        E[yw(y)|x]
                                                   .
                                        E[w(y)|x]
   If
                                  L[y, f (x)] = |y − f (x)|,
the BP is
                                         med(y|x).
Moreover, if we use the absolute loss function with y Bernoulli, we get the same
result as using Hamming loss, i.e., the BP is
                                     
                                       1 if p(x) > .5
                           δ (x) =
                                       0 if p(x) < .5.
Chapter 11
Notes on weak conditionality principle
We have two potential experiments to collect data y and learn about a parameter θ .
Roughly, the weak conditionality principle says that if you flip a coin to decide
to perform experiment E = 1 or E = 2 then the analysis should be conditional on
the experiment you actually performed. What is unquestionably stupid would be to
ignore which experiment was actually performed when you know it. But it is less
clear that conditioning inferences on the observed experiment is actually necessary
to get good results as opposed to using the joint distribution of the data and the
experiment. (Inferences based on the marginal distribution of the data that ignore
knowing the experiment are dumb.) Of course all these distributions are conditional
on the parameter.
   Note that the weak conditionality principle is a consequence of the ancillarity
principle, since the outcome of the Experiment randomization is an ancillary statistic
and should be conditioned on.
   Fletch has an example
                                        E=1               E=2
                 f (y|θ , E = i) y = 0 y = 1 y = 2 y = 0 y = 1 y = 2
                       θ =0       0.90 0.05 0.05 0.90 0.05 0.05
                       θ =1       0.10 0.43 0.47 0.01 0.49 0.50
                   f (y|θ =1,E=i)
                   f (y|θ =0,E=i) 1/9 8.6 9.4 1/90 9.8 10
alternatively
                                          E=1                 E=2
                 f (y, E = i|θ ) y = 0   y=1     y=2 y=0 y=1 y=2
                       θ =0       0.45   0.025   0.025 0.45 0.025 0.025
                       θ =1       0.05   0.215   0.235 0.005 0.245 0.25
                   f (y,E=i|θ =1)
                   f (y,E=i|θ =0) 1/9     8.6     9.4 1/90 9.8 10
From the second table, the unconditional MP α = .05 test of H0 : θ = 0 versus
H1 : θ = 1 rejects for E = 2, y = 2 or E = 2, y = 1. From the first table, the two MP
α = .05 tests of H0 : θ = 0 versus H1 : θ = 1, conditional on E, reject for E = 1, y = 2
and E = 2, y = 2, respectively. They are different results, but I’m not at all sure that
                                                                                    121
122                                              11 Notes on weak conditionality principle
the conditional tests are better. As is the problem with so many N-P tests, the real
problem is picking a stupid α level. In this case, by paying the small price of going
from α = 0.05 to α = 0.10 you almost double the power.
     I am not saying what procedure is better, only that it is not clear that one domi-
nates the other. And I am all in favor of Bayes over NP. These are all simple versus
simple tests, so the class of Bayes rules agrees with the class of NP rules. But, to
me, Bayes is clearly a better way of choosing a test than arbitrarily picking a small
level of α.
     More generally, for two experiments E = 1 and E = 2 with Pr(E = 1) = p but
getting to observe E, and with outcomes y from the experiments determined by
 f (y|θ , E = i), then the weak conditionality principal says that the analysis should
be based on f (y|θ , E = i). The alternative to conditioning would be to base the
analysis on the joint distribution of y, E,
which seems to lead to pretty reasonable results. However, if we look at the likeli-
hood function
                        L(θ |y, E) = f (y, E|θ ) ∝ f (y|θ , E)
so anything based on the likelihood is using the conditional distributions.
   Another thing I looked at that seems to give good Fisherian inference from the
joint distribution is a 50/50 mixture of
I come up with P values that agree with the conditional P values for E = 1, y = 3
and E = 2, y = 3, the first being small and the second being 1. Note that data points
of equal weirdness to E = 1, y = 3 are E = 1, y = −3 E = 2, y = 0, and E = 2, y = 6.
Similarly, the only point as weird as E = 2, y = 3 is E = 1, y = 0. I think, but did not
check out, that things work reasonably for testing the alternative H1 : y ∼ N(4, 1).
Chapter 12
Reviews of Two Inference Books
This chapter contains reviews of two excellent books on statistical inference. Both
reviews were originally published in JASA: Christensen (2008, 2014). The first is a
brief text on statistical inference by David Cox. The second is by Erich Lehmann on
the historical development of statistical inference.
I must admit that I write this review wondering why anyone would care what I
have to say about a new book on statistical inference by D.R. Cox. Cox is, after all,
arguably our greatest living statistician. He is the author of numerous books, one of
which, Planning of Experiments, I consider to be one of the two best statistics books
that I have ever read (the other being the small book by Shewhart (1939) edited by
Deming). Interestingly, at about the same time Cox published this book on statistical
inference, he also published a review article on applied statistics, Cox (2007), the
first article ever published in a new IMS journal on the subject. But I am nothing if
not opinionated, so I will persevere in my task. I should perhaps also add that, as
with any review, this is not about what Cox said, but about what I thought he said.
    This is a book on foundational issues in statistical inference. The mathemati-
cal level is aimed at university undergraduates in quantitative fields. In chapter one
he states, “The object of the study of a theory of statistical inference is to provide
a set of ideas that deal systematically with the above relatively simple situations
and, more importantly still, enable us to deal with new models that arise in new
applications.” The book has nine chapters and two appendices. The nine chapters
are: Preliminaries, Some concepts and simple applications, Significance tests, More
complicated situations, Interpretations of uncertainty, Asymptotic theory, Further
aspects of maximum likelihood, Additional objectives, Randomization-based anal-
ysis. At the ends of the chapters are Notes, which I think contain some of the most
interesting material. In some ways, the chapter Notes are crucial. For example, the
book contains interesting material on such things as linear rank statistics, Feiller’s
                                                                                  123
124                                                    12 Reviews of Two Inference Books
method, and ratio estimation, but the uninitiated would have no chance of relating
those names to the material without the chapter Notes.
    I’ll say right up front that I think everyone should read the appendices. The first
is “A brief history” of statistics and the second contains Cox’s personal view on
matters of inference. Although I do not agree with all his views, they are certainly
worth encountering. Cox considers (p. 195) the main differences between Fisher
and Neyman to be the nature of probability and the role of conditioning. Personally,
I think the most important differences between them are the role of repeated sam-
pling and the nature of testing. Neyman-Pearson theory seems to provide decision
rules between a null and an alternative. I will discuss Fisherian testing later, but it is
certainly a different approach. Neyman embraced the concept of repeated sampling,
that is, the long run frequency (LRF) justification for statistical procedures, but my
reading of Fisher is that he rejected LRF as a basis for testing. I think that in testing
problems Fisher simply used the (null) probability model as a criterion to evaluate
how weird the observed data were. Of course, that is intimately related to the dif-
ferent views that Fisher and Neyman had on the nature of probability. But as Cox
says in one of my favorite sentences from the book, “Fisher had little sympathy for
what he regarded as the pedanticism of precise mathematical formulation and, only
partly for that reason, his papers are not always easy to understand.” Amen!
    Cox is clearly not a Bayesian. I am. He raises the issue of whether probability has
the same meaning regardless of whether it is prior or posterior probability. While I
am sure I am missing the philosophical subtleties, as a practical matter it seems like
the posterior does one of three things. In the best of cases, we obtain more spe-
cific knowledge from the posterior. (Reduced entropy?). If that is not happening,
it suggests that we didn’t know what we were talking about in the first place. A
third scenario is where the data are inadequate to inform us about the parameteriza-
tion, i.e., we have nonidentifiability. A simple example of this is diagnostic testing
where there are three parameters: sensitivity, specificity, and prevalence, but the data
are simply the number of individuals who test positive. The data only provide in-
formation on the apparent prevalence which is a function of the three underlying
parameters. In this case, the conditional distribution of the three parameters given
the apparent prevalence will be identical in the prior and the posterior.
    Cox (p. 199) states that “Issues of model criticism, especially a search for ill-
specified departures from the initial model, are somewhat less easily addressed
within the Bayesian formulation.” I think he is absolutely correct, but I see little rea-
son to attempt such a search “within the Bayesian formulation.” I see little reason
not to use Fisherian (as distinct from Neyman-Pearson) tests to critique Bayesian
models. In fact, I think it is essential to do so! I also think Cox is right to reject
the axioms of personalistic probability as “being so compelling that methods not
explicitly or implicitly using that approach are to be rejected.” Bayesian statistics
may be the only logically consistent form of inference, but it is not the only useful
form of inference. Moreover, I think that Bayesian statistics is a wonderful medium
for arriving at a consensus of thought. And I suggest that to believe we are doing
more than arriving at a consensus is deluding ourselves.
12.1 “Principals of Statistical Inference” by D.R. Cox                                125
esis. He considers two approaches. First, there are significance tests in which only
a null hypothesis and a test statistic are specified (but this also requires a known
distribution for the test statistic under the null). In the second, significance tests
are considered when alternatives are specified. Initially, I thought that most of this
chapter was about the second approach, but gradually I came to think that he is
addressing the first approach in a way that is difficult for me to digest.
    As I understand the first approach, the one I have been calling Fisherian, sig-
nificance testing is a variation on proof by contradiction. The null hypothesis is
assumed to be correct. If the p value, a measure of consistency with the null hypoth-
esis, is small then the data are inconsistent with the model that incorporates the null
hypothesis. This suggests that something is wrong with the null model. But what is
wrong need not be the hypothesized parameter value! The p value is the probability
of seeing a value of the test statistic that is as weird or weirder than we actually saw.
Since there is no alternative, the null distribution has to determine how weird a par-
ticular observed value is. Weird values are those with a small probability (density)
of occurring, thus the null density provides the ordering of how weird the observable
values are. For example, in testing that the mean µ of a Poisson distributed random
variable Y is 2, Table 1, column 3 gives the p values for y = 0, . . . , 6. Column 2 gives
the probabilities under the null, thus providing our measure of weirdness. For values
y > 6, the values get increasingly weirder so the pattern is obvious without listing
them.
   Although this seems like a logically sound way to proceed, there are several dif-
ficulties with it. The biggest problem is in choosing a test statistic with a known
distribution. Fisher (1956, p.50) suggests that reasonable alternatives should inform
the choice of test statistic. However, there is a potential lack of coherence in that, for
example, a t test and a t 2 = F test are fundamentally equivalent but provide different
orderings of how weird observed values are. [I cannot think why I said this unless
I wrote it before I realized the shape of F(1, d f ) and F(2, d f ) distributions.]
(Although I suspect that the difference occurs because they provide different con-
tinuous approximations to our inherently discrete world.) This testing procedure,
while having a sound logical basis, does not address all the issues one might like.
Finally, I do not know of anybody who has consistently held to this procedure. Al-
most nobody applies this theory to χ 2 and F tests, although I suspect that is merely
a matter of computational convenience. To “correctly” evaluate the p value in those
12.1 “Principals of Statistical Inference” by D.R. Cox                              127
cases, you need to find a second value of the statistic that gives the same density as
your observed value and then compute the probability of being in either tail. These
days, that would not be hard to program, but even today, it is not a computation
that is commonly performed. Fisher (1925, Sec. 20) insisted that extremely large p
values are as significant as extremely small ones. I view this as simply a convenient
alternative to taking the trouble to make a correct p value computation. Box (1980)
used this definition of p value for Bayesian model checking. This definition is also
widely accepted in performing exact conditional tests on discrete data, e.g. Mehta
and Patel (1983). Nonetheless, in Fisher’s discussions of his exact test for 2 × 2 ta-
bles, he seems to have lent towards p values computed directionally. But again, that
might have been for computational convenience.
   I am not at all sure that Cox would agree with my description of the first ap-
proach because, what I consider a computational convenience, Cox seems to incor-
porate into the basic procedure. Even without explicitly defining alternatives, Cox
p.32 indicates that the p value, here called a p̃ value to distinguish it from the other
definition, is the probability of seeing data as indicative or more indicative of “a
departure from the null hypothesis of subject matter interest [my italics].” For ex-
ample, in testing that the mean µ of a Poisson distributed random variable Y is 2,
Cox uses the idea that if µ > 2, then only large values of Y are useful for detecting
departures from the null, and conversely when µ < 2. Cox provides a table similar
to Table 1 of one sided p̃ values. Column 4 provides p̃ values when larger values of
the test statistic are of interest or when the alternative mean is greater than 2 while
column 5 provides p̃ values when smaller values are of interest or the mean is less
than 2. These are distinct from the p values computed by letting the null distribution
determine which observations are most unusual.
   I am confused about how these ideas make the transition into Cox’s second ap-
proach, testing the null value of a parameter against some alternative value of the
parameter, because I am not quite sure if these are merely examples of p̃ values
with different departures of subject matter interest or if they are p̃ values based
on the specification of alternative hypotheses. In this case, they seem to amount to
the same thing. When alternatives are available, I suspect p̃ needs to be viewed as
a measure of consistency with the null model relative to the alternative, in which
case it could perhaps be formalized as the probability of seeing a likelihood ratio
as extreme or more extreme than the one you actually saw. In fact, Cox presents
this version of a p̃ value when discussing classification problems in section 5.17.
Without a specific alternative, I am not sure how one could formalize p̃ values be-
yond what Cox has done. However, I am not convinced that this definition of p̃ can
be reconciled with the idea of p̃ being a measure of consistency with the null hy-
pothesis. Cox rightly points out that extremely small values of F statistics in linear
models have large p̃ values under this paradigm but suggest inconsistency with the
null model.
   Defining a p̃ value relative to some formal or informal alternative is to abandon
completely the idea of a proof by contradiction. A small p̃ does not assure us that the
data contradict the null hypothesis nor does a large p̃ value assure us that the data
are consistent with the null hypothesis. The following simple example using two dis-
128                                                      12 Reviews of Two Inference Books
    Example 5.6 is to illustrate that “a prior that gives results that are reasonable from
various viewpoints for a single parameter will have unappealing features if applied
independently to many parameters.” What he sees as a problem with the prior distri-
bution, I see as a problem with the data and to some extent with biased estimation.
Let me present an example similar to his: heteroscedastic one-way ANOVA. For
i = 1, . . . , n, j = 1, . . . , m let yi j = µi + εi j with the εi j s independent N(0, σi2 ). Let
s2i be the sample variance from the ith group. To introduce some bias, we look at
[(m − 1)/(m + 1)]s2i , which has better expected squared error properties than s2i . It
seems like Cox’s dissatisfaction in Example 5.6 should extend to the fact that as n
gets large ∏ni=1 [(m − 1)/(m + 1)]s2i / ∏ni=1 σi2 does not become a good estimator of
the number 1. In fact, it is an unbiased estimate of [(m − 1)/(m + 1)]n , which ap-
proaches 0. Even without introducing the bias, ∏ni=1 s2i / ∏ni=1 σi2 is still not a really
good estimate of the number 1. Introducing the bias turns a mediocre estimate into
a bad one. Ultimately, although we are in an asymptotic framework, there are not
enough data on any one parameter to expect good asymptotic behavior. Perhaps the
quote from the beginning of this paragraph should be changed to: a procedure that
gives results that are reasonable from various viewpoints for a single parameter may
have unappealing features if applied to many parameters. In Section 8.3 Cox says
something quite similar about biased point estimation while agreeing that a little bit
of bias is not normally a bad thing.
    Section 5.9 deals with reference priors but more specifically with the virtues
and difficulties of developing reference priors through maximizing entropy. While
I think this is an interesting and useful theory, I think the most important role for
reference priors is simply as priors we agree to use so that everyone has a common
basis for comparison.
    Section 5.11 discusses an approach to eliciting prior probabililities that should
interest those of us who are not enamored with having betting as a key aspect to the
foundations of Bayesianism. Cox’s note on this section is quite amusing. Personally,
rather than buying into coherent betting, the first thoughtful reason I had for being
a Bayesian was based on a result in section 8.2 on Decision analysis, that is, that
all reasonable procedures are Bayes procedures. If I have to act like a Bayesian
anyway, why not use a prior that I think is reasonable. But to be honest, I had been
indoctrinated before I ever heard of the complete class theorem.
    Frankly, I was a little offended that in Section 5.12 on four ways to imple-
ment Bayesian procedures that the first two were not Bayesian. They are empiri-
cal Bayesian, which is a frequentist approach. In discussing the use of informative
priors, Cox seems to want a consensus on what the prior distribution should be. Per-
sonally, I am far more interested in getting a consensus on the posterior distribution.
It seems clear to me that if we cannot arrive at a practical consensus on the posterior
distribution, we collectively do not yet know what is going on. I think the fact that
reasonable people can obtain substantially different posteriors provides valuable in-
formation to the scientific community on our state of knowledge.
    Chapter 6 treats asymptotic theory. I must say that Cox displays a facility with
O(·), o(·), O p (·), o p (·) notation that boggles my mind. In fact, he displays amazing
facility with the entire subject, although I am personally more comfortable with,
132                                                    12 Reviews of Two Inference Books
say, Ferguson (1996). Figure 6.2 provides a fascinating illustration of the relation-
ships between likelihood ratio, Wald, and score tests. It also quickly leads to Cox’s
conclusion that if these test statistics are substantially different, the asymptotic for-
mulation is called in question.
   Chapter 7 discusses some of the difficulties with asymptotic maximum likeli-
hood theory as well as means for avoiding some of those difficulties. In particular,
section 7.6 gives brief discussions of partial-, pseudo-, and quasi- likelihoods.
   Chapter 8 is devoted to additional objectives. In section 8.1 Cox manages to treat
the entire object of science, that is, prediction, in one page. But I’m being opinion-
ated again. Technically, he presents (good) frequentist prediction in terms of testing
equality of the parameters from the new observation and the observed data. That
seems unnecessarily complicated to me. If the new observation is y∗ and the old data
are a vector y and if one can find a known distribution for some function of (y∗ , y),
then a prediction region consists of all values y∗ such that (y∗ , yobs ) is consistent
with the known distribution at a level α. In other words, take y∗ so that (y∗ , yobs )
provides a p value greater than α. Again, values of (y∗ , y) with the lowest density
under the known distribution are the values that are most inconsistent. This differ-
ence in thinking about prediction regions is unlikely to engender much difference in
practice. Bad frequentist prediction includes plugging estimates of the parameters
into the sampling distribution of the new observation and then proceeding as if the
sampling distribution was known. This systematically underestimates variability, a
problem Bayesian prediction avoids.
   Subsection 8.3.3 seems to have some key typographic errors or some logic I do
not follow. It took me a while to figure out that subsection 8.4.2 is a generalization
of the Grizzle, Starmer, Koch (1969) approach to categorical data.
   Chapter 9 on randomization-based analysis, considers both sampling theory and
designed experiments. The main idea is to contrast randomization-based analyses
with the model based approach taken in the rest of the book. I think section 9.3
on design would be tough sledding for anyone who had not seen similar material
before. One thing I particularly liked was his making a notational distinction be-
tween the variance appropriate for a completely randomized design versus that for
a randomized complete block (paired) design. Too often we (I) just call them both
σ 2.
   Two final points about the book that I really like. First, Cox does not blithely as-
sume independence. He repeated points out how crucial independence assumptions
may be. Second, he stresses that all real data are discrete and that continuous models
are just approximations that can go astray.
   I would like to thank Prof. Cox for never having sat down with one of my books
and picked it apart as I have done his. I hope he takes this as a sign of my high regard
for his professional accomplishments. This is a great book by a great statistician.
Buy it and read it.
12.2 “Fisher, Neyman, and the Creation of Classical Statistics” by Erich L. Lehmann   133
References
   Blachman, Nelson M., Christensen, Ronald and Utts, Jessica M. (1996). Com-
   ment on Christensen, R. and Utts, J. (1992), “Bayesian Resolution of the ‘Ex-
   change Paradox.” The American Statistician, 50, 98-99.
   Box, George E. P. (1980). Sampling and Bayes’ Inference in Scientific Mod-
   elling and Robustness. Journal of the Royal Statistical Society. Series A (Gen-
   eral), 143(4), 383-430.
   Christensen, Ronald (2005). Testing Fisher, Neyman, Pearson, and Bayes. The
   American Statistician, 59, 121-126.
   Cox, D.R. (1958). Planning of Experiments. John Wiley and Sons, New York.
   Cox, D.R. (2007). Applied Statistics: A Review. The Annals of Applied Statistics
   1, 1-17.
   Ferguson, Thomas S. (1996). A Course in Large Sample Theory. Chapman and
   Hall, New York
   Fisher, Ronald A. (1925). Statistical Methods for Research Workers, Fourteenth
   Edition, 1970. Hafner Press, New York.
   Fisher, R. A. (1956). Statistical Methods and Scientific Inference, Third Edition,
   1973. Hafner Press, New York.
   Grizzle, James E., Starmer, C. Frank, and Koch, Gary G. (1969). Analysis of
   categorical data by linear models. Biometrics, 25, 489-504.
   Mehta, C.R. and Patel, N.R. (1983). A network algorithm for performing Fisher’s
   exact test in r × c contingency tables. Journal of the American Statistical Associ-
   ation, 78, 427-434.
   Shewhart, W. A. (1939). Statistical Method from the Viewpoint of Quality Con-
   trol. Graduate School of the Department of Agriculture, Washington. Reprint
   (1986), Dover, New York.
Erich Lehmann was a class act and this short book Fisher, Neyman and the Creation
of Classical Statistics is worthy of him. I found it great fun to read.
   Lehmann was Neyman’s Ph. D. student and spent most of his career in “Ney-
man’s” department at Berkeley. He was a major contributor to Neyman-Pearson
theory and is almost certainly the foremost expositor of that theory with classic
books Testing Statistical Hypotheses, Theory of Point Estimation, and, my personal
favorite, Nonparametrics: Statistical Methods Based on Ranks. He could be for-
given for being biased (I was on the look out) but for the most part the treatment
is even-handed. When the issue of bias arises, I suspect it is less a matter of bias
and more a matter of having an imperfect understanding of Fisher. (And who can
be blamed for having an imperfect understanding of Fisher?) Having addressed the
134                                                  12 Reviews of Two Inference Books
possible bias of the author, I should perhaps address the biases of the reviewer.
While I have the utmost respect for both Fisher and Neyman, nobody would call me
a fan of Neyman-Pearson theory and I am at core a Bayesian. In reviewing the book,
which is itself a review of others’ work, references that are not listed below come
from Lehmann’s book.
    The first chapter starts out with some brief biographical information about the
two protagonists as well as bios of supporting characters Karl and Egon Pearson
and W.S. Gossett. To me, one highlight of this chapter was that Neyman learned
from Karl Pearson that “scientific theories are no more than models of natural phe-
nomena” which, not only do I agree with but, brought to mind Box’s quote about
all models being wrong. The chapter then spends a few pages on Fisher’s classic
1922 paper, “On the mathematical foundations of theoretical statistics”. Although
Lehmann seems to credit Fisher for the invention of maximum likelihood estimates,
he was aware of Stigler’s (2007) work on their history. I myself remembered, from a
previous life, that the “most probable number” used in serial dilution bioassays pre-
dated Fisher. “The estimate is ‘most probable’ only in the roundabout sense that it
gives the highest probability to the observed results.” This seems to me an admirable
description of maximum likelihood estimation for discrete distributions. The previ-
ous quotation was taken from Cochran (1950) who later says, “Consequently the
m.p.n. (most probable number) method is now generally used in a great variety of
problems of statistical estimation, though it more frequently goes by the name of the
‘method of maximum likelihood’.” Cochran credits McCrady (1915) for originating
the m.p.n. in this application and the m.p.n. terminology lives on to this day. Stigler
cites examples by Lagrange and Daniel Bernoulli of finding most probable values
in the 18th century.
    Like Fisher’s 1922 paper, Chapter 2 of Lehmann’s book transitions into testing,
but now the focus shifts to Fisher’s ground breaking (1925) book Statistical Meth-
ods for Research Workers. It seems that Fisher took quite a bit of grief for writing a
practical manual directed at research workers and not including the underlying the-
ory. Lehmann points out that after Fisher carefully derived Gossett’s t distribution
in 1912 (Fisher, 1915), Gossett urged on Fisher to derive more small sample distri-
butions. Lehmann cites Gossett as pleading to Fisher, “But seriously, I want to know
what is the frequency distribution of rσx /σy [sic] for small samples, in my work I
want that more than the r distribution now happily solved.” In a section close to my
own heart, Lehmann discusses the issue of testing two independent samples. This
is the first time that I wondered if Lehmann was too firmly in Neyman’s camp to
fully understand Fisher. While Lehmann justifiably calls out Fisher for some tech-
nical sloppiness (citing Scheffé’s admirable 1959 book), I think the bigger point
goes wanting. The issue, of course, is whether to assume equal variances. Fisher’s
point, and I think it is well taken, is that the appropriate test is typically one of
whether the two samples come from the same normal population. This is scientif-
ically distinct from, although probabilistically equivalent to, testing whether two
normal populations have the same mean given that they have the same variance.
Quoting Lehmann, “Fisher concludes this later discussion by pointing out that one
could of course ask the question: ‘Might these samples be drawn from different nor-
12.2 “Fisher, Neyman, and the Creation of Classical Statistics” by Erich L. Lehmann   135
mal populations having the same mean?’ ... but that ‘the question seems somewhat
academic’.” As Christensen et al. (2011, p. 123) point out, it is by no means clear
that testing the equality of means when the variances are different is a worthwhile
thing to do.
    Chapter 3 moves on to Neyman-Pearson theory. Two things particularly struck
me here. Apparently, the generalized likelihood ratio test statistic predates the the-
ory of optimal testing and Neyman originally wanted to do the theory as a Bayesian.
“This long two-part [1928] paper is a great achievement. It introduces the consider-
ation of alternatives, the two kinds of error, and the distinction between simple and
composite hypotheses. In addition, of course, it proposes the likelihood ratio test.
This test is intuitively appealing, and Neyman and Pearson show that in a number
of important cases it leads to very satisfactory solutions. It has become the stan-
dard approach to new testing problems.” Optimal testing arrives in 1933 as Neyman
and Pearson seek to solve a decision problem, “Without hoping to know whether
each separate hypothesis is true of false, we may search for rules to govern our
behavior with regard to them ...” Lehmann’s summary of Neyman and Pearson’s
innovations follows: “The 1928 and 1933 papers of Neyman and Pearson discussed
in the present chapter, exerted enormous influence. The latter initiated the behav-
ioral [decision theoretic] point of view and the associated optimization approach.
It brought the Fundamental Lemma and exhibited its central role, and it provided
a justification for nearly all the tests that Fisher had proposed on intuitive grounds.
On the other hand, the applicability of the Neyman-Pearson optimality theory was
severely limited. It turned out that optimum tests in their sense existed only if the
underlying family of distributions was an exponential family (or, in later extensions,
a transformation family). For more complex problems the earlier Neyman-Pearson
proposal of the likelihood ratio test offered a convenient and plausible solution. It
continues even today to be the most commonly used approach.” The rub is whether
this theory really does provide an appropriate justification for Fisher’s tests. Fisher’s
dissent is the subject of Chapter 4.
    It seems to me that Fisher’s key objection to Neyman-Pearson testing is the intro-
duction of alternative hypotheses. Ironically, it was correspondence between Gosset
and Egon Pearson (discussed at the beginning of Chapter 3) that generated this idea.
In a 1934 paper, Fisher very carefully stated, “The test of significance is termed
uniformly most powerful with regard to a class of alternative hypotheses if this
property [i.e. maximum power] holds with respect to all of them.” [My italics. I
am quoting Lehmann quoting Fisher. Presumably the “i.e.” is Lehmann’s.] Even at
this early point, when Fisher’s reaction to Neyman-Pearson theory was tepid and as
yet involved no personal animosity, we have the hint that Fisher is not willing to
accept this “class of alternative hypotheses” as the only possible alternatives. As I
have argued elsewhere (Christensen, 2005), Fisherian testing is essentially subject-
ing the null hypothesis to a proof by contradiction in which the null is contradicted
(rejected) or not contradicted. Unfortunately, the contradictions are almost never
absolute and the strength of contradictory evidence is measured by a small P value.
No alternatives are needed for a proof by contradiction. As indicated earlier, and as
Fisher violently objected to, Neyman-Pearson theory is essentially a decision proce-
136                                                    12 Reviews of Two Inference Books
dure (cf. p. 54), for which (unlike Neyman and Pearson’s original thinking, cf. p. 35)
there is no reason why having a small probability of type I error should be important
if it leads to large probabilities of type II error, cf. p. 55.
    It is in Chapter 4 that Lehmann seems, to me, to misunderstand Fisher most often.
On page 48 he (technically correctly) describes an argument by Fisher as increas-
ing the power of a test. Fisher’s interest is in decreasing the P value. On page 57
Lehmann writes, “Fisher relied on his intuition, while Neyman strove for logical
clarity.” The word “while” seems to make this sentence inappropriately convey far
more than the sum of its parts (neither of which I could disagree with). In another
matter, I admit that Fisher is responsible for the silly dominance of the 0.05 level in
testing, but I do not believe that he is to blame for an idea that others took to absurd
lengths. On page 53 Lehmann presents 8 examples and states, “These examples,
to which many others could be added, show that Fisher considered the purpose of
testing to be to established [sic] whether or not the results were significant at the
5% level, and that he was not particularly interested in the p-values per se.” I took
the examples exactly opposite. Given that Fisher was reporting P values from ta-
bles of the t distribution, he seems to report them as accurately as the tables allow.
Moreover, Fisher (1936) once pointed out that some of Gregor Mendel’s data give
P values that are suspiciously too high. On page 55 Lehmann suggests, I think un-
fairly, that the Neyman-Pearson attitude towards test sizes is more appropriate. In
fact, I think it is equally fair to say that Neyman-Pearson are responsible, but not to
blame, for their tests being used almost exclusively with small α levels.
    Chapter 5 is entitled, “The Design of Experiments and Sample Surveys.” While
it is hard not to notice similarities in the ideas used in experimental design and sam-
pling, I somehow felt that Dr. Lehmann was a little too cavalier (a word too often
applied to me) about their differences. On page 64 Lehmann quotes a passage from
Fisher’s book The Design of Experiments that comes close to demonstrating that
Fisher’s concept of testing is essentially a proof by contradiction. On the same page
I think Lehmann is quite right for chiding Fisher for not seeing the usefulness of
power as a tool in determining sample size. Even in Fisher’s concept of testing, it
is worthwhile to consider the power of various alternatives for a fixed sized test.
However, in Fisher’s concept, one should take a wider view of what the various al-
ternatives are. (I think Fisher occasionally used fixed sized tests but I am less sure
that he would admit to it.) I have addressed the issue that Fisher found alternative hy-
potheses inappropriate (except to help in choosing a test statistic, cf. Fisher (1956,
p. 50)), but on page 65 I found myself thinking about how Fisher and Neyman-
Pearson would disagree even on what a null hypothesis was. In Neyman-Pearson
theory a null hypothesis is an hypothesis about a parameter value within an under-
lying statistical model. In Fisherian testing the null “hypothesis” is better thought of
as the null model. The correspondence is that the Neyman-Pearson model, together
with their null hypothesis, is the null model in Fisherian testing.
    In discussing randomized block designs, on page 67 Lehmann quotes Fisher as
saying, “the discrepancies between the relative performances of different varieties in
different blocks ... provide a basis for the estimation of error.” I take this as a pretty
clear statement that in a randomized complete block the treatment-block interaction
12.2 “Fisher, Neyman, and the Creation of Classical Statistics” by Erich L. Lehmann   137
is what you want to use as a measure of error. As I recently said in another context:
If evidence for main effects is not so blatant that it overwhelms any block-treatment
interaction we should not declare main effects.
    Subsection 5.7.1 on randomization does not mention what I consider to be the
most important reason for randomizing treatments. Randomization should (on av-
erage) alleviate the effects of any confounder variables, therefore randomization
provides a philosophical basis for inferring that the effects we see are caused by the
treatments.
    Alas, page 73 closes with more comments reflecting the author’s background.
“These papers by Jack Kiefer [on optimal design] complemented and to some ex-
tent completed Fisher’s work on experimental design as the Neyman-Pearson theory
had done for Fisher’s testing methodology.” “In testing, the Neyman-Pearson theory
provided justification for the normal-theory tests that Fisher had proposed on intu-
itive grounds.” It has been pointed out that the t test is not reasonable because it is
uniformly most powerful unbiased, the criterion of being uniformly most powerful
unbiased may be reasonable because it gives the t test. (I got this from Ed Bedrick,
who got it from Robinson (1991), who got it from Dawid (1976).)
    Chapter 6 discusses estimation. Fiducial inference is one of the great mysteries of
the statistical world. I have never personally met anyone who claimed to understand
it. But Lehmann points out a passage from Fisher that I find crucial. In discussing
Fisher’s 1935 paper on “The foundations of inductive inference” Lehmann says a
“new feature is the identification of fiducial limits with the set of parameters θ0 for
which the hypothesis θ = θ0 is accepted at the given level. This interpretation had
already been suggested by Neyman in the appendix to his 1934 paper.” Personally,
I find this a much more reasonable basis for Fisherian interval estimation than fidu-
cial inversion of probability distributions. This lets one state unambiguously that
a Fisherian interval contains all the parameter values that are consistent with the
data and the statistical model as determined by an α level test. (Remember that a
Fisherian test requires a null model that is often a statistical model together with
a null hypothesis for a parameter value and that if the test is not rejected at the α
level we merely fail to contradict the null model so the data are merely consistent
with the null model.) In the next section Lehmann repeats Neyman’s important point
about the long-run frequency interpretation of confidence intervals that the long run
need not be on the same problem, but rather on all the confidence intervals that a
statistician performs. (Not that that solves the problem of a confidence interval re-
ally saying nothing about the data at hand.) Section 6.4 seems to me to suggest that
Fisher, like me, finds “confidence” to be nothing more than a backhanded way to get
people to think of posterior probability, no matter how much one talks about long-
run frequencies. Indeed, it seems that Fisher is identifying confidence with his own
concept of fiducial probability. I find it rather comforting that these two concepts
that I have never understood could be the same concept. It is fascinating to think of
these renown anti-Bayesians trying desperately to make Bayesian omlettes without
breaking eggs a priori. McGrayne (2011, p. 144) reveals that Abraham Wald, who
appears five times in Lehmann’s book as a key contributor to classical statistics,
138                                                   12 Reviews of Two Inference Books
was a closet Bayesian. His reticence at coming out must ultimately be due to our
protaganists.
    The final chapter provides an epilogue that briefly summarizes the contributions
of these giants to a variety of topics. In particular, the section “Hypothesis Testing”,
for good or ill, recapitulates many of the the points highlighted in this review. An
appendix lists Fisher’s works.
    While I have gone to some lengths to point out what I think are biases in the book,
let me reemphasize that given Lehmann’s background, I think these are remarkably
minor. And for all my disagreements, I found the book both fun and informative.
Indeed, I found myself almost ashamed for having let Lehmann do all this hard
work for me and definitely feel appreciative. If you find the title of Lehmann’s book
interesting, by all means buy it and read it. My hope is that this review will have
whetted your appetite.
References
   Christensen, R. (2005). Testing Fisher, Neyman, Pearson, and Bayes. The Amer-
   ican Statistician, 59, 121–126.
   Christensen, R., Johnson, W., Branscum, A. and Hanson, T.E. (2011). Bayesian
   Ideas and Data Analysis: An Introduction for Scientists and Statisticians, CRC
   Press, Boca Raton.
   Cochran, W.G. (1950). Estimation of bacterial densities by means of the “most
   probable number”. Biometrics, 6, 105-116
   Dawid, D. (1976). Discussion of the paper of O. Bandorff-Nielsen “Plausibility
   inference.” Journal of the Royal Statistical Society, Series B, 38, 123-125.
   Fisher, R.A. (1936). Has Mendel’s work been rediscovered? Annals of Science,
   1, 115-137.
   Fisher, R.A. (1956), Statistical Methods and Scientific Inference (3rd. ed., 1973),
   Hafner Press, New York.
   McCrady, M.H. (1915). The numerical interpretation of fermentation-tube re-
   sults. J. Infec. Dis., 17, 183-212.
   McGrayne, S.B. (2011). The Theory that Would Not Die: How Bayes Rule
   Cracked the Enigma Code, Hunted Down Russian Submarines & Emerged Tri-
   umphant from Two Centuries of Controversy, Yale, New Haven.
   Robinson, G. K. (1991). That BLUP is a good thing: The estimation of random
   effects. Statistical Science, 6, 15-51.
   Scheffé, H. (1959). The Analysis of Variance, John Wiley and Sons, New York.
   Stigler, S.M. (2007). The epic story of maximum likelihood. Statistical Science,
   22, 598-620.
Chapter 13
The Life and Times of Seymour Geisser.
    Seymour Geisser was a major figure in modern statistics, particularly in the de-
velopment of Bayesian statistics. He was an active researcher, and able administra-
tor, and an interesting man. I discuss his life, his impact on statistics, and recount
some personal interactions.
13.1 Introduction
I am not quite sure how I came to be doing this. There are any number of people
more qualified to discuss Seymour Geisser’s personal and professional lives than
me.
   I was a student at the University of Minnesota before Seymour got there. And I
am sure there were times when he thought that I would still be a student there when
he left. Fortunately for us both, that turned out not to be true.
1.   Seymour was not my advisor.
2.   I never wrote a paper with him.
3.   I never officially took a class from him.
4.   I sat in on his prediction class one quarter where my unofficial status did not keep
     Seymour from making me do a class presentation.
Seymour was fond of observing that there are a lot of smart people in the world but
that what matters is what you do with it. In graduate school, I may have been his
poster boy for what to avoid.
                                                                                     139
140                                            13 The Life and Times of Seymour Geisser.
    After I finally left Minnesota for a position at Montana State University, I was
told that my stock rose immeasurably when I published a little American Statistician
article on “Bayesian point estimation using the predictive distribution,” (Christensen
and Huffman, 1985). Seymour was pretty sure that I would go to Montana and never
be heard from again.
Seymour Geisser was born on October 5, 1929 in the Bronx, New York. He was the
elder of two sons born to Polish immigrants who worked in the garment industry.
His father left Poland after being discharged from the army having served in the
Russo-Polish war of 1920.
   At the age of two he moved to Brooklyn. By the age of 12, he was already widely
recognized for being very clever (Zelen, 1996). He graduated from Lafayette High
School. Seymour enjoyed high school and played some point guard with the basket-
ball team. Seymour matriculated from the City College of New York in 1950 having
spent much of his undergraduate years sleeping on the subway between City Col-
lege and his home in Bensonhurst. His major was mathematics, in part because the
math program was housed near the cafeteria where he played chess.
After graduation from City College, Seymour faced the same question many of us
do and arrived at the same answer. The question is, “How am I going to support
myself?” and the answer is, “There seem to be jobs in statistics.” In Seymour’s case,
this came about through the intervention (or intercession) of Seymour’s cousin Leon
Gilford and his wife Dorothy Gilford. Both were statisticians. Dorothy had been
a student of Harold Hotelling at Columbia. By then, Hotelling had moved to the
University of North Carolina (UNC), so Seymour headed south.
   At UNC Seymour hobnobbed with the likes of Sudish Ghurye, Ingram Olkin,
Ram Gnanadesikan, Shanti Gupta, Marvin Kastenbaum, Marvin Zelen, and others.
They drank beer, played cards, gambled, and did good statistics. To see how times
have progressed, in graduate school we drank beer and played volleyball, softball,
and basketball. But we had people like Bill Sudderth to teach us the evils, or at least
the futility, of gambling.
   The faculty at UNC included Hotelling, Gertrude Cox, Wassily Hoeffding, S.N.
Roy, George Nicholson, R.C. Bose, and Herb Robbins. The students at UNC were
apparently as intimidated by their professors as we were twenty five years later. I
often wish that my own students were as intimidated by me.
   Not surprisingly, Seymour worked on his Masters and Ph. D. with Harold
Hotelling. His Masters thesis was on computing eigenvalues and eigenvectors. His
13.4 Washington, DC                                                                141
doctoral thesis was on mean square successive differences. This came about from
spending summers working at the naval proving ground in Aberdeen, Maryland.
The statistics group leader, Monroe Norden, had him following up work that John
von Neumann had done during World War II. Contrary to the rumor (that I started),
his thesis was not based on data obtained when the denizens of the proving ground
went deer hunting with their cannon.
    Seymour later described his interaction with Hotelling (in Christensen and John-
son, 2005). “He was very hard to get [to see] and every time I would find him and
show him my work, he would always suggest something more to do. I got to be a
little annoyed at this. I thought I had done enough. So the next time he asked me
to do something, I went back and I did it and I thought, what would he ask next. I
thought about it and said, probably this kind of thing, and I did it. Next time I came
in, sure enough, he asked me to do exactly that and I said, ‘Here, I’ve done it.’ He
said, ‘Well then, I guess you’re finished.”’
13.4 Washington, DC
After graduating from UNC in 1955, Seymour took off to the National Bureau of
Standards. It paid better than the University of Illinois and he liked living in Wash-
ington. He initially worked under Churchill Eisenhart in the Statistical Engineering
Laboratory. He also worked with Marvin Zelen, Jack Youden, I. R. Savage, Bill
Conner, and Bill Clatworthy.
   Before long he joined the U.S. Public Health Service as a lieutenant j.g. The
commission was necessary for joining the National Institutes of Health. Seymour
spoke fondly of his lunchtime discussions with Sam Greenhouse, Max Halperin,
Nate Mantel, Marvin Schneiderman, and Jerry Cornfield. His interactions with Jerry
Cornfield changed his professional life.
   Cornfield was interested in Bayesian ideas and the corresponding frequentist
concepts. Seymour soon caught the Bayesian bug and, given his association with
Hotelling, he not surprisingly began developing Bayesian approaches to multivariate
problems such as discriminant analysis and profile analysis. The work on Bayesian
discrimination lead naturally to looking at predictive probabilities of correct classi-
fication. Ultimately, that lead to Seymour’s seminal work on prediction as the basis
for statistical inference, first summarized in his 1971 paper “The inferential use of
predictive distributions” and later compiled in his (1993) book Predictive Inference:
An Introduction.
   In 1959, Seymour published his infamous citation classic on the Greenhouse-
Geisser correction to the F test. The adjective “infamous” is Seymour’s. He was not
overly taken with the work and opined, “There is no accounting for taste.” Except
Seymour said it in Latin.
   In the early ’60s, Seymour began his academic career teaching nights at George
Washington University.
142                                            13 The Life and Times of Seymour Geisser.
13.5 Buffalo
13.6 Minnesota
Two years after I began at the University of Minnesota, Seymour became the first
Director of the School of Statistics. That was 1971 and they forgot to consult me,
perhaps because I was a sophomore in math education at the time. Seymour re-
mained director for 30 years, helping develop a distinguished faculty. When I began
graduate school the faculty included Don Berry, Kit Bingham, Bob Buehler, Dennis
Cook, Joe Eaton (I even know why Morris L. Eaton is called Joe), Steve Fienberg,
Cliff Hildreth, David Hinkley, Kinley Larntz, Bernie Lindgren, Frank Martin, Mil-
ton Sobel, Bill Sudderth, and Sandy Weisberg. Later additions included Katherine
Chaloner, Jim Dickey, Charlie Geyer, Doug Hawkins, David Lane, Gary Oehlert,
Glen Meeden, Luke Teirney and undoubtedly others that I am less familiar with.
   Unintentionally, Seymour introduced a shibboleth by which Minnesota gradu-
ates recognize each other. Every year Seymour would tell the graduate students that
seminar attendance was obligatory but not mandatory. To this day, if we hear anyone
say that something is obligatory but not mandatory, we assume that person is from
Minnesota. I doubt that Seymour was aware of this, but I am sure that he would have
enjoyed me calling it a shibboleth.
   I mentioned that we were intimidated by our faculty, and Seymour was certainly
not an exception. A fellow graduate student, Dennis Jennings, got married and in-
vited Seymour to the reception. I do not remember any other faculty being there.
But I do remember Seymour and Anne sitting alone and the graduate students not
having the nerve to go over and socialize. Perhaps one of the reasons I’m writing
this is because I eventually worked up enough nerve.
Seymour was always a very active researcher. He had over 175 publications. He had
visiting professorships at 13 universities. He was a fellow of the Institute of Math-
ematical Statistics and the American Statistical Association. He was on numerous
national committees.
   In a two year period, late in life but before getting sick, Seymour published papers
on:
13.7 Seymour’s Professional Contributions                                            143
f (x1 , . . . , xn |θ ) = θ ∑ xi (1 − θ )n−∑ xi .
so
                                           p(θ ) = 1.
Stigler (1982) has Bayes assuming a marginal distribution for the data in which
                                                          1
                                 Pr    ∑ Xi = r       =
                                                          N +1
                                                               .
p(θ ) = 1.
    Seymour’s version is entirely predictivist. He notes that the data are actually
Y0 ,Y1 , . . . ,Yn iid U(0, 1) but that all one observes is
                                            
                                              1 if Yi ≤ Y0
                                      Xi =                  .
                                              0 Yi > Y0
In other words,
                                      θ ≡ Pr[Yi ≤ Y0 |Y0 ].
In Seymour’s formulation,
                                                        xi       n−∑ xi
                          f (x1 , . . . , xn |y0 ) = y∑
                                                      0 (1 − y0 )
Pr[X = 0|θ = 0] = .9
and rejects for small values of the test statistic T (X). That the likelihood ratio test
has power less than its size IS surprising.
   The uniformly most powerful invariant (UMPI) test of size .1 is a randomized
test. It rejects when X = 0 with probability 1/9. The size is .9(1/9) = .1 and the
power is .91(1/9) > .1. Note, however, that observing X = 0 does not contradict
the null hypothesis because X = 0 is the most probable outcome under the null
hypothesis. Moreover, the test does not reject for any value X ̸= 0, even though such
data are 90 times more likely to come from the alternative θ = X than from the null.
   In my humble opinion, Seymour’s primary research contributions were to
   Non-Bayesian Multivariate Analysis
   Bayesian Multivariate Analysis
   Predictive Sample Reuse
   and Predictive Inference.
He was also proud of his role in building the program at Minnesota.
Seymour had four children: Adam, Dan, Georgia, and Mindy. Mindy is a biostatisti-
cian in Minnesota. He had five grandchildren, including a set of triplets. When Sey-
mour visited Harvard once, Marvin Zelen asked his administrator Anne to take care
of him. She took the job so seriously that they married. Seymour’s brother Martin
taught high school and served as a counselor. Seymour enjoyed history, archeology,
146                                           13 The Life and Times of Seymour Geisser.
religion and novels. More than once I was surprised at how well read he was. He
seemed to know something about even my most obscure interests.
13.9 Conclusion
Acknowledgements
Aelise Houx, Anne Geisser, Jessica Utts, Marvin Zelen, Martin Geisser, and Wes
Johnson helped accumulate this information. The late, great Larry Brown pointed
out an error in an earlier draft. Actually, after my talk he privately pointed out a
blunder and I have been forever grateful that he chose not to humiliate me. That was
the only time I ever met him.
References
Fisher, R. A. (1935). The Design of Experiments, Ninth Edition, 1971. Hafner Press,
     New York.
Geisser, Seymour (1971). The inferential use of predictive distributions. In Foun-
     dations of Statistical Inference, V.P. Godambe and D.A. Sprott (Eds.). Holt,
     Rinehart, and Winston, Toronto, 456-469.
Geisser, Seymour (1975). The predictive sample reuse method with applications.
     Biometrika, 70, 320-328.
Geisser, Seymour (1985). On the predicting of observables: A selective update. In
     Bayesian Statistics 2, J.M. Bernardo et al. (Eds.). North Holland, 203-230.
Geisser, Seymour (1993). Predictive Inference: An Introduction, Chapman and Hall,
     New York.
Geisser, Seymour (2000). Statistics, litigation, and conduct unbecoming. In Statisti-
     cal Science in the Courtroom, Joseph L. Gastwirth (Ed.). Springer-Verlag, New
     York, 71-85.
Geisser, Seymour (2005). Modes of Parametric Statistical Inference, John Wiley
     and Sons, New York.
Hacking, I. (1965). Logic of Statistical Inference. Cambridge University Press.
Lane, David (1996). “Story about Cosimo di Medici.” In Modelling and Predic-
     tion: honoring Seymour Geisser, eds. Jack C. Lee, Wesley O. Johnson, Arnold
     Zellner. Springer- Verlag, New York.
Stigler, S.M. (1982). Thomas Bayes and Bayesian inference. Journal of the Royal
     Statistical Society, A, 145(2), 250-258.
Stone, M. (1974). Cross-validatory choice and assessment of statistical predictions.
     Journal of the Royal Statistical Society, B, 36, 44-47.
Zelen, Marvin (1996). “After dinner remarks: On the occasion of Seymour Geisser’s
     65th Birthday, Hsinchu, Taiwan, December 13, 1994.” In Modelling and Pre-
     diction: honoring Seymour Geisser, eds. Jack C. Lee, Wesley O. Johnson,
     Arnold Zellner. Springer-Verlag, New York.
Appendix A
Multivariate Distributions
F(v1 , . . . , vn ) ≡ Pr [y1 ≤ v1 , . . . , yn ≤ vn ] .
f (v1 , . . . , vn ) ≡ Pr [y1 = v1 , . . . , yn = vn ] .
If F(v1 , . . . , vn ) admits the nth order mixed partial derivative, then we can define a
(joint) density function
                                                         ∂n
                           f (v1 , . . . , vn ) ≡                   F(v1 , . . . , vn ).
                                                    ∂ v1 · · · ∂ vn
The cdf can be recovered from the density as
                                           Z v1         Z vn
                 F(v1 , . . . , vn ) =            ···          f (w1 , . . . , wn )dw1 · · · dwn .
                                            −∞           −∞
We (with D.R. Cox) will often adopt the “deplorable” habit of referring to probabil-
ity mass functions as discrete densities, or if the context is clear, just densities.
    For a function g(·) of (y1 , . . . , yn )′ into R, the expected value is defined as
                                     Z ∞         Z ∞
         E [g(y1 , . . . , yn )] =         ···          g(v1 , . . . , vn ) f (v1 , . . . , vn )dv1 · · · dvn .
                                      −∞          −∞
                                                                                                                  149
150                                                                                  A Multivariate Distributions
and, if E(y) = µ,
                                   Cov(y) ≡ E[(y − µ)(y − µ)′ ].
It is easily seen that for a conformable fixed matrix A and vector b,
   The distribution of one random vector, say x, ignoring the other vector, y, is called
the marginal distribution of x. The marginal cdf of x can be obtained by substituting
the value +∞ into the joint cdf for all of the y variables:
The conditional density of a vector, say x, given the value of the other vector, say
y = v, is obtained by dividing the density of (x′ , y′ )′ by the density of y evaluated at
v, i.e.,                                             
                             fx|y (u|v) ≡ fx,y (u, v) fy (v).
The conditional density is a well-defined density, so expectations with respect to it
are well defined. Let g be a function from Rm into R,
A.1 Conditional Distributions                                                                     151
                                                 Z ∞         Z ∞
                      E[g(x)|y = v] =                  ···         g(u) fx|y (u|v)du,
                                                  −∞          −∞
The standard properties of expectations hold for conditional expectations. For ex-
ample, with a and b real,
   In fact, both the notion of conditional expectation and this result can be gen-
eralized. Consider a function g(x, y) from Rm+n into R. If y = v, we can de-
fine E[g(x, y)|y = v] in a natural manner. If we consider y as random, we write
E[g(x, y)|y]. It can be easily shown that
E[h(y)|y] = h(y).
and
A.2 Independence
If their densities exist, two random vectors are independent if and only if their joint
density is equal to the product of their marginal densities, i.e., x and y are indepen-
dent if and only if
                                 fx,y (u, v) = fx (u) fy (v).
Note that if x and y are independent,
   If the random vectors x and y are independent, then any (reasonable) vector-
valued functions of them, say g(x) and h(y), are also independent. This follows
easily from a more general definition of the independence of two random vectors:
The random vectors x and y are independent if for any two (reasonable) sets A and
B,
                      Pr[x ∈ A, y ∈ B] = Pr[x ∈ A]Pr[y ∈ B].
To prove that functions of random variables are independent, recall that the set in-
verse of a function g(u) on a set A0 is g−1 (A0 ) ≡ {u|g(u) ∈ A0 }. That g(x) and h(y)
are independent follows from the fact that for any (reasonable) sets A0 and B0 ,
A.4 Inequalities                                                                                                       153
for all (t1 , . . . ,tn ), then x and y have the same distribution.
                                                                    ′
   The great advantage of characteristic functions is that eit y ≡ cos(t ′ y) + i sin(t ′ y)
so that the random variable is bounded,    √ so its expectation always exists. The moment
generating function gets rid of i ≡ −1 and is
                                            "       #
                             Z ∞    Z        ∞           n
                                                                                                               ′
     ψy (t1 , . . . ,tn ) ≡         ···          exp    ∑ t jv j      fy (v1 , . . . , vn )dv1 · · · dvn = E[et y ].
                              −∞           −∞           j=1
A.4 Inequalities
P(|y − µy | ≥ ε) ≤ Var(y)/ε 2 .
Jensen’s inequality is that for any convex function g(u) and random variable y, that
E[g(y)] ≥ g[E(y)].
and using the chain rule on the right hand side gives I = [du T (u)|u=T −1 (v) ][dv T −1 (v)].
Moreover, since det(I) = 1 and, for conformable square matrices A and B, det(AB) =
det(A)det(B), we get det[dv T −1 (v)] = 1/det[du T (u)|u=T −1 (v) ].
   The score function and the information are important concepts in statistical infer-
ence and are discussed in Chapter 6. The next equation implicitly defines the score
function and then simplifies it for location families. Here f˙(y|θ ) ≡ dθ f (v|θ )
                                          f˙(y|θ ) −ḣ(y − θ )
                            S(y; θ ) ≡             =           .
                                          f (y|θ )   h(y − θ )
                                             f˙(y|θ ) −ḣ(y/θ )
                             S(y; θ ) ≡               =
                                             f (y|θ )   h(y/θ )
A.6 Exercises
Exercise A.6.2.      Using the methods of Subsection A.4.1, prove Markov’s in-
equality: for ε > 0,
                             P(|y| ≥ ε) ≤ E(|y|)/ε.
Exercise A.6.3.      Let U ∼ U[0, 1] with density fU (u) = I(0,1) . Let Y ≡ 2U Use
the change of variable formula to find the density of Y . You should be able to guess
the correct answer. The problem is show that it is correct.
Exercise A.6.4.       Let the n-vector Z have independent standard normal compo-
nents and for a fixed nonsingular matrix A and a fixed vector µ define Y ≡ AZ + µ.
Find the mean and covariance matrix of Y and use the change of variable formula to
find the density of Y in terms of the mean and covariance matrix. Hints: Recall that
determinants have the properties that det(A) = det(A′ ) and det(AB) = det(A)det(A).
Appendix B
Measure Theory and Convergence
This book does not require the reader to know measure theory or measure theoretic
probability. But some of the ideas in measure theoretic probability are extremely
useful and we seek to provide some intuition for them.
Lebesgue measure generalizes the concepts of length, area, and volume to arbitrary
sets in an arbitrary number of dimensions. Life being what it is, some people are
smart enough to find sets of points for which even Lebesgue’s theory is incapable
of finding their length etc., but for most of us, any set we can dream up, Lebesgue’s
theory will measure.
    Any reasonable measure of length has to satisfy certain properties. If A is any set,
its length, say µ(A), has to be greater than or equal to 0. If you have any two sets A1
and A2 , like (0.2,0.6] and (0.5,0.7], the total length of the set has to satisfy µ(A1 ∪
A2 ) ≤ µ(A1 ) + µ(A2 ). If the sets are disjoint the inequality becomes an equality.
If this works for two sets, it works for any finite number of sets. We will assume
that it also works for a countably infinite number of sets, although one can have
philosophical debates about that. Incidentally, the finite version is enough to ensure
that if A1 ⊂ A2 , then µ(A1 ) ≤ µ(A2 ).
    Let’s get our hands dirty by showing that the length of the set of rational numbers
in the unit interval is 0. Let h = 1, 2, 3, . . . and i = 1, . . . , h. The numbers i/h are
all of the rational numbers in (0, 1] (with many numbers – obviously 1 – repeated
many times). We want to list all of these numbers with a single index so take n =
(h − 1)h/2 + i. If you know n, you can figure out what h and i have to be. If you
only know i/h, rather than i and h there are lots of values n that correspond to it, but
that is no problem.
    Let Q denote the rationals in (0, 1]. For any ε > 0, and any n we put a ball
(interval) around i/h of length ε/2n , call the ball An . Obviously { i/h } ⊂ An and
                                                                                       157
158                                                          B Measure Theory and Convergence
                                     !                   !
                      h
                    ∞ [                       ∞                  ∞          ∞
                    [                         [                                  ε
       µ(Q) = µ            { i/h }       ≤µ         An       ≤   ∑ µ(An ) = ∑ 2n = ε.
                    h=1 i=1                   n=1                n=1       n=1
So, proof by contradiction. If you claim that µ(Q) = δ > 0, I can find ε < δ that
contradicts your claim. The only length than can work for the rationals is µ(Q) = 0.
   The same argument will establish that any countable set of points has Lebesgue
measure 0, so in particular any finite set of points has Lesbegue measure 0.
   By definition, the numbers in (0,1] that are not rational are irrational, say Ir.
Since (0, 1] = Q ∪ Ir and the sets are disjoint,
Anything that occurs except on a set of measure 0 is said to occur almost everywhere.
So irrational numbers occur in (0,1] almost everywhere.
     An integral measures the area (volume, hypervolumne) under a curve. Suppose
we have a function from (x, y) into z, say f (x, y), and we want to measure the volume
under the curve defined by f (x, y).
     The Reimann idea of an integral is to divide the (x, y) points into small regions
and approximate the volume under the curve for that region as the area of the region
times the height, where the height is just the value of f (x, y) for any point (x, y) in
the region. (There is a fair amount of slop here, but don’t worry.) The approximate
volume under the entire curve is just the sum of the approximate volumes for all
the small regions. Does this always work? Of course not! If x and y are allowed to
be any real numbers, you need an infinite number of small regions to sum over. If
 f (x, y) is not well behaved, it can matter a great deal which (x, y) you pick in each
region to use as the height for the approximate volume. We need a guarantee that
the values of f (x, y) cannot vary too much within our small (x, y) regions. But for
lots of well-behaved functions, this idea works well.
     The Lebesgue idea of an integral is different and works (by and large) for less
well-behaved functions than you need for Reimann integration. There are Lebesgue
integrable functions that are not Reimann integrable and Reimann integrable func-
tions that are not Lebesgue integrable, but typically when both integrals exist they
give the same result. One key point is that no function f is Lebesgue integrable
unless its absolute value is also integrable. That is not a requirement for Reimann
integrable functions.
     In our 3-dimensional example, Reimann integration divides up the (x, y) plane
into small regions whereas Lebesgue integration divides the z axis into small re-
gions. For each small region of the z axis, there exist a set of points (x, y) that will
give you a value of f (x, y) in that small region of z. This set of (x, y) points can
be very complicated, but as mentioned earlier, Lebesgue measure can be used to
determine the area associated with this complicated set of points. Again, we will
approximate the volume by the area of the set in the x, y plane times the height of
the function but now the height is restricted to be in a very narrow region of the
z axis and we are using Lebesgue measure theory to give us the area of the corre-
B.1 A Brief Introduction to Measure and Integration                                  159
sponding set of (x, y) values. We have a good way of measuring areas, so this is
subject to much less variability (in the z direction) than is Reimann integration.
    Of course the theory is far more complicated than this. Lebesgue integrals are
defined first for something called simple functions (a generalization of step func-
tions) for which the integral is easy to compute and then simple functions are used
to approximate more complicated functions, with the simple function integrals ap-
proximating the the more complicated function’s integral.
    Although it is rather redundant, the Lebesgue integral of a function that is 1 when
(x, y) ∈ A and 0 when (x, y) ̸∈ A, is the area of the set A. In other words, define the
indicator function                       
                                           1 if (x, y) ∈ A
                             IA (x, y) =                    ,
                                           0 if (x, y) ̸∈ A
then                                    Z
                               µ(A) =       IA (x, y) dµ(x, y),
where µ is being used to denote Lebesgue measure and dµ(x, y) denotes that we
are integrating with respect to Lebesgue measure.
                                               R
                                                      If f (x, y) = g(x,
                                                                      R
                                                                         y) almost every-
where, their integrals must be the same, i.e., f (x, y) dµ(x, y) = g(x, y) dµ(x, y).
   The whole idea of Lebesgue integration is based on Lebesgue measure which
corresponds to our usual (Euclidean) conception of length, area, and volume. As
systematized by Kolmogorov, probability is just an alternative measure that replaces
length, area, or volume. Probabilities act much like areas. The area of any set has to
be at least 0. Same for probabilities. The total area of your living quarters is the sum
of the areas for each room. The probability of you being in your living quarters is
the sum of the probabilities of you being in each room. The probability of the union
of disjoint sets has to be the sum of the probabilities for each set. The probability
of the union of any sets has to be less than or equal to the sum of the probabilities
for each set. Again, that is pretty obvious if you have a finite number of (disjoint)
sets but life is much, much easier if we also assume that it is true for an infinite
number of disjoint sets. (Again, this is a matter of some controversy, especially
among Bayesian statisticians.) The main difference between a probability measure
and Lebesgue measure is that the biggest a probability can ever be is 1.
   Suppose we have a function from (x, y) into z, say f (x, y) and a probability mea-
sure on (x, y), say P. Using the same ideas as for Lebesgue measure we can define
integrals with respect to the probability measure P, say
                                    Z
                                        f (x, y) dP(x, y).
Now, anything that holds with probability one is said to hold almost surely (a.s.).
   Technically, a probability space is a triple (Ω , F , P) where Ω is the set of possi-
ble outcomes, F is the collection of sets (the sigma field) of outcomes that we will
work with (remember some sets are too weird for Lebesgue measure), and P is the
probability measure which is defined for every set in F . Our three main rules are
     / = 0; the probability that nothing happens (the empty set 0/ occurs) is 0
1. P(0)
160                                                        B Measure Theory and Convergence
The random variable y is measurable if for any B ∈ B, we always have y−1 (B) ∈ F .
For comparison, it is well known that if (Ω , F ) = (R, B), so we can talk about
continuous functions, y is continuous if and only if y−1 (B) is an open set whenever
B is an open set.
   The expected value of y is defined by its integral with respect to the probability
measure,                              Z
                              E(y) ≡ y(ω) dP(ω).
   When I first learned probability, one thing that stumped me is that at some point
Ω went away. We stop really needing it because we focus exclusively on random
variables. If we have (Ω , F , P) and a random variable y, we can just as well work
with a new probability space (R, B, Py ) in which we define Py by
so we can act like the probabilities were all defined on the real line to begin with. In
this case, the random variable y defined on (R, B, Py ), takes the value y(u) = u for
any u ∈ R.
   A random vector is a mapping y : Ω → Rn and we take the expected values
elementwise, i.e., write y(ω) = [y1 (ω), . . . , yn (ω)]′ and E(y) ≡ [E(y1 ), . . . , E(yn )]′ .
Everything discussed to this point extends is pretty obvious ways.
   For a space (Rn , B n ), a probability distribution P is said to be absolutely contin-
uous with respect to Lebesgue measure µ if for any set A ∈ B n with µ(A) = 0, we
B.2 A Brief Introduction to Convergence                                           161
also have P(A) = 0. We will see in Appendix C that this is the property that allows
us to find density functions as in Appendix A. Standard continuous distributions
like the normal are specified by their density wrt Lebesgue measure and standard
discrete distributions are specified by their density wrt counting measure (in which
the measure of set is the number of integers in the set). Most of the standard con-
tinuous distributions used in Statistics are defined by their densities with respect
to Lebesgue measure, so they are automatically absolutely continuous. Although
our primary interest is in relating a probability measure to Lebesgue measure on
(Rn , B n ), the concept of absolute continuity works for any two measures defined
on the same space.
    For example, we might take (y1 , y2 ) to be the probability of a randomly selected
man and woman. It is actually hard to uniquely define a man or woman’s height and
all measurements are fundamentally discrete but lets pretend that heights are well-
defined and continuous. What is the length of the set { 65 }? It is only one point,
it has no length. Similarly, any probability (absolutely continuous with respect to
Lebesgue measure) has the probability that a person is 65 inches tall is also 0. There
is some positive probability that a person is between, 64.5 and 65.6 inches tall. But
no chance that someone is exactly 65 inches. In reality, all of the measurements that
we make in life (length, mass, time etc), although we think of them as measuring
continuous variables, are really statements that the measurement is within some
interval (centered at a rational number) determined by the accuracy of the measuring
device. Approximating this as a continuous measurement, rarely causes problems.
E XAMPLE B.2.1. Our probability space is the unit interval Ω = [0, 1] with the
uniform distribution (and Borel sets). Thus, the probability of any set is just its
length. We will define sequences of random variables yn that converge to a random
variable y(ω) = 0 for all ω. (Remember ω ∈ [0, 1].)
    Consider the random variable yn defined as the indicator function of the set
[0, 1/n], i.e.,
                               yn (ω) = I[0,1/n] (ω).
This random variable converges in all four ways.
162                                                          B Measure Theory and Convergence
   It converges almost surely to 0 because for every ω ∈ (0, 1], the sequence of
numbers yn (ω) converges to the number 0. (As soon as 1/n < ω, we get yn (ω) = 0
AND by assumption the probability that ω ∈ (0, 1] is 1, i.e., P{(0, 1]} = 1. The fact
that yn (0) = 1 for all n, so that yn (0) = 1 ̸→ y(0) = 0 does not matter because it
occurs with zero probability.
   To get convergence in mean square we need the limit of [yn − y]2 dP to go to 0.
                                                             R
Since y = 0 a.s.,
                                                                           Z 1/n
                                                                                          1
  Z                    Z                          Z
      [yn − y]2 dP =       [I[0,1/n] (ω)]2 dω =       I[0,1/n] (ω) dω =            dω =     → 0.
                                                                            0             n
   To get convergence in probability we need the limit of P[|yn − y| > ε] to go to 0
for any ε > 0. Since y = 0 a.s., for 0 < ε ≤ 1 (bigger εs are easy)
                                                                            1
                 P[|yn − y| > ε] = P[yn = 1] = P(0 ≤ ω ≤ 1/n) =               → 0.
                                                                            n
   For convergence in distribution, we need the cdf of yn , say Fn (v) ≡ P[yn ≤ v] to
converge to the cdf F(v) ≡ P[y ≤ v] for every point v at which F(v) is continuous.
                                                1
                                 Fn (v) = 1 −     → 1 = F(v).
                                                n
In this case, we even get convergence at v = 0 which we do not need.                               2
The arguments for almost sure convergence, convergence in probability, and con-
vergence in distribution continue to hold with very little change but this random
variable does not converge to 0 in mean square. To get convergence in mean square
we need the limit of [yn − y]2 dP go to 0. Since y = 0 a.s.,
                     R
  Z
             2
                       Z
                            √
      [yn − y] dP =        [ nI[0,1/n] (ω)]2 dω
                                                                      Z 1/n
                                                                                       n
                                            Z
                                       =n       I[0,1/n] (ω) dω = n             dω =     ̸→ 0. 2
                                                                       0               n
B.2 A Brief Introduction to Convergence                                                163
Exercise B.1.       Show convergence a.s, in probability, and in distribution for Ex-
ample B.2.2.
   It is a little tricker to get something that converges in mean square but does not
converge almost surely.
we get a yn that does not converge either almost surely or in quadratic mean but does
converge in probability and in distribution.                                       2
Exercise B.2.        Establish convergence (or its lack) in quadratic mean, in proba-
bility, and in distribution for the sequences in Example B.2.3.
Exercise B.3.        For the probability space in the Examples define yn (ω) to be
(n − 1)/n if ω is irrational and 0 if it is rational. To what and how does yn converge?
which is the probability that a N(0, 4) is farther from 0 than ε. That number is
certainly not getting close to 0, and the smaller ε gets the closer it is to 1. Clearly,
    P
yn →
   ̸ y.
   We now give the general definitions of√convergence for random vectors but first
recall that the length of a vector is ∥y∥ = y′ y.
Fn (v) → F(v)
                                  P[ ∥yn − y∥ > ε ] → 0.
                                                   a.s.
3. yn converges to y almost surely, written yn → y, if
                        h                              i
                       P {ω| lim ∥yn (ω) − y(ω)∥ = 0} = 1.
                                n→∞
                                                          q.m.
4. yn converges to y in quadratic mean, written yn → y, if
                                  E ∥yn − y∥2 → 0.
                                             
Note that all of these definitions involve checking whether certain sequences of
numbers converge and all but number 3 reduce merely to checking convergence of
numbers.
How can we tell when two probability measures P1 and P2 , both defined on (Ω , F ),
are the same? Obviously they are the same if they give the same probability for
every set in F . In Appendix A we claimed that if two random variables had the
same characteristic function, they had the same distribution. A separating class is a
class of functions S such that if
                         Z                     Z
                             f (ω) dP1 (ω) =       f (ω) dP2 (ω)
B.2 A Brief Introduction to Convergence                                                 165
then
                                              L
                                           yn → y.
In particular, if the characteristic functions of the yn s converge for all t in an interval
                                                         L
around 0 to the characteristic function of y, then yn → y.
   What if we do not know y so that all we know is that E[ f (yn )] converges to
something for every f ? Consider a sequence of probability distributions P1 , P2 , . . .
defined by a sequence of random variables y1 , y2 , . . .. The sequence is said to be
tight if for any ε > 0 there exists a closed bounded set Bε such that Pn (Bε ) ≥ 1 − ε
for every n. The sequence of distributions is tight in the sense that it does not have
a lot of probability going off towards infinity. It turns out that if the sequence is
tight and if E[ f (yn )] converges to something for every f in a separating class, then
                            L
there exists y such that yn → y. In particular, if the characteristic functions of the yn s
converge for all t in an interval around 0 to some function ϕ(t) that is continuous on
a ball around 0, it turns out that the sequence has to be tight, so there exists a y with
    L
yn → y.
    One of my oldest (and therefore, not necessarily accurate) memories in statistics
is hearing Joe (Morris L.) Eaton say that, if you need to use characteristic functions
to prove something, you obviously don’t know what you are doing. (Not withstand-
ing, every edition of PA has used characteristic functions to prove that multivariate
normal distributions are uniquely defined by their mean vector and covariance ma-
trix.)
E(yn ) → E(y).
The Central Limit Theorem states that if x1 , . . . , xn are iid with E(xi ) = µ and
Var(xi ) = σ 2 , then the sample mean x̄· has the property that
                                   (x̄· − µ) L
                                   p          → N(0, 1).
                                       σ 2 /n
               √           L
Alternatively, n(x̄· − µ) → N(0, σ 2 ). A common way to prove this is toptake a
second-order Taylor’s expansion of the characteristic function of (x̄· − µ)/ σ 2 /n
and show that it converges to the characteristic function of a standard normal. We
will not be doing that. We present without proof a more general result.
                                 1 n 
                                     ∑ E |yni |2 I[εBn ,∞) (|yni |) ,
                                                                   
                       0 = lim                                                        (1)
                             n→∞ B2n
                                     i=1
then
                                      zn L
                                         → N(0, 1).
                                      Bn
B.2 A Brief Introduction to Convergence                                             167
   Lindeberg’s result implies the usual Central Limit Theorem for iid random vari-
ables.
E XAMPLE B.2.6. Take x1 , . . . , xn iid with E(xi ) = µ and Var(xi ) = σ 2 . Set yni ≡
xi − µ, so zn = ∑ni=1 (xi − µ) and B2n = nσ 2 . If the Lindeberg condition holds
However, liman →∞ I[an ,∞) (u) = 0 and E[|xi − µ|2 ] = σ 2 , so by probability dom-
inated convergence with |xi − µ|2 as the dominating function and the sequence
|xi − µ|2 I[ε √nσ ,∞) (|xi − µ|) converging to 0 as n → ∞ a.s.,
                        h                                i
                       E |xi − µ|2 I[ε √nσ ,∞) (|xi − µ|) → E[0] = 0.
                                                                   √
E XAMPLE B.2.7. The xn s are independent with Pr[xn = ± n − 1] = 0.25 and
Pr[xn = ±1] = 0.25. Note that xn is an example of a sequence of random variables
that is not tight. Define yni ≡ xi and observe that Var(xi ) = i/2, so B2n = ∑ni=1 i/2 =
n(n + 1)/4. If the Lindeberg condition holds
                      ∑n x        x̄·    n      L
                    p i=1 i     =p    p         → N(0, 1).
                     n(n + 1)/4    1/4 n(n + 1)
           p
Since n/    n(n + 1) → 1, according to something called Slutsky’s theorem,
                                     x̄  L
                                    p · → N(0, 1),
                                     1/4
     L
or x̄· → N(0, 0.25).
    It remains to show that the Lindeberg condition (1) holds. In this example the
Lindeberg condition reduces to
                                n
                        1           h                               i
                 lim           ∑   E |xi |2 I[ε √n(n+1)/4,∞) (|xi |) = 0.
                n→∞ n(n + 1)/4
                               i=1
168                                                      B Measure Theory and Convergence
    There is a concept called entropy that measures the randomness in a random vari-
able. It turns out that among distributions with a common expected value and vari-
ance, the normal distribution has the largest entropy. In the Central Limit Theorem
we have fixed expected values and variances for the individual random variables, so
the sample mean also has a fixed expected value and variance. The Central Limit
Theorem is basically saying that the sample mean converges in distribution to the
most random thing possible that retains the correct expected value and variance,
cf. Barron (1986).
Appendix C
Conditional Probability and Radon-Nikodym
The Radon-Nikodym Theorem basically tells us that for any absolutely continuous
distribution, a density exists, and for any discrete distribution, a probability mass
function (discrete density) exists. Typically, we define distributions in terms of their
densities, so this is somewhat circular. A more important application of Radon-
Nikodym is to defining conditional probabilities and expectations.
Before stating the theorem we need a few additional ideas. A σ -finite measure
µ may have µ(Ω ) = ∞ but has sets A1 , A2 , . . . with µ(Ai ) < ∞ and Ω = ∪∞   i=1 Ai .
Lebesgue measure is σ -finite. A signed measure λ simply allows negative mea-
sures. It must display countable additivity for sets of finite measure and λ (0)
                                                                              / = 0.
The set R̄ includes ±∞
                                                                                    169
170                                         C Conditional Probability and Radon-Nikodym
To get the meters to cancel out, the density has to have units that are 1/meters.
Because of this feature, sometimes using densities in statistical inference can get
dicey and Cox (2006) rightly calls of the name “density” for a probability mass
function “deplorable,” but, like us, still uses it.
    Another issue that there is no compelling reason why continuous densities should
be defined relative to Lebesgue measure. They could just as well be defined relative
to the N(0, 1) probability distribution.
Exercise C.1.       Show that the density of a U[0, 1] relative to the N(0, 1) is just
I[0,1] (v) divided by the usual standard normal density. Why can you not find the
density of a N(0, 1) relative to a U[0, 1]?
                                            P(x ∈ A, y ∈ B)
                         P(x ∈ A|y ∈ B) ≡                   ,
                                               P(y ∈ B)
when P(y ∈ B) > 0. The problem is how to define conditional probability when
P(y ∈ B) = 0. Specifically, we want to develop the ideas of P(x ∈ A|y = v) when
P(y = v) = 0, and E(x|y = v), or more generally just P(x ∈ A|y) and E(x|y) as
functions of y.
   The key idea is to define P(x ∈ A|y) in such a way that it is a function of y alone
and that for any allowable B,
C.2 Conditional Probability                                                                  171
                                                    Z
                         P(x ∈ A, y ∈ B) =                   P(x ∈ A|y) dP.
                                                       y∈B
where in P[x ∈ A|y(ω)] the symbols x ∈ A are only part of a name and do not actually
depend on ω.
   The vector (x′ , y′ )′ is mapping (Ω , F ) into (Rm+n , B m+n ). (B m+n can be thought
of as the smallest σ -field generated by products of sets in B m and sets in B n hence
also denoted B m × B n .) For fixed A, P(x ∈ A, y ∈ B) defines a new measure on
Rn for B ∈ B n , typically not a probability measure, yet Radon-Nikodym assures us
that some function of y(ω) exists that characterizes the new measure. We call this
function P(x ∈ A|y).
   But we also want to avoid having to think about the ωs and just think about the
random vectors. We can also write the definition of conditional probability using
                                                       Z
          P(x ∈ A, y ∈ B) ≡ Pxy (A × B) =                       P(x ∈ A|y = v) dPxy (u, v)
                                                        Rm ×B
         P(x ∈ A, y ∈ B) ≡ Pxy (A × B)
                                  Z
                              =     P(x ∈ A|y = v) fxy (u, v) d[µx × µy ](u, v)
                                   Rm ×B
                                  Z Z                                 
                              =       P(x ∈ A|y = v) fxy (u, v) dµx (u) dµy (v)
                                 B Rm
                                Z              Z                      
                              = P(x ∈ A|y = v)       fxy (u, v) dµx (u) dµy (v)
                                   B                          Rm
                                  Z
                              =        P(x ∈ A|y = v) fy (v) dµy (v)
                                  ZB
                              =        P(x ∈ A|y = v) dPy (v)
                                   B
so                                   Z
                P(x ∈ A|y = v) ≡         fxy (u, v)/ fy (v) dµx (u),         a.s. µy ,
                                     A
is a function of y (it tells us what the function is for every y = v) such that when
integrated over y ∈ B gives P(x ∈ A, y ∈ B) for any B ∈ B n . Radon-Nikodym tells
us that any other such function must equal this one a.s. Note that fxy (u, v)/ fy (v) is
undefined for fy (v) = 0, but as a function of y, fy (v) = 0 on a set of y probability 0,
so we can define the ratio any way we desire on this set.
   There is a slight catch. If we want to think about P(x ∈ A|y = v) defining a condi-
tional distribution on x given y = v, we need to think about v being fixed and varying
the sets A ∈ B m . Although P(x ∈ A|y = v) is unique up to sets of y probability 0, by
changing A an uncountably infinite number of times, the uncountable accumulation
of sets of y probability 0 might cause a problem. Fortunately, it can be shown that
there is a version of P(x ∈ A|y = v)
                                   R
                                      that works fine. In fact, when we can do the iter-
ated integrals, P(x ∈ A|y = v) ≡ A fxy (u, v)/ fy (v) dµx (u) is such a version because
we can define the conditional probabilities as the result of the integral. In particular,
P(x ∈ A|y = v) admits a density wrt µx (u) which is
   Now we extend these ideas to conditional probabilities that are not product sets
A×B. For D ∈ B m ×B n = B m+n , to define P[(x, y) ∈ D|y = v] we require, for every
B in B n ,
for all Borel sets B. As a notational matter, the collection of all sets T −1 (B) for
B ∈ B d defines a sub-σ -field, say, F0 contained in B n and sometimes E[g(y)|T (y)]
is written as E[g(y)|F0 ].
   In particular, we can apply this result replacing y with (x′ , y′ )′ , g(y) with g(x, y),
and T (y) with y to see that E[g(x, y)|y] has the requirement that, for any B ∈ B n ,
                        Z
  E [g(x, y)IB (y)] ≡               g(u, v) dPxy (u, v)
                         Rm ×B
                                    Z                                     Z
                                =            E[g(x, y)|v] dPxy (u, v) =         E[g(x, y)|v] dPy (v).
                                     Rm ×B                                  B
To deal with continuous distributions, probabilities are defined on sets rather than on
outcomes. As discussed earlier, the probability that someone is 65 inches tall is zero
but the probability that someone is in any neighborhood of 65 inches is generally
positive. The sets on which we define probabilities (or any other measures) must
constitute a σ -field.
   Consider a set of outcomes Ω . For a set F ⊂ Ω , define its complement to be
F C ≡ { ω ∈ Ω |ω ̸∈ F }.
                                                                                      175
176                                                                 D Some Additional Measure Theory
Take numbers −∞ ≡ a0 < a1 < a2 < · · · < an−1 < ∞ and define the sets Ai =
(ai−1 , ai ], i = 1, . . . , n − 1 and An = (an−1 , ∞) Also take numbers f1 , . . . , fn . The
function
                                                  n
                                       f (u) ≡ ∑ fi IAi (u)
                                              i=1
is a step function. Note that is extremely easy to compute the Reimann integral of a
step function over any bounded interval. In particular,
                           Z an−1                     n−1
                                a1
                                      f (u) du =      ∑ fi (ai − ai−1 ).
                                                      i=2
If aa1n−1 f˜(u) du converges to some number as you let n → ∞, ai − ai−1 → 0 for all i,
  R
and regardless of how you pick xi , but keep a1 and an fixed, you have the Reimann
D.3 Product Spaces and Measures                                                        177
integral aa1n−1 f (u) du. If youR let a1 and an go to ∓∞, and the integrals converge to
         R
If f˜(u) dµ(u) converges to some number as you let n → ∞, bi − bi−1 → 0 for all i,
  R
vided   the two integrals on the right both exist and are finite. If they both exist,
  | f |dµ ≡ f + dµ + f − dµ, so the integral of f only exists if the integral of | f | is
R           R         R
Consider the measurable space (R, B). You shouldn’t be reading this if you don’t
know what R2 is, but you may not know the appropriate σ -field for R2 . In one
dimension, the Borel σ -field is generated by intervals. In two and three dimensions,
it is generated by rectangles and boxes, respectively. For n dimensions let A1 , . . . , An
be sets in R and v = (v1 , . . . , vn )′ . Define product sets
                      A1 × · · · × An ≡ { v|v1 ∈ A1 , . . . , vn ∈ An }.
178                                                          D Some Additional Measure Theory
The σ -field B n is the smallest σ -field generated by all the product sets with Ai s
defined by finite intervals. In two dimensions, products of intervals are rectangles.
In three dimensions, they are boxes.
    It turns out that if you know the measure on a collection of sets that generate the
σ -field, it is enough to determine the entire measure. For Lebesgue measure in n
dimensions define µn (A1 × · · · × An ) ≡ ∏ni=1 µ(Ai ). In two or three dimensions we
might write µ2 = µ × µ or µ3 = µ × µ × µ. When getting lazy, we write µn ≡ µ and
let you figure out the dimension, like in Appendix B.
    More generally we can have different measures on different parts of the space.
For example we can have Lebesgue measure on some parts and counting measure
on other parts. That would be useful if we have Bin(N, θ ) data (having a density
wrt counting measure) and a Beta(α, β ) distribution on θ (having a density wrt
Lebesgue measure). Together we would have a joint with respect to the product
measure obtained by crossing counting measure with Lebesgue measure.
    Consider a probability space (Ω , F , P) and a random vector y : (Ω , F ) →
(Rn , B n ). Specifically, write the random vector y in terms of random variables,
y = (y1 , . . . , yn )′ . We define Py on (Rn , B n ) by defining
This is the density wrt n dimensional Lebesgue product measure (which here we
will call mn ) so that for any A ∈ B n
                                Z                                  Z
       P(y ∈ A) = Py (A) =          IA (v) f (v|µ, Σ ) dmn (v) ≡       f (v|µ, Σ ) dmn (v),
                                                                   A
where the last equivalence defines a shorthand notation. Similarly, the n dimensional
multinomial distribution y ∼ Mult(N, p) has a density with respect to n dimensional
product counting measure of
D.4 Families of Distributions                                                       179
                                                             n
                                                    N!
                                  f (v|N, p) =     n         ∏ pvi ,
                                                  ∏i=1 vi ! i=1 i
If all of these are absolutely continuous with respect to a single (dominating) mea-
sure ν, then by Radon-Nikodym densities exist. Write the densities as
f (ω|θ ), θ ∈Θ
so that                    Z                           Z
                Pθ (A) =         f (ω|θ )dν(ω) =           IA (ω) f (ω|θ ) dν(ω).
                             A
Usually we think of ν as Lebesgue measure but if we let it be counting measure
                                     Pθ (A) =     ∑     f (ω|θ ),
                                                  ω∈A
Pθ (x ∈ A) ≡ Px|θ (A).
The first of these is defined on (Ω , F ) and the second is defined on (R, B). We
have a dominating measure ν on (Ω , F ) that corresponds to a dominating measure
µ on (R, B) via
                                 µ(A) = ν(x ∈ A).
The density f (ω|θ ) for θ ∈ Θ relative to (Ω , F , ν) corresponds to the density
fx|θ (u) ≡ f (u|θ ), θ ∈ Θ relative to (R, B, µ). [Pretty much the only way to tell
f (ω|θ ) apart from f (u|θ ) is context, but we almost always use the latter.]
    Typically we write
                                     x|θ ∼ f (u|θ )
then
                  Z                       Z
  Ex|θ [g(x)] =       g(u) dPx|θ (u) =        g(u) f (u|θ ) dµ(u)
180                                                           D Some Additional Measure Theory
                                             Z                       Z
                                         =       g[x(ω)] dPθ (ω) =       g(ω) f (ω|θ ) dν(ω).
Typically we would be doing that with Lebesgue measure as the domination mea-
sure µ but in the counting measure case it reduces to
                  Z
  Ex|θ [g(x)] =       g[u] dPx|θ (u) =   ∑ g(u) f (u|θ )
                                         all u
                                                 Z
                                             =       g[x(ω)] dPθ (ω) =    ∑      g[x(ω)] f (ω|θ ).
                                                                         all ω
Again, for rolling dice, think of ω as the top side of a die, u = x(ω) as the number of
dots on the top side, Pθ could have θ indicating various ways to weight the die that
affect which face comes up. Px|θ is then the probability of the number that comes up
from the weighted die (rather than the probability of what face comes up). f (ω|θ )
gives the probabilities for all the faces that may come up whereas fx|θ (u) ≡ f (u|θ )
gives the probabilities for the numbers that come up. In particular,
   For better or worse (usually worse) much of statistical practice focuses on es-
timating and testing parameters. Identifiability is a property that ensures that this
process is a sensible one.
   Consider a collection of probability distributions Y ∼ Pθ , θ ∈ Θ . The parameter
θ merely provides the name (index) for each distribution in the collection. Identifi-
ability ensures that each distribution has a unique name/index.
   The problem with not being identifiable is that some distributions have more
than one name. Observed data give you information about the correct distribution
and thus about the correct name. Typically, the more data you have, the more in-
formation you have about the correct name. Estimation is about getting close to the
correct name and testing hypotheses is about deciding which of two lists contains
the correct name. If a distribution has more than one name, it could be in both lists.
(Significance testing is about whether it seems plausible that a name is on a list,
so identifiability seems less of an issue.) If a distribution has more than one name,
does getting close to one of those names really help? In applications to linear mod-
els, typically distributions have only one name or they have an infinite number of
names.
   The ideas are roughly this. If the distributions are well-defined and I know that
Wesley O. Johnson (θ1 ) and O. Wesley Johnson (θ2 ) are the same person (θ1 = θ2 ),
                                                                                   181
182                                                                       E Identifiability
then, say, any collection of blood pressure readings on Wesley O. should look pretty
much the same as comparable readings on O. Wesley. They would be two samples
from the same distribution. Identifiability is the following: if all the samples I have
taken or ever could take on Wesley O. look pretty much the same as samples on
O. Wesley, then Wesley O. would have to be the same person as O. Wesley. (The
reader might consider whether personhood is actually an identifiable parameter for
blood pressure.)
    For multivariate normal distributions, being well-defined is the requirement that
if Y1 ∼ N (µ1 ,V1 ), Y2 ∼ N (µ2 ,V2 ), and µ1 = µ2 and V1 = V2 , then Y1 ∼ Y2 . Being
identifiable is that if Y1 ∼ N (µ1 ,V1 ), Y2 ∼ N (µ2 ,V2 ), and Y1 ∼ Y2 , then µ1 = µ2 and
V1 = V2 . Obviously, two random vectors with the same distribution have to have the
same mean vector and covariance matrix. But life gets more complicated.
    The more interesting problem for multivariate normality is a model
Y ∼ N [F(β ),V (φ )]
where F and V are known functions of parameter vectors β and φ . To show that β
and φ are identifiable we need to consider
F.1 Differentiation
If F is a function from Rs into Rt with F(x) = [ f1 (x), . . . , ft (x)]′ , then the derivative
of F at c is the t × s matrix of partial derivatives,
Ḟ(c) ≡ dF(c).
where ∥x − c∥2 = (x − c)′ (x − c) and for scalars an → 0, o(an ) has the property that
o(an ) /an → 0. (This is a vector divided by a scalar converging to the 0 vector.)
   In fact, the first order Taylor’s expansion is essentially the mathematical defini-
tion of a derivative. The technical definition of a derivative, if it exists, is that it is
some t × s matrix dF(c) such that for any ε > 0, there exists a δ > 0 for which any
x with ∥x − c∥ < δ has
                                                                                           183
184                                                                  F Multivariate Differentiation
First and second order Taylor’s expansions are fundamental to the models used in re-
sponse surface methodology, cf. http://www.stat.unm.edu/˜fletcher/
TopicsInDesign or ALM-II.1
   The chain rule can be written as a matrix product. If f : Rs → Rt and g : Rt → Rn ,
then the composite function is defined by
(g ◦ f )(x) ≡ g[ f (x)]
d(g ◦ f )(c) = [dv g(v)|v= f (c) ][dx f (x)|x=c ] ≡ dg[ f (c)]d f (c).
order Taylor’s expansion with the Lagrange characterization is known as the Mean Value Theorem.
This Mean Value Theorem cannot be extended to the case of mapping vectors into other (nonde-
generate) vectors but similar extensions can be obtained by generalizing the Taylor’s Theorem
integral and Peano characterizations of the remainder. Ferguson (1996) refers to the vector valued
zero-order Taylor’s
               hR theorem generalization  i    with integral remainder as the Mean Value Theorem,
                  1
F(x) = F(c)+ 0 dF(c + u(x − c))du] (x−c). The comparable first-order Peano characterization
extension is essentially just the definition of the derivative.
F.1 Differentiation                                                                   185
[G(v1 ), · · · , G(vn )]′ . Also define the scalar function g(u) ≡ du G(u) with its vector
equivalent. It follows that dv G(v) = D[g(v)] a diagonal matrix of derivatives and
from the chain rule that dβ G(Xβ ) = D[g(X ′ β )]X.
  I pretty much ripped this out of ALM-III. Not yet clear whether the following
material is needed for this work.
P ROOF. (a) is proven by writing each element of Ax as a sum and taking par-
tial derivatives. (b) is proven by writing x′ Ax as a double sum and taking partial
derivatives.                                                                     2
We now present some useful rules for matrix derivatives. While these are specified
for a scalar u, if A is a function of a vector θ , by thinking of u = θr , we can obtain
partial derivatives with respect to θr . (To find critical points we set all of the par-
tial derivatives equal to 0.) The last three results in Proposition F.2 are particularly
useful when dealing with likelihood functions associated with multivariate normal
distributions.
The notations det[V ] and |V | are used interchangeably to indicate the determinant.
Exercise F.1.        Prove Proposition F.2. Hints: For (a), consider A(u)B(u) ele-
mentwise.  For   (b), use (a) twice. For (c), use (a) and the fact that 0 = du I =
d A(u)A−1 (u) . For (d) use the fact that the trace is a linear function of the diagonal
              
elements. For (e), write V = P Diag(φi ) P′ and show that both sides equal
                                      q
                                                      1
                                     ∑ du φi (u) φi (u) .
                                     i=1
For the right-hand side, use (a) and the fact that 0 = du I = du PP′ .
Y = Xβ + e, E(e) = 0, Cov(e) = σ 2 I.
The least squares criterion is to choose an estimate of β that minimizes the squared
Euclidean distance between Y and Xβ , namely
∥Y − Xβ ∥2 ≡ (Y − Xβ )′ (Y − Xβ ).
1. Using the chain rule, show that the first derivative of the squared error loss func-
   tion is
                      dβ (Y − Xβ )′ (Y − Xβ ) = −2(Y − Xβ )′ X.
   Note that setting the derivative equal to 0 leads to the well known normal equa-
   tions X ′ Xβ = X ′Y for finding least squares estimates.
2. Show that the second derivative of ∥Y − Xβ ∥2 is
                              d2β β (Y − Xβ )′ (Y − Xβ ) = 2X ′ X.
F.1 Differentiation                                                             187
so we seek to maximize this as a function of β . Often there are many β vectors that
give the same maximization.
2. To get the first derivative with respect to β use the chain rule to show
   and show
188                                                           F Multivariate Differentiation
                                      ′                 ′ 
                              x11 ex1 β     · · · x1p ex1 β
                               
                              .                      .. 
                  dβ m(β ) =  ..                      .  = D[m(β )]X.
                                    ′                   ′
                              xq1 exq β     · · · xqp exq β
   It is implicit in our models that every element of m(β ) is positive, so the negative
   of the second derivative is nonnegative definite and critical points β̂ will max-
   imize of the likelihood function. When the negative of the second derivative is
   positive definite, the maxima will be unique.
This is the simplest method to find where dβ L(β ) ≡ L̇(β ) equals 0. For some η > 0
set
                               βt+1 = βt − η[L̇(βt )]′ .
To maximize L(β ), use steepest ascent which replaces the minus sign with a plus
sign. At convergence, βt+1 = βt so [L̇(βt )]′ = 0 and βt is a critical point.
    Gradient descent is easy to program and easy to compute but can be inefficient.
It is often used when the dimension of β is very large.
   b1=1
   b2=2
   A = matrix(c(1,.9,.9,2),2,2, dimnames = list(NULL,                                c("b1", "b2")))
   A
   E <- ellipse(A,centre = c(b1, b2),t=.95,npoints =                                 100)
   E1 <- ellipse(A,centre = c(b1, b2),t=.5 ,npoints =                                100)
   E2 <- ellipse(A,centre = c(b1, b2),t=.75,npoints =                                100)
   plot(E,type = ’l’,ylim=c(.5,3.5),xlim=c(0,2),
        xlab=expression(y[1]),
    ylab=expression(y[2]),main="Normal Density")
   text((b1+.01),(b2-.1),expression(mu),lwd=1,cex=1)
   lines(E1,type="l",lty=1)
   lines(E2,type="l",lty=1)
   lines(b1,b2,type="p",lty=3)
F.2.2 Newton-Raphson
Newton-Raphson finds critical points of dβ L(β ) ≡ L̇(β ) by using the second deriva-
tive d2β β L(β ) ≡ L̈(β ) in a Taylor’s approximation centered on βt of the first deriva-
tive function,
                                      .
                            [L̇(β )]′ = [L̇(βt )]′ + [L̈(βt )](β − βt ).
Setting 0 = [L̇(β )]′ we find βt+1 to be the solution for β in
which gives
                             βt+1 = βt − [L̈(βt )]−1 [L̇(βt )]′ .
   At convergence we have
but by the (implicit) assumption that [L̈(βt )] is nonsingular, the only vector v with
−[L̈(βt )]−1 v = 0 is v = 0, so we have L̇(βt ) = 0.
   When β is a very high dimensional vector, finding the inverse matrix of L̈(βt )
can be very difficult in which case the method is impractical. When this method is
applicable, it tends to be very efficient. For example, it finds least squares estimates
for a linear model in just one iteration regardless of the starting value β0 . Many
computer programs for fitting generalized linear models employ Newton-Raphson
under the name iteratively reweighted least squares.
190                                                               F Multivariate Differentiation
Exercise F.5.     Use the results of Exercise F.2 to show that Newton-Raphson gives
least squares estimates in only one iteration.
F.2.3 Gauss-Newton
Gauss-Newton applies only to minimizing L(β ) = [Y − F(β )]′ [Y − F(β )]. To mini-
mize this, note that [L̇(β )] = 2[Y − F(β )]′ [Ḟ(β )]. This method is particularly useful
in nonlinear regression of which neural networks are a special case. Like Newton-
Raphson it involves the inverse of a square matrix that has the same dimensions as
β so, while efficient when applicable, it is not readily applied to high dimensional
problems.
   Use the approximation
                                  .
                            F(β ) = F(βt ) + [Ḟ(βt )](β − βt )
βt+1 = βt + γ̂.
At convergence we have
                                                        −1
            0 = βt+1 − βt = γ̂ = [Ḟ(βt )]′ [Ḟ(βt )]        [Ḟ(βt )]′ [Y − F(βt )]
                                
We will see if I ever get around to writing this. I have had little interest in it but I
gather it is extremely useful.
F.2 Iterative Methods for Finding Extreme Values                             191
Agresti, Alan (1992). A Survey of Exact Inference for Contingency Tables. Statistical Science,
     Vol. 7, 131-153.
Aitchison, J. and Dunsmore, I. R. (1975). Statistical Prediction Analysis. Cambridge University
     Press, Cambridge.
Akaike, Hirotugu (1973). Information theory and an extension of the maximum likelihood princi-
     ple. In Proceedings of the 2nd International Symposium on Information, edited by B.N. Petrov
     and F. Czaki. Akademiai Kiado, Budapest.
Anderson, T. W. (2003). An Introduction to Multivariate Statistical Analysis, Third Edition. John
     Wiley and Sons, New York.
Andrews, D. F. (1974). A robust method for multiple regression. Technometrics, 16, 523-531.
Arbuthnot. (1710).
Arnold, S. F. (1981). The Theory of Linear Models and Multivariate Analysis. John Wiley and
     Sons, New York.
Aroian, Leo A. (1941). A Study of R. A. Fisher’s z Distribution and the Related F Distribution.
     The Annals of Mathematical Statistics, 12, 429-448.
Ash, Robert B. and Doleans-Dade, Catherine A. (2000). Probability and Measure Theory, Second
     Edition. Academic Press, San Diego.
Atkinson, A. C. (1981). Two graphical displays for outlying and influential observations in regres-
     sion. Biometrika, 68, 13-20.
Atkinson, A. C. (1982). Regression diagnostics, transformations and constructed variables (with
     discussion). Journal of the Royal Statistical Society, Series B, 44, 1-36.
Atkinson, A. C. (1985). Plots, Transformations, and Regression: An Introduction to Graphical
     Methods of Diagnostic Regression Analysis. Oxford University Press, Oxford.
Atwood, C. L. and Ryan, T. A., Jr. (1977). A class of tests for lack of fit to a regression model.
     Unpublished manuscript.
Bailey, D. W. (1953). The Inheritance of Maternal Influences on the Growth of the Rat. Ph.D.
     Thesis, University of California.
Barnard, G.A. (1949). Statistical Inference. Journal of the Royal Statistical Society, Series B, 11,
     115-149.
Barron, A. R. (1986). Entropy and the Central Limit Theorem. The Annals of Probability, 14,
     336-342.
Bayes, T. (1763). An essay towards solving a problem in the doctrine of chances. Philosophical
     Transactions of the Royal Society of London, 53, 370-418.
Bedrick, E. J., Christensen, R., and Johnson, W. (1996). A new perspective on priors for generalized
     linear models. Journal of the American Statistical Association, 91, 1450-1460.
Bedrick, E. J. and Tsai, C.-L. (1994). Model selection for multivariate regression in small samples.
     Biometrics, 50, 226-231.
                                                                                                193
194                                                                                       References
Christensen, R. (1984). A note on ordinary least squares methods for two-stage sampling. Journal
     of the American Statistical Association, 79, 720-721.
Christensen, R. (1987). The analysis of two-stage sampling data by ordinary least squares. Journal
     of the American Statistical Association, 82, 492-498.
Christensen, R. (1989). Lack of fit tests based on near or exact replicates. The Annals of Statistics,
     17, 673-683.
Christensen, R. (1991). Small sample characterizations of near replicate lack of fit tests. Journal of
     the American Statistical Association, 86, 752-756.
Christensen, R. (1993). Quadratic covariance estimation and equivalence of predictions. Mathe-
     matical Geology, 25, 541-558.
Christensen, R. (1995). Comment on Inman (1994). The American Statistician, 49, 400.
Christensen, R. (1996). Analysis of Variance, Design, and Regression: Applied Statistical Methods.
     Chapman and Hall, London.
Christensen, R. (1997). Log-Linear Models and Logistic Regression, Second Edition. Springer-
     Verlag, New York.
Christensen, R. (2001). Advanced Linear Modeling: Multivariate, Time Series, and Spatial Data;
     Nonparametric Regression, and Response Surface Maximization, Second Edition. Springer-
     Verlag, New York.
Christensen, R. (2003). Significantly insignificant F tests. The American Statistician, 57, 27-32.
Christensen, R. (2005). Testing Fisher, Neyman, Pearson, and Bayes. The American Statistician,
     59, 121-126.
Christensen, R. (2008). Review of Principals of Statistical Inference by D. R. Cox. Journal of the
     American Statistical Association, 103, 1719-1723.
Christensen, Ronald (2014). “Review of Fisher, Neyman, and the Creation of Classical Statistics
     by Erich L. Lehmann.” Journal of the American Statistical Association, 109, 866-868.
Christensen, R. (2015). Analysis of Variance, Design, and Regression: Linear Modeling for Un-
     balanced Data, Second Edition. Chapman and Hall/CRC Pres, Boca Raton, FL.
Christensen, R. (2018). Comment on “A note on collinearity diagnostics and centering” by Velilla
     (2018). The American Statistician, 72, 114-117.
Christensen, Ronald (2019). Advanced Linear Modeling: Statistical Learning and Dependent
     Data, Third Edition. Springer-Verlag, New York.
Christensen, Ronald (2020a). Plane Answers to Complex Questions: The Theory of Linear Models,
     Fifth Edition. Springer-Verlag, New York.
Christensen, Ronald (2020b). “Comment on ‘Test for Trend With a Multinomial Outcome’ by
     Szabo (2019)” The American Statistician, accepted.
Christensen, R. (2020c). Log-Linear Models and Logistic Regression, Third Edition. Not yet pub-
     lished. Contact author. Hopefully, Springer-Verlag, New York.
Christensen, R. (2019d). Another Look at Linear Hypothesis Testing in Dense High-Dimensional
     Linear Models. http://www.stat.unm.edu/˜fletcher/AnotherLook.pdf
Christensen, R. and Bedrick, E. J. (1997). Testing the independence assumption in linear models.
     Journal of the American Statistical Association, 92, 1006-1016.
Christensen, Ronald and Huffman, Michael D. (1985). “Bayesian point estimation using the pre-
     dictive distribution.” The American Statistician, 39, 319-321.
Christensen, Ronald and Johnson, Wesley (2005). A Conversation with Seymour Geisser. Statisti-
     cal Science, 22, 621-636.
Christensen, R., Johnson, W., Branscum, A., and Hanson, T. E. (2010). Bayesian Ideas and Data
     Analysis: An Introduction for Scientists and Statisticians. Chapman and Hall/CRC Press, Boca
     Raton, FL.
Christensen, R., Johnson, W., and Pearson, L. M. (1992). Prediction diagnostics for spatial linear
     models. Biometrika, 79, 583-591.
Christensen, R., Johnson, W., and Pearson, L. M. (1993). Covariance function diagnostics for spa-
     tial linear models. Mathematical Geology, 25, 145-160.
Christensen, R. and Lin, Y. (2010). Linear models that allow perfect estimation. Statistical Papers,
     54, 695-708.
196                                                                                       References
Christensen, R. and Lin, Y. (2015). Lack-of-fit tests based on partial sums of residuals. Communi-
     cations in Statistics, Theory and Methods, 44, 2862-2880.
Christensen, R., Pearson, L. M., and Johnson, W. (1992). Case deletion diagnostics for mixed
     models. Technometrics, 34, 38-45.
Christensen, R. and Utts, J. (1992). Testing for nonadditivity in log-linear and logit models. Journal
     of Statistical Planning and Inference, 33, 333-343.
Cochran, W. G. and Cox, G. M. (1957). Experimental Designs, Second Edition. John Wiley and
     Sons, New York.
Cook, R. D. (1977). Detection of influential observations in linear regression. Technometrics, 19,
     15-18.
Cook, R. D. (1998). Regression Graphics: Ideas for Studying Regressions Through Graphics. John
     Wiley and Sons, New York.
Cook, R. D., Forzani, L., and Rothman, A. J. (2013). Prediction in abundant high-dimensional
     linear regression. Electronic Journal of Statistics, 7, 3059-3088.
Cook, R. D., Forzani, L., and Rothman, A. J. (2015). Letter to the editor. The American Statistician,
     69, 253-254.
Cook, R. D. and Weisberg, S. (1982). Residuals and Influence in Regression. Chapman and Hall,
     New York.
Cook, R. D. and Weisberg, S. (1994). An Introduction to Regression Graphics. John Wiley and
     Sons, New York.
Cook, R. D. and Weisberg, S. (1999). Applied Regression Including Computing and Graphics. John
     Wiley and Sons, New York.
Cornell, J. A. (1988). Analyzing mixture experiments containing process variables. A split plot
     approach. Journal of Quality Technology, 20, 2-23.
Cox, D. R. (1958). Planning of Experiments. John Wiley and Sons, New York.
Cox, D. R. (2006). Principals of Statistical Inference. Cambridge University Press, Cambridge.
Cox, D. R. (2007). Applied Statistics: A Review. The Annals of Applied Statistics 1, 1-17.
Cox, D. R. and Hinkley, D. V. (1974). Theoretical Statistics. Chapman and Hall, London.
Cox, D. R. and Reid, N. (2000). The Theory of the Design of Experiments. Chapman and Hall/CRC,
     Boca Raton, FL.
Cramér, H. (1946). Mathematical Methods of Statistics. Princeton University Press, Princeton.
Cressie, N. (1993). Statistics for Spatial Data, Revised Edition. John Wiley and Sons, New York.
Cressie, N. A. C. and Wikle, C. K. (2011). Statistics for Spatio-Temporal Data. John Wiley and
     Sons, New York.
Daniel, C. (1959). Use of half-normal plots in interpreting factorial two-level experiments. Tech-
     nometrics, 1, 311-341.
Daniel, C. (1976). Applications of Statistics to Industrial Experimentation. John Wiley and Sons,
     New York.
Daniel, C. and Wood, F. S. (1980). Fitting Equations to Data, Second Edition. John Wiley and
     Sons, New York.
Davies, R. B. (1980). The distribution of linear combinations of χ 2 random variables. Applied
     Statistics, 29, 323-333.
de Finetti, B. (1974, 1975). Theory of Probability, Vols. 1 and 2. John Wiley and Sons, New York.
DeGroot, M. H. (1970). Optimal Statistical Decisions. McGraw-Hill, New York.
deLaubenfels, R. (2006). The victory of least squares and orthogonality in statistics. The American
     Statistician, 60, 315-321.
Doob, J. L. (1953). Stochastic Processes. John Wiley and Sons, New York.
Draper, N. and Smith, H. (1998). Applied Regression Analysis, Third Edition. John Wiley and Sons,
     New York.
Draper, N. R. and van Nostrand, R. C. (1979). Ridge regression and James-Stein estimation: Re-
     view and comments Technometrics, 21, 451-466.
Duan, N. (1981). Consistency of residual distribution functions. Working Draft No. 801-1-HHS
     (106B-80010), Rand Corporation, Santa Monica, CA.
References                                                                                        197
Durbin, J. and Watson, G. S. (1951). Testing for serial correlation in least squares regression II.
     Biometrika, 38, 159-179.
Eaton, M. L. (1983). Multivariate Statistics: A Vector Space Approach. John Wiley and Sons, New
     York. Reprinted in 2007 by IMS Lecture Notes – Monograph Series.
Efron, B. and Hastie, T. (2016). Computer Age Statistical Inference: Algorithms, Evidence, and
     Data Science. Cambridge University Press, Cambridge.
Ferguson, T. S. (1967). Mathematical Statistics: A Decision Theoretic Approach. Academic Press,
     New York.
Ferguson, Thomas S. (1996). A Course in Large Sample Theory. Chapman and Hall, New York.
Fienberg, S. E. (1980). The Analysis of Cross-Classified Categorical Data, Second Edition. MIT
     Press, Cambridge, MA.
Fienberg, S. E. (2006). When did Bayesian inference become “Bayesian”? Bayesian Analysis, 1,
     1–40
Fisher, R. A. (1922a). The goodness of fit of regression formulae, and the distribution of regression
     coefficients. Journal of the Royal Statistical Society, 85, 597-612.
Fisher, Ronald A. (1922b). On the mathematical foundations of theoretical statistics. Philos. Trans.
     Roy. Soc. London Ser. A , 222, 309-368.
Fisher, R. A. (1924). ”On a distribution yielding the error functions of several well known statis-
     tics,” Proc. International Math. Cong., Toronto, 2, 805-813.
Fisher, Ronald A. (1925). Statistical Methods for Research Workers, Fourteenth Edition, 1970.
     Hafner Press, New York.
Fisher, R. A. (1935). The Design of Experiments, Ninth Edition, 1971. Hafner Press, New York.
Fisher, R. A. (1956). Statistical Methods and Scientific Inference, Third Edition, 1973. Hafner
     Press, New York.
Fraser, D. A. S. (1957). Nonparametric methods in statistics. John Wiley and Sons, New York.
Freedman, D. A. (2006). On the so-called “Huber sandwich estimator” and “robust standard er-
     rors”. The American Statistician, 60, 299-302.
Furnival, G. M. and Wilson, R. W. (1974). Regression by leaps and bounds. Technometrics, 16,
     499-511.
Galili, Tal and Meilijson, Isaac (2016). An example of an improvable Rao-Blackwell improve-
     ment, inefficient maximum likelihood estimator, and unbiased generalized Bayes estimator.
     The American Statistician, 70, 108-113.
Geisser, Seymour (1971). The inferential use of predictive distributions. In Foundations of Statis-
     tical Inference, V.P. Godambe and D.A. Sprott (Eds.). Holt, Rinehart, and Winston, Toronto,
     456-469.
Geisser, Seymour (1975). The predictive sample reuse method with applications. Biometrika, 70,
     320-328.
Geisser, Seymour (1985). On the predicting of observables: A selective update. In Bayesian Statis-
     tics 2, J.M. Bernardo et al. (Eds.). North Holland, 203-230.
Geisser, Seymour (1993). Predictive Inference: An Introduction, Chapman and Hall, New York.
Geisser, Seymour (2000). Statistics, litigation, and conduct unbecoming. In Statistical Science in
     the Courtroom, Joseph L. Gastwirth (Ed.). Springer-Verlag, New York, 71-85.
Geisser, Seymour (2005). Modes of Parametric Statistical Inference, John Wiley and Sons, New
     York.
Gelman, A., Carlin, J. B., Stern, H. S., Dunson, D. B., Vehtari, A., and Rubin, D. B. (2013).
     Bayesian Data Analysis, Third Edition. Chapman and Hall/CRC, Boca Raton, FL.
Gnanadesikan, R. (1977). Methods for Statistical Analysis of Multivariate Observations. John Wi-
     ley and Sons, New York.
Goldstein, M. and Smith, A. F. M. (1974). Ridge-type estimators for regression analysis. Journal
     of the Royal Statistical Society, Series B, 26, 284-291.
Graybill, F. A. (1976). Theory and Application of the Linear Model. Duxbury Press, North Scituate,
     MA.
Grizzle, J. E., Starmer, C. F., and Koch, G. G. (1969). Analysis of categorical data by linear models.
     Biometrics, 25, 489-504.
198                                                                                       References
Groß, J. (2004). The general Gauss–Markov model with possibly singular dispersion matrix. Sta-
      tistical Papers, 25, 311-336.
Guttman, I. (1970). Statistical Tolerance Regions. Hafner Press, New York.
Haberman, S. J. (1974). The Analysis of Frequency Data. University of Chicago Press, Chicago.
Hacking, I. (1965). Logic of Statistical Inference. Cambridge University Press.
Halmos, P. R. and Savage, L. J. (1949). Application of the Radon-Nikodym theorem to the theory
      of sufficient statistics. Annals of Mathematical Statistics, 20, 225-241.
Hartigan, J. (1969). Linear Bayesian methods. Journal of the Royal Statistical Society, Series B,
      31, 446-454.
Harville, D. A. (2018). Linear Models and the Relevant Distributions and Matrix Algebra. CRC
      Press, Boca Raton, FL.
Haslett, J. (1999). A simple derivation of deletion diagnostic results for the general linear model
      with correlated errors. Journal of the Royal Statistical Society, Series B, 61, 603-609.
Haslett, J. and Hayes, K. (1998). Residuals for the linear model with general covariance structure.
      Journal of the Royal Statistical Society, Series B, 60, 201-215.
Hastie, T., Tibshirani, R. and Friedman, J. (2016). The Elements of Statistical Learning: Data
      Mining, Inference, and Prediction, Second Edition. Springer, New York.
Hill, Bruce M. (1987). The validity of the likelihood principle. The American Statistician, 43,
      95-100.
Hill, Joe R. (1990). A general framework for model-based statistics. Biometrika, 77, 115-126.
Hinkelmann, K. and Kempthorne, O. (2005). Design and Analysis of Experiments: Volume 2, Ad-
      vanced Experimental Design. John Wiley and Sons, Hoboken, NJ.
Hinkelmann, K. and Kempthorne, O. (2008). Design and Analysis of Experiments: Volume 1, In-
      troduction to Experimental Design, Second Edition. John Wiley and Sons, Hoboken, NJ.
Hinkley, D. V. (1969). Inference about the intersection in two-phase regression. Biometrika, 56,
      495-504.
Hochberg, Y. and Tamhane, A. (1987). Multiple Comparison Procedures. John Wiley and Sons,
      New York.
Hodges, J. S. (2013). Richly Parameterized Linear Models: Additive, Time Series, and Spatial
      Models Using Random Effects. Chapman and Hall/CRC, Boca Raton, FL.
Hoerl, A. E. and Kennard, R. (1970). Ridge regression: Biased estimation for non-orthogonal prob-
      lems. Technometrics, 12, 55-67.
Högfeldt, P. (1979). On low F-test values in linear models. Scandinavian Journal of Statistics, 6,
      175-178.
Hsu, J. C. (1996). Multiple Comparisons: Theory and Methods. Chapman and Hall, London.
Hubbard, Raymond and Bayarri, M. J. (2003). Confusion over measures of evidence (ps) versus
      errors (αs) in classical statistical testing. The American Statistician, 57, 171-177.
Huber, P. J. and Ronchetti, E. M. (2009). Robust Statistics, Second Edition. John Wiley and Sons,
      New York.
Hurvich, C. M. and Tsai, C.-L. (1989). Regression and time series model selection in small sam-
      ples. Biometrika, 76, 297-307.
Huynh, H. and Feldt, L. S. (1970). Conditions under which mean square ratios in repeated mea-
      surements designs have exact F-distributions. Journal of the American Statistical Association,
      65, 1582-1589.
James, G., Witten, D., Hastie, T., and Tibshirani, R. (2013). An Introduction to Statistical Learning.
      Springer, New York.
Jeffreys, H. (1961). Theory of Probability, Third Edition. Oxford University Press, London.
John, P. W. M. (1971). Statistical Design and Analysis of Experiments. Macmillan, New York.
Johnson, R. A. and Wichern, D. W. (2007). Applied Multivariate Statistical Analysis, Sixth Edition.
      Prentice–Hall, Englewood Cliffs, NJ.
Kempthorne, O. (1952). Design and Analysis of Experiments. Krieger, Huntington, NY.
Kutner, M. H., Nachtsheim, C. J., Neter, J., and Li, W. (2005). Applied Linear Statistical Models,
      Fifth Edition. McGraw-Hill Irwin, New York.
References                                                                                        199
LaMotte, Lynn Roy (2014). The Gram-Schmidt Construction as a Basis for Linear Models, The
     American Statistician, 68, 52-55.
Lane, David (1996). Story about Cosimo di Medici. In Modelling and Prediction: honoring Sey-
     mour Geisser, eds. Jack C. Lee, Wesley O. Johnson, Arnold Zellner. Springer- Verlag, New
     York.
Lehmann, E.L. (1959). Testing Statistical Hypotheses. John Wiley and Sons, New York.
Lehmann, E. L. (1983) Theory of Point Estimation. John Wiley and Sons, New York.
Lehmann, E. L. (1986) Testing Statistical Hypotheses, Second Edition. John Wiley and Sons, New
     York.
Lehmann, E.L. (1997) Testing Statistical Hypotheses, Second Edition. Springer, New York.
Lehmann, E. L. (1999) Elements of Large-Sample Theory. Springer, New York.
Lehmann, E. L. (2011). Fisher, Neyman, and the Creation of Classical Statistics. Springer, New
     York.
Lehmann, E.L. and Casella, George (1998). Theory of Point Estimation, 2nd Edition. Springer,
     New York
Lehmann, E.L. and Romano, J.P. (2005). Testing Statistical Hypotheses, Third Edition. Springer,
     New York.
Lehmann, E. L. and Scheffé, H. (1950). Completeness, similar regions and unbiased estimation,
     part I. Sankhya, 10, 305-340.
Lenth, R. V. (2015). The case against normal plots of effects (with discussion). Journal of Quality
     Technology, 47, 91-97.
Lindgren, Bernard W. (1968). Statistical Theory, Second Edition. Macmillan, London.
Lindley, D. V. (1971). Bayesian Statistics: A Review. SIAM, Philadelphia.
McCullagh, P. (2000). Invariance and factorial models, with discussion. Journal of the Royal Sta-
     tistical Society, Series B, 62, 209-238.
McCullagh, P. and Nelder, J. A. (1989). Generalized Linear Models, Second Edition. Chapman
     and Hall, London.
McCulloch, C. E., Searle, S. R., and Neuhaus, J. M. (2008). Generalized, Linear, and Mixed Mod-
     els, 2nd Edition. John Wiley and Sones, New York.
Mandansky, A. (1988). Prescriptions for Working Statisticians. Springer-Verlag, New York.
Mandel, J. (1961). Nonadditivity in two-way analysis of variance. Journal of the American Statis-
     tical Association, 56, 878-888.
Mandel, J. (1971). A new analysis of variance model for nonadditive data. Technometrics, 13, 1-18.
Manoukian, E. B. (1986), Modern Concepts and Theorems of Mathematical Statistics. Springer-
     Verlag, New York.
Marquardt, D. W. (1970). Generalized inverses, ridge regression, biased linear estimation, and
     nonlinear estimation. Technometrics, 12, 591-612.
Martin, R. J. (1992). Leverage, influence and residuals in regression models when observations are
     correlated. Communications in Statistics – Theory and Methods, 21, 1183-1212.
Mathew, T. and Sinha, B. K. (1992). Exact and optimum tests in unbalanced split-plot designs
     under mixed and random models. Journal of the American Statistical Association, 87, 192-
     200.
Mehta, C.R. and Patel, N.R. (1983). A network algorithm for performing Fisher’s exact test in r × c
     contingency tables. Journal of the American Statistical Association, 78, 427-434.
Miller, F. R., Neill, J. W., and Sherfey, B. W. (1998). Maximin clusters for near replicate regression
     lack of fit tests. The Annals of Statistics, 26, 1411-1433.
Miller, F. R., Neill, J. W., and Sherfey, B. W. (1999). Implementation of maximin power cluster-
     ing criterion to select near replicates for regression lack-of-fit tests. Journal of the American
     Statistical Association, 94, 610-620.
Miller, R. G., Jr. (1981). Simultaneous Statistical Inference, Second Edition. Springer-Verlag, New
     York.
Milliken, G. A. and Graybill, F. A. (1970). Extensions of the general linear hypothesis model.
     Journal of the American Statistical Association, 65, 797-807.
200                                                                                      References
Moguerza, J. M. and Muñoz, A. (2006). Support vector machines with applications. Statistical
     Science, 21, 322-336.
Monlezun, C. J. and Blouin, D. C. (1988). A general nested split-plot analysis of covariance. Jour-
     nal of the American Statistical Association, 83, 818-823.
Morrison, D. F. (2004). Multivariate Statistical Methods, Fourth Edition. Duxbury Press, Pacific
     Grove, CA.
Mosteller, F. and Tukey, J. W. (1977). Data Analysis and Regression. Addison-Wesley, Reading,
     MA.
Nayak, T. K. (2002). Rao-Cramer type inequalities for mean squared error of prediction. The Amer-
     ican Statistician, 56, 102-106.
Neill, J. W. and Johnson, D. E. (1984). Testing for lack of fit in regression – a review. Communi-
     cations in Statistics, Part A – Theory and Methods, 13, 485-511.
Oehlert, G. W. (2010). A First Course in Design and Analysis of Experiments. http://users.
     stat.umn.edu/˜gary/book/fcdae.pdf
Parmigiani, Giovanni and Inoue, Lurdes (2009). Decision Theory : Principles and Approaches.
     John Wiley and Sons, New York.
Peixoto, J. L. (1993). Four equivalent definitions of reparameterizations and restrictions in linear
     models. Communications in Statistics, A, 22, 283-299.
Picard, R. R. and Berk, K. N. (1990). Data splitting. The American Statistician, 44, 140-147.
Picard, R. R. and Cook, R. D. (1984). Cross-validation of regression models. Journal of the Amer-
     ican Statistical Association, 79, 575-583.
Puri, M. L. and Sen, P. K. (1971). Nonparametric Methods in Multivariate Analysis. John Wiley
     and Sons, New York.
Raiffa, H. and Schlaifer, R. (1961). Applied Statistical Decision Theory. Division of Research,
     Graduate School of Business Administration, Harvard University, Boston.
Rao, C. R. (1973). Linear Statistical Inference and Its Applications, Second Edition. John Wiley
     and Sons, New York.
Rao, C. R. and Mitra, S. K. (1971). Generalized Inverse of Matrices and Its Applications. John
     Wiley and Sons, New York.
Ravishanker, N. and Dey, D. (2002). A First Course in Linear Model Theory. Chapman and
     Hall/CRC Press, Boca Raton, FL.
Rencher, A. C. and Schaalje, G. B. (2008). Linear Models in Statistics, Second Edition. John Wiley
     and Sons, New York.
Ripley, B. D. (1981). Spatial Statistics. John Wiley and Sons, New York.
Robert, C. P. (2007). The Bayesian Choice: From Decision-Theoretic Foundations to Computa-
     tional Implementation, Second Edition. Springer, New York.
Royall, Richard (1997). Statistical Evidence : A Likelihood Paradigm. Chapman & Hall, London.
St. Laurent, R. T. (1990). The equivalence of the Milliken-Graybill procedure and the score test.
     The American Statistician, 44, 36-37.
Salsburg, David (2001). The Lady Tasting Tea: How Statistics Revolutionized Science in the Twen-
     tieth Century. Holt and Company, New York.
Savage, L. J. (1954). The Foundations of Statistics. John Wiley and Sons, New York.
Schafer, D. W. (1987). Measurement error diagnostics and the sex discrimination problem. Journal
     of Business and Economic Statistics, 5, 529-537.
Schatzoff, M., Tsao, R., and Fienberg, S. (1968). Efficient calculations of all possible regressions.
     Technometrics, 10, 768-779.
Scheffé, H. (1959). The Analysis of Variance. John Wiley and Sons, New York.
Schwarz, G. (1978). Estimating the dimension of a model. Annals of Statistics, 6, 461–464.
Searle, S. R. (1971). Linear Models. John Wiley and Sons, New York.
Searle, S. R. (1988). Parallel lines in residual plots. The American Statistician, 42, 211.
Searle, S. R. and Pukelsheim, F. (1987). Estimation of the mean vector in linear models, Technical
     Report BU-912-M, Biometrics Unit, Cornell University, Ithaca, NY.
Seber, G. A. F. (1966). The Linear Hypothesis: A General Theory. Griffin, London.
Seber, G. A. F. (1977). Linear Regression Analysis. John Wiley and Sons, New York.
References                                                                                       201
Seber, G. A. F. (2015). The Linear Model and Hypothesis: A General Theory. Springer, New York.
Serfling, R. J. (1980). Approximation Theorems of Mathematical Statistics, Wiley, (paperback,
     2001)
Shapiro, S. S. and Francia, R. S. (1972). An approximate analysis of variance test for normality.
     Journal of the American Statistical Association, 67, 215-216.
Shapiro, S. S. and Wilk, M. B. (1965). An analysis of variance test for normality (complete sam-
     ples). Biometrika, 52, 591-611.
Shewhart, W. A. (1931). Economic Control of Quality. Van Nostrand, New York.
Shewhart, W. A. (1939). Statistical Method from the Viewpoint of Quality Control. Graduate School
     of the Department of Agriculture, Washington. Reprint (1986), Dover, New York.
Shi, L. and Chen, G. (2009). Influence measures for general linear models with correlated errors.
     The American Statistician, 63, 40-42.
Shillington, E. R. (1979). Testing lack of fit in regression without replication. Canadian Journal of
     Statistics, 7, 137-146.
Shumway, R. H. and Stoffer, D. S. (2011). Time Series Analysis and Its Applications: With R
     Examples, Third Edition. Springer, New York.
Skinner, C. J., Holt, D., and Smith, T. M. F. (1989). Analysis of Complex Surveys. John Wiley and
     Sons, New York.
Smith, A. F. M. (1986). Comment on an article by B. Efron. The American Statistician, 40, 10.
Snedecor, G. W. and Cochran, W. G. (1980). Statistical Methods, Seventh Edition. Iowa State
     University Press, Ames.
Stefanski, L. A. (2007). Residual (sur)realism. The American Statistician, 61, 163-177.
Stigler, S.M. (1982). Thomas Bayes and Bayesian inference. Journal of the Royal Statistical Soci-
     ety, A, 145(2), 250-258.
Stigler, S.M. (2007). The epic story of maximum likelihood. Statistical Science, 22, 598-620.
Stone, M. (1974). Cross-validatory choice and assessment of statistical predictions. Journal of the
     Royal Statistical Society, B, 36, 44-47.
Sugiura, N. (1978). Further analysis of the data by Akaike’s information criterion and the finite
     corrections. Communications in Statistics, Part A, Theory and Methods, 7, 13-26.
Sulzberger, P. H. (1953). The effects of temperature on the strength of wood, plywood and glued
     joints. Department of Supply, Report ACA-46, Aeronautical Research Consultative Commit-
     tee, Australia.
Tarpey, T., Ogden, R., Petkova, E., and Christensen, R. (2015). Reply. The American Statistician,
     69, 254-255.
Tibshirani, R. J. (1996). Regression shrinkage and selection via the LASSO. Journal of the Royal
     Statistical Society, Series B, 58, 267-288.
Tukey, J. W. (1949). One degree of freedom for nonadditivity. Biometrics, 5, 232-242.
Utts, J. (1982). The rainbow test for lack of fit in regression. Communications in Statistics—Theory
     and Methods, 11, 2801-2815.
Velilla, S. (2018). A note on collinearity diagnostics and centering. The American Statistician, 72,
     140-146.
von Neumann, John and Morgenstern, Oskar (1944). Theory of Games and Economic Behavior.
     (Third Edition, 1945; Reprinted, 2007.) Princeton University Press, Princeton.
Wald, Abraham (1950). Statistical Decision Functions. John Wiley and Sons, New York.
Wasserman, Larry (2004). All of Statistics. Springer, New York.
Weisberg, S. (2014). Applied Linear Regression, Fourth Edition. John Wiley and Sons, New York.
Wermuth, N. (1976). Model search among multiplicative models. Biometrics, 32, 253-264.
Wichura, M. J. (2006). The Coordinate-Free Approach to Linear Models. Cambridge University
     Press, New York.
Wilks, S. S. (1962). Mathematical Statistics. John Wiley and Sons, New York.
Williams, E. J. (1959). Regression Analysis. John Wiley and Sons, New York.
Wu, C. F. J. and Hamada, M. S. (2009). Experiments: Planning, Analysis, and Optimization, 2nd
     Edition. John Wiley and Sons, New York.
Zacks, S. (197).1 The Theory of Statistical Inference. John Wiley and Sons, New York.
202                                                                                  References
Zelen, Marvin (1996). After dinner remarks: On the occasion of Seymour Geisser’s 65th Birth-
     day, Hsinchu, Taiwan, December 13, 1994. In Modelling and Prediction: honoring Seymour
     Geisser, eds. Jack C. Lee, Wesley O. Johnson, Arnold Zellner. Springer-Verlag, New York.
Zellner, A. (1971). An Introduction to Bayesian Inference in Econometrics. John Wiley and Sons,
     New York.
Zhu, M. (2008). Kernels and ensembles: Perspectives on statistical learning. The American Statis-
     tician, 62, 97-109.
Albert, James (1997). The American Statistician, 51, .
Moore, David (1997). The American Statistician, 51, .
Index
                                                                           203
204                                                                                    Index