Chap 4
Chap 4
In a medical study we observe that 10% of the women and 12.5% of the men
suffer from heart disease. If there are 20 people in the study, we would probably
be hesitant to declare that women are less prone to suffer from heart disease than
men; it is very possible that the results occurred by chance. However, if there are
20,000 people in the study, then it seems more likely that we are observing a real
phenomenon. Hypothesis testing makes this intuition precise; it is a framework
that allows us to decide whether patterns that we observe in our data are likely
to be the result of random fluctuations (chance) or not.
Why do we need a hypothesis? Why not just look at the outcome of the
experiment and go with whichever treatment does better?
The answer lies in the tendency of the human mind to underestimate the scope
of natural random behavior. One manifestation of this is the failure to anticipate
extreme events, or so-called "black swans". Another manifestation is the tendency
to misinterpret random events as having patterns of some significance. Statistical
hypothesis testing was invented as a way to protect researchers from being fooled
by random chance.
1
◃ The Alternative hypothesis , denoted by H1 , is the assertion that contradicts
the null hypothesis. This is what you hope to prove.
◃ Test statistic: A sample statistic used to decide whether to reject the null
hypothesis.
◃ Rejection region: The numerical values of the test statistic for which the
null hypothesis will be rejected.
The null hypothesis will then be rejected if and only if the observed or
computed test statistic value falls in the rejection region.
2
rejected. On the other hand, even when Hα < 0.1 is true, an unusual sample
might yield x = 20, in which case H0 would not be rejected, again an incorrect
conclusion.
In the best of all possible worlds, test procedures for which neither type of
error is possible could be achieved only by examining the entire population.
DEFINITION 3 (Significance level and size). The size of a test is the probability of
making a Type I error, denote by α. The significance level of a test is an upper
bound on the size.
For a fixed significance level, it is desirable to select a test that minimizes the
probability of making a Type II error. Equivalently, we would like to maximize the
probability of rejecting the null hypothesis when it does not hold. This probability
is known as the power of the test.
3
DEFINITION 4 (Power). The power of a test is the probability of rejecting the
null hypothesis if it does not hold.
(a) If the numerical value of the test statistic falls into the rejection region,
we reject the null hypothesis and conclude that the alternative hypothesis is
true. We know that the hypothesis-testing process will lead to this conclusion
incorrectly (a Type I error) only 100α% of the time when H0 is true.
(b) If the test statistic does not fall into the rejection region, we do not reject
H0 . Thus, we reserve judgement about which hypothesis is true. We do not
conclude that the null hypothesis is true because we do not (in general) know
the probability that our test procedure will lead to an incorrect acceptance
of H0 (a Type II error).
THEOREM 1 Suppose that an experiment and a sample size are fixed and a
test statistic is chosen. Then decreasing the size of the rejection region to
obtain a smaller value of α results in a larger value of β.
This proposition says that once the test statistic and n are fixed, there is no
rejection region that will simultaneously make both α and β small. A region must
be chosen to effect a compromise between α and β.
Because of the suggested guidelines for specifying H0 and H1 , a type I error
is usually more serious than a type II error. The approach adhered to by most
statistical practitioners is then to specify the largest value of α that can be tolerated
and find a rejection region.
4
Intuitively, H0 should be rejected if a substantial number of the crashes show no
damage. The test statistic is
and
Rejection region : R = {8, 9, · · · , 19, 20}
This rejection region is called upper tailed because it consists of only large values
of the test statistic.
When H0 is true, X has a binomial probabilities distribution with n = 20 and
p = 0.25. It follows that
Alternatively
5
DEFINITION 6 The p-value is the probability, calculated assuming H0 is true, of
obtaining a test statistic value at least as contradictory to H0 as the value that
actually observed. The smaller the p-value, the more contradictory is the data
to H0 .
Null hypothesis: H0 : µ = µ0 .
Test statistic:
X̄ − µ0
z= √ .
σ/ n
6
The statistic z is a natural measure of the distance between X̄, the estimator of
µ, and its expected value (expressed "standard deviation units") when H0 is true.
√
Here σ/ n stands for the standard deviation of X̄. If the distance is too great in
a direction consistent with Hα , the null hypothesis should be rejected.
EXAMPLE 3 A monthly income investment scheme exists that promises variable
monthly returns. An investor will invest in it only if he is assured of an average
$180 monthly income. He has a sample of 300 months’ returns which has a mean
of $190 and a standard deviation of $75. Should he or she invest in this scheme?
Solution
The investor will invest in the scheme if he or she is assured of his desired
180 average return. It follows that
Since Z= 2.309 is greater than 1.645, the null hypothesis can be rejected, so
the investor can consider investing in this scheme.
EXAMPLE 4 A new stockbroker (XYZ) claims that his brokerage fees are lower
than that of your current stock broker’s (ABC). Data available from an indepen-
dent research firm indicates that the mean and std-dev of all ABC broker clients
are $18 and $6, respectively.
A sample of 100 clients of ABC is taken and brokerage charges are calculated
with the new rates of XYZ broker. If the mean of the sample is $18.75 and std-dev
is the same ($6), can any inference be made about the difference in the average
brokerage bill between ABC and XYZ broker?
7
EXAMPLE 5 The effect of drugs and alcohol on the nervous system has been the
subject of considerable research. Suppose a research neurologist is testing the
effect of a drug on response time by injecting 100 rats with a unit dose of the
drug, subjecting each rat to a neurological stimulus, and recording its response
time. The mean and standard deviations for the 100 records are X̄ = 1.05 and
s = 0.5 respectively. The neurologist knows that the mean response time for rats
not injected with the drug (the "control" mean) is 1.2 seconds. She wishes to test
whether the mean response time for drug-injected rats differs from 1.2 seconds.
Set up the test of hypothesis for this experiment, using α = 0.01.
Solution
◃ H0 : µ = 1.2
◃ Hα : µ = 1.2
◃ Test statistic:
x̄ − 1.2 1.05 − 1.2
z= √ = √ = −3.0.
s/ n 0.5/ 100
◃ Region region
|z| > 2.58
which corresponds to α = 0.01. This sampling experiment provides suffi-
cient evidence to reject H0 and conclude, at the α = 0.01 level of significance,
that the mean response time for drug-injected rats differs from the control
mean of 1.2 seconds. It appears that the rats receiving an injection of the
drug have a mean response time that is less than 1.2 seconds.
Note: Since the sample size of the experiment is large enough (n > 30),
the central limit theorem will apply, and no assumptions need be made about
the population of response time measurement. The sampling distribution of the
sample mean response of 100 rats will be approximately normal, regardless of
the distribution of the individual rats’ response times.
8
THEOREM 3 Let X1 · · · , Xn represent a random sample of size n from the
normal population with mean µ and variance σ 2 . Define
n
1∑
X̄ = Xi .
n
i=1
Null hypothesis: H0 : µ = µ0 .
Test statistic:
X̄ − µ
T= √
S/ n
The function t.test is available in R for performing t-tests. Let’s test it out on
a simple example, using data simulated from a normal distribution.
> x = rnorm(10)
> y = rnorm(10)
> t.test(x,y)
data: x and y
t = 1.4896, df = 15.481, p-value = 0.1564
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-0.3221869 1.8310421
sample estimates:
mean of x mean of y
0.1944866 -0.5599410
9
a filter cigarette). Testing concerning p will be based on a random sample of
size n from the population. Denote by X the number of the observations in the
sample which has the specified property. Provided that n is small relative to the
population size. X has (approximately) a binomial distribution.
Define the estimator
X
p̂ = .
n
Then
p(1 − p)
E(p̂) = p, V ar(p̂) = .
n
Indeed suppose that X1 , · · · , Xn are a random sample from the population and
each of them have the specified property with probability p. write
n
∑
X= I(Xi = 1)
i=1
where Xi = 1 means that the ith sample has the specified property. It follows that
n
∑ n
∑
EX = I(Xi = 1) = P(Xi = 1) = np,
i=1 i=1
10
EXAMPLE 6 The reputations (and hence sales) of many businesses can be severely
damaged by shipments of manufactured items that contain a large percentage of
defectives. For example, a manufacturer of alkaline batteries many want to be
reasonably certain that less than 5% of its batteries are defective. Suppose 300
batteries are randomly selected from a very large shipment each is tested and 10
defective batteries are found. Does this outcome provide sufficient evidence for
the manufacturer to conclude that the fraction defective in the entire shipment
is less than 0.05? Use α = 0.01.
Solution
H0 : p = 0.05 ↔ Hα p < 0.05.
The test statistic is
10
p̂ − p0 − 0.05
300
z= √ = = −1.32.
p0 (1 − p0 )/n 0.05 ∗ 0.95/300
where x are all the data merged together. Here the statistic m could be either
the Z statistic or t statistic. Our goal is to test whether m(xA ) is larger than m(xB )
at a certain significance level. The corresponding rejection region is of the form
11
R := {m|mdiff (x) > η}. The problem is how to fix the threshold so that the test
has the desired significance level.
The first step in a permutation test of a hypothesis is to combine the results
from groups A and B together. This is the logical embodiment of the null hy-
pothesis that the treatments to which the groups were exposed do not differ. We
then test that hypothesis by randomly drawing groups from this combined set,
and seeing how much they differ from one another?
Imagine that we randomly permute the groups A and B in the merged data
set x. As a result, some of the data that belong to A will be labeled as B and
vice versa. If we recompute mdiff (x) we will obviously obtain a different value.
However, the distribution of the random variable mdiff (x) under the hypothesis
that the data are sampled from the same distribution has not changed.
Consider the value of mdiff for all the possible permutations of the labels:
mdiff,1 ; mdiff,2 ; · · · , mdiff,n! . We can therefore compute the p value of the ob-
served statistic mdiff (x) as
n!
1 ∑
p = P(mdiff (X) > mdiff (x)) = I(mdiff,i ≥ mdiff (x)).
n!
i=1
In words, the p value is the fraction of permutations that yield a more extreme
test statistic than the one we observe. Unfortunately, it is often challenging to
compute the above exactly. Even for moderately sized data sets the number of
possible permutations is usually too large (for example, 40! > 81047) for it to be
computationally tractable. In such cases the p value can be approximated by sam-
pling a large number of permutations and making a Monte Carlo approximation
of the above with its average.
Before looking at an example, let us review the steps to be followed when
applying a permutation test.
(4) Permute the labels k times and compute the corresponding values of mdiff :
mdiff,1 , mdiff,2 , · · · , mdiff,k .
k
∑
1
(5) Compute the approximate p value k I(mdiff,i ≥ mdiff (x)) and reject the
i=1
null hypothesis if it is below a predefined limit (typically 1% or 5%).
12
A company selling a relatively high-value service wants to test which of two
web presentations does a better selling job. Due to the high value of the service
being sold, sales are infrequent and the sales cycle is lengthy; it would take too
long to accumulate enough sales to know which presentation is superior. So the
company decides to measure the results with a proxy variable, using the detailed
interior page that describes the service.
One potential proxy variable for our company is the number of clicks on the
detailed landing page. A better one is how long people spend on the page. It
is reasonable to think that a web presentation (page) that holds peopleŠs atten-
tion longer will lead to more sales. Hence, our metric is average session time,
comparing page A to page B.
###put these two sets of data togeter and draw the boxplot.
comp=list(y1,y2)
boxplot(comp,names = c("Page A","Page B"),ylab="Time")
The boxplot, shown below, indicates that page B leads to longer sessions than
page A. The means for each group can be computed as follows.
The question is whether this difference is within the range of what random
chance might produce, or, alternatively, is statistically significant. One way to
answer this is to apply a permutation test—combine all the session times together,
then repeatedly shuffle and divide them into groups of 21 (recall that n = 21 for
page A) and 15 (n = 15 for B). To apply a permutation test, we need a function to
randomly assign the 36 session times to a group of 21 (page A) and a group of
15 (page B):
13
Figure 1: Session times for web pages A and B.
{
n <- n1 + n2
idx_b <- sample(1:n, n1) #### randomly choose n1 samples from 1:n
idx_a <- setdiff(1:n, idx_b) ### the rest samples exclude idx_b
mean_diff <- mean(x[idx_b]) - mean(x[idx_a]) ##calculate their diff
return(mean_diff)
}
It turns out that the p value is 0.128 by the following R code. This suggests
that the observed difference in session time between page A and page B is well
within the range of chance variation, thus is not statistically significant.
14