0% found this document useful (0 votes)
42 views14 pages

Chap 4

ntu

Uploaded by

yujiaaoro
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
42 views14 pages

Chap 4

ntu

Uploaded by

yujiaaoro
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

MH8131 Probability and Statistics

Chapter 4. Hypothesis Testing

In a medical study we observe that 10% of the women and 12.5% of the men
suffer from heart disease. If there are 20 people in the study, we would probably
be hesitant to declare that women are less prone to suffer from heart disease than
men; it is very possible that the results occurred by chance. However, if there are
20,000 people in the study, then it seems more likely that we are observing a real
phenomenon. Hypothesis testing makes this intuition precise; it is a framework
that allows us to decide whether patterns that we observe in our data are likely
to be the result of random fluctuations (chance) or not.
Why do we need a hypothesis? Why not just look at the outcome of the
experiment and go with whichever treatment does better?
The answer lies in the tendency of the human mind to underestimate the scope
of natural random behavior. One manifestation of this is the failure to anticipate
extreme events, or so-called "black swans". Another manifestation is the tendency
to misinterpret random events as having patterns of some significance. Statistical
hypothesis testing was invented as a way to protect researchers from being fooled
by random chance.

1 The Elements of a Hypothesis Test


A statistical hypothesis, just hypothesis, is a claim or assertion either about the
value of a single parameter, about the values of several parameters or about
the form of a probability distribution. One example is the statement p < 0.1
where p is the proportion of defective TVs among all TVs produced by a certain
manufacturer. In any hypothesis testing problem, there are two contradictory
hypotheses under consideration. The objective is to decide which of the two
hypotheses is correct based on sample information.

◃ The Null hypothesis , denoted by H0 , is the claim that is initially assumed


to be true (the "prior belief" claim). It generally represents the status quo,
which we adopt until it is proven false. The hypothesis that chance is to
blame.

1
◃ The Alternative hypothesis , denoted by H1 , is the assertion that contradicts
the null hypothesis. This is what you hope to prove.

In testing statistical hypotheses, the problem will be formulated so that one of


the claims is initially favored. This initially favor claim will not be rejected in favor
of the alternative claim unless sample evidence contradicts it. For example, in the
U. S. judicial system, null hypothesis is the assertion that the accused individual
is innocent. Only in the face of strong evidence to the contrary should the jury
reject this claim in favor of the alternative assertion that the accused is guilty. In
this sense, the claim of innocence is the favored or protected hypothesis.
Hypothesis tests use the following logic: "Given the human tendency to react
to unusual but random behavior and interpret it as something meaningful and
real, in our experiments we will require proof that the difference between groups
is more extreme than what chance might reasonably produce."
A test procedure is a rule, based on sample data, for deciding whether to reject
H0 .

EXAMPLE 1 A test of H0 = p = 0.1 versus Hα < 0.1 in the TV problem might be


based on examining a random sample of n = 200 TVs. Let X denote the number
of defective boards in the sample, a binomial random variable; x represents the
observed value of X. If H0 is true, EX = np = 200 ∗ 0.1 = 20, whereas we can
expect fewer than 20 defective TVs if Hα is true. A value x just a bit below 20 does
not support contradict H0 so it is reasonable to reject H0 only if x is substantially
less than 20. 

DEFINITION 1 A test procedure is specified by the following:

◃ Test statistic: A sample statistic used to decide whether to reject the null
hypothesis.

◃ Rejection region: The numerical values of the test statistic for which the
null hypothesis will be rejected.

The null hypothesis will then be rejected if and only if the observed or
computed test statistic value falls in the rejection region.

The basis for choosing a particular rejection region lies in consideration of


the errors that one might be faced with in drawing a conclusion. Consider the
rejection region x ≤ 15 in the TVs problem. Even when H0 , p = 0.1 is true,
it might happen that an unusual samples in x = 13, so that H0 is erroneously

2
rejected. On the other hand, even when Hα < 0.1 is true, an unusual sample
might yield x = 20, in which case H0 would not be rejected, again an incorrect
conclusion.
In the best of all possible worlds, test procedures for which neither type of
error is possible could be achieved only by examining the entire population.

2 Statistical Significance and Type I and II errors


Statistical significance is how statisticians measure whether an experiment (or
even a study of existing data) yields a result more extreme than what chance
might produce. If the result is beyond the realm of chance variation, it is said to
be statistically significant.

DEFINITION 2 A type I error consists of rejecting the null hypothesis H0 when


it is true. A type II error involves not rejecting H0 when H0 is false.

Alternatively, Type 1 error: mistakenly concluding an effect is real (when it is


due to chance). Type 2 error: mistakenly concluding an effect is due to chance
(when it is real).

DEFINITION 3 (Significance level and size). The size of a test is the probability of
making a Type I error, denote by α. The significance level of a test is an upper
bound on the size.

When you read in a study that a result is statistically significant at a level of


0.05, this means that the probability of committing a Type I error is bounded by
5%. The significance level of the test is usually set to be 0.1,0.05 and 0.01. The
probability of making type II error is denoted by β. In other words

α = P(making type I error), β = P(making type II error)

For a fixed significance level, it is desirable to select a test that minimizes the
probability of making a Type II error. Equivalently, we would like to maximize the
probability of rejecting the null hypothesis when it does not hold. This probability
is known as the power of the test.

3
DEFINITION 4 (Power). The power of a test is the probability of rejecting the
null hypothesis if it does not hold.

How should we draw conclusions ?

(a) If the numerical value of the test statistic falls into the rejection region,
we reject the null hypothesis and conclude that the alternative hypothesis is
true. We know that the hypothesis-testing process will lead to this conclusion
incorrectly (a Type I error) only 100α% of the time when H0 is true.

(b) If the test statistic does not fall into the rejection region, we do not reject
H0 . Thus, we reserve judgement about which hypothesis is true. We do not
conclude that the null hypothesis is true because we do not (in general) know
the probability that our test procedure will lead to an incorrect acceptance
of H0 (a Type II error).

THEOREM 1 Suppose that an experiment and a sample size are fixed and a
test statistic is chosen. Then decreasing the size of the rejection region to
obtain a smaller value of α results in a larger value of β.

This proposition says that once the test statistic and n are fixed, there is no
rejection region that will simultaneously make both α and β small. A region must
be chosen to effect a compromise between α and β.
Because of the suggested guidelines for specifying H0 and H1 , a type I error
is usually more serious than a type II error. The approach adhered to by most
statistical practitioners is then to specify the largest value of α that can be tolerated
and find a rejection region.

EXAMPLE 2 A certain type of automobile is known to sustain no visible damage


25% of the time in 10-mph crash tests. A modified bumper design has been
proposed in an effort to increase this percentage. Let ρ denote the proportion of
all 10-mph crashes with this new bumper that result in no visible damage. We
draw n = 20 independent crashes with prototypes of the new design. Write down
the null hypothesis, the alternative hypothesis, rejection region and type I error.
Solution

H0 : p = 0.25 (no improvement), Hα : p > 0.25.

4
Intuitively, H0 should be rejected if a substantial number of the crashes show no
damage. The test statistic is

X = the number of crashes with no visible damage

and
Rejection region : R = {8, 9, · · · , 19, 20}
This rejection region is called upper tailed because it consists of only large values
of the test statistic.
When H0 is true, X has a binomial probabilities distribution with n = 20 and
p = 0.25. It follows that

α = P(Type I error) = P(H0 is rejected when it is true)


( ) 20 ( )
∑ n
= P X ≥ 8 when X ∼ Bin(20, 0.25) = 1 − 0.25k (0.75)n−k = 0.102.
k
k=7
That is, when H0 is true, roughly 10% of all experiments consisting of 20 crashes
would result in H0 being incorrectly rejected (a type I error).

3 Observed Significance Levels: p-Values


One way to report the result of a hypothesis-testing analysis is to simply say
whether the null hypothesis was rejected at a specified level of significance. Thus
an investigator might state that H0 was rejected at level of significance 0.05 or that
use of a level 0.01 test resulted in not rejecting H0 . In many decision situations,
individuals may have different views concerning the consequence of a type I or
type II error.
A second method of presenting method of presenting the results of a statistical
test reports the extent to which the test statistic disagrees with the null hypothesis
and leaves to the reader the task of deciding whether to reject the null hypothesis.
This measure of disagreement is called the observed significance level (or p-value)
for the test.

DEFINITION 5 The p-value is defined as the probability of obtaining a result equal


to or "more extreme" than what was actually observed, when the null hypothesis
is true.

Alternatively

5
DEFINITION 6 The p-value is the probability, calculated assuming H0 is true, of
obtaining a test statistic value at least as contradictory to H0 as the value that
actually observed. The smaller the p-value, the more contradictory is the data
to H0 .

◃ p-value ≤ α Ñ reject H0 at level α.

◃ p-value > α Ñ do not reject H0 at level α.

4 Tests about a population mean


4.1 A normal population with known σ
Although the assumption that the value of σ is known is rarely met in practice,
this case provides a good starting point because of the ease with which general
procedures and their properties can be developed. The test procedure for this
case is summarized as follows.

THEOREM 2 Let X1 · · · , Xn represent a random sample of size n from the


normal population with mean µ and variance σ 2 . Define
n
1∑
X̄ = Xi .
n
i=1

Null hypothesis: H0 : µ = µ0 .
Test statistic:
X̄ − µ0
z= √ .
σ/ n

Alternative Hypothesis Rejection region for level α test


Hα : µ > µ0 z ≥ zα (upper-tailed test)
Hα : µ < µ0 z ≤ −zα (lower-tailed test)
Hα : µ ̸= µ0 |z| ≥ zα/2 (two-tailed test)

where zα and zα/2 are chosen so that

P(z > zα ) = α, P(|z| > zα/2 ) = α.

6
The statistic z is a natural measure of the distance between X̄, the estimator of
µ, and its expected value (expressed "standard deviation units") when H0 is true.

Here σ/ n stands for the standard deviation of X̄. If the distance is too great in
a direction consistent with Hα , the null hypothesis should be rejected.
EXAMPLE 3 A monthly income investment scheme exists that promises variable
monthly returns. An investor will invest in it only if he is assured of an average
$180 monthly income. He has a sample of 300 months’ returns which has a mean
of $190 and a standard deviation of $75. Should he or she invest in this scheme?
Solution
The investor will invest in the scheme if he or she is assured of his desired
180 average return. It follows that

H0 : µ = 180 ↔ H1 : µ > 180.


Test statistic is
190 − 180
Z= √ = 2.309.
7/ 300
Our rejection region at 5% significance level is

Z > Z0.05 = 1.645.

Since Z= 2.309 is greater than 1.645, the null hypothesis can be rejected, so
the investor can consider investing in this scheme.

EXAMPLE 4 A new stockbroker (XYZ) claims that his brokerage fees are lower
than that of your current stock broker’s (ABC). Data available from an indepen-
dent research firm indicates that the mean and std-dev of all ABC broker clients
are $18 and $6, respectively.
A sample of 100 clients of ABC is taken and brokerage charges are calculated
with the new rates of XYZ broker. If the mean of the sample is $18.75 and std-dev
is the same ($6), can any inference be made about the difference in the average
brokerage bill between ABC and XYZ broker? 

4.2 Large-Sample tests


When the sample size is large, the z tests for cases I care easily modified to yield
valid test procedures without requiring either a normal population distributions
or known σ. A large n implies that the standardized variable
X̄ − µ
Z= √
S/ n
has approximately a standard normal distribution (central limit theorem). The
rule of thumb n > 30 will be used to characterize a large sample size.

7
EXAMPLE 5 The effect of drugs and alcohol on the nervous system has been the
subject of considerable research. Suppose a research neurologist is testing the
effect of a drug on response time by injecting 100 rats with a unit dose of the
drug, subjecting each rat to a neurological stimulus, and recording its response
time. The mean and standard deviations for the 100 records are X̄ = 1.05 and
s = 0.5 respectively. The neurologist knows that the mean response time for rats
not injected with the drug (the "control" mean) is 1.2 seconds. She wishes to test
whether the mean response time for drug-injected rats differs from 1.2 seconds.
Set up the test of hypothesis for this experiment, using α = 0.01.
Solution
◃ H0 : µ = 1.2

◃ Hα : µ = 1.2

◃ Test statistic:
x̄ − 1.2 1.05 − 1.2
z= √ = √ = −3.0.
s/ n 0.5/ 100
◃ Region region
|z| > 2.58
which corresponds to α = 0.01. This sampling experiment provides suffi-
cient evidence to reject H0 and conclude, at the α = 0.01 level of significance,
that the mean response time for drug-injected rats differs from the control
mean of 1.2 seconds. It appears that the rats receiving an injection of the
drug have a mean response time that is less than 1.2 seconds. 
Note: Since the sample size of the experiment is large enough (n > 30),
the central limit theorem will apply, and no assumptions need be made about
the population of response time measurement. The sampling distribution of the
sample mean response of 100 rats will be approximately normal, regardless of
the distribution of the individual rats’ response times.

4.3 A normal population


When n is small, the central limit theorem (CLT) can no longer be invoked to
justify the use of a large sample test. We faced the same difficulty in obtaining a
small sample confidence interval for µ in the last chapter. We will assume that
the population distribution is at least approximately normal so that
X̄ − µ
T= √
S/ n
has a t distribution with n − 1 degrees of freedom.

8
THEOREM 3 Let X1 · · · , Xn represent a random sample of size n from the
normal population with mean µ and variance σ 2 . Define
n
1∑
X̄ = Xi .
n
i=1

Null hypothesis: H0 : µ = µ0 .
Test statistic:
X̄ − µ
T= √
S/ n

Alternative Hypothesis Rejection region for level α test


Hα : µ > µ0 t ≥ tα,n−1 (upper-tailed test)
Hα : µ < µ0 t ≤ −tα,n−1 (lower-tailed test)
Hα : µ ̸= µ0 |t| ≥ tα/2,n−1 (two-tailed test)

The function t.test is available in R for performing t-tests. Let’s test it out on
a simple example, using data simulated from a normal distribution.

> x = rnorm(10)
> y = rnorm(10)
> t.test(x,y)

The output is as follows.

Welch Two Sample t-test

data: x and y
t = 1.4896, df = 15.481, p-value = 0.1564
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-0.3221869 1.8310421
sample estimates:
mean of x mean of y
0.1944866 -0.5599410

5 Tests concerning a population proportion


Let p denote the proportion of individuals or objects in a population who posses a
specified property (e.g. cars with manual transmissions or smokers who smoke

9
a filter cigarette). Testing concerning p will be based on a random sample of
size n from the population. Denote by X the number of the observations in the
sample which has the specified property. Provided that n is small relative to the
population size. X has (approximately) a binomial distribution.
Define the estimator
X
p̂ = .
n
Then
p(1 − p)
E(p̂) = p, V ar(p̂) = .
n
Indeed suppose that X1 , · · · , Xn are a random sample from the population and
each of them have the specified property with probability p. write
n

X= I(Xi = 1)
i=1
where Xi = 1 means that the ith sample has the specified property. It follows that
n
∑ n

EX = I(Xi = 1) = P(Xi = 1) = np,
i=1 i=1

which implies that E(p̂) = p. Likewise


(∑
n ) ∑ n n
∑ ( )2
V ar(X) = V ar I(Xi = 1) = V ar(I(Xi = 1)) = E(I(Xi = 1))2 − E(I(Xi = 1)
i=1 i=1 i=1
n
∑ ( )2 n

= (P(Xi = 1) − P(Xi = 1) ) = (p − p2 ) = np(1 − p).
i=1 i=1
Therefore
p(1 − p)
V ar(p̂) = .
n

THEOREM 4 Null hypothesis


H 0 : p = p0
Test statistic
p̂ − p0
z= √ .
p0 (1 − p0 )/n
Alternative Hypothesis Rejection region for level α test
Hα : p > p0 z ≥ zα (upper-tailed test)
Hα : p < p0 z ≤ −zα (lower-tailed test)
Hα : p ̸= p0 |z| ≥ zα/2 (two-tailed test)
where zα and zα/2 are chosen so that
P(z > zα ) = α, P(|z| > zα/2 ) = α.

10
EXAMPLE 6 The reputations (and hence sales) of many businesses can be severely
damaged by shipments of manufactured items that contain a large percentage of
defectives. For example, a manufacturer of alkaline batteries many want to be
reasonably certain that less than 5% of its batteries are defective. Suppose 300
batteries are randomly selected from a very large shipment each is tested and 10
defective batteries are found. Does this outcome provide sufficient evidence for
the manufacturer to conclude that the fraction defective in the entire shipment
is less than 0.05? Use α = 0.01.
Solution
H0 : p = 0.05 ↔ Hα p < 0.05.
The test statistic is
10
p̂ − p0 − 0.05
300
z= √ = = −1.32.
p0 (1 − p0 )/n 0.05 ∗ 0.95/300

The rejection region is


z < −z0.01 = −2.33,
which corresponds to α = 0.01. The calculated z-value does not fall into the
rejection region. Therefore, there is insufficient evidence at the 0.01 level of
significance to indicate that the shipment contains less than 5% defective batteries.

6 Nonparametric testing: The permutation test


In practical situations we may not able to design a parametric model that is ade-
quate for our data. Nonparametric tests are hypothesis tests that do not assume
that the data follow any distribution with a predefined form. In this section we
describe the permutation test, a nonparametric test that can be used to compare
two data sets xA and xB in order to evaluate conjectures of the form xA is sam-
pled from a distribution that has a higher mean than xB or xB is sampled from
a distribution that has a higher variance than xA . The null hypothesis is that the
two data sets are actually sampled from the same distribution.
The test statistic in a permutation test is the difference between the values of
a test statistic of interest, m, evaluated on the two data sets

mdiff (x) = m(xA ) − m(xB )

where x are all the data merged together. Here the statistic m could be either
the Z statistic or t statistic. Our goal is to test whether m(xA ) is larger than m(xB )
at a certain significance level. The corresponding rejection region is of the form

11
R := {m|mdiff (x) > η}. The problem is how to fix the threshold so that the test
has the desired significance level.
The first step in a permutation test of a hypothesis is to combine the results
from groups A and B together. This is the logical embodiment of the null hy-
pothesis that the treatments to which the groups were exposed do not differ. We
then test that hypothesis by randomly drawing groups from this combined set,
and seeing how much they differ from one another?
Imagine that we randomly permute the groups A and B in the merged data
set x. As a result, some of the data that belong to A will be labeled as B and
vice versa. If we recompute mdiff (x) we will obviously obtain a different value.
However, the distribution of the random variable mdiff (x) under the hypothesis
that the data are sampled from the same distribution has not changed.
Consider the value of mdiff for all the possible permutations of the labels:
mdiff,1 ; mdiff,2 ; · · · , mdiff,n! . We can therefore compute the p value of the ob-
served statistic mdiff (x) as

n!
1 ∑
p = P(mdiff (X) > mdiff (x)) = I(mdiff,i ≥ mdiff (x)).
n!
i=1

In words, the p value is the fraction of permutations that yield a more extreme
test statistic than the one we observe. Unfortunately, it is often challenging to
compute the above exactly. Even for moderately sized data sets the number of
possible permutations is usually too large (for example, 40! > 81047) for it to be
computationally tractable. In such cases the p value can be approximated by sam-
pling a large number of permutations and making a Monte Carlo approximation
of the above with its average.
Before looking at an example, let us review the steps to be followed when
applying a permutation test.

(1) Choose a conjecture/statement as to how xA and xB are different.

(2) Choose a test statistic mdiff .

(3) Compute mdiff (x).

(4) Permute the labels k times and compute the corresponding values of mdiff :
mdiff,1 , mdiff,2 , · · · , mdiff,k .
k

1
(5) Compute the approximate p value k I(mdiff,i ≥ mdiff (x)) and reject the
i=1
null hypothesis if it is below a predefined limit (typically 1% or 5%).

EXAMPLE 7 (Web Stickiness)

12
A company selling a relatively high-value service wants to test which of two
web presentations does a better selling job. Due to the high value of the service
being sold, sales are infrequent and the sales cycle is lengthy; it would take too
long to accumulate enough sales to know which presentation is superior. So the
company decides to measure the results with a proxy variable, using the detailed
interior page that describes the service.
One potential proxy variable for our company is the number of clicks on the
detailed landing page. A better one is how long people spend on the page. It
is reasonable to think that a web presentation (page) that holds peopleŠs atten-
tion longer will lead to more sales. Hence, our metric is average session time,
comparing page A to page B.

###read the data


x=read.csv("/Users/lym/Desktop/untitled folder/web_page_data.csv")

###divide the data by their row names (Page A and B)


###x[’Page’]==’Page A’ is to select the rows with names being ’Page A’.
y1=x[x[’Page’]==’Page A’, ’Time’]###21 observations
y2=x[x[’Page’]==’Page B’, ’Time’]###15 observations

###put these two sets of data togeter and draw the boxplot.
comp=list(y1,y2)
boxplot(comp,names = c("Page A","Page B"),ylab="Time")

##calculate their means


mean_a <- mean(y1)
mean_b <- mean(y2)
mean_b - mean_a

The boxplot, shown below, indicates that page B leads to longer sessions than
page A. The means for each group can be computed as follows.
The question is whether this difference is within the range of what random
chance might produce, or, alternatively, is statistically significant. One way to
answer this is to apply a permutation test—combine all the session times together,
then repeatedly shuffle and divide them into groups of 21 (recall that n = 21 for
page A) and 15 (n = 15 for B). To apply a permutation test, we need a function to
randomly assign the 36 session times to a group of 21 (page A) and a group of
15 (page B):

####To apply permutation test


perm_fun <- function(x, n1, n2)

13
Figure 1: Session times for web pages A and B.

{
n <- n1 + n2
idx_b <- sample(1:n, n1) #### randomly choose n1 samples from 1:n
idx_a <- setdiff(1:n, idx_b) ### the rest samples exclude idx_b
mean_diff <- mean(x[idx_b]) - mean(x[idx_a]) ##calculate their diff
return(mean_diff)
}

This function works by sampling without replacement n2 indices and assigning


them to the B group; the remaining n1 indices are assigned to group A. The
difference between the two means is returned. Call this function R = 1,000 times
and specifying n2 = 15 and n1 = 21

perm_diffs <- rep(0, 1000)


for(i in 1:1000)
perm_diffs[i] = perm_fun(x[,’Time’], 21, 15)

It turns out that the p value is 0.128 by the following R code. This suggests
that the observed difference in session time between page A and page B is well
within the range of chance variation, thus is not statistically significant.

###calculate the p-value of permutation test


v = mean_b - mean_a
p_per_value=mean(perm_diffs>v)
p_per_value

14

You might also like