Contingency Analysis:
association between categorical variables
Contingency Analysis
Mosaic Plots
Odds
Odds Ratio
SE & CI for Odds Ratio
2 Contingency Test
R example
Assumptions of 2
Correction for Continuity
Fisher's Exact Test
G-tests
Contingency Analysis
There are many examples in biology where we wish to
relate two variables that are categorical.
For example:
1. Do bright and drab butterflies differ in their probability of
being eaten?
2. Is fur color (tan, brown, black) related to gender?
3. Is tree death related to slope aspect?
These questions are best approached using contingency
analysis, which allows us to determine whether two or
more categorical variables are independent.
2
Mosaic Plots
The Titanic disaster provides a simple example of the use
of mosaic plots for examining the structure of frequency
data.
Plots are composed of a series of graphical blocks or
boxes. The area of each box is proportional to the number
of elements in that group. Groups can be compared side by
side (rowise or columnwise).
The plot clearly shows that women experienced a greater
survival rate then men.
3
Mosaic Plots
Odds
Let's consider a variable (e.g., our previous coin toss
example) for which a random trial yields one of two
outcomes: success or failure (heads or tails).
The probability of success is p and the probability of failure
is 1-p. The odss of success (O) are the probability of
success divided by the probability of failure:
p
O=
1 p
The estimate of the odds is calculated from a random
sample of trials using the observed proportion of
successes (p-hat):
p
O=
1 p 5
Odds
- Example -
It is well established that there is a link between the use
of aspirin and decreased risk of heart attack. A
suggestion was made that there may also be a link with
reduction in cancer risk. A total of 39,876 women were
split into two groups: half took aspirin, half a placebo.
After 10 years the prevalence of cancer was assessed
in the two groups:
6
Odds
- Example -
Odds
- Example -
The estimated proportion that did not get cancer
(and the cmplement; those that did get cancer) is:
18496
p1 = =0.9279
19934
1 p 1 =10.9279=0.0721
The odds of not getting cancer while taking aspirin are:
p 0.9279
O 1= 1 p 1= =12.87
1 0.0721
So, the odds are ca. 12.87:1 of not getting cancer if taking aspirin.
8
Odds
- Example -
But, what are the odds of getting cancer if taking aspirin?
18515
p 2 = =0.9284
19942
1 p2 =10.9284=0.0716
p 0.9284
O 2 = 2 p 2= =12.97
1 0.0716
The difference between 12.87 and 12.97 is negligible,
so aspirin is not likely to influence cancer rate.
9
Odds Ratio
But, as statisticians, we are seldom convinced by just a
small difference (such a large sample size could still be
significant). We can use the odds ratio (OR) to assess
the odds of success relative to the odds of failure:
O1
O R=
O2
If the odds ratio is equal to one, the the odds of success
in the response variable is the same for both groups.
10
Odds Ratio
- Example -
O 1 12.86
O R= = =0.992
O 2 12.97
The OR suggests that the odds of developing cancer
while taking aspirin were about the same as while taking
the placebo; however, since the value is less than one,
there was a slight benefit of taking aspirin.
We're still left with the question of whether the aspirin is
a significant help towards reducing cancer risk (even if
small). We can evaluate this using the SE and CI
around OR.
11
Odds Ratio
Because the data are highly skewed, we must convert the OR
to its natural log form and then calculate the SE from which we
can derive the CI:
SE [ ln O R]=
1 1 1 1
a b c d
SE [ ln O R]=
1
1
1
1
1438 1427 18496 18515
SE [ ln O R]=0.03878
12
Odds Ratio
Now that we have a SE calculated, we can calculate the
the 95% CI:
-0.00803 1.96(0.03878) < ln(OR) < -0.00803 + 1.96(0.03878)
-0.084 < ln(OR) < 0.068
e-0.084 < OR < e0.068
0.92 < OR < 1.07
The CI is tightly bounded around 1.0, so the data provide good
evidence that aspirin plays no effect on the probability of
developing cancer. 13
2 Contingency Test
The most commonly used frequency data analysis
method is the chi-square contingency test for
association.
You may also see this test referred to in the literature as
an R x C (row-by-column) association test. R can have
two or more categories and C can have two or more
categories.
This test is widely adaptable to a variety of tests dealing
with the comparison of categorical data (and can be
expanded to 3+ dimensions = log-linear analysis)
14
2 Contingency Test
- Example -
Example 9.3 provides a biological example involving the
infection of fish with a parasite and their risk of predation by
birds as a function of their position in the water column.
The two variables of interest are infection status (uninfected,
lightly infected, and highly infected) and predation (eaten, not
eaten).
The corresponding hypotheses:
H0: Parasite infection and being eaten are independent.
HA: Parasite infection and being eaten are not independent.
15
2 Contingency Test
- Example -
16
17
2 Contingency Test
- Example -
[uninfected ]=50/141=0.3546
Pr
[eaten ]=48/141=0.3404
Pr
[uninfected eaten ]=0.35460.3404=0.1207
Pr
Expected [ uninfected eaten]=0.1207141=17.0
18
19
2 Statistic
Now that we have observed frequencies and expected
frequencies, we can generate a chi-square test using our
general formula:
c r 2
[Observed column , rowExpected column , row]
2 = .
[ Expected column , row]
column=1 row=1
117.02 4933.02 930.32
=
2
= 69.5
17.0 33.0 30.3
2
2, 0.05 =5.99 therefore , reject H 0
NB : df = r1c1=2131=2
20
Example
How would we solve this problem in R? Basically, a row x
column table is a matrix; so in keeping with the approach
of using vectors for data, we create an array using the
matrix function and specify that the data are read by rows
(note how R cycles through the data to create a matrix from
a vector):
> fish<-matrix(c(1,49,10,35,37,9),nrow=2)
> fish
[,1] [,2] [,3]
[1,] 1 10 37
[2,] 49 35 9
21
Example
While we have a perfectly workable matrix, let's prettify it and
add the appropriate variable names and levels:
> fish<-matrix(c(1,49,10,35,37,9), nrow=2,
dimnames=list("Predation"=c("Eaten", "Not Eaten"),
"Infection" = c("Uninfected", "Light", "Heavy")))
> fish
Infection
Predation Uninfected Light Heavy
Eaten 1 10 37
Not Eaten 49 35 9
22
Example
And the chi-square test...
> chisq.test(fish)
Pearson's Chi-squared test
data: fish
X-squared = 69.7557, df = 2, p-value
= 7.124e-16
23
Chi-square has a number of sub-routines that we take
further advantage of:
> chisq.test(fish)$observed
Infection
Predation Uninfected Light Heavy
Eaten 1 10 37
Not Eaten 49 35 9
> chisq.test(fish)$expected
Infection
Predation Uninfected Light Heavy
Eaten 17.02128 15.31915 15.65957
Not Eaten 32.97872 29.68085 30.34043
24
> mosaicplot(t(fish),cex=1.25,color=TRUE)
25
The chi-square contingency test makes the same assumptions
as the goodness of fit test:
1. No more than 20% of the cells can have a frequency less
than 5, and
2. No cell can have an expected frequency less than one.
If either are violated, the response is the same: (a) combine a
row or column [if array is bigger than 2 2], (b) if table is 2 2
use Fisher's Exact Test, or (c) use a randomization procedure
(discussed at end of course).
26
Correction for Continuity
When the contingency table is 2 2, most statisticians
recommend the use of a continuity correction factor. This
modification is known as the Yates Correction for
Continuity:
2
1
c r Observed column ,rowExpected column , row
2
= .
2
column=1 row=1 [ Expected column ,row]
27
Fisher's Exact Test
Fisher's Exact Test is used specifically for 2 2 contingency
tests. The test is an improvement over the normal chi-
square in cases where the expected cell frequencies are too
low to meet the regular assumptions. Thus, this test is used
for small data sets comparing two categorical variables.
Let's look at Example 9.4
which examines the feeding
habits of vampire bats.
The main question is whether
or not cows in estrous have a
greater chance of being
attacked by bats compared
to cows not in estrous.
28
29
> bats<-matrix(c(15,7,6,322),nrow=2)
> bats
[,1] [,2]
[1,] 15 6
[2,] 7 322
> fisher.test(bats)
Fisher's Exact Test for Count Data
data: bats
p-value < 2.2e-16
alternative hypothesis: true odds ratio
is not equal to 1
95 percent confidence interval:
29.94742 457.26860
sample estimates:
odds ratio
108.3894
30
G-tests
The G-test is another contingency test seen frequently in the
literature. The G-test is very similar to the chi-square test
across a wider range of circumstances. It utilizes the natural
logarithm (ln) in its calculation.
The G-test may not be as powerful as the chi-square test for
small sample sizes.
R code for G-test statistics are available, but are not part of the
normal stats base package or related packages.
31