0% found this document useful (0 votes)
29 views56 pages

Measures of Association

The document discusses measures of association in statistical analysis, particularly focusing on the Chi-Square test for categorical data. It explains how to use a 2x2 contingency table to assess the association between two categorical variables and introduces concepts such as Relative Risk (RR) and Odds Ratio (OR). The document also covers the assumptions of the Chi-Square test and provides examples to illustrate the calculations involved.

Uploaded by

Bewket Chalachew
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views56 pages

Measures of Association

The document discusses measures of association in statistical analysis, particularly focusing on the Chi-Square test for categorical data. It explains how to use a 2x2 contingency table to assess the association between two categorical variables and introduces concepts such as Relative Risk (RR) and Odds Ratio (OR). The document also covers the assumptions of the Chi-Square test and provides examples to illustrate the calculations involved.

Uploaded by

Bewket Chalachew
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 56

Measures of Association

• While a test of hypothesis can be used to


determine whether an association exists
between two random variables, it cannot
provide a measure of the strength of the
association
• Several methods are available for
estimating the magnitude of the effect
given the categorical data in a 2× 2
contingency table
• For the most part, we have been applying
the techniques of hypothesis testing to
either continuous or ordinal data
• What about nominal data?
• Instead of using the normal approximation
to the binomial distribution, we could reach
the same conclusion using different
techniques
1. Categorical Data
1. Chi-Square Test
• A Chi-Square (χ2) is a probability
distribution used to make statistical
inferences about categorical data
(proportions) in which the numbers of
categories are two or more.
• Widely used in the analysis of
contingency tables.
• Chi-Square test allows us to test for
association between two categorical
variables.
Ho: No association between the variables.
HA: There is association
• Consequently a significant p-value implies
association.
• Chi-Square test compares observed to
expected counts (frequencies) under the
assumption of no association (or Ho is
true)

• With this method, data are arranged in the


form of a contingency table.
X2 Distribution
• Indexed by the degrees of freedom (n)
• Unlike z and t distributions, which are always
symmetric about 0, the X2 distribution only
takes on positive values and is always
skewed to the right.
• The skewness diminishes as n increases
Rejection
Acceptance region
region
0,95
0.05

18.307 210
Contingency Table
• A table composed of rows cross-classified
by columns
• A 2x2 contingency table is a table
composed of two rows cross-classified by
two columns
• Appropriate to display data that can be
classified by two different variables, each
of which has only two possible outcomes
Example
• A study was conducted to look at the effects of
oral contraceptives (OC) on heart disease in
women 40 to 44 years of age. It is found that
among 5000 current OC users at baseline, 13
women develop a myocardial infarction (MI) over
a 3-year period, where as among 10,000 non-
OC users, 7 develop an MI over a 3-year period.
– P1 = 0.0026, P2 = 0.0007
– Z-test = 2.77, P-value = 0.006
– There is a highly significant association between MI
and OC use
Display the above data in the form of a 2x2
contingency table

MI status over
OC-use 3 years
group Yes No Total
OC users 13 4987 5000

Non-OC 7 9993 10,000


users
Total 20 14,980 15,000

Is the proportion of MI the same in OC users and non-OC users?


What can be said about the relationship between MI status and OC use?
Definition
• Chi-Square test is a statistic which
measures the discrepancy between k
observed frequencies O1, O2,…Ok and the
corresponding expected frequencies E1,
E2,… Ek.
• When the Ho of no association is true, the observed and expected
counts will be similar, their difference will be close to zero, resulting
in a SMALL chi square statistic value.
• When the HA of an association is true the observed counts will be
unlike the expected counts, their difference will be non zero and
their squared difference will be positive, resulting in a LARGE
POSITIVE chi square statistic value.
• Chi-Square test is based on the table of Χ2
for different degrees of freedom (df).
• Requires 2x2 table
• If the value of χ2 is zero, no discrepancy
between the observed and the expected
frequencies.
• The greater the discrepancy, the larger will
be the value of χ2.
• The calculated value of χ2 is compared
with the tabulated value for the given df.
Degrees of Freedom
• Counts in the Chi-Square Test of a 2x2
table are represented as “a”, “b”, “c” and
“d”.
• The general calculation:

• is the same calculation as the following


shortcut formula:
Expected Value
• Is the product of the row total multiplied by
the column total, divided by the grand total

• The expected numbers must be computed for


each cell.
Example
• Compute the expected table for the OC-MI
data in the previous example
MI status over
OC-use 3 years
group Yes No Total
OC users 13 4987 5000

Non-OC 7 9993 10,000


users
Total 20 14,980 15,000
Example
• Compute the expected table for the OC-MI
data in the previous example
MI status over 3-
OC use group years
Yes No Total
OC users 6.7 4993.3 5000
Non-OC users 13.3 9986.7 10,000
Total 20 14,980 15,000

• X2 ≈ 8, 0.001 <p-value < 0.005


Example

X2 = 8.30, P-value = 0.004


Example: Observed Numbers
Response by Treatment
Expected Numbers
Shortcut Formula for 2x2 Tables
Example
• A study was conducted to investigate the
possible cause of gastroenteritis outbreak
following a lunch served in a high school
cafeteria. Among the 225 students who
ate the sandwiches, 109 became ill. While,
among the 38 students who did not eat the
sandwiches, 4 became ill.
• Present the data by 2x2 contingency table
• With this method, data are arranged in the
form of a contingency table

• This is a 2 × 2 table for two dichotomous


random variables
• We again wish to know whether the
proportions of students who became ill in
each of the groups are identical
• To carry out the test, we first calculate the
expected counts for the table assuming
that:
H0: p1 = p2
HA: p1 ≠ p2
p1 = 48.44%, p2 = 10.52%
Z test = 4.36
• Expected counts are represented as follows:
• The chi-square test compares the
observed frequencies in each category
with the expected frequencies given that
H0 is true
• Are the deviations between Observed and
Expected too large to be attributed to
chance?
• To determine this, deviations from all 4
cells must be combined
• Calculate the sum:
• The Ho is rejected at α level if X2 is too
large, in particular, if X2 > X21,α
• If α = 0.05, we would reject H0 for X2
greater than X21,α = 3.84
• Therefore, we reject the Ho
• The p-value is given by the area under the
X2 distribution to the right of X2
• P-value < 0.001
Relationship between X2 and Z test
X2 = Z2
19 = (4.36)2
19 ≈ 19.01
Assumptions of the 2 - test
• No expected value in the table is <5, and
no more than 20% of the expected
frequencies should be <5.
• If this does not hold
• - row or column variables categories can
sometimes be combined to make the
expected frequencies larger or
• - use Yates correction
• For 2x2 table, when the total no of
observations is less than 20 or when it is
greater than 20 and the smallest of the
four expected frequencies is < 5,
use Fisher’s Exact test.
Fisher’s Exact Test

• Given the fixed margins, the probability of


obtaining the specific table which was
observed is
• Both the Chi-square test and the exact test
can be generalized to allow the
comparison of three or more proportions
• The data are arranged in the form of an
R × C contingency table
2. Relative Risk (RR)
• Or Risk Ratio
• Defined as the ratio of the incidence of
disease in the exposed group divided by
the corresponding incidence of disease in
the non-exposed group
• A point estimate of the risk ratio
(RR=p1/p2) is given by:
Disease
Exposure Yes No Total
Yes a b a+b
No c d c+d
Total a+c b+d N

RR = a/a+b
c/c+d
1st Give Breast Cancer
Birth Yes No Total
≥25 years 31 1597 1628
<25 years 65 4475 4540
Total 96 6072 6168

RR = a/a+b
c/c+d
a/a+b = 31/1628 = 0.019
b/b+d = 65/4540 = 0.014

• Women who first give birth at an older age


are 36% more likely to develop breast
cancer
• To obtain a CI for the RR,

• Where, n1=a+b n2=c+d,


ln=natural logarithm
• Exponentiate each side to get a CI for RR
• For the breast cancer data, a 95% CI for
ln(RR) is
• Consequently, a 95% CI for RR itself is

or
(0.89, 2.08)
• This interval contains the value 1
3. The Odds Ratio
• The odds ratio (OR) is the odds in favor of
disease for the exposed group divided by
the odds in favor of disease for the
unexposed group
• The odds in favor of disease = p/(1-p),
where p = probability of a disease
• Odds = Pr (event occurs) / Pr (event does
not occur) = p/(1-p)
• The odds ratio defined as:

=
• Is estimated by
Example:
• In a study of the risk factors for invasive
cervical cancer, the following data were
collected (Case-Control):
• The odds ratio is estimated by:

• Women with cancer have an odds of


smoking that are 1.52 times the odds of
those without cancer
• A CI can be constructed for OR
• To find a CI for the underlying OR, we first
find a CI for ln(OR) = (c1,c2), where
• Exponentiate the upper and lower confidence
limits for the natural log of the OR:

1 1 1 1 1 1 1 1
ln ORˆ − Z + + + ln ORˆ + Z + + +
a b c d a b c d
e ,e
• For the cervical cancer data,

• Therefore, a 95% CI for ln(OR) is


ln(1.52) ± 1.96(0.166)
or
(0.093, 0.744)
• A 95% CI for the OR itself is

or
(1.10, 2.13)
• This interval does not contain the value 1
• We conclude that the odds of developing
cervical cancer are significantly higher for
smokers than for nonsmokers
Example: Odds of Death
Related to Vit A use (Case-Control Study)
• What is the estimated OR?

• Estimated OR = (46/61)/(74/59)=0.60
• 95% CI = (0.36, 1.04)

You might also like