0% found this document useful (0 votes)
16 views23 pages

Stat 7

The document outlines the analysis of associations between categorical variables in a sociology course, focusing on hypothesis testing and the use of contingency tables. It explains the concept of association, statistical independence, and the Chi-squared test for assessing relationships between categorical data. Limitations of the Chi-squared test are also discussed, emphasizing the need for further analysis to understand the strength and direction of associations.

Uploaded by

beliz tuzel
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views23 pages

Stat 7

The document outlines the analysis of associations between categorical variables in a sociology course, focusing on hypothesis testing and the use of contingency tables. It explains the concept of association, statistical independence, and the Chi-squared test for assessing relationships between categorical data. Limitations of the Chi-squared test are also discussed, emphasizing the need for further analysis to understand the strength and direction of associations.

Uploaded by

beliz tuzel
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

WEEK 8.

ANALYZING THE ASSOCIATION


BETWEEN CATEGORICAL VARIABLES

STATISTICAL METHODS IN SOCIOLOGY II


SOC 242
Spring 2024-2025

Tuesday, 14:40-16:30 G204


Thursday, 12:40-14:30 G204

FACULTY OF ARTS AND SCIENCES


DEPARTMENT OF SOCIOLOGY
SESSION PLAN
o Independence and Dependence (Association)

o Testing Categorical Variables For Independence

Statistical Methods in Sociology II, Week 10 2


WHAT IS “ASSOCIATION”?

“two variables have an association if a particular


value for one variable is more likely to occur with
certain values of the other variable—for example, if
being very happy is more likely to happen if a
person has an above average income”.

Statistical Methods in Sociology II, Week 10 3


THE ASSOCIATION BETWEEN
CATEGORICAL VARIABLES

▪ Suppose both response and explanatory variables are


categorical, with any number of categories for
each.

▪ There is an association between the variables if the


population conditional distribution for the
response variable differs among the categories of the
explanatory variable.

Statistical Methods in Sociology II, Week 10 4


LOGIC OF HYPOTHESIS TESTING
AND ‘ASSOCIATION’

▪ We discussed how to test a hypothesis about differences in


means -suitable for continuous variables, and
differences in proportions suitable for categorical
variables.

▪ One way of re-phrasing our hypothesis test is to say ‘is


there an association between categorical variable X and
continuous variable Y?’
▪ E.g. association between gender and height

Statistical Methods in Sociology II, Week 10 5


LOGIC OF HYPOTHESIS TESTING AND
‘ASSOCIATION’
▪ Today – how to test hypotheses about relationship
between two categorical variables (nominal/ordinal)

▪ Logic of hypothesis test exactly the same.

▪ Main difference is instead of using sampling distribution


of mean to create z-scores (or t-scores), we use sampling
distribution of another statistic called Chi-squared,
appropriate for categorical data.

Statistical Methods in Sociology II, Week 10 6


CONTINGENCY TABLES
▪ ‘Contingent on’ – means ‘depends on’.

▪ We display categorical data for analysis in contingency


tables.

▪ Definition: A contingency table displays the number of


observations for each combination of outcomes over the
categories of each variable.

▪ We’ll only look at two variables at a time.


Let’s look at an example…

Statistical Methods in Sociology II, Week 10 7


EXAMPLE
To see how subjective health status depends on gender, convert
to % within columns (within independent variable).
Gender
Male Female Total
Subjective health Row
Very good 170 205 375 marginals
Good 730 643 1,373
Fair 264 306 570
Poor 34 42 76
Very poor 5 7 12
Total 1,203 1,203 2,406
Data source: WVS Turkey, 2018

Response: subjective health status Column


marginals
Explanatory: gender

Statistical Methods in Sociology II, Week 10 8


TO SEE HOW SUBJECTIVE HEALTH DEPENDS ON GENDER,
CONVERT TO % WITHIN COLUMNS!

▪ e.g male who report very good health =(170/1,203)x100=22.5%


▪ The two columns form the conditional distributions of subjective
health status on gender
14.1 % of men very good
17.0 % of women very good
Gender
Male Female Total
Subjective health
status
Very good 14.1% 17.0% 15.6%
Good 60.7% 53.4% 57.1%
Fair 21.9% 25.4% 23.7%
Poor 2.8% 3.5% 3.2%
Very poor 0.4% 0.6% 0.5%
Total 100% 100% 100%
Data source: WVS Turkey, 2018

Statistical Methods in Sociology II, Week 10 9


GUIDELINES FOR CONTINGENCY TABLES
▪ Show sample conditional distributions: percentages for the
response variable within the categories of the explanatory
variable.
(Find by dividing the cell counts by the explanatory category total and
multiplying by 100. Percents on response categories will add to 100.)
▪ Clearly define variables and categories.
▪ If display percentages but not the cell counts, include
explanatory total sample sizes, so reader can (if desired)
recover all the cell count data.
▪ rows for response variables, columns for explanatory
variables.

Statistical Methods in Sociology II, Week 10 10


STATISTICAL INDEPENDENCE
▪ Association between these variables depends whether the
conditional distribution of subjective health status differs
between men and women.

▪ We use concept of ‘statistical independence’


▪ Two categorical variables are statistically independent if the
population conditional distributions on one of them are
identical at each category of the other;
▪ Remember, the distribution of a random variable will differ across
samples, even where pop. is invariant;
▪ Task is to decide whether two variables are independent or not in the
population, not the sample. In other words, could our result have
occurred due to chance alone?

Statistical Methods in Sociology II, Week 10 11


PERFECT DEPENDENCE
Gender
Male Female
Subjective health
Good 100 0
Poor 0 100
Total 100 100

▪ Gender perfectly predicts the subjective health status.


▪ Conditional distributions are different.
▪ There is perfect dependence.

Statistical Methods in Sociology II, Week 10 12


PERFECT INDEPENDENCE
Gender
Male Female
Subjective health
Good 50 50
Poor 50 50
Total 100 100

▪ Gender is no help at all in predicting subjective


health status.
▪ The conditional distributions are the same.
▪ There is perfect independence.
▪ In reality, there is never perfect independence...
Statistical Methods in Sociology II, Week 10 13
TESTING FOR STATISTICAL INDEPENDENCE
We want to know if population conditional distributions are
identical

We don’t expect sample conditional distributions to be identical -


why?

▪ Answer: Sampling variation


▪ Question: is it plausible that the observed difference in sample
conditional distributions would be this great if the population
conditional distributions are identical?
▪ We use a statistical test – similar logic as for comparing means:
▪ H0: the variables are statistically independent

▪ H1: the variables are statistically dependent

Statistical Methods in Sociology II, Week 10 14


EXPECTED CELL FREQUENCIES

▪ The way we test for statistically significant


association is the same logic as for our t-test. By
comparing what we get with what we would get if the
null hypothesis is true. This means comparing
observed with expected cell frequencies.

▪ Crucially, we will get a p-value for our significance


test, which is called a Chi-squared test.

Statistical Methods in Sociology II, Week 10 15


EXPECTED CELL FREQUENCIES
Gender
Male Female Total
Subjective health
status
Very good 188 188 375
Good 687 687 1,373
Fair 285 285 570
Poor 38 38 76
Very poor 6 6 12
Total 1,203 1,203 2,406
Data source: WVS Turkey, 2018

fe = column total x raw total / total sample size


e.g.
proportion of ‘good’ = 1,373/2,406= 0.57
number of men = 1,203
1,203 x 0.57 = 687
expected frequency of male respondents
assuming H0 is true = 687

Statistical Methods in Sociology II, Week 10 16


CHI-SQUARED (Χ2 ) (KARL PEARSON, 1900)

Gender
Male Female Total
Subjective health
status
Very good 188 188 375
Good
Fair
687
285
687
285
1,373
570 2 = 
( f o − f e )2
Poor 38 38 76 fe
Very poor 6 6 12
Total 1,203 1,203 2,406
Data source: WVS Turkey, 2018

Test statistic = chi-squared (χ2 )


Definition: Sum of squared deviations of observed from
expected cell frequencies, divided by sum of expected
frequencies.
Quantifies how much difference there is between what we would
expect if H0 is true and what we actually see.
H0 is true: fo and fe Statistical
are close Methods in Sociology II, Week 8

H0 is false: some fo and fe are far – large value of Χ2

Statistical Methods in Sociology II, Week 10 17


CHI-SQUARED (Χ2 )

2 = 
( fo − fe )2
O E O-E (O-E)2 (O-E)2/E
fe
170 188 -18 306.25 1.6333
730 687 44 1892.25 2.7564
264 285 -21 441 1.5474
34 38 -4 16 0.4211
5 6 -1 1 0.1667
205 188 18 306.25 1.6333
643 687 -44 1892.25 2.7564
306 285 21 441 1.5474
42 38 4 16 0.4211
7 6 1 1 0.1667
Total: 13.0496
χ2 = 13.0496

Statistical Methods in Sociology II, Week 10 18


DISTRIBUTION OF Χ2 AND DEGREES OF FREEDOM
If we took repeated samples, sampling distribution of χ2 is not
normal but follows its own χ2 distribution.

What happens to χ2 when there are more cells in a table?


▪ It gets bigger – so we need to take the number of cells into
account to get the critical value for test statistic

χ2 distribution is based on the degrees of freedom in the contingency


table
▪ Table with r rows and c columns, df=(r-1)(c-1)
▪ df – refers to number of cells in a contingency table that can
vary, given the marginals
▪ Statistics softwares (R, Excell, IBM SPSS, STATA etc.) work this out
for you but you should know how to work it out for yourself too.

Statistical Methods in Sociology II, Week 10 19


PROPERTIES OF CHI-SQUARE DISTRIBUTION

• No negative values
• Mean = df
• The standard deviation
increases as the df
increase, so the chi-
square curve spreads
out more as the df
increase
• As the df becomes very
large, the shape
becomes more like the
normal distribution

Statistical Methods in Sociology II, Week 10 20


PROPERTIES OF CHI-SQUARE DISTRIBUTION

The probability that


we would get a value
of the Χ2 statistic this
big or bigger if gender
and subjective health
status are
independent in the
population is .005
(that is less than
0.05).
There is strong
evidence to reject
H0.

Statistical Methods in Sociology II, Week 10 21


LIMITATIONS OF CHI-SQUARE TEST

▪ Doesn’t tell us anything about the strength or direction of the


association.

▪ Doesn’t tell us which cells deviate from expected distributions

▪ Next week: Residual analysis and Odds Ratios

Statistical Methods in Sociology II, Week 10 22


Questions? Ideas?

Thank you for your attention!

You might also like