WEEK 8.
ANALYZING THE ASSOCIATION
 BETWEEN CATEGORICAL VARIABLES
         STATISTICAL METHODS IN SOCIOLOGY II
                        SOC 242
                    Spring 2024-2025
                     Tuesday, 14:40-16:30 G204
                     Thursday, 12:40-14:30 G204
     FACULTY OF ARTS AND SCIENCES
        DEPARTMENT OF SOCIOLOGY
                    SESSION PLAN
o Independence and Dependence (Association)
o Testing Categorical Variables For Independence
                         Statistical Methods in Sociology II, Week 10   2
         WHAT IS “ASSOCIATION”?
“two variables have an association if a particular
value for one variable is more likely to occur with
certain values of the other variable—for example, if
being very happy is more likely to happen if a
person has an above average income”.
                   Statistical Methods in Sociology II, Week 10   3
           THE ASSOCIATION BETWEEN
            CATEGORICAL VARIABLES
▪ Suppose both response and explanatory variables are
  categorical, with any number of categories for
  each.
▪ There is an association between the variables if the
  population conditional distribution for the
  response variable differs among the categories of the
  explanatory variable.
                      Statistical Methods in Sociology II, Week 10   4
           LOGIC OF HYPOTHESIS TESTING
                AND ‘ASSOCIATION’
▪ We discussed how to test a hypothesis about differences in
  means -suitable for continuous variables, and
  differences in proportions suitable for categorical
  variables.
▪ One way of re-phrasing our hypothesis test is to say ‘is
  there an association between categorical variable X and
  continuous variable Y?’
   ▪ E.g. association between gender and height
                        Statistical Methods in Sociology II, Week 10   5
       LOGIC OF HYPOTHESIS TESTING AND
                ‘ASSOCIATION’
▪ Today – how to test hypotheses about relationship
  between two categorical variables (nominal/ordinal)
▪ Logic of hypothesis test exactly the same.
▪ Main difference is instead of using sampling distribution
  of mean to create z-scores (or t-scores), we use sampling
  distribution of another statistic called Chi-squared,
  appropriate for categorical data.
                        Statistical Methods in Sociology II, Week 10   6
          CONTINGENCY TABLES
▪ ‘Contingent on’ – means ‘depends on’.
▪ We display categorical data for analysis in contingency
  tables.
▪ Definition: A contingency table displays the number of
  observations for each combination of outcomes over the
  categories of each variable.
▪ We’ll only look at two variables at a time.
  Let’s look at an example…
                          Statistical Methods in Sociology II, Week 10   7
                            EXAMPLE
To see how subjective health status depends on gender, convert
to % within columns (within independent variable).
                                                  Gender
                                           Male             Female                Total
    Subjective health                                                                       Row
   Very good                                170                205                 375      marginals
   Good                                     730                643                1,373
   Fair                                     264               306                 570
   Poor                                      34                42                  76
   Very poor                                  5                 7                  12
     Total                                 1,203              1,203               2,406
   Data source: WVS Turkey, 2018
Response: subjective health status                                                        Column
                                                                                          marginals
Explanatory: gender
                                   Statistical Methods in Sociology II, Week 10                         8
     TO SEE HOW SUBJECTIVE HEALTH DEPENDS ON GENDER,
              CONVERT TO % WITHIN COLUMNS!
▪   e.g male who report very good health =(170/1,203)x100=22.5%
▪   The two columns form the conditional distributions of subjective
    health status on gender
                                                                                     14.1 % of men very good
                                                                                     17.0 % of women very good
                                               Gender
                                        Male             Female              Total
     Subjective health
     status
    Very good                          14.1%              17.0%              15.6%
    Good                               60.7%              53.4%              57.1%
    Fair                               21.9%              25.4%              23.7%
    Poor                                2.8%               3.5%               3.2%
    Very poor                          0.4%               0.6%               0.5%
     Total                              100%               100%              100%
     Data source: WVS Turkey, 2018
                                     Statistical Methods in Sociology II, Week 10                                9
    GUIDELINES FOR CONTINGENCY TABLES
▪ Show sample conditional distributions: percentages for the
  response variable within the categories of the explanatory
  variable.
  (Find by dividing the cell counts by the explanatory category total and
multiplying by 100. Percents on response categories will add to 100.)
▪ Clearly define variables and categories.
▪ If display percentages but not the cell counts, include
  explanatory total sample sizes, so reader can (if desired)
  recover all the cell count data.
▪ rows for response variables, columns for explanatory
  variables.
                             Statistical Methods in Sociology II, Week 10   10
             STATISTICAL INDEPENDENCE
▪ Association between these variables depends whether the
  conditional distribution of subjective health status differs
  between men and women.
▪ We use concept of ‘statistical independence’
  ▪   Two categorical variables are statistically independent if the
      population conditional distributions on one of them are
      identical at each category of the other;
  ▪   Remember, the distribution of a random variable will differ across
      samples, even where pop. is invariant;
  ▪   Task is to decide whether two variables are independent or not in the
      population, not the sample. In other words, could our result have
      occurred due to chance alone?
                                Statistical Methods in Sociology II, Week 10   11
          PERFECT DEPENDENCE
                                                         Gender
                                                Male                   Female
    Subjective health
   Good                                          100                     0
   Poor                                           0                     100
    Total                                        100                    100
▪ Gender perfectly predicts the subjective health status.
▪ Conditional distributions are different.
▪ There is perfect dependence.
                        Statistical Methods in Sociology II, Week 10            12
       PERFECT INDEPENDENCE
                                                 Gender
                                         Male               Female
       Subjective health
      Good                                  50                   50
      Poor                                  50                   50
       Total                               100                  100
▪ Gender is no help at all in predicting subjective
  health status.
▪ The conditional distributions are the same.
▪ There is perfect independence.
▪ In reality, there is never perfect independence...
                      Statistical Methods in Sociology II, Week 10    13
    TESTING FOR STATISTICAL INDEPENDENCE
We want to know if population conditional distributions are
identical
We don’t expect sample conditional distributions to be identical -
why?
▪ Answer: Sampling variation
▪ Question: is it plausible that the observed difference in sample
  conditional distributions would be this great if the population
  conditional distributions are identical?
▪ We use a statistical test – similar logic as for comparing means:
  ▪ H0: the variables are statistically independent
  ▪ H1: the variables are statistically dependent
                            Statistical Methods in Sociology II, Week 10   14
      EXPECTED CELL FREQUENCIES
▪ The way we test for statistically significant
  association is the same logic as for our t-test. By
  comparing what we get with what we would get if the
  null hypothesis is true. This means comparing
  observed with expected cell frequencies.
▪ Crucially, we will get a p-value for our significance
  test, which is called a Chi-squared test.
                     Statistical Methods in Sociology II, Week 10   15
            EXPECTED CELL FREQUENCIES
                                                   Gender
                                           Male             Female           Total
      Subjective health
      status
     Very good                              188                188            375
     Good                                   687                687           1,373
     Fair                                   285                285            570
     Poor                                    38                 38             76
     Very poor                               6                  6              12
      Total                                1,203              1,203          2,406
      Data source: WVS Turkey, 2018
fe = column total x raw total / total sample size
e.g.
proportion of ‘good’ = 1,373/2,406= 0.57
number of men = 1,203
1,203 x 0.57 = 687
expected frequency of male respondents
assuming H0 is true = 687
                              Statistical Methods in Sociology II, Week 10           16
      CHI-SQUARED (Χ2 ) (KARL PEARSON, 1900)
                                                   Gender
                                           Male             Female      Total
    Subjective health
    status
   Very good                                188                188       375
   Good
   Fair
                                            687
                                            285
                                                               687
                                                               285
                                                                        1,373
                                                                         570    2 = 
                                                                                         ( f o − f e )2
   Poor                                      38                 38        76                   fe
   Very poor                                 6                  6         12
     Total                                 1,203              1,203     2,406
    Data source: WVS Turkey, 2018
Test statistic = chi-squared (χ2 )
Definition: Sum of squared deviations of observed from
expected cell frequencies, divided by sum of expected
frequencies.
Quantifies how much difference there is between what we would
expect if H0 is true and what we actually see.
H0 is true: fo and fe Statistical
                      are close   Methods in Sociology II, Week 8
H0 is false: some fo and fe are far – large value of Χ2
                         Statistical Methods in Sociology II, Week 10                               17
                              CHI-SQUARED (Χ2 )
2 = 
         ( fo − fe )2
                        O     E        O-E             (O-E)2              (O-E)2/E
              fe
                        170   188       -18            306.25                1.6333
                        730   687        44           1892.25                2.7564
                        264   285       -21             441                  1.5474
                         34    38        -4              16                  0.4211
                          5    6         -1               1                  0.1667
                        205   188        18            306.25                1.6333
                        643   687       -44           1892.25                2.7564
                        306   285        21             441                  1.5474
                         42    38         4              16                  0.4211
                          7   6          1                1                  0.1667
                                                                      Total: 13.0496
                                                                    χ2 = 13.0496
                                    Statistical Methods in Sociology II, Week 10       18
 DISTRIBUTION OF Χ2 AND DEGREES OF FREEDOM
If we took repeated samples, sampling distribution of χ2 is not
normal but follows its own χ2 distribution.
What happens to χ2 when there are more cells in a table?
▪ It gets bigger – so we need to take the number of cells into
  account to get the critical value for test statistic
χ2 distribution is based on the degrees of freedom in the contingency
table
▪ Table with r rows and c columns, df=(r-1)(c-1)
▪ df – refers to number of cells in a contingency table that can
   vary, given the marginals
▪ Statistics softwares (R, Excell, IBM SPSS, STATA etc.) work this out
   for you but you should know how to work it out for yourself too.
                            Statistical Methods in Sociology II, Week 10   19
PROPERTIES OF CHI-SQUARE DISTRIBUTION
                                                •     No negative values
                                                •     Mean = df
                                                •     The standard deviation
                                                      increases as the df
                                                      increase, so the chi-
                                                      square curve spreads
                                                      out more as the df
                                                      increase
                                                •     As the df becomes very
                                                      large, the shape
                                                      becomes more like the
                                                      normal distribution
               Statistical Methods in Sociology II, Week 10                    20
PROPERTIES OF CHI-SQUARE DISTRIBUTION
                                                        The probability that
                                                        we would get a value
                                                        of the Χ2 statistic this
                                                        big or bigger if gender
                                                        and subjective health
                                                        status are
                                                        independent in the
                                                        population is .005
                                                        (that is less than
                                                        0.05).
                                                        There is strong
                                                        evidence to reject
                                                        H0.
                Statistical Methods in Sociology II, Week 10                       21
           LIMITATIONS OF CHI-SQUARE TEST
▪ Doesn’t tell us anything about the strength or direction of the
  association.
▪ Doesn’t tell us which cells deviate from expected distributions
▪ Next week: Residual analysis and Odds Ratios
                               Statistical Methods in Sociology II, Week 10   22
 Questions? Ideas?
Thank you for your attention!