48th ICRO-SUN PG TEACHING PROGRAMME
26 & 27 OCTOBER, 2024
MAX SUPERSPECIALITY HOSPITAL, BATHINDA
CLINICAL TRIAL & CANCER STATISTICS
Categorical data analysis
Dr. Madhur Verma
Associate Professor of Community and Family Medicine
AIIMS Bathinda
drmadhurverma@gmail.com
9466445513
Types of Variables
VARIABLES
QUALITATIVE
QUANTITATIVE
(Categorical)
NOMINAL ORDINAL DISCRETE CONTINUOUS
Categorical data
Nominal
Locality Rural/Urban
Gender M, F
Diagnosis Normal, Abnormal
Ordinal
Age (years) < 15, 15 -30, 30-45, 45 +
SES Low, Medium, High
Improvement Mild, Moderate, Fair
Tumor grade Grade 1, 2, 3
Initiating the categorical data analysis
To initiate a categorical data analysis, it is recommended to
follow a systematic approach:
1. Understand Your Data
2. Summarise the Data
3. Examine Relationships Between Variables.
4. Advanced Analysis.
5. Test for Homogeneity or Independence
6. Interpretation
Categorical data analysis
1. Understand Your Data
• Identify variables
Review your dataset and identify the categorical variables
you want to analyse.
• Define levels: Ensure that each categorical variable has clearly
defined levels or categories.
• Check for missing data: Handle missing data appropriately
(e.g., using imputation techniques or omitting cases)
Categorical data analysis
2. Summarize the Data
• Frequency tables: Start by creating frequency tables for
each categorical variable to understand the distribution of
categories.
• Bar plots or pie charts: Visualize the frequency
distribution using bar plots or pie charts to give a clear
picture of how the categories are distributed.
• Collapse tables if necessary: reduce categories if
necessary.
Categorical data analysis
3. Examine Relationships Between Variables
• Contingency tables: For two or more categorical
variables, create contingency tables (cross-tabulation) to
explore relationships.
• Chi-square test: Use the chi-square test to assess whether
there’s a significant association between two categorical
variables.
• Cramér’s V: If the chi-square test is significant, use
Cramér’s V to measure the strength of the association.
Its value ranges from 0 to 1, where 0 indicates no association, while 1
indicates a perfect association (complete dependence).
Contingency Tables
Contingency Tables
▪ Cross-classifications of categorical variables in which
▪ Rows (typically): categories of EXPLANATORY variables
▪ Columns: categories of OUTCOME variables.
▪ Counts in the “cells” of the table give the numbers of individuals
at the corresponding combination of levels of the two variables.
▪ Contingency tables enable us to compare one characteristic of
the sample, e.g. Oral cancer, defined by another categorical
variable, e.g. Smoking. 10
Contingency table (Bivariate)
Example 1: Gender and smoking status
S. No Gender Smoking
1 M Y
2 F Y
3 M N Smoking Status
4 F Y Yes No Total
Gender n(row%) n(row%) n(col%)
5 M N
Male 3 (60) 2 (40) 5 (50)
6 F N Female 2 (40) 3 (60) 5 (50)
7 M Y Total 5 (50) 5 (50) 10 (100)
8 F N
9 F N
10 M Y
11
Contingency table (Bivariate)
Example 2: Education status and Cervical cancer screening
readiness
Education status Cervical cancer screening readiness
Row
Very Eager Pretty Not too
Total
Above Average 164 233 26 423
Average 293 473 117 883
Below Average 132 383 172 687
Col Total 589 1089 315 1993
Row and column totals are called Marginal counts
12
What can a contingency table do ?
Can summarize by percentages on response variable (happiness)
Education status Cervical cancer screening readiness
Row
Very Eager Pretty Not too
Total
Above Average 164 233 26 423
Average 293 473 117 883
Below Average 132 383 172 687
Col Total 589 1089 315 1993
Example: Percentage “readiness” is
39% for above average. education (164/423 = 0.39) 33% for
average education (293/883 = 0.33) 19% for below average
education (??)
What can a contingency table do ?
2. Association between two categorical variables.
For example, you want to know
• if there is any association between gender and smoking.
• Is there any association between hepatitis C infection
and the population's HCC risk?
• To test whether lung cancer is associated with smoking
or not.
• Obesity is associated with colon cancer.
14
Chi-Square Tests
Chi-Square Tests
Simplest & most widely used non-parametric test in statistical
work.
Chi-Square Tests: These tests check whether the differences or
patterns between two groups are real or just random.
• Chi-square is basically a measure of significance.
• It is not a good measure of the strength of the association.
• It can help you decide if an association exists but not tell
how strong it is.
Chi-Square Tests
Assumptions
1. The sample must be randomly drawn from the population.
2. Data must be reported in raw frequencies (not
percentages).
3. Categories of the variables must be mutually exclusive &
exhaustive.
4. Expected frequencies cannot be too small; expected
frequency should be more than 5 in at least 80% of the
cells, and all individual expected counts should be >1.
17
Logic of the chi-square
The total number of observations in each column and the total number of observations
in each row are considered to be given or fixed.
If we assume that columns and rows are independent, we can calculate - expected
frequencies.
Logic of Chi square
If no relationship exists between the
column and row variable
If a relationship (or dependency) does occur
the observed
The observedfrequencies
frequencieswill
willbe very
vary close
from the
to the expected
expected frequencies
frequencies
Compares the observed frequency in each cell with
the expected frequency.
they will differ only by small amounts
The value of the chi-square statistic will be
large.
the value of the chi-square statistic will be
small
Steps for Chi-square test
Define Null and alternative hypothesis
State alpha
Calculate degree of freedom
State decision rule
Calculate test statistics
State and Interpret results
Hypothesis Testing
• Tests a claim about a parameter using evidence
(data in a sample) gives causal relationships
Steps
1. Formulate a Hypothesis about the population
2. Random sample
3. Summarizing the information (descriptive statistic)
4. Does the information given by the sample support the hypothesis? Are we making any
errors? (inferential stat.)
Decision rule: Convert the research question to null and alternative hypothesis
Null Hypothesis
H0 = No difference between observed and expected observations
H1 = difference is present between observed and expected observations
What is statistical significance?
• A statistical concept indicates that the result is very unlikely
due to chance and, therefore, likely represents a true
relationship between the variables.
• Statistical significance is usually indicated by the alpha value
(or probability value), which should be smaller than a chosen
significance level.
State alpha value
• Alpha error (type I) is Rejecting a true null hypothesis (which
says that there is no difference between observed and
expected).
For the majority of the studies, alpha is 0.05
Meaning: that the investigator has set 5% as the maximum
chance of incorrectly rejecting the null hypothesis.
Degree of freedom
It is positive whole number that indicates the lack of
Calculation restrictions in calculations.
• For Goodness of Fit = Number of levels (outcome)-1
• For independent variables
The degree of freedom/ Homogeneity ofin a
is the number of values
proportion : (No. of that
calculation columns
can vary. – 1) (No. of rows – 1)
The Chi-Square Distribution
• No negative values
• Mean is equal to the degrees
of freedom
• The standard deviation
increases as degrees of
freedom increase, so the chi-
square curve spreads out
more as the degrees of
freedom increase.
• As the degrees of freedom become
very large, the shape becomes more
like the normal distribution.
The Chi-Square Distribution
The chi-square distribution is different for each value of the
degrees of freedom, different critical values correspond to
degrees of freedom.
We find the critical value that separates the area defined by α
from that defined by 1 – α.
Finding Critical Value
Q. What is the critical 2 value if df = 2, and =0.05?
If ni = E(ni), 2 = 0 Reject H0
Do not reject H0
= 0.05
df =2
0 5.991 2
2 Table (Portion) Significance level
DF 0.995 … 0.95 … 0.05
1 ... … 0.004 … 3.841
2 0.010 … 0.103 … 5.991
State decision rule
If the value obtained is greater than the
critical value of chi square , the null
hypothesis will be rejected
Calculate test statistics
Expected Value
Chi square for goodness of fit Chi square for independent variables
Homogeneity
Calculated of the
using proportion
formula-
χ2 = ∑ ( O – E )2
E
O = observed frequencies
E = expected frequencies
• Expected Value =
• a theory Row total *
• Previous study • Previous study Column total /
Question >>> How to find the Expected value
• Comparison groups • standard Table total
State and interpret results
See whether the value of chi square is more than or
less than the critical value
If the value of chi square is less than the If the value of chi square is more than the
critical value we accept the null critical value the null hypothesis can be
hypothesis rejected
Chi-Square Tests
Chi-Square Tests: These tests check whether the differences or
patterns between two groups are real or just random.
Types:
• Chi-Square Goodness of Fit Test: Tests whether the
observed frequencies in a categorical dataset match the
expected frequencies based on a specific hypothesis.
• Chi-Square Test of Independence: Assesses whether two
categorical variables (in row and columns) are independent
of each other.
• Chi-Square Test for Homogeneity of Proportions:
Compares the distributions of a categorical variable across
different populations.
Take Home Message
1. The chi-square test applied to Qualitative data may be
nominal or ordinal.
2. Before applying the Chi-square test, see all
assumptions are met.
3. If the value of chi-square is large >>>, there is a high
probability of rejecting the null hypothesis.
4. If the value of chi-square is small >>>, there is less
probability of rejecting the null hypothesis
Fisher's Exact Test
• Used when sample sizes are small, and the Chi-square test
may not be appropriate.
• It tests for independence between two categorical variables in
a 2x2 contingency table.
Cochran (1954) suggests
The decision regarding the use of Chi-square should be
guided by the following considerations:
1. When N > 40, use Chi-square corrected for continuity.
2. When N is between 20 and 40, the Chi-square test may
be used if all the expected frequencies are >five.
If any expected frequency is less than 5, use the Fisher’s
Exact probability test.
3. When N < 20, use Fisher’s test in all cases.
Yate’s Correction
Chi-square distribution is a continuous distribution, and it fails to maintain
its continuityeven if any one of the expected frequencies is less than 5.
In such cases, Yates Correction for continuity is applied to
maintain the character of continuity of the distribution.
The formula for the Chi-square test with Yates correction is:
N ( |observed – expected| - 0.5 )2
Chi-square =
expected
38
Phi-Coefficient
•It is only used on 2X2 contingency tables.
•Interpreted as a measure of the relative (strength) of an
association between two variables ranging from 0 to 1
𝜒2
𝑃h𝑖 𝜙 = n = total number of observation
𝑛
𝑎 𝑑 −𝑏 Sex Smoking Total
=
(𝑎+𝑏)(𝑎+𝑐)(𝑐+𝑑)(𝑏+𝑑)
Yes No
Male a b a+b
Female c d C+d
Total a+c b+d n
Pearson’s Contingency Coefficient (C)
• It is interpreted as a measure of relative (strength) of an
association between two variables
• The coefficient will always be less than 1 and varies according to the
number of rows and columns.
• This can be used for general rXc tables.
• It ranges between 0 to 1
𝜒2
𝐶= =
𝑛+𝜒 2 1+𝜙
Cramer’s V Coefficient (V)
• It is useful for comparing multiple 𝜒2 test statistics and is generalizable
across tables of varying sizes.
• It is not affected by sample size and, therefore is very useful in situations
where you expect statistically significant chi-square was the result of
large sample size instead of any substantive relationship between the variables.
• It is interpreted as a measure of the relative (strength) of an association
between variables.
• The coefficient ranges from 0 to 1 (perfect association).
• In practice, you may find that a Cramer’s V of 0.10 provides a good minimum
threshold for suggesting there is a substantive relationship two variables
𝜒2 Where, q= smaller number of rows or columns
𝐶=
𝑛(𝑞 − 1)
McNemar’s Test
• Used in case of two paired/related samples or there are repeated
measurements.
• It can be used to test for the significance of changes in “before-after” designs in
which each person is used as his own control.
• Thus, the test can be used
❖ to test the effectiveness of a treatment /training/ program/ therapy/intervention
or
❖ to compare the ratings of two judges on the same set of individuals.
43
Kappa Statistic
▪ The Kappa Statistic measures the agreement between the
evaluations of two examiners when both are rating the same
objects.
▪ It describes agreement achieved beyond chance as a proportion
of that agreement that is possible beyond chance.
▪ The value of the Kappa Statistic ranges from -1 to 1, with larger
values indicating better reliability.
A value of 1 indicates perfect agreement.
A value of 0 indicates that agreement is no better than
chance.
▪ Generally, a Kappa > 0.60 is considered satisfactory.
53
Kappa Statistic
11/26/2019 52
Kappa Statistic
𝑃0 − 𝑃𝐸
𝐾𝑎𝑝𝑝𝑎 =
1 − 𝑃𝐸
Where:
𝑃0 = proportion of observed agreement
𝑃𝐸 = proportion of expected agreement by chance
0.00 Agreement is no better than chance
0.01-0.20 Slight agreement
0.21-0.40 Fair agreement
0.41-0.60 Moderate agreement
0.61-0.80 Substantial agreement
0.81-0.99 Almost perfect agreement
1.00 Perfect agreement
54
Initiating the categorical data analysis
4. Advanced Analysis
• Logistic regression: If you're analyzing the relationship
between categorical and continuous variables or a binary
outcome, logistic regression is appropriate.
• Chi-square automatic interaction detector (CHAID): This
method builds decision trees using categorical data, often
used in market research and healthcare to identify significant
predictors.
3. Logistic Regression
• Binary Logistic Regression: Used when the dependent
variable has two categories (e.g., success/failure). It
estimates the probability of an outcome based on one
or more predictor variables.
• Multinomial Logistic Regression: Used when the
dependent variable has more than two categories, but
the categories do not have an inherent order.
• Ordinal Logistic Regression: Used when the dependent
variable has ordered categories (e.g., low, medium,
high).
Initiating the categorical data analysis
5. Test for Homogeneity of proportions or Independence
• Chi-square test for association tests (2x2): whether two categorical
variables are associated
• Chi-square test of independence (R x C): used to test a variety of
sizes of contingency tables
• Chi-square goodness-of-fit test: whether the distribution of cases
in a single categorical variable follows a known/hypothesised
distribution
• Chi-square test of homogeneity: whether the proportions in each
group are equal in the population
• Fisher’s exact test: If sample sizes are small, this test can be more
accurate than the chi-square test for testing independence.
Chi-square goodness-of-fit test
It is also called Pearson's chi-square goodness-of-fit test.
The chi-square goodness-of-fit test is a single-sample
nonparametric test.
Q: How "close" are the observed values to those which would be
expected in a study
OR
Q: An administrator at a hospital may want to determine whether an equal
number of people are hospitalised each day of the week to better plan staffing
levels.?
Expected frequency can be based
on
• theory
• previous experience
• comparison groups
Example: Are cancer-related deaths affected by seasonal variations??
Null Hypothesis: The proportion of deaths due to cancer in winter, summer, autumn, spring
is equal = ¼ = 25%
Alternative: Not all probabilities stated a in null hypothesis is correct
Cancer deaths Observed Expected = 322*1/4
Summer 78 80.5
Spring 71 80.5
Autumn 87 80.5
Winter 86 80.5
Total 322
Degree of freedom = k-1 = 4-1 =3
For α =0.05 for df =3 critical value X2 = 7.81
X2 = (78-80.5)2/80.5 + (71- 80.5)2/80.5 + (87.5 – 80.5)2/80.5 + (86 – 80.5)2/80.5 = 2.09
Conclusion: As calculated X2 value is less than Critical value we can
accept the null hypothesis and state that deaths due to cancer across
seasons are not statistically different from what's expected by chance
(i.e. all seasons being equal)
Chi-square for independence
It focuses on contingency tables that are greater than 2 x 2, which
are often referred to as r x c contingency tables.
It tests whether two variables measured at the nominal level are
independent (i.e., whether there is an association between the
two variables).
Other analytical approaches
• Probit Regression: Similar to logistic regression, but it assumes a normal
cumulative distribution function instead of a logistic one, often used in
binary outcome models.
• Cochran-Mantel-Haenszel Test: Tests for an association between two
categorical variables while controlling for a third variable (stratification).
• Log-Linear Models: Used to model the relationships between three or
more categorical variables by modeling the logarithm of the expected cell
frequencies in a contingency table.
• Cluster Analysis (for Categorical Data): Methods like k-modes or latent
class analysis (LCA) group observations into clusters based on categorical
attributes.
Initiating the categorical data analysis
6. Interpretation
• Analyse p-values (typically < 0.05 for significance).
• Assess effect sizes (e.g., Cramér's V for associations).
• Interpret visualisations (e.g., bar plots, mosaic plots).
11/26/2019 62