0% found this document useful (0 votes)
6 views57 pages

Categorical Data Analysis: 48Th Icro-Sun PG Teaching Programme 26 & 27 OCTOBER, 2024

Uploaded by

jlmmm029
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views57 pages

Categorical Data Analysis: 48Th Icro-Sun PG Teaching Programme 26 & 27 OCTOBER, 2024

Uploaded by

jlmmm029
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 57

48th ICRO-SUN PG TEACHING PROGRAMME

26 & 27 OCTOBER, 2024


MAX SUPERSPECIALITY HOSPITAL, BATHINDA

CLINICAL TRIAL & CANCER STATISTICS

Categorical data analysis

Dr. Madhur Verma


Associate Professor of Community and Family Medicine
AIIMS Bathinda
drmadhurverma@gmail.com
9466445513
Types of Variables

VARIABLES

QUALITATIVE
QUANTITATIVE
(Categorical)

NOMINAL ORDINAL DISCRETE CONTINUOUS


Categorical data

Nominal
Locality Rural/Urban
Gender M, F
Diagnosis Normal, Abnormal
Ordinal
Age (years) < 15, 15 -30, 30-45, 45 +
SES Low, Medium, High
Improvement Mild, Moderate, Fair
Tumor grade Grade 1, 2, 3
Initiating the categorical data analysis

To initiate a categorical data analysis, it is recommended to


follow a systematic approach:

1. Understand Your Data


2. Summarise the Data
3. Examine Relationships Between Variables.
4. Advanced Analysis.
5. Test for Homogeneity or Independence
6. Interpretation
Categorical data analysis

1. Understand Your Data

• Identify variables
Review your dataset and identify the categorical variables
you want to analyse.

• Define levels: Ensure that each categorical variable has clearly


defined levels or categories.

• Check for missing data: Handle missing data appropriately


(e.g., using imputation techniques or omitting cases)
Categorical data analysis

2. Summarize the Data


• Frequency tables: Start by creating frequency tables for
each categorical variable to understand the distribution of
categories.

• Bar plots or pie charts: Visualize the frequency


distribution using bar plots or pie charts to give a clear
picture of how the categories are distributed.

• Collapse tables if necessary: reduce categories if


necessary.
Categorical data analysis

3. Examine Relationships Between Variables

• Contingency tables: For two or more categorical


variables, create contingency tables (cross-tabulation) to
explore relationships.

• Chi-square test: Use the chi-square test to assess whether


there’s a significant association between two categorical
variables.

• Cramér’s V: If the chi-square test is significant, use


Cramér’s V to measure the strength of the association.
Its value ranges from 0 to 1, where 0 indicates no association, while 1
indicates a perfect association (complete dependence).
Contingency Tables
Contingency Tables
▪ Cross-classifications of categorical variables in which
▪ Rows (typically): categories of EXPLANATORY variables
▪ Columns: categories of OUTCOME variables.

▪ Counts in the “cells” of the table give the numbers of individuals


at the corresponding combination of levels of the two variables.

▪ Contingency tables enable us to compare one characteristic of


the sample, e.g. Oral cancer, defined by another categorical
variable, e.g. Smoking. 10
Contingency table (Bivariate)
Example 1: Gender and smoking status
S. No Gender Smoking

1 M Y
2 F Y
3 M N Smoking Status
4 F Y Yes No Total
Gender n(row%) n(row%) n(col%)
5 M N
Male 3 (60) 2 (40) 5 (50)
6 F N Female 2 (40) 3 (60) 5 (50)
7 M Y Total 5 (50) 5 (50) 10 (100)
8 F N
9 F N
10 M Y

11
Contingency table (Bivariate)
Example 2: Education status and Cervical cancer screening
readiness

Education status Cervical cancer screening readiness


Row
Very Eager Pretty Not too
Total
Above Average 164 233 26 423
Average 293 473 117 883
Below Average 132 383 172 687
Col Total 589 1089 315 1993

Row and column totals are called Marginal counts

12
What can a contingency table do ?

Can summarize by percentages on response variable (happiness)

Education status Cervical cancer screening readiness


Row
Very Eager Pretty Not too
Total
Above Average 164 233 26 423
Average 293 473 117 883
Below Average 132 383 172 687
Col Total 589 1089 315 1993

Example: Percentage “readiness” is

39% for above average. education (164/423 = 0.39) 33% for


average education (293/883 = 0.33) 19% for below average
education (??)
What can a contingency table do ?

2. Association between two categorical variables.


For example, you want to know

• if there is any association between gender and smoking.


• Is there any association between hepatitis C infection
and the population's HCC risk?
• To test whether lung cancer is associated with smoking
or not.
• Obesity is associated with colon cancer.

14
Chi-Square Tests
Chi-Square Tests

Simplest & most widely used non-parametric test in statistical


work.
Chi-Square Tests: These tests check whether the differences or
patterns between two groups are real or just random.

• Chi-square is basically a measure of significance.

• It is not a good measure of the strength of the association.

• It can help you decide if an association exists but not tell


how strong it is.
Chi-Square Tests

Assumptions
1. The sample must be randomly drawn from the population.
2. Data must be reported in raw frequencies (not
percentages).
3. Categories of the variables must be mutually exclusive &
exhaustive.
4. Expected frequencies cannot be too small; expected
frequency should be more than 5 in at least 80% of the
cells, and all individual expected counts should be >1.

17
Logic of the chi-square

The total number of observations in each column and the total number of observations
in each row are considered to be given or fixed.

If we assume that columns and rows are independent, we can calculate - expected
frequencies.
Logic of Chi square

If no relationship exists between the


column and row variable
If a relationship (or dependency) does occur

the observed
The observedfrequencies
frequencieswill
willbe very
vary close
from the
to the expected
expected frequencies
frequencies
Compares the observed frequency in each cell with
the expected frequency.

they will differ only by small amounts


The value of the chi-square statistic will be
large.
the value of the chi-square statistic will be
small
Steps for Chi-square test

Define Null and alternative hypothesis

State alpha

Calculate degree of freedom

State decision rule

Calculate test statistics

State and Interpret results


Hypothesis Testing

• Tests a claim about a parameter using evidence


(data in a sample) gives causal relationships

Steps
1. Formulate a Hypothesis about the population
2. Random sample
3. Summarizing the information (descriptive statistic)
4. Does the information given by the sample support the hypothesis? Are we making any
errors? (inferential stat.)

Decision rule: Convert the research question to null and alternative hypothesis
Null Hypothesis

H0 = No difference between observed and expected observations

H1 = difference is present between observed and expected observations


What is statistical significance?

• A statistical concept indicates that the result is very unlikely


due to chance and, therefore, likely represents a true
relationship between the variables.

• Statistical significance is usually indicated by the alpha value


(or probability value), which should be smaller than a chosen
significance level.
State alpha value

• Alpha error (type I) is Rejecting a true null hypothesis (which


says that there is no difference between observed and
expected).

For the majority of the studies, alpha is 0.05


Meaning: that the investigator has set 5% as the maximum
chance of incorrectly rejecting the null hypothesis.
Degree of freedom

It is positive whole number that indicates the lack of


Calculation restrictions in calculations.
• For Goodness of Fit = Number of levels (outcome)-1
• For independent variables
The degree of freedom/ Homogeneity ofin a
is the number of values
proportion : (No. of that
calculation columns
can vary. – 1) (No. of rows – 1)
The Chi-Square Distribution

• No negative values
• Mean is equal to the degrees
of freedom

• The standard deviation


increases as degrees of
freedom increase, so the chi-
square curve spreads out
more as the degrees of
freedom increase.
• As the degrees of freedom become
very large, the shape becomes more
like the normal distribution.
The Chi-Square Distribution
The chi-square distribution is different for each value of the
degrees of freedom, different critical values correspond to
degrees of freedom.

We find the critical value that separates the area defined by α


from that defined by 1 – α.
Finding Critical Value

Q. What is the critical 2 value if df = 2, and  =0.05?

If ni = E(ni), 2 = 0 Reject H0
Do not reject H0
 = 0.05
df =2
0 5.991 2
2 Table (Portion) Significance level
DF 0.995 … 0.95 … 0.05
1 ... … 0.004 … 3.841
2 0.010 … 0.103 … 5.991
State decision rule

If the value obtained is greater than the


critical value of chi square , the null
hypothesis will be rejected
Calculate test statistics
Expected Value

Chi square for goodness of fit Chi square for independent variables

Homogeneity
Calculated of the
using proportion
formula-
χ2 = ∑ ( O – E )2
E
O = observed frequencies
E = expected frequencies
• Expected Value =
• a theory Row total *
• Previous study • Previous study Column total /
Question >>> How to find the Expected value
• Comparison groups • standard Table total
State and interpret results

See whether the value of chi square is more than or


less than the critical value

If the value of chi square is less than the If the value of chi square is more than the
critical value we accept the null critical value the null hypothesis can be
hypothesis rejected
Chi-Square Tests
Chi-Square Tests: These tests check whether the differences or
patterns between two groups are real or just random.

Types:
• Chi-Square Goodness of Fit Test: Tests whether the
observed frequencies in a categorical dataset match the
expected frequencies based on a specific hypothesis.

• Chi-Square Test of Independence: Assesses whether two


categorical variables (in row and columns) are independent
of each other.

• Chi-Square Test for Homogeneity of Proportions:


Compares the distributions of a categorical variable across
different populations.
Take Home Message

1. The chi-square test applied to Qualitative data may be


nominal or ordinal.

2. Before applying the Chi-square test, see all


assumptions are met.

3. If the value of chi-square is large >>>, there is a high


probability of rejecting the null hypothesis.

4. If the value of chi-square is small >>>, there is less


probability of rejecting the null hypothesis
Fisher's Exact Test
• Used when sample sizes are small, and the Chi-square test
may not be appropriate.

• It tests for independence between two categorical variables in


a 2x2 contingency table.
Cochran (1954) suggests

The decision regarding the use of Chi-square should be


guided by the following considerations:
1. When N > 40, use Chi-square corrected for continuity.
2. When N is between 20 and 40, the Chi-square test may
be used if all the expected frequencies are >five.
If any expected frequency is less than 5, use the Fisher’s
Exact probability test.
3. When N < 20, use Fisher’s test in all cases.
Yate’s Correction
Chi-square distribution is a continuous distribution, and it fails to maintain
its continuityeven if any one of the expected frequencies is less than 5.

In such cases, Yates Correction for continuity is applied to


maintain the character of continuity of the distribution.

The formula for the Chi-square test with Yates correction is:

N ( |observed – expected| - 0.5 )2


Chi-square =
expected

38
Phi-Coefficient
•It is only used on 2X2 contingency tables.

•Interpreted as a measure of the relative (strength) of an


association between two variables ranging from 0 to 1
𝜒2
𝑃h𝑖 𝜙 = n = total number of observation
𝑛

𝑎 𝑑 −𝑏 Sex Smoking Total


=
(𝑎+𝑏)(𝑎+𝑐)(𝑐+𝑑)(𝑏+𝑑)
Yes No
Male a b a+b
Female c d C+d
Total a+c b+d n
Pearson’s Contingency Coefficient (C)

• It is interpreted as a measure of relative (strength) of an


association between two variables

• The coefficient will always be less than 1 and varies according to the
number of rows and columns.

• This can be used for general rXc tables.

• It ranges between 0 to 1

𝜒2
𝐶= =
𝑛+𝜒 2 1+𝜙
Cramer’s V Coefficient (V)
• It is useful for comparing multiple 𝜒2 test statistics and is generalizable
across tables of varying sizes.

• It is not affected by sample size and, therefore is very useful in situations


where you expect statistically significant chi-square was the result of
large sample size instead of any substantive relationship between the variables.

• It is interpreted as a measure of the relative (strength) of an association


between variables.

• The coefficient ranges from 0 to 1 (perfect association).

• In practice, you may find that a Cramer’s V of 0.10 provides a good minimum
threshold for suggesting there is a substantive relationship two variables

𝜒2 Where, q= smaller number of rows or columns


𝐶=
𝑛(𝑞 − 1)
McNemar’s Test

• Used in case of two paired/related samples or there are repeated


measurements.
• It can be used to test for the significance of changes in “before-after” designs in
which each person is used as his own control.

• Thus, the test can be used


❖ to test the effectiveness of a treatment /training/ program/ therapy/intervention
or
❖ to compare the ratings of two judges on the same set of individuals.

43
Kappa Statistic

▪ The Kappa Statistic measures the agreement between the


evaluations of two examiners when both are rating the same
objects.

▪ It describes agreement achieved beyond chance as a proportion


of that agreement that is possible beyond chance.

▪ The value of the Kappa Statistic ranges from -1 to 1, with larger


values indicating better reliability.
A value of 1 indicates perfect agreement.
A value of 0 indicates that agreement is no better than
chance.

▪ Generally, a Kappa > 0.60 is considered satisfactory.


53
Kappa Statistic

11/26/2019 52
Kappa Statistic

𝑃0 − 𝑃𝐸
𝐾𝑎𝑝𝑝𝑎 =
1 − 𝑃𝐸
Where:
𝑃0 = proportion of observed agreement
𝑃𝐸 = proportion of expected agreement by chance

0.00 Agreement is no better than chance


0.01-0.20 Slight agreement
0.21-0.40 Fair agreement
0.41-0.60 Moderate agreement
0.61-0.80 Substantial agreement
0.81-0.99 Almost perfect agreement
1.00 Perfect agreement

54
Initiating the categorical data analysis

4. Advanced Analysis

• Logistic regression: If you're analyzing the relationship


between categorical and continuous variables or a binary
outcome, logistic regression is appropriate.

• Chi-square automatic interaction detector (CHAID): This


method builds decision trees using categorical data, often
used in market research and healthcare to identify significant
predictors.
3. Logistic Regression

• Binary Logistic Regression: Used when the dependent


variable has two categories (e.g., success/failure). It
estimates the probability of an outcome based on one
or more predictor variables.

• Multinomial Logistic Regression: Used when the


dependent variable has more than two categories, but
the categories do not have an inherent order.

• Ordinal Logistic Regression: Used when the dependent


variable has ordered categories (e.g., low, medium,
high).
Initiating the categorical data analysis
5. Test for Homogeneity of proportions or Independence

• Chi-square test for association tests (2x2): whether two categorical


variables are associated
• Chi-square test of independence (R x C): used to test a variety of
sizes of contingency tables
• Chi-square goodness-of-fit test: whether the distribution of cases
in a single categorical variable follows a known/hypothesised
distribution
• Chi-square test of homogeneity: whether the proportions in each
group are equal in the population
• Fisher’s exact test: If sample sizes are small, this test can be more
accurate than the chi-square test for testing independence.
Chi-square goodness-of-fit test
It is also called Pearson's chi-square goodness-of-fit test.

The chi-square goodness-of-fit test is a single-sample


nonparametric test.

Q: How "close" are the observed values to those which would be


expected in a study
OR
Q: An administrator at a hospital may want to determine whether an equal
number of people are hospitalised each day of the week to better plan staffing
levels.?

Expected frequency can be based


on
• theory
• previous experience
• comparison groups
Example: Are cancer-related deaths affected by seasonal variations??
Null Hypothesis: The proportion of deaths due to cancer in winter, summer, autumn, spring
is equal = ¼ = 25%
Alternative: Not all probabilities stated a in null hypothesis is correct
Cancer deaths Observed Expected = 322*1/4
Summer 78 80.5
Spring 71 80.5
Autumn 87 80.5
Winter 86 80.5
Total 322
Degree of freedom = k-1 = 4-1 =3
For α =0.05 for df =3 critical value X2 = 7.81

X2 = (78-80.5)2/80.5 + (71- 80.5)2/80.5 + (87.5 – 80.5)2/80.5 + (86 – 80.5)2/80.5 = 2.09

Conclusion: As calculated X2 value is less than Critical value we can


accept the null hypothesis and state that deaths due to cancer across
seasons are not statistically different from what's expected by chance
(i.e. all seasons being equal)
Chi-square for independence
It focuses on contingency tables that are greater than 2 x 2, which
are often referred to as r x c contingency tables.

It tests whether two variables measured at the nominal level are


independent (i.e., whether there is an association between the
two variables).
Other analytical approaches

• Probit Regression: Similar to logistic regression, but it assumes a normal


cumulative distribution function instead of a logistic one, often used in
binary outcome models.

• Cochran-Mantel-Haenszel Test: Tests for an association between two


categorical variables while controlling for a third variable (stratification).

• Log-Linear Models: Used to model the relationships between three or


more categorical variables by modeling the logarithm of the expected cell
frequencies in a contingency table.

• Cluster Analysis (for Categorical Data): Methods like k-modes or latent


class analysis (LCA) group observations into clusters based on categorical
attributes.
Initiating the categorical data analysis
6. Interpretation

• Analyse p-values (typically < 0.05 for significance).

• Assess effect sizes (e.g., Cramér's V for associations).

• Interpret visualisations (e.g., bar plots, mosaic plots).


11/26/2019 62

You might also like