0% found this document useful (0 votes)

14 views7 pages

Chi SQ Tutorial

The Chi-square test for independence assesses whether an observed distribution of categorical data is due to chance, comparing it to an expected distribution under the null hypothesis of independence. It requires careful construction of categories and sufficient data, with degrees of freedom calculated based on the grid's dimensions. The test results in a Chi-square value and a p-value, indicating the likelihood that the observed data reflects true independence between the variables.

Uploaded by

mrkeem4real

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views7 pages

Chi SQ Tutorial

Uploaded by

mrkeem4real

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 7

Chi-square Test for Independence

What is the Chi-square test for?

The Chi-square test is intended to test how likely it is that an observed distribution
is due to chance. It is also called a "goodness of fit" statistic, because it
measures how well the observed distribution of data fits with the distribution that
is expected if the variables are independent.

A Chi-square test is designed to analyze categorical data. That means that the
data has been counted and divided into categories. It will not work with
parametric or continuous data (such as height in inches). For example, if you want
to test whether attending class influences how students perform on an exam,
using test scores (from 0-100) as data would not be appropriate for a Chi-square
test. However, arranging students into the categories "Pass" and "Fail" would.
Additionally, the data in a Chi-square grid should not be in the form of
percentages, or anything other than frequency (count) data. Thus, by dividing a
class of 54 into groups according to whether they attended class and whether
they passed the exam, you might construct a data set like this:

Pass Fail
Attended 25 6
Skipped 8 15

IMPORTANT: Be very careful when constructing your categories! A Chi-square

test can tell you information based on how you divide up the data. However, it
cannot tell you whether the categories you constructed are meaningful. For
example, if you are working with data on groups of people, you can divide them
into age groups (18-25, 26-40, 41-60...) or income level, but the Chi-square test
will treat the divisions between those categories exactly the same as the
divisions between male and female, or alive and dead! It's up to you to assess
whether your categories make sense, and whether the difference (for example)
between age 25 and age 26 is enough to make the categories 18-25 and 26-40
meaningful. This does not mean that categories based on age are a bad idea, but
only that you need to be aware of the control you have over organizing data of
that sort.

Another way to describe the Chi-square test is that it tests the null hypothesis that
the variables are independent. The test compares the observed data to a model
that distributes the data according to the expectation that the variables are
independent. Wherever the observed data doesn't fit the model, the likelihood
that the variables are dependent becomes stronger, thus proving the null
hypothesis incorrect!

The following table would represent a possible input to the Chi-square test, using
2 variables to divide the data: gender and party affiliation. 2x2 grids like this one
are often the basic example for the Chi-square test, but in actuality any size grid
would work as well: 3x3, 4x2, etc.

Democrat Republican
Male 20 30
Female 30 20

This shows the basic 2x2 grid. However, this is actually incomplete, in a sense;
generally, the data table should include "marginal" information giving the total
counts for each column and row, as well as for the whole data set:

Democrat Republican Total

Male 20 30 50
Female 30 20 50
Total 50 50 100

We now have a complete data set on the distribution of 100 individuals into
categories of gender (Male/Female) and party affiliation (Democrat/Republican). A
Chi-square test would allow you to test how likely it is that gender and party
affiliation are completely independent; or in other words, how likely it is that the
distribution of males and females in each party is due to chance.

So, as implied, the null hypothesis in this case would be that gender and party
affiliation are independent of one another. To test this hypothesis, we need to
construct a model which estimates how the data should be distributed if our
hypothesis of independence is correct. This is where the totals we put in the
margins will become handy: later on, I'll show how you can calculate your
estimated data using the marginals. Meanwhile, however, I've constructed an
example which will allow very easy calculations. Assuming that there's a 50/50
chance of males or females being in either party, we get the very simple
distribution shown below.
Democrat Republican Total
Male 25 25 50
Female 25 25 50
Total 50 50 100
This is the information we would need to calculate the likelihood that gender and
party affiliation are independent. I will discuss the next steps in calculating a Chi
square value later, but for now I'll focus on the background information.

Note: you can assume a different null hypothesis for a Chi-square test. Using the
scenario suggested above, you could test the hypothesis that women are twice as
likely to register as Democrats than men, and a Chi-square test would tell you how
likely it is that the observed data reflects that relationship between your variables.
In this case, you would simply run the test using a model of expected data built
under the assumption that this hypothesis is true, and the formula will (as before)
test how well that distribution fits the observed data. I will not discuss this in more
detail, but it is important to know that the null hypothesis is not some abstract
"fact" about the test, but rather a choice you make when calculating your model.

What is the Chi-square test NOT for?

This is also an important question to tackle, of course. Using a statistical test

without having a good idea of what it can and cannot do means that you may
misuse the test, but also that you won't have a clear grasp of what your results
really mean. Even if you don't understand the detailed mathematics underlying
the test, it is not difficult to have a good comprehension of where it is or isn't
appropriate to use. I mentioned some of this above, when contrasting types of
data and so on. This section will consider other things that the Chi-square test is
not meant to do.

First of all, the Chi-square test is only meant to test the probability of
independence of a distribution of data. It will NOT tell you any details about the
relationship between them. If you want to calculate how much more likely it is
that a woman will be a Democrat than a man, the Chi-square test is not going to
be very helpful. However, once you have determined the probability that the two
variables are related (using the Chi-square test), you can use other methods to
explore their interaction in more detail. For a fairly simple way of discussing the
relationship between variables, I recommend the odds ratio.

Some further considerations are necessary when selecting or organizing your data
to run a Chi-square test. The variables you consider must be mutually exclusive;
participation in one category should not entail or allow participation in another. In
other words, the data from all of your cells should add up to the total count, and
no item should be counted twice.

You should also never exclude some part of your data set. If your study
examined males and females registered as Republican,
Democrat, and Independent, then excluding one category from the grid might
conceal critical data about the distribution of your data.

It is also important that you have enough data to perform a viable Chi-square
test. If the estimated data in any given cell is below 5, then there is not
enough data to perform a Chi-square test. In a case like this, you should
research some other techniques for smaller data sets: for example, there is a
correction for the Chi-square test to use with small data sets, called the Yates
correction. There are also tests written specifically for smaller data sets, like the
Fisher Exact Test.
Degrees of Freedom

A broader description of this topic can be found here.

The degrees of freedom (often abbreviated as df or d) tell you how many

numbers in your grid are actually independent. For a Chi-square grid, the degrees
of freedom can be said to be the number of cells you need to fill in before, given
the totals in the margins, you can fill in the rest of the grid using a formula. You
can see the idea intended; if you have a given set of totals for each column and
row, then you don't have unlimited freedom when filling in the cells. You can only
fill in a certain amount of cells with "random" numbers before the rest just
becomes dependent on making sure the cells add up to the totals. Thus, the
number of cells that can be filled in independently tell us something about the
actual amount of variation permitted by the data set.

The degrees of freedom for a Chi-square grid are equal to the number of rows
minus one times the number of columns minus one: that is, (R-1)*(C-1). In our
simple 2x2 grid, the degrees of independence are therefore (2-1)*(2-1), or 1! Note
that once you have put a number into one cell of a 2x2 grid, the totals determine
the rest for you.
Degrees of freedom are important in a Chi-square test because they factor into
your calculations of the probability of independence. Once you calculate a Chi-
square value, you use this number and the degrees of freedom to decide the
probability, or p-value, of independence. This is the crucial result of a Chi-square
test, which means that knowing the degrees of freedom is crucial!

Building a Model of Expected Data

Earlier, I showed a simple example of observed vs. expected data, using an
artificial data set on the party affiliations of males and females. I show them again
below.
Observed

Democrat Republican Total

Male 20 30 50
Female 30 20 50
Total 50 50 100
Expected (assuming independence)
Democrat Republican Total
Male 25 25 50
Female 25 25 50
Total 50 50 100
We will focus on models based on the null hypothesis that the distribution of data
is due to chance -- that is, our models will reflect the expected distribution of data
when that hypothesis is assumed to be true. But as I mentioned before, the ease
of dividing up this data is due to the simplicity of the distribution I chose. How do
we calculate the expected distribution of a more complicated data set?
Pass Fail Total
Attended 25 6 31
Skipped 8 15 23
Total 33 21 54

Here is the grid for an earlier example I discussed, showing how students who
attended or skipped class performed on an exam. The numbers for this example
are not so clean! Fortunately, we have a formula to guide us.

The estimated value for each cell is the total for its row multiplied by the total for
its column, then divided by the total for the table: that
is, (RowTotal*ColTotal)/GridTotal. Thus, in our table above, the expected count
in cell (1,1) is (33*31)/54, or 18.94. Don't be afraid of decimals for your expected
counts; they're meant to be estimates!

I'll show a different method for notating observed versus expected counts below:
the expected frequency appears in parentheses below the observed frequency.
This allows you to show all your data in one clean table.
Pass Fail Total
25 6
Attended 31
(18.94) (12.05)
8 15
Skipped 23
(14.05) (8.94)
Total 33 21 54

We have now calculated the distribution of our totals based on the assumption
that attending class will have absolutely no effect on your test performance. Let's
all hope we can prove this null hypothesis wrong. The Chi-square Formula
It's finally time to put our data to the test. You can find many programs that will
calculate a Chi-square value for you, and later I will show you how to do it in
Excel. For now, however, let's start by trying to understand the formula itself.

What does this mean?? Actually, it's a fairly simple relationship. The variables
in this formula are not simply symbols, but actual concepts that we've been
discussing all along. O stands for the Observed frequency. E stands for
the Expected frequency. You subtract the expected count from the observed
count to find the difference between the two (also called the "residual"). You
calculate the square of that number to get rid of positive and negative values
(because the squares of 5 and -5 are, of course, both 25). Then, you divide the
result by the expected frequency to normalize bigger and smaller counts (because
we don't want a formula that will give us a bigger Chi-square value just because
you're working with a bigger set of data). The huge sigma sitting in front of all that
is asking for the sum of every i for which you calculate this relationship - in other
words, you calculate this for each cell in the table, then add it all together. And
that's it!

Using this formula, we find that the Chi-square value for our gender/party
example is ((20-25)^2/25) + ((30-25)^2/25) + ((30-25)^2/25) + ((20-25)^2/25),
or (25/25) + (25/25) + (25/25) + (25/25), or 1 + 1 + 1 + 1, which comes out to 4.

Okay, but what does THAT mean?? In a sense, not much yet. The Chi-square
value serves as input for the more interesting piece of information: the p-value.
Calculating a p-value is less intuitive than a Chi-square value, so I will not discuss
the actual formula here, but simply tools to use in calculating this data. We will
need the following to get a p-value for our data:

(1) The Chi-square value.

(2) The degrees of freedom.

Once you have this information, there are a couple of methods you can use to get
your p-value. For example, charts like this one or even Javascript programs like
the one on this site will take the Chi-square value and degrees of freedom as
input, and simply return a p-value. In the chart, you choose your degrees of
freedom (df) value on the left, follow along its row to the closest number to your
Chi-square value, and then check the corresponding number in the top row to see
the approximate probability ("Significance Level") for that value. The Javascript
program is more direct, as you simply input your numbers and click "calculate."
Later, I will also show you how to make Excel do the work for you.

So, for our example, we take a Chi-square value of 4 and a df of 1, which gives us
a p-value of 0.0455. This is interpreted as a 4.6% likelihood that the null
hypothesis is correct. To put it best, if the distribution of this data is due
entirely to chance, then you have a 4.6% chance of finding a discrepancy
between the observed and expected distributions that is at least this
extreme.

By convention, the "cutoff" point for a p-value is 0.05; anything below that can be
considered a very low probability, while anything above it is considered a
reasonable probability. However, that does not mean that we should take our
0.046 value and say, "Eureka! They're dependent!" Actually, 0.046 is so close to
0.05 that there's really not much we can say from this example; it is teetering
right on the brink of chance. This is a very good thing to realize, because from this
we discover that although the distribution seems to have fairly clear tendencies in
certain directions when you just look at it, the data shows that it's not so unlikely
that this would show up just by chance.

So, let's try our other data set, and see if attending class really does affect your
exam performance.
Pass Fail Total
25 6
Attended 31
(18.94) (12.05)
8 15
Skipped 23
(14.05) (8.94)
Total 33 21 54

I'm going to skip the specific formula this time, and use the javascript program
on this site to do the calculation for me. It returns a value of 11.686. We still only
have 1 degree of freedom, so our p-value is calculated as 0.0006. In other words,
if this distribution was due to chance, we would see exactly this distribution only
0.06% of the time! A value of 0.0006 is a much lower probability than a value of
0.05. We can thus safely say that the null hypothesis is incorrect; attending class
and passing the exam are definitely dependent on one another. (Of course, if you
are testing a null hypothesis that you are expecting to be correct, then you would
want a very high p-value. The reason we want a low one in this case is because
we are trying to disprove the hypothesis that the variables are independent.)

This is all you need to know to calculate and understand Pearson's Chi-square test
for independence. It's a widely popular test because once you know the formula, it
can all be done on a pocket calculator, and then compared to simple charts to
give you a probability value. You can also use this spreadsheet to play around
with all the steps of the test (spreadsheet created by Bill Labov, with some small
additions by Joel Wallenberg). The Chi-square test will prove to be a handy tool for
analyzing all kinds of relationships; once you know the basics for a 2x2 grid,
expanding to a larger set of values is easy. Good luck!

The Chi
No ratings yet
The Chi
3 pages
Chi-Square by MPH
No ratings yet
Chi-Square by MPH
55 pages
Chi Square Test
No ratings yet
Chi Square Test
13 pages
Chi Square Lesson
No ratings yet
Chi Square Lesson
11 pages
BS IMI U8 Oct23
No ratings yet
BS IMI U8 Oct23
100 pages
Lecture 1 5th
No ratings yet
Lecture 1 5th
45 pages
Chi Square Test
No ratings yet
Chi Square Test
22 pages
Chi Square
No ratings yet
Chi Square
10 pages
Statistical Notes For Clinical Researchers: Chi-Squared Test and Fisher's Exact Test
No ratings yet
Statistical Notes For Clinical Researchers: Chi-Squared Test and Fisher's Exact Test
4 pages
DDBA 8437: Nonparametric Statistics: The Chi-Square Test Video Podcast Transcript
No ratings yet
DDBA 8437: Nonparametric Statistics: The Chi-Square Test Video Podcast Transcript
5 pages
Chi-Square Test
No ratings yet
Chi-Square Test
12 pages
Chisquare
No ratings yet
Chisquare
10 pages
Chi Square Test
No ratings yet
Chi Square Test
24 pages
Chi-Square: History and Definition
No ratings yet
Chi-Square: History and Definition
16 pages
Chi Square (KI Square) Test
No ratings yet
Chi Square (KI Square) Test
30 pages
Maths Report
No ratings yet
Maths Report
15 pages
Ermi Stat LL CH 4
No ratings yet
Ermi Stat LL CH 4
32 pages
Lecture3 - Contingency Analysis
No ratings yet
Lecture3 - Contingency Analysis
16 pages
Chi Square Test
No ratings yet
Chi Square Test
16 pages
QM Lecture 10 - Chi Square Tests
No ratings yet
QM Lecture 10 - Chi Square Tests
48 pages
Chisquaretest
No ratings yet
Chisquaretest
16 pages
Non-Parametric Tests Overview
No ratings yet
Non-Parametric Tests Overview
47 pages
Categorical Data Analysis
100% (1)
Categorical Data Analysis
20 pages
Method of Chi Square
No ratings yet
Method of Chi Square
3 pages
X Test PDF
No ratings yet
X Test PDF
38 pages
A Gentle Introduction To The Chi-Squared Test For Machine Learning
0% (2)
A Gentle Introduction To The Chi-Squared Test For Machine Learning
17 pages
Module 6 Chi-Square T Z Test
100% (1)
Module 6 Chi-Square T Z Test
72 pages
Chi Squareedited
No ratings yet
Chi Squareedited
39 pages
Chi-Square Test Fall Semester 2024
No ratings yet
Chi-Square Test Fall Semester 2024
21 pages
Chi Square Test
100% (1)
Chi Square Test
23 pages
Chi-Square Test: DR Ramakanth
No ratings yet
Chi-Square Test: DR Ramakanth
38 pages
Categorical Data Analysis: 48Th Icro-Sun PG Teaching Programme 26 & 27 OCTOBER, 2024
No ratings yet
Categorical Data Analysis: 48Th Icro-Sun PG Teaching Programme 26 & 27 OCTOBER, 2024
57 pages
Chi-Square Test Overview
No ratings yet
Chi-Square Test Overview
4 pages
Definition of Chi-Square Test
100% (1)
Definition of Chi-Square Test
8 pages
Chi Square Test
No ratings yet
Chi Square Test
11 pages
X2 Test (Chi Squared Test)
No ratings yet
X2 Test (Chi Squared Test)
5 pages
Lecture 41 - Hypothesis Testing - Chi-Square
No ratings yet
Lecture 41 - Hypothesis Testing - Chi-Square
24 pages
New Microsoft Word Document
100% (1)
New Microsoft Word Document
7 pages
Chi Square
No ratings yet
Chi Square
25 pages
Chisquare Gonzales
No ratings yet
Chisquare Gonzales
32 pages
Chi Square
No ratings yet
Chi Square
12 pages
Test of Goodness of Fit and Independence: Chi-Square-test-as A Test of Independence
No ratings yet
Test of Goodness of Fit and Independence: Chi-Square-test-as A Test of Independence
9 pages
Chi-Square Distribution Guide
No ratings yet
Chi-Square Distribution Guide
28 pages
Chi-Square Test for Students
No ratings yet
Chi-Square Test for Students
118 pages
Chi-Square Basics for Students
No ratings yet
Chi-Square Basics for Students
39 pages
Q3W2 Chi Square Distribution
No ratings yet
Q3W2 Chi Square Distribution
28 pages
What Is The Chi-Square Test?: Categorical Variables
No ratings yet
What Is The Chi-Square Test?: Categorical Variables
4 pages
Chi Square (Χ) : Yetty Dwi Lestari Department of Management, FEB Airlangga University
No ratings yet
Chi Square (Χ) : Yetty Dwi Lestari Department of Management, FEB Airlangga University
71 pages
Stats CHP 3 Notes
No ratings yet
Stats CHP 3 Notes
8 pages
Chi Square
No ratings yet
Chi Square
19 pages
Complete Chi-Square Test
No ratings yet
Complete Chi-Square Test
19 pages
STATISTICS
No ratings yet
STATISTICS
14 pages
Chi Square Test
No ratings yet
Chi Square Test
9 pages
Chi Square Tests
No ratings yet
Chi Square Tests
12 pages
Stat 213 Chapter 7 2
No ratings yet
Stat 213 Chapter 7 2
18 pages
Chi Square Test
No ratings yet
Chi Square Test
11 pages
Stat 97
No ratings yet
Stat 97
10 pages
Abstract Class Rep
No ratings yet
Abstract Class Rep
3 pages
Chapter Five Rep
No ratings yet
Chapter Five Rep
5 pages
Chapter 3 CG&F
No ratings yet
Chapter 3 CG&F
6 pages
4th Lecture Digital Forensics Updated DFS
No ratings yet
4th Lecture Digital Forensics Updated DFS
66 pages
Analysis Exel ToolPak (Autosaved)
No ratings yet
Analysis Exel ToolPak (Autosaved)
14 pages
1st Lecture Digital Forenciscs - DFS
No ratings yet
1st Lecture Digital Forenciscs - DFS
34 pages
Worksheet 3 For Eng
No ratings yet
Worksheet 3 For Eng
1 page
Big O Little O Notation and Taylor
No ratings yet
Big O Little O Notation and Taylor
7 pages
Lecture 8+9 Multicollinearity and Heteroskedasticity Exercise 10.2
No ratings yet
Lecture 8+9 Multicollinearity and Heteroskedasticity Exercise 10.2
3 pages
Introduction To Conditional Probability and Bayes Theorem For Data Science Professionals
No ratings yet
Introduction To Conditional Probability and Bayes Theorem For Data Science Professionals
12 pages
4.6 Complexity Theorem
No ratings yet
4.6 Complexity Theorem
8 pages
Chapter 11: Quantitative Data Analysis
No ratings yet
Chapter 11: Quantitative Data Analysis
4 pages
Essential Statistics For The Behavioral Sciences Gregory J. Privitera Full Chapters Instanly
No ratings yet
Essential Statistics For The Behavioral Sciences Gregory J. Privitera Full Chapters Instanly
77 pages
Probability Basics for Students
No ratings yet
Probability Basics for Students
10 pages
Correlation and Regression s1
No ratings yet
Correlation and Regression s1
30 pages
Central Limit Theorem & Confidence Intervals
100% (1)
Central Limit Theorem & Confidence Intervals
14 pages
Statistical Distributions Guide
No ratings yet
Statistical Distributions Guide
12 pages
Duplichecker Plagiarism Report 0.39251400 1713722050
No ratings yet
Duplichecker Plagiarism Report 0.39251400 1713722050
4 pages
Simulating Chi-Square Test Using Excel
No ratings yet
Simulating Chi-Square Test Using Excel
9 pages
Point Estimate of Population Mean
100% (2)
Point Estimate of Population Mean
2 pages
Multivariate Geostatistics: Mg. Sc. Marco A. Cotrina
No ratings yet
Multivariate Geostatistics: Mg. Sc. Marco A. Cotrina
51 pages
4 Key Aspects To Variography
100% (2)
4 Key Aspects To Variography
14 pages
Data Case Analysis: Ly Anh Tuan-S3818425
No ratings yet
Data Case Analysis: Ly Anh Tuan-S3818425
7 pages
Stata Time Series Reference Manual
No ratings yet
Stata Time Series Reference Manual
921 pages
Ula Analyze
50% (2)
Ula Analyze
7 pages
C. Grouped Discrete and Continuous Data (H) PDF
No ratings yet
C. Grouped Discrete and Continuous Data (H) PDF
10 pages
Categorical Data Analysis (CDA) - 1
No ratings yet
Categorical Data Analysis (CDA) - 1
154 pages
Correlation Ratio
No ratings yet
Correlation Ratio
3 pages
Assignment Booklet PGDAST Jan-Dec 2018
No ratings yet
Assignment Booklet PGDAST Jan-Dec 2018
35 pages
Multiple Choice Questions
0% (1)
Multiple Choice Questions
23 pages
Probability and Statistics Pacing Guide Summer 2024 Schrader
No ratings yet
Probability and Statistics Pacing Guide Summer 2024 Schrader
2 pages
Regression Step by Step
No ratings yet
Regression Step by Step
1 page
CP, CPK
100% (1)
CP, CPK
18 pages
Probability Mass Function Basics
No ratings yet
Probability Mass Function Basics
5 pages
2.4 Descriptive Statistics: Measures of Central Tendency
No ratings yet
2.4 Descriptive Statistics: Measures of Central Tendency
6 pages
Some Sampling Distribution Problems
No ratings yet
Some Sampling Distribution Problems
3 pages

Chi SQ Tutorial

Uploaded by

Chi SQ Tutorial

Uploaded by

Chi-square Test for Independence

What is the Chi-square test for?

IMPORTANT: Be very careful when constructing your categories! A Chi-square

Democrat Republican Total

What is the Chi-square test NOT for?

This is also an important question to tackle, of course. Using a statistical test

A broader description of this topic can be found here.

The degrees of freedom (often abbreviated as df or d) tell you how many

Building a Model of Expected Data

Democrat Republican Total

(1) The Chi-square value.

You might also like