Different Types of
Correlation
Important key terms
Categorical variable: A categorical variable represents qualitative
data that falls into distinct categories or groups. These categories
are typically non-numerical and can be nominal or ordinal in
nature.
• Example: Let's consider a variable called "Hair Color" with
categories such as "Blonde," "Brunette," "Redhead," and
"Black." Hair color is a categorical variable because it
represents different categories or groups rather than numerical
values.
Continuous variable: A continuous variable represents
quantitative data that can take on any numerical value within a
certain range. It can have an infinite number of possible values,
and measurements can be made with great precision.
• Example: "Height" is a continuous variable because it can take
on any numerical value within a range (e.g., 150 cm, 157.2 cm,
162.8 cm). Height can be measured with precision, and there is
no limit to the number of possible values within the range.
Dichotomous variable: A dichotomous variable represents data
that has only two distinct categories or options. It is a special
case of a categorical variable where there are precisely two
mutually exclusive and exhaustive options.
• Example: "Gender" is a dichotomous variable as it typically has
two categories: "Male" and "Female." Each individual can be
assigned to one of these two categories, and there are no other
possibilities.
Quantitative variable: A quantitative variable represents numerical
data that can be measured or counted on a numeric scale. It is a
type of variable that takes on numerical values and can be
subjected to mathematical operations such as addition,
subtraction, multiplication, and division. Quantitative variables
can be further classified into interval and ratio variables.
• Example: "Age" is a quantitative variable as it represents a
numeric value that can be measured. For example, someone's
age can be 25 years, 40 years, or 60 years. Age can be
quantitatively compared, and arithmetic operations can be
performed on it.
Binary variable: A binary variable is a special type of categorical
variable that has only two categories or options. It represents
data that can be classified into two mutually exclusive and
exhaustive groups.
• Example: "Smoking status" is a binary variable as it has two
categories: "Smoker" and "Non-smoker." Individuals can be
classified as either smokers or non-smokers, and there are no
other possibilities. Binary variables are often represented using
numerical codes like 0 and 1, where 0 may represent one
category (e.g., non-smoker) and 1 represents the other category
(e.g., smoker).
Categorical vs Binary
• Categorical variables can have more than two categories, whereas
binary variables specifically have only two categories.
• Categorical variable: A categorical variable represents qualitative data that
falls into distinct categories or groups. These categories can have more
than two options and can be either nominal or ordinal in nature.
• Example: "Favorite color" is a categorical variable. The options can include
"Red," "Blue," "Green," "Yellow," and so on. Here, the variable has multiple
categories beyond just two.
• Binary variable: A binary variable is a special case of a categorical
variable that has only two distinct categories or options. It represents data
that can be classified into two mutually exclusive and exhaustive groups.
• Example: "Marital status" is a binary variable because it has two categories:
"Married" and "Unmarried." In this case, there are only two possible
options, and an individual can be classified into one of the two categories.
ASSUMPTIONS
LINEARITY
• Linearity: Linearity refers to the assumption that there is a linear
relationship between two variables. It assumes that as one variable
changes, the other variable changes by a constant amount. In other
words, the relationship between the variables can be represented by
a straight line.
• For example, if one variable increases by a certain amount, the other
variable also increases by a proportional amount.
• Imagine a scenario where we examine the relationship between the
number of hours studied and the test scores of a group of students. If
there is linearity between these variables, it means that as the
number of hours studied increases, the test scores also increase
consistently. For example, a student who studies for 2 hours may
score 70, while a student who studies for 4 hours may score 80.
HOMOSCEDASTICITY
• Homoscedasticity: Homoscedasticity refers to the assumption that the
variability of the residuals (the differences between observed and predicted
values) is constant across all levels of the independent variable(s). It
means that the spread or dispersion of the residuals is similar for different
values of the independent variable(s).
• Homoscedasticity means that the spread or variability of the data is roughly
the same across different levels or values of a variable. In simpler terms, it
suggests that the points on a scatter plot have a similar amount of scatter
or dispersion around the line of best fit.
• Let's consider a situation where we investigate the relationship between the
number of years of work experience and salary. If there is
homoscedasticity, it means that the variation in salaries is similar
regardless of the number of years of work experience. In simpler terms, it
suggests that individuals with both low and high levels of work experience
have a similar spread or variability in their salaries.
INDEPENDENCE
• Independence: Independence assumes that the observations or data
points being analyzed are independent of each other. Independence
means that the values of one observation do not depend on or
influence the values of other observations. Violation of independence
can lead to biased or unreliable statistical results.
• Independence means that each data point or observation is not
influenced by or related to other data points. Each observation is
considered separate and unrelated. This assumption allows for
unbiased and reliable statistical analyses.
• Suppose we conduct a survey to examine the relationship between
people's favorite colors and their preferred modes of transportation.
Each person's choice of favorite color and mode of transportation
should be independent of others. In other words, one person's
favorite color being blue does not affect another person's preference
for traveling by car.
NORMALITY
• Normality assumes that the data or variables being analyzed follow a
normal distribution. A normal distribution is a symmetrical bell-shaped
distribution characterized by a specific mean and standard deviation. It is
important for certain statistical methods, such as parametric tests, that
assume normality of the data.
• Normality means that the data or variables follow a symmetrical
bell-shaped distribution, like a typical "normal" curve. In simpler terms, it
means that most of the data is concentrated around the mean (average)
and tapers off towards the tails. Many statistical methods assume that the
data follows a normal distribution.
• Let's consider the heights of adult males in a population. If the heights
follow a normal distribution, it means that most men will have heights
around the average height, with fewer men being extremely tall or
extremely short. This distribution would resemble a bell-shaped curve, with
the majority of heights falling close to the mean height.
MONOTONICITY
• Monotonicity refers to the assumption that there is a consistent direction of
relationship between two variables. It means that as values of one variable
increase or decrease, the values of the other variable consistently increase
or decrease, but not necessarily at a constant rate. Monotonicity does not
require a linear relationship but rather a consistent trend in the relationship.
• Monotonicity suggests that there is a consistent trend or pattern in the
relationship between two variables, regardless of whether it is a straight line
or not. It means that as one variable increases (or decreases), the other
variable consistently increases (or decreases) as well, although not
necessarily at a constant rate.
• Imagine studying the relationship between the amount of time spent
exercising per week and body weight. If there is monotonicity, it means that
as the amount of exercise increases, the body weight consistently
decreases (or vice versa). This pattern would hold regardless of the specific
values; it could be a gradual or steep decline, but the relationship maintains
a consistent direction.
A. Pearson Correlation Coefficient (Pearson
r)
• Most common type of correlation used to measure STRENGTH AND
DIRECTION of a linear relationship.
• Ranges from +1 to -1.
• Positive value indicates positive correlation
• Negative value indicates negative correlation
• A value of 0 indicates no relationship
Example : Pearson
• Let's say we want to understand the relationship between study time
and test scores. We collect data on how many hours students study
(continuous variable) and their corresponding test scores (continuous
variable). By calculating the Pearson correlation coefficient, we can
determine if there is a positive or negative relationship between study
time and test scores.
• Suppose we want to understand the relationship between hours of
sleep and cognitive performance. We measure the number of hours
participants sleep (continuous variable) and their scores on a
cognitive test (continuous variable). By calculating the Pearson
correlation coefficient, we can determine if there is a correlation
between sleep duration and cognitive performance.
B. Spearman’s Rank Correlation Coefficient
(Spearman rho)
• This correlation measures the strength and direction of a monotonic
relationship between variables.
• It is used when the variables are not normally distributed or when
the relationship is non linear.
• It is based on the ranks of the data rather than the actual values.
• It's like when you and your friends line up based on your
height, and then we see if the order you're standing in
matches another line of your friends based on their favorite
colors.
Example- spearman
• Suppose we want to examine the relationship between
self-esteem and social anxiety. We ask participants to rate their
self-esteem and social anxiety on a scale from 1 to 10. By
calculating the Spearman's rank correlation coefficient, we can
determine if there is a consistent pattern in how self-esteem and
social anxiety are ranked among the participants.
• Let's say we want to examine the relationship between
self-reported happiness levels and stress levels. Participants
rate their happiness and stress levels on a scale from 1 to 10.
By calculating the Spearman's rank correlation coefficient, we
can determine if there is a consistent pattern in how happiness
and stress levels are ranked among the participants.
C. Kendall’s Rank correlation coefficient
(kendall tau)
• It measures the strength and direction of a monotonic relationship
between variables.
• It also uses the ranks of the data and is useful when dealing with
ranked or ordinal data.
• It's like when you and your friends line up based on your
age, and then we see if the order you're standing in
matches another line of your friends based on their favorite
ice cream flavors.
Example - Kendall
• Let's consider the relationship between age and preference for video
game genres. We ask participants to rank their age (e.g., 1 for
youngest, 2 for second youngest) and rank their preference for
different video game genres (e.g., action, adventure, puzzle). Using
Kendall's rank correlation coefficient, we can explore if there is a
consistent pattern in how age and video game genre preferences are
ranked.
• Suppose we want to explore the relationship between exercise
frequency and levels of anxiety. Participants rank their exercise
frequency (e.g., never, occasionally, regularly) and their anxiety
levels (e.g., low, medium, high). Using Kendall's rank correlation
coefficient, we can determine if there is a consistent pattern in how
exercise frequency and anxiety levels are ranked.
D. Point Biserial Correlation Coefficient
• This correlation measures the relationship between a continuous
variable and a dichotomous variable
• It is used when one variable is continuous and other is binary.
• This tells us how a number and a "yes" or "no" question
are related. It's like when we ask if you have a pet (yes or
no), and we also measure how tall you are. Then we see if
taller kids are more likely to have a pet.
Example- Point biserial
• Suppose we want to investigate the relationship between
extraversion (a personality trait) and participation in team sports (a
yes/no question). We assign scores to participants based on their
level of extraversion and check whether they participate in team
sports or not. By calculating the Point-Biserial correlation coefficient,
we can determine if there is a relationship between extraversion
scores and team sports participation.
• Let's consider the relationship between self-esteem (continuous
variable) and having a history of bullying victimization (yes/no).
Participants rate their self-esteem, and we collect data on whether
they have experienced bullying in the past. By calculating the
Point-Biserial correlation coefficient, we can determine if there is a
relationship between self-esteem and the experience of bullying
victimization.
E. Phi Coefficient
• It is used when there are two dichotomous variables
• It determines the association between two categorical variables with
two categories each
• This tells us how two "yes" or "no" questions are related.
It's like when we ask if you like dogs (yes or no) and if you
like ice cream (yes or no). Then we see if there is any
connection between these two things.
Example
• Let's consider the relationship between gender (male/female)
and preference for playing musical instruments (yes/no). We
collect data on participants' gender and whether they play a
musical instrument. By calculating the Phi coefficient, we can
determine if there is any association between gender and
musical instrument preference.
• Suppose we want to examine the relationship between gender
(male/female) and attitudes towards body image
(positive/negative). Participants indicate their gender and rate
their attitudes towards body image. By calculating the Phi
coefficient, we can determine if there is any association
between gender and attitudes towards body image.
F. Cramer’s V
• It is an extension of phi coefficient
• It is association between two categorical variables with more than
two categories
• This is like the Phi coefficient, but it works when we have
more than two categories for each question. It's like when
we ask if you like dogs (yes, no, or maybe) and if you like
ice cream (chocolate, vanilla, or strawberry). Then we
check if there is any relationship between these choices.
• Suppose we want to examine the relationship between
education level (high school, bachelor's degree, master's
degree, Ph.D.) and career choice (teacher, engineer, doctor,
artist). By collecting data on participants' education level and
career choice, we can calculate Cramer's V to determine if there
is a relationship between the two categorical variables.
• Let's say we want to investigate the relationship between
personality types (introvert, extrovert, ambivert) and preferred
social activities (parties, small gatherings, alone time).
Participants indicate their personality type and their preferred
social activities. By calculating Cramer's V, we can determine if
there is a relationship between personality types and preferred
social activities.
G. Biserial Correlation Coefficient
• Measures relationship between a continuous variable and a
dichotomous variables
• Similar to point-biserial correlation but assumes that the continuous
variable follows a normal distribution
• This tells us how a number and a "yes" or "no" question
are related, assuming the number is normally distributed.
It's similar to the Point-Biserial correlation but with some
math stuff.
Example
• Let's consider the relationship between hours spent on social
media (continuous variable) and feelings of loneliness
(dichotomous variable: yes for feeling lonely, no for not feeling
lonely). We measure the number of hours participants spend on
social media and ask if they feel lonely or not. By calculating the
biserial correlation coefficient, we can determine if there is a
relationship between social media usage and feelings of
loneliness.
F. Tetrachoric Correlation Coefficient
• Measure the relationship between two dichotomous variables that are
believed to have an underlying continuous distribution,
• Based on the assumption that the two dichotomous variables are
manifested of an unobserved variables
• This is used when we have two "yes" or "no" questions, but we
think they are related to an underlying number that we can't see.
It's like when we can't measure your height directly, but we ask
if you can reach the top shelf (yes or no), and we ask if you can
see over a fence (yes or no). Then we check if there is any
connection between these questions, assuming there is an
invisible measurement like your height that affects both
answers.
Example
• Suppose we want to explore the relationship between scores on
a depression questionnaire (dichotomous variable: yes for
depressed, no for not depressed) and the presence of a specific
genetic marker (dichotomous variable: presence or absence of
the marker). We collect data on participants' scores on the
depression questionnaire and whether they have the genetic
marker or not. By calculating the tetrachoric correlation
coefficient, we can determine if there is a relationship between
depression scores and the presence of the genetic marker,
assuming there is an underlying continuous variable that the
marker represents.
Q1.
• What type of correlation coefficient is most appropriate for
examining the relationship between age groups (e.g., young,
middle-aged, elderly) and preferences for music genres (e.g.,
rock, classical, hip-hop)?
• a) Pearson correlation coefficient b) Spearman's rank
correlation coefficient c) Kendall's rank correlation coefficient d)
Point-Biserial correlation coefficient
A1
• Kendall's rank correlation coefficient
• Reason: Kendall's rank correlation coefficient is used when
dealing with ranked or ordinal data, such as age groups and
preferences for music genres. It assesses the strength and
direction of a monotonic relationship, which means it examines
if there is a consistent pattern in the ranks of the variables being
compared.
Q2.
2. Which correlation coefficient should be used to measure the
relationship between a continuous variable (e.g., income level)
and a dichotomous variable (e.g., employment status: employed
or unemployed)?
• a) Pearson correlation coefficient b) Spearman's rank
correlation coefficient c) Point-Biserial correlation coefficient d)
Phi coefficient
A2
• Point-Biserial correlation coefficient
• Reason: The Point-Biserial correlation coefficient is used to
measure the relationship between a continuous variable and a
dichotomous variable. In this case, income level (continuous
variable) and employment status (dichotomous variable) are
being compared to determine if there is a relationship between
them.
Q3
3. A researcher wants to investigate the relationship between two
dichotomous variables: whether someone is a coffee drinker
(yes or no) and whether they experience insomnia (yes or no).
Which correlation coefficient is appropriate for this analysis?
• a) Pearson correlation coefficient b) Spearman's rank
correlation coefficient c) Tetrachoric correlation coefficient d) Phi
coefficient
A3
• Tetrachoric correlation coefficient
• Reason: The Tetrachoric correlation coefficient is specifically
used when both variables are dichotomous, and it assumes that
there is an underlying continuous variable that the dichotomous
variables represent. In this case, the researcher wants to
explore the relationship between being a coffee drinker and
experiencing insomnia, both of which are dichotomous
variables.
Q4
• Which correlation coefficient is most suitable for examining the
association between students' class rankings (e.g., 1st, 2nd,
3rd) and their scores on a math test?
• a) Pearson correlation coefficient b) Spearman's rank
correlation coefficient c) Kendall's rank correlation coefficient d)
Biserial correlation coefficient
A4
• Spearman's rank correlation coefficient
• Reason: Spearman's rank correlation coefficient is used when
dealing with ranked or ordinal data. In this scenario, the class
rankings and math test scores are both ranked variables.
Therefore, Spearman's rank correlation coefficient is
appropriate for examining the association between them.
Q5
• A researcher wants to investigate the relationship between the
number of hours of exercise per week (continuous variable) and
the occurrence of heart disease (yes or no). Which correlation
coefficient should be used?
• Pearson correlation coefficient b) Spearman's rank correlation
coefficient c) Point-Biserial correlation coefficient d) Cramer's V
A5
• Point-Biserial correlation coefficient
• The Point-Biserial correlation coefficient is used to measure the
relationship between a continuous variable and a dichotomous
variable. In this case, the number of hours of exercise per week
(continuous variable) is being compared to the occurrence of
heart disease (dichotomous variable), to determine if there is a
relationship between them.