0% found this document useful (0 votes)
14 views12 pages

DA Unit-II

This document discusses the Pearson Product-Moment correlation, which measures the linear relationship between two variables through covariance and standardization, resulting in a correlation coefficient between -1 and +1. It also covers t-tests for comparing means, including one-sample and two-sample tests, explaining their applications and significance in statistical analysis. Additionally, it emphasizes the importance of visualizing data when interpreting correlations and highlights the context-dependent nature of correlation coefficients.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views12 pages

DA Unit-II

This document discusses the Pearson Product-Moment correlation, which measures the linear relationship between two variables through covariance and standardization, resulting in a correlation coefficient between -1 and +1. It also covers t-tests for comparing means, including one-sample and two-sample tests, explaining their applications and significance in statistical analysis. Additionally, it emphasizes the importance of visualizing data when interpreting correlations and highlights the context-dependent nature of correlation coefficients.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

UNIT-II

1. Pearson Product-Moment Correlation


A Pearson Product-Moment correlation assesses the linear relationship among two
variables and consists of dividing an average cross-product in the numerator (called
the covariance) by the product of standard deviations in the denominator

The Pearson Product-Moment correlation is a measure of the linear relationship between


variables, and is given by,

∑𝑛𝑖=1(𝑥𝑖 − 𝑥̅ )(𝑦𝑖 − 𝑦̅)


𝑟= 𝑛−1
𝑆𝑥 . 𝑆𝑦

The correlation coefficient is essentially the standardized covariance. But what does
this actually mean? To truly understand any statistical function, it's important to examine its
components. This means analysing the mathematics behind the formula to understand what it
actually "does." Let's focus on the numerator of the quotient:
∑𝑛𝑖=1(𝑥𝑖 − 𝑥̅ )(𝑦𝑖 − 𝑦̅)
𝑛−1
What we have defined above is called covariance.

• Covariance is the average cross-product of variables 𝑥𝑖 and 𝑦𝑖 . We refer to it as an


"average" cross-product because we divide the sum by n−1.
• Normally, an arithmetic average is computed by dividing by n, but here we use n−1
because we lose a degree of freedom. Despite this adjustment, the quotient remains an
"average-like" statistic. Losing a degree of freedom does not make the function any less
of an average.
• Covariances are widely used in statistics and statistical modeling. However, they are
scale-dependent, meaning they are influenced by the scales of 𝑥𝑖 and yi.
• If xi and yi have high variability, the sum of products can be large, as we are multiplying
deviations from their means. Dividing by n−1 does not address this issue, as it merely
distributes the sum across the n−1 pieces of information.
• Consequently, a large covariance can occur even if the relationship between xi and yi is
not very strong. This is a key point to understand.

• The solution to the problem of scale-dependence in covariance, as given by Pearson's


r, is to divide by the product of the standard deviations Sx⋅ Sy. This standardizes the
covariance, placing the resulting number between -1.0 and +1.0.
• Dividing by Sx⋅Sy effectively standardizes the measure and the Sx⋅ Sy as representing
the total possible product of deviations in the data, or the total "cross-product" possible
for the two variables, i.e. “play space”. It sets the baseline for the co-variability in the
data.
• The numerator of the correlation coefficient, which is the covariance, represents how
much of this "play space" is due to the actual relationship between xi and yi. Since the
numerator cannot exceed the denominator in absolute value, the correlation coefficient
will always lie between -1.0 and +1.0.
• In essence, Pearson's r is a ratio that reflects the observed cross-variation relative to the
total possible cross-variation. This standardization removes the influence of the scale
of the variables, resulting in a dimensionless measure. The sign of the coefficient is
determined by the sign of the covariance, while the magnitude reflects the strength of
the relationship.
• It is crucial to understand that a Pearson correlation of zero in a sample does not imply
there is no relationship between the variables. For example, knowing the mean IQ of a
population is 100 might lead one to incorrectly assume that half the population has an
IQ below 100 and half above, which is only true if the distribution is normal or
symmetrical. This assumption does not hold for skewed distributions. The median, not
the mean, is the value that splits the population in half.
• Similarly, a correlation coefficient should never be interpreted without visualizing a
plot of the two variables. Memorize this: Always plot your data when interpreting
correlations. Pearson's correlation measures only linear relationships, and its magnitude
does not account for non-linear trends. Thus, a strong non-linear relationship could be
overlooked if one relies solely on Pearson’s r.

• For instance, consider a curvilinear relationship between two variables. The Pearson
correlation could be near zero despite a strong relationship. If you only rely on software
output without plotting the data, this relationship could be missed. Therefore, always
visualize your data before interpreting correlation coefficients.

Computing Correlation in Python

We can compute a correlation coefficient on random data via the following:

import numpy as np
np.random.seed(1)
x = np.random.randint(0, 50, 1000)
y = x + np.random.normal(0, 10, 1000)
np.corrcoef (x, y)
Output:
Array ([[1., 0.81543901],
[0.81543901, 1. ]])
For this data, we see that the correlation is equal to 0.815, a relatively strong correlation in
absolute value. We can generate a scatterplot quite easily for these data using matplotlib:

import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline
matplotlib.style.use('ggplot')
plt.scatter(x, y)

The plot shows no visible bivariate outliers, and a Pearson correlation coefficient is
appropriate due to evidence of linearity. Even without clear linearity, if a linear relationship is
theorized, calculating the Pearson correlation can still be valuable to test the theory.
Now, let's consider covariance and correlation using the Galton data set, which includes
the heights of parents and their grown children. We aim to investigate if there is a linear
relationship between these variables.
galton = pd.read_csv('Galton.csv')
galton.head()
Output:
Unnamed: 0 parent child
0 1 70.5 61.7
1 2 68.5 61.7
2 3 65.5 61.7
3 4 64.5 61.7
4 5 64.0 61.7
from scipy import stats
pearson_coef, p_value = stats.pearsonr(galton["child"]
galton["parent"])
print ("Pearson Correlation: ", pearson_coef, "and a P-value of:", p_value)
output:
Pearson Correlation: 0.45876236829282174 and
a P-value of:1.7325092920165045e-49
The correlation between parent and child height is 0.45, with a very small p-value,
making it statistically significant. This allows us to reject the null hypothesis that the true
population correlation is zero. This finding is between what Karl Pearson termed "organic
variables" and was one of the first uses of the correlation coefficient around 1888. Galton and
others aimed to demonstrate genetic heritability, showing that tall parents tend to have tall
children, indicating a genetic component.
Another study investigates the correlation between eating carrots and the risk of
developing cancer. They find a correlation coefficient of 0.2. While numerically smaller than
the height correlation, this finding could have significant health implications. If eating carrots
is linked to a reduced or increased risk of cancer, even a modest correlation could prompt
further research and potentially influence dietary recommendations.
In both cases, the numerical value of the correlation is important, but its significance is
heavily influenced by the context:
• In the first study, the correlation of 0.45 is significant because it aligns with existing
theories about genetic inheritance.
• In the second study, even a smaller correlation of 0.2 is significant due to its potential
impact on public health and individual behaviour.

We can obtain some plots of the correlation as well as histograms for both variables
univariately through

sns.pairplot():
import seaborn as sns
sns.pairplot(galton)
In the above figure, the correlation for parent and child is represented by the scatterplots
in row 2, column 3, as well as row 3, column 2. The remaining scatterplots are of no use, as
they simply represent the correlation of each variable with an index variable (“unnamed”) that
Python produced automatically. Likewise, the plots in row 1, column 3, and row 3, column 1
are of no use. The histograms in row 2, column 2, and row 3, column 3 are useful, as they
represent the distributions for parent and child, respectively. We can see that both distributions
are relatively normal in shape, even if slightly skewed.
Instead of producing a matrix plot as earlier, we can instead define each variable from
the Galton data (printing only a few cases on each variable), and generate a scatterplot through
the following:
parent = galton['parent']
child = galton['child']
columns = ['child', 'parent']
ax1 = galton.plot.scatter(x = 'child', y = 'parent')
In a scatterplot, data points form a subset of the Cartesian product, which consists of all
possible pairs of points. If every possible pairing were displayed, the correlation would be zero,
representing the entire Cartesian product. Therefore, the scatterplot highlights a specific subset
of these potential pairings, emphasizing particular relationships within the data.

2. T-Tests for Comparing Means


• T-tests are commonly used in statistics to test for mean differences, but this is just one
of their applications. They also assess the statistical significance of regression
coefficients and other sample statistics in various contexts.
• T-statistics relate to the t-distribution, which has widespread applications beyond
evaluating mean differences.
One-Sample Test:
• In a one-sample t-test, a researcher evaluates the probability that a sample mean could
come from a population with a specific mean.
• If the population variance is known, a z-test is more appropriate. However, when the
variance is unknown and must be estimated from the sample data, the t-test is the correct
choice.
• For large sample sizes (e.g., over 100), the difference between the t-test and z-test is
minimal regarding the null hypothesis, as the t-distribution approaches the z-
distribution as degrees of freedom increase.
• Nevertheless, using a t-test is always advisable when variances are unknown and need
to be estimated, regardless of sample size.

• Where, y is the sample mean obtains from our research and E(y) the expectation under the
null hypothesis, which is the population mean, μ, in this case. Hence, we could rewrite the
numerator as y−μ.
• The denominator is the estimated standard error of the mean, where s is the sample
standard deviation and n is the sample size.
• The t-test compares a mean difference in the numerator to the expected variation under the
null hypothesis in the denominator.
• It's important to note that a large t-value doesn't always indicate a significant scientific
result, as it can be achieved by reducing the sample standard deviation (s), increasing the
sample size (n), or both.
• The degrees of freedom for the one-sample t-test are equal to n−1, which is one less than
the number of participants in the sample.
• Let’s, assume that the following IQ sample data could have been drawn from a population
of IQ scores with mean equal to 100.Our null hypothesis is H0 :μ=100 against the
alternative hypothesis H1 :μ≠100. We first enter our data:

iq = [105, 98, 110, 105, 95]


• Conduct the one-sample t-test using scipy, listing first the dataframe (df), followed by the
population mean under the null hypothesis, which for our data is equal to 100:
df = pd.DataFrame(iq)
df
Out:
0 105
1 98
2 110
3 105
4 95
from scipy import stats
stats. ttest_1samp (df, 100.0)
Output:
Ttest_1sampResult(statistic=array([0.9649505]),
pvalue=array([0.38921348]))

• The t-statistic for the test is 0.9649505, with an associated p-value of 0.38921348. Since the
p-value is quite large (not less than a typical threshold like 0.05), we do not reject the null
hypothesis.
• This means we lack sufficient evidence to suggest that our sample of IQ values is from a
population with a mean other than 100.
• Importantly, we have not confirmed that the population mean is 100; we have only failed to
reject the null hypothesis. Failing to reject the null does not confirm its truth; it simply
indicates we have no reason to believe it is false.
Two-Sample Test:
• The two-sample t-test can be used when collecting data from what we assume, under the
null, are two independent populations:

• In the two-sample case, the mean difference in the numerator is between two population
means, μ1 and μ2. The standard error, now referred to as the estimated standard error of
the difference in means, includes both sample variances from each sample and the
corresponding sample sizes.
• Now, conduct a two-sample t-test using the previously loaded Galton data to evaluate
the null hypothesis that the mean difference between parent and child is equal to 0.
from scipy import stats
stats.ttest_ind(parent, child)
Output:
Ttest_indResult(statistic=2.167665371332533, pvalue= 0.030311049956448916)

• The obtained t-statistic is equal to 2.167, with p-value of 0.03. Since the p-value is quite
small (e.g. less than 0.05), we reject the null and conclude a mean population difference
between parent and child.
• That is, we are rejecting the null hypothesis of H0: μ1 = μ2 in favour of the alternative
hypothesis H1: μ1 ≠ μ2.
• We can easily evaluate distributions for both variables by obtaining plots:
plt.figure(figsize=(10, 7))
sns.distplot(parent)
plt.figure(figsize=(9, 5))
sns.distplot(child)

• From these plots, we observe that each population is approximately normally


distributed, suggesting that the assumption of normality for the t-test is likely met. The
t-test also requires that the variances in each population are equal.
• If you perform a Levene test on this data using stats.levene(child, parent,
center='mean'), you will reject the null hypothesis of equal variances. Levene's test
assesses the equality of variances in the population, but its p-value is highly sensitive
to sample size. Therefore, these tests should be used as a guide rather than a definitive
conclusion.
• We can also obtain boxplots for both the parent and child:
sns.boxplot(parent)
sns.boxplot(child)

• These boxplots indicate that the parent data might have a few extreme scores worth
investigating. In each plot, the center line represents the median, and the lines enclosing
the rectangle denote the first quartile (Q1) and the third quartile (Q3). The distribution
for the child data shows a slight leftward pull, indicating a mild negative skew.

Paired-Samples t-Test in Python


• Sometimes, instead of one-sample or independent-samples data, we have data measured
on pairs of observations.
• A classic example is asking husbands and wives for their happiness ratings in their
marriage. In this case, the data cannot be assumed to be independent, as knowing the
husband's rating likely provides some insight into the wife's rating (even if the ratings
might differ).
• This paired situation is a special case of the more general matched pairs design, where
subjects are grouped into "blocks" (as shown in Table 4.1).
• In Table 4.1, each block represents a pair of related subjects or conditions, which can
be naturally occurring or determined by the researcher. For instance, in a study
involving married couples, "Block 1" would include the first couple, with Treatment 1
representing the husband's score and Treatment 2 representing the wife's score.
Alternatively, in another experimental setup, Treatment 1 might be one drug dose and
Treatment 2 another.
• The matched pairs and block layout are versatile and can be adapted to various
experiments. The key point is that within each block, the conditions are not
independent. This contrasts with a purely between-subjects design, where observations
under different conditions are assumed to be independent.
• If we extend this concept to include more than two conditions, we get the results shown
in Table 4.2, where there are multiple conditions instead of just two

• When dealing with only two related conditions, a paired-samples t-test is appropriate.
We illustrate this test using the data in Table 4.3, where rats were measured on a
learning task across three trials.
• In this dataset, each rat is measured three times (trials 1 through 3). Since we are
focusing on the paired situation, we will only consider the first two trials. As each rat
participates in both trials, the data are naturally paired and repeated. This means that
knowing a rat's behaviour in one trial provides information about its behavior in the
other trial, indicating the trials are not independent.
• To proceed, we first build the data frame for trials 1 and 2.
trial_1 = [10, 12.1, 9.2, 11.6, 8.3, 10.5]
trial_2 = [8.2, 11.2, 8.1, 10.5, 7.6, 9.5]
paired_data = pd.DataFrame(trial_1, trial_2)
paired_data
Out:
8.2 10.0
11.2 12.1
8.1 9.2
10.5 11.6
7.6 8.3
9.5 10.5
• We now conduct the t-test:
stats.ttest_rel(trial_1, trial_2)
Out:
Ttest_rel Result (statistic=7.201190377787752, pvalue= 0.0008044382024663002)

• The t-statistic value is 7.20, with an associated p-value of 0.0008, which is statistically
significant. This provides evidence that the mean difference between the two trials is
not zero. Consequently, we reject the null hypothesis and infer that, in the population
from which these data were drawn, the true mean difference is not zero.

Binomial Test
• The binomial distribution is a useful tool for modeling the probability of success in a
random experiment with two possible outcomes, where each event is mutually
exclusive.
• A classic example is coin flipping, where each trial results in either heads or tails. This
is a binary event since there are only two possible outcomes for each experiment.
• Other examples include being dead or alive, day or night, young or old. The binomial
distribution is described by,

where,
p(r), “probability of r,” is the probability of observing r successes.
The total number of trials is denoted by n.
• Hence, p(r) is the probability of observing a given number of successes out of a total
number of trials (e.g. number of coin flips). Since the events in question are mutually
exclusive, the total probability is equal to p+(1−p), where 1−p is often denoted by q.
Hence, p+(1−p) = p+q=1.
• To understand how the binomial distribution can be applied to answer a research
question, let's delve into a specific example.
• Suppose we want to determine the probability of getting a certain number of heads
when flipping a fair coin. We assume p=0.5 to represent the probability of getting heads
under the null hypothesis. Now, let's say we flip the coin 5 times and get 2 heads. We
want to calculate the probability of obtaining 2 or more heads, assuming the coin is fair.
• This is a binomial setting since:
➢ The coin variable is binary in nature, meaning that the outcome can only
be one of two possibilities, “head” or “tail.” It is a discrete variable.
The two events are mutually exclusive, meaning that you either get a
head or a tail, and cannot get both on the same flip.
➢ The probability of “success” on each trial remains the same from trial to
trial. That is, the probability is stationary. In other words, if the
probability of heads on the first trial is 0.5, then this implies the
probability of heads on the second and ensuing trials is also 0.5.
➢ Each trial is independent of any other trial. For instance, the probability
of a “head” on the first flip does not have an influence on the probability
of heads on the second flip, and so on.
• In Python, we can evaluate the probability of 2 or more heads out of 5 flips as follows,
where stats.binom_test() is the function for the test, “2” is the number of successes out
of n=5 trials, and p=0.5 is the probability of a success on any given trial:
stats.binom_test(2, n=5, p=0.5, alternative=‘greater’)
Out: 0.8125
• The probability of getting 2 or more heads in 5 flips of a fair coin is 0.8125. What does
this result imply? If we assume the coin is fair with a probability of heads being 0.5,
finding that the probability of getting 2 heads out of 5 flips is 0.8125 doesn't provide
evidence to reject this assumption.
The Chi-Squared Distribution and Goodness-of-Fit Test
• The binomial test we just discussed is actually a special case of a more widely used
statistical test: the chi-squared goodness-of-fit test.
• It's important to distinguish between the chi-squared distribution and the chi-squared
test; they are not the same thing and should not be confused. In undergraduate applied
statistics, chi-square is often associated with the chi-squared test.
• However, chi-square, like the t statistic, is a statistical measure with its own density
function that can be used in various contexts to evaluate null hypotheses.
• In theoretical statistics, the chi-square density appears frequently and is related to other
distributions, such as the F distribution.
• The chi-square distribution is given by,

where for different inputs of x >0,


v are degrees of freedom and
⸢ is the gamma function.
• We can also state the chi-square distribution in terms of the sum of squares of n
independent and normally distributed z-scores:

• The goodness-of-fit test happens to be a statistical method that features the chisquare
distribution in evaluating the statistical significance of the test. The goodnessof- fit test
features a test on count data rather than means such as in the t-test or ANOVA. We
define the chi-squared goodness-of-fit test as follows:

• The observed frequencies (Oi) and expected frequencies (Ei) are compared within each
cell. The summation notation represents the process of summing the squared
differences between these values, (Oi−Ei)2, first across the columns (c) and then across
the rows (r).
• The numerator in this calculation is squared to avoid the sum equalling zero, which
would result in a chi-squared (χ²) value of zero, regardless of any actual discrepancy
between Oi and Ei.
• Squaring the differences ensures that χ² reflects any deviation between observed and
expected frequencies. If the sum of (Oi−Ei)2 large relative to Ei, and considering the
degrees of freedom, the χ² statistic may become large enough to reach statistical
significance.
• The null hypothesis in this test states that the observed frequencies are consistent with
the expected frequencies in the population from which the data were sampled. The
alternative hypothesis suggests that they are not consistent, indicating an association
between the row and column variables.
• Performing a chi-squared goodness-of-fit test in Python is very easy. As an example,
let us first consider the case where all observed frequencies are equal in each cell. In
the following, we set them all equal to 16:
from scipy.stats import chisquare
chisquare([16, 16, 16, 16, 16])
Out: Power_divergenceResult(statistic=0.0, pvalue=1.0)
• When the p-value is 1.0, it indicates that there is no discrepancy between the observed
and expected frequencies in a chi-squared analysis. This occurs when all cell
frequencies are equal, meaning the observed result perfectly matches what is expected
under the null hypothesis.
• In this one-way chi-squared analysis, with a total of 80 frequencies distributed evenly
across the cells (16 per cell), the lack of any difference between cells provides no
evidence against the null hypothesis, hence the p-value of 1.0.
• Now, the situation is frequencies are slightly unequal by adjusting a couple of the cell
frequencies downward to 15. Notice elements 2 and 4 in our data in the following are
now data points 15 rather than 16:
chisquare([16, 15, 16, 15, 16])
Out:Power_divergenceResult(statistic=0.07692307692307693,
pvalue=0.9992790495310748)
• For these data, we see the resulting p-value is now equal to 0.999, a bit lower than 1.0.
This is as a result of the discrepancy in frequencies now from cell to cell. That is, the
probability of observing this set of frequencies under the null hypothesis of equal
frequencies is still very high, certainly no cause or reason to reject the null hypothesis.
• The following is an even more disparate distribution of frequencies:
chisquare([16, 15, 10, 8, 25])
Out:Power_divergenceResult(statistic=11.81081081081081,
pvalue=0.01881499899397827)
• In this case that the observed p-value is equal to 0.018, and hence we deem the result
to be statistically significant at the 0.05 level. That is, the probability of observing a
discrepancy such as we have observed is very small under the null hypothesis of equal
frequencies per cell. Hence, we reject the null hypothesis of equal frequencies per cell.

You might also like