0% found this document useful (0 votes)
15 views34 pages

Ch6 7now

Chapters 6 and 7 cover scatterplots, correlation, and simple linear regression, emphasizing how to analyze relationships between two quantitative variables. Key concepts include distinguishing correlation from causation, understanding the properties of correlation, and the importance of checking conditions for linear models. The chapters also address dealing with nonlinear relationships and the significance of residuals in evaluating regression models.

Uploaded by

ngoclampham3008
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views34 pages

Ch6 7now

Chapters 6 and 7 cover scatterplots, correlation, and simple linear regression, emphasizing how to analyze relationships between two quantitative variables. Key concepts include distinguishing correlation from causation, understanding the properties of correlation, and the importance of checking conditions for linear models. The chapters also address dealing with nonlinear relationships and the significance of residuals in evaluating regression models.

Uploaded by

ngoclampham3008
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 34

Ch.

6-7: Scatterplot, Association,


Correlation and simple linear regression.
Learning Objectives
1) Draw a scatterplot and use it to analyze the relationship
between two variables
2) Calculate the correlation as a measure of linear
relationship between two variables
3) Distinguish between correlation and causation
4) Choose a linear model of the relationship between two
variables
5) Use the correlation to analyze the usefulness of the
model
6) Deal with nonlinear relationships

Copyright © 2021 Pearson Canada Inc.


Ch. 6: Scatterplot, Association, and Correlation (2 of 2)
A scatterplot, which plots one quantitative variable against
another, can be an effective display for data
Scatterplots are the ideal way to picture associations
between two quantitative variables

Figure 6.1 Monthly


Canadian/U.S. exchange
rate and oil prices.
Sources: Based on OPEC
basket price of oil; Bank
of Canada exchange rates
(January–November
2014).

Copyright © 2021 Pearson Canada Inc.


6.1 Looking at Scatterplots (1 of 4)
The direction of the association is important
A pattern that runs from the upper left to the lower right is
said to be negative. A pattern running from the lower left to
the upper right is called positive.

Copyright © 2021 Pearson Canada Inc.


6.1 Looking at Scatterplots (2 of 4)
The second thing to look for in a scatterplot is its form

Copyright © 2021 Pearson Canada Inc.


6.1 Looking at Scatterplots (3 of 4)
The third feature to look for in a scatterplot is the strength of
the relationship

Copyright © 2021 Pearson Canada Inc.


6.1 Looking at Scatterplots (4 of 4)
Finally, always look for the unexpected
An outlier is an unusual observation, standing away from the
overall pattern of the scatterplot

Look for unusual features: Are


there unusual observations or subgroups?

Copyright © 2021 Pearson Canada Inc.


6.2 Understanding Correlation (1 of 6)
The ratio of the sum of the product zx zy for every point in the
scatterplot to n – 1 is called the correlation coefficient.

r =
 zz x y

n −1
Two of the more common alternative formulas for correlation are:

r =
 ( x − x )( y − y ) =
 ( x − x )( y − y )

(x − x ) (y − y )
2 2
( n − 1) sx sy
Copyright © 2021 Pearson Canada Inc.
6.3 Understanding Correlation (2 of 6)
Finding the Correlation Then x = 14, y = 7, sx = 6.20,
Coefficient and s y = 3.39.
Suppose the data pairs Deviations in x Deviations in y Product
are: 6 −14 = −8 5 − 7 = −2 −8 × −2 = 16
10 −14 = −4 3 − 7 = −4 16
x 6 10 14 19 21
14 −14 = 0 7−7=0 0
19 −14 = 5 8−7=1 5
y 5 3 7 8 12
21 −14 = 7 12 − 7 = 5 35

Add up the products: 16 + 16 + 0 + 5 + 35 = 72


Finally, we divide by (n − 1) × sx × sy = (5 − 1) × 6.20 × 3.39
= 84.07.
The ratio is the correlation coefficient: r = 72/84.07 = 0.856.
Copyright © 2021 Pearson Canada Inc.
6.3 Understanding Correlation (3 of 6)
Correlation Conditions
Correlation measures the strength of the linear association
between two quantitative variables
Before you use correlation, you must check three conditions:
• Quantitative Variables Condition: Correlation applies only
to quantitative variables
• Linearity Condition: Correlation measures the strength only
of the linear association
• Outlier Condition: Unusual observations can distort the
correlation

Copyright © 2021 Pearson Canada Inc.


6.3 Understanding Correlation (4 of 6)
Correlation Properties
• The sign of a correlation coefficient gives the direction of the
association
• Correlation is always between −1 and +1
• Correlation treats x and y symmetrically
• Correlation has no units
• Correlation is not affected by changes in the center or scale of either
variable.
• Correlation measures the strength of the linear association between
the two variables.
• Correlation is sensitive to unusual observations.
Copyright © 2021 Pearson Canada Inc.
6.3 Understanding Correlation (6 of 6)
Correlation Tables
Correlation tables are compact and give a lot of summary
information at a glance. There, you’ll see the correlations
between pairs of variables in a data set arranged in a table.

Table 6.1 A correlation table for some variables collected on a sample of Amazon books.

Blank #Pages Width Thick Thick


#Pages 1.000 Blank Blank Blank
Width 0.003 1.000 Blank Blank
Thick 0.813 0.074 1.000 Blank
Pub year 0.253 0.012 0.309 1.000

Copyright © 2021 Pearson Canada Inc.


6.4 Straightening Scatterplots (1 of 2)
Figure 6.2 Price of solar
installations in Germany,
2009–2013, in Euros/Watt.
Source: “Analysis of 13 years
of successful PV
development in Germany
under the EEG with a focus
on 2013,” Renewable
International, March 2014,
Bernard Chabot.

Simple transformations
such as the logarithm,
square root, and
reciprocal can
sometimes straighten a
scatterplot’s form.

Copyright © 2021 Pearson Canada Inc.


6.5 Lurking Variables and Causation
There is no way to
conclude from a high
correlation alone that one
variable causes the other.
There’s always the
possibility that some third
variable—a lurking
variable—is
simultaneously affecting
both of the variables you
have observed.
Figure 6.4 Life Expectancy and numbers of Doctors per
Person in 40 countries shows a fairly strong, positive
linear relationship with a correlation of 0.705.

Copyright © 2021 Pearson Canada Inc.


6.5 Lurking Variables and Causation

When analyzing relationships between variables, it’s essential to


understand the distinction between correlation and causation and the
potential impact of lurking variables (also called confounding variables).
These concepts are critical for avoiding misleading conclusions in data
analysis.

Causation implies that one variable directly influences another. For


instance, increasing the temperature of water causes it to boil.
To establish causation, three key criteria must be satisfied:
•Temporal precedence: The cause must precede the effect.
•Correlation: The variables must be statistically correlated.
•No confounding factors: The observed relationship must not be due to a
third variable.

Copyright © 2021 Pearson Canada Inc.


6.5 Lurking Variables and Causation
Lurking Variables
A lurking variable is a hidden factor that influences both the independent and
dependent variables, creating a spurious relationship that may appear causal.
Examples of Lurking Variables:
1.Ice Cream Sales and Drowning:
1. Correlation: Ice cream sales and drowning rates are positively
correlated.
2. Lurking Variable: Warm weather increases both swimming (hence
drowning) and ice cream sales.
2.Coffee Consumption and Heart Disease:
1. Correlation: Studies may show coffee drinkers have a higher risk of
heart disease.
2. Lurking Variable: Coffee drinkers might also smoke more, which
increases the risk of heart disease.
3.Education and Income:
1. Correlation: Higher education levels are correlated with higher income.
2. Lurking Variables: Family wealth, access to quality schools, or social
networks could influence both education and income.
They can lead to false conclusions about cause-and-effect relationships.
Copyright © 2021 Pearson Canada Inc.
7.1 The Linear Model (2 of 5)

Figure 7.2 A linear model for monthly advertising expenses and sales over the past four years.

Copyright © 2021 Pearson Canada Inc.


7.1 The Linear Model (3 of 5)
Residuals
A linear model can be written in the form ŷ = b0+ b1x where
b0 and b1 are numbers estimated from the data and ŷ is the
predicted value
The difference between the predicted value and the
observed value, y, is called the residual and is denoted e
e = y − yˆ
The Line of “Best Fit”
The line of best fit is the line for which the sum of the squared
residuals is smallest – often called the least squares line
Copyright © 2021 Pearson Canada Inc.
7.2 Correlation and the Line (3 of 8)
We can find the slope of the least squares line using the
correlation and the standard deviations

The slope gets its sign from the correlation. If the correlation
is positive, the scatterplot runs from lower left to upper right
and the slope of the line is positive
The slope gets its units from the ratio of the two standard
deviations, so the units of the slope are a ratio of the units of
the variables

Copyright © 2021 Pearson Canada Inc.


7.2 Correlation and the Line (4 of 8)
To find the intercept of our line, we use the means. If our line
estimates the data, then it should predict 𝑦ത for the x-value 𝑥ҧ
Thus we get the following relationship from our line

𝑦ത = 𝑏0 + 𝑏1𝑥ҧ

We can now solve this equation for the intercept to obtain the formula
for the intercept

𝑏0 = 𝑦ത − 𝑏1𝑥ҧ

Copyright © 2021 Pearson Canada Inc.


7.2 Correlation and the Line (5 of 8)
Least squares lines are commonly called regression lines.
We’ll need to check the same condition for regression as we
did for correlation.
1) Quantitative Variables Condition
2) Linearity Condition
3) Outlier Condition

Copyright © 2021 Pearson Canada Inc.


7.2 Correlation and the Line (6 of 8)
Working in Standard Deviations
If we consider finding the least squares line for standardized
variables zx and zy, the formula for slope can be simplified
b1 = r
The intercept formula can be rewritten as well

b0 = 0

𝑧𝑦Ƹ = 𝑟 𝑧𝑥

Copyright © 2021 Pearson Canada Inc.


7.2 Correlation and the Line (8 of 8)
For our data on advertising costs and sales, the correlation
is 0.693. We can now express the relationship for the
standardized variables

𝑧𝑦Ƹ = 0.693 𝑧𝑥
For every standard deviation above (or below) the mean we
are in advertising expenses, we’ll predict that the sales are
0.693 standard deviations above (or below) their mean.

Copyright © 2021 Pearson Canada Inc.


7.3 Regression to the Mean
The equation below shows that if x is 2 SDs above its mean,
we won’t ever move more than 2 SDs away for y, since r
can’t be bigger than 1

Zˆ y = rz x

So, each predicted y tends to be closer to its mean than its


corresponding x
This property of the linear model is called regression to the
mean

Copyright © 2021 Pearson Canada Inc.


7.3 Regression to the Mean

Figure 7.3 Price and weight for 307 books on Amazon, showing two regression lines. The red line
estimates price when we know weight. The blue line estimates weight when we know price.

Copyright © 2021 Pearson Canada Inc.


7.4 Checking the Model
Models are only useful only when specific assumptions are
reasonable. We check conditions that provide information
about the assumptions.
1) Quantitative Data Condition – linear models only make sense
for quantitative data, so don’t be fooled by categorical data
recorded as numbers
2) Linearity Condition – two variables must have a linear
association, or a linear model won’t mean a thing
3) Outlier Condition – outliers can dramatically change a
regression model

Copyright © 2021 Pearson Canada Inc.


7.5 Learning More from the Residuals

Residuals help us see whether the model


makes sense

A scatterplot of residuals against predicted


values should show nothing interesting – no
patterns, no direction, no shape

If nonlinearities, outliers, or clusters in the


residuals are seen, then we must try to
determine what the regression model missed

Copyright © 20 18 Pearson Cana da Inc .


Copyright © 20 18 Pearson Cana da Inc .
Copyright © 20 18 Pearson Cana da Inc .
Copyright © 20 18 Pearson Cana da Inc .
7.5 Learning More from the Residuals
The standard deviation of the residuals, se, gives us a
measure of how much the points spread around the
regression line
We estimate the standard deviation of the residuals as
shown below

se = 
2
e
n−2
The standard deviation around the line should be the same
wherever we apply the model – this is called the Equal
Spread Condition

Copyright © 2021 Pearson Canada Inc.


7.6 Variation in the Model and R2 (3 of 4)

r2 by tradition is written
R2 and called “R squared”
Consider the square of the correlation
coefficient r to get r2 which is a value
between 0 and 1

Copyright © 2021 Pearson Canada Inc.


7.6 Variation in the Model and R2 (4 of 4)
How Big Should R2 Be?
There is no value of R2 that automatically determines that a
regression is “good”
Data from scientific experiments often have R2 in the 80% to
90% range
Data from observational studies may have an acceptable R2
in the 30% to 50% range

Copyright © 2021 Pearson Canada Inc.


7.8 Nonlinear Relationships
A regression model works well if the relationship between
the two variables is linear
What should be done if the relationship is nonlinear?

Figure 7.5 The scatterplot of number of Cell Phones (000s) vs. HDI for countries shows a bent
relationship not suitable for correlation or regression.
Copyright © 2021 Pearson Canada Inc.
7.8 Nonlinear Relationships
To use regression models:
Transform or re-express one or both variables by a function
such as:
• Logarithm
• Square root
• reciprocal

Figure 7.6 Taking the logarithm of


cell phones results in a more nearly
linear relationship.

Copyright © 2021 Pearson Canada Inc.

You might also like