0% found this document useful (0 votes)

27 views47 pages

Lecture 12

This document introduces linear regression and correlation. It discusses using independent variables (x-values) to predict a dependent variable (y-value). Examples are given of predicting student GPA or company sales based on relevant factors. The method of least squares is used to estimate the regression line that best fits the data by minimizing the vertical distances between observed y-values and the line. An analysis of variance (ANOVA) is conducted to determine if the regression model is useful and a correlation coefficient is calculated to measure the strength of the relationship between the independent and dependent variables.

Uploaded by

traceynguyen1205

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

27 views47 pages

Lecture 12

Uploaded by

traceynguyen1205

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 47

INTRODUCTION TO PROBABILITY

AND STATISTICS
FOURTEENTH EDITION

Chapter 12
Linear Regression and
Correlation
INTRODUCTION
 In Chapter 11, we used ANOVA to
investigate the effect of various factor-level
combinations (treatments) on a response x.
 Our objective was to see whether the
treatment means were different.
 In Chapters 12 and 13, we investigate a
response y which is affected by various
independent variables, xi.
 Our objective is to use the information
provided by the xi to predict the value of y.
EXAMPLE
 Let y be a student’s college achievement,
measured by his/her GPA. This might be a
function of several variables:
 x1 = rank in high school class
 x2 = high school’s overall rating
 x3 = high school GPA
 x4 = SAT scores
 We want to predict y using knowledge of
x1, x2, x3 and x4.
EXAMPLE
 Let y be the monthly sales revenue for a
company. This might be a function of
several variables:
 x1 = advertising expenditure
 x2 = time of year
 x3 = state of economy
 x4 = size of inventory
 We want to predict y using knowledge of
x1, x2, x3 and x4.
SOME QUESTIONS
 Which of the independent variables are useful and
which are not?
 How could we create a prediction equation to allow us to
predict y using knowledge of x1, x2, x3 etc?
 How good is this prediction?

We start with the simplest case, in which the

response y is a function of a single
independent variable, x.
A SIMPLE LINEAR MODEL
 In Chapter 3, we used the equation of
a line to describe the relationship between y
and x for a sample of n pairs, (x, y).
 If we want to describe the relationship
between y and x for the whole population,
there are two models we can choose

•Deterministic Model: y = a + bx
•Probabilistic Model:
–y = deterministic model + random error
–y = a + bx + e
A SIMPLE LINEAR MODEL
 Since the bivariate measurements
that we observe do not generally fall
exactly on a straight line, we choose
to use:
 Probabilistic Model:
 y = a + bx + e
 E(y) = a + bx
Points deviate from the
line of means by an amount
e where e has a normal
distribution with mean 0 and
variance s2.
THE RANDOM ERROR
 The line of means, E(y) = a + bx , describes
average value of y for any fixed value of x.
 The population of measurements is
generated as y deviates from
the population line
by e. We estimate a
and b using sample
information.
THE METHOD OF
LEAST SQUARES
 The equation of the best-fitting line
is calculated using a set of n pairs (xi, yi).
•We choose our
estimates a and b to
estimate a and b so
that the vertical
distances of the
points from the line,
are minimized.
Bestfitting line :yˆ = a + bx
Choose a and b to minimize
SSE = ( y − yˆ ) 2 = ( y − a − bx) 2
LEAST SQUARES
ESTIMATORS
Calculatethe sumsof squares:
( x)2
( y ) 2
Sxx =  x − 2
Syy =  y −
2

n n
( x)( y )
Sxy =  xy −
n
Bestfitting line : yˆ = a + bx where
S xy
b= and a = y − bx
S xx
EXAMPLE
The table shows the math achievement test
scores for a random sample of n = 10 college
freshmen, along with their final calculus
grades.
Student 1 2 3 4 5 6 7 8 9 10
Math test, x 39 43 21 64 57 47 28 75 34 52
Calculus grade, y 65 78 52 82 92 89 73 98 56 75
 x = 460  y = 760
Use your
calculator to find  x = 23634  y = 59816
2 2

the sums and

sums of squares.
 xy = 36854
x = 46 y = 76
EXAMPLE
(460) 2
Sxx = 23634 − = 2474
10
(760) 2
Syy = 59816 − = 2056
10
(460)(760)
Sxy = 36854 − = 1894
10
1894
b= = .76556 and a = 76 − .76556(46) = 40.78
2474
Bestfitting line : yˆ = 40.78 + .77 x
THE ANALYSIS OF VARIANCE
 The total variation in the experiment is
measured by the total sum of squares:
Total SS= S yy = ( y − y ) 2

The Total SS is divided into two parts:

✓SSR (sum of squares for regression):
measures the variation explained by using
x in the model.
✓SSE (sum of squares for error):
measures the leftover variation not
explained by x.
THE ANALYSIS OF
VARIANCE
We calculate

( S xy ) 2 18942
SSR = =
S xx 2474
= 1449.9741
SSE = Total SS- SSR
( S xy ) 2
= S yy −
S xx
= 2056 − 1449.9741
= 606.0259
THE ANOVA TABLE
Total df = n -1 Mean Squares
Regression df =1 MSR = SSR/(1)
Error df = n –1 – 1 = n - 2
MSE = SSE/(n-2)

Source df SS MS F
Regression 1 SSR SSR/(1) MSR/MS
E
Error n-2 SSE SSE/(n-2)
Total n -1 Total SS
THE CALCULUS PROBLEM
( S xy ) 2 18942
SSR = = = 1449.9741
S xx 2474
( S xy ) 2
SSE = Total SS- SSR = S yy −
S xx
= 2056 − 1449.9741 = 606.0259
Source df SS MS F
Regression 1 1449.9741 1449.9741 19.14
Error 8 606.0259 75.7532
Total 9 2056.0000
TESTING THE USEFULNESS
OF THE MODEL
• The first question to ask is whether the
independent variable x is of any use in
predicting y.
• If it is not, then the value of y does not
change, regardless of the value of x. This
implies that the slope of the line, b, is zero.

H 0 : b = 0 versus H a : b  0
TESTING THE
USEFULNESS OF THE
MODEL
• The test statistic is function of b, our best
estimate of b. Using MSE as the best
estimate of the random variation s2, we
obtain a t statistic.
b−0
Test statistic: t = which has a t distribution
MSE
S xx
MSE
with df = n − 2 or a confidenceinterval: b  ta / 2
S xx
THE CALCULUS PROBLEM
• Isthere a significant relationship
between the calculus grades and the test
scores at the 5% level of significance?

H 0 : b = 0 versusH a : b  0
b−0 .7656 − 0
t= = = 4.38
MSE/ S xx 75.7532 / 2474

Reject H 0 when |t| > 2.306. Since t = 4.38 falls

into the rejection region, H 0 is rejected .
There is a significant linear relationship between the calculus grades and the
test scores for the population of college freshmen.
THE F TEST
You can test the overall usefulness of
the model using an F test. If the model
is useful, MSR will be large compared
to the unexplained variation, MSE.

To test H 0 : model is usefulin predicting y

MSR This test is
Test Statistic: F = exactly
MSE equivalent to
RejectH 0 if F  Fa with1 and n - 2 df . the t-test, with
t2 = F.
MINITAB OUTPUT
Least squares
H 0 : b = 0line
To testregression
Regression Analysis: y versus x
The regression equation is y = 40.8 + 0.766 x
Predictor Coef SE Coef T P
Constant 40.784 8.507 4.79 0.001
x 0.7656 0.1750 4.38 0.002

S = 8.70363 R-Sq = 70.5% R-Sq(adj) = 66.8%

Analysis of Variance
Source DF SS MS F P
Regression 1 1450.0 1450.0 19.14 0.002
Residual Error 8 606.0 75.8
Total 9 2056.0

Regression
coefficients, a and b t = F
2
MSE
MEASURING THE STRENGTH
OF THE RELATIONSHIP
• If the independent variable x is of useful in
predicting y, you will want to know how well
the model fits.
• The strength of the relationship between x
and y can be measured using:
S xy
Correlation coefficient : r =
S xx S yy
2
S xy
SSR
Coefficient of determination : r =2
=
S xx S yy Total SS
MEASURING THE STRENGTH
OF THE RELATIONSHIP
• Since Total SS = SSR + SSE, r2 measures
✓ the proportion of the total variation in the
responses that can be explained by using
the independent variable x in the model.
✓ the percent reduction the total variation by
using the regression equation rather than
just using the sample mean y-bar to
estimate y.
SSR
For the calculus problem, r2 = .705 or r =
2

70.5%. The model is working well!

Total SS
INTERPRETING A
SIGNIFICANT REGRESSION
• Even if you do not reject the null hypothesis
that the slope of the line equals 0, it does
not necessarily mean that y and x are
unrelated.
• Type II error—falsely declaring that the
slope is 0 and that x and y are unrelated.
• It may happen that y and x are perfectly
related in a nonlinear way.
SOME CAUTIONS
• You may have fit the wrong model.

• Extrapolation—predicting values of y
outside the range of the fitted data.
• Causality—Do not conclude that x causes
y. There may be an unknown variable at
work!
CHECKING THE
REGRESSION ASSUMPTIONS
• Remember that the results of a regression
analysis are only valid when the necessary
assumptions have been satisfied.
1. The relationship between x and y is linear,
given by y = a + bx + e.
2. The random error terms e are independent
and, for any value of x, have a normal
distribution with mean 0 and variance s 2.
DIAGNOSTIC TOOLS
• We use the same diagnostic tools
used in Chapter 11 to check the
normality assumption and the
assumption of equal variances.
1. Normal probability plot of
residuals
2. Plot of residuals versus fit or
residuals versus variables
RESIDUALS
• The residual error is the “leftover”
variation in each data point after the
variation explained by the regression
model has been removed.
Residual= yi − yˆ i or yi − a − bxi

• If all assumptions have been met,

these residuals should be normal, with
mean 0 and variance s2.
NORMAL PROBABILITY PLOT
✓ If the normality assumption is valid,
the plot should resemble a straight
line, sloping upward to the right.
✓ If not, you will often see the pattern
fail in the tails of the graph.
Normal Probability Plot of the Residuals
(response is y)
99

95
90

80
70
Percent

60
50
40
30
20

1
-20 -10 0 10 20
Residual
RESIDUALS VERSUS FITS
✓ If the equal variance assumption is
valid, the plot should appear as a
random scatter around the zero
center line.
✓ If not, you will see a pattern in the
residuals. Residuals Versus the Fitted Values
(response is y)
15

5
Residual

-5

-10

60 70 80 90 100
Fitted Value
ESTIMATION AND
PREDICTION
• Once you have
✓ determined that the regression line is
useful
✓ used the diagnostic plots to check for
violation of the regression assumptions.
• ✓YouEstimate
are readythetoaverage
use the value
regression line to
of y for
a given value of x
✓ Predict a particular value of y for a
given value of x.
ESTIMATION AND
PREDICTION
Estimating a
particular value of y
when x = x0

Estimating the
average value of
y when x = x0
ESTIMATION AND
PREDICTION
• The best estimate of either E(y) or y for
a given value x = x0 is
yˆ = a + bx0
• Particular values of y are more difficult to
predict, requiring a wider range of values in the
prediction interval.
ESTIMATION AND
PREDICTION
To estimatethe averagevalueof y when x = x0 :
 1 ( x0 − x ) 2 
yˆ  ta / 2 MSE  + 

 n S xx 
To predict a particularvalueof y when x = x0 :
 1 ( x0 − x ) 2 
yˆ  ta / 2 MSE 1 + + 

 n S xx 
THE CALCULUS
PROBLEM
 Estimatethe average calculus grade for
students whose achievement score is 50
with a 95% confidence interval.
Calculateyˆ = 40.78424 + .76556(50) = 79.06
 1 (50 − 46) 2 
yˆ  2.306 75.7532 + 
 10 2474 
79.06  6.55 or 72.51to 85.61.
THE CALCULUS
PROBLEM
 Estimate the calculus grade for a
particular student whose achievement
score is 50 with a 95% confidence interval.

Calculateyˆ = 40.78424 + .76556(50) = 79.06

 1 (50 − 46) 2

yˆ  2.306 75.75321 + + 
 10 2474 
79.06  21.11 or 57.95 to 100.17.
Notice how
much wider
this interval is!
MINITAB OUTPUT
Confidence and prediction
intervals when x = 50
Predicted Values for New Observations
New Obs Fit SE Fit 95.0% CI 95.0% PI
1 79.06 2.84 (72.51, 85.61) (57.95,100.17)

Values of Predictors for New Observations

New Obs x
1 50.0 Fitted Line Plot
y = 40.78 + 0.7656 x

✓Green prediction 120

Regression
95% CI
95% PI
110
bands are always 100
S
R-Sq
8.70363
70.5%
R-Sq(adj) 66.8%

wider than red 90

80
y

confidence bands. 70

✓Both intervals are 50

narrowest when x = 30
20 30 40 50 60 70 80
x
x-bar.
CORRELATION ANALYSIS
• The strength of the relationship between x
and y is measured using the coefficient of
correlation:
S xy
Correlation coefficient : r =
S xx S yy
• Recall from Chapter 3 that
(1) -1  r  1 (2) r and b have the same sign
(3) r  0 means no linear relationship
(4) r  1 or –1 means a strong (+) or (-)
relationship
EXAMPLE
The table shows the heights and weights of
n = 10 randomly selected college football
players.
Player 1 2 3 4 5 6 7 8 9 10
Height, x 73 71 75 72 72 75 67 69 71 69
Weight, y 185 175 200 210 190 195 150 170 180 175

Use your S xy = 328 S xx = 60.4 S yy = 2610

calculator to find 328
the sums and r= = .8261
sums of squares. (60.4)(2610)
FOOTBALL PLAYERS
Scatterplot of Weight vs Height

210

200

190
Weight

180
r = .8261
170

160
Strong positive
150
correlation
66 67 68 69 70 71 72 73 74 75
Height
As the player’s
height increases,
so does his
weight.
SOME CORRELATION PATTERNS
• Use the Exploring Correlation applet to explore some
correlation patterns:
r = .931; Strong
r = 0; No positive correlation
correlation

r = 1; Linear
relationship r = -.67; Weaker
negative correlation
INFERENCE USING R
• The population coefficient of correlation is
called r (“rho”). We can test for a significant
correlation between x and y using a t test:

To test H 0 : r = 0 versusH a : r  0 This test is

exactly
n−2 equivalent to
Test Statistic: t = r the t-test for
1− r2 the slope b=0.
RejectH 0 if t  ta / 2 or t  −ta / 2 with n - 2 df .
r = .8261
EXAMPLE
Is there a significant positive correlation between weight
and height in the population of all college football players?

H0 : r = 0 n−2
Test Statistic: t = r
Ha : r  0 1− r2
8
Use the t-table with n-2 = 8 df = .8261 = 4.15
to bound the p-value as p-
1 − .8261 2

value < .005. There is a

significant positive correlation.
KEY CONCEPTS
I. A Linear Probabilistic Model
1. When the data exhibit a linear relationship,
the appropriate model is y = a + b x + e .
2. The random error e has a normal distribution
with mean 0 and variance s2.
II. Method of Least Squares
1. Estimates a and b, for a and b, are chosen to
minimize SSE, the sum of the squared
deviations about the regression
yˆ = a + bx.line,
2. The least squares estimates are b = Sxy/Sxx
and
a = y − bx.
KEY CONCEPTS
III. Analysis of Variance
1. Total SS = SSR + SSE, where Total SS = Syy and
SSR = (Sxy)2 / Sxx.
2. The best estimate of s 2 is MSE = SSE / (n − 2).

IV. Testing, Estimation, and Prediction

1. A test for the significance of the linear regression—
H0 : b = 0 can be implemented using one of two test
statistics:

b MSR
t= or F=
MSE / S xx MSE
KEY CONCEPTS
2. The strength of the relationship between x and y can be
measured using
SSR
R =
2

Total SS
which gets closer to 1 as the relationship gets stronger.
3. Use residual plots to check for nonnormality,
inequality of variances, and an incorrectly fit model.
4. Confidence intervals can be constructed to estimate
the intercept a and slope b of the regression line and to
estimate the average value of y, E( y ), for a given value
of x.
5. Prediction intervals can be constructed to predict a
particular observation, y, for a given value of x. For a
given x, prediction intervals are always wider than
confidence intervals.
KEY CONCEPTS
V. Correlation Analysis
1. Use the correlation coefficient to measure the
relationship between x and y when both
variables are random:
S xy
r=
S xx S yy
2. The sign of r indicates the direction of the
relationship; r near 0 indicates no linear
relationship, and r near 1 or −1 indicates a
strong linear relationship.
3. A test of the significance of the correlation
coefficient is identical to the test of the slope b.

Chapter 12
No ratings yet
Chapter 12
48 pages
03 - Simple Linear Regression
No ratings yet
03 - Simple Linear Regression
13 pages
Session 5 Marked B PDF
No ratings yet
Session 5 Marked B PDF
36 pages
The Bucharest University of Economic Studies Bucharest Business School Romanian - French INDE MBA Program
No ratings yet
The Bucharest University of Economic Studies Bucharest Business School Romanian - French INDE MBA Program
67 pages
Linear Regression Full Version
No ratings yet
Linear Regression Full Version
34 pages
DAM Class 21-24 Regression Analysis
No ratings yet
DAM Class 21-24 Regression Analysis
93 pages
PE Civil: Transportation Ebook Practice Exam
No ratings yet
PE Civil: Transportation Ebook Practice Exam
41 pages
Etman MachineL4
No ratings yet
Etman MachineL4
55 pages
Interactive Lecture Notes 12-Regression Analysis
No ratings yet
Interactive Lecture Notes 12-Regression Analysis
22 pages
Chapter 14
No ratings yet
Chapter 14
65 pages
Regression Models - Follow
No ratings yet
Regression Models - Follow
7 pages
BSC - Applied Statistics - Correlation and SLR
No ratings yet
BSC - Applied Statistics - Correlation and SLR
67 pages
Section 2
No ratings yet
Section 2
22 pages
Simple Linear Regression Sample
No ratings yet
Simple Linear Regression Sample
55 pages
Topic - Chapter 12 - Regression Models
No ratings yet
Topic - Chapter 12 - Regression Models
1 page
Topic Simple Linear Regression
No ratings yet
Topic Simple Linear Regression
38 pages
Chapter 5 - Eng
No ratings yet
Chapter 5 - Eng
20 pages
Lecture8 4
No ratings yet
Lecture8 4
29 pages
5 Chapter Fi
No ratings yet
5 Chapter Fi
29 pages
Regression Analysis Basics
No ratings yet
Regression Analysis Basics
56 pages
Chapter 4 Regression
No ratings yet
Chapter 4 Regression
38 pages
Simple Linear Regression Guide
No ratings yet
Simple Linear Regression Guide
12 pages
Iml Unit III
No ratings yet
Iml Unit III
18 pages
Chapter 09
No ratings yet
Chapter 09
25 pages
Regression and Correlation
No ratings yet
Regression and Correlation
13 pages
Simple Linear Regression
No ratings yet
Simple Linear Regression
36 pages
Excel Regression for Finance Students
No ratings yet
Excel Regression for Finance Students
19 pages
Business Statistics: Regression Basics
No ratings yet
Business Statistics: Regression Basics
56 pages
Simple LR Lecture
No ratings yet
Simple LR Lecture
60 pages
CH 14
No ratings yet
CH 14
31 pages
BES - Lecture 10 - Simple Linear Regression
No ratings yet
BES - Lecture 10 - Simple Linear Regression
15 pages
F Regression
No ratings yet
F Regression
65 pages
Regression and Correlation
No ratings yet
Regression and Correlation
14 pages
L1 QM07 High Yield Notes
No ratings yet
L1 QM07 High Yield Notes
4 pages
Sec2 Regression PDF
No ratings yet
Sec2 Regression PDF
183 pages
14 Statistics and Probability
No ratings yet
14 Statistics and Probability
37 pages
Investigating Variables
No ratings yet
Investigating Variables
15 pages
Linear Regression
No ratings yet
Linear Regression
22 pages
STAT630Slide Adv Data Analysis
0% (1)
STAT630Slide Adv Data Analysis
238 pages
Notes 516 Summer 09 Part 2
No ratings yet
Notes 516 Summer 09 Part 2
15 pages
Regression and Correlation
No ratings yet
Regression and Correlation
17 pages
Correl Regr
No ratings yet
Correl Regr
33 pages
Lecture 8 Correlation and Linear Regression
No ratings yet
Lecture 8 Correlation and Linear Regression
66 pages
9 Regression (Statistics IEM 2-2)
No ratings yet
9 Regression (Statistics IEM 2-2)
32 pages
Lecture 9 Simple-Linear-Regression-Correlation Updated
No ratings yet
Lecture 9 Simple-Linear-Regression-Correlation Updated
44 pages
Simple Linear Regressionclassroom
No ratings yet
Simple Linear Regressionclassroom
37 pages
Chapter 12
No ratings yet
Chapter 12
12 pages
Corr and Regress
No ratings yet
Corr and Regress
30 pages
CH 08
No ratings yet
CH 08
13 pages
STAT22209 - Chapter 02-Regression Analyisis - 2022
No ratings yet
STAT22209 - Chapter 02-Regression Analyisis - 2022
41 pages
AP Statistics Tutorial
No ratings yet
AP Statistics Tutorial
3 pages
Correlation and Regression
No ratings yet
Correlation and Regression
30 pages
Simple Lin Regress Inference
No ratings yet
Simple Lin Regress Inference
51 pages
6 Continuous Data Analysis
No ratings yet
6 Continuous Data Analysis
49 pages
Simple Regression
No ratings yet
Simple Regression
35 pages
MAP 716 Lecture 4 Simple Linear Regression
No ratings yet
MAP 716 Lecture 4 Simple Linear Regression
23 pages

Lecture 12

Uploaded by

Lecture 12

Uploaded by

INTRODUCTION TO PROBABILITY

We start with the simplest case, in which the

the sums and

The Total SS is divided into two parts:

Reject H 0 when |t| > 2.306. Since t = 4.38 falls

To test H 0 : model is usefulin predicting y

S = 8.70363 R-Sq = 70.5% R-Sq(adj) = 66.8%

70.5%. The model is working well!

• If all assumptions have been met,

Calculateyˆ = 40.78424 + .76556(50) = 79.06

Values of Predictors for New Observations

✓Green prediction 120

wider than red 90

✓Both intervals are 50

Use your S xy = 328 S xx = 60.4 S yy = 2610

To test H 0 : r = 0 versusH a : r  0 This test is

value < .005. There is a

IV. Testing, Estimation, and Prediction

You might also like