Assumptions of Regression L.I.N.
E
• Linearity
The relationship between X and Y is linear
• Independence of Errors
Error values are statistically independent
• Normality of Error
Error values are normally distributed for any given value of X
• Equal Variance (also called homoscedasticity)
The probability distribution of the errors has constant variance
Residual Analysis
ei Yi Ŷi
• The residual for observation i, ei, is the difference between its observed and predicted value
• Check the assumptions of regression by examining the residuals
– Examine for Linearity assumption
– Evaluate Independence assumption
– Evaluate Normal distribution assumption
– Examine for constant variance for all levels of X (homoscedasticity)
• Graphical Analysis of Residuals
– Can plot residuals vs. X
Residual Analysis for Linearity
Linearity: The relationship between X and Y is linear
Y Y
x x
residuals
residuals
x x
√
Not Linear Linear
Residual Analysis for Independence
Independence of Errors: Error values are statistically independent
Not Independent
Independent
residuals
residuals
X
residuals
X
Checking for Normality
• Examine the Stem-and-Leaf Display of the Residuals
• Examine the Boxplot of the Residuals
• Examine the Histogram of the Residuals
• Construct a Normal Probability Plot of the Residuals
Residual Analysis for Normality
Normality of Error: Error values are normally distributed for any given value of X
When using a normal probability plot, normal errors will approximately display in a straight line
Percent
100
0
-3 -2 -1 0 1 2 3
Residual
Residual Analysis for Equal Variance
Equal Variance (also called homoscedasticity): The probability distribution of the errors has
constant variance
Y
Y
x x
residuals
residuals
x x
Non-constant variance
Constant variance
^
Y = a+bx =98.25+0.1098* x
=98.25+0.1098*1400=251.97
House Price in $1000s Square Feet
Residuals = (observed –predicted)
(Y) (X)
(Y) - Y
245 1400
245-251.97= -6.97
312 1600
38.12
279 1700
-5.86
308 1875
3.925
199 1100
-19.98
219 1550
-49.39
405 2350
48.77
324 2450
-43.21
319 1425
64.335
255 1700
-29.86
Simple Linear Regression Example: Excel Residual Output
RESIDUAL OUTPUT
House Price Model Residual Plot
Square Feet
(X) Residuals
80
1 1400 -6.923162
60
2 1600 38.12329
40
1700
Residuals
3 -5.853484
20
4 1875 3.937162
0
5 1100 -19.99284 0 1000 2000 3000
1550 -20
6 -49.38832
2350 -40
7 48.79749
2450 -60
8 -43.17929
1425 Square Feet
9 64.33264
10 1700 -29.85348
Does not appear to violate any regression assumptions
Inferences About the Slope
• The standard error of the regression slope coefficient (b1) is estimated by
S YX S YX
Sb1
SSX i
(X X ) 2
where:
Sb1 = Estimate of the standard error of the slope
SSE
S YX = Standard error of the estimate
n2
Inferences About the Slope: t Test
• t test for a population slope
– Is there a linear relationship between X and Y?
• Null and alternative hypotheses
– H0: β1 = 0 (no linear relationship)
– H1: β1 ≠ 0 (linear relationship does exist)
• Test statistic
b1 β 1
t STAT
Sb
where: 1
b1 = regression slope coefficient
β1 = hypothesized slope d.f. n 2
Sb1 = standard error of the slope
Inferences About the Slope: t Test Example
H0: β1 = 0
H1: β1 ≠ 0
Test Statistic: tSTAT = 3.329
d.f. = 10- 2 = 8
α /2=.025 α/2=.025
Decision: Reject H0
There is sufficient evidence that square
Reject H0 Do not reject H0 Reject H0 footage affects house price
-tα/2 tα/2
0
-2.3060 2.3060 3.329
Inferences About the Slope: t Test Example
House Price in Square Estimated Regression Equation:
$1000s Feet
(y) (x)
245 1400 Y = a+bx =98.25+0.1098* Sq feet
312 1600
279 1700
308 1875 The slope of this model is 0.1098
199 1100
219 1550
405 2350
Is there a relationship between the square
324 2450 footage of the house and its sales price?
319 1425
255 1700
Inferences About the Slope: t Test Example
H0: β1 = 0
From Excel output: H1: β1 ≠ 0
Coefficients Standard Error t Stat P-value
Intercept 98.24833 58.03348 1.69296 0.12892
Square Feet 0.10977 0.03297 3.32938 0.01039
b1
From Minitab output: Sb1
Predictor Coef SE Coef T P
Constant 98.25 58.03 1.69 0.129
Square Feet 0.10977 0.03297 3.33 0.010
b1 β 1 0.10977 0
b1 t STAT 3.32938
Sb1 Sb 0.03297
1
Multiple Regression
The Multiple Regression Model
Idea: Examine the linear relationship between 1 dependent (Y) & 2 or more independent variables (Xi)
Multiple Regression Model with k Independent Variables:
Y-intercept Population slopes Random Error
Yi β 0 β1X1i β 2 X 2i β k X ki ε i
Multiple Regression Equation
The coefficients of the multiple regression model are estimated using sample data
Multiple regression equation with k independent variables:
Estimated Estimated
(or predicted) Estimated slope coefficients
intercept
value of Y
ˆ b b X b X b X
Yi 0 1 1i 2 2i k ki
Multiple Regression Equation
(continued)
Two variable model
Y
Ŷ b0 b1X1 b 2 X 2
X2
X1
Example: 2 Independent Variables
• A distributor of frozen dessert pies wants to evaluate factors thought
to influence demand
– Dependent variable: Pie sales (units per week)
– Independent variables: Price (in $)
Advertising ($100’s)
• Data are collected for 15 weeks
Pie Sales Example
Pie Price Advertising Multiple regression equation:
Week Sales ($) ($100s)
1 350 5.50 3.3
2 460 7.50 3.3
Sales = b0 + b1 (Price)+ b2 (Advertising)
3 350 8.00 3.0
4 430 8.00 4.5
5 350 6.80 3.0
6 380 7.50 4.0
7 430 4.50 3.0
8 470 6.40 3.7
9 450 7.00 3.5
10 490 5.00 4.0
11 340 7.20 3.5
12 300 7.90 3.2
13 440 5.90 4.0
14 450 5.00 3.5
15 300 7.00 2.7
Excel Multiple Regression Output
Regression Statistics
Multiple R 0.72213
R Square 0.52148
Adjusted R Square 0.44172
Standard Error 47.46341
Sales 306.526 - 24.975(Pri ce) 74.131(Adv ertising)
Observations 15
ANOVA df SS MS F Significance F
Regression 2 29460.027 14730.013 6.53861 0.01201
Residual 12 27033.306 2252.776
Total 14 56493.333
Coefficients Standard Error t Stat P-value Lower 95% Upper 95%
Intercept 306.52619 114.25389 2.68285 0.01993 57.58835 555.46404
Price -24.97509 10.83213 -2.30565 0.03979 -48.57626 -1.37392
Advertising 74.13096 25.96732 2.85478 0.01449 17.55303 130.70888