Statistics 203: Introduction to Regression
and Analysis of Variance
Multiple Linear Regression: Diagnostics
Jonathan Taylor
- p. 1/16
Today
Today
Spline models
What are the assumptions?
Problems in the regression
Splines + other bases.
Diagnostics
function
Partial residual plot
Added-variable plot
Problems with the errors
Outliers & Influence
Dropping an observation
Different residuals
Crude outlier detection test
Bonferroni correction for
multiple comparisons
DF F IT S
Cooks distance
DF BET AS
- p. 2/16
Spline models
Today
Spline models
What are the assumptions?
Problems in the regression
function
Partial residual plot
Added-variable plot
Problems with the errors
Outliers & Influence
Splines are piecewise polynomials functions, i.e. on an
interval between knots (ti , ti+1 ) the spline f (x) is
polynomial but the coefficients change within each interval.
Example: cubic spline with knows at t1 < t2 < < th
Dropping an observation
Different residuals
f (x) =
Crude outlier detection test
Bonferroni correction for
multiple comparisons
3
X
0j xj +
j=0
DF F IT S
Cooks distance
where
DF BET AS
(x ti )+ =
h
X
i=1
x ti
0
i (x ti )3+
if x ti 0
otherwise.
Here is an example.
Conditioning problem again: B-splines are used to keep the
model subspace the same but have the design less
ill-conditioned.
Other bases one might use: Fourier: sin and cos waves;
Wavelet: space/time localized basis for functions.
- p. 3/16
What are the assumptions?
Today
Spline models
What is the full model for a given design matrix X ?
What are the assumptions?
Yi = 0 + 1 Xi1 + + p Xi,p1 + i
Problems in the regression
function
Partial residual plot
Added-variable plot
Problems with the errors
Outliers & Influence
Dropping an observation
Different residuals
Crude outlier detection test
Bonferroni correction for
multiple comparisons
DF F IT S
Cooks distance
DF BET AS
Errors N (0, 2 I).
What can go wrong?
Regression function can be wrong missing predictors,
nonlinear.
Assumptions about the errors can be wrong.
Outliers & influential observations: both in predictors and
observations.
- p. 4/16
Problems in the regression function
Today
Spline models
What are the assumptions?
Problems in the regression
function
Partial residual plot
Added-variable plot
Problems with the errors
True regression function may have higher-order non-linear
terms i.e. X12 or even interactions X1 X2 .
How to fix? Difficult in general we will look at two plots
added variable plots and partial residual plots.
Outliers & Influence
Dropping an observation
Different residuals
Crude outlier detection test
Bonferroni correction for
multiple comparisons
DF F IT S
Cooks distance
DF BET AS
- p. 5/16
Partial residual plot
Today
Spline models
What are the assumptions?
Problems in the regression
function
Partial residual plot
Added-variable plot
Problems with the errors
Outliers & Influence
Dropping an observation
Different residuals
Crude outlier detection test
Bonferroni correction for
multiple comparisons
DF F IT S
For 1 j p 1 let
eij = ei + bj Xij .
Can help to determine if variance depends on X j and
outliers.
If there is a non-linear trend, it is evidence that linear is not
sufficient.
Cooks distance
DF BET AS
- p. 6/16
Added-variable plot
Today
Spline models
What are the assumptions?
Problems in the regression
function
Partial residual plot
(I H(j) )Y vs.(I H(j) )Xj .
Added-variable plot
Problems with the errors
Outliers & Influence
Dropping an observation
For 1 j p 1 let H(j) be the Hat matrix with this predictor
deleted. Plot
Plot should be linear and slope should be j . Why?
Different residuals
Crude outlier detection test
Y = X(j) (j) + j Xj +
Bonferroni correction for
multiple comparisons
(I H(j) )Y = (I H(j) )X(j) (j) + j (I H(j) )Xj + (I H(j) )
DF F IT S
Cooks distance
DF BET AS
(I H(j) )Y = j (I H(j) )Xj + (I H(j) )
Also can be helpful for detecting outliers.
If there is a non-linear trend, it is evidence that linear is not
sufficient.
- p. 7/16
Problems with the errors
Today
Spline models
What are the assumptions?
Problems in the regression
function
Partial residual plot
Added-variable plot
Problems with the errors
Outliers & Influence
Dropping an observation
Different residuals
Crude outlier detection test
Bonferroni correction for
multiple comparisons
DF F IT S
Cooks distance
DF BET AS
Errors may not be normally distributed. We will look at
QQplot for a graphical check. May not effect inference in
large samples.
Variance may not be constant. Transformations can
sometimes help correct this. Non-constant variance affects
b which can change t and F statistics
our estimates of SE()
substantially!
Graphical checks of non-constant variance: added variable
plots, partial residual plots, fitted vs. residual plots.
Errors may not be independent. This can seriously affect our
b
estimates of SE().
- p. 8/16
Outliers & Influence
Today
Spline models
What are the assumptions?
Problems in the regression
function
Partial residual plot
Added-variable plot
Problems with the errors
Outliers & Influence
Dropping an observation
Different residuals
Crude outlier detection test
Bonferroni correction for
multiple comparisons
DF F IT S
Cooks distance
DF BET AS
Some residuals may be much larger than others which can
affect the overall fit of the model. This may be evidence of an
outlier: a point where the model has very poor fit. This can
be caused by many factors and such points should not be
automatically deleted from the dataset.
Even if an observation does not have a large residual, it can
exert a strong influence on the regression function.
General stragegy to measure influence: for each
observation, drop it from the model and measure how much
does the model change?
- p. 9/16
Dropping an observation
Today
Spline models
What are the assumptions?
Problems in the regression
function
Partial residual plot
Added-variable plot
Problems with the errors
Outliers & Influence
Dropping an observation
Different residuals
Crude outlier detection test
Bonferroni correction for
multiple comparisons
DF F IT S
Cooks distance
DF BET AS
A (i) indicates i-th observation was not used in fitting the
model.
For example: Ybj(i) is the regression function evaluated at the
j-th observations predictors BUT the coefficients
(b0,(i) , . . . , bp1,(i) ) were fit after deleting i-th row of data.
Basic idea: if Ybj(i) is very different than Ybj (using all the data)
then i is an influential point for determining Ybj .
- p. 10/16
Different residuals
Today
Spline models
What are the assumptions?
Problems in the regression
function
Partial residual plot
Added-variable plot
Problems with the errors
Outliers & Influence
Dropping an observation
Different residuals
Ordinary residuals: ei = Yi Ybi
Standardized residuals: ri = ei /s(ei ) = ei /b
1 Hii , H is
the hat matrix. (rstandard)
Studentized residuals: ti = ei /d
(i) 1 Hii tnp1 .
(rstudent)
Crude outlier detection test
Bonferroni correction for
multiple comparisons
DF F IT S
Cooks distance
DF BET AS
- p. 11/16
Crude outlier detection test
Today
Spline models
What are the assumptions?
Problems in the regression
function
Partial residual plot
Added-variable plot
Problems with the errors
Outliers & Influence
Dropping an observation
Different residuals
If the studentized residuals are large: observation may be an
outlier.
Problem: if n is large, if we threshold at t1/2,np1 we
will get many outliers by chance even if model is correct.
Solution: Bonferroni correction, threshold at t1/(2n),np1 .
Crude outlier detection test
Bonferroni correction for
multiple comparisons
DF F IT S
Cooks distance
DF BET AS
- p. 12/16
Bonferroni correction for multiple comparisons
Today
Spline models
What are the assumptions?
Problems in the regression
function
Partial residual plot
Added-variable plot
Problems with the errors
Outliers & Influence
Dropping an observation
Different residuals
Crude outlier detection test
Bonferroni correction for
multiple comparisons
If we are doing many t (or other) tests, say m > 1 we can
control overall false positive rate at by testing each one at
level /m.
Proof:
P (at least one false positive)
=P
DF F IT S
Cooks distance
DF BET AS
m
i=1 |Ti |
m
X
i=1
t1/(2m),np2
P |Ti | t1/(2m),np2
m
X
= .
=
m
i=1
- p. 13/16
DF F IT S
Today
Spline models
What are the assumptions?
Problems in the regression
function
Partial residual plot
Added-variable plot
Problems with the errors
Outliers & Influence
Dropping an observation
Different residuals
Crude outlier detection test
Bonferroni correction for
multiple comparisons
DF F IT S
Cooks distance
DF BET AS
Ybi Ybi(i)
DF F IT Si =
b(i) Hii
This quantity measures how much the regression function
changes at the i-th observation when the i-th variable is
deleted.
For small/medium datasets: value of 1 or greater is p
considered suspicious. For large dataset: value of 2 p/n.
- p. 14/16
Cooks distance
Today
Spline models
What are the assumptions?
Di =
Problems in the regression
function
Partial residual plot
Added-variable plot
Problems with the errors
Outliers & Influence
Dropping an observation
Different residuals
Crude outlier detection test
Bonferroni correction for
multiple comparisons
DF F IT S
Cooks distance
Pn
bj Ybj(i) )2
(
Y
j=1
p
b2
This quantity measures how much the entire regression
function changes when the i-th variable is deleted.
Should be comparable to Fp,np : if the p-value of Di is 50
percent or more, then the i-th point is likely influential:
investigate this point further.
DF BET AS
- p. 15/16
DF BET AS
Today
Spline models
What are the assumptions?
Problems in the regression
function
Partial residual plot
Added-variable plot
Problems with the errors
Outliers & Influence
Dropping an observation
Different residuals
Crude outlier detection test
Bonferroni correction for
multiple comparisons
DF F IT S
Cooks distance
DF BET AS
DF BET ASj(i) = q
bj bj(i)
2 (X T X)1
b(i)
jj
This quantity measures how much the coefficients change
when the i-th variable is deleted.
For small/medium datasets: value of 1 or greater
is
suspicious. For large dataset: value of 2/ n.
Here is an example.
- p. 16/16