0% found this document useful (0 votes)
16 views212 pages

Regression

Uploaded by

e96819619
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views212 pages

Regression

Uploaded by

e96819619
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 212

Regression Lecture Notes

for
Stat 5313

J. D. Tubbs
Department of Mathematical Sciences

Spring Semester 2001


Contents

1 Least Squares 1
1.1 Inference in Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.1 Regression Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1.2 Analysis of Variance for Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.1.3 ANOVA Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.1.4 Continuation of the Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2 Matrices, Random Variables, and Distributions 7


2.1 Some Matrix Algebra . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.1 Special Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.2 Addition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.1.3 Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.1.4 Kronecker or Direct Product . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.1.5 Inverse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.1.6 Transpose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.1.7 Trace . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1.8 Rank . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1.9 Quadratic Forms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1.10 Positive Semidefinite Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.1.11 Positive Definite Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.1.12 Idempotent Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.1.13 Orthogonal Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.1.14 The Generalized Inverse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.1.15 Solution of Linear Equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2 Random Variables and Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2.1 Expectations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2.2 Covariance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2.3 Linear Combinations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2.4 Random Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3 Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3.1 Multivariate Normal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3.2 Chi-Square, T and F Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3.3 Quadratic Forms of Normal Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

3 Matrix Approach to Linear Regression 15


3.1 Least Squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.1.1 Estimation of σ 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.1.2 ANOVA Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

i
3.1.3 Expected Values of the Sum of Squares . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.1.4 Distribution of the Mean Squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

4 The General Linear Regresssion Model 19


4.0.5 Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.0.6 Estimation of σ 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.0.7 ANOVA Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.0.8 Expected Values of the Sum of Squares . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.0.9 Distribution of the Mean Squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.1 The Reduction Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.2 Testing Linear Hypothesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.3 Constrained Least Squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

5 Testing for Model Fit 29


5.0.1 Checks for Normality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
5.0.2 QQ and PP Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
5.0.3 Durbin Watson Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
5.0.4 Residual Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
5.0.5 Leverages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
5.0.6 Detection of Influential Observations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
5.1 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
5.1.1 Heat Consumption Example – Residual Plots . . . . . . . . . . . . . . . . . . . . . . . 33
5.1.2 Exercise 1 Chapter 3 problem k page 101 . . . . . . . . . . . . . . . . . . . . . . . . . 34
5.2 Transformations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
5.2.1 Box-Cox Transformations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
5.2.2 Examples of Box Cox Transformations . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

6 The Generalized Least Squares Model 61


6.0.3 Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
6.1 Weighted Least Squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
6.1.1 Example of Weighted Least Squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
6.1.2 SAS Scatterplot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
6.1.3 SAS Student Plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
6.1.4 SAS Student Plot for Weighted LS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

7 Selecting the Best Subset Regression 72


7.0.5 Mallow’s Cp Statistic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
7.1 Subset Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
7.1.1 Stepwise Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
7.1.2 Backward Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
7.2 SAS Model-Selection Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
7.2.1 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
7.2.2 Selection using JMP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

8 Multicollinearity 87
8.1 Detecting Multicollinearity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
8.1.1 Tolerances and Variance Inflation Factors . . . . . . . . . . . . . . . . . . . . . . . . . 90
8.1.2 Eigenvalues and Condition Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
8.1.3 SAS – Collinearity Diagnostics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
8.1.4 SAS Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

ii
9 Ridge Regression 96
9.0.5 Ridge Plot Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

10 Use of Dummy Variables in Regression Models 107


10.0.6 Turkey Weight Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
10.0.7 Harris County Discrimination Example . . . . . . . . . . . . . . . . . . . . . . . . . . 114

11 General Methods of finding Transformations 123


11.1 Solving Standard Least-Squares Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
11.1.1 Nonlinear Regression Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
11.1.2 Example 65.4: Transformation Regression of Exhaust Emissions Data . . . . . . . . . 127

12 Principle Component Regression 136


12.1 Eigenvalues and Eigenvectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
12.1.1 Properties of Eigenvalues–Eigenvectors . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
12.2 Principle Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
12.2.1 Example – FOC Sales . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
12.3 Prinicple Component Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142

13 Robust Regression 145


13.1 Choice of ρ(u) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
13.1.1 Steel Employment Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
13.1.2 Phone Data Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
13.1.3 Phone Example with Splus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162

14 Regression with Violations of the Error Structure Assumptions 165


14.1 Serially correlated Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
14.1.1 Example Using Blaisdell Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
14.2 Detecting Heteroscedastic Error Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
14.2.1 Meyers Table 7.1 Data Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
14.2.2 Draper and Smith Example Using Weighted Least Squares . . . . . . . . . . . . . . . 179

15 Regression Trees 185


15.1 An Overview of CART Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
15.2 Example from Splus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188

16 LOESS Regression 193


16.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
16.1.1 Local Regression and the Loess Method . . . . . . . . . . . . . . . . . . . . . . . . . . 193
16.2 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
16.2.1 Phone Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
16.2.2 Motor Cycle Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207

iii
Chapter 1

Least Squares

Suppose that one observes n pairs of data, given by (xi , yi ), i = 1, 2, . . . , n. The Least Squares problem
consists of finding the linear equation given by ŷ = β̂0 + β̂1 x which minimizes the following equation
n
X n
X
Q(β0 , β1 ) = [yi − ŷi ]2 = [yi − (β0 + β1 xi )]2 .
i=1 i=1

The solution can be found by taking the partial derivatives of Q(β0 , β1 ) with repect to both β0 and β1 . That
is,
n
∂Q(β0 , β1 ) X
= −2 [yi − β0 − β1 xi ]
∂β0 i=1
and
n
∂Q(β0 , β1 ) X
= −2 [yi − β0 − β1 xi ]xi .
∂β1 i=1
Setting both of these equations equal to zero one obtains the following:
X X
nβ0 + xi β1 = yi
X X X
xi β0 + x2i β1 = xi yi

The above equation is called the normal equation for the linear least squares problem. This equation has a
unique solution given by
β̂0 = ȳ − β̂1 x̄
SSxy
β̂1 =
SSxx
where the basic notation gives
n
X n
X
ȳ = yi /n x̄ = xi /n
i=1 i=1
Xn X
SSxy = (xi − x̄)(yi − ȳ) = xi yi − nx̄ȳ
i=1
Xn X
SSxx = (xi − x̄)2 = x2i − nx̄2 = (n − 1)s2x
i=1

1
n
X X
SSyy = (yi − ȳ)2 = yi2 − nȳ 2 = (n − 1)s2y
i=1

The residuals are defined by êi = yi − β̂0 − β̂1 xi for i = 1, 2, . . . , n. The residual sum of squares or sum of
squares due to the error is
2
X SSxy
SSE = Q(β̂0 , β̂1 ) = ê2i = SSyy − .
SSxx
The predicted value (line at x = x∗ ) if given by

µ̂y|x∗ = ŷ ∗ = ȳ + β̂1 (x∗ − x̄)

1.1 Inference in Linear Regression


The procedure for finding the least squares estimates for β0 and β1 has been described. It should be noted that
the method of least squares is a mathematical problem, in that, it provides the solution to an optimization
problem. In order to make the problem into a statistical problem for which one can do statistical inference
it will be nessecary to make a distributional assumption concerning the dependent variable y. For linear
regression it is assumed that the following hold:
• The the dependent data given by y1 , y2 , . . . , yn ∼ N (µy , σy2 ). That is the data are normally distributed.
This is called the normality assumption.
• µy = β0 + β1 x. The mean or expected value for y changes as a linear function of x. This is called the
linear assumption.
• σy2 does not depend upon x. This is called the homogeneity of variance assumption.
• The data y1 , y2 , . . . , yn are random. That is, the value for yi does not depend upon the value for yi−1
or yi+1 . This is called the independence assumption. If this assumption is violated then the data
are said to be correlated or serially correlated.
The above assumptions allow one to specify the standard errors for the statistical estimates of the
population parameters of interest; β0 , β1 and µy at a particular value of x = x∗ denoted by µy|x∗ = β0 +β1 x∗ .
The estimates and standard errors are;
• The slope (β1 ):
β̂1 = SSxy /SSxx
and p
σ̂β̂1 = σ̂/ SSxx .
where v
u n
p uX
σ̂ = SSE/(n − 2) = t (yi − ŷi )2 /(n − 2).
i=1

• The line (µy|x∗ ):


µ̂y|x∗ = β̂0 + β̂1 x∗ = ȳ + β̂1 (x∗ − x̄)
and p
σ̂µ̂y|x∗ = σ̂ 1/n + (x∗ − x̄)2 /SSxx .

2
• The y-intercept (β0 ):
β̂0 = ȳ − β̂1 x̄
and p
σ̂β̂0 = σ̂ 1/n + x̄2 /SSxx .

From these values one can find (1 − γ)100% Confidence intervals for;
• The slope:
β̂1 ± tγ/2 (df = (n − 2))σ̂β̂1 .

• The line at x = x∗ :
µ̂y|x∗ ± tγ/2 (df = (n − 2))σ̂µ̂y|x∗ .
where tγ/2 (df = (n − 2)) is the critical point from a t-distribution with df = n-2.

1.1.1 Regression Example


Given 9 pairs of observations where the independent variable X = heating degree days and the dependent
variable Y = gas consumption for house.

x y (x − x̄) (y − ȳ) (x − x̄)2 (y − ȳ)2 (x − x̄)(y − ȳ) (y − ŷ) ŷ


15.6 5.2 -5.94 -0.39 35.34 0.151 2.312 0.813 4.387
26.8 6.1 5.256 0.511 27.62 0.261 2.686 -0.55 6.652
37.8 8.7 16.26 3.111 264.2 9.679 50.57 -0.18 8.876
36.4 8.5 14.86 2.911 220.7 8.475 43.25 -0.09 8.593
35.5 8.8 13.96 3.211 194.8 10.31 44.81 0.389 8.411
18.6 4.9 -2.94 -0.69 8.67 0.475 2.028 -0.09 4.993
15.3 4.5 -6.24 -1.09 38.99 1.186 6.8 0.174 4.326
7.9 2.5 -13.6 -3.09 186.2 9.541 42.15 -0.33 2.83
0 1.1 -21.5 -4.49 464.2 20.15 96.71 -0.13 1.232
193.9 50.3 0.002 0 1441 60.23 291.3 0 50.3
21.54 5.589 0 0 160.1 6.692 32.37 0 5.589

The last two lines of the above table denote the column totals and averages. Hence x̄ = 21.54 and ȳ = 5.589.
The sample variances are computed using the totals under (x − x̄)2 = 1441 and (y − ȳ)2 = 60.23. That is
SSxx = 1441 and SSyy = 60.23 and the sample variance of x is given by SSxx /(n-1) = 1441/8 = 180.08 and
the sample variance for y is given by SSyy /8 = 60.23/8 = 7.529. In order to find the least square estimates
for the y intercept and for the slope of the line one computes the estimate for the slope

β̂1 = SSxy /SSxx = 291.3/1441 = .20221.


where SSxy is given as the total under the (x − x̄)(y − ȳ) column. The estimate for the y intercept

β̂0 = ȳ − β̂1 x̄ = 5.589 − .20221 ∗ 21.54 = 1.233.


The residuals are calculated by computing y − ŷ where ŷ = β̂0 + β̂1 x. The predicted values are in the ŷ
column and the residuals are in the (y − ŷ) column. The residual sum of squares is found by summing the
values under (y − ŷ)2 . One should get SSE = 1.323. From here one estimates
√ the variance of y by computing
SSE/(n-2) = 1.323/7 = .189. The standard deviation of y is given by .189 = .435. (this number is often
called the root mean square error, denoted by s = σ̂). p
Suppose that one wish to find a 95% CI for the slope of the line. It is given by β̂1 ±(t.025 (n−2))∗ σ̂/SSxx =
.20221 ± (2.365)(.01146) = (.17511, .22931).

3

where .01146 = .435/ 1440 and 2.365 is obtained form the t distribution table under the .025 column with
7 degrees of freedom.
Suppose that you want to test the hypothesis that the slope of the line is equal to zero at the .05 level. One
can reject this hypothesis since 0 is not contained in the above interval. Note: one can only use confidence
intervals to formulate test of hypothesis when the level of the CI is comparable to the specified type I error.
Suppose that one wanted a 95% CI for the line at x = 20. The interval is given by

p
= 1/n + (20 − x̄)2 /SSxx
(β̂0 + β̂1 · 20) ± (t.025 (n − 2))σ̂
p
= 1.233 + (.20221)(20) ± 2.365(.435) 1/9 + (20 − 21.54)2 /1440
= 5.277 ± (2.365)(.146)
= (4.932, 5.622).

1.1.2 Analysis of Variance for Regression


The regression results are often presented as an analysis of variance table or ANOVA table. The basic idea
is to describe how much of the variablity found in the dependent variable y can be explained by the presence
of the linear equation (β = 0) versus having a line y = ȳ (β 6= 0). Pn
The total adjusted sum of squares in the dependent variable y is given by SST = i=1 (yi − ȳ)2 . This
adjusted or corrected sum of squares can be written as SST = SSM + SSE where
n
X
SSM = (ŷi − ȳ)2
i=1

and
n
X
SSE = (yi − ŷi )2 .
i=1

The value SSM is called the sum of squares due to the model and SSE is called the sum of squares due to
the error or the lack of fit.
The above identity follows from
X X
(yi − ŷi )2 = (yi − ȳ + ȳ − ŷi )2
X
= (yi − ȳ)2 + (ŷi − ȳ)2 − 2(yi − ȳ)(ŷi − ȳ)
X X X
= (yi − ȳ)2 + (ŷi − ȳ)2 − 2 (yi − ȳ)(ŷi − ȳ)
X X
= (yi − ȳ)2 − (ŷi − ȳ)2

The last line follows from


X X
−2 (yi − ȳ)(ŷi − ȳ) = −2 (yi − ȳ)β̂1 (xi − x̄)
X
= −2β̂1 (yi − ȳ)(xi − x̄)
X
= −2β̂12 (xi − x̄)2
X
= −2 (ŷi − ȳ)2

4
If the slope of the line is nonzero then one would expect that a sizeable amount of the variability in
y as given by SST would be attributable to SSM . One way of measuring this is by computing R2 =
SSM/SST = SSM/(SSM + SSE). R2 is a number between 0 and 1 which is usually expressed as a
percentage. The closer the value is to 1 or 100% means that the amount of variablity found in SST is nearly
explained by the model (or in this case by having β̂1 to be nonzero). On the otherhand if R2 is close to
zero then very little of the variability in the data is explained by the model (in this case the linear equation)
which means that one doesn’t need x in order to explain variability in y.

1.1.3 ANOVA Table


The analysis of variance table is given by

Source Sum
P of Squares Degrees of Freedom Mean Square
2
due to β1 | β0 P (ŷi − ȳ) 2 1 MSM = SSM/1
Residual P(yi − ŷi )2 n-2 MSE = SSE/(n-2)
Corrected Total (yi − ȳ) n-1
due to β0 nȳ 2 1
yi2
P
Total n

1.1.4 Continuation of the Example


The SAS code is given by
options center nodate pagesize=100 ls=80;
symbol1 i=none color=red v=star ;
symbol2 i=join v=none color=blue;
symbol3 i=join color=green v=none;
title1 ’Heat Consumption’;
data heat;
input x y;
cards;
15.6 5.2
26.8 6.1
37.8 8.7
36.4 8.5
35.5 8.8
18.6 4.9
15.3 4.5
7.9 2.5
0 1.1
;
proc sort; by x;
title2 ’Scatterplot of Data’;proc gplot data=heat; plot x*y=1; run;
proc reg graphics;
model y = x / r clm influence dw corrb xpx ; run;

title2 ’Least Squares Fit to the Data’; plot y*x / pred95;


run;
The ANOVA table is given below as

5
Dependent variable is: consumption

R squared = 97.8% R squared (adjusted) = 97.5%


s = 0.4345 with 9 - 2 = 7 degrees of freedom

Source Sum of Squares df Mean Square


due to β1 | β0 58.9071 1 58.9071
Residual 1.32175 7 0.188822

Variable Coefficient s.e. of Coeff t-ratio prob


Constant 1.23235 0.286 4.31 0.0035
heat 0.202212 0.01145 17.7 0.0001

SAS Plots

6
Chapter 2

Matrices, Random Variables, and


Distributions

2.1 Some Matrix Algebra


A matrix A = (aij ), i = 1, 2, . . . , r, j = 1, 2, . . . , c is said to be an r × c matrix given by
a11 a12 . . . a1c
 
 a21 a22 . . . a2c 
A=  ... .. .. .. 
. . . 
ar1 ar2 . . . arc
A vector x = (x1 , x2 , . . . , xn ) is said to be a n × 1 row vector, x0 is a 1 × n column vector given by
x1
 
x2 
x0 = 

 ...  .

xn

2.1.1 Special Matrices


1. D = diag(A) is the diagonal of the r × r matrix A given by
a11 0 ... 0
 
 0 a22 ... 0 
D=  ... .. .. ..  .
. . . 
0 0 . . . arr
2. In is called the n × n identity matrix given by
1 0 ... 0
 
0 1 ... 0
I=  ... .. . . . .
. . .. 
0 0 ... 1
3. J is an n × n matrix with each element equal to one.
4. j is a n × 1 vector with each element equal to one where J = jj0 .

7
2.1.2 Addition
C = A ± B is defined as cij = aij ± bij provided both A and B have the same number of rows and columns.
It can easily be showned that (A ± B) ± C = A ± (B ± C) and A + B = B + A.

2.1.3 Multiplication
Pp
C = AB is defined as cij = k=1 aik bkj provided A and B are conformable matrices (A is r × p and B is
p × c). Note: Even if both AB and BA are defined they are not necessarily equal. It follows
Pthat A(B ± C) =
n
AB ± AC). Two vectors a and b are said to be orthogonal, denoted by a⊥b = 0, if ab = i=1 ai bi = 0.

2.1.4 Kronecker or Direct Product


If A is m × n and B is s × t, the direct or Kronecker product of A and B, denoted by A ⊗ B, is an ms × nt
matrix given by
a11 B a12 B . . . a1n B
 
 a21 B a22 B . . . a2n B 
A⊗B =  ... .. .. ..  .
. . . 
am1 B am2 B . . . amn B
Properties are given as
1. (A ⊗ B)(C ⊗ D) = (AC ⊗ BD).
2. ((A + B) ⊗ (C + D)) = (A ⊗ C) + (A ⊗ D) + (B ⊗ C) + (B ⊗ D).
3. A ⊗ (B ⊗ C) = (A ⊗ B) ⊗ C.

2.1.5 Inverse
A n × n matrix A is said to be nonsingular if there exists a matrix B satisfying AB = BA = In . B is called
the inverse of A is denoted by A−1 .

2.1.6 Transpose
If A is r × c then the transpose of A, denoted by A0 , is a c × r matrix. It follows that
1. (A0 )0 = A
2. (A ± B)0 = A0 ± B 0

3. (AB)0 = B 0 A0
4. If A = A0 then A is said to be symmetric.
5. A0 A and AA0 are symmetric.
6. (A ⊗ B)0 = (A0 ⊗ B 0 ).

8
2.1.7 Trace
Definition:
Pn 2.1 Suppose that the matrix A = (aij ), i = 1, . . . , n, j = 1, . . . , n then the trace of A given by
tr[A] = i=1 aii .

Provided the matrices are conformable


1. tr[A] = tr[A0 ].
2. tr[A ± B] = tr[A] ± tr[B].
3. tr[AB] = tr[BA].
4. tr[ABC] = tr[CAB] = tr[BCA].
5. tr[A ⊗ B] = tr[A]tr[B].
For a square matrix A, one can write Ax = λx for some non-null vector x, then λ is called a character-
istic or eigenvalue or latent root of A. x is called the correpsonding characteristic vector (eigenvector
or latent vector).
If A is a symmetric n × n matrix with eigenvalues λi for i = 1, 2, . . . , n, then
Pn
6. tr[A] = i=1 λi
Pn
7. tr[As ] = i=1 λsi
Pn
8. tr[A−1 ] = i=1 λ−1i , A nonsingular.

2.1.8 Rank
Suppose that A is a r × c matrix with r rows a1 , a2 , . . . , ac are said to be linearly independent if no ai can
be expressed as a linear combination Pr of the remaining a0i s, that is, there does not exist a non-null vector
c = (c1 , c2 , . . . , cr ) such that i=1 ci ai = 0. It can be shown that the number of linearly indendent rows is
equal to the number of linearly independent columns of any matrix A and that number is the rank of the
matrix. If the rank of A is r then the matrix A is said to be full row rank. If the rank of A is c then A is
said to be full row rank.
1. rank[A] = 0 if and only if A = 0.
2. rank[A] = rank[A0 ].
3. rank[A] = rank[A0 A] = rank[AA0 ].

4. rank[AB] ≤ min{rank[A], rank[B]}


5. If A is any matrix, and P and Q are any conformable nonsingular matrices then rank[P AQ] = rank[A].
6. If A is r × c with rank r then AA0 is nonsingular ((AA0 )−1 exists and rank[AA0 ] = r). If the rank of
A is c then A0 A is nonsingular ((A0 )−1 exists and rank[A0 A] = c).
7. If A is symmetric, then rank[A] is equal to the number of nonzero eigenvalues.

2.1.9 Quadratic Forms


Let A be a symmetric n × n matrix and x = (x1 , x2 , . . . , n) be a vector. Then q = x0 Ax, is called a quadratic
form of A. The quadratic form is a second degree polynomial in the x0i s.

9
2.1.10 Positive Semidefinite Matrices
A symmetric matrix A is said to be positive semidefinite (p.s.d.) if and only if q = x0 Ax ≥ 0 for all x.
1. The eigenvalues of p.s.d. matrices are nonnegative.
2. If A is p.s.d. then tr[A] ≥ 0.
3. A is p.s.d. of rank r if and only if there exists an n × n matrix R of rank r such that A = RR0 .
4. If A is an n × n p.s.d. matrix of rank r, then there exists an n × r matrix S of rank r such that
S 0 AS = Ir .
5. If A is p.s.d., then X 0 AX = 0 ⇒ AX = 0.

2.1.11 Positive Definite Matrices


A symmetric matrix A is said to be positive definite (p.d.) if and only if q = x0 Ax > 0 for all x, x 6= 0.
1. The eigenvalues of p.d. matrices are positive.
2. A is p.d. if and only if there exists an nonsingular matrix R such that A = RR0 .
3. If A is p.d. then so is A−1 .
4. If A is p.d. then rank[CAC 0 ] = rank[C].
5. If A is n × n p.d. matrix and C is a p × n matrix of rank p, then CAC 0 is p.d.
6. If X is n × p of rank p then X 0 X is p.d.
7. If A is p.d. if and only if all the leading minor determinants of A are positive.
8. The diagonal elements os a p.d. matrix are all positive.
9. (Cholesky decomposition). Is A is p.d. there exists a unique upper triangular matrix U with positive
diagonal elements such that A = U 0 U .

2.1.12 Idempotent Matrices


A matrix P is said to be idempotent if P 2 = P . A symmetric idempotent matrix is called a projection
matrix.
1. If P is symmetric, then P is idempotent and of rank r if and only if it has r eigenvalues equal to unity
and n − r eigenvalues equal to zero.
2. If P is a projection matrix then the tr[P ] = rank[P ].
3. If P is idempotent, so is I − P .
4. Projection matrices are positive semidefinite.

10
2.1.13 Orthogonal Matrices
An n × n matrix A is said to be orthogonal if and only A−1 = A0 . If A is orthogonal then
1. −1 ≤ ai ≤ 1.
2. AA0 = A0 A = In .
3. | A |= 1.

Vector Differentiation
Let X be an n × m matrix with elements xij , then if f (X) is a function of the elements of X, we define

df  df 
=
dX dxij

then
d(β 0 a)
1. dβ = a.
d(β 0 Aβ)
2. dβ = 2Aβ. (A symmetric).
df
3. if f (X) = a0 Xb, then dX = ab0 .
df
4. if f (X) = tr[AXB], then dX = A0 B 0 .
df
5. if X is symmetric and f (X) = a0 Xb, then dX = ab0 + b0 a − diag(ab0 ).
df
6. if X is symmetric and f (X) = tr[AXB], then dX = A0 B 0 + BA − diag(BA).
df
7. if X and A are symmetric and f (X) = tr[AXAX], then dX = 2AXA.

2.1.14 The Generalized Inverse


A matrix B is said to be the generalized inverse of A if it satisfies ABA = A. The generalized inverse of A
is denoted by A− . If A is nonsingular then A−1 = A− . If A is singular then A− exists but is not unique.
1. If A is an r × c matrix of rank c. Then the generalized inverse of A is A− = (A0 A)−1 A0 .

2. If A is an r × c matrix of rank r. Then the generalized inverse of A is A− = A(AA0 )−1 .


3. If A is an r × c matrix of rank c. Then A(A0 A)− A0 is symmetric, idempotent, of rank A, and unique.

2.1.15 Solution of Linear Equations


A system of linear equations given by Ax = b is said to be consistent and has a solution which can be
expressed as x̃ = A− b. If A is nonsingular then x̃ is unique.

11
2.2 Random Variables and Vectors
2.2.1 Expectations
Let U denote a random variable with expectation E(U ) and V ar(U ) = E(U − E(U ))2 ). Let a and b denote
any constants, then we have
1. E(aU ± b) = aE(U ) ± b.
2. V ar(aU ± b) = a2 V ar(U ).
Suppose that t(x) is a statistic that is used to estimate a parameter θ. If E(t(x)) = θ, the statistic is said to
be an unbiased estimate for θ. If E(t(x)) = η 6= θ then t(x) is biased and the bias is given by Bias = (θ − η),
in which case the mean square error is given by

M SE(t) = E(t(x) − θ)2 = V ar(t(x)) + Bias2 .

2.2.2 Covariance
Let U and V denote two random variables with respective means, µu and µv . The covariance between the
two random variables is defined by

Cov(U, V ) = E[(U − µu )(V − µv )] = E(U V ) − µu µv .

If U and V are independent then Cov(U, V ) = 0, one has the following:


1. Cov(aU ± b, cV ± d) = acCov(U, V ).
Cov(U,V )
2. −1 ≤ Corr(U, V ) = ρ = [V ar(U )V ar(V )]1/2
≤ 1.

2.2.3 Linear Combinations


Suppose that one has n r.v. given by u1 , u2 , . . . , un and one defines
n
X
u= ai ui
i=1

where E(ui ) = µi , V ar(ui ) = σi2 , and cov(ui , uj ) = σij when i 6= j. Then


Pn
1. E(u) = i=1 ai µi ,
Pn
2. V ar(u) = i=1 a2i σi2 +
PP
ai aj σij .

2.2.4 Random Vectors


Let u = (u1 , u2 , . . . , un )0 denote a n-dimensional vector of random variables. Then the expected value of u
is given by E(u) = (E(u1 ), E(u2 ), . . . , E(un ))0 . The covariance matrix is an n × n matrix given by

cov(u) = E[(u − E(u))(u − E(u))0 ] = Σ = (σij )

where σij = cov(ui , uj ). There are several properities for Σ


1. Σ is symmetric and at least a p.s.d. n × n matrix.
2. E(uu0 ) = Σ − E(u)E(u)0 .

12
3. cov(u + d) = cov(u).
Pn
4. tr[cov(u)] = trE[(u − E(u))(u − E(u))0 ] = E[(u − E(u))0 (u − E(u))] = i=1 σii is the total variance
of u.
Suppose that A is a r × n matrix and one defines v = Au ± b, then
5. E(v) = AE(u) ± b.
6. cov(v) = Acov(u)A0 = AΣA0 . Note cov(v) is an r × r symmetric and at least p.s.d. matrix.
Suppose that C is a s × n matrix and one defines w = Cu ± d, then
7. cov(v, w) = AΣB 0 . Note cov(v, w) is a r × s matrix.

2.3 Distributions
2.3.1 Multivariate Normal
Recall in the univariate case, the normal density function for y is given by

fy (y) = k exp[−1/2σ 2 (y − µ)2 ]

where E(y) = µ, var(y) = σ 2 and k is the normalizing constant given by

k = (2πσ 2 )−1/2 .

Let y = (y1 , y2 , . . . , yn )0 denote an n-dimensional vector with density function given by

f (y1 , y2 , . . . , yn ) = k exp[−1/2(y − E(y))0 Σ−1 (y − E(y))],

where,
1. k = (2π)−n/2 | Σ |−1/2 is the normalizing constant and | Σ | is the determinate of Σ.
2. E(y) = µ = (µ1 , µ2 , . . . , µn )0 and cov(y) = Σ.
3. Q = (y − E(y))0 Σ−1 (y − E(y)) ∼ χ2n , where χ2n is a Chi-square with n degrees of freedom.
4. y is said to have an n-dimensional multivariate normal distribution with mean = µ and covariance
matrix = Σ provided Σ is nonsingular. This is denoted by y ∼ Nn (µ, Σ).
Suppose that y ∼ Nn (µ, Σ) and A is a r × n. Define u = Ay ± b then u ∼ Nr (µu = Aµ ± b, Σu = AΣA0 )
provided AΣA0 is nonsingular (i.e. rank(A) = r).

2.3.2 Chi-Square, T and F Distributions


Recall from univariate statistics that if zi ∼ N (0, 1) for i = 1, 2, . . . , n then
Pn
1. zi2 ∼ χ2 (1) and i=1 zi2 ∼ χ2 (n).
Pn
2. (n − 1)s2z = i=1 (zi − z̄)2 ∼ χ2 (n − 1).
3. z̄ and s2z are independent.

4. If z ∼ N (0, 1) and u ∼ χ2 (n) then √ z ∼ t-dist(n).


u/n

13
u/n
5. If u ∼ χ2 (n) and v ∼ χ2 (m) then v/m ∼ F-dist(n,m).
Pn
6. z = (z1 , z2 , . . . , zn )0 then z 0 z = i=1 zi2 ∼ χ2 (n).
7. If x ∼ N (µ, 1) then x2 ∼ χ2 (df = 1, λ = µ2 ). x2 is said to have a noncentral Chi-square distribution
with noncentrality parameter λ.

2.3.3 Quadratic Forms of Normal Variables


1. Let z = (z1 , z2 , . . . , zn )0 ∼ Nn (0, In ). Define the quadratic form q = z 0 Az then
(a) The expected value of q is E(q) = tr[A].
(b) The variance of q is V ar(q) = 2 tr[A2 ].
(c) q ∼ χ2 (a) if and only if A2 = A (A is idempotent) where a = rank[A] = tr[A].
2. Let x = (x1 , x2 , . . . , xn )0 ∼ Nn (µ, In ). Define the quadratic form q = x0 Ax then
(a) The expected value of q is E(q) = tr[A] + µ0 Aµ.
(b) The variance of q is V ar(q) = 2 tr[A2 ] + 4µ0 A2 µ.
(c) q ∼ χ2 (a, λ) if and only if A2 = A (A is idempotent) where a = rank[A] = tr[A] and λ = 1/2µ0 Aµ.
(d) If x ∼ Nn (µ, σ 2 In ) then (x − µ)0 A(x − µ)/σ 2 ∼ χ2 (a) if and only if A is idempotent and a = tr[A].
3. Let x = (x1 , x2 , . . . , xn )0 ∼ Nn (µ, V ) (This means that the x0i s are not independent of one another).
Define the quadratic form q1 = x0 Ax then
(a) The expected value of q1 is E(q1 ) = tr[AV ] + µ0 Aµ.
(b) The variance of q1 is V ar(q1 ) = 2 tr[AV AV ] + 4µ0 AV µ.
(c) q1 ∼ χ2 (a, λ) if and only if (AV )2 = AV (AV is idempotent) where a = rank[A] and λ = 1/2µ0 Aµ.
Suppose that q2 = x0 Bx and t = Cx where C is an c × n matrix. Then
(d) cov(q1 , q2 ) = 2 tr[AV BA] + 4µ0 AV Bµ.
(e) cov(x, q1 ) = 2 V Aµ.
(f) cov(T, q1 ) = 2 CV Aµ.
(g) q1 and q2 are indenpendent if and only if AV B = BV A = 0.
(h) q1 and T are independent if and only if CV A = 0.

4. (Cochran’s Theorem) Let x ∼ Nn (µ, V ), Ai , i = 1, 2, . . . , m be symmetric, rank[Ai ] = ri , and


X
A= Ai

ri then qi = x0 Ai x are mutually independent with


P
with rank[A] = r. If AV is idempotent, and r =
qi ∼ χ2 (df = ri , λi = µ0 Ai µ/2).

14
Chapter 3

Matrix Approach to Linear


Regression

In this chapter the approach given in the previous chapter is modernized by using matrix notation. The
linear regression model can be written as

yi = β0 + β1 xi + ei

for i = 1, 2, . . . , n. The term given by ei represents the unobserved error or residual that the observed data
value yi is from the predicted line given by β0 + β1 xi . This model can be written as
~ + ~e
~y = X β

where
y1 1 x1 e1
     
 y2  1 x2   e2 
~y = 
 .. 
 X=
 ... ..  ~e = 
 ... 

. . 
yn 1 xn en
and  
~= β0
β .
β1
Note: I will assume that the symbol y is a vector rather than using the notation ~y .

3.1 Least Squares


The least squares problem becomes finding values β̂0 and β̂1 that minimize

Q(β) = (y − Xβ)0 (y − Xβ)


= y 0 y − β 0 X 0 y − y 0 Xβ + β 0 X 0 Xβ
= y 0 y − 2β 0 (X 0 y) + β 0 (X 0 X)β

By using the properties of differentiation with matrices one has


∂Q(β)
= −2X 0 y + 2X 0 Xβ = 0.
∂β

15
From which one obtains the normal equations given by

X 0 Xβ = X 0 y.

If rank[X] = 2 the normal equation has a unique solution given by

β̂ = (X 0 X)−1 X 0 y

Let L = (X 0 X)−1 X then it follows that the least squares estimate for β is a linear function of y, i.e. β̂ = Ly.
In the linear regression model we have
   P −1  P 
β̂ =
β̂0
= Pn P x2i P yi
β̂1 xi xi xi yi
 P 2 P  P 
1 − xi
= P Pxi P yi
(xi − x̄)2 − xi n xi yi

The predicted value for y is ŷ = X β̂ = XLy = X(X 0 X)−1 X 0 y = Hy, where

H = X(X 0 X)−1 X 0 .

H is a n × n matrix called the Hat Matrix. The error in prediction or residual is given by ê = y − ŷ =
y − Hy = (I − H)y. The residual sum of squares is given by

Q(β̂) = ê0 ê = y 0 (I − H)0 (I − H)y = y 0 (I − H)y = y 0 y − β̂ 0 X 0 y.

Note: that (I − H) and H are idempotent matrices and it follows that H(I − H) = (I − H)H = 0.
The above derivation is similar to that given in the previous chapter. At this stage the problem is one of
optimation rather than statistics. In order to create a statistical problem it is necessary to introduce some
distributional assumptions. When using the linear models approach one makes assumptions concerning the
error structure. That is, assume that the unobserved error ei , i = 1, 2, . . . , n are i.i.d normals with mean =
0 and variance = σ 2 . This assumption becomes
e1
 
 e2  2
e=  ...  ∼ Nn (0, σ In ).

en
Using the properties of linear transformations of normal variates one has
y1
 
 y2  2
y=  ..  ∼ Nn (Xβ, σ In ).

.
yn

Now using the fact that β̂ = Ly it follows that


1. β̂ ∼ N2 (β, σ 2 (X 0 X)−1 ).

(a) β̂ is an unbiased estimate of β.


(b) var(β̂i ) = σ 2 ((X 0 X)−1 )ii .
(c) cov(β̂i , β̂j ) = σ 2 ((X 0 X)−1 )ij .

16
(d) corr(β̂i , β̂j ) = ((X 0 X)−1 )ii /[((X 0 X)−1 )ii ((X 0 X)−1 )jj ]1/2 .
2. ŷ ∼ Nn (Xβ, σ 2 H).
(a) var(ŷi ) = σ 2 hii .
(b) cov(ŷi , ŷj ) = σ 2 hij , where H = (hij ). Notice that the ŷi0 s are not independent of one another
unless hij = 0.
(c) corr(ŷi , ŷj ) = hij /[hii hjj ]1/2 .
3. ê ∼ Nn (0, σ 2 (I − H)).
(a) var(êi ) = σ 2 (1 − hii ).
(b) cov(êi , êj ) = −σ 2 hij .
(c) corr(êi , êj ) = −hij /[(1 − hii )(1 − hjj )]1/2 .

3.1.1 Estimation of σ 2
The estimation of σ 2 follows from observing that the residual sum of of squares Q(β̂) is a quadratic form
given by ê0 ê = y 0 (I −H)y and using the expected value of a quadratic form, i.e., E(q = y 0 Ay) = tr[A]+µ0 Aµ,
one has
E(Q(β̂) = y 0 (I − H)y) = tr[(I − H)] + β̂ 0 X 0 (I − H)X β̂ = tr[(I − H)] = σ 2 (tr[I] + tr[H]) = σ 2 (n − 2),
since
X 0 (I − H) = X 0 − X 0 X(X 0 X)−1 X 0 = X 0 − X 0 = (I − H)X = X − X 0 X(X 0 X)−1 X = X − X = 0.
From here one can define an estimate for σ 2 with
σ̂ 2 = y 0 (I − H)y/(n − 2) = SSE /(n − 2).

3.1.2 ANOVA Table


As in the previous chapter one has the analysis of variance table is given by

Source Sum of Squares Degrees of Freedom Mean Square


due to β SS(β) = β̂ 0 X 0 y = y 0 Hy 2 M S(β) = SS(β)/2
Residual SSE = y 0 y − β̂ 0 X 0 y = y 0 (I − H)y n-2 M SE = SSE /(n − 2)
Uncorrected Total y0 y n
Since β1 is the only parameter that is needed if the independent variable x explains the dependent variable
y, the sum of squares term is adjusted for β0 , that is
y 0 Hy = y 0 (H − jj0 )y + y 0 jj0 y
where y 0 jj0 y = nȳ 2 is the correction factor or the sum of squares due to β0 . In which case the ANOVA table
become

Source Sum of Squares Degrees of Freedom Mean Square


due to β1 | β0 SSβ1 |β0 = y 0 (H − yjj0 )y 1 M Sβ1 |β0 = SSβ1 |β0 /1
Residual SSE = y 0 y − β̂ 0 X 0 y = y 0 (I − H)y n-2 M SE = SSE /(n − 2)
Corrected Total y 0 y − y 0 jj0 y n-1
due to β0 SSβ0 = y 0 jj0 y = nȳ 2 1
Uncorrected Total y0 y n

17
3.1.3 Expected Values of the Sum of Squares
Observe that the terms under the sum of squares column are actually quadratic forms. Using the properties
of the expected value of quadratic forms, i.e.,
Let x = (x1 , x2 , . . . , xn )0 ∼ Nn (µ, V ) (This means that the x0i s are not independent of one another).
Define the quadratic form q1 = x0 Ax then the expected value of q1 is E(q1 ) = tr[AV ] + µ0 Aµ. It follows
when V = σ 2 In that
1. E(y 0 Hy) = σ 2 tr[H] + β 0 X 0 HXβ = 2σ 2 + β 0 X 0 Xβ.
2. E(y 0 (I − H)y) = σ 2 tr[(I − H)] + β 0 X 0 (I − H)Xβ = (n − 2)σ 2 , since X 0 (I − H) = (I − H)X = 0.

3.1.4 Distribution of the Mean Squares


Again from the properities of the distribution of quadratic forms it can be show that
1. SS(β)/σ 2 ∼ χ2 (df = 2, λ = 1/2β 0 X 0 Xβ).
2. SSE /σ 2 ∼ χ2 (df = (n − 2)).
3. SS(β) and SSE are independent.
4. F = M S(β)/M SE ∼ F (df1 = 2, df2 = (n − 2), λ = 1/2β 0 X 0 Xβ).
5. When β = 0 it follows that SS(β)/σ 2 ∼ χ2 (df = 2) and F = M S(β)/M SE ∼ F (df1 = 2, df2 = (n−2)).
6. SSβ1 |β0 /σ 2 ∼ χ2 (df = 1, λ = 1/2β 0 X 0 Xβ).
7. SSβ1 |β0 and SSE are independent.
8. F = M Sβ1 |β0 /M SE ∼ F (df1 = 1, df2 = (n − 2), λ = 1/2β 0 X 0 Xβ).
9. When β1 = 0 it follows that SSβ1 |β0 /σ 2 ∼ χ2 (df = 1) and F = M Sβ1 |β0 /M SE ∼ F (df1 = 1, df2 =
(n − 2)).

18
Chapter 4

The General Linear Regresssion


Model

In this chapter the approach given in the previous chapter is extended to higher order regression models. The
generalized linear regression (not to be confused with the generalized linear model) model can be written as

yi = β0 + β1 x1i + β2 x2i + . . . + βp−1 x(p−1)i + ei

for i = 1, 2, . . . , n where the independent variables (x1i , x2i , . . . , x(p−1)i ) are;

1. Polynomial Regression when xji = xji , i = 1, 2, . . . , n, j = 1, 2, . . . , (p − 1). This equation is called a


polynomial of degree (p − 1).
2. Multiple Regression when each of the variables are different independent variables.
The term given by ei represents the unobserved error or residual that the observed data value yi is from the
predicted surface given by β0 + β1 x1i + β2 x2i + . . . + βp−1 x(p−1)i . This model can be written as

y = Xβ + e

where  
y1
  1 x11 x21 ... x(p−1)1 e1
 
 y2   1 x12 x21 ... x(p−1)1   e2 
y=
 ..  X=. e=
 ... 
 
 .. .. .. 
 ..

. . . . 
yn 1 x1n x2n . . . x(p−1)n en
and
β0
 
 β1 
β2
 
β= ,
 .. 
.
 
βp−1
where
y is n × 1 vector of dependent observations.
X is an n × p matrix of independent observations or known values.

19
β is a p × 1 vector of parameters.
e is a n × 1 vector of unobservable errors or residuals.
By using the properties of differentiation with matrices one has

∂Q(β)
= −2X 0 y + 2X 0 Xβ = 0.
∂β
From which one obtains the normal equations given by

X 0 Xβ = X 0 y.

If rank[X] = p the normal equation has a unique solution given by

β̂ = (X 0 X)−1 X 0 y.

4.0.5 Inference
The above derivation is similar to that given in the previous chapter. At this stage the problem is one of
optimation rather than statistics. In order to create a statistical problem it is necessary to introduce some
distributional assumptions. When using the linear models approach one makes assumptions concerning the
error structure. That is, assume that the unobserved error ei , i = 1, 2, . . . , n are i.i.d normals with mean =
0 and variance = σ 2 . This assumption becomes

e1
 
 e2  2
 ...  ∼ Nn (0, σ In ).
e= 

en

Using the properties of linear transformations of normal variates one has

y1

 y2  2
 ..  ∼ Nn (Xβ, σ In ).
y= 
.
yn

Now using the fact that β̂ = Ly it follows that


1. β̂ ∼ Np (β, σ 2 (X 0 X)−1 ).

(a) β̂ is an unbiased estimate of β.


(b) var(β̂i ) = σ 2 ((X 0 X)−1 )ii .
(c) cov(β̂i , β̂j ) = σ 2 ((X 0 X)−1 )ij .
(d) corr(β̂i , β̂j ) = ((X 0 X)−1 )ii /[((X 0 X)−1 )ii ((X 0 X)−1 )jj ]1/2 .

2. ŷ ∼ Nn (Xβ, σ 2 H).
(a) var(ŷi ) = σ 2 hii .
(b) cov(ŷi , ŷj ) = σ 2 hij , where H = (hij ). Notice that the ŷi0 s are not independent of one another
unless hij = 0.

20
(c) corr(ŷi , ŷj ) = hij /[hii hjj ]1/2 .
3. ê ∼ Nn (0, σ 2 (I − H)).
(a) var(êi ) = σ 2 (1 − hii ).
(b) cov(êi , êj ) = −σ 2 hij .
(c) corr(êi , êj ) = −hij /[(1 − hii )(1 − hjj )]1/2 .

4.0.6 Estimation of σ 2
The estimation of σ 2 follows from observing that the residual sum of of squares Q(β̂) is a quadratic form
given by ê0 ê = y 0 (I −H)y and using the expected value of a quadratic form, i.e., E(q = y 0 Ay) = tr[A]+µ0 Aµ,
one has

E(Q(β̂) = y 0 (I − H)y) = tr[(I − H)] + β̂ 0 X 0 (I − H)X β̂ = tr[(I − H)] = σ 2 (tr[I] + tr[H]) = σ 2 (n − p),

since

X 0 (I − H) = X 0 − X 0 X(X 0 X)−1 X 0 = X 0 − X 0 = (I − H)X = X − X 0 X(X 0 X)−1 X = X − X = 0.

From here one can define an estimate for σ 2 with

σ̂ 2 = y 0 (I − H)y/(n − p) = SSE /(n − p).

4.0.7 ANOVA Table


As in the previous chapter one has the analysis of variance table is given by

Source Sum of Squares Degrees of Freedom Mean Square


due to β SS(β) = β̂ 0 X 0 y = y 0 Hy p M S(β) = SS(β)/p
Residual SSE = y 0 y − β̂ 0 X 0 y = y 0 (I − H)y n-p M SE = SSE /(n − p)
Uncorrected Total y0 y n
Since β1 is the only parameter that is needed if the independent variable x explains the dependent variable
y, the sum of squares term is adjusted for β0 , that is

y 0 Hy = y 0 (H − jj0 )y + y 0 jj0 y

where y 0 jj0 y = nȳ 2 is the correction factor or the sum of squares due to β0 . In which case the ANOVA table
become

Source Sum of Squares Degrees of Freedom Mean Square


due to β∗ | β0 SSβ∗ |β0 = y 0 Hy − yjj0 y p-1 M Sβ∗ |β0 = SSβ∗ |β0 /(p − 1)
Residual SSE = y 0 y − β̂ 0 X 0 y = y 0 (I − H)y n-p M SE = SSE /(n − p)
Corrected Total SSCT = y 0 y − y 0 jj0 y n-1
due to β0 SSβ0 = y 0 jj0 y = nȳ 2 1
Uncorrected Total y0 y n

where β∗ = (β1 , β2 , . . . , βp−1 )

21
4.0.8 Expected Values of the Sum of Squares
Observe that the terms under the sum of squares column are actually quadratic forms. Using the properities
of the expected value of quadratic forms, i.e.,
Let x = (x1 , x2 , . . . , xn )0 ∼ Nn (µ, V ) (This means that the x0i s are not independent of one another).
Define the quadratic form q1 = x0 Ax then the expected value of q1 is E(q1 ) = tr[AV ] + µ0 Aµ. It follows
when V = σ 2 In that
1. E(y 0 Hy) = σ 2 tr[H] + β 0 X 0 HXβ = pσ 2 + β 0 X 0 Xβ.
2. E(y 0 (I − H)y) = σ 2 tr[(I − H)] + β 0 X 0 (I − H)Xβ = (n − p)σ 2 , since X 0 (I − H) = (I − H)X = 0.
The R2 is an indicator of how much of the variation in the data is explained by the model. It is defined
as
SSβ∗ |β0 y 0 Hy − yjj0 y
R2 = = 0 .
SSCT y y − y 0 jj0 y
The adjusted R2 is the R2 value adjusted for the number of parameters in the model and is given by

adjR2 = 1 − [(n − i)/(n − p)](1 − R2 )

where i is 1 if the model includes the y intercept (β0 ), and is 0 otherwise. Tolerance (TOL) and variance
inflation factors (VIF) measure the strength of interrelationships amoung regressor variables in the model.
They are given as
T OL = 1 − R2
and
V IF = T OL−1 .

4.0.9 Distribution of the Mean Squares


Again from the properities of the distribution of quadratic forms it can be show that
1. SS(β)/σ 2 ∼ χ2 (df = p, λ = 1/2β 0 X 0 Xβ).
2. SSE /σ 2 ∼ χ2 (df = (n − p)).
3. SS(β) and SSE are independent.
4. F = M S(β)/M SE ∼ F (df1 = p, df2 = (n − p), λ = 1/2β 0 X 0 Xβ).
5. When β = 0 it follows that SS(β)/σ 2 ∼ χ2 (df = p) and F = M S(β)/M SE ∼ F (df1 = p, df2 = (n−p)).
6. SSβ∗ |β0 /σ 2 ∼ χ2 (df = p − 1, λ = 1/2β 0 X 0 Xβ).

7. SSβ∗ |β0 and SSE are independent.


8. F = M Sβ∗ |β0 /M SE ∼ F (df1 = p − 1, df2 = (n − p), λ = 1/2β 0 X 0 Xβ).
9. When β∗ = 0 it follows that SSβ∗ |β0 /σ 2 ∼ χ2 (df = p − 1) and F = M Sβ∗ |β0 /M SE ∼ F (df1 =
p − 1, df2 = (n − p)).

22
4.1 The Reduction Notation
In constructing the ANOVA tables one is interested in describing the sum of squares attributal to various
terms in the model. This can be done in one of two ways, sequential or partial. Suppose that one has the
three term model given by
yi = β0 + β1 x1i + β2 x2i + β3 x3i + ei .
If the model represents a polynomial regression model of order 2 (i.e., xji = xji for j = 1, 2, 3, i =
1, 2, . . . , n) then the model sum of squares SSM can be written as

SSM = S(β0 ) + S(β1 | β0 ) + S(β2 | β0 , β1 ) + S(β3 | β0 , β1 , β2 ).

Where each of these terms are independent quadratic forms with noncentral chi-square distributions with
degrees of freedom = 1.
However, the model represents a multiple regression model where each of the xj represents a different
measurement or random variable one is interested in determining the amount of reduction in sum of squares
that can be attributal to each of the terms in the model. That is, one wants to be able to compute

S(β3 | β0 , β1 , β2 )

S(β2 | β0 , β1 , β3 )
S(β1 | β0 , β2 , β3 ).
These quadratic forms are no longer independent hence their sum does not equal SSM but they are inde-
pendent of the error sum of squares and they can be shown to be chi-square distributed with one degree of
freedom.
When using SAS the sequential sum of squares is the standard output and can be specified with the
Type I SS statement. The partial SS can be obtained using the Type II SS statement. SAS is a very general
statistics package and one can analyse other linear models besides the regression models. The Type I SS
should be used when
• Balance ANOVA model with terms in their proper sequence or order.
• Purely nested model in proper order.
• Polynomial regression models.

The Type II SS statement should be used when


• The model is balanced.
• Any main effects model.
• Any pure regression model (multiple regression).

• An effect not contained in any other effect (no nesting).


SAS also allows for TYPE III and IV sum of squares. Both of these are used when the models are unbalanced
(ANOVA situations). The Type III SS imposes a sum to zero restriction in order to create a unique least
squares estimate for the model coefficients. These are not needed in the purely regression models (i.e. no
dummy variables).
Consider the following example using the data found on page 46, Appendix A1 data. The SAS code is
as follows;

23
options center nodate pagesize=100 ls=80;
symbol1 i=none color=red v=star ;
symbol2 i=join v=none color=blue;
symbol3 i=join color=green v=none;
title1 ’Appendix 1A page 46’;
data a1;
infile ’C:\MyDocuments\Regression\Data\DS_data\01a’;
input obs x1-x10 @@; y=x1; x=x4;
*proc print;*run;
proc sort; by x;
title2 ’Scatterplot of Data’;*proc gplot data=a1; *plot y*x=1; *run;
proc reg graphics;
model y=x4; run;
proc reg graphics;
model y = x2 x3 x5 x4 / ss1 ss2; run;
The output for this program is as follows;
Appendix 1A page 46
Scatterplot of Data

Analysis of Variance

Sum of Mean
Source DF Squares Square F Value Pr > F

Model 1 14.35660 14.35660 6.68 0.0166


Error 23 49.45920 2.15040
Corrected Total 24 63.81580

Root MSE 1.46642 R-Square 0.2250


Dependent Mean 9.42400 Adj R-Sq 0.1913
Coeff Var 15.56053

Parameter Estimates

Parameter Standard
Variable DF Estimate Error t Value Pr > |t|

Intercept 1 6.62141 1.12361 5.89 <.0001


x4 1 0.44094 0.17065 2.58 0.0166

24
Analysis of Variance

Sum of Mean
Source DF Squares Square F Value Pr > F

Model 4 28.88720 7.22180 4.14 0.0133


Error 20 34.92860 1.74643
Corrected Total 24 63.81580

Root MSE 1.32153 R-Square 0.4527


Dependent Mean 9.42400 Adj R-Sq 0.3432
Coeff Var 14.02298

Parameter Estimates

Parameter Standard
Variable DF Estimate Error t Value Pr > |t| Type I SS

Intercept 1 -7.11549 12.75297 -0.56 0.5831 2220.29440


x2 1 1.27506 1.14615 1.11 0.2791 9.36977
x3 1 -3.23231 7.11445 -0.45 0.6545 1.82950
x5 1 0.27995 0.43423 0.64 0.5264 0.38976
x4 1 0.52143 0.16568 3.15 0.0051 17.29818

Parameter Estimates

Variable DF Type II SS

Intercept 1 0.54367
x2 1 2.16136
x3 1 0.36049
x5 1 0.72591
x4 1 17.29818

4.2 Testing Linear Hypothesis


The general form of a linear hypothesis for the parameters is

H0 : Lβ = c

where L is a q × p matrix of rank q. The approach is to estimate Lβ − c with Lβ̂ − c. Using the properities
of expectation and covariance we have
1. E(Lβ̂ − c) = Lβ − c.

2. Cov(Lβ̂ − c) = σ 2 L(X 0 X)−1 L0 .

25
3. The quadratic form given by

Q = (Lβ̂ − c)0 (L(X 0 X)−1 L0 )−1 (Lβ̂ − c) ∼ χ2 (df = q, λ = 1/2(Lβ − c)0 (L(X 0 X)−1 L0 )−1 (Lβ − c).

4. Q/q/σ̂ 2 ∼ F − dist(q, n − p) whenever Lβ = c.

4.3 Constrained Least Squares


A related topic involves finding β̃ which minimizes (y −Xβ)0 (y −Xβ)) subject to the restriction that Lβ = d.
The approach is to define the Lagrangean function given by

Λ = (y − Xβ)0 (y − Xβ) + λ0 (d − Lβ).

Taking the derivative of Λ with respect to both β and λ and setting equal to zero gives the following solution

β̃ = β̂ + (X 0 X)−1 L0 [L(X 0 X)−1 L0 ]−1 (c − Lβ̂).

where β̂ is the usual unrestricted estimator of β.


SAS is capable of handling each of the above situations. One can either specify a linear hypothesis
and test it or one can restrict the usual least squares problem by the linear restriction and estimate the
constrained least squares estimate. The SAS code is given as follows;
*This code is added to the heat-consumption example

proc reg graphics;


model y = x ;
restrict x=.21;
run;
proc reg graphics;
model y = x;
test x=.21;
run;
The output is as follows;
Heat Consumption
Scatterplot of Data

Regular Regression Model

Analysis of Variance

Sum of Mean
Source DF Squares Square F Value Pr > F

Model 1 58.90713 58.90713 311.97 <.0001


Error 7 1.32175 0.18882
Corrected Total 8 60.22889

26
Root MSE 0.43454 R-Square 0.9781
Dependent Mean 5.58889 Adj R-Sq 0.9749
Coeff Var 7.77501

Parameter Estimates

Parameter Standard
Variable DF Estimate Error t Value Pr > |t|

Intercept 1 1.23235 0.28604 4.31 0.0035


x 1 0.20221 0.01145 17.66 <.0001

The REG Procedure


Model: MODEL1
Dependent Variable: y

NOTE: Restrictions have been applied to parameter estimates.

Analysis of Variance

Sum of Mean
Source DF Squares Square F Value Pr > F

Model 0 58.81974 . . .
Error 8 1.40914 0.17614
Corrected Total 8 60.22889

Root MSE 0.41969 R-Square 0.9766


Dependent Mean 5.58889 Adj R-Sq 0.9766
Coeff Var 7.50943

Parameter Estimates

Parameter Standard
Variable DF Estimate Error t Value Pr > |t|

Intercept 1 1.06456 0.13990 7.61 <.0001


x 1 0.21000 0 Infty <.0001
RESTRICT -1 -11.22042 15.92982 -0.70 0.5182*

* Probability computed using beta distribution.

27
Test of Hypothesis

The REG Procedure


Model: MODEL1
Dependent Variable: y

Analysis of Variance

Sum of Mean
Source DF Squares Square F Value Pr > F

Model 1 58.90713 58.90713 311.97 <.0001


Error 7 1.32175 0.18882
Corrected Total 8 60.22889

Root MSE 0.43454 R-Square 0.9781


Dependent Mean 5.58889 Adj R-Sq 0.9749
Coeff Var 7.77501

Parameter Estimates

Parameter Standard
Variable DF Estimate Error t Value Pr > |t|

Intercept 1 1.23235 0.28604 4.31 0.0035


x 1 0.20221 0.01145 17.66 <.0001

The REG Procedure


Model: MODEL1

Test 1 Results for Dependent Variable y

Mean
Source DF Square F Value Pr > F

Numerator 1 0.08739 0.46 0.5182


Denominator 7 0.18882

28
Chapter 5

Testing for Model Fit

In this chapter the underlying assumptions are tested for the fitted model. Various techniques have been
propose for examing the assumptions with the fitted model. The are several results that have been found in
the previous chapters. They are;
1. β̂ = (X 0 X)−1 X 0 y ∼ Np (β, σ 2 (X 0 X)−1 ).

(a) β̂ is an unbiased estimate of β.


(b) var(β̂i ) = σ 2 ((X 0 X)−1 )ii .
(c) cov(β̂i , β̂j ) = σ 2 ((X 0 X)−1 )ij .
(d) corr(β̂i , β̂j ) = ((X 0 X)−1 )ii /[((X 0 X)−1 )ii ((X 0 X)−1 )jj ]1/2 .

2. ŷ = X β̂ ∼ Nn (Xβ, σ 2 H).
(a) var(ŷi ) = σ 2 hii .
(b) cov(ŷi , ŷj ) = σ 2 hij , where H = (hij ). Notice that the ŷi0 s are not independent of one another
unless hij = 0.
(c) corr(ŷi , ŷj ) = hij /[hii hjj ]1/2 .

3. ê = (y − X β̂) ∼ Nn (0, σ 2 (I − H)).


(a) var(êi ) = σ 2 (1 − hii ).
(b) cov(êi , êj ) = −σ 2 hij .
(c) corr(êi , êj ) = −hij /[(1 − hii )(1 − hjj )]1/2 .
Where H = X(X 0 X)−1 X 0 is the hat matrix.

5.0.1 Checks for Normality


Since the residuals êi are normally distributed with mean zero and a constant variance. Simple plots of the
residuals versus the predicted values or the independent variable x will often reveal whether or not these
assumptions have been met. Draper and Smith discuss these plots. Essentially, you should not “see” any
patterns in these scatterplots. Likewise, the amount of variability should be constant across the plots. If
they are not then you will need to take some additional steps to insure that you have a proper model with
the proper assumptions. If you “see” a pattern in your residuals then it could indicate that you have an
improper model. For example, the Anscombe data set for the model y2 = x2 when one fits a linear equation

29
to data that should be modeled by a quadratic model. SAS allows for a visual examination for the normality
of the residuals. These plots are called the QQ plots and the PP plots. If normality holds then these plots
will resemble a 45 degree line passing through the origin of your x-y plot. There are some tests of fit for
normality although these are usually not needed when assessing the normality of the residuals.

5.0.2 QQ and PP Plots


The probability plots are used to assess the normality of the residuals. The basic idea stems from the
following:
Suppose that z(1) ≤ z(2) ≤ . . . ≤ z(n) represents the ordered values of n independent and identically
distributed N (0, 1) random variables. Then the expected value of z(i) is

E(z(i) ) ≈ γi = Φ−1 [(i − 3/8)/(n + 1/4)]

where Φ is the cdf for the standard normal given by


Z x
2
Φ(x) = (2π)−1/2 e−1/2t dt.
−∞

In the QQ plot one usually plots the standardized residuals versus the γi0 s. If the data are normal then the
resulting scatterplot should form a diagonal 45 degree line.
In the PP plot one plots the ordered pair for the ith observation as (Φ(z(i) , [i/n]).
Sometimes the patterns in the residuals may indicate that one has correlated errors, i.e. corr(ei , ej ) 6= 0.
The following test called the Durbin-Watson test is commonly used to test for dependency in the error
structure.

5.0.3 Durbin Watson Test


The Durbin-Watson statistic tests for a particular type of error structure called a first order autoregression
model (this assumes that the errors are related by an outside function such as time). The autoregression
error of order one, AR(1) is given by,
ei = φei−1 + i
where | φ |< 1 and corr(ei , ei−1 ) = φ and i ∼ N (0, σ2 ). Note, if φ > 0 then your estimate of the variance of
y will usually be “too small” and when φ < 0 the estimate for the variance of y will usually be “too large”.
The Durbin-Watson statistics is
Xn Xn
d= (ei − ei−1 )2 / e2i .
i=2 i=1

Note that d = 0 when ei = ei−1 , d = 4 when ei = −ei−1 , while d = 2 when φ = 0. Since this statistic has
these properties the decision rule for rejecting the null hypothesis is dependent upon whether or not one is
testing for the alternative of positive or negative correlation.
In order to test for first order autoregressive error structure one will need to find critical values from the
Durbin-Watson tables (Table 7.1 page 184 in Draper and Smith). The table provides two critical points dL
and dU for specified values of n and k=p-1. The test procedure is as follows;
1. One sided test against the alternative hypothesis Ha : ρ > 0 where ρ = corr(ei , ei−1 ).
If d < dL , conclude d is significant and reject Ho at the α level.
If d > dU , conclude d is not significant and do not reject Ho at the α level.
If dL ≤ d ≤ dU , the test is said to be inconclusive (in practice this implies that one does not have
a reason to reject the null hypothesis).

30
2. One sided test against the alternative hypothesis Ha : ρ < 0. Repeat above using (4 − d) instead of d.
3. One sided test against the alternative hypothesis Ha : ρ 6= 0.
If d < dL or (4 − d) < dL , conclude d is significant and reject Ho at the 2α level.
If d > dU and (4 − d) > dU , conclude d is not significant and do not reject Ho at the 2α level.
Otherwise, the test is said to be inconclusive.

5.0.4 Residual Analysis


Draper and Smith discuss the importance of examining the residuals in order to determine whether or nor
the model assumptions have been met (pages 59-72). In chapter 8 they present some additional material
for checking the fitted model. From the above we see that the ith residual given by yi − ŷi has a normal
distribution with mean 0 and variance σ 2 (1 − hii ). Since σ 2 is unknown it can be estimated with either

σ̂ 2 = M SE

or
2 (n − p)σ̂ 2 − ê2i /(1 − hii )
σ̂(i) =
(n − p − 1)
2
where σ̂(i) is the mean square for the error whenever the ith observation has been omitted from the regression
model. From here one can define
1. Internally Studentized Residual is given by
êi
si = .
σ̂(1 − hii )1/2

SAS refers to this statistic as the STUDENT residual.


2. Externally Studentized Residual is given by
êi
s(i) = .
σ̂(i) (1 − hii )1/2

SAS refers to this statistic as the RSTUDENT residual.


3. si should have a t-distribution with df = n-p.
4. s(i) should have a t-distribution with df = n-p-1.

5.0.5 Leverages
The hat matrix H has the following properities;
1. SS(β) = β̂ 0 X 0 y = y 0 Hy = y 0 H 2 y = ŷ 0 ŷ.
Pn 2 2
2. i=1 var(ŷi )/n = tr[σ H]/n = σ p/n.

3. H1 = 1 whenever the y intercept is included in the model, thus the sum of every row and every column
of H sums to 1.
Pn
4. 0 ≤ hij ≤ 1 and i=1 hii = p = rank(X). Since the average hii = p/n the ith observation is said to
be a leverage point if hii ≥ 2p/n.

31
5. Since ŷ = Hy we have X
ŷi = hii yi + hij yj .
i6=j

This indicates the importance that yi has upon ŷi is given by the magitude of the leverage = hii .

5.0.6 Detection of Influential Observations


If one suspects that the ith observation has an unusual influence upon the prediction equation ŷ one can
recompute the regression model with the ith observation omitted from the calculation. Whenever this is done
0
one has a new regression estimate given by β̂(i) = (X(i) X(i) )−1 X(i)
0
y(i) from which we have a new predicted
value for y given by ŷ(i) = X β̂(i) . Cook’s distance is a measure of how far the original “line” is from the
“new line” when the ith observation is omittted from the calculation. Cook’s distance is given by

Di = (ŷ − ŷ(i) )0 (ŷ − ŷ(i) )/(pσ̂ 2 )


Di = (β̂ − β̂(i) )0 X 0 X(β̂ − β̂(i) )/(pσ̂ 2 )
 êi 2  hii 
Di = 1/2
.
σ̂(1 − hii ) p(1 − hii )

A related measure to Cook’s distance is the DFFITS statistic given by

DF F IT Si2 = (β̂ − β̂(i) )0 X 0 X(β̂ − β̂(i) )/(σ̂(i)


2
).

Another measure of influence is the COVRATIO which is the ratio of the determinant of the covariance
matrix for the estimate β̂, given by det[σ̂ 2 (X 0 X)−−1 ] when the ith observation has been removed for the
computation of the estimate β̂(i) . That is,
0
2
COV RAT IO = det[σ̂(i) (X(i) X(i) )−−1 ]/det[σ̂ 2 (X 0 X)−−1 ].

This statistic should be close to one whenever the observation has little influence upon the estimation of β.
If the statistics is much different from one then the observation is said to be influencial.

32
5.1 Examples
5.1.1 Heat Consumption Example – Residual Plots
Splus Residual Plots

33
5.1.2 Exercise 1 Chapter 3 problem k page 101
Run the following SAS program and compare the results with the answer to the questions for the problem.

options center nodate pagesize=100 ls=80;


symbol1 i=none color=red v=star ;
symbol2 i=join v=none color=blue;
symbol3 i=join color=green v=none;
title1 ’Problem 3k page 101’;
data e03k;
*The following statement is the location of my data set, yours will be different;
infile ’C:\MyDocuments\Regression\Data\DS_data\e03k’;

input obs y x @@;


proc print;run;
proc sort; by x;
title2 ’Scatterplot of Data’;proc gplot data=e03k; plot x*y=1; run;
proc reg graphics;
model y = x / r clm corrb xpx ; run;

title2 ’Least Squares Fit to the Data’; plot y*x / pred95; run;
title2 ’Residual Plot’; plot r.*x/ vref=0; run;
title2 ’Leverage Plot’; plot h.*obs.; run;
title2 ’Check for Outlier’; plot student.*obs.; run;
title2 ’Check for outlier ith obs missing’; plot rstudent.*obs.; run;
title2 ’PP Plot’; plot npp.*r. /nostat; run;
title2 ’QQ Plot’; plot r. * nqq. /nostat; run;
*Test to determine if the lack of fit is significant, not useful in this example;

proc anova data=e03k;


class x;
model y = x;
title2 ’Fit of pure Error and Lack of Fit’; run;

34
Output from Splus
*** Linear Model ***

Call: lm(formula = y ~ x, data = e03k, na.action = na.exclude)


Residuals:
Min 1Q Median 3Q Max
-0.07652 -0.02001 -0.004806 0.02709 0.07375

Coefficients:
Value Std. Error t value Pr(>|t|)
(Intercept) 1.0021 0.0109 92.0401 0.0000
x -0.0029 0.0002 -12.4346 0.0000

Residual standard error: 0.03933 on 32 degrees of freedom


Multiple R-Squared: 0.8285
F-statistic: 154.6 on 1 and 32 degrees of freedom, the p-value is 8.538e-014
1 observations deleted due to missing values

Correlation of Coefficients:
(Intercept)
x -0.785

Analysis of Variance Table

Response: y

Terms added sequentially (first to last)


Df Sum of Sq Mean Sq F Value Pr(F)
x 1 0.2391502 0.2391502 154.6186 8.537615e-014
Residuals 32 0.0494947 0.0015467

35
Splus Plots Scatterplot

36
Splus Plots QQ plot

37
Output from JMP
JMP Layout 1

38
JMP Layout 2

39
SAS Output

Problem 3k page 101


Scatterplot of Data

The REG Procedure


Model: MODEL1

Model Crossproducts X’X X’Y Y’Y

Variable Intercept x y

Intercept 34 1244.5 30.458


x 1244.5 73920.05 1032.4865
y 30.458 1032.4865 27.573638

Analysis of Variance

Sum of Mean
Source DF Squares Square F Value Pr > F

Model 1 0.23915 0.23915 154.62 <.0001


Error 32 0.04949 0.00155
Corrected Total 33 0.28864

Root MSE 0.03933 R-Square 0.8285


Dependent Mean 0.89582 Adj R-Sq 0.8232
Coeff Var 4.39018

Parameter Estimates

Parameter Standard
Variable DF Estimate Error t Value Pr > |t|

Intercept 1 1.00210 0.01089 92.04 <.0001


x 1 -0.00290 0.00023350 -12.43 <.0001

Correlation of Estimates

Variable Intercept x

Intercept 1.0000 -0.7850


x -0.7850 1.0000

40
Output Statistics

Dep Var Predicted Std Error


Obs y Value Mean Predict 95% CL Mean Residual

1 0.9710 0.9934 0.0103 0.9723 1.0145 -0.0224


2 0.9790 0.9885 0.0100 0.9680 1.0089 -0.009454
3 0.9820 0.9780 0.009443 0.9588 0.9972 0.003999
4 0.9710 0.9751 0.009281 0.9562 0.9940 -0.004098
5 0.9570 0.9734 0.009185 0.9546 0.9921 -0.0164
6 0.9610 0.9702 0.009013 0.9518 0.9885 -0.009162
7 0.9560 0.9664 0.008814 0.9484 0.9843 -0.0104
8 0.9720 0.9658 0.008784 0.9479 0.9837 0.006193
9 0.8890 0.9655 0.008770 0.9477 0.9834 -0.0765
10 0.9610 0.9559 0.008298 0.9390 0.9728 0.005065
11 0.9820 0.9536 0.008191 0.9369 0.9703 0.0284
12 0.9750 0.9475 0.007923 0.9314 0.9637 0.0275
13 0.9420 0.9475 0.007923 0.9314 0.9637 -0.005515
14 0.9320 0.9472 0.007911 0.9311 0.9633 -0.0152
15 0.9080 0.9391 0.007590 0.9236 0.9546 -0.0311
16 0.9700 0.9385 0.007568 0.9231 0.9539 0.0315
17 0.9850 0.9359 0.007475 0.9207 0.9511 0.0491
18 0.9330 0.9318 0.007340 0.9169 0.9468 0.001164
19 0.8580 0.9272 0.007201 0.9125 0.9419 -0.0692
20 0.9870 0.9133 0.006889 0.8992 0.9273 0.0737
21 0.9580 0.8970 0.006745 0.8833 0.9107 0.0610
22 0.9090 0.8865 0.006786 0.8727 0.9004 0.0225
23 0.8590 0.8735 0.006980 0.8593 0.8877 -0.0145
24 0.8630 0.8662 0.007153 0.8516 0.8808 -0.003216
25 0.8110 0.8662 0.007153 0.8516 0.8808 -0.0552
26 0.8770 0.8334 0.008408 0.8163 0.8505 0.0436
27 0.7980 0.8212 0.009027 0.8028 0.8396 -0.0232
28 0.8550 0.7971 0.0104 0.7759 0.8183 0.0579
29 0.7880 0.7957 0.0105 0.7743 0.8171 -0.007661
30 0.8210 0.7951 0.0105 0.7736 0.8166 0.0259
31 0.8300 0.7605 0.0128 0.7345 0.7866 0.0695
32 0.7180 0.7594 0.0129 0.7331 0.7856 -0.0414
33 0.6420 0.7132 0.0162 0.6803 0.7461 -0.0712
34 0.6580 0.6792 0.0187 0.6412 0.7173 -0.0212

Output Statistics

Std Error Student Cook’s


Obs Residual Residual -2-1 0 1 2 D

1 0.0379 -0.590 | *| | 0.013


2 0.0380 -0.249 | | | 0.002
3 0.0382 0.105 | | | 0.000

41
4 0.0382 -0.107 | | | 0.000
5 0.0382 -0.428 | | | 0.005
6 0.0383 -0.239 | | | 0.002
7 0.0383 -0.271 | | | 0.002
8 0.0383 0.162 | | | 0.001
9 0.0383 -1.996 | ***| | 0.104
10 0.0384 0.132 | | | 0.000
11 0.0385 0.738 | |* | 0.012
12 0.0385 0.714 | |* | 0.011
13 0.0385 -0.143 | | | 0.000
14 0.0385 -0.395 | | | 0.003
15 0.0386 -0.806 | *| | 0.013
16 0.0386 0.816 | |* | 0.013
17 0.0386 1.272 | |** | 0.030
18 0.0386 0.0301 | | | 0.000
19 0.0387 -1.790 | ***| | 0.056
20 0.0387 1.905 | |*** | 0.057
21 0.0387 1.575 | |*** | 0.038
22 0.0387 0.580 | |* | 0.005
23 0.0387 -0.374 | | | 0.002
24 0.0387 -0.0832 | | | 0.000
25 0.0387 -1.428 | **| | 0.035
26 0.0384 1.135 | |** | 0.031
27 0.0383 -0.606 | *| | 0.010
28 0.0379 1.526 | |*** | 0.088
29 0.0379 -0.202 | | | 0.002
30 0.0379 0.684 | |* | 0.018
31 0.0372 1.868 | |*** | 0.207
32 0.0372 -1.113 | **| | 0.074
33 0.0359 -1.986 | ***| | 0.401
34 0.0346 -0.613 | *| | 0.055

Sum of Residuals 0
Sum of Squared Residuals 0.04949
Predicted Residual SS (PRESS) 0.05706

42
SAS Plots

43
44
45
46
47
The REG Procedure
Model: MODEL1
Dependent Variable: y

Output Statistics

Dep Var Predicted Std Error


Obs y Value Mean Predict 95% CL Mean Residual

1 0.6000 0.6644 0.0508 0.5526 0.7762 -0.0644


2 0.5000 0.6134 0.0387 0.5283 0.6986 -0.1134
3 0.7000 0.6134 0.0387 0.5283 0.6986 0.0866
4 0.6000 0.5625 0.0293 0.4979 0.6270 0.0375
5 0.6000 0.5625 0.0293 0.4979 0.6270 0.0375
6 0.6000 0.5625 0.0293 0.4979 0.6270 0.0375
7 0.4000 0.5115 0.0259 0.4545 0.5685 -0.1115
8 0.6000 0.5115 0.0259 0.4545 0.5685 0.0885

48
9 0.4000 0.4605 0.0305 0.3934 0.5276 -0.0605
10 0.6000 0.4605 0.0305 0.3934 0.5276 0.1395
11 0.3000 0.4095 0.0404 0.3205 0.4985 -0.1095
12 0.5000 0.4095 0.0404 0.3205 0.4985 0.0905
13 0.3000 0.3585 0.0528 0.2423 0.4747 -0.0585

Output Statistics

Std Error Student Cook’s Hat Diag Cov


Obs Residual Residual -2-1 0 1 2 D RStudent H Ratio

1 0.0782 -0.823 | *| | 0.143 -0.8104 0.2964 1.5144


2 0.0849 -1.336 | **| | 0.185 -1.3921 0.1719 1.0253
3 0.0849 1.020 | |** | 0.108 1.0217 0.1719 1.1981
4 0.0886 0.424 | | | 0.010 0.4076 0.0988 1.2991
5 0.0886 0.424 | | | 0.010 0.4076 0.0988 1.2991
6 0.0886 0.424 | | | 0.010 0.4076 0.0988 1.2991
7 0.0896 -1.244 | **| | 0.065 -1.2792 0.0771 0.9683
8 0.0896 0.988 | |* | 0.041 0.9867 0.0771 1.0887
9 0.0882 -0.686 | *| | 0.028 -0.6684 0.1067 1.2412
10 0.0882 1.582 | |*** | 0.150 1.7168 0.1067 0.8080
11 0.0841 -1.302 | **| | 0.196 -1.3500 0.1877 1.0658
12 0.0841 1.077 | |** | 0.134 1.0852 0.1877 1.1923
13 0.0769 -0.761 | *| | 0.136 -0.7450 0.3202 1.5976

Output Statistics

-------DFBETAS-------
Obs DFFITS Intercept x

1 -0.5260 -0.4631 0.4527


2 -0.6343 -0.4883 0.4716
3 0.4656 0.3584 -0.3461
4 0.1350 0.0683 -0.0635
5 0.1350 0.0683 -0.0635
6 0.1350 0.0683 -0.0635
7 -0.3697 0.0015 -0.0164
8 0.2851 -0.0012 0.0127
9 -0.2310 0.1141 -0.1221
10 0.5934 -0.2930 0.3136
11 -0.6491 0.4815 -0.4987
12 0.5217 -0.3871 0.4009
13 -0.5112 0.4351 -0.4456

Sum of Residuals 0
Sum of Squared Residuals 0.09573
Predicted Residual SS (PRESS) 0.13405

49
5.2 Transformations
In some cases, one needs to transform the dependent variable y is ensure that the model assumptions are
correct. Suppose that one suspects that the homogeneity of variance assumption is violated. One solution
might be to use a more general linear regression model called weighted least squares (this will be discussed
in a later chapter). Another approach is to define what are called variance stabilizing transformations. The
concept is fairly simple and is presented as follows;
Using a form of the Taylor series approximation, any function f (y) of y with continuous first derivative
f 0 (y) and finite second derivative f 00 (y) can be expressed as
f (y) − f (η) = (y − η)f 0 (η) + 1/2(y − η)2 f 00 (θ),
where θ lies between y and η and E[y] = η. Thus whenever (y − η)2 is small, we have
f (y) − f (η) = (y − η)f 0 (η).
Squaring both side and taking the expectation, we have
var(f (y)) ≈ (f 0 (η))2 σ 2 (η),
where σ 2 (η) is the variance of the random variable y with mean η. Thus in order to find a variance stabilizing
transformation f of y which makes the var(f (y)) approximately constant, we need to solve the equation
f 0 (η) = c/σ(η),
where c is any constant. Such a transformation f is called a variance stabilizing transformation.
An example is suppose that y is a count, then σ 2 (η) ∝ η and one needs an f such that
f 0 (η) = c/η 1/2 .
If one lets c = 1/2 then f (η) = η 1/2 is a solution and therefore one should use the square root transformation

y whenever the response variable is a count. This same method provides a solution for
• y is the proportion of counts which means that y ∼ Binomial with var(y) = n−1 η(1 − η) where

E(y) = η. Then f = n1/2 sin−1 y is the appropraite transformation.

• y is such that σ 2 = σ = η from which f = log(η).
Draper and Smith provide a table (Table 13.11) containing additonal variance stabilizing transformation on
page 292.

Nature of Dependence Transformation


σy = f (η)
σy ∝ η k (y ≥ 0) y 1−k
σy ∝ η 1/2 (Poisson) (y ≥ 0) y 1/2
σy ∝ η (y ≥ 0) ln y
σy ∝ η 2 (y ≥ 0) y −1
σy ∝ η 1/2 (1 − η)1/2 (Binomial) (0 ≤ y ≤ 1) sin−1 (y 1/2 )
σy ∝ (1 − η)1/2 /η (0 ≤ y ≤ 1) (1 − y)1/2 − (1 − y)3/2 /3
σy ∝ (1 − η 2 )−2 (−1 ≤ y ≤ 1) ln[(1 + y)/(1 − y)]

Another method of finding transformations when one does not have a precise form for the variance of y
is the general Box-Cox family of transformations.

50
5.2.1 Box-Cox Transformations
Suppose that y > 0, define the Box-Cox transformation as
 λ
(λ) (yi − 1)/λ when λ 6= 0,
yi =
ln yi when λ = 0,
where i = 1, 2, . . . , n. One determines λ by maximizing
n
X
−n/2 log[s2 (λ)] = (λ − 1) ln(yi ) − n/2 log[σ̂ 2 (λ)],
i=1

(λ)
and σ̂ 2 (λ) = 1/n~y (λ)0 [I − H]~y (λ) i.e., it is the sum of squares for the error term when yi is used instead of
(λ) (λ) (λ)
yi and ~y (λ) = (y1 , y2 , . . . , yn )0 .
Since, there is not a close form solution to the above maximization, one usually plots −n/2 log[s2 (λ)] vs
λ. Another approach is to compute a confidence interval using the fact that −n/2 log[s2 (λ)] ∼ χ2 (df = 1).
One can then use any λ which is contained in the confidence interval.

51
5.2.2 Examples of Box Cox Transformations
Splus Plots

52
SAS Plots

53
The SAS code (including a listing of the boxcox macro) is as follows;
%macro boxcox(name1=, lower2=, number=, increas=);
/*
This program contains the MACRO in order to find the BOX-COX Power
Transformation (Box and Cox 1964).
The program reads an input SAS filename bc, i.e.,

data bc;
infile ’R:/export/home/jtubbs/ftp/Regression/Data/DS_data/13.2’;
input obs f p y;
%boxcox(name1=y, lower2=-.95, number=20, increas=.1)

The data needs to consist of a vector of observations.The program then


performs the BOX-COX Procedure to find a transformation that will aid in
transforming data to follow more closely the normal distribution.

The user inputs the lower bound where the search is to begin and the
number of searches to be made.The increment for the search is also input
by the user.

USEAGE: %BOXCOX

INPUT:

1. variable from data bc;

2. The lower bound for which the search to maximize the


log-likelihood is to be made (e.g. -1.2).

3. The number of searches to be made (e.g. 10).

4. The incremental unit to be added at each stage of the search


beginning at the lower bound (e.g. 0.2).

OUTPUT: LAMBDA (the power transformation to be used to make the


original data more normally distributed).

MAXL_LIK (the value of the log-likelihood for the LAMDA


transformation;this value is the maximum
log-likelihood over the grid search).
Graph:

REFERENCES:
Box,G.E.P., and D.R.Cox. (1964). An analysis of
transformations (with discussion). Journal of the Royal
Statistical Society.B.26:211-252.
Johnson,R.A., and D.Wichern. (1982) Applied Multivariate

54
Statistical Analysis.Englewood Cliffs,N.J.:Prentice-Hall.

*/

data a; set bc;


xx = &name1;
run;
proc sort data=a;
by xx;
data a;set a;
title1 ’THE BOX-COX TRANSFORMATION ’;
*title2 ’The Lower and Upper Bounds for the Search Grid’;
ll=&lower2;
incr=&increas;
numb=&number;
uu = ll+(numb-1)*incr;

data aa;set a;
mn =_N_;
if mn = 1;
drop mn;

proc print data=aa split =’*’ noobs ;


var ll uu;
label ll = ’Lower*Bound’
uu = ’Upper*Bound’;

%do i = 1 %to &number;


data a;set a;
lower=&lower2;
increase =&increas;
%let j=%eval(&i-1);
j=&j;

/* Transformation is increased by the increment at each stage


of the search
*/
lambda=lower+increase*j;
if lambda=0 then lambda=0.001;

data a;set a;
y=log(xx);
z=(xx**lambda-1)/lambda;

proc means data=a noprint;


var y z;
id lambda;
output out=b n=n1 n2 var=var1 var2 sum = sum1 sum2;

55
/* The log-likelihood is calculated at each stage of the search */

data b;set b;
c91=var2*(n2-1)/n2;
c71=sum1;
c51=(-n2*log(c91)/2)+(lambda-1)*c71;
if abs(lambda)<0.000001 then lambda=0;
lambda&i=lambda;
loglik&i=c51;
loglik=c51;

data dat&i;set b;
keep lambda&i loglik&i;
%end;
data cc;set dat1;
lambda=lambda1;
loglik=loglik1;
keep loglik lambda;

%do j=2 %to &number;

data cc&j;set dat&j;


lambda=lambda&j;
loglik=loglik&j;
keep loglik lambda;

proc append base=cc data=cc&j force;


%end;

data cc;set cc;


indicate=’a’;
proc sort data=cc;
by lambda;
proc print data=cc noobs split =’*’;
var lambda loglik ;
label loglik = ’Log*Likelihood’;
*title2 ’The Log-Likelihood is Maximized giving the required Power Transformation’;
run;

/* The LAMDA power Transformation corresponding to the value that


maximizes the log-likelihood is obtained */

proc means data=cc noprint;


var loglik;
output out=dd max=maxlik;
data dd;set dd;

56
indicate=’a’;
data ee;merge cc dd;
by indicate;
if loglik=maxlik;
drop indicate;
maxl_lik=maxlik;

data ee; set ee; level95 = maxl_lik - .5*3.84;


proc print data =ee split = ’*’ noobs;
var lambda maxl_lik level95;
label maxl_lik = ’Maximum*Log-Likelihood’;
label level95 = ’Level for 95% CI’;
/*title1 ’THE BOX-COX TRANSFORMATION ’;
title2 ’If LAMBDA = 0 or close to 0 Then Use the LOGe Transformation’;
title3 ’If LAMBDA = 1 or close to 1 Then Use NO Transformation’;
title4 ’If LAMBDA = 0.5 or close to 0.5 Then Use the Square Root Transformation’;
title5 ’If LAMBDA = -1 or close to -1 Then Use the Reciprocal Transformation’;
*/
run;
data ff;
if _n_=1 then set ee;
set cc;
run;

PROC GPLOT DATA=ff;


PLOT LOGLIK*LAMBDA level95*lambda
/overlay href=-1 -.5 0 .5 1 lh=2 ch=green;
SYMBOL1 V=NONE I=JOIN L=1;
symbol2 v=none i=join color=red;
symbol3 v=none i=join color=green;

%mend boxcox;

data bc;
infile ’c:/mydocuments/Regression/Data/DS_data/e13g’;
input obs x1 x2 x3 y; ly=log(y); lx1=log(x1); lx2=log(x2); lx3=log(x3); run;
title2 ’Transformation for Response variable Y’;
%boxcox(name1=x2, lower2=-0.85, number=30, increas=.1)
run;
title2 ’Transformation for Independent variable X1’;
%boxcox(name1=x3, lower2=-2.15, number=30, increas=.1)
run;
title2 ’Regular Linear regression’;
proc reg data=bc graphics;
model y = x1-x3/ss1 ss2 dw;
title2 ’Student Residuals’;plot student.*obs./nostat;run;
title2 ’Leverage Plot’; plot h.*obs./nostat;run;
title2 ’QQ plot’;plot student.*nqq./nostat;run;

57
run;
proc reg data=bc graphics;
model ly = x1-x3/ss1 ss2 dw;
title2 ’Student Residuals’;plot student.*obs./nostat;run;
title2 ’Leverage Plot’; plot h.*obs./nostat;run;
title2 ’QQ plot’;plot student.*nqq./nostat;run;
run;
title2 ’Regression using logs’;
proc reg data=bc graphics;
model ly = lx1 lx2 lx3/ ss1 ss2 dw;
title2 ’Student Residuals’;plot student.*obs./nostat;run;
title2 ’Leverage Plot’; plot h.*obs./nostat;run;
title2 ’QQ plot’;plot student.*nqq./nostat;run;
run;
The SAS output (excluding the boxcox macro) is;

The REG Procedure


Model: MODEL1
Dependent Variable: y

Analysis of Variance

Sum of Mean
Source DF Squares Square F Value Pr > F

Model 3 1153.25135 384.41712 119.92 <.0001


Error 31 99.37151 3.20553
Corrected Total 34 1252.62286

Root MSE 1.79040 R-Square 0.9207


Dependent Mean 26.18571 Adj R-Sq 0.9130
Coeff Var 6.83732

Parameter Estimates

Parameter Standard
Variable DF Estimate Error t Value Pr > |t| Type I SS Type II SS

Intercept 1 36.06842 1.36549 26.41 <.0001 23999 2236.55537


x1 1 112.28017 11.00542 10.20 <.0001 635.53261 333.65120
x2 1 -0.00197 0.00015758 -12.53 <.0001 516.44861 503.35503
x3 1 -0.44233 0.70270 -0.63 0.5337 1.27012 1.27012

Durbin-Watson D 1.447
Number of Observations 35

58
1st Order Autocorrelation 0.249
--------------------------------------------------------------------------------------------------------

The REG Procedure


Model: MODEL1
Dependent Variable: ly

Analysis of Variance

Sum of Mean
Source DF Squares Square F Value Pr > F

Model 3 1.65898 0.55299 147.45 <.0001


Error 31 0.11626 0.00375
Corrected Total 34 1.77524

Root MSE 0.06124 R-Square 0.9345


Dependent Mean 3.23975 Adj R-Sq 0.9282
Coeff Var 1.89028

Parameter Estimates

Parameter Standard
Variable DF Estimate Error t Value Pr > |t| Type I SS Type II SS

Intercept 1 3.63763 0.04671 77.88 <.0001 367.35829 22.74899


x1 1 4.08145 0.37644 10.84 <.0001 0.87078 0.44087
x2 1 -0.00007701 0.00000539 -14.29 <.0001 0.78606 0.76565
x3 1 -0.01816 0.02404 -0.76 0.4556 0.00214 0.00214

Durbin-Watson D 1.586
Number of Observations 35
1st Order Autocorrelation 0.173
-------------------------------------------------------------------------------------------------------
The REG Procedure
Model: MODEL1
Dependent Variable: ly

Analysis of Variance

Sum of Mean
Source DF Squares Square F Value Pr > F

Model 3 1.69570 0.56523 220.31 <.0001


Error 31 0.07954 0.00257
Corrected Total 34 1.77524

59
Root MSE 0.05065 R-Square 0.9552
Dependent Mean 3.23975 Adj R-Sq 0.9509
Coeff Var 1.56347

Parameter Estimates

Parameter Standard
Variable DF Estimate Error t Value Pr > |t| Type I SS Type II SS

Intercept 1 8.54953 0.26602 32.14 <.0001 367.35829 2.64999


lx1 1 0.16842 0.01181 14.26 <.0001 0.82195 0.52198
lx2 1 -0.53714 0.03010 -17.85 <.0001 0.86822 0.81724
lx3 1 -0.01441 0.00982 -1.47 0.1521 0.00553 0.00553

Durbin-Watson D 1.901
Number of Observations 35
1st Order Autocorrelation 0.016

60
Chapter 6

The Generalized Least Squares Model

In this chapter the approach given in the previous chapters are extended to the case where there is a general
error structure.
This model can be written as
y = Xβ + e
where  
y1
  1 x11 x21 ... x(p−1)1 e1
 
 y2   1 x12 x21 ... x(p−1)1   e2 
y=
 ..  X=. e=
 ... 
 
 .. .. .. 
 ..

. . . . 
yn 1 x1n x2n ... x(p−1)n en
and
β0
 
 β1 
β2
 
β= ,
 .. 
.
 
βp−1
where
E(e) = 0, V ar(e) = σ 2 Σ, e ∼ Nn (0, σ 2 Σ).
The least squares problem becomes find β̂ which minimizes

q(β) = (y − Xβ)0 Σ−1 (y − Xβ)

By using the properties of differentiation with matrices one has

∂Q(β)
= −2X 0 Σ−1 y + 2X 0 Σ−1 Xβ = 0.
∂β
From which one obtains the normal equations given by

X 0 Σ−1 Xβ = X 0 Σ−1 y.

If rank[X] = p the normal equation has a unique solution given by

β̂ = (X 0 Σ−1 X)−1 X 0 Σ−1 y.

61
6.0.3 Inference
The above derivation is similar to that given in the previous chapter. At this stage the problem is one of
optimation rather than statistics. In order to create a statistical problem it is necessary to introduce some
distributional assumptions. When using the linear models approach one makes assumptions concerning the
error structure. That is, assume that the unobserved error ei , i = 1, 2, . . . , n are i.i.d normals with mean =
0 and variance = σ 2 Σ. This assumption becomes

e1
 
 e2  2
 ...  ∼ Nn (0, σ Σ).
e= 

en

Using the properties of linear transformations of normal variates one has

y1
 
 y2  2
 ..  ∼ Nn (Xβ, σ Σ).
y= 
.
yn

Now using the fact that β̂ = Ly it follows that


1. β̂ ∼ Np (β, σ 2 (X 0 Σ−1 X)−1 ).

(a) β̂ is an unbiased estimate of β.


(b) var(β̂i ) = σ 2 ((X 0 Σ−1 X)−1 )ii .
(c) cov(β̂i , β̂j ) = σ 2 ((X 0 Σ−1 X)−1 )ij .
(d) corr(β̂i , β̂j ) = ((X 0 Σ−1 X)−1 )ii /[((X 0 Σ−1 X)−1 )ii ((X 0 Σ−1 X)−1 )jj ]1/2 .

6.1 Weighted Least Squares


One common example is when Σ = diag(1/w1 , 1/w2 , . . . , 1/wn ) where the weights are often the inverse of
var(yi ) = 1/wi .

6.1.1 Example of Weighted Least Squares


The following example is given in Draper and Smith chapter 9. The SAS code is given as;
options center nodate pagesize=100 ls=80;
symbol1 i=none color=red v=star ;
symbol2 i=join v=none color=blue;
symbol3 i=join color=green v=none;
title1 ’Table 9.1 data X, Y, W page 226’;
data t91;
*infile ’R:/export/home/jtubbs/ftp/Regression/DS_data/15a’;
input x y w ;
cards;
1.15 0.99 1.24
1.90 0.98 2.18

62
3.00 2.60 7.84
3.00 2.67 7.84
3.00 2.66 7.84
3.00 2.78 7.84
3.00 2.80 7.84
5.34 5.92 7.43
5.38 5.35 6.99
5.40 4.33 6.78
5.40 4.89 6.78
5.45 5.21 6.30
7.70 7.68 0.89
7.80 9.81 0.84
7.81 6.52 0.83
7.85 9.71 0.82
7.87 9.82 0.81
7.91 9.81 0.79
7.94 8.50 0.78
9.03 9.47 0.47
9.07 11.45 0.46
9.11 12.14 0.45
9.14 11.50 0.45
9.16 10.65 0.44
9.37 10.64 0.41
10.17 9.78 0.31
10.18 12.39 0.31
10.22 11.03 0.30
10.22 8.00 0.30
10.22 11.90 0.30
10.18 8.68 0.31
10.50 7.25 0.28
10.23 13.46 0.30
10.03 10.19 0.32
10.23 9.93 0.30
;
proc reg;
model y = x / influence dw; run;
title2 ’Regression Plot’; plot y*x; run;
/*
title2 ’Residual Plot’; plot r.*obs./ vref=0; run;
title2 ’Leverage Plot’; plot h.*obs.; run;
*/
title2 ’Check for Outlier’; plot student.*obs.; run;
/*
title2 ’Check for outlier ith obs missing’; plot rstudent.*obs.; run;
title2 ’PP Plot’; plot npp.*r. /nostat; run;
title2 ’QQ Plot’; plot r. * nqq. /nostat; run;
*/
* Weighted Least Squares;
proc reg;

63
model y = x / influence dw;
weight w;run;
/*
title2 ’Regression Plot’; plot y*x; run;
title2 ’Residual Plot’; plot r.*obs./ vref=0; run;
title2 ’Leverage Plot’; plot h.*obs.; run;
*/
title2 ’Check for Outlier’; plot student.*obs.; run;
The output is;
Table 9.1 data X, Y, W page 226

The REG Procedure


Model: MODEL1
Dependent Variable: y

Analysis of Variance

Sum of Mean
Source DF Squares Square F Value Pr > F

Model 1 367.94805 367.94805 173.42 <.0001


Error 33 70.01571 2.12169
Corrected Total 34 437.96375

Root MSE 1.45660 R-Square 0.8401


Dependent Mean 7.75686 Adj R-Sq 0.8353
Coeff Var 18.77824

Parameter Estimates

Parameter Standard
Variable DF Estimate Error t Value Pr > |t|

Intercept 1 -0.57895 0.67919 -0.85 0.4001


x 1 1.13540 0.08622 13.17 <.0001

Durbin-Watson D 1.952
Number of Observations 35
1st Order Autocorrelation 0.015

Output Statistics

Hat Diag Cov ------DFBETAS-----


Obs Residual RStudent H Ratio DFFITS Intercept x

1 0.2632 0.1946 0.1629 1.2674 0.0859 0.0857 -0.0780


2 -0.5983 -0.4355 0.1323 1.2113 -0.1701 -0.1690 0.1506

64
3 -0.2273 -0.1615 0.0946 1.1727 -0.0522 -0.0511 0.0436
4 -0.1573 -0.1118 0.0946 1.1737 -0.0361 -0.0353 0.0302
5 -0.1673 -0.1189 0.0946 1.1736 -0.0384 -0.0376 0.0321
6 -0.0473 -0.0336 0.0946 1.1745 -0.0109 -0.0106 0.0091
7 -0.0273 -0.0194 0.0946 1.1746 -0.0063 -0.0061 0.0052
8 0.4359 0.3016 0.0426 1.1045 0.0636 0.0529 -0.0365
9 -0.1795 -0.1240 0.0421 1.1091 -0.0260 -0.0215 0.0147
10 -1.2222 -0.8537 0.0418 1.0610 -0.1783 -0.1468 0.1002
11 -0.6622 -0.4589 0.0418 1.0954 -0.0958 -0.0789 0.0539
12 -0.3990 -0.2758 0.0411 1.1038 -0.0571 -0.0466 0.0315
13 -0.4837 -0.3324 0.0290 1.0877 -0.0575 -0.0140 -0.0072
14 1.5328 1.0704 0.0293 1.0211 0.1860 0.0391 0.0295
15 -1.7686 -1.2425 0.0293 0.9971 -0.2160 -0.0447 -0.0350
16 1.3760 0.9577 0.0295 1.0356 0.1669 0.0323 0.0292
17 1.4633 1.0204 0.0295 1.0279 0.1781 0.0333 0.0324
18 1.4079 0.9807 0.0297 1.0330 0.1716 0.0298 0.0335
19 0.0638 0.0438 0.0298 1.0960 0.0077 0.0013 0.0016
20 -0.2037 -0.1405 0.0386 1.1048 -0.0281 0.0046 -0.0143
21 1.7308 1.2212 0.0390 1.0103 0.2461 -0.0424 0.1274
22 2.3754 1.7120 0.0395 0.9292 0.3473 -0.0634 0.1828
23 1.7014 1.2000 0.0399 1.0143 0.2446 -0.0464 0.1304
24 0.8287 0.5748 0.0402 1.0854 0.1176 -0.0229 0.0631
25 0.5802 0.4020 0.0430 1.1001 0.0852 -0.0208 0.0493
26 -1.1881 -0.8359 0.0566 1.0796 -0.2047 0.0815 -0.1441
27 1.4105 0.9970 0.0568 1.0606 0.2447 -0.0978 0.1725
28 0.005126 0.003570 0.0576 1.1285 0.0009 -0.0004 0.0006
29 -3.0249 -2.2698 0.0576 0.8372 -0.5611 0.2280 -0.3983
30 0.8751 0.6130 0.0576 1.1024 0.1515 -0.0616 0.1076
31 -2.2995 -1.6689 0.0568 0.9542 -0.4095 0.1638 -0.2887
32 -4.0928 -3.3136 0.0635 0.6295 -0.8630 0.3868 -0.6401
33 2.4238 1.7687 0.0578 0.9366 0.4381 -0.1787 0.3115
34 -0.6191 -0.4316 0.0539 1.1111 -0.1030 0.0386 -0.0706
35 -1.1062 -0.7777 0.0578 1.0872 -0.1926 0.0786 -0.1370

Sum of Residuals 0
Sum of Squared Residuals 70.01571
Predicted Residual SS (PRESS) 77.79464

--------------------------------------------------------------------------------------------------------

The REG Procedure


Model: MODEL1
Dependent Variable: y

Weight: w

Analysis of Variance

65
Sum of Mean
Source DF Squares Square F Value Pr > F

Model 1 493.21364 493.21364 384.11 <.0001


Error 33 42.37373 1.28405
Corrected Total 34 535.58737

Root MSE 1.13316 R-Square 0.9209


Dependent Mean 4.49629 Adj R-Sq 0.9185
Coeff Var 25.20212

Parameter Estimates

Parameter Standard
Variable DF Estimate Error t Value Pr > |t|

Intercept 1 -0.88770 0.30000 -2.96 0.0057


x 1 1.16442 0.05941 19.60 <.0001

Durbin-Watson D 1.662
Number of Observations 35
1st Order Autocorrelation 0.160

Output Statistics

Weight Hat Diag Cov


Obs Variable Residual RStudent H Ratio DFFITS

1 1.2400 0.5386 0.5386 0.0552 1.1054 0.1302


2 2.1800 -0.3447 -0.4599 0.0691 1.1275 -0.1253
3 7.8400 -0.005556 -0.0146 0.1455 1.2446 -0.0060
4 7.8400 0.0644 0.1697 0.1455 1.2424 0.0700
5 7.8400 0.0544 0.1434 0.1455 1.2430 0.0592
6 7.8400 0.1744 0.4607 0.1455 1.2283 0.1901
7 7.8400 0.1944 0.5139 0.1455 1.2243 0.2121
8 7.4300 0.5897 1.5201 0.0946 1.0217 0.4912
9 6.9900 -0.0269 -0.0647 0.0901 1.1685 -0.0204
10 6.7800 -1.0702 -2.8365 0.0880 0.7446 -0.8809
11 6.7800 -0.5102 -1.2373 0.0880 1.0620 -0.3842
12 6.3000 -0.2484 -0.5687 0.0831 1.1368 -0.1712
13 0.8900 -0.3983 -0.3327 0.0332 1.0925 -0.0617
14 0.8400 1.6152 1.3445 0.0328 0.9851 0.2476
15 0.8300 -1.6864 -1.3983 0.0326 0.9763 -0.2565
16 0.8200 1.4570 1.1914 0.0327 1.0081 0.2192
17 0.8100 1.5437 1.2575 0.0326 0.9983 0.2310
18 0.7900 1.4871 1.1934 0.0324 1.0074 0.2184
19 0.7800 0.1422 0.1110 0.0324 1.0983 0.0203

66
20 0.4700 -0.1570 -0.0950 0.0304 1.0962 -0.0168
21 0.4600 1.7764 1.0825 0.0302 1.0205 0.1910
22 0.4500 2.4198 1.4805 0.0300 0.9603 0.2603
23 0.4500 1.7449 1.0506 0.0303 1.0248 0.1858
24 0.4400 0.8716 0.5122 0.0299 1.0785 0.0899
25 0.4100 0.6171 0.3493 0.0300 1.0881 0.0615
26 0.3100 -1.1745 -0.5799 0.0297 1.0734 -0.1015
27 0.3100 1.4239 0.7049 0.0298 1.0629 0.1236
28 0.3000 0.0173 0.008371 0.0292 1.0955 0.0015
29 0.3000 -3.0127 -1.5061 0.0292 0.9553 -0.2613
30 0.3000 0.8873 0.4299 0.0292 1.0829 0.0746
31 0.3100 -2.2861 -1.1458 0.0298 1.0115 -0.2009
32 0.2800 -4.0887 -2.0277 0.0297 0.8607 -0.3550
33 0.3000 2.4357 1.2030 0.0293 1.0028 0.2091
34 0.3200 -0.6014 -0.3005 0.0293 1.0895 -0.0522
35 0.3000 -1.0943 -0.5310 0.0293 1.0765 -0.0923

Output Statistics

-------DFBETAS-------
Obs Intercept x

1 0.1293 -0.1124
2 -0.1221 0.1005
3 -0.0053 0.0038
4 0.0621 -0.0438
5 0.0524 -0.0370
6 0.1684 -0.1188
7 0.1879 -0.1325
8 0.0364 0.1635
9 -0.0012 -0.0071
10 -0.0423 -0.3148
11 -0.0185 -0.1373
12 -0.0046 -0.0646
13 0.0335 -0.0515
14 -0.1375 0.2087
15 0.1428 -0.2164
16 -0.1230 0.1856
17 -0.1302 0.1959
18 -0.1241 0.1858
19 -0.0116 0.0173
20 0.0112 -0.0153
21 -0.1273 0.1738
22 -0.1741 0.2372
23 -0.1246 0.1695
24 -0.0604 0.0820
25 -0.0420 0.0565
26 0.0733 -0.0953
27 -0.0893 0.1161

67
28 -0.0011 0.0014
29 0.1892 -0.2457
30 -0.0540 0.0701
31 0.1451 -0.1887
32 0.2608 -0.3356
33 -0.1514 0.1966
34 0.0374 -0.0489
35 0.0668 -0.0868

Sum of Residuals 0
Sum of Squared Residuals 42.37373
Predicted Residual SS (PRESS) 47.09259

NOTE: The above statistics use observation weights or frequencies.

68
6.1.2 SAS Scatterplot

69
6.1.3 SAS Student Plot

70
6.1.4 SAS Student Plot for Weighted LS

71
Chapter 7

Selecting the Best Subset Regression

In this chapter the aim is to determine the best subset model when using a multiple regression model. Draper
and Smith introduce this concept in chapter 15. They propose using the statistics, 1) R2 , 2) s2 , the residual
mean square, or 3) Mallow’s Cp statistic. The first two statistics are found from running each of the possible
subset models and computing the corresponding R2 and s2 = σ̂ 2 . The Mallow’s Cp is as follows

7.0.5 Mallow’s Cp Statistic


The Mallow’s statistic is
Cp = RSSp /s2 − (n − 2p),
where RSSp is the residual sum of squares from the model containing p parameters (p is the rank(X) ,
includes the term β0 ), and s2 is the residual mean square from the largest model containing all the available
predictor variables (r) and is presumed to be the most reliable estimate of σ 2 . Note when p = r + 1, Cp = p.
The basic idea is to find the smallest value of p such that Cp ≈ p.

7.1 Subset Selection


Statistical packages often provide an number of selection procedures. One of the better is stepwise.

7.1.1 Stepwise Regression


Stepwise is a forward selection method which selects the best single predictor model based upon the one
with the largest R2 , X1 and fit the equation ŷ = f (X1 ). If this model is not significant, stop and conclude
that ŷ = ȳ. If the model is significant, select the next predictor variable, say X2 based upon the one with the
largest partial F-value and a second equation is given by ŷ = f (X1 , X2 ). This model checked for improvement
in the R2 and partial F-values for both variables in the equation. (This differs from the forward procedure
in that a first variable may be excluded from the model at this step whereas in forward selection once a
variable enters the model it remains). These partial F-values are used to determine whether or not a variable
remains in the model or is excluded. The procedure is continued until a final model is derived.

7.1.2 Backward Selection


The backward procedure starts with the complete model and eliminates variables. It is similar to the forward
procedure where once a variable is eliminated it remains eliminated. The procedure is as follows;
1. Compute the regression model using all the predictor variables.

72
2. The partial F-value is calculated for every predictor variable using the type II sum of squares.
3. The lowest partial F-value, FL , is compared with a preselected or default significance level, F0 .
(a) If FL < F0 , remove the variable corresponding to FL , say XL then recompute the model using
the reduced model.
(b) If FL > F0 , adopt the regression model as calculated.
A list of the procedures used by SAS is as follows;

7.2 SAS Model-Selection Methods


The nine methods of model selection implemented in PROC REG are specified with the SELECTION=
option in the MODEL statement. Each method is discussed in this section.

Full Model Fitted (NONE)


This method is the default and provides no model selection capability. The complete model specified in the
MODEL statement is used to fit the model. For many regression analyses, this may be the only method you
need.

Forward Selection (FORWARD)


The forward-selection technique begins with no variables in the model. For each of the independent variables,
the FORWARD method calculates F statistics that reflect the variable’s contribution to the model if it is
included. The p-values for these F statistics are compared to the SLENTRY= value that is specified in the
MODEL statement (or to 0.50 if the SLENTRY= option is omitted). If no F statistic has a significance level
greater than the SLENTRY= value, the FORWARD selection stops. Otherwise, the FORWARD method
adds the variable that has the largest F statistic to the model. The FORWARD method then calculates F
statistics again for the variables still remaining outside the model, and the evaluation process is repeated.
Thus, variables are added one by one to the model until no remaining variable produces a significant F
statistic. Once a variable is in the model, it stays.

Backward Elimination (BACKWARD)


The backward elimination technique begins by calculating F statistics for a model, including all of the
independent variables. Then the variables are deleted from the model one by one until all the variables
remaining in the model produce F statistics significant at the SLSTAY= level specified in the MODEL
statement (or at the 0.10 level if the SLSTAY= option is omitted). At each step, the variable showing the
smallest contribution to the model is deleted.

Stepwise (STEPWISE)
The stepwise method is a modification of the forward-selection technique and differs in that variables already
in the model do not necessarily stay there. As in the forward-selection method, variables are added one by
one to the model, and the F statistic for a variable to be added must be significant at the SLENTRY=
level. After a variable is added, however, the stepwise method looks at all the variables already included in
the model and deletes any variable that does not produce an F statistic significant at the SLSTAY= level.
Only after this check is made and the necessary deletions accomplished can another variable be added to the
model. The stepwise process ends when none of the variables outside the model has an F statistic significant

73
at the SLENTRY= level and every variable in the model is significant at the SLSTAY= level, or when the
variable to be added to the model is the one just deleted from it.

Maximum R2 Improvement (MAXR)


The maximum R2 improvement technique does not settle on a single model. Instead, it tries to find the“best”
one-variable model, the“best” two-variable model, and so forth, although it is not guaranteed to find the
model with the largest R2 for each size.
The MAXR method begins by finding the one-variable model producing the highest R2 . Then another
variable, the one that yields the greatest increase in R2 , is added. Once the two-variable model is obtained,
each of the variables in the model is compared to each variable not in the model. For each comparison,
the MAXR method determines if removing one variable and replacing it with the other variable increases
R2 . After comparing all possible switches, the MAXR method makes the switch that produces the largest
increase in R2 . Comparisons begin again, and the process continues until the MAXR method finds that no
switch could increase R2 . Thus, the two-variable model achieved is considered the “best” two-variable model
the technique can find. Another variable is then added to the model, and the comparing-and-switching
process is repeated to find the“best” three-variable model, and so forth.
The difference between the STEPWISE method and the MAXR method is that all switches are evaluated
before any switch is made in the MAXR method . In the STEPWISE method, the “worst” variable may
be removed without considering what adding the “best” remaining variable might accomplish. The MAXR
method may require much more computer time than the STEPWISE method.

Minimum R2 (MINR) Improvement


The MINR method closely resembles the MAXR method, but the switch chosen is the one that produces
the smallest increase in R2 . For a given number of variables in the model, the MAXR and MINR methods
usually produce the same “best” model, but the MINR method considers more models of each size.

R2 Selection (RSQUARE)
The RSQUARE method finds subsets of independent variables that best predict a dependent variable by
linear regression in the given sample. You can specify the largest and smallest number of independent
variables to appear in a subset and the number of subsets of each size to be selected. The RSQUARE
method can efficiently perform all possible subset regressions and display the models in decreasing order of
R2 magnitude within each subset size. Other statistics are available for comparing subsets of different sizes.
These statistics, as well as estimated regression coefficients, can be displayed or output to a SAS data set.
The subset models selected by the RSQUARE method are optimal in terms of R2 for the given sample,
but they are not necessarily optimal for the population from which the sample is drawn or for any other
sample for which you may want to make predictions. If a subset model is selected on the basis of a large
R2 value or any other criterion commonly used for model selection, then all regression statistics computed
for that model under the assumption that the model is given a priori, including all statistics computed by
PROC REG, are biased.
While the RSQUARE method is a useful tool for exploratory model building, no statistical method can
be relied on to identify the “true” model. Effective model building requires substantive theory to suggest
relevant predictors and plausible functional forms for the model.
The RSQUARE method differs from the other selection methods in that RSQUARE always identifies
the model with the largest R2 for each number of variables considered. The other selection methods are not
guaranteed to find the model with the largest R2 . The RSQUARE method requires much more computer
time than the other selection methods, so a different selection method such as the STEPWISE method is a
good choice when there are many independent variables to consider.

74
Adjusted R2 Selection (ADJRSQ)
This method is similar to the RSQUARE method, except that the adjusted R2 statistic is used as the
criterion for selecting models, and the method finds the models with the highest adjusted R2 within the
range of sizes.

Mallows’ Cp Selection (CP)


This method is similar to the ADJRSQ method, except that Mallows’ Cp statistic is used as the criterion
for model selection. Models are listed in ascending order of Cp .

Additional Information on Model-Selection Methods


If the RSQUARE or STEPWISE procedure (as documented in SAS User’s Guide: Statistics, Version 5
Edition) is requested, PROC REG with the appropriate model-selection method is actually used.
Reviews of model-selection methods by Hocking (1976) and Judge et al. (1980) describe these and other
variable-selection methods.

7.2.1 Example
The following SAS program illustrates the output for stepwaise and backward selection methods;
options center nodate pagesize=100 ls=80;
symbol1 i=none color=red v=star ;
symbol2 i=join v=none color=blue;
symbol3 i=join color=green v=none;
title1 ’Data on page 46’;
data t01a;
*infile ’R:/export/home/jtubbs/ftp/Regression/DS_data/t01a’;
input obs x1-x10 ;
cards;
1 10.98 5.20 0.61 7.4 31 20 22 35.3 54.8 4
2 11.13 5.12 0.64 8.0 29 20 25 29.7 64.0 5
3 12.51 6.19 0.78 7.4 31 23 17 30.8 54.8 4
4 8.40 3.89 0.49 7.5 30 20 22 58.8 56.3 4
5 9.27 6.28 0.84 5.5 31 21 0 61.4 30.3 5
6 8.73 5.76 0.74 8.9 30 22 0 71.3 79.2 4
7 6.36 3.45 0.42 4.1 31 11 0 74.4 16.8 2
8 8.50 6.57 0.87 4.1 31 23 0 76.7 16.8 5
9 7.82 5.69 0.75 4.1 30 21 0 70.7 16.8 4
10 9.14 6.14 0.76 4.5 31 20 0 57.5 20.3 5
11 8.24 4.84 0.65 10.3 30 20 11 46.4 106.1 4
12 12.19 4.88 0.62 6.9 31 21 12 28.9 47.6 4
13 11.88 6.03 0.79 6.6 31 21 25 28.1 43.6 5
14 9.57 4.55 0.60 7.3 28 19 18 39.1 53.3 5
15 10.94 5.71 0.70 8.1 31 23 5 46.8 65.6 4
16 9.58 5.67 0.74 8.4 30 20 7 48.5 70.6 4
17 10.09 6.72 0.85 6.1 31 22 0 59.3 37.2 6
18 8.11 4.95 0.67 4.9 30 22 0 70.0 24.0 4
19 6.83 4.62 0.45 4.6 31 11 0 70.0 21.2 3
20 8.88 6.60 0.95 3.7 31 23 0 74.5 13.7 4

75
21 7.68 5.01 0.64 4.7 30 20 0 72.1 22.1 4
22 8.47 5.68 0.75 5.3 31 21 1 58.1 28.1 6
23 8.86 5.28 0.70 6.2 30 20 14 44.6 38.4 4
24 10.36 5.36 0.67 6.8 31 20 22 33.4 46.2 4
25 11.08 5.87 0.70 7.5 31 22 28 28.6 56.3 5
;
proc reg graphics;
model x1 = x2 x3 x4 x5 x6 x7 x8 x9 x10 / selection=stepwise; run;
model x1 = x2 x3 x4 x5 x6 x7 x8 x9 x10 / selection=backward; run;
model x1 = x2 x3 x4 x5 x6 x7 x8 x9 x10 / selection=rsquare best =5; run;
plot cp.*np./ chocking=red cmallows=blue; run;
The output is;
Data on page 46
The REG Procedure
Model: MODEL1
Dependent Variable: x1

Stepwise Selection: Step 1

Variable x8 Entered: R-Square = 0.7144 and C(p) = 35.1394

Analysis of Variance

Sum of Mean
Source DF Squares Square F Value Pr > F

Model 1 45.59240 45.59240 57.54 <.0001


Error 23 18.22340 0.79232
Corrected Total 24 63.81580

Parameter Standard
Variable Estimate Error Type II SS F Value Pr > F

Intercept 13.62299 0.58146 434.91206 548.91 <.0001


x8 -0.07983 0.01052 45.59240 57.54 <.0001

Bounds on condition number: 1, 1


--------------------------------------------------------------------------------

Stepwise Selection: Step 2

Variable x2 Entered: R-Square = 0.8600 and C(p) = 8.5141

Analysis of Variance

76
Sum of Mean
Source DF Squares Square F Value Pr > F

Model 2 54.88446 27.44223 67.60 <.0001


Error 22 8.93134 0.40597
Corrected Total 24 63.81580

Parameter Standard
Variable Estimate Error Type II SS F Value Pr > F

Intercept 9.47422 0.96189 39.38462 97.01 <.0001


x2 0.76165 0.15920 9.29206 22.89 <.0001
x8 -0.07976 0.00753 45.51469 112.11 <.0001

Bounds on condition number: 1, 4


--------------------------------------------------------------------------------

Stepwise Selection: Step 3

Variable x6 Entered: R-Square = 0.8796 and C(p) = 6.6740

Analysis of Variance

Sum of Mean
Source DF Squares Square F Value Pr > F

Model 3 56.13099 18.71033 51.13 <.0001


Error 21 7.68481 0.36594
Corrected Total 24 63.81580

Parameter Standard
Variable Estimate Error Type II SS F Value Pr > F

Intercept 8.56626 1.03732 24.95581 68.20 <.0001


x2 0.48799 0.21173 1.94389 5.31 0.0315
x6 0.10820 0.05862 1.24653 3.41 0.0791
x8 -0.07582 0.00746 37.75824 103.18 <.0001

Bounds on condition number: 2.0526, 15.312


--------------------------------------------------------------------------------

Stepwise Selection: Step 4

Variable x5 Entered: R-Square = 0.8934 and C(p) = 5.9546

77
Analysis of Variance

Sum of Mean
Source DF Squares Square F Value Pr > F

Model 4 57.01371 14.25343 41.91 <.0001


Error 20 6.80209 0.34010
Corrected Total 24 63.81580

Stepwise Selection: Step 4

Parameter Standard
Variable Estimate Error Type II SS F Value Pr > F

Intercept 0.09878 5.35020 0.00011594 0.00 0.9855


x2 0.29767 0.23584 0.54181 1.59 0.2214
x5 0.28873 0.17922 0.88272 2.60 0.1228
x6 0.14230 0.06035 1.89089 5.56 0.0287
x8 -0.07558 0.00720 37.50744 110.28 <.0001

Bounds on condition number: 2.6196, 29.579


--------------------------------------------------------------------------------

Stepwise Selection: Step 5

Variable x2 Removed: R-Square = 0.8849 and C(p) = 5.6238

Analysis of Variance

Sum of Mean
Source DF Squares Square F Value Pr > F

Model 3 56.47190 18.82397 53.83 <.0001


Error 21 7.34390 0.34971
Corrected Total 24 63.81580

Parameter Standard
Variable Estimate Error Type II SS F Value Pr > F

Intercept -2.96806 4.83346 0.13187 0.38 0.5458


x5 0.40205 0.15729 2.28481 6.53 0.0184
x6 0.19892 0.04094 8.25607 23.61 <.0001
x8 -0.07392 0.00718 37.11636 106.13 <.0001

78
Bounds on condition number: 1.0534, 9.3248
--------------------------------------------------------------------------------

All variables left in the model are significant at the 0.1500 level.

No other variable met the 0.1500 significance level for entry into the model.

Summary of Stepwise Selection

Variable Variable Number Partial Model


Step Entered Removed Vars In R-Square R-Square C(p) F Value Pr > F

1 x8 1 0.7144 0.7144 35.1394 57.54 <.0001


2 x2 2 0.1456 0.8600 8.5141 22.89 <.0001
3 x6 3 0.0195 0.8796 6.6740 3.41 0.0791
4 x5 4 0.0138 0.8934 5.9546 2.60 0.1228
5 x2 3 0.0085 0.8849 5.6238 1.59 0.2214

--------------------------------------------------------------------------------------------------------
The REG Procedure
Model: MODEL2
Dependent Variable: x1

Backward Elimination: Step 0

All Variables Entered: R-Square = 0.9237 and C(p) = 10.0000

Analysis of Variance

Sum of Mean
Source DF Squares Square F Value Pr > F

Model 9 58.94665 6.54963 20.18 <.0001


Error 15 4.86915 0.32461
Corrected Total 24 63.81580

Parameter Standard
Variable Estimate Error Type II SS F Value Pr > F

Intercept 1.89421 6.99636 0.02379 0.07 0.7903


x2 0.70541 0.56490 0.50617 1.56 0.2309
x3 -1.89372 4.14629 0.06771 0.21 0.6544
x4 1.13422 0.74609 0.75020 2.31 0.1493
x5 0.11876 0.20461 0.10936 0.34 0.5702

79
x6 0.17935 0.08095 1.59339 4.91 0.0426
x7 -0.01818 0.02451 0.17859 0.55 0.4697
x8 -0.07742 0.01659 7.06721 21.77 0.0003
x9 -0.08585 0.05200 0.88475 2.73 0.1195
x10 -0.34501 0.21070 0.87036 2.68 0.1223

Bounds on condition number: 126.63, 2605.5


--------------------------------------------------------------------------------

Backward Elimination: Step 1

Variable x3 Removed: R-Square = 0.9226 and C(p) = 8.2086

Analysis of Variance

Sum of Mean
Source DF Squares Square F Value Pr > F

Model 8 58.87894 7.35987 23.85 <.0001


Error 16 4.93686 0.30855
Corrected Total 24 63.81580

Parameter Standard
Variable Estimate Error Type II SS F Value Pr > F

Intercept 0.77428 6.38847 0.00453 0.01 0.9050


x2 0.48305 0.27935 0.92265 2.99 0.1030
x4 1.28060 0.65687 1.17271 3.80 0.0690
x5 0.14849 0.18913 0.19020 0.62 0.4439
x6 0.15634 0.06179 1.97566 6.40 0.0223
x7 -0.01729 0.02382 0.16263 0.53 0.4783
x8 -0.07661 0.01609 6.99996 22.69 0.0002
x9 -0.09537 0.04644 1.30122 4.22 0.0567
x10 -0.34376 0.20540 0.86425 2.80 0.1136

Bounds on condition number: 103.26, 1720


--------------------------------------------------------------------------------

Backward Elimination: Step 2

Variable x7 Removed: R-Square = 0.9201 and C(p) = 6.7096

Analysis of Variance

80
Sum of Mean
Source DF Squares Square F Value Pr > F

Model 7 58.71630 8.38804 27.96 <.0001


Error 17 5.09950 0.29997
Corrected Total 24 63.81580

Parameter Standard
Variable Estimate Error Type II SS F Value Pr > F

Intercept -0.36669 6.10541 0.00108 0.00 0.9528


x2 0.54112 0.26390 1.26119 4.20 0.0561
x4 1.26814 0.64745 1.15079 3.84 0.0668
x5 0.16128 0.18567 0.22634 0.75 0.3971
x6 0.14840 0.05996 1.83766 6.13 0.0241
x8 -0.06795 0.01063 12.25966 40.87 <.0001
x9 -0.09409 0.04576 1.26833 4.23 0.0554
x10 -0.34257 0.20252 0.85831 2.86 0.1090

Backward Elimination: Step 2

Bounds on condition number: 103.19, 1444.8


--------------------------------------------------------------------------------

Backward Elimination: Step 3

Variable x5 Removed: R-Square = 0.9165 and C(p) = 5.4069

Analysis of Variance

Sum of Mean
Source DF Squares Square F Value Pr > F

Model 6 58.48997 9.74833 32.95 <.0001


Error 18 5.32583 0.29588
Corrected Total 24 63.81580

Parameter Standard
Variable Estimate Error Type II SS F Value Pr > F

Intercept 4.39661 2.66596 0.80472 2.72 0.1165


x2 0.66223 0.22253 2.62044 8.86 0.0081
x4 1.30266 0.64181 1.21888 4.12 0.0574
x6 0.13814 0.05838 1.65658 5.60 0.0294
x8 -0.06919 0.01046 12.95092 43.77 <.0001

81
x9 -0.09798 0.04523 1.38878 4.69 0.0439
x10 -0.40827 0.18658 1.41681 4.79 0.0421

Bounds on condition number: 102.8, 1211.9


--------------------------------------------------------------------------------

All variables left in the model are significant at the 0.1000 level.

Summary of Backward Elimination

Variable Number Partial Model


Step Removed Vars In R-Square R-Square C(p) F Value Pr > F

1 x3 8 0.0011 0.9226 8.2086 0.21 0.6544


2 x7 7 0.0025 0.9201 6.7096 0.53 0.4783
3 x5 6 0.0035 0.9165 5.4069 0.75 0.3971

--------------------------------------------------------------------------------------------------------

R-Square Selection Method

Number in
Model R-Square Variables in Model

1 0.7144 x8
1 0.4104 x7
1 0.2874 x6
1 0.2250 x4
1 0.1557 x9
-----------------------------------------------------
2 0.8600 x2 x8
2 0.8491 x6 x8
2 0.8467 x3 x8
2 0.7555 x5 x8
2 0.7495 x8 x10
-----------------------------------------------------
3 0.8849 x5 x6 x8
3 0.8796 x2 x6 x8
3 0.8651 x2 x8 x10
3 0.8638 x2 x5 x8
3 0.8635 x6 x8 x9
-----------------------------------------------------
4 0.8934 x2 x5 x6 x8
4 0.8914 x2 x6 x8 x10
4 0.8899 x5 x6 x7 x8
4 0.8884 x5 x6 x8 x9

82
4 0.8879 x2 x3 x6 x8
-----------------------------------------------------
5 0.8995 x2 x3 x6 x8 x10
5 0.8988 x2 x5 x6 x8 x10
5 0.8975 x4 x5 x6 x8 x9
5 0.8975 x2 x3 x5 x6 x8
5 0.8974 x2 x6 x8 x9 x10
-----------------------------------------------------
6 0.9165 x2 x4 x6 x8 x9 x10
6 0.9075 x2 x3 x6 x8 x9 x10
6 0.9066 x2 x4 x5 x6 x8 x9
6 0.9052 x2 x3 x4 x6 x8 x10
6 0.9037 x2 x3 x5 x6 x8 x10
-----------------------------------------------------
7 0.9201 x2 x4 x5 x6 x8 x9 x10
7 0.9197 x2 x4 x6 x7 x8 x9 x10
7 0.9186 x2 x3 x4 x6 x8 x9 x10
7 0.9122 x3 x4 x5 x6 x8 x9 x10
7 0.9109 x2 x3 x6 x7 x8 x9 x10
-----------------------------------------------------
8 0.9226 x2 x4 x5 x6 x7 x8 x9 x10
8 0.9220 x2 x3 x4 x6 x7 x8 x9 x10
8 0.9209 x2 x3 x4 x5 x6 x8 x9 x10
8 0.9158 x3 x4 x5 x6 x7 x8 x9 x10
8 0.9119 x2 x3 x5 x6 x7 x8 x9 x10
-----------------------------------------------------
9 0.9237 x2 x3 x4 x5 x6 x7 x8 x9 x10

83
84
85
7.2.2 Selection using JMP

86
Chapter 8

Multicollinearity

In this chapter the problem of having linear dependence among the “independent” variables is discussed.
Draper and Smith discuss this topic in chapter 16 under the title of “Ill-conditioning in regression data”.
The problem arises whenever there is a linear dependency among the independent variables. That is, let

X = (~x0 , ~x1 , ~x2 , . . . , ~xp−1 )

where ~xi is the n × 1 vector of responses for the ith variable and ~x0 = ~j. The indendent variables are said
to have linear dependence whenever
p−1
X
tj ~xj = 0,
j=0

for tj 6= 0. If the above condition holds then (X 0 X)−1 does not exist. Seldom does the above linear depen-
dency actually hold, rather one nearly has linear dependency which implies that (X 0 X)−1 is ill conditioned,
hence any estimates using (X 0 X)−1 are “poor”. Think about the problem of dividing a number by a very
small positive value. Montogomery and Peck list four primary reasons for having a mulicollinearity problem
within the regression model. They are;
1. The data collection method employed.
2. Constraints on the model or in the population.
3. Model specification.

4. An over-defined model.
Consider the following artifical example given in Sen and Srivastava gives an example of what can happen
whenever collinearity is present. Note the different values for y are nearly the same yet the parameter
estimates vary greatly.
options center nodate pagesize=100 ls=80;
symbol1 i=none color=red v=star ;
symbol2 i=join v=none color=blue;
symbol3 i=join color=green v=none;
title1 ’Artifical Multicollinearity Example’;
data t01a;
input x1 x2 y1 y2 y3 ;
cards;

87
2.705 2.659 4.1 4.1 4.06
2.995 3.005 4.34 4.73 4.39
3.255 3.245 4.95 4.81 5.02
3.595 3.605 5.36 5.3 5.23
3.805 3.795 5.64 5.75 5.57
4.145 4.155 6.18 6.26 6.5
4.405 4.395 6.69 6.61 6.65
4.745 4.755 7.24 7.13 7.26
4.905 4.895 7.46 7.3 7.48
4.845 4.855 7.23 7.32 7.39
;
proc reg graphics;
model y1 y2 y3 = x1 x2 / noint collin vif tol; run;
The SAS output is;
Artifical Multicollinearity Example

The REG Procedure


Model: MODEL1
Dependent Variable: y1

NOTE: No intercept in model. R-Square is redefined.

Analysis of Variance

Sum of Mean
Source DF Squares Square F Value Pr > F

Model 2 363.91782 181.95891 24229.1 <.0001


Error 8 0.06008 0.00751
Uncorrected Total 10 363.97790

Root MSE 0.08666 R-Square 0.9998


Dependent Mean 5.91900 Adj R-Sq 0.9998
Coeff Var 1.46410

Parameter Standard
Variable DF Estimate Error t Value Pr > |t| Tolerance

x1 1 3.13724 1.58900 1.97 0.0838 0.00001848


x2 1 -1.63428 1.58983 -1.03 0.3340 0.00001848

Variance
Variable DF Inflation

x1 1 54102
x2 1 54102

88
Collinearity Diagnostics
Condition --Proportion of Variation-
Number Eigenvalue Index x1 x2

1 1.99999 1.00000 0.00000462 0.00000462


2 0.00000924 465.19623 1.00000 1.00000

The REG Procedure


Model: MODEL1
Dependent Variable: y2

NOTE: No intercept in model. R-Square is redefined.

Analysis of Variance

Sum of Mean
Source DF Squares Square F Value Pr > F

Model 2 363.98221 181.99110 18596.1 <.0001


Error 8 0.07829 0.00979
Uncorrected Total 10 364.06050

Root MSE 0.09893 R-Square 0.9998


Dependent Mean 5.93100 Adj R-Sq 0.9997
Coeff Var 1.66796

Parameter Standard
Variable DF Estimate Error t Value Pr > |t| Tolerance

x1 1 0.91305 1.81393 0.50 0.6283 0.00001848


x2 1 0.59123 1.81487 0.33 0.7530 0.00001848

Variance
Variable DF Inflation

x1 1 54102
x2 1 54102

Collinearity Diagnostics

Condition --Proportion of Variation-


Number Eigenvalue Index x1 x2

1 1.99999 1.00000 0.00000462 0.00000462


2 0.00000924 465.19623 1.00000 1.00000

89
The REG Procedure
Model: MODEL1
Dependent Variable: y3

NOTE: No intercept in model. R-Square is redefined.

Analysis of Variance

Sum of Mean
Source DF Squares Square F Value Pr > F

Model 2 368.90354 184.45177 8531.34 <.0001


Error 8 0.17296 0.02162
Uncorrected Total 10 369.07650

Root MSE 0.14704 R-Square 0.9995


Dependent Mean 5.95500 Adj R-Sq 0.9994
Coeff Var 2.46917

Parameter Standard
Variable DF Estimate Error t Value Pr > |t| Tolerance

x1 1 0.75499 2.69612 0.28 0.7866 0.00001848


x2 1 0.75951 2.69752 0.28 0.7854 0.00001848

Variance
Variable DF Inflation

x1 1 54102
x2 1 54102

Collinearity Diagnostics
Condition --Proportion of Variation-
Number Eigenvalue Index x1 x2

1 1.99999 1.00000 0.00000462 0.00000462


2 0.00000924 465.19623 1.00000 1.00000

8.1 Detecting Multicollinearity


8.1.1 Tolerances and Variance Inflation Factors
Since, there is a linear dependency among the independent variables the rsquare for the variable ~xj using the
remaining independent variables is given by Rj2 . If this value is close to one, then the j th variable is not needed
in the model since it can nearly be written as a linear combination of the remaining independent variables. If
the variables are centered and scaled before computing the least squares estimates, then C = (X 0 X)−1 = C(ij)

90
and
T OLj = C(jj) = (1 − Rj2 ), j = 1, 2, . . . , p − 1,
and
V IFj = T OL−1
j .

Some authors suggest that if V IFj > 10 then collinearity may be present.

8.1.2 Eigenvalues and Condition Numbers


If the independent variables are centered and scaled as above it follows that
p−1
X
tr[X 0 X] = λj = p,
j=0

where λj is the j th eigenvalue of the matrix X 0 X. Since the order of the eigenvalues can be ordered so that
λ0 ≥ λ1 ≥ . . . ≥ λp−1 whenever one of the λ0 s becomes small the matrix X 0 X is said to be ill conditioned.
The condition number is defined as q
CNj = λ0 /λj .
If CNj > 30, collinearity may be present.

8.1.3 SAS – Collinearity Diagnostics


When a regressor is nearly a linear combination of other regressors in the model, the affected estimates are
unstable and have high standard errors. This problem is called collinearity or multicollinearity. It is a good
idea to find out which variables are nearly collinear with which other variables. The approach in PROC
REG follows that of Belsley, Kuh, and Welsch (1980). PROC REG provides several methods for detecting
collinearity with the COLLIN, COLLINOINT, TOL, and VIF options.
The COLLIN option in the MODEL statement requests that a collinearity analysis be performed. First,
X’X is scaled to have 1s on the diagonal. If you specify the COLLINOINT option, the intercept variable
is adjusted out first. Then the eigenvalues and eigenvectors are extracted. The analysis in PROC REG is
reported with eigenvalues of X’X rather than singular values of X. The eigenvalues of X’X are the squares
of the singular values of X.
The condition indices are the square roots of the ratio of the largest eigenvalue to each individual eigen-
value. The largest condition index is the condition number of the scaled X matrix. Belsey, Kuh, and Welsch
(1980) suggest that, when this number is around 10, weak dependencies may be starting to affect the regres-
sion estimates. When this number is larger than 100, the estimates may have a fair amount of numerical
error (although the statistical standard error almost always is much greater than the numerical error).
For each variable, PROC REG produces the proportion of the variance of the estimate accounted for
by each principal component. A collinearity problem occurs when a component associated with a high
condition index contributes strongly (variance proportion greater than about 0.5) to the variance of two or
more variables.
The VIF option in the MODEL statement provides the Variance Inflation Factors (VIF). These factors
measure the inflation in the variances of the parameter estimates due to collinearities that exist among the
regressor (dependent) variables. There are no formal criteria for deciding if a VIF is large enough to affect
the predicted values.
The TOL option requests the tolerance values for the parameter estimates. The tolerance is defined as
1/VIF.
For a complete discussion of the preceding methods, refer to Belsley, Kuh, and Welsch (1980). For a
more detailed explanation of using the methods with PROC REG, refer to Freund and Littell (1986).

91
8.1.4 SAS Example
The SAS code for the example is;
options center nodate pagesize=100 ls=80;
symbol1 color=black v=x ;
symbol2 v=circle color=red;
symbol3 color=blue v=square;
symbol4 v=triangle color=green;
symbol5 v=plus color=orange;
title1 ’Data on page 46’;
data t01a;
*infile ’MacIntosh HD:Regression:DS_data:t01a’;
input obs x1-x10 ;
cards;
1 10.98 5.20 0.61 7.4 31 20 22 35.3 54.8 4
2 11.13 5.12 0.64 8.0 29 20 25 29.7 64.0 5
3 12.51 6.19 0.78 7.4 31 23 17 30.8 54.8 4
4 8.40 3.89 0.49 7.5 30 20 22 58.8 56.3 4
5 9.27 6.28 0.84 5.5 31 21 0 61.4 30.3 5
6 8.73 5.76 0.74 8.9 30 22 0 71.3 79.2 4
7 6.36 3.45 0.42 4.1 31 11 0 74.4 16.8 2
8 8.50 6.57 0.87 4.1 31 23 0 76.7 16.8 5
9 7.82 5.69 0.75 4.1 30 21 0 70.7 16.8 4
10 9.14 6.14 0.76 4.5 31 20 0 57.5 20.3 5
11 8.24 4.84 0.65 10.3 30 20 11 46.4 106.1 4
12 12.19 4.88 0.62 6.9 31 21 12 28.9 47.6 4
13 11.88 6.03 0.79 6.6 31 21 25 28.1 43.6 5
14 9.57 4.55 0.60 7.3 28 19 18 39.1 53.3 5
15 10.94 5.71 0.70 8.1 31 23 5 46.8 65.6 4
16 9.58 5.67 0.74 8.4 30 20 7 48.5 70.6 4
17 10.09 6.72 0.85 6.1 31 22 0 59.3 37.2 6
18 8.11 4.95 0.67 4.9 30 22 0 70.0 24.0 4
19 6.83 4.62 0.45 4.6 31 11 0 70.0 21.2 3
20 8.88 6.60 0.95 3.7 31 23 0 74.5 13.7 4
21 7.68 5.01 0.64 4.7 30 20 0 72.1 22.1 4
22 8.47 5.68 0.75 5.3 31 21 1 58.1 28.1 6
23 8.86 5.28 0.70 6.2 30 20 14 44.6 38.4 4
24 10.36 5.36 0.67 6.8 31 20 22 33.4 46.2 4
25 11.08 5.87 0.70 7.5 31 22 28 28.6 56.3 5
;
title2 ’COLLINEARITY OUTPUT’;
proc reg;
model x1 = x2-x10 / vif tol collin collinoint; run;
The SAS output is;
The REG Procedure
Model: MODEL1
Dependent Variable: y

92
Analysis of Variance

Sum of Mean
Source DF Squares Square F Value Pr > F

Model 9 58.94665 6.54963 20.18 <.0001


Error 15 4.86915 0.32461
Corrected Total 24 63.81580

Root MSE 0.56975 R-Square 0.9237


Dependent Mean 9.42400 Adj R-Sq 0.8779
Coeff Var 6.04569

Parameter Standard
Variable DF Estimate Error t Value Pr > |t| Tolerance

Intercept 1 1.89421 6.99636 0.27 0.7903 .


x2 1 0.70541 0.56490 1.25 0.2309 0.06351
x3 1 -1.89372 4.14629 -0.46 0.6544 0.04966
x4 1 1.13422 0.74609 1.52 0.1493 0.00790
x5 1 0.11876 0.20461 0.58 0.5702 0.54448
x6 1 0.17935 0.08095 2.22 0.0426 0.22666
x7 1 -0.01818 0.02451 -0.74 0.4697 0.21299
x8 1 -0.07742 0.01659 -4.67 0.0003 0.16481
x9 1 -0.08585 0.05200 -1.65 0.1195 0.00929
x10 1 -0.34501 0.21070 -1.64 0.1223 0.41928

Variance
Variable DF Inflation

Intercept 1 0
x2 1 15.74659
x3 1 20.13711
x4 1 126.62562
x5 1 1.83663
x6 1 4.41192
x7 1 4.69501
x8 1 6.06743
x9 1 107.59089
x10 1 2.38505

Collinearity Diagnostics

Condition --------Proportion of Variation--------


Number Eigenvalue Index Intercept x2 x3

1 9.06875 1.00000 0.00000319 0.00001599 0.00001789


2 0.68073 3.64994 0.00000312 0.00003364 0.00004784

93
3 0.16428 7.42994 0.00001022 0.00016643 0.00019850
4 0.05575 12.75456 0.00027529 0.00184 0.00367
5 0.01457 24.94866 0.00000153 0.00745 0.01573
6 0.00945 30.98408 0.00530 0.00400 0.00521
7 0.00494 42.86180 0.00132 0.04999 0.02661
8 0.00098671 95.86934 0.02321 0.61220 0.59483
9 0.00041422 147.96509 0.00030721 0.01339 0.13346
10 0.00014186 252.83547 0.96957 0.31092 0.22023

----------------------Proportion of Variation----------------------
Number x4 x5 x6 x7 x8

1 0.00000632 0.00000400 0.00005688 0.00073178 0.00016084


2 0.00001505 0.00000460 0.00003513 0.11306 0.00391
3 0.00041011 0.00001778 0.00020250 0.13452 0.00004932
4 0.00002489 0.00029864 0.00404 0.11437 0.11781
5 0.00015234 0.00008369 0.01785 0.00684 0.00984
6 0.00251 0.00961 0.06603 0.34530 0.40378
7 0.00178 0.00088810 0.57396 0.23042 0.12920
8 0.01318 0.00154 0.16005 0.01957 0.10550
9 0.78856 0.13433 0.13778 0.00126 0.07421
10 0.19336 0.85322 0.04000 0.03392 0.15555

Collinearity Diagnostics

--Proportion of Variation-
Number x9 x10

1 0.00002062 0.00017870
2 0.00024277 0.00011191
3 0.00717 0.00301
4 0.00010957 0.04011
5 0.00000490 0.72075
6 0.00512 0.00608
7 0.00965 0.00776
8 0.01404 0.01752

The REG Procedure


Model: MODEL1
Dependent Variable: y

Collinearity Diagnostics

--Proportion of Variation-
Number x9 x10

9 0.80868 0.13873
10 0.15496 0.06575

94
Collinearity Diagnostics(intercept adjusted)

Condition --------Proportion of Variation--------


Number Eigenvalue Index x2 x3 x4

1 3.29457 1.00000 0.00187 0.00153 0.00049321


2 3.15506 1.02187 0.00379 0.00296 0.00012721
3 1.02001 1.79720 0.00040383 0.00028361 0.00016524
4 0.80893 2.01811 0.00123 0.00047193 0.00151
5 0.36252 3.01462 0.00008740 0.00629 0.00073852
6 0.21490 3.91542 0.06932 0.02924 4.801955E-7
7 0.10635 5.56571 0.01199 0.00457 0.00008685
8 0.03349 9.91832 0.81780 0.73676 0.00169
9 0.00417 28.11848 0.09351 0.21791 0.99519

Collinearity Diagnostics(intercept adjusted)

----------------------Proportion of Variation----------------------
Number x5 x6 x7 x8 x9

1 0.01106 0.00050757 0.01126 0.00725 0.00054193


2 0.00016169 0.01739 0.00207 0.00372 0.00012594
3 0.29806 0.00573 0.02935 0.03024 0.00035966
4 0.13790 0.00101 0.03582 0.01066 0.00251
5 0.07738 0.15828 0.04837 0.00042538 0.00075128
6 0.25640 0.42124 0.00095883 0.05041 0.00006786
7 0.02309 0.08452 0.87192 0.73005 0.00126
8 0.18859 0.17944 0.00004350 0.01356 0.00498
9 0.00736 0.13188 0.00021686 0.15368 0.98940

-Proportion of Variation-
Number x10

1 0.00135
2 0.02465
3 0.00618
4 0.10686
5 0.58110
6 0.13660
7 0.01522
8 0.07364
9 0.05440

95
Chapter 9

Ridge Regression

Ridge regression is a popular method for detecting multicollinearity within a regression model. It was first
proposed by Hoerl and Kennard (1970) and it was one of the first biased estimation procedures. The idea
is fairly simple. Since the matrix X 0 X is ill-condition or nearly singular one can add positive constants to
the diagonal matrix and insure that the resulting matrix is not ill-conditioned. That is, consider the biased
normal equations given by
(X 0 X + kIn ]β = X 0 y.
With a resulting biased estimate for β given by

β̃(k) = [X 0 X + kIn ]−1 X 0 y,

where k is called the shrinkage parameter. Since, E[β̃] 6= β some do not want to use such a precedure.
However in spite of the fact that it is biased, it does have the effect of reducing the variance in the estimator.
It can be shown that,
var(β̂j ) = σ 2 1/λj ,
where λj is the j th eigenvalue of X 0 X. So when X 0 X is ill-conditioned some of the λ0j s are very small, hence
var(β̂j ) is very large. However,
var(β̃j ) = σ 2 λj /(λj + k)2 .
Consider the example whereσ 2 = 1, λ1 = 2.985, λ2 = 0.01, and λ3 = 0.005, the usual least squares estimation
gives,
X3 3
X
2
var(β̂j ) = σ 1/λj = .3350 + 100 + 200 = 300.3350.
j=1 i=1

However, if k = 0.10 we have,


3
X 3
X
var(β̂j ) = σ 2 λj /(λj + k)2 ≈ 2.3.
j=1 i=1

This process of reducing the total variance is very desireable and has led people to proposed similar estimation
procedures called shrinkage estimators. In this class, we are interested using this procedure as a way of
identifying multicollinearity and the variables which may contribute to this problem. The best way of
illustrating this procedure is with an example.

96
9.0.5 Ridge Plot Example
It can be shown that
lim β̃(k) → β̂,
as k → 0. Whenever multicollinearity is present the value of the estimate as it approaches zero will often
change dramatically. Notice the following plot the effect on the least sqaures estimate for x3 and x4.
The SAS code is,
options center nodate pagesize=100 ls=80;
symbol1 color=black v=x ;
symbol2 v=circle color=red;
symbol3 color=blue v=square;
symbol4 v=triangle color=green;
symbol5 v=plus color=orange;
title1 ’Data on page 46’;
data t01a;
input obs x1-x10 ;
cards;
1 10.98 5.20 0.61 7.4 31 20 22 35.3 54.8 4
2 11.13 5.12 0.64 8.0 29 20 25 29.7 64.0 5
3 12.51 6.19 0.78 7.4 31 23 17 30.8 54.8 4
4 8.40 3.89 0.49 7.5 30 20 22 58.8 56.3 4
5 9.27 6.28 0.84 5.5 31 21 0 61.4 30.3 5
6 8.73 5.76 0.74 8.9 30 22 0 71.3 79.2 4
7 6.36 3.45 0.42 4.1 31 11 0 74.4 16.8 2
8 8.50 6.57 0.87 4.1 31 23 0 76.7 16.8 5
9 7.82 5.69 0.75 4.1 30 21 0 70.7 16.8 4
10 9.14 6.14 0.76 4.5 31 20 0 57.5 20.3 5
11 8.24 4.84 0.65 10.3 30 20 11 46.4 106.1 4
12 12.19 4.88 0.62 6.9 31 21 12 28.9 47.6 4
13 11.88 6.03 0.79 6.6 31 21 25 28.1 43.6 5
14 9.57 4.55 0.60 7.3 28 19 18 39.1 53.3 5
15 10.94 5.71 0.70 8.1 31 23 5 46.8 65.6 4
16 9.58 5.67 0.74 8.4 30 20 7 48.5 70.6 4
17 10.09 6.72 0.85 6.1 31 22 0 59.3 37.2 6
18 8.11 4.95 0.67 4.9 30 22 0 70.0 24.0 4
19 6.83 4.62 0.45 4.6 31 11 0 70.0 21.2 3
20 8.88 6.60 0.95 3.7 31 23 0 74.5 13.7 4
21 7.68 5.01 0.64 4.7 30 20 0 72.1 22.1 4
22 8.47 5.68 0.75 5.3 31 21 1 58.1 28.1 6
23 8.86 5.28 0.70 6.2 30 20 14 44.6 38.4 4
24 10.36 5.36 0.67 6.8 31 20 22 33.4 46.2 4
25 11.08 5.87 0.70 7.5 31 22 28 28.6 56.3 5
;
title2 ’Ridge Plot’;
proc reg outest=b outvif ridge = 0 to 0.02 by .002;
model y = x2 - x10 / vif tol collin collinoint; run;
plot /ridgeplot nomodel legend=legend2 nostat vref=0 lvref=1 cvref=blue; run;
The SAS output is similar to that when considering the collinearity problem of the previous chapter. The
main difference is the following plot which is called a ridge trace plot.

97
98
When the maginitude of the coefficients are adjusted.

Another example is as follows;


options center nodate pagesize=100 ls=80;
symbol1 color=black v=x ;
symbol2 v=circle color=red;
symbol3 color=blue v=square;
symbol4 v=triangle color=green;
symbol5 v=plus color=orange;
title1 ’Heat Transfer Data Meyers page 200’;
data t01a;
input OBS Y1 Y2 X1 X2 X3 X4;
cards;
1 41.852 38.75 69.69 170.83 45 219.74
2 155.329 51.87 113.46 230.06 25 181.22
3 99.628 53.79 113.54 228.19 65 179.06

99
4 49.409 53.84 118.75 117.73 65 281.30
5 72.958 49.17 119.72 117.69 25 282.20
6 107.702 47.61 168.38 173.46 45 216.14
7 97.239 64.19 169.85 169.85 45 223.88
8 105.856 52.73 169.85 170.86 45 222.80
9 99.348 51.00 170.89 173.92 80 218.84
10 111.907 47.37 171.31 173.34 25 218.12
11 100.008 43.18 171.43 171.43 45 219.20
12 175.380 71.23 171.59 263.49 45 168.62
13 117.800 49.30 171.63 171.63 45 217.58
14 217.409 50.87 171.93 170.91 10 219.92
15 41.725 54.44 173.92 71.73 45 296.60
16 151.139 47.93 221.44 217.39 65 189.14
17 220.630 42.91 222.74 221.73 25 186.08
18 131.666 66.60 228.90 114.40 25 285.80
19 80.537 64.94 231.19 113.52 65 286.34
20 152.966 43.18 236.84 167.77 45 221.72
;
title2 ’COLLINEARITY OUTPUT’;
proc reg data = t01a;
model y1 y2 = x1-x4 / ss1 ss2 vif tol collin collinoint; run;

title2 ’Ridge Plot’;


proc reg data = t01a outest=b outvif ridge = 0 to 0.02 by .002;
model y1 = x1 - x4 / noprint; run;
plot /ridgeplot nomodel legend=legend2 nostat vref=0 lvref=1 cvref=blue; run;
proc reg data = t01a outest=b outvif ridge = 0 to 0.02 by .002;
model y2 = x1 - x4 / noprint; run;
plot /ridgeplot nomodel legend=legend2 nostat vref=0 lvref=1 cvref=blue; run;
The SAS output is;
Heat Transfer Data Meyers page 200
COLLINEARITY OUTPUT

The REG Procedure


Model: MODEL1
Dependent Variable: Y1

Analysis of Variance

Sum of Mean
Source DF Squares Square F Value Pr > F

Model 4 42264 10566 23.36 <.0001


Error 15 6784.90640 452.32709
Corrected Total 19 49049

Root MSE 21.26798 R-Square 0.8617


Dependent Mean 116.52440 Adj R-Sq 0.8248

100
Coeff Var 18.25196

Parameter Estimates

Parameter Standard
Variable DF Estimate Error t Value Pr > |t| Type I SS

Intercept 1 -103.27724 202.58848 -0.51 0.6176 271559


X1 1 0.58035 0.10867 5.34 <.0001 10585
X2 1 0.83936 0.46067 1.82 0.0884 21217
X3 1 -1.31676 0.27397 -4.81 0.0002 10425
X4 1 0.16032 0.55290 0.29 0.7758 38.03088

Parameter Estimates

Variance
Variable DF Type II SS Inflation

Intercept 1 117.55264 0
X1 1 12900 1.01124
X2 1 1501.68695 19.75531
X3 1 10449 1.00062
X4 1 38.03088 19.80063

Collinearity Diagnostics

Condition
Number Eigenvalue Index

1 4.74606 1.00000
2 0.11521 6.41820
3 0.09539 7.05362
4 0.04297 10.50897
5 0.00035682 115.33036

Collinearity Diagnostics

----------------------Proportion of Variation----------------------
Number Intercept X1 X2 X3 X4

1 0.00002426 0.00265 0.00014144 0.00543 0.00005967


2 0.00012713 0.06106 0.00223 0.90947 0.00009501
3 0.00000421 0.05178 0.02218 0.00858 0.00433
4 0.00150 0.88451 0.00063098 0.07626 0.00790
5 0.99834 0.00000440 0.97482 0.00024977 0.98761

101
Collinearity Diagnostics(intercept adjusted)

Condition
Number Eigenvalue Index

1 1.99079 1.00000
2 1.00202 1.40953
3 0.98158 1.42413
4 0.02561 8.81690

Collinearity Diagnostics(intercept adjusted)

-----------------Proportion of Variation----------------
Number X1 X2 X3 X4

1 0.00806 0.01249 0.00006308 0.01249


2 0.10212 0.00002266 0.89322 0.00002468
3 0.88676 0.00046155 0.10617 0.00033748
4 0.00305 0.98702 0.00054841 0.98715

COLLINEARITY OUTPUT

The REG Procedure


Model: MODEL1
Dependent Variable: Y2

Analysis of Variance

Sum of Mean
Source DF Squares Square F Value Pr > F

Model 4 635.73602 158.93401 3.12 0.0469


Error 15 763.77428 50.91829
Corrected Total 19 1399.51030

Root MSE 7.13570 R-Square 0.4543


Dependent Mean 52.24500 Adj R-Sq 0.3087
Coeff Var 13.65816

Parameter Estimates

Parameter Standard
Variable DF Estimate Error t Value Pr > |t| Type I SS

Intercept 1 -175.32443 67.97126 -2.58 0.0209 54591

102
X1 1 0.03502 0.03646 0.96 0.3520 72.79590
X2 1 0.48264 0.15456 3.12 0.0070 3.82753
X3 1 0.04319 0.09192 0.47 0.6452 15.25565
X4 1 0.60627 0.18551 3.27 0.0052 543.85694

Parameter Estimates

Variance
Variable DF Type II SS Inflation

Intercept 1 338.77239 0
X1 1 46.98637 1.01124
X2 1 496.50004 19.75531
X3 1 11.24086 1.00062
X4 1 543.85694 19.80063

Collinearity Diagnostics

Condition
Number Eigenvalue Index

1 4.74606 1.00000
2 0.11521 6.41820
3 0.09539 7.05362
4 0.04297 10.50897
5 0.00035682 115.33036

Collinearity Diagnostics

----------------------Proportion of Variation----------------------
Number Intercept X1 X2 X3 X4

1 0.00002426 0.00265 0.00014144 0.00543 0.00005967


2 0.00012713 0.06106 0.00223 0.90947 0.00009501
3 0.00000421 0.05178 0.02218 0.00858 0.00433
4 0.00150 0.88451 0.00063098 0.07626 0.00790
5 0.99834 0.00000440 0.97482 0.00024977 0.98761

Collinearity Diagnostics(intercept adjusted)

Condition
Number Eigenvalue Index

1 1.99079 1.00000
2 1.00202 1.40953
3 0.98158 1.42413
4 0.02561 8.81690

103
Collinearity Diagnostics(intercept adjusted)

-----------------Proportion of Variation----------------
Number X1 X2 X3 X4

1 0.00806 0.01249 0.00006308 0.01249


2 0.10212 0.00002266 0.89322 0.00002468
3 0.88676 0.00046155 0.10617 0.00033748
4 0.00305 0.98702 0.00054841 0.98715
The SAS Ridge plots are;

104
105
106
Chapter 10

Use of Dummy Variables in


Regression Models

The material in this chapter is similar to that given in Draper and Smith chapter 14. I will not inlcude any
discussion of why and how we use dummy variables. Rather I have chosen to consider the example that
they given in their chapter 14 and illustrate how one can use PROC GLM in SAS to reproduce the results
without having to define these dummy variables. On page 302, example 1 the aim is to determine whether
or not their are differences in states (y intercepts) for the regression model of weights (y) as a function of
age (x). This model will be expanded to determine if there are differing slope for the y = x model.

10.0.6 Turkey Weight Example


The SAS code for the example is;
options center nodate pagesize=100 ls=80;
symbol1 i=none color=red v=star ;
symbol2 i=none v=square color=blue;
symbol3 i=none color=green v=diamond;
title1 ’Example 1 page 303 Using Dummy Variables’;
data t14_1;
input obs x z1 z2 y state $;
cards;
1 28 1 0 13.3 G
2 20 1 0 8.9 G
3 32 1 0 15.1 G
4 22 1 0 10.4 G
5 29 0 1 13.1 V
6 27 0 1 12.4 V
7 28 0 1 13.2 V
8 26 0 1 11.8 V
9 21 0 0 11.5 W
10 27 0 0 14.2 W
11 29 0 0 15.4 W
12 23 0 0 13.1 W
13 25 0 0 13.8 W
;

107
title2 ’Least Squares Fit to the Data’;
proc sort; by x;
proc gplot;plot y*x=state;run;
proc reg graphics;
model y = x;
plot y*x;
run;
proc reg ;
model y = x z1 z2;
run;
title2 ’Proc GLM Using Dummy varaiable for State’;
proc glm;
class state;
model y = x state / E solution;
contrast ’Test for G - V’
state 1 -1 0;
contrast ’Test for V - W’
state 0 1 -1;
contrast ’Test for G - W’
state 1 0 -1;
run;
proc glm;
class state;
model y = x state x*state / e solution;
contrast ’Test for G - V y-int’
state 1 -1 0;
contrast ’Test for V - W y-int’
state 0 1 -1;
contrast ’Test for G - W y-int’
state 1 0 -1;
contrast ’Test for G - V slope’
state*x 1 -1 0;
contrast ’Test for V - W slope’
state*x 0 1 -1;
contrast ’Test for G - W slope’
state*x 1 0 -1;
run;
The SAS output is;
Example 1 page 303 Using Dummy Variables
Least Squares Fit to the Data

The REG Procedure


Model: MODEL1
Dependent Variable: y

Analysis of Variance

Sum of Mean
Source DF Squares Square F Value Pr > F

108
Model 1 26.20192 26.20192 21.81 0.0007
Error 11 13.21500 1.20136
Corrected Total 12 39.41692

Root MSE 1.09607 R-Square 0.6647


Dependent Mean 12.78462 Adj R-Sq 0.6343
Coeff Var 8.57333

Parameter Estimates

Parameter Standard
Variable DF Estimate Error t Value Pr > |t|

Intercept 1 1.98333 2.33273 0.85 0.4133


x 1 0.41667 0.08922 4.67 0.0007

Using the dummy variables as in Draper and Smith

The REG Procedure


Model: MODEL1
Dependent Variable: y

Analysis of Variance

Sum of Mean
Source DF Squares Square F Value Pr > F

Model 3 38.60575 12.86858 142.78 <.0001


Error 9 0.81118 0.09013
Corrected Total 12 39.41692

Root MSE 0.30022 R-Square 0.9794


Dependent Mean 12.78462 Adj R-Sq 0.9726
Coeff Var 2.34827

Parameter Estimates

Parameter Standard
Variable DF Estimate Error t Value Pr > |t|

Intercept 1 1.43088 0.65744 2.18 0.0575


x 1 0.48676 0.02574 18.91 <.0001
z1 1 -1.91838 0.20180 -9.51 <.0001
z2 1 -2.19191 0.21143 -10.37 <.0001

109
Using Proc GLM model to determine if states have different y-intercepts

Example 1 page 303 Using Dummy Variables


Proc GLM Using Dummy varaiable for State

General Form of Estimable Functions

Effect Coefficients

Intercept L1

x L2

state G L3
state V L4
state W L1-L3-L4

Dependent Variable: y

Sum of
Source DF Squares Mean Square F Value Pr > F

Model 3 38.60574661 12.86858220 142.78 <.0001

Error 9 0.81117647 0.09013072

Corrected Total 12 39.41692308

R-Square Coeff Var Root MSE y Mean

0.979421 2.348274 0.300218 12.78462

Source DF Type I SS Mean Square F Value Pr > F

x 1 26.20192308 26.20192308 290.71 <.0001


state 2 12.40382353 6.20191176 68.81 <.0001

Source DF Type III SS Mean Square F Value Pr > F

x 1 32.22382353 32.22382353 357.52 <.0001


state 2 12.40382353 6.20191176 68.81 <.0001

110
Contrast DF Contrast SS Mean Square F Value Pr > F

Test for G - V 1 0.14132353 0.14132353 1.57 0.2421


Test for V - W 1 9.68730759 9.68730759 107.48 <.0001
Test for G - W 1 8.14493012 8.14493012 90.37 <.0001

Standard
Parameter Estimate Error t Value Pr > |t|

Intercept 1.430882353 B 0.65744187 2.18 0.0575


x 0.486764706 0.02574346 18.91 <.0001
state G -1.918382353 B 0.20180313 -9.51 <.0001
state V -2.191911765 B 0.21142578 -10.37 <.0001
state W 0.000000000 B . . .

NOTE: The X’X matrix has been found to be singular, and a generalized inverse
was used to solve the normal equations. Terms whose estimates are
followed by the letter ’B’ are not uniquely estimable.

Using Proc GLM to determine if states have different slopes

General Form of Estimable Functions

Effect Coefficients

Intercept L1

x L2

state G L3
state V L4
state W L1-L3-L4

x*state G L6
x*state V L7
x*state W L2-L6-L7

Dependent Variable: y

Sum of
Source DF Squares Mean Square F Value Pr > F

Model 5 38.71074725 7.74214945 76.74 <.0001

Error 7 0.70617582 0.10088226

111
Corrected Total 12 39.41692308

R-Square Coeff Var Root MSE y Mean

0.982084 2.484390 0.317620 12.78462

Source DF Type I SS Mean Square F Value Pr > F

x 1 26.20192308 26.20192308 259.73 <.0001


state 2 12.40382353 6.20191176 61.48 <.0001
x*state 2 0.10500065 0.05250032 0.52 0.6156

Source DF Type III SS Mean Square F Value Pr > F

x 1 8.55703372 8.55703372 84.82 <.0001


state 2 0.51494101 0.25747050 2.55 0.1471
x*state 2 0.10500065 0.05250032 0.52 0.6156

Contrast DF Contrast SS Mean Square F Value Pr > F

Test for G - V y-int 1 0.00290257 0.00290257 0.03 0.8701


Test for V - W y-int 1 0.04602196 0.04602196 0.46 0.5211
Test for G - W y-int 1 0.51380881 0.51380881 5.09 0.0586
Test for G - V slope 1 0.00615751 0.00615751 0.06 0.8120
Test for V - W slope 1 0.00277778 0.00277778 0.03 0.8729
Test for G - W slope 1 0.10354173 0.10354173 1.03 0.3447

Standard
Parameter Estimate Error t Value Pr > |t|

Intercept 2.475000000 B 1.26351168 1.96 0.0910


x 0.445000000 B 0.05022008 8.86 <.0001
state G -3.454120879 B 1.53053816 -2.26 0.0586
state V -2.775000000 B 4.10854284 -0.68 0.5211
state W 0.000000000 B . . .
x*state G 0.061043956 B 0.06025490 1.01 0.3447
x*state V 0.025000000 B 0.15066024 0.17 0.8729
x*state W 0.000000000 B . . .

NOTE: The X’X matrix has been found to be singular, and a generalized inverse
was used to solve the normal equations. Terms whose estimates are
followed by the letter ’B’ are not uniquely estimable.
The graphs are;

112
113
10.0.7 Harris County Discrimination Example
The following example concerns potential gender-salary discrimination in Harris County Texas. The SAS
code is;
options center nodate pagesize=100 ls=80;
symbol1 i=none color=red v=star ;
symbol2 i=none v=square color=blue;
symbol3 i=none color=green v=diamond;
title1 ’Harris County Discrimination’;
data harris;
input SALARY EDUCAt EXPER MONTHS Gender @@;
cards;
3900 12 0 1 0 4020 10 44 7 0 4290 12 5 30 0
4380 8 6 7 0 4380 8 8 6 0 4380 12 0 7 0
4380 12 0 10 0 4380 12 5 6 0 4440 15 75 2 0
4500 8 52 3 0 4500 12 8 19 0 4620 12 52 3 0
4800 8 70 20 0 4800 12 6 23 0 4800 12 11 12 0
4800 12 11 17 0 4800 12 63 22 0 4800 12 144 24 0
4800 12 163 12 0 4800 12 228 26 0 4800 12 381 1 0
4800 16 214 15 0 4980 8 318 25 0 5100 8 96 33 0
5100 12 36 15 0 5100 12 59 14 0 5100 15 115 1 0
5100 15 165 4 0 5100 16 123 12 0 5160 12 18 12 0
5220 8 102 29 0 5220 12 127 29 0 5280 8 90 11 0
5280 8 190 1 0 5280 12 107 11 0 5400 8 173 34 0
5400 8 228 33 0 5400 12 26 11 0 5400 12 36 33 0
5400 12 38 22 0 5400 12 82 29 0 5400 12 169 27 0
5400 12 244 1 0 5400 15 24 13 0 5400 15 49 27 0
5400 15 51 21 0 5400 15 122 33 0 5520 12 97 17 0
5520 12 196 32 0 5580 12 133 30 0 5640 12 55 9 0
5700 12 90 23 0 5700 12 117 25 0 5700 15 51 17 0
5700 15 61 11 0 5700 15 241 34 0 6000 12 121 30 0
6000 15 79 13 0 6120 12 209 21 0 6300 12 87 33 0
6300 15 231 15 0 4620 12 12 22 1 5040 15 14 3 1
5100 12 180 15 1 5100 12 315 2 1 5220 12 29 14 1
5400 12 7 21 1 5400 12 38 11 1 5400 12 113 3 1
5400 15 18 8 1 5400 15 359 11 1 5700 15 36 5 1
6000 8 320 21 1 6000 12 24 2 1 6000 12 32 17 1
6000 12 49 8 1 6000 12 56 33 1 6000 12 252 11 1
6000 12 272 19 1 6000 15 25 13 1 6000 15 36 32 1
6000 15 56 12 1 6000 15 64 33 1 6000 15 108 16 1
6000 16 46 3 1 6300 15 72 17 1 6600 15 64 16 1
6600 15 84 33 1 6600 15 216 16 1 6840 15 42 7 1
6900 12 175 10 1 6900 15 132 24 1 8100 16 55 33 1
;
title2 ’Least Squares Fit to the Data’;
proc sort; by exper;
proc gplot;plot salary*exper=gender salary*educat=gender salary*months=gender;run;
proc reg;
model salary = EDUCAT EXPER MONTHS /ss1 ss2;

114
run;
title2 ’Proc GLM Using Gender’;
proc glm;
class gender;
model salary = educat exper months gender / E solution;
contrast ’Test for gender y-int’
gender 1 -1;
run;
title2 ’Test for different slopes’;
proc glm;
class gender;
model salary = educat exper months gender gender*educat gender*exper gender*months / e solution;
contrast ’Test for gender y-int’
gender 1 -1 ;
contrast ’Test for gender*educat slope’
gender*educat 1 -1;
contrast ’Test for gender*exper slope’
gender*exper 1 -1;
contrast ’Test for gender*months slope’
gender*months 1 -1;
run;
The SAS output is;
Harris County Discrimination
Least Squares Fit to the Data

The REG Procedure


Model: MODEL1
Dependent Variable: SALARY

Analysis of Variance

Sum of Mean
Source DF Squares Square F Value Pr > F

Model 3 13991247 4663749 12.84 <.0001


Error 89 32332043 363281
Corrected Total 92 46323290

Root MSE 602.72829 R-Square 0.3020


Dependent Mean 5420.32258 Adj R-Sq 0.2785
Coeff Var 11.11979

Parameter Estimates

Parameter Standard
Variable DF Estimate Error t Value Pr > |t| Type I SS

115
Intercept 1 3179.47980 383.42484 8.29 <.0001 2732330410
EDUCAT 1 139.60934 27.71249 5.04 <.0001 7862534
EXPER 1 1.48405 0.69705 2.13 0.0360 2047106
MONTHS 1 20.62908 6.15440 3.35 0.0012 4081607

Parameter Estimates

Variable DF Type II SS

Intercept 1 24980136
EDUCAT 1 9219791
EXPER 1 1646670
MONTHS 1 4081607

Harris County Discrimination


Proc GLM Using Gender

General Form of Estimable Functions

Effect Coefficients

Intercept L1
EDUCAT L2
EXPER L3
MONTHS L4
Gender 0 L5
Gender 1 L1-L5

Dependent Variable: SALARY

Sum of
Source DF Squares Mean Square F Value Pr > F

Model 4 23669749.67 5917437.42 22.99 <.0001

Error 88 22653540.65 257426.60

Corrected Total 92 46323290.32

R-Square Coeff Var Root MSE SALARY Mean

0.510969 9.360554 507.3722 5420.323

Source DF Type I SS Mean Square F Value Pr > F

EDUCAT 1 7862534.292 7862534.292 30.54 <.0001


EXPER 1 2047106.105 2047106.105 7.95 0.0059

116
MONTHS 1 4081606.609 4081606.609 15.86 0.0001
Gender 1 9678502.666 9678502.666 37.60 <.0001

Source DF Type III SS Mean Square F Value Pr > F

EDUCAT 1 3421726.297 3421726.297 13.29 0.0005


EXPER 1 1204694.106 1204694.106 4.68 0.0332
MONTHS 1 5213102.493 5213102.493 20.25 <.0001
Gender 1 9678502.666 9678502.666 37.60 <.0001

Contrast DF Contrast SS Mean Square F Value Pr > F

Test for gender y-int 1 9678502.666 9678502.666 37.60 <.0001

Standard
Parameter Estimate Error t Value Pr > |t|

Intercept 4248.580543 B 366.8478768 11.58 <.0001


EDUCAT 90.017046 24.6904403 3.65 0.0005
EXPER 1.271567 0.5877971 2.16 0.0332
MONTHS 23.402444 5.2004364 4.50 <.0001
Gender 0 -722.379625 B 117.8116151 -6.13 <.0001
Gender 1 0.000000 B . . .

NOTE: The X’X matrix has been found to be singular, and a generalized inverse
was used to solve the normal equations. Terms whose estimates are
followed by the letter ’B’ are not uniquely estimable.

Harris County Discrimination


Test for different slopes

The GLM Procedure

General Form of Estimable Functions

Effect Coefficients

Intercept L1

EDUCAT L2

EXPER L3

117
MONTHS L4

Gender 0 L5
Gender 1 L1-L5

EDUCAT*Gender 0 L7
EDUCAT*Gender 1 L2-L7

EXPER*Gender 0 L9
EXPER*Gender 1 L3-L9

MONTHS*Gender 0 L11
MONTHS*Gender 1 L4-L11

Dependent Variable: SALARY

Sum of
Source DF Squares Mean Square F Value Pr > F

Model 7 24323557.46 3474793.92 13.43 <.0001

Error 85 21999732.87 258820.39

Corrected Total 92 46323290.32

R-Square Coeff Var Root MSE SALARY Mean

0.525083 9.385861 508.7439 5420.323

Source DF Type I SS Mean Square F Value Pr > F

EDUCAT 1 7862534.292 7862534.292 30.38 <.0001


EXPER 1 2047106.105 2047106.105 7.91 0.0061
MONTHS 1 4081606.609 4081606.609 15.77 0.0001
Gender 1 9678502.666 9678502.666 37.39 <.0001
EDUCAT*Gender 1 518876.077 518876.077 2.00 0.1605
EXPER*Gender 1 105391.828 105391.828 0.41 0.5251
MONTHS*Gender 1 29539.879 29539.879 0.11 0.7363

Source DF Type III SS Mean Square F Value Pr > F

EDUCAT 1 3319396.518 3319396.518 12.83 0.0006


EXPER 1 1296875.642 1296875.642 5.01 0.0278
MONTHS 1 4277525.807 4277525.807 16.53 0.0001
Gender 1 17350.732 17350.732 0.07 0.7963

118
EDUCAT*Gender 1 393532.787 393532.787 1.52 0.2209
EXPER*Gender 1 108312.735 108312.735 0.42 0.5194
MONTHS*Gender 1 29539.879 29539.879 0.11 0.7363

Contrast DF Contrast SS Mean Square F Value

Test for gender y-int 1 17350.7315 17350.7315 0.07


Test for gender*educat slope 1 393532.7874 393532.7874 1.52
Test for gender*exper slope 1 108312.7347 108312.7347 0.42
Test for gender*months slope 1 29539.8792 29539.8792 0.11

Contrast Pr > F

Test for gender y-int 0.7963


Test for gender*educat slope 0.2209
Test for gender*exper slope 0.5194
Test for gender*months slope 0.7363

Standard
Parameter Estimate Error t Value Pr > |t|

Intercept 3540.522496 B 753.2653997 4.70 <.0001


EDUCAT 142.711764 B 51.9769544 2.75 0.0074
EXPER 0.978844 B 0.9500077 1.03 0.3058
MONTHS 25.046682 B 9.4439940 2.65 0.0095
Gender 0 217.978347 B 841.8867997 0.26 0.7963
Gender 1 0.000000 B . . .
EDUCAT*Gender 0 -73.105214 B 59.2866799 -1.23 0.2209
EDUCAT*Gender 1 0.000000 B . . .
EXPER*Gender 0 0.795723 B 1.2300455 0.65 0.5194
EXPER*Gender 1 0.000000 B . . .
MONTHS*Gender 0 -3.843432 B 11.3766404 -0.34 0.7363
MONTHS*Gender 1 0.000000 B . . .

NOTE: The X’X matrix has been found to be singular, and a generalized inverse
was used to solve the normal equations. Terms whose estimates are
followed by the letter ’B’ are not uniquely estimable.
The graphs are;

119
120
121
122
Chapter 11

General Methods of finding


Transformations

SAS PROC TRANSREG is a very power procedure for finding transformation for a number of different type
of models.
The TRANSREG (transformation regression) procedure fits linear models, optionally with spline and other
nonlinear transformations, and it can be used to code experimental designs prior to their use in other
analyses.
The TRANSREG procedure fits many types of linear models, including
• ordinary regression and ANOVA
• metric and nonmetric conjoint analysis (Green and Wind 1975; de Leeuw, Young, and Takane 1976)
• metric and nonmetric vector and ideal point preference mapping (Carroll 1972)
• simple, multiple, and multivariate regression with variable transformations (Young, de Leeuw, and
Takane 1976; Winsberg and Ramsay 1980; Breiman and Friedman 1985)
• redundancy analysis (Stewart and Love 1968) with variable transformations (Israels 1984)
• canonical correlation analysis with variable transformations (van der Burg and de Leeuw 1983)
• response surface regression (Meyers 1976; Khuri and Cornell 1987) with variable transformations
The data set can contain variables measured on nominal, ordinal, interval, and ratio scales (Siegel 1956).
Any mix of these variable types is allowed for the dependent and independent variables. The TRANSREG
procedure can transform
• nominal variables by scoring the categories to minimize squared error (Fisher 1938), or they can be
expanded into dummy variables
• ordinal variables by monotonically scoring the ordered categories so that order is weakly preserved
(adjacent categories can be merged) and squared error is minimized. Ties can be optimally untied or
left tied (Kruskal 1964). Ordinal variables can also be transformed to ranks.
• interval and ratio scale of measurement variables linearly or nonlinearly with spline (de Boor 1978;
van Rijckevorsel 1982) or monotone spline (Winsberg and Ramsay 1980) transformations. In addition,
smooth, logarithmic, exponential, power, logit, and inverse trigonometric sine transformations are
available.

123
Transformations produced by the PROC TRANSREG multiple regression algorithm, requesting spline trans-
formations, are often similar to transformations produced by the ACE smooth regression method of Breiman
and Friedman (1985). However, ACE does not explicitly optimize a loss function (de Leeuw 1986), while
PROC TRANSREG always explicitly optimizes a squared-error loss function.
PROC TRANSREG extends the ordinary general linear model by providing optimal variable transfor-
mations that are iteratively derived using the method of alternating least squares (Young 1981). PROC
TRANSREG iterates until convergence, alternating
1. finding least-squares estimates of the parameters of the model given the current scoring of the data
(that is, the current vectors)
2. finding least-squares estimates of the scoring parameters given the current set of model parameters
For more background on alternating least-squares optimal scaling methods and transformation regression
methods, refer to Young, de Leeuw, and Takane (1976), Winsberg and Ramsay (1980), Young (1981), Gifi
(1990), Schiffman, Reynolds, and Young (1981), van der Burg and de Leeuw (1983), Israels (1984), Breiman
and Friedman (1985), and Hastie and Tibshirani (1986). (These are just a few of the many relevant sources.)

Model Statement Usage


MODEL < transform(dependents < / t-options >)
< transform(dependents < / t-options >)...> = >
transform(independents < / t-options >)
< transform(independents < / t-options >)...> < / a-options > ;

Here are some examples of model statements:


• linear regression
– model identity(y) = identity(x);
• a linear model with a nonlinear regression function
– model identity(y) = spline(x / nknots=5);
• multiple regression
– model identity(y) = identity(x1-x5);
• multiple regression with nonlinear transformations

– model spline(y / nknots=3) = spline(x1-x5 / nknots=3);


• multiple regression with nonlinear but monotone transformations
– model mspline(y / nknots=3) = mspline(x1-x5 / nknots=3);
• multivariate multiple regression

– model identity(y1-y4) = identity(x1-x5);


• canonical correlation
– model identity(y1-y4) = identity(x1-x5) / method=canals;
• redundancy analysis

124
– model identity(y1-y4) = identity(x1-x5) / method=redundancy;
• preference mapping, vector model (Carroll 1972)
– model identity(Attrib1-Attrib3) = identity(Dim1-Dim2);
• preference mapping, ideal point model (Carroll 1972)
– model identity(Attrib1-Attrib3) = point(Dim1-Dim2);
• preference mapping, ideal point model, elliptical (Carroll 1972)
– model identity(Attrib1-Attrib3) = epoint(Dim1-Dim2);
• preference mapping, ideal point model, quadratic (Carroll 1972)
– model identity(Attrib1-Attrib3) = qpoint(Dim1-Dim2);
• metric conjoint analysis
– model identity(Subj1-Subj50) = class(a b c d e f / zero=sum);
• nonmetric conjoint analysis
– model monotone(Subj1-Subj50) = class(a b c d e f / zero=sum);
• main effects, two-way interaction
– model identity(y) = class(a—b);
• less-than-full-rank model -main effects and two-way interaction are constrained to sum to zero
– model identity(y) = class(a—b / zero=sum);
• main effects and all two-way interactions
– model identity(y) = class(a—b—c@2);
• main effects and all two- and three-way interactions
– model identity(y) = class(a—b—c);
• main effects and just B*C two-way interaction
– model identity(y) = class(a b c b*c);
• seven main effects, three two-way interactions
– model identity(y) = class(a b c d e f g a*b a*c a*d);
• deviations-from-means (effects or (1, 0, -1)) coding, with an A reference level of ’1’ and a B reference
level of ’2’
– model identity(y) = class(a—b / deviations zero=’1’ ’2’);
• cell-means coding (implicit intercept)
– model identity(y) = class(a*b / zero=none);
• reference cell model

125
– model identity(y) = class(a—b / zero=’1’ ’1’);
• reference line with change in line parameters
– model identity(y) = class(a) — identity(x);
• reference curve with change in curve parameters
– model identity(y) = class(a) — spline(x);
• separate curves and intercepts
– model identity(y) = class(a / zero=none) — spline(x);
• quantitative effects with interaction
– model identity(y) = identity(x1 — x2);
• separate quantitative effects with interaction within each cell
– model identity(y) = class(a * b / zero=none) — identity(x1 — x2);

11.1 Solving Standard Least-Squares Problems


This section illustrates how to solve some ordinary least-squares problems and generalizations of those
problems by formulating them as transformation regression problems. One problem involves finding linear
and nonlinear regression functions in a scatter plot. The next problem involves simultaneously fitting two
lines or curves through a scatter plot. The last problem involves finding the overall fit of a multi-way
main-effects and interactions analysis-of-variance model.

11.1.1 Nonlinear Regression Functions


This example uses PROC TRANSREG in simple regression to find the optimal regression line, a nonlinear
but monotone regression function, and a nonlinear nonmonotone regression function. A regression line can
be found by specifying
proc transreg;
model identity(y) = identity(x);
output predicted;
run;
A monotone regression function (in this case, a monotonically decreasing regression function, since the
correlation coefficient is negative) can be found by requesting an MSPLINE transformation of the independent
variable, as follows.
proc transreg;
model identity(y) = mspline(x / nknots=9);
output predicted;
run;
The monotonicity restriction can be relaxed by requesting a SPLINE transformation of the independent
variable, as shown below.
proc transreg;
model identity(y) = spline(x / nknots=9);
output predicted;
run;

126
11.1.2 Example 65.4: Transformation Regression of Exhaust Emissions Data
In this example, the MORALS algorithm is applied to data from an experiment in which nitrogen oxide
emissions from a single cylinder engine are measured for various combinations of fuel, compression ratio, and
equivalence ratio. The data are provided by Brinkman (1981).
The equivalence ratio and nitrogen oxide variables are continuous and numeric, so spline transformations
of these variables are requested. Each spline is degree three with nine knots (one at each decile) in order
to allow PROC TRANSREG a great deal of freedom in finding transformations. The compression ratio
variable has only five discrete values, so an optimal scoring is requested. The character variable Fuel is
nominal, so it is designated as a classification variable. No monotonicity constraints are placed on any of
the transformations. Observations with missing values are excluded with the NOMISS a-option.
The squared multiple correlation for the initial model is less than 0.25. PROC TRANSREG increases
the R2 to over 0.95 by transforming the variables. The transformation plots show how each variable is
transformed. The transformation of compression ratio (TCpRatio) is nearly linear. The transformation of
equivalence ratio (TEqRatio) is nearly parabolic. It can be seen from this plot that the optimal transfor-
mation of equivalence ratio is nearly uncorrelated with the original scoring. This suggests that the large
increase in R2 is due to this transformation. The transformation of nitrogen oxide (TNOx) is something like
a log transformation.
These results suggest the parametric model

X
log(N OX) = β0 + β1 ∗ EqRatio + β2 ∗ EqRatio2 + β3 3 ∗ CpRatio + β(j) ∗ F uel(j) + error.

You can perform this analysis with PROC TRANSREG using the following MODEL statement:
model log(NOx)= psp(EqRatio / deg=2) identity(CpRatio)
class(Fuel / zero=first);
The LOG transformation computes the natural log. The PSPLINE expansion expands EqRatio into a linear
term, EqRatio, and a squared term, EqRatio2. A linear transformation of CpRatio and a dummy variable
expansion of Fuel is requested with the first level as the reference level. These should provide a good
parametric operationalization of the optimal transformations. The final model has an R2 of 0.91 (smaller
than before since the model uses fewer degrees of freedom, but still quite good).
The following statements produce Output 65.4.1 through Output 65.4.3:
title ’Gasoline Example’;

data Gas;
input Fuel :$8. CpRatio EqRatio NOx @@;
label Fuel = ’Fuel’
CpRatio = ’Compression Ratio (CR)’
EqRatio = ’Equivalence Ratio (PHI)’
NOx = ’Nitrogen Oxide (NOx)’;
datalines;
Ethanol 12.0 0.907 3.741 Ethanol 12.0 0.761 2.295
Ethanol 12.0 1.108 1.498 Ethanol 12.0 1.016 2.881
Ethanol 12.0 1.189 0.760 Ethanol 9.0 1.001 3.120
Ethanol 9.0 1.231 0.638 Ethanol 9.0 1.123 1.170
Ethanol 12.0 1.042 2.358 Ethanol 12.0 1.215 0.606
Ethanol 12.0 0.930 3.669 Ethanol 12.0 1.152 1.000
Ethanol 15.0 1.138 0.981 Ethanol 18.0 0.601 1.192
Ethanol 7.5 0.696 0.926 Ethanol 12.0 0.686 1.590

127
Ethanol 12.0 1.072 1.806 Ethanol 15.0 1.074 1.962
Ethanol 15.0 0.934 4.028 Ethanol 9.0 0.808 3.148
Ethanol 9.0 1.071 1.836 Ethanol 7.5 1.009 2.845
Ethanol 7.5 1.142 1.013 Ethanol 18.0 1.229 0.414
Ethanol 18.0 1.175 0.812 Ethanol 15.0 0.568 0.374
Ethanol 15.0 0.977 3.623 Ethanol 7.5 0.767 1.869
Ethanol 7.5 1.006 2.836 Ethanol 9.0 0.893 3.567
Ethanol 15.0 1.152 0.866 Ethanol 15.0 0.693 1.369
Ethanol 15.0 1.232 0.542 Ethanol 15.0 1.036 2.739
Ethanol 15.0 1.125 1.200 Ethanol 9.0 1.081 1.719
Ethanol 9.0 0.868 3.423 Ethanol 7.5 0.762 1.634
Ethanol 7.5 1.144 1.021 Ethanol 7.5 1.045 2.157
Ethanol 18.0 0.797 3.361 Ethanol 18.0 1.115 1.390
Ethanol 18.0 1.070 1.947 Ethanol 18.0 1.219 0.962
Ethanol 9.0 0.637 0.571 Ethanol 9.0 0.733 2.219
Ethanol 9.0 0.715 1.419 Ethanol 9.0 0.872 3.519
Ethanol 7.5 0.765 1.732 Ethanol 7.5 0.878 3.206
Ethanol 7.5 0.811 2.471 Ethanol 15.0 0.676 1.777
Ethanol 18.0 1.045 2.571 Ethanol 18.0 0.968 3.952
Ethanol 15.0 0.846 3.931 Ethanol 15.0 0.684 1.587
Ethanol 7.5 0.729 1.397 Ethanol 7.5 0.911 3.536
Ethanol 7.5 0.808 2.202 Ethanol 7.5 1.168 0.756
Indolene 7.5 0.831 4.818 Indolene 7.5 1.045 2.849
Indolene 7.5 1.021 3.275 Indolene 7.5 0.970 4.691
Indolene 7.5 0.825 4.255 Indolene 7.5 0.891 5.064
Indolene 7.5 0.710 2.118 Indolene 7.5 0.801 4.602
Indolene 7.5 1.074 2.286 Indolene 7.5 1.148 0.970
Indolene 7.5 1.000 3.965 Indolene 7.5 0.928 5.344
Indolene 7.5 0.767 3.834 Ethanol 7.5 0.749 1.620
Ethanol 7.5 0.892 3.656 Ethanol 7.5 1.002 2.964
82rongas 7.5 0.873 6.021 82rongas 7.5 0.987 4.467
82rongas 7.5 1.030 3.046 82rongas 7.5 1.101 1.596
82rongas 7.5 1.173 0.835 82rongas 7.5 0.931 5.498
82rongas 7.5 0.822 5.470 82rongas 7.5 0.749 4.084
82rongas 7.5 0.625 0.716 94%Eth 7.5 0.818 2.382
94%Eth 7.5 1.128 1.004 94%Eth 7.5 1.191 0.623
94%Eth 7.5 1.132 1.030 94%Eth 7.5 0.993 2.593
94%Eth 7.5 0.866 2.699 94%Eth 7.5 0.910 3.177
94%Eth 12.0 1.139 1.151 94%Eth 12.0 1.267 0.474
94%Eth 12.0 1.017 2.814 94%Eth 12.0 0.954 3.308
94%Eth 12.0 0.861 3.031 94%Eth 12.0 1.034 2.537
94%Eth 12.0 0.781 2.403 94%Eth 12.0 1.058 2.412
94%Eth 12.0 0.884 2.452 94%Eth 12.0 0.766 1.857
94%Eth 7.5 1.193 0.657 94%Eth 7.5 0.885 2.969
94%Eth 7.5 0.915 2.670 Ethanol 18.0 0.812 3.760
Ethanol 18.0 1.230 0.672 Ethanol 18.0 0.804 3.677
Ethanol 18.0 0.712 . Ethanol 12.0 0.813 3.517
Ethanol 12.0 1.002 3.290 Ethanol 9.0 0.696 1.139
Ethanol 9.0 1.199 0.727 Ethanol 9.0 1.030 2.581

128
Ethanol 15.0 0.602 0.923 Ethanol 15.0 0.694 1.527
Ethanol 15.0 0.816 3.388 Ethanol 15.0 0.896 .
Ethanol 15.0 1.037 2.085 Ethanol 15.0 1.181 0.966
Ethanol 7.5 0.899 3.488 Ethanol 7.5 1.227 0.754
Indolene 7.5 0.701 1.990 Indolene 7.5 0.807 5.199
Indolene 7.5 0.902 5.283 Indolene 7.5 0.997 3.752
Indolene 7.5 1.224 0.537 Indolene 7.5 1.089 1.640
Ethanol 9.0 1.180 0.797 Ethanol 7.5 0.795 2.064
Ethanol 18.0 0.990 3.732 Ethanol 18.0 1.201 0.586
Methanol 7.5 0.975 2.941 Methanol 7.5 1.089 1.467
Methanol 7.5 1.150 0.934 Methanol 7.5 1.212 0.722
Methanol 7.5 0.859 2.397 Methanol 7.5 0.751 1.461
Methanol 7.5 0.720 1.235 Methanol 7.5 1.090 1.347
Methanol 7.5 0.616 0.344 Gasohol 7.5 0.712 2.209
Gasohol 7.5 0.771 4.497 Gasohol 7.5 0.959 4.958
Gasohol 7.5 1.042 2.723 Gasohol 7.5 1.125 1.244
Gasohol 7.5 1.097 1.562 Gasohol 7.5 0.984 4.468
Gasohol 7.5 0.928 5.307 Gasohol 7.5 0.889 5.425
Gasohol 7.5 0.827 5.330 Gasohol 7.5 0.674 1.448
Gasohol 7.5 1.031 3.164 Methanol 7.5 0.871 3.113
Methanol 7.5 1.026 2.551 Methanol 7.5 0.598 0.204
Indolene 7.5 0.973 5.055 Indolene 7.5 0.980 4.937
Indolene 7.5 0.665 1.561 Ethanol 7.5 0.629 0.561
Ethanol 9.0 0.608 0.563 Ethanol 12.0 0.584 0.678
Ethanol 15.0 0.562 0.370 Ethanol 18.0 0.535 0.530
94%Eth 7.5 0.674 0.900 Gasohol 7.5 0.645 1.207
Ethanol 18.0 0.655 1.900 94%Eth 7.5 1.022 2.787
94%Eth 7.5 0.790 2.645 94%Eth 7.5 0.720 1.475
94%Eth 7.5 1.075 2.147
;

*---Fit the Nonparametric Model---;


proc transreg data=Gas dummy test nomiss;
model spline(NOx / nknots=9)=spline(EqRatio / nknots=9)
opscore(CpRatio) class(Fuel / zero=first);
title2 ’Iteratively Estimate NOx, CPRATIO and EQRATIO’;
output out=Results;
run;

proc gplot data=Results;


title;
axis1 minor=none label=(angle=90 rotate=0);
axis2 minor=none;
symbol1 color=blue v=dot i=none;
plot TCpRatio*CpRatio / &opts name=’tregex1’;
plot TEqRatio*EqRatio / &opts name=’tregex2’;
plot TNOx*NOx / &opts name=’tregex3’;
run;

129
*-Fit the Parametric Model Suggested by the Nonparametric Analysis-;
proc transreg data=Gas dummy ss2 short nomiss;
title ’Gasoline Example’;
title2 ’Now fit log(NOx) = b0 + b1*EqRatio + b2*EqRatio**2 +’;
title3 ’b3*CpRatio + Sum b(j)*Fuel(j) + Error’;
model log(NOx)= pspline(EqRatio / deg=2) identity(CpRatio)
class(Fuel / zero=first);
output out=Results2;
run;
The SAS output is;
Gasoline Example
Iteratively Estimate NOx, CPRATIO and EQRATIO

The TRANSREG Procedure

TRANSREG MORALS Algorithm Iteration History for Spline(NOx)

Iteration Average Maximum Criterion


Number Change Change R-Square Change Note

0 0.48074 3.86778 0.24597


1 0.00000 0.00000 0.95865 0.71267 Converged

Algorithm converged.

The TRANSREG Procedure Hypothesis Tests for Spline(NOx)


Nitrogen Oxide (NOx)

Univariate ANOVA Table Based on the Usual Degrees of Freedom

Sum of Mean
Source DF Squares Square F Value Liberal p

Model 21 326.0946 15.52831 162.27 >= <.0001


Error 147 14.0674 0.09570
Corrected Total 168 340.1619

The above statistics are not adjusted for the fact that the dependent
variable was transformed and so are generally liberal.

Root MSE 0.30935 R-Square 0.9586


Dependent Mean 2.34593 Adj R-Sq 0.9527
Coeff Var 13.18661

Gasoline Example
Iteratively Estimate NOx, CPRATIO and EQRATIO

130
The TRANSREG Procedure

Adjusted Multivariate ANOVA Table Based on the Usual Degrees of Freedom

Dependent Variable Scoring Parameters=12 S=12 M=4 N=67

Statistic Value F Value Num DF Den DF p

Wilks’ Lambda 0.041355 2.05 252 1455 <= <.0001


Pillai’s Trace 0.958645 0.61 252 1764 <= 1.0000
Hotelling-Lawley Trace 23.18089 12.35 252 945.01 <= <.0001
Roy’s Greatest Root 23.18089 162.27 21 147 >= <.0001

The Wilks’ Lambda, Pillai’s Trace, and Hotelling-Lawley Trace statistics are a conservative
adjustment of the normal statistics. Roy’s Greatest Root is liberal. These statistics are
normally defined in terms of the squared canonical correlations which are the eigenvalues of the
matrix H*inv(H+E). Here the R-Square is used for the first eigenvalue and all other eigenvalues
are set to zero since only one linear combination is used. Degrees of freedom are computed
assuming all linear combinations contribute to the Lambda and Trace statistics, so the F tests
for those statistics are conservative. The p values for the liberal and conservative statistics
provide approximate lower and upper bounds on p. A liberal test statistic with conservative
degrees of freedom and a conservative test statistic with liberal degrees of freedom yield at
best an approximate p value, which is indicated by a "~" before the p value.

Gasoline Example
Now fit log(NOx) = b0 + b1*EqRatio + b2*EqRatio**2 +
b3*CpRatio + Sum b(j)*Fuel(j) + Error

The TRANSREG Procedure

Log(NOx)
Algorithm converged.

The TRANSREG Procedure Hypothesis Tests for Log(NOx)


Nitrogen Oxide (NOx)

Univariate ANOVA Table Based on the Usual Degrees of Freedom

Sum of Mean
Source DF Squares Square F Value Pr > F

Model 8 79.33838 9.917298 213.09 <.0001


Error 160 7.44659 0.046541
Corrected Total 168 86.78498

131
Root MSE 0.21573 R-Square 0.9142
Dependent Mean 0.63130 Adj R-Sq 0.9099
Coeff Var 34.17294

Gasoline Example
Now fit log(NOx) = b0 + b1*EqRatio + b2*EqRatio**2 +
b3*CpRatio + Sum b(j)*Fuel(j) + Error

The TRANSREG Procedure

Univariate Regression Table Based on the Usual Degrees of Freedom

Type II
Sum of Mean
Variable DF Coefficient Squares Square F Value Pr > F Label

Intercept 1 -14.586532 49.9469 49.9469 1073.18 <.0001 Intercept


Pspline.EqRatio_1 1 35.102914 62.7478 62.7478 1348.22 <.0001 Equivalence Ratio (PHI) 1
Pspline.EqRatio_2 1 -19.386468 64.6430 64.6430 1388.94 <.0001 Equivalence Ratio (PHI) 2
Identity(CpRatio) 1 0.032058 1.4445 1.4445 31.04 <.0001 Compression Ratio (CR)
Class.Fuel94_Eth 1 -0.449583 1.3158 1.3158 28.27 <.0001 Fuel 94%Eth
Class.FuelEthanol 1 -0.414242 1.2560 1.2560 26.99 <.0001 Fuel Ethanol
Class.FuelGasohol 1 -0.016719 0.0015 0.0015 0.03 0.8584 Fuel Gasohol
Class.FuelIndolene 1 0.001572 0.0000 0.0000 0.00 0.9853 Fuel Indolene
Class.FuelMethanol 1 -0.580133 1.7219 1.7219 37.00 <.0001 Fuel Methanol
The SAS Ridge plots are;

132
133
134
135
Chapter 12

Principle Component Regression

In the previous chapters we have address the problem of having an unstabled models with either;
1. Model selection using one of the stepswise procedures.
2. Transformations such as the boxcox transformation or using PROC TRANSREG.
In this chapter we want to consider another approach whereby we use a subset model where the new variables
are linear combinations of the original variables. The approach is called principle component regression.
Before considering an example using this method it is necessary to review some matrix algebra concerning
eigenvalues and eigenvectors.

12.1 Eigenvalues and Eigenvectors


Suppose that A is an n × n matrix. Is there a transformation of vectors ~x that transforms A into a constant
multiple of itself? That is, does there exist a vector ~x satisfying,

A~x = λ~x

for some a constant λ? If so, then


(A − λIn )~x = 0,
for some ~x 6= 0. If so, then it follows that
| A − λIn |= 0
or
n
X
ai λi = 0.
i=0

This last equation is called the characteristic equation for a n × n matrix A. The matrix A has possibly n
roots or solutions for the value λ to the characteristic equation. These solutions are called the characteristic
values or roots or the eigenvalues for the matrix A. Suppose that λ1 is a solution and

A~x1 = λ1 ~x1 , ~x1 6= 0,

then ~x1 is said to a characteristic vector or eigenvector of A corresponding to the eigenvalue λ1 . Note: The
eigenvalues may or may not be real numbers.

136
12.1.1 Properties of Eigenvalues–Eigenvectors
1. The n × n matrix A has at least one eigenvalue equal to zero if and only if A is singular.
2. The eigenvalues for A, C −1 AC and CAC −1 have the same set of eigenvalues for any nonsingular matrix
C.
3. The matrices A and A0 have the same set of eigenvalues but need not have the same eigenvectors.
4. Let A be a nonsingular matrix with eigenvalue λ, then 1/λ is an eigenvalue of A−1 .
5. The eigenvectors are not unique, for if ~x1 is an eigenvector corresponding to λ1 then c~x1 is also n
eigenvector, since Ac~x1 = λ1 c~x1 , for any nonzero value c.
6. Let A be an n × n real matrix, then there exists a nonsingular, complex matrix Q such that Q0 AQ = T ,
where T is a complex, upper triangular matrix, and the eigenvalues of A are the diagonal elements of
the matrix T .
7. Suppose that A is a symmetric matrix;
(a) Then the eigenvalues of A are real numbers.
(b) For each eigenvalue there exists a real eigenvector (each element is a real number).
(c) Let λ1 and λ2 be eigenvalues of A with corresponding eigenvectors ~x1 and ~x2 , then ~x1 and ~x2 are
orthogonal vectors, that is, ~x01 ~x2 = 0.
(d) There exists an orthogonal matrix P (P 0 P = P P 0 = In ) such that P 0 AP = D, where D is a
diagonal matrix whose diagonal elements are the eigenvalues of the matrix A.

12.2 Principle Components


The basic idea behind principle components is to explain the variance-covariance structure for a large number
of variables through a few linear combination of these original variables. This method allows for dimension
reduction and for interpretation. (Note this topic is separate for the regression problem for which we will
use the method).
Suppose that the matrix
X = (~x1 , ~x2 , . . . , ~xp )
is n × p with cov(X) = Σ = (σij )i,j=1,2,...,n . The idea is to find a new set of vectors ~y1 , ~y2 , . . . , ~yp where
p
X
~yi = lij ~xj ,
j=1

where var(~yi ) = ~li0 Σ~li and cov(~yi , ~yk ) = ~li0 Σ~lk = 0 and var(~y1 ) ≥ var(~y2 ) ≥ . . . var(~yp ) for ~li0 = (l1i , l2i , . . . , lpi ).
This problem has the following solution;
1. Suppose that the matrix Σ has associated real eigenvalue–eigenvectors given by (λi , ~ei ) where λ1 ≥
λ2 ≥ . . . ≥ λp ≥ 0, then the ith principle component is given by
~yi = ~e0i X = ei1~x1 + ei2 ~x2 + . . . + eip ~xp ,
and var(~yi ) = λi for i = 1, 2 . . . , p, cov(~yi , ~yk ) = ~e0i Σ~ek = 0 for i 6= k. Note, the eigenvalues λi are
unique, however, the eigenvectors (and hence the vectors ~yi ) are not.
Pp
2. The total variance for the p dimensionsP is tr[Σ] = i=1 λi . Hence, the proportion of variance explained
p
by the k th principle component is λk / i=1 λi .
Pp
3. If the matrix X is centered and scaled so that Σ is the correlation matrix, then i=1 λi = p.

137
12.2.1 Example – FOC Sales
This example using the FOC sales data found in Dielman’s text. The sas code is,

title1 ’FOC Sales - Dielman Chapter 8’;


data foc_sales;
input SALES MONTH FOV COMPOSITE INDUSTRIAL TRANS UTILITY FINANCE PROD HOUSE;
cards;
.
.
.
;
proc princomp data=bc;
var FOV COMPOSITE INDUSTRIAL TRANS UTILITY FINANCE PROD HOUSE;run;

and the corresponding output is;


FOC Sales - Dielman Chapter 8
The PRINCOMP Procedure

Observations 265
Variables 8

Simple Statistics

FOV COMPOSITE INDUSTRIAL TRANS

Mean 1069001.796 384.7471698 481.8603774 347.0226415


StD 288773.614 113.6248822 140.1715647 90.2328609

UTILITY FINANCE PROD HOUSE

Mean 270.7320755 340.8150943 72.38867925 331.3584906


StD 66.2953098 120.9088365 32.78401345 45.1613665

Correlation Matrix

FOV COMPOSITE INDUSTRIAL TRANS UTILITY FINANCE PROD HOUSE

FOV 1.0000 0.4480 0.4564 0.4107 0.3752 0.4395 0.4660 0.1227


COMPOSITE 0.4480 1.0000 0.9992 0.9808 0.9516 0.9942 0.9622 0.4756
INDUSTRIAL 0.4564 0.9992 1.0000 0.9798 0.9444 0.9914 0.9638 0.4704
TRANS 0.4107 0.9808 0.9798 1.0000 0.8977 0.9879 0.9166 0.4520
UTILITY 0.3752 0.9516 0.9444 0.8977 1.0000 0.9308 0.9335 0.5214
FINANCE 0.4395 0.9942 0.9914 0.9879 0.9308 1.0000 0.9446 0.4634
PROD 0.4660 0.9622 0.9638 0.9166 0.9335 0.9446 1.0000 0.6018
HOUSE 0.1227 0.4756 0.4704 0.4520 0.5214 0.4634 0.6018 1.0000

138
Eigenvalues of the Correlation Matrix

Eigenvalue Difference Proportion Cumulative

1 6.29775087 5.40865169 0.7872 0.7872


2 0.88909918 0.24608105 0.1111 0.8984
3 0.64301813 0.53207160 0.0804 0.9787
4 0.11094653 0.06361914 0.0139 0.9926
5 0.04732739 0.04035421 0.0059 0.9985
6 0.00697318 0.00209528 0.0009 0.9994
7 0.00487790 0.00487108 0.0006 1.0000
8 0.00000682 0.0000 1.0000

The PRINCOMP Procedure

Eigenvectors

Prin1 Prin2 Prin3 Prin4 Prin5 Prin6 Prin7 Prin8

FOV 0.196279 0.767804 0.601657 -.008748 0.097951 0.011645 -.012393 0.000243


COMPOSITE 0.395296 0.025306 -.149454 0.040571 -.010696 -.046339 0.400148 -.810394
INDUSTRIAL 0.394798 0.036677 -.144876 0.069547 -.119048 0.180686 0.682217 0.552018
TRANS 0.385564 0.016319 -.201304 0.531083 0.267185 0.527935 -.422252 0.010084
UTILITY 0.380283 -.079879 -.111648 -.770369 0.448507 0.095961 -.166056 0.071625
FINANCE 0.392185 0.029463 -.169716 0.242321 0.118966 -.818914 -.199303 0.182498
PROD 0.389979 -.054709 0.064182 -.202320 -.818674 0.078091 -.351489 -.000664
HOUSE 0.224016 -.630859 0.713282 0.135168 0.138538 -.020693 0.071907 -.000346
From the output one notices that the dimension of the original variables is much smalled than 8 (look at
the size of the smaller eigenvalues) and that the first two principle components explain nearly 90% (.8984) of
the variance found in the original variables. In considering the first two principle components one observes
that the first component is essentially a weighted average of the 8 original variables, where FOV and HOUSE
are weighted the least. Whereas, prinicple component 2 is essentially FOV - HOUSE.

In order to consider this problem in the regression setting, consider the following output from the multiple
regression problem.

Regular Linear regression

The REG Procedure


Model: MODEL1
Dependent Variable: SALES

139
Analysis of Variance

Sum of Mean
Source DF Squares Square F Value Pr > F

Model 8 9695576855 1211947107 103.54 <.0001


Error 256 2996508858 11705113
Corrected Total 264 12692085712

Root MSE 3421.27355 R-Square 0.7639


Dependent Mean 14341 Adj R-Sq 0.7565
Coeff Var 23.85692

Parameter Estimates

Parameter Standard
Variable DF Estimate Error t Value Pr > |t| Type I SS Type II SS

Intercept 1 -5323.91957 3619.28367 -1.47 0.1425 54499514330 25327558


FOV 1 0.00974 0.00089161 10.92 <.0001 6044261874 1395995637
COMPOSITE 1 98.29045 575.20132 0.17 0.8645 3346548065 341789
INDUSTRIAL 1 -48.15829 317.90960 -0.15 0.8797 42506756 268603
TRANS 1 20.28015 22.81317 0.89 0.3749 84967849 9250108
UTILITY 1 -3.60724 88.07391 -0.04 0.9674 24099702 19635
FINANCE 1 -50.48583 123.01252 -0.41 0.6818 192193 1971587
PROD 1 140.40226 41.04110 3.42 0.0007 121262582 136989029
HOUSE 1 -13.19983 8.01618 -1.65 0.1009 31737833 31737833

Collinearity Diagnostics

Condition ---------------Proportion of Variation---------------


Number Eigenvalue Index Intercept FOV COMPOSITE INDUSTRIAL

1 8.78051 1.00000 0.00004159 0.00055610 1.070456E-8 2.240508E-8


2 0.14425 7.80191 0.00714 0.03942 1.869798E-7 3.428155E-7
3 0.04776 13.55846 0.00475 0.71379 2.191234E-8 2.097776E-8
4 0.01751 22.39513 0.00686 0.00686 0.00000154 0.00000295
5 0.00619 37.65174 0.01017 0.00002005 1.750609E-7 9.08316E-10
6 0.00280 56.03381 0.46368 0.18635 6.327443E-8 0.00001451
7 0.00060748 120.22523 0.36015 0.02061 0.00000686 0.00023005
8 0.00037196 153.64327 0.14003 0.02657 0.00035275 0.00192
9 5.44382E-7 4016.13182 0.00719 0.00583 0.99964 0.99783

Collinearity Diagnostics

------------------------Proportion of Variation------------------------

140
Number TRANS UTILITY FINANCE PROD HOUSE

1 0.00000852 9.452204E-7 2.848753E-7 0.00005154 0.00007696


2 0.00003190 6.459514E-7 0.00001510 0.00722 0.00647
3 0.00013456 0.00004234 9.05031E-10 0.00076847 0.01706
4 0.00522 4.933308E-7 0.00010958 0.09049 0.10321
5 0.01642 0.00759 0.00019542 0.00005213 0.12915
6 0.00081962 0.00409 0.00053764 0.29093 0.34666
7 0.31321 0.00097797 0.02032 0.00196 0.00054098
8 0.50763 0.00883 0.00038489 0.60694 0.39090
9 0.15653 0.97847 0.97844 0.00160 0.00592

Durbin-Watson D 1.821
Number of Observations 265
1st Order Autocorrelation 0.085

From the condition numbers one observes that only 7 variables are needed when including the intercept.
The ridge trace plots confirm the fact that the model is very unstable.

If one considers the model selection solution one has the following:
Regular Linear regression

The REG Procedure


Model: MODEL1
Dependent Variable: SALES

141
Summary of Stepwise Selection

Variable Variable Number Partial Model


Step Entered Removed Vars In R-Square R-Square C(p) F Value Pr > F

1 PROD 1 0.6252 0.6252 145.369 438.77 <.0001


2 FOV 2 0.1321 0.7574 4.0837 142.69 <.0001
3 HOUSE 3 0.0035 0.7609 2.2454 3.86 0.0504
From here we have three variables; PROD, FOV, and HOUSE which explain 76% of variance in SALES.

12.3 Prinicple Component Regression


In this section, the regression problem is considered using this method of finding the principle conponents.
SAS allows for principle component regression using PROC PLS. The above example can be written as;
proc pls data=bc method=prc cv=split cvtest(seed=12345) details;
output out=outpls xscore=t predicted=yhat yresidual=r press=press;
model SALES = FOV COMPOSITE INDUSTRIAL TRANS UTILITY FINANCE PROD HOUSE /solution;
run;
The oupput is as follows:
The PLS Procedure

Split-sample Validation for the Number of Extracted Factors

Number of Root
Extracted Mean Prob >
Factors PRESS T**2 T**2

0 1.080098 53.26124 <.0001


1 0.668428 14.95033 <.0001
2 0.585167 4.510997 0.0290
3 0.556582 1.132275 0.3010
4 0.556331 1.815221 0.1860
5 0.547047 0.150288 0.7070
6 0.545371 0 1.0000
7 0.549929 6.395755 0.0120
8 0.551189 7.834948 0.0040

Minimum root mean PRESS 0.5454


Minimizing number of factors 6
Smallest number of factors with p > 0.1 3

Percent Variation Accounted for by Principal Components

Number of
Extracted Model Effects Dependent Variables
Factors Current Total Current Total

142
1 78.7219 78.7219 62.4695 62.4695
2 11.1137 89.8356 8.8480 71.3175
3 8.0377 97.8734 3.2763 74.5938

Model Effect Loadings

Number of
Extracted
Factors FOV COMPOSITE INDUSTRIAL TRANS UTILITY FINANCE
1 0.196279 0.395296 0.394798 0.385564 0.380283 0.392185
2 0.767804 0.025306 0.036677 0.016319 -0.079879 0.029463
3 0.601657 -0.149454 -0.144876 -0.201304 -0.111648 -0.169716

PROD HOUSE
0.389979 0.224016
-0.054709 -0.630859
.064182 0.713282

One can now use the principle comonents as independent variables in a new regression problem.
proc reg data=outpls;
model sales = t1-t3;run;

The output is;


Linear regression with Principle Compnents

The REG Procedure


Model: MODEL1
Dependent Variable: SALES

Analysis of Variance

Sum of Mean
Source DF Squares Square F Value Pr > F

Model 3 9467504650 3155834883 255.44 <.0001


Error 261 3224581062 12354717
Corrected Total 264 12692085712

Root MSE 3514.92770 R-Square 0.7459


Dependent Mean 14341 Adj R-Sq 0.7430
Coeff Var 24.50998

Parameter Estimates

143
Parameter Standard
Variable DF Estimate Error t Value Pr > |t|

Intercept 1 14341 215.92029 66.42 <.0001


t1 1 89043 3514.92770 25.33 <.0001
t2 1 33511 3514.92770 9.53 <.0001
t3 1 20392 3514.92770 5.80 <.0001
When PROC TRANSREG is used to fit these data, one obtains an r-square value of .8295 when each variable
is transformed using the monotone splines and a r-square of .8697 using spline functions on each variable.
When the monotone splines were used only FOV and PROD were significant, whereas, only FOV was needed
when using the spline function option.

144
Chapter 13

Robust Regression

The purpose of this chapter is to introduce some additional regression procedures which are robust or not
highly effected by the presence of outliers in the data. In statistics there are procedures which are called
robust estimators. For example, the sample median would be a robust estimator of the population mean,
µ, for a symmetric p.d.f. since it is not effected by the outlying observations. Whereas, the sample mean is
not. In this chapter we want to discuss how these estimation procedures are used in the regression problem.
The method that we consider here are called M-estimators since they are “maximum likelihood type”
estimators. That is, suppose the errors are independently distributed and all follow the same distribution,
f (). Then the maximum likelihood estimator (MLE) of β is given by β̂, which maximizes the quantity
n
Y
f (yi − x0i β),
i=1

where x0i is the ith row of X, i = 1, 2, . . . , n, in the model Y = Xβ +. Equivalently, the MLE of β maximizes
n
X
ln f (yi − x0i β).
I=1

In the least squares problem this leads to minimizing the sum of squares function
n
X
(yi − x0i β)2
i=1

when the data are normally distributed. If the p.d.f. is the double exponential case one minimizes
n
X
| yi − x0i β | .
i=1

The basic idea is to extend this idea as follows. Suppose ρ(u) is a defined function of u and suppose that
s is an estimate of a scale. Define a robust estimator as one that minimizes
n n
X X yi − x0i β
ρ(i /s) = ρ( ).
i=1
s
I=1

One needs to minimize the above equation with respect to the parameters, βj which provides
n
X yi − x0i β
xij ψ( ) = 0,
i=1
s

145
where ψ(u) is the partial derivative ∂ρ/∂u and xij is the jth entry of x0i = (1, xi1 , xi2 , . . . , xik ). This equation
will not usually have a close form solution so it is necessary to use a procedure which is called an iterative
reweighted least squares (IRLS). That is, define the weights by

ψ[(yi − x0i β)/s]


wiβ = , i = 1, 2, . . . , n,
(yi − x0i β)/s
in which case we have
n
X
xij wiβ (yi − x0i β) = 0, j = 1, 2, . . . , k,
I=1
or
n
X n
X
xij wiβ yi = xij wiβ x0i β.
I=1 I=1

Which can be written in matrix form as,

X 0 Wβ Xβ = X 0 Wβ Y,
where Wβ is a diagonal matrix with elements wiβ . This equation is the weighted least squares which can
be used to estimate β as
β̂ = (X 0 Wβ X)−1 X 0 Wβ Y.
This procedure is used in an iterative scheme whereby,

β̂q+1 = (X 0 Wq X)−1 X 0 Wq Y, q = 0, 1, 2, . . . ,
where the procedure terminates whenever the specified convergence criteria is met. SAS PROC NLIN
allows for the (IRLS) procedure.

13.1 Choice of ρ(u)


Draper and Smith list a number of commonly used M-estimators in Table 25.1 on page 570. Rather than
discuss why each of these functions were proposed, I will consider a number of examples which illustrate
their use. The first is an example given in the Draper Smith text on pages 573–574.

13.1.1 Steel Employment Example


The SAS code is as follows;
data steel;
input country$ x74 y92;
cards;
germany 232 132
italy 96 50
france 158 43
uk 194 41
spain 89 33
belgium 64 25
netherlands 25 16
lux 23 8
portugal 4 3
denmark 2 1

146
;
title ’Steel Example’;
symbol1 v=star color=black;
symbol2 v=none i=join color=red;
symbol3 v=none i=join color=blue;

*Least Squares Procedure;


proc reg data=steel graphics;
model y92 = x74;
output out=a p=yhat;
plot y92*x74/conf; run;
*Robust Procedures
title2 ’Ramseys method using IRLS’;
title3 ’Robust = red LS = blue’;
proc nlin data=a nohalve;
parms b0=-.3 b1=.4;
model y92 = b0 + b1*x74;
resid=y92-model.y92;
p=model.y92; sigma=20;der.b0=1;der.b1=x74;
b=0.3;
r=abs(resid/sigma);
_weight_=exp(-b*abs(r));
output out=c r=rbi p=p; run;
data c; set c; sigma=40; b=0.3; r=abs(rbi/sigma);
if r<=b then _weight_=exp(-b*abs(r));
else _weight_=0;
proc gplot data=c;
plot y92*x74 p*x74 yhat*x74/ overlay;
run;
The SAS output is;
Steel Example

The REG Procedure


Model: MODEL1
Dependent Variable: y92

Analysis of Variance

Sum of Mean
Source DF Squares Square F Value Pr > F

Model 1 9643.03276 9643.03276 22.27 0.0015


Error 8 3464.56724 433.07090
Corrected Total 9 13108

Root MSE 20.81036 R-Square 0.7357


Dependent Mean 35.20000 Adj R-Sq 0.7026
Coeff Var 59.12033

147
Parameter Estimates

Parameter Standard
Variable DF Estimate Error t Value Pr > |t|

Intercept 1 -0.31386 9.99747 -0.03 0.9757


x74 1 0.40038 0.08485 4.72 0.0015

Ramseys method using IRLS


Robust = red LS = blue

The NLIN Procedure


Iterative Phase
Dependent Variable y92
Method: Gauss-Newton

Weighted
Iter b0 b1 SS

0 -0.3000 0.4000 2084.6


1 0.6550 0.3880 2071.2
2 0.8117 0.3840 2067.3
3 0.8715 0.3823 2065.8
4 0.8974 0.3815 2065.1
5 0.9088 0.3812 2064.9
6 0.9138 0.3811 2064.7
7 0.9160 0.3810 2064.7
8 0.9169 0.3810 2064.7
9 0.9173 0.3810 2064.7
10 0.9175 0.3809 2064.7
11 0.9176 0.3809 2064.6

NOTE: Convergence criterion met.

Estimation Summary

Method Gauss-Newton
Iterations 11
R 5.312E-6
PPC(b0) 0.000039
RPC(b0) 0.000088
Object 9.54E-7
Objective 2064.649
Observations Read 10

148
Observations Used 10
Observations Missing 0

Sum of Mean Approx


Source DF Squares Square F Value Pr > F

Regression 2 13601.2 6800.6 23.98 0.0012


Residual 8 2064.6 258.1
Uncorrected Total 10 15665.8

Corrected Total 9 8254.6

The NLIN Procedure

Approx
Parameter Estimate Std Error Approximate 95% Confidence Limits

b0 0.9176 7.9994 -17.5293 19.3645


b1 0.3809 0.0778 0.2016 0.5603

Approximate Correlation Matrix


b0 b1

b0 1.0000000 -0.7274205
b1 -0.7274205 1.0000000
and corresponding graphs are;

149
150
13.1.2 Phone Data Example
A more illustrative example is as follows; The SAS code is,
option ls=80 center nodate;
symbol1 v=star color=black;
symbol2 v=none i=join color=red;
symbol3 v=none i=join color=blue;
data phones;
input year calls @@;
cards;
50 4.4 51 4.7 52 4.7 53 5.9 54 6.6
55 7.3 56 8.1 57 8.8 58 10.6 59 12.0
60 13.5 61 14.9 62 16.1 63 21.2 64 119.0
65 124.0 66 142.0 67 159.0 68 182.0 69 212.0
70 43.0 71 24.0 72 27.0 73 29.0
;run;
title ’Analysis of Phone Data’;

*Least Squares Procedure;


proc reg data=phones graphics;
model calls = year;
output out=a p=yhat;
plot calls*yea/conf;
run;

*Robust procedure using Proc NLIN with iterated reweighted least squares;
title2 ’Tukey Biweigth using IRLS’;
title3 ’Robust = red LS = blue’;
proc nlin data=a nohalve;
parms b0=-260 b1=5;
model calls = b0 + b1*year;
resid=calls-model.calls;
p=model.calls;
sigma=20;der.b0=1;der.b1=year;
b=4.685;
r=abs(resid/sigma);
if r<=b then _weight_=(1-(r/b)**2);
else _weight_=0;
output out=c r=rbi p=p; run;
data c; set c; sigma=40; b=4.685; r=abs(rbi/sigma);
if r<=b then _weight_=(1-(r/b)**2);
else _weight_=0;
proc gplot data=c;
plot calls*year p*year yhat*year/ overlay;
run;
*Robust procedure using Proc NLIN with iterated reweighted least squares;
title2 ’Andrews method using IRLS’;
title3 ’Robust = red LS = blue’;
proc nlin data=a nohalve;

151
parms b0=-260 b1=5;
model calls = b0 + b1*year;
resid=calls-model.calls;
p=model.calls;
sigma=20;der.b0=1;der.b1=year;
b=4.21;
r=abs(resid/sigma);
if r<=b then _weight_=b*sin(r/b)/r;
else _weight_=0;
output out=c r=rbi p=p; run;
data c; set c; sigma=40; b=4.21; r=abs(rbi/sigma);
if r<=b then _weight_=b*sin(r/b)/r;
else _weight_=0;
proc gplot data=c;
plot calls*year p*year yhat*year/ overlay;
run;
*Robust procedure using Proc NLIN with iterated reweighted least squares;
title2 ’Hubers method using IRLS’;
title3 ’Robust = red LS = blue’;
proc nlin data=a nohalve;
parms b0=-260 b1=5;
model calls = b0 + b1*year;
resid=calls-model.calls;
p=model.calls;
sigma=20;der.b0=1;der.b1=year;
b=2;
r=abs(resid/sigma);
if r>=b then _weight_=b/abs(r);
else _weight_=1;
output out=c r=rbi p=p; run;
data c; set c; sigma=40; b=2; r=abs(rbi/sigma);
if r>=b then _weight_=b/abs(r);
else _weight_=1;
proc gplot data=c;
plot calls*year p*year yhat*year/ overlay;
run;
*Robust procedure using Proc NLIN with iterated reweighted least squares;
title2 ’Ramseys method using IRLS’;
title3 ’Robust = red LS = blue’;
proc nlin data=a nohalve;
parms b0=-260 b1=5;
model calls = b0 + b1*year;
resid=calls-model.calls;
p=model.calls;
sigma=20;der.b0=1;der.b1=year;
b=0.3;
r=abs(resid/sigma);
_weight_=exp(-b*abs(r));
output out=c r=rbi p=p; run;

152
data c; set c; sigma=40; b=0.3; r=abs(rbi/sigma);
if r<=b then _weight_=exp(-b*abs(r));
else _weight_=0;
proc gplot data=c;
plot calls*year p*year yhat*year/ overlay;
run;
The SAS output is;
Analysis of Phone Data

The REG Procedure


Model: MODEL1
Dependent Variable: calls

Analysis of Variance

Sum of Mean
Source DF Squares Square F Value Pr > F

Model 1 29229 29229 9.25 0.0060


Error 22 69544 3161.06999
Corrected Total 23 98773

Root MSE 56.22339 R-Square 0.2959


Dependent Mean 49.99167 Adj R-Sq 0.2639
Coeff Var 112.46553

Parameter Estimates

Parameter Standard
Variable DF Estimate Error t Value Pr > |t|

Intercept 1 -260.05925 102.60700 -2.53 0.0189


year 1 5.04148 1.65794 3.04 0.0060

______________________________________________________
Analysis of Phone Data
Tukey Biweigth using IRLS
Robust = red LS = blue

The NLIN Procedure


Iterative Phase
Dependent Variable calls
Method: Gauss-Newton

Weighted
Iter b0 b1 SS

153
0 -260.0 5.0000 20671.7
1 -166.1 3.2452 9784.0
2 -82.1794 1.6768 664.4
3 -63.5502 1.3053 302.8
4 -63.2587 1.3000 302.7
5 -63.2555 1.2999 302.7

NOTE: Convergence criterion met.

Estimation Summary

Method Gauss-Newton
Iterations 5
R 1.392E-6
PPC(b0) 5.676E-7
RPC(b0) 0.000051
Object 2.276E-6
Objective 302.7444
Observations Read 24
Observations Used 18
Observations Missing 6

Sum of Mean Approx


Source DF Squares Square F Value Pr > F

Regression 2 5346.9 2673.4 82.96 <.0001


Residual 16 302.7 18.9215
Uncorrected Total 18 5649.6

Corrected Total 17 1872.5

The NLIN Procedure

Approx Approximate 95% Confidence


Parameter Estimate Std Error Limits

b0 -63.2555 8.5983 -81.4831 -45.0279


b1 1.2999 0.1427 0.9974 1.6025

Approximate Correlation Matrix


b0 b1

b0 1.0000000 -0.9928513
b1 -0.9928513 1.0000000

154
_______________________________________________________
Analysis of Phone Data
Andrews method using IRLS
Robust = red LS = blue

The NLIN Procedure


Iterative Phase
Dependent Variable calls
Method: Gauss-Newton

Weighted
Iter b0 b1 SS

0 -260.0 5.0000 38692.4


1 -171.2 3.4019 19171.1
2 -103.9 2.1266 2325.8
3 -63.5749 1.3058 307.7
4 -63.4360 1.3032 307.7
5 -63.4357 1.3032 307.7

NOTE: Convergence criterion met.

Estimation Summary

Method Gauss-Newton
Iterations 5
R 2.767E-8
PPC(b0) 1.132E-8
RPC(b0) 4.999E-6
Object 4.418E-8
Objective 307.7286
Observations Read 24
Observations Used 18
Observations Missing 6

Sum of Mean Approx


Source DF Squares Square F Value Pr > F

Regression 2 5383.9 2692.0 82.28 <.0001


Residual 16 307.7 19.2330
Uncorrected Total 18 5691.7

Corrected Total 17 1890.2

The NLIN Procedure

155
Approx Approximate 95% Confidence
Parameter Estimate Std Error Limits

b0 -63.4357 8.6578 -81.7892 -45.0821


b1 1.3032 0.1437 0.9986 1.6078

Approximate Correlation Matrix


b0 b1

b0 1.0000000 -0.9928441
b1 -0.9928441 1.0000000
__________________________________________________________
Analysis of Phone Data
Hubers method using IRLS
Robust = red LS = blue

The NLIN Procedure


Iterative Phase
Dependent Variable calls
Method: Gauss-Newton

Weighted
Iter b0 b1 SS

0 -260.0 5.0000 36404.5


1 -225.9 4.3488 34568.5
2 -196.9 3.8203 33526.3
3 -178.7 3.4894 33020.1
4 -166.9 3.2751 32751.6
5 -159.0 3.1322 32557.7
6 -154.2 3.0451 32332.3
7 -153.0 3.0241 32266.9
8 -152.9 3.0207 32254.7
9 -152.8 3.0200 32252.4
10 -152.8 3.0199 32252.0

NOTE: Convergence criterion met.

Estimation Summary

Method Gauss-Newton
Iterations 10
R 6.595E-6
PPC(b0) 7.908E-6
RPC(b0) 0.000042
Object 0.000013

156
Objective 32251.95
Observations Read 24
Observations Used 24
Observations Missing 0

Sum of Mean Approx


Source DF Squares Square F Value Pr > F

Regression 2 27710.0 13855.0 6.35 0.0195


Residual 22 32252.0 1466.0
Uncorrected Total 24 59962.0

Corrected Total 23 41560.3

____________________________________________________________
Analysis of Phone Data
Hubers method using IRLS
Robust = red LS = blue

The NLIN Procedure

Approx Approximate 95% Confidence


Parameter Estimate Std Error Limits

b0 -152.8 73.0706 -304.4 -1.2750


b1 3.0199 1.1984 0.5345 5.5053

Approximate Correlation Matrix


b0 b1

b0 1.0000000 -0.9932326
b1 -0.9932326 1.0000000

The NLIN Procedure


Iterative Phase
Dependent Variable calls
Method: Gauss-Newton

Weighted
Iter b0 b1 SS

0 -260.0 5.0000 21489.6


1 -215.8 4.1598 19530.7
2 -177.6 3.4524 17825.1
3 -152.7 2.9915 16712.2
4 -138.9 2.7359 16118.8
5 -132.0 2.6076 15832.8

157
6 -128.7 2.5465 15700.3
7 -127.2 2.5182 15639.8
8 -126.5 2.5052 15612.3
9 -126.2 2.4994 15599.9
10 -126.1 2.4967 15594.2
11 -126.0 2.4955 15591.6
12 -126.0 2.4949 15590.5
13 -126.0 2.4947 15590.0
14 -126.0 2.4946 15589.7
15 -125.9 2.4945 15589.6

NOTE: Convergence criterion met.

Estimation Summary

Method Gauss-Newton
Iterations 15
R 6.817E-6
PPC(b0) 9.849E-6
RPC(b0) 0.000022
Object 6.892E-6
Objective 15589.62
Observations Read 24
Observations Used 24
Observations Missing 0

_________________________________________________________
Analysis of Phone Data
Ramseys method using IRLS
Robust = red LS = blue

The NLIN Procedure

Sum of Mean Approx


Source DF Squares Square F Value Pr > F

Regression 2 13580.7 6790.4 7.01 0.0147


Residual 22 15589.6 708.6
Uncorrected Total 24 29170.4

Corrected Total 23 20557.8

Approx Approximate 95% Confidence


Parameter Estimate Std Error Limits

b0 -125.9 56.4983 -243.1 -8.7799

158
b1 2.4945 0.9421 0.5408 4.4483

Approximate Correlation Matrix


b0 b1

b0 1.0000000 -0.9933620
b1 -0.9933620 1.0000000
with corresponding graphs;

159
160
161
13.1.3 Phone Example with Splus
The Splus script for running this example is;
phones.lm <- lm(calls ~ year, phones)
attach(phones); plot(year, calls); detach()
abline(phones.lm$coef)
abline(rlm(calls ~ year, phones, maxit=50), lty=2, col=2)
abline(lqs(calls ~ year, phones), lty=3, col=3)
legend(locator(1), legend=c("least squares", "M-estimate", "LTS"), lty=1:3, col=1:3)

summary(lm(calls ~ year, data=phones), cor=F)


summary(rlm(calls ~ year, maxit=50, data=phones), cor=F)
summary(rlm(calls ~ year, scale.est="proposal 2", data=phones), cor=F)
summary(rlm(calls ~ year, data=phones, psi=psi.bisquare), cor=F)
The output is;
> summary(lm(calls ~ year, data=phones), cor=F)

Call: lm(formula = calls ~ year, data = phones)


Residuals:
Min 1Q Median 3Q Max
-78.97 -33.52 -12.04 23.38 124.2

Coefficients:
Value Std. Error t value Pr(>|t|)
(Intercept) -260.0592 102.6070 -2.5345 0.0189
year 5.0415 1.6579 3.0408 0.0060

Residual standard error: 56.22 on 22 degrees of freedom


Multiple R-Squared: 0.2959
F-statistic: 9.247 on 1 and 22 degrees of freedom, the p-value is 0.0059
98
> summary(rlm(calls ~ year, maxit=50, data=phones), cor=F)

Call: rlm.formula(formula = calls ~ year, data = phones, maxit = 50)


Residuals:
Min 1Q Median 3Q Max
-18.31 -5.953 -1.681 26.46 173.8

Coefficients:
Value Std. Error t value
(Intercept) -102.6222 26.6082 -3.8568
year 2.0414 0.4299 4.7480

Residual standard error: 9.032 on 22 degrees of freedom


> summary(rlm(calls ~ year, scale.est="proposal 2", data=phones), cor=F)

Call: rlm.formula(formula = calls ~ year, data = phones, scale.est =


"proposal 2")

162
Residuals:
Min 1Q Median 3Q Max
-68.15 -29.46 -11.52 22.74 132.7

Coefficients:
Value Std. Error t value
(Intercept) -227.9250 101.8740 -2.2373
year 4.4530 1.6461 2.7052

Residual standard error: 57.25 on 22 degrees of freedom


> summary(rlm(calls ~ year, data=phones, psi=psi.bisquare), cor=F)

Call: rlm.formula(formula = calls ~ year, data = phones, psi = psi.bisqu


are)
Residuals:
Min 1Q Median 3Q Max
-1.658 -0.4143 0.2837 39.09 188.5

Coefficients:
Value Std. Error t value
(Intercept) -52.3025 2.7530 -18.9985
year 1.0980 0.0445 24.6846

Residual standard error: 1.654 on 22 degrees of freedom


With the following graph

163
164
Chapter 14

Regression with Violations of the


Error Structure Assumptions

In this chapter I wish to consider a procedure called AUTOREG in SAS which has some useful techniques for
detecting and handling violations of the independence and homogenous error structure assumptions which
are assumed in linear regression.
I will consider two examples which illustrate how one might use this procedure. The first violates the
independence assumption for the residuals. The second is an example in which one has a heteroscedastic
error structure. In both cases, PROC AUTOREG provides a number of procedures that are beyond the
scope of this course.

14.1 Serially correlated Errors


As you recall the basic linear regression model can be written as,

y = Xβ + e

where  
y1
  1 x11 x21 ... x(p−1)1 
e1

 y2   1 x12 x21 ... x(p−1)1   e2 
y=
 ..  X=. e=
 ... 
 
 .. .. .. 
 ..

. . . . 
yn 1 x1n x2n ... x(p−1)n en
and
β0
 
 β1 
β2
 
β= ,
 .. 
.
 
βp−1
where
y is n × 1 vector of dependent observations.
X is an n × p matrix of independent observations or known values.
β is a p × 1 vector of parameters.

165
e is a n × 1 vector of unobservable errors or residuals.
By adding the distributional assumption on the errors thta the unobserved error ei , i = 1, 2, . . . , n are i.i.d
normals with mean = 0 and variance = σ 2 . This assumption becomes

e1
 
 e2  2
 ...  ∼ Nn (0, σ In ).
e= 

en

In this section, one considers the problem when the error covariance matrix is V 6= σ 2 In and that the errors
behave according to an autoregressive error structure given by,

ei = −ρ1 ei−1 − ρ2 ei−2 − . . . − ρp ei−p + ηi , i = p + 1, p + 2, . . . , n.

The above error structure is said to have an AR(p) model, called an autoregressive model of order p. When
p=1, the model is said to be an AR(1) or Markov error model. When p=1, it follows that,

corr(e1 , ei−1 ) = corr(ei , ei+1 ) = ρ1

corr(e1 , ei−2 ) = corr(ei , ei+2 ) = ρ21


..
.
corr(e1 , ei−k ) = corr(ei , ei+k ) = ρk1 .
When p > 1 the correlation of the errors is said to have a long memory. The Durbin-Watson test as defined
in a previous chapter and is used in PROC REG as a test statistic for testing H0 : ρ1 = 0 when the errors
are assumed to have an AR(1) model as an alternative. Proc AUTOREG allows one to test for higher
order autoregressive terms and in addition it provides p-values for each of the tests. Furthermore, one
can incorporate an error structure model into the usual linear regression model which might be useful in
prediction models.

14.1.1 Example Using Blaisdell Data


The SAS code for this example is
DATA BLAIS; INPUT COMSALE INDSALE;
cards;
20.96 127.3
21.4 130.0
21.96 132.7
21.52 129.4
22.39 135.0
22.76 137.1
23.48 141.2
23.66 142.8
24.1 145.5
24.01 145.3
24.54 148.3
24.3 146.4
25.0 150.2

166
25.64 153.1
26.36 157.3
26.98 160.7
27.52 164.2
27.78 165.6
28.24 168.7
28.78 171.7
;
title ’Blaisdell Data’;
options center nodate pagesize=100 ls=80;
symbol1 i=none color=red v=star ;
symbol2 i=join v=none color=blue;
symbol3 i=join color=green v=none;
title2 ’Output from OLS’;
PROC reg data=blais graphics;
MODEL COMSALE=INDSALE / DW;
output out=new p=yhat r=r ucl=u lcl=l;run;
proc gplot data=new;
plot comsale*indsale=1 l*indsale=3 u*indsale=3 yhat*indsale=2/overlay;
plot r*indsale /vref=0; run;
title2 ’Output from Autoreg’;
PROC AUTOREG data=blais;
model comsale=indsale/nlag=1 archtest dwprob;
output out=new2 p=yhat r=r ucl=u lcl=l; RUN;
proc gplot data=new2;
plot comsale*indsale=1 yhat*indsale=2 u*indsale=3 l*indsale=3/overlay;
plot r*indsale /vref=0; run;
The SAS output is;
Blaisdell Data
Output from OLS

The REG Procedure


Model: MODEL1
Dependent Variable: COMSALE

Analysis of Variance

Sum of Mean
Source DF Squares Square F Value Pr > F

Model 1 110.25688 110.25688 14888.1 <.0001


Error 18 0.13330 0.00741
Corrected Total 19 110.39018

Root MSE 0.08606 R-Square 0.9988


Dependent Mean 24.56900 Adj R-Sq 0.9987
Coeff Var 0.35026

167
Parameter Estimates

Parameter Standard
Variable DF Estimate Error t Value Pr > |t|

Intercept 1 -1.45475 0.21415 -6.79 <.0001


INDSALE 1 0.17628 0.00144 122.02 <.0001

The REG Procedure


Dependent Variable: COMSALE

Durbin-Watson D 0.735
Number of Observations 20
1st Order Autocorrelation 0.626
_________________________________________________________________
Blaisdell Data
Output from Autoreg

The AUTOREG Procedure Dependent Variable COMSALE

Ordinary Least Squares Estimates

SSE 0.1333023 DFE 18


MSE 0.00741 Root MSE 0.08606
SBC -37.468355 AIC -39.45982
Regress R-Square 0.9988 Total R-Square 0.9988
Durbin-Watson 0.7347 Pr < DW 0.0002
Pr > DW 0.9998
NOTE: Pr<DW is the p-value for testing positive autocorrelation, and Pr>DW is
the p-value for testing negative autocorrelation.

Q and LM Tests for ARCH Disturbances

Order Q Pr > Q LM Pr > LM

1 0.2727 0.6016 0.3906 0.5320


2 0.8986 0.6381 0.7508 0.6870
3 3.2983 0.3479 1.4399 0.6962
4 3.4856 0.4801 1.5027 0.8262
5 3.7697 0.5830 1.6747 0.8921
6 5.6838 0.4595 2.5085 0.8675
7 6.1287 0.5248 2.6845 0.9126
8 9.5204 0.3003 3.7760 0.8767
9 14.3130 0.1116 4.2465 0.8945
10 14.3359 0.1582 4.4726 0.9235
11 14.4373 0.2097 4.7521 0.9426
12 15.8697 0.1973 5.0843 0.9551

168
Standard Approx
Variable DF Estimate Error t Value Pr > |t|

Intercept 1 -1.4548 0.2141 -6.79 <.0001


INDSALE 1 0.1763 0.001445 122.02 <.0001

Estimates of Autocorrelations

Lag Covariance Correlation -1 9 8 7 6 5 4 3 2 1 0 1 2 3 4 5 6 7 8 9 1

0 0.00667 1.000000 | |********************|


1 0.00417 0.626005 | |************* |

Preliminary MSE 0.00405

Estimates of Autoregressive Parameters

Standard
Lag Coefficient Error t Value
1 -0.626005 0.189134 -3.31

Yule-Walker Estimates

SSE 0.07921358 DFE 17


MSE 0.00466 Root MSE 0.06826
SBC -44.384673 AIC -47.37187
Regress R-Square 0.9970 Total R-Square 0.9993
Durbin-Watson 1.6563 Pr < DW 0.1880
Pr > DW 0.8120
NOTE: Pr<DW is the p-value for testing positive autocorrelation, and Pr>DW is
the p-value for testing negative autocorrelation.

Standard Approx
Variable DF Estimate Error t Value Pr > |t|

Intercept 1 -1.2903 0.3494 -3.69 0.0018


INDSALE 1 0.1751 0.002349 74.56 <.0001
I have created a number of graphs to illustrate the effect of having the autoregressive error structure in the
model as a part of the overall model.

169
This graph is the confidence bands as givven by ordinary least squares (PROC REG)

The next graph gives the residual plot using ordinary least squares.

170
The next two graphs are similar except that I have used PROC AUTOREG withan autoregressive of
order one error model.

The residual plot using proc autoreg.

171
14.2 Detecting Heteroscedastic Error Structure
PROC AUTOREG using the theory developed for modeling conditional heteroscedasticity or GARCH model
given by
yi = x0i β + νi
νi = i − ρ1 νi−1 − . . . − ρm νi−m
p
i = hi ei
q
X p
X
hi = ω + αj 2i−j + γk hi−k
j=1 k=1

ei i.i.d. N (0, 1).


The above model consist of the error structure having an AR(m) and the variance of the error having a
GARCH(p,q) variance model. This is sometimes referred to as a AR(m)-GARCH(p,q) regression model.
PROC AUTOREG has many options which are beyond the needs of this course, however, one can test for
homogeneity of variance and estimate the variance whenever it is hetereoscedastic. My plan is to give an
example whereby I test for homogeneity of variance and then use the GARCH model to estimate the variance
of yi for the purpose of weighted least squares.

14.2.1 Meyers Table 7.1 Data Example


The SAS code is;
options center nodate pagesize=100 ls=80;
symbol1 i=none color=red v=star ;
symbol2 i=join v=none color=blue;
symbol3 i=join color=green v=none;
title1 ’Meyers Table 7.1 Data’;
data mt71;
input veloc volt effic @@;n=_n_;
cards;
60 50 87.5 60 50 88.1 60 50 89.5 60 50 86.2 60 50 90.0
60 50 88.2 60 50 87.3 60 50 89.2 60 50 85.9 60 50 87.0
120 50 82.5 120 50 81.6 120 50 77.4 120 50 81.5 120 50 79.7
120 50 81.3 120 50 80.7 120 50 79.3 120 50 82.0 120 50 79.2
120 70 61.2 120 70 67.2 120 70 55.9 120 70 52.0 120 70 63.5
120 70 50.7 120 70 52.3 120 70 68.6 120 70 69.5 120 70 70.1
60 70 77.4 60 70 70.7 60 70 67.0 60 70 71.7 60 70 79.2
60 70 68.1 60 70 65.3 60 70 61.0 60 70 81.7 60 70 60.3
;
title2 ’Output from OLS’;
proc reg data=mt71 graphics;
model effic = veloc volt
/ ss1 ss2 collinoint vif dw; run;
plot r.*obs.;
output out=new p=yhat r=r uclm=u lclm=l;
run;
proc gplot data=new;
plot effic*n=1 yhat*n=2 u*n=3 l*n=3/overlay;

172
plot r*n /vref=0;
PROC AUTOREG data=mt71;
model effic = veloc volt/dw=1 dwprob archtest garch=(p=1,q=1);
output out=new2 p=yhat r=r ucl=u lcl=l cpev=vhat;
title2 ’Output from Autoreg’;
RUN;
data new3; set new2; shat=sqrt(vhat); w=1/shat;run;
proc gplot data=new3;
plot effic*n=1 yhat*n=2 u*n=3 l*n=3/overlay;
title3 ’Estimated SE and residual vs n’;
plot r*n=1 shat*n=2/vref=0 cvref=black overlay;
run;
proc reg data=new3 graphics;
model effic = veloc volt
/ noprint ss1 ss2 collinoint vif dw;
weight w;
title3 ’Weighted LS’;
plot r.*obs./vref=0;
run;
The SAS output is;
Meyers Table 7.1 Data
Output from OLS
The REG Procedure
Dependent Variable: effic

Analysis of Variance

Sum of Mean
Source DF Squares Square F Value Pr > F

Model 2 4116.91250 2058.45625 70.36 <.0001


Error 37 1082.48125 29.25625
Corrected Total 39 5199.39375

Root MSE 5.40890 R-Square 0.7918


Dependent Mean 74.93750 Adj R-Sq 0.7806
Coeff Var 7.21789

Parameter Estimates
Parameter Standard
Variable DF Estimate Error t Value Pr > |t| Type I SS

Intercept 1 142.92500 5.80040 24.64 <.0001 224625


veloc 1 -0.13758 0.02851 -4.83 <.0001 681.45025
volt 1 -0.92675 0.08552 -10.84 <.0001 3435.46225

Parameter Estimates
Variance
Variable DF Type II SS Inflation

173
Intercept 1 17763 0
veloc 1 681.45025 1.00000
volt 1 3435.46225 1.00000

Collinearity Diagnostics(intercept adjusted)

Condition --Proportion of Variation-


Number Eigenvalue Index veloc volt

1 1.00000 1.00000 1.00000 0


2 1.00000 1.00000 0 1.00000

The REG Procedure


Model: MODEL1
Dependent Variable: effic

Durbin-Watson D 1.883
Number of Observations 40
1st Order Autocorrelation 0.016
___________________________________________________________
Output from Autoreg

The AUTOREG Procedure

Dependent Variable effic


Ordinary Least Squares Estimates

SSE 1082.48125 DFE 37


MSE 29.25625 Root MSE 5.40890
SBC 256.506988 AIC 251.44035
Regress R-Square 0.7918 Total R-Square 0.7918
Durbin-Watson 1.8832 Pr < DW 0.2420
Pr > DW 0.7580
NOTE: Pr<DW is the p-value for testing positive autocorrelation, and Pr>DW is
the p-value for testing negative autocorrelation.

Q and LM Tests for ARCH Disturbances

Order Q Pr > Q LM Pr > LM

1 10.4636 0.0012 10.8130 0.0010


2 16.0728 0.0003 12.9509 0.0015
3 19.1576 0.0003 13.3856 0.0039
4 24.9255 <.0001 15.3565 0.0040
5 26.3526 <.0001 16.2259 0.0062
6 26.3978 0.0002 19.9419 0.0028
7 26.3985 0.0004 19.9437 0.0057
8 26.7892 0.0008 22.6420 0.0039

174
9 28.8634 0.0007 30.8401 0.0003
10 28.8708 0.0013 30.8769 0.0006
11 30.5517 0.0013 30.9659 0.0011
12 32.1185 0.0013 30.9694 0.0020

Standard Approx
Variable DF Estimate Error t Value Pr > |t|

Intercept 1 142.9250 5.8004 24.64 <.0001


veloc 1 -0.1376 0.0285 -4.83 <.0001
volt 1 -0.9268 0.0855 -10.84 <.0001

Algorithm converged.
GARCH Estimates

SSE 1137.77249 Observations 40


MSE 28.44431 Uncond Var .
Log Likelihood -109.5695 Total R-Square 0.7812
SBC 241.272276 AIC 231.138999
Normality Test 0.7314 Pr > ChiSq 0.6937

Standard Approx
Variable DF Estimate Error t Value Pr > |t|

Intercept 1 138.0122 2.0968 65.82 <.0001


veloc 1 -0.1281 0.0126 -10.20 <.0001
volt 1 -0.8457 0.0360 -23.49 <.0001
ARCH0 1 0.2594 1.5188 0.17 0.8644
ARCH1 1 0.8561 0.4959 1.73 0.0843
GARCH1 1 0.4156 0.1667 2.49 0.0127
_____________________________________________________________________
Meyers Table 7.1 Data
Output from Autoreg
Weighted LS

The REG Procedure


Model: MODEL1
Dependent Variable: effic

Analysis of Variance

Sum of Mean
Source DF Squares Square F Value Pr > F

Model 2 1083.54935 541.77468 117.51 <.0001


Error 37 170.58000 4.61027
Corrected Total 39 1254.12935

175
Root MSE 2.14715 R-Square 0.8640
Dependent Mean 78.99993 Adj R-Sq 0.8566
Coeff Var 2.71792

Parameter Estimates

Parameter Standard
Variable DF Estimate Error t Value Pr > |t| Type I SS

Intercept 1 141.36873 4.10924 34.40 <.0001 85955


veloc 1 -0.13143 0.01943 -6.77 <.0001 211.45848
volt 1 -0.90693 0.06594 -13.75 <.0001 872.09087

Parameter Estimates
Variance
Variable DF Type II SS Inflation

Intercept 1 5456.44356 0
veloc 1 211.02182 1.00000
volt 1 872.09087 1.00000

Collinearity Diagnostics(intercept adjusted)

Condition --Proportion of Variation-


Number Eigenvalue Index veloc volt

1 1.00051 1.00000 0.49975 0.49975


2 0.99949 1.00051 0.50025 0.50025

Durbin-Watson D 2.001
Number of Observations 40
1st Order Autocorrelation -0.024

176
177
178
14.2.2 Draper and Smith Example Using Weighted Least Squares
In chapter 9, Draper and Smith consider the problem of Generalized Least Squares and Weighted Least
Squares. In section 9.3 they consider an example (data found in Table 9.1) where they estimate their

179
weights from modeling the sample variance for the independent data xi . The weights given in Table 9.1 are
the inverse of the estimated sample variance for the individual x0i s.
In this example I am going to estimate the weights using PROC AUTOREG and compare the results
using weighted least squares. The SAS code is,
options center nodate pagesize=100 ls=80;
symbol1 i=none color=red v=star ;
symbol2 i=join v=none color=blue;
symbol3 i=join color=green v=none;
title1 ’Table 9.1 data X, Y, W page 226’;
data t91;
*infile ’R:/export/home/jtubbs/ftp/Regression/DS_data/15a’;
input x y w ;n=_n_;
cards;
1.15 0.99 1.24
1.90 0.98 2.18
3.00 2.60 7.84
3.00 2.67 7.84
3.00 2.66 7.84
3.00 2.78 7.84
3.00 2.80 7.84
5.34 5.92 7.43
5.38 5.35 6.99
5.40 4.33 6.78
5.40 4.89 6.78
5.45 5.21 6.30
7.70 7.68 0.89
7.80 9.81 0.84
7.81 6.52 0.83
7.85 9.71 0.82
7.87 9.82 0.81
7.91 9.81 0.79
7.94 8.50 0.78
9.03 9.47 0.47
9.07 11.45 0.46
9.11 12.14 0.45
9.14 11.50 0.45
9.16 10.65 0.44
9.37 10.64 0.41
10.17 9.78 0.31
10.18 12.39 0.31
10.22 11.03 0.30
10.22 8.00 0.30
10.22 11.90 0.30
10.18 8.68 0.31
10.50 7.25 0.28
10.23 13.46 0.30
10.03 10.19 0.32
10.23 9.93 0.30
;

180
proc reg data=t91 graphics;
model y = x / dw; run;
title2 ’Least Squares’;
title3 ’Regression Plot’; plot y*x p.*x uclm.*x lclm.*x/overlay; run;
title3 ’Residual Plot’; plot r.*obs./ vref=0; run;
* Weighted Least Squares;
proc reg data=t91 graphics;
model y = x / dw;
weight w;run;
title2 ’Weighted LS’;
title3 ’Regression Plot’; plot y*x p.*x uclm.*x lclm.*x/overlay; run;
title3 ’Residual Plot’; plot r.*obs./ vref=0; run;
PROC AUTOREG data=t91;
* model y = x/dw=2 dwprob archtest;
model y = x/garch=(p=1,q=1);
output out=new2 p=yhat r=r ucl=u lcl=l cpev=vhat;
title2 ’Output from Autoreg’;
RUN;
data new3; set new2; shat=sqrt(vhat); wn=1/vhat;run;
proc gplot data=new3;
plot y*x=1 yhat*x=2 u*x=3 l*x=3/overlay;
title3 ’Estimated SE and residual vs n’;
plot r*n=1 shat*n=2/vref=0 cvref=black overlay;
run;
proc reg data=new3 graphics;
model y = x / dw;
weight wn;
title3 ’Weighted LS’;
run;
proc print data=new3;run;
The SAS output is;
Table 9.1 data X, Y, W page 226
The REG Procedure

Analysis of Variance

Sum of Mean
Source DF Squares Square F Value Pr > F

Model 1 367.94805 367.94805 173.42 <.0001


Error 33 70.01571 2.12169
Corrected Total 34 437.96375

Root MSE 1.45660 R-Square 0.8401


Dependent Mean 7.75686 Adj R-Sq 0.8353
Coeff Var 18.77824

Parameter Estimates
Parameter Standard

181
Variable DF Estimate Error t Value Pr > |t|

Intercept 1 -0.57895 0.67919 -0.85 0.4001


x 1 1.13540 0.08622 13.17 <.0001

Durbin-Watson D 1.952
Number of Observations 35
1st Order Autocorrelation 0.015
________________________________________________________________
Table 9.1 data X, Y, W page 226
Weighted Least Squares

Analysis of Variance

Sum of Mean
Source DF Squares Square F Value Pr > F

Model 1 493.21364 493.21364 384.11 <.0001


Error 33 42.37373 1.28405
Corrected Total 34 535.58737

Root MSE 1.13316 R-Square 0.9209


Dependent Mean 4.49629 Adj R-Sq 0.9185
Coeff Var 25.20212

Parameter Estimates
Parameter Standard
Variable DF Estimate Error t Value Pr > |t|

Intercept 1 -0.88770 0.30000 -2.96 0.0057


x 1 1.16442 0.05941 19.60 <.0001

Durbin-Watson D 1.662
Number of Observations 35
1st Order Autocorrelation 0.160
_________________________________________________________________
Table 9.1 data X, Y, W page 226
Output from Autoreg

Ordinary Least Squares Estimates

SSE 70.0157089 DFE 33


MSE 2.12169 Root MSE 1.45660
SBC 130.704398 AIC 127.593702
Regress R-Square 0.8401 Total R-Square 0.8401
Durbin-Watson 1.9517

Standard Approx
Variable DF Estimate Error t Value Pr > |t|

182
Intercept 1 -0.5790 0.6792 -0.85 0.4001
x 1 1.1354 0.0862 13.17 <.0001

Algorithm converged.
GARCH Estimates

SSE 70.2144329 Observations 35


MSE 2.00613 Uncond Var 408.957177
Log Likelihood -58.424952 Total R-Square 0.8397
SBC 134.626645 AIC 126.849904
Normality Test 1.4387 Pr > ChiSq 0.4871

Standard Approx
Variable DF Estimate Error t Value Pr > |t|

Intercept 1 -0.7847 0.7616 -1.03 0.3028


x 1 1.1584 0.1010 11.47 <.0001
ARCH0 1 0.1722 0.2684 0.64 0.5212
ARCH1 1 0.4461 0.3595 1.24 0.2146
GARCH1 1 0.5534 0.2969 1.86 0.0623
_______________________________________________________________
Table 9.1 data X, Y, W page 226
Output from Autoreg
Weighted LS

Analysis of Variance

Sum of Mean
Source DF Squares Square F Value Pr > F

Model 1 272.13091 272.13091 256.58 <.0001


Error 33 34.99993 1.06060
Corrected Total 34 307.13084

Root MSE 1.02986 R-Square 0.8860


Dependent Mean 6.21159 Adj R-Sq 0.8826
Coeff Var 16.57959

Parameter Estimates
Parameter Standard
Variable DF Estimate Error t Value Pr > |t|

Intercept 1 -0.78320 0.47670 -1.64 0.1099


x 1 1.15803 0.07229 16.02 <.0001

Durbin-Watson D 1.929
Number of Observations 35

183
1st Order Autocorrelation 0.031

184
Chapter 15

Regression Trees

Classification and Regression trees are a relatively new method for exploring relationships among data in
either a regression model or classfication problem. These methods are very easy to use and interprete
although they are somewhat difficult to understand. I have downloaded some material from Salford Systems
which produces and markets a commerical prodcut called, CART, which was developed by Brieman and
Friedman in the early 1980’s.

15.1 An Overview of CART Methodology


The CART methodology is technically known as binary recursive partitioning. The process is binary because
parent nodes are always split into exactly two child nodes and recursive because the process can be repeated
by treating each child node as a parent. The key elements of a CART analysis are a set of rules for:
• splitting each node in a tree;
• deciding when a tree is complete; and
• assigning each terminal node to a class outcome (or predicted value for regression)

Splitting Rules
To split a node into two child nodes, CART always asks questions that have a ‘yes’ or ‘no’ answer. For
example, the questions might be: is age <= 55? Or is credit score <= 600?
How do we come up with candidate splitting rules? CART?’ method is to look at all possible splits for all
variables included in the analysis. For example, consider a data set with 215 cases and 19 variables. CART
considers up to 215 times 19 splits for a total of 4085 possible splits. Any problem will have a finite number
of candidate splits and CART will conduct a brute force search through them all.

Choosing a Split
CART’s next activity is to rank order each splitting rule on the basis of a quality-of-split criterion. The
default criterion used in CART is the GINI rule, essentially a measure of how well the splitting rule separates
the classes contained in the parent node. (Alternative criteria are also available).

185
Class Assignment
Once a best split is found, CART repeats the search process for each child node, continuing recursively until
further splitting is impossible or stopped. Splitting is impossible if only one case remains in a particular
node or if all the cases in that node are exact copies of each other (on predictor variables). CART also allows
splitting to be stopped for several other reasons, including that a node has too few cases. (The default for
this lower limit is 10 cases, but may be set higher or lower to suit a particular analysis).
Once a terminal node is found we must decide how to classify all cases falling within it. One simple
criterion is the plurality rule: the group with the greatest representation determines the class assignment.
CART goes a step further: because each node has the potential for being a terminal node, a class assignment
is made for every node whether it is terminal or not. The rules of class assignment can be modified from
simple plurality to account for the costs of making a mistake in classification and to adjust for over- or
under-sampling from certain classes.
A common technique among the first generation of tree classifiers was to continue splitting nodes (growing
the tree) until some goodness-of-split criterion failed to be met. When the quality of a particular split fell
below a certain threshold, the tree was not grown further along that branch. When all branches from the
root reached terminal nodes, the tree was considered complete. While this technique is still embodied in
several commercial programs, including CHAID and KnowledgeSEEKER, it often yields erroneous results.
CART uses a completely different technique.

Pruning Trees
Instead of attempting to decide whether a given node is terminal or not, CART proceeds by growing trees
until it is not possible to grow them any further. Once CART has generated what we call a maximal tree,
it examines smaller trees obtained by pruning away branches of the maximal tree. Unlike other methods,
CART does not stop in the middle of the tree-growing process, because there might still be important
information to be discovered by drilling down several more levels.

Testing
Once the maximal tree is grown and a set of sub-trees are derived from it, CART determines the best tree
by testing for error rates or costs. With sufficient data, the simplest method is to divide the sample into
learning and test sub-samples. The learning sample is used to grow an overly-large tree. The test sample is
then used to estimate the rate at which cases are misclassified (possibly adjusted by misclassification costs).
The misclassification error rate is calculated for the largest tree and also for every sub-tree. The best sub-tree
is the one with the lowest or near-lowest cost, which may be a relatively small tree.
Some studies will not have sufficient data to allow a good-sized separate test sample. The tree-growing
methodology is data intensive, requiring many more cases than classical regression. When data are in short
supply, CART employs the computer-intensive technique of cross validation.

Cross Validation
Cross validation is used if data are insufficient for a separate test sample. In such cases, CART grows a
maximal tree on the entire learning sample. This is the tree that will be pruned back. CART then proceeds
by dividing the learning sample into 10 roughly-equal parts, each containing a similar distribution for the
dependent variable. CART takes the first 9 parts of the data, constructs the largest possible tree, and uses
the remaining 1/10 of the data to obtain initial estimates of the error rate of selected sub-trees. The same
process is then repeated (growing the largest possible tree) on another 9/10 of the data while using a different
1/10 part as the test sample. The process continues until each part of the data has been held in reserve one

186
time as a test sample. The results of the 10 mini-test samples are then combined to form error rates for
trees of each possible size; these error rates are applied to the tree based on the entire learning sample.
The upshot of this complex process is a set of fairly reliable estimates of the independent predictive
accuracy of the tree. This means that we can know how well any tree will perform on completely fresh
data-even if we do not have an independent test sample. Because the conventional methods of assessing tree
accuracy can be wildly optimistic, cross validation is the method CART normally uses to obtain objective
measures for smaller data sets.

Conclusion
CART uses a combination of exhaustive searches and computer-intensive testing techniques to identify useful
tree structures of data. It can be applied to virtually any data set and can proceed with little or no guidance
from the user. Thus, if you have a data set and have no idea how to proceed with its analysis, you can
simply hand it over to CART and let it do the work. If this sounds too good to be true, the natural question
is: does CART really deliver useful results that you can trust?
The surprising answer is a resounding yes. When automatic CART analyses are compared with stepwise
logistic regressions or discriminant analyses, CART typically performs about 10 to 15 better on the learning
sample. CART’s performance on test samples is even more important. Because CART does not suffer from
the statistical deficiencies that plague conventional stepwise techniques, CART will typically be far more
accurate on new data. Further, when automatic CART analyses are compared with the best parametric
models of sophisticated teams of statisticians, CART is still competitive. CART can often generate models
in an hour or two that are only slightly worse in predictive accuracy than models that may take specialists
several days to develop.

Technical Note for Statisticians


Some technical aspects of CART analyses are of special interest to statisticians; we list the most important
ones here.
• CART is nonparametric.
CART is a nonparametric procedure and does not require specification of a functional form.
• does not require variables to be selected in advance.
CART uses a stepwise method to determine splitting rules. However, unlike parametric stepwise
procedures, CART trees can be shown to be statistically sound. Thus, no advance selection of variables
is necessary, although certain variables such as ID numbers and reformulations of the dependent variable
should be excluded from the analysis. Also, CART?s performance can be much enhanced by a judicious
selection and creation of predictor variables.
• Results are invariant with respect to monotone transformations of the independent variables.
There is no need to experiment with transformations of the independent variables, such as logarithms,
square roots or squares. In CART, creating such variables will not affect the trees produced unless
linear combination splits are used.
• CART can handle data sets with a complex structure.
Unlike parametric models, which are intended to uncover a single dominant structure in data, CART
is designed to work with data that might have multiple structures. In fact, provided there are enough
observations, the more complex the data and the more variables available, the better CART will do
compared to alternative methods.

187
• CART is extremely robust to the effects of outliers.
Outliers among the independent variables generally do not affect CART because splits usually occur
at non-outlier values. Outliers in the dependent variable are often separated into nodes where they
no longer affect the rest of the tree. Also, in regression models, least absolute deviations can be used
instead of least squares, diminishing the effect of outliers in the dependent variable.

• CART can use any combination of categorical and continuous variables.


CART does not require any preprocessing of the data. In particular, and in contrast to CHAID,
continuous variables do not have to be recoded into discrete variable versions prior to analysis.
• CART can use linear combinations of variables to determine splits.
While the CART default is to split nodes on single variables, it will optionally use linear combinations
of non-categorical variables.
• CART can adjust for samples stratified on a categorical dependent variable.
If a sample has substantial over-representation of certain classes, CART can adjust for this by automatic
reweighting.

• CART can discover context dependence and interactions.


CART can use the same variable in different parts of the tree, uncovering the context dependency of
the effects of certain variables.
• CART can process cases with missing values for predictors.
For each split in the tree, CART develops alternative splits (surrogates), which can be used to classify
an object when the primary splitting variable is missing. Thus, CART can be effectively used with
data that has a large fraction of missing values.

15.2 Example from Splus


The Splus session produces the following;
Working data will be in C:\Program Files\sp2000\users\Default\_Data
> x1A
obs y x2 x3 x4 x5 x6 x7 x8 x9 x10
1 10.98 5.20 0.61 7.4 31 20 22 35.3 54.8 4
2 11.13 5.12 0.64 8.0 29 20 25 29.7 64.0 5
3 12.51 6.19 0.78 7.4 31 23 17 30.8 54.8 4
4 8.40 3.89 0.49 7.5 30 20 22 58.8 56.3 4
5 9.27 6.28 0.84 5.5 31 21 0 61.4 30.3 5
6 8.73 5.76 0.74 8.9 30 22 0 71.3 79.2 4
7 6.36 3.45 0.42 4.1 31 11 0 74.4 16.8 2
8 8.50 6.57 0.87 4.1 31 23 0 76.7 16.8 5
9 7.82 5.69 0.75 4.1 30 21 0 70.7 16.8 4
10 9.14 6.14 0.76 4.5 31 20 0 57.5 20.3 5
11 8.24 4.84 0.65 10.3 30 20 11 46.4 106.1 4
12 12.19 4.88 0.62 6.9 31 21 12 28.9 47.6 4
13 11.88 6.03 0.79 6.6 31 21 25 28.1 43.6 5
14 9.57 4.55 0.60 7.3 28 19 18 39.1 53.3 5
15 10.94 5.71 0.70 8.1 31 23 5 46.8 65.6 4

188
16 9.58 5.67 0.74 8.4 30 20 7 48.5 70.6 4
17 10.09 6.72 0.85 6.1 31 22 0 59.3 37.2 6
18 8.11 4.95 0.67 4.9 30 22 0 70.0 24.0 4
19 6.83 4.62 0.45 4.6 31 11 0 70.0 21.2 3
20 8.88 6.60 0.95 3.7 31 23 0 74.5 13.7 4
21 7.68 5.01 0.64 4.7 30 20 0 72.1 22.1 4
22 8.47 5.68 0.75 5.3 31 21 1 58.1 28.1 6
23 8.86 5.28 0.70 6.2 30 20 14 44.6 38.4 4
24 10.36 5.36 0.67 6.8 31 20 22 33.4 46.2 4
25 11.08 5.87 0.70 7.5 31 22 28 28.6 56.3 5
> attach(x1A)
> x1A.tree<-tree(y ~ x2+x3+x4+x5+x6+x7+x8+x9+x10)
> x1A.tree
node), split, n, deviance, yval
* denotes terminal node

1) root 25 63.820 9.424


2) x8<37.2 7 3.504 11.450 *
3) x8>37.2 18 20.520 8.637
6) x8<65.7 10 6.288 9.256
12) x5<30.5 5 1.594 8.930 *
13) x5>30.5 5 3.631 9.582 *
7) x8>65.7 8 5.614 7.864 *
> plot.tree(x1A.tree, type="u")
> text.tree(x1A.tree)
> summary(x1A.tree)

Regression tree:
tree(formula = y ~ x2 + x3 + x4 + x5 + x6 + x7 + x8 + x9 + x10)
Variables actually used in tree construction:
[1] "x8" "x5"
Number of terminal nodes: 4
Residual mean deviance: 0.683 = 14.34 / 21
Distribution of residuals:
Min. 1st Qu. Median Mean 3rd Qu. Max.
-1.504 -0.4671 -0.07 8.882e-016 0.64 1.358

> title("Classification Tree \nwith Table A.1 Data")

> plot(prune.tree(x1A.tree,k=2))
> text(plot(prune.tree(x1A.tree,k=2)))
> summary(prune.tree(x1A.tree,k=2))

Regression tree:
snip.tree(tree = x1A.tree, nodes = 6)
Variables actually used in tree construction:
[1] "x8"
Number of terminal nodes: 3
Residual mean deviance: 0.7003 = 15.41 / 22

189
Distribution of residuals:
Min. 1st Qu. Median Mean 3rd Qu. Max.
-1.504 -0.4671 -0.04375 5.329e-016 0.6363 1.684
> prune.tree(x1A.tree,k=2)
node), split, n, deviance, yval
* denotes terminal node

1) root 25 63.820 9.424


2) x8<37.2 7 3.504 11.450 *
3) x8>37.2 18 20.520 8.637
6) x8<65.7 10 6.288 9.256 *
7) x8>65.7 8 5.614 7.864 *
With the following graph;

190
Cereal Example
The breakfast cereal example output becomes;

*** Tree Model ***

Regression tree:
tree(formula = rating ~ cal + protein + fat + sodium + fiber + carbs +
sugar + potass + vitm, data = cereal, na.action = na.exclude,
mincut = 5, minsize = 10, mindev = 0.01)
Variables actually used in tree construction:
[1] "cal" "sugar" "carbs" "fat" "fiber"
Number of terminal nodes: 11
Residual mean deviance: 105.7 = 6341 / 60
Distribution of residuals:
Min. 1st Qu. Median Mean 3rd Qu. Max.
-25.73 -5.23 0.4829 2.802e-015 4.76 24
node), split, n, deviance, yval
* denotes terminal node

1) root 71 21600.00 42.67


2) cal<95 13 2245.00 65.15
4) cal<75 5 701.70 73.80 *
5) cal>75 8 934.60 59.73 *
3) cal>95 58 11310.00 37.63
6) sugar<8.5 31 3811.00 44.64
12) carbs<20.5 22 1814.00 48.08
24) fat<0.5 5 508.70 57.39 *
25) fat>0.5 17 743.90 45.34
50) fiber<2.75 11 442.60 43.38
100) fat<1.5 5 72.83 47.11 *
101) fat>1.5 6 242.20 40.27 *
51) fiber>2.75 6 181.70 48.93 *
13) carbs>20.5 9 1101.00 36.23 *
7) sugar>8.5 27 4233.00 29.59
14) fat<1.5 21 1826.00 33.03
28) sugar<11.5 7 1129.00 39.59 *
29) sugar>11.5 14 246.10 29.76
58) fiber<0.5 6 81.35 27.36 *
59) fiber>0.5 8 104.40 31.56 *
15) fat>1.5 6 1284.00 17.52 *

With the following graph;

191
192
Chapter 16

LOESS Regression

The SAS LOESS procedure is a nonparametric method for finding a regression relationship between variables.

16.1 Overview
The LOESS procedure implements a nonparametric method for estimating regression surfaces pioneered
by Cleveland, Devlin, and Grosse (1988), Cleveland and Grosse (1991), and Cleveland, Grosse, and Shyu
(1992). The LOESS procedure allows great flexibility because no assumptions about the parametric form of
the regression surface are needed.
The SAS System provides many regression procedures such as the GLM, REG, and NLIN procedures for
situations in which you can specify a reasonable parametric model for the regression surface. You can use
the LOESS procedure for situations in which you do not know a suitable parametric form of the regression
surface. Furthermore, the LOESS procedure is suitable when there are outliers in the data and a robust
fitting method is necessary.
The main features of the LOESS procedure are as follows:
• fits nonparametric models
• supports the use of multidimensional data
• supports multiple dependent variables
• supports both direct and interpolated fitting using kd trees
• performs statistical inference
• performs iterative reweighting to provide robust fitting when there are outliers in the data
• supports multiple SCORE statements

16.1.1 Local Regression and the Loess Method


Assume that for i = 1, 2, . . . , n, the ith measurement yi of the response y and the corresponding measurement
~xi of the vector x of p predictors are related by

yi = g(xi ) + ei

where g is the regression function and ei is a random error. The idea of local regression is that at a predictor
x, the regression function g(x) can be locally approximated by the value of a function in some specified

193
parametric class. Such a local approximation is obtained by fitting a regression surface to the data points
within a chosen neighborhood of the point x.
In the loess method, weighted least squares is used to fit linear or quadratic functions of the predictors at
the centers of neighborhoods. The radius of each neighborhood is chosen so that the neighborhood contains a
specified percentage of the data points. The fraction of the data, called the smoothing parameter, in each local
neighborhood controls the smoothness of the estimated surface. Data points in a given local neighborhood
are weighted by a smooth decreasing function of their distance from the center of the neighborhood.
In a direct implementation, such fitting is done at each point at which the regression surface is to be
estimated. A much faster computational procedure is to perform such local fitting at a selected sample of
points in predictor space and then to blend these local polynomials to obtain a regression surface.
You can use the LOESS procedure to perform statistical inference provided the error distribution satisfies
some basic assumptions. In particular, such analysis is appropriate when the are i.i.d. normal random
variables with mean 0. By using the iterative reweighting, the LOESS procedure can also provide statistical
inference when the error distribution is symmetric but not necessarily normal. Furthermore, by doing
iterative reweighting, you can use the LOESS procedure to perform robust fitting in the presence of outliers
in the data.
While all output of the LOESS procedure can be optionally displayed, most often the LOESS procedure
is used to produce output data sets that will be viewed and manipulated by other SAS procedures. PROC
LOESS uses the Output Delivery System (ODS) to place results in output data sets. This is a departure
from older SAS procedures that provide OUTPUT statements to create SAS data sets from analysis results.

16.2 Examples
16.2.1 Phone Example
The SAS code is,

title1 ’Phones Data’;


data phones;
input year calls;
cards;
50 4.4
51 4.7
52 4.7
53 5.9
54 6.6
55 7.3
56 8.1
57 8.8
58 10.6
59 12.0
60 13.5
61 14.9
62 16.1
63 21.2
64 119.0
65 124.0
66 142.0
67 159.0

194
68 182.0
69 212.0
70 43.0
71 24.0
72 27.0
73 29.0
;run;
/*
data phones;
infile ’R:/export/home/jtubbs/ftp/Regression/phones.dat’;
input year calls; run;
*/
title2 ’Linear Regression’;
*Least Squares Procedure;
proc reg data=phones;
* ods html body=’R:/export/home/jtubbs/public_html/loess.phones.1.html’;
model calls = year;
* ods html body=’R:/export/home/jtubbs/public_html/loess.phones.2.html’;
output out=a p=yhat;
*plot calls*year;
run;
symbol1 v=star color=black;
symbol2 v=none i=join color=red;
symbol3 v=none i=join color=blue;

*Robust procedure using Proc NLIN with iterated reweighted least squares;
title2 ’Tukey Biweigth using IRLS’;
title3 ’Robust = red LS = blue’;
proc nlin data=a nohalve;
parms b0=-260 b1=5;
* ods html body=’R:/export/home/jtubbs/public_html/loess.phones.3.html’;
model calls = b0 + b1*year;
resid=calls-model.calls;
p=model.calls;
sigma=20;der.b0=1;der.b1=year;
b=4.685;
r=abs(resid/sigma);
if r<=b then _weight_=(1-(r/b)**2)**2;
else _weight_=0;
output out=c r=rbi p=p; run;
data c; set c; sigma=40; b=4.685; r=abs(rbi/sigma);
if r<=b then _weight_=(1-(r/b)**2)**2;
else _weight_=0;
proc gplot data=c;
* ods html body=’R:/export/home/jtubbs/public_html/loess.phones.4.html’;
plot calls*year p*year yhat*year/ overlay;
run;
symbol1 color=blue value=dot;
symbol2 color=red interpol=spline value=none;

195
symbol3 color=green interpol=spline value=none;
symbol4 color=green interpol=spline value=none;
*Loess Procedure for data;
title2 ’LOESS Regression’;
proc loess data=phones;
ods output OutputStatistics = phonefit FitSummary=Summary;
* ods html body=’R:/export/home/jtubbs/public_html/loess.phones.5.html’;
model calls = year / degree=2 smooth= 0.3 0.6 1.0
details(outputstatistics) details(predatvertices);
run;
proc sort data=phonefit;
by SmoothingParameter year;
proc gplot data=phonefit;
by SmoothingParameter;
* ods html body=’R:/export/home/jtubbs/public_html/loess.phones.6.html’;
plot (DepVar Pred LowerCL UpperCL)*year / overlay;
run;
*Loess Procedure for residuals;
title2 ’Loess for Phone Data’;
proc loess data=phonefit;
ods output OutputStatistics = residout;
* ods html body=’R:/export/home/jtubbs/public_html/loess.phones.7.html’;
model Residual = year / degree=2 smooth= 0.3 0.6 1.0 details(kdtree)
details(outputstatistics) details(predatvertices);
run;

data residout; set residout; if obs <=24;


proc sort data=residout; by SmoothingParameter year; run;
proc gplot data=residout;
by SmoothingParameter;
* ods html body=’R:/export/home/jtubbs/public_html/loess.phones.8.html’;
plot DepVar*year=1 Pred*year=2 / overlay vref=0;
run;
quit;
The corresponding output is;
Phones Data
Linear Regression

Analysis of Variance

Sum of Mean
Source DF Squares Square F Value Pr > F

Model 1 29229 29229 9.25 0.0060


Error 22 69544 3161.06999
Corrected Total 23 98773

Root MSE 56.22339 R-Square 0.2959

196
Dependent Mean 49.99167 Adj R-Sq 0.2639
Coeff Var 112.46553

Parameter Estimates

Parameter Standard
Variable DF Estimate Error t Value Pr > |t|

Intercept 1 -260.05925 102.60700 -2.53 0.0189


year 1 5.04148 1.65794 3.04 0.0060

_______________________________________________________________
Phones Data
Tukey Biweigth using IRLS
Robust = red LS = blue

The NLIN Procedure


Iterative Phase
Dependent Variable calls
Method: Gauss-Newton

Weighted
Iter b0 b1 SS

0 -260.0 5.0000 11962.0


1 -153.8 2.9893 4647.8
2 -68.8011 1.4062 322.0
3 -63.1815 1.2986 296.6
4 -63.0315 1.2958 296.5
5 -63.0282 1.2958 296.5

NOTE: Convergence criterion met.

Estimation Summary

Method Gauss-Newton
Iterations 5
R 2.786E-6
PPC(b0) 1.131E-6
RPC(b0) 0.000052
Object 4.621E-6
Objective 296.5022
Observations Read 24
Observations Used 18
Observations Missing 6

197
Sum of Mean Approx
Source DF Squares Square F Value Pr > F

Regression 2 5301.0 2650.5 83.85 <.0001


Residual 16 296.5 18.5314
Uncorrected Total 18 5597.5

Corrected Total 17 1850.3

The NLIN Procedure

Approx
Parameter Estimate Std Error Approximate 95% Confidence Limits

b0 -63.0282 8.5227 -81.0954 -44.9610


b1 1.2958 0.1415 0.9958 1.5957

Approximate Correlation Matrix


b0 b1

b0 1.0000000 -0.9928601
b1 -0.9928601 1.0000000
_________________________________________________________________
Phones Data
LOESS Regression

The LOESS Procedure

Independent Variable Scaling

Scaling applied: None

Statistic year

Minimum Value 50.00000


Maximum Value 73.00000
Phones Data
LOESS Regression

The LOESS Procedure


Smoothing Parameter: 0.3
Dependent Variable: calls

Predicted Values at kd Tree Vertices

Vertex Predicted
Number year Value

198
1 50.00000 4.41320
2 51.00000 4.58248
3 52.00000 5.02038
4 53.00000 5.72984
5 54.00000 6.63094
6 55.00000 7.32320
7 56.00000 7.99172
8 57.00000 9.00883
9 58.00000 10.44531
10 59.00000 12.05414
11 60.00000 13.49227
12 61.00000 14.57516
13 62.00000 9.54896
14 63.00000 42.41547
15 64.00000 96.46968
16 65.00000 133.26582
17 66.00000 140.37578
18 67.00000 159.46406
19 68.00000 198.01022
20 69.00000 169.07405
21 70.00000 79.89311
22 71.00000 28.91523
23 72.00000 13.93099
24 73.00000 34.06159

Output Statistics

Predicted
Obs year calls calls

1 50.00000 4.40000 4.41320


2 51.00000 4.70000 4.58248
3 52.00000 4.70000 5.02038
4 53.00000 5.90000 5.72984
5 54.00000 6.60000 6.63094
6 55.00000 7.30000 7.32320
7 56.00000 8.10000 7.99172
8 57.00000 8.80000 9.00883
9 58.00000 10.60000 10.44531
10 59.00000 12.00000 12.05414
11 60.00000 13.50000 13.49227
12 61.00000 14.90000 14.57516
Phones Data
LOESS Regression

The LOESS Procedure


Smoothing Parameter: 0.3

199
Dependent Variable: calls

Output Statistics

Predicted
Obs year calls calls

13 62.00000 16.10000 9.54896


14 63.00000 21.20000 42.41547
15 64.00000 119.00000 96.46968
16 65.00000 124.00000 133.26582
17 66.00000 142.00000 140.37578
18 67.00000 159.00000 159.46406
19 68.00000 182.00000 198.01022
20 69.00000 212.00000 169.07405
21 70.00000 43.00000 79.89311
22 71.00000 24.00000 28.91523
23 72.00000 27.00000 13.93099
24 73.00000 29.00000 34.06159
Phones Data
LOESS Regression

The LOESS Procedure


Smoothing Parameter: 0.3
Dependent Variable: calls

Fit Summary

Fit Method kd Tree


Blending Linear
Number of Observations 24
Number of Fitting Points 24
kd Tree Bucket Size 1
Degree of Local Polynomials 2
Smoothing Parameter 0.30000
Points in Local Neighborhood 7
Residual Sum of Squares 4770.31488

Phones Data
LOESS Regression

The LOESS Procedure


Smoothing Parameter: 0.6
Dependent Variable: calls

Predicted Values at kd Tree Vertices

Vertex Predicted

200
Number year Value

1 50.00000 4.34864
2 51.00000 4.66114
3 52.00000 5.10032
4 54.00000 6.36513
5 55.00000 7.19961
6 57.00000 9.23011
7 58.00000 8.87318
8 60.00000 7.67110
9 61.00000 16.01639
10 63.00000 55.63975
11 64.00000 85.66250
12 66.00000 154.05459
13 67.00000 166.58650
14 69.00000 138.61922
15 70.00000 114.06367
16 72.00000 37.10261
17 73.00000 -14.50157

Output Statistics

Predicted
Obs year calls calls

1 50.00000 4.40000 4.34864


2 51.00000 4.70000 4.66114
3 52.00000 4.70000 5.10032
4 53.00000 5.90000 5.73272
5 54.00000 6.60000 6.36513
6 55.00000 7.30000 7.19961
7 56.00000 8.10000 8.21486
8 57.00000 8.80000 9.23011
9 58.00000 10.60000 8.87318
10 59.00000 12.00000 8.27214
11 60.00000 13.50000 7.67110
12 61.00000 14.90000 16.01639
13 62.00000 16.10000 35.82807
14 63.00000 21.20000 55.63975
15 64.00000 119.00000 85.66250
16 65.00000 124.00000 119.85854
17 66.00000 142.00000 154.05459
18 67.00000 159.00000 166.58650
19 68.00000 182.00000 152.60286
Phones Data
LOESS Regression

The LOESS Procedure

201
Smoothing Parameter: 0.6
Dependent Variable: calls

Output Statistics

Predicted
Obs year calls calls

20 69.00000 212.00000 138.61922


21 70.00000 43.00000 114.06367
22 71.00000 24.00000 75.58314
23 72.00000 27.00000 37.10261
24 73.00000 29.00000 -14.50157
Phones Data
LOESS Regression

The LOESS Procedure


Smoothing Parameter: 0.6
Dependent Variable: calls

Fit Summary

Fit Method kd Tree


Blending Linear
Number of Observations 24
Number of Fitting Points 17
kd Tree Bucket Size 2
Degree of Local Polynomials 2
Smoothing Parameter 0.60000
Points in Local Neighborhood 14
Residual Sum of Squares 18914

Phones Data
LOESS Regression

The LOESS Procedure


Smoothing Parameter: 1
Dependent Variable: calls

Predicted Values at kd Tree Vertices

Vertex Predicted
Number year Value

1 50.00000 15.12017
2 52.00000 3.09888
3 55.00000 -1.96298
4 58.00000 9.31005
5 61.00000 36.79542

202
6 64.00000 90.13736
7 67.00000 104.06626
8 70.00000 89.41573
9 73.00000 40.80335

Output Statistics

Predicted
Obs year calls calls

1 50.00000 4.40000 15.12017


2 51.00000 4.70000 9.10953
3 52.00000 4.70000 3.09888
4 53.00000 5.90000 1.41160
5 54.00000 6.60000 -0.27569
6 55.00000 7.30000 -1.96298
7 56.00000 8.10000 1.79470
8 57.00000 8.80000 5.55237
9 58.00000 10.60000 9.31005
10 59.00000 12.00000 18.47184
11 60.00000 13.50000 27.63363
12 61.00000 14.90000 36.79542
13 62.00000 16.10000 54.57607
14 63.00000 21.20000 72.35671
15 64.00000 119.00000 90.13736
16 65.00000 124.00000 94.78033
17 66.00000 142.00000 99.42329
18 67.00000 159.00000 104.06626
19 68.00000 182.00000 99.18275
20 69.00000 212.00000 94.29924
21 70.00000 43.00000 89.41573
22 71.00000 24.00000 73.21160
23 72.00000 27.00000 57.00748
24 73.00000 29.00000 40.80335
Phones Data
LOESS Regression

The LOESS Procedure


Smoothing Parameter: 1
Dependent Variable: calls

Fit Summary

Fit Method kd Tree


Blending Linear
Number of Observations 24
Number of Fitting Points 9
kd Tree Bucket Size 4

203
Degree of Local Polynomials 2
Smoothing Parameter 1.00000
Points in Local Neighborhood 24
Residual Sum of Squares 38006
The SAS graphs are;

204
205
206
16.2.2 Motor Cycle Example
The SAS graphs are;

207
208

You might also like