0% found this document useful (0 votes)
12 views40 pages

Chap 2

Uploaded by

Rachit Deo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views40 pages

Chap 2

Uploaded by

Rachit Deo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 40

Theory of Linear Models

Steven Gilmour
King’s College London

January – February 2020


Least Squares Estimation

Any solution β̂ of the normal equations

X0 Xβ = X0 Y

minimises S = (Y − Xβ)0 (Y − Xβ).

For any p × 1 vector c, c0 β̂ is defined to be a least squares


estimator of c0 β
Properties of Least Squares Estimators

Theorem
The following conditions are equivalent:
I c0 β̂ is linear in Y and an unbiased estimator of c0 β;
I c0 β̂ is unique for all solutions of the normal equations;
I c is in the vector space defined by the columns of X0 X;
I c is in the vector space defined by the columns of X0 ;
I There exists a linear function of Y with expectation c0 β.

If any of the conditions of Theorem 1 hold, we say that c0 β is


estimable.
Example (simple linear regression): This theorem implies that if
there are unique least squares estimators of the parameters β0 and
β1 , these estimators will be unbiased and linear in Y.

In the case where xi = x ∀i = 1, . . . , n, β0 + β1 x is estimable but,


for example, β1 is not estimable.
Theorem (Gauss-Markov Theorem)
If c0 β is estimable, then c0 β̂ has minimum variance in the class of
unbiased estimators which are linear in Y.

We call c0 β̂ the best linear unbiased estimator (BLUE) of c0 β.


Example (simple linear regression): The estimator

β̃1 = (Y2 − Y1 )/(x2 − x1 )

is an unbiased estimator of β1 and is linear in Y.

The Gauss-Markov Theorem shows that this estimator has higher


variance than the least squares estimator (if n > 2).

Note that it is easy to find biased estimators with smaller variance


than least squares estimators, e.g. β̃1 = 0 has variance zero.
There are other (more or less) sensible methods of estimation, e.g.
L1 -norm, ridge regression methods, maximum likelihood, empirical
Bayes, subjective Bayes. They all produce biased estimators
(except in a few special cases in which they are equivalent to least
squares).

Best linear unbiased estimation is a very traditional interpretation


of what it means for an estimator to be “good”. Nevertheless, it is
difficult to argue against it on general grounds.

The particular emphasis on unbiasedness suggests that it is


particularly suited to situations in which multiple studies will be
carried out.
Theorem
All linear functions of β are estimable if and only if rank(X) = p.

If rank(X) = p, i.e. X is of full rank, then

β̂ = (X0 X)−1 X0 Y

is the unique solution of the normal equations.

If rank(X) < p, then unbiased estimators do not exist for some


functions of the parameters. They are said to be inestimable or
confounded.
Example (simple linear regression): rank(X) = 2, unless
xi = x; ∀i = 1, . . . , n, so this is the only case in which inestimable
functions exist.

Otherwise
 P 2 P 
0 −1 1
Pxi − xi
(X X) = P 2 P 2
n xi − ( xi ) − xi n

and
2
 P P P P 
1 xP
i Yi − P xi P xi Yi
β̂ = .
xi2 − ( xi )2 n xi Yi − xi Yi
P P
n
Example (one-way analysis of variance): Consider the model for t
treatments,
µri = β0 + τr ,
where observation i has treatment r .

If we write the model in full with t + 1 parameters


β 0 = [β0 τ1 · · · τt ], we have
 
1 1 0 ··· 0 0
 .. .. .. .. .. 
 . . . . . 
 
 1 1 0 ··· 0 0 
 
 .. .. .. .. .. 
 . . . . . 
X=  .. .. .. .. .. 

 . . . . . 
 
 1 0 0 ··· 0 1 
 .. .. .. .. .. 
 
 . . . . . 
1 0 0 ··· 0 1
and it is immediately obvious that rank(X) = t < t + 1.

Hence there some functions of the parameters which are


inestimable.

If we use only t parameters, β 0 = [β0 τ2 · · · τt ], then rank(X) = t


and all linear functions of the parameters are estimable.

In general how can we determine which functions are estimable and


which are not?
Generalized Inverse

Definition
A generalized inverse of an m × n matrix A is any matrix A− such
that AA− A = A.

There is not, in general, a unique generalized inverse, but if A is a


nonsingular square matrix, then A−1 is the unique generalized
inverse.
If X0 X is singular, we can find a particular solution of the normal
equations by finding a generalized inverse of X0 X. Different
generalized inverses can lead to different solutions.

One way to find a generalized inverse of X0 X is to append some


rows to X.

If rank(X) = q < p and X∗ is a (p − q) × p matrix such that


 
X
rank = p,
X∗

then (X0 X + X∗0 X∗ )−1 is a generalized inverse of X0 X.


This can be thought of as using pseudo-data to allow estimation.

Example (simple linear regression): If xi = x; i = 1, . . . , n, then we


can append the row
X∗ = [1 x ∗ ],
where x ∗ 6= x.

Then
nx + x ∗
 
0 ∗0 ∗ n+1
XX+X X =
nx + x ∗ nx 2 + x ∗2
and
nx 2 + x ∗2 −(nx + x ∗ )
 
0 − 1
(X X) = .
(n + 1)(nx 2 + x ∗2 ) − (nx + x ∗ )2 (−nx + x ∗ ) n+1
Example (one-way analysis of variance): To fit the model

µri = β0 + τr ,

we need one additional row and we can use X∗ = [1 1 · · · 1].

Consider the case of two treatments, each replicated twice. Then


 
4 2 2
X0 X =  2 2 0  ,
2 0 2

and  
1 1 1
X∗0 X∗ =  1 1 1  ,
1 1 1
so that  
5 3 3
X0 X + X∗0 X∗ =  3 3 1  .
3 1 3
Hence, a generalized inverse is
 
2 −3/2 −3/2
(X0 X)− =  −3/2 3/2 1 .
−3/2 1 3/2

Since  
Y1 + Y2 + Y3 + Y4
X0 Y =  Y1 + Y2 ,
Y3 + Y4
a least squares estimator of β is
 
(Y1 + Y2 + Y3 + Y4 )/2
β̂ =  −(Y3 + Y4 )/2 .
−(Y1 + Y2 )/2
Theorem
c0 β is estimable if and only if c0 {I − (X0 X)− (X0 X)} = 0.

This allows us to find estimable functions of parameters in cases


where the normal equations do not have a unique solution.
Example (one-way analysis of variance): Consider first estimating
τ1 . i.e. c0 = [0 1 0].

 
2 −1 −1
(X0 X)− (X0 X) =  −1 0 −1 
−1 −1 0
 
−1 1 1
⇒ I − (X0 X)− (X0 X) =  1 1 1 
1 1 1
⇒ c0 {I − (X0 X)− (X0 X)} = [1 1 1]

and so τ1 is not estimable.


Consider instead estimating τ2 − τ1 , i.e. c0 = [0 − 1 1].

In this case
c0 {I − (X0 X)− (X0 X)} = [0 0 0]
and so τ2 − τ1 is estimable.

The estimator is

c0 β̂ = (Y3 + Y4 )/2 − (Y1 + Y2 )/2.

This is why overparameterised models, such as this one, can be


used directly. It is possible, but not necessary, to impose some
constraints on the parameters, e.g. τ1 = 0. This example shows
that whatever constraints we impose, we will always get the same
estimator for τ2 − τ1 .
Variance and Covariance of Estimators
Theorem 1 told us that if c0 β is estimable, then c is in the vector
space defined by X0 X ⇒ c = X0 Xγ, for some vector γ.

Hence,

V (c0 β̂) = V {c0 (X0 X) X0 Y}

= c0 (X0 X) X0 V (Y)X{(X0 X)− }0 c
= σ 2 γ 0 X0 X(X0 X)− X0 X(X0 X)− c
= σ 2 γ 0 X0 X(X0 X)− c
= σ 2 c0 (X0 X)− c.

Similarly for a p × 1 vector d,

Cov (c0 β̂, d0 β̂) = σ 2 c0 (X0 X)− d.


Estimating σ 2

In order to do interval estimation or hypothesis testing, we need to


be able to estimate σ 2 .

We will use the following results.

Theorem

X0 = X0 X(X0 X)− X0 .
Theorem

Y0 {I − X(X0 X)− X0 }Y = (Y − Xβ)0 {I − X(X0 X)− X0 }(Y − Xβ).

Theorem
Let Z be an n × 1 random vector and A an n × n constant matrix.
Then
E {Z − E (Z)}0 A{Z − E (Z)}
 

n
X n−1 X
X n
= aii V (Zi ) + 2 aij Cov (Zi , Zj ).
i=1 i=1 j=i+1
Using the results above, we find
Theorem
n o
E (Y − Xβ̂)0 (Y − Xβ̂) = (n − r )σ 2 ,

where r = rank(X).

Hence S 2 = SSR /(n − r ) is an unbiased estimator of σ 2 , where


SSR = (Y − Xβ̂)0 (Y − Xβ̂) is the residual sum of squares.

Note that this result applies whether or not the design matrix is of
full rank.

It can also be shown that S 2 is a consistent estimator of σ 2 .

Results on the optimality of S 2 as an estimator of σ 2 are not


available in general, but are under certain assumptions, e.g.
normality.
Minimum Variance Unbiased Estimation
If we adopt a fully parametric approach, then we can often find
stronger results.

Theorem
If Y ∼ N(Xβ, σ 2 I) then, for any estimable function c0 β, c0 β̂ is the
unique minimum variance unbiased estimator of c0 β.

Note that we are no longer restricted to linear estimators, i.e.


under normality, there can be no nonlinear unbiased estimator
better than the least squares estimator.

Theorem
If each i has the same distribution, which has all moments finite,
and c0 β̂ is the minimum variance unbiased estimator of c0 β, then
i has a normal distribution.
Notes:
I Theorem 11 says that the normal distribution is the only
(nice) distribution for which the least squares estimators are
minimum variance unbiased estimators.
I Theorems 10 and 11 together suggest that the additional
assumption of normality is worth a great deal, in the sense
that without this assumption, we know that we are using
inferior estimators.
I However, for many other distributions, the minimum variance
linear unbiased estimator will be almost as good as the
minimum variance unbiased estimator.
I Overall, least squares estimation can be seen to be quite
robust to distributional assumptions.

Theorem
If Y ∼ N(Xβ, σ 2 I) then S 2 is the minimum variance unbiased
estimator of σ 2 .
Maximum Likelihood Estimation
If Y ∼ N(Xβ, σ 2 I) then the likelihood is
 
2 2 −n/2 1 0
L(β, σ |Y) = (2πσ ) exp − 2 (Y − Xβ) (Y − Xβ) .

It is immediately obvious that the likelihood is maximised by β̂.

It is also easy to show that the MLE of σ 2 is

σ̂ 2 = SSR /n.

Note that
n−r 2
σ̂ 2 = S
n
and is a biased estimator of σ 2 .

For other distributions, β̂ is not, in general, the MLE of β.


Minimum Mean Square Error Estimation
It is easy to show that, among estimators of σ 2 of the form
SSR /m, that which minimises the mean square error is

SSR
σ̃ 2 = .
n−r +2

If rank(X) = p and p ≥ 3, it can be shown that the James-Stein


shrinkage estimator,
( )
(p − 2)σ̃ 2
β̃ = 1 − 0 β̂,
β̂ X0 Xβ̂

of β has uniformly lower tr {X0 XMSE (β̃)} than β̂.

This might seem like a convincing argument against least squares


estimators. However, note that β̂r can have lower MSE than β̃r
L1 -Norm Regression

If the i are assumed to be independent and to have double


exponential distributions, then the maximum likelihood estimator is
equivalent to the L1 -norm or least absolute
P deviations (LAD)
regression estimator, which minimises ni=1 |i |.

L1 -norm estimation is more robust to outliers than least squares


(L2 -norm), in the sense that an extreme observation is much more
likely from the double exponential than from the normal
distribution.

We can also use Ld -norm regression for other values of d, e.g.


d = 1.5, or we can find MLEs for other distributions for i .
M-Estimators

An M-estimator of β is obtained by defining a function ρ(u) and


minimising
n  
i
X
ρ ,
S
i=1

where S is some estimator of σ.

ρ(u) = u 2 is least squares and ρ(u) = |u| is L1 -norm estimation.

Various other recommendations include Huber’s function,


(
u2
2 −a ≤ u ≤ a;
ρ(u) = a2
a|u| − 2 otherwise,
and Tukey’s biweight function,
( 2
u u4
2 − 4a 2 −a ≤ u ≤ a;
ρ(u) = a2
4 otherwise.

These lead to weighted least squares estimators, but with the


weights depending on β. Observations with large standardised
residuals are downweighted.

It is also possible to choose a weight function directly, without


defining ρ(u).
Ranked Residuals Regression

One example of this, using the Wilcoxon score function, estimates


β by minimising
n  
X i 1
−  ,
n + 1 2 [i]
i=1

where [i] are the ranked deviations.

This method gives most weight to the large residuals.

Other score functions have been suggested.


Least Median Squares

The least median squares estimator of β minimises Median(2i ).

This involves finding the narrowest strip which covers half of the
observations, the estimate being in the middle of this strip.
PCR and PLS
Principal component regression (PCR) involves replacing the
columns of X (usually centred and scaled) with their principal
components and using the latter as the explanatory variables.

Usually, we can use a relatively small number of principal


components to explain most of the variation in y.

Partial least squares or projection to latent structures (PLS)


regression represents a compromise between PCR and OLS.

In fact, it is possible to define continuum regression which


represents a smooth transition between PCR and OLS.

Opinion: PCR is entirely meaningless except, perhaps, in


calibration situations, i.e. when the “responses” were fixed and the
“explanatory variables” were observed and we want to do the
reverse regression.
Ridge Regression

If rank(X) < p or rank(X) = p but (numerically) there are near


linear dependencies among the columns of X, the method of ridge
regression has been recommended.

The ridge regression estimator of β is

β̂ R = (X0 X + λI)−1 X0 Y.

If the response data and columns of X are centred, then the ridge
regression estimator can also be written as the least squares √
estimator with response vector [Y0 00 ]0 and design matrix [X0 λI]0 .
Ridge regression appears to have some similarity to the use of a
generalized inverse, but the philosophy is different:
I A generalized inverse allows us to estimate the estimable
functions, when not all functions are estimable.
I Ridge regression allows us to estimate inestimable functions.

Ridge regression estimators have

Bias(β̂ R ) = λ(X0 X + λI)−1 β

and
V (β̂ R ) = σ 2 (X0 X + λI)−1 X0 X(X0 X + λI)−1
and so we gain lower variances by increasing the bias.

Ridge regression is controversial, but is particularly useful when


there is strong prior knowledge that the elements of β should not
be “too large”.
It can be shown that there exists a λ > 0 for which the total mean
square error of β̂ R is smaller than that of β̂, but this value
depends on β and so cannot be found.

In practice λ is usually chosen (graphically) by comparing β̂ R for


various λ and choosing the smallest value for which the estimates
are stable.
Bayesian Estimation
The natural conjugate prior is inverse gamma for σ 2 and
conditionally multivariate normal for β|σ 2 . Let the prior
expectation and variance be β P and σ 2 A respectively.

This leads to a conditionally multivariate normal posterior for


β|σ 2 , Y with variance matrix

σ 2 (A−1 + X0 X)−1

and expectation
 
2 −1 0 −1 1 −1 0
σ (A + X X) A µP + X Y .
σ2

In the noninformative prior case, where A = cI with c → ∞, this


reduces to least squares.
Of course, the Bayesian approach is very flexible and can meet all
the objectives of the methods described above, for example:

I Appropriate distributional assumptions can be made to allow


for possible outliers, e.g. Yi can be assumed to have a t
distribution, or a double exponential distribution.
I Estimation is always possible, no matter what the rank of X is.
I Informative priors centred at 0 can be used to express the
prior belief that the elements of β will not be too large.
I If it is expected that many coefficients will be close to zero,
priors can be used which are mixtures of normal distributions,
or Bayesian variable selection methods can be used.
Some Personal Opinions of Estimation Methods

Most of the methods described here have their supporters and


many of them are widely used in practice, so many statisticians will
disagree with the following.

The only estimation methods I would ever recommend are least


squares and Bayesian methods using informative subjective priors.

Least squares estimation is fully understood, is quick and easy to


implement and has good frequentist properties. However, one
must recognise the limitations of frequentist-like conclusions.
If one wants more detail than least squares can provide, e.g. for
formal decision making, then I believe there is no sensible
alternative to Bayesian methods. However, I believe these must be
used properly, with great care taken in specifying the model, priors
chosen very carefully to reflect the subjective views of each person
involved in interpreting the conclusions and a careful investigation
of the posteriors.

I believe that everything else, including “automatic” Bayesian


analyses, are worthless. The numbers obtained from the analysis
seem to be entirely meaningless. At best, they are attempts at
approximating the fully Bayesian analysis. However, it must be
better to describe the fully Bayesian analysis and then choose any
necessary approximations on a rational basis.

You might also like