0% found this document useful (0 votes)

12 views40 pages

Chap 2

Uploaded by

Rachit Deo

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views40 pages

Chap 2

Uploaded by

Rachit Deo

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 40

Theory of Linear Models

Steven Gilmour
King’s College London

January – February 2020

Least Squares Estimation

Any solution β̂ of the normal equations

X0 Xβ = X0 Y

minimises S = (Y − Xβ)0 (Y − Xβ).

For any p × 1 vector c, c0 β̂ is defined to be a least squares

estimator of c0 β
Properties of Least Squares Estimators

Theorem
The following conditions are equivalent:
I c0 β̂ is linear in Y and an unbiased estimator of c0 β;
I c0 β̂ is unique for all solutions of the normal equations;
I c is in the vector space defined by the columns of X0 X;
I c is in the vector space defined by the columns of X0 ;
I There exists a linear function of Y with expectation c0 β.

If any of the conditions of Theorem 1 hold, we say that c0 β is

estimable.
Example (simple linear regression): This theorem implies that if
there are unique least squares estimators of the parameters β0 and
β1 , these estimators will be unbiased and linear in Y.

In the case where xi = x ∀i = 1, . . . , n, β0 + β1 x is estimable but,

for example, β1 is not estimable.
Theorem (Gauss-Markov Theorem)
If c0 β is estimable, then c0 β̂ has minimum variance in the class of
unbiased estimators which are linear in Y.

We call c0 β̂ the best linear unbiased estimator (BLUE) of c0 β.

Example (simple linear regression): The estimator

β̃1 = (Y2 − Y1 )/(x2 − x1 )

is an unbiased estimator of β1 and is linear in Y.

The Gauss-Markov Theorem shows that this estimator has higher

variance than the least squares estimator (if n > 2).

Note that it is easy to find biased estimators with smaller variance

than least squares estimators, e.g. β̃1 = 0 has variance zero.
There are other (more or less) sensible methods of estimation, e.g.
L1 -norm, ridge regression methods, maximum likelihood, empirical
Bayes, subjective Bayes. They all produce biased estimators
(except in a few special cases in which they are equivalent to least
squares).

Best linear unbiased estimation is a very traditional interpretation

of what it means for an estimator to be “good”. Nevertheless, it is
difficult to argue against it on general grounds.

The particular emphasis on unbiasedness suggests that it is

particularly suited to situations in which multiple studies will be
carried out.
Theorem
All linear functions of β are estimable if and only if rank(X) = p.

If rank(X) = p, i.e. X is of full rank, then

β̂ = (X0 X)−1 X0 Y

is the unique solution of the normal equations.

If rank(X) < p, then unbiased estimators do not exist for some

functions of the parameters. They are said to be inestimable or
confounded.
Example (simple linear regression): rank(X) = 2, unless
xi = x; ∀i = 1, . . . , n, so this is the only case in which inestimable
functions exist.

Otherwise
P 2 P
0 −1 1
Pxi − xi
(X X) = P 2 P 2
n xi − ( xi ) − xi n

and
2
P P P P
1 xP
i Yi − P xi P xi Yi
β̂ = .
xi2 − ( xi )2 n xi Yi − xi Yi
P P
n
Example (one-way analysis of variance): Consider the model for t
treatments,
µri = β0 + τr ,
where observation i has treatment r .

If we write the model in full with t + 1 parameters

β 0 = [β0 τ1 · · · τt ], we have
 
1 1 0 ··· 0 0
 .. .. .. .. .. 
 . . . . . 
 
 1 1 0 ··· 0 0 
 
 .. .. .. .. .. 
 . . . . . 
X=  .. .. .. .. .. 

 . . . . . 
 
 1 0 0 ··· 0 1 
 .. .. .. .. .. 
 
 . . . . . 
1 0 0 ··· 0 1
and it is immediately obvious that rank(X) = t < t + 1.

Hence there some functions of the parameters which are

inestimable.

If we use only t parameters, β 0 = [β0 τ2 · · · τt ], then rank(X) = t

and all linear functions of the parameters are estimable.

In general how can we determine which functions are estimable and

which are not?
Generalized Inverse

Definition
A generalized inverse of an m × n matrix A is any matrix A− such
that AA− A = A.

There is not, in general, a unique generalized inverse, but if A is a

nonsingular square matrix, then A−1 is the unique generalized
inverse.
If X0 X is singular, we can find a particular solution of the normal
equations by finding a generalized inverse of X0 X. Different
generalized inverses can lead to different solutions.

One way to find a generalized inverse of X0 X is to append some

rows to X.

If rank(X) = q < p and X∗ is a (p − q) × p matrix such that

X
rank = p,
X∗

then (X0 X + X∗0 X∗ )−1 is a generalized inverse of X0 X.

This can be thought of as using pseudo-data to allow estimation.

Example (simple linear regression): If xi = x; i = 1, . . . , n, then we

can append the row
X∗ = [1 x ∗ ],
where x ∗ 6= x.

Then
nx + x ∗

0 ∗0 ∗ n+1
XX+X X =
nx + x ∗ nx 2 + x ∗2
and
nx 2 + x ∗2 −(nx + x ∗ )

0 − 1
(X X) = .
(n + 1)(nx 2 + x ∗2 ) − (nx + x ∗ )2 (−nx + x ∗ ) n+1
Example (one-way analysis of variance): To fit the model

µri = β0 + τr ,

we need one additional row and we can use X∗ = [1 1 · · · 1].

Consider the case of two treatments, each replicated twice. Then

 
4 2 2
X0 X =  2 2 0  ,
2 0 2

and  
1 1 1
X∗0 X∗ =  1 1 1  ,
1 1 1
so that  
5 3 3
X0 X + X∗0 X∗ =  3 3 1  .
3 1 3
Hence, a generalized inverse is
 
2 −3/2 −3/2
(X0 X)− =  −3/2 3/2 1 .
−3/2 1 3/2

Since  
Y1 + Y2 + Y3 + Y4
X0 Y =  Y1 + Y2 ,
Y3 + Y4
a least squares estimator of β is
 
(Y1 + Y2 + Y3 + Y4 )/2
β̂ =  −(Y3 + Y4 )/2 .
−(Y1 + Y2 )/2
Theorem
c0 β is estimable if and only if c0 {I − (X0 X)− (X0 X)} = 0.

This allows us to find estimable functions of parameters in cases

where the normal equations do not have a unique solution.
Example (one-way analysis of variance): Consider first estimating
τ1 . i.e. c0 = [0 1 0].

 
2 −1 −1
(X0 X)− (X0 X) =  −1 0 −1 
−1 −1 0
 
−1 1 1
⇒ I − (X0 X)− (X0 X) =  1 1 1 
1 1 1
⇒ c0 {I − (X0 X)− (X0 X)} = [1 1 1]

and so τ1 is not estimable.

Consider instead estimating τ2 − τ1 , i.e. c0 = [0 − 1 1].

In this case
c0 {I − (X0 X)− (X0 X)} = [0 0 0]
and so τ2 − τ1 is estimable.

The estimator is

c0 β̂ = (Y3 + Y4 )/2 − (Y1 + Y2 )/2.

This is why overparameterised models, such as this one, can be

used directly. It is possible, but not necessary, to impose some
constraints on the parameters, e.g. τ1 = 0. This example shows
that whatever constraints we impose, we will always get the same
estimator for τ2 − τ1 .
Variance and Covariance of Estimators
Theorem 1 told us that if c0 β is estimable, then c is in the vector
space defined by X0 X ⇒ c = X0 Xγ, for some vector γ.

Hence,
−
V (c0 β̂) = V {c0 (X0 X) X0 Y}
−
= c0 (X0 X) X0 V (Y)X{(X0 X)− }0 c
= σ 2 γ 0 X0 X(X0 X)− X0 X(X0 X)− c
= σ 2 γ 0 X0 X(X0 X)− c
= σ 2 c0 (X0 X)− c.

Similarly for a p × 1 vector d,

Cov (c0 β̂, d0 β̂) = σ 2 c0 (X0 X)− d.

Estimating σ 2

In order to do interval estimation or hypothesis testing, we need to

be able to estimate σ 2 .

We will use the following results.

Theorem

X0 = X0 X(X0 X)− X0 .
Theorem

Y0 {I − X(X0 X)− X0 }Y = (Y − Xβ)0 {I − X(X0 X)− X0 }(Y − Xβ).

Theorem
Let Z be an n × 1 random vector and A an n × n constant matrix.
Then
E {Z − E (Z)}0 A{Z − E (Z)}

n
X n−1 X
X n
= aii V (Zi ) + 2 aij Cov (Zi , Zj ).
i=1 i=1 j=i+1
Using the results above, we find
Theorem
n o
E (Y − Xβ̂)0 (Y − Xβ̂) = (n − r )σ 2 ,

where r = rank(X).

Hence S 2 = SSR /(n − r ) is an unbiased estimator of σ 2 , where

SSR = (Y − Xβ̂)0 (Y − Xβ̂) is the residual sum of squares.

Note that this result applies whether or not the design matrix is of
full rank.

It can also be shown that S 2 is a consistent estimator of σ 2 .

Results on the optimality of S 2 as an estimator of σ 2 are not

available in general, but are under certain assumptions, e.g.
normality.
Minimum Variance Unbiased Estimation
If we adopt a fully parametric approach, then we can often find
stronger results.

Theorem
If Y ∼ N(Xβ, σ 2 I) then, for any estimable function c0 β, c0 β̂ is the
unique minimum variance unbiased estimator of c0 β.

Note that we are no longer restricted to linear estimators, i.e.

under normality, there can be no nonlinear unbiased estimator
better than the least squares estimator.

Theorem
If each i has the same distribution, which has all moments finite,
and c0 β̂ is the minimum variance unbiased estimator of c0 β, then
i has a normal distribution.
Notes:
I Theorem 11 says that the normal distribution is the only
(nice) distribution for which the least squares estimators are
minimum variance unbiased estimators.
I Theorems 10 and 11 together suggest that the additional
assumption of normality is worth a great deal, in the sense
that without this assumption, we know that we are using
inferior estimators.
I However, for many other distributions, the minimum variance
linear unbiased estimator will be almost as good as the
minimum variance unbiased estimator.
I Overall, least squares estimation can be seen to be quite
robust to distributional assumptions.

Theorem
If Y ∼ N(Xβ, σ 2 I) then S 2 is the minimum variance unbiased
estimator of σ 2 .
Maximum Likelihood Estimation
If Y ∼ N(Xβ, σ 2 I) then the likelihood is

2 2 −n/2 1 0
L(β, σ |Y) = (2πσ ) exp − 2 (Y − Xβ) (Y − Xβ) .
2σ

It is immediately obvious that the likelihood is maximised by β̂.

It is also easy to show that the MLE of σ 2 is

σ̂ 2 = SSR /n.

Note that
n−r 2
σ̂ 2 = S
n
and is a biased estimator of σ 2 .

For other distributions, β̂ is not, in general, the MLE of β.

Minimum Mean Square Error Estimation
It is easy to show that, among estimators of σ 2 of the form
SSR /m, that which minimises the mean square error is

SSR
σ̃ 2 = .
n−r +2

If rank(X) = p and p ≥ 3, it can be shown that the James-Stein

shrinkage estimator,
( )
(p − 2)σ̃ 2
β̃ = 1 − 0 β̂,
β̂ X0 Xβ̂

of β has uniformly lower tr {X0 XMSE (β̃)} than β̂.

This might seem like a convincing argument against least squares

estimators. However, note that β̂r can have lower MSE than β̃r
L1 -Norm Regression

If the i are assumed to be independent and to have double

exponential distributions, then the maximum likelihood estimator is
equivalent to the L1 -norm or least absolute
P deviations (LAD)
regression estimator, which minimises ni=1 |i |.

L1 -norm estimation is more robust to outliers than least squares

(L2 -norm), in the sense that an extreme observation is much more
likely from the double exponential than from the normal
distribution.

We can also use Ld -norm regression for other values of d, e.g.

d = 1.5, or we can find MLEs for other distributions for i .
M-Estimators

An M-estimator of β is obtained by defining a function ρ(u) and

minimising
n
i
X
ρ ,
S
i=1

where S is some estimator of σ.

ρ(u) = u 2 is least squares and ρ(u) = |u| is L1 -norm estimation.

Various other recommendations include Huber’s function,

(
u2
2 −a ≤ u ≤ a;
ρ(u) = a2
a|u| − 2 otherwise,
and Tukey’s biweight function,
( 2
u u4
2 − 4a 2 −a ≤ u ≤ a;
ρ(u) = a2
4 otherwise.

These lead to weighted least squares estimators, but with the

weights depending on β. Observations with large standardised
residuals are downweighted.

It is also possible to choose a weight function directly, without

defining ρ(u).
Ranked Residuals Regression

One example of this, using the Wilcoxon score function, estimates

β by minimising
n
X i 1
− ,
n + 1 2 [i]
i=1

where [i] are the ranked deviations.

This method gives most weight to the large residuals.

Other score functions have been suggested.

Least Median Squares

The least median squares estimator of β minimises Median(2i ).

This involves finding the narrowest strip which covers half of the
observations, the estimate being in the middle of this strip.
PCR and PLS
Principal component regression (PCR) involves replacing the
columns of X (usually centred and scaled) with their principal
components and using the latter as the explanatory variables.

Usually, we can use a relatively small number of principal

components to explain most of the variation in y.

Partial least squares or projection to latent structures (PLS)

regression represents a compromise between PCR and OLS.

In fact, it is possible to define continuum regression which

represents a smooth transition between PCR and OLS.

Opinion: PCR is entirely meaningless except, perhaps, in

calibration situations, i.e. when the “responses” were fixed and the
“explanatory variables” were observed and we want to do the
reverse regression.
Ridge Regression

If rank(X) < p or rank(X) = p but (numerically) there are near

linear dependencies among the columns of X, the method of ridge
regression has been recommended.

The ridge regression estimator of β is

β̂ R = (X0 X + λI)−1 X0 Y.

If the response data and columns of X are centred, then the ridge
regression estimator can also be written as the least squares √
estimator with response vector [Y0 00 ]0 and design matrix [X0 λI]0 .
Ridge regression appears to have some similarity to the use of a
generalized inverse, but the philosophy is different:
I A generalized inverse allows us to estimate the estimable
functions, when not all functions are estimable.
I Ridge regression allows us to estimate inestimable functions.

Ridge regression estimators have

Bias(β̂ R ) = λ(X0 X + λI)−1 β

and
V (β̂ R ) = σ 2 (X0 X + λI)−1 X0 X(X0 X + λI)−1
and so we gain lower variances by increasing the bias.

Ridge regression is controversial, but is particularly useful when

there is strong prior knowledge that the elements of β should not
be “too large”.
It can be shown that there exists a λ > 0 for which the total mean
square error of β̂ R is smaller than that of β̂, but this value
depends on β and so cannot be found.

In practice λ is usually chosen (graphically) by comparing β̂ R for

various λ and choosing the smallest value for which the estimates
are stable.
Bayesian Estimation
The natural conjugate prior is inverse gamma for σ 2 and
conditionally multivariate normal for β|σ 2 . Let the prior
expectation and variance be β P and σ 2 A respectively.

This leads to a conditionally multivariate normal posterior for

β|σ 2 , Y with variance matrix

σ 2 (A−1 + X0 X)−1

and expectation

2 −1 0 −1 1 −1 0
σ (A + X X) A µP + X Y .
σ2

In the noninformative prior case, where A = cI with c → ∞, this

reduces to least squares.
Of course, the Bayesian approach is very flexible and can meet all
the objectives of the methods described above, for example:

I Appropriate distributional assumptions can be made to allow

for possible outliers, e.g. Yi can be assumed to have a t
distribution, or a double exponential distribution.
I Estimation is always possible, no matter what the rank of X is.
I Informative priors centred at 0 can be used to express the
prior belief that the elements of β will not be too large.
I If it is expected that many coefficients will be close to zero,
priors can be used which are mixtures of normal distributions,
or Bayesian variable selection methods can be used.
Some Personal Opinions of Estimation Methods

Most of the methods described here have their supporters and

many of them are widely used in practice, so many statisticians will
disagree with the following.

The only estimation methods I would ever recommend are least

squares and Bayesian methods using informative subjective priors.

Least squares estimation is fully understood, is quick and easy to

implement and has good frequentist properties. However, one
must recognise the limitations of frequentist-like conclusions.
If one wants more detail than least squares can provide, e.g. for
formal decision making, then I believe there is no sensible
alternative to Bayesian methods. However, I believe these must be
used properly, with great care taken in specifying the model, priors
chosen very carefully to reflect the subjective views of each person
involved in interpreting the conclusions and a careful investigation
of the posteriors.

I believe that everything else, including “automatic” Bayesian

analyses, are worthless. The numbers obtained from the analysis
seem to be entirely meaningless. At best, they are attempts at
approximating the fully Bayesian analysis. However, it must be
better to describe the fully Bayesian analysis and then choose any
necessary approximations on a rational basis.

Lecture5 Module2 Anova 1
No ratings yet
Lecture5 Module2 Anova 1
9 pages
Lect 6
No ratings yet
Lect 6
20 pages
UMVUE Statmat 2 2022
No ratings yet
UMVUE Statmat 2 2022
43 pages
Unit - III
No ratings yet
Unit - III
4 pages
Linear Regression Analysis: Module - Vii
No ratings yet
Linear Regression Analysis: Module - Vii
10 pages
6 Properties of LSE
No ratings yet
6 Properties of LSE
2 pages
L5 2025 Spring
No ratings yet
L5 2025 Spring
40 pages
Least-Squares Estimation Guide
100% (2)
Least-Squares Estimation Guide
4 pages
RegEstimationLS ML StatColumbia
No ratings yet
RegEstimationLS ML StatColumbia
44 pages
Lecture 24: Weighted and Generalized Least Squares 1 Weighted Least Squares
No ratings yet
Lecture 24: Weighted and Generalized Least Squares 1 Weighted Least Squares
8 pages
Design and Analysis of Computer Experiments: Theory: 1 Density Estimation
No ratings yet
Design and Analysis of Computer Experiments: Theory: 1 Density Estimation
9 pages
Linear Model
No ratings yet
Linear Model
14 pages
Classical Linear Regression and Its Assumptions
No ratings yet
Classical Linear Regression and Its Assumptions
63 pages
LeastSquares DeptMath
No ratings yet
LeastSquares DeptMath
7 pages
3 SimpleLinearRegression
No ratings yet
3 SimpleLinearRegression
30 pages
A Modern Gauss-Markov Theorem
No ratings yet
A Modern Gauss-Markov Theorem
18 pages
Classical Statistics & MLE Guide
No ratings yet
Classical Statistics & MLE Guide
8 pages
Risk Fisher
No ratings yet
Risk Fisher
39 pages
9511 Et Module-2
No ratings yet
9511 Et Module-2
6 pages
Advanced Econometrics & Estimation
No ratings yet
Advanced Econometrics & Estimation
81 pages
Properties of Estimators Explained
No ratings yet
Properties of Estimators Explained
43 pages
Advanced Econometrics PDF
No ratings yet
Advanced Econometrics PDF
58 pages
Lecture II - Docx - 12
No ratings yet
Lecture II - Docx - 12
12 pages
Econometric Efficiency Bounds
No ratings yet
Econometric Efficiency Bounds
12 pages
OLS Matrix Analysis for Statisticians
No ratings yet
OLS Matrix Analysis for Statisticians
14 pages
Unit - 1
No ratings yet
Unit - 1
8 pages
Block 1
No ratings yet
Block 1
83 pages
Stat 450850 Notes 2012
No ratings yet
Stat 450850 Notes 2012
190 pages
Multiple Linear Reegression
No ratings yet
Multiple Linear Reegression
21 pages
Robust Regression: 1 M-Estimation
No ratings yet
Robust Regression: 1 M-Estimation
8 pages
Appendix Robust Regression
No ratings yet
Appendix Robust Regression
8 pages
Lecture39 Module16 Econometrics
No ratings yet
Lecture39 Module16 Econometrics
10 pages
Gauss Markov Theorem
100% (1)
Gauss Markov Theorem
3 pages
3 The Basic Linear Model Finite Sample Results
No ratings yet
3 The Basic Linear Model Finite Sample Results
9 pages
Unbiased Estimator
No ratings yet
Unbiased Estimator
70 pages
Lecture 13
No ratings yet
Lecture 13
12 pages
Estimation 4
No ratings yet
Estimation 4
16 pages
Appendix Robust Regression
No ratings yet
Appendix Robust Regression
17 pages
Gauss-Markov Theorem
No ratings yet
Gauss-Markov Theorem
5 pages
Technometrics
No ratings yet
Technometrics
14 pages
SLRM Note
No ratings yet
SLRM Note
15 pages
BEST Linear Estimators
No ratings yet
BEST Linear Estimators
8 pages
Lecture-4 2
No ratings yet
Lecture-4 2
50 pages
Estimator Bias & Variance Analysis
No ratings yet
Estimator Bias & Variance Analysis
2 pages
Quant Chapter 05 Ols
No ratings yet
Quant Chapter 05 Ols
15 pages
Bias-Variance Tradeoffs: 1 Single Sample MLE
No ratings yet
Bias-Variance Tradeoffs: 1 Single Sample MLE
7 pages
02 Point Estimators
No ratings yet
02 Point Estimators
33 pages
Chapter 2: Statistical Inference, Point Estimation, and Confidence Intervals
No ratings yet
Chapter 2: Statistical Inference, Point Estimation, and Confidence Intervals
16 pages
LN Estimation Theory
No ratings yet
LN Estimation Theory
11 pages
Econometric Theory: Module - Ii
No ratings yet
Econometric Theory: Module - Ii
11 pages
Week 1 1720465962 Estimation Hour 2
No ratings yet
Week 1 1720465962 Estimation Hour 2
14 pages
Unbias
No ratings yet
Unbias
15 pages
Part 2 Estimation
No ratings yet
Part 2 Estimation
72 pages
Lecture3 Module1 Anova 1
No ratings yet
Lecture3 Module1 Anova 1
10 pages
4 - Multiple Linear Regressions
No ratings yet
4 - Multiple Linear Regressions
61 pages
Chapter 4. Gauss-Markov Model
No ratings yet
Chapter 4. Gauss-Markov Model
20 pages
LPPand GAME81124
No ratings yet
LPPand GAME81124
15 pages
Suraj Resume Web
No ratings yet
Suraj Resume Web
1 page
Sahil Resume
No ratings yet
Sahil Resume
1 page
RRB NTPC Vacancies 2019
No ratings yet
RRB NTPC Vacancies 2019
13 pages
6338 - Multicollinearity & Autocorrelation
No ratings yet
6338 - Multicollinearity & Autocorrelation
28 pages
Exponential Curve Fitting Guide
No ratings yet
Exponential Curve Fitting Guide
12 pages
Dollar and Kraay (2003) Institutions, Trade and Growth
No ratings yet
Dollar and Kraay (2003) Institutions, Trade and Growth
30 pages
Cross-Border Mergers & Acquisitions
No ratings yet
Cross-Border Mergers & Acquisitions
36 pages
Performance Evaluation of National Football League Teams
No ratings yet
Performance Evaluation of National Football League Teams
8 pages
Lecture 10 - Chapter 20 - Simulteneous Equation Methods
No ratings yet
Lecture 10 - Chapter 20 - Simulteneous Equation Methods
9 pages
R18&19
No ratings yet
R18&19
32 pages
Rcaller: A Library For Calling R From Java: by M.Hakan Satman August 17, 2013
No ratings yet
Rcaller: A Library For Calling R From Java: by M.Hakan Satman August 17, 2013
6 pages
MD 145 8 081704
No ratings yet
MD 145 8 081704
16 pages
Environmental Econometrics Using Stata 1st Edition Christopher F. Baum Ready To Read
No ratings yet
Environmental Econometrics Using Stata 1st Edition Christopher F. Baum Ready To Read
155 pages
Sketching As A Tool For Numerical Linear Algebra
No ratings yet
Sketching As A Tool For Numerical Linear Algebra
139 pages
Journal of Sports Economics: Season Estimating Attendance at Major League Baseball Games For The 2007
No ratings yet
Journal of Sports Economics: Season Estimating Attendance at Major League Baseball Games For The 2007
34 pages
Dummy Variables and Properties of OLS Estimators - Lecture Notes
No ratings yet
Dummy Variables and Properties of OLS Estimators - Lecture Notes
19 pages
Kwoka 2010
No ratings yet
Kwoka 2010
27 pages
Stata Book - Manual - Panel Data Analysis
100% (4)
Stata Book - Manual - Panel Data Analysis
35 pages
Facebook's Impact on Lusaka Students' Academics
No ratings yet
Facebook's Impact on Lusaka Students' Academics
34 pages
UnivariateRegression 2
No ratings yet
UnivariateRegression 2
72 pages
Chapter 5 Multiple Regression Analysis OLS Asymptotics
No ratings yet
Chapter 5 Multiple Regression Analysis OLS Asymptotics
9 pages
World Development: Carlo Azzarri, Sara Signorelli
No ratings yet
World Development: Carlo Azzarri, Sara Signorelli
19 pages
Simple e Manual
No ratings yet
Simple e Manual
85 pages
Statistics For Finance Notes
No ratings yet
Statistics For Finance Notes
111 pages
Research in Agribusiness and Value Chain (Abvm 14) : Compiled By: Milkessa Temesgen (MSC) Wondimagegn Mesfin (MSC)
No ratings yet
Research in Agribusiness and Value Chain (Abvm 14) : Compiled By: Milkessa Temesgen (MSC) Wondimagegn Mesfin (MSC)
85 pages
Introduction To Linear Regression Analysis - (CHAPTER 2 SIMPLE LINEAR REGRESSION)
No ratings yet
Introduction To Linear Regression Analysis - (CHAPTER 2 SIMPLE LINEAR REGRESSION)
51 pages
Time Series Analysis Course Intro
No ratings yet
Time Series Analysis Course Intro
38 pages
Aniebiet's Project 333
No ratings yet
Aniebiet's Project 333
17 pages
Econometrics II
No ratings yet
Econometrics II
15 pages
Machine Learning in Python
No ratings yet
Machine Learning in Python
36 pages
Topic 1 Wble
No ratings yet
Topic 1 Wble
58 pages
Indian Dividend Payout Analysis
No ratings yet
Indian Dividend Payout Analysis
25 pages

Chap 2

Uploaded by

Chap 2

Uploaded by

Theory of Linear Models

January – February 2020

Any solution β̂ of the normal equations

minimises S = (Y − Xβ)0 (Y − Xβ).

For any p × 1 vector c, c0 β̂ is defined to be a least squares

If any of the conditions of Theorem 1 hold, we say that c0 β is

In the case where xi = x ∀i = 1, . . . , n, β0 + β1 x is estimable but,

We call c0 β̂ the best linear unbiased estimator (BLUE) of c0 β.

β̃1 = (Y2 − Y1 )/(x2 − x1 )

is an unbiased estimator of β1 and is linear in Y.

The Gauss-Markov Theorem shows that this estimator has higher

Note that it is easy to find biased estimators with smaller variance

Best linear unbiased estimation is a very traditional interpretation

The particular emphasis on unbiasedness suggests that it is

If rank(X) = p, i.e. X is of full rank, then

is the unique solution of the normal equations.

If rank(X) < p, then unbiased estimators do not exist for some

If we write the model in full with t + 1 parameters

Hence there some functions of the parameters which are

If we use only t parameters, β 0 = [β0 τ2 · · · τt ], then rank(X) = t

In general how can we determine which functions are estimable and

There is not, in general, a unique generalized inverse, but if A is a

One way to find a generalized inverse of X0 X is to append some

If rank(X) = q < p and X∗ is a (p − q) × p matrix such that

then (X0 X + X∗0 X∗ )−1 is a generalized inverse of X0 X.

Example (simple linear regression): If xi = x; i = 1, . . . , n, then we

we need one additional row and we can use X∗ = [1 1 · · · 1].

Consider the case of two treatments, each replicated twice. Then

This allows us to find estimable functions of parameters in cases

and so τ1 is not estimable.

c0 β̂ = (Y3 + Y4 )/2 − (Y1 + Y2 )/2.

This is why overparameterised models, such as this one, can be

Similarly for a p × 1 vector d,

Cov (c0 β̂, d0 β̂) = σ 2 c0 (X0 X)− d.

In order to do interval estimation or hypothesis testing, we need to

We will use the following results.

Y0 {I − X(X0 X)− X0 }Y = (Y − Xβ)0 {I − X(X0 X)− X0 }(Y − Xβ).

Hence S 2 = SSR /(n − r ) is an unbiased estimator of σ 2 , where

It can also be shown that S 2 is a consistent estimator of σ 2 .

Results on the optimality of S 2 as an estimator of σ 2 are not

Note that we are no longer restricted to linear estimators, i.e.

It is immediately obvious that the likelihood is maximised by β̂.

It is also easy to show that the MLE of σ 2 is

For other distributions, β̂ is not, in general, the MLE of β.

If rank(X) = p and p ≥ 3, it can be shown that the James-Stein

of β has uniformly lower tr {X0 XMSE (β̃)} than β̂.

This might seem like a convincing argument against least squares

If the i are assumed to be independent and to have double

L1 -norm estimation is more robust to outliers than least squares

We can also use Ld -norm regression for other values of d, e.g.

An M-estimator of β is obtained by defining a function ρ(u) and

where S is some estimator of σ.

ρ(u) = u 2 is least squares and ρ(u) = |u| is L1 -norm estimation.

Various other recommendations include Huber’s function,

These lead to weighted least squares estimators, but with the

It is also possible to choose a weight function directly, without

One example of this, using the Wilcoxon score function, estimates

where [i] are the ranked deviations.

This method gives most weight to the large residuals.

Other score functions have been suggested.

The least median squares estimator of β minimises Median(2i ).

Usually, we can use a relatively small number of principal

Partial least squares or projection to latent structures (PLS)

In fact, it is possible to define continuum regression which

Opinion: PCR is entirely meaningless except, perhaps, in

If rank(X) < p or rank(X) = p but (numerically) there are near

The ridge regression estimator of β is

Ridge regression estimators have

Bias(β̂ R ) = λ(X0 X + λI)−1 β

Ridge regression is controversial, but is particularly useful when

In practice λ is usually chosen (graphically) by comparing β̂ R for

This leads to a conditionally multivariate normal posterior for

In the noninformative prior case, where A = cI with c → ∞, this

I Appropriate distributional assumptions can be made to allow

Most of the methods described here have their supporters and

The only estimation methods I would ever recommend are least

If the i are assumed to be independent and to have double

where [i] are the ranked deviations.

The least median squares estimator of β minimises Median(2i ).