0% found this document useful (0 votes)
113 views8 pages

Lecture 24: Weighted and Generalized Least Squares 1 Weighted Least Squares

Weighted least squares (WLS) generalizes ordinary least squares (OLS) by allowing for unequal weights on observations. WLS is optimal when variances are known and weights are set inversely to variances. The Gauss-Markov theorem states that among linear, unbiased estimators, WLS has minimum variance. WLS is useful when observations have differing precision, to focus on important regions, or to approximate generalized linear models.

Uploaded by

S
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
113 views8 pages

Lecture 24: Weighted and Generalized Least Squares 1 Weighted Least Squares

Weighted least squares (WLS) generalizes ordinary least squares (OLS) by allowing for unequal weights on observations. WLS is optimal when variances are known and weights are set inversely to variances. The Gauss-Markov theorem states that among linear, unbiased estimators, WLS has minimum variance. WLS is useful when observations have differing precision, to focus on important regions, or to approximate generalized linear models.

Uploaded by

S
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Lecture 24: Weighted and Generalized Least Squares

1 Weighted Least Squares


When we use ordinary least squares to estimate linear regression, we minimize the mean squared
error:
n
1X
M SE(b) = (Yi − Xi· β)2 (1)
n
i=1

where Xi· is the ith row of X. The solution is

βbOLS = (XT X)−1 XT Y. (2)

Suppose we minimize the weighted MSE


n
1X
W M SE(b, w1 , . . . wn ) = wi (Yi − Xi· b)2 . (3)
n
i=1

This includes ordinary least squares as the special case where all the weights wi = 1. We can solve
it by the same kind of linear algebra we used to solve the ordinary linear least squares problem. If
we write W for the matrix with the wi on the diagonal and zeroes everywhere else, then

W M SE = n−1 (Y − Xb)T W(Y − Xb) (4)


1
YT WY − YT WXb − bT XT WY + bT XT WXb .

= (5)
n
Differentiating with respect to b, we get as the gradient
2
−XT WY + XT WXb .

∇b W M SE =
n
Setting this to zero at the optimum and solving,

βbW LS = (XT WX)−1 XT WY. (6)

But why would we want to minimize Eq. 3?

1. Focusing accuracy. We may care very strongly about predicting the response for certain values
of the input — ones we expect to see often again, ones where mistakes are especially costly
or embarrassing or painful, etc. — than others. If we give the points near that region big
weights, and points elsewhere smaller weights, the regression will be pulled towards matching
the data in that region.

2. Discounting imprecision. Ordinary least squares minimizes the squared error when the vari-
ance of the noise terms  is constant over all observations, so we’re measuring the regression
function with the same precision elsewhere. This situation, of constant noise variance, is called
homoskedasticity. Often however the magnitude of the noise is not constant, and the data
are heteroskedastic. Let σi2 = E[i |Xi ] = σi2 . Homoskedasticity means that σi2 = σ 2 for all
i. Heteroskedasticity means that the σi2 can be different.

1
When we have heteroskedasticity, ordinary least squares is no longer the optimal estimate —
we’ll see soon that other estimators can be unbiased and have smaller variance. If however
we know the noise variance σi2 at each measurement i, and set wi = 1/σi2 , we get to minimize
the variance of estimation.

3. Doing something else. There are a number of other optimization problems which can be
transformed into, or approximated by, weighted least squares. The most important of these
arises from generalized linear models, where the mean response is some nonlinear function
of a linear predictor; we will look at them in 402.

2 Heteroskedasticity
Suppose that
Yi = β0 + β1 X1i + · · · + βp Xpi + i
where E[i ] = 0 and Var[i ] = σi2 . (As usual, we are treating the Xi ’s as fixed.) This is called the
Heteroskedastic linear regression model. For now, assume we know σ1 , . . . , σp .
The model is
Y = Xβ + 
where now (treating X as fixed), E[] = 0 and Var() = Σ and Σ is no longer of the form σ 2 I. The
weighted least squares estimator is
βb = KY
where K = (XT W X)−1 XT W. So,

βb = KY = K(Xβ + ) = β + K.

Hence,
E[β]
b = β, b = KΣK T .
Var(β) (7)
Note that the estimator is unbiased.

3 The Gauss-Markov Theorem


We’ve seen that when we do weighted least squares, our estimates of β are linear in Y, and unbiased:

βb = (XT WX)−1 XT WY

and E[β]
b = β.
Let us consider a special case: suppose we take W = Σ−1 . Then, from (7)
" # " #
b = KΣK T = (XT Σ−1 X)−1 XT Σ−1 Σ Σ−1 X(XT Σ−1 X)−1 = (XT Σ−1 X)−1 .
Var(β) (8)

We will now show that βb is, in a certain sense, optimal.


Like any optimality result, it is crucial to lay out carefully the range of possible alternatives,
and the criterion by which those alternatives will be compared. The classical optimality result for

2
estimating linear models is the Gauss-Markov theorem, which takes the range of possibilities
to be linear, unbiased estimators of β, and the criterion to be variance of the estimator.
Any linear estimator, say β,
e could be written as

βe = QY

where Q would be a (p + 1) × n matrix. We will show that if βe is unbiased, then it has larger
variance than βbWLS .
For βe to be an unbiased estimator, we must have

E [QY|X] = QXβ = β.

Since this must hold for all β and all X, we have to have QX = I. The variance is then

Var [QY|X] = QVar [|X] QT = QΣQ (9)

where Σ = Var [|X]. Let


K ≡ (XT Σ−1 X)−1 XT Σ−1 .
Now, whatever Q might be, we can always write

Q=K+R (10)

for some matrix R. The unbiasedness constraint on Q implies that

RX = 0

because KX = I. Now we substitute Eq. 10 into Eq. 9:

h i
Var βe = (K + R)Σ(K + R)T (11)
= KΣKT + RΣKT + KΣRT + RΣRT (12)
T −1 −1 T −1 −1 T −1 −1
= (X Σ X) X Σ ΣΣ X(X Σ X) (13)
−1 T −1 −1
+RΣΣ X(X Σ X)
−1 −1
+(X Σ T
X) X Σ−1 ΣRT
T

+RΣRT
= (XT Σ−1 X)−1 XT Σ−1 X(XT Σ−1 X)−1 (14)
T −1 −1 T −1 −1 T T
+RX(X Σ X) + (X Σ X) X R
T
+RΣR
= (XT Σ−1 X)−1 + RΣRT (15)

where the last step uses the fact that RX = 0 (and so XT RT = 0T ).


Hence, h i h i
Var β̃i = (XT Σ−1 X)−1 T T
ii + Ri ΣRi = Var βi + Ri ΣRi
b

where Ri denotes the ith row of R. Since Σ is a covariance matrix, it’s positive definite, meaning
that aΣaT ≥ 0 for any vector a. Hence,
h i h i
Var β̃i ≥ Var βbi .

3
We conclude that WLS, with W = Σ−1 , has the least variance among all possible linear, unbiased
estimators of the regression coefficients.
Notes:

1. If all the noise variances are equal, then we’ve proved the optimality of OLS.

2. The theorem doesn’t rule out linear, biased estimators with smaller variance. As an example,
albeit a trivial one, 0Y is linear and has variance 0, but is (generally) very biased.

3. The theorem also doesn’t rule out non-linear unbiased estimators of smaller variance. Or
indeed non-linear biased estimators of even smaller variance.

4. The proof actually doesn’t require the variance matrix to be diagonal.

4 Finding the Variance and Weights


So far we have assumed that we know σ1 , . . . , σp . Here are some cases where this might be true.

Multiple measurements. The easiest case is when our measurements of the response are actu-
ally averages over individual measurements, each with some variance σ 2 . If some Yi are based on
averaging more individual measurements than others, there will be heteroskedasticity. The variance
of the average of ni uncorrelated measurements will be σ 2 /ni , so in this situation we could take
wi ∝ n i .

Binomial counts Suppose our response variable is a count, derived from a binomial distribution,
i.e., Yi ∼ Binom(ni , pi ). We would usually model pi as a function of the predictor variables — at
this level of statistical knowledge, a linear function. This would imply that Yi had expectation ni pi ,
and variance ni pi (1 − pi ). We would be well-advised to use this formula for the variance, rather
than pretending that all observations had equal variance.

Proportions based on binomials If our response variable is a proportion based on a binomial,


we’d see an expectation value of pi and a variance of pi (1−p ni
i)
. Again, this is not equal across
different values of ni , or for that matter different values of pi .

Poisson counts Binomial counts have a hard upper limit, ni ; if the upper limit is immense or
even (theoretically) infinite, we may be better off using a Poisson distribution. In such situations,
the mean of the Poisson λi will be a (possibly-linear) function of the predictors, and the variance
will also be equal to λi .

Other counts The binomial and Poisson distributions rest on independence across “trials” (what-
ever those might be). There are a range of discrete probability models which allow for correlation
across trials (leadings to more or less variance). These may, in particular situations, be more
appropriate.

4
5 Conditional Variance Function Estimation
If we don’t know the variances, we might be able to estimate them. There are two common ways
to estimate conditional variances, which differ slightly in how they use non-parametric smoothing.
(We will discuss non-parametric smoothing in more detail later and in 402).
Method 1:

1. Estimate m(x) with your favorite regression method, getting m(x).


b

b i ))2 .
2. Construct the squared residuals, ui = (Yi − m(x

3. Use your favorite non-parametric method to estimate the conditional mean of the ui , call it
qb(x).

bx2 = qb(x).
4. Estimate the variance using σ

Here is method 2:

1. Estimate m(x) with your favorite regression method, getting m(x).


b

b i ))2 .
2. Construct the log squared residuals, zi = log (Yi − m(x

3. Use your favorite non-parametric method to estimate the conditional mean of the zi , call it
sb(x).

bx2 = exp sb(x).


4. Predict the variance using σ

The second method ensures that the estimates variances are positive.
We are estimating the variance function to do weighted least squares, but these methods can
be used more generally. It’s often important to understand variance in its own right, and this is a
general method for estimating it. Our estimate of the variance function depends on first having a
good estimate of the regression function

5.1 Example
> plot(x,y)
> out = lm(y ~ x)
> abline(out)
> summary(out)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.02934 0.03935 0.745 0.458
x 1.89866 0.07092 26.772 <2e-16 ***

> plot(x,rstudent(out))
> abline(h=0)
>
> u = log((resid(out))^2)
> tmp = loess(u ~ x)
> s2 = exp(tmp$fitted)

5
> plot(x,s2)
> w = 1/s2
> out2 = lm(y ~ x,weights=w)
> summary(out2)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.002967 0.008252 0.36 0.72
x 2.046929 0.043872 46.66 <2e-16 ***

6 Correlated Noise and Generalized Least Squares


Sometimes, we might believe the right model is (in matrix form)

Y = Xβ +  (16)
E [|X] = 0 (17)
Var [|X] = Σ (18)

where the matrix Σ is not diagonal. The off-diagonal entries represent covariance in the noise
terms, Cov [i , j ] = Σij . How should we estimate β? Here are two approaches. We assume that Σ
is known.
Approach 1. Because Σ is a variance matrix, we know it is square, symmetric, and positive-
definite. This implies that there exists a square root matrix S such that SS = Σ. Now we multiply
our model by S−1 :
S−1 Y = S−1 Xβ + S−1 
This os a linear regression of S−1 Y on S−1 X, with the same coefficients β as our original regression.
However, we have improved the properties of the noise. The noise is still zero in expectation,

E S−1 |X = S−1 0 = 0


 

but

Var S−1 |X = S−1 Var [|X] S−T


 
(19)
= S−1 ΣS−T (20)
−1 T −T
= S SS S (21)
= I. (22)

To sum up, if we know Σ, we can estimate β by doing an ordinary least squares regression of
S−1 Y on S−1 X. The estimate is

βb = ((S−1 X)T S−1 X)−1 (S−1 X)T S−1 Y (23)


T −T −1 −1 T −T −1
= (X S S X) X S S Y (24)
T −1 −1 T −1
= (X Σ X) X Σ Y. (25)

This looks just like our weighted least squares estimate, only with Σ−1 in place of W.

6
● ● ●

4
3


3


2

● ● ●
● ●●●●

rstudent(out)

2
● ●
● ● ●● ●
● ● ●● ●
●●
●●
● ●
1


● ●
● ●
● ●

●●

1

●●● ● ● ●● ●
●● ●
y

● ●● ● ●

●● ● ●
● ●
●●
● ● ● ●● ● ● ●● ●

●●●
● ● ● ● ● ●●
● ●

0
● ● ●●●● ●●
0


● ●●●●●● ●
●●
●●● ●●
● ● ● ● ●●
● ●
●●●

● ●●
● ●● ●
●●● ●●● ● ●

−1
● ●● ●


● ● ●
●● ●●
−1

● ● ●
● ● ● ●
● ●
● ● ● ●
●● ●

● ●
−2



● ●

−3
−1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0

x x

● ●
3



0.30



2

● ●●●● ●
● ● ●
● ● ●●
● ● ● ●● ●
●●
0.20

● ●
1

● ●
● ●

●●

●●●
s2

●● ● ●
y



●● ●
● ●●
● ●

●●●

0

● ●

●●●

0.10




●●●●


●●● ●●●
● ●●
● ●● ● ●
●● ●●
−1

● ● ● ● ●


● ● ● ●

●● ●

● ●●
● ●● ●

●●
0.00

● ●● ● ● ●
−2


●●●
●●●● ●●● ●
●●●
●●
● ●●
●●
●●
●●●●
● ●
●●●●●● ●

−1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0

x x

Figure 1: Top left: data with fitted line. Top right: residuals. Bottom left: Plot of estimates variances.
Bottom right: Data with weigthed least squares line.

7
Approach 2. This resemblance is no mere coincidence. We can write the WLS problem as
that of minimizing (Y − Xβ)T W(Y − Xβ), for a diagonal matrix W. Suppose we try instead to
minimize
(Y − Xβ)T W(Y − Xβ)
for a non-diagonal, but still symmetric and positive-definite, matrix W. This is called a general-
ized least squares (GLS) problem. Every single step we went through before is still valid, because
none of it rested on W being diagonal, so

βbGLS = (XT WX)−1 XT WY. (26)

We have just seen is that if we set W = Σ−1 , we also get this solution when we transform the
variables so as to de-correlate the noise, and then do ordinary least squares. This should at least
make it plausible that this is a good way to estimate β in the face of correlated noise.
To go beyond plausibility, refer back to §3. At no point in our reasoning did we actually rely
on Var [|X] being diagonal. It follows that if we set W = Var [|X]−1 , we get the linear, unbiased
estimator of minimum variance. If we believe that the noise is Gaussian, then this is also the
maximum likelihood estimator.
Where Do the Covariances Come From? In general we cannot estimate Σ. So we can
only use this approach when Σ comes from external information. For example, the data may have
some time series structure or spatial structure which suggests the form of Σ.

7 WLS versus Model Errors


When you find that your residuals from an initial model have non-constant variance or are correlated
with each other, there are (at least) two possible explanations. One is that the fluctuations around
the regression line really are heteroskedastic and/or correlated. In that case, you should try to
model that variance and those correlations, and use WLS or GLS. The other explanation is that
something is wrong with your model. If there’s an important predictor variable which is just missing
from your model, for example, then its contribution to the response will be part of your residuals.
If that omitted variable is larger in some parts of the data than in others, or if the omitted variable
has correlations, then that will make your residuals change in magnitude and be correlated. More
subtly, having the wrong functional form for a variable you do include can produce those effects as
well.

You might also like