0% found this document useful (0 votes)
10 views39 pages

Week 3 ML Estimation and REML

Uploaded by

zp.he007
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views39 pages

Week 3 ML Estimation and REML

Uploaded by

zp.he007
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 39

Methods of Parameter Estimation

In statistical models (e.g., linear regression, logistic regression, Cox’s


proportional hazards), a number of methods can be used to estimate
model parameters:
▶ Maximum Likelihood Estimation
▶ Estimating Equation Method
▶ Bayesian Estimation
▶ Custom Objective Functions (e.g., penalized likelihood)
Matrix Representation of Linear Regression Models

Define the response vector y, design matrix X, coefficient vector β,


and error vector ε as follows:

       
y1 1 x11 x12 ··· x1p β0 ε1
 y2  1 x21 x22 ··· x2p   β1   ε2 
y =  . , X = . ..  , β =  . , ε= . 
       
.. .. ..
 ..   .. . . . .   ..   .. 
yn 1 xn1 xn2 ··· xnp βp εn

Consider the linear regression model:

y = Xβ + ε, E(ε) = 0, Var(ε) = σ 2
Least Square Estimation Using Matrix Algebra

For least square estimation, we minimize the quantity:

Q = (y − Xβ)′ (y − Xβ)

Expanding and simplifying, we have:

Q = y′ y − 2y′ Xβ + β ′ X′ Xβ

To find the value of β that minimizes Q, we solve the equation from


setting the first derivative of Q to 0:
∂Q
= −2X′ y + 2X′ Xβ = 0
∂β
Leading to the estimate:

β̂LSE = (X′ X)−1 X′ y


Estimate of Variance

▶ The mean squared error (MSE) is an unbiased estimate of σ 2 :

1 1
σ̂ 2 = (y − Xβ̂)′ (y − Xβ̂) = y′ (I − H)y,
n−p−1 n−p−1

where H = X(X′ X)−1 X′


▶ MSE adjusts for the loss of degrees of freedom due to estimating
the coefficients.
▶ β̂LSE and MSE are unbiased.

▶ The derivation of β̂LSE and MSE don’t require the normal


distribution.
Linear Models with Normal Errors

We further assume that the random error ε follows a normal


distribution:
y = Xβ + ε, ε ∼ N (0, σ 2 I),

▶ Under this assumption, the sampling distributions of the


estimates have the following properties:
▶ β̂LSE follows a normal distribution.
▶ MSE follows a Chi-squared distribution.
▶ β̂LSE and MSE are independent.

▶ We can apply Maximum likelihood method to estimate the


parameters.
Likelihood Function for Regression Models

The likelihood function for a multivariate normal distribution


N (µ, Σ) can be written as:
 
1 1 ′ −1
L= exp − (y − µ) Σ (y − µ)
(2π)n/2 |Σ|1/2 2

For our regression model, given Σ = σ 2 I, we have:


 
1 1 ′
L= exp − 2 (y − Xβ) (y − Xβ)
(2πσ 2 )n/2 2σ

The log-likelihood function can be written as:


n n 1
l=− ln(2π) − ln(σ 2 ) − 2 (y − Xβ)′ (y − Xβ)
2 2 2σ
Linear Regression Models using MLE

n n 1
l=− ln(2π) − ln(σ 2 ) − 2 (y − Xβ)′ (y − Xβ)
2 2 2σ
The first derivatives of the parameters (score equations) are:

∂l 1 ∂(β ′ β)
= 2 X′ (y − Xβ), using the fact = 2β
∂β σ ∂β
∂l n 1
= − 2 + 4 (y − Xβ)′ (y − Xβ)
∂σ 2 2σ 2σ
Setting these two equations to zero gives:
1
β̂ = (X′ X)−1 X′ y, σ̂ 2 (β) = (y − Xβ)′ (y − Xβ)
n

Substituting β̂ for β, we obtain the MLE for σ 2 :


1
σ̂ 2 = (y − Xβ̂)′ (y − Xβ̂)
n
Linear Models for Independent Observations

▶ Reviewed two estimation methods: LSE and MLE, and derived


estimates using each method.
▶ Estimates for the regression parameter β are the same using LSE
or MLE (β̂LSE = β̂MLE ).
▶ Estimation for the variance parameter σ 2 :
▶ MSE of σ 2 is unbiased, taking into account the degree of freedom
used to estimate β.
▶ MLE of σ 2 is biased; when n is small, MLE tends to
underestimate σ 2 .

▶ When extending the methods to models for longitudinal data, we


will cover: MLE, REML, GLS and the Generalized Estimating
Equations approach (GEE).
General Linear Models for Longitudinal Data
Assume a general linear regression model,
yij = β0 + β1 xij1 + · · · + βp xijp + εij , i = 1, . . . , n, j = 1, . . . , k
Let    
yi1 1 xi11 xi12 ··· xi1p
yi2  1 xi21 xi22 ··· xi2p 
yi =  .  , Xi =  . ..  ,
   
.. .. ..
 ..   .. . . . . 
yik 1 xik1 xik2 ··· xikp
denote the k × 1 vector of response variables and the k × (p + 1)
design matrix of covariates for the ith subject.
Define    
y1 X1
 y2   X2 
Y =  . , X =  . .
   
 ..   .. 
yn Xn
The model can be written as:
Y = Xβ + ε, E(ε) = 0, Var(ε) = Σ
The TLC Data

There were 100 study subjects randomly assigned to treatment or


placebo and measured at baseline, 1, 4, and 6 weeks post baseline.
The data for the first 2 subjects:

Treat Week y
0 0 30.8
0 1 26.9
0 4 25.8
0 6 23.8
1 0 26.5
1 1 14.8
1 4 19.5
1 6 21.0
Construct Xi , yi , X and Y. What are the dimensions of Xi , yi , X
and Y?
Assumptions

▶ The individuals represent a random sample from the population


of interest.
▶ X are fixed, free of random error.

▶ The elements of the vector of repeated measures yi1 , ..., yik , have
a conditional Multivariate Normal (MVN) distribution, with
means

µij = E(yij |xij ) = β0 + β1 xij1 + · · · + βp xijp

▶ Observations from different individuals are independent, while


repeated measurements of the same individual are not assumed
to be independent.
Maximum Likelihood for General Linear Models
The likelihood function for a multivariate normal distribution:
 
1 1 ′ −1
L(β, Σ) = exp − (y − Xβ) Σ (y − Xβ)
(2π)N/2 |Σ|1/2 2

log likelihood:
1 1
log L(β, Σ) ∝ − ln |Σ| − (y − Xβ)′ Σ−1 (y − Xβ)
2 2
Using matrix derivative formulae below and chain rule:
∂ ′
(x Ax) = 2Ax
∂x

∂ log L
= X′ Σ−1 (Y − Xβ)
∂β
Assume that Σ is known, solving the score equation for β gives:

X′ Σ−1 Xβ = X′ Σ−1 Y ⇒ β̂ = (X′ Σ−1 X)−1 X′ Σ−1 Y


Generalized Least Squares

▶ Under the multivariate normal assumption, MLE of β is also a


generalized least squares estimate of β
▶ Note that maximizing the likelihood for a given Σ is equivalent
to finding the estimates minimize

(Y − Xβ)′ Σ−1 (Y − Xβ)

▶ The estimates of β that minimize this function are called


generalized least squares estimates (GLS).
▶ Instead of minimizing the squared distance between Y and Xβ, a
generalized squared distance determined by Σ−1 is minimized.
Properties of GLS

For any choice of Σ, the GLS estimate of β is unbiased; that is

E(β̂) = β

Variance:
Cov(β̂) = (X′ Σ−1 X)−1
Sampling Distribution:

β̂ ∼ N β, (X′ Σ−1 X)−1




▶ If Σ = σ 2 I, the GLS estimate reduces to the ordinary least


squares estimate.
▶ The most efficient generalized least squares estimate is the one
that uses the true value of Σ.
Maximum Likelihood Estimation for Σ

Since we usually do not know Σ we typically estimate it from the


data. Now we compute the MLE of Σ.

To get the derivative with regard to Σ−1 we use the fact that
(y − Xβ)′ Σ−1 (y − Xβ) is a scalar and

tr(X′ AX) = tr(XX′ A)


The log-likelihood can be rewritten as:

1 1
log L(β, Σ−1 ) ∝ − ln |Σ| − (y − Xβ)′ Σ−1 (y − Xβ)
2 2
1 1 
= ln |Σ−1 | − tr (Y − Xβ)′ Σ−1 (Y − Xβ)

2 2
1 1 
= ln |Σ−1 | − tr (Y − Xβ)(Y − Xβ)′ Σ−1

2 2
Maximum Likelihood Estimation for Σ

Using the properties of matrix derivatives

∂tr(CB)
= C′ when B is symmetric,
∂B
and
∂ ln |Σ−1 |

∂Σ−1
The derivative of the log likelihood with respect to Σ−1 is given by:
∂ log L 1 1
−1
= Σ − (Y − Xβ)(Y − Xβ)′
∂Σ 2 2
Setting the first derivative to 0 gives us the MLE of Σ:

Σ̂ = (Y − Xβ)(Y − Xβ)′
Numerical algorithms for finding MLE

Solving the two score equations yields:

β̂ = (X′ Σ−1 X)−1 X′ Σ−1 Y

Σ̂ = (Y − Xβ)(Y − Xβ)′
An iterative algorithm needs to be used to obtain ML estimates of β
and Σ:
1. Choose an initial estimate of Σ to obtain an estimate of β.

β̂ = (X′ Σ̂−1 X)−1 X′ Σ̂−1 Y

2. Use β̂ to obtain a new estimate for Σ.

Σ̂ = (Y − Xβ̂)(Y − Xβ̂)′

3. The process iterates until the estimates of β and Σ converge.


Asymptotic Distributions of β̂

▶ Recall: β̂ ∼ N (β, (X′ Σ−1 X)−1 ) when Σ is known.

▶ In large samples, the resulting estimator of β from the iterative


algorithm will have the same properties as when Σ is known.
▶ To test hypotheses about β we can make direct use of the ML
estimate and its estimated covariance matrix,

d β̂) = (X′ Σ̂−1 X)−1


Cov(
Statistical Inference for β

Let L denote a matrix or vector of known weights (often representing


contrasts of interest) and suppose that it is of interest to test

H0 : Lβ = 0 vs HA : Lβ ̸= 0
For example, if a model has 3 regression parameters β = (β1 , β2 , β3 ),
the following composite hypotheses can be expressed using L matrices:
▶ To test H0 : β1 = β2 , we can use L = (1, −1, 0), Lβ = β1 − β2 ,
then test H0 is equivalent to testing H0 : Lβ = 0.
 
▶ To test H0 : β1 = β2 = β3 , we can use L = 1 −1 0 ,
  0 1 −1
β1 − β2
Lβ = , testing H0 : Lβ = 0.
β2 − β3
Wald Test

▶ A natural estimate of Lβ is Lβ̂ and the covariance matrix of Lβ̂


is given by
LCov(β̂)L′
▶ Suppose that L is a single row vector. An approximate 95% CI is
given by q
Lβ̂ ± 1.96 LCov(β̂)L′
▶ To test H0 : Lβ = 0 versus HA : Lβ ̸= 0, we can use Wald
statistic
Lβ̂
Z=q
LCov(β̂)L′
and compare with a standard normal distribution.
▶ The Wald test is based on asymptotic properties, meaning it
tends to perform well when sample sizes are large
Generalization of Wald Test

▶ Recall: If Z is a standard normal random variable, then Z 2 has a


χ2 distribution with 1 df. Thus, an identical test compares

W = (Lβ̂)′ (LCov(β̂)L′ )−1 (Lβ̂)


to the critical values of the χ2 distribution with 1 df.
▶ This approach readily generalizes to L having more than one row
and this allows simultaneous testing of more than one hypothesis.
▶ Suppose that L has r rows (number of contrasts). Then a
simultaneous test of the r contrasts is given by

W = (Lβ̂)′ (LCov(β̂)L′ )−1 (Lβ̂)

which has a χ2 distribution with r df if H0 is true


Hypothesis Tests for β in SAS PROC MIXED

By default, PROC MIXED performs these tests for coefficients:


▶ When L is a row vector, a t-test is used:

Lβ̂
T =q ∼ t(ν), ν = df
LCov(β̂)L′

▶ When the row rank of L is greater than 1 (contrast matrix), a


F-test is used:

(Lβ̂)′ (LCov(β̂)L′ )−1 (Lβ̂)


F = ∼ F (rank(L), ν)
r

where r = rank(LCov(β̂)L′ ).
Likelihood Ratio Test

Suppose that we are interested in comparing two nested models, a


“full” model and a “reduced” model.

Aside: Nested Models


When one model (the “reduced” model) is a special case of the other
(the “full model”), the reduced model is said to be nested within the
full model.
We can compare two nested models by comparing their maximized
log-likelihood, say ˆlf ull and ˆlred ; the former is at least as large as the
latter.
The larger the difference between ˆlf ull and ˆlred the stronger the
evidence that the reduced model is inadequate.
Likelihood Ratio Test

A formal test is obtained by taking

2(ˆlf ull − ˆlred )

and comparing the statistic to a Chi-squared distribution with degree


of freedom equal to the difference between the number of parameters
in the full and reduced models.

Formally, this test is called the likelihood ratio test (LRT).

We can use LRTs for hypotheses about models for the mean and the
covariance.
Numerical algorithms for finding MLE

▶ An iterative algorithm for obtaining maximum likelihood


estimates for longitudinal models:

β̂ = (X′ Σ̂−1 X)−1 X′ Σ̂−1 Y

Σ̂ = (Y − Xβ̂)(Y − Xβ̂)′
▶ MLE of β depends on the MLE of Σ; Different specification of
the covariance structure will lead to different β estimate.
Covariance Matrix for Longitudinal Data

In our MLE derivation, we used the most general form for the
covariance matrix, Σ, without assuming any special pattern. In
practice, a general covariance matrix is too costly for computation
efficiency because
▶ Σ is a matrix of large order (nk × nk)
▶ Maximum likelihood estimation requires computing Σ−1 ,
iteratively.
Luckily, in most longitudinal data analysis, it is reasonable to assume
that observations collected from different individuals are independent.
Covariance Matrix for Longitudinal Data
Assuming independence between individuals, the covariance matrix
has the following structure:
 
Σ1 0 ··· 0
0 Σ2 ··· 0 
Σ= .
 
.. .. ..
 ..

. . . 
0 0 ··· Σn

If we further assume Σ1 = Σ2 =, . . . , = Σn , this covariance pattern


can greatly reduces the dimensionality for ML (or REML) estimation,
leading to (verify on your own):

n
!−1 n
!
X X
β̂ = X′i Σ̂−1
i Xi X′i Σ̂−1
i yi
i=1 i=1

n
1X
Σ̂i = (Yi − Xi β̂)(Yi − Xi β̂)′
n i=1
Covariance Matrix for Longitudinal Data

▶ The assumption of independence between subjects allows for a


reduction in matrix dimension in the ML algorithm from N = nk
to k.
▶ In the TLC data (n = 100, k = 4), the outcome vector Y is a
400 × 1 vector; hence, a general covariance matrix Σ would have
dimensions 400 × 400.
▶ With the assumption of independence between subjects, we only
need to estimate Σi with dimensions of 4 × 4 in each iteration.
▶ 10 parameters are involved in defining Σi .
Covariance Matrix for Longitudinal Data

Since we have shown that the MLE of β depends on the MLE of Σ,


and there are usually many plausible patterns for the covariance
matrices Σi in a given dataset, fitting a longitudinal model will
require two key decisions:
▶ Which independent variables should be included in the model?
▶ What covariance matrix structure should be used to best capture
correlations among repeated measures?
Examples of Covariance Matrix Structures
One commonly used type of covariance matrix is the compound
symmetry (CS) structure:
 2
σ + σb2 σb2 σb2

···
 σb2 σ 2 + σb2 · · · σb2 
Σi =   = σ 2 I + σb2 J
 
.. .. .. ..
 . . . . 
2 2 2 2
σb σb · · · σ + σb

Or alternatively:

σ2 σ2 ρ σ2 ρ
 
···
σ 2 ρ σ2 ··· σ 2 ρ
Σi =  .
 
.. .. .. 
 .. . . . 
σ2 ρ σ2 ρ ··· σ2

Therefore, two parameters are involved in defining the covariance


matrix in a compound symmetry structure.
Examples of Covariance Matrix Structures

A compound symmetry covariance matrix assumes that variances of


the random error terms σ 2 + σb2 stay constant at different time points.

A different covariance matrix structure that relaxes this assumption is


the heterogeneous compound symmetry (HCS) structure:
 2 
σ1 σ1 σ2 ρ · · · σ1 σk ρ
 σ1 σ2 ρ σ22 · · · σ2 σk ρ
Σi =  .
 
. .. .. 
 .. .. . . 
σ1 σk ρ σ2 σ k ρ · · · σk2

In the HCS type of covariance structure, k + 1 parameters are needed


in defining the covariance matrix.
MLE for Σ are biased
Recall that the MLE of Σ is given by:
n
1X
Σ̂i = (Yi − Xi β̂)(Yi − Xi β̂)′
n i=1

▶ Although the MLE’s of the elements of Σ (the variances and


covariances) have the usual large sample properties, (they are
asymptotically normal and unbiased), the estimates are biased
toward 0 in small samples.
▶ The situation is similar to the biases in MLE of σ 2 in linear
regression models where the ML estimate does not adjusted for
the effect of estimating β.
▶ In ordinary least squares regression, we can eliminate the bias by
dividing by n − p − 1 instead of n.
▶ Though the correction is not always as simple in the longitudinal
setting, the same idea applies. We can correct the estimates of
variance parameters for the bias.
Restricted Maximum Likelihood Estimation (REML)

▶ REML is introduced for the purpose of finding an unbiased


estimator for the covariance matrix Σ.
▶ Two significant academic papers in the development of REML:
▶ Patterson H. D.; Thompson R. (1971). “Recovery of inter-block
information when block sizes are unequal”. Biometrika. 58 (3):
545. doi:10.1093/biomet/58.3.545.
▶ Harville D. A. (1977). “Maximum Likelihood Approaches to
Variance Component Estimation and to Related Problems”.
Journal of the American Statistical Association. 72 (358):
320–338. doi:10.2307/2286796.
REML

▶ The main idea behind REML is to use a slightly different


likelihood function (restricted ML) that eliminates the
parameters β, so that it is defined only in terms of Σ.
▶ One possible way to obtain the restricted likelihood is to
transform the data to a set of linear combinations of observations
whose distribution does not depend on β.
▶ An example of such linear combinations of observations: residuals
from OLS estimates of β

Y − Xβ̂ = (I − H)Y, where H = X(X′ X)−1 X′ .

The likelihood of these residuals depends only on Σ and not on β.


▶ When the residual likelihood is maximized, we obtain estimates
of Σ whose degrees of freedom are effectively corrected for the
reduction in degrees of freedom due to estimating β.
REML

Observe that the vector:


def
(I − H)Y = (I − X(X′ X)−1 X′ )Y = PY

is free of β.

Let M be a matrix where its columns are the eigenvectors of the


matrix P corresponding to non-zero eigenvalues. The likelihood
function for MY is the Residual (or Restricted) Likelihood:

1 1 1
log LR (Σ|M Y ) = − ln |Σ|− (Y−Xβ̂)′ Σ−1 (Y−Xβ̂)− ln |X′ Σ−1 X|,
2 2 2

where β̂ is the GLS solution to β.


REML

▶ Thus, rather than maximizing

1 1
− ln |Σ| − (Y − Xβ̂)′ Σ−1 (Y − Xβ̂)
2 2
▶ REML maximizes the following slightly modified log-likelihood

1 1 1
− ln |Σ| − (Y − Xβ̂)′ Σ−1 (Y − Xβ̂) − ln |X′ Σ−1 X|
2 2 2
▶ The extra determinant term effectively makes a correction or
adjustments that is analogous to the correction to the
denominator in σ̂ 2 .
REML

Use iterative algorithm to find REML estimates of β and Σ:

β̂REML = (X′ Σ̂−1


REML X)
−1 ′ −1
X Σ̂REML Y

Σ̂REML = (Y − Xβ̂REML )(Y − Xβ̂REML )′ + XCov(


d β̂REML )X′

Asymptotic properties similar to ML estimates have also been


established:  
β̂REML ∼ N β, (X′ Σ̂−1REML X) −1

d β̂REML ) = (X′ Σ̂−1 X)−1


Cov( REML
Hypothesis Testing on β̂ REML

Given β̂REML ∼ N (β, (X′ Σ̂−1


REML X)
−1
), we can conduct hypothesis
tests similarly to MLE:

H0 : LβREML = 0 vs HA : LβREML ̸= 0

▶ When L is a row vector, use a t-test:

Lβ̂REML
T =q ∼ t(ν), ν = df
d β̂REML )L′
LCov(

▶ When the row rank of L is greater than 1, use a F-test:

(Lβ̂REML )′ (LCov(
d β̂REML )L′ )−1 (Lβ̂REML )
F = ∼ F (rank(L), ν)
r
d β̂REML )L′ ).
where r = rank(LCov(
Summary

▶ Linear regression models are special cases of general linear


models when Σ = σ 2 I.
▶ For general linear models, both ML and REML estimates of the
regression parameter β depend on the estimate of the covariance
matrix Σ.
▶ The mathematical derivation assumes general forms of the
covariance matrix.
▶ Different patterns assumed for Σ will lead to different estimates
for β.
▶ Future lectures will cover how to use SAS to specify a general
linear model, including various patterns for Σ.

You might also like