Generalized Linear Models
Generalized Estimating Equations
Source: Book and internet
Dr. Md Jamal Uddin
Professor
Statistics, SUST, Sylhet
Correlated data
1. Repeated measures: same subjects, same measure, successive
times – expect successive measurements to be correlated
Treatment groups Measurement times
Subjects, i = B
1,…,n
Randomize
2
Correlated data
2. Clustered/multilevel studies
Level 3
Level 2
Level 1
E.g., Level 3: populations
Level 2: age - sex groups
Level 1: blood pressure measurements in sample of people in each
age - sex group
We expect correlations within populations and within age-sex groups due
to genetic, environmental and measurement effects
3
Example with time-dependent,
continuous predictor…
6 patients with depression are given a drug that increases levels of a “happy
chemical” in the brain. At baseline, all 6 patients have similar levels of this
happy chemical and scores >=14 on a depression scale. Researchers measure
depression score and brain-chemical levels at three subsequent time points: at 2
months, 3 months, and 6 months post-baseline.
Here are the data in broad form:
id time1 time2 time3 time4 chem1 chem2 chem3 chem4
1 20 18 15 20 1000 1100 1200 1300
2 22 24 18 22 1000 1000 1005 950
3 14 10 24 10 1000 1999 800 1700
4 38 34 32 34 1000 1100 1150 1100
5 25 29 25 29 1000 1000 1050 1010
6 30 28 26 14 1000 1100 1109 1500
4
Turn the data to long form…
data long4;
set new4;
time=0; score=time1; chem=chem1; output;
time=2; score=time2; chem=chem2; output;
time=3; score=time3; chem=chem3; output;
time=6; score=time4; chem=chem4; output;
run;
5
id time score chem
1 0 20 1000
Data in long1 2 18 1100
1 3 15 1200
form: 1 6 20 1300
2 0 22 1000
2 2 24 1000
2 3 18 1005
2 6 22 950
3 0 14 1000
3 2 10 1999
3 3 24 800
3 6 10 1700
4 0 38 1000
4 2 34 1100
4 3 32 1150
4 6 34 1100
5 0 25 1000
5 2 29 1000
5 3 25 1050
5 6 29 1010
6 0 30 1000
6 2 28 1100
6 3 26 1109
6 6 14 150
Graphically, let’s see what’s going on:
First, by subject.
All 6 subjects at once:
Mean chemical levels compared with mean
depression scores:
How do you analyze these
data?
Using repeated-measures ANOVA?
The only way to force a rANOVA here is…
data forcedanova;
set broad;
avgchem=(chem1+chem2+chem3+chem4)/4;
if avgchem<1100 then group="low";
Gives no
if avgchem>1100 then group="high";
run; significant
proc glm data=forcedanova; results!
class group;
model time1-time4= group/ nouni;
repeated time /summary;
run; quit; 15
Limitations of Repeated
ANOVA/MANOVA
• They assume categorical predictors.
• They do not handle time-dependent covariates
(predictors measured over time).
• They assume everyone is measured at the same time
(time is categorical) and at equally spaced time
intervals.
• Missing data must be imputed.
• They require restrictive assumptions about the
correlation structure.
16
If such conditions are violated
What are the solutions ?
Solutions!!!
• We need more complicated models!
• Models for correlated data
1. Linear mixed model (already we learned last
classes)
2. Generalized estimating equations
3. GLMM
Generalized Estimating Equations (GEEs)
These are used to model correlated data from
• Longitudinal/ repeated measures studies
• Clustered/ multilevel studies
19
But first…naïve analysis…
• The data in long form could be naively thrown into
an ordinary least squares (OLS) linear regression…
• I.e., look for a linear correlation between chemical
levels and depression scores ignoring the
correlation between subjects. (the cheating way to
get 4-times as much data!)
• Can also look for a linear correlation between
depression scores and time.
• In SAS:
proc reg data=long;
model score=chem time;
run;
20
Graphically…
Naïve linear regression here looks for significant slopes (ignoring
correlation between individuals):
Y= 24.90889 - 0.557778*time. Y=42.44831-0.01685*chem
N=24—as if we have 24 independent observations!
21
The model
The linear regression model:
22
Results…
The fitted model:
Parameter Standard
Variable DF Estimate Error t Value Pr > |t|
Intercept 1 42.46803 6.06410 7.00 <.0001
chem 1 -0.01704 0.00550 -3.10 0.0054
time 1 0.07466 0.64946 0.11 0.9096
1-unit increase in chemical is associated
with a .0174 decrease in depression score
(1.7 points per 100 units chemical)
Each month is associated only with a .07
increase in depression score, after
correcting for chemical changes. 23
Generalized Estimating
Equations (GEE)
• GEE takes into account the dependency of
observations by specifying a “working
correlation structure.”
• Let’s briefly look at the model (we’ll return
to it in detail later)…
24
The model…
Measures linear correlation between chemical levels and depression scores
across all 4 time periods. Vectors!
Measures linear correlation between time and depression scores.
CORR represents the correction for correlation between observations.
A significant beta 1 (chem effect) here would mean either that people who have high levels
of chemical also have low depression scores (between-subjects effect), or that people
whose chemical levels change correspondingly have changes in depression score
(within-subjects effect), or both.
25
SAS code (long form of data!!)
Generalized Linear models (using MLE)…
proc genmod data=long4;
class id;
model score=chem time;
repeated subject = id / type=exch corrw; run; quit;
The type of correlation structure…
Time is continuous (do not place on NOTE, for time-dependent predictors…
class statement)! --Interaction term with time (e.g. chem*time) is
NOT necessary to get a within-subjects effect.
Here we are modeling as a linear
--Would only be included if you thought there was
relationship with score. an acceleration or deceleration of the chem effect
with time. 26
Results…
Analysis Of GEE Parameter Estimates
Empirical Standard Error Estimates
Standard 95% Confidence
Parameter Estimate Error Limits Z Pr > |Z|
Intercept 38.2431 4.9704 28.5013 47.9848 7.69 <.0001
chem -0.0129 0.0026 -0.0180 -0.0079 -5.00 <.0001
time -0.0775 0.2829 -0.6320 0.4770 -0.27 0.7841
27
How does GEE work?
• First, a naive linear regression analysis is carried
out, assuming the observations within subjects
are independent.
• Then, residuals are calculated from the naive
model (observed-predicted) and a working
correlation matrix is estimated from these
residuals.
• Then the regression coefficients are refit,
correcting for the correlation. (Iterative process)
• The within-subject correlation structure is treated
as a nuisance variable (i.e. as a covariate)
28
OLS regression
variance-covariance matrix
t1 t2 t3 Correlation structure (pairwise
correlations between time
t1 points) is Independence.
t2
t3
Variance of scores is homogenous across
time (MSE in ordinary least squares
regression). 29
GEE variance-covariance matrix
t1 t2 t3 Correlation structure must be
specified.
t1
t2
t3
Variance of scores is homogenous across
time (residual variance).
30
Choice of the correlation
structure within GEE
In GEE, the correction for within subject correlations is
carried out by assuming a priori a correlation structure for
the repeated measurements (although GEE is fairly robust
against a wrong choice of correlation matrix—particularly
with large sample size)
Choices:
• Independent (naïve analysis)
• Exchangeable (compound symmetry, as in rANOVA)
• Autoregressive
• M-dependent
• Unstructured (no specification, as in rMANOVA)
We are looking for the simplest structure (uses up the fewest degrees of freedom)
that fits data well!
31
Correlation
For unit i
For repeated measures = correl between times l and m
For clustered data = correl between measures l and m
For all models considered here Vi is assumed to be same for
all units
32
Types of correlation
1. Independent: Vi is diagonal
2. Exchangeable: All measurements on the same
unit are equally correlated
Plausible for clustered data
Other terms: spherical and compound symmetry
33
Types of correlation
3. Correlation depends on time or distance between
measurements l and m
e.g. first order auto-regressive model has terms ρ,
ρ2, ρ3 and so on
Plausible for repeated measures where correlation is
known to decline over time
4. Unstructured correlation: no assumptions about the
correlations
Lots of parameters to estimate – may not converge
34
Types of correlation-Summary
35
Independence
t1 t2 t3
t1
t2
t3
36
Exchangeable
t1 t2 t3
t1
t2
t3
Also known as compound symmetry or
sphericity. Costs 1 df to estimate p.
37
Autoregressive
t1 t2 t3 t4
t1
t2
t3
t4
Only 1 parameter estimated.
Decreasing correlation for farther
time periods. 38
M-dependent
t1 t2 t3 t4
t1
t2
t3
t4
Here, 2-dependent. Estimate 2 parameters (adjacent time
periods have 1 correlation coefficient; time periods 2 units of
time away have a different correlation coefficient; others are
39
uncorrelated)
Unstructured
t1 t2 t3 t4
t1
t2
t3
t4
Estimate all correlations
separately (here 6)
40
How GEE handles missing data
Uses the “all available pairs” method, in
which all non-missing pairs of data are
used in the estimating the working
correlation parameters.
Because the long form of the data are
being used, you only lose the
observations that the subject is
missing, not all measurements.
41
Back to our example…
What does the empirical correlation matrix look like
for our data?
Pearson Correlation Coefficients, N = 6
Prob > |r| under H0: Rho=0
time1 time2 time3 time4
Independent? time1 1.00000 0.92569 0.69728 0.68635
Exchangeable? 0.0081 0.1236 0.1321
Autoregressive? time2 0.92569 1.00000 0.55971 0.77991
0.0081 0.2481 0.0673
M-dependent?
time3 0.69728 0.55971 1.00000 0.37870
Unstructured? 0.1236 0.2481 0.4591
time4 0.68635 0.77991 0.37870 1.00000
0.1321 0.0673 0.4591
42
Back to our example…
I previously chose an exchangeable correlation matrix…
proc genmod data=long4;
class id;
model score=chem time;
repeated subject = id / type=exch corrw; run; quit;
This asks to see the
working correlation
matrix.
43
Working Correlation Matrix
Working Correlation Matrix
Col1 Col2 Col3 Col4
Row1 1.0000 0.7276 0.7276 0.7276
Row2 0.7276 1.0000 0.7276 0.7276
Row3 0.7276 0.7276 1.0000 0.7276
Row4 0.7276 0.7276 0.7276 1.0000
Standard 95% Confidence
Parameter Estimate Error Limits Z Pr > |Z|
Intercept 38.2431 4.9704 28.5013 47.9848 7.69 <.0001
chem -0.0129 0.0026 -0.0180 -0.0079 -5.00 <.0001
time -0.0775 0.2829 -0.6320 0.4770 -0.27 0.7841
44
Compare to autoregressive…
proc genmod data=long4;
class id;
model score=chem time;
repeated subject = id / type=ar corrw;
run; quit;
45
Working Correlation Matrix
Working Correlation Matrix
Col1 Col2 Col3 Col4
Row1 1.0000 0.7831 0.6133 0.4803
Row2 0.7831 1.0000 0.7831 0.6133
Row3 0.6133 0.7831 1.0000 0.7831
Row4 0.4803 0.6133 0.7831 1.0000
Analysis Of GEE Parameter Estimates
Empirical Standard Error Estimates
Standard 95% Confidence
Parameter Estimate Error Limits Z Pr > |Z|
Intercept 36.5981 4.0421 28.6757 44.5206 9.05 <.0001
chem -0.0122 0.0015 -0.0152 -0.0092 -7.98 <.0001
time 0.1371 0.3691 -0.5864 0.8605 0.37 0.7104
46
Example two…recall…
From rANOVA:
Within subjects effects,
but no between subjects
effects.
Time is significant.
Group*time is not
significant.
Group is not significant.
This is an example with a
binary time-independent
predictor.
47
Empirical Correlation
Pearson Correlation Coefficients, N = 6
Prob > |r| under H0: Rho=0
time1 time2 time3 time4
Independent? time1 1.00000 -0.13176 -0.01435 -0.50848
0.8035 0.9785 0.3030
Exchangeable?
time2 -0.13176 1.00000 -0.02819 -0.17480
Autoregressive? 0.8035 0.9577 0.7405
M-dependent? time3 -0.01435 -0.02819 1.00000 0.69419
0.9785 0.9577 0.1260
Unstructured?
time4 -0.50848 -0.17480 0.69419 1.00000
0.3030 0.7405 0.1260
48
GEE analysis
proc genmod data=long;
class group id;
model score= group time group*time;
repeated subject = id / type=un corrw ;
run; quit;
NOTE, for time-independent predictors…
--You must include an interaction term with time to get a
within-subjects effect (development over time).
49
Working Correlation Matrix
Working Correlation Matrix
Col1 Col2 Col3 Col4
Group A is on average 8 points higher;
Row1 1.0000 -0.0701 0.1916 -0.1817 there’s an average 5 point drop per
Row2 -0.0701 1.0000 0.1778 -0.5931 time period for group B, and an
Row3 0.1916 0.1778 1.0000 0.5931 average 4.3 point drop more for group
A.
Row4 -0.1817 -0.5931 0.5931 1.0000
Analysis Of GEE Parameter Estimates
Empirical Standard Error Estimates
Standard 95% Confidence
Parameter Estimate Error Limits Z Pr > |Z|
Intercept 42.1433 6.2281 29.9365 54.3501 6.77 <.0001
group A 7.8957 6.6850 -5.2065 20.9980 1.18 0.2376
group B 0.0000 0.0000 0.0000 0.0000 . .
time -4.9184 2.0931 -9.0209 -0.8160 -2.35 0.0188
time*group A -4.3198 2.1693 -8.5716 -0.0680 -1.99 0.0464
GEE analysis
proc genmod data=long;
class group id;
model score= group time group*time;
repeated subject = id / type=exch corrw ;
run; quit;
51
Working Correlation Matrix
Working Correlation Matrix
Col1 Col2 Col3 Col4
Row1 1.0000 -0.0529 -0.0529 -0.0529
Row2 -0.0529 1.0000 -0.0529 -0.0529
Row3 -0.0529 -0.0529 1.0000 -0.0529
Row4 -0.0529 -0.0529 -0.0529 1.0000
Analysis Of GEE Parameter Estimates
Empirical Standard Error Estimates
Standard 95% Confidence
Parameter Estimate Error Limits Z Pr > |Z|
Intercept 40.8333 5.8516 29.3645 52.3022 6.98 <.0001
group A 7.1667 6.1974 -4.9800 19.3133 1.16 0.2475
group B 0.0000 0.0000 0.0000 0.0000 . .
time -5.1667 1.9461 -8.9810 -1.3523 -2.65 0.0079
time*group A -3.5000 2.2885 -7.9853 0.9853 -1.53 0.1262
Notation
• Repeated measurements: y , i = 1,… N, subjects; j = 1, … ni,
ij
times for subject i
• Clustered data: y , i = 1,… N, clusters; j = 1, … ni,
ij
measurements within cluster i
• Use “unit” for subject or cluster
53
Generalized linear model (GLM)
54
Generalized estimating equations
(GEE)
55
Generalized estimating equations
Di is the matrix of derivatives δμi/δβj
Vi is the ‘working’ covariance matrix of Yi
Ai=diag{var(Yik)},
Ri is the correlation matrix for Yi
φ is an overdispersion parameter
56
Overdispersion parameter
Estimated using the formula:
Where N is the total number of measurements and
p is the number of regression parameters
The square root of the overdispersion parameter
is called the scale parameter
57
Estimation – For GEEs
58
Iterative process for GEE’s
• Start with Ri=identity (ie independence) and φ=1:
estimate β
• Use estimates to calculated fitted values:
• And residuals:
• These are used to estimate Ai, Ri and φ
• Then the GEE’s are solved again to obtain improved
estimates of β
59
Steps for fitting GEE (first starts GEE model)
60
Sandwich variance or empirically corrected
61
Missing Data
For missing data, can estimate the working
correlation using the all available pairs method,
in which all non-missing pairs of data are used in
the estimators of the working correlation
parameters.
62
Choosing the Best Model
Standard Regression (GLM)
AIC = - 2*log likelihood + 2*(#parameters)
→ Values closer to zero indicate better fit and
greater parsimony.
63
Choosing the Best Model
GEE
QIC(V) – function of V, so can use to choose
best correlation structure.
QICu – measure that can be used to
determine the best subsets of covariates
for a particular model.
the best model is the one with the smallest
value!
64