Outline for today
• What is a generalized linear model
• Linear predictors and link functions
• Example: fit a constant (the proportion)
• Analysis of deviance table
• Example: fit dose-response data using logistic regression
• Example: fit count data using a log-linear model
• Advantages and assumptions of glm()
• Modeling overdispersion (excessive variance)
• Example: Modeling contingency tables
Review: what is a linear model
A model of the following form:
Y = β0 + β1X1 + β2X2 + …
• Y is the response variable
• The X ‘s are the explanatory variables
• The b ‘s are the parameters of the linear equation
• The errors are normally distributed with equal variance at all values of the X
variables.
• Uses least squares to fit model to data and to estimate parameters
• Use lm() in R when analyzing fixed effects
Review: fitting a linear model in R
Use lm() in R when analyzing fixed effects
Simplest linear model: fit a constant (the mean)
z <- lm(y ~ 1)
Linear regression
z <- lm(y ~ x) # x is numeric
Single factor ANOVA
z <- lm(y ~ A) # A is categorical
Review: what is a linear model
Eg: linear regression: Y = b0 + b1X + error
The predicted Y-values, symbolized here by µ, are modeled as
μ = b0 + b1X
The part to the right of “=” is the linear predictor
What is a generalized linear model
A model whose predicted values are of the form
g(μ) = β0 + β1X1 + β2X2 + …
• The model still includes a linear predictor (to right of “=”)
• But the predicted Y-values are now transformed
• g(μ) is called the “link function,” of which there are several types
• Non-normal distributions of errors OK (specified by link function)
• Unequal error variances OK (specified by link function)
• Uses maximum likelihood to estimate parameters
• Uses log-likelihood ratio tests to test parameters
• Fit models using glm() in R
The two most common link functions
1) Natural log (i.e., base e)
log(μ) = η = β0 + β1X1 + β2X2 + …
Greek mu Greek eta
Usually used to model count data (e.g., number of mates, etc)
log(μ) is the link function.
The inverse function is μ = eη
The two most common link functions
2) Logistic (a.k.a. “logit”)
𝜇
log = 𝜂 = 𝛽! + 𝛽" 𝑋" + 𝛽# 𝑋# + ⋯
1−𝜇
Greek mu Greek eta
Used to model binary data (e.g., survived vs died)
$
The link function log "%$ is also known as the log-odds
&!
The inverse function is 𝜇 = "'& !
Example 1: Fit a constant to 0-1 data (estimate a proportion)
This example was used previously in Likelihood lecture. My goal here is to connect
what glm() does with what we did by brute force previously.
The wasp, Trichogramma brassicae, rides on female cabbage white butterflies,
Pieris brassicae. When a butterfly lays her eggs on a cabbage, the wasp climbs down
and parasitizes the freshly laid eggs.
Fatouros et al. (2005) carried out trials to determine whether the wasps can
distinguish mated female butterflies from unmated females. In each trial a single
wasp was presented with two female cabbage white butterflies, one a virgin
female, the other recently mated.
Y = 23 of 32 wasps tested chose the mated female. What is
the proportion p of wasps in the population choosing the
mated female?
Number of wasps choosing the mated female fits a binomial distribution
Under random sampling, the number of “successes” in n trials has a binomial
distribution, with p being the probability of “success” in any one trial.
To model these data, let “success” be “wasp chose mated butterfly”
Y = 23 successes
n = 32 trials
Goal: estimate p
Use glm() to fit a constant, and so obtain the ML estimate of p
The data are binary. Each wasp has a measurement of 1 or 0 (“success” or
“failure”) for choice: 1 1 1 0 1 1 1 0 1 0 1 0 1 1 1 1 0 1 0 1 1 1 1 1 0 1 1 1 0 0 1 1
z <- glm(choice ~ 1, family = binomial(link=”logit”))
family specifies the error distribution
(binomial) and the link function (logit).
Use glm() to fit a constant, and so obtain the ML estimate of p
Fits a model having only a constant. Use the link function appropriate for binary
data:
𝜇
log =𝛽
1−𝜇
μ here refers to the population proportion (p) but let’s stick with μ symbol here to
use consistent notation for generalized linear models.
Fitting will yield the estimate, 𝛽. .
The estimate of proportion 𝜇̂ is then obtained using the inverse function:
#
&"
𝜇̂ = #
"'& "
Use summary() for estimation
summary(z)
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.9383 0.3932 2.386 0.017 *
0.9383 is the estimate of 𝛽 (the constant on
the logit scale). Convert back to ordinary scale
(plug into inverse equation) to get estimate of
population proportion:
)
𝑒( 𝑒 !.+,-,
𝜇̂ = ) = 1 + 𝑒 !.+,-, = 0.719
1+𝑒 (
This is the ML estimate of the population
proportion. This is identical to the estimate
obtained last week using likelihood function.
Confidence intervals
summary(z)
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.9383 0.3932 2.386 0.017 *
95% confidence limits:
myCI <- confint(z) # on logit scale
exp(myCI)/(1 + exp(myCI)) # inverse logit scale
2.5 % 97.5 %
0.5501812 0.8535933
0.550 ≤ p ≤ 0.853 is the same result we obtained last week for likelihood based
confidence intervals using likelihood function (more decimal places this week).
Avoid using summary() for hypothesis testing
summary(z)
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.9383 0.3932 2.386 0.017 *
The z-value (Wald statistic) and P-value test the null hypothesis that β = 0. This is
the same as a test of the null hypothesis that the true (population) proportion
μ = 0.5, because
𝑒!
!
= 0.5
1+𝑒
Agresti (2002, Categorical data analysis, 2nd ed., Wiley) says that for small to
moderate sample size, the Wald test is less reliable than the log-likelihood ratio
test.
Use anova() to test hypotheses
Last week we calculated the log-likelihood ratio test for these data “by hand”.
Here we’ll use glm() to accomplish the same task.
“Full” model (b estimated from data):
z1 <- glm(y ~ 1, family = binomial(link=”logit”))
“Reduced” model (b set to 0 by removing intercept from model):
z0 <- glm(y ~ 0, family = binomial(link=”logit”))
Use anova() to test hypotheses
anova(z0, z1, test = "Chi") # Analysis of deviance
Model 1: y ~ 0 # Reduced model
Model 2: y ~ 1 # Full model
Analysis of deviance table:
Resid. Df Resid. Dev Df Deviance P(>|Chi|)
1 32 44.361
2 31 38.024 1 6.337 0.01182 *
The deviance is the log-likelihood ratio statistic (G-statistic). It has an approximate
c2 distribution under the null hypothesis.
Residual deviance measures goodness of fit of the model to the data.
G = 6.337 is the identical result we obtained “by hand” using log likelihood ratio test
last week.
Example 2: Logistic regression
One of the most common uses of generalized linear models.
Goal is to model the relationship between a proportion and an explanatory variable
Data: 72 rhesus monkeys (Macacus rhesus) exposed for 1 minute to aerosolized
preparations of anthrax (Bacillus anthracis).
Goal is to estimate the relationship
between dose and probability of death.
Logistic regression
Measurements of individuals are 1 (dead) or 0 (alive)
Ordinary linear regression model not
appropriate because
• For each X the Y observations are
binary, not normal
• For every X the variance of Y is not
constant
• A linear relationship is not bounded
between 0 and 1
• 0, 1 data can’t simply be transformed
The generalized linear model
g(μ) = β0 + β1X
μ is the probability of death, which depends on concentration X.
g(μ) is the link function.
Linear predictor (right side of equation) is like an ordinary linear regression, with
intercept b0 and slope b1
Logistic regression uses the logit link function
z <- glm(mortality ~ concentration,
family = binomial(link = "logit"))
The generalized linear model
g(μ) = β0 + β1X
glm() uses maximum likelihood: the
method finds those values of b0 and b1
for which the data have maximum
probability of occurring. These are the
maximum likelihood estimates.
No formula for the solution.
glm() uses an iterative procedure to
find the maximum likelihood estimates
on the likelihood surface.
Use summary() for estimation
z <- glm(mortality ~ concentration,
family = binomial(link = "logit"))
summary(z)
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.74452 0.69206 -2.521 0.01171 *
concentration 0.03643 0.01119 3.255 0.00113 **
Number of Fisher Scoring iterations: 5
Numbers in red are the estimates of b0 and b1 (intercept and slope) which predict
log(μ / 1 – μ).
Number of Fisher Scoring iterations refers to the number of iterations used before
the algorithm used by glm() converged on the maximum likelihood solution.
The generalized linear model
Use predict(z) to obtain
predicted values on the logit scale
𝑔(𝜇̂ ) = 𝜂̂ = −1.74 + 0.036𝑋
visreg(z) uses predict() to
plot predicted values and confidence
limits on the logit scale.
See that the function is a line.
The points on this scale are not the
logit-transformed data. Instead,
glm() creates “working” values to fit
model to data on transformed scale.
The generalized linear model
Use fitted(z) to obtain predicted values on the original scale
𝑒 ./
𝜇̂ =
1 + 𝑒 ./
Use
visreg(z,scale = ‘response’)
to get fitted curve and
confidence bands on the
original scale.
The generalized linear model
intercept 𝛽.! 0.03643
LD0! =− =− =− = 47.88
slope .
𝛽" −1.7445
The parameter estimates from the model
fit can be used to estimate LD50, the
estimated concentration at which 50% of
individuals are expected to die.
library(MASS)
dose.p(z)
Dose SE
p = 0.5: 47.8805 8.168823
Use anova() to test hypotheses
Analysis of deviance table gives log-likelihood ratio test of the null hypothesis that
there is no differences among years in mean number of offspring.
anova(z, test="Chisq")
Df Deviance Resid Df Resid Dev P(>|Chi|)
NULL 71 92.982
concentration 1 19.020 70 73.962 1.293e-05
As with lm(), terms are tested using model comparison (always a “full” vs
“reduced” model). Default program of action is to fit terms sequentially (“Type 1
sums of squares”), just as with lm().
Advantages of generalized linear models
• More flexible than simply transforming variables. (A given transformation of the
raw data may not accomplish both linearity and homogeneity of variance.)
• Yields more familiar measures of the response variable than data
transformations.
• Avoids the problems associated with transforming 0’s and 1’s. For example, the
logit transformation of 0 or 1 can’t be computed.
• Retains the same analysis framework as linear models.
When glm() is appropriate and when it is not
In second case, analyze the summary statistic (fraction surviving) with lm(). Or, fit
a generalized linear mixed models using glmm() in lme4 package.
Assumptions of generalized linear models
• Statistical independence of data points.
• Correct specification of the link function for the data.
• The variances of the residuals correspond to that assumed by the link function.
(I will explain).
Example 3: Analyzing count data with log-linear regression
Estimate mean number of offspring fledged by female song sparrows on Mandarte
Island, BC.
http://commons.wikimedia.org/wiki/
File:Song_Sparrow-27527-2.jpg
Linear model assumptions not met:
Data are discrete counts (non-normal).
Variance increases with mean.
Example 3: Analyzing count data with log-linear regression
Estimate mean number of offspring fledged by female song sparrows on Mandarte
Island, BC.
Linear model assumptions not met:
Data are discrete counts (non-normal).
Variance increases with mean.
Two solutions:
1. Transform data: X’ = log(X + 1)
2. Generalized linear model.
Poisson distribution might be
appropriate for error distribution.
So try log link function.
Example 3: Analyzing count data with log-linear regression
Log-linear regression (a.k.a. Poisson regression) uses the log link function
log(𝜇) = 𝜂 = 𝛽! + 𝛽" 𝑋" + 𝛽# 𝑋# + ⋯
𝜂 is the response variable on the log scale (here, mean of each group on log scale).
Year is a categorical variable. So is analogous to single factor ANOVA.
Categorical variables are modeled in R using “dummy” indicator variables,
as with lm().
Use summary() for estimation
z <- glm(noffspring ~ year, family=poisson(link="log"))
summary(z)
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.24116 0.26726 0.902 0.366872
year1976 1.03977 0.31497 3.301 0.000963 ***
year1977 0.96665 0.28796 3.357 0.000788 ***
year1978 0.97700 0.28013 3.488 0.000487 ***
year1979 -0.03572 0.29277 -0.122 0.902898
(Dispersion parameter for poisson family taken to be 1)
Numbers in red are the parameter estimates on the log scale.
Intercept refers to mean of the first group (1975) and the rest of the coefficients are
differences between each given group (year) and the first group.
Dispersion parameter of 1 states assumption that variance = mean, which is almost
never true in ecology.
Predicted values on the transformed scale
Predicted values on the log scale:
predict(z)
visreg(z) uses predict() to plot
the predicted values, with confidence
limits, on this transformed scale.
See that the “data points” on this scale
are not just the transformed data (we
can’t take log of 0). Instead, glm()
creates “working” values to fit the model
to the data on the transformed scale.
Predicted values on the original scale
Predicted values on original scale:
fitted.values(z)
𝜇̂ = 𝑒 ./
I have plotted them here using
visreg() and superimposed the
original data points.
Note that the fitted values aren’t the
means of the original data.
Fitted values are the transformed
means estimated on the log scale
(geometric means).
Use emmeans() to obtain model predicted values on original scale
emmeans(z, "year", type = "response")
Yields model-fitted predicted values and approximate 95% confidence intervals.
year rate SE df asymp.LCL asymp.UCL
1975 1.272727 0.3401496 Inf 0.7537770 2.148957
1976 3.600000 0.6000000 Inf 2.5967825 4.990791
1977 3.346154 0.3587453 Inf 2.7119865 4.128614
1978 3.380952 0.2837232 Inf 2.8681893 3.985385
1979 1.228070 0.1467822 Inf 0.9715952 1.552248
Uses a large-sample approximation (this is why degrees of freedom, df, are shown
as infinite). These confidence limits might not be accurate for small sample sizes.
Use anova() to test hypotheses
Analysis of deviance table gives log-likelihood ratio test of the null hypothesis that
there is no differences among years in mean number of offspring.
anova(z, test="Chisq")
Terms added sequentially (first to last)
Df Deviance Resid. Df Resid. Dev P(>|Chi|)
NULL 145 288.656
year 4 75.575 141 213.081 1.506e-15 ***
As with lm(), terms are tested using model comparison (always a “full” vs
“reduced” model). Default program of action is to fit terms sequentially (“Type 1
sums of squares”), as with lm().
Evaluating assumptions of the glm() fit
Do the variances of the residuals correspond to those assumed by the chosen link
function?
The log link function assumes that the Y values are Poisson distributed at each X.
A key property of the Poisson distribution is that within each treatment group the
variance and mean are equal (i.e., the glm() dispersion parameter = 1). But real
data rarely show this.
Evaluating assumptions of the glm() fit
A central property of the Poisson distribution is that the variance and mean are
equal (i.e., the glm() dispersion parameter = 1).
Let’s check the sparrow data:
tapply(noffspring, year, mean)
tapply(noffspring, year, var)
1975 1976 1977 1978 1979
1.272727 3.600000 3.346154 3.380952 1.228070 # mean
1.618182 6.044444 3.835385 4.680604 1.322055 # variance
Variances slightly larger than means (typical for real data).
Modeling excessive variance
Finding excessive variance (“overdispersion”) is common when analyzing count
data. Excessive variance occurs because variables not included in the model also
affect the response variable.
In the workshop we will analyze an example where the problem is more severe than
in the case of the song sparrow data here.
Modeling excessive variance
Excessive variance can be accommodated with glm() by using a different link
function, one that incorporates a dispersion parameter that is itself estimated from
the data.
The glm() procedure to accomplish over (or under) dispersion uses the observed
relationship between mean and variance rather than an explicit probability
distribution for the data. In the case of count data,
variance = dispersion parameter × mean
Method generates “quasi-likelihood” estimates that behave like maximum
likelihood estimates.
Modeling excessive variance
Let’s try it with the song sparrow data
z <- glm(noffspring ~ year, family = quasipoisson)
summary(z)
Estimate Std. Error t value Pr(>|t|)
Intercept) 0.24116 0.29649 0.813 0.41736
year1976 1.03977 0.34942 2.976 0.00344 **
year1977 0.96665 0.31946 3.026 0.00295 **
year1978 0.97700 0.31076 3.144 0.00203 **
year1979 -0.03572 0.32479 -0.110 0.91259
Dispersion parameter for quasipoisson family taken to be
1.230689
The dispersion parameter is reasonably close to 1 for these data. But typically it is much
larger than 1 for count data, so I recommend using family = quasipoisson.
Modeling excessive variance
The point estimates are identical with those obtained using family=poisson
instead, but the standard errors (and resulting confidence intervals) are wider.
z <- glm(noffspring ~ year, family=poisson(link="log"))
summary(z)
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.24116 0.26726 0.902 0.366872
year1976 1.03977 0.31497 3.301 0.000963 ***
year1977 0.96665 0.28796 3.357 0.000788 ***
year1978 0.97700 0.28013 3.488 0.000487 ***
year1979 -0.03572 0.29277 -0.122 0.902898
Dispersion parameter for poisson family taken to be 1
Example 4: Modeling contingency tables
Example: Incidence of malaria in female great tits in relation to experimental
treatment. n = 65 birds.
Grouped bar graph
Explanatory variable = outer groups;
response variable = inner groups
Example 4: Modeling contingency tables
Response Treatment Frequency
Malaria Control 7
No Malaria Control 28
Malaria Egg removal 15
No Malaria Egg removal 15
z <- glm(Frequency ~ Treatment + Response,
data = mydata, family = poisson(link = "log"))
The counts are modeled in an analogous way to modeling of means in an ordinary
linear model. If the standard assumptions of data are met (e.g., independent
random trials) then family = poisson is appropriate.
Example 4: Modeling contingency tables
visreg(z, xvar="Response", by="Treatment")
Use anova() to test hypotheses
anova(z, test="Chi")
Df Deviance Resid. Df Resid. Dev P
NULL 3 13.8771
Treatment 1 0.3850 2 13.4921 0.534942
Response 1 6.9079 1 6.5843 0.008582
Treatment:Response 1 6.5843 0 0.0000 0.010288
The interaction between Treatment and Response is the test of association.
Other uses of generalized linear models
The method is especially useful when analyzing higher-order contingency tables,
where there may be two- and three-way interactions. Each of the interactions can
be assessed separately.
glm() can handle data having other probability distributions than the ones used in
my examples, including exponential and gamma distributions.
Discussion paper for next week:
Whittingham et al (2006) Why do we still use stepwise modelling?
Download from “assignments” tab on course web site.
Presenters: Nicole & ____________
Moderators: _________ & __________