Linear and generalized linear models
Monday 3 June, 2024
Esa Läärä
  Statistical Practice in Epidemiology using R
  3 to 7 June, 2024
  International Agency for Research on Cancer, Lyon, France
Outline
          ▶ Simple linear regression.
          ▶ Fitting a regression model and extracting results.
          ▶ Predictions and diagnostics.
          ▶ Categorical factors and contrast matrices.
          ▶ Main effects and interactions.
          ▶ Modelling curved effects.
          ▶ Generalized linear models.
          ▶ Binary regression and Poisson regression.
Linear and generalized linear models                             1/ 25
Variables in generalized linear models
          ▶ The outcome or response variable must be numeric.
          ▶ Main types of response variables are
             – Metric or continuous (a measurement with units).
                    – Binary (“yes” vs. ”no”, coded 1/0), or proportion.
                    – Failure in person-time, or incidence rate.
          ▶ Explanatory variables or regressors can be
              – Numeric or quantitative variables
                    – Categorical factors, represented by class indicators or contrast matrices.
Linear and generalized linear models                                                               2/ 25
The births data in Epi
              id:             Identity number for mother and baby.
         bweight:             Birth weight of baby.
           lowbw:             Indicator for birth weight less than 2500 g.
         gestwks:             Gestation period in weeks.
         preterm:             Indicator for gestation period less than 37 weeks.
          matage:             Maternal age.
             hyp:             Indicator for maternal hypertension (0 = no, 1 = yes).
             sex:             Sex of baby (1 = male, 2 = female).
       Declaring and transforming some variables as factors:
       > library(Epi) ; data(births)
       > births <- transform(births,
       +    hyp = factor(hyp, labels=c("N", "H")),
       +    sex = factor(sex, labels=c("M", "F")),
       +    gest4 = cut(gestwks,breaks=c(20, 35, 37, 39, 45), right=FALSE)     )
       > births <- subset(births, !is.na(gestwks))
Linear and generalized linear models                                                   3/ 25
Birth weight and gestational age
                          4000
       Birth weight (g)
                          3000
                          2000
                          1000
                                 25   30       35       40   45
                                      Gestational age (wk)
       > with(births, plot(bweight ~ gestwks, xlim = c(24,45), pch = 16, cex.axis=1.5, cex.lab = 1.5,
       + xlab= "Gestational age (wk)", ylab= "Birth weight (g)" ) )
Linear and generalized linear models                                                                    4/ 25
Metric response, numeric explanatory variable
       Roughly linear relationship btw bweight and gestwks
       → Simple linear regression model fitted.
       > m <- lm(bweight ~ gestwks, data=births)
          ▶ lm() is the function that fits linear regression models, assuming
            Gaussian distribution or family for error terms.
          ▶ bweight ~ gestwks is the model formula
          ▶ m is a model object belonging to class “lm”.
       > coef(m) – Printing the estimated regression coefficients
       (Intercept)                     gestwks
           -4489.1                       197.0
       Interpretation of intercept and slope?
Linear and generalized linear models                                            5/ 25
Model object and extractor functions
       Model object = list of different elements, each being separately accessible.
       – See str(m) for the full list.
       Functions that extract results from the fitted model object
          ▶ summary(m) – lots of output
          ▶ coef(m) – beta-hats only (see above)
          ▶ ci.lin(m)[,c(1,5,6)] – βbj s plus confidence limits
                                       Estimate    2.5%   97.5%
                (Intercept)             -4489.1 -5157.3 -3821.0
                gestwks                   197.0   179.7   214.2
                Function ci.lin() is found in Epi package.
          ▶ anova(m) – Analysis of Variance Table
Linear and generalized linear models                                                  6/ 25
Other extractor functions, for example
          ▶ fitted(m), resid(m), vcov(m), . . .
          ▶ predict(m, ...) – or ci.pred(m, ...) in Epi
              – Predicted responses for desired combinations of new values of the
                regressors – argument newdata
               – Argument interval specifies whether confidence intervals for the
                 mean response or prediction intervals for individual responses are
                 returned.
          ▶ plot(m) – produces various diagnostic plots based on residuals
            (raw, standardized or studentized residuals).
       Many of these are special methods for certain generic functions, aimed at
       acting on objects of class “lm”.
Linear and generalized linear models                                                  7/ 25
Fitted values, confidence & prediction intervals
                          4000
       Birth weight (g)
                          3000
                          2000
                          1000
                                 25   30       35       40   45
                                      Gestational age (wk)
       >                  nd <- data.frame( gestwks = seq(24, 45, by = 0.25 ) )
       >                  pr.c1 <- predict( m, newdata=nd, interval="conf" )
       >                  pr.p1 <- predict( m, newdata=nd, interval="pred" )
       >                  with(births, plot(bweight ~ gestwks, xlim = c(24,45), cex.axis=1.5, cex.lab = 1.5, xlab = ’Gestation
       >                  matlines( nd$gestwks, pr.c1, lty=1, lwd=c(3,2,2), col=c(’red’,’blue’,’blue’))
       >                  matlines( nd$gestwks, pr.p1, lty=1, lwd=c(3,2,2), col=c(’red’,’green’,’green’))
Linear and generalized linear models                                                                                   8/ 25
A couple of diagnostic plots
                                  Residuals vs Fitted                                                                   Q−Q Residuals
                   1500
                                                       78                                                                                         78
                                                                        Standardized residuals
                                                                                                 2
                   500
       Residuals
                   0
                                                                                                 0
                   −1000
                                                                                                 −2
                                                 124
                                                                                                           124
                   −2000
                                                            30
                                                                                                      30
                                                                                                 −4
                           1000       2000       3000            4000                                 −3         −2     −1     0      1       2   3
                                     Fitted values                                                                    Theoretical Quantiles
       > par(mfrow=c(1,2))
       > plot(m, 1:2, cex.lab = 1.5, cex.axis=1.5, cex.caption=1.5, lwd=2)
                   ▶ Some deviation from linearity?
                   ▶ Reasonable agreement with Gaussian error assumption?
Linear and generalized linear models                                                                                                                   9/ 25
Factor as an explanatory variable
          ▶ How bweight depends on maternal hypertension?
            > mh <- lm( bweight ~ hyp, data=births)
                                       Estimate   2.5% 97.5%
                (Intercept)              3198.9 3140.2 3257.6
                hypH                     -430.7 -585.4 -275.9
          ▶ Removal of intercept → mean bweights by hyp:
                > mh2 <- lm( bweight ~ -1 + hyp, data = births)
                > coef(mh2)
                  hypN   hypH
                3198.9 2768.2
          ▶ Interpretation: -430.7 = 2768.2 - 3198.9
            = difference between level 2 (“H”) vs. reference level 1 (“N”) of factor hyp.
Linear and generalized linear models                                                        10/ 25
Additive model with both gestwks and hyp
          ▶ Joint effect of hyp and gestwks is modelled e.g. by updating:
            > mhg <- update(mh, . ~ . + gestwks)
                           Estimate    2.5%   97.5%
                (Intercept) -4285.0 -4969.7 -3600.3
                hypH         -143.7 -259.0    -28.4
                gestwks       192.2   174.7   209.8
          ▶ The coefficient for hyp: H vs. N is attenuated (from −430.7 to −143.7).
          ▶ Does −143.7 estimate the causal effect of hyp adjusted for gestwks?
          ▶ No, as gestwks is most likely a mediator. – Much of the effect of hyp on
            bweight is mediated via shorter gestwks in hypertensive mothers.
          ▶ Instead, for total causal effect of hyp, adjustment for at least age is
            needed, but adjusting for gestwks is overadjustment.
          ▶ Yet, for predictive modelling it is OK to keep gestwks.
Linear and generalized linear models                                                   11/ 25
Model with interaction of hyp and gestwks
          ▶ mhgi <- lm(bweight ~ hyp + gestwks + hyp:gestwks, ...)
           ▶ Or with shorter formula: bweight ~ hyp * gestwks
                               Estimate    2.5%   97.5%
             (Intercept)        -3960.8 -4758.0 -3163.6
             hypH               -1332.7 -2841.0   175.7
             gestwks              183.9   163.5   204.4
             hypH:gestwks          31.4    -8.3    71.1
           ▶ Estimated slope: 183.9 g/wk in reference group N of normotensive mothers and
             183.9 + 31.4 = 215.3 g/wk in hypertensive mothers.
          ⇔ For each additional week the difference in mean bweight between H and N group
            increases by 31.4 g.
           ▶ Interpretation of Intercept and “main effect” hypH?
Linear and generalized linear models                                                        12/ 25
Model with interaction (cont’d)
       More interpretable parametrization obtained if gestwks is centered at some
       reference value, using e.g. the insulate operator I() for explicit transformation
       of an original term.
          ▶ mi2 <- lm(bweight ~ hyp*I(gestwks-40), ...)
                                       Estimate   2.5% 97.5%
                (Intercept)              3395.6 3347.5 3443.7
                hypH                      -77.3 -219.8   65.3
                I(gestwks - 40)           183.9 163.5 204.4
                hypH:I(gestwks - 40)       31.4   -8.3   71.1
          ▶ The “main effect” of hyp = −77.3 is the difference between H and N
            at the reference value gestwks = 40.
          ▶ Intercept = 3395.6 is the estimated mean bweight
            at the reference value 40 of gestwks in group N.
Linear and generalized linear models                                                       13/ 25
Factors and contrasts in R
          ▶ A categorical explanatory variable or factor with L levels will be
            represented by L − 1 linearly independent columns in the
            model matrix of a linear model.
          ▶ These columns can be defined in various ways implying alternative
            parametrizations for the effect of the factor.
          ▶ Parametrization is defined by given type of contrasts.
          ▶ Default: treatment contrasts, in which 1st class is the reference, and
            regression coefficient βk for class k is interpreted as βk = µk − µ1
          ▶ Own parametrization may be tailored by ci.lin(mod, ctr.mat=CM) after
            fitting, the pertinent contrast matrix given as the 2nd argument.
Linear and generalized linear models                                                 14/ 25
Two factors: additive effects
          ▶ Factor X has 3 levels, Z has 2 levels – Model:
                                       µ = α + β1 X1 + β2 X2 + β3 X3 + γ1 Z1 + γ2 Z2
          ▶ X1 (reference), X2 , X3 are the indicators for X ,
          ▶ Z1 (reference), Z2 are the indicators for Z .
          ▶ Omitting X1 and Z1 the model for mean is:
                                              µ = α + β2 X2 + β3 X3 + γ2 Z2
                with predicted means µjk (j = 1, 2, 3; k       = 1, 2):
                                        Z =1                    Z =2
                                     1 µ11 = α                  µ11 = α + γ2
                                X 2 µ21 = α + β2                µ22 = α + β2 + γ2
                                     3 µ31 = α + β3             µ32 = α + β3 + γ2
Linear and generalized linear models                                                   15/ 25
Two factors with interaction
          ▶ Effect of Z differs at different levels   of X :
                                    Z =1               Z =2
                                1 µ11 = α              µ12 = α + γ2
                          X 2 µ21 = α + β2             µ22 = α + β2 + γ2 + δ22
                                3 µ31 = α + β3         µ32 = α + β3 + γ2 + δ32
          ▶ How much the effect of Z (level 2 vs. 1)
            changes when the level of X is changed from 1 to 3:
                                       δ32 = (µ32 − µ31 ) − (µ12 − µ11 )
                                           = (µ32 − µ12 ) − (µ31 − µ11 ),
                = how much the effect of X (level 3 vs. 1)
                changes when the level of Z is changed from 1 to 2.
          ▶ See the exercise: interaction of hyp and gest4.
Linear and generalized linear models                                             16/ 25
Contrasts in R
          ▶ All contrasts can be implemented by supplying a suitable
            contrast function giving the contrast matrix e.g:
                >   contr.cum(3)        > contr.sum(3)
                1   0 0                  1    1   0
                2   1 0                  2    0   1
                3   1 1                  3   -1 -1
          ▶ In model formula factor name faktori can be replaced by expression like
            C(faktori, contr.cum).
          ▶ Function ci.lin() can calculate CI’s for linear functions of the parameters
            of a fitted model mall when supplied by a relevant contrast matrix
            > ci.lin(mall, ctr.mat = CM)[ , c(1,5,6)]
                → No need to specify contrasts in model formula!
Linear and generalized linear models                                                      17/ 25
More about numeric regressors
       What if dependence of Y on X is non-linear?
          ▶ Categorize the values of X into a factor.
              – Continuous effects violently discretized by often
                arbitrary cutpoints. This is inefficient.
          ▶ Fit a low-degree (e.g. 2 to 4) polynomial of X .
               – Tail behaviour may be problematic.
          ▶ Use fractional polynomials.
              – Invariance problems. Only useful if X = 0 is well-defined.
          ▶ Use a spline model: smooth function s(X ; β). – See Martyn’s lecture
              – More flexible models that act locally.
              – Effect of X reported by graphing bs (X ; β) & its CI
Linear and generalized linear models                                               18/ 25
Mean bweigth as 3rd order polynomial of gestwks
                 4000
                 3000
       bweight
                 2000
                 1000
                        25   30           35     40   45
                                       gestwks
       > mp3 <- update( m, . ~ . - gestwks + poly(gestwks, 3) )
                 ▶ The model is linear in parameters with 4 terms & 4 df.
                 ▶ Otherwise good, but the tails do not behave well.
Linear and generalized linear models                                        19/ 25
Penalized spline model with cross-validation
                 4000
                 3000
       bweight
                 2000
                 1000
                        25   30           35     40   45
                                       gestwks
       > library(mgcv)
       > mpen <- gam( bweight ~ s(gestwks), data = births)
                 ▶ Looks quite nice.
                 ▶ Model df ≈ 4.2; close to 4, as in the 3rd degree polynomial model.
Linear and generalized linear models                                                    20/ 25
From linear to generalized linear models
          ▶ An alternative way of fitting our 1st Gaussian model:
                > m <- glm(bweight ~ gestwks, family=gaussian, data=births)
          ▶ Function glm() fits generalized linear models (GLM).
          ▶ Requires specification of the
               – family – i.e. the assumed “error” distribution for Yi s,
               – link function – a transformation of the expected Yi .
          ▶ Covers common models for other types of response variables and
            distributions, too, e.g. logistic regression for binary responses and
            Poisson regression for counts.
          ▶ Fitting: method of maximum likelihood.
          ▶ Many extractor functions for a glm object similar to those for an lm object.
Linear and generalized linear models                                                       21/ 25
Generalized linear models
       Modelling how expected values, risks, rates, etc. depend on explanatory variables
       or regressors X = (X1 , . . . , Xp ). – Common elements:
          ▶ Each subject i (i = 1, . . . , N ) has an own regressor profile, i.e. vector
            xiT = (xi1 , . . . , xip ) of values of X .
          ▶ Let vector β T = (β0 , β1 , . . . , βp ) contain regression coefficients.
            The linear predictor is a linear combination of βj s and xij s:
                                       ηi = β0 + β1 xi1 + · · · + βp xip
          ▶ Some Xj s can be product terms for interactions and modifications if
            needed, and splines may be used for continuous covariates.
          ▶ Further model specification depends on the type of outcome variable,
            assumed error distribution or family, desired interpretation of coefficients,
            and importance and choice of time scale(s).
Linear and generalized linear models                                                        22/ 25
Binary regression and interpretations of coefficients
          ▶ Basic model for risks π(xi ) = P {Yi = 1|X = xi } = E (Yi |X = xi ) with
            fixed risk period, complete follow-up (no censoring, nor competing events):
                                       g{π(xi )} = β0 + β1 xi1 + . . . βp xip ,   i = 1, . . . , N .
          ▶ Link g(·) and interpretation of βj s, assuming the validity of model
            (including homogeneity or non-modification of the coefficent in question):
               – id ⇒ βj = adjusted risk difference (RD) for Xj = 1 vs. Xj = 0,
               – log ⇒ βj = adjusted log of risk ratio (RR)       – ”–
               – logit ⇒ βj = adjusted log of odds ratio (OR), – ” –
          ▶ Fitting: glm(..., family=binomial(link=...), ...)
          ▶ Issues with id & log links in keeping predicted π
                                                            b(·) between 0 and 1.
            – A solution for RR: Doubling the cases & logit-link! (Ning et al. 2022).
            – A solution for RD exists, too (Battey et al. 2019).
Linear and generalized linear models                                                                   23/ 25
Poisson regression – model for rates
          ▶ A common outcome variable is a pair (D, Y ) = (no. of cases, person-time),
            from which the incidence rate = D/Y (see Janne’s lecture on Monday).
          ▶ Poisson regression model specifies, how theoretical hazard rates or
            hazards λ(xi ) are assumed to depend on values of X .
          ▶ Some components of X represent the relevant time scales
            (as in the exercise of today; more details in Bendix’s lecture on Wednesday).
          ▶ Linear predictor as above – Link g(·) and interpretation of βj s:
              – id ⇒ βj = adjusted rate difference (RD) for Xj = 1 vs. Xj = 0,
              – log ⇒ βj = adjusted log of rate ratio (RR)        – ”–
          ▶ Fitting – our recommended approach using Epi:
                   glm(cbind(d,y) ~ ..., family=poisreg(link=...),...)
Linear and generalized linear models                                                        24/ 25
What was covered
          ▶ A wide range of models from simple linear regression to splines.
          ▶ Gaussian family for continuous outcomes, binomial for binary outcomes, and
            Poisson family for rates.
          ▶ Various link functions for different parametrizations.
          ▶ R functions fitting linear and generalized models: lm() and glm().
          ▶ Parametrization of categorical explanatory factors; contrast matrices.
          ▶ Extracting results and predictions: ci.lin(), fitted(), predict().
          ▶ Model diagnostics: resid(), plot.lm(), ... .
Linear and generalized linear models                                                     25/ 25