0% found this document useful (0 votes)
34 views8 pages

Regn Lect 6

Epidemiology linear regression

Uploaded by

Martha Reuben
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
34 views8 pages

Regn Lect 6

Epidemiology linear regression

Uploaded by

Martha Reuben
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

1

Lecture 6: Multiple regression with categorical explanatory factors – the general linear
model
Objectives
At the end of the session participants should
 Be able to carry out a “comparison of regressions” analysis including a test for
parallelism
 Be able to fit regression models which have a mixture of quantitative and categorical
explanatory variables

The link between multiple regression models and analysis of variance models can be formalized
as follows:
We can think of multiple regression models as involving variates as explanatory variables (e.g.
weight, age, etc). Analysis of variance models can be thought of as involving factors as
explanatory variables (e.g. treatment group, sex, ethnic group). We can convert analysis of
variance models into regression models by replacing the factors by indicator variables – the
number of indicator variables being one fewer than the levels of the factor (e.g. education had 3
levels so we needed two indicator variables educ2 and educ3).
The general linear model allows us to include both variates and factors as explanatory
variables and thus allows us to analyze a wide range of data from research studies – provided it
is reasonable to assume that the response variable has an underlying normal distribution.

Ex: Consider the Misoprostol trial and let us examine jointly the effect of treatment arm and
gestational age on hegasize.
The questions of interest are:
1. Does hegasize increase with gestational age?
2. Is there any difference in the mean hegasize between those who were allocated to
misoprostol and those who were allocated to placebo after adjusting for the effect of
gestational age?
3. Does hegasize increase with gestational age at the same rate for the two groups of
subjects?
We can firstly look at a plot of hegasize versus gestational age, with observations labeled by the
treatment arm .
2

Hegar size vs gestational age by treatment arm


misoprostol = + placebo = o
15

+ + + + +
+ + + + + + o
10

+ ++ + +
++ ++ +o ++ +++ +
hegasize

+ o+++++
o+o ++o
++ooo++
o+o + oo++
+ o ++
o++ ++
+ ++
o+oo+o
+o++
oooo+
oo+o
++ o
+ +oo ++
ooo oo
o o ++o
+o+
oo++
oo+o
+oo
+o+oo
+o+o oo+
oo+o
++oo+
o+ ++ ooo
5

o o o o +o+++
o oo+
o+oo
o+ + o o o o + +
o o
0

o oo
+o + o

35 45 55 65 75 85
gestage

Thus we can see that in general the Hegar size seems to be larger for women who were
allocated to misoprostol than for those who were allocated to placebo. There is some
suggestion that Hegar size increases with increasing gestational age.
We can examine this further by fitting regression models.

. use misorhrm , clear

.
. ttest hegasize , by(treat)

Two-sample t test with equal variances


------------------------------------------------------------------------------
Group | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval]
---------+--------------------------------------------------------------------
1 | 135 7.614815 .182115 2.115985 7.254623 7.975006
2 | 141 6.007092 .1508156 1.790836 5.708922 6.305263
---------+--------------------------------------------------------------------
combined | 276 6.793478 .1271574 2.112498 6.543153 7.043804
---------+--------------------------------------------------------------------
diff | 1.607723 .2356042 1.143898 2.071547
------------------------------------------------------------------------------
diff = mean(1) - mean(2) t = 6.8238
Ho: diff = 0 degrees of freedom = 274

Ha: diff < 0 Ha: diff != 0 Ha: diff > 0


Pr(T < t) = 1.0000 Pr(|T| > |t|) = 0.0000 Pr(T > t) = 0.0000
3

. lab def trtlab 1 "+" 2 "o"

. lab val treat trtlab

. twoway (scatter hegasize gestage, sort msymbol(none) mlabel(treat)) , ///


> xlab (35 (10) 85) ti("Hegar size vs gestational age by treatment arm") ///
> subtitle (misoprostol = + placebo = o)

.
. reg hegasize gestage

Source | SS df MS Number of obs = 276


-------------+------------------------------ F( 1, 274) = 11.54
Model | 49.5852132 1 49.5852132 Prob > F = 0.0008
Residual | 1177.64305 274 4.29796733 R-squared = 0.0404
-------------+------------------------------ Adj R-squared = 0.0369
Total | 1227.22826 275 4.46264822 Root MSE = 2.0732

------------------------------------------------------------------------------
hegasize | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
gestage | .0465442 .0137032 3.40 0.001 .0195673 .073521
_cons | 3.919714 .8552239 4.58 0.000 2.236069 5.603358
------------------------------------------------------------------------------

Thus we see that there is very strong evidence that hegasize increases with gestational age.
The equation is hegasize = 3.92 + 0.047 * gestage

We can now see if there is any difference in the mean hegasize of the two treatment arms after
adjusting for gestational age
. xi: reg hegasize gestage i.treat
i.treat _Itreat_1-2 (naturally coded; _Itreat_1 omitted)

Source | SS df MS Number of obs = 276


-------------+------------------------------ F( 2, 273) = 30.47
Model | 223.951391 2 111.975695 Prob > F = 0.0000
Residual | 1003.27687 273 3.67500685 R-squared = 0.1825
-------------+------------------------------ Adj R-squared = 0.1765
Total | 1227.22826 275 4.46264822 Root MSE = 1.917

------------------------------------------------------------------------------
hegasize | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
gestage | .044687 .0126741 3.53 0.000 .0197356 .0696383
_Itreat_2 | -1.590404 .2308902 -6.89 0.000 -2.044956 -1.135853
_cons | 4.846871 .8021927 6.04 0.000 3.267601 6.426141
------------------------------------------------------------------------------

. test _Itreat_2

( 1) _Itreat_2 = 0

F( 1, 273) = 47.45
Prob > F = 0.0000
4

Looking at the t-test for _Itreat_2 (the dummy variable for being allocated to placebo) and also
at the post estimation test, we can reject the null hypothesis that there is no difference between
the two groups i.e. there is overwhelming evidence that adjusting for gestational age, the mean
hegasize is lower by 1.59 on average in subjects who were allocated to the placebo
i.e. model 2 provides a better description of the data than model 1.

We can find the equations of the lines for those on misoprostol and those on placebo
For those on misoprostol _Itreat_2 takes the value 0
So hegasize = 4.85 + 0.045 * gestage
For those on placebo _Itreat_2=1 so the equation becomes
hegasize = 4.85 + 0.045*gestage - 1.59
i.e. hegasize = 3.36 + 0.045*gestage

Note that this model is the one that is most useful for interpretation when our main aim is to
compare two or more groups, adjusting for potential confounders (such as gestational age in
this example). It is often referred to as the analysis of covariance model, with the confounders
(e.g. gestational age) being the covariates. Note that this model makes the assumption of
parallelism i.e. that the lines for the groups are parallel.
To check the assumption of parallelism i.e. to see whether hegasize increases with gestational
age at the same rate in the two groups we can add a treatment by gestational age interaction to
our regression model as shown below:

. xi: reg hegasize i.treat*gestage


i.treat _Itreat_1-2 (naturally coded; _Itreat_1 omitted)
i.treat*gestage _ItreXgesta_# (coded as above)

Source | SS df MS Number of obs = 276


-------------+------------------------------ F( 3, 272) = 20.70
Model | 228.115304 3 76.0384347 Prob > F = 0.0000
Residual | 999.112957 272 3.6732094 R-squared = 0.1859
-------------+------------------------------ Adj R-squared = 0.1769
Total | 1227.22826 275 4.46264822 Root MSE = 1.9166

------------------------------------------------------------------------------
hegasize | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
_Itreat_2 | -3.265655 1.590288 -2.05 0.041 -6.396494 -.1348166
gestage | .0325812 .0170245 1.91 0.057 -.0009353 .0660977
_ItreXgest~2 | .0271401 .0254908 1.06 0.288 -.0230442 .0773244
_cons | 5.596712 1.067333 5.24 0.000 3.495428 7.697995
------------------------------------------------------------------------------
5

. test _ItreXgesta_2

( 1) _ItreXgesta_2 = 0

F( 1, 272) = 1.13
Prob > F = 0.2880

Looking at our post estimation test command and also the t-test for the coefficient
_ItreXgesta_2, the term is not significant i.e. we cannot reject the null hypothesis that the
relationship between hegasize and gestage is the same in both groups. So we would not prefer
model 3 over model 2 and would base our interpretation on model 2.
Note that if we did choose model 3 we can interpret it as follows:

For those on misoprostol _Itreat_2=0 so _ItreXgesta_2=0 and our equation is


hegasize = 5.60 + 0.033 * gestage

For those on placebo _Itreat_2=1 so _ItreXgesta_2=gestage and our equation is


hegasize = 5.60 + 0.033*gestage - 3.27 + 0.027*gestage
i.e. hegasize = 2.33 + 0.060*gestage

We should emphasize that here we would choose the simpler model 2.

In the context of the trial we can see the effect of adjusting for gestational age. To compare the
hegasize between the two arms we used a two-sample t-test (see above) which gave an
estimated difference of 1.61 in hegasize (with larger sizes for those on misoprostol) with a
standard error of the difference of 0.236. Adjusting for gestational age changed the estimated
treatment difference very slightly (see model 2) to 1.59 but also reduces the standard error of
the difference to 0.231. Thus the confidence intervals will become slightly narrower.
In a clinical trial, due to randomization to treatment arm, confounding is unlikely to be an issue
(as shown by the fact that here the unadjusted and adjusted estimates are very similar) however
for a continuous outcome, adjusting for other factors can lead to increased precision through
reducing the standard error of the difference; this is often the motivation for adjusting for
covariates in clinical trials. In observational studies the main motivation for adjusting is to deal
with confounding.
6

Note that we can now fit more complex models e.g. the following model for the Stepping Stones
pilot study specifically investigates whether rhknow differs between males and females and
considers as potential explanatory variables age, ever partnered, education, whether the
respondent had read a newspaper in the past week, whether the respondent had a TV at home,
whether the respondent lived with his/her biological mother, whether the respondent lived with
his/her biological father as well as gender.

. xi: sw regress rhknow (i.sex) age i.ephsex (i.educ) (i.x106) (i.x116) (i.x133ma)
(i.x133pa) , pe(0.10) lockterm1
i.sex _Isex_1-2 (naturally coded; _Isex_1 omitted)
i.ephsex _Iephsex_0-1 (naturally coded; _Iephsex_0 omitted)
i.educ _Ieduc_1-3 (naturally coded; _Ieduc_1 omitted)
i.x106 _Ix106_1-2 (naturally coded; _Ix106_1 omitted)
i.x116 _Ix116_1-2 (naturally coded; _Ix116_1 omitted)
i.x133ma _Ix133ma_1-2 (naturally coded; _Ix133ma_1 omitted)
i.x133pa _Ix133pa_1-2 (naturally coded; _Ix133pa_1 omitted)
begin with term 1 model
p = 0.0015 < 0.1000 adding _Iephsex_1
p = 0.0636 < 0.1000 adding _Ieduc_2 _Ieduc_3

Source | SS df MS Number of obs = 203


-------------+------------------------------ F( 4, 198) = 4.05
Model | 89.0097551 4 22.2524388 Prob > F = 0.0035
Residual | 1087.64049 198 5.49313379 R-squared = 0.0756
-------------+------------------------------ Adj R-squared = 0.0570
Total | 1176.65025 202 5.82500122 Root MSE = 2.3437

------------------------------------------------------------------------------
rhknow | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
_Isex_2 | -.1919783 .3344941 -0.57 0.567 -.8516065 .4676499
_Iephsex_1 | .8556354 .3538613 2.42 0.017 .1578148 1.553456
_Ieduc_2 | .1671443 .3725591 0.45 0.654 -.5675489 .9018374
_Ieduc_3 | 1.31891 .5799488 2.27 0.024 .1752413 2.46258
_cons | 7.806347 .3717176 21.00 0.000 7.073314 8.539381
------------------------------------------------------------------------------

. xi: sw regress rhknow (i.sex) age i.ephsex (i.educ) (i.x106) (i.x116) (i.x133ma)
(i.x133pa) , pr(0.10) lockterm1
i.sex _Isex_1-2 (naturally coded; _Isex_1 omitted)
i.ephsex _Iephsex_0-1 (naturally coded; _Iephsex_0 omitted)
i.educ _Ieduc_1-3 (naturally coded; _Ieduc_1 omitted)
i.x106 _Ix106_1-2 (naturally coded; _Ix106_1 omitted)
i.x116 _Ix116_1-2 (naturally coded; _Ix116_1 omitted)
i.x133ma _Ix133ma_1-2 (naturally coded; _Ix133ma_1 omitted)
i.x133pa _Ix133pa_1-2 (naturally coded; _Ix133pa_1 omitted)
begin with full model
p = 0.6946 >= 0.1000 removing _Ix133ma_2
p = 0.6106 >= 0.1000 removing age
p = 0.4888 >= 0.1000 removing _Ix116_2
p = 0.4450 >= 0.1000 removing _Ix106_2
p = 0.3729 >= 0.1000 removing _Ix133pa_2
7

Source | SS df MS Number of obs = 203


-------------+------------------------------ F( 4, 198) = 4.05
Model | 89.0097551 4 22.2524388 Prob > F = 0.0035
Residual | 1087.64049 198 5.49313379 R-squared = 0.0756
-------------+------------------------------ Adj R-squared = 0.0570
Total | 1176.65025 202 5.82500122 Root MSE = 2.3437

------------------------------------------------------------------------------
rhknow | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
_Isex_2 | -.1919783 .3344941 -0.57 0.567 -.8516065 .4676499
_Ieduc_2 | .1671443 .3725591 0.45 0.654 -.5675489 .9018374
_Ieduc_3 | 1.31891 .5799488 2.27 0.024 .1752413 2.46258
_Iephsex_1 | .8556354 .3538613 2.42 0.017 .1578148 1.553456
_cons | 7.806347 .3717176 21.00 0.000 7.073314 8.539381
------------------------------------------------------------------------------

So in this case forward selection and backward elimination lead to the same model. Note that
we force sex into the model using the option lockterm1. By including more terms in the first
parentheses we can force a number into the model. Thus rhknow depends on education level
and ever partnered only; adjusting for these there is no evidence that it differs between males
and females.

Note that we can use the adjust command to see predicted means for a given factor adjusting
for other terms in the model as shown below first for educ and then for ephsex.
. adjust _Isex_2 _Iephsex_1 , by(educ)

-------------------------------------------------------------------------------
Dependent variable: rhknow Command: regress
Variables left as is: _Ieduc_2, _Ieduc_3
Covariates set to mean: _Isex_2 = .5320197, _Iephsex_1 = .59605911
-------------------------------------------------------------------------------
education |
(grouped) | xb
----------+-----------
1 | 8.21422
2 | 8.38136
3 | 9.53313
----------------------
Key: xb = Linear Prediction

Thus the mean rhknow for educ level 3 is higher than that for educ levels 1 & 2
8

. adjust _Isex_2 _Ieduc_2 _Ieduc_3 , by(ephsex)

------------------------------------------------------------------------------
Dependent variable: rhknow Command: regress
Variable left as is: _Iephsex_1
Covariates set to mean: _Isex_2 = .5320197, _Ieduc_2 = .56650246, _Ieduc_3 =
.12807882
-------------------------------------------------------------------------------
ever |
partnered |
& had sex| xb
----------+-----------
o | 7.96782
+ | 8.82346
----------------------
Key: xb = Linear Prediction

So the rhknow is about 0.86 units higher for those ever partnered
Note that sometimes it is useful to have an ANOVA table for the fitted model. This can be found
using the anova command

. anova rhknow sex educ ephsex

Number of obs = 203 R-squared = 0.0756


Root MSE = 2.34374 Adj R-squared = 0.0570

Source | Partial SS df MS F Prob > F


-----------+----------------------------------------------------
Model | 89.0097551 4 22.2524388 4.05 0.0035
|
sex | 1.80945437 1 1.80945437 0.33 0.5667
educ | 30.6949685 2 15.3474842 2.79 0.0636
ephsex | 32.1167488 1 32.1167488 5.85 0.0165
|
Residual | 1087.64049 198 5.49313379
-----------+----------------------------------------------------
Total | 1176.65025 202 5.82500122

The command can also deal with continuous explanatory variables using the option cont e.g.
cont(age).

You might also like