0% found this document useful (0 votes)
47 views9 pages

Regn Lect 5

Epidemiology linear regression

Uploaded by

Martha Reuben
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
47 views9 pages

Regn Lect 5

Epidemiology linear regression

Uploaded by

Martha Reuben
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

1

Lecture 5: Multiple Linear Regression

Objectives

At the end of this session participants should

 Be able to interpret the equation of a multiple linear regression model

 Understand some of the criteria and principles used in choosing which variables to
include in a multiple linear regression model

 Understand the relationship between ANOVA models and linear regression models

In multiple regression we investigate how a single response variable Y is related to two or more
explanatory variables x1, x2, etc.

e.g. rhknow might depend on education level as well as age

The model used for multiple regression is

Yi = β0 + β1 x1 + β2 x2 + εi

Where, as before, the deviations εi are assumed to be independent and normally distributed with
mean zero and constant variance σ2.

As before we have to estimate the parameters β0, β1 and β2 using sample statistics b0, b1 and b2
and this is done as for simple linear regression using the method of least squares by
minimizing

D = ∑ εi 2 = ∑ {yi - (β0 + β1 x1 + β2 x2)}2

With respect to β0 , β1 and β2.

The calculations involved are very time consuming and can only effectively be done using a
statistics package.

Note that often the question of interest is “Which of the x-variables are most important for
predicting y?”
2

Ex: We can carry out a multiple regression analysis to see whether rhknow depends on age
(x1) , education (x2) or both.

. regress rhknow age educ

Source | SS df MS Number of obs = 203

-------------+------------------------------ F( 2, 200) = 6.14

Model | 68.0241289 2 34.0120645 Prob > F = 0.0026

Residual | 1108.62612 200 5.54313059 R-squared = 0.0578

-------------+------------------------------ Adj R-squared = 0.0484

Total | 1176.65025 202 5.82500122 Root MSE = 2.3544

------------------------------------------------------------------------------

rhknow | Coef. Std. Err. t P>|t| [95% Conf. Interval]

-------------+----------------------------------------------------------------

age | .0587707 .0268695 2.19 0.030 .0057868 .1117545

educ | .5905042 .266571 2.22 0.028 .0648539 1.116155

_cons | 6.25016 .6579523 9.50 0.000 4.952746 7.547574

Thus our overall regression is highly significant (F=6.14 ; P=0.0026) so we can reject the null
hypothesis that rhknow is related to neither age nor education and can conclude that rhknow
depends on at least one of them. Note that the R2 has increased slightly to 5.78% - so the
majority of the variation in rhknow is still unexplained.

The question remains as to whether we need both age and education in our model. We can
check this using the post estimation test command in Stata, which tests whether a particular
coefficient is significantly different from zero in the model – if we cannot reject the null
hypothesis that a coefficient is zero, we could then omit it from the model (i.e. we can omit
variables which are “not significant”).

. test age

( 1) age = 0

F( 1, 200) = 4.78

Prob > F = 0.0299


3

Thus in this case we reject the null hypothesis that the coefficient for age is zero and we need to
keep age in our model.

. test educ

( 1) educ = 0

F( 1, 200) = 4.91

Prob > F = 0.0279

Thus we also reject the null hypothesis that the coefficient for education is zero and we need to
keep education in our model.

So in this case the best model is the one which has both age and education as explanatory
variables. Note that education is actually an ordinal variable, which is acceptable as an
explanatory variable – the effect can be interpreted as follows: for an increase of one level in
education, rhknow increases by 0.59 on average, adjusting for the effect of age. Similarly for a
1 year increase in age, rhknow increases by 0.059 on average, adjusting for the effect of
education. Thus we can see that adjusting for education reduces the magnitude of the age
effect, even though age remains statistically significant.

Model selection

In the case of two x variables we are in fact choosing between 4 models

(a) no x variable is important


(b) x1 alone is important
(c) x2 alone is important
(d) x1 and x2 are both important (as in our example above)

In the case of three x variables there are 8 models (a) no x variable (b) x1 alone (c) x2 (d) x3
(e) x1 & x2 (f) x1 & x3 (g) x2 & x3 (h) x1, x2 & x3

In the case of 4 x variables there are 16 models, in the case of 5 x-variables 32 models etc.
Thus the number of potential models increases rapidly and we need some strategies to search
for a suitable model. In general there are a number of criteria that are used and this is still an
area of on-going research. However one possible strategy that is easy to implement is that of
stepwise regression. There are two basic versions of this strategy: in forward selection we
4

choose the variable that will give us the biggest reduction in the residual sum of squares at each
stage and add it to the model, stopping when no variable not in the model significantly improves
the fit of the model. In backward elimination we start by fitting the model with all potential
regressors and successively remove the term which leads to the smallest increase in the
residual sum of squares, stopping when the term so removed would lead to a significantly worse
model.

Note that there are a number of theoretical criticisms of these procedures:

(1) The derived model will give an over-optimistic impression – the P-values for the selected
variables will be too small, confidence intervals will be too narrow and the proportion of
variance explained (R2) will be too high. This is because these quantities do not reflect
the fact that the model was selected using a stepwise procedure.
(2) The regression coefficients will be too large (i.e. too far from the null values). The
performance of the model in predicting future values of the outcome will be less good
than we expect.
(3) Computer simulations have shown that minor changes in the data may lead to important
changes in the variables selected for the final model.
(4) There are some cases where the two procedures can lead to different models.
(5) Stepwise procedures should never be used as a substitute for thinking about the
problem. We should include variables known from previous work to be associated with
the outcome, and exclude variables for which an association is implausible.

Note that the higher the original number of exposure variables from which the model was
selected, the higher the probability of selecting variables with chance associations and thus the
worse the problem will be.

When implementing this strategy we use a slightly liberal p-value for inclusion / exclusion (say
P=0.15 or even P=0.20) so as not to lose any variable that could have predictive power. It is
also useful to carry out both procedures – and we can have more confidence in our final model
if it is chosen by both.

In some cases there is a variable that is of intrinsic interest and we can fix it in all models using
the lockterm1 option in Stata.
5

Both methods are implemented in Stata using the sw prefix. For forward selection we specify
the option pe to give the nominal significance level for a term to enter the model and the option
pr to give the nominal significance level for a term to be dropped from the model.

Ex We can find the “best” model for predicting rhknow using age, education and income level
(also an ordinal variable) as candidate regressors.

. sw regress rhknow age educ income , pe(0.10)

begin with empty model

p = 0.0073 < 0.1000 adding educ

p = 0.0299 < 0.1000 adding age

Source | SS df MS Number of obs = 203

-------------+------------------------------ F( 2, 200) = 6.14

Model | 68.0241289 2 34.0120645 Prob > F = 0.0026

Residual | 1108.62612 200 5.54313059 R-squared = 0.0578

-------------+------------------------------ Adj R-squared = 0.0484

Total | 1176.65025 202 5.82500122 Root MSE = 2.3544

------------------------------------------------------------------------------

rhknow | Coef. Std. Err. t P>|t| [95% Conf. Interval]

-------------+----------------------------------------------------------------

educ | .5905042 .266571 2.22 0.028 .0648539 1.116155

age | .0587707 .0268695 2.19 0.030 .0057868 .1117545

_cons | 6.25016 .6579523 9.50 0.000 4.952746 7.547574

------------------------------------------------------------------------------
6

. sw regress rhknow age educ income , pr(0.10)

begin with full model

p = 0.1481 >= 0.1000 removing income

Source | SS df MS Number of obs = 203

-------------+------------------------------ F( 2, 200) = 6.14

Model | 68.0241289 2 34.0120645 Prob > F = 0.0026

Residual | 1108.62612 200 5.54313059 R-squared = 0.0578

-------------+------------------------------ Adj R-squared = 0.0484

Total | 1176.65025 202 5.82500122 Root MSE = 2.3544

------------------------------------------------------------------------------

rhknow | Coef. Std. Err. t P>|t| [95% Conf. Interval]

-------------+----------------------------------------------------------------

age | .0587707 .0268695 2.19 0.030 .0057868 .1117545

educ | .5905042 .266571 2.22 0.028 .0648539 1.116155

_cons | 6.25016 .6579523 9.50 0.000 4.952746 7.547574

------------------------------------------------------------------------------

Thus we see that in this case both procedures lead to the same model – the model with age
and education, and given these terms income does not improve the prediction of rhknow.

Categorical Explanatory factors

Often we are interested in comparing three or more groups e.g. in the regression model we
assumed that the effect of education was linear i.e. that the effect of moving from level 1 to level
2 is the same as the effect of moving from level 2 to level 3. We could alternatively treat educ as
a 3 level categorical variable and look at whether the mean rhknow varies between the three
education groups.

Recall that if we want to only investigate the effect of a single categorical explanatory factor on
an outcome variable, then this can be done using a oneway analysis of variance (recall the
example from lecture 2 investigating the effect of two different drugs or a placebo on the
lymphocyte counts of mice).
7

use mice , clear

. oneway lcount drug , tab

| Summary of lcount

drug | Mean Std. Dev. Freq.

------------+------------------------------------

1 | 68 9.3005376 5

2 | 60 6.363961 5

3 | 55 3.5355339 5

------------+------------------------------------

Total | 61 8.4006802 15

Analysis of Variance

Source SS df MS F Prob > F

------------------------------------------------------------------------

Between groups 430 2 215 4.62 0.0325

Within groups 558 12 46.5

------------------------------------------------------------------------

Total 988 14 70.5714286

Bartlett's test for equal variances: chi2(2) = 2.9923 Prob>chi2 = 0.224

Analysis of variance and regression

There is a very close connection between multiple regression models and the analysis of
variance models. To see this we can measure the effect of drug B and the placebo C relative to
drug A i.e. we set drug A to be the reference or baseline level.

We can define variables to measure the effect of drug B and drug C relative to the baseline as
follows:

. gen drugb = 0

. replace drugb=1 if drug==2

(5 real changes made)

. gen drugc = 0

. replace drugc=1 if drug==3

(5 real changes made)


8

. list drug drugb drugc , noobs

+----------------------+

| drug drugb drugc |

|----------------------|

| 1 0 0 |

| 1 0 0 |

| 1 0 0 |

| 1 0 0 |

| 1 0 0 |

|----------------------|

| 2 1 0 |

| 2 1 0 |

| 2 1 0 |

| 2 1 0 |

| 2 1 0 |

|----------------------|

| 3 0 1 |

| 3 0 1 |

| 3 0 1 |

| 3 0 1 |

| 3 0 1 |

+----------------------+

Thus the variables drugb and drugc jointly specify the drug received by any mouse
If drugb=0 and drugc=0 then the mouse received drug A
If drugb=1 and drugc=0 then the mouse received drug B
If drugb=0 and drugc=1 then the mouse received placebo C
We can now fit the multiple regression model of lcount on drugb and drugc
9

reg lcount drugb drugc

Source | SS df MS Number of obs = 15


-------------+------------------------------ F( 2, 12) = 4.62
Model | 430 2 215 Prob > F = 0.0325
Residual | 558 12 46.5 R-squared = 0.4352
-------------+------------------------------ Adj R-squared = 0.3411
Total | 988 14 70.5714286 Root MSE = 6.8191

------------------------------------------------------------------------------
lcount | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
drugb | -8 4.312772 -1.85 0.088 -17.39672 1.396722
drugc | -13 4.312772 -3.01 0.011 -22.39672 -3.603278
_cons | 68 3.04959 22.30 0.000 61.35551 74.64449
------------------------------------------------------------------------------

We can see that the ANOVA table from the regression gives identical results (with an F-value of
4.62) to that obtained by carrying out the one-way analysis of variance.

We defined drugb and drugc in such a way that our baseline level is drug A i.e. if drugb=0 and
drugc=0 then the mouse received drug A. For drug A the mean value of lcount is 68 – which is
the value of the intercept (_cons).
The parameter drugb measures the effect of drug B which is just the difference in mean values
between drug B and drug A
i.e. -8 = 60 – 68
Similarly the parameter drugc measures the effect of the placebo C (relative to the baseline)
which is just the difference in mean values between placebo C and drug A
i.e. -13 = 55 – 68
From the t-tests in the regression output we can see that the difference between the placebo C
and drug A is statistically significant, but the difference between drug B and drug A is not
statistically significant.

Thus we can use regression models to fit Analysis of Variance models

You might also like