Linear and Generalized Linear
Models (4433LGLM6Y)
Model selection
Meeting 8
Vahe Avagyan
Biometris, Wageningen University and Research
Model selection (Fox: chapter 22, Faraway PRA: chapter 10)
Model selection and criteria
Model validation
Collinearity
Model Selection: Caution!
• Suppose we have many predictor variables, not all necessarily related to the response variable.
• We want to select the "best" subset of predictors.
• The following problems may appear (according to Fox):
• Problem of simultaneous inference.
• What does “failing to reject a null hypothesis” mean?
• Impact of large samples on hypothesis tests: trivially small effects become statistically
significant if dataset is large.
• Exaggerated precision.
General strategies
• Addressing these concerns (according to Fox):
• Use alternative model-selection criteria instead of statistical significance.
• Compensate for simultaneous inference, e.g., by Bonferroni adjustments,
• Validate a statistical model selected by another approach.
• Model averaging
• Avoid model selection. Specify maximally complex and flexible model without trying to
simplify it. Issues here?
Model Selection
• Reasons for selecting "best" subset of regressors:
• We can explain the data in simplest way, removing redundant predictors.
• Principle of Occam’s Razor (parsimony) : among several plausible explanations for
phenomenon, simplest is best.
• Unnecessary predictors will add noise to the estimation of other quantities that are of interest,
• degrees of freedom are wasted.
• Collinearity is caused by having too many variables doing same job.
• Remove excess predictors.
Types of variable selection
• Backwards
• Forward
• Two main types of variable selection: • Stepwise (mixed)
• Stepwise approach, comparing successive models
• Stepwise approach may use hypothesis testing to select the next step, but other criteria may
be used too.
• Criterion approach, finding a model that optimizes some measure of goodness of fit.
• Marginality principle: keep lower order terms in the model, if higher order term is important.
• Model selection is conceptually simplest if the goal is prediction
• Example: develop regression model that will predict new data as accurately as possible.
Stepwise procedures
• Backward elimination vs Forward elimination
• Both procedures can be implemented using step() command in R.
• Stepwise selection (mixed)
• Stepwise procedures may also be used in combination with other criteria, e.g., AIC (see stepAIC()).
Stepwise procedures: Backward Elimination
• Backward Elimination
e.g., 0.05
1. Start with all predictors in model (full model).
2. Remove a predictor with the highest p-value, greater than threshold p-value-to-stay 𝛼𝑐𝑟𝑖𝑡
3. Refit the model and repeat the step 2.
4. Stop if all p-values of terms remaining in the model are smaller than 𝛼𝑐𝑟𝑖𝑡 .
• What is the main drawback of this approach? Other criteria may be used here as well,
e.g., AIC.
Stepwise procedures: Backward Elimination with 5 predictors
• Backward Elimination
1. Start with all 5 predictors in model (full model).
2. Remove a predictor with the least significant
predictor (e.g., 𝑋4 )
3. Refit the model and repeat the step 2.
4. Keep removing the least significant predictors
until all of them are significant (or running out
of predictors).
Example: Life expectancy dataset
• Examine the relationship between life expectancy and other socio-economic variables for the U.S.
states.
• There is an easier way to do this in R, but let’s try manually first.
• See using step() or stepAIC() commands.
Example: Life expectancy dataset
“area” shows the highest p-value above the threshold (e.g., 0.05),
i.e., the least significant predictor.
Example: Life expectancy dataset
• Next, “illiteracy” shows the highest
p-value above the threshold.
• “income” shows the highest p-value
above the threshold.
Example: Life expectancy dataset
• Next, “population” shows the
highest p-value above the
threshold.
• The procedure stops here.
Stepwise procedures: Forward Selection with 5 predictors
• Forward Selection
1. Start with no predictor in model (null model).
2. Enter a predictor with the smallest p-value, smaller than threshold p-value-to-enter 𝛼𝑐𝑟𝑖𝑡 .
3. Refit the model and repeat the step 2.
4. Stop if all p-values of terms not in the model are higher than 𝛼𝑐𝑟𝑖𝑡 .
Stepwise procedures: Forward Selection
• Forward Selection
1. Start with no predictor in model (null model).
2. Enter the most significant predictor (e.g., 𝑋2 )
3. Refit the model and repeat the step 2.
4. Keep adding the most significant predictors
until those not in the model are not
significant.
Stepwise procedures: Stepwise (or mixed) selection
• This is a combination of backward elimination and forward selection. After entering variable, all
variables in model are candidate for removal.
• Thresholds p-value-to-enter and p-value-to-stay need to be specified.
• Drawback related to earlier mentioned caveats:
• “Optimal” model may be missed due to adding / dropping of single variables.
• Stepwise selection tends to pick models smaller than desirable for prediction purposes.
• If using p-values: don’t treat p-values literally! (recall multiple testing problem)
• Procedures are not linked to final objective of prediction or explanation.
Criterion-Based Procedures
• Criterion-based procedures typically compare all possible models (i.e., all possible “subsets regression”)
• A model with 𝑘 regressors has 2𝑘 possible sub-models! (why?)
• Different criteria may be used, e.g.:
2
• 𝑅𝑎𝑑𝑗 (𝑅2 - adjusted)
• Mallow’s 𝐶𝑝
• Predicted residual error sum of squares (PRESS) and Cross-Validation
• Akaike information criterion (AIC)
• Bayesian (sometimes Schwarz's Bayesian) information criterion (BIC)
2
Criterion-Based Procedures: 𝑅𝑎𝑑𝑗
• Recall
𝑅𝑒𝑔𝑆𝑆 𝑅𝑆𝑆
𝑅2 = =1− .
𝑇𝑆𝑆 𝑇𝑆𝑆
• Why can’t we use 𝑅2 as a model selection criterion?
2
𝜎ො𝑁𝑢𝑙𝑙 is the estimate of error
variance based on the “empty”
model (intercept only).
• Instead, we use the adjusted 𝑅2 :
2
2 𝑅𝑆𝑆Τ 𝑛 − 𝑝 𝜎ො𝑀𝑜𝑑𝑒𝑙
𝑅𝑎𝑑𝑗 =1− =1− 2
𝑇𝑆𝑆 / 𝑛 − 1 𝜎ො𝑁𝑢𝑙𝑙
2
• 𝑅𝑎𝑑𝑗 will only increase by changing a model, if the estimate of error variance based on new model
2
𝜎ො𝑀𝑜𝑑𝑒𝑙 decreases. It will only decrease, if the “change” in RSS is compensated by change in residual df.
Criterion-Based Procedure
• Good model should predict well, so total prediction MSE (on population level) should be small.
• The scalled (normalized) MSE is:
1 2 1 2
2 σ𝑛𝑖=1 𝐸 𝑌𝑖 − 𝐸(𝑌𝑖 ) = 2 σ𝑛𝑖=1 𝑉 𝑌𝑖 + 𝐸 𝑌𝑖 − 𝐸 𝑌𝑖 .
𝜎𝜖 𝜎𝜖
2
• There are two components: variance 𝑉 𝑌𝑖 and squared bias 𝐸 𝑌𝑖 − 𝐸 𝑌𝑖 .
• Bias-variance trade-off: by removing a variable, the decrease in variance offsets any increase in bias
Criterion-Based Procedure: Mallow's 𝐶𝑝
• Prediction MSE is estimated by Mallow’s 𝐶𝑝 :
𝑅𝑆𝑆𝑝
𝐶𝑝 = 2 + 2𝑝 − 𝑛
𝜎ො𝜖
𝜎ො𝜖2 is from the full model, and 𝑅𝑆𝑆𝑝
from current model (with 𝑝
parameters).
• Good model should have 𝐶𝑝 close to or below 𝑝. Model with a bad fit has 𝐶𝑝 much bigger than 𝑝.
• For full model, we have 𝐶𝑝 = 𝑝 (why?).
Mallow's 𝐶𝑝 : Example
• In practice, we plot 𝐶𝑝 against p and look for models with small 𝑝 and with 𝐶𝑝 around or less than p.
• We have 𝑘 = 7 predictors
• How many models are there, in total?
• Good options are the models “456” and “1456”. Smaller model is more parsimonious, but larger
models fits slightly better.
2
Criterion-Based Procedure: 𝑅𝑎𝑑𝑗
2
• Now, let’s check the model selection with 𝑅𝑎𝑑𝑗 .
2
• Model with largest 𝑅𝑎𝑑𝑗 is “1456”.
• What about the best 3-predictor model ?
• Variable selection methods are sensitive to outliers, influential points, and transformations.
Criterion-Based Procedure: Cross-Validation
• 𝑃𝑅𝐸𝑆𝑆 = Predicted REsidual Sum of Squares (Leave-one-out cross-validated residuals):
2
𝑃𝑅𝐸𝑆𝑆 = σ𝑛𝑖=1 𝑌−𝑖 − 𝑌𝑖 .
• 𝑌−𝑖 is the prediction for 𝑖-th observation, using a model fitted without the 𝑖-th observation.
• Cross-validation criterion estimates the mean-squared error of prediction as:
2
σ𝑛𝑖=1 𝑌−𝑖 − 𝑌𝑖 𝑃𝑅𝐸𝑆𝑆
𝐶𝑉 ≡ =
𝑛 𝑛
What is the drawback of LOOCV?
• We prefer the model with the smallest value of CV or PRESS.
Criterion-Based Procedure: PRESS and Cross-Validation
• Alternative is 𝑘-fold cross validation:
1. Divide the data into small number of
subsets or folds (e.g., 5 or 10) of roughly
equal size
2. Fit a model omitting each subset in turn
(i.e., use only the training data)
3. Obtain the fitted values for all
observations in omitted subset (i.e., the
test data).
Criterion-Based Procedure: AIC and BIC
• Penalized model-fit statistics:
መ + 𝑝𝑒𝑛𝑎𝑙𝑡𝑦
−2 log 𝐿(𝜃)
• 𝜃 is the vector of parameters of the model, including the regression coefficients and the error
variance. Here, 𝜃መ is the m.l.e.
መ is the maximized likelihood under current model.
• 𝐿(𝜃)
• 𝑝𝑒𝑛𝑎𝑙𝑡𝑦 ≡ 𝑐 × 𝑝 (𝑐 is a scaling parameter).
• The magnitude of a criterion is not interpretable, but differences are.
• Model with the smallest value of information criterion is preferred.
Criterion-Based Procedure: AIC and BIC
• Most popular criteria are
መ + 2𝑝, thus 𝑐 = 2
𝐴𝐼𝐶 = −2 log 𝐿(𝜃)
መ + 𝑝log(𝑛), thus 𝑐 = log(𝑛)
𝐵𝐼𝐶 = −2 log 𝐿(𝜃)
• When the sample size 𝑛 is small, there is a high chance that AIC will select models that have too many
parameters (i.e., AIC will overfit). In this case, the corrected AIC can be used:
2𝑝2 + 2𝑘
𝐴𝐼𝐶𝑐 = 𝐴𝐼𝐶 +
𝑛−𝑘−1
Example: Life expectancy dataset
What would be the AIC value, if the
predictor is removed.
The variables are removed step-by-
step, and AIC is checked at each step.
Summary model selection
• Stepwise procedures:
• Search through space of potential models.
• Testing-based procedures use dubious hypothesis testing.
• Criterion-based procedures:
• search through a wider space of models (“all possible subsets regression")
• compare the models using a particular criterion.
• Criterion based procedures are usually preferred.
Summary model selection
• The aim of variable selection is to construct a model that predicts well or explains relationships in
data well.
• It is part of process of model building, like identification of outliers and in influential points, and
variable transformation.
• Automatic selections are not guaranteed to be consistent. Use methods as guide only.
• Accept possibility that several models are suggested which fit equally well. Then consider:
• Do models have similar qualitative consequences?
• Do they make similar predictions?
• What is cost of measuring predictors?
• Which has best diagnostics?
Model validation
• Model validation: split dataset into training subsample and validation subsample:
• Training subsample is used to specify the statistical model.
• Validation subsample is used to evaluate the fitted model.
• Cross-validation is an application of this idea where the roles of training and validation subsamples are
interchanged.
• Statistical modeling: iterative sequence of data exploration, model fitting, model criticism, model re-
specification.
• Variables may be dropped, interactions may be incorporated or deleted, variables may be
transformed, unusual data may be corrected, removed, or otherwise accommodated.
Example: Model validation
Example: Model validation
• Evaluation is usually done using Pearson’s correlation or
Root mean squared error (RMSE) metrics.
• Usually, several random partition is used.
Model validation
• Resulting model should accurately reflect the principal characteristics of your data.
• Danger: overfitting and overstating strength of results.
• Ideal solution: collect new data with which to validate model (often not possible).
• Model validation simulates the collection of new data by randomly dividing data into two parts:
• First, for exploration and model formulation,
• second for checking adequacy of model, formal estimation, and testing.
Collinearity
• If a perfect linear relationship among regressors exist, least-squares coefficient are no longer
uniquely defined.
• Strong, but less-than-perfect linear relationship among 𝑋’s causes least-squares coefficients to be
unstable:
• large standard errors of coefficients,
• broad confidence intervals,
• hypothesis tests with low power.
Collinearity and Remedies
• Small changes in data can greatly change the coefficients.
• Large changes in coefficients coincide with only very small changes in residual sum of squares.
• This is problem is known as collinearity or multicollinearity.
• Collinearity is relatively rare problem in social-science applications of linear models.
• Methods employed as remedies for collinearity may be worse than the disease.
• Usually, it is impossible to redesign study to decrease correlations between 𝑋’s.
Detecting Collinearity
• Suppose a perfect linear relationship exists between 𝑋’s:
𝑐1 𝑋𝑖1 + 𝑐2 𝑋𝑖2 + … + 𝑐𝑘 𝑋𝑖𝑘 = 𝑐0 .
• Then, the matrix 𝐗 ′ 𝐗 is singular (why?),
• Therefore.
• least squares normal equations do not have unique solution
• sampling variances of regression coefficients are infinite.
• Perfect collinearity is often a product of some error in formulating linear model
• e.g., too many dummies.
Detecting Collinearity
Here, 𝑅𝑗2 is the 𝑅2 for regression of 𝑋𝑗
• The sampling variance of the slope 𝐵𝑗 is on other 𝑋’s .
1 𝜎𝜖2 𝑆𝑗2 is the sample variance of 𝑋𝑗 .
𝑉(𝐵𝑗 ) = ×
1 − 𝑅𝑗2 𝑛 − 1 𝑆𝑗2
• The first term is called Variance Inflation Factor:
1
𝑉𝐼𝐹 =
1 − 𝑅𝑗2
• 𝑉𝐼𝐹 indicates directly the impact of collinearity on precision of 𝐵𝑗 .
• 𝑉𝐼𝐹 is a basic diagnostic for collinearity.
• A rule of thumb: 𝑉𝐼𝐹 is greater than 10 (very strong) or 5 (strong) (why these values?).
Detecting Collinearity (Optional)
• Ways for detecting collinearity besides looking at the VIF’s.
• Examination of the correlation matrix of predictors:
• Regress 𝑋𝑖 on all other 𝑋’s and repeat this for all predictors. 𝑅𝑖2 close to one indicates the problem.
• Examine the eigenvalues 𝜆𝑖 of 𝐗′𝐗: small eigenvalues indicate problem.
• Large condition numbers 𝜅 𝐗 ′ 𝐗 = 𝜆1 /𝜆𝑝 (𝜅 > 30 is considered large).
• Also check the values of condition index 𝜆1 /𝜆𝑖
Detecting Collinearity: Example
• What is your opinion?
Collinearity: No Quick Fix
• Collinearity leads to
• imprecise estimates of 𝛽; even the signs of coefficients may be misleading.
• t-tests fail to reveal significant factors.
• Coping With Collinearity
• Model re-specification.
• Variable Selection.
• Biased Estimation: e.g., Ridge Regression..
• Prior Info About Regression Coefficients: e.g., Bayesian approaches.
Geometric interpretation of collinearity (Optional)
• Imagine a table: as two diagonally opposite legs are moved closer together, the table becomes increasing
• no collinearity (a), complete collinearity (b) and strong collinearity