0% found this document useful (0 votes)
39 views41 pages

Model Selection for Statisticians

Model selection aims to identify the best subset of predictor variables for a model. Stepwise procedures use hypothesis testing to sequentially add or remove variables, while criterion-based procedures compare all possible models using information criteria like AIC. Both have drawbacks, as stepwise selection may miss optimal models and criteria values can indicate overly complex models for prediction purposes. Mallow's Cp criterion estimates prediction error for candidate models, with good models having Cp close to the number of parameters p.

Uploaded by

maartenwilders
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views41 pages

Model Selection for Statisticians

Model selection aims to identify the best subset of predictor variables for a model. Stepwise procedures use hypothesis testing to sequentially add or remove variables, while criterion-based procedures compare all possible models using information criteria like AIC. Both have drawbacks, as stepwise selection may miss optimal models and criteria values can indicate overly complex models for prediction purposes. Mallow's Cp criterion estimates prediction error for candidate models, with good models having Cp close to the number of parameters p.

Uploaded by

maartenwilders
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 41

Linear and Generalized Linear

Models (4433LGLM6Y)

Model selection
Meeting 8

Vahe Avagyan
Biometris, Wageningen University and Research
Model selection (Fox: chapter 22, Faraway PRA: chapter 10)
Model selection and criteria

Model validation

Collinearity
Model Selection: Caution!

• Suppose we have many predictor variables, not all necessarily related to the response variable.

• We want to select the "best" subset of predictors.

• The following problems may appear (according to Fox):

• Problem of simultaneous inference.

• What does “failing to reject a null hypothesis” mean?

• Impact of large samples on hypothesis tests: trivially small effects become statistically
significant if dataset is large.

• Exaggerated precision.
General strategies

• Addressing these concerns (according to Fox):

• Use alternative model-selection criteria instead of statistical significance.

• Compensate for simultaneous inference, e.g., by Bonferroni adjustments,

• Validate a statistical model selected by another approach.

• Model averaging

• Avoid model selection. Specify maximally complex and flexible model without trying to
simplify it. Issues here?
Model Selection

• Reasons for selecting "best" subset of regressors:

• We can explain the data in simplest way, removing redundant predictors.


• Principle of Occam’s Razor (parsimony) : among several plausible explanations for
phenomenon, simplest is best.

• Unnecessary predictors will add noise to the estimation of other quantities that are of interest,
• degrees of freedom are wasted.

• Collinearity is caused by having too many variables doing same job.


• Remove excess predictors.
Types of variable selection
• Backwards
• Forward
• Two main types of variable selection: • Stepwise (mixed)
• Stepwise approach, comparing successive models
• Stepwise approach may use hypothesis testing to select the next step, but other criteria may
be used too.
• Criterion approach, finding a model that optimizes some measure of goodness of fit.

• Marginality principle: keep lower order terms in the model, if higher order term is important.

• Model selection is conceptually simplest if the goal is prediction


• Example: develop regression model that will predict new data as accurately as possible.
Stepwise procedures

• Backward elimination vs Forward elimination

• Both procedures can be implemented using step() command in R.

• Stepwise selection (mixed)

• Stepwise procedures may also be used in combination with other criteria, e.g., AIC (see stepAIC()).
Stepwise procedures: Backward Elimination

• Backward Elimination
e.g., 0.05
1. Start with all predictors in model (full model).

2. Remove a predictor with the highest p-value, greater than threshold p-value-to-stay 𝛼𝑐𝑟𝑖𝑡

3. Refit the model and repeat the step 2.

4. Stop if all p-values of terms remaining in the model are smaller than 𝛼𝑐𝑟𝑖𝑡 .

• What is the main drawback of this approach? Other criteria may be used here as well,
e.g., AIC.
Stepwise procedures: Backward Elimination with 5 predictors

• Backward Elimination

1. Start with all 5 predictors in model (full model).

2. Remove a predictor with the least significant


predictor (e.g., 𝑋4 )

3. Refit the model and repeat the step 2.

4. Keep removing the least significant predictors


until all of them are significant (or running out
of predictors).
Example: Life expectancy dataset

• Examine the relationship between life expectancy and other socio-economic variables for the U.S.
states.

• There is an easier way to do this in R, but let’s try manually first.

• See using step() or stepAIC() commands.


Example: Life expectancy dataset

“area” shows the highest p-value above the threshold (e.g., 0.05),
i.e., the least significant predictor.
Example: Life expectancy dataset

• Next, “illiteracy” shows the highest


p-value above the threshold.

• “income” shows the highest p-value


above the threshold.
Example: Life expectancy dataset

• Next, “population” shows the


highest p-value above the
threshold.

• The procedure stops here.


Stepwise procedures: Forward Selection with 5 predictors

• Forward Selection

1. Start with no predictor in model (null model).

2. Enter a predictor with the smallest p-value, smaller than threshold p-value-to-enter 𝛼𝑐𝑟𝑖𝑡 .

3. Refit the model and repeat the step 2.

4. Stop if all p-values of terms not in the model are higher than 𝛼𝑐𝑟𝑖𝑡 .
Stepwise procedures: Forward Selection

• Forward Selection

1. Start with no predictor in model (null model).

2. Enter the most significant predictor (e.g., 𝑋2 )

3. Refit the model and repeat the step 2.

4. Keep adding the most significant predictors


until those not in the model are not
significant.
Stepwise procedures: Stepwise (or mixed) selection

• This is a combination of backward elimination and forward selection. After entering variable, all
variables in model are candidate for removal.
• Thresholds p-value-to-enter and p-value-to-stay need to be specified.

• Drawback related to earlier mentioned caveats:


• “Optimal” model may be missed due to adding / dropping of single variables.
• Stepwise selection tends to pick models smaller than desirable for prediction purposes.
• If using p-values: don’t treat p-values literally! (recall multiple testing problem)
• Procedures are not linked to final objective of prediction or explanation.
Criterion-Based Procedures

• Criterion-based procedures typically compare all possible models (i.e., all possible “subsets regression”)

• A model with 𝑘 regressors has 2𝑘 possible sub-models! (why?)

• Different criteria may be used, e.g.:


2
• 𝑅𝑎𝑑𝑗 (𝑅2 - adjusted)
• Mallow’s 𝐶𝑝
• Predicted residual error sum of squares (PRESS) and Cross-Validation
• Akaike information criterion (AIC)
• Bayesian (sometimes Schwarz's Bayesian) information criterion (BIC)
2
Criterion-Based Procedures: 𝑅𝑎𝑑𝑗

• Recall
𝑅𝑒𝑔𝑆𝑆 𝑅𝑆𝑆
𝑅2 = =1− .
𝑇𝑆𝑆 𝑇𝑆𝑆

• Why can’t we use 𝑅2 as a model selection criterion?


2
𝜎ො𝑁𝑢𝑙𝑙 is the estimate of error
variance based on the “empty”
model (intercept only).
• Instead, we use the adjusted 𝑅2 :
2
2 𝑅𝑆𝑆Τ 𝑛 − 𝑝 𝜎ො𝑀𝑜𝑑𝑒𝑙
𝑅𝑎𝑑𝑗 =1− =1− 2
𝑇𝑆𝑆 / 𝑛 − 1 𝜎ො𝑁𝑢𝑙𝑙
2
• 𝑅𝑎𝑑𝑗 will only increase by changing a model, if the estimate of error variance based on new model
2
𝜎ො𝑀𝑜𝑑𝑒𝑙 decreases. It will only decrease, if the “change” in RSS is compensated by change in residual df.
Criterion-Based Procedure

• Good model should predict well, so total prediction MSE (on population level) should be small.

• The scalled (normalized) MSE is:

1 2 1 2
2 σ𝑛𝑖=1 𝐸 𝑌෠𝑖 − 𝐸(𝑌𝑖 ) = 2 σ𝑛𝑖=1 𝑉 𝑌෠𝑖 + 𝐸 𝑌෠𝑖 − 𝐸 𝑌𝑖 .
𝜎𝜖 𝜎𝜖

2
• There are two components: variance 𝑉 𝑌෠𝑖 and squared bias 𝐸 𝑌෠𝑖 − 𝐸 𝑌𝑖 .

• Bias-variance trade-off: by removing a variable, the decrease in variance offsets any increase in bias
Criterion-Based Procedure: Mallow's 𝐶𝑝

• Prediction MSE is estimated by Mallow’s 𝐶𝑝 :


𝑅𝑆𝑆𝑝
𝐶𝑝 = 2 + 2𝑝 − 𝑛
𝜎ො𝜖

𝜎ො𝜖2 is from the full model, and 𝑅𝑆𝑆𝑝


from current model (with 𝑝
parameters).

• Good model should have 𝐶𝑝 close to or below 𝑝. Model with a bad fit has 𝐶𝑝 much bigger than 𝑝.

• For full model, we have 𝐶𝑝 = 𝑝 (why?).


Mallow's 𝐶𝑝 : Example

• In practice, we plot 𝐶𝑝 against p and look for models with small 𝑝 and with 𝐶𝑝 around or less than p.

• We have 𝑘 = 7 predictors

• How many models are there, in total?

• Good options are the models “456” and “1456”. Smaller model is more parsimonious, but larger
models fits slightly better.
2
Criterion-Based Procedure: 𝑅𝑎𝑑𝑗

2
• Now, let’s check the model selection with 𝑅𝑎𝑑𝑗 .

2
• Model with largest 𝑅𝑎𝑑𝑗 is “1456”.

• What about the best 3-predictor model ?

• Variable selection methods are sensitive to outliers, influential points, and transformations.
Criterion-Based Procedure: Cross-Validation

• 𝑃𝑅𝐸𝑆𝑆 = Predicted REsidual Sum of Squares (Leave-one-out cross-validated residuals):

2
𝑃𝑅𝐸𝑆𝑆 = σ𝑛𝑖=1 𝑌෠−𝑖 − 𝑌𝑖 .

• 𝑌෠−𝑖 is the prediction for 𝑖-th observation, using a model fitted without the 𝑖-th observation.

• Cross-validation criterion estimates the mean-squared error of prediction as:

2
σ𝑛𝑖=1 𝑌෠−𝑖 − 𝑌𝑖 𝑃𝑅𝐸𝑆𝑆
𝐶𝑉 ≡ =
𝑛 𝑛
What is the drawback of LOOCV?
• We prefer the model with the smallest value of CV or PRESS.
Criterion-Based Procedure: PRESS and Cross-Validation

• Alternative is 𝑘-fold cross validation:

1. Divide the data into small number of


subsets or folds (e.g., 5 or 10) of roughly
equal size

2. Fit a model omitting each subset in turn


(i.e., use only the training data)

3. Obtain the fitted values for all


observations in omitted subset (i.e., the
test data).
Criterion-Based Procedure: AIC and BIC

• Penalized model-fit statistics:


መ + 𝑝𝑒𝑛𝑎𝑙𝑡𝑦
−2 log 𝐿(𝜃)

• 𝜃 is the vector of parameters of the model, including the regression coefficients and the error
variance. Here, 𝜃መ is the m.l.e.

መ is the maximized likelihood under current model.


• 𝐿(𝜃)

• 𝑝𝑒𝑛𝑎𝑙𝑡𝑦 ≡ 𝑐 × 𝑝 (𝑐 is a scaling parameter).

• The magnitude of a criterion is not interpretable, but differences are.

• Model with the smallest value of information criterion is preferred.


Criterion-Based Procedure: AIC and BIC

• Most popular criteria are

መ + 2𝑝, thus 𝑐 = 2
𝐴𝐼𝐶 = −2 log 𝐿(𝜃)

መ + 𝑝log(𝑛), thus 𝑐 = log(𝑛)


𝐵𝐼𝐶 = −2 log 𝐿(𝜃)

• When the sample size 𝑛 is small, there is a high chance that AIC will select models that have too many
parameters (i.e., AIC will overfit). In this case, the corrected AIC can be used:
2𝑝2 + 2𝑘
𝐴𝐼𝐶𝑐 = 𝐴𝐼𝐶 +
𝑛−𝑘−1
Example: Life expectancy dataset

What would be the AIC value, if the


predictor is removed.

The variables are removed step-by-


step, and AIC is checked at each step.
Summary model selection

• Stepwise procedures:

• Search through space of potential models.

• Testing-based procedures use dubious hypothesis testing.

• Criterion-based procedures:

• search through a wider space of models (“all possible subsets regression")

• compare the models using a particular criterion.

• Criterion based procedures are usually preferred.


Summary model selection

• The aim of variable selection is to construct a model that predicts well or explains relationships in
data well.
• It is part of process of model building, like identification of outliers and in influential points, and
variable transformation.
• Automatic selections are not guaranteed to be consistent. Use methods as guide only.

• Accept possibility that several models are suggested which fit equally well. Then consider:
• Do models have similar qualitative consequences?
• Do they make similar predictions?
• What is cost of measuring predictors?
• Which has best diagnostics?
Model validation

• Model validation: split dataset into training subsample and validation subsample:
• Training subsample is used to specify the statistical model.
• Validation subsample is used to evaluate the fitted model.

• Cross-validation is an application of this idea where the roles of training and validation subsamples are
interchanged.

• Statistical modeling: iterative sequence of data exploration, model fitting, model criticism, model re-
specification.

• Variables may be dropped, interactions may be incorporated or deleted, variables may be


transformed, unusual data may be corrected, removed, or otherwise accommodated.
Example: Model validation
Example: Model validation

• Evaluation is usually done using Pearson’s correlation or


Root mean squared error (RMSE) metrics.

• Usually, several random partition is used.


Model validation

• Resulting model should accurately reflect the principal characteristics of your data.

• Danger: overfitting and overstating strength of results.

• Ideal solution: collect new data with which to validate model (often not possible).

• Model validation simulates the collection of new data by randomly dividing data into two parts:

• First, for exploration and model formulation,

• second for checking adequacy of model, formal estimation, and testing.


Collinearity

• If a perfect linear relationship among regressors exist, least-squares coefficient are no longer
uniquely defined.

• Strong, but less-than-perfect linear relationship among 𝑋’s causes least-squares coefficients to be
unstable:

• large standard errors of coefficients,

• broad confidence intervals,

• hypothesis tests with low power.


Collinearity and Remedies

• Small changes in data can greatly change the coefficients.

• Large changes in coefficients coincide with only very small changes in residual sum of squares.

• This is problem is known as collinearity or multicollinearity.

• Collinearity is relatively rare problem in social-science applications of linear models.

• Methods employed as remedies for collinearity may be worse than the disease.

• Usually, it is impossible to redesign study to decrease correlations between 𝑋’s.


Detecting Collinearity

• Suppose a perfect linear relationship exists between 𝑋’s:


𝑐1 𝑋𝑖1 + 𝑐2 𝑋𝑖2 + … + 𝑐𝑘 𝑋𝑖𝑘 = 𝑐0 .

• Then, the matrix 𝐗 ′ 𝐗 is singular (why?),

• Therefore.

• least squares normal equations do not have unique solution

• sampling variances of regression coefficients are infinite.

• Perfect collinearity is often a product of some error in formulating linear model

• e.g., too many dummies.


Detecting Collinearity

Here, 𝑅𝑗2 is the 𝑅2 for regression of 𝑋𝑗


• The sampling variance of the slope 𝐵𝑗 is on other 𝑋’s .
1 𝜎𝜖2 𝑆𝑗2 is the sample variance of 𝑋𝑗 .
𝑉(𝐵𝑗 ) = ×
1 − 𝑅𝑗2 𝑛 − 1 𝑆𝑗2

• The first term is called Variance Inflation Factor:


1
𝑉𝐼𝐹 =
1 − 𝑅𝑗2

• 𝑉𝐼𝐹 indicates directly the impact of collinearity on precision of 𝐵𝑗 .

• 𝑉𝐼𝐹 is a basic diagnostic for collinearity.

• A rule of thumb: 𝑉𝐼𝐹 is greater than 10 (very strong) or 5 (strong) (why these values?).
Detecting Collinearity (Optional)

• Ways for detecting collinearity besides looking at the VIF’s.

• Examination of the correlation matrix of predictors:

• Regress 𝑋𝑖 on all other 𝑋’s and repeat this for all predictors. 𝑅𝑖2 close to one indicates the problem.

• Examine the eigenvalues 𝜆𝑖 of 𝐗′𝐗: small eigenvalues indicate problem.

• Large condition numbers 𝜅 𝐗 ′ 𝐗 = 𝜆1 /𝜆𝑝 (𝜅 > 30 is considered large).

• Also check the values of condition index 𝜆1 /𝜆𝑖


Detecting Collinearity: Example

• What is your opinion?


Collinearity: No Quick Fix

• Collinearity leads to
• imprecise estimates of 𝛽; even the signs of coefficients may be misleading.
• t-tests fail to reveal significant factors.

• Coping With Collinearity

• Model re-specification.

• Variable Selection.

• Biased Estimation: e.g., Ridge Regression..

• Prior Info About Regression Coefficients: e.g., Bayesian approaches.


Geometric interpretation of collinearity (Optional)

• Imagine a table: as two diagonally opposite legs are moved closer together, the table becomes increasing

• no collinearity (a), complete collinearity (b) and strong collinearity

You might also like