Lecture 7
Model Selection
Semester 1 AY 2024/2025
1
Model Selection
• We touched on this topic briefly before.
• How should we pick an order for, say, AR(p) model for GDP?
• There are sets of tools and methods that are widely used, but
there is no universally agreed upon methodology.
• The superior method depends on the set of models considered,
application at hand, sample size, and the true model.
2
Model Selection Tradeoffs
• Fundamental tradeoff in model selection: estimation error vs.
model misspecification.
• More variables = more parameters = more estimation error.
• Fewer variables = less estimation error. More chance to miss
important predictors (misspecification).
• Due to this, in-sample fit does not in general translate to
better forecast performance out-of-sample.
3
Selection Based on Fit?
• Why don’t we just pick a model with the smallest SSR or
largest R 2 ?
• We know the two always imply better fit when extra variables
are added, so we could argue for SER and adjusted-R 2 .
• The latter penalize for model complexity, but not enough to
yield useful selection criteria.
4
Selection Based on Testing?
• We could test whether coefficients for some variables are zero.
• If we can reject the null, keep them in the model. If cannot,
remove.
• Can either use sequential t-tests or sequential F-tests
(something we used to select polynomial order in EC3303).
• Popular with some applied researchers, but not designed to
select best forecast model and can perform badly.
5
Example: Real GDP Growth (F-test)
Data: US annualized GDP growth rate: 1947q1 — 2017q2
• Lags 3-4 jointly insignificant
6
• Lags 2-4 jointly insignificant:
• Lags 1-4 jointly significant.
• Thus, F-test picks AR(1).
7
Example: Real GDP Growth (t-test)
Thus, t-test also picks AR(1). 8
Sequential Test Summary
• Intuitive, makes sense to applied researchers.
• F-tests preferred to t-tests in presence of high correlation
among regressors. See Kozbur (2020, ECMA).
• Search across models not comprehensive and the outcome is
often path-dependent (e.g., may differ if start with smaller
model and expand vs. “shrinking” a larger model).
• Frequently end up with models containing variables not
present in the true model (overparameterization).
9
Bayesian Criterion
• Let M1 be model 1, and M2 be model 2. Denote data by D.
• Bayes’ theorem:
P(D | M1 )P(M1 )
P(M1 | D) =
P(D | M1 )P(M1 ) + P(D | M2 )P(M2 )
I P(M1 ) and P(M2 ) are priors, i.e., beliefs by the user.
I P(D | M1 ) and P(D | M2 ) come from probabilistic models.
I P(M1 | D) is the posterior probability (beliefs updated by the
data).
10
Bayesian Criterion for AR(p)
• Assume AR(p) with normal errors and uniform priors, then:
BIC
P(M1 | D) ∝ exp −
2
SSR
BIC = T ln + k ln(T )
T
where k is number of estimated coefficients, and T is the
sample size.
• This is the famous Schwarz or Bayesian information criterion
(BIC or SIC in short).
• The model with the smallest BIC has the highest posterior
probability.
11
Alternative Versions of BIC
• The more rigorous formula (reported by Stata) is:
BIC = −2L + k ln(T )
SSR
2L = −T (ln(2π) + 1) − T ln
T
where L is the Gaussian log-likelihood.
• The difference in the formula above and simplification in the
previous slide is just a constant – it does not change with
models.
• Another frequently used definition (reported by R):
SSR ln(T )
BIC = ln +k
T T
• All definitions select the same model ordering.
12
Tradeoff in BIC
• The larger AR(p) model will have larger k and smaller SSR:
SSR
BIC = T ln + k ln(T )
T
• First term goes down, but second goes up: A tradeoff
between fit and model complexity.
• We typically compute BIC for all the considered models, say
AR(1) to AR(4) for quarterly data, and pick the one with the
lowest BIC.
13
Popular Mistake in Using BIC: Different Sample
• The larger AR(p) model will also reduce your sample size as
more observations are needed to construct lags.
• You need to make sure the models are estimated using the
same sample, otherwise you are comparing apples with
oranges.
• Convenient way to do this in Stata is to locate the time/date
(tp+1 ) for the (p + 1)st observation (p is the highest order
considered), and run regressions with if time >= tp+1
condition.
14
Popular Mistake in Using BIC: Different Sample
15
Example: Revisit GDP Growth
• Consider models AR(1) through AR(4).
• Data runs from 1947:Q2 to 2017:Q2 – 282 observations.
I AR(1) would use 1947:Q3 onwards;
I AR(2) – 1947:Q4 onwards;
I AR(3) – 1948:Q1 onwards;
I AR(4) – 1948:Q2 onwards.
• Thus, for the purpose of comparing BIC, we use 1948:Q2 for
all models. Add if time >= tq(1948q2) when regressing.
16
Example: GDP Growth AR(1)
. reg gdpgr L.gdpgr if time>=tq(1948q2), r
Linear regression Number of obs = 277
F(1, 275) = 29.48
Prob > F = 0.0000
R-squared = 0.1293
Root MSE = 3.5293
------------------------------------------------------------------------------
| Robust
gdpgr | Coefficient std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
gdpgr |
L1. | .3591947 .0661568 5.43 0.000 .2289567 .4894328
|
_cons | 1.998005 .3231902 6.18 0.000 1.361764 2.634247
------------------------------------------------------------------------------
. estimates store ar1
Note the estimates store command is useful for displaying all 4
BIC together later using estimates stats
17
Example: GDP Growth AR(2)
. reg gdpgr L(1/2).gdpgr if time>=tq(1948q2), r
Linear regression Number of obs = 277
F(2, 274) = 17.30
Prob > F = 0.0000
R-squared = 0.1416
Root MSE = 3.5106
------------------------------------------------------------------------------
| Robust
gdpgr | Coefficient std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
gdpgr |
L1. | .3162249 .0740875 4.27 0.000 .1703719 .4620779
L2. | .1189849 .0725869 1.64 0.102 -.0239141 .2618839
|
_cons | 1.757554 .3555542 4.94 0.000 1.057589 2.457519
------------------------------------------------------------------------------
. estimates store ar2
18
Example: GDP Growth AR(3)
. reg gdpgr L(1/3).gdpgr if time>=tq(1948q2), r
Linear regression Number of obs = 277
F(3, 273) = 12.21
Prob > F = 0.0000
R-squared = 0.1529
Root MSE = 3.4939
------------------------------------------------------------------------------
| Robust
gdpgr | Coefficient std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
gdpgr |
L1. | .3294918 .0741839 4.44 0.000 .1834466 .4755371
L2. | .1548351 .079526 1.95 0.053 -.0017271 .3113972
L3. | -.1138022 .0689545 -1.65 0.100 -.2495525 .021948
|
_cons | 1.960562 .368287 5.32 0.000 1.235518 2.685605
------------------------------------------------------------------------------
. estimates store ar3
19
Example: GDP Growth AR(4)
. reg gdpgr L(1/4).gdpgr if time>=tq(1948q2), r
Linear regression Number of obs = 277
F(4, 272) = 9.75
Prob > F = 0.0000
R-squared = 0.1577
Root MSE = 3.4903
------------------------------------------------------------------------------
| Robust
gdpgr | Coefficient std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
gdpgr |
L1. | .3207103 .0742694 4.32 0.000 .1744943 .4669262
L2. | .166058 .082436 2.01 0.045 .0037643 .3283517
L3. | -.0888071 .0692083 -1.28 0.201 -.2250591 .047445
L4. | -.0749689 .0738667 -1.01 0.311 -.220392 .0704542
|
_cons | 2.108767 .4110886 5.13 0.000 1.299447 2.918087
------------------------------------------------------------------------------
. estimates store ar4
20
Example: BIC for All Models
• First thing first: Check the sample sizes! All should have the
same (277 here). If not, we need to redo the whole exercise
fixing the same sample.
• BIC also picks AR(1) in this example.
. estimates stats ar1 ar2 ar3 ar4 /*show AIC/BIC for all models at once*/
Akaike’s information criterion and Bayesian information criterion
-----------------------------------------------------------------------------
Model | N ll(null) ll(model) df AIC BIC
-------------+---------------------------------------------------------------
ar1 | 277 -760.5395 -741.3694 2 1486.739 1493.987
ar2 | 277 -760.5395 -739.389 3 1484.778 1495.65
ar3 | 277 -760.5395 -737.5635 4 1483.127 1497.623
ar4 | 277 -760.5395 -736.7721 5 1483.544 1501.664
-----------------------------------------------------------------------------
Note: BIC uses N = number of observations. See [R] BIC note.
21
Issue With BIC
• This is more or less the theory behind BIC.
• If one of the models is true, and the others are false – BIC will
pick the model most likely to be true (or best approximation if
all are false) – consistency property.
• However, BIC selection not specifically designed to produce a
good forecast! We are not interested in the true model, we
want good forecasting performance.
22
Selection To Minimize MSFE
• Our goal is to minimize forecast risk, or MSFE:
R(Ŷ ) = E(Y − Ŷ )2
• If we had a good estimate of MSFE, we could simply pick the
model that minimizes it.
• SSR is a bad estimate:
1. Biased (in-sample overfitting).
2. Decreases as more variables are added – selects largest model.
23
The Bias of SSR
• It can be shown that approximately:
E(SSR) = E(MSFE) − 2σ 2 k and E(MSFE) = T σ 2
• Shibata (1980) suggested a bias correction:
2k
Sk = SSR 1 +
T
• This is known as Shibata criterion.
24
From Shibata to Akaike
• If you take Shibata’s formula, divide by T , take log and
multiply by T , you will obtain:
Sk SSR 2k
T ln = T ln + T ln 1 +
T T T
SSR
≈ T ln + 2k
T
• This is the Akaike information criterion (AIC).
• Looks similar to BIC, but 2 is replaced with ln(T ).
I BIC puts a harsher penalty on model size. ln(T ) > 2 for
T > 7.
I E.g., for T = 277 in our example, ln(T ) = 5.62 – almost three
times larger.
25
Motivation Behind AIC
• AIC is an approximately unbiased estimate of .
1. the MSFE and
2. the Kullback-Leibler information criterion (KLIC)
• KLIC is a loss function of a density forecast. Suppose f (Y ) is
a forecast density, and g(Y ) is a true density, then:
f (Y )
KLIC(f , g) = E ln
g(Y )
Minimizing AIC = minimizing estimated KLIC.
26
Example: AIC for All Models
AIC picked AR(3). BIC picked AR(1).
. estimates stats ar1 ar2 ar3 ar4 /*show AIC/BIC for all models at once*/
Akaike’s information criterion and Bayesian information criterion
-----------------------------------------------------------------------------
Model | N ll(null) ll(model) df AIC BIC
-------------+---------------------------------------------------------------
ar1 | 277 -760.5395 -741.3694 2 1486.739 1493.987
ar2 | 277 -760.5395 -739.389 3 1484.778 1495.65
ar3 | 277 -760.5395 -737.5635 4 1483.127 1497.623
ar4 | 277 -760.5395 -736.7721 5 1483.544 1501.664
-----------------------------------------------------------------------------
Note: BIC uses N = number of observations. See [R] BIC note.
27
Akaike’s Result
• Akaike has shown that in a normal AR(p) AIC is an
approximately unbiased estimator of the KLIC.
• So unlike testing or BIC, AIC is designed to find models with
low forecast risk.
• AIC will often select a larger model than BIC.
I Mechanically, it is because the penalty is lower.
I Conceptually, it is because, instead of trying to find a true
model (like BIC is designed to), AIC treats every model as an
approximation and tries to find the one which makes the best
forecast. Includes extra lags if they help forecast better.
28
AIC Asymptotic Properties
• Unlike BIC, not consistent.
• Designed using different consideration: all models considered.
• AIC is asymptotically efficient: if true model not contained in
the considered set, and Model k has the lowest risk, then,
Risk(AIC selection)/Risk(k) → 1 as T → ∞.
• That is, AIC will asymptotically pick the best forecasting
model when the true model is not in the set considered by the
forecaster.
29
Selection Based on Prediction Errors
• Why not compute true out-of-sample forecasts, the associated
forecast errors, and pick the model with the smallest value of
the loss function applied directly to these forecast errors.
• This approach is called Predictive Least Squares (PLS).
• Diebold calls it recursive cross-validation in Ch. 10.
• Originated with Rissanen (1986), a Finnish information
theorist.
30
The PLS Procedure
• You have T observations. Select a “holdout sample” of M
observations. Then, you will make recursive one-step-ahead
pseudo out-of-sample forecasts for P = T − M periods.
• E.g., for AR(1) model you compute:
Ŷt = α̂t−1 + β̂t−1 Yt−1
• Coefficients are estimated using data from [1, . . . , t − 1].
• t goes from M + 1 to T : A total of P recursive estimates.
31
The PLS Procedure (ctd.)
• The out-of-sample forecast errors are:
ẽt = Yt − Ŷt
• This is different from an in-sample residual – a true forecast
error.
• The PLS criterion is the estimated out-of-sample MSFE:
v
u1
u T
PLS = t
X
ẽt2
P t=M+1
• Select model with the smallest PLS. A very popular approach
in applied forecasting.
32
PLS Attractive Features
The advantages are:
• Doesn’t depend on approximations or any distribution theory.
• Can be computed for any forecast method (even for published
forecast surveys) without knowing how forecast was obtained.
• Possibly robust to moderate structural breaks.
• A common measure of “empirical performance” in applied
forecasting.
• Provides σ̂ for forecast intervals.
33
PLS Disadvantages
The disadvantages are:
• Tends to overestimate true MSFE.
• Tends to be over-parsimonious.
• VERY sensitive to the choice of P. No generally accepted
theory for the choice of P.
34
PLS in Stata
• Could be a bit tricky. Either use manual loop or the rolling
command.
• Thankfully, we can just recycle the loop we built for
1-step-ahead recursive window forecasting from Lecture 6!
• This time we will estimate all 4 models and produce 1-step
forecasts at each iteration.
• Then we construct 1-step forecast errors, square them, and
take the average to get the PLS criterion value.
35
PLS in Stata
forvalues p=182/281 {
*AR(1)
qui reg gdpgr L.gdpgr if t>=6 & t<=‘p’ /*note quiet execution of regression*/
predict fit1, xb
*AR(2)
qui reg gdpgr L(1/2).gdpgr if t>=6 & t<=‘p’ /*note quiet execution of regression*/
predict fit2, xb
*AR(3)
qui reg gdpgr L(1/3).gdpgr if t>=6 & t<=‘p’ /*note quiet execution of regression*/
predict fit3, xb
*AR(4)
qui reg gdpgr L(1/4).gdpgr if t>=6 & t<=‘p’ /*note quiet execution of regression*/
predict fit4, xb
*Store 1-step forecasts for all 4 models:
replace yhat1=fit1 if t==(‘p’+1)
replace yhat2=fit2 if t==(‘p’+1)
replace yhat3=fit3 if t==(‘p’+1)
replace yhat4=fit4 if t==(‘p’+1)
drop fit* /*clean up for next iteration -
note it’s easy to drop multiple variables with similar names*/
}
36
PLS in Stata: Results for GDP Growth
• PLS picks AR(2).
• Again, if I change P, this ranking may change.
Model PLS Criterion
AR(1) 2.2424549
AR(2) 2.1929934
AR(3) 2.2205646
AR(4) 2.2404814
37
Model Selection Summary
• Selection on measures of fit or testing inappropriate.
• Feasible criteria: AIC, BIC, PLS
• Hold sample constant for valid comparisons.
• All methods except PLS assume conditional homoskedasticity.
• PLS is sensitive to the choice of P.
38
Before you leave...
• Diebold, Francis X., “Elements of Forecasting,” 4th edition.
Chapter 12.
I Particularly, PLS on p. 212.
39
Appendix
40
Useful Stata Command
• store estimates after estimation
estimates store name
• display AIC/BIC of estimates store under “names” (can show
several at once).
estimates stats names
41