Bayesian variable selection
Gibbs VS versus Spike and slab VS
Gibbs VS (GVS)
Developed from Ku and Mallick (KM) after the paper:
L. Kuo and B. Mallick(1998) Variable Selection for regression
Models, Sankhya,60,B, 65—81
See also:
P. Dellaportas and J. Forster and I. Ntzoufras (2002),
On Bayesian Model and Variable Selection using MCMC,
Statistics and Computing, 12, 27-36
R. B. O'Hara and M. J. Sillanpaa (2009) A Review of Bayesian
Variable Selection Methods: What, How, and Which
Bayesian Analysis, 4, 85-118
Kuo and Mallick Selection
p
Basic model:
y i = β j xij + ei
j =1
β j ~ N (0,τ β−1 )
ei N (0,τ y−1 )
A special case of GVS:
Introduce an indicator : I j p
And formulate the model as y i = β j I j xij + ei
j =1
Assume that
Pr(I j , β j ) = Pr(I j ).Pr( β j )
Notes
Independent priors on the indicator and regression
parameter
For sampling when I j = 0 the prior for β j is the full
conditional distribution
If the prior is too vague then mixing will be poor.
Usually the prior for the indicator is I j Bern( p j )
Should we have a hyperprior for p j ?
Gibbs variable Selection (GVS)
Avoids the problem of poor mixing with vague priors
by assigning a pseudo prior
Here it is a assumed that Pr(I j , β j ) = Pr( β j | I j )Pr(I j )
And Pr( β | I ) = (1 − I )N ( μ,S) + I N (0,τ ) −1
j j j j β
where μ ,S τ −1
are tuning constants and β is a fixed
variance
Tuning has to be done to improve mixing
Stochastic Search Variable
Selection (SSVS)
Assume a mixture of spike and slab
Pr( β j | I j ) = (1 − I j )N (0,τ β−1 ) + I j N (0, gτ β−1 )
τ −1
Where the spike variance is small ie β small
Have to tune g and variance and these choices affect
the posterior estimation
A random effect version could be assumed where τ β−1
has to be estimated also (with g fixed)
Adaptive Shrinkage
Specify a prior directly on βj
But with prior control:
β j | τ 2j N (0,τ 2j )
and P (τ 2j )
Could use the Jeffreys prior P (τ j ) ∝ 1/ τ j
2 2
which leads to a improper posterior :
unless?.............
No tuning parameter however.
Laplacian shrinkage and others
Could also use an exponential prior for j with mean μ
τ 2
If you integrate over the variance components you get
a double exponential (Laplace) prior for P ( β j | μ )
Random effect variant of the method where μ
has a prior is the Bayesian Lasso
Reversible Jump McMC (RJMCMC) can also be used
of course
Comparison of methods
(from O’Hara and Sillanpää)
Some Code
The R code for ABC-MC would be easy to implement
and probably needs M-H steps to help of course.
WinBUGS code here for Kuo and Mallick is simple
(var_select_simple.odc)
Data example in VAR_SELECTexample.odc
model{
Kuo and for(i in 1:N){
x1c[i]<-(x1[i]-mean(x1[]))/sd(x1[])
Mallick x2c[i]<-(x2[i]-mean(x2[]))/sd(x2[])
x3c[i]<-(x3[i]-mean(x3[]))/sd(x3[])
Winbugs x4c[i]<-(x4[i]-mean(x4[]))/sd(x4[])
x5c[i]<-(x5[i]-mean(x5[]))/sd(x5[])
y1[i]~dbin(p1[i],n[i])
code: 5 logit(p1[i])<-
b0+b[1]*psi[1]*x1c[i]+b[2]*psi[2]*x2c[i]+b[3]*ps
predictors i[3]*x3c[i]+b[4]*psi[4]*x4c[i]+b[5]*psi[5]*x5c[i
]+v[i]
and y1 v[i]~dnorm(0,tauV)
}
outcome, for( j in 1: 5){
psi[j]~dbern(p[j])
p[j]~dbeta(0.5,0.5)}
logistic b0~dnorm(0,taub0)
for(j in 1:5){b[j]~dnorm(0,taub[j])
regression ....}
Example
code in VAR_SELECTexample.odc
Low birth weight in Georgia counties (159) in 2007
binomial example with total births as denominator
Predictors:
Population density (x1)
Black proportion (x2)
Median income/1000 (x3)
% below poverty (x4)
Unemployment rate (x5)
Model run results
Burnin: 30,000; thin: 10; sample size: 2508
Clear evidence that x2 and x4 are really important:
Continually selected in converged sample
mean sd MC_error val2.5pc median val97.5pc start sample
psi[1] 0.03907 0.1938 0.009205 0 0 1 30001 2508
psi[2] 1 0 2.00E-12 1 1 1 30001 2508
psi[3] 0.08533 0.2794 0.009163 0 0 1 30001 2508
psi[4] 1 0 2.00E-12 1 1 1 30001 2508
psi[5] 0.05104 0.2201 0.00444 0 0 1 30001 2508