Kernel Matching
Kernel Matching
Ben Jann
1 Background
What is Matching?
Multivariate Distance Matching (MDM)
Propensity Score Matching (PSM)
Matching Algorithms
“Why PSM Should Not Be Used for Matching”
3 Conclusions
Basic idea:
1. For each observation in the treatment group, find “statistical twins” in
the control group with the same (or at least very similar) X values.
2. The Y values of these matching observations are then used to
compute the counterfactual outcome without treatment for the
observation at hand.
3. An estimate for the average treatment effect can be obtained as the
mean of the differences between the observed values and the
“imputed” counterfactual values over all observations.
Formally:
1 X h i X
[ =
ATT Yi − Ŷi0 with Ŷi0 = wij Yj
N T =1
i|T =1 j|T =0
1 X h i X
[=
ATC Ŷi1 − Yi with Ŷi1 = wij Yj
N T =0
i|T =0 j|T =1
T =1 T =0
d =N
ATE [+N
· ATT · ATC
[
N N
ATE : average treatment effect; ATT : a.t.e. on the treated; ATC : a.t.e. on the untreated
T : treatment indicator (0/1)
Y : observed outcome; Y 1 ; potential outcome with treatment; Y 0 : p.o. without treatment
Exact matching:
(
1/ki if Xi = Xj
wij =
0 else
with ki as the number of observations for which Xi = Xj applies.
The idea then is to use observations that are “close”, but not
necessarily equal, as matches.
(Y 0 , Y 1 ) ⊥⊥ T | X implies (Y 0 , Y 1 ) ⊥
⊥ T | π(X ), where π(X ) is the
treatment probability conditional on X (the “propensity score”)
(Rosenbaum and Rubin 1983).
Procedure
I Step 1: Estimate the propensity score, e.g. using a Logit model.
I Step 2: Apply a matching algorithm using differences in the
propensity score, |π̂(Xi ) − π̂(Xj )|, instead of multivariate distances.
Caliper matching
I Like nearest-neighbor matching, but only use controls with a distance
smaller than some threshold c.
Radius matching
I Use all controls with a distance smaller than some threshold c.
Kernel matching
I Like radius matching, but give larger weight to controls with smaller
distances (using some kernel function such as, e.g., the Epanechnikov
kernel).
6
Ben Jann (University of Bern) Kernel matching London, 07.09.2017 11
Best Case: Mahalanobis Distance Matching
80
C
C C C
C
C C C C CCCCC C CCCCCC C C C
70
CTCC CC
C
C
C CCCC C CCCC CC TT
C C
T
C CCCT
C
C C C CC
C CC CC C CC C C C C
CC CCCC CCCCC TC C CC
CCC CCCCC
CCC CC
C C C CCC CC
C C CC CC CCCC C CC
CC CC CC
C
C
CC C
C
C
C CCCTC
C CCC C C CCCC
C CTC
60 C C T
CC CC C C C CC CCT CC C
CC CCC CCCTC TC CCT
CT C C CCC C
C
C
CCC C CCCCC C CC CC TCCC
C TCC C CCC CCCC C
C CCCCC CCC C C
T
C C
C TCCCC C C C
Age 50 CC
CC C CCC
CC CCTCC CCCC C C C T CC
C CCT CCC
C C T
C CC TCC
C CCC
TC
C CC C
CT
CCC C C C C C C C C
T CC
C C C CCC CC
C CCC
C C C
T C C CC CCC CC C TC
CT
CC
C C C
T
T
C C C C CC C T
C CC
T CC
C C
40 CCCCC C C CC CC
CCC
C C C
T
C CC C TCC CCCC
TC
C C T CCCCCCT T C CC
TC CCC T
CC C CCCC
CC
CCC
T CCCCCC C
CC
C C CCC C C CT C C
C CC
T C TC
C C CCCCCC
CC
C C CCC CC C C TC
CC C T
C C
T C CCCC
CC C C CCC CCC C
T CCCC
20
12 14 16 18 20 22 24 26 28
80
70 C
T TT
CC T
C
T
C
C
T
T
C C
T
60 T
C T
C C
T
TT
CT C
C C
T
T
C T
C
TC
Age 50 C
T T
T
C T
C T
C
CT C
C T
T C T
TC
CTC
T
C T CTT
40 CT T
C TC
C
T C
C T C T C
TC T C
T
C
T TC
C T C C
TCT
T T
C
20
12 14 16 18 20 22 24 26 28
70 T C TT
C CC
C T TC
T C
TC T
60 T C T C C
T C
C
T
C T
T
CTCTT
C C
CT T C
Age 50 CC T T
CT TC T CCT
T TTC C C
CTTT C T C
40 C T
C C C
T T TCCC
T T
C T T C
T C T T C T T C
C TCCT T TC
30
Education (years)
15/23
Treatment-effects estimation
wage Coef.
ATT .6059013
NATE 1.432913
Raw Matched(ATT)
Means Treated Untrea~d StdDif Treated Untrea~d StdDif
Raw Matched(ATT)
Variances Treated Untrea~d Ratio Treated Untrea~d Ratio
Raw Matched
Treatment-effects estimation
wage Coef.
ATT .3887224
NATE 1.432913
0 .2 .4 .6 .8 0 .2 .4 .6 .8
Propensity score
Untreated Treated
0 .2 .4 .6 .8 0 .2 .4 .6 .8
Propensity score
Untreated Treated
Untreated Treated
Treatment-effects estimation
. lincom ATT-NATE
( 1) ATT - NATE = 0
Treatment-effects estimation
wage Coef.
ATT .7246969
. teffects nnmatch (wage collgrad ttl_exp tenure i.industry i.race south) (union), atet
Treatment-effects estimation Number of obs = 1,853
Estimator : nearest-neighbor matching Matches: requested = 1
Outcome model : matching min = 1
Distance metric: Mahalanobis max = 1
AI Robust
wage Coef. Std. Err. z P>|z| [95% Conf. Interval]
ATET
union
(union vs nonunion) .7246969 .2942952 2.46 0.014 .147889 1.301505
Treatment-effects estimation
wage Coef.
ATT .5590823
. teffects nnmatch (wage collgrad ttl_exp tenure i.industry i.race south) (union), atet nn(5)
Treatment-effects estimation Number of obs = 1,853
Estimator : nearest-neighbor matching Matches: requested = 5
Outcome model : matching min = 5
Distance metric: Mahalanobis max = 6
AI Robust
wage Coef. Std. Err. z P>|z| [95% Conf. Interval]
ATET
union
(union vs nonunion) .5590823 .2381752 2.35 0.019 .0922675 1.025897
Treatment-effects estimation
wage Coef.
ATT .5288023
AI Robust
wage Coef. Std. Err. z P>|z| [95% Conf. Interval]
ATET
union
(union vs nonunion) .5288023 .2420635 2.18 0.029 .0543666 1.003238
Treatment-effects estimation
wage Coef.
ATT .6408443
. kmatch md union collgrad ttl_exp tenure (wage), att ematch(industry race south)
(computing bandwidth ... done)
Multivariate-distance kernel matching Number of obs = 1,853
Kernel = epan
Treatment : union = 1
Metric : mahalanobis
Covariates: collgrad ttl_exp tenure
Exact : industry race south
Matching statistics
Treatment-effects estimation
wage Coef.
ATT .6047374
Treatment-effects estimation
wage Coef.
ATT .6059013
Treatment-effects estimation
wage Coef.
ATT .6651578
.1
3
.08
.06
MSE
.04
4
1
.02
5 6
7 9158 102
12
11
14
13
1.5 2 2.5 3
Bandwidth
Treatment-effects estimation
wage Coef.
ATT .6928956
12.6
1
12.4
3
MISE
12.2
2
12
5 4
7 14
8
96
11
113
0
15
12
11.8
1.5 2 2.5 3
Bandwidth
Treatment-effects estimation
wage Coef.
ATT .7308166
14 13 4
Weighted MISE
5
2
7
12
6
3
9
1012 11
1413
815
11
1 2 3 4 5
Bandwidth
Treatment-effects estimation
wage Coef.
ATT .3303161
. kmatch csummarize
(refitting the model using the generate() option)
Treatment-effects estimation
Coef.
wage
ATT .6021049
NATE 1.430823
hours
ATT 1.263759
NATE 1.450303
Treatment-effects estimation
Coef.
wage
ATT .5152752
NATE 1.430823
hours
ATT 1.263759
NATE 1.450303
0
Treated 306 15 321 625 120 745 1.3199
1
Treated 126 10 136 473 178 651 1.3398
Treatment-effects estimation
0
ATT .4586332 .2808206 1.63 0.102 -.0917652 1.009032
1
ATT .9518705 .334356 2.85 0.004 .2965449 1.607196
5
Density
0
0 .1 .2 .3 .4 .5 .6 .7 .8 .9 1
Propensity score
McFadden R 2 = 0.121
Ben Jann (University of Bern) Kernel matching London, 07.09.2017 43
Simulation
55 -1
Untreated
Treated
50 -2
Treatment effect
45 -3
Outcome
40 -4
35 -5
30 -6
0 .1 .2 .3 .4 .5 .6 .7 .8 0 .1 .2 .3 .4 .5 .6 .7 .8
Propensity score Propensity score
1 neighbor
5 neighbors
cross-validation
with respect to X
cross-validation
with respect to Y
weighted CV
with respect to Y
2017-09-12
1 neighbor
cross-validation
with respect to X
cross-validation
with respect to Y
weighted CV
with respect to Y
In this slide we can see that for the same algorithm PSM typically is
somewhat less efficient than MDM, but that across algorithms PSM
can also be much more efficient than MDM. For example, kernel
matching PSM has a much smaller variance than 1-nearest-neighbor
MDM. That is, the choice of algorithm matters much more than the
choice between PSM and MDM.
For kernel matching the efficiency differences between PSM and MDM
are only small; additional post-matching regression adjustment further
reduces the differences.
Results: Bias reduction (in percent)
N = 500 N = 5000
Nearest-neighbor
matching
1 neighbor
5 neighbors
cross-validation
with respect to X
cross-validation
with respect to Y
weighted CV
with respect to Y
70 80 90 100 110 120 130 95 100 105 110 115 120 125 130 135
2017-09-12
1 neighbor
cross-validation
with respect to X
cross-validation
with respect to Y
weighted CV
with respect to Y
70 80 90 100 110 120 130 95 100 105 110 115 120 125 130 135
Here we see that PSM has a bias that does not vanish as the sample
size increases. The reason is that the same propensity-score model
specification is used for both sample sizes. The model is rather simple
(linear effect of age, no interactions) and due to the specific pattern of
the data (in particular, the sharp drop in the outcome variable after
propensity score 0.3) small imprecisions can have substantial effects on
the results. In practice, one would probably use a more refined
specification in the large-sample situation, which would reduce bias.
1 neighbor
5 neighbors
cross-validation
with respect to X
cross-validation
with respect to Y
weighted CV
with respect to Y
1 neighbor
5 neighbors
Nearest-neighbor
matching (bootstrap)
1 neighbor MDM
with bias
5 neighbors correction
Kernel matching
(bootstrap) PSM
with bias
fixed bandwidth correction
pair-matching
bandwidth
cross-validation
with respect to X
cross-validation
with respect to Y
weighted CV
with respect to Y
.9 .95 1 1.05 1.1 1.15 1.2 .95 1 1.05 1.1 1.15 1.2 1.25
2017-09-12
1 neighbor
Nearest-neighbor
matching (bootstrap)
1 neighbor MDM
with bias
5 neighbors
1 neighbor
5 neighbors
Nearest-neighbor
matching (bootstrap)
1 neighbor MDM
with bias
5 neighbors correction
Kernel matching
(bootstrap) PSM
with bias
fixed bandwidth correction
pair-matching
bandwidth
cross-validation
with respect to X
cross-validation
with respect to Y
weighted CV
with respect to Y
.92 .93 .94 .95 .96 .97 .98 .9 .92 .94 .96 .98
2017-09-12
1 neighbor
Nearest-neighbor
matching (bootstrap)
1 neighbor MDM
with bias
5 neighbors
Coverage of teffects CIs is a bit too low for PSM (and for MDM with
bias-correction in the small sample).
Bootstrap CIs are too conservative for nearest-neighbor matching.
Overall, I agree with King and Nielsen that MDM has some
advantages over PSM, but it also has some disadvantages. In
applied research the choice may not be that clear.
- MDM leaves less scope for bias due to post-matching modeling
decisions.
- Theoretical results (see, e.g., Frölich 2007) suggest that MDM will
generally tend to outperform PSM in terms of efficiency (but
differences are likely to be small).
- Less restrictions in terms of possible post-matching analyses.
, Choice of scaling matrix largely arbitrary.
, Computational complexity.
To do
I Run some more simulations.
I Variance estimation based on influence functions?
I Better (and faster) bandwidth selection algorithms?
I Explore potential of adaptive bandwidths?