1. Nonparametric (NP) estimation; by K. M.
Abadir
A. NP density estimation
Histograms are the most basic form of data summary. However:
1. they are a mixture of p.d.f. and c.d.f. (they estimate the probability that
the data fall in a sequence of intervals, rather than the probability measure
associated with each point);
2. they do not provide smooth curves that we are accustomed to seeing as
densities of continuous variates (for example, the normal density).
Notation: Denote a probability density function (p.d.f.) by , a cumulative
distribution function (c.d.f.) by . If needed to avoid ambiguity, subscripts
indicate the variable’s name; e.g. the c.d.f. of is () ≡ Pr( ≤ ). (For
continuous , () = d() d; for discrete, () = Pr ( = ) = ∆().
The functions can also be visualized and linked by an integral or sum.)
Special case: for N(0,1), write the density as () and distribution as Φ().
Suppose that a random (or i.i.d.) sample 1 (we use the shorthand
{}=1 or {} for a sequence) was drawn from a density (), or simply
().
• If we knew that the data came from a normal distribution, N( 2), we
could estimate its parameters . This would be parametric estimation.
• But, in general, we do not know what type of distribution generated the data
(IBM example). So how do we estimate the density without any paramet-
ric assumptions, that is, how can we obtain a nonparametric (NP) density
estimator?
Assume that is a continuous function whose functional form is unknown. A
smooth approximation of (), say b(), may be obtained
R ∞ from the data by
using a weighting function called a kernel satisfying −∞ () d = 1 (the
weights add up to 100%). The kernel density estimator is
X µ ¶
b 1 −
() ≡ (1)
=1
and 0 is the smoothing parameter (or bandwidth or window width).
√
For example, if we choose () = () ≡ exp(− 2 ) 2, we can rewrite b
1 2
as the sum of rescaled kernels each centered around one of the data points:
µ ³ ´ ¶ µ ³ ´ ¶
1 −1 2 1 2
b
() = √ exp − 2 1 + · · · + √ exp − 1 − ; (2)
2
2 2
√ −1
see next page: each small curve is ( 2) exp(−( − )2(22)), the big
one being their sum b, and the horizontal axis is . For example, the sample
{1 2 3} = {12 20 11} gives the following function of
exp(−( − 12) 222) exp(−( − 2)222) exp(−( − 11)222)
b() = √ + √ + √
3 2 3 2 3 2
Note that the sum gives an b plot that need not be normal, even if the kernels
are normal as in the illustration.
There are two unknowns in the definition of b() in (1):
1. the function (determines the shape of the bumps);
2. the smoothing parameter (determines the smoothness/width of the bumps:
compare the 3 figures).
We choose them by minimizing the expected squared error of our estimate,
integrated over all possible
Z ∞ values:
h i
E (b() − ())2 d
−∞
called the integrated mean squared error (IMSE).
As with the MSE, the IMSE can be written as the integral of (bias)2 + variance,
whose components can be balanced to optimize IMSE (e.g. tolerate a bit of
bias if variance is much reduced).
We show this MSE decomposition for the simplest setup. Suppose that we
have an estimator b
of a parameter , with MSE(b
) ≡ E((b
− )2).
Define the deviation from the mean ≡ b − E(b
) and the bias ≡ E(b
) − .
Then, ³ ´
MSE(b ) ≡ E ( + )2 = E( 2 + 2 + 2)
= E( 2) + E(2) + E(2)
Since is nonrandom (think of E(b ) as a constant ),
MSE(b) = E( 2) + 2 E() + 2
By the definition of var(b
) ≡ E( 2), and by the deviations averaging to zero
³ ´
E() ≡ E b − E(b ) = E(b) − E(b) = 0
we get MSE(b ) = var(b) + 2. This proof was constructive and “from first
principles”. There is an alternative one that verifies the desired result by
making use of
E( 2) ≡ var() + (E())2
Letting ≡ b − , then using var() = var(b
) (location invariance) and E() ≡
, we get the required result.
Minimizing the IMSE when is large implies that the optimal standardized
kernel is the Epanechnikov (or quadratic) kernel
( ³ ´ √
3
√ 1− 1 2 (|| 5)
e () ≡ 4 5 5 (3)
0 (otherwise),
and its mean is 0 as a result of the optimization (not by construction), which
is why each kernel was centered around earlier. Nevertheless, the increase
in this optimal IMSE is small (and vanishes as → ∞) if using () = ()
instead. More crucial is the choice of , determining how smooth b is.
The variance of b as an estimate of is the vertical volatilty of this estimate
in the previous graphs. A larger makes the estimate b less erratic (less
volatile b) while assigning a higher probability to points far from the data
(hence causing a bias in b); see the previous graphs & Remark 1 in Section C.
Exercise 2 will give the balancing tradeoff between bias and variance of b.
Section C will show that the optimal is of the order −15: doubling the
sample size , means that we need to reduce by a factor of 2−15 ≈ 087
(a reduction of 13%). Therefore, less smoothing is needed as the sample size
increases because there are more data to fill the horizontal axis in the graphs.
Remark 1(b) in Section C will show that the efficiency of the estimate b is
less than what parametric estimation gives: the variance declines at a rate of
1() instead of 1, a price to pay for the flexibility of the estimate b.
Apart from the usual direct reasons to estimate the density of variates (to
quantify their randomness and know their overall behaviour), there are other
indirect uses for densities in finance and economics such as “hazard rates”.
They are also known as “age-related failure rates” if time is involved. For
example, the probabilities of:
1. death, given that an individual has survived until some age (used in calcu-
lating life-insurance premia in actuarial work);
2. breakdown of a new machine, given that it has functioned up to some point
in time;
3. employment, given that a person has been unemployed for some period (typ-
ically a diminishing function of this period as it gets longer);
4. default on some obligation, given that no default has occurred so far.
Having survived until 0, what is the density function of over the rest of its
lifetime? For a continuous variate and the density of failures over time,
()
|0 () = ( 0 in the numerator)
Pr ( 0)
where the denominator ensures that the new density integrates to 1 (conditional
probabilities too add up to 100%) and can be written as 1 − (0).
The hazard rate is this function evaluated at 0, and it gives the density of
failures at 0 once this point is reached. We have
(0) dlog (1 − (0))
(0) ≡ =−
1 − (0) d0
R 0
or equivalently 1 − (0) = exp(− 0 () d) if 0 as in the examples of
the previous slide (otherwise the lower limit of integration is different). Note:
1. the similarity of this exponential to continuous discounting by an instanta-
neous rate, in finance theory (Note: 1 exp () ≈ 1 (1 + ) for small );
2. we call 1 − (0) the survival function because it equals Pr ( 0);
3. taking log will transform absolute into relative scales, so (0) is the relative
(or percentage) change in the survival function (1 − (0)) as 0 changes.
Let 0 be a random variable having an exponential density defined by
() = exp(−) 0
This implies a fixed hazard rate because
exp(−0) exp(−0) exp(−0)
(0) = R ∞ = ∞ = =
exp(−) d − [exp(−)]0 exp(−0)
0
regardless of the value = 0. (Recall the meaning of this, from earlier
examples like the probability of employment.)
Other densities generate other patterns in the hazard rates, but they are all
restricted by the specific choice of a density.
NP density estimation frees us from the restrictions that these initial choices
would have imposed. By using the NP estimate b, we can discover whether
the pattern can be simplified to one known density or another, but the result
is usually more general than the standard known densities and their associated
hazards.
The same ideas of NP density estimation can be used for multivariate data.
We have ( ) ≡ Pr( ≤ ≤ ) and, for continuous variates,
2 ( )
( ) =
When using NP density estimation as a visual tool in high dimensions, it is
best to look at the density of one or two variates, conditionally on the rest.
The next five graphs from Silverman (1986) illustrate:
• the first one shows a bivariate histogram;
• the following three show smoothed bivariate NP density estimates;
• the last one shows a contour map or iso-probability contours defined by
( ) = for a succession of constants 1 · · · 0 (i.e. take a
sequence of slices parallel to the axes in the 3-dimensional graphs.)
These contours are similar to the wiggly lines in weather maps, where the
values 1 would denote atmospheric pressures. Similarly for walking
maps, with contours indicating the altitude along hills and mountains (tight
contours = steep climb, large gaps between contours = gentle slope).
Peaks of are at the center of these contours (turn a cone upside down!).
Another illustration is in Gallant, Rossi, and Tauchen (1992, Review of Finan-
cial Studies); see the figure containing 4 displays on the next page.
1. Denote log trading volumes by and conditional changes in log prices by
∆. Recall that the log transforms variables into relative scales: defining
price returns by ≡ ( − −1) −1, where is the price level, we have
∆ = ∆ log = log − log −1 = log (−1) = log (1 + ) ≈
There is a positive relation between and |∆|. In the last display (a contour
map), as ∆ becomes nonzero (move away from the center of the horizontal
axis), tends to increase and this is reflected by the widening top of the
triangle (this is where the most likely corresponding is).
2. The relation is also asymmetric, depending on whether the prices are going
up or down. The tilt of the triangle shows that there is more trading on
the upswing of prices, a feature detected from the NP density contour (last
display) but not detectable from the scatter plot (first display).
(The 3-dimensional graph in the third display is the NP estimate of the joint
density ∆ , the vertical axis denoting the estimated value of . The contours
are obtained as a succession of horizontal slices.)
B. NP regression
Regression analysis can be formulated as expected+unexpected parts of LHS:
y = E (y | X) + ε ≡ g(X) + ε
with yb = gb(X) as the estimated (or fitted) regression curve. For the bivariate
normal distribution, the conditional expectation g() is a linear function, hence
the so-called linear regression model. Linearity holds for distributions having
elliptical contours (known as elliptical distributions), incl. normal and Student
t, but in general other functional forms emerge (as seen in the previous slide).
Nonparametric regression does not presuppose a functional form. It estimates
the conditional expectation from the nonparametric joint density estimate of
the variates. This gives rise in the bivariate case to the Nadaraya-Watson
estimator of E ( | = ), namely
¡ −1 ¢
X ( − )
b() = with ≡ P ¡ −1 ¢
=1 =1 ( − )
where b can be seen as a weighted average of the ’s, with weights ’s that
are functions of and add up to 100% here.
Interpreting this formula through a scatter plot:
1. For each , the ’s in a small neighborhood of it provide the largest weight
to the average of the corresponding ’s, a local averaging of .
This is because a large value of | − | corresponds to the tail of the kernel;
for example, see (2).
2. Tracing the sequence of such averages as changes, we get the NP regression
curve.
The next graph illustrates, using data on consumption (vertical axis) and in-
come (horizontal axis). As income increases, consumption increases. However,
this does not happen in the usual linear way, but rather in an -shaped form:
1. At low incomes, consumers borrow to spend more than they earn.
As income increases, the gap falls.
2. For intermediate incomes, an increase in earnings is matched by an increase
in consumption.
3. At high incomes, savings increase and consumption levels off.
Now consider what happens as → 0 or → ∞ in the Nadaraya-Watson
formula:
1. In the first case, the window width is shrinking around each point in the
dataset, and the NP regression is simply the sequence of -values (assuming
there are no ties in the -values).
2. At the other extreme, a very large window width means that the regression
curve is just a horizontal line given by the average of .
The optimal bandwidth is somewhere in between, and it is found by the same
methods as in the case of NP density estimation.
We now turn to these technicalities. Only the results labelled “Exercise” need
to be solved. The rest is a collection of results that formalize some of the
earlier claims. They are included for reference, but you are not required to
derive them.
C. Technical results and Exercises
Exercise 1 (Kernels) Suppose that , = 1 , are data from a random
sample of , and that has a continuous density . We define
X µ ¶
b 1 −
() ≡
=1
where 0 and we choose to be a proper density function. Write b() as
the average of densities.
Remark 1 (Kernel density estimator, pointwise distributions) Suppose that
we have a random sample 1 from a density (), whose first three
derivatives exist for all . We estimate by using a kernel that is chosen
as any function satisfying
Z Z
∞ ∞
() d = 1 () d = 0
−∞ Z ∞ −∞
1 ≡ 2 () d 6= 0
−∞
R∞ 3 R∞
where 1 is finite, and we assume that −∞ () d and −∞ ()2 d are
finite. Let → ∞ faster than → 0, such that → ∞. Then:
(a) the asymptotic expansion of the bias is E(b() − ()) = 1 21 00 () +
¡ 2¢ ¡ 2¢ 2
¡ ¢, where means “of order of magnitude smaller than 2” (while
2 means of order 2) and ³ the´ bias is increasing with ;
R∞
(b) var(b()) = 2 ()+
1
1 , where 2 ≡ −∞ ()2 d and the variance
is decreasing with (see Figure 1 where b is less volatile when increases);
√ √
b 1 00
(c) ( () − ()) ∼ N( 2 1 () 2 ()), where ≡ ;2
√
(d) hence, b() is a consistent estimator of (), at a rate given by .
Exercise 2 (Kernel density estimator, IMSE) Continuing with the setup of
Remark 1, define Z ∞
≡ 00 ()2 d ∞
−∞
Show that:
(a) the asymptotic (i.e. large-) IMSE (or AIMSE) of b is
421 2
+ ;
4
(b) the bandwidth that minimizes this AIMSE is
à !15
b 2
= 2
1
where 1 and 2 are defined in Remark 1.
[Note: The value of the minimum AIMSE is not affected by the choice of 1,
because 2 varies accordingly to offset this change. Therefore, we can set 1 = 1
without loss of generality in the formula for b from now on.]
Remark 2 (CV and the bandwidth) Given that the Epanechnikov kernel is the
optimal one, the only remaining obstacle to practical implementation of the
estimation
R∞ of is the
¡ undetermined
√ ¢ bandwidth. The Epanechnikov kernel has
e ()2
d = 3 5 5 , so the optimal b is
−∞
à !15
3532 0769
b
= ≈
15
()
R ∞ 00 2
It contains the unknown ≡ −∞ () d.
One possibility is the plug-in method, whereby a preliminary estimate of (or
some hypothesized ) is substituted into . In this remark, we will analyze
another method that does not require such arbitrary choices.
Least-squares cross-validation (LSCV) optimizes a criterion that has, on aver-
age, the same value as IMSE. It is based on the following resampling procedure.
The first step of the procedure is to delete one observation at a time, say
( = 1 ), then calculate the usual kernel estimator based on the remaining
− 1 data points:
X µ ¶
1 −
b− () = = 1
( − 1)
6=
b 1 P b ¡ ¢
Defining −1 (x) ≡ =1 − , where we have the data vector x ≡
(1 )0, the procedure minimizes
Z ∞ Z ∞
() ≡ ()2 d + b()2 d − 2b−1(x)
−∞ −∞
with respect to . This procedure is justified because E (()) is the IMSE:
Z ∞ h i ∙Z ∞ ¸
E (b() − ())2 d = E (b() − ())2 d
−∞ ∙Z ∞ Z ∞−∞ Z ∞ ¸
=E ()2 d + b()2 d − 2 b() () d
−∞ −∞ −∞
but we won’tR prove the equivalence of this last term with the last one in E (()).
∞
Note that −∞ ()2 d does not depend on , so that minimizing with
respect to is equivalent Zto minimizing
∞
b()2 d − 2b−1(x)
−∞
which contains no unknowns.
Exercise 3 (Estimator of Nadaraya and Watson, optional exercise) Suppose
that the random sample (1 1) ( ) is drawn from a bivariate density
( ), or simply ( ). Define
X
b 1
( ) ≡ ( − − )
=1
where is a scaled bivariate
Z kernel such that Z ∞
∞
() ≡ ( ) d () ≡ ( ) d
−∞ Z −∞
∞
( ) d = 0
−∞ ¡ −1 ¢
−1
where and are univariate kernels (such as () = where
is as seen earlier) and is centered around 0 (at least for ).
Show that the estimator of E ( | = ) implied by b( ) is
X
( − )
P ¡ ¢
=1 =1 −
P ¡ ¢
as in the earlier lecture notes, with = ( − ) =1 − =
¡ −1 ¢ P ¡ −1 ¡ ¢¢
( − ) =1 − .
Exercise 4 (NP density estimation, numerical) Using PC-GIVE, calculate the
NP kernel density estimate for either inflation or interest rate. Does it look
like a normal (or log-normal)? Comment on the features of the density.
[The series is not i.i.d., but we will address this complication in later lectures.]
Exercise 5 (NP regression, numerical) Using PC-GIVE, calculate the NP re-
gression of consumption on income. Comment on the features of the relation.
Solution 1. Define µ ¶
1 −
≡ = 1
where the scaling 1 indicates the concentration of the assigned weights
around the points . To show that is indeed a density function with respect
to , we must show that it is nonnegative everywhere and that it integrates to
1. The first property is established by the positivity of and the definition of
as a density function. The second property is obtained as
Z ∞ Z ∞ µ ¶
1 −
d = d
−∞ Z−∞
∞
= () d = 1
−∞
by the change of variable = ( − ) having d = d. The quantity b()
is then the average of these , with weights 1 each.
Solution 2. (a) We know that the MSE is the sum of the squared bias and
the variance. Therefore, the leading term of the MSE of b() is
421 00 2 2
() + ()
4
and the corresponding AIMSE is
4 2 Z ∞ Z ∞
1 00 2 2
≡ () d + () d
4 −∞ −∞
421 2
= +
4
(b) Minimizing with respect to ,
3 2 2
≡ 1 − 2 = 0
5 ¡ 2 ¢
b
is solved uniquely by = 2 1 , and we have confirmation of the mini-
mum (not maximum!) from
2 2 2 22
2
≡ 3 1 + 3 0
The optimal value is therefore
à !15 à !15
1 24 24
b =
1 2 + 1 2
4 4 4
à !15
5 2142
= 4
4
where we see that the optimal squared bias is of the same order of magnitude as
the optimal variance. There is a balancing tradeoff between bias and variance,
for the sake of minimizing . See also (a) and (b) of Remark 1.
Solution 3. We first need an estimator of the conditional density of | ,
then we can calculate its implied expectation. We are given an estimator of
the joint density, from which we obtain the estimator of the marginal density
Z ∞
b () = b( ) d
−∞
X Z ∞
1
= ( − − ) d
=1 −∞
X Z ∞
1
= ( − ) d (by the change of variable = − )
=1 −∞
X Z ∞
1
= ( − ) (by the definition () ≡ ( ) d)
−∞
=1
This is the same univariate estimator b () as before.
The estimator of E ( | = ) is therefore
Z ∞ R∞
−∞ b ( ) d
b|= () d =
−∞ b ()
R ∞ P
−∞ P =1 ( − − ) d
=
R =1 ( − )
X ∞
−∞ ( + ) ¡( − ¢ ) d
= P
=1 =1 −
R∞ R∞
The result follows from −∞ ( − ) d = 0 and −∞ ( − ) d ≡
( − ). Notice that the resulting estimator requires only one marginal
kernel , with only one smoothing parameter, which is something that has
not been imposed at the outset.