0% found this document useful (0 votes)
16 views10 pages

Delta Method

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views10 pages

Delta Method

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

3

Delta Method

The delta method consists of using a Taylor expansion to approximate a


random vector of the form ¢(Tn) by the polynomial ¢(e) + ¢'(e)(Tn -
e) + ... in Tn - e. It is a simple but useful method to deduce the limit law
of¢(Tn) - ¢(e)from that of Tn - e. Applications include the non robust-
ness of the chi-square test for normal variances and variance stabilizing
transformations.

3.1 Basic Result


Suppose an estimator Tn for a parameter e is available, but the quantity of interest is ¢ (e) for
some known function ¢. A natural estimator is ¢(Tn). How do the asymptotic properties
of ¢(Tn) follow from those of Tn?
A first result is an immediate consequence of the continuous-mapping theorem. If the
sequence Tn converges in probability to e and ¢ is continuous at e, then ¢(Tn) converges
in probability to ¢(e).
Of greater interest is a similar question concerning limit distributions. In particular, if
y'n(Tn -e) converges weakly to a limit distribution, is the same true for y'n(¢(Tn) -¢(e»)?
If¢ is differentiable, then the answer is affirmative. Informally, we have

v'fi(¢(Tn) - ¢(e») ¢'(e) v'fi(Tn - e).

If y'n(Tn -e) - T for some variable T, then we expect that y'n(¢(Tn) - ¢(e») - ¢'(e) T.
In particular, if y'n(Tn - e) is asymptotically normal N(O, a 2), then we expect that
y'n(¢(Tn) - ¢(e») is asymptotically normal N(O, ¢'(e)2a 2). This is proved in greater
generality in the following theorem.
In the preceding paragraph it is silently understood that Tn is real-valued, but we are more
interested in considering statistics ¢ (Tn) that are formed out of several more basic statistics.
Consider the situation that Tn = (Tn.!, ... , Tn,k) is vector-valued, and that ¢ :]Rk f-+ ]Rm is
a given function defined at least on a neighbourhood of e. Recall that ¢ is differentiable at
e if there exists a linear map (matrix) :]Rk f-+ ]Rm such that

¢(e + h) - ¢(e) = + o(llhll), h -+ 0.

All the expressions in this equation are vectors of length m, and IIhll is the Euclidean
norm. The linear map h f-+ is sometimes called a "total derivative," as opposed to

25

https://doi.org/10.1017/CBO9780511802256.004 Published online by Cambridge University Press


26 Delta Method

partial derivatives. A sufficient condition for t/J to be (totally) differentiable is that all partial
derivatives at/Jj(X)/aXi exist for x in a neighborhood of 0 and are continuous at O. (Just
existence of the partial derivatives is not enough.) In any case, the total derivative is found
from the partial derivatives. IT t/J is differentiable, then it is partially differentiable, and the
derivative map h t-* (h) is matrix multiplication by the matrix

IT the dependence of the derivative on 0 is continuous, then t/J is called continuously


differentiable.
It is better to think of a derivative as a linear approximation h t-* to the function
h t-*t/J(O + h) - t/J(O) than as a set of partial derivatives. Thus the derivative at a point 0
is a linear map. IT the range space of t/J is the real line (so that the derivative is a horizontal
vector), then the derivative is also called the gradient of the function.
Note that what is usually called the derivative of a function t/J : IR t-* IR does not com-
pletely correspond to the present derivative. The derivative at a point, usually written t/J'(0),
is written here as Although t/J'(0) is a number, the second object is identified with the
map h t-* = t/J'(O) h. Thus in the present terminology the usual derivative function
o t-* t/J'(O) is a map from IR into the set of linear maps from IR t-* IR, not a map from
IR t-* IR. Graphically the "affine" approximation h t-* t/J(O) + is the tangent to the
function t/J at O.

3.1 Theorem. Let t/J : 1D>t/> C IRk t-* IRm be a map defined on a subset of IRk and dif-
ferentiable at O. Let Tn be random vectors taking their values in the domain of t/J. If
rn(Tn - 0) "-'+ T for numbers rn --+ 00, then rn(t/J(Tn) - t/J(O») Moreover, the
difference between rn(t/J(Tn) - t/J(O») and - 0») converges to zero in probability.

Proof. Because the sequence r n(Tn - 0) converges in distribution, it is uniformly tight and
Tn - 0 converges to zero in probability. By the differentiability of t/J the remainder function
=
R(h) t/J(O + h) - t/J(O) - =
satisfies R(h) o(lIhll) as h --+ O. Lemma 2.12 allows
to replace the fixed h by a random sequence and gives

t/J(Tn) - t/J(O) - - 0) == R(Tn - 0) = oP(IITn - 011).


Multiply this left and right with rn, and note that op(rnliTn - 011) = op(l) by tightness of
the sequence rn(Tn - 0). This yields the last statement of the theorem. Because matrix
multiplication is continuous, (rn (Tn - 0) ) "-'+ (T) by the continuous-mapping theorem.
Apply Slutsky's lemma to conclude that the sequence rn(t/J(Tn) - t/J(O») has the same weak
limit. _

A common situation is that .jii(Tn - 0) converges to a multivariate normal distribution


Nk(J.l" E). Then the conclusion of the theorem is that the sequence .jii(t/J(Tn) - t/J(O»)
converges in law to the Nm J.l" 1:: l) distribution.

3.2 Example (Sample variance). The sample variance of n observations Xl,"" Xn


is defined as S2 = n- 1L7=1 (Xi - X)2 and can be written as t/J(X, X2) for the function

https://doi.org/10.1017/CBO9780511802256.004 Published online by Cambridge University Press


3.1 Basic Result 27

4>(x, y) = Y_X2 • (For simplicity of notation, wedividebynratherthann-l.) Suppose that


S2 is based on a sample from a distribution with finite first to fourth moments a I , a2, a3, a4.
By the multivariate central limit theorem,

The map 4> is differentiable at the point () = (ai, a2)T, with derivative 4>(al.a2l = (-2al, 1).
Thus if the vector (TI , T2)' possesses the normal distribution in the last display, then

Jn(4)(X, X2) - 4>(al, a2») - -2aITI + Tz.


The latter variable is normally distributed with zero mean and a variance that can be ex-
pressed in al, ... , a4. In case al = 0, this variance is simplya4 - The general case
can be reduced to this case, because S2 does not change if the observations Xi are replaced
by the centered variables Yi = Xi - al. Write ILk = Eyik for the central moments of the
Xi. Noting that S2 = 4>(Y, y2) and that 4> (ILl , IL2) = IL2 is the variance of the original
observations, we obtain

In view of Slutsky's lemma, the same result is valid for the unbiased version n/(n - 1)S2
of the sample variance, because In(n/(n - 1) - 1) -+ O. 0

3.3 Example (Level of the chi-square test). As an application of the preceding example,
consider the chi-square test for testing variance. Normal theory prescribes to reject the null
hypothesis Ho: IL2 1 for values of nS2 exceeding the upper a point X;.a of the xLI
distribution. If the observations are sampled from a normal distribution, then the test has
exactly level a. Is this still approximately the case if the underlying distribution is not
normal? Unfortunately, the answer is negative.
For large values of n, this can be seen with the help of the preceding result. The central
limit theorem and the preceding example yield the two statements
X2 -(n-1)
n-I _ N(O, 1), In (S2
"2 -
)
1 - N(O, K + 2),
./2n - 2 fA'

where K = IL4/ - 3 is the kurtosis of the underlying distribution. The first statement
implies that (X;.a - (n - 1») / ./2n - 2) converges to the upper a point Za of the standard
normal distribution. Thus the level of the chi-square test satisfies

PIL2=1
2) =
(nS 2 > Xn•a P
(c(S2 1)
...,n IL2 - >
X;.a- n
In ) -+ 1 - <I> (zaJ2)
JK + 2 .
The asymptotic level reduces to 1 - <I> (za) = a if and only if the kurtosis of the underlying
distribution is O. This is the case for normal distributions. On the other hand, heavy-tailed
distributions have a much larger kurtosis. If the kurtosis of the underlying distribution is
"close to" infinity, then the asymptotic level is close to 1 - <1>(0) = 1/2. We conclude that
the level of the chi-square test is nonrobust against departures of normality that affect the
value of the kurtosis. At least this is true if the critical values of the test are taken from
the chi-square distribution with (n - 1) degrees of freedom. If, instead, we would use a

https://doi.org/10.1017/CBO9780511802256.004 Published online by Cambridge University Press


28 Delta Method

Table 3.1. Level of the test that rejects


ifns2 /1L2 exceeds the 0.95 quantile
of the X?9 distribution.

Law Level

Laplace 0.12
0.95 N(O, 1) + 0.05 N(O, 9) 0.12

Note: Approximations based on simulation of


10,000 samples.

normal approximation to the distribution of ../Ti(S2 / JL2 - 1) the problem would not arise,
provided the asymptotic variance IC + 2 is estimated accurately. Table 3.1 gives the level
for two distributions with slightly heavier tails than the normal distribution. 0

In the preceding example the asymptotic distribution of ../Ti(S2 - (12) was obtained by the
delta method. Actually, it can also and more easily be derived by a direct expansion. Write

C2 2
vn(S - (1 ) = vn -1 L)Xi
n i=1
c( - JL) -2 (1 2) - vn(X
C- - JL)2.
The second term converges to zero in probability; the first term is asymptotically normal
by the central limit theorem. The whole expression is asymptotically normal by Slutsky's
lemma.
Thus it is not always a good idea to apply general theorems. However, in many exam-
ples the delta method is a good way to package the mechanics of Taylor expansions in a
transparent way.

3.4 Example. Consider the joint limit distribution of the sample variance S2 and the
t-statistic X/So Again for the limit distribution it does not make a difference whether we
use a factor nor n - I to standardize S2. For simplicity we use n. Then (S2, XIS) can be
written as t/J(X, X2) for the map t/J: R2 R2 given by

t/J(x, y) = (y - x 2, (y _:2)1/2).
The joint limit distribution of ../Ti(X-ai, X2 -a2) is derived in the preceding example. The
map t/J is differentiable at () = (ai, a2) provided (12 = a2 - ar is positive, with derivative

It follows that the sequence ../Ti(S2 - (12, X/ S - all(1) is asymptotically bivariate normally
distributed, with zero mean and covariance matrix,

It is easy but uninteresting to compute this explicitly. 0

https://doi.org/10.1017/CBO9780511802256.004 Published online by Cambridge University Press


3.1 Basic Result 29

3.5 Example (Skewness). The sample skewness of a sample XI, .. ', Xn is defined as
(X. _ X)3
I _ n L..1=1 I

n - (n- 1 L:7=1 (Xi - X)2f/2·


Not surprisingly it converges in probability to the skewness of the underlying distribution,
defined as the quotient A = JL3/a3 of the third central moment and the third power of the
standard deviation of one observation. The skewness of a symmetric distribution, such
as the normal distribution, equals zero, and the sample skewness may be used to test this
aspect of normality of the underlying distribution. For large samples a critical value may
be determined from the normal approximation for the sample skewness.
The sample skewness can be written as ¢ (X, X2, X3) for the function ¢ given by
c - 3ab + 2a 3
¢(a, b, c) = (b _ a 2)3/2 .

The sequence ../ii(X - ai, X2 - a2, X3 - (3) is asymptotically mean-zero normal by the
central limit theorem, provided is finite. The value ¢(al, a2, (3) is exactly the popu-
lation skewness. The function ¢ is differentiable at the point (ai, a2, (3) and application of
the delta method is straightforward. We can save work by noting that the sample skewness
is location and scale invariant. With Yi = (Xi - a 1) / a, the skewness can also be written as
¢(Y, y2, y3). With).. = JL3/a3 denoting the skewness of the underlying distribution, the
Ys satisfy

'V'-tN(O,(
y3 _ A K + 3 JLs/a -).. JL6/a - A
The derivative of ¢ at the point (0, 1, A) equals (-3, -3A/2, 1). Hence, if T possesses the
normal distribution in the display, then ../ii(In -)..) is asymptotically normal distributed with
mean zero and variance equal to var( -3Tl - 3AT2/2 + T3). If the underlying distribution

asymptotically N(O,
6)-distributed.
°
is normal, then A = JLs = 0, K = and JL6/a6 = 15. In that case the sample skewness is

An approximate level a test for normality based on the sample skewness could be to
reject normality if vlnl1n I > ,J6 Zaj2. Table 3.2 gives the level of this test for different
values of n. 0

Table 3.2. Level of the test that


rejects if v'nlln IIJ6 exceeds the
0.975 quantile of the normal
distribution, in the case that the
observations are normally
distributed.

n Level

10 0.02
20 0.03
30 0.03
50 0.05

Note: Approximations based on simula-


tion of 10,000 samples.

https://doi.org/10.1017/CBO9780511802256.004 Published online by Cambridge University Press


30 Delta Method

3.2 Variance-Stabilizing Transformations


√ θ  
Given a sequence of statistics Tn with n(Tn − θ )  N 0, σ 2 (θ ) for a range of values of
θ , asymptotic confidence intervals for θ are given by
 
σ (θ ) σ (θ)
Tn − z α √ , Tn + z α √ .
n n

These are asymptotically of level 1 − 2α in that the probability that θ is covered by


the interval converges to 1 − 2α for every θ . Unfortunately, as stated previously, these
intervals are useless, because of their dependence on the unknown θ . One solution is to
replace the unknown standard deviations σ (θ ) by estimators. If the sequence of estimators
is chosen consistent, then the resulting confidence interval still has asymptotic level 1 − 2α.
Another approach is to use a variance-stabilizing transformation, which often leads to a
better approximation.
The idea is that no problem arises if the asymptotic variances σ 2 (θ ) are independent of θ .
Although this fortunate situation is rare, it is often possible to transform the parameter into
a different parameter η = φ(θ ), for which this idea can be applied. The natural estimator
for η is φ(Tn ). If φ is differentiable, then
√   θ  
n φ (Tn ) − φ(θ )  N 0, φ  (θ )2 σ 2 (θ ) .

For φ chosen such that φ  (θ )σ (θ ) ≡ 1, the asymptotic variance is constant and finding an
asymptotic confidence interval for η = φ(θ ) is easy. The solution

1
φ(θ ) = dθ
σ (θ)

is a variance stabililizing transformation. If it is well defined, then it is automatically


monotone, so that a confidence interval for η can be transformed back into a confidence
interval for θ .

3.6 Example (Correlation). Let (X 1 , Y1 ), . . . , (X n , Yn ) be a sample from a bivariate


normal distribution with correlation coefficient ρ. The sample correlation coefficient is
defined as
n   
i=1 X i − X̄ Yi − Ȳ
rn =  2 1/2 .
n  2  
i=1 X i − X̄ Yi − Ȳ

With the help of the delta method, it is possible to derive that n(rn − ρ) is asymptotically
zero-mean normal, with variance depending on the (mixed) third and fourth moments of
(X, Y ). This is true for general underlying distributions, provided the fourth moments exist.
Under the normality assumption the asymptotic variance can be expressed in the correlation
of X and Y . Tedious algebra gives
√  
n (rn − ρ)  N 0, (1 − ρ 2 )2 .

It does not work very well to base an asymptotic confidence interval directly on this result.

https://doi.org/10.1017/CBO9780511802256.004 Published online by Cambridge University Press


3.3 Higher-Order Expansions 31

Table 3.3. Coverage probability of the asymptotic 95%


x confidence interval for the correlation coefficient, for two
values of n and five different values of the true correlation ρ.

n ρ=0 ρ = 0.2 ρ = 0.4 ρ = 0.6 ρ = 0.8


15 0.92 0.92 0.92 0.93 0.92
25 0.93 0.94 0.94 0.94 0.94

Note: Approximations based on simulation of 10,000 samples.

Figure 3.1. Histogram of 1000 sample correlation coefficients, based on 1000 independent
samples of the the bivariate normal distribution with correlation 0.6, and histogram of the
arctanh of these values.

The transformation

1 1 1+ρ
φ(ρ) = dρ = log = arctanh ρ
1−ρ 2 2 1−ρ

is variance stabilizing. Thus, the sequence n(arctanh rn –arctanh ρ) converges to a
standard normal distribution for every ρ. This leads to the asymptotic confidence interval
for the correlation coefficient ρ given by
 √ √ 
tanh(arctanh rn − z α / n ), tanh(arctanh rn + z α / n ) .

Table 3.3 gives an indication of the accuracy of this interval. Besides stabilizing the
variance the arctanh transformation has the benefit of symmetrizing the distribution of the
sample correlation coefficient (which is perhaps of greater importance), as can be seen in
Figure 5.3. 

∗ 3.3 Higher-Order Expansions


To package a simple idea in a theorem has the danger of obscuring the idea. The delta
method is based on a Taylor expansion of order one. Sometimes a problem cannot be
exactly forced into the framework described by the theorem, but the principle of a Taylor
expansion is still valid.

https://doi.org/10.1017/CBO9780511802256.004 Published online by Cambridge University Press


32 Delta Method

In the one-dimensional case, a Taylor expansion applied to a statistic Tn has the form

Usually the linear term (Tn - O)t/J'(O) is of higher order than the remainder, and thus
determines the order at which t/J(Tn) - t/J(O) converges to zero: the same order as Tn - O.
Then the approach of the preceding section gives the limit distribution of t/J(Tn) - t/J(O). If
t/J' (0) = 0, this approach is still valid but not of much interest, because the resulting limit
distribution is degenerate at zero. Then it is more informative to multiply the difference
t/J(Tn) - t/J(O) by a higher rate and obtain a nondegenerate limit distribution. Looking at
the Taylor expansion, we see that the linear term disappears if t/J'(O) = 0, and we expect
that the quadratic term determines the limit behavior of t/J(Tn).

3.7 Example. Suppose that .jTi.X converges weakly to a standard normal distribution.
Because the derivative of x cos x is zero at x = 0, the standard delta method of the
preceding section yields that .jTi.(cos X - cos 0) converges weakly to O. It should be
concluded that .jTi. is not the right norming rate for the random sequence cos X-I. A
more informative statement is that - 2n (cos X-I) converges in distribution to a chi-square
distribution with one degree of freedom. The explanation is that

cos X- cosO = (X - 0)0 +!(X - + ....


That the remainder term is negligible after multiplication with n can be shown along the
same lines as the proof of Theorem 3.1. The sequence nX2 converges in law to a xl
distribution by the continuous-mapping theorem; the sequence -2n(cos X-I) has the
same limit, by Slutsky's lemma. 0

A more complicated situation arises if the statistic Tn is higher-dimensional with coor-


dinates of different orders of magnitude. For instance, for a real-valued function t/J,

If the sequences Tn,j - OJ are of different order, then it may happen, for instance, that the
linear part involving Tn,j - OJ is of the same order as the quadratic part involving (Tn,j _ OJ)2.
Thus, it is necessary to determine carefully the rate of all terms in the expansion, and to
rearrange these in decreasing order of magnitude, before neglecting the "remainder."

*3.4 Uniform Delta Method


Sometimes we wish to prove the asymptotic normality of a sequence .jTi.(t/J(Tn) - t/J(On»)
for centering vectors On changing with n, rather than a fixed vector. If.jTi.(On - 0) 40 h for
certain vectors 0 and h, then this can be handled easily by decomposing

https://doi.org/10.1017/CBO9780511802256.004 Published online by Cambridge University Press


3.5 Moments 33

Several applications of Slutsky's lemma and the delta method yield as limit in law the vector
+ h) - = if T is the limit in distribution of ,.jn(Tn - 9n). For 9n -* 9
at a slower rate, this argument does not work. However, the same result is true under a
slightly stronger differentiability assumption on q,.

3.8 Theorem. Let q, : IRk H- IRm be a map defined and continuously differentiable in
a neighborhood of 9. Let Tn be random vectors taking their values in the domain of
t/J. Ifrn(Tn - 9n) - T for vectors 9n -* 9 and numbers rn -* 00, then rn(t/J(Tn) -
t/J(9n») - Moreover, the difference between rn(t/J(Tn) - t/J(9n») and - 9n»)
converges to zero in probability.

Proof. It suffices to prove the last assertion. Because convergence in probability to zero
of vectors is equivalent to convergence to zero of the components separately, it is no loss
of generality to assume that q, is real-valued. For 0 ::: t ::: 1 and fixed h, define gn(t) =
q,(fJn + th). For sufficiently large n and sufficiently small h, both fJn and 9n + h are in a
ball around 9 inside the neighborhood on which t/J is differentiable. Then gn : [0, 1] H- IR is
continuously differentiable with derivative (t) = (h). By the mean-value theorem,
gn(1) - gn(O) = for some 0::: ::: 1. In other words

By the continuity of the map 9 H- there exists for every e > 0 a I) > 0 such that
- < ellhll for every - 911 < I) and every h. For sufficiently large nand
IIhll < 1)/2, the vectors 9n + are within distance I) of fJ, so that the nonn II Rn(h)/I of the
right side of the preceding display is bounded by ell h II. Thus, for any 1/ > 0,

p(rnIIRn(Tn - fJn)11 > 1/) ::: P(II Tn - 9nll + P(rnII Tn - 9nlle > 1/).
The first tenn converges to zero as n -* 00. The second tenn can be made arbitrarily small
by choosing e small. •

*3.5 Moments
So far we have discussed the stability of convergence in distribution under transfonnations.
We can pose the same problem regarding moments: Can an expansion for the moments of
q,(Tn) - q,(fJ) be derived from a similar expansion for the moments of Tn - fJ? In principle
the answer is affirmative, but unlike in the distributional case, in which a simple derivative
of q, is enough, global regularity conditions on q, are needed to argue that the remainder
tenns are negligible.
One possible approach is to apply the distributional delta method first, thus yielding the
qualitative asymptotic behavior. Next, the convergence of the moments of t/J(Tn) - q,(fJ)
(or a remainder tenn) is a matter of unifonn integrability, in view of Lemma 2.20. If
q, is uniformly Lipschitz, then this unifonn integrability follows from the corresponding
unifonn integrability of Tn - 9. If q, has an unbounded derivative, then the connection
between moments of q,(Tn) - q,(fJ) and Tn - fJ is harder to make, in general.

https://doi.org/10.1017/CBO9780511802256.004 Published online by Cambridge University Press


34 Delta Method

Notes
The Delta method belongs to the folklore of statistics. It is not entirely trivial; proofs are
sometimes based on the mean-value theorem and then require continuous differentiability in
a neighborhood. A generalization to functions on infinite-dimensional spaces is discussed
in Chapter 20.

PROBLEMS
1. Find the joint limit distribution of (.Jii"(X - /1-), .Jii"(S2 - (12») ifK and S2 are based on a sample
of size n from a distribution with finite fourth moment. Under what condition on the underlying
distribution are .Jii"(K - /1-) and .Jii"(S2 - (12) asymptotically independent?
2. Find the asymptotic distribution of .Jii"(r - p) if r is the correlation coefficient of a sample of n
bivariate vectors with finite fourth moments. (This is quite a bit of work. It helps to assume that
the mean and the variance are equal to 0 and 1, respectively.)
3. Investigate the asymptotic robustness of the level of the t-test for testing the mean that rejects
Ho : /1- 0 if .Jii"K/ S is larger than the upper a quantile of the tn-I distribution.
4. Find the limit distribution of the sample kurtosiskn = n- I I:7=1 (Xi - K)4/S4 - 3, and design an
asymptotic level a test for normality based on kn • (Warning: At least 500 observations are needed
to make the normal approximation work in this case.)
S. Design an asymptotic level a test for normality based on the sample skewness and kurtosis jointly.
6. Let X10 ••• , Xn be i.i.d. with expectation /1- and variance 1. Find constants such that an - bn )
converges in distribution if /1- = 0 or /1- =f. O.
7. Let XI, ... , Xn be a random sample from the Poisson distribution with mean (J. Find a variance
stabilizing transformation for the sample mean, and construct a confidence interval for (J based on
this.
8. Let XI, ... ,Xn be Li.d. with expectation I and finite variance. Find the limit distribution of
.Jii"(K;1 - 1). Ifthe random variables are sampled from a density f that is bounded and strictly
positive in a neighborhood ofzero, show that EIK;II = 00 for every n. (The density of Xn is
bounded away from zero in a neighborhood of zero for every n.)

https://doi.org/10.1017/CBO9780511802256.004 Published online by Cambridge University Press

You might also like