0% found this document useful (0 votes)

73 views35 pages

Renyi's Entropy: I P P I

Renyi entropy

Uploaded by

Ivan Dutta

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

73 views35 pages

Renyi's Entropy: I P P I

Renyi entropy

Uploaded by

Ivan Dutta

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 35

JOSE C.

PRINCIPE
Renyi’s entropy
U. OF FLORIDA

EEL 6935
History: Alfred Renyi was looking for the most general definition of
352-392-2662
information measures that would preserve the additivity for indepen-
principe@cnel.ufl.edu
dent events and was compatible with the axioms of probability.
copyright 2009

He started with Cauchy’s functional equation: If p and q are indepen-

dent than I(pq)=I(p)+I(q).

Apart from a normalizing constant this is compatible with Hartley’s

information content I(p)=-log p. If we assume that the events
X={x1,...xN} have different probabilities {p1,...pN}, and each delivers Ik
bits of information, then the total amount of information for the set is
N
I ( P) = ∑ pk Ik
k =1
This can be recognized as Shannon’s entropy. But he reasoned that
there is an implicit assumption used in this equation: we use the linear
average, which is not the only one that can be used.
In the general theory of means, for any function g(x) with inverse g-1,
the mean can be computed as
N
g (∑ pk g( xk ))
−1

k =1
page 1 of 35
JOSE C. PRINCIPE
Applying this definition to the I(P) we get
U. OF FLORIDA N

EEL 6935
I ( P) = g (∑ pk g (I k ))
−1

k =1
352-392-2662
When the postulate of additivity for independent events is applied we
principe@cnel.ufl.edu
get just two possible g(x):
copyright 2009
g(x)=cx
g(x)=c-2(1-α)x
The first form gives Shannon information and the second gives
N
1
Iα (P) = log(∑ pkα )
1− α k =1
for non negative α different from 1. This gives a parametric family of
information measures that are called today Renyi’s entropies.

It can be shown that Shannon is a special case when α→1

N N

N
lim ∑ log pk ⋅ pk α
∑ pkα −1
1 α →1
lim Hα ( X ) = lim log ∑ pαk = k =1 k =1
= HS ( X )
α →1 α →1 1 − α lim− 1
k =1
α →1

So Renyi’s entropies contain Shannon as a special case.

page 2 of 35
JOSE C. PRINCIPE
Meaning of α
U. OF FLORIDA

EEL 6935
If we compare the two definitions we see that instead of weighting the
352-392-2662
log pk by the probabilities, here the log is external to the sum and its
argument is the α power of the PMF, i.e.
principe@cnel.ufl.edu

( )
copyright 2009

Hα ( X ) =
1
log(Vα (X )) = −log(α −1Vα ( X ) = −log α−1 E(Vα−1(X )) Vα ( X ) = ∑ p αk
1 −α k

Note that Vα(X) is the argument of the α norm of the PMF.

A geometric view will help here. All the PMFs of N r.v. exist on what is
called the simplex, in an N dimensional space.
p2 n
α α
∑
k=1
pk = p α
(entropy α-norm) p2

1 (α–norm of p raised power to α) 1 p = ( p1, p2, p3)

p = (p 1, p2)
0 p3

p1 1
p1 1
0 1

page 3 of 35
JOSE C. PRINCIPE
The norm measures exactly the distance of the PMF to the origin, and
U. OF FLORIDA
α is the designated l-norm (Euclidean norm is l=2). Changing α
EEL 6935
changes the metric distance in the simplex.
352-392-2662
[ 0 01] [0 01] [ 001 ]
principe@cnel.ufl.edu

[ 100] α = 0.2 [01 0] [1 00] α = 0.5 [ 01 0] [1 00] α =1 [ 010]

[ 0 01] [0 01] [ 001 ]

[ 100] α =2 [01 0] [1 00] α =4 [ 01 0] [1 00] α = 20 [ 010]

From another perspective, the α root of Vα(X) is the α-norm of the

PMF. So we conclude that Renyi’s entropy is more flexible than Shan-
non and includes Shannon as a special case.

We will be using extensively α=2.

page 4 of 35
JOSE C. PRINCIPE
Properties of Renyi’s entropies
U. OF FLORIDA

EEL 6935

352-392-2662

principe@cnel.ufl.edu

page 5 of 35
JOSE C. PRINCIPE
Renyi’s quadratic entropy
U. OF FLORIDA
We will be using heavily Renyi’s entropy with α=2, called the quadratic
EEL 6935
entropy
352-392-2662
H2 ( X ) = − log(∑ pk2 )
principe@cnel.ufl.edu
k
copyright 2009
It has been used in physics, in signal processing and in economics.

We want also to stress that the argument of the log, which is the 2-
norm of the PMF V2(X) has meaning in itself. In fact it is the E[p(X)] or
if we make the change in variables ξk=p(xk), then it is the mean of the
transformed variable when the PMF is the transformation.
p(x)

Mean p(X)

x
mean of X

For us Reny’s quadratic entropy is appealing because we have found

an easy way to estimate it directly from samples.

page 6 of 35
JOSE C. PRINCIPE
Extensions to continuous variables.
U. OF FLORIDA

EEL 6935
It is possible to show that Renyi’s entropy measure extends to continu-
352-392-2662
ous r.v. and reads
principe@cnel.ufl.edu
1
copyright 2009 Hα ( X ) = lim(Iα (Pn ) − log n) = log ∫ pα (x)dx
n→∞ 1−α

and for the quadratic entropy

H2 ( X ) = − log ∫ p2 ( x)dx

Notice however that now entropy is no longer positive, in fact it can

become arbitrarily large negative.

page 7 of 35
JOSE C. PRINCIPE
Estimation of Renyi’s quadratic entropy
U. OF FLORIDA

EEL 6935
We will use here ideas from density estimation called kernel density
352-392-2662
estimation. This is an old method (Rosenblatt) that is now called
principe@cnel.ufl.edu
Parzen density estimation because Parzen proved many important
copyright 2009
statistical properties of the estimator.

The idea is very simple: Place a kernel over the samples and sum with
proper normalization, i.e.
1 N
x − xi
pˆ X ( x ) =
Nσ
∑κ ( σ
)
i =1

The kernel has the following properties:

1. sup R k < ∞

2. ∫R k <∞

3. lim xk ( x ) = 0
x→∞

4. k ( x ) ≥ 0, ∫R k ( x )dx = 1
Parzen proved that the estimator is asymptotically unbiased and con-
sistent with a good efficiency.
The free parameter is called the kernel size. Normally a Gaussian is
used and σ becomes the standard deviation.
page 8 of 35
JOSE C. PRINCIPE
But notice that here we are not interested in estimating the PDF that is
U. OF FLORIDA
a function, we are interested in a single number the 2-norm of the PDF.
EEL 6935
For a Gaussian kernel, substituting the estimator yields immediately
352-392-2662
∞ 2
⎛1 N ⎞
principe@cnel.ufl.edu H 2 ( X ) = − log ∫ ⎜⎜ ∑ Gσ ( x − xi ) ⎟⎟ dx
ˆ
−∞ ⎝ ⎠
N i =1
copyright 2009

1
∞ ⎛N N ⎞
⎜ ∑∑ Gσ ( x − x j ) ⋅ Gσ ( x − xi ) ⎟dx
= − log 2
N ∫ ⎜ i =1 j =1 ⎟
−∞ ⎝ ⎠
∞
1 N N
= − log 2 ∑∑ ∫ Gσ ( x − x j ) ⋅ Gσ ( x − xi ) dx
N i =1 j =1 −∞
1 N N
= − log( 2 ∑∑ Gσ 2
( x j − xi ))
N i =1 j =1
We call this estimator for V2(X), the Information Potential.
Let us look at the derivation: We never had to compute the integral
explicitly since the integral of the product of Gaussians is the value of a
Gaussian at the difference of arguments (with a larger kernel size).
Let us also look at the expression: The algorithm is O(N2), and there is
a free parameter σ that the user has to select from the data.
Cross validation or the Silverman’s rule suffice for most cases
( )
1
−1 −1 d + 4
σ opt = σ X 4 N (2d + 1)

page 9 of 35
JOSE C. PRINCIPE
Extended Estimator for Renyi’s Entropy
U. OF FLORIDA

EEL 6935
We can come up with still another estimator for Reny’s entropy of any
352-392-2662
order
principe@cnel.ufl.edu

[ ]
Δ ∞
1 1
copyright 2009 Hα ( X ) = log ∫ pαX ( x) dx = log EX pαX −1( X )
1−α −∞
1− α
by approximating the expected value by the empirical mean
N
1 1
Hα ( X ) ≈
1 −α
log
N
∑ pαX −1 (x j )
j =1

to yield
α −1
1 1 N ⎛1 N ⎞
Hˆ α ( X ) =
1 −α
log
N
∑ ⎜ N ∑ κ σ ( x j − xi ) ⎟⎟
⎜
j =1 ⎝ i =1 ⎠
α −1
1 1 N ⎛N ⎞
=
1−α
log
Nα
∑⎜ ∑ σ j i ⎟⎟
⎜ κ ( x − x )
j =1 ⎝ i=1 ⎠
We will now present a set of properties of these estimators.

page 10 of 35
JOSE C. PRINCIPE
Properties of the Estimators
U. OF FLORIDA
Property 2.1: The estimator of Eq. (2.18) for Renyi’s quadratic entropy using Gaussian
EEL 6935

352-392-2662 kernels only differs from the IP of Eq. (2.14) by a factor of 2 in the kernel size.
principe@cnel.ufl.edu

∞
κ new
(x j − xi ) = ∫κ old ( x − xi ) ⋅κ old ( x − x j )dx (2.19)
−∞

the estimator of Renyi’s quadratic entropy of Eq. (2.18) matched the estimator of Eq.
(2.14) using the IP.

Property 2.3. The kernel size must be a parameter that satisfies the scaling property
κ cσ ( x) = κ σ ( x / c) / c for any positive factor c

Property 2.4. The entropy estimator in Eq. (2.18) is invariant to the mean of the
underlying density of the samples as is the actual entropy

Property 2.5. The limit of Renyi’s entropy as α → 1 is Shannon’s entropy. The

limit of the entropy estimator in Eq. (2.18) as α → 1 is Shannon’s entropy estimated

using Parzen windowing with the expectation approximated by the sample mean.

page 11 of 35
JOSE C. PRINCIPE
Properties of the Estimators
U. OF FLORIDA
Property 2.6. In order to maintain consistency with the scaling property of the actual
EEL 6935 entropy, if the entropy estimate of samples {x1 ,…,x N} of a random variable X is estimated
352-392-2662 using a kernel size of σ, the entropy estimate of the samples {cx 1,…,cxN} of a random
variable cX must be estimated using a kernel size of |c| σ.
principe@cnel.ufl.edu

Property 2.7. When estimating the joint entropy of an n-dimensional random vector X

from its samples {x 1,…,x N}, use a multi-dimensional kernel that is the product of single-

dimensional kernels. This way, the estimate of the joint entropy and estimate of the

marginal entropies are consistent.

Theorem 2.1. The entropy estimator in Eq. (2.18) is consistent if the Parzen

windowing and the sample mean are consistent for the actual PDF of the iid samples.

Theorem 2.2. If the maximum value of the kernel κσ(ξ) is achieved when ξ = 0, then the
minimum value of the entropy estimator in Eq. (2.18) is achieved when all samples are
equal to each other, i.e., x1=…= xN = c

page 12 of 35
JOSE C. PRINCIPE
Properties of the Estimators
U. OF FLORIDA Theorem 2.3. If the kernel function κ σ(.) is continuous, differentiable, symmetric and
EEL 6935
unimodal, then the global minimum described in Theorem 2.2 of the entropy estimator in
352-392-2662

principe@cnel.ufl.edu Eq. (2.18) is smooth, i.e., it has a zero gradient and a positive semi-definite Hessian
copyright 2009
matrix.

Property 2.8. If the kernel function satisfies the conditions in Theorem 2.3, then in

the limit, as the kernel size tends to infinity, the quadratic entropy estim ator approaches

to the logarithm of a scal ed and biased version of the sample variance.

Property 2.9. In the case of joint entropy estim ation, if the m ulti-dimensional

kernel function satisfies κ Σ (ξ ) = κ Σ ( R −1 ξ ) for all orthonorm al m atrices, R, then the

entropy estim ator in Eq. (2.18) is invariant under rotations as is the actual entropy of a

random vector X. Notice that the condition on the joint kernel function requires hyper-

spherical sym metry.

page 13 of 35
JOSE C. PRINCIPE
Properties of the Estimators
U. OF FLORIDA
Theorem 2.4. lim Hˆ α ( X ) = H α ( Xˆ ) ≥ H α ( X ) , where X̂ is a random variable with the
EEL 6935 N →∞

352-392-2662
PDF f X (.) * κ σ (.) . The equality (in the inequality portion) occurs if and only if (iff) the
principe@cnel.ufl.edu
kernel size is zero. This result is also valid on the average for the finite-sample case.

Bias[Vˆ ( X )] = E[Vˆ ( X )] − ∫ f 2 ( x )dx = (σ 2 / 2 ) E[ f ′ ( X )]

Variance of the Information Potential

aN (N −1)(N − 2) + bN ( N − 1) a
Var (Vˆ ( X )) = E(Vˆ 2 ( X )) − ( E(Vˆ ( X ))) 2 = ≈ , as N →∞
σN 4 Nσ

[ ] [ ][
a = E Kσ ( xi − xj )Kσ ( x j − xl ) − E Kσ ( xi − x j ) E Kσ (x j − xl ) ]
b = Var[Kσ (xi − x j )]

The AMISE (asymptotic mean integrated square error)

σ 4
aN ( N − 1)( N − 2) + bN ( N − 1)
AMISE(Vˆ ( X )) = E[∫ (Vˆ ( X ) − V ( X )) dx] = ∫ ( f ′′(x)) dx +
2 2
2 σN 4

This is expected from the kernel density estimation theory. The IP

belongs to this class of estimators.
page 14 of 35
JOSE C. PRINCIPE
Physical Interpretation of the Information Potential Estimator
U. OF FLORIDA

EEL 6935
There is a useful physical interpretation of the IP. Think of the gravita-
352-392-2662
tional or the electrostatic field.
principe@cnel.ufl.edu
Assume the samples are particles (information particles) of unit mass
copyright 2009
that interact with each other with the rules specified by the Parzen ker-
nel used. This analogy is exact!
The information potential field is given by the kernel density estimation
Δ 1 N
Vˆ2 ( x j ) = ∑ Gσ 2
( x j − xi )
N i =1
the Information Potential is the total potential
N
Vˆ2 ( X ) = 1 / N ∑ Vˆ2 ( x j )
j =1

Now if there is potential there are forces in the space of the samples
that can be easily computed as
∂ ˆ 1 N
1 N
∂xj
V2 (x j ) =
N
∑Gσ′ 2 (xj − xi ) = 2Nσ 2 ∑Gσ 2 (xj − xi )(xi − xj )
i =1 i=1
Δ 1
F2 ( x j ; xi ) = Gσ′ 2
( x j − xi )
N
Δ
∂ N
F2 ( x j ) = V2 ( x j ) = ∑ F2 ( x j ; x i )
∂x j i =1
page 15 of 35
JOSE C. PRINCIPE
Interaction between one information particle in the center of 2-D space
U. OF FLORIDA

EEL 6935

352-392-2662

principe@cnel.ufl.edu

Another way to look at the interactions among samples is to create two

matrices in the space
⎧ D = {d ij }, dij = xi − x j ⎧ˆ 1 N ˆ
⎨ ⎪ V (i) = ∑ Vij
⎩ς = {Vˆ
ij } Vˆ
ij = Gσ 2
(dij ) fields
⎪
⎨
N j =1
⎪ Fˆ (i ) = − 1 ∑ Vˆijd ij
N

⎪⎩ Nσ 2
j =1

1 N N ˆ 1 N
V ( X ) = 2 ∑∑ Vij =
ˆ ∑ Vˆ (i )
N i =1 j =1 N i =1

D is a matrix of distances, and ζ is a matrix of scalars. As we can see

the samples create a metric space given by the kernel.

page 16 of 35
JOSE C. PRINCIPE
Example of forces pulling apart samples when the entropy is maxi-
U. OF FLORIDA
mized. The lines show the direction of the force and their size is the
EEL 6935
intensity of the pull.
352-392-2662

principe@cnel.ufl.edu 1

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

page 17 of 35
JOSE C. PRINCIPE
Extension to any α and kernel
U. OF FLORIDA

EEL 6935
The framework of information potential and forces extends to any
352-392-2662
Parzen kernel where the shape of the kernel creates the interaction
principe@cnel.ufl.edu
field. Changing α also changes the field.
copyright 2009
The α information potential field is
α −1
Δ1 ⎛N ⎞
Vα (x j )= α −1 ⎜⎜∑κσ (x j − xi )⎟⎟
ˆ
N ⎝ i=1 ⎠
N
1
Vˆα ( X) = ∑Vˆα (xj )
N j =1

and it can be expressed as a function of the IP as

α−2
1 ⎛N ⎞ 1 N
Vα(x j) = α−2 ⎜⎜∑κσ (xj − xi) ⎟⎟
ˆ ∑ κσ (x j − xi ) = pˆα−2 (xj )Vˆ2(x j)
N ⎝ i=1 ⎠ N i=1
The α information force is likewise given by
α −2
Δ
∂ ˆ α −1 ⎛ N ⎞ ⎛N ⎞
Fα ( x j ) =
ˆ Vα ( x j ) = α −1 ⎜⎜ ∑ κσ ( x j − xi ) ⎟⎟ ⎜⎜ ∑κσ′ ( x j − xi ) ⎟⎟
∂x j N ⎝ i =1 ⎠ ⎝ i =1 ⎠
= (α −1) pˆ αX −2 ( x j )Fˆ2 ( x j )
so conceptually all fields can be composed from the IP. α<2 empha-
sizes regions of lower density of samples (opposite for α>2).
page 18 of 35
JOSE C. PRINCIPE
Divergence Measures
U. OF FLORIDA

EEL 6935
Renyi proposed a divergence measure from his entropy definition
352-392-2662
which is different from Kullback Liebler divergence discussed in ch 1.
principe@cnel.ufl.edu

copyright 2009
Renyi’s divergence
∞ α−1
Δ 1 ⎛ f (x) ⎞
Dα ( f || g)= log ∫ f (x)⎜⎜ ⎟⎟ dx
α −1 −∞ ⎝ g(x) ⎠
with properties
i. Dα ( f || g) ≥ 0, ∀f , g,α > 0

ii. Dα ( f || g) = 0 iff f(x) = g(x) ∀x ∈ℜ

iii. lim Dα ( f || g ) = DKL ( f || g)

α →1

α −1
1 ⎡⎛ f (x) ⎞α −1⎤ 1 1 N ⎛⎜ fˆ (x j ) ⎞⎟
f
Dα ( f || g) = log Ep ⎢⎜⎜ ⎟⎟ ⎥ ≈ log ∑
α −1 ⎢⎣⎝ g ( x) ⎠ ⎥⎦ α − 1 N j=1 ⎜⎝ gˆ (xgj ) ⎟⎠
α −1
⎛N f f f ⎞
N ∑
⎜ κ ( x j − xi )⎟
log ∑⎜⎜ i=m1 ⎟
1 1
⎟ =D
ˆ ( f || g)
α
α − 1 N j=1
⎜⎜ ∑κ ( xj − xi ) ⎟⎟
g g g

⎝ i=1 ⎠
page 19 of 35
JOSE C. PRINCIPE
Reny’s Mutual Information
U. OF FLORIDA

EEL 6935
Renyi’s mutual information is also the divergence between the joint
352-392-2662
and the product of the marginals
principe@cnel.ufl.edu
∞ ∞
Δ 1 pαX ( x1 ,..., xn ) 1
copyright 2009 I( X ) = log ∫ L ∫ n dx ...dxn
1 −α −∞ −∞
∏ p1X−oα ( xo )
o =1

and we can come up with a kernel estimate for it if we approximate the

E[.] by the empirical mean
1 −α
⎛ ⎛1 N ⎞ ⎞
⎜ ⎜ ∑ κ Σ ( x j − xi ) ⎟ ⎟
⎜ ⎟ ⎟
ˆIα ( X ) = 1 log 1 ∑ ⎜ ⎝ N i =1
Δ N
⎠
⎜ n ⎟
1− α N j =1 ⎜ ⎛1 N o ⎞⎟
⎜ ∏ ⎜ N ∑ κ σ o ( x j − xi ) ⎟ ⎟
⎜ o
⎟
⎝ o =1 ⎝ i =1 ⎠⎠
1− α
⎛⎛ 1 N n ⎞⎞
⎜ ⎜ ∑ ∏ κ σ ( x oj − xio ) ⎟ ⎟
1 1 N ⎜ ⎜⎝ N i =1 o =1 o ⎟⎟
⎠
= log ∑ ⎜ n ⎟
1− α N j =1 ⎜ ⎛ 1 N
o ⎞ ⎟
⎜ ∏⎜ N ∑ σo j
⎜ κ (x − xi ) ⎟⎟ ⎟
o

⎝ o =1 ⎝ i = 1 ⎠⎠

that also approximates KL estimator for α->1. However Renyi’s diver-

gence and MI are not as general as KL because Shannon measure of
information is the only one for which the increase in information is
equal to the negative of the decrease in uncertainty.
page 20 of 35
JOSE C. PRINCIPE
Quadratic Divergence and Mutual Information
U. OF FLORIDA

EEL 6935
Measuring dissimilarity in probability space is a complex issue and
352-392-2662
there are many other divergence measures that can be used. If we
principe@cnel.ufl.edu
transform the simplex to an hypersphere preserving the FIsher infor-
copyright 2009
mation, the coordinates of each PMF become pk . The geodesic dis-
tance between PMF f and g becomes cos D G = ∑ fk gk . This is related to
the Bhattacharyya distance (which is Renyi’s divergence with α=1/2)
(
DB( f , g) = −ln ∫ f (x)g(x)dx )
Now if we measure the distance between f and g in a linear projection
0.5
space (the cordal distance) we get DH = ∑ ( fk – gk ) This resembles
the Hellinger distance which is also related to Reny’s divergence

( ) [( )]
1/ 2
DH ( f , g) = ⎡∫ f (x) − g(x) dx⎤
2 1/ 2
= 21− ∫ f (x)g(x)dx
⎢⎣ ⎥⎦
Since we have an estimator for quadratic Renyi’s entropy (the IP) we
will substitute α=1/2 for 2 and define two quadratic divergences called
Euclidean distance and Cauchy Schwarz divergence between f and g.

page 21 of 35
JOSE C. PRINCIPE
Euclidean distance between PDFs
U. OF FLORIDA

EEL 6935
We will drop the square root and define the distance as
352-392-2662

principe@cnel.ufl.edu
DED ( f , g) = ∫ ( f (x) − g( x))2 dx = ∫ f (x)2 dx − 2∫ f ( x)g (x)dx + ∫ g( x)2 dx
copyright 2009 Notice that each of the three terms can be estimated with the IP. The
middle term involves samples from the two PDFs and so will be called
the Cross Information Potential (CIP) that measures the potential cre-
ated by one PDF in the locations specified by the samples of the other
PDF.

We will define the quadratic mutual information QMIED between two

random variables as
I ED ( X1, X 2 ) = DED( f X1 X2 ( x1, x2 ), f X1 (x1) f X2 (x2 ))

Notice that the IED is zero if the two r.v. are independent.
The extension to multidimensional random variables is also possible
k
I ED ( X 1,..., X k ) = DED ( f X ( x1,... xk ), ∏ f X i ( xi ))
i =1

Although a distance, we sometimes refer to DED as a divergence.

page 22 of 35
JOSE C. PRINCIPE
Cauchy Schwarz divergence
U. OF FLORIDA

EEL 6935
The other divergence is related to the Bhattacharyya distance but can
352-392-2662
be formally derived from the Cauchy Schwarz inequality
principe@cnel.ufl.edu

∫ ∫ (x)dx ≥ ∫ f ( x)g( x)dx

2 2
copyright 2009 f ( x) dx g
where the equality holds iff f(x)=g(x). The divergence is defined as

DCS ( f , g ) = − log ∫ f ( x)g( x)dx

∫ ∫ (x)dx
2 2
f ( x) dx g
DCS is always greater than zero, it is zero for f(x)=g(x) and it is sym-
metric (but does not obey the triangular inequality). We will also work
with the square of this quantity in practice. We can also write
DCS ( f , g ) = log(∫ f ( x)2 dx) + log(∫ g( x)2 dx) − 2log(∫ f (x) g( x)dx)

We can also define the quadratic mutual information QMICS as

I CS( X1, X 2) = D CS( f X1 X2 ( x 1, x 2 ), f X1 ( x 1)fX 2 ( x2 ) )

and as before for independent variables it is zero. It can also be

extended to multiple variables easily
k
I CS ( X 1,..., X k ) = DCS ( f X ( x1,... xk ), ∏ f X i ( xi ))
i =1
page 23 of 35
JOSE C. PRINCIPE
Relation between DCS and Renyi’s relative entropy
U. OF FLORIDA

EEL 6935
There is a very interesting interpretation of the DCS in terms of Renyi’s
352-392-2662
entropy. Lutwak defines Renyi’s relative entropy as
principe@cnel.ufl.edu

(∫ ) (∫ g )
1
α −1 α 1/ α
copyright 2009 g (x) f (x) 1− α (x)
DRα ( f , g ) = log R R

(∫ ( x) )
1
fα α (1− α )
R

for α=2 this gives exactly DCS, so

DCS ( X ,Y ) = log( ∫ f ( x) g ( x) dx) − log(1/ 2∫ f ( x) 2 dx) − log(1/ 2∫ g ( x)2 dx) =
= H 2 ( X ;Y ) − 1 / 2 H2 ( X ) − 1/ 2H2 (Y )
where the first term can be shown to be quadratics Reny’s cross
entropy.
There is a striking similarity between DCS and Shannon’s mutual infor-
mation, but notice that here we are dealing with Renyi’s quadratic
entropy.

page 24 of 35
JOSE C. PRINCIPE
Geometric Interpretation of Quadratic Mutual Information
U. OF FLORIDA

EEL 6935
Let us define
352-392-2662
⎧ VJ = ∫∫ f X X (x 1, x2 )2dx 1dx 2
principe@cnel.ufl.edu ⎪ 1 2

⎪
⎨ VM = ∫∫ (f X1( x1 )fX2( x2) ) dx1 dx2
2
copyright 2009
⎪
⎪ V = f
⎩ c ∫ ∫ X1X2 (x1, x2 )fX1( x1 )fX2(x2 )dx 1dx 2

therefore QMIED and QMIED can be rewritten as

⎧ IED = V J – 2V c + V M
⎨
⎩ ICS = log V J – 2 log V c + log V M

The Figure illustrates these distances in the simplex

fX1 X2 ( x 1, x2 )
IED ( Euclidean Distance)

I s (K-L Divergence )

f X ( x1 )fX ( x 2 )
θ 1 2
VM

0
2
I CS = –log ( ( cos θ ) ) V c = cos θ VJ VM

page 25 of 35
JOSE C. PRINCIPE
Example
U. OF FLORIDA 21
X2 PX
Is
EEL 6935

352-392-2662 2
P X2 2 PX
12
PX
22
Is
principe@cnel.ufl.edu 1 11
PX2 1 PX P 21
X

21
IE D PX

Fix PX1= (0.6,0.4) I ED

Vary
P(1,1) -> [0, 0.6] 21 11 11
P(2,1) -> [0, 0.4] PX PX PX

I CS 21
PX

IC S

21 11 11
⎧ n m PX PX PX
I ED ( X 1, X 2 ) = ∑∑ ( PX (i , j ) − PX 1 ( i ) PX 2 ( j )) 2
i =1 j =1
⎛ n m ⎞⎛ n m
⎜ ∑∑ ∑∑ (PX 1 (i) PX 2 ( j ))2
⎨ ⎜ ( P (i , j )) 2 ⎟⎜
X ⎟⎜
I CS ( X1, X 2 ) = log ⎝ ⎠⎝ i =1 j =1
i =1 j =1
n m
∑∑ (PX (i, j) PX 1(i)PX 2 ( j))2
⎩ i =1 j =1
page 26 of 35
JOSE C. PRINCIPE
Information Potential and Forces in the Joint Spaces
U. OF FLORIDA

EEL 6935
The important lesson to be learned from the IP framework and the dis-
352-392-2662
tances measures defined, is that each PDF creates its own potential
principe@cnel.ufl.edu
that interacts with the others. This is an additive process, so the forces
copyright 2009
are also additive, and it becomes relatively simple.

Euclidean and Cauchy Schwarz divergences

The three potentials needed are

1 N N
Vf = 2 ∑∑G
ˆ
2σ
( x f (i) − x f ( j )) 2
N i =1 i = j
1 N N
Vg = 2 ∑ ∑ G
ˆ
2σ
( x g (i ) − x g ( j )) 2
N i =1 i = j
1 N N
Vˆc = 2 ∑ ∑ G 2σ
( x f (i ) − x g ( j )) 2
N i =1 i = j
and the divergences and forces are written as

ˆ ( f , g) = Vˆ = Vˆ + Vˆ − 2Vˆ
D ∂VˆED ∂ Vˆf ∂Vˆg ∂Vˆc
ED ED f g c = + −2
∂xi ∂xi ∂xi ∂xi
Vˆ Vˆ
ˆ ( f , g ) = Vˆ = log f g
D ∂VˆCS 1 ∂Vˆ f 1 ∂Vˆg 2 ∂Vˆc
CS CS
Vˆ 2 = + −
c
∂xi V f ∂xi Vg ∂ xi Vˆc ∂xi
ˆ ˆ
page 27 of 35
JOSE C. PRINCIPE
Quadratic Mutual Information
U. OF FLORIDA

EEL 6935
The QMIs are a bit more detailed because now we have two deal with
352-392-2662
the joints, the marginals and their product. But everything is still addi-
principe@cnel.ufl.edu
tive. The three PDFs are
copyright 2009
⎧ˆ 1 N
⎪ f X 1 X 2 ( x1 , x2 ) = ∑ Gσ ( x − x(i ))
⎪ N i =1
⎪ ˆ 1 N
⎨ f X 1 ( x1 ) = ∑ Gσ ( x1 − x1 (i ))
⎪ N i =1
⎪ ˆ 1 N
⎪ f X 2 ( x2 ) = ∑ Gσ ( x2 − x2 (i))
⎩ N i =1

they create the following fields

1 N N 1 N N
VJ = 2 ∑∑ G
ˆ
2σ
( x (i ) − x ( j )) = 2 ∑∑ G 2σ ( x1 (i ) − x1 ( j ))G 2σ ( x2 (i ) − x2 ( j ))
N i =1 j =1 N i =1 j =1
1 N N
VM = V1V2
ˆ ˆ ˆ with Vk = 2 ∑∑ G 2σ ( xk (i ) − xk ( j )),
ˆ k = 1, 2
N i =1 j =1
1 N ⎛⎜ 1 N ⎞⎛ 1 N ⎞
VC =
ˆ ∑ ⎜ N
∑
N i =1 ⎝ j =1
G (
2σ 1
x ( i ) − x1 ( j )) ⎟
⎟
⎜
⎜ N
∑ G 2σ
( x2 (i ) − x2 ( j )) ⎟
⎟
( 2. 104)
⎠⎝ j =1 ⎠

page 28 of 35
JOSE C. PRINCIPE
let us exemplify for VC
U. OF FLORIDA

EEL 6935 VˆC = ∫∫ fˆ (x1, x2 ) fˆ (x1) fˆ ( x2 )dx1dx2

⎡1 N ⎤⎡ 1 N ⎤⎡ 1 N ⎤
352-392-2662

principe@cnel.ufl.edu = ∫∫ ⎢ ∑Gσ (x1 − x1(k ))Gσ ( x2 − x2 (k ))⎥⎢ ∑Gσ (x1 − x1(i ))⎥ ⎢ ∑Gσ ( x2 − x2 ( j ))⎥ dx1dx2
copyright 2009
⎣ N k =1 ⎦⎣ N i =1 ⎦ ⎢⎣ N j =1 ⎥⎦
1 N 1 N 1 N
= ∑ ∑ ∑∫ Gσ (x1 − x1(i))Gσ (x1 − x1(k ))dx1 ∫Gσ ( x2 − x2 (k ))Gσ ( x2 − x2 ( j ))dx2
N i =1 N j =1 N k =1

1 N ⎡1 N ⎤⎡ 1 N ⎤
= ∑⎢ ∑G ( x (k ) − x1(i))⎥⎢ ∑G
2σ 1
( x (k ) − x2 ( j))⎥
2σ 2
(2.102)
N k =1 ⎣ N i =1 ⎦⎣⎢ N j =1 ⎦⎥

Notice that Vc has complexity O(N3). Finally we have

N N
1
IˆED ( X 1 , X 2 ) = 2
N
∑∑G 2σ
( x1 ( i ) − x1 ( j ))G 2σ
( x 2 ( i ) − x 2 ( j )) + Vˆ1Vˆ2 −
i =1 j = 1

N ⎛ 1 N ⎞⎛ 1 N ⎞
∑ ⎜⎜ N ∑G ∑G
1
− ( x1 ( i ) − x1 ( j )) ⎟⎜ ( x 2 ( i ) − x2 ( j )) ⎟
N i =1 ⎝
2σ ⎟⎜ N 2σ ⎟
j =1 ⎠⎝ j =1 ⎠
⎛ 1 ⎞
( )
N N
⎜
⎜ N2 ∑ ∑G 2σ
( x1 ( i ) − x1 ( j ))G 2σ ( x 2 ( i ) − x 2 ( j )) ⎟ Vˆ1Vˆ2
⎟
IˆCS ( X 1 , X 2 ) = lo g ⎝ i = 1 j =1 ⎠
2
⎛ 1 N ⎛ 1 N ⎞⎛ 1 N ⎞⎞
⎜
⎜N ∑ ⎜⎜ N ∑ G 2σ ( x1 (i ) − x1 ( j )) ⎟⎟ ⎜⎜ N ∑ G 2σ ( x 2 ( i ) − x 2 ( j )) ⎟⎟ ⎟⎟
⎝ i =1 ⎝ j =1 ⎠⎝ j =1 ⎠⎠
page 29 of 35
JOSE C. PRINCIPE
The interactions are always based on the marginals, but there are
U. OF FLORIDA
three levels: the level of the individual marginal samples (joint space),
EEL 6935
the level of the marginal sample and marginal fields (the CIP) and the
352-392-2662
product of the two marginal fields. This field can be generalized to any
principe@cnel.ufl.edu
number of variables
copyright 2009
N N K
ˆI ( X ,..., X ) = 1 2 N K ˆ K

2 ∑∑∏ k
ED 1 K V (ij ) − ∑∏ Vk (i) + ∏ Vˆk
ˆ
N i =1 j =1 k =1 N i =1 k =1 k =1

⎛ 1 N N K ⎞K
⎜ N 2 ∑∑∏
⎜ Vˆk (ij ) ⎟∏ Vˆk
⎟ k =1
ˆI ( X ,..., X ) = log ⎝ i =1 j =1 k =1 ⎠
CS 1 K 2
⎛1 N K ˆ ⎞
⎜⎜ ∑∏ Vk (i) ⎟⎟
⎝ N i =1 k =1 ⎠

The information forces for each field are also easily derived
∂IˆCS ∂Vˆj 2∂VˆC ∂Vˆk
FED (i) =
ˆ = − +
∂xk (i) ∂xk (i) ∂xk (i) ∂xk (i)
∂Iˆ 1 ∂Vˆ 2 ∂VˆC 1 ∂Vˆk
FCS (i) = = − +
ˆ CS j
∂xk (i) Vj ∂xk (i) VC ∂xk (i) Vk ∂xk (i)

These expressions are going to be very useful when adapting systems

with divergences or quadratic mutual information.

page 30 of 35
JOSE C. PRINCIPE
Fast Information and Cross Information Potential Calculations
U. OF FLORIDA

EEL 6935
One of the problems with this IP estimator methodology is the compu-
tational complexity which is O(N2) or O(N3). There are two ways to get
352-392-2662

principe@cnel.ufl.edu
approximations to the IP and CIP with arbitrary accuracy: The Fast
copyright 2009
Gauss Transform and the Incomplete Cholesky decomposition. They
both transform the complexity to O(NM) where M <<N.

Fast Gauss Transform

The FGT takes advantage of the fast decay of the Gaussian function
and can efficiently compute weighted sums of scalar Gaussians.
S( yi ) = ∑j=1wie
−( y j − yi ) / 4σ
2 2
N
i = 1,...M

The savings come from the shifting property of the Gaussian function
which reads
2 2 2
⎛ y j − yi ⎞ ⎛ y j − yC − ( yi − yC ) ⎞ ⎛ y j − yc ⎞
−⎜⎜ ⎟ − ⎜⎜ ⎟⎟ −⎜⎜ ⎟
1 ⎛ yi − yc ⎞
n
⎛ y j − yc ⎞
n
exp(−x2)
hn( y) = (−1)
σ ⎟⎠ σ ⎟⎠ nd
∑n=0 n!⎝ σ ⎠ n ⎜⎜⎝ σ ⎟⎟⎠
σ ∞
e ⎝ =e ⎝ ⎠ = e ⎝
⎜ ⎟ H
dxn

and the efficient approximation of the Hermite polynomials with a small

order p.
page 31 of 35
JOSE C. PRINCIPE
The shifting property is useful because there is no need to evaluate
U. OF FLORIDA
every Gaussian at every point. Instead a p term sum is computed
EEL 6935
around a small number yc of cluster centers with O(Np) computation.
352-392-2662
These sums are then shifted to the yi desired locations and computed
principe@cnel.ufl.edu
in another O(Mp) operations. Normally the centers c are found using
copyright 2009
the furthest point algorithm which is efficient.

The information potential calculation (M=N) can be immediately given

as
p −1
1 ⎛ y j − yC b ⎞
N B n
1 ⎛ y j − yCb ⎞
V ( y) ≈
2σ N 2 π
∑ ∑ ∑ n!hn ⎜⎜ 2σ ⎟⎟ C n ( b ) Cn (b ) = ∑ ⎜⎜
yi ∈B ⎝ 2σ
⎟⎟
⎠
j =1 b = 1 n = 0 ⎝ ⎠

which requires O(NpB) calculations. p is normally 4 or 5 (independent

of N), and B, the number of clusters is also relatively small (~10), but
with a weak dependence on N.

The problem with this algorithm is if the data is multidimensional

because the complexity p changes to pD where D is the dimension. In
order to cope with high dimensions, a vector Taylor series has been
proposed to approximate the high dimensional Gaussian as

page 32 of 35
JOSE C. PRINCIPE

U. OF FLORIDA
⎛ y −y 2⎞ ⎛ y −c 2⎞
⎟ ⎛⎜ y i − c ⎞⎟ ⎛ (y j − c) ⋅ (yi − c) ⎞
2
⎜ j i ⎟ ⎜ j
EEL 6935
exp⎜ − ⎟ = exp⎜− ⎟ exp⎜ − ⎟ exp⎜⎜ 2 ⎟
⎟
⎜ 4σ 2
⎟ ⎜ 4σ 2 ⎟ ⎝ 4σ 2
⎝ 4σ 2
⎠
352-392-2662
⎝ ⎠ ⎝ ⎠ ⎠
principe@cnel.ufl.edu
and the cross term is approximated by a Taylor expansion as
copyright 2009
α α α α! = α 1!α 2 !L α d !
⎛ (y j − c) ⋅ (yi − c) ⎞ ⎛ y j − c ⎞ ⎛ yi − c ⎞
∑
2
exp⎜⎜2 ⎟=
⎟
⎜ ⎟
⎜ 2σ ⎟ ⎜⎝ 2σ ⎟⎠ + ε (α ) α =α1 +α2 +L+αd
4σ ⎠ α ≥0 α!
2
⎝ ⎝ ⎠

Now the IP for multidimensions becomes

⎛ y −c 2 ⎞ α
N
⎜ ⎟⎛ y j − cB ⎞
) ∑∑∑
Cα ( B)exp⎜ −
1
⎟⎜⎜ ⎟⎟
j B
VT (y) ≈
(
N 2 4πσ 2
d 2
j =1 B α ≥0 ⎜
⎝
4σ 2 ⎟⎝ 2σ ⎠
⎠
⎧ α ⎛ y −c 2 ⎞⎛ y − c ⎞α ⎫⎪
⎪
Cα (B ) = ∑
exp⎜ − ⎟⎜ i
2 i B
⎟⎝ 2σ ⎟⎠ ⎬
B
⎨ ⎜
α ! ⎪e ∈B 4σ 2
⎪⎭
⎩i ⎝ ⎠

For a D dimensional data set the calculation is O(NBrp,D) instead of

( p + D )!
O(NBrD) with r = ⎛
p + D⎞
⎝ d ⎠
= -------------------- ,
D!p!
but normally p must be larger than
before for the same precision.

page 33 of 35
JOSE C. PRINCIPE
Incomplete Cholesky Decomposition
U. OF FLORIDA

EEL 6935
It turns out that the eigenvalues of the matrix created by the pairwise
352-392-2662
evaluation of the Gaussian is full rank, but the eigenspectrum decays
principe@cnel.ufl.edu
very fast. Hence we can take advantage of this fact.
For a NxN symmetric matrix K=GTG, where G is a lower triangular
copyright 2009

matrix of positive diagonal entries. When the eigenspectum falls fast

we can approximate K by NxD lower triangular matrix G˜ such that
~ ~
K − GT G < ε
. There are ways of computing G˜ effectively. The computation
complexity of the procedure becomes O(ND).
The IP can be computed as
1 N N 1 1 ~
V( X) = 2 ∑∑k (xi − x j ) = 2 1TN KXX 1N = 2 1TN GXX
2
ˆ
N i=1 j=1 N N 2

ˆIED = 1 1TD (G
~T ~ ~T ~ 1 T ~ 2 T ~ 2 2 T ~ ~T ~ ~T
XX GYY o G XXGYY )1D + 4 1NGXX 1N GYY − 3 (1N GXX )(G XXGYY )(GYY 1N )
N2 X
Ny 2 2 N

~ ~ ~T ~ T ~ 2 T ~ 2
1TDX (GT XX GYY o G XX GYY )1Dy 1 N GXX 1N GYY
IˆCS = log 2 2

( ~ ~ ~ ~
(1TN GXX )(GTXX GYY )(GYY
T
)
1N )
2

page 34 of 35
JOSE C. PRINCIPE
However the CIP is not a positive definite matrix, so we need to extend
U. OF FLORIDA
K XX K XY
K ZZ =
EEL 6935 K K
KXX to create a positive definite matrix where KXY denotes XY YY

352-392-2662
the Gram matrix for the CIP. The calculation of the CIP becomes
principe@cnel.ufl.edu

( )( )
1 N N
1 1 ~ ~ e1 = {1{ ,... 0} T
,... 1, 0{
copyright 2009
Vˆ ( X ; Y ) = 2 ∑∑ k (xi − y j ) = 2 e1T Kzze2 = 2 e1T GZZ GZZ e2 N N
N i =1 j =1 N N e 2 = { 0{ ,. .. 1}T
,. .. 0 , 1{
N N

with complexity O(ND2). The divergence measures can be computed

efficiently as
1 ~ ~T 1 ~ ~T 2 ~ ~T
Dˆ ED = 2 (e1TGZZ )(GZZ e1 ) + 2 (eT2GZZ )(GZZ e2 ) − 2 (eT1GZZ )(GZZ e2 )
N N N
~ ~T ~ ~T
(e1TGZZ )(GZZ e1 )(eT2GZZ )(GZZ e2 )
Dˆ CS = log
( ~ ~T
(e1TGZZ )(GZZ e2 )
2
)

page 35 of 35

Three Tutorial Lectures
No ratings yet
Three Tutorial Lectures
36 pages
Entropy Estimates of Small Data Sets-Bonachela Hinrichsen Munoz-JPhysA 41 202001
No ratings yet
Entropy Estimates of Small Data Sets-Bonachela Hinrichsen Munoz-JPhysA 41 202001
9 pages
The World According To R Enyi: Thermodynamics of Multifractal Systems
No ratings yet
The World According To R Enyi: Thermodynamics of Multifractal Systems
23 pages
Revised
No ratings yet
Revised
24 pages
Info Theory Course Notes
No ratings yet
Info Theory Course Notes
46 pages
Renyi Anti Conc
No ratings yet
Renyi Anti Conc
18 pages
Entropy and Uncertainty
No ratings yet
Entropy and Uncertainty
15 pages
A Joint Representation of Rényi's and Tsalli's Entropy With Application in Coding Theory - 2017 - International Journal of Mathematics A
No ratings yet
A Joint Representation of Rényi's and Tsalli's Entropy With Application in Coding Theory - 2017 - International Journal of Mathematics A
6 pages
10 1007@bf00533058
No ratings yet
10 1007@bf00533058
9 pages
Average R Enyi Entropy of A Subsystem in Random Pure State
No ratings yet
Average R Enyi Entropy of A Subsystem in Random Pure State
14 pages
Entropy 1
No ratings yet
Entropy 1
7 pages
Ent Var Two Rmks
No ratings yet
Ent Var Two Rmks
13 pages
Lec7 InformationTheory
No ratings yet
Lec7 InformationTheory
41 pages
Kernel Density Estimation of Tsalli's Entropy With Applications in Adaptive System Training
No ratings yet
Kernel Density Estimation of Tsalli's Entropy With Applications in Adaptive System Training
7 pages
(CBMS-NSF Regional Conference Series in Applied Mathematics) R. Bahadur - Some Limit Theorems in Statistics-Society For Industrial Mathematics (1987)
No ratings yet
(CBMS-NSF Regional Conference Series in Applied Mathematics) R. Bahadur - Some Limit Theorems in Statistics-Society For Industrial Mathematics (1987)
51 pages
Bayes' Estimators of Generalized Entropies
No ratings yet
Bayes' Estimators of Generalized Entropies
16 pages
Information Theory/ Data Compression Ma 4211: J Urgen Bierbrauer February 28, 2007
No ratings yet
Information Theory/ Data Compression Ma 4211: J Urgen Bierbrauer February 28, 2007
78 pages
Lecture 3: Entropy, Relative Entropy, and Mutual Information
No ratings yet
Lecture 3: Entropy, Relative Entropy, and Mutual Information
5 pages
SK Book Bonus
No ratings yet
SK Book Bonus
188 pages
Presentation Math7952
No ratings yet
Presentation Math7952
29 pages
MIT6 262S11 Lec02
No ratings yet
MIT6 262S11 Lec02
11 pages
A Central Limit Theorem For Convex Sets
No ratings yet
A Central Limit Theorem For Convex Sets
45 pages
2009 Golshani - Some Properties of Rényi Entropy and Entropy Rate
No ratings yet
2009 Golshani - Some Properties of Rényi Entropy and Entropy Rate
8 pages
Discussion Notes 2-6
No ratings yet
Discussion Notes 2-6
3 pages
Lecturer: Mark Braverman Scribe: Mark Braverman: COS597D: Information Theory in Computer Science
No ratings yet
Lecturer: Mark Braverman Scribe: Mark Braverman: COS597D: Information Theory in Computer Science
5 pages
Info Theory Homework Solutions
No ratings yet
Info Theory Homework Solutions
9 pages
Information Theory Differential Entropy
No ratings yet
Information Theory Differential Entropy
29 pages
An Entropy For Boolean Independence: Abstract
No ratings yet
An Entropy For Boolean Independence: Abstract
32 pages
Random Matrix Approximation of Spectra of Integral Operators
No ratings yet
Random Matrix Approximation of Spectra of Integral Operators
55 pages
Lectures On Random Matrix Theory
No ratings yet
Lectures On Random Matrix Theory
131 pages
2011 Nemenman - Coincidences and Estimation of Entropies of Random Variables
No ratings yet
2011 Nemenman - Coincidences and Estimation of Entropies of Random Variables
11 pages
Chains of Large Gaps Between Primes: P N K X G X P P, - . - , P P, K
No ratings yet
Chains of Large Gaps Between Primes: P N K X G X P P, - . - , P P, K
16 pages
Asymptotic Integrals in Stats
No ratings yet
Asymptotic Integrals in Stats
325 pages
Probabilistic Methods in Information Theory
No ratings yet
Probabilistic Methods in Information Theory
48 pages
2009 Paninsky Nonparametric Estimation of Entropy and Distributions
No ratings yet
2009 Paninsky Nonparametric Estimation of Entropy and Distributions
34 pages
MATH5410 Extensions and Random Matrices
No ratings yet
MATH5410 Extensions and Random Matrices
7 pages
Dabel Info Theory
No ratings yet
Dabel Info Theory
25 pages
Information Theory With Kernel Methods: Francis Bach Inria, Ecole Normale Sup Erieure PSL Research University
No ratings yet
Information Theory With Kernel Methods: Francis Bach Inria, Ecole Normale Sup Erieure PSL Research University
47 pages
Information Theoretic Inequalities
No ratings yet
Information Theoretic Inequalities
18 pages
Va Nerven 2014
No ratings yet
Va Nerven 2014
24 pages
Graduate CS Theoretical Tools
No ratings yet
Graduate CS Theoretical Tools
84 pages
Kayal and Kumar - Estimation of The Shannon Entropy of Several Shifted Exponential Populations
No ratings yet
Kayal and Kumar - Estimation of The Shannon Entropy of Several Shifted Exponential Populations
9 pages
Information Theory
No ratings yet
Information Theory
18 pages
Some Limit Theorems For Random Fields: Physics
No ratings yet
Some Limit Theorems For Random Fields: Physics
14 pages
Lecture 1: Introduction, Entropy and ML Estimation
No ratings yet
Lecture 1: Introduction, Entropy and ML Estimation
5 pages
Advanced Entropy and Statistics
No ratings yet
Advanced Entropy and Statistics
26 pages
R Enyi Divergence and Kullback-Leibler Divergence: Tim Van Erven Peter Harremo Es, Member, IEEE
No ratings yet
R Enyi Divergence and Kullback-Leibler Divergence: Tim Van Erven Peter Harremo Es, Member, IEEE
24 pages
R Enyi Divergence and Kullback-Leibler Divergence: Tim Van Erven Peter Harremo Es, Member, IEEE
No ratings yet
R Enyi Divergence and Kullback-Leibler Divergence: Tim Van Erven Peter Harremo Es, Member, IEEE
24 pages
Class9 8
No ratings yet
Class9 8
8 pages
Chapter 2
No ratings yet
Chapter 2
68 pages
A Fractional Entropy in Fractal Phase Space
No ratings yet
A Fractional Entropy in Fractal Phase Space
17 pages
1 Warm-Up Examples: COS597D: Information Theory in Computer Science
No ratings yet
1 Warm-Up Examples: COS597D: Information Theory in Computer Science
5 pages
Possible Generalization of Boltzmann-Gibbs Statist
No ratings yet
Possible Generalization of Boltzmann-Gibbs Statist
10 pages
THE - Norm Density Estimator Process
No ratings yet
THE - Norm Density Estimator Process
50 pages
STAT 538 Maximum Entropy Models C Marina Meil A Mmp@stat - Washington.edu
No ratings yet
STAT 538 Maximum Entropy Models C Marina Meil A Mmp@stat - Washington.edu
20 pages
Shannon's Theorems: Math and Science Summer Program 2020
No ratings yet
Shannon's Theorems: Math and Science Summer Program 2020
28 pages
Intro to Information Theory
No ratings yet
Intro to Information Theory
14 pages
Hermitian Operators For Observables
No ratings yet
Hermitian Operators For Observables
5 pages
Casimir Effect PDF
No ratings yet
Casimir Effect PDF
5 pages
Reiki - Client Intake Form
100% (2)
Reiki - Client Intake Form
1 page
Gaussian States
No ratings yet
Gaussian States
201 pages
Exploration of The Quantum Casimir Effect: A. Torode
No ratings yet
Exploration of The Quantum Casimir Effect: A. Torode
7 pages
Physics Lab: Measuring Dielectric Constants
No ratings yet
Physics Lab: Measuring Dielectric Constants
9 pages
4 4 Probability Distributions CIHxO9Ak6d70xqKv
No ratings yet
4 4 Probability Distributions CIHxO9Ak6d70xqKv
42 pages
Numerical Probability in Finance
No ratings yet
Numerical Probability in Finance
354 pages
Statistics Quiz, Shaon 5802
No ratings yet
Statistics Quiz, Shaon 5802
12 pages
ECS2221 - Statistics For Economics UG 2nd Sem 2023
No ratings yet
ECS2221 - Statistics For Economics UG 2nd Sem 2023
3 pages
ECE673 - Week1 - Lecture - With Figures
No ratings yet
ECE673 - Week1 - Lecture - With Figures
26 pages
Discrete Random Variables Guide
No ratings yet
Discrete Random Variables Guide
4 pages
s1-24 Ism Mid QP Make Up
No ratings yet
s1-24 Ism Mid QP Make Up
3 pages
Part B & Part C Questions - Unit Wise: Anna University Exams April / May 2019 - Regulation 2017 MA8402
No ratings yet
Part B & Part C Questions - Unit Wise: Anna University Exams April / May 2019 - Regulation 2017 MA8402
6 pages
Probability Distribution Analysis
No ratings yet
Probability Distribution Analysis
11 pages
Tide Prediction For 9 Constituents
No ratings yet
Tide Prediction For 9 Constituents
25 pages
STA 114 Test 2 - 29th April 2023
No ratings yet
STA 114 Test 2 - 29th April 2023
7 pages
Bayesian Modeling and Computation in Python 1st Edition Martin Available All Format
No ratings yet
Bayesian Modeling and Computation in Python 1st Edition Martin Available All Format
96 pages
Stat HW2
No ratings yet
Stat HW2
4 pages
Probability Math Education Presentation in A Colorful Scrapbook Style
No ratings yet
Probability Math Education Presentation in A Colorful Scrapbook Style
18 pages
Ec24 103 6
No ratings yet
Ec24 103 6
3 pages
Midterm 2022 Sol
No ratings yet
Midterm 2022 Sol
7 pages
(Counting) : A B A B
No ratings yet
(Counting) : A B A B
62 pages
Simulation Exam Solutions
No ratings yet
Simulation Exam Solutions
14 pages
Bayesian Networks
No ratings yet
Bayesian Networks
24 pages
Image Encryption Security Analysis
No ratings yet
Image Encryption Security Analysis
8 pages
Introductory Econometrics Course
No ratings yet
Introductory Econometrics Course
2 pages
We I Bull Bounds
No ratings yet
We I Bull Bounds
59 pages
The T Distribution Table
No ratings yet
The T Distribution Table
1 page
Markov Chains and RL Applications
No ratings yet
Markov Chains and RL Applications
54 pages
Probability Distribution
No ratings yet
Probability Distribution
21 pages
Areas For The Standard Normal Distribution Table PDF
No ratings yet
Areas For The Standard Normal Distribution Table PDF
2 pages
Engineering Students' Guide to Statistics
No ratings yet
Engineering Students' Guide to Statistics
32 pages
STATISTICS
No ratings yet
STATISTICS
19 pages
Janrie M. Raguine, Mat Probability Distribution of Discrete Random Variable (Q3 - Wk. 1, LAS 3)
No ratings yet
Janrie M. Raguine, Mat Probability Distribution of Discrete Random Variable (Q3 - Wk. 1, LAS 3)
1 page
Negative Binomial Distribution
No ratings yet
Negative Binomial Distribution
10 pages

Renyi's Entropy: I P P I

Uploaded by

Renyi's Entropy: I P P I

Uploaded by

JOSE C.

He started with Cauchy’s functional equation: If p and q are indepen-

Apart from a normalizing constant this is compatible with Hartley’s

It can be shown that Shannon is a special case when α→1

So Renyi’s entropies contain Shannon as a special case.

Note that Vα(X) is the argument of the α norm of the PMF.

1 (α–norm of p raised power to α) 1 p = ( p1, p2, p3)

[ 100] α = 0.2 [01 0] [1 00] α = 0.5 [ 01 0] [1 00] α =1 [ 010]

[ 0 01] [0 01] [ 001 ]

[ 100] α =2 [01 0] [1 00] α =4 [ 01 0] [1 00] α = 20 [ 010]

From another perspective, the α root of Vα(X) is the α-norm of the

We will be using extensively α=2.

For us Reny’s quadratic entropy is appealing because we have found

and for the quadratic entropy

Notice however that now entropy is no longer positive, in fact it can

The kernel has the following properties:

Property 2.5. The limit of Renyi’s entropy as α → 1 is Shannon’s entropy. The

limit of the entropy estimator in Eq. (2.18) as α → 1 is Shannon’s entropy estimated

marginal entropies are consistent.

to the logarithm of a scal ed and biased version of the sample variance.

kernel function satisfies κ Σ (ξ ) = κ Σ ( R −1 ξ ) for all orthonorm al m atrices, R, then the

spherical sym metry.

Bias[Vˆ ( X )] = E[Vˆ ( X )] − ∫ f 2 ( x )dx = (σ 2 / 2 ) E[ f ′ ( X )]

Variance of the Information Potential

The AMISE (asymptotic mean integrated square error)

This is expected from the kernel density estimation theory. The IP

Another way to look at the interactions among samples is to create two

D is a matrix of distances, and ζ is a matrix of scalars. As we can see

and it can be expressed as a function of the IP as

ii. Dα ( f || g) = 0 iff f(x) = g(x) ∀x ∈ℜ

iii. lim Dα ( f || g ) = DKL ( f || g)

and we can come up with a kernel estimate for it if we approximate the

that also approximates KL estimator for α->1. However Renyi’s diver-

We will define the quadratic mutual information QMIED between two

Although a distance, we sometimes refer to DED as a divergence.

∫ ∫ (x)dx ≥ ∫ f ( x)g( x)dx

DCS ( f , g ) = − log ∫ f ( x)g( x)dx

We can also define the quadratic mutual information QMICS as

and as before for independent variables it is zero. It can also be

for α=2 this gives exactly DCS, so

therefore QMIED and QMIED can be rewritten as

The Figure illustrates these distances in the simplex

Fix PX1= (0.6,0.4) I ED

Euclidean and Cauchy Schwarz divergences

The three potentials needed are

they create the following fields

EEL 6935 VˆC = ∫∫ fˆ (x1, x2 ) fˆ (x1) fˆ ( x2 )dx1dx2

Notice that Vc has complexity O(N3). Finally we have

These expressions are going to be very useful when adapting systems

Fast Gauss Transform

and the efficient approximation of the Hermite polynomials with a small

The information potential calculation (M=N) can be immediately given

which requires O(NpB) calculations. p is normally 4 or 5 (independent

The problem with this algorithm is if the data is multidimensional

Now the IP for multidimensions becomes

For a D dimensional data set the calculation is O(NBrp,D) instead of

matrix of positive diagonal entries. When the eigenspectum falls fast

with complexity O(ND2). The divergence measures can be computed

You might also like