Renyi's Entropy: I P P I
Renyi's Entropy: I P P I
PRINCIPE
                        Renyi’s entropy
U. OF FLORIDA
EEL 6935
                        History: Alfred Renyi was looking for the most general definition of
352-392-2662
                        information measures that would preserve the additivity for indepen-
principe@cnel.ufl.edu
                        dent events and was compatible with the axioms of probability.
copyright 2009
                                                               k =1
  page 1 of 35
JOSE C. PRINCIPE
                        Applying this definition to the I(P) we get
U. OF FLORIDA                                                                N
EEL 6935
                                                            I ( P) = g (∑ pk g (I k ))
                                                                       −1
                                                                            k =1
352-392-2662
                        When the postulate of additivity for independent events is applied we
principe@cnel.ufl.edu
                        get just two possible g(x):
copyright 2009
                                                   g(x)=cx
                                                   g(x)=c-2(1-α)x
                        The first form gives Shannon information and the second gives
                                                                              N
                                                                     1
                                                           Iα (P) =      log(∑ pkα )
                                                                    1− α     k =1
                        for non negative α different from 1. This gives a parametric family of
                        information measures that are called today Renyi’s entropies.
N N
                                                                N
                                                                            lim ∑ log pk ⋅ pk α
                                                                                                  ∑ pkα −1
                                                        1               α →1
                                   lim Hα ( X ) = lim       log ∑ pαk =      k =1                 k =1
                                                                                                             = HS ( X )
                                   α →1          α →1 1 − α                            lim− 1
                                                                k =1
                                                                                       α →1
  page 2 of 35
JOSE C. PRINCIPE
                        Meaning of α
U. OF FLORIDA
EEL 6935
                        If we compare the two definitions we see that instead of weighting the
352-392-2662
                        log pk by the probabilities, here the log is external to the sum and its
                        argument is the α power of the PMF, i.e.
principe@cnel.ufl.edu
                                                                                  (                    )
copyright 2009
                          Hα ( X ) =
                                         1
                                            log(Vα (X )) = −log(α −1Vα ( X ) = −log α−1 E(Vα−1(X ))         Vα ( X ) =   ∑ p αk
                                       1 −α                                                                               k
                        A geometric view will help here. All the PMFs of N r.v. exist on what is
                        called the simplex, in an N dimensional space.
                                     p2             n
                                                        α          α
                                                  ∑
                                                  k=1
                                                        pk = p     α
                                                                        (entropy α-norm)               p2
                                                            p = (p 1, p2)
                                                                                                   0                 p3
                                                                   p1                                          1
                                                                                 p1        1
                                 0                          1
  page 3 of 35
JOSE C. PRINCIPE
                        The norm measures exactly the distance of the PMF to the origin, and
U. OF FLORIDA
                        α is the designated l-norm (Euclidean norm is l=2). Changing α
EEL 6935
                        changes the metric distance in the simplex.
352-392-2662
                                                [ 0 01]                     [0 01]                       [ 001 ]
principe@cnel.ufl.edu
copyright 2009
  page 4 of 35
JOSE C. PRINCIPE
                        Properties of Renyi’s entropies
U. OF FLORIDA
EEL 6935
352-392-2662
principe@cnel.ufl.edu
copyright 2009
  page 5 of 35
JOSE C. PRINCIPE
                        Renyi’s quadratic entropy
U. OF FLORIDA
                        We will be using heavily Renyi’s entropy with α=2, called the quadratic
EEL 6935
                        entropy
352-392-2662
                                                         H2 ( X ) = − log(∑ pk2 )
principe@cnel.ufl.edu
                                                                            k
copyright 2009
                        It has been used in physics, in signal processing and in economics.
                        We want also to stress that the argument of the log, which is the 2-
                        norm of the PMF V2(X) has meaning in itself. In fact it is the E[p(X)] or
                        if we make the change in variables ξk=p(xk), then it is the mean of the
                        transformed variable when the PMF is the transformation.
                                                  p(x)
Mean p(X)
                                                                                    x
                                                                    mean of X
  page 6 of 35
JOSE C. PRINCIPE
                        Extensions to continuous variables.
U. OF FLORIDA
EEL 6935
                        It is possible to show that Renyi’s entropy measure extends to continu-
352-392-2662
                        ous r.v. and reads
principe@cnel.ufl.edu
                                                                               1
copyright 2009                           Hα ( X ) = lim(Iα (Pn ) − log n) =       log ∫ pα (x)dx
                                                   n→∞                        1−α
H2 ( X ) = − log ∫ p2 ( x)dx
  page 7 of 35
JOSE C. PRINCIPE
                        Estimation of Renyi’s quadratic entropy
U. OF FLORIDA
EEL 6935
                        We will use here ideas from density estimation called kernel density
352-392-2662
                        estimation. This is an old method (Rosenblatt) that is now called
principe@cnel.ufl.edu
                        Parzen density estimation because Parzen proved many important
copyright 2009
                        statistical properties of the estimator.
                        The idea is very simple: Place a kernel over the samples and sum with
                        proper normalization, i.e.
                                                                           1    N
                                                                                      x − xi
                                                             pˆ X ( x ) =
                                                                          Nσ
                                                                               ∑κ (    σ
                                                                                               )
                                                                               i =1
2. ∫R k <∞
                                     3.   lim xk ( x ) = 0
                                          x→∞
                                     4. k ( x ) ≥ 0,    ∫R k ( x )dx = 1
                        Parzen proved that the estimator is asymptotically unbiased and con-
                        sistent with a good efficiency.
                        The free parameter is called the kernel size. Normally a Gaussian is
                        used and σ becomes the standard deviation.
  page 8 of 35
JOSE C. PRINCIPE
                        But notice that here we are not interested in estimating the PDF that is
U. OF FLORIDA
                        a function, we are interested in a single number the 2-norm of the PDF.
EEL 6935
                        For a Gaussian kernel, substituting the estimator yields immediately
352-392-2662
                                                         ∞                            2
                                                             ⎛1 N               ⎞
principe@cnel.ufl.edu                   H 2 ( X ) = − log ∫ ⎜⎜ ∑ Gσ ( x − xi ) ⎟⎟ dx
                                        ˆ
                                                         −∞ ⎝                   ⎠
                                                              N i =1
copyright 2009
                                                         1
                                                              ∞  ⎛N N                                ⎞
                                                                 ⎜ ∑∑ Gσ ( x − x j ) ⋅ Gσ ( x − xi ) ⎟dx
                                                 = − log 2
                                                        N      ∫ ⎜ i =1 j =1                         ⎟
                                                              −∞ ⎝                                   ⎠
                                                                      ∞
                                                         1 N N
                                                 = − log 2 ∑∑ ∫ Gσ ( x − x j ) ⋅ Gσ ( x − xi ) dx
                                                        N i =1 j =1 −∞
                                                          1 N N
                                                 = − log( 2 ∑∑ Gσ          2
                                                                               ( x j − xi ))
                                                         N i =1 j =1
                        We call this estimator for V2(X), the Information Potential.
                        Let us look at the derivation: We never had to compute the integral
                        explicitly since the integral of the product of Gaussians is the value of a
                        Gaussian at the difference of arguments (with a larger kernel size).
                        Let us also look at the expression: The algorithm is O(N2), and there is
                        a free parameter σ that the user has to select from the data.
                        Cross validation or the Silverman’s rule suffice for most cases
                                                                  (                   )
                                                                                         1
                                                                      −1            −1 d + 4
                                                      σ opt = σ X 4 N (2d      + 1)
  page 9 of 35
JOSE C. PRINCIPE
                        Extended Estimator for Renyi’s Entropy
U. OF FLORIDA
EEL 6935
                        We can come up with still another estimator for Reny’s entropy of any
352-392-2662
                        order
principe@cnel.ufl.edu
                                                                                             [           ]
                                                Δ           ∞
                                                    1                       1
copyright 2009                          Hα ( X ) =     log ∫ pαX ( x) dx =      log EX pαX −1( X )
                                                   1−α    −∞
                                                                           1− α
                        by approximating the expected value by the empirical mean
                                                                            N
                                                             1      1
                                                Hα ( X ) ≈
                                                           1 −α
                                                                log
                                                                    N
                                                                           ∑ pαX −1 (x j )
                                                                           j =1
                        to yield
                                                                                                  α −1
                                                        1      1   N    ⎛1 N                 ⎞
                                         Hˆ α ( X ) =
                                                      1 −α
                                                           log
                                                               N
                                                                   ∑ ⎜ N ∑ κ σ ( x j − xi ) ⎟⎟
                                                                        ⎜
                                                                   j =1 ⎝  i =1              ⎠
                                                                                                 α −1
                                                       1       1       N    ⎛N                ⎞
                                                    =
                                                      1−α
                                                          log
                                                              Nα
                                                                       ∑⎜ ∑ σ j i ⎟⎟
                                                                            ⎜     κ ( x − x )
                                                                       j =1 ⎝ i=1             ⎠
                        We will now present a set of properties of these estimators.
  page 10 of 35
JOSE C. PRINCIPE
                        Properties of the Estimators
U. OF FLORIDA
                             Property 2.1: The estimator of Eq. (2.18) for Renyi’s quadratic entropy using Gaussian
EEL 6935
352-392-2662                 kernels only differs from the IP of Eq. (2.14) by a factor of 2 in the kernel size.
principe@cnel.ufl.edu
copyright 2009 Property 2.2: For any Parzen kernel that obeys the relation
                                                             ∞
                                     κ   new
                                               (x j − xi ) = ∫κ old ( x − xi ) ⋅κ old ( x − x j )dx   (2.19)
                                                             −∞
                              the estimator of Renyi’s quadratic entropy of Eq. (2.18) matched the estimator of Eq.
                              (2.14) using the IP.
                           Property 2.3. The kernel size must be a parameter that satisfies the scaling property
                           κ cσ ( x) = κ σ ( x / c) / c for any positive factor c
                          Property 2.4. The entropy estimator in Eq. (2.18) is invariant to the mean of the
                          underlying density of the samples as is the actual entropy
using Parzen windowing with the expectation approximated by the sample mean.
  page 11 of 35
JOSE C. PRINCIPE
                        Properties of the Estimators
U. OF FLORIDA
                          Property 2.6. In order to maintain consistency with the scaling property of the actual
EEL 6935                  entropy, if the entropy estimate of samples {x1 ,…,x N} of a random variable X is estimated
352-392-2662              using a kernel size of σ, the entropy estimate of the samples {cx 1,…,cxN} of a random
                          variable cX must be estimated using a kernel size of |c| σ.
principe@cnel.ufl.edu
copyright 2009
Property 2.7. When estimating the joint entropy of an n-dimensional random vector X
from its samples {x 1,…,x N}, use a multi-dimensional kernel that is the product of single-
dimensional kernels. This way, the estimate of the joint entropy and estimate of the
Theorem 2.1. The entropy estimator in Eq. (2.18) is consistent if the Parzen
windowing and the sample mean are consistent for the actual PDF of the iid samples.
                             Theorem 2.2. If the maximum value of the kernel κσ(ξ) is achieved when ξ = 0, then the
                             minimum value of the entropy estimator in Eq. (2.18) is achieved when all samples are
                             equal to each other, i.e., x1=…= xN = c
  page 12 of 35
JOSE C. PRINCIPE
                        Properties of the Estimators
U. OF FLORIDA                Theorem 2.3. If the kernel function κ σ(.) is continuous, differentiable, symmetric and
EEL 6935
                             unimodal, then the global minimum described in Theorem 2.2 of the entropy estimator in
352-392-2662
principe@cnel.ufl.edu        Eq. (2.18) is smooth, i.e., it has a zero gradient and a positive semi-definite Hessian
copyright 2009
                             matrix.
Property 2.8. If the kernel function satisfies the conditions in Theorem 2.3, then in
the limit, as the kernel size tends to infinity, the quadratic entropy estim ator approaches
Property 2.9. In the case of joint entropy estim ation, if the m ulti-dimensional
entropy estim ator in Eq. (2.18) is invariant under rotations as is the actual entropy of a
random vector X. Notice that the condition on the joint kernel function requires hyper-
  page 13 of 35
JOSE C. PRINCIPE
                        Properties of the Estimators
U. OF FLORIDA
                            Theorem 2.4. lim Hˆ α ( X ) = H α ( Xˆ ) ≥ H α ( X ) , where X̂ is a random variable with the
EEL 6935                                       N →∞
352-392-2662
                            PDF f X (.) * κ σ (.) . The equality (in the inequality portion) occurs if and only if (iff) the
principe@cnel.ufl.edu
                            kernel size is zero. This result is also valid on the average for the finite-sample case.
copyright 2009
                        Bias of the Information Potential
                                                            [                             ] [                            ][
                                                     a = E Kσ ( xi − xj )Kσ ( x j − xl ) − E Kσ ( xi − x j ) E Kσ (x j − xl )            ]
                                                     b = Var[Kσ (xi − x j )]
EEL 6935
                        There is a useful physical interpretation of the IP. Think of the gravita-
352-392-2662
                        tional or the electrostatic field.
principe@cnel.ufl.edu
                        Assume the samples are particles (information particles) of unit mass
copyright 2009
                        that interact with each other with the rules specified by the Parzen ker-
                        nel used. This analogy is exact!
                        The information potential field is given by the kernel density estimation
                                                                       Δ 1  N
                                                            Vˆ2 ( x j ) = ∑ Gσ         2
                                                                                           ( x j − xi )
                                                                         N i =1
                        the Information Potential is the total potential
                                                                                   N
                                                             Vˆ2 ( X ) = 1 / N ∑ Vˆ2 ( x j )
                                                                                   j =1
                        Now if there is potential there are forces in the space of the samples
                        that can be easily computed as
                                     ∂ ˆ            1   N
                                                                              1 N
                                    ∂xj
                                        V2 (x j ) =
                                                    N
                                                        ∑Gσ′ 2 (xj − xi ) = 2Nσ 2 ∑Gσ 2 (xj − xi )(xi − xj )
                                                        i =1                      i=1
                                                                   Δ   1
                                                   F2 ( x j ; xi ) =     Gσ′   2
                                                                                   ( x j − xi )
                                                                       N
                                                               Δ
                                                                 ∂                 N
                                                   F2 ( x j ) =      V2 ( x j ) = ∑ F2 ( x j ; x i )
                                                                ∂x j              i =1
  page 15 of 35
JOSE C. PRINCIPE
                        Interaction between one information particle in the center of 2-D space
U. OF FLORIDA
EEL 6935
352-392-2662
principe@cnel.ufl.edu
copyright 2009
                                                                             ⎪⎩       Nσ   2
                                                                                             j =1
                                                                                     1 N N ˆ    1       N
                                                                           V ( X ) = 2 ∑∑ Vij =
                                                                            ˆ                           ∑ Vˆ (i )
                                                                                    N i =1 j =1 N       i =1
  page 16 of 35
JOSE C. PRINCIPE
                        Example of forces pulling apart samples when the entropy is maxi-
U. OF FLORIDA
                        mized. The lines show the direction of the force and their size is the
EEL 6935
                        intensity of the pull.
352-392-2662
principe@cnel.ufl.edu 1
copyright 2009
                                         0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
                                          0
                                           0   0.1   0.2   0.3   0.4   0.5   0.6   0.7   0.8   0.9   1
  page 17 of 35
JOSE C. PRINCIPE
                        Extension to any α and kernel
U. OF FLORIDA
EEL 6935
                        The framework of information potential and forces extends to any
352-392-2662
                        Parzen kernel where the shape of the kernel creates the interaction
principe@cnel.ufl.edu
                        field. Changing α also changes the field.
copyright 2009
                        The α information potential field is
                                                                                                     α −1
                                                                      Δ1 ⎛N                   ⎞
                                                            Vα (x j )= α −1 ⎜⎜∑κσ (x j − xi )⎟⎟
                                                             ˆ
                                                                      N ⎝ i=1                 ⎠
                                                                              N
                                                                      1
                                                            Vˆα ( X) = ∑Vˆα (xj )
                                                                      N j =1
EEL 6935
                        Renyi proposed a divergence measure from his entropy definition
352-392-2662
                        which is different from Kullback Liebler divergence discussed in ch 1.
principe@cnel.ufl.edu
copyright 2009
                        Renyi’s divergence
                                                                                ∞                 α−1
                                                                  Δ    1              ⎛ f (x) ⎞
                                                    Dα ( f || g)=         log ∫ f (x)⎜⎜       ⎟⎟        dx
                                                                      α −1 −∞         ⎝ g(x) ⎠
                        with properties
                               i.   Dα ( f || g) ≥ 0, ∀f , g,α > 0
                                                                                                               α −1
                                                     1          ⎡⎛ f (x) ⎞α −1⎤     1      1 N ⎛⎜ fˆ (x j ) ⎞⎟
                                                                                                          f
                                     Dα ( f || g) =      log Ep ⎢⎜⎜        ⎟⎟ ⎥ ≈       log ∑
                                                    α −1        ⎢⎣⎝ g ( x)  ⎠ ⎥⎦  α − 1    N j=1 ⎜⎝ gˆ (xgj ) ⎟⎠
                                                                                     α −1
                                                             ⎛N f f              f ⎞
                                                           N ∑
                                                             ⎜      κ  ( x j − xi )⎟
                                                      log ∑⎜⎜ i=m1                 ⎟
                                                  1      1
                                                                                   ⎟        =D
                                                                                             ˆ ( f || g)
                                                                                              α
                                                 α − 1 N j=1
                                                             ⎜⎜ ∑κ ( xj − xi ) ⎟⎟
                                                                      g g        g
                                                              ⎝ i=1                ⎠
  page 19 of 35
JOSE C. PRINCIPE
                        Reny’s Mutual Information
U. OF FLORIDA
EEL 6935
                        Renyi’s mutual information is also the divergence between the joint
352-392-2662
                        and the product of the marginals
principe@cnel.ufl.edu
                                                                ∞     ∞
                                                   Δ    1            pαX ( x1 ,..., xn ) 1
copyright 2009                               I( X ) =      log ∫ L ∫ n                  dx ...dxn
                                                      1 −α    −∞  −∞
                                                                     ∏ p1X−oα ( xo )
                                                                          o =1
⎝ o =1 ⎝ i = 1 ⎠⎠
EEL 6935
                        Measuring dissimilarity in probability space is a complex issue and
352-392-2662
                        there are many other divergence measures that can be used. If we
principe@cnel.ufl.edu
                        transform the simplex to an hypersphere preserving the FIsher infor-
copyright 2009
                        mation, the coordinates of each PMF become pk . The geodesic dis-
                        tance between PMF f and g becomes                          cos D G =     ∑   fk gk .      This is related to
                        the Bhattacharyya distance (which is Renyi’s divergence with α=1/2)
                                                                        (
                                                    DB( f , g) = −ln ∫ f (x)g(x)dx              )
                        Now if we measure the distance between f and g in a linear projection
                                                                                                           0.5
                        space (the cordal distance) we get                      DH =     ∑   ( fk – gk )         This resembles
                        the Hellinger distance which is also related to Reny’s divergence
                                                         (                  )             [(                      )]
                                                                                  1/ 2
                                      DH ( f , g) = ⎡∫       f (x) − g(x) dx⎤
                                                                            2                                      1/ 2
                                                                                         = 21− ∫ f (x)g(x)dx
                                                    ⎢⎣                      ⎥⎦
                        Since we have an estimator for quadratic Renyi’s entropy (the IP) we
                        will substitute α=1/2 for 2 and define two quadratic divergences called
                        Euclidean distance and Cauchy Schwarz divergence between f and g.
  page 21 of 35
JOSE C. PRINCIPE
                        Euclidean distance between PDFs
U. OF FLORIDA
EEL 6935
                        We will drop the square root and define the distance as
352-392-2662
principe@cnel.ufl.edu
                                 DED ( f , g) = ∫ ( f (x) − g( x))2 dx = ∫ f (x)2 dx − 2∫ f ( x)g (x)dx + ∫ g( x)2 dx
copyright 2009          Notice that each of the three terms can be estimated with the IP. The
                        middle term involves samples from the two PDFs and so will be called
                        the Cross Information Potential (CIP) that measures the potential cre-
                        ated by one PDF in the locations specified by the samples of the other
                        PDF.
                        Notice that the IED is zero if the two r.v. are independent.
                        The extension to multidimensional random variables is also possible
                                                                                                  k
                                          I ED ( X 1,..., X k ) = DED ( f X ( x1,... xk ), ∏ f X i ( xi ))
                                                                                                 i =1
  page 22 of 35
JOSE C. PRINCIPE
                        Cauchy Schwarz divergence
U. OF FLORIDA
EEL 6935
                        The other divergence is related to the Bhattacharyya distance but can
352-392-2662
                        be formally derived from the Cauchy Schwarz inequality
principe@cnel.ufl.edu
EEL 6935
                        There is a very interesting interpretation of the DCS in terms of Renyi’s
352-392-2662
                        entropy. Lutwak defines Renyi’s relative entropy as
principe@cnel.ufl.edu
                                                                     (∫                           ) (∫ g                 )
                                                                                                  1
                                                                              α −1                             α         1/ α
copyright 2009                                                            g          (x) f (x)   1− α              (x)
                                               DRα   ( f , g ) = log  R                                 R
                                                                                     (∫         ( x) )
                                                                                                       1
                                                                                           fα      α (1− α )
                                                                                       R
  page 24 of 35
JOSE C. PRINCIPE
                        Geometric Interpretation of Quadratic Mutual Information
U. OF FLORIDA
EEL 6935
                        Let us define
352-392-2662
                                            ⎧ VJ = ∫∫ f X X (x 1, x2 )2dx 1dx 2
principe@cnel.ufl.edu                       ⎪            1 2
                                            ⎪
                                            ⎨ VM = ∫∫ (f X1( x1 )fX2( x2) ) dx1 dx2
                                                                           2
copyright 2009
                                            ⎪
                                            ⎪ V = f
                                            ⎩ c ∫ ∫ X1X2 (x1, x2 )fX1( x1 )fX2(x2 )dx 1dx 2
                                                       fX1 X2 ( x 1, x2 )
                                                                            IED ( Euclidean Distance)
I s (K-L Divergence )
VJ
                                                                             f X ( x1 )fX ( x 2 )
                                                    θ                          1          2
                                                              VM
                                              0
                                                                               2
                                                       I CS = –log ( ( cos θ ) )         V c = cos θ VJ VM
  page 25 of 35
JOSE C. PRINCIPE
                        Example
U. OF FLORIDA                                                                                                                   21
                                             X2                                                                                PX
                                                                                                     Is
EEL 6935
352-392-2662                        2
                                  P X2   2                  PX
                                                              12
                                                                        PX
                                                                           22
                                                                                                                                       Is
principe@cnel.ufl.edu               1                         11
                                  PX2    1                  PX          P 21
                                                                          X
copyright 2009
                                                                                 X1
                                                        1           2                                     21              11
                                                                                                     PX              PX                      11
                                                         1                                                                                  PX
                                                       P X1        P 2X1
                                                                                                                                21
                                                                                                    IE D                       PX
                               Vary
                               P(1,1) -> [0, 0.6]                                                          21             11                 11
                               P(2,1) -> [0, 0.4]                                                         PX         PX                     PX
                                                                                                    I CS                       21
                                                                                                                           PX
IC S
                                                                                                               21    11                      11
                        ⎧                          n   m                                                  PX        PX                      PX
                            I ED ( X 1, X 2 ) = ∑∑ ( PX (i , j ) − PX 1 ( i ) PX 2 ( j )) 2
                                                  i =1 j =1
                                                 ⎛ n m                          ⎞⎛ n m
                                                 ⎜ ∑∑                              ∑∑ (PX 1 (i) PX 2 ( j ))2
                        ⎨                        ⎜           (  P   (i , j )) 2 ⎟⎜
                                                                 X              ⎟⎜
                          I CS ( X1, X 2 ) = log ⎝                              ⎠⎝ i =1 j =1
                                                   i =1 j =1
                                                               n m
                                                            ∑∑ (PX (i, j) PX 1(i)PX 2 ( j))2
                        ⎩                                    i =1 j =1
  page 26 of 35
JOSE C. PRINCIPE
                        Information Potential and Forces in the Joint Spaces
U. OF FLORIDA
EEL 6935
                        The important lesson to be learned from the IP framework and the dis-
352-392-2662
                        tances measures defined, is that each PDF creates its own potential
principe@cnel.ufl.edu
                        that interacts with the others. This is an additive process, so the forces
copyright 2009
                        are also additive, and it becomes relatively simple.
                              ˆ ( f , g) = Vˆ = Vˆ + Vˆ − 2Vˆ
                              D                                            ∂VˆED ∂ Vˆf ∂Vˆg          ∂Vˆc
                               ED            ED   f    g     c                   =      +         −2
                                                                            ∂xi     ∂xi     ∂xi      ∂xi
                                                    Vˆ Vˆ
                              ˆ ( f , g ) = Vˆ = log f g
                              D                                            ∂VˆCS    1 ∂Vˆ f     1 ∂Vˆg 2 ∂Vˆc
                               CS             CS
                                                     Vˆ 2                        =          +           −
                                                      c
                                                                            ∂xi    V f ∂xi Vg ∂ xi Vˆc ∂xi
                                                                                    ˆ           ˆ
  page 27 of 35
JOSE C. PRINCIPE
                        Quadratic Mutual Information
U. OF FLORIDA
EEL 6935
                        The QMIs are a bit more detailed because now we have two deal with
352-392-2662
                        the joints, the marginals and their product. But everything is still addi-
principe@cnel.ufl.edu
                        tive. The three PDFs are
copyright 2009
                                                           ⎧ˆ                         1 N
                                                           ⎪ f X 1 X 2 ( x1 , x2 ) = ∑ Gσ ( x − x(i ))
                                                           ⎪                         N i =1
                                                           ⎪ ˆ                    1 N
                                                           ⎨ f X 1 ( x1 ) = ∑ Gσ ( x1 − x1 (i ))
                                                           ⎪                      N i =1
                                                           ⎪ ˆ                    1 N
                                                           ⎪ f X 2 ( x2 ) = ∑ Gσ ( x2 − x2 (i))
                                                           ⎩                      N i =1
  page 28 of 35
JOSE C. PRINCIPE
                        let us exemplify for VC
U. OF FLORIDA
                             ⎡1 N                                 ⎤⎡ 1 N              ⎤⎡ 1 N                  ⎤
352-392-2662
principe@cnel.ufl.edu   = ∫∫ ⎢ ∑Gσ (x1 − x1(k ))Gσ ( x2 − x2 (k ))⎥⎢ ∑Gσ (x1 − x1(i ))⎥ ⎢ ∑Gσ ( x2 − x2 ( j ))⎥ dx1dx2
copyright 2009
                             ⎣ N k =1                             ⎦⎣ N i =1           ⎦ ⎢⎣ N j =1             ⎥⎦
                         1 N 1 N 1 N
                        = ∑ ∑ ∑∫ Gσ (x1 − x1(i))Gσ (x1 − x1(k ))dx1 ∫Gσ ( x2 − x2 (k ))Gσ ( x2 − x2 ( j ))dx2
                         N i =1 N j =1 N k =1
                         1 N ⎡1 N                                ⎤⎡ 1 N                                            ⎤
                        = ∑⎢ ∑G                 ( x (k ) − x1(i))⎥⎢ ∑G
                                              2σ 1
                                                                                                ( x (k ) − x2 ( j))⎥
                                                                                              2σ 2
                                                                                                                                        (2.102)
                         N k =1 ⎣ N i =1                         ⎦⎣⎢ N j =1                                        ⎦⎥
                                                         N   ⎛ 1     N                                        ⎞⎛ 1     N                                       ⎞
                                                        ∑ ⎜⎜ N       ∑G                                               ∑G
                                                 1
                                              −                                        ( x1 ( i ) − x1 ( j )) ⎟⎜                       ( x 2 ( i ) − x2 ( j )) ⎟
                                                N       i =1 ⎝
                                                                                  2σ                          ⎟⎜ N                2σ                           ⎟
                                                                     j =1                                     ⎠⎝      j =1                                     ⎠
                                                             ⎛ 1                                                                                  ⎞
                                                                                                                                                         (         )
                                                                         N     N
                                                             ⎜
                                                             ⎜ N2     ∑ ∑G               2σ
                                                                                              ( x1 ( i ) − x1 ( j ))G 2σ ( x 2 ( i ) − x 2 ( j )) ⎟ Vˆ1Vˆ2
                                                                                                                                                  ⎟
                           IˆCS ( X 1 , X 2 ) = lo g         ⎝        i = 1 j =1                                                                  ⎠
                                                                                                                                                                       2
                                                       ⎛ 1       N ⎛ 1        N                            ⎞⎛ 1               N                               ⎞⎞
                                                       ⎜
                                                       ⎜N     ∑ ⎜⎜ N         ∑ G 2σ ( x1 (i ) − x1 ( j )) ⎟⎟ ⎜⎜ N            ∑ G 2σ ( x 2 ( i ) − x 2 ( j )) ⎟⎟ ⎟⎟
                                                       ⎝      i =1 ⎝         j =1                          ⎠⎝                j =1                             ⎠⎠
  page 29 of 35
JOSE C. PRINCIPE
                        The interactions are always based on the marginals, but there are
U. OF FLORIDA
                        three levels: the level of the individual marginal samples (joint space),
EEL 6935
                        the level of the marginal sample and marginal fields (the CIP) and the
352-392-2662
                        product of the two marginal fields. This field can be generalized to any
principe@cnel.ufl.edu
                        number of variables
copyright 2009
                                                            N N K
                                      ˆI ( X ,..., X ) = 1                         2 N K ˆ       K
                                                          2 ∑∑∏ k
                                        ED  1       K                     V (ij ) − ∑∏ Vk (i) + ∏ Vˆk
                                                                           ˆ
                                                         N i =1 j =1 k =1          N i =1 k =1  k =1
                                                             ⎛ 1 N N K                       ⎞K
                                                             ⎜ N 2 ∑∑∏
                                                             ⎜                     Vˆk (ij ) ⎟∏ Vˆk
                                                                                             ⎟ k =1
                                      ˆI ( X ,..., X ) = log ⎝      i =1 j =1 k =1           ⎠
                                        CS  1       K                                          2
                                                                   ⎛1 N K ˆ ⎞
                                                                   ⎜⎜ ∑∏ Vk (i) ⎟⎟
                                                                    ⎝ N i =1 k =1            ⎠
                        The information forces for each field are also easily derived
                                                      ∂IˆCS    ∂Vˆj 2∂VˆC ∂Vˆk
                                           FED (i) =
                                           ˆ                =        −      +
                                                     ∂xk (i) ∂xk (i) ∂xk (i) ∂xk (i)
                                                      ∂Iˆ     1   ∂Vˆ    2 ∂VˆC      1 ∂Vˆk
                                           FCS (i) =        =          −          +
                                           ˆ             CS          j
                                                     ∂xk (i) Vj ∂xk (i) VC ∂xk (i) Vk ∂xk (i)
  page 30 of 35
JOSE C. PRINCIPE
                        Fast Information and Cross Information Potential Calculations
U. OF FLORIDA
EEL 6935
                        One of the problems with this IP estimator methodology is the compu-
                        tational complexity which is O(N2) or O(N3). There are two ways to get
352-392-2662
principe@cnel.ufl.edu
                        approximations to the IP and CIP with arbitrary accuracy: The Fast
copyright 2009
                        Gauss Transform and the Incomplete Cholesky decomposition. They
                        both transform the complexity to O(NM) where M <<N.
                        The FGT takes advantage of the fast decay of the Gaussian function
                        and can efficiently compute weighted sums of scalar Gaussians.
                                                                                       S( yi ) = ∑j=1wie
                                                                                                                        −( y j − yi ) / 4σ
                                                                                                                                   2         2
                                                                                                              N
                                                                                                                                                 i = 1,...M
                        The savings come from the shifting property of the Gaussian function
                        which reads
                                           2                                       2                          2
                            ⎛ y j − yi ⎞             ⎛ y j − yC − ( yi − yC ) ⎞                ⎛ y j − yc ⎞
                          −⎜⎜          ⎟          − ⎜⎜                        ⎟⎟             −⎜⎜          ⎟
                                                                                                                       1 ⎛ yi − yc ⎞
                                                                                                                                     n
                                                                                                                                         ⎛ y j − yc ⎞
                                                                                                                                                                               n
                                                                                                                                                                                   exp(−x2)
                                                                                                                                                              hn( y) = (−1)
                                 σ ⎟⎠                                                               σ ⎟⎠                                                                  nd
                                                                                                                  ∑n=0 n!⎝ σ ⎠ n ⎜⎜⎝ σ ⎟⎟⎠
                                                                σ                                                   ∞
                        e   ⎝                  =e    ⎝                         ⎠       =   e   ⎝
                                                                                                                         ⎜         ⎟   H
                                                                                                                                                                                    dxn
  page 32 of 35
JOSE C. PRINCIPE
U. OF FLORIDA
                                         ⎛ y −y          2⎞          ⎛ y −c            2⎞
                                                                                        ⎟ ⎛⎜ y i − c ⎞⎟ ⎛ (y j − c) ⋅ (yi − c) ⎞
                                                                                                      2
                                         ⎜   j      i         ⎟      ⎜  j
EEL 6935
                                      exp⎜ −                  ⎟ = exp⎜−                 ⎟ exp⎜ −        ⎟ exp⎜⎜ 2              ⎟
                                                                                                                               ⎟
                                         ⎜     4σ 2
                                                              ⎟      ⎜  4σ 2            ⎟ ⎝      4σ 2
                                                                                                              ⎝   4σ  2
                                                                                                                               ⎠
352-392-2662
                                         ⎝                    ⎠      ⎝                  ⎠               ⎠
principe@cnel.ufl.edu
                        and the cross term is approximated by a Taylor expansion as
copyright 2009
                                                                α                           α          α                     α! = α 1!α 2 !L α d !
                                 ⎛ (y j − c) ⋅ (yi − c) ⎞                        ⎛ y j − c ⎞ ⎛ yi − c ⎞
                                                                   ∑
                                                              2
                             exp⎜⎜2                     ⎟=
                                                        ⎟
                                                                                 ⎜         ⎟
                                                                                 ⎜ 2σ ⎟ ⎜⎝ 2σ ⎟⎠ + ε (α )                   α =α1 +α2 +L+αd
                                          4σ            ⎠ α ≥0 α!
                                               2
                                 ⎝                                               ⎝         ⎠
  page 33 of 35
JOSE C. PRINCIPE
                        Incomplete Cholesky Decomposition
U. OF FLORIDA
EEL 6935
                        It turns out that the eigenvalues of the matrix created by the pairwise
352-392-2662
                        evaluation of the Gaussian is full rank, but the eigenspectrum decays
principe@cnel.ufl.edu
                        very fast. Hence we can take advantage of this fact.
                        For a NxN symmetric matrix K=GTG, where G is a lower triangular
copyright 2009
                            ˆIED = 1 1TD (G
                                          ~T ~ ~T ~                1 T ~ 2 T ~ 2 2 T ~ ~T ~ ~T
                                            XX GYY o G XXGYY )1D + 4 1NGXX 1N GYY − 3 (1N GXX )(G XXGYY )(GYY 1N )
                                  N2   X
                                                                  Ny      2      2 N
                                                                ~     ~       ~T ~            T ~      2 T ~   2
                                                          1TDX (GT XX GYY   o G XX GYY )1Dy 1 N GXX 1N GYY
                                             IˆCS = log                                                2       2
                                                                       (    ~     ~    ~     ~
                                                                       (1TN GXX )(GTXX GYY )(GYY
                                                                                              T
                                                                                                  )
                                                                                                 1N )
                                                                                                      2
  page 34 of 35
JOSE C. PRINCIPE
                        However the CIP is not a positive definite matrix, so we need to extend
U. OF FLORIDA
                                                                                             K XX       K XY
                                                                                    K ZZ =
EEL 6935                                                           K   K
                        KXX to create a positive definite matrix           where KXY denotes   XY        YY
352-392-2662
                        the Gram matrix for the CIP. The calculation of the CIP becomes
principe@cnel.ufl.edu
                                                                                (            )(            )
                                        1 N N
                                                             1             1     ~ ~                           e1 = {1{        ,... 0} T
                                                                                                                      ,... 1, 0{
copyright 2009
                         Vˆ ( X ; Y ) = 2 ∑∑ k (xi − y j ) = 2 e1T Kzze2 = 2 e1T GZZ GZZ e2                               N        N
                                       N i =1 j =1          N             N                                    e 2 = { 0{          ,. .. 1}T
                                                                                                                        ,. .. 0 , 1{
                                                                                                                           N           N
page 35 of 35