0% found this document useful (0 votes)
122 views4 pages

Notes On Kullback-Leibler Divergence and Likelihood Theory

The document summarizes the Kullback-Leibler (KL) divergence, a measure used to quantify the difference between two probability distributions. It explains that the KL divergence can be understood intuitively through its relationship to likelihood theory. Specifically, it derives that the KL divergence is equivalent to the negative of the average log likelihood that a set of data is generated from one distribution if another distribution is assumed to be true. This provides insight into how the KL divergence measures the difference or "coding penalty" between the true and assumed distributions based on how well the assumed distribution predicts the data from the true distribution.

Uploaded by

Amod Kr
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
122 views4 pages

Notes On Kullback-Leibler Divergence and Likelihood Theory

The document summarizes the Kullback-Leibler (KL) divergence, a measure used to quantify the difference between two probability distributions. It explains that the KL divergence can be understood intuitively through its relationship to likelihood theory. Specifically, it derives that the KL divergence is equivalent to the negative of the average log likelihood that a set of data is generated from one distribution if another distribution is assumed to be true. This provides insight into how the KL divergence measures the difference or "coding penalty" between the true and assumed distributions based on how well the assumed distribution predicts the data from the true distribution.

Uploaded by

Amod Kr
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

Notes on Kullback-Leibler Divergence and Likelihood Theory

Jonathon Shlens
Systems Neurobiology Laboratory, Salk Insitute for Biological Studies, La Jolla, CA 92037 (Dated: August 20, 2007, version 1.0)
The Kullback-Leibler (KL) divergence is a fundamental equation of information theory that quanties the proximity of two probability distributions. Although difcult to understand by examining the equation, an intuition and understanding of the KL divergence arises from its intimate relationship with likelihood theory. We discuss how KL divergence arises from likelihood theory in an attempt to provide some intuition and reserve a rigorous (but rather simple) derivation for the appendix. Finally, we comment on recent applications of KL divergence in the neural coding literature and highlight its natural application.

The Kullback-Leibler (KL) divergence is a measure in statistics (Cover and Thomas, 1991) that quanties in bits how close a probability distribution p = {pi } is to a model (or candidate) distribution q = {qi }, DKL (p || q) =
i

pi log2

pi qi

(1)

DKL is non-negative ( 0), not symmetric in p and q, zero if the distributions match exactly and can potentially equal innity. A common technical interpretation although bereft of intuition is that the KL divergence is the coding penalty associated with selecting a distribution q to approximate the true distribution p (Cover and Thomas, 1991). An intuitive understanding, however, arises from likelihood theory - the probability that one observes a set of data given that a particular model were true (Duda et al., 2001). Pretend we perform an experiment to measure a discrete, random variable - such as rolling a dice many times (or in neuroscience, the simultaneous binned ring patterns of multiple neurons). If we perform a long experiment and make n measurements, we can count the number of times we observe each face of the die (or similarly, each ring pattern of neurons), a histogram c = {ci }, where n=
i ci .

This histogram measures the relative frequency of each face of the die (or, each type of ring pattern). If
ci n

this experiment lasts forever, the normalized histogram counts

reect an underlying distribution pi =

ci n.

Pretend

we have a candidate model for die (or ring patterns), the distribution q. What is the probability of observing the histogram counts c if the model q actually generated the observations? This probability is given by the multinomial likelihood (Duda et al., 2001), L
i c qi i

To gain some intuition, imagine that we performed n = 1 measurements - in this case, the likelihood would be the qi attributed to the single observed ring pattern. The likelihood L shrinks mutiplicatively as we perform more measurements (or n grows). Ideally, we want the probability to be invariant to the number of measurements - this 1 is given by the average likelihood L = L n , a number between 0 and 1. Matching intuition, as we perform more

2 qi , then the average likelihood would be perfect, or L 1. Conversely, as ci diverges n from the model qi , the average likelihood L decreases, approaching zero. The link between likelihood and the KL measurements, if
ci n

divergence arises from the fact that if we perform an innite number of measurements (see Appendix; Shlens et al. (2006); Section 12.1 of Cover and Thomas (1991)), DKL (p || q) = log2 L (2)

Thus, if the distributions p and q are identical, L = 1 and DKL = 0 (or if L = 0, DKL = ). The central intuition is that the KL divergence effectively measures the average likelihood of observing (innite) data with the distribution p if the particular model q actually generated the data. The KL divergence has many applications and is a foundation of information theory and statistics (Cover and Thomas, 1991). For example, one can ask how similar a joint distribution p(x, y) is to the product of its marginals p(x)p(y) - this is the mutual information, a general measure of statistical dependence between two random variables (Cover and Thomas, 1991), I(X; Y ) =
x,y

p(x, y) log2

p(x, y) p(x)p(y)

(3)

The mutual information is zero if and only if the two random variables X and Y are statistically independent. In addition to its role in mutual information, the KL divergence has been applied extensively in the neural coding literature, most recently to quantify the effects of conditional dependence between neurons (Amari and Nakahara, 2006; Latham and Nirenberg, 2005; Schneidman et al., 2003) and to measure how well higher order correlations can be approximated by lower order structure (Schneidman et al., 2006; Shlens et al., 2006).

APPENDIX A: Derivation

In this appendix we prove that the relationship asserted in Equation 2. This derivation basically involves three main ideas: the application of Stirlings approximation, playing around with some algebra and recognizing an implicit probability distribution. First, we begin with some key some denitions. Multinomial likelihood. The multinomial likelihood expresses the probability of observing a histogram, c = {ci }

given that a particular model q = {qi } is true. L(c|q) = n! i ci !


c qi i i

(A1)

The term in front

Qn i ci !

is a normalization constant that counts the number of combinations which could give rise to
i ci

the particular histogram. Note that n =

is the total number of measurements.

3 Stirlings approximation. Stirlings approximation, log n! n log n n, is a numerical approximation useful for

large factorials that often appear in combinatorics. This approximation becomes quite good for n > O(100). We now begin the derivation by remembering that independent observations constituting a histogram are multiplied together to recover the joint probability of all measurements. Thus, an invariant likelihood across histogram counts is the geometric mean of the multinomial likelihood L(c|q) n . We term this quantity the average multinomial likelihood, or average likelihood for short. We start by dening the average log-likelihood as
1 L log L(c|q) n . 1

Plugging in Equation A1 and a little algebra later, 1 L = log n n! i ci !


c qi i i

1 1 = log n! n n We now plug in Stirlings approximation to simplify

log ci ! +
i i

ci log qi n

1 1 (n log n n) (ci log ci ci ) + L = n n i ci ci = log n log ci + log qi n n i i Finally, rearranging terms highlights an implicit probability distribution. L = ci ci log n log ci + n n i i ci ci ci = log + log qi n n n i i

ci log qi n

ci log qi n

In the limit of n , the normalized histogram can be viewed as a probability distribution pi accordingly. L =
i

ci n

and substituted

pi log pi +
i

pi log qi

= DKL (p || q) where we now recognize the KL divergence (Equation 1). The results can be summarized as DKL (p || q) = lim
n

1 log L(c|q) n

(A2)

or the KL divergence is negative logarithm of the average multinomial log-likelihood.

4 A closer look at this derivation reveals that the normalization constant in front of Equation A1 directly results in the term
i

pi log pi , which is the entropy of the distribution. Thus, it is possible to derive the entropy of a distribution

from purely combinatorial notions (Jaynes, 2003).

References S Amari and H Nakahara. Correlation and independence in the neural code. Neural Comput, 18(6):12591267, 2006. TM Cover and JA Thomas. Elements of information theory. Wiley, New York, 1991. RO Duda, PE Hart, and DG Stork. Pattern classication. Wiley & Sons, New York, 2001. ET Jaynes. Probability Theory: The Logic of Science. Cambridge University Press, London, 2003. PE Latham and S Nirenberg. Synergy, redundancy, and independence in population codes, revisited. J Neurosci, 25(21):51955206, 2005. E Schneidman, S Still, MJ Berry, and W Bialek. Network information and connected correlations. Phys Rev Lett, 91:238701, 2003. E Schneidman, MJ Berry, R Segev, and W Bialek. Weak pairwise correlations imply strongly correlated network states in a neural population. Nature, 440(7087):10071012, 2006. CE Shannon and W Weaver. The mathematical theory of communication. University of Illinois Press, Urbana, 1949. J Shlens, GD Field, JL Gauthier, MI Grivich, D Petrusca, A Sher, AM Litke, and EJ Chichilnisky. The structure of multi-neuron ring patterns in primate retina. J Neurosci, 26(32):82548266, 2006.

You might also like