0% found this document useful (0 votes)

122 views4 pages

Notes On Kullback-Leibler Divergence and Likelihood Theory

The document summarizes the Kullback-Leibler (KL) divergence, a measure used to quantify the difference between two probability distributions. It explains that the KL divergence can be understood intuitively through its relationship to likelihood theory. Specifically, it derives that the KL divergence is equivalent to the negative of the average log likelihood that a set of data is generated from one distribution if another distribution is assumed to be true. This provides insight into how the KL divergence measures the difference or "coding penalty" between the true and assumed distributions based on how well the assumed distribution predicts the data from the true distribution.

Uploaded by

Amod Kr

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

122 views4 pages

Notes On Kullback-Leibler Divergence and Likelihood Theory

Uploaded by

Amod Kr

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 4

Notes on Kullback-Leibler Divergence and Likelihood Theory

Jonathon Shlens
Systems Neurobiology Laboratory, Salk Insitute for Biological Studies, La Jolla, CA 92037 (Dated: August 20, 2007, version 1.0)
The Kullback-Leibler (KL) divergence is a fundamental equation of information theory that quanties the proximity of two probability distributions. Although difcult to understand by examining the equation, an intuition and understanding of the KL divergence arises from its intimate relationship with likelihood theory. We discuss how KL divergence arises from likelihood theory in an attempt to provide some intuition and reserve a rigorous (but rather simple) derivation for the appendix. Finally, we comment on recent applications of KL divergence in the neural coding literature and highlight its natural application.

The Kullback-Leibler (KL) divergence is a measure in statistics (Cover and Thomas, 1991) that quanties in bits how close a probability distribution p = {pi } is to a model (or candidate) distribution q = {qi }, DKL (p || q) =
i

pi log2

pi qi

(1)

DKL is non-negative ( 0), not symmetric in p and q, zero if the distributions match exactly and can potentially equal innity. A common technical interpretation although bereft of intuition is that the KL divergence is the coding penalty associated with selecting a distribution q to approximate the true distribution p (Cover and Thomas, 1991). An intuitive understanding, however, arises from likelihood theory - the probability that one observes a set of data given that a particular model were true (Duda et al., 2001). Pretend we perform an experiment to measure a discrete, random variable - such as rolling a dice many times (or in neuroscience, the simultaneous binned ring patterns of multiple neurons). If we perform a long experiment and make n measurements, we can count the number of times we observe each face of the die (or similarly, each ring pattern of neurons), a histogram c = {ci }, where n=
i ci .

This histogram measures the relative frequency of each face of the die (or, each type of ring pattern). If
ci n

this experiment lasts forever, the normalized histogram counts

reect an underlying distribution pi =

ci n.

Pretend

we have a candidate model for die (or ring patterns), the distribution q. What is the probability of observing the histogram counts c if the model q actually generated the observations? This probability is given by the multinomial likelihood (Duda et al., 2001), L
i c qi i

To gain some intuition, imagine that we performed n = 1 measurements - in this case, the likelihood would be the qi attributed to the single observed ring pattern. The likelihood L shrinks mutiplicatively as we perform more measurements (or n grows). Ideally, we want the probability to be invariant to the number of measurements - this 1 is given by the average likelihood L = L n , a number between 0 and 1. Matching intuition, as we perform more

2 qi , then the average likelihood would be perfect, or L 1. Conversely, as ci diverges n from the model qi , the average likelihood L decreases, approaching zero. The link between likelihood and the KL measurements, if
ci n

divergence arises from the fact that if we perform an innite number of measurements (see Appendix; Shlens et al. (2006); Section 12.1 of Cover and Thomas (1991)), DKL (p || q) = log2 L (2)

Thus, if the distributions p and q are identical, L = 1 and DKL = 0 (or if L = 0, DKL = ). The central intuition is that the KL divergence effectively measures the average likelihood of observing (innite) data with the distribution p if the particular model q actually generated the data. The KL divergence has many applications and is a foundation of information theory and statistics (Cover and Thomas, 1991). For example, one can ask how similar a joint distribution p(x, y) is to the product of its marginals p(x)p(y) - this is the mutual information, a general measure of statistical dependence between two random variables (Cover and Thomas, 1991), I(X; Y ) =
x,y

p(x, y) log2

p(x, y) p(x)p(y)

(3)

The mutual information is zero if and only if the two random variables X and Y are statistically independent. In addition to its role in mutual information, the KL divergence has been applied extensively in the neural coding literature, most recently to quantify the effects of conditional dependence between neurons (Amari and Nakahara, 2006; Latham and Nirenberg, 2005; Schneidman et al., 2003) and to measure how well higher order correlations can be approximated by lower order structure (Schneidman et al., 2006; Shlens et al., 2006).

APPENDIX A: Derivation

In this appendix we prove that the relationship asserted in Equation 2. This derivation basically involves three main ideas: the application of Stirlings approximation, playing around with some algebra and recognizing an implicit probability distribution. First, we begin with some key some denitions. Multinomial likelihood. The multinomial likelihood expresses the probability of observing a histogram, c = {ci }

given that a particular model q = {qi } is true. L(c|q) = n! i ci !

c qi i i

(A1)

The term in front

Qn i ci !

is a normalization constant that counts the number of combinations which could give rise to
i ci

the particular histogram. Note that n =

is the total number of measurements.

3 Stirlings approximation. Stirlings approximation, log n! n log n n, is a numerical approximation useful for

large factorials that often appear in combinatorics. This approximation becomes quite good for n > O(100). We now begin the derivation by remembering that independent observations constituting a histogram are multiplied together to recover the joint probability of all measurements. Thus, an invariant likelihood across histogram counts is the geometric mean of the multinomial likelihood L(c|q) n . We term this quantity the average multinomial likelihood, or average likelihood for short. We start by dening the average log-likelihood as
1 L log L(c|q) n . 1

Plugging in Equation A1 and a little algebra later, 1 L = log n n! i ci !

c qi i i

1 1 = log n! n n We now plug in Stirlings approximation to simplify

log ci ! +
i i

ci log qi n

1 1 (n log n n) (ci log ci ci ) + L = n n i ci ci = log n log ci + log qi n n i i Finally, rearranging terms highlights an implicit probability distribution. L = ci ci log n log ci + n n i i ci ci ci = log + log qi n n n i i

ci log qi n

In the limit of n , the normalized histogram can be viewed as a probability distribution pi accordingly. L =
i

ci n

and substituted

pi log pi +
i

pi log qi

= DKL (p || q) where we now recognize the KL divergence (Equation 1). The results can be summarized as DKL (p || q) = lim
n

1 log L(c|q) n

(A2)

or the KL divergence is negative logarithm of the average multinomial log-likelihood.

4 A closer look at this derivation reveals that the normalization constant in front of Equation A1 directly results in the term
i

pi log pi , which is the entropy of the distribution. Thus, it is possible to derive the entropy of a distribution

from purely combinatorial notions (Jaynes, 2003).

References S Amari and H Nakahara. Correlation and independence in the neural code. Neural Comput, 18(6):12591267, 2006. TM Cover and JA Thomas. Elements of information theory. Wiley, New York, 1991. RO Duda, PE Hart, and DG Stork. Pattern classication. Wiley & Sons, New York, 2001. ET Jaynes. Probability Theory: The Logic of Science. Cambridge University Press, London, 2003. PE Latham and S Nirenberg. Synergy, redundancy, and independence in population codes, revisited. J Neurosci, 25(21):51955206, 2005. E Schneidman, S Still, MJ Berry, and W Bialek. Network information and connected correlations. Phys Rev Lett, 91:238701, 2003. E Schneidman, MJ Berry, R Segev, and W Bialek. Weak pairwise correlations imply strongly correlated network states in a neural population. Nature, 440(7087):10071012, 2006. CE Shannon and W Weaver. The mathematical theory of communication. University of Illinois Press, Urbana, 1949. J Shlens, GD Field, JL Gauthier, MI Grivich, D Petrusca, A Sher, AM Litke, and EJ Chichilnisky. The structure of multi-neuron ring patterns in primate retina. J Neurosci, 26(32):82548266, 2006.

Kullback-Leibler Divergence
No ratings yet
Kullback-Leibler Divergence
13 pages
Class3 ML MaxEnt
No ratings yet
Class3 ML MaxEnt
6 pages
At Salak Is 2009
No ratings yet
At Salak Is 2009
2 pages
Kullback-Leibler Divergence Estimation of Continuous Distributions
No ratings yet
Kullback-Leibler Divergence Estimation of Continuous Distributions
5 pages
MUML Preliminiaries
No ratings yet
MUML Preliminiaries
24 pages
Sol Information Theory 1
No ratings yet
Sol Information Theory 1
4 pages
Adrl App
No ratings yet
Adrl App
139 pages
2a Probability4
No ratings yet
2a Probability4
7 pages
Probability in Neural Networks
No ratings yet
Probability in Neural Networks
25 pages
Info Theory for Systems Scientists
No ratings yet
Info Theory for Systems Scientists
18 pages
Ri D RQ Ri Qi: That Which Follows From Observations and Facts Rather Than From Theory or Logic
No ratings yet
Ri D RQ Ri Qi: That Which Follows From Observations and Facts Rather Than From Theory or Logic
2 pages
2 - Maximum Likelihood
No ratings yet
2 - Maximum Likelihood
20 pages
Lecture Notes MAI
No ratings yet
Lecture Notes MAI
114 pages
Lecture 17 - KL Divergence, Autoencoders
No ratings yet
Lecture 17 - KL Divergence, Autoencoders
54 pages
Kullback-Leibler Divergence - Wikipedia
No ratings yet
Kullback-Leibler Divergence - Wikipedia
23 pages
F Divergence PDF
No ratings yet
F Divergence PDF
13 pages
Lecture Notes MAI
No ratings yet
Lecture Notes MAI
111 pages
Kullback-Leibler Divergence
No ratings yet
Kullback-Leibler Divergence
22 pages
Divergences
No ratings yet
Divergences
8 pages
Discussion Notes 2-6
No ratings yet
Discussion Notes 2-6
3 pages
Biologists' Guide to AIC Theory
No ratings yet
Biologists' Guide to AIC Theory
22 pages
Statistics Part3 2013
No ratings yet
Statistics Part3 2013
25 pages
STAT 538 Maximum Entropy Models C Marina Meil A Mmp@stat - Washington.edu
No ratings yet
STAT 538 Maximum Entropy Models C Marina Meil A Mmp@stat - Washington.edu
20 pages
Information Theory ML
No ratings yet
Information Theory ML
7 pages
Toc 1
No ratings yet
Toc 1
17 pages
Lecture 3: Entropy, Relative Entropy, and Mutual Information
No ratings yet
Lecture 3: Entropy, Relative Entropy, and Mutual Information
5 pages
Notes
No ratings yet
Notes
10 pages
An Introduction To Objective Bayesian Statistics PDF
No ratings yet
An Introduction To Objective Bayesian Statistics PDF
69 pages
(Cambridge Tracts in Mathematics) H. Cramer - Random Variables and Probability Distributions (2004, Cambridge University Press) - Libgen - Li
No ratings yet
(Cambridge Tracts in Mathematics) H. Cramer - Random Variables and Probability Distributions (2004, Cambridge University Press) - Libgen - Li
133 pages
Applied Probability Trust
No ratings yet
Applied Probability Trust
16 pages
Lecture Notes Fall Term 2013
No ratings yet
Lecture Notes Fall Term 2013
40 pages
Relative Entropy
No ratings yet
Relative Entropy
6 pages
Empirical Process (Sara Van de Geer)
No ratings yet
Empirical Process (Sara Van de Geer)
91 pages
DSAI514 Lec1 Background in Prob Part3
No ratings yet
DSAI514 Lec1 Background in Prob Part3
25 pages
Introduction To Information Theory
No ratings yet
Introduction To Information Theory
20 pages
Wagner Book
No ratings yet
Wagner Book
129 pages
Slide Lardev M2MOensae16 17
No ratings yet
Slide Lardev M2MOensae16 17
189 pages
An Introduction To Probability Theory
100% (1)
An Introduction To Probability Theory
91 pages
Section 53
No ratings yet
Section 53
35 pages
Entropy 1
No ratings yet
Entropy 1
7 pages
Stat520 Ch.5
No ratings yet
Stat520 Ch.5
5 pages
Math408 Lecture 9 10
No ratings yet
Math408 Lecture 9 10
17 pages
Cosc 416
No ratings yet
Cosc 416
6 pages
December 2, 2020
No ratings yet
December 2, 2020
38 pages
Probability Theory Lecture Notes
No ratings yet
Probability Theory Lecture Notes
111 pages
Lec 4
No ratings yet
Lec 4
8 pages
Variational Inference for ML Students
No ratings yet
Variational Inference for ML Students
5 pages
Info Theory Course Notes
No ratings yet
Info Theory Course Notes
46 pages
Approximating The Kullback Leibler Divergence Between Gaussian Mixture Models
No ratings yet
Approximating The Kullback Leibler Divergence Between Gaussian Mixture Models
4 pages
Statistics - Lecture 7
No ratings yet
Statistics - Lecture 7
47 pages
Lecture Note 4
No ratings yet
Lecture Note 4
8 pages
Introduction To Probability Theory and Statistics
No ratings yet
Introduction To Probability Theory and Statistics
127 pages
Introduction To Probability Theory and S
No ratings yet
Introduction To Probability Theory and S
127 pages
S1 Formula Sheet
No ratings yet
S1 Formula Sheet
4 pages
Elementary Probability Theory
No ratings yet
Elementary Probability Theory
3 pages
Statistics 111 - Lecture 6: Probability
No ratings yet
Statistics 111 - Lecture 6: Probability
32 pages
Week 1
No ratings yet
Week 1
48 pages
Math 215 Cheat Sheet
No ratings yet
Math 215 Cheat Sheet
3 pages
Random Variables & Probability Basics
50% (2)
Random Variables & Probability Basics
5 pages
Reliability
No ratings yet
Reliability
58 pages
Contoh Soal Dan Jawaban Proses Stokastik
No ratings yet
Contoh Soal Dan Jawaban Proses Stokastik
7 pages
JGI 220 - Tutorial 9 - Memorandum - 2024
No ratings yet
JGI 220 - Tutorial 9 - Memorandum - 2024
4 pages
3 - (M) Discrete Random Variables & Probability Distributions
100% (1)
3 - (M) Discrete Random Variables & Probability Distributions
58 pages
Image Encryption Security Analysis
No ratings yet
Image Encryption Security Analysis
8 pages
Sample Exam
No ratings yet
Sample Exam
5 pages
2021 MID-YEAR - Assignment Stats 1
No ratings yet
2021 MID-YEAR - Assignment Stats 1
7 pages
Reliability and Risk A Bayesian Perspective 1st Edition Nozer D. Singpurwalla - Downloadable PDF 2025
No ratings yet
Reliability and Risk A Bayesian Perspective 1st Edition Nozer D. Singpurwalla - Downloadable PDF 2025
52 pages
Homework Problems Stat 479: February 19, 2014
No ratings yet
Homework Problems Stat 479: February 19, 2014
25 pages
Probability Basics for Math Students
No ratings yet
Probability Basics for Math Students
21 pages
2 Mean Median Mode Variance
No ratings yet
2 Mean Median Mode Variance
29 pages
Statistical Methods in Quality Management
No ratings yet
Statistical Methods in Quality Management
71 pages
Performance Analysis of RIS-Assisted Communication With Direct Link A New Copula Application
No ratings yet
Performance Analysis of RIS-Assisted Communication With Direct Link A New Copula Application
13 pages
Independent vs Dependent Events Worksheet
No ratings yet
Independent vs Dependent Events Worksheet
2 pages
E1 244 March2020-Mid
No ratings yet
E1 244 March2020-Mid
2 pages
Hydrology & Probability Basics
No ratings yet
Hydrology & Probability Basics
18 pages
18.445 Introduction To Stochastic Processes: Lecture 3: Markov Chains: Time-Reversal
No ratings yet
18.445 Introduction To Stochastic Processes: Lecture 3: Markov Chains: Time-Reversal
12 pages
Bonus-Malus System Design Analysis
No ratings yet
Bonus-Malus System Design Analysis
9 pages
Ball Bearing Life Probability Analysis
No ratings yet
Ball Bearing Life Probability Analysis
2 pages
07-01 Continuous Distributions Using Excel
No ratings yet
07-01 Continuous Distributions Using Excel
9 pages
00 KokoskaIntroStat3e 04962 ch06.5 Online 001 010 4PP 105448
No ratings yet
00 KokoskaIntroStat3e 04962 ch06.5 Online 001 010 4PP 105448
10 pages
Lecture 9 Uniform and Normal Distributions
No ratings yet
Lecture 9 Uniform and Normal Distributions
42 pages
Introductory Statistical Inference With The Likelihood Function ISBN 3319104608, 9783319104607 Best Quality Download
No ratings yet
Introductory Statistical Inference With The Likelihood Function ISBN 3319104608, 9783319104607 Best Quality Download
16 pages
Document 5
No ratings yet
Document 5
25 pages

Notes On Kullback-Leibler Divergence and Likelihood Theory

Uploaded by

Notes On Kullback-Leibler Divergence and Likelihood Theory

Uploaded by

Notes on Kullback-Leibler Divergence and Likelihood Theory

this experiment lasts forever, the normalized histogram counts

reect an underlying distribution pi =

given that a particular model q = {qi } is true. L(c|q) = n! i ci !

The term in front

the particular histogram. Note that n =

is the total number of measurements.

Plugging in Equation A1 and a little algebra later, 1 L = log n n! i ci !

1 1 = log n! n n We now plug in Stirlings approximation to simplify

or the KL divergence is negative logarithm of the average multinomial log-likelihood.

from purely combinatorial notions (Jaynes, 2003).

You might also like