Information Measure
Information Measure
Information Measure
• What is information?
• What is communication?
Even if we think we know what information and communication is, it is not the same
as defining it in a mathematical context. As an example one can consider an electronic
copy of Claude Shannon’s paper in pdf format. The one considered here has the file size
2.2MB. It contains a lot of information about the subject, but how can the information in
the file be measured? One way to look at it is to compress the file as much as possible.
The this size could be a measure of the amount of information required to describe the
file. In zip-format the same file has the size 713kB. So if we should quantify the amount
of information in the paper, the pdf version contains at least 1.5MB that is not necessary
for the pure information in the text. Is then the number 713kB a measure of the contained
information? From a mathematical point of view, we will see that it is definitely closer to
the truth. However, from a philosophical point of view, it is not certain. We can compare
this text with a text of the same size containing only randomly chosen letters. If they
have the same file size, do they contain the same amount of information? These semantic
doubts are not considered in the mathematical model. Instead, the question answered is
the amount of data needed to describe the text.
4.1 Information
In his paper Shannon set up a mathematical theory for information and communication,
based on probability theory. He gave a quantitative measure of the amount of informa-
tion stored in a variable and gave limits of how much information can be transmitted
from one place to another over a given communication channel.
                                            33
34                                                    CHAPTER 4. INFORMATION MEASURE
However, already twenty years earlier, in 1928, Hartley stated that a symbol can contain
information only if it has multiple choices [8]. That is, the symbol must be a random
variable. Hartley argued that if one symbol, X, has L alternatives, a vector of n indepen-
dent such symbols (a1 , . . . , an ) has Ln alternatives. To form a measure of information,
one must notice that if the symbol X has the information I, then the vector should have
the information nI. The conclusion of this was that an appropriate information measure
should be based on the logarithm of the number of alternatives,
IH (X) = log L
In that way
IH (X1 , . . . , Xn ) = n log L
E XAMPLE 4.1 Consider a the outcome of a throw with a fair dice. It has 6 alternatives,
and hence, the information according to Hartley is1
In this example Hartley’s information measure makes sense, since it is the number of bits
needed to point out one of the six alternatives. But there can be other situations where it
runs into problem, like in the next example.
E XAMPLE 4.2 Let the variable X be the outcome of a counterfeit coin, with the probabil-
ities P (X = Head) = p and P (X = Tail) = 1 − p. According to Hartley the information
is
In the case when the outcomes from the coin flip are equally likely, i.e. p = 21 , the measure
is intuitive, since one bit is the amount of data needed to describe the outcome. If we
instead consider the case when p is very small, and flip the coin several times after each
other our intuition does not say the same. Since p is small we would expect to have most
of the outcomes to be Tail and only a small fraction to be Head. The normal outcome
in a series of flips would be to have mostly Tail, meaning there is not much information
in this outcome. On the other hand, in the rare cases when Head occur there would be
much more information. So, even if Hartley’s information measure was ground breaking
at its time, it did not consider the probability distribution of the experiment.
When Shannon 20 years later introduced his measure the base of the information quantity
is the probability describing how two random variables are related. The information
achieved about the outcome of one variable by observing another can be viewed as the
ratio between the conditioned and the unconditioned probability for that outcome. Still,
as Hartley stated, the logarithm is an important part of the function.
     1
    Hartley did not specify the basis of the logarithm. Using the binary base, the information measure has
the unit bits. In this way it specifies the number of bits required to distinguish the alternatives.
4.1. INFORMATION                                                                               35
D EFINITION 4.1 The mutual information between event A and event B, denoted I(A; B),
is
                        P (A|B)
      I(A; B) = log2            ,
                         P (A)
where it is assumed P (A) 6= 0 och P (B) 6= 0.                                                  
If nothing else is stated, we will assume the the logarithmic base 2 to achieve the unit bit
(binary digit). This unit was first used in Shannon’s paper, but it it also stated that it was
John W. Tukey who coined the expression. To derive the binary logarithmic function use
      x = aloga x = blogb x = aloga b logb x
where the last equality follows from b = aloga b . This leads to
                                                        loga x
      loga x = loga b logb x      ⇒    logb x =
                                                        loga b
Especially, it is convenient to use
                 ln x   log10 x
      log2 x =        =
                 ln 2   log10 2
                                                1
It is also worth noting that log2 e =          ln 2 .    In Matlab the command log2(n) derives the
binary logarithm.
E XAMPLE 4.3 The outcome of a dice is reflected by the two random variables
      X = Number of pips
      Y = Odd or even number
The information achieved about the event X = 3 from the event Y = Odd is
                                      P (X = 3|Y = Odd)       1/3
      I(X = 3; Y = Odd) = log                           = log     = log 2 = 1 bit
                                           P (X = 3)          1/6
In other words, by knowing that the number of pips is odd, which split the set of out-
comes in half, we gain one bit information about the event that the number is 3.
E XAMPLE 4.4 [cont’d] The information from the event X = 3 about the event Y = Odd)
is
                                      P (Y = Odd|X = 3)        1
      I(Y = Odd; X = 3) = log                           = log     = log 2 = 1 bit
                                           P (Odd)            1/2
The knowledge about X = 3 gives us full knowledge about the outcome of Y , which is a
binary choice with two equally sized parts. To specify one of the two outcomes of Y it is
required one bit.
36                                             CHAPTER 4. INFORMATION MEASURE
To see this consider the variations of P (A|B), which is a value between 0 and 1. Since
the logarithm is a strictly increasing function the two end cases give the bounds on the
information,
                                    0
      P (A|B) = 0 ⇒ I(A; B) = log       = log 0 = −∞
                                  P (B)
                                    1
      P (A|B) = 1 ⇒ I(A; B) = log       = − log P (B)
                                  P (B)
Similarly, by letting P (B|A) = 1 the information is I(A; B) = − log P (A). If P (A) and
P (B) are not equal, there are two upper bound, where the minimum should be used.
Notice that since 0 ≤ P (A) ≤ 1 the value − log P (A) is positive. If I(A; B) = 0 the events
A and B are statistically independent since it imply
      P (A|B)
              = 1 P (A|B) = P (B)
       P (B)
To get a measure for the information related with event A, consider the mutual informa-
tion between A and A. That is, the amount of information achieved about the event by
observing the same event. This quantity is called the self information.
                             P (A|A)
      I(A) = I(A; A) = log           = − log P (A)
                              P (A)
That is, − log Pr(A) is the amount of information needed to determine that the event
A has occurred. The self information is always a positive quantity, and as long as the
outcome is not deterministic, i.e. P (A) = 1 for some event A, it is strictly positive.
4.2 Entropy
The above quantities deal with information related to specific events. An interesting
measure then is the average required information to determine the outcome of a random
variable. This is directly achieved from the expected value of the self information. We
get the following important definition.
                                                                                          
4.2. ENTROPY                                                                              37
In the derivations, we use the convention that 0 log 0 = 0, which follows from the corre-
sponding limit value. It is sometimes convenient to use the notation
                                    L
                                    X
     H(p1 , p2 , . . . , pL ) = −         pi log pi
                                    i=1
when considering the entropy function for a probability distribution given as a vector
p = (p1 , p2 , . . . , pL ).
The entropy is the amount of information needed to determine the outcome of a random
variable. Thus, it can also be interpreted as the uncertainty of the outcome. Since the self
information is non-negative so is its average,
H(X) ≥ 0 (4.3)
In many cases a random variable describes a binary choice, e.g. when considering flip-
ping of coins. The entropy function for this case is so widely used that it often has a
definition of its own.
D EFINITION 4.4 The binary entropy function for the probability p is defined as
The binary entropy function has its maximum for h( 12 ) = 1. In Figure 4.1 a plot of the
function is shown. Here the maximum value is depicted. It can also be see that the binary
entropy function is symmetric in p, i.e. h(p) = h(1 − p), which is also clear directly from
the definition. In the case of a coin flip the uncertainty is maximised for a fair coin, i.e.
P (Head) = p = 12 . If the p increases a natural initial guess is that the outcome would
be Head, and the uncertainty decreases. Similarly, if p decreases the uncertainty should
also decrease. At the and points, where p = 0 or p = 1, the outcome is known and the
uncertainty is zero, corresponding to h(0) = h(1) = 0.
                              h(p)
                                                                        p
                                                      1/2       1
Let Y be the outcome of a dice with a small weight in, such that the probabilities are
P (Y = 1) = 0, P (Y = 6) = 1/3 and P (Y = i) = 1/3, i = 2, 3, 4, 5. Then, the correspond-
ing uncertainty of the outcome is
                 1     1 4       1  4
       H(Y ) = − log − log = + log 3 ≈ 2.2516 bit
                 3     3 6       6  6
Again, since we expect the outcome to be 6, there is little information is this outcome. In
total the uncertainty of the outcome has decreased compared to the fair dice.
The definition of the entropy is also valid for vectorised random variables, such as (X, Y )
with the joint probability function p(x, y).
D EFINITION 4.5 The joint entropy for a pair of random variables with the joint distribu-
tion p(x, y) is
                                       X
      H(X, Y ) = EXY − log p(x, y) = −       p(x, y) log p(x, y)
                                                         x,y
Similarly, in the general case with an n-dimensional vector X = (X1 , . . . , Xn ), the joint
entropy function is
                                           X
     H(X1 , . . . , Xn ) = EX − log p(x) = −   p(x) log p(x)
                                                               x
E XAMPLE 4.6 Let X and Y be the outcomes from two independent true dice. Then the
joint probability is P (X, Y = x, y) = 1/36 and the joint entropy
                     X 1          1
      H(X, Y ) = −           log    = log 36 = 2 log 6 ≈ 5.1699
                      x,y
                          36     36
We conclude that the uncertainty of the outcome of two dice is twice the uncertainty of
one dice.
Let Z the sum of the dice, Z = X + Y . The probabilities are shown in the following table
           Z     2     3      4     5      6     7      8      9    10 11 12
                 1     2     3      4     5      6     5       4    3        2    1
       P (Z)     36    36    36     36    36     36    36      36   36       36   36
The entropy of Z is
                      1 2 3 4 5 6 5 4 3 2 1
                                                                             
      H(Z) = H        36 , 36 , 36 , 36 , 36 , 36 , 36 , 36 , 36 , 36 , 36
                    1      1       2      2      3      3
               = −2    log    − 2 log        − 2 log
                    36     36     36      36    36      36
                     4      4      5       5    6      6
                 − 2 log      − 2 log        −     log
                    36     36     36      36 36        36
                 23 5            5
               =    + log 3 −       log 5 ≈ 3.2744
                 18 3           18
4.2. ENTROPY                                                                                 39
The uncertainty of the sum of the dice is less than the outcomes of the individual dice.
This makes sense, since several outcomes of the pair X, Y give the same sum Z.
In (4.3) it was seen that the entropy function is a non-negative function. To achieve an
upper bound we will be helped by the following inequality.
log(r) ≤ (r − 1) log(e)
Hence, for 0 < r < 1 the curve for ln r is steeper than r − 1 and for r > 1 it is flatter. So in
all cases but r = 1, the curve for ln r must be strictly lower than r − 1. That is, we have
shown ln r ≤ r − 1 with equality if and only if r = 1. Rewriting into binary logarithm
completes the proof.                                                                          
y y =r−1
y = ln(r)
                                                               r
                                       1
−1
From the previous examples we would guess that the maximum value of the entropy
would occur when the outcomes have equal probabilities. In that case a random variable
X with L outcomes, {x1 , . . . , xL }, has the probabilities P (X = xi ) = L1 . The entropy is
                  X1    1
      H(X) = −       log = log L
                   L    L
                   i
40                                                      CHAPTER 4. INFORMATION MEASURE
0 ≤ H(X) ≤ log L
with equality to the left if and only if there exists some i where p(xi ) = 1, and with
equality to the right if and only if p(xi ) = 1/L for all i = 1, 2, . . . , L.       
To cover also dependencies between random variables the conditional entropy need to be
considered. This is defined as the average self information for the event Aăconditioned
on the event B, that is E[− log p(x|y)]. Since there are two random variables the average
needs to be taken over the joint probability distribution.
where p(x, y) and p(x|y) are the joint and conditional probability functions, respectively.
                                                                                         
Using the chain rule for probabilities, p(x, y) = p(x|y)p(y), the formula can be rewritten
as
                   XX                             XX
     H(X|Y ) = −           p(x, y) log p(x|y) = −        p(x|y)p(y) log p(x|y)
                       x   y                              x   y
                   X        X                  
               =       p(y) −  p(x|y) log p(x|y)
                   y               x
E XAMPLE 4.7 The joint distribution of the random variables X and Y is given by
                          Y
        p(x, y)       0       1
                              3
              0       0       4
        X
                      1       1
              1       8       8
                                                                                   P
The
P marginal distributions of X and Y can be derived as p(x) =                       y   p(x, y) and p(y) =
  x p(x, y) giving
           x      0       1                 y       0   1
                  1       7                         3   1
        p(x)      8       8              p(y)       4   4
          P (X|Y = 0) P (X|Y = 1)
                           1
        0      0           2
 X
                                                1
        1             1                         2
Then
       H(X|Y = 0) = h(0) = 0
       H(X|Y = 1) = h( 21 ) = 1
Putting things together, the conditional entropy becomes
       H(X|Y ) = H(X|Y = 0)P (Y = 0) + H(X|Y = 0)P (Y = 0)
                     3      1   1
               = h(0) + h(1) =
                     4      4   4
The chain rule for probabilities can be used also to achieve a corresponding chain rule for
entropies,
                    X                          X
     H(X, Y ) = −      p(x, y) log p(x, y) = −     p(x, y) log p(x|y)p(y)
                                  x,y                              x,y
                                  X                            X
                      =−                p(x, y) log p(x|y) −       p(y) log p(y)
                                  x,y                          y
Rewriting the result we get H(X|Y ) = H(X, Y ) − H(Y ). That is, the conditional entropy
is the difference between the uncertainty of the pair (X, Y ) and the information gained by
observing Y . A more general version of (4.5) can be stated as the chain rule for entropies
in the following theorem.
                              n
                              X
      H(X1 , . . . , Xn ) =         H(Xi |X1 , . . . , Xi−1 )
                              i=1
E XAMPLE 4.8 [Cont’d from Example 4.7] The joint entropy can alternatively be derived
as
                                                1            9 3
      H(X, Y ) = H(X|Y ) + H(Y ) =                + h( 14 ) = − log 3 ≈ 1.0613
                                                4            4 4
The entropy was obtained by averaging the self information for a random variable. Sim-
ilarly, the average mutual information between the random variables X and Y can be
defined as follows.
D EFINITION 4.7 The mutual information between the random variables X and Y is de-
fined as
                     h     p(x, y) i X                 p(x, y)
      I(X; Y ) = EX,Y log           =     p(x, y) log                                 (4.6)
                          p(x)p(y)    x,y
                                                      p(x)p(y)
Utilizing that p(x, y) = p(x)p(y|x) = p(y)p(x|y) the function can also be written as the
fraction between the conditional and the non-conditional probabilities,
                     h    p(x|y) i       h    p(y|x) i
      I(X; Y ) = EX,Y log          = EX,Y log                                         (4.7)
                           p(x)                p(y)
The mutual information describes how strong two variables are connected. From the
definition it is clear that the it is a symmetric measure,
      I(X; Y ) = I(Y ; X)
4.3. MUTUAL INFORMATION                                                                 43
Breaking up the logarithm in the definition it is possible to derive the mutual information
from the entropies as
                     h     p(x, y) i
      I(X; Y ) = EX,Y log
                          p(x)p(y)
                                                      
               = EX,Y log p(x, y) − log p(x) − log p(y)
                                                        
               = EX,Y log p(x, y) − EX log p(x) − EY log p(y)
              = H(X) + H(Y ) − H(X, Y )
              = H(X) − H(X|Y )
              = H(Y ) − H(Y |X)
where the last two equalities follows from the chain rule of entropies.
E XAMPLE 4.9 [Cont’d from Example 4.7] The mutual information can be derives as
The mutual information between two random variables X and Y can be affected by ob-
serving a third variable Z. This is reflected in the conditional mutual information.
D EFINITION 4.8 The conditional mutual information between the random variables X and
Y , when Z is observed, is defined as
                        h        p(x, y|z) i X                       p(x, y|z)
     I(X; Y |Z) = EX,Y,Z log                =       p(x, y, z) log                    (4.8)
                               p(x|z)p(y|z)   x,y,z
                                                                   p(x|z)p(y|z)
Similar to the unconditional case the conditional mutual information can be derived from
the entropies as
Both the entropy and the mutual information are important measures of information. The
entropy states how much information is needed to determine the outcome of a random
variable. It will be shown later that this is a limit of how many bits is needed in average
to describe the variable. In other words, this is a limit of how much a symbol can be
compressed without any data being lost. The mutual information, on the other hand,
describes the amount of information achieved about the variable X by observing the
44                                              CHAPTER 4. INFORMATION MEASURE
To get more knowledge about these quantities we introduce the relative entropy. It was
first considered by Kullback and Leibler in 1951 [18].
D EFINITION 4.9 Given two probability distributions p(x) and q(x) for the same sample
set X . The relative entropy, or Kullback-Leibler divergence, is defined as
                   h    p(x) i X           p(x)
       D(p||q) = Ep log       =   p(x) log
                        q(x)    x
                                           q(x)
E XAMPLE 4.10 Consider a binary random variable, X ∈ {0, 1}, where we set up two
distributions. First we assume that the values are equally probable,
                 1     1/2 1     1/2
       D(p||q) =   log    + log
                 2     1/4 2     3/4
                     1
               = 1 − log 3 ≈ 0.2075
                     2
On the other hand, if we consider the relative entropy from q to p we get
                1     1/4 3      3/4
       D(q||p) =  log    + log
                4     1/2 4      1/2
                3
               = log 3 − 1 ≈ 0.1887
                4
That is, the relative entropy is not a symmetric measure. Furthermore, in Section 4.3.2
it will be shown that the triangular distance, in general, does not hold for the relative
entropy.
Sometimes the relative entropy is mentioned as a distance from one distribution to an-
other. When dealing with optimal source coding, we will see that there is a natural reason
for this. However, in a mathematical meaning a distance should be seen as a metric. That
is if a function g(x, y) is a metric then it should hold that
A divergence does not have to fulfill the symmetry and triangular inequality criteria,
which holds for the relative entropy. In Corollary 4.4 it will be stated that the relative
entropy is non-negative.
In the next example the Poisson distribution is considered. Assume that the result from
a random experiment is Poisson distributed with intensity λ. Then if the experimentally
estimated intensity is λ0 , the example shows the relative entropy for the true distribution
and the estimation. It will later be shown that this value reflects the penalty in compres-
sion rate due to the estimation missmatch.
               λk e−λ
      p(k) =          ,        k = 0, 1, 2, . . .
                 k!
Then compare this distribution with another Poisson distribution with the parameter λ0 ,
i.e.
                 λk0 e−λ0
      p0 (k) =            ,        k = 0, 1, 2, . . .
                     k!
The relative entropy from p(k) to p0 (k) is then
                      X λk e−λ            λk e−λ
                                              k!
      D(p||p0 ) =                   log
                              k!          λk0 e−λ0
                       k                      k!
                      X λk e−λ              λ                       
                  =                  k log      + λ0 log e − λ log e
                              k!             λ0
                       k
                           λ X λk e−λ                  X λk e−λ
                  = log       k       + (λ0 − λ) log e
                           λ0    k!                        k!
                               k                                  k
                          λ    λ0 − λ
                  = λ log    +
                          λ0    ln 2
                                                              P                               1
where in the last equality it is used that E[k] = λ,             k    p(k) = 1 and log e =   ln 2 .
The relative entropy was introduced as a generalised information measure. The mutual
information can be expressed as a special case of the relative entropy as
                           h p(x, y) i                     
      I(X; Y ) = EXY                   = D p(x, y) p(x)p(y)
                            p(x)p(y)
The mutual information is the information divergence from the joint distribution to the
independent case, i.e. the information divergence describes the relation between X and
Y.
Another aspect of the relative entropy to consider is the relationship with the entropy
function. Consider a random variable with the possible outcomes in the set X with car-
dinality |X | = L and probability distribution p(x), x ∈ X . Let u(x) = 1/L be the uniform
46                                                       CHAPTER 4. INFORMATION MEASURE
where we see that the relative entropy from p(x) to u(x) is the difference between the
entropy based on the true distribution and the maximum value of the entropy. Since the
maximum value is achieved by the uniform distribution, the relative entropy is a measure
of how much p(x) diverge from the uniform distribution.
The relative entropy can be shown to only take positive values. Here the IT-inequality is
used to show this fact.
                     X            p(x) X              q(x)
      −D(p||q) = −       p(x) log      =     p(x) log
                     x
                                  q(x)    x
                                                      p(x)
                  X         q(x)     
                ≤     p(x)        − 1 log e
                   x
                             p(x)
                  X            X       
                =       p(x) −      q(x) log e = (1 − 1) log e = 0
                         x            x
                                    q(x)
with equality if and only if        p(x)    = 1, i.e. when p(x) = q(x), for all x. The result is
expressed in the next theorem.
T HEOREM 4.4 Given two probability distributions p(x) and q(x) for the same sample set
X . Then the relative entropy is positive
      D(p||q) ≥ 0
with equality if and only if p(x) = q(x) for all x.                                           
Since the mutual information can be expressed as the relative entropy, the following
corollary follows immediately.
C OROLLARY 4.5 The mutual information for two random variables, X and Y , is a non-
negative function,
      I(X; Y ) ≥ 0
with equality if and only if X and Y are independent.                                         
T HEOREM 4.6 The conditional entropy is upper bounded by the unconditional entropy,
i.e.
      H(X|Y ) ≤ H(X)
with equality if and only if X and Y are independent.                                         
4.3. MUTUAL INFORMATION                                                                           47
Intuitively, this means that the uncertainty, in average, will not increase by observing side
information. If the two variables are independent, the uncertainty will not be affected.
Using this result together with the chain rule for entropy, Theorem 4.3, we can get the
next result.
                                n
                                X
        H(X1 , . . . , Xn ) ≤         H(Xi )
                                i=1
That is, the uncertainty is minimised when considering a random vector as a whole,
instead of individual variables. In other words, we should take relationship between the
variables into account when minimising the uncertainty.
In Definition 3.4 the terminology of convex functions was introduced. In areas like opti-
misation they have important properties since there are no local minima. In this section
the convexity of the information measures will be investigated. First, the relative entropy
will be shown to be convex. With this as a tool the entropy can be shown to be concave
and the convexity of mutual information investigated.
Our previous definition of a convex function is stated for one dimensional functions.
Therefore, we start with a generalisation of the definition. A straight forward way is to
say that a multidimensional functions is convex if it is convex in all dimensions. For the
two dimensional case the function surface resembles a bowl. Comparing with Figure 3.3
the two dimensional argument λ(x1 , y1 ) + (1 − λ)(x2 , y2 ), for 0 ≤ λ ≤ 1, describes a
straight line between the points (x1 , y1 ) and (x2 , y2 ) in the argument plane. The equation
for this line can be rewritten as the coordinates λx1 + (1 − λ)x2 , λy1 + (1 − λ)y2 . Con-
sidering the two dimensional function g(x, y), the values corresponding to the endpoints
are z1 = g(x1 , y1 ) and z2 = g(x2 , y2 ). Then λ(x1 , y1 , z1 ) + (1 − λ)(x2 , y2 , z2 ), 0 ≤ λ ≤ 1,
describes a straight line in the three dimensional space between the end points (x1 , y1 , z1)
and (x2 , y2 , z2 ). If the function value along the argument line, g λ(x1 , y1 )+(1−λ)(x2 , y2 )
never exceeds the corresponding value at the line, λg(x1 , y1 ) + (1 − λ)g(x2 , y2 ), the func-
tion g(x, y) is a convex function. That is, g(x, y) is convex over the region A if
                                         
        g λ(x1 , y1 ) + (1 − λ)(x2 , y2 ) ≤ λg(x1 , y1 ) + (1 − λ)g(x2 , y2 )
for all λ such that 0 ≤ λ ≤ 1 and all (x1 , y1 ), (x2 , y2 ) ∈ A. Here A denotes a two-
dimensional convex region, i.e. a straight line between two points in the region should
never be outside the region. The corresponding regions considered in this text are easily
verified to satisfy this criterion. The above reasoning for convexity of functions can easily
be generalised for n-dimensional functions.
48                                                CHAPTER 4. INFORMATION MEASURE
The relative entropy is a two dimensional function in the probability pair (p, q), and can
thus be checked for convexity. Then, consider the four probability distributions p1 (x),
p2 (x), q1 (x) and q2 (x) over the same sample space X . For λ between 0 and 1 two new
distributions can be formed as
      pλ (x) = λp1 (x) + (1 − λ)p1 (x)
      qλ (x) = λq1 (x) + (1 − λ)q1 (x)
where the inequality is a direct application of the log-sum inequality in Theorem 3.7.
Hence, we have the following result.
From (4.9) the entropy can be expressed as Hp (X) = log L − D(p||u) where u is uniformly
distributed. Again using pλ (x) = λp1 (x) + (1 − λ)p1 (x) we get
                                               
       Hpλ (X) = log L − D λp1 + (1 − λ)p2 u
                                                     
               ≥ log L − λD p1 u − (1 − λ)D p2 u
                                                             
               = λ log L − D p1 u + (1 − λ) log L − D p2 u
               = λHp1 (X) + (1 − λ)Hp2 (X)
where the inequality follows from the convexity of the relative entropy. The above result
is stated in the following theorem.
When it comes to the mutual information it can be written as I(X; Y ) = H(Y ) − H(Y |X).
Hence, it consists of two parts that needs to treated separately. The first case to con-
sider is two distributions on X, p1 (x) and p2 (x), while the conditional probability on Y ,
4.3. MUTUAL INFORMATION                                                                               49
p(y|x), is fixed. Then, again we form pλ (x) = λp1 (x) + (1 − λ)p2 (x). The corresponding
unconditional probability on Y then becomes
                X
      pλ (y) =     pλ (x)p(y|x)
                 x
                 X                                X
            =λ           p1 (x)p(y|x) + (1 − λ)       p2 (x)p(y|x)
                     x                            x
            = λp1 (y) + (1 − λ)p2 (y)
since the entropy is concave. On the other hand, the conditional entropy is
                      X
       Hpλ (Y |X) = −    pλ (x)p(y|x) log p(y|x)
                           x,y
                            X                                         X
                     = −λ         p1 (x)p(y|x) log p(y|x) − (1 − λ)         p2 (x)p(y|x) log p(y|x)
                            x,y                                       x,y
That is, for fixed p(y|x) the mutual information I(X; Y ) is concave in p(x).
Similarly, if p(x) is fixed and considering two distributions on the conditional probability,
p1 (y|x) and p2 (y|x) we introduce
and
                 X
      pλ (y) =        p(x)pλ (y|x) = λp1 (y) + (1 − λ)p2 (y)
                 x
                                             P
where pi (x, y) = p(x)pi (y|x) and pi (y) =   x p(x)pi (y|x). Then by writing the mutual
information as the relative entropy and using its convexity we get
                                          
      Ipλ (X; Y ) = D pλ (x, y) p(x)pλ (y)
                                                                          
                  ≤ λD p1 (x, y) p(x)p1 (y) + (1 − λ)D p2 (x, y) p(x)p2 (y)
                     = λIp1 (X; Y ) + (1 − λ)Ip2 (X; Y )
Hence, the mutual information is convex in p(y|x). The convexity of the mutual informa-
tion can be summarised in the following theorem.
50                                               CHAPTER 4. INFORMATION MEASURE
To be done
In the previous section the the information measures is defined for random variables. Of-
ten it is desirable to use also random processes where the variables in a sequence are sta-
tistically dependent. Then, to generalize the entropy measure complete sequences must
be considered. In this section first a famous result on data processing will be derived.
Then the entropy rate will be defined, which is the corresponding entropy measure for
random processes.
For the first part, consider a Markov chain with three variables X, Y and Z. Their de-
pendencies are described in Figure 4.3. The process A transforms X into Y and process
B transforms Y into Z. These processes are very general and can for example represent
pre-processing, post-processing or transmission of data.
                        X                    Y                    Z
                                Process A           Process B
The assumed Markov property gives that X and Z are independent when conditioned
on Y , i.e.
where the second equality follows from the Markov condition. Then the mutual infor-
mation between the end points can be derived and bounded in two ways,
and
L EMMA 4.11 (D ATA P ROCESSING L EMMA ) Let the random variables X, Y and Z form
a Markov chain, X → Y → Z. Then
      I(X; Z) ≤ I(X; Y )
      I(X; Z) ≤ I(Y ; Z)
An interpretation of the lemma can be viewed in the following way. Assume first that X
is transformed into Y by process A. This can for example be a transmission of data over
a channel distorting the signals (e.g. wired or wireless communication or writing and
reading of a CD, DVD or flash memory). The aim of the receiver is then to get as much
information about X by observing Y . It is common to perform post-processing, which in
this model is represented by process B. The data processing lemma states that the infor-
mation about X by viewing Z cannot exceed the information the information about X
by viewing Y . In other words, information about X will not increase by post-processing,
it can only decrease. In practice, however, post-processing is often used to transform the
information into another representation where the information is easier accessible for in-
terpretation. For example, it is easier to understand an image when viewed on a screen
than it is from the data received.
Similarly, process A can represent pre-processing and then process B the transmission.
Then, the data processing lemma states that the information cannot increase by the pro-
cessing. Still, in practice it is common to use pre-processing in communication systems
to transform data into appropriate representations. Summarising, the lemma states that
the information cannot increase by either pre- or post-processing. The information can
only be destroyed by the processing.
Next, the description will go to a more general description of information measure for
sequences. In general there is a dependency between the symbols in the sequence, which
can be modelled by a random random process. In this section two natural generalisations
of the entropy function will be introduced. It will be shown that these two definitions are
in fact equivalent. This measure can in many cases be used and interpreted in the same
way for a random process as the entropy for random variables.
A natural way to define the entropy per symbol for a sequence is by treating the sequence
as a multi-dimensional random variable and averaging over the number of symbols. As
the length of the sequence tend to infinity we get the following definition.
                      1
     H∞ (X) = lim       H(X1 X2 . . . Xn )                                           (4.10)
                n→∞   n
To see that this is a natural generalisation of the entropy function where the variables in
a sequence are considered independent, consider a sequence of i.i.d. variables as in the
next example.
52                                                CHAPTER 4. INFORMATION MEASURE
E XAMPLE 4.12 Consider a sequence of i.i.d. random variables with entropy H(X). Then
the entropy rate equals the entropy function since,
                     1                       1X
      H∞ (X) = lim     H(X1 . . . Xn ) = lim     H(Xi |X1 . . . Xi−1 )
                n→∞ n                    n→∞ n
                                               i
                     1X                        1X
               = lim      H(Xi ) = lim H(X)        1 = H(X)
                n→∞ n                  n→∞     n
                           i                             i
An alternative definition of the entropy rate for a random process, would be to consider
the entropy of the nth variable in the sequence, conditioned on all the previous. As
n → ∞ we get the following definition.
To see how the two definitions relates, rewrite the entropy with the chain rule,
      1                   1X
        H(X1 . . . Xn ) =    H(Xi |X1 . . . Xi−1 )
      n                   n
                               i
The right hand side is the arithmetic mean of H(Xi |X1 . . . Xi−1 ). By the law of large
numbers, as n → ∞ this will approach H(X|X ∞ ). Hence, asymptotically as the length
of the sequence grows to infinity the two definitions for the entropy rate are equal. This
important result is stated as a theorem.
T HEOREM 4.12 The entropy rate and the alternative entropy rate are equivalent, i.e.
H∞ (X) = H(X|X ∞ )
In the continuation we will mostly use the notation in the first definition.
where the last equality follows since it is stationarity. Here it is seen that H(Xn |X1 . . . Xn−1 )
is a decreasing function in n. As n decreases a lower bound for the entropy function is
found from
Finally, since the entropy is a non-negative function we can state the following theorem.
T HEOREM 4.13 For a stationary random process the entropy rate is bounded by
                                                                                                
4.4. ENTROPY OF SEQUENCES                                                                           53
H∞ (X)
In Figure 4.4 the relation between log |X |, H(X), H(Hn |X1 . . . Xn−1 ) and H∞ (X) is shown
as a function of n. One natural conclusion is that the uncertainty of the sequence is less if
there is a dependency between the symbols.
So far the entropy rate has been treated for the class of stationary random processes.
If the theory is limited to the often used Markov chains it is possible to be more spe-
cific on derivations of the entropy rate. From the unit memory property and stationarity
of a Markov process, the conditional entropy can be written as H(Xn |X1 . . . Xn−1 ) =
H(Xn |Xn−1 ). Then, the entropy rate is
     H∞ (X) = lim H(Xn |X1 . . . Xn−1 ) = lim H(Xn |Xn−1 ) = H(X2 |X1 )
              n→∞                          n→∞
              X
            =    P (X1 = xi , X2 = xj ) log P (X2 = xj |X1 = xi )
                  i,j
                  X                       X
              =         P (X1 = xi )           P (X2 = xj |X1 = xi ) log P (X2 = xj |X1 = xi )
                   i                       j
                  X
              =         H(X2 |X1 = xi )P (X1 = xi )                                              (4.12)
                   i
where
                             X
     H(X2 |X1 = xi ) =                P (X2 = xj |X1 = xi ) log P (X2 = xj |X1 = xi )
                                  j
In (4.12) the transition probability is given by the state transition matrix for the Markov
chain
            h                           i
      P = pij = P (X2 = xj |X1 = xi )
                                                  xi ,xj ∈X
and the stationary distribution by πi = P (X1 = xj ). Hence, we get the following theo-
rem.
T HEOREM 4.14 For a stationary Markov chain with stationary distribution π and tran-
sition matrix P = [pij ], the entropy rate can be derived as
                  X
     H∞ (X) =           πi H(X2 |X1 = xi )
                   i
54                                                     CHAPTER 4. INFORMATION MEASURE
where
                                 X
      H(X2 |X1 = xi ) = −            pij log pij
                                 j
E XAMPLE 4.13 The Markov chain shown in Figure 3.4 has the state transition matrix
            1       2
                             
             3       3   0
            1           3
      P =   4       0   4
             1       1
             2       2   0
     π = 15    16   12
                       
          43   43   43
To be done