1. Introduction
In this paper we derive an integral representation of the relative entropy
μ||P), where
μ is a measure on
d) and
P governs the solution to a stochastic differential equation (SDE). The relative entropy is used to quantify the distance between two measures. It has considerable applications in statistics, imaging, information theory and communications. It has been used in the long-time analysis of Fokker–Planck equations [
2], the analysis of dynamical systems [
3] and the analysis of spectral density functions [
4]. It has been used in financial mathematics to quantify the difference between martingale measures [
6]. It has also been shown in [
7] that the existence problem of the minimal relative entropy martingale measure problem of birth and death processes can be reduced to the problem of solving the Hamilton–Jacobi–Bellman equation; furthermore the minimal entropy martingale measures (MEMMs) for geometric Levy processes are investigated in [
8]. The finiteness of
μ||P) has been shown to be equivalent to the invertibility of certain shifts on Wiener space, when
P is the Wiener measure [
10]. However, one of the most frequent uses of the relative entropy is in statistical inference (particularly in medical imaging) [
12]. For example, in data fitting, it is a standard technique to select the parameters that minimise the relative entropy of two conditional probability distributions [
13]. Modelling in medical imaging increasingly involves diffusion process with state space
d), for which the expression
or the variational definition in Definition 1 may not always be tractable. Furthermore, it is not always clear that one may simply approximate the relative entropy by successively calculating it for the marginals over increasingly fine time-discretisations, since these expressions may asymptotically approach infinity (see
(4) below).
Another very important application of the relative entropy is in the field of Large Deviations. Sanov’s theorem dictates that the empirical measure induced by independent samples governed by the same probability law
P converge towards their limit exponentially fast; and the constant governing the rate of convergence is the relative entropy [
14]. Large Deviations have been applied for example to spin glasses [
15], neural networks [
18] and mean-field models of interacting particles [
20]. In the mean-field theory of neuroscience in particular, there has been a recent interest in the modelling of “finite size effects” [
21], that is, the deviations from the limiting behaviour for a population of a particular size. Large Deviations provides a mathematically rigorous tool to do this. In this system, the limiting system is typically the law
P of a stochastic process, and therefore the likelihood of the empirical measure of the system being “near” some measure
μ is the relative entropy
μ||P). However the numerical calculation of
μ||P) is not straightforward: the results of this paper provide an alternative characterization of
μ||P), which assists in this calculation.
For example, the rate function for the Large Deviation Principle of the interacting particle model of [
20] is directly in terms of the relative entropy between two measures on the space of continuous functions (see in particular Theorem 5.2 of this paper). Similarly, the rate function in [
18] (Theorem 10) may be expressed as a function of the relative entropy. In more detail, the rate function
in [
18] (Theorem 10) is of the form
. Here Ξ is the law of the process in [
18] (
Equation (31)),
i.e., the law of a ℤ
d-indexed stochastic process, and
μVn and Ξ
Vn are the marginals over the finite hypercube
Vn of side length (2
n + 1). The results of this paper give a means of evaluating
Vn) and therefore
In this paper we derive a specific integral (with respect to time) representation of the relative entropy
μ||P) when
P is the law of a diffusion process. The representation is in terms of the infinitesimal generator of
P. This
P is the same as in [
22] (Section 4). The representation makes use of regular conditional probabilities. We expect that in some circumstances, it ought to be more tractable than the standard definition in Definition 1, and thus it might be of practical use in the applications listed above.
2. Outline of Main Result
T be the Banach Space
d) equipped with the norm
where |⋅| is the standard Euclidean norm over ℝ
d. We let (
Ft) be the canonical filtration over (
T)). For some topological space
, we let
) be the Borelian
σ-algebra and
) the space of all probability measures on
. Unless otherwise indicated, we endow
) with the topology of weak convergence. Let
σ = {
t1, t
tm} be a finite set of elements such that
t1 ≥ 0,
tm ≤
T and
tj <
tj+1. We term
σ a partition, and denote the set of all such partitions by J. The set of all partitions of the above form such that
t1 = 0 and
tm =
T is denoted J
*. We define |
σ| = sup
tj+1 −
tj}. For some
t ∈ [0,
T] and
σ ∈ J
*, we define
. The following definition of relative entropy is standard.
Definition 1. Let (Ω,
) be a measurable space, and
ν probability measures.
where ε is the set of all bounded functions. If the
σ-algebra is clear from the context, we omit the
and write
ν). If Ω is Polish and
B(Ω), then we only need to take the supremum over the set of all continuous bounded functions.
P ∈
T) be the following law governing a Markov–Feller diffusion process on
T. Stipulate
P to be a weak solution (with respect to the canonical filtration) of the local martingale problem with infinitesimal generator
x) in
i.e., the space of twice continuously differentiable functions. The initial condition (governing
P0) is
μI ∈
d). The coefficients
bj: [0,
T]× ℝ
d → ℝ are assumed to be continuous (over [0,
T] × ℝ
d), and the matrix
x) is strictly positive definite for all
t and
x. Here
P is assumed to be the unique weak solution. We note that the above infinitesimal generator is the same as in [
22] (p. 269) (note particularly its Remark 4.4). We note that
P is the law of the solution
Y = (
Y j) to the following stochastic differential equation: for
j ∈ [1,
Here (Wk) are independent Wiener processes.
Our major result is the following. Let μ ∈ (T) govern a random variable
∈ T. For some x ∈ T, we note μ|[0,s],x, the regular conditional probability (rcp) given Xr = xr for all r ∈ [0, s]. The marginal of μ|[0,s],x at some time t ≥ s is noted μt|[0,s],x.
Theorem 1. Let (
m∈ℤ+ be any series of partitions such that σ(m) ⊆
σ(m+1) and |
→ 0
as m → ∞. For μ ∈
where Here D is the Schwartz space of compactly supported functions ℝd → ℝ, possessing continuous derivatives of all orders. If does not exist, then we consider it to be ∞.
Our paper has the following format. In Section 3 we make some preliminary definitions, defining the process
P against which the relative entropy is taken in this paper. In Section 4 we employ the projective limits approach of [
22] to obtain the chief result of this paper: Theorem 1. This gives an explicit integral representation of the relative entropy. In Section 5 we apply the result in Theorem 1 to various corollaries, including the particular case when
μ is the solution of a martingale problem. We finish by comparing our results to those of [
19] and [
3. Preliminaries
We outline some necessary definitions. For σ ∈ J of the form σ = {t1, t2,…, tm}, let σ;j = {t1,…, tj}. We denote the number of elements in a partition σ by m(σ). We let Js be the set of all partitions lying in [0, s]. For 0 < s < t ≤ T, we let Js;t be the set of all partitions of the form σ ∪ t, where σ ∈ Js.
Let π: T → Tσ := ℝ d×m (σ) be the natural projection, i.e., such that
. We similarly define the natural projection
, and we define
to be the natural restriction of x ∈ T to [s, t]. The expectation of some measurable function f with respect to a measure μ is written as Eμ(x)[f(x)], or simply Eμ[f] when the context is clear.
For s < t, we write
. We define Fs;t to be the σ-algebra generated by Fs and Fγ (where γ = [t]). For μ ∈ (T), we denote its image laws by
Let μ ϵ (T) govern a random variable X = (Xs) ∈ T. For z ∈ ℝd, the rcp given Xs =z by μ|s,z For x ϵ C([0, s]; Rd) or T, the rcp given that Xu = xu for all 0 ≤ u ≤ s is written as μ|[0,s],x. The rcp given that Xu = xu for all u ≤ s, and Xt = z, is written as μ|s,x;t,z For σ ∈ Js and z ∈ (ℝd)m(σ), the rcp given that Xu = zu for all u ∈ σ is written as μ|σ,z. All of these measures are considered to be in (C([s, T]; ℝd)) (unless indicated otherwise in particular circumstances). The probability laws governing Xt (for t ≥ s), for each of these, are respectively μt|s,z, μt|[0,s],x and μt|σ,z. We clearly have μs|s,z = δz, for μs a.e. z, and similarly for the others.
See [23] (Definition 5.3.16) for a definition of a rcp. Technically, if we let be the rcp given Xs =
z according to this definition, then and.
By [23] (Theorem 3.18), μ|s,z is well-defined for μs a.e. z. Similar comments apply to the other rcp’s defined above.In the definition of the relative entropy, we abbreviate RFσ(μ||P) by Rσ(R||P). If σ = {t}, we write Rt(μ||P).
4. The Relative Entropy R(⋅||P ) Using Projective Limits
In this section we derive an integral representation of the relative entropy
μ||P), for arbitrary
μ ∈
T). We start with the standard result in Theorem 2, before adapting the projective limits approach of [
22] to obtain the central result (Theorem 1).
We begin with a standard decomposition result for the relative entropy [
Lemma 1. Let X be a Polish space with sub σ-algebras G ⊆
F ⊆
Let μ and ν be probability measures on (
X, F),
and their regular conditional probabilities over G be (respectively) μω and νω. Then The following Theorem is a straightforward consequence of [
25] (Theorem 6.6): we provide an alternative proof using the theory of Large Deviations in Section 6.
Theorem 2. If α, σ ∈ J
and α ⊆
σ, then Rα(
μ||P) ≤
It suffices for the supremums in (4) to take σ ⊂
where Qs,t is any countable dense subset of [
s, t].
Thus we may assume that there exists a sequence σ(n) ⊂
Q of partitions such that σ(n) ⊆
σ(n+1), |
→ 0
as n → ∞ and We now provide a technical lemma.
Lemma 2. Let t >
s, α, σ ∈ J
s, σ ⊂
α and s ∈
σ. Then for μσ a.e. x, Rt(
μ|σ,x||P|s,xs) =
Proof. The first statement is immediate from Definition 1 and the Markovian nature of P. For the second statement, it suffices to prove this in the case that
α =
σ ∪
u, for some
u <
s. We note that, using a property of regular conditional probabilities, for
μσ a.e x,
x, ω) ∈
x, ω)
u =
x, ω)
r =
xr for all
r ∈
We consider A to be the set of all finite disjoint partitions a ⊂
d) of ℝ
d. The expression for the entropy in [
26] (Lemma 1.4.3) yields
Here the summand is considered to be zero if
A) = 0, and infinite if
A) > 0 and
A) = 0. Making use of
(7), we find that
We note that, for
μα a.e.
z, if
in this last expression, then
A) = 0 and we consider the summand to be zero. To complete the proof of the lemma, it is thus sufficient to prove that for
μα a.e.
zHowever, in turn, the above inequality will be true if we can prove that for each partition a such that
for all
A ∈ a,
The left hand side is equal to
An application of Jensen’s inequality demonstrates that this is greater than or equal to zero. □
If, contrary to the definition, we briefly consider to be a probability measure on T, such that μ(
A) = 1
where A is the set of all points y such that ys =
xs for all s ≤ t, then it may be seen from the definition of R thatWe have also made use of the Markov property of P. This is why our convention, to which we now return, is to consider to be a probability measure on (C([t, T]; Rd), Ft,T ).
This leads us to the following expressions for R(μ||P).
Lemma 3. Each σ in the supremums below is of the form {
t1 <
t2 < … <
tm(σ)−1 <
for some integer m(
where in this last expression 0 ≤
s <
t ≤
T. Proof. Consider the sub
. We then find, through an application of Lemma 1 and
(8), that
We may continue inductively to obtain the first identity.
We use Theorem 2 to prove the second identity. It suffices to take the supremum over J
*, because
μ||P) ≥
μ||P) if
γ ⊂
σ. It thus suffices to prove that
However, this also follows from repeated application of Lemma 1. To prove the third identity, we firstly note that
The proof of this is entirely analogous to that of the second identity, except that it makes use of
(5) instead of
(4). However, after another application of Lemma 1, we also have that
On equating these two different expressions for
, we obtain
Let (
σ(k)) ⊂ J
σ(k−1) ⊆
σ(k) be such that
. Such a sequence exists by
(4). Similarly, let (
γ(k)) ⊆ J
s be a sequence such that
is strictly non-decreasing and, as
k → ∞, asymptotically approaches
. Lemma 2 dictates that
asymptotically approaches the same limit as well. Clearly
because of the identity at the start of Theorem 2. This yields the third identity.
4.1. Proof of Theorem 1
In this section we work towards the proof of Theorem 1, making use of some results in [
22]. However, we first require some more definitions.
If K ⊂ ℝd is compact, let DK be the set of all f ∈ D whose support is contained in K. The corresponding space of real distributions is D′, and we denote the action of θ ∈ D′ by 〈θ, f〉. If θ ∈ (ℝd), then clearly 〈θ, f〉 = Eθ[f]. We let
denote the set of all continuous functions, possessing continuous spatial derivatives of first and second order, a continuous time derivative of first order, and of compact support. For f ∈ D and t ∈ [0, T], we define the random variable ∇tf: ℝd→ ℝd such that
, we may also understand ∇tf(x) := ∇tf(xt)). Let aij be the components of the matrix inverse of aij. For random variables X, Y: T → ℝd, we define the inner
with associated norm
We note that
Let M be the space of all continuous maps [0,
→ M(ℝ
d), equipped with the topology of uniform convergence. For
s ∈ [0,
ϑ ∈ M and
ν ∈
d) we define n(
s, ϑ, ν) ≥ 0 and such that
This definition is taken from [
22] (Equation (4.7))—we note that n is convex in
ϑ. For
γ ∈
T), we may naturally write n(
s, γ, ν) := n(
s, ω, ν), where
ω is the projection of
γ onto M,
s) =
γs. It is shown in [
22] that this projection is continuous. The following two definitions, lemma and two propositions are all taken (with some small modifications) from [
Definition 2. Let
I be an interval of the real line. A measure
μ ∈
T) is called absolutely continuous if for each compact set
K ⊂ ℝ
d there exists a neighbourhood
U of 0 in
K and an absolutely continuous function
HK :
I → ℝ such that
for all
u, v ∈
I and
f ∈
Lemma 4. [22] (Lemma 4.2) If μ is absolutely continuous over an interval I, then its derivative exists (in the distributional sense) for Lebesgue a.e. t ∈
I. That is, for Lebesgue a.e. t ∈
I, there exists such that for all f ∈
D Definition 3. For
ν ∈
s, t]; ℝ
d)), and 0 ≤
s <
t ≤
T, let
be the Hilbert space of all measurable maps
h : [
s, t] × ℝ
d → ℝ
d with inner product
We denote by
the closure in
of the linear subset generated by maps of the form (x, u) → ∇uf, where
. We note that functions in
only need to be defined du⊗νu(dx) almost everywhere.
Recall that n is defined in
(13), and note that
Proposition 1. Assume that μ ∈
r, s]; ℝ
such that μr =
δy for some y ∈ ℝ
d and 0 ≤
r <
s ≤
T. We have that [22] (Equation 4.9 and Lemma 4.8) It clearly suffices to take the supremum over a countable dense subset. Assume now that.
Then for Lebesgue a.e. t,
where [22] (Lemma 4.8(3))for some that satisfies [22] (Lemma 4.8(4)) R
We reach (17) from the proof of Lemma 9 in [22] (Eq 4.10). One should note also that in the equation (4.10) of [22] the relative entropy R as.
To reach (18),
we also use the equivalence between (4.7) and (4.8) in [22].Proposition 2. Assume that μ ∈
such that μr =
δy for some y ∈ ℝ
d and 0 ≤
r <
s ≤
T. If,
then μ is absolutely continuous on [
r, s],
and [22] (Lemma 4.9) Here the derivative is defined in Lemma 4. For all f ∈
D, [22] (Eq. (4.35)) We are now ready to prove Theorem 1 (the central result).
Proof. Fix a partition
σ = {
t1, …,
tm}. We may conclude from
(9) and
(17) that
The integrand on the right hand side is measurable with respect to
due to the equivalent expression
(14). We may infer from
(18) that
This last step follows by noting that if
ν ∈
d), and
f ∈
d), and the expectation of
f with respect to
ν is finite, then there exists a series (
Kn) ⊂ ℝ
d of compact sets such that
In turn, for each
n there exist
such that we may write
This allows us to conclude that the two supremums are the same. The last expression in
(20) is merely
(11), this is greater than or equal to
We thus obtain the theorem using
5. Some Corollaries
We state some corollaries of Theorem 1. In the course of this section we make progressively stronger assumptions on the nature of
μ, culminating in the elegant expression for
μ||P) when
μ is a solution of a martingale problem. We finish by comparing our work with that of [
Corollary 1. Suppose that μ ∈
and R(
μ||P) <
∞. Then for all s and μ a.e. x, μ|[0,s],x is absolutely continuous over [
s, T].
For each s ∈ [0,
and μ a.e. x ∈
T, for Lebesgue a.e. t ≥
swhere for some For any dense countable subset Q0,T of [0,
there exists a series of partitions σ(n) ⊂
σ(n+1) ∈
such that as n →
∞, |σ(n)| → 0,
and R
It is not immediately clear that we may simplify (23) further (barring further assumptions). The reason for this is that we only know that is measurable (as a function of w), but it has not been proven that is measurable (as a function of w).Proof. Let
σ = {0 =
t1, …,
tm =
T} be an arbitrary partition. For all
j <
m, we find from Lemma 3 that
x ∈
tj]; ℝ
d). We thus find that, for all such
is absolutely continuous on [
tj, tj+1] from Proposition 2. We are then able to obtain
(21) and
(22) from Propositions 1 and 2. From
(16) and
(21) we find that
The above integral must be finite (since we are assuming
μ||P) is finite). Furthermore
is (
t, x) measurable as a consequence of the equivalent form
(14). This allows us to apply Fubini’s theorem to obtain
(23). The last statement on the sequence of maximising partitions follows from Theorem 2.
Corollary 2. Suppose that R(
μ||P) <
∞. Suppose that for all s ∈
Q0,T (any countable, dense subset of [0,
), for μ a.e. x and Lebesgue a.e. t,
for some progressively measurable random variable hμ : [0,
× T → ℝ
d. Then Proof. Let
Gs,x;t,y be the sub
σ-algebra consisting of all
B ∈
T) such that for all
w ∈
wr =
xr for all
r ≤
s and
wt =
y. Thus
. By [
27] (Corollary 2.4), since
(restricting to
s ∈
Q0,T), for
μ a.e.
s ∈
Q0,T. By the properties of the regular conditional probability, we find from
(24) that
By assumption, the above limit is finite. Thus by Fatou’s lemma, and using the properties of the regular conditional probability,
Conversely, through an application of Jensen’s inequality to
(27)A property of the regular conditional probability yields
Remark. The condition in the above corollary is satisfied when μ is a solution to a martingale problem—see Lemma 5.
We may further simplify the expression in Theorem 1 when
μ is a solution to the following martingale problem. Let {
cjk, ej} be progressively measurable functions [0,
T] ×
T → ℝ. We suppose that
cjk =
ckj. For all 1 ≤
k ≤
t, x) and
t, x) are assumed to be bounded for
x ∈
L (where
L is compact) and all
t ∈ [0,
T]. For
x ∈
T, let
We assume that for all such
f, the following is a continuous martingale (relative to the canonical filtration) under
μThe law governing X0 is stipulated to be ν ∈ (ℝd).
From now on we switch from our earlier convention and we consider
μ|[0,s],x to be a measure on
T such that, for
μ a.e.
x ∈
As,x) = 1, where
As,x is the set of all
T satisfying
Xt =
xt for all 0 ≤
t ≤
s. This is a property of a regular conditional probability (see Theorem 3.18 in [
23]). Similarly,
μ|s,x;t,y is considered to be a measure on
T such that for
μ a.e.
x ∈
Bs,x;t,y) = 1, where
Bs,x;t,y is the set of all
As,x such that
Xt =
y. We may apply Fubini’s Theorem (since
f is compactly supported and bounded) to
(28) to find that
This ensures that
μ|[0,s] is absolutely continuous over [
s, T], and that
Lemma 5. If R(
μ||P) <
∞ then for Lebesgue a.e. t ∈ [0,
and μ a.e. x ∈
Proof. It follows from
μ||P) <
(21) and
(22) that for all
s and
μ a.e.
x, for Lebesgue a.e.
t ≥
s Let us take a countable dense subset
Q0,T of [0,
T]. There thus exists a null set
N ⊆ [0,
T] such that for every
s ∈
μ a.e.
x and every
t ∉
N the above equation holds. We may therefore conclude
(30) using [
27] (Corollary 2.4) and taking
s →
t−. From
(29), we observe that for all
s ∈ [0,
T] and
μ a.e.
x, for Lebesgue a.e.
t 5.1. Comparison of our Results to Those of Fischer et al. [19,20]
We have already noted in the introduction that one may infer a variational representation of the relative entropy from [
20] by assuming that the coefficients of the underlying stochastic process are independent of the empirical measure in these papers. The assumptions in [
20] on the underlying process
P are both more general and more restrictive than ours. His assumptions are more general insofar as the coefficients of the SDE may depend on the past history of the process and the diffusion coefficient is allowed to be degenerate. However, our assumptions are more general insofar as we only require
P to be the unique (in the sense of probability law) weak solution of the SDE, whereas [
20] requires
P to be the unique strong solution of the SDE. Of course when both sets of assumptions are satisfied, one may infer that the expressions for the relative entropy are identical.
6. Proof of Theorem 2
The following is an alternative proof to that of [
25] (Theorem 6.6) employing the theory of Large Deviations. The fact that, if
α ⊆
σ, then
μ||P) ≤
μ||P), follows from Lemma 1. We prove the first expression
(4) in the case
s = 0,
t =
T (the proof of the second identity
(5) is analogous).
Definition 4. A series of probability laws Γ
N on some topological space Ω equipped with its Borelian
σ-algebra is said to satisfy a strong Large Deviation Principle with rate function
I : Ω → ℝ if for all open sets
and for all closed sets
F If furthermore the set {x : I(x) ≤ α} is compact for all α ≥ 0, we say that I is a good rate function.
We define the following empirical measures.
Definition 5. For x ∈ T
, let
. The image law
is denoted by
. Similarly, for
σ ∈ J, the image law of
Tσ) is denoted by
. Since
T and
Tσ are Polish spaces, we have by Sanov’s theorem (see Theorem 6.2.10 in [
14]) that Π
N satisfies a strong Large Deviation Principle with good rate function
||P). Similarly,
satisfies a strong Large Deviation Principle on
Tσ) with good rate function
We now define the projective limit
. If
γ ∈ J,
α ⊂
γ, then we may define the projection
. An element of
is then a member ⊗
σ) of the Cartesian product ⊗
Tσ) satisfying the consistency condition
for all
α ⊂
γ. The topology on
is the minimal topology necessary for the natural projection
to be continuous for all
α ∈ J. That is, it is generated by open sets of the form
for some
γ ∈ J and open
O (with respect to the weak topology of
We may continuously embed
T) into the projective limit
of its marginals, letting
ι denote this embedding. That is, for any
σ ∈ J, (
σ) =
μσ. We note that
ι is continuous because
Aγ,O) is open in
T), for all
Aγ,O of the form in
(33). We equip
with the Borelian
σ-algebra generated by this topology. The embedding
ι is measurable with respect to this
σ-algebra because the topology of
T) has a countable base. The embedding induces the image laws (Π
N ○
ι−1) on
. For
σ ∈ J, it may be seen that
, where
It follows from [
22] (Thm 3.3) that Π
N ○
ι−1 satisfies a Large Deviation Principle with rate function sup
σ∈J Rσ(
μ||P). However, we note that
ι is 1 – 1, because any two measures
μ, ν ∈
T) such that
μσ =
νσ for all
σ ∈ J must be equal. Furthermore,
ι is continuous. Because of Sanov’s theorem, (Π
N) is exponentially tight (see Defn 1.2.17, Exercise 1.2.19 in [
14] for a definition of exponential tightness and proof of this statement). These facts mean that we may apply the inverse contraction principle [
14] (Thm 4.2.4) to infer that Π
N satisfies a Large Deviation Principle with the rate function sup
σ∈J Rσ(
μ||P). Since rate functions are unique [
14] (Lemma 4.1.4), we obtain the first identity in conjunction with Sanov’s theorem. The second identity
(5) follows similarly. We may repeat the argument above, while restricting to
σ ⊂
Qs,t. We obtain the same conclusion because the
σ-algebra generated by (
σ⊂Qs,t is the same as
Fs,t. The last identity follows from the fact that, if
α ⊆
σ, then
μ||P) ≤