New Riemannian Priors on the Univariate Normal Model

Said, Salem; Bombrun, Lionel; Berthoumieu, Yannick

doi:10.3390/e16074015

Open AccessArticle

New Riemannian Priors on the Univariate Normal Model

by

Salem Said

^*,

Lionel Bombrun

and

Yannick Berthoumieu

Groupe Signal et Image, CNRS Laboratoire IMS, Institut Polytechnique de Bordeaux, Université de Bordeaux, UMR 5218, Talence, 33405, France

^*

Author to whom correspondence should be addressed.

Entropy 2014, 16(7), 4015-4031; https://doi.org/10.3390/e16074015

Submission received: 17 April 2014 / Revised: 23 June 2014 / Accepted: 9 July 2014 / Published: 17 July 2014

(This article belongs to the Special Issue Information Geometry)

Download

Browse Figure

Versions Notes

Abstract

:

The current paper introduces new prior distributions on the univariate normal model, with the aim of applying them to the classification of univariate normal populations. These new prior distributions are entirely based on the Riemannian geometry of the univariate normal model, so that they can be thought of as “Riemannian priors”. Precisely, if {p_θ; θ ∈ Θ} is any parametrization of the univariate normal model, the paper considers prior distributions G(θ̄, γ) with hyperparameters θ̄ ∈ Θ and γ > 0, whose density with respect to Riemannian volume is proportional to exp(−d²(θ, θ̄)/2γ²), where d²(θ, θ̄) is the square of Rao’s Riemannian distance. The distributions G(θ̄, γ) are termed Gaussian distributions on the univariate normal model. The motivation for considering a distribution G(θ̄, γ) is that this distribution gives a geometric representation of a class or cluster of univariate normal populations. Indeed, G(θ̄, γ) has a unique mode θ̄ (precisely, θ̄ is the unique Riemannian center of mass of G(θ̄, γ), as shown in the paper), and its dispersion away from θ̄ is given by γ. Therefore, one thinks of members of the class represented by G(θ̄, γ) as being centered around θ̄ and lying within a typical distance determined by γ. The paper defines rigorously the Gaussian distributions G(θ̄, γ) and describes an algorithm for computing maximum likelihood estimates of their hyperparameters. Based on this algorithm and on the Laplace approximation, it describes how the distributions G(θ̄, γ) can be used as prior distributions for Bayesian classification of large univariate normal populations. In a concrete application to texture image classification, it is shown that this leads to an improvement in performance over the use of conjugate priors.

Keywords:

Fisher information; Riemannian metric; prior distribution; univariate normal distribution; image classification

1. Introduction

In this paper, a new class of prior distributions is introduced on the univariate normal model. The new prior distributions, which will be called Gaussian distributions, are based on the Riemannian geometry of the univariate normal model. The paper introduces these new distributions, uncovers some of their fundamental properties and applies them to the problem of the classification of univariate normal populations. It shows that, in the context of a real-life application to texture image classification, the use of these new prior distributions leads to improved performance in comparison with the use of more standard conjugate priors.

To motivate the introduction of the new prior distributions, considered in the following, recall some general facts on the Riemannian geometry of parametric models.

In information geometry [1], it is well known that a parametric model {p_θ;θ ∈ Θ}, where Θ ⊂ R^p, can be equipped with a Riemannian geometry, determined by Fisher’s information matrix, say I(θ). Indeed, assuming I(θ) is strictly positive definite, for each θ ∈ Θ, a Riemannian metric on Θ is defined by:

d 𝓈^{2} (θ) = \sum_{i, j = 1}^{p} I_{i j} (θ) d θ^{i} d θ^{j}

(1)

The fact that the length element Equation (1) is invariant to any change of parametrization was realized by Rao [2], who was the first to propose the application of Riemannian geometry in statistics.

Once the Riemannian metric Equation (1) is introduced, the whole machinery of Riemannian geometry becomes available for application to statistical problems relevant to the parametric model {p_θ; θ ∈ Θ}. This includes the notion of Riemannian distance between two distributions, p_θ and p_θ ′, which is known as Rao’s distance, say d(θ, θ′), the notion of Riemannian volume, which is exactly the same as Jeffreys prior [3], and the notion of Riemannian gradient, which can be used in numerical optimization and coincides with the so-called natural gradient of Amari [4].

It is quite natural to apply Rao’s distance to the problem of classifying populations that belong to the parametric model {p_θ; θ ∈ Θ}. In the case where this parametric model is the univariate normal model, this approach to classification is implemented in [5]. For more general parametric models, beyond the univariate normal model, similar applications of Rao’s distance to problems of image segmentation and statistical tests can be found in [6–8].

The idea of [5] is quite elegant. In general, it requires that some classes {

𝒮

_L; L = 1,...,C}, (based on a learning sequence) have been identified with “centers” θ̄_L ∈ Θ. Then, in order to assign a test population, given by the parameter θ_t, to a class L*, it is proposed to choose L*, which minimizes Rao’s distance d²(θ_t, θ̄_L), over L = 1,...,C. In the specific context of the classification of univariate normal populations [5], this leads to the introduction of hyperbolic Voronoi diagrams.

The present paper is also concerned with the case where the parametric model {p_θ; θ ∈ Θ} is a univariate normal model. It starts from the idea that a class

𝒮

_L should be identified not only with a center θ̄_L, as in [5], but also with a kind of “variance”, say γ², which will be called a dispersion parameter. Accordingly, assigning a test population given by the parameter θ_t to a class L should be based on a tradeoff between the square of Rao’s distance d²(θ_t, θ̄_L) and the dispersion parameter γ².

Of course, this idea has a strong Bayesian flavor. It proposes to give more “confidence” to classes that have a smaller dispersion parameter. Thus, in order to implement it, in a concrete way, the paper starts by introducing prior distributions on the univariate normal model, which it calls Gaussian distributions. By definition, a Gaussian distribution G(θ̄, γ²) has a probability density function, with respect to Riemannian volume, given by:

p (θ | \bar{θ}, γ) \propto exp (\frac{- d^{2} (θ, \bar{θ})}{2 γ^{2}})

(2)

Given this definition of a Gaussian distribution (which is developed in a detailed way, in Section 3), classification of univariate normal populations can be carried out by associating to each class

𝒮

_L of univariate normal populations a Gaussian distribution G(θ̄_L,

G ({\bar{θ}}_{L}, γ_{L}^{2})

) and by assigning any test population with parameter θ_t to the class L*, which maximizes the likelihood p(θ_t|θ̄_L,γ_L), over L = 1,...,C.

The present paper develops in a rigorous way the general approach to the classification of univariate normal populations, which has just been described. It proceeds as follows.

Section 2, which is basically self-contained, provides the concepts, regarding the Riemannian geometry of the univariate normal model, which will be used throughout the paper.

Section 3 introduces Gaussian distributions on the univariate normal model and uncovers some of their general properties. In particular, Section 3.2 of this section gives a Riemannian gradient descent algorithm for computing maximum likelihood estimates of the parameters θ̄ and γ of a Gaussian distribution.

Section 4 states the general approach to classification of univariate normal populations proposed in this paper. It deals with two problems: (i) given a class

𝒮

of univariate normal populations S_i, how to fit a Gaussian distribution G(z̄, γ) to this class; and (ii) given a test univariate normal population S_t and a set of classes {

𝒮

_L, L = 1,...,C}, how to assign S_t to a suitable class

𝒮

_L^*_.

In the present paper, the chosen approach for resolving these two problems is marginalized likelihood estimation, in the asymptotic framework where each univariate normal population contains a large number of data points. In this asymptotic framework, the Laplace approximation plays a major role [9]. In particular, it reduces the first problem, of fitting a Gaussian distribution to a class of univariate normal populations, to the problem of maximum likelihood estimation, covered in Section 3.2.

The final result of Section 4 is the decision rule Equation (37). This generalizes the one developed in [5] and already explained above, by taking into account the dispersion parameter γ, in addition to the center θ̄, for each class.

In Section 5, the formalism of Section 4 is applied to texture image classification, using the VisTeX image database [10]. This database is used to compare the performance obtained using Gaussian distributions, as in Section 4, to that obtained using conjugate prior distributions. It is shown that Gaussian distributions, proposed in the current paper, lead to a significant improvement in performance.

Before going on, it should be noted that probability density functions of the form (2), on general Riemannian manifolds, were considered by Pennec in [11]. However, they were not specifically used as prior distributions, but rather as a representation of uncertainty in medical image analysis and directional or shape statistics.

2. Riemannian Geometry of the Univariate Normal Model

The current section presents in a self-contained way the results on the Riemannian geometry of the univariate normal model, which are required for the remainder of the paper. Section 2.1 recalls the fact that the univariate normal model can be reparametrized, so that its Riemannian geometry is essentially the same as that of the Poincaré upper half plane. Section 2.2 uses this fact to give analytic formulas for distance, geodesics and integration on the univariate normal model. Finally, Section 2.3 presents, in general form, the Riemannian gradient descent algorithm.

2.1. Derivation of the Fisher Metric

This paper considers the Riemannian geometry of the univariate normal model, as based on the Fisher metric (1). To be precise, the univariate normal model has a two-dimensional parameter space Θ= {θ = (μ, σ)|μ ∈ R, σ > 0}, and is given by:

p_{θ} (x) = {| 2 π σ^{2} |}^{- 1 / 2} exp (\frac{- {(x - μ)}^{2}}{2 σ^{2}})

(3)

where each p_θ is a probability density function with respect to the Lebesgue measure on R. The Fisher information matrix, obtained from Equation (3), is the following:

I (θ) = (\begin{matrix} \frac{1}{σ^{2}} & 0 \\ 0 & \frac{2}{σ^{2}} \end{matrix})

As in [12], this expression can be made more symmetric by introducing the parametrization z =(x, y), where

x = μ / \sqrt{2}

and y = σ. This yields the Fisher information matrix:

I (z) = 2 \times (\begin{matrix} \frac{1}{y^{2}} & 0 \\ 0 & \frac{1}{y^{2}} \end{matrix})

It is suitable to drop the factor two in this expression and introduce the following Riemannian metric for the univariate normal model,

d s^{2} (z) = \frac{d x^{2} + d y^{2}}{y^{2}}

(4)

This is essentially the same as the Fisher metric (up to the factor tow) and will be considered throughout the following. The resulting Rao’s distance and Riemannian geometry are given in the following paragraph.

2.2. Distance, Geodesics and Volume

The Riemannian metric (4), obtained in the last paragraph, happens to be a very well-known object in differential geometry. Precisely, the parameter space H = {z = (x, y)|y > 0} equipped with the metric (4) is known as the Poincaré upper half plane and is a basic model of a two-dimensional hyperbolic space [13].

Rao’s distance between two points z₁ = (x₁, y₁) and z₂ = (x₂, y₂) in H can be expressed as follows (for results in the present paragraph, see [13], or any suitable reference on hyperbolic geometry),

d (z_{1}, z_{2}) = acosh (1 + \frac{{(x_{1} - x_{2})}^{2} {(y_{1} - y_{2})}^{2}}{2 y_{1} y_{2}})

(5)

where acosh denotes the inverse hyperbolic cosine.

Starting from z₁, in any given direction, it is possible to draw a unique geodesic ray γ : R₊ → H. This is a curve having the property that γ(0) = z₁ and, for any t ∈ R₊, if γ(t) = z₂ then d(z₁, z₂) = t. In other words, the length of γ between z₁ and z₂ is equal to the distance between z₁ and z₂.

The equation of a geodesic ray starting from z ∈ H is conveniently written down in complex notation (that is, by treating points of H as complex numbers). To begin, consider the case of z = i (which stands for x = 0 and y = 1). The geodesic in the direction making an angle ψ with the y-axis is the curve,

γ_{i} (t) = \frac{e^{t / 2} cos (ψ / 2) i - e^{- t / 2} sin (ψ / 2)}{e^{t / 2} sin (ψ / 2) i + e^{- t / 2} cos (ψ / 2)}

(6)

In particular ψ = 0 gives γ_i(t) = e^ti and ψ = π gives γ_i(t)= e⁻^ti.If ψ is not a multiple of π, γ_i(t) traces out a portion of a circle, which is parallel to the y-axis, in the limit t → ∞. For a general starting point z, the geodesic ray in the direction making an angle ψ with the y-axis can be written:

γ_{z} (t, ψ) = x + y γ_{i} (t / y, ψ)

(7)

where z = (x, y) and γ_i(t, ψ) is given by Equation (6). A more detailed treatment of Rao’s distance (5) and of geodesics in the Poincaré upper half plane, along with applications in image clustering, can be found in [5].

The Riemannian volume (or area, since H is of dimension 2) element corresponding to the Riemannian metric (4) is dA(z) = dxdy/y². Accordingly, the integral of a function f : H → R with respect to dA is given by:

\int_{H} f (z) d A (z) = \int_{0}^{+ \infty} \int_{- \infty}^{+ \infty} \frac{f (x, y)}{y^{2}} dxdy

(8)

In many cases, the analytic computation of this integral can be greatly simplified by using polar coordinates (r, φ) defined with respect to some “origin” z̄ ∈ H. Polar coordinates (r, φ) map to the point z(r, φ) given by:

z (r, φ) = γ_{\bar{z}} (r, \frac{π}{2} - φ)

(9)

where the right-hand side is defined according to Equation (7). The polar coordinates (r, φ) do indeed define a global coordinate system of H, in the sense that the application that takes a complex number re^iφ to the point z(r, φ) in H is a diffeomorphism. The standard notation from differential geometry is:

{exp}_{\bar{z}} (r e^{i φ}) = z (r, φ)

(10)

In these coordinates, the Riemannian metric (4) takes on the form:

d s^{2} (z) = d r^{2} + {sinh}^{2} r d φ^{2}

(11)

The integral Equation (8) can be computed in polar coordinates using the formula [13],

\int_{H} f (z) d A (z) = \int_{0}^{2 π} \int_{0}^{+ \infty} (f \circ {exp}_{\bar{z}}) (r e^{i φ}) sinh (r) drd φ

(12)

where exp_z̄ was defined in Equation (10) and ∘ denotes composition. This is particularly useful when f ∘ exp_z̄ does not depend on φ.

2.3. Riemannian Gradient Descent

In this paper, the problem of minimizing, or maximizing, a differentiable function f : H → R will play a central role. A popular way of handling the minimization of a differentiable function defined on a Riemannian manifold (such as H) is through Riemannian gradient descent [14].

Here, the definition of Riemannian gradient is reviewed, and a generic description of Riemannian gradient descent is provided. The Riemannian gradient of f is here defined as a mapping ∇f : H → C with the following property:

\frac{1}{y^{2}} \times Re {\nabla f (z) h^{*}} = Re {d f (z) h^{*}}

(13)

for any complex number h, where Re denotes the real part, * denotes conjugation and df is the “derivative”, df = (∂f/∂x)+(∂f/∂y) i. For example, if f(z) = y, it follows from Equation (13) that ∇f(z) = y².

Riemannian gradient descent consists in following the direction of −∇f at each step, with the length of the step (in other words, the step size) being determined by the user. The generic algorithm is, up to some variations, the following:


INPUT	ẑ ∈ H	% Initial guess
WHILE	‖∇f(ẑ)‖ > ε	% ε ≈ 0 machine precision
WHILE	ẑ ← exp_ẑ (−λ∇f(ẑ))	% λ > 0 step size, depends on ẑ
END WHILE
OUTPUT	ẑ	% near critical point of f

Here, in the condition for the while loop, ‖∇f(z_k)‖ is the Riemannian norm of the gradient ∇f(z_k). In other words,

{‖ \nabla f (z_{k}) ‖}^{2} = \frac{1}{y_{k}^{2}} \times Re {\nabla f (z_{k}) \nabla f {(z_{k})}^{*}}

Just like a classical gradient descent algorithm, the above Riemannian gradient descent consists in following the direction of the negative gradient −∇f(ẑ), in order to define a new estimate. This is repeated as long as the gradient is sensibly nonzero, in the sense of the loop condition.

The generic algorithm described above has no guarantee of convergence. Convergence and behavior near limit points depends on the function f, on the initialization of the algorithm and on the step sizes λ. For these aspects, the reader may consult [14](Chapter 4).

3. Riemannian Prior on the Univariate Normal Model

The current section introduces new prior distributions on the univariate normal model. These may be referred to as “Riemannian priors”, since they are entirely based on the Riemannian geometry of this model, and will also be called “Gaussian distributions”, when viewed as probability distributions on the Poincaré half plane.

Here, Section 3.1 defines in a rigorous way Gaussian distributions on H (based on the intuitive Formula (2)). A Gaussian distribution G(z̄, γ) has two parameters, z̄ ∈ H, called the center of mass, and γ > 0, called the dispersion parameter. Section 3.2 uses the Riemannian gradient descent algorithm Section 2.3 to provide an algorithm for computing maximum likelihood estimates of z̄ and γ. Finally, Section 3.3 proves that z̄ is the Riemannian center of mass or Karcher mean of the distribution G(z̄, γ), (Historically, it is more correct to speak of the “Fréchet mean”, since this concept was proposed by Fréchet in 1948 [15]), and that γ is uniquely related to mean square Rao’s distance from z̄.

The reader may wish to note that the results of Section 3.3 are not used in the following, so this paragraph may be skipped on a first reading.

3.1. Gaussian Distributions on H

A Gaussian distribution G(z̄, γ) on H is a probability distribution with the following probability density function:

p (z | \bar{z}, γ) = \frac{1}{Z (γ)} exp (\frac{- d^{2} (z - \bar{z})}{2 γ^{2}})

(14)

Here, z̄ ∈ H is called the center of mass and γ> 0 the dispersion parameter of the distribution G(z̄, γ). The squared distance d²(z, z̄) refers to Rao’s distance (5). The probability density function (14) is understood with respect to the Riemannian volume element dA(z). In other words, the normalization constant Z(γ) is given by:

Z (γ) \int_{H} f (z) d A (z) f (z) = exp (\frac{- d^{2} (z, \bar{z})}{2 γ^{2}})

Using polar coordinates, as in Equation (12), it is possible to calculate this integral explicitly. To do so, let (r, φ), whose origin is z̄. Then, d²(z, z̄) = r² when z = z(r, φ), as in Equation (9). It follows that:

(f \circ {exp}_{\bar{z}}) (r, φ) = exp (\frac{- r^{2}}{2 γ^{2}})

(15)

According to Equation (12), the integral Z(γ) reduces to:

Z (γ =) \int_{0}^{2 π} \int_{0}^{+ \infty} exp (\frac{- r^{2}}{2 γ^{2}}) sinh (r) drd φ

which is readily calculated,

Z (γ) = 2 π \times \sqrt{\frac{π}{2} γ} \times e^{\frac{γ^{2}}{2}} \times erf (\frac{γ}{\sqrt{2}})

(16)

where erf denotes the error function. Formula (16) completes the definition of the Gaussian distribution G(z̄, γ). This definition is the same as suggested in [11], with the difference that, in the present work, it has been possible to compute exactly the normalization constant Z(γ).

It is noteworthy that the normalization constant Z(γ) depends only on γ and not on z̄. This shows that the shape of the probability density function (14) does not depend on z̄, which only plays the role of a location parameter. At a deeper mathematical level, this reflects the fact that H is a homogeneous Riemannian space [13].

The probability density function (14) bears a clear resemblance to the usual Gaussian (or normal) probability density function. Indeed, both are proportional to the exponential minus the “square distance”, but in one case, the distance is interpreted as Euclidean distance and, in the other (that of Equation (14)) as Rao’s distance.

3.2. Maximum Likelihood Estimation of z̄ and γ

Consider the problem of computing maximum likelihood estimates of the parameters z̄ and γ of the Gaussian distribution G(z̄, γ), based on independent samples

{z_{i}}_{i = 1}^{N}

from this distribution. Given the expression (14) of the density p(z|z̄, γ), the log-likelihood function ℓ(z̄, γ) can be written,

ℓ (\bar{z}, γ) = - N log {Z (γ)} - \frac{1}{2 γ^{2}} \sum_{i = 1}^{N} d^{2} (z_{i}, \bar{z})

(17)

Since z̄ only appears in the second term, the maximum likelihood estimate of z̄, say ẑ, can be computed first. It is given by the minimization problem:

\hat{z} = {argmin}_{z \in H} \frac{1}{2} \sum_{i = 1}^{N} d^{2} (z_{i}, z)

(18)

In other words, the maximum likelihood estimate ẑ minimizes the sum of squared Rao distances to the samples z_i. This exhibits ẑ as the Riemannian center of mass, also called the Karcher or the Fréchet mean [16], of the samples z_i.

The notion of Riemannian center of mass is currently a widely popular one in signal and image processing, with applications ranging from blind source separation and radar signal processing [17,18] to shape and motion analysis [19,20]. The definition of Gaussian distributions, proposed in the present paper, shows how the notion of Riemannian center of mass is related to maximum likelihood estimation, thereby giving it a statistical foundation.

An original result, due to Cartan and cited in Equation [16], states that ẑ, as defined in Equation (18), exists and is unique, since H, with the Riemannian distance (4), has constant negative curvature. Here, ẑ is computed using Riemannian gradient descent, as described in Section 2.3. The cost function f to be minimized is given by (the factor N⁻¹ is conventional),

f (z) = \frac{1}{2 N} \sum_{i = 1}^{N} d^{2} (z_{i}, z)

(19)

Its Riemannian gradient ∇f(z) is easily found by noting the following fact. Let f_i(z) = (1/2)d²(z, z_i). Then, the Riemannian gradient of this function is (see [21] (page 407)),

\nabla f_{i} (z) = {log}_{z} (z_{i})

(20)

where log_z : H → C is the inverse of exp_z : C → H. It follows from Equation (20) that,

\nabla f (z) = \frac{1}{N} \sum_{i = 1}^{N} {log}_{z} (z_{i})

(21)

The analytic expression of log_z, for any z ∈ H, will be given below (see Equation (23)).

Here, the gradient descent algorithm for computing ẑ is described. This algorithm uses a constant step size λ, which is fixed manually.

Once the maximum likelihood estimate ẑ has been computed, using the gradient descent algorithm, the maximum likelihood estimate of γ, say γ̂, is found by solving the equation:

F (γ) = \frac{1}{N} \sum_{i = 1}^{N} d^{2} (z_{i}, \hat{z}) where F (γ) = γ^{3} \times \frac{d}{d γ} log {Z (γ)}

(22)

The gradient descent algorithm for computing ẑ is the following,


INPUT	{z₁ ,..., z_N}	% N independent samples from G(z̄, γ)
INPUT	ẑ ∈ H	% Initial guess
WHILE	‖∇f(ẑ) ‖ > ε	% ε ≈ 0 machine precision
	ẑ ← exp_ẑ (−λ∇f(ẑ))	% ∇f(ẑ) given by Equation (21)
	ẑ ← exp_ẑ (−λ∇f(ẑ))	% step size λ is constant
END WHILE
OUTPUT	ẑ	% near Riemannian center of mass

Application of Formula (21) requires computation of log_ẑ(z_i) for i = 1,...,N. Fortunately, this can be done analytically as follows. In general, for ẑ = (x̄, ȳ),

{log}_{\hat{z}} (z) = \bar{y} {log}_{i} (\frac{z - \bar{x}}{y})

(23)

where log_i is found by inverting Equation (6). Precisely,

{log}_{i} (z) = r e^{i φ}

(24)

where, for z = (x, y) with x ≠ 0,

r = acosh (1 + \frac{x^{2} + {(y - 1)}^{2}}{2 y})

and:

cos (φ) = \frac{x}{y sinh (γ)} sin (φ) = \frac{cosh (γ) - y^{- 1}}{sinh (r)}

and, for z = (0, y),

{log}_{i} (z) = ln (y) i

with ln denoting the natural logarithm.

3.3. Significance of z̄ and γ

The parameters z̄ and γ of a Gaussian distribution G(z̄, γ) have been called the center of mass and the dispersion parameter. In the present paragraph, it is proven that,

\bar{z} = {argmin}_{z \in H} \frac{1}{2} \int_{H} d^{2} (z^{'}, z) p (z^{'} | \bar{z}, γ) d A (z^{'})

(25)

and also that:

F (γ) = \int_{H} d^{2} (z^{'}, \bar{z}) p (z^{'} | \bar{z}, γ) d A (z^{'})

(26)

where F (γ) was defined in Equation (22) and p(z′|z̄, γ) is the probability density function of G(z̄, γ), given in Equation (14).

Note that Equations (25) and (26) are asymptotic versions of Equations (18) and (22). Indeed, Equations (25) and (26) can be written:

\bar{z} = {argmin}_{z \in H} \frac{1}{2} E_{\bar{z}, γ} d^{2} (z^{'}, z) F (γ) = E_{\bar{z}, γ} d^{2} (z, \bar{z})

(27)

where E_z̄,γ denotes the expectation with respect to G(z̄, γ), and the expectation is carried out on the variable z′ in the first formula. Now, these two formulae are the same as Equations (18) and (22), but with expectation instead of empirical mean.

Note, moreover, that Equations (25) and (26) can be interpreted as follows. If z′ is distributed according to the Gaussian distribution G(z̄, γ), then Equation (25) states that z̄ is the unique point, out of all z ∈ H, which minimizes the expectation of squared Rao’s distance to z′. Moreover, Equation (26) states that the expectation of squared Rao’s distance between z̄ and z′ is equal to F (γ), so F (γ) is the least possible expected squared Rao’s distance between a point z ∈ H and z′. This interpretation justifies calling z̄ the center of mass of G(z̄, γ) and shows that γ is uniquely related to the expected dispersion, as measured by squared Rao’s distance, away from z̄.

In order to prove Equation (25), consider the log-likelihood function,

ℓ (\bar{z}, γ; z) = - log {Z (γ)} - \frac{1}{2 γ^{2}} d^{2} z, \bar{z}

(28)

Let f_z(z̄) = (1/2)d²(z, z̄). The score function, with respect to z̄ is, by definition,

\nabla_{\bar{z}} ℓ (\bar{z}, γ; z) = \nabla f_{z} (\bar{z})

(29)

where ∇_z̄ indicates the Riemannian gradient (defined in Equation (13) of Section 2.3) is with respect to the variable z̄. Under certain regularity conditions, which are here easily verified, the expectation of the score function is identically zero,

E_{\bar{z}, γ} \nabla f_{z} (\bar{z}) = 0

(30)

Let f(z) be defined by:

f (z) = E_{\bar{z}, γ} f_{z^{'}} (z) = \frac{1}{2} E_{\bar{z}, γ} d^{2} (z^{'}, z)

with the expectation carried out on the variable z′. Clearly, f(z) is the expression to be minimized in Equation (25) (or in the first formula in Equation (27), which is just the same). By interchanging Riemannian gradient and expectation,

\nabla f (\bar{z}) = E_{\bar{z}, γ} \nabla f_{z} (\bar{z}) = 0

where the last equality follows from Equation (30).

It has just been proved that z̄ is a stationary point of f (a point where the gradient is zero). Theorem 2.1 in [16] states the function f has one and only one stationary point, which is moreover a global minimizer. This concludes the Proof (25).

The proof of Equation (26) follows exactly the same method, defining the score function with respect to γ and noting that its expectation is identically zero.

4. Classification of Univariate Normal Populations

The previous section studied Gaussian distributions on H, “as they stand”, focusing on the fundamental issue of maximum likelihood estimation of their parameters. The present Section considers the use of Gaussian distributions as prior distributions on the univariate normal model.

The main motivation behind the introduction of Gaussian distributions is that a Gaussian distribution G(z̄, γ) can be used to give a geometric representation of a cluster or class of univariate normal populations. Recall that each point (x, y) ∈ H is identified with a univariate normal population with mean

μ = \sqrt{2 x}

and standard deviation σ = y. The idea is that populations belonging to the same cluster, represented by G(z̄, γ), should be viewed as centered on z̄ and lying within a typical distance determined by γ.

In the remainder of this Section, it is shown how the maximum likelihood estimation algorithm of Section 3.2 can be used to fit the hyperparameters z̄ and γ to data, consisting in a class

𝒮

= {S_i; i = 1,...,K} of univariate normal populations. This is then applied to the problem of the classification of univariate normal populations. The whole development is based on marginalized likelihood estimation, as follows.

Assume each population S_i contains N_i points, S_i = {s_j; j = 1,...,N_i}, and the points s_j, in any class, are drawn from a univariate normal distribution with mean μ and standard deviation σ. The focus will be on the asymptotic case where the number N_i of points in each population S_i is large.

In order to fit the hyperparameters z̄ and γ to the data

𝒮

, assume moreover that the distribution of z = (x, y), where

(x, y) = (μ / \sqrt{2}, σ)

, is a Gaussian distribution G(z̄, γ). Then, the distribution of

𝒮

can be written in integral form:

p (𝓈 | \bar{z}, γ) = \prod_{i = 1}^{K} \int_{H} p (S_{i} | z) p (z | \bar{z}, γ) d A (z)

(31)

where p(z|z̄, γ) is the probability density of a Gaussian distribution G(z̄, γ), defined in Equation (14). Moreover, expressing p(S_i|z) as a product of univariate normal distributions p(s_j|z), it follows,

p (𝓈 | \bar{z}, γ) = \prod_{i = 1}^{K} \int_{H} \prod_{j = 1}^{N_{i}} p (S_{j} | z) p (z | \bar{z}, γ) d A (z)

(32)

This expression, given the data

𝒮

, is to be maximized over (z̄, γ). Using the Laplace approximation, this task is reduced to the maximum likelihood estimation problem, addressed in Section 3.2.

The Laplace approximation will here be applied in its “basic form” [9]. That is, up to terms of order

N_{i}^{- 1}

. To do so, write each of the integrals in Equation (32), using Equation (8) of Section 2.2. These integrals then take on the form:

\int_{0}^{+ \infty} \int_{- \infty}^{+ \infty} \prod_{j = 1}^{N_{i}} {| 2 π y^{2} |}^{- 1 / 2} exp (\frac{- {(s_{j} - \sqrt{2} x)}^{2}}{2 y^{2}}) \times p (z | \bar{z}, γ) \times \frac{1}{y^{2}} dxdy

(33)

where the univariate normal distribution p(s_j|z) has been replaced by its full expression. Now, this expression can be written p(s_j|z) = exp[−N_ih(x, y)], where:

h (x, y) = - \frac{1}{2} ln (2 π y^{2}) - \frac{B_{i}^{2} + V_{i}^{2}}{2 y^{2}}

Here, B² and

V_{i}^{2}

are the empirical bias and variance, within population S_i,

B_{i} = {\hat{S}}_{i} - \sqrt{2} x V_{i}^{2} = N_{i}^{- 1} \sum_{j = 1}^{N_{i}} {({\hat{S}}_{i} - s_{j})}^{2}

where Ŝ_i is the empirical mean of the population

{\hat{S}}_{i} = N_{i}^{- 1} \sum_{j = 1}^{N_{i}} s_{j}

.

The expression h(x, y) is maximized when x = x̂_i and y = ŷ_i, where ẑ_i = (x̂_i, ŷ_i) is the couple of maximum likelihood estimates of the parameters (x, y), based on the population S_i.

According to the Laplace approximation, the integral Equation (33) is equal to:

2 π {| \partial^{2} h ({\hat{x}}_{i}, {\hat{y}}_{i}) |}^{- 1 / 2} \times exp [- N_{i} h ({\hat{x}}_{i}, {\hat{y}}_{i})] \times p ({\hat{z}}_{i} | \bar{z}, γ) \times \frac{1}{{\hat{y}}_{i}^{2}} + O (N_{i}^{- 1})

where ∂²h(x̂_i,ŷ_i) is the matrix of second derivatives of h, and |·| denotes the determinant. Now, since h is essentially the logarithm of p(s_j|z), a direct calculation shows that ∂²h(x̂_i,ŷ_i) is the same as the Fisher information matrix derived in Section 2.1 (where it was denoted I(z)). Thus, the first factor in the above expression is

2 π {\hat{y}}_{i}^{2}

, and cancels out with the last factor.

Finally, the Laplace approximation of the integral Equation (33) reads:

2 π \times exp [- N_{i} h ({\hat{x}}_{i}, {\hat{y}}_{i})] \times p ({\hat{z}}_{i} | \bar{z}, γ) + O (N_{i}^{- 1})

and the resulting approximation of the distribution of

𝒮

, as given by Equation (32), can be written:

p (𝓈 | \bar{z}, γ) \approx \prod_{i = 1}^{K} α \times p ({\hat{z}}_{i} | \bar{z}, γ)

(34)

where α is a constant, which does not depend either on the data or on the parameters, and p(ẑ_i|z̄, γ) has the expression (14).

Accepting this expression to give the distribution of the data

𝒮

, conditionally on the hyperparameters (z̄, γ), the task of estimating these hyperparameters becomes the same as the maximum likelihood estimation problem, described in Section 3.2.

In conclusion, if one assumes the populations S_i belong to a single cluster or class

𝒮

and wishes to fit the hyperparameters z̄ and γ of a Gaussian distribution representing this cluster, it is enough to start by computing the maximum likelihood estimates x̂_i andŷ_i for each population S_i and then to consider these as input to the maximum likelihood estimation algorithm described in Section 3.2.

The same reasoning just carried out, using the Laplace approximation, can be generalized to the problem of classification of univariate normal populations. Indeed, assume that classes {

𝒮

_L, L = 1,...,C}, each containing some number K_L of univariate normal populations, have been identified based on some training sequence. Using the Laplace approximation and the maximum likelihood estimation approach of Section 3.2, to each one of these classes, it is possible to fit hyperparameters (z̄_L,γ_L) of a Gaussian distribution G(z̄_L,γ_L) on H.

For a test population S_t, the maximum likelihood rule, for deciding which of the classes

𝒮

_L this test population S_t belongs to, requires finding the following maximum:

L^{*} = {argmax}_{L} p (S_{t} | {\bar{z}}_{L}, γ_{L})

(35)

and assigning the test population S_t to the class with label L*. If the number of points N_t in the population S_t is large, the Laplace approximation, in the same way used above, approximates the maximum in Equation (35) by:

L^{*} = {argmax}_{L} p ({\hat{z}}_{t} | {\bar{z}}_{L}, γ_{L})

(36)

where ẑ_t =(x̂_t, ŷ_t) is the couple of maximum likelihood estimates computed based on the test population S_t and where p(ẑ_t|z̄_L,γ_L) is given by Equation (14). Now, writing out Equation (14), the decision rule becomes:

L^{*} = {argmax}_{L} (- log {Z (γ_{L})} - \frac{1}{2 γ_{L}^{2}} d^{2} ({\hat{z}}_{t}, {\bar{z}}_{L}))

(37)

Under the homoscedasticity assumption, that all of the γ_L are equal, this decision rule essentially becomes the same as the one proposed in [5], which requires S_t to be assigned to the “nearest” cluster, in terms of Rao’s distance. Indeed, if all the γ_L are equal, then Equation (37) is the same as,

L^{*} = {argmin}_{L} d^{2} ({\hat{z}}_{t,} {\bar{z}}_{L})

(38)

This decision rule is expected to be less efficient that the one proposed in Equation (37), which also takes into account the uncertainty associated with each cluster, as measured by its dispersion parameter γ_L.

5. Application to Image Classification

In this section, the framework proposed in Section 4, for classification of univariate normal populations, is applied to texture image classification using Gabor filters. Several authors have found that Gabor energy features are well-suited texture descriptors. In the following, consider 24 Gabor energy sub-bands that are the result of three scales and eight orientations. Hence, each texture image can be decomposed as the collection of those 24 sub-bands. For more information concerning the implementation, the interested reader is referred to [22].

Starting from the VisTeX database of 40 images [10] (these are displayed in Figure 1), each image was divided into 16 non-overlapping subimages of 128 × 128 pixels each. A training sequence was formed by choosing randomly eight subimages out of each image. To each subimage in the training sequence, a bank of 24 Gabor filters was applied. The result of applying a Gabor filter with scale s and orientation o to a subimage i belonging to an image L is a univariate normal population S_i,s,o of 128 × 128 points (one point for each pixel, after the filter is applied).

These populations S_i,s,o (called sub-bands) are considered independent, each one of them univariate normal with mean

μ_{i, s, o} = \sqrt{2} x_{i, s, o}

, standard deviation σ_i,s,o = y_i,s,o and with z_i,s,o = (x_i,s,o, y_i,s,o). The couple of maximum likelihood estimates for these parameters is denoted ẑ_i,s,o = (x̂_i,s,o, ŷ_i,s,o). An image L (recall, there are 40 images) contains, in each sub-band, eight populations S_i,s,o, with which hyperparameters z̄_L,s,o and γ_L,s,o are associated, by applying the maximum likelihood estimation algorithm of Section 3.2 to the inputs ẑ_i,s,o.

If S_t is a test subimage, then one should begin by applying the 24 Gabor filters to it, obtaining independent univariate normal populations S_t,s,o, and then compute for each population the couple of maximum likelihood estimates ẑ_t,s,o = (x̂_t,s,o,ŷ_t,s,o). The decision rule Equation (37) of Section 4 requires that S_t should be assigned to the image L*, which realizes the maximum:

L^{*} = {argmax}_{L} \sum_{s, o} log {Z (γ_{L, s, o})} - \frac{1}{2 γ_{L, s, o}^{2}} d^{2} ({\hat{z}}_{t, s, o}, {\bar{z}}_{L, s, o})

(39)

When considering the homoscedasticity assumption, i.e., γ_L,s,o = γ_s,o for all L, this decision rule becomes:

L^{*} = {argmin}_{L} \sum_{s, o} d^{2} ({\hat{z}}_{t, s, o}, {\bar{z}}_{L, s, o})

(40)

For this concrete application, to the VisTex database, it is pertinent to compare the rate of successful classification (or overall accuracy) obtained using the Riemannian prior, based on the framework of Section 4, to that obtained using a more classical conjugate prior, i.e., a normal-inverse gamma distribution of the mean

μ = \sqrt{2} x

and the standard deviation σ = y. This conjugate prior is given by:

p (μ | σ, μ_{p}, κ_{p}) = \frac{\sqrt{κ_{p}}}{σ \sqrt{2 π}} exp (- \frac{κ_{p}}{2 σ^{2}} {(μ - μ_{p})}^{2})

with an inverse gamma prior, on σ²,

p (σ^{2} | α, β) = \frac{β^{α}}{Γ (α)} {(α^{2})}^{- (α + 1)} exp (- \frac{β}{σ^{2}})

(41)

Using this conjugate prior, instead of a Riemannian prior, and following the procedure of applying the Laplace approximation, a different decision rule is obtained, where L* is taken to be the maximum of the following expression:

\begin{array}{l} \sum_{s, o} \frac{ln κ_{p L, s, o}}{2} - \frac{κ_{p L, s, o}}{2 {\hat{y}}_{t, s, o}^{2}} {(\sqrt{2} {\hat{x}}_{t, s, o} - μ_{p L, s, o})}^{2} \\ + α_{L, s, o} ln β_{L, s, o} - ln Γ (α_{L, s, o}) - 2 (α_{L, s, o} + 1) ln {\hat{y}}_{t, s, o} - \frac{β_{L, s, o}}{{\hat{y}}_{t}^{2}} \end{array}

(42)

where, as in Equation (39), x̂_t,s,o and ŷ_t,s,o are the maximum likelihood estimates computed for the population S_t,s,o.

Both the Riemannian and conjugate priors have been applied to the VisTex database, with half of the database used for training and half for testing. In the course of 100 Monte Carlo runs, a significant gain of about 3% is observed with the Riemannian prior compared to the conjugate prior. This is summarized in the following table.


Prior Model	Overall Accuracy
Riemannian prior Equation (39)	71.88% ± 2.16%
Riemannian prior, homoscedasticity assumption Equation (40)	69.06% ± 1.96%
Conjugate prior Equation (42)	68.73% ± 2.92%

Recall that the overall accuracy is the ratio of the number of successfully classified subimages to the total number of subimages. The table shows that the use of a Riemannian prior, even under a homoscedasticity assumption, yields significant improvement upon the use of a conjugate prior.

6. Conclusions

Motivated by the problem of the classification of univariate normal populations, this paper introduced a new class of prior distributions on the univariate normal model. With the univariate normal model viewed as the Poincaré half plane H, these new prior distributions, called Gaussian distributions, were meant to reflect the geometric picture (in terms of Rao’s distance) that a cluster or class of univariate normal populations can be represented as having a center z̄ ∈ H and a “variance” or dispersion γ². Precisely, a Gaussian distribution G(z̄, γ) has a probability density function p(z), with respect to Riemannian volume of the Poincaré half plane, which is proportional to exp

(- \frac{d^{2} (z, \bar{z})}{2 γ^{2}})

. Using Gaussian distributions as prior distributions in the problem of the classification of univariate normal populations was shown to lead to a new, more general and efficient decision rule. This decision rule was implemented in a real-world application to texture image classification, where it led to significant improvement in performance, in comparison to decision rules obtained by using conjugate priors.

The general approach proposed in this paper contains several simplifications and approximations, which could be improved upon in future work. First, it is possible to use different prior distributions, which are more geometrically rich than Gaussian distributions, to represent classes of univariate normal populations. For example, it may be helpful to replace Gaussian distributions that are “isotropic”, in the sense of having a scalar dispersion parameter γ, by non-isotropic distributions, with a dispersion matrix Γ (a 2 × 2 symmetric positive definite matrix). Another possibility would be to represent each class of univariate normal populations by a finite mixture of Gaussian distributions, instead of representing it by a single Gaussian distribution.

These variants, which would allow classes with a more complex geometric structure to be taken into account, can be integrated in the general framework proposed in the paper, based on: (i) fitting each class to a prior distribution (Gaussian non-isotropic, mixture of Gaussians); and (ii) choosing, for a test population, the most adequate class, based on a decision rule. These two steps can be realized as above, through the Laplace approximation and maximum likelihood estimation, or through alternative techniques, based on Markov chain Monte Carlo stochastic optimization.

In addition to generalizing the approach of this paper and improving its performance, a further important objective for future work will be to extend it to other parametric models, beyond univariate normal models. Indeed, there is an increasing number of parametric models (generalized Gaussian, elliptical models, etc.), whose Riemannian geometry is becoming well understood and where the present approach may be helpful.

Author Contributions

Salem Said carry out the mathematical development, and specify the algorithms, appearing in Sections 2, 3 and 4. Lionel Bombrun carry out all numerical simulations, and to propose the theoretical development of Section 4. Yannick Berthoumieu devise the main idea of the paper. That is, use of Riemannian priors as geometric representation of a class or cluster of univariate normal population. All authors have read and approved the final manuscript.

Conflicts of Interest

The authors declare no conflict of interest.

References

Amari, S.I; Nagaoka, H. Methods of Information Geometry; American Mathematical Society: Providence, RI, USA, 2000. [Google Scholar]
Rao, C.R. Information and the accuracy attainable in the estimation of statistical parameters. Bull. Calcutta Math. Soc 1945, 37, 81–91. [Google Scholar]
Kass, R.E. The geometry of asymptotic inference. Stat. Sci 1989, 4, 188–234. [Google Scholar]
Amari, S.I. Natural gradient works efficiently in learning. Neur. Comput 1998, 10, 251–276. [Google Scholar]
Nielsen, F; Nock, R. Hyperbolic Voronoi diagrams made easy. 2009. arXiv:0903.3287. [Google Scholar]
Lenglet, C.; Rousson, M.; Deriche, R.; Fougeras, O. Statistics on the manifold of multivariate normal distributions: Theory and application to diffusion tensor MRI processing. J. Math. Imaging Vis 2006, 25, 423–444. [Google Scholar]
Verdoolaege, G.; Scheunders, P. On the geometry of multivariate generalized Gaussian models. J. Math. Imaging Vis 2012, 43, 180–193. [Google Scholar]
Berkane, M.; Oden, K. Geodesic estimation in elliptical distributions. J. Multival. Anal 1997, 63, 35–46. [Google Scholar]
Erdélyi, A. Asymptotic Expansions; Dover Books: Mineola, New York, NY, USA, 2010. [Google Scholar]
MIT Vision Modeling Group. Vision Texture. Available online: http://vismod.media.mit.edu/pub/VisTex (accessed on 10 June 2014).
Pennec, X. Intrinsic statistics on Riemannian manifold: Basic tools for geometric measurements. J. Math. Imaging Vis 2006, 25, 127–154. [Google Scholar]
Atkinson, C.; Mitchell, A.F.S. Rao’s distance measure. Sankhya Ser. A 1981, 43, 345–365. [Google Scholar]
Gallot, S.; Hulin, D.; Lafontaine, J. Riemannian Geometry; Springer-Verlag: Berlin, Germany, 2004. [Google Scholar]
Absil, P.A.; Mahony, R.; Sepulchre, R. Optimization Algorithms on Matrix Manifolds; Princeton University Press: Cambridge, MA, USA, 2006. [Google Scholar]
Fréchet, M. Les éléments aléatoires de nature quelconque dans un espace distancié. Annales de l’I.H.P 1948, 10, 215–310. (In French) [Google Scholar]
Afsari, B. Riemannian L^p center of mass: Existence, Uniqueness and convexity. Proc. Am. Math. Soc 2011, 139, 655–673. [Google Scholar]
Manton, J.H. A centroid (Karcher mean) approach to the joint approximate diagonalisation problem: The real symmetric case. Digit. Sign. Process 2006, 16, 468–478. [Google Scholar]
Arnaudon, M.; Barbaresco, F. Riemannian medians and means with applications to RADAR signal processing. IEEE J. Sel. Top. Sign. Process 2013, 7, 595–604. [Google Scholar]
Le, H. On the consistency of procrustean mean shapes. Adv. Appl. Prob 1998, 30, 53–63. [Google Scholar]
Turaga, P.; Veeraraghavan, A.; Chellappa, R. Statistical Snalysis on Stiefel and Grassmann Manifolds with Applications in Computer Vision. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Anchorage, AK, USA, 23–28 June 2008. [CrossRef]
Chavel, I. Riemannian geometry: A modern introduction; Cambridge University Press: Princeton, MA, USA, 2008. [Google Scholar]
Grigorescu, S.E.; Petkov, N.; Kruizinga, P. Comparison of texture features based on Gabor filter. IEEE Trans. Image Process 2002, 11, 1160–1167. [Google Scholar]

Figure 1. Forty images of the VisTex database.

© 2014 by the authors; licensee MDPI, Basel, Switzerland This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution license (http://creativecommons.org/licenses/by/3.0/).

Share and Cite

MDPI and ACS Style

Said, S.; Bombrun, L.; Berthoumieu, Y. New Riemannian Priors on the Univariate Normal Model. Entropy 2014, 16, 4015-4031. https://doi.org/10.3390/e16074015

AMA Style

Said S, Bombrun L, Berthoumieu Y. New Riemannian Priors on the Univariate Normal Model. Entropy. 2014; 16(7):4015-4031. https://doi.org/10.3390/e16074015

Chicago/Turabian Style

Said, Salem, Lionel Bombrun, and Yannick Berthoumieu. 2014. "New Riemannian Priors on the Univariate Normal Model" Entropy 16, no. 7: 4015-4031. https://doi.org/10.3390/e16074015

APA Style

Said, S., Bombrun, L., & Berthoumieu, Y. (2014). New Riemannian Priors on the Univariate Normal Model. Entropy, 16(7), 4015-4031. https://doi.org/10.3390/e16074015

Article Menu

New Riemannian Priors on the Univariate Normal Model

Abstract

1. Introduction

2. Riemannian Geometry of the Univariate Normal Model

2.1. Derivation of the Fisher Metric

2.2. Distance, Geodesics and Volume

2.3. Riemannian Gradient Descent

3. Riemannian Prior on the Univariate Normal Model

3.1. Gaussian Distributions on H

3.2. Maximum Likelihood Estimation of z̄ and γ

3.3. Significance of z̄ and γ

4. Classification of Univariate Normal Populations

5. Application to Image Classification

6. Conclusions

Author Contributions

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI