1. Introduction
In this paper, a new class of prior distributions is introduced on the univariate normal model. The new prior distributions, which will be called Gaussian distributions, are based on the Riemannian geometry of the univariate normal model. The paper introduces these new distributions, uncovers some of their fundamental properties and applies them to the problem of the classification of univariate normal populations. It shows that, in the context of a real-life application to texture image classification, the use of these new prior distributions leads to improved performance in comparison with the use of more standard conjugate priors.
To motivate the introduction of the new prior distributions, considered in the following, recall some general facts on the Riemannian geometry of parametric models.
In information geometry [
1], it is well known that a parametric model {
pθ;
θ ∈ Θ}, where Θ ⊂
Rp, can be equipped with a Riemannian geometry, determined by Fisher’s information matrix, say
I(
θ). Indeed, assuming
I(
θ) is strictly positive definite, for each
θ ∈ Θ, a Riemannian metric on Θ is defined by:
The fact that the length element
Equation (1) is invariant to any change of parametrization was realized by Rao [
2], who was the first to propose the application of Riemannian geometry in statistics.
Once the Riemannian metric
Equation (1) is introduced, the whole machinery of Riemannian geometry becomes available for application to statistical problems relevant to the parametric model {
pθ;
θ ∈ Θ}. This includes the notion of Riemannian distance between two distributions,
pθ and
pθ ′, which is known as Rao’s distance, say
d(
θ, θ′), the notion of Riemannian volume, which is exactly the same as Jeffreys prior [
3], and the notion of Riemannian gradient, which can be used in numerical optimization and coincides with the so-called natural gradient of Amari [
4].
It is quite natural to apply Rao’s distance to the problem of classifying populations that belong to the parametric model {
pθ;
θ ∈ Θ}. In the case where this parametric model is the univariate normal model, this approach to classification is implemented in [
5]. For more general parametric models, beyond the univariate normal model, similar applications of Rao’s distance to problems of image segmentation and statistical tests can be found in [
6–
8].
The idea of [
5] is quite elegant. In general, it requires that some classes {
L;
L = 1
,...,C}, (based on a learning sequence) have been identified with “centers”
θ̄L ∈ Θ. Then, in order to assign a test population, given by the parameter
θt, to a class
L*, it is proposed to choose
L*, which minimizes Rao’s distance
d2(
θt, θ̄L), over
L = 1
,...,C. In the specific context of the classification of univariate normal populations [
5], this leads to the introduction of hyperbolic Voronoi diagrams.
The present paper is also concerned with the case where the parametric model {
pθ;
θ ∈ Θ} is a univariate normal model. It starts from the idea that a class
L should be identified not only with a center
θ̄L, as in [
5], but also with a kind of “variance”, say
γ2, which will be called a dispersion parameter. Accordingly, assigning a test population given by the parameter
θt to a class
L should be based on a tradeoff between the square of Rao’s distance
d2(
θt, θ̄L) and the dispersion parameter
γ2.
Of course, this idea has a strong Bayesian flavor. It proposes to give more “confidence” to classes that have a smaller dispersion parameter. Thus, in order to implement it, in a concrete way, the paper starts by introducing prior distributions on the univariate normal model, which it calls Gaussian distributions. By definition, a Gaussian distribution
G(
θ̄, γ2) has a probability density function, with respect to Riemannian volume, given by:
Given this definition of a Gaussian distribution (which is developed in a detailed way, in Section 3), classification of univariate normal populations can be carried out by associating to each class L of univariate normal populations a Gaussian distribution G(θ̄L,) and by assigning any test population with parameter θt to the class L*, which maximizes the likelihood p(θt|θ̄L,γL), over L = 1,...,C.
The present paper develops in a rigorous way the general approach to the classification of univariate normal populations, which has just been described. It proceeds as follows.
Section 2, which is basically self-contained, provides the concepts, regarding the Riemannian geometry of the univariate normal model, which will be used throughout the paper.
Section 3 introduces Gaussian distributions on the univariate normal model and uncovers some of their general properties. In particular, Section 3.2 of this section gives a Riemannian gradient descent algorithm for computing maximum likelihood estimates of the parameters θ̄ and γ of a Gaussian distribution.
Section 4 states the general approach to classification of univariate normal populations proposed in this paper. It deals with two problems: (i) given a class of univariate normal populations Si, how to fit a Gaussian distribution G(z̄, γ) to this class; and (ii) given a test univariate normal population St and a set of classes {L, L = 1,...,C}, how to assign St to a suitable class L*.
In the present paper, the chosen approach for resolving these two problems is marginalized likelihood estimation, in the asymptotic framework where each univariate normal population contains a large number of data points. In this asymptotic framework, the Laplace approximation plays a major role [
9]. In particular, it reduces the first problem, of fitting a Gaussian distribution to a class of univariate normal populations, to the problem of maximum likelihood estimation, covered in Section 3.2.
The final result of Section 4 is the decision rule
Equation (37). This generalizes the one developed in [
5] and already explained above, by taking into account the dispersion parameter
γ, in addition to the center
θ̄, for each class.
In Section 5, the formalism of Section 4 is applied to texture image classification, using the VisTeX image database [
10]. This database is used to compare the performance obtained using Gaussian distributions, as in Section 4, to that obtained using conjugate prior distributions. It is shown that Gaussian distributions, proposed in the current paper, lead to a significant improvement in performance.
Before going on, it should be noted that probability density functions of the form
(2), on general Riemannian manifolds, were considered by Pennec in [
11]. However, they were not specifically used as prior distributions, but rather as a representation of uncertainty in medical image analysis and directional or shape statistics.
2. Riemannian Geometry of the Univariate Normal Model
The current section presents in a self-contained way the results on the Riemannian geometry of the univariate normal model, which are required for the remainder of the paper. Section 2.1 recalls the fact that the univariate normal model can be reparametrized, so that its Riemannian geometry is essentially the same as that of the Poincaré upper half plane. Section 2.2 uses this fact to give analytic formulas for distance, geodesics and integration on the univariate normal model. Finally, Section 2.3 presents, in general form, the Riemannian gradient descent algorithm.
2.1. Derivation of the Fisher Metric
This paper considers the Riemannian geometry of the univariate normal model, as based on the Fisher metric
(1). To be precise, the univariate normal model has a two-dimensional parameter space Θ= {
θ = (
μ, σ)|
μ ∈
R, σ > 0}, and is given by:
where each
pθ is a probability density function with respect to the Lebesgue measure on
R. The Fisher information matrix, obtained from
Equation (3), is the following:
As in [
12], this expression can be made more symmetric by introducing the parametrization
z =(
x, y), where
and
y =
σ. This yields the Fisher information matrix:
It is suitable to drop the factor two in this expression and introduce the following Riemannian metric for the univariate normal model,
This is essentially the same as the Fisher metric (up to the factor tow) and will be considered throughout the following. The resulting Rao’s distance and Riemannian geometry are given in the following paragraph.
2.2. Distance, Geodesics and Volume
The Riemannian metric
(4), obtained in the last paragraph, happens to be a very well-known object in differential geometry. Precisely, the parameter space
H = {
z = (
x, y)|
y > 0} equipped with the metric
(4) is known as the
Poincaré upper half plane and is a basic model of a two-dimensional hyperbolic space [
13].
Rao’s distance between two points
z1 = (
x1, y1) and
z2 = (
x2, y2) in
H can be expressed as follows (for results in the present paragraph, see [
13], or any suitable reference on hyperbolic geometry),
where acosh denotes the inverse hyperbolic cosine.
Starting from z1, in any given direction, it is possible to draw a unique geodesic ray γ : R+ → H. This is a curve having the property that γ(0) = z1 and, for any t ∈ R+, if γ(t) = z2 then d(z1, z2) = t. In other words, the length of γ between z1 and z2 is equal to the distance between z1 and z2.
The equation of a geodesic ray starting from
z ∈
H is conveniently written down in complex notation (that is, by treating points of
H as complex numbers). To begin, consider the case of
z =
i (which stands for
x = 0 and
y = 1). The geodesic in the direction making an angle
ψ with the
y-axis is the curve,
In particular
ψ = 0 gives
γi(
t) =
eti and
ψ =
π gives
γi(
t)=
e−ti.If
ψ is not a multiple of
π,
γi(
t) traces out a portion of a circle, which is parallel to the
y-axis, in the limit
t → ∞. For a general starting point
z, the geodesic ray in the direction making an angle
ψ with the
y-axis can be written:
where
z = (
x, y) and
γi(
t, ψ) is given by
Equation (6). A more detailed treatment of Rao’s distance
(5) and of geodesics in the Poincaré upper half plane, along with applications in image clustering, can be found in [
5].
The Riemannian volume (or area, since
H is of dimension 2) element corresponding to the Riemannian metric
(4) is
dA(
z) =
dxdy/y2. Accordingly, the integral of a function
f :
H →
R with respect to
dA is given by:
In many cases, the analytic computation of this integral can be greatly simplified by using polar coordinates (
r, φ) defined with respect to some “origin”
z̄ ∈
H. Polar coordinates (
r, φ) map to the point
z(
r, φ) given by:
where the right-hand side is defined according to
Equation (7). The polar coordinates (
r, φ) do indeed define a global coordinate system of
H, in the sense that the application that takes a complex number
reiφ to the point
z(
r, φ) in
H is a diffeomorphism. The standard notation from differential geometry is:
In these coordinates, the Riemannian metric
(4) takes on the form:
The integral
Equation (8) can be computed in polar coordinates using the formula [
13],
where exp
z̄ was defined in
Equation (10) and ∘ denotes composition. This is particularly useful when
f ∘ exp
z̄ does not depend on
φ.
2.3. Riemannian Gradient Descent
In this paper, the problem of minimizing, or maximizing, a differentiable function
f :
H →
R will play a central role. A popular way of handling the minimization of a differentiable function defined on a Riemannian manifold (such as
H) is through Riemannian gradient descent [
14].
Here, the definition of Riemannian gradient is reviewed, and a generic description of Riemannian gradient descent is provided. The Riemannian gradient of
f is here defined as a mapping ∇
f :
H →
C with the following property:
for any complex number
h, where Re denotes the real part, * denotes conjugation and
df is the “derivative”,
df = (
∂f/∂x)+(
∂f/∂y)
i. For example, if
f(
z) =
y, it follows from
Equation (13) that ∇
f(
z) =
y2.
Riemannian gradient descent consists in following the direction of −∇
f at each step, with the length of the step (in other words, the step size) being determined by the user. The generic algorithm is, up to some variations, the following:
INPUT | ẑ ∈ H | % Initial guess |
WHILE | ‖∇f(ẑ)‖ > ε | % ε ≈ 0 machine precision |
ẑ ← expẑ (−λ∇f(ẑ)) | % λ > 0 step size, depends on ẑ |
END WHILE | | |
OUTPUT | ẑ | % near critical point of f |
Here, in the condition for the while loop, ‖∇
f(
zk)‖ is the Riemannian norm of the gradient ∇
f(
zk). In other words,
Just like a classical gradient descent algorithm, the above Riemannian gradient descent consists in following the direction of the negative gradient −∇f(ẑ), in order to define a new estimate. This is repeated as long as the gradient is sensibly nonzero, in the sense of the loop condition.
The generic algorithm described above has no guarantee of convergence. Convergence and behavior near limit points depends on the function
f, on the initialization of the algorithm and on the step sizes
λ. For these aspects, the reader may consult [
14](Chapter 4).
3. Riemannian Prior on the Univariate Normal Model
The current section introduces new prior distributions on the univariate normal model. These may be referred to as “Riemannian priors”, since they are entirely based on the Riemannian geometry of this model, and will also be called “Gaussian distributions”, when viewed as probability distributions on the Poincaré half plane.
Here, Section 3.1 defines in a rigorous way Gaussian distributions on
H (based on the intuitive
Formula (2)). A Gaussian distribution
G(
z̄, γ) has two parameters,
z̄ ∈
H, called the center of mass, and
γ > 0, called the dispersion parameter. Section 3.2 uses the Riemannian gradient descent algorithm Section 2.3 to provide an algorithm for computing maximum likelihood estimates of
z̄ and
γ. Finally, Section 3.3 proves that
z̄ is the Riemannian center of mass or Karcher mean of the distribution
G(
z̄, γ), (Historically, it is more correct to speak of the “Fréchet mean”, since this concept was proposed by Fréchet in 1948 [
15]), and that
γ is uniquely related to mean square Rao’s distance from
z̄.
The reader may wish to note that the results of Section 3.3 are not used in the following, so this paragraph may be skipped on a first reading.
3.1. Gaussian Distributions on H
A Gaussian distribution
G(
z̄, γ) on
H is a probability distribution with the following probability density function:
Here,
z̄ ∈
H is called the center of mass and
γ> 0 the dispersion parameter of the distribution
G(
z̄, γ). The squared distance
d2(
z, z̄) refers to Rao’s distance
(5). The probability density function
(14) is understood with respect to the Riemannian volume element
dA(
z). In other words, the normalization constant
Z(
γ) is given by:
Using polar coordinates, as in
Equation (12), it is possible to calculate this integral explicitly. To do so, let (
r, φ), whose origin is
z̄. Then,
d2(
z, z̄) =
r2 when
z =
z(
r, φ), as in
Equation (9). It follows that:
According to
Equation (12), the integral
Z(
γ) reduces to:
which is readily calculated,
where erf denotes the error function.
Formula (16) completes the definition of the Gaussian distribution
G(
z̄, γ). This definition is the same as suggested in [
11], with the difference that, in the present work, it has been possible to compute exactly the normalization constant
Z(
γ).
It is noteworthy that the normalization constant
Z(
γ) depends only on
γ and not on
z̄. This shows that the shape of the probability density function
(14) does not depend on
z̄, which only plays the role of a location parameter. At a deeper mathematical level, this reflects the fact that
H is a homogeneous Riemannian space [
13].
The probability density function
(14) bears a clear resemblance to the usual Gaussian (or normal) probability density function. Indeed, both are proportional to the exponential minus the “square distance”, but in one case, the distance is interpreted as Euclidean distance and, in the other (that of
Equation (14)) as Rao’s distance.
3.2. Maximum Likelihood Estimation of z̄ and γ
Consider the problem of computing maximum likelihood estimates of the parameters
z̄ and
γ of the Gaussian distribution
G(
z̄, γ), based on independent samples
from this distribution. Given the expression
(14) of the density
p(
z|z̄, γ), the log-likelihood function ℓ(
z̄, γ) can be written,
Since
z̄ only appears in the second term, the maximum likelihood estimate of
z̄, say
ẑ, can be computed first. It is given by the minimization problem:
In other words, the maximum likelihood estimate
ẑ minimizes the sum of squared Rao distances to the samples
zi. This exhibits
ẑ as the Riemannian center of mass, also called the Karcher or the Fréchet mean [
16], of the samples
zi.
The notion of Riemannian center of mass is currently a widely popular one in signal and image processing, with applications ranging from blind source separation and radar signal processing [
17,
18] to shape and motion analysis [
19,
20]. The definition of Gaussian distributions, proposed in the present paper, shows how the notion of Riemannian center of mass is related to maximum likelihood estimation, thereby giving it a statistical foundation.
An original result, due to Cartan and cited in Equation [
16], states that
ẑ, as defined in
Equation (18), exists and is unique, since
H, with the Riemannian distance
(4), has constant negative curvature. Here,
ẑ is computed using Riemannian gradient descent, as described in Section 2.3. The cost function
f to be minimized is given by (the factor
N−1 is conventional),
Its Riemannian gradient ∇
f(
z) is easily found by noting the following fact. Let
fi(
z) = (1
/2)
d2(
z, zi). Then, the Riemannian gradient of this function is (see [
21] (page 407)),
where log
z :
H →
C is the inverse of exp
z :
C →
H. It follows from
Equation (20) that,
The analytic expression of log
z, for any
z ∈
H, will be given below (see
Equation (23)).
Here, the gradient descent algorithm for computing ẑ is described. This algorithm uses a constant step size λ, which is fixed manually.
Once the maximum likelihood estimate
ẑ has been computed, using the gradient descent algorithm, the maximum likelihood estimate of
γ, say
γ̂, is found by solving the equation:
The gradient descent algorithm for computing
ẑ is the following,
INPUT | {z1 ,..., zN} | % N independent samples from G(z̄, γ) |
ẑ ∈ H | % Initial guess |
WHILE | ‖∇f(ẑ) ‖ > ε | % ε ≈ 0 machine precision |
ẑ ← expẑ (−λ∇f(ẑ)) | % ∇f(ẑ) given by Equation (21) |
% step size λ is constant |
END WHILE | | |
OUTPUT | ẑ | % near Riemannian center of mass |
Application of
Formula (21) requires computation of log
ẑ(
zi) for
i = 1
,...,N. Fortunately, this can be done analytically as follows. In general, for
ẑ = (
x̄, ȳ),
where log
i is found by inverting
Equation (6). Precisely,
where, for
z = (
x, y) with
x ≠ 0,
and:
and, for
z = (0
, y),
with ln denoting the natural logarithm.
3.3. Significance of z̄ and γ
The parameters
z̄ and
γ of a Gaussian distribution
G(
z̄, γ) have been called the center of mass and the dispersion parameter. In the present paragraph, it is proven that,
and also that:
where
F (
γ) was defined in
Equation (22) and
p(
z′
|z̄, γ) is the probability density function of
G(
z̄, γ), given in
Equation (14).
Note that
Equations (25) and
(26) are asymptotic versions of
Equations (18) and
(22). Indeed,
Equations (25) and
(26) can be written:
where
Ez̄,γ denotes the expectation with respect to
G(
z̄, γ), and the expectation is carried out on the variable
z′ in the first formula. Now, these two formulae are the same as
Equations (18) and
(22), but with expectation instead of empirical mean.
Note, moreover, that
Equations (25) and
(26) can be interpreted as follows. If
z′ is distributed according to the Gaussian distribution
G(
z̄, γ), then
Equation (25) states that
z̄ is the unique point, out of all
z ∈
H, which minimizes the expectation of squared Rao’s distance to
z′. Moreover,
Equation (26) states that the expectation of squared Rao’s distance between
z̄ and
z′ is equal to
F (
γ), so
F (
γ) is the least possible expected squared Rao’s distance between a point
z ∈
H and
z′. This interpretation justifies calling
z̄ the center of mass of
G(
z̄, γ) and shows that
γ is uniquely related to the expected dispersion, as measured by squared Rao’s distance, away from
z̄.
In order to prove
Equation (25), consider the log-likelihood function,
Let
fz(
z̄) = (1
/2)
d2(
z, z̄). The score function, with respect to
z̄ is, by definition,
where ∇
z̄ indicates the Riemannian gradient (defined in
Equation (13) of Section 2.3) is with respect to the variable
z̄. Under certain regularity conditions, which are here easily verified, the expectation of the score function is identically zero,
Let
f(
z) be defined by:
with the expectation carried out on the variable
z′. Clearly,
f(
z) is the expression to be minimized in
Equation (25) (or in the first formula in
Equation (27), which is just the same). By interchanging Riemannian gradient and expectation,
where the last equality follows from
Equation (30).
It has just been proved that
z̄ is a stationary point of
f (a point where the gradient is zero). Theorem 2
.1 in [
16] states the function
f has one and only one stationary point, which is moreover a global minimizer. This concludes the Proof
(25).
The proof of
Equation (26) follows exactly the same method, defining the score function with respect to
γ and noting that its expectation is identically zero.
4. Classification of Univariate Normal Populations
The previous section studied Gaussian distributions on H, “as they stand”, focusing on the fundamental issue of maximum likelihood estimation of their parameters. The present Section considers the use of Gaussian distributions as prior distributions on the univariate normal model.
The main motivation behind the introduction of Gaussian distributions is that a Gaussian distribution G(z̄, γ) can be used to give a geometric representation of a cluster or class of univariate normal populations. Recall that each point (x, y) ∈ H is identified with a univariate normal population with mean
and standard deviation σ = y. The idea is that populations belonging to the same cluster, represented by G(z̄, γ), should be viewed as centered on z̄ and lying within a typical distance determined by γ.
In the remainder of this Section, it is shown how the maximum likelihood estimation algorithm of Section 3.2 can be used to fit the hyperparameters z̄ and γ to data, consisting in a class = {Si; i = 1,...,K} of univariate normal populations. This is then applied to the problem of the classification of univariate normal populations. The whole development is based on marginalized likelihood estimation, as follows.
Assume each population Si contains Ni points, Si = {sj; j = 1,...,Ni}, and the points sj, in any class, are drawn from a univariate normal distribution with mean μ and standard deviation σ. The focus will be on the asymptotic case where the number Ni of points in each population Si is large.
In order to fit the hyperparameters
z̄ and
γ to the data
, assume moreover that the distribution of
z = (
x, y), where
, is a Gaussian distribution
G(
z̄, γ). Then, the distribution of
can be written in integral form:
where
p(
z|z̄, γ) is the probability density of a Gaussian distribution
G(
z̄, γ), defined in
Equation (14). Moreover, expressing
p(
Si|z) as a product of univariate normal distributions
p(
sj|z), it follows,
This expression, given the data , is to be maximized over (z̄, γ). Using the Laplace approximation, this task is reduced to the maximum likelihood estimation problem, addressed in Section 3.2.
The Laplace approximation will here be applied in its “basic form” [
9]. That is, up to terms of order
. To do so, write each of the integrals in
Equation (32), using
Equation (8) of Section 2.2. These integrals then take on the form:
where the univariate normal distribution
p(
sj|z) has been replaced by its full expression. Now, this expression can be written
p(
sj|z) = exp[−
Nih(
x, y)], where:
Here,
B2 and
are the empirical bias and variance, within population
Si,
where
Ŝi is the empirical mean of the population
.
The expression h(x, y) is maximized when x = x̂i and y = ŷi, where ẑi = (x̂i, ŷi) is the couple of maximum likelihood estimates of the parameters (x, y), based on the population Si.
According to the Laplace approximation, the integral
Equation (33) is equal to:
where
∂2h(
x̂i,ŷi) is the matrix of second derivatives of
h, and
|·
| denotes the determinant. Now, since
h is essentially the logarithm of
p(
sj|z), a direct calculation shows that
∂2h(
x̂i,ŷi) is the same as the Fisher information matrix derived in Section 2.1 (where it was denoted
I(
z)). Thus, the first factor in the above expression is
, and cancels out with the last factor.
Finally, the Laplace approximation of the integral
Equation (33) reads:
and the resulting approximation of the distribution of
, as given by
Equation (32), can be written:
where
α is a constant, which does not depend either on the data or on the parameters, and
p(
ẑi|z̄, γ) has the expression
(14).
Accepting this expression to give the distribution of the data , conditionally on the hyperparameters (z̄, γ), the task of estimating these hyperparameters becomes the same as the maximum likelihood estimation problem, described in Section 3.2.
In conclusion, if one assumes the populations Si belong to a single cluster or class and wishes to fit the hyperparameters z̄ and γ of a Gaussian distribution representing this cluster, it is enough to start by computing the maximum likelihood estimates x̂i andŷi for each population Si and then to consider these as input to the maximum likelihood estimation algorithm described in Section 3.2.
The same reasoning just carried out, using the Laplace approximation, can be generalized to the problem of classification of univariate normal populations. Indeed, assume that classes {L, L = 1,...,C}, each containing some number KL of univariate normal populations, have been identified based on some training sequence. Using the Laplace approximation and the maximum likelihood estimation approach of Section 3.2, to each one of these classes, it is possible to fit hyperparameters (z̄L,γL) of a Gaussian distribution G(z̄L,γL) on H.
For a test population
St, the maximum likelihood rule, for deciding which of the classes
L this test population
St belongs to, requires finding the following maximum:
and assigning the test population
St to the class with label
L*. If the number of points
Nt in the population
St is large, the Laplace approximation, in the same way used above, approximates the maximum in
Equation (35) by:
where
ẑt =(
x̂t, ŷt) is the couple of maximum likelihood estimates computed based on the test population
St and where
p(
ẑt|z̄L,γL) is given by
Equation (14). Now, writing out
Equation (14), the decision rule becomes:
Under the homoscedasticity assumption, that all of the
γL are equal, this decision rule essentially becomes the same as the one proposed in [
5], which requires
St to be assigned to the “nearest” cluster, in terms of Rao’s distance. Indeed, if all the
γL are equal, then
Equation (37) is the same as,
This decision rule is expected to be less efficient that the one proposed in
Equation (37), which also takes into account the uncertainty associated with each cluster, as measured by its dispersion parameter
γL.
5. Application to Image Classification
In this section, the framework proposed in Section 4, for classification of univariate normal populations, is applied to texture image classification using Gabor filters. Several authors have found that Gabor energy features are well-suited texture descriptors. In the following, consider 24 Gabor energy sub-bands that are the result of three scales and eight orientations. Hence, each texture image can be decomposed as the collection of those 24 sub-bands. For more information concerning the implementation, the interested reader is referred to [
22].
Starting from the VisTeX database of 40 images [
10] (these are displayed in
Figure 1), each image was divided into 16 non-overlapping subimages of 128
× 128 pixels each. A training sequence was formed by choosing randomly eight subimages out of each image. To each subimage in the training sequence, a bank of 24 Gabor filters was applied. The result of applying a Gabor filter with scale
s and orientation
o to a subimage
i belonging to an image
L is a univariate normal population
Si,s,o of 128
× 128 points (one point for each pixel, after the filter is applied).
These populations Si,s,o (called sub-bands) are considered independent, each one of them univariate normal with mean
, standard deviation σi,s,o = yi,s,o and with zi,s,o = (xi,s,o, yi,s,o). The couple of maximum likelihood estimates for these parameters is denoted ẑi,s,o = (x̂i,s,o, ŷi,s,o). An image L (recall, there are 40 images) contains, in each sub-band, eight populations Si,s,o, with which hyperparameters z̄L,s,o and γL,s,o are associated, by applying the maximum likelihood estimation algorithm of Section 3.2 to the inputs ẑi,s,o.
If
St is a test subimage, then one should begin by applying the 24 Gabor filters to it, obtaining independent univariate normal populations
St,s,o, and then compute for each population the couple of maximum likelihood estimates
ẑt,s,o = (
x̂t,s,o,ŷt,s,o). The decision rule
Equation (37) of Section 4 requires that
St should be assigned to the image
L*, which realizes the maximum:
When considering the homoscedasticity assumption,
i.e.,
γL,s,o =
γs,o for all
L, this decision rule becomes:
For this concrete application, to the VisTex database, it is pertinent to compare the rate of successful classification (or overall accuracy) obtained using the Riemannian prior, based on the framework of Section 4, to that obtained using a more classical conjugate prior,
i.e., a normal-inverse gamma distribution of the mean
and the standard deviation
σ =
y. This conjugate prior is given by:
with an inverse gamma prior, on
σ2,
Using this conjugate prior, instead of a Riemannian prior, and following the procedure of applying the Laplace approximation, a different decision rule is obtained, where
L* is taken to be the maximum of the following expression:
where, as in
Equation (39),
x̂t,s,o and
ŷt,s,o are the maximum likelihood estimates computed for the population
St,s,o.
Both the Riemannian and conjugate priors have been applied to the VisTex database, with half of the database used for training and half for testing. In the course of 100 Monte Carlo runs, a significant gain of about 3% is observed with the Riemannian prior compared to the conjugate prior. This is summarized in the following table.
Recall that the overall accuracy is the ratio of the number of successfully classified subimages to the total number of subimages. The table shows that the use of a Riemannian prior, even under a homoscedasticity assumption, yields significant improvement upon the use of a conjugate prior.
6. Conclusions
Motivated by the problem of the classification of univariate normal populations, this paper introduced a new class of prior distributions on the univariate normal model. With the univariate normal model viewed as the Poincaré half plane H, these new prior distributions, called Gaussian distributions, were meant to reflect the geometric picture (in terms of Rao’s distance) that a cluster or class of univariate normal populations can be represented as having a center z̄ ∈ H and a “variance” or dispersion γ2. Precisely, a Gaussian distribution G(z̄, γ) has a probability density function p(z), with respect to Riemannian volume of the Poincaré half plane, which is proportional to exp
. Using Gaussian distributions as prior distributions in the problem of the classification of univariate normal populations was shown to lead to a new, more general and efficient decision rule. This decision rule was implemented in a real-world application to texture image classification, where it led to significant improvement in performance, in comparison to decision rules obtained by using conjugate priors.
The general approach proposed in this paper contains several simplifications and approximations, which could be improved upon in future work. First, it is possible to use different prior distributions, which are more geometrically rich than Gaussian distributions, to represent classes of univariate normal populations. For example, it may be helpful to replace Gaussian distributions that are “isotropic”, in the sense of having a scalar dispersion parameter γ, by non-isotropic distributions, with a dispersion matrix Γ (a 2 × 2 symmetric positive definite matrix). Another possibility would be to represent each class of univariate normal populations by a finite mixture of Gaussian distributions, instead of representing it by a single Gaussian distribution.
These variants, which would allow classes with a more complex geometric structure to be taken into account, can be integrated in the general framework proposed in the paper, based on: (i) fitting each class to a prior distribution (Gaussian non-isotropic, mixture of Gaussians); and (ii) choosing, for a test population, the most adequate class, based on a decision rule. These two steps can be realized as above, through the Laplace approximation and maximum likelihood estimation, or through alternative techniques, based on Markov chain Monte Carlo stochastic optimization.
In addition to generalizing the approach of this paper and improving its performance, a further important objective for future work will be to extend it to other parametric models, beyond univariate normal models. Indeed, there is an increasing number of parametric models (generalized Gaussian, elliptical models, etc.), whose Riemannian geometry is becoming well understood and where the present approach may be helpful.