3d Modal Retrieval
3d Modal Retrieval
Dejan V. Vranić
3D Model Retrieval
- Ph. D. Dissertation -
Hiermit erkläre ich, die vorliegende Dissertation selbständig und ohne unzulässige
fremde Hilfe angefertigt zu haben. Ich habe keine anderen als die angeführten
Quellen und Hilfsmittel benutzt und sämtliche Textstellen, die wörtlich oder sin-
ngemäßaus veröffentlichten oder unveröffentlichten Schriften entnommen wurden,
und alle Angaben, die auf mündlichen Auskünften beruhen, als solche kenntlich
gemacht. Ebenfalls sind alle von anderen Personen bereitgestellten Materialen oder
erbrachten Dienstleistungen als solche gekennzeichnet.
.........................................
(Ort, Datum)
.........................................
(Unterschrift)
Abstract
First of all, I would like to thank to my advisor, Prof. Dr. Dietmar Saupe. This
work was supported by an award from the Deutsche Forschungsgemeinschaft (DFG),
grant GRK 446/1-98 for the Graduiertenkolleg Wissensrepräsentation (graduate
study program on knowledge representation) at the University of Leipzig. I would
also like to thank to Prof. Dr. Gerhard Brewka, the chair of the program. The
Deutsche Forschungsgemeinschaft also supported my work through the Project SA
449/10-1, within the strategic research initiative ”Distributed Processing and Deliv-
ery of Digital Documents” (V3D2), SPP 1041.
My experience at the Technical University of Vienna played a decisive role in
selecting goals in my life. The person who directly changed my career was Dr.
Gordana Popović and I wish to express my gratefulness to her.
During the past four years, I communicated with a lot of people and I want to
thank to all of them.
Last but not the least, I thank to my wife Djilija, my daughter Aleksandra, and
my mother Živka. My “three funny female persons” mean everything to me and
they supported me the most.
Introduction 1
1 Multimedia Retrieval 7
1.1 Content-Based Retrieval of Audiovisual Data . . . . . . . . . . . . . 7
1.2 MPEG-7 Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.3 Shape-Similarity Retrieval of 3D Objects . . . . . . . . . . . . . . . . 14
1.3.1 Polygonal Mesh Models . . . . . . . . . . . . . . . . . . . . . 14
1.3.2 3D Model Retrieval Algorithm . . . . . . . . . . . . . . . . . 18
1.3.3 Types of 3D-Shape Features . . . . . . . . . . . . . . . . . . . 20
1.3.4 3D-Shape Descriptors Criteria . . . . . . . . . . . . . . . . . 22
1.4 Similarity Search Metrics . . . . . . . . . . . . . . . . . . . . . . . . 25
1.5 Tools for Evaluation of Retrieval Effectiveness . . . . . . . . . . . . . 29
3 Pose Estimation 61
3.1 Problem Description . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
3.2 Principal Component Analysis . . . . . . . . . . . . . . . . . . . . . 63
3.3 Modifications of the PCA . . . . . . . . . . . . . . . . . . . . . . . . 68
3.4 “Continuous” PCA . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
3.5 Evaluation of the Continuous Approach . . . . . . . . . . . . . . . . 73
i
4 3D-Shape Feature Vectors 77
4.1 Ray-Based Feature Vector . . . . . . . . . . . . . . . . . . . . . . . . 78
4.2 Silhouette-Based Feature Vectors . . . . . . . . . . . . . . . . . . . . 85
4.3 Depth Buffer-Based Feature Vector . . . . . . . . . . . . . . . . . . . 91
4.4 Volume-Based Feature Vector . . . . . . . . . . . . . . . . . . . . . . 98
4.5 Voxel-Based Approach . . . . . . . . . . . . . . . . . . . . . . . . . . 107
4.6 Describing 3D-Shape with Functions on a Sphere . . . . . . . . . . . 118
4.6.1 Spherical Harmonics . . . . . . . . . . . . . . . . . . . . . . . 119
4.6.2 Ray-Based Approach with Spherical Harmonic Representation 123
4.6.3 Moments-Based Feature Vector . . . . . . . . . . . . . . . . . 126
4.6.4 Shading-Based Feature Vector . . . . . . . . . . . . . . . . . 127
4.6.5 Complex Feature Vector . . . . . . . . . . . . . . . . . . . . . 129
4.6.6 Feature Vectors Based on Layered Depth Spheres . . . . . . . 131
4.7 Hybrid Descriptors . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
6 Conclusion 201
Bibliography 205
Biography 227
Introduction
Since the ubiquitous Internet makes the world an increasingly networked place,
effective access to information is becoming more and more important. An object
from a database can traditionally be accessed using attached structural data. Other
forms of data access include search in collections of textual documents and search in
collections of audiovisual data (images, audio sequences, movies, and 3D-objects).
Solutions for searching collections of textual documents are reasonably effective.
Search in collections of multimedia objects can be performed by using textual an-
notations or by analyzing content of objects. Content-based search for multimedia
objects is more challenging form of accessing audiovisual information. When we
started our work in autumn 1999, solutions for content-based search were mostly
aimed at retrieving still images, audio sequences, and movies, while only a few tech-
niques for 3D model retrieval were reported. Our goal was to create a variety of
3D-shape descriptors in order to fill in the gap.
Appropriate 3D-shape features are automatically extracted and represented us-
ing suitable data structures (e.g., vectors, octrees, graphs). The resulting repre-
sentation of a feature is regarded as a descriptor. Usually, features are represented
by vectors with real-valued components, whence such descriptors are regarded as
feature vectors. Descriptors should be defined in such a way that similar 3D-models
are attributed feature vectors that are close in the search space. In our 3D model
retrieval system a model, a polygonal mesh, serves as a query and similar objects
are retrieved from a collection of 3D-objects. Algorithms proceed first by a nor-
malization step (pose estimation) in which models are transformed into a canonical
coordinate frame. Second, feature vectors are extracted and compared with those
derived from normalized models in the search space. Using a metric in the feature
vector space nearest neighbors are computed and ranked. Objects thus retrieved
are displayed for inspection, selection, and processing. Shape-similarity retrieval of
3D-objects is not a “toy” problem, because designers of virtual worlds, creators of
mechanical parts, scientist studying molecule docking, and authors of copyrighted
models need to find 3D-models of particular shapes. Having in mind that recent
development of 3D-scanners, visualization techniques, and CAD tools continuously
increase the number of available 3D-models, there is a need for content-based 3D
model retrieval systems.
The main objective of this thesis is construction, analysis, and testing of new
techniques for describing 3D-shape of polygonal mesh models. Since there is no
1
2 Introduction
theory that specifies how to analyze low-level features of polygonal meshes in order
to describe the 3D-shape in the optimal way, we developed a variety of descriptors
capturing different features of 3D-objects and using different representation meth-
ods. The best feature vectors possess high discriminant power. Besides defining
3D-shape descriptors, the retrieval performance of our feature vectors is carefully
studied and compared to effectiveness of techniques proposed by other authors. A
variety of tools are implemented for efficient experimental analysis and verification
of results. A Web-based 3D model retrieval system serves as a proof-of-concept.
The results unambiguously show that our best descriptors outperform the state-of-
the-art.
This thesis is organized as follows:
• Introduction;
• Chapter 1: Multimedia Retrieval;
• Chapter 2: Related Research Work;
• Chapter 3: Pose Estimation;
• Chapter 4: 3D-Shape Feature Vectors;
• Chapter 5: Experimental Results;
• Chapter 6: Conclusion;
• Bibliography;
• Appendix: CCCC.
The first chapter has an introductory character and consists of five sections.
The motivation and general concept of multimedia retrieval is presented in sec-
tion 1.1. An overview of MPEG-7, a standard that provides tools and methods
to describe the content of audiovisual data, is given in section 1.2. Section 1.3,
which consists of four subsections, focuses on the topic of the thesis, content-based
retrieval of 3D-models. Since polygonal mesh is the most common way of represent-
ing 3D-objects, the polygonal mesh representation is explained in subsection 1.3.1.
A general 3D-model retrieval algorithm is presented in subsection 1.3.2. In con-
trast to image retrieval, where color and texture can be used for searching similar
object, the main challenge in the area of 3D-model retrieval is to describe shape of
a 3D-object. Types of features, which are considered for characterizing shape, are
discussed in subsection 1.3.3. 3D-shape descriptors should fulfill certain criteria,
which are defined and discussed in subsection 1.3.4. Methods for measuring simi-
larity (or dissimilarity) between feature vectors are addressed in section 1.4. Tools
that are used in order to evaluate retrieval performance of 3D-shape descriptors are
described in section 1.5.
The second chapter is dedicated to 3D-shape descriptors proposed by other
authors. Since more and more researchers are attracted by the topic, a variety
of approaches for describing 3D-shape have recently been reported. Besides, there
is a significant work in the area of Computer Vision, which can be regarded as a
somewhat similar topic. However, we do not need to infer all the information about
3D-shape from one or more 2D-images, because we deal with objects represented
Introduction 3
as polygonal meshes. The criteria for selecting the descriptors that are presented in
this chapter are historical reasons and the impact on the area of retrieval of polyg-
onal mesh models. A selection of 9 techniques proposed by five groups of authors is
described in details. In section 2.1, cords and moments-based descriptors proposed
by Paquet et al. are described. A descriptor based on equivalence classes proposed
by Suzuki et al. is presented in section 2.2. Section 2.3 is dedicated to the MPEG-7
shape spectrum descriptor, which is a histogram of curvature indices. An interesting
technique, called topology matching, which uses graphs as 3D-shape descriptors, is
explained in section 2.4. Section 2.5 is subdivided into 4 parts describing techniques
proposed by the Princeton Shape Analysis and Retrieval Group. A concept of shape
distributions (subsection 2.5.1) is based on randomized computation of certain geo-
metric properties. A descriptor based on binary voxel grids (subsection 2.5.2) uses
an original representation technique that secures invariance of the descriptor with
respect to rotations of a 3D-model. The reflective symmetry descriptor (subsection
2.5.3 relies upon the assumption that a measure of similarity between parts of a
model laying on the opposite sides of a cutting plane can be used for capturing 3D-
shape. A descriptor based on exponentially decaying Euclidean distance transform
(subsection 2.5.4) uses almost identical representation as the descriptor based on
binary voxel grids. However, the considered feature, a voxel grid attributed by ex-
ponentially decaying Euclidean distance transform (EDT), describes the 3D-shape
in a more effective way. Based on our evaluation as well as on results presented
in the literature, we regard the descriptor based on negatively exponentiated EDT
[58] as the state-of-the-art-descriptor.
Objects represented as polygonal meshes are given in arbitrary orientation, scale,
and position in the 3D-space R3 . 3D-shape descriptors can be defined in such a way
that invariance with respect to translation, rotation, scaling, and reflection of a
mesh model is provided. Examples of such descriptors are the shape spectrum
descriptor (section 2.3), topology matching (section 2.4), and shape distributions
(section 2.5.1). If the invariance of descriptor with respect to similarity transforms
is not provided by the representation of a feature, pose estimation (normalization)
is necessary as a step preceding the feature extraction. The pose normalization
procedure is a transformation of a 3D-mesh model into a canonical coordinate frame
by translating, rotating, scaling, and reflecting (flipping) the original set of vertices.
In the third chapter, details about our original pose estimation approach are given.
In section 3.1, we describe the problem of finding the canonical coordinates of a
mesh. The most prominent tool for solving the problem is the Principal Component
Analysis (PCA) [51], also known as the discrete Karhunen-Loeve transform, or the
Hotelling transform, which is described in details in section 3.2. Since applying the
PCA to the set of vertices of a mesh model can produce undesired normalization
results, two modifications of the PCA are given in section 3.3. Both modifications
approximate the application of the PCA to the point set of a 3D-object (union of
polygons). In section 3.4, we present our original method for analytical computation
of various parameters needed to analyze the set of infinitely many points. This
computation enables application of the PCA to an infinite point set represented as
a union of triangles. We called the approach the Continuous Principal Component
4 Introduction
In the fourth chapter, we present our original methods for describing 3D-shape.
Since the optimal way of encoding information about 3D-shape is not prescribed, we
consider a variety of different features to define shape descriptors (feature vectors).
The approaches include: ray-based feature vector in the spatial domain (section
4.1), silhouette-based descriptor (section 4.2), depth buffer-based feature vector
(section 4.3), descriptor based on artificial volumes associated to triangles (section
4.4), and descriptor based on voxel grids attributed by fractions of the total surface
of a polygonal mesh (section 4.5). Certain features aimed at describing 3D-shape
of a polygonal mesh can be considered as samples of a function on a sphere (section
4.6). A suitable tool for representing samples of functions on a sphere is the fast
Fourier transform on a sphere. In the frequency (spectral) domain, the samples are
represented by spherical harmonic coefficients. In subsection 4.6.1, a brief presenta-
tion of spherical harmonics as well as our original approach for forming descriptors
with spherical harmonic representation are given. As far as we know, spherical
harmonics as a tool for 3D model retrieval are introduced by ourselves in [147]. A
set of descriptors with spherical harmonic representation include: ray-based fea-
ture vector with spherical harmonic representation (section 4.6.2), descriptor based
on rendered perspective projection on an enclosing sphere (section 4.6.4), complex
feature vector (section 4.6.5), descriptor based on layered depth spheres (section
4.6.6), and rotation invariant descriptor based on layered depth spheres (section
4.6.6). We also defined a moments-based feature vector (section 4.6.3), to compare
different representations of samples of functions on a sphere, spherical harmonics
vs. moments. Finally, a concept of hybrid feature vectors is introduced in section
4.7. We follow a typical 3D-model algorithm (pose estimation → feature extraction
→ similarity search) and most of our feature vectors are extracted in the canon-
ical coordinate frame of a 3D-model. Our techniques are not restricted to closed
or orientable polygonal mesh models. In order to be able to verify the results of
feature extraction procedure, we usually implement at least two extraction tools for
each approach. The forming of feature vector components as well as specifications
of feature extraction methods are described in details, in order to provide sufficient
information to a reader who wants to implement and test our methods.
that our hybrid descriptor significantly outperform all other descriptors, including
the state-of-the-art descriptor, which is extracted using original tools provided by
the authors [108]. Experiments aimed at testing dimension reduction using the
Principal Component Analysis (PCA) are presented in section 5.3. A summary of
experimental results (section 5.4) concludes the chapter. Some authors object the
use of the PCA for orienting a model in the pose estimation step. However, our
results show that the best approach relies upon the PCA. Moreover, we modified
certain techniques (including our implementation of the state-of-the-art) that avoid
the use of the PCA by utilizing a property of spherical harmonics. The modification
consists of applying the PCA instead of the property of spherical harmonics. Thus,
we compared the competing techniques for achieving rotation invariance, the PCA
vs. the property of spherical harmonics, on three types of feature vectors. Each
time the result was the same, the descriptors relying upon the PCA outperformed
descriptor relying upon the competing approach for attaining rotation invariance.
In the sixth chapter, we summarize the contribution, stress the most important
results, and suggest directions for future work.
A list of directly used or cited works (bibliography), which contains 153 refer-
ences, is located at the end of the thesis.
Our Web-based 3D model retrieval system, called CCCC, is presented in the
appendix. Besides serving as a proof-of-concept, the CCCC 3D search engine
is useful for obtaining an impression about effectiveness of different descriptors,
by inspecting retrieved models as well as by using provided tools for comparing
descriptors at three levels (models, classes, and the whole collection).
Multimedia Retrieval
In this chapter, we first present the motivation and general concept of multimedia
retrieval. Then, we give a brief overview of MPEG-7, a standard that provides
tools and methods to describe the content of audiovisual data. Next, we focus
on the topic of the thesis, shape-similarity search for 3D-objects. We describe
the polygonal-mesh representation of 3D-models, present the algorithm that we
use to search in 3D-shape collections, discuss types of features that are used for
describing 3D-content, and set criteria for defining 3D-shape descriptors. We also
address methods for measuring similarity (or dissimilarity) between feature vectors.
Finally, we describe tools that are applied in order to evaluate retrieval performance
of 3D-shape descriptors.
7
8 Multimedia Retrieval
↓ ↓ ↓ ↓
↓
FEATURE EXTRACTION
... ... ...
↓
DESCRIPTION GENERATION
↓
SEARCH ENGINE
the problem of recognizing words and notes. A general problem in audio analysis
is to simply discriminate speech from non-vocal music, silence, or other sounds.
Automatic speech recognition deals with keyword spotting, sub-word indexing, and
speaker identification. Tools for describing high-level features, timbre and melody,
are developed, as well. For more details about audio features and methods, which
are used for content-based retrieval, we refer to [82, 32, 149, 70, 71, 31, 66].
A video retrieval system can be created by combining audio and image retrieval
techniques. Also, there are features specific to videos such as spatial and tempo-
ral characteristics. Purely spatial features are color space, luminance, shape, size,
texture, and orientation, while object motion and camera operation belong to tem-
poral features. A spatio-temporal feature is, e.g., motion trajectory. Segmentation
of videos as well as detection of shots and key frames are important steps of feature
extraction procedure. High-level video features contain motion of objects, motion
activity (slow or fast), recognition of an important person, recognition of an event,
etc. Video retrieval literature is also very rich, e.g., [94, 114, 111, 123, 135, 50, 137].
Several techniques for retrieval of 3D-mesh models have recently been proposed.
All reported features capture the 3D-shape of models (objects). The area of shape
similarity search for 3D objects is addressed in section 1.3, while the related work is
reviewed in chapter 2. Our original contribution to the topic is presented in chapter
4.
Description generation. According to [85], we refer to a representation of
a feature as a descriptor. A structure containing an identifier of the object (e.g.,
a name in local database or a URL) and at least one descriptor is called a de-
scription. As an example, for an image, we can generate a description consisting of
image name, a few keywords (annotations), and representations of color histograms,
texture contrast, and contour shape. Content-based retrieval systems usually store
descriptions in an internal format, whence other systems cannot use the descriptions
without proper transcoding tools and/or specification about the internal description
format. One of the objectives of the MPEG-7 standard is to standardize content-
based description for various types of audiovisual information. More details about
MPEG-7 are given in section 1.2.
Search engine. A search engine for certain type of audiovisual data is an
application that accepts a specific input as query (e.g., text or multimedia content),
and retrieves objects ranked by the degree of similarity to the query. Descriptions
of two “similar” audiovisual object contain descriptors, which should be attributed
values that are “close” in a space of descriptors D. The degree of similarity (or
dissimilarity) between two descriptors is computed using a suitable measure d :
D × D → R. Suppose Q ∈ D is a descriptor of a query object of certain type
(e.g., an image) and {D1 , . . . , DN } ⊂ D is a collection of descriptors associated to
objects of the same type as the query. For simplicity, the objects in the collection
are enumerated (identified) by a set of indices {1, . . . , N }. If d is a measure of
dissimilarity, the object with descriptor Dm1 is the best match (nearest neighbor)
to the query, where
where Dmi is a descriptor of the object (match) with identifier mi , which is regarded
as the i-th match. The number of retrieved models, K, depends on the application.
A user can specify how many models should be retrieved or a threshold value t can
be set to retrieve all objects whose descriptors satisfy d(Q, Dmi ) ≤ t.
The metric d depends on the type of descriptor. If we use textual annotations as
descriptors, then d(Q, Di ) can be, e.g., the total number of words that are contained
in both descriptors Q and Di . For multimedia retrieval applications, it is more
common that descriptors are stored as vectors with real-valued components and
fixed dimensions. Various choices of metric d are addressed in section 1.4.
If the size N of the collection of multimedia objects is relatively small (e.g.,
N < 10000) and the computation of distance d is fast enough (e.g., less than 200µs),
then the search can be done sequentially, i.e., we calculate all distances d(Q, D i )
(1 ≤ i ≤ N ) and sort them. However, if the order of magnitude of N is higher (e.g.,
106 ), then it is necessary to include techniques for accelerating the search [131, 130].
Sk−2
Ω= i=1 4v1 vk+1 vk+2 . (1.1)
Since certain feature extraction algorithms deal with triangles and not with polygons
(e.g., ray-triangle intersection), we iterate through all polygons splitting them into
triangles as necessary.
We regard a given triangle mesh as consisting of a set of triangles
T = {T1 , . . . , Tm }, Ti ⊂ R3 , (1.2)
A1 B1 C1
.. (1.4)
.
Am Bm Cm
Then,
m
[ m
[
I= Ti = 4pAi pBi pCi (1.5)
i=1 i=1
p1 1 0 0 T1 1 2 3
p2 a b 0 T2 1 3 4
p3 a c d T3 1 4 5
p4 a −e f T4 1 5 6
p5 a −e −f T5 1 6 2
p6 a c −d T6 2 6 10
p7 −1 0 0 T7 3 2 11
p8 −a −b 0 T8 4 3 12
p9 −a −c −d T9 5 4 8
p10 −a e −f T10 6 5 9
p11 −a e f T11 7 9 8
p12 −a −c d T12 7 10 9
T13 7 11 10
√
a = 1/√5 T14 7 12 11
b = 2/ T15 7 8 12
√ 5 √ T16 8 4 12
c = (p 5 − 1)/(2 5)
√ √ T17 9 5 8
d = p5 + 5/ 10 A visualization of the icosahedron defined
√ √ T18 10 6 9
e = p3 + 5/ 10 T19 11 2 10 by the lists of geometry and topology (left).
√ √
f = 5 − 5/ 10 T20 12 3 11
The list of indices (1.4) contains an additional field for specifying the number of
vertices for each polygon.
Besides geometry and topology, a mesh model may contain additional informa-
tion about a 3D-object, e.g., list of normals, list of colors, textures, etc. However,
the additional information is mostly used for rendering, while geometry and topol-
ogy are sufficient to represent the 3D-shape.
Let E be the list of edges of all triangles T1 , . . . , Tm (1.2), where the edges of
triangle Ti are pAi pBi , pBi pCi , and pCi pAi ,
E = {pA1 pB1 , pB1 pC1 , pC1 pA1 , . . . , pAm pBm , pBm pCm , pCm pAm } (1.7)
Definition 1.1 A triangle mesh is closed if for any pair of vertices p i and pj
(i, j ∈ {1, n}), the total number of occurrences of edges pi pj and pj pi in E (1.7)
is either even or zero.
Ti0 is combined with Tj , then the orientability exists, because the edges pc pb and
pb pc occur once both.
Definition 1.2 A triangle mesh is orientable if for any pair of vertices p i and pj
(i, j ∈ {1, n}), the number of occurrences of the edge pi pj in E (1.7) is equal to the
number of occurrences of the edge pj pi in E.
Since features are stored as vectors, and descriptors are regarded as representations
of features, we use terms descriptors and feature vectors interchangeably throughout
this thesis. A variety of 3D-model feature vectors are presented in chapters 2 and
4.
Extracted descriptors are used in the description generation step. Ideally, de-
scriptions should be generated using some standard, e.g., MPEG-7 (section 1.2), in
order to provide interoperability between applications. For instance, a search engine
might rely upon standardized descriptions only, without engaging feature extraction
tools. We use a simple internal format for generating descriptions. A description
consists of a model identifier (a string of constant length) and an array of N float-
ing point numbers, where N is the dimension of the feature vector. A module for
description management organizes and stores descriptions. In our retrieval system,
for a given feature type and parameter settings, descriptions of the whole 3D-model
collection are stored in a single file. This file starts with a header, containing infor-
mation about the feature type and parameter settings, followed by descriptions for
each model. The descriptions can easily be represented using MPEG-7 description
definition language [83], and encoded using standardized MPEG-7 tools.
Finally, an interactive application performs similarity search for 3D-objects by
processing the stored descriptions. The features are designed so that similar 3D-
objects are attributed vectors that are close in feature vector space. Using a suitable
metric (see section 1.4) nearest neighbors are computed and ranked. A variable
number of objects are thus retrieved by listing the top ranking items.
All the steps, which are present in a typical 3D model retrieval application, are
summarized in algorithm 1.1 (compare to figure 1.1).
which is reasonably similar to a desired one. Our Web-based retrieval system, called
CCCC [140], is presented in appendix.
be defined (sections 2.1, 2.2, 2.5, and 4.5), discretizing the model into a voxel-grid
[53]. A voxel attribute can be a binary value, a value representing the “level of
presence” of the mesh (e.g., the surface area) in certain region, a value representing
the distance to the model’s boundaries, etc. The voxel grid can directly be used as
a feature, or it can be processed further. A subset of triangles of the mesh model
can be fitted by a parametric 3D-surface, whose features are then analyzed. In the
technique presented in section 2.3, a curvature index is the feature of a parametric
3D-surface.
Certain features can be regarded as functions on a sphere. A 3D-mesh model can
be projected on an enclosing sphere (sections 4.1 and 4.6). A point on the sphere is
attributed a value, which can be a distance, surface area, information about shading,
curvature index, etc. Also, several concentric spheres can be used for defining
functions that represent voxel attributes (sections 2.5) or encode positions of points
in the 3D space R3 (4.6.6). All functions on concentric spheres are individually
processed obtaining a signature for each function. The obtained signatures are then
combined to form a compact descriptor of the model.
Image and 3D-geometry-based features as well as features that can be regarded
as functions on a sphere have meaningful spatial interpretations. For instance, a
voxel attribute is related to a specific region in the 3D-space, a 2D-image pixel
attribute is related to the projected part of the model, a curvature index of a
parametric 3D-surface is related to a specified group of triangles, and a value of a
function on a sphere can also be interpreted. Local features of a polygonal mesh,
such as a curvature index associated to a triangle (section 2.3) or the distance of
the center of gravity of a triangle to the origin (section 2.1), are summarized by
histograms in reported techniques [103, 105, 88, 89]. We stress that local features
can be represented in a more suitable way than using histograms, where a mean-
ingful spatial interpretation is lost. However, histogram is the most prominent tool
for representing randomized features (e.g., [97, 98]). Features that are derived from
several randomly selected points on the mesh, e.g., a distance between two randomly
selected points (section 2.5), are purely statistical. We regard these quantities as
statistical features. Typically, descriptors based on statistical features have inferior
retrieval performance comparing to descriptors based on images, 3D-geometry, and
functions on a sphere.
Topological features rely upon skeletal structures, such as medial axes [12], me-
dial surfaces, Reeb graph [48], etc. Similarity between skeletal structures of different
models is measured by matching connectivity information as well as certain propor-
tions between skeletal primitives (e.g., nodes, branches, etc.). A technique of this
type, known as “topology matching”, is presented in section 2.4.
Image-based, 3D-geometry-based, statistical features and features that can be
regarded as functions on a sphere are usually represented as real valued vector of
fixed dimensions. As an exception, a voxel grid can be represented as an octree
(section 4.5). Topological features are represented by graphs.
Most of the feature vectors can be represented in both the spatial and spectral
domains. For instance, a sequence of distances, between contour pixels of a silhou-
ette to the origin, can be used as a feature vector in the spatial domain. The same
22 Multimedia Retrieval
sequence can be transformed by the discrete Fourier transform (DFT) (4.11), ob-
taining a sequence of Fourier coefficients whose magnitudes are used as components
of a feature vector in the spectral domain. Similar tools for rectangular images,
voxel grids, and functions on a sphere are the 2D-DFT (4.26), 3D-DFT (4.48),
and the Fourier transform on the sphere (spherical harmonic transform) (4.59). A
wavelet transform can be used [105], as well.
As a rule, the spectral domain representation of a feature shows better retrieval
performance than the spatial domain representation of the same feature. More
details about this fact are given in section 1.4.
8. compact representation;
Mesh models, which are available in the Internet, are not necessarily closed
(definition 1.1) or orientable (definition 1.2). For instance, our 3D model collection is
mostly collected from www.3dcafe.com, and more than 60% of models are not closed,
while more than 30% are not orientable. Restricting a feature extraction technique
to closed or orientable meshes leads to the elimination of significant number of
available 3D-shapes. Although there are approaches for filling holes in meshes
[8, 23, 67], we prefer defining descriptors without imposing any constraint regarding
closedness or orientability. Also, mesh anomalies such as floating vertices, multiple
vertices and triangles, and degenerated triangles, must not affect extracted feature
Shape-Similarity Retrieval of 3D Objects 23
fi = fi0 , 1 ≤ i ≤ M. (1.8)
This means that the vector fN contains all lower-dimensional feature vectors of the
same type. Thus, if we search for similar models in the space of feature vectors
of the dimension M , there is no need to store fM separately, because the first M
components of fN can be used.
The feature extraction, including filtering and normalization steps, should be
efficient. In a 3D model retrieval system, feature vectors of new models (e.g., col-
lected by a Web-crawler application) can be extracted in an idle time. Nevertheless,
it is desirable that the extraction time is as small as possible. Moreover, the rep-
resentation of a feature should be concise. To represent a feature as a real-valued
vector of fixed dimension is usually more compact than to use a graph or octree
representations. However, the dimension of the vector should be reasonably small.
We consider that a feature vector should not have more than 1000 components, and
we try to keep the dimension below 500. It is crucial that the indexing of 3D-models
by shape-similarity to a given query is fast. A variety of accelerating techniques
can be engaged to speed-up the search, by pre-computing certain index structures.
The response time of a system to a given query, i.e., the elapsed time between the
moment when the query is specified and the moment when similar models are re-
trieved, should be at most several seconds. If a matching procedure of a pair of
descriptors is too time consuming, then the descriptors of that type are not suitable
for interactive retrieval applications. The l1 or l2 norms are very efficient distance
metrics for vectors.
Finally, the most important requirement is the discriminant power of descrip-
tors. By discriminant power of a descriptor we refer to the feasibility to distinguish
between similar and non-similar objects to the query. When a descriptor possesses
high discriminant power, we say that it has good retrieval performance. Occasion-
ally, a descriptor of one type is extracted faster than a descriptor of the other type,
Similarity Search Metrics 25
but the retrieval performance of the latter is significantly better. Also, the dimen-
sion of a feature vector can be lower than the dimension of the other type of feature
vector, but the latter is much more effective. In these trade-offs we consider the
retrieval performance to be more important.
N
!1/p
X
0 00 0 00
dp (f , f ) = ||f − f ||p = |fi0 − fi00 |p , p = 1, 2, . . . . (1.9)
i=1
while for p = 2, we have the l2 norm of the difference f 0 − f 00 , which is called the
Euclidean metric,
v
uN
uX
0 00 0 00
d2 (f , f ) = ||f − f ||2 = t (fi0 − fi00 )2 . (1.11)
i=1
The lp norm is usually ineffective as the distance metric of features in the spatial
domain. From the definition of the lp norm, we observe that the component-wise
differences fi0 − fi00 are equally important. According to (1.9), it is assumed that
the individual components of the feature vector are independent from each other.
However, the values fi0 and fi00 may correspond to a feature related to a specific
region Λi of the 3D-space R3 . For instance, fi0 may be the extent of a model
in direction ui ∈ R3 (see section 4.1). Let Λi , Λj , and Λk be three regions in
the 3D-space such that Λi is very close (neighbor) to Λj and distant to Λk . In
this case, instead of computing differences only between feature values in the same
region, the distribution of the values across the neighboring regions should be taken
into account, too. Generally, the retrieval effectiveness of vectors in the spatial
domain can be improved by accounting the distribution of regions Λi (i = 1, . . . , n)
26 Multimedia Retrieval
f0 f 00 f 000
Figure 1.7: Visualizations of three feature vectors (dark fields denote the value of
1, white fields denote the value of 0). According to (1.9), dp (f 0 , f 00 ) > dp (f 0 , f 000 ) =
dp (f 00 , f 000 ), which is in a contradiction to a human perception.
Obviously, the Euclidean distance is a special case of the quadratic form distance
(when S = IN , IN is the identity matrix of order N ). Besides introducing sim-
ilarities between components, it is possible to specify the “importance” of each
component, e.g., S = diag(ω1 , . . . , ωN ). Moreover, the elements of matrix S can
Similarity Search Metrics 27
As a contrast to the spatial domain, the lp norms are reasonably effective in the
spectral domain. Indeed,
√ √
d1 (f̂ 0 , f̂ 00 ) = 0 > d1 (f̂ 0 , f̂ 000 ) = 3+2 2, and d2 (f̂ 0 , f̂ 00 ) = 0 > d2 (f̂ 0 , f̂ 000 ) = 3. (1.19)
Note that the feature vectors from the spatial domain whose nonzero components
have identical relative distributions (f 0 and f 00 ) are transformed into identical vectors
in the spectral domain (f̂ 0 = f̂ 00 ).
The computation of the lp (1.9) norm is less expensive than the computation of
the quadratic form distance (1.15). Since the correlation between vector components
in the spatial domain is reflected by a spectral domain representation, we consider
that the lp norm is a suitable distance metric for feature vectors in the spectral
domain. Besides being efficient, the lp distance is also effective (see results from
section 5.2).
We also tested certain minimizations of the l1 and l2 distances, whose definitions
are given in a sequel. Generally, if f 0 ∈ RN is a feature vector of a query model and
f 00 ∈ RN is a feature vector of a matching candidate, then we want to determine a
parameter α ∈ R so that the distance dmin p (f 0 , f 00 ) is minimal.
dmin
p (f 0 , f 00 ) = min dp (f 0 , αf 00 ) = min ||f 0 − αf 00 ||p .
α∈R α∈R
dmin
1 (f 0 , f 00 ) = min d1 (f 0 , gi f 00 ). (1.21)
1≤i≤M
f 0 = (f10 , . . . , fN
0 ), f 00 = (f 00 , . . . , f 00 );
1 N
Form an array m = {m1 , . . . , mM }, mi = (gi , ni , si ) ∈ R × N × {−1, 1},
gi = fi /fi , (fi 6= 0), ni = i, si = sign(fi00 );
0 00 00
Figure 1.8: Algorithm of complexity O(N log N ) for computing the minimal l1
distance.
Our experimental results (section 5.2) show that retrieval performance of some
feature vectors in the spatial domain (e.g., the ray-based approach from section 4.1)
is better when the minimized l1 distance (1.21) is used instead of the l1 (1.10). This
also holds for certain feature vectors in the spectral domain.
We consider that the presented distance calculations are reasonably suitable for
our application. Nevertheless, we do not exclude that, e.g., modifications of the
Hausdorff distance, the Earth Mover’s distance [21], the Bhattacharyya distance
[75], or the Kullback-Leibler distance [148] might be more effective for certain fea-
ture vectors. Similarity measures and algorithms suitable for content-based search
are also addressed in [139, 9, 38, 36].
Let C and U be the sets of all classified and unclassified objects, respectively,
K
[
C= Ci , U = Σ \ C. (1.23)
i=1
Models belonging to the same class are regarded as relevant to each other. The
categorization of similar (relevant) objects into classes serves as a ground truth.
30 Multimedia Retrieval
Figure 1.9: For a given query model, a fraction of retrieved models is relevant.
Precision is the proportion of retrieved models that are relevant and recall is the
proportion of the relevant models actually retrieved,
|M | m |M | m
precision = = , recall = = . (1.26)
|R| r |Ci \ {q}| ni − 1
If figure 1.9, four sets of models are illustrated, the entire collection, the set of
retrieved models (R), the set of relevant objects (Ci \ {q}), and the set of retrieved
relevant objects (M ).
The construction of the precision-recall diagrams, which are widely used in sec-
tion 5.2, is demonstrated by the following example.
Let Ci \ {q} = {c1 , c2 , c3 , c4 , c5 , c6 , c7 }, i.e., there are 7 objects relevant to
the query q. Firstly, we rank all objects form the collection Σ using the selected
descriptor and matching criterion. Suppose that
R = {c4 , r1 , c2 , c7 , r2 , c1 , r3 , r4 , c6 , r5 , c5 , r6 , r7 , r8 , c3 , . . .}
is the ranking of all models from Σ according to the similarity to q, where r i ∈ Σ\Ci ,
i = 1, 2, . . .. Thus, the last relevant model lies at position 15 (match number 15).
Next, we construct the table 1.1.
The number of columns in table 1.1 depends on the size of the class of relevant
models. In order to average values of precision over classes of different sizes, we
estimate precision at standard recall values
k
recallk = , 1 ≤ k ≤ G, G ∈ N,
G
Tools for Evaluating Retrieval Effectiveness 31
Match no. 1 3 4 6 9 11 15
Object c4 c2 c7 c1 c6 c5 c3
Recall 1/7 2/7 3/7 4/7 5/7 6/7 7/7
Precision 1/1 2/3 3/4 4/6 5/9 6/11 7/15
Table 1.1: An example of computing precision and recall values using the ranking
of relevant objects.
by applying linear interpolation to the values from table 1.1. Interpolation results
for G = 5 and G = 10 are shown in table 1.2.
Recall 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90 1.00
Precision (G = 5) - 0.87 - 0.73 - 0.64 - 0.55 - 0.47
Precision (G = 10) 1.00 0.87 0.68 0.73 0.71 0.64 0.57 0.55 0.52 0.47
The precision-recall diagrams for the values from table 1.1 and the interpolations
for G = 5, G = 10, and G = 20 are shown in figure 1.10, where both recall and
precision are expressed in percentages. Obviously, the higher the value of G, the
better the interpolation. In section 5.2, all precision recall curves possess G = 20
standard recall values.
100
Original
Interpolated (G= 5)
Interpolated (G=10)
Interpolated (G=20)
80
60
Precision (%)
40
20
0
0 20 40 60 80 100
Recall (%)
Values in table 1.1 describe ranking results for a single query. In order to
evaluate a descriptor and matching criterion for a selected query set Q 3 q, we take
each object from the set Q as a query, retrieve models, compute and interpolate
precision vs. recall for the single query, and average the results. By examining
averaged precision/recall diagrams for different queries (and classes) we estimate
the retrieval performance of descriptor for selected settings (e.g., type, parameters,
representation, and matching criterion).
32 Multimedia Retrieval
We also compute two cumulative parameters, the average precision p̄100 over
the whole recall range 1/G ≤recall≤ 1 and the average precision p̄50 over the recall
range 1/G ≤recall≤ 0.5. We consider these values to be useful for comparing
overall performance of different descriptors. For the interpolated values from table
1.2 obtained for G = 10, we have
1.00 + 0.87 + 0.68 + 0.73 + 0.71 + 0.64 + 0.57 + 0.55 + 0.52 + 0.47
p̄100 = = 0.674 (or 67.4%)
10
BEP = 6/7 = 0.857 (or 85.7%) and RP = 4/7 = 0.571 (or 57.1%).
100
Descriptor 1 (50.3,36.3,44.0,33.3)
Descriptor 2 (43.2,29.8,36.3,28.0)
80
60
Precision (%)
40
20
0
0 20 40 60 80 100
Recall (%)
of relevant items, assuming that the total number of relevant models is known.
For instance, suppose we use a model of car as the query and we want to retrieve
10 relevant models from a collection containing K >> 10 models of cars. Then,
we calculate the recall value (10/K) and find the corresponding value of precision
p, using a precision-recall diagram of the engaged descriptor. The precision-recall
diagram should be averaged for queries of cars. We estimate that if we retrieve
around 10/p models, 10 of them should be cars.
An ideal precision-recall curve is constant, having precision=100% for the whole
recall range. In practice, precision is well below 100% even at small recall values,
and there is a decreasing tendency with the increase of recall values. A precision-
recall curve for a single query model (e.g., figure 1.10) appears to be “broken”.
An average precision-recall curve (for the whole query set) is usually monotonically
non-increasing, i.e., for two points belonging to a curve, (p1 , r1 ) and (p2 , r2 ), we
have
r 1 < r 2 ⇒ p1 ≥ p 2 .
In [7], it is recommended to force this monotonicity by setting p2 ← min{p1 , p2 }.
Nevertheless, in results presented in section 5.2 the monotonicity is not forced.
Chapter 2
35
36 Related Research Work
ambiguous [56, 55]. In order to reduce the probability of missing the correct fea-
ture extraction procedure, we implemented several variants of feature extractors, in
cases of non-complete descriptor specifications.
where S and Si are defined by (1.6). Thus, the number of cords is equal to the
number of triangles of the mesh model. The cords-based descriptor consists of
three histograms. The first histogram represents the distribution of the angles α i
between the cords and the first principle axis, while the second histogram provides
the distribution of the angles βi between the cords and the second principle axis,
where 0 ≤ αi , βi ≤ π. The third histogram describes the distribution of the norms
of cords ||ci || so that the smallest value is zero and the largest value corresponds to
the norm M of the longest cord. In order to provide independence with respect to
the number of cords, all three histograms are normalized using the total number of
cords m. Therefore, for each triangle we compute
p
ci = g i − m I , ||ci || = (gxi − mx )2 + (gyi − my )2 + (gzi − mz )2 ,
(2.2)
g xi − m x g yi − m y
M = max ||ci ||, αi = arccos , βi = arccos .
1≤i≤m ||ci || ||ci ||
The number of bins in each histogram is equal N , whence the dimension of the
cords-based feature vector f is equal 3N ,
(1) (1) (2) (2) (3) (3)
f = (h1 , . . . , hN , h1 , . . . , hN , h1 , . . . , fN ). (2.3)
Cords and Moments-Based Descriptors 37
The forming of the the feature vector f is described by the following pseudocode
(k)
hi = 0, i = 1, . . . , N, k = 1, 2, 3;
for i = 1, . . . , m
(1) (1)
k = dN · αi /πe; hk ← hk + 1/m; (2.4)
(2) (2)
k = dN · βi /πe; hk ← hk + 1/m;
(3) (3)
k = dN · ||ci ||/M e; hk ← hk + 1/m;
where Si (1.6) denotes the surface area of triangle Ti (1.2). The authors did not
give any details how to organize the moments Mqrs in order to compose a feature
vector. Therefore, wePtested this approach in the following manner. Firstly, we
m
observe that M000 = i=1 Si = S (1.6) and M100 = M010 = M001 = 0. Indeed,
from (2.1) and (2.5), we obtain for M100 (similar for M010 and M001 )
m
X m
X m
X m
X
M100 = Si (gxi − mx ) = Si g x i − m x Si = Si gxi − mx S = 0.
i=1 i=1 i=1 i=1
We regard the sum q + r + s as the order of the moment, and form the feature
vector using all moments of order 2 to N , i.e., 2 ≤ q + r + s ≤ N . The dimension of
the moments-based feature vector f is equal to (N + 1)(N + 2)(N + 3)/6 − 4. The
vector is formed as described by the algorithm in figure 2.2. Note that an embedded
multi-resolution representation is provided (1.8).
Figure 2.2: Algorithm for forming the feature vector based on statistical moments.
The computation of the dimension dim, when the vector is composed according
to the algorithm in figure 2.2, is given by
k
N X N
X X k 2 + 3k + 2 (N + 1)(N + 2)(N + 3)
dim = (k − q + 1) = = − 4.
2 6
k=2 q=0 k=2
We set N = 12 and obtain the feature vector of dimension 451, thereby all lower-
dimensional vectors (for N = 2, . . . , 11) are contained in the largest vector. The
average extraction time of the feature vector with 451 components is 84ms, for a
Descriptor based on Equivalence Classes 39
model from the MPEG-7 set (section 5.1), on a PC with an 1.4 GHz AMD processor
running Windows 2000.
The authors [105] considered the scale of a model as a feature and they did
not provide any mean to secure scaling invariance. This is a contradiction to the
invariance requirement 3 in section 1.3.4. Consequently, the retrieval performance
is very poor (see section 5.2.12). However, even if we fix the scale (section 3.4), the
effectiveness is not god enough. Therefore, we conclude that the approach based on
statistical moments possesses rather theoretical than practical significance.
We implemented both cords and moments-based descriptors and evaluated their
performance (section 5.2.12).
The same authors proposed a wavelet transform-based descriptor [105], which
is rather underspecified. The main idea is to voxelize a 3D-model and to apply a
wavelet transform in each dimension. However, no information about the voxeliza-
tion is given, i.e., it is not specified what is the region of voxelization as well as how
voxel attributes are computed. The authors proposed to use Daubechies-4 (DAU4)
wavelets. After performing the wavelet transform, the obtained coefficients are pro-
cessed further. Firstly, the logarithm of each coefficient is computed, in order to
enhance the coefficients corresponding to fine details, whose magnitudes are usually
small. Then, the sums of logarithms at each level of resolution are computed and
regarded as components of the wavelet transform-based feature vector.
The web-based 3D model retrieval system [99] uses methods presented in [103,
105].
• CC2: the center of the unit cube coincides with the center of gravity of a mesh
model, while the length of edge is four times the average distance of points on
the model’s surface to the center of gravity (see definition 4.4).
All cubes, CBC, EBB and CC2, have faces parallel to the coordinate hyper-planes.
Note that CBC and EBB encompass the whole object, while there might be some
parts of the model outside the CC2. The CC2 is created in order to improve the
robustness with respect to outliers (requirement 5 in section 1.3.4).
The unit cube, which is determined by its diagonal points bmin = (bx , by , bz ) ∈
R3 and bmax = (b̄x , b̄y , b̄z ) ∈ R3 (bx < b̄x , by < b̄y , and bz < b̄z ), is subdivided into
(2N + 1)3 (N ∈ N) cubes γijk ⊂ R3 (−N ≤ i, j, k ≤ N ). The cube γijk is defined
by its diagonal points pijk and pijk + d, where
bmax − bmin
pijk = bmin + d · (i + N, j + N, k + N ), d = (dx , dy , dz ) = .
2N + 1
N
X N
X N
X
vijk = 1.
i=−N j=−N k=−N
↓
↓ ↓ ↓ ↓
f1 f2 f3 f4
The average extraction times, for various dimensions of the feature vector based
on equivalence classes, are given in table 2.1. The descriptors are extracted using
our implementation of the approach. The results are obtained on a PC with an
42 Related Research Work
1.4 GHz AMD processor running Windows 2000, using the MPEG-7 set of models
(section 5.1).
N 1 2 3 4 5 6 7 8 9 10 11
dim 4 10 21 39 66 104 155 221 304 406 529
Time [ms] 17 26 34 45 53 63 74 86 100 117 135
Table 2.1: Average extraction times (in milliseconds) of the descriptor based on
equivalence classes.
According to our evaluation (section 5.2.12), the best choice of the unit cube is
EBB (definition 4.3). The feature vector based on equivalence classes of dimension
406 (N = 10, i.e., 21 × 21 × 21 grid) possesses better overall retrieval performance
than vectors of lower dimensions. However, the vector of dimension 155 is just
slightly less effective, whence if there is a strong requirement for a compact feature
representation, then we recommend dim = 155. The descriptor based on equiv-
alence classes significantly outperforms the cords and moments-based descriptors
from section 2.1.
1 1 kj0 + kj00
Γj = − arctan 0 , kj0 ≥ kj00 . (2.8)
2 π kj − kj00
Before computing Γj , a checking for singularities is performed.
Two triangles Ti and Tj are considered to be adjacent if they share a common
vertex. Let Λj be the set of indices of all triangles adjacent to Tj and σj be the
sum of surface areas given by
X
Λj = {i | Ti adjacent to Tj }, σ j = Sj + Si . (2.9)
i∈Λj
If the number of triangles adjacent to Tj (the cardinality of the set Λj ) is less than
5, then the area σj is regarded as an area of singular surface.
Shape Spectrum Descriptor 43
where Si and ni are the surface area and normal vector of triangle Ti , while Λj is
given by (2.9). Obviously, inconsistent orientations of triangles lead to wrong values
of the mean normal vectors. Hence, the triangle mesh supposed to be orientable
(definition 1.2) in order to enable the correct estimation of the mean normal vector.
The authors stress [88] that adjacent triangles of the second order (adjacent
to adjacent) and higher order (further recursion) can be considered. In that case,
the computational complexity increases proportionally to the size of the set Λ j .
However, retrieval performance will not be improved [88].
Local fitting of a parametric 3D-surface is performed through the centers of
gravity of the triangle Tj and the adjacent triangles Ti (i ∈ Ai ). The parametric
surface is quadratic, expressed as a second order polynomial,
z = f (x, y) = a0 x2 + a1 y 2 + a2 xy + a3 x + a4 y + a5 , a0 , . . . , a5 ∈ R. (2.11)
where t is a given threshold (recommended 0.1 ≤ t ≤ 0.4) and σj is the total area
of local surfaces defined by (2.9).
After all triangles Tj of the mesh are processed, the shape spectrum descriptor
f is composed from a histogram of curvature indices Γj (2.8) and two additional
components describing the proportion of planar and singular surfaces f (x, y) (2.11).
If N is the number of histogram bins, then the dimension of the feature vector f
is dim = N + 2. The exact algorithm for populating feature vector components is
given by the following pseudocode.
The MPEG-7 eXperimentation model [84] includes a C++ source code that
we use for generating the shape spectrum descriptor. On average, a descriptor is
Topology Matching 45
extracted for 1.92s on a PC with an 1.4 GHz AMD processor running Windows
2000, using the models from the MPEG-7 collection (section 5.1). The extraction
time does not depend on the dimension of the feature vector, but on the complexity
of models. We also tested the extraction when adjacent triangles of the second
order are used for forming the set Λj (2.9). In this case, the average extraction time
amounts 5.37s. However, our tests confirm the claim from [88] that the retrieval
performance is not better if adjacent triangles of the second order are considered.
The descriptor is particularly effective in the case of articulated modifications
of models (e.g., different poses of a model of human), while its major drawback is
a high sensitivity with respect to the levels of detail or different tessellations of an
object. Also, there is a requirement for meshes to be orientable. An inconsistent
orientation of triangles of a mesh deteriorates the discriminant power of the shape
spectrum descriptor.
In [88], it is stated that a pre-processing step consisting of a mesh subdivision
algorithm (up to 2 subdivision levels) could strongly increase the accuracy of the
curvature estimation. However, our experience suggests that the increase of the
accuracy of the curvature estimation will not lead to the increase of retrieval per-
formance, because the feature vector will still be sensitive to different tessellations
and inconsistent orientation. Our evaluation results (section 5.2.12) show that the
shape spectrum descriptor is generally the most inferior descriptor, which is tested
in this thesis.
The method does not require any pose normalization step, because the invariance
with respect to similarity transforms is provided by the definition of descriptor. No
restrictions regarding closedness or orientability of the mesh models are imposed.
The main idea of the approach is to represent topology information by the MRG.
The MRG is generated using a suitable function µ(u), which approximates integral
of geodesic distances between a point u ∈ I on the surface of 3D-model to all other
points of the point set I (1.5), defined by
Z
µ(u) = g(u, p)dv, (2.15)
p∈I
46 Related Research Work
where g(u, p) denotes the geodesic distance between the point u and the point p
on the surface of model I. Obviously, the value of µ(u) depends on the scale of the
object. To avoid the scale dependency, the function µ is normalized as
In order to balance the trade-off between the computational costs and the qual-
ity of the approximation of µ, certain pre-processing steps, aimed at reorganizing
topology and geometry of the mesh, are performed. The feature extraction algo-
rithm can be summarized into the following steps:
• Resampling of a triangle mesh model;
• Generation of short-cut edges;
• Selection of base vertices;
• Computation of geodesic distances;
• Construction of the multiresolutional Reeb graph;
• Representation of the multiresolutional Reeb graph.
Since a value of the function µn is assigned to each vertex of the mesh, the distri-
bution of the vertices should be fine enough and as uniform as possible. Therefore,
it is usually necessary to resample the triangles (figure 2.4) until all edge lengths are
less than a threshold τ . The authors do not specify how to determine the threshold.
Figure 2.4: Resampling until all edge lengths are less than a threshold τ .
associated to an edge is the geodesic distance between the end points of the edge.
Therefore, the problem of finding the geodesic distance between two vertices of the
mesh can be interpreted as the problem of finding the shortest path from a point
in the graph (the source) to a destination. An efficient algorithm, proposed by
Dijkstra [24], can be used for computing simultaneously shortest paths from a fixed
point (node) bi to all other nodes of the graph (see the algorithm in figure 2.6).
Figure 2.6: Dijkstra’s algorithm for finding the shortest path from a base vertex
(source node) bi to all other nodes (vertices) v1 , . . . , vm .
The set of base vertices B = {b1 , . . . , bN }, which are scattered almost equally on
the surface of the model, is selected using a modification of the Dijkstra’s algorithm.
Firstly, an arbitrary vertex is selected as b1 and taken as the base point of the
Dijkstra’s algorithm. The original Dijkstra’s algorithm (figure 2.6) is modified as
follows:
• Step (∗) is changed to:
if g(bi , va ) < tr , then insert va to LIST ;
where tr is a threshold.
• If the LIST is empty, then an arbitrary unvisited vertex is selected as a new base
vertex bi+1 (bi is the current base vertex), inserted in LIST , set g(bi+1 , bi+1 ) =
0, and the while loop is repeated.
48 Related Research Work
where σi is the sum of areasPof faces composed of vertices whose distances from b i
N
are less than tr . It holds, i=1 σi = S. The specification how to select areas in
order to compute the value of σi is not given. In particular, we consider that the
handling of special cases is important. The special cases include:
• Vertices of triangle Tj are closest to different base vertices;
• A vertex v satisfies g(bi , v) < tr and g(bj , v) < tr .
√
Typically, for tr = 0.005S around N = 150 base vertices are selected, and the
approximation (2.17) achieves sufficient accuracy.
After computing values of the function µn (2.16) at all vertices v1 , . . . , vm , the
multiresolutional Reeb graph is generated. Construction of the MRG is depicted
using the example in figure 2.7. For simplicity, a height function is used as the
function µn on a 2D-triangle mesh. In figure 2.7(a), an example 2D-triangle mesh
is visualized and four ranges of values of the function µn (briefly, µn -ranges) are
marked. If a triangle is spread across two or more µn -ranges, then a subdivision is
necessary (figure 2.7(b)). For instance, if v1 , v2 , and v3 are vertices of a triangle
and 0.25 < µn (v1 ) < 0.50 < µn (v2 ) < 0.75, then a new vertex v is inserted so that
(µn (v2 ) − µn (v))v2 + (µn (v) − µn (v1 ))v1
v= ,
µ(v2 ) − µ(v1 )
where µn (v) = 0.50. The triangle 4v1 v2 v3 is subdivided into triangles 4v1 vv3
and 4vv2 v3 . This process is repeated until each edge lies in one µn -range. Then,
the nodes of the MRG are identified (bold polygonal lines in figure 2.7(c)). The
nodes and edges of the MRG at the finest resolution (level 2) are visualized in figure
2.7(d), while the other two resolutions are shown in figure 2.7(e-f). Note that there
is a parent-child relationship between nodes at different levels, e.g., n 6 is parent of
n0 , n1 , and n2 , etc.
Finally, certain attributes of all nodes from the MRG are computed and stored
together with both the connectivity and parental information. The stored repre-
sentation of the MRG is regarded as a 3D-shape descriptor. As a first approach,
the authors suggested to use a pair of values (a(ni ), l(ni )) to attribute the node ni .
The value of a(ni ) is the quotient of the area of triangles related to ni and the total
area S, P while l(ni ) is the ratio of the length len(ni ) of the node ni to the sum of
lengths j len(nj ), where nj are the nodes at the finest level. The length of a node
is defined in [48]. The recommended number of µn -ranges is 64, i.e., an MRG with
7 resolution levels (26 = 64) is tested in [48].
Topology Matching 49
In the results presented in [48], the average feature extraction time, i.e., the
average time for constructing a MRG, is around 15s, for a triangle mesh of 10000
50 Related Research Work
vertices. The results are obtained on a PC with an 400 MHz Pentium II processor
running Linux. For 64 µn -ranges, the average number of nodes in the MRG is
around 300. The average matching between two MRGs is approximately 0.05s,
which the authors regard as a quick procedure. We consider that the extraction
procedure takes reasonable amount of time, because in a retrieval system features
can be extracted at idle time. However, we disagree with the authors regarding the
quickness of the matching procedure. For a given query, if the matching procedure
has to be performed 1000 times, then the retrieval system responds after 50 seconds.
We consider this amount of time to be unacceptable for interactive applications.
We also feel that the effectiveness of the topology matching breaks down when
models are complex (geometrically and topologically). We expect more frequent
occurrences of the problem depicted in figure 2.8 as well as unexpected matching
pairs during the similarity calculation procedure. Our assumptions comply with
remarks presented in [11].
where d·e is the ceiling function. The components of the feature vector based on
shape distributions, f = (f1 , . . . , fdim ), are defined by
6
10
fi = gi , Σ = g1 + . . . + gdim (2.22)
Σ
where [·] denotes the rounded value. Thus, the dimension of the feature vector f is
determined by the number of vertices of the piecewise linear function g. The factor
106 /Σ normalizes the values o fi so that the vector components do not depend on
the number of samples N .
The authors suggest that the best choice of the geometric function is D2, while
N = 10242 samples, and dim = 64 are recommended parameter settings. The
number of bins B is tested for values 1024 and 64 (B = dim). The feature extraction
procedure for the case B = dim, when the D2 function is used, is summarized in
the algorithm shown in figure 2.9. Note that B = dim ⇒ Σ = N .
Figure 2.9: Feature extraction algorithm of the shape distribution descriptor, for
the case B = dim, when the D2 function is used.
The average extraction time of the shape distribution descriptor is around 1.12s,
when the algorithm from figure 2.9 is used for generating descriptors of the models
from the MPEG-7 collection (5.1), on a PC with an 1.4 GHz AMD processor running
Windows 2000.
This approach satisfies all requirements listed in section 1.3.4. In spite of satis-
fying the desirable requirements, the discriminant power of the shape distribution
descriptor is poor (see section 5.2.12 as well as [36]). In general, purely statistical
shape feature vectors do not have sufficient discriminant power, because compo-
nents of the vector are not related to a specific spatial region. The results and
examples given in [97, 98] are based on a very small collection of only 133 3D-
models. Some examples show a good discrimination between shape distributions
Shape Distributions, Reflective Symmetry, and Ds on Concentric Spheres 53
for cubes, spheres, cylinders, and other geometric primitives. However, the situ-
ation is totally different when dealing with complex 3D-meshes, i.e., the retrieval
performance of descriptors based on shape distributions is not good enough.
Note that some points of the point set I 0 lay outside the region of voxelization. Those
points are ignored as outliers. The binary voxel grid of a model of an airplane is
shown in figure 2.10.
Then, a function fr (θ, ϕ) (θ ∈ [0, π], ϕ ∈ [0, 2π]) on the sphere Ωr with the
center at the origin and radius r, is defined using (2.24) and (2.25)
fr (θ, ϕ) = vabc , whereby ηabc 3 (r sin θ cos ϕ, r cos θ, r sin θ sin ϕ). (2.26)
In our implementation of this approach, we use B = 64, i.e., we have 16384 sample
values fr,a,b for each function fr (θ, ϕ). The samples obtained using the binary voxel
grid are visualized on the right side of figure 2.10. In this visualization, neighboring
samples whose values are equal 1 are treated as vertices of triangles. Samples
whose values are 0 are not visualized. Functions on successive concentric spheres
are differently colored.
Figure 2.10: Approach based on a binary voxel grid: a 3D model (left) is repre-
sented by a binary voxel grid (middle), which is used to define binary functions on
concentric spheres (right).
For each function fr (2.26), the fast Fourier transform on the sphere Ωr (see
section 4.6.1) is applied to the sample values fr,a,b ∈ {0, 1}. As a result, we obtain
B 2 complex coefficients fˆr,l,m ∈ C (0 ≤ |m| ≤ l ≤ B). A signature sr of the function
fr is defined using the property 4.1, in order to attain invariance of the signature
with respect to rotation of the underlying object (see remark 4.1). The signature is
generated using the first L bands of spherical harmonics, i.e.,
v
u l
uX
sr = (||fr,0 ||, . . . , ||fr,L−1 ||), where ||fr,l || = t |fˆr,l,m |2 . (2.27)
m=−l
The feature vector f based on a binary voxel grid is formed by concatenating the
signatures of spherical functions,
f = (s1 |s2 | . . . |sR ) = (||f1,0 ||, . . . , ||f1,L−1 ||, . . . , ||fR,0 ||, . . . , ||fR,L−1 ||). (2.28)
than filtering the noise by voxelizing the model. The results presented in [142], where
the descriptor based on a binary voxel grid is compared to a descriptor formed using
a ray-casting technique (see section 4.6.6), comply with our assumptions.
The discriminant power of the descriptor based on a binary voxel grid is limited
because of the following reasons:
• If the resolution of voxelization R is low, than the representation of the model
by a binary voxel grid is too rough (coarse);
• If we increase R, then the functions on concentric spheres, corresponding to
similar models whose scaling factors are slightly different, can be mismatched.
The problem of mismatched binary functions is illustrated in figure 2.11. For
simplicity, a 2D case is depicted, i.e., instead of voxels we have square fields and
instead of functions on spheres we have functions on circles. At the resolution R,
functions on the circle (analogous to (2.26)) of both objects are identical. However,
at the resolution 2R the functions are almost complementary. The depicted example
represents an extreme situation, which rarely happens. Nevertheless, a similar
mismatching is present, because the method relies upon the center of gravity of a
3D-mesh model (center of concentric spheres) and the average distance of a point
on the surface to the center of gravity (scaling factor). We anticipate that similar
(a)
objects may have slightly different scaling factors so that the function f r of the
(b)
model Ia matches fr±1 of the model Ib .
Object Ia Object Ib
Figure 2.11: At the resolution R, the functions on the circle are identical for both
objects, while at the resolution 2R, the functions are almost complementary. Darker
parts of the circles denote the value of 0, while brighter parts denote the value of 1.
fr (θ, ϕ) = r · vabc , whereby ηabc 3 (r sin θ cos ϕ, r cos θ, r sin θ sin ϕ). (2.29)
56 Related Research Work
The experimental results (section 5.2.12) show that (2.29) is significantly better
choice of defining fr than (2.26).
The authors do not specify how the points (i.e., normal vectors) ni on the unit
sphere are chosen. Also, the vector dimension dim is not suggested. We assume
that the normal vectors are uniformly distributed across the unit sphere.
According to [56], for an 64 × 64 × 64 voxel grid (R = 32), the voxelization
(2.25) takes between 0.2 and 3.5 seconds, the computation of the exponentially de-
caying distances (2.31) takes 1.2 seconds, and the estimation of reflective symmetry
distances takes 5 seconds. Hence, the feature extraction lasts between 6.4 and 9.7
seconds. The results are obtained on an 800 MHz Athlon processor with 512 MB
of RAM.
The reflective symmetry descriptor is invariant with respect to translation, scal-
ing, and reflection, while the invariance with respect to rotation is not provided.
The evaluation results in [55, 56] are based on a 3D-model collection of manually
oriented objects. In practice, such a 3D-shape descriptor, which is highly sensitive
even to rotation of a mesh model for a very small angle around a coordinate axis,
cannot be useful. Hence, the importance of the approach presented in [55, 56] is
mainly theoretical.
Figure 2.12: Models that are considered similar may have significantly different
components of the reflective symmetry feature vector.
The authors stress that because of the global nature of the reflective symmetry
distance, a large difference in the values for a single component of the feature vector
(2.32) “provides a provable indication that two models are significantly different”.
However, we disagree with this statement because of two reasons. First, since
rotation invariance is not provided, models should be perfectly aligned (oriented).
Obviously, if a model is rotated for an angle around an arbitrary axis, then the
reflective symmetry descriptor will be significantly different. Second, a counter-
example to the claim is depicted in figure 2.12. Namely, we have models that are
58 Related Research Work
considered relevant (similar) to each other, which are articularly modified. Since
there is no perfect alignment for these models, a number of components of the
corresponding reflective symmetry feature vectors of similar models will always be
significantly different.
Limitations of retrieval effectiveness of the reflective symmetry descriptor are
also commented in [56]. Instead of 3D-model retrieval, registration is considered as
another possible usage of the presented technique.
2
= a0 + a1 x + a2 y 2 + a3 z 2 +a4 xy + a5 xz + a6 yz
(2.34)
m1 m4 m5
= [ x y z ] m4 m2 m6 [ x y z ] T ,
m5 m6 m3
where x2 + y 2 + z 2 = 1 and
√ √ m1 + m 2
m1 = ψr (1, 0, 0), m4 = ψr 1/ 2, 1/ 2, 0 − ,
2
√ √ m1 + m 3
m2 = ψr (0, 1, 0), m5 = ψr 1/ 2, 0, 1/ 2 − ,
2
√ √ m2 + m 3
m3 = ψr (0, 0, 1), m6 = ψr 0, 1/ 2, 1/ 2 − .
2
Shape Distributions, Reflective Symmetry, and Ds on Concentric Spheres 59
ψr (X, Y, Z) = b1 X 2 + b2 Y 2 + b3 Z 2 = (b1 , b2 , b3 ).
s0r = (c1 , c2 , c3 , ||fr,1 ||, ||fr,3 ||, ||fr,4 || . . . , ||fr,L−1 ||). (2.36)
Thus, the feature vector is normalized to the Euclidean unit length. The suggested
parameter settings are R = 32 and L = 16, whence dim = 544.
The feature vector based on the exponentially decaying EDT significantly out-
performs (see section 5.2.12) the descriptor based on a binary voxel grid (subsec-
tion 2.5.2). Note that in both cases, almost the same concept is used to represent
features. The only difference is the post-processing, which actually does not sig-
nificantly improve retrieval effectiveness, but is regarded as a valuable theoretical
result.
The significantly better retrieval performance of the approach based on the
exponentially decaying EDT, comparing to the approach based on a binary voxel
grid, can easily be explained. Namely, if a voxel attribute describes how far an
arbitrary point is from the model, then the problem of mismatched functions on a
sphere (see figure 2.11) is almost eliminated. The extracted feature, the voxel grid
with attributes defined by (2.33) represented in the described manner, satisfies
all the requirements on 3D-shape descriptors, which are listed in section 1.3.4.
In general, the descriptor based on the exponentially decaying EDT significantly
outperforms all other descriptors presented in this chapter.
Chapter 3
Pose Estimation
61
62 Pose Estimation
Figure 3.1: Models of cars are initially given in arbitrary units, position, and orien-
tation (a,b, and c). The outcome of the pose estimation procedure is the canonical
positioning of each model (d, e, and f).
where the point set I is given by (1.5). We have set σ(I) := {σ(v)|v ∈ I} and
similarly for τ .
There have been several approaches for estimating the pose of a 3D mesh model
[52, 103, 105, 143, 147, 91], the most prominent one being the Principal Component
Analysis (PCA) that produces an affine transformation of the space R3 . The basics
as well as certain extensions of the PCA are given in the next sections.
Pose estimation of a 3D-mesh model based on the Extended Gaussian Images
(EGIs) is one of the first approaches reported in the literature. An EGI defines a
function on a unit sphere, by using normal vectors of faces of the mesh. The method
Principal Component Analysis 63
where |V | denotes the number of elements of the set V (i.e., the cardinal number).
The associated covariance matrix of the same data set is given by
1 X
CV = [cij ]n×n = E{(v − mV )(v − mV )T } = (v − mV )(v − mV )T . (3.4)
|V |
v∈V
covariance is zero (cij = cji = 0). The element cii represents the variance of the
component vi , which indicates the spread of the component values around its mean
value. Eigenvalues and eigenvectors of the covariance matrix are used to form an
orthonormal basis of the space Rn . We recall that the eigenvectors ei (||ei || = 1)
and the corresponding eigenvalues λi are the solutions of equations
CV ei = λ i ei (i = 1, . . . , n). (3.5)
We stress that in the case of a symmetric non-negative matrix all eigenvalues are
non-negative real numbers. By ordering the eigenvectors according to the order of
descending eigenvalues, we obtain an orthonormal basis with the first eigenvector
coinciding with the direction of largest variance of the set V (3.2). Directions of
largest variance are usually regarded as directions in which the original data set
possesses the most significant amounts of energy.
Let A be a matrix consisting of ordered eigenvectors of the covariance matrix as
the row vectors. The ordered eigenvectors can be seen as basis of a new coordinate
frame with the origin placed at the point mV . We regard the new coordinate
system as the PCA coordinate system (frame). A data vector v ∈ V from the
original system is transformed into the vector p in the PCA frame,
p = A(v − mV ). (3.6)
In the PCA frame data are uncorrelated, i.e., the non-diagonal elements of the
covariance matrix are equal to zero.
Before explaining how we engage the PCA for 3D-model retrieval purposes, we
present applications of the PCA in data compression and image processing. These
applications are motivation for a set of experiments aimed at reducing dimension-
ality of 3D-shape feature vectors, without significant loss in retrieval effectiveness
(section 5.3).
Data can be compressed using the PCA in the following manner. The original
vector v, which is projected on the coordinate axes of the PCA frame (3.6), can be
reconstructed by applying an affine map to the projection p given by,
v = AT p + mV , (3.7)
where we used the property of an orthogonal matrix A−1 = AT (AT denotes the
transpose of matrix A). If we do not use all the eigenvectors of the covariance
matrix, the data can be represented in a lower dimensional space, whose dimension is
determined by the number of used eigenvectors, i.e., basis vectors of the orthonormal
basis.
Let Ak be the matrix consisting of the first k (ordered) eigenvectors as the row
vectors. By substituting A with Ak in equation (3.6), we obtain
Hence, we project the original data vector from an n-dimensional linear metric space
Rn on a new k-dimensional vector space Rk , whose orthonormal basis consists of
Principal Component Analysis 65
where t ∈ [0, 1] is a threshold. In this case, the total amount of energy (information)
is approximately consistent with a varying dimensionality k. Both alternatives are
applied to concatenated 3D-shape feature vectors and tested in section 5.3.
Thus, dealing with a lossy compression gained by the PCA introduces a trade-off
between the reduction of vector dimension (we want to simplify the representation
as much as possible) and the loss of information (we want to preserve as much as
possible of the original information content). The PCA offers convenient mecha-
nisms (fixed k vs. fixed t) to control this trade-off.
Properties of the PCA can be depicted using an application in image processing.
Suppose that we have a color image of dimensions M × N . Each pixel is represented
by a triplet of red, green, and blue (RGB) component values. We consider that each
image consists of three bands, i.e., three grayscale images each of which represents
pixel values of the corresponding color. If we want to generate a single grayscale
image so that the most details are shown, then we apply the PCA to the set of 3D
points, which are obtained by treating color triplets of pixels as points in R 3 . An
example RGB image and the outcome of the PCA are shown in figure 3.2. Pixels
of the given image are represented as points in a 3D space, where x, y, and z
66 Pose Estimation
axes represent values of red, green, and blue components, respectively. The first
eigenvector (P1 ) having the largest eigenvalue points to the direction of largest
variance (spread) whereas the second (P2 ) and the third (P3 ) eigenvectors are
orthogonal to the first one.
Figure 3.2: The analyzed image (left) and the pixels of the image in the color space
(right). Axes x, y, and z represent values of red, green, and blue components,
respectively. The PCA coordinate axes are denoted by P1 , P2 , and P3 .
Figure 3.3: The grayscale images formed from the red, green, and blue band are
shown in the first row, while the grayscale images in the second row are formed by
normalizing coordinates in the PCA frame.
analogously.
Let Si and gi be the area of the triangle Ti (1.2) and the center of gravity,
respectively. For simplicity of notation we may assume that the triangles of a 3D-
model intersect only on subsets of measure zero so that we may write the overall
surface in the model as (compare (1.6))
ZZ
S := S1 + . . . + Sm = ds. (3.10)
I
where Si0 is the sum of surfaces of all triangles that have pi as a vertex.
n
n Sk0 X
Hence, the weights wk are defined by wk = . Obviously, wk = n.
3S
k=1
The covariance matrix (3.4) of the set I is approximated as follows
n
1X
CI ≈ wi (pi − mI )(pi − mI )T . (3.13)
n i=1
The rest of the PCA procedure remains the same, i.e., after forming the rotation
matrix A we transform the set of vertices P (1.3)
In this way, translation and rotation invariance, as well as robustness with respect
to levels of detail and different tessellations of a mesh model, are achieved.
Paquet et al. [105] used centers of gravity of triangles to form the input for the
PCA. Each center of gravity is multiplied by the area of surface of the corresponding
triangle, applying the PCA to the obtained set of vectors. Thus, the matrix C I is
approximated by
m
1 X
CI ≈ Si (gi − mI )(gi − mI )T . (3.15)
m i=1
where p0i = (x0i , yi0 , zi0 ) (3.14). Then, we form a diagonal matrix
F = diag (sign(fx ), sign(fy ), sign(fz )) .
We calculate a scale factor s by
m
1X p0Ai + p0Bi + p0Ci
s= Si , (3.17)
S i=1 3
After the exact computation of CI (in section 3.3 the covariance matrix is just
approximated), the remaining part of the PCA follows the standard procedure.
Since the matrix CI is a symmetric real matrix its eigenvalues are real and the
eigenvectors orthogonal. We calculate the eigenvalues of CI , sort them in decreasing
order, compute the corresponding eigenvectors and scale them to the Euclidean unit
length. We form the rotation matrix A, which has the scaled eigenvectors as rows.
We regard the described approach of applying the PCA to the whole point set I as
the Continuous PCA (CPCA).
A new point set I2 is obtained by translating (3.12) and rotating the set I using
the vector mI (3.11) and matrix A,
I2 := A · (I − mI ) = {v | v = A · (u − mI ), u ∈ I}. (3.22)
To ensure the reflection invariance we multiply points in I2 by a diagonal matrix
F = diag(sign(fx ), sign(fy ), sign(fz )), where fx is computed by
ZZ
1
fx = sign(vx0 )|vx0 |p ds, p = 2, 3, . . . (fy , fz similar), (3.23)
S v0 ∈I2
and v0 = (vx0 , vy0 , vz0 ) ∈ I2 . Thus, we set f (v0 ) = sign(vx0 )|vx0 |p in (3.18).
We performed tests for p = 2 and p = 3 and concluded that better results were
obtained for p = 2. For p = 2, the value of fx is analytically computed by
m
1 X x
fx = F Si ,
6S i=1 i
Jix , x0Ai , x0Bi , x0Ci ≥ 0
Jix − 2Lxi , x0Ai < 0, x0Bi , x0Ci ≥ 0
Fix =
−Jix + 2Lxi ,
x0Ai ≥ 0, x0Bi , x0Ci < 0 (3.24)
−Jix , x0Ai , x0Bi , x0Ci < 0
Jix = (x0Ai )2 + (x0Bi )2 + (x0Ci )2 + x0Ai x0Bi + x0Ai x0Ci + x0Bi x0Ci ,
(x0Ai )4
Lxi = 0 ,
(xBi − x0Ai )(x0Ci − x0Ai )
“Continuous” PCA 71
where A(pi −mI ) = (x0i , yi0 , zi0 ) ∈ I2 . Note that the conditions x0Ai < 0, x0Bi , x0Ci ≥ 0
are interpreted as “two x-coordinates of triangle vertices are non-negative and one
is negative”, while x0Ai ≥ 0, x0Bi , x0Ci < 0 is the opposite. This means that if we
originally have, for example, x0Bi < 0, x0Ai , x0Ci ≥ 0, then we exchange the values of
x0Ai and x0Bi and the condition x0Ai < 0, x0Bi , x0Ci ≥ 0 is fulfilled. The exchange of
values corresponds to the renaming (reordering) of vertices.
Scaling invariance is achieved by scaling the set I2 by the inverse of a scaling
factor s. We have four approaches for calculating the scaling factor s. As the first
approach, we take
s = davg , (3.25)
where davg is the average distance of a point p on the surface of a model I to
the center of gravity
√ of the model (i.e., the new coordinate origin). By setting
f (v0 ) = ||v0 || = v0 · v0 in (3.18), v0 = (vx0 , vy0 , vz0 ) ∈ I2 (3.22), we obtain,
ZZ √
1
davg = v0 · v0 ds
S v0 ∈I2
m Z 1 Z 1−α q (3.26)
2X 2
= Si dα αp0Ai + βp0Bi + (1 − α − β)p0Ci dβ,
S i=1 0 0
where p0Ai = (x0Ai , x0Bi , x0Ci ) = A(pAi − mI ), and similarly for p0Bi and p0Ci .
Since the computation of the integrals in (3.26) is too expensive, we approximate
the value of davg by sampling the surface of the mesh uniformly. The following
pseudocode describes our algorithm for approximating davg :
average distance of a point on the surface from the center of gravity by randomly
selecting N >> 0 points vi ∈ I, 1 ≤ i ≤ N , using (2.19), and computing davg ≈
PN
i=1 ||vi ||/N . The approximation obtained by using the presented pseudocode
provides a better trade-off between accuracy and efficiency than the approximation
by random sampling.
As the second method, we calculate a continuous scaling factor
r
s2x + s2y + s2z
s= , (3.27)
3
where sx , sy , and sz denote the average distances of points v ∈ I2 (3.22) from the
yz-, xz-, and xy-coordinate hyperplanes, respectively, i.e., by setting f (v 0 ) = |vx0 |
in (3.18), we have
ZZ
1
sx = |v 0 |ds and likewise for sy , sz . (3.28)
S v0 =(vx0 ,vy0 ,vz0 )∈I2 x
Analytically, sx is computed by
m
1 X x
sx = M Si ,
3S i=1 i
|x0Ai + x0Bi + x0Ci |, x0Ai x0Bi , x0Bi x0Ci , x0Ai x0Ci ≥ 0 (3.29)
Mix = 0 0 0
|xAi + xBi + xCi | − 2Ki , x0Ai x0Bi , x0Ai x0Ci ≤ 0, x0Bi x0Ci ≥ 0
x
x3Ai
Kix = 0 ,
(xBi − x0Ai )(x0Ci − x0Ai )
s = sx , (3.30)
i.e., to make the average distance (3.28) from the yz-coordinate hyperplane constant,
equal to 1.
Finally, as the fourth option, we take
p
s = λ1 , (3.31)
where λ1 is the largest eigenvalue of the covariance matrix CI (3.21). Thus, the
energy of the first principal component is averaged.
Our experiments (see section 5.2) suggest that the first approach, s = d avg , is
the best choice.
Putting all the above together, the affine map τ (3.1), defined by
In contrast to the usual application of the PCA, we work with sums of integrals
over triangles (3.19) in place of sums over vertices which makes our approach more
complete taking into account all points of the model I (1.5) with equal weight. The
calculation of the integrals is only slightly more expensive. The time needed to
calculate parameters and transform an object into the canonical frame (normaliza-
tion time) linearly depends on the complexity of the object. Figure 3.4 comprises
scatter plots of normalization time vs. complexity of 3D-mesh models, using the
CPCA (section 3.4), when (3.27) is used to fix the scale, and the two modifications
of the PCA (section 3.3). We selected the number of triangles as a parameter that
denotes the complexity of a 3D-model (x-axis) and measured normalization times
(in milliseconds) for each of the three cases. The tests were carried on a computer
running Windows 2000 Professional, with 1GB RAM and an 1.4 GHz AMD pro-
cessor. The most complex model consists of 215473 triangles and the continuous
normalization step of this model takes 589ms. Each scatter plot is approximated
by a line y = α x + β representing the optimal approximation in the mean-square
sense. The values of α are the following: in the case of the CPCA, k = 0.00277; the
modification with the weights associated to centroids (3.15) k = 0.00189; and the
modification with the weights associated to vertices (3.13) k = 0.00146. This means
that the time needed for normalization step when we use the CPCA is approximately
twice the time needed for normalization when the modification proposed in [143]
is used. The method based on weights associated to vertices (3.13) is faster than
then the approach based on weights assigned to centroids (3.15), because a triangle
mesh model possesses more triangles than vertices, on average (see section 5.1).
400
Time (ms)
300
200
100
0
0 50000 100000 150000 200000
No. of triangles
Figure 3.4: Scatter plots of normalization time vs. complexity of 3D-mesh models,
using the CPCA and modifications with weights associated to centroids and vertices.
Average normalization times are given in the brackets
74 Pose Estimation
The average normalization time of the continuous approach is 28.5ms, while the
normalization based on weights assigned to vertices is completed in 15.0ms. We
consider the average difference of 13.5ms to be slight.
To demonstrate differences between the three presented normalization methods
we give examples in figure 3.5. As expected, the CPCA is the most stable. However,
in figure 3.5c the canonical frame obtained by using the modification with weights
associated to vertices produces better positioning of the model, because of the
method’s inability to treat equally all the points of the set I (1.5). The better
performance of the discrete approach than the continuous method is just a spacial
case. A different tessellation of the model in figure 3.5c may cause varying principal
directions P01 , P02 , and P03 . To explore the stability of all three approaches, we
inserted k · m (k ∈ {1, 3, 10, 30, 100}) random vertices in triangles Ti (1 ≤ i ≤ m),
and performed a re-triangulation. For instance, if p ∈ Ti is inserted in the list
of vertices V (1.3), then the triangle Ti = 4pAi pBi pCi is erased from the list
of triangles T (1.2) and the new triangles 4pAi pBi p, 4pBi pCi p, and 4pCi pAi p
are added to T . As the number (k · m) of inserted vertices increases, P01 and P001
converge to P1 . The continuous principal axis P1 remains invariant with respect
to the number of inserted vertices.
a) b) c)
Figure 3.5: Examples of pose estimation using three presented approaches. The
CPCA axes are denoted by P1 , P2 , and P3 , the axes obtained by using the modifi-
cation with weights associated to vertices are denoted by P01 , P02 , and P03 , and the
axes obtained by using the modification with weights associated to centroids are
denoted by P”1 , P”2 , and P”3 .
Our normalization method is very efficient and rather effective for many cat-
egories of 3D-objects. However, it is not perfect. Examples of pose estimation
using the continuous approach for categories of cups, cars, and models of humans
are depicted in figure 3.6. Models of cups in the first row have different rotations,
reflections, and assignments of principle axes. In a way, the category of cups is
sub-classified by our normalization step. However, models of cars in the second row
are almost ideally transformed. The outcome of our normalization step for mod-
els of humans is also satisfying, regardless the presence of outliers (third model in
the last row). Our approach based on the CPCA is suitable for categories of 3D-
objects such as cars, bottles, humans, missiles, swords, ships, glasses, dogs, horses,
Evaluation of the Continuous Approach 75
etc. Conversely, categories of models of chairs, cups, airplanes, trees, etc., are not
consistently aligned in the canonical coordinates, whence these categories are sub-
classified. Moreover, examples given in figure 3.6 motivate us to improve the scaling
factor (third model in the last row) as well as to fix the reflection invariance in a
more stable way (last model in the second row).
Figure 3.6: Examples of pose estimation using the continuous approach. All models
are visualized from positive sides of the z-axis, while the x-axis travels to the right-
hand side.
Since our pose normalization step does not align all models in an ideal way, a
natural question is:
Should we eliminate 3D-shape descriptors that require the use of the PCA, and
focus only those in which the invariance with respect to similarity transforms
is provided by the definition of descriptor?
Shape descriptors that are presented in sections 2.3, 2.4, and 2.5 do not require
the use of the PCA. We recall that the topology matching (section 2.4) relies on a
graph representation, which is invariant with respect to rotation of a mesh model.
The MPEG-7 shape spectrum descriptor (section 2.3) is based on local features
(curvature indices). Shape distributions (subsection 2.5.1) are extracted from rela-
tive features. Both the shape spectrum and shape distributions are represented in
a form of histogram. Finally, in section 2.5.2 a technique that uses certain math-
ematical properties to secure rotation invariance is presented. As a contrast to all
these approaches, our feature vectors (chapter 4) heavily rely upon the pose nor-
malization step, because we mostly consider absolute features. In section 5.2.13, we
compared approaches that use the CPCA vs. approaches that avoid the PCA, and
the results show that descriptors that use the PCA outperform the others.
Explanations for these results are the following:
• For categories of models that are suitable for the PCA, good object alignments
make absolute feature effective;
• When a class of models is not uniformly oriented by the CPCA, then it is usually
subdivided into classes of consistently oriented models. Absolute features are
effective on a subclass of consistently oriented models;
76 Pose Estimation
Figure 3.7: Shape-similarity search for a cup and desk using a descriptor that is
inherently invariant with respect to rotations around the z-axis. In this case, the
task of the CPCA is to fix the z-axis correctly. The models are visualized from the
positive side of the z-axis in the canonical coordinate frame, while the x-axis travel
to the right.
Since no reported technique that avoids the use of the PCA shows better perfor-
mance (see section 5.2) than our best methods relying upon the CPCA, we consider
that the use of the PCA is justified. Moreover, we expect that further improvements
of the normalization step will result in increased retrieval performance of methods
relying upon pose estimation.
Chapter 4
In this chapter, we present our original methods for describing 3D-shape, which are
listed in table 4.1. Since the optimal way of encoding information about 3D-shape is
not prescribed, we consider a variety of different features to define shape descriptors
(feature vectors). The approaches range from considering 2D rectangular images
(SIL,DBD) and images on a sphere (RAY,RSH,SSH,CSH) to exploiting 3D-features
(VOL) and volumetric data (VOX). Besides, we define moments (MOM) and a novel
data structure – layered depth sphere (LDS,RID), which are used for describing
shape. Finally, a concept of hybrid descriptors (HYB), obtained by crossbreeding
complemental feature vectors, is introduced.
We follow the general 3D-model retrieval algorithm 1.1, and most of our feature
vectors are extracted in the canonical coordinate frame of a 3D-model. In other
words, feature extraction follows after the complete normalization step (see section
3.4). The only exception is the rotation invariant feature vector based on layered
77
78 3D-Shape Feature Vectors
depth spheres (RID), where we use certain properties to provide rotation invariance
of the descriptor without orienting the model using the CPCA.
We stress that our techniques are not restricted to closed and orientable polygonal
meshes (see definitions 1.1 and 1.2). Since we consider a 3D-model to be the point
set I (1.5), we do not introduce a constraint that the polygonal mesh need to be
closed or orientable, in order to extract any of descriptors.
We describe in details all our feature vectors, in order to provide sufficient infor-
mation to a reader who wants to implement and test our methods. For each feature
vector, we give the average feature extraction time, which does not include times
needed for loading and normalizing a 3D-mesh model. Retrieval performance of the
feature vectors, which are presented in this chapter, is addressed in section 5.2.
r : S 2 → [0, +∞)
(4.1)
r(u) = max{ r ≥ 0 | ru ∈ I ∪ {O} }
where I is given by (1.5). The function r(u) measures the extent of the object from
the origin O in the directions given by u1 , . . . , uN . The ray-based feature vector f
of a model is composed as
In [143] the vertices of a dodecahedron, with the center at the coordinate origin,
are taken as directional vectors (N = 20). The corresponding feature vector pos-
sesses the fixed number of components. In order to provide a multi-resolution rep-
resentation of the feature, which is a desirable property of descriptors (section 1.3.4,
property 6), we engaged [141] dodecahedron’s dual solid – the icosahedron (defined
in figure 1.2). Each triangle of the icosahedron is subdivided into k 2 (k = 1, 2, . . .)
coincident triangles (see figure 4.1).
A similar subdivision could be done using the dodecahedron, e.g., to subdivide
each pentagon into five coincident triangles, which could recursively be subdivided
Ray-Based Feature Vector 79
further. Since the subdivision of dodecahedron is less uniform, we decided to use the
subdivision of icosahedron. Hence, after the subdivision we obtain 20k 2 triangles
with totally 10k 2 + 2 distinctive vertices vi (i = 1, . . . , 10k 2 + 2). Each vertex vi
is projected on the sphere S 2 by finding its unit vector ui = vi /||vi ||. The unit
vectors ui are used as directional vectors. Thus, the changeable dimension of the
feature vector is provided, i.e., N = 10k 2 + 2 (k = 1, 2, . . .).
Our feature extraction procedure is very efficient. The ray-triangle intersection
problem has been well-studied [6, 80, 40]. In figure 4.2, a ray u, cast from the
coordinate origin O, intersects the plane of triangle ABC at the point P . Using
the fact that P is collinear with u, P = d · u, and representing P by barycentric
coordinates, P = αA + βB + (1 − α − β)C, we obtain the system of three equations,
with unknown quantities d, α, and β. If d > 0 and if α and β satisfy the conditions
α ≥ 0, β ≥ 0, and α + β ≤ 1, (4.3)
then the point of intersection P belongs to the triangle. Instead of solving the system
of equations using Cramer’s rule directly and testing the conditions (4.3), we created
a faster method aimed at reducing the total number of arithmetic operations. If n
is the normal vector of the plane containing the triangle ABC and point P , then it
holds n · P = n · A. Therefore, we compute the value of d directly, d = n · A / n · u.
If d > 0, then we determine if P = d · u lies inside the triangle ABC by checking
the conditions
((A − P ) × (B − P )) · n ≥ 0,
((B − P ) × (C − P )) · n ≥ 0, and
((C − P ) × (A − P )) · n ≥ 0,
which are similar to (4.3). Since we measure the extent r of a polygonal mesh in the
direction u, the value of r is initialized to zero, r = 0. If a triangle is intersected by
the ray travelling along u, then we check if the computed extent d is greater that the
current one, r = max{r, d}. Our ray-triangle intersection method is described by
the pseudocode in figure 4.3, as algorithm 1. The algorithm deals also with special
cases, i.e, it examines if the triangle is degenerated (surface area equal to zero) as
well as if the triangle plane is parallel to u.
Möller and Trumbore [80] proposed a method that is believed to be the fastest
ray-triangle intersection routine for triangles, which do not have precomputed plane
80 3D-Shape Feature Vectors
P = d · u = α(A − C) + β(B − C) + C
d > 0 ∧ α ≥ 0 ∧ β ≥ 0 ∧ α + β ≤ 1 ⇒ P ∈ 4ABC
d
[ −u | A − C | B − C ] · α = −C
β
d −C · ((A − C) × (B − C))
α = 1 −u · (−C × (B − C))
β −u · ((A − C) × (B − C)) −u · ((A − C) × (−C))
(B − C) · q
1
= −C · p , p = u × (B − C), q = (A − C) × C.
(A − C) · p u·q
equations. Their method is depicted in figure 4.2. Instead of solving the system
of equations directly by Cramer’s rule, the number of arithmetic operations is re-
duced by appropriate reordering of computations. Besides, the statistical analysis
presented in [80] suggests that it is better to test first if the point P belongs to
the triangle ABC (4.3), and then to calculate the extent d. In order to reduce the
number of operations further, we slightly modified the algorithm given in [80] by
eliminating two divisions. The modified method is described by the pseudocode in
figure 4.3, as algorithm 2. Obviously, the algorithm 1 performs more arithmetic
operations than the algorithm 2. Note that if the normal vector of the triangle is
precomputed, the average operation count of algorithm 1 drops.
The ray-based feature vector of a 3D-model with m triangles can be extracted
using a brute force method, by intersecting 10k 2 + 2 rays with all m triangles. The
brute force approach (”exhaustive search”) was used in [141]. In order to accelerate
the feature extraction procedure, we created an algorithm, which is more suitable
for our application than previously presented methods [39, 13, 3, 40, 1]. There is a
variety of accelerated ray-casting techniques, which use various data structures in-
cluding octrees, bounding volume hierarchies, spatial partitions, and uniform grids.
For complex 3D-scenes with a lot of objects, a binary space partitioning tree (or
BSP Tree) [34] is a frequently used data structure that organizes objects within a
space. A common problem of techniques that use 3D-space subdivision [39] is that
a ray which misses everything must still be checked against contents of each region
or voxel it intersects. The idea of space subdivision was extended in [3] by including
ray direction. The space of all rays is adaptively subdivided into equivalence classes
E1 , . . . , El . Candidate object sets C1 , . . . , Cl are constructed in such a way that
Ray-Based Feature Vector 81
d = 0, CA = A − C, CB = B − C;
// r – the current extent in the direction u p = u × CB;
// ε = 10−10 (a small positive value) det = CA · p;
if det > ε
h = (B − A) × (C − A); αs = −C · p; // α = αs /det
// Is the triangle degenerated? if αs ≥ 0 ∧ αs ≤ det
if ||h|| > ε // NON degenerated q = CA × C;
// Calculate the normal unit vector βs = u · q; // β = βs /det
n = h/||h||; if βs ≥ 0 ∧ αs + βs ≤ det
// Is u orthogonal to n? d = (CB · q)/det;
if |n · u| > ε // NON orthogonal else if det < −ε
d = n · A / n · u; αs = −C · p; // α = αs /det
if (d > r) if αs ≤ 0 ∧ αs ≥ det
P = d · u; q = CA × C;
// Is P inside 4ABC ? βs = u · q; // β = βs /det
if ( ((A − P ) × (B − P )) · n ≥ 0 ∧ if βs ≤ 0 ∧ αs + βs ≥ det
((B − P ) × (C − P )) · n ≥ 0 ∧ d = (CB · q)/det;
((C − P ) × (A − P )) · n ≥ 0 ) if d > r
r = d; r = d;
Algorithm 1. Algorithm 2.
Ci contains all objects that can be intersected by a ray from Ei . A method for
ray tracing triangular meshes presented in [1] requires preprocessing and a complex
data structure is attached to each triangle. The most significant drawback of the
approach [1] is the exhaustive search. Another problem of the method is a require-
ment that all triangles of mesh models must be correctly oriented, which is usually
not the case if the models are retrieved from the Internet.
Our technique for ray-casting triangular meshes does not need any preprocessing
step for triangles and there are no restrictions regarding orientations of triangles.
The number of ray-triangle intersections is significantly reduced compared to the
brute force method. Therefore, we consider that our technique is very suitable for
extracting the ray-based feature vector. The key idea of the method is finding a set
of rays that are candidates to intersect a triangle under consideration. Firstly, all
directional unit vectors
ui = ui (ϕi , θi ) = (cos ϕi sin θi , sin ϕi sin θi , cos θi ), i = 1, . . . , N,
(4.4)
−π < ϕi ≤ π, 0 < θi ≤ π,
We stress that the sorting of directional vector is not performed for each 3D-model,
but only once. The quicksort algorithm sorts an array of n elements with O(n log n)
comparisons on average. In the worst case, the complexity is O(n2 ). However, we
82 3D-Shape Feature Vectors
use the intrasort algorithm, which is a part of the SGI Standard Template Library
(STL) [42]. Intrasort is very similar to quicksort, and is at least as fast as quicksort
on average. A characteristic that makes the intrasort more prominent is that the
worst case complexity is equal to the average case complexity, O(n log n).
As a contrast to all previously mentioned techniques, our algorithm efficiently
determines a set of rays which are candidates for intersection with a given triangle.
We simply calculate the range of angles ϕi and θi of directional vectors, which could
intersect the triangle. For a triangle T , we analytically determine ϕmin , ϕmax , θmin ,
and θmax , where
Then, a necessary condition that the ray ui (ϕi , θi ) can intersect the triangle T is:
0.02
0.015
Ratio
0.01
0.005
0
1 2 3 4 5 6 7 8
Parameter k
Figure 4.4: Ratio of the number of ray-triangle intersections using our approach
and using the brute force vs. parameter k. The number of rays is equal to 10k 2 + 2.
The speed-up of our approach over the brute force method for ray-casting trian-
gular meshes is shown in table 4.2, where the average extraction times for various
Ray-Based Feature Vector 83
dimensions of the ray-based feature vector are obtained using the official MPEG-
7 collection (section 5.1). The results are obtained on a PC with an 1.4 GHz
AMD processor running Windows 2000. The brute force method is used with the
ray-triangle intersection algorithm 1, while our method for ray-casting triangular
meshes is tested with both algorithm 1 and algorithm 2. We observe that the most
efficient way of extracting the ray-based descriptor is to combine our ray-casting
approach with the algorithm 2.
Table 4.2: Average extraction times for various dimensions of the ray-based feature
vector using the brute force method, our approach combined with the algorithm 1,
and our approach combined with the algorithm 2 (figure 4.3).
In order to visualize the feature vector and to depict characteristics of the ray-
based descriptor, we give two examples in figure 4.5. Examples of feature extraction
for vectors of dimension N = 42 (k = 2) are shown, where wire-frame representa-
tions of triangular meshes are visualized. The model on the left side possesses 400
triangles and is obtained by simplifying the model with 800 triangles using QSlim
1.0 simplification software [37]. For each model, rays are emanated from the origin
(center of mass) in given directions, and the furthest points of intersections with
the underlying mesh-model are found, i.e., the function r (4.1) is sampled at points
ui (4.4).
The ray-based 3D-shape descriptor in the spatial domain is suitable for retrieving
some categories of 3D models (e.g., missiles, cars, swords, etc.). However, we noticed
certain properties of the feature vector that aggravate its retrieval performance:
• Low-dimensional vectors do not capture sufficient information about the object;
• The lp metric (1.9) is not effective in the spatial domain;
• Very similar 3D-models can have large differences between specific feature vector
components.
Indeed, if we take a small number of samples (e.g., k = 2, N = 42), then not all the
parts of a model are captured. For instance, in figure 4.5 the front legs of models are
missed in both examples. However, if we increase the number of samples (i.e., the
dimension of feature vector), then, in spite of capturing more information about the
model, the retrieval performance usually decreases or stagnate (see section 5.2.1).
As mentioned in section 1.4, the reason for the decrease of retrieval performance
with the increase of vector dimension (the number of samples) is the ineffectiveness
of the lp (1.9) metric in the spatial domain.
Another problem is depicted in figure 4.5. Since the same model is represented in
two levels of detail, the triangle meshes differ slightly. Therefore, the corresponding
components of their feature vectors should approximately be the same. However, the
back left leg of the bull is missed by all rays in one case (right), while it is intersected
84 3D-Shape Feature Vectors
f800 = (2.30, 1.16, 1.14, 0.83, 0.90, 0.64, 1.77, 0.69, 0.62, 1.09, 1.23, 1.65, 2.57, 1.43,
1.20, 2.19, 1.15, 1.22, 1.04, 1.17, 1.25, 1.08, 0.97, 1.15, 0.76, 0.88, 0.72, 2.80,
1.09, 1.73, 1.62, 1.64, 0.61, 1.10, 0.65, 0.53, 0.86, 0.60, 0.54, 1.29, 0.85, 1.82),
1 − f 1 |, . . . ,
( |f400 42 − f 42 | ) =
|f400
800 800
(0.03, 0.01, 0.02, 0.01, 0.07, 0.03, 0.03, 0.02, 0.00, 0.00, 0.00, 0.02, 0.05, 0.04,
1.80, 0.03, 0.05, 0.00, 0.02, 0.02, 0.01, 0.02, 0.01, 0.01, 0.03, 0.01, 0.04, 0.04,
0.09, 0.02, 0.05, 0.01, 0.01, 0.03, 0.01, 0.03, 0.07, 0.01, 0.02, 0.00, 0.08, 0.05).
Figure 4.5: A problem of the ray-based shape descriptor in the spatial domain. Ray-
based feature vectors (N = 42) of models with 400 and 800 triangles are visualized.
The feature vector instances of both models as well as componentwise differences
are given. Regardless of high similarity between the models, vector components
may significantly differ (bold values).
with one of the rays in the other case (left). This implies a large difference between
the corresponding components of the feature vectors as shown in figure 4.5, where
the component corresponding to the ray traveling in the direction of the back left
leg is in bold.
Hence, it is desirable to increase significantly the number of samples of a model,
but still to characterize its shape with feature vectors of reasonable dimensions.
Also, large differences between components of feature vectors of very similar models
should be avoided. To achieve these goals as well as to strengthen the discriminant
power of the descriptor, we use a different representation of the ray-based feature.
For instance, spherical wavelets [118] or spherical harmonics [45, 78] can be engaged.
Spherical wavelets presented in [118] are based on the geodesic sphere construction
starting with an icosahedron using the same subdivision as in figure 4.1. The ray-
based feature with spherical harmonic representation is described in section 4.6.2.
Silhouette-Based Feature Vectors 85
Figure 4.6: Silhouette images of an airplane model obtained by projecting the model
on the coordinate hyper-planes yOz, zOy, and xOy, respectively.
The next step is finding the outer contour. We consider the image as an N × N
matrix C = [cij ], where the elements cij (i, j = 1, . . . , N ) denote pixel attributes of
a monochrome image. Attribute 1 denotes background pixels (white) and attribute
0 denotes silhouette pixels (black). For fixed i and j, pixel cij is a contour point
if its attribute is 0 and at least one of its four closest neighbors (east, south, west,
and north pixels) belongs to the background. A contour Γ is made up of a cyclic
sequence of contour points,
Γ = [ ci0 j0 , . . . , ciL−1 jL−1 ], L ∈ N,
Silhouette-Based Feature Vectors 87
where L denotes the length of the contour, which depends on the silhouette image.
Moreover, it holds
q = (p + 1) mod L, p = 0, . . . , L − 1.
In the case that there is more than one contour in the silhouette image, which may
occur either if the object possesses holes or is composed from disjoint parts, we
process the longest contour (with the largest L).
Since the contour length L is non constant, we select K points to process further,
where K is fixed. The selected points are used for forming a sequence SK defined
by
na amax o
max
SK = { s0 , . . . , sK−1 }, si ∈ (c1 − O), . . . , (cL − O) ∪ {(0, 0)},
N N
(4.8)
where O = (N/2, N/2) is the center of the image and amax is defined by (4.6). The
multiplication of contour points by amax /N is necessary in order to provide invari-
ance with respect to outliers. We tested the following two methods for selecting
points from the set Γ ∪ {O}, in order to generate the sequence SK (4.8):
a) Adjacent selected points have constant distance along the contour.
The points si (4.8) are determined by
amax iL
si = Γ −O , (4.9)
N K
where btc ∈ N denotes the greatest integer that is not greater than t ∈ R.
b) Polar angles of adjacent selected points differ by a constant.
Let ( ρ(cp ), ϕ(cp ) ), ρ(cp ) ≥ 0, −π < ϕ(cp ) ≤ π, be coordinates of the contour
point cp = (ip , jp ) in the polar coordinate system with the origin O, i.e.,
a)
b)
where j is the imaginary unit and fˆp ∈ C are Fourier coefficients. If K is a power
of 2, then a fast Fourier transform (FFT) algorithm [148] can be used. The com-
plexity O(K 2 ) of the DFT can significantly be reduced with the FFT algorithm
(O(K log2 K)). We tested the following three possibilities for defining the values fi
using the selected contour points si :
1. Consider si as points in the complex plane,
fi = ρi · ejϕi ; (4.12)
fi = ρ i ; (4.13)
After the evaluation (section 5.2.2), we decided to use the second approach (4.13).
The absolute values of the obtained coefficients, |fˆp | (4.11), are used to form the
silhouette-based feature vector. Namely, for each of the three contours, we take
the first k absolute values of coefficients as components of the vector, whence
(1) (1)
the dimension of the vector is equal to 3k. More precisely, if {fˆ0 , . . . , fˆK−1 },
(2) (2) (3) (3)
{fˆ0 , . . . , fˆK−1 }, and {fˆ0 , . . . , fˆK−1 } are the coefficients obtained from three sil-
houettes, then the components of the corresponding feature vector f (dim(f ) = 3k)
are defined in the following way
(1) (2) (3) (1) (2) (3) (1) (2) (3)
f = |fˆ0 |, |fˆ0 |, |fˆ0 |, |fˆ1 |, |fˆ1 |, |fˆ1 |, . . . , |fˆk−1 |, |fˆk−1 |, |fˆk−1 | . (4.15)
i.e, the coefficients fˆp and fˆK−p are complex conjugate, whence their absolute values
are equal, |fˆp | = |fˆK−p |. Thus, we propose to choose the values of k and K so that
k ≤ K/2.
We stress an interesting property of the DFT, which makes the magnitudes
of obtained coefficients (approximately) invariant with respect to rotation of the
underlying silhouette image. Suppose that {f0 , . . . , fK−1 } is the original sequence
of samples, and {g0 , . . . , gK−1 } is a sequence of samples after rotating the original
silhouette image for an arbitrary angle φ. Approximately, the original samples are
shifted so that
gi0 ≈ fi , 0 ≤ i, i0 ≤ K − 1, (4.17)
where
0 Kφ
i = (i − t) mod K, t= ∈ N, (t is a rounded value).
2π
Then, the absolute values of Fourier coefficients fˆp , obtained by transforming the
original sequence, are approximately equal to the absolute values of coefficients ĝ p ,
obtained by transforming the shifted sequence. The higher the value of K, the
better the approximation. Indeed, using (4.11) and (4.17), we have
K−1
1 X
fˆp = √ fi e−j2πip/K
K i=0
K−1
1 X
= e−j2πtp/K √ fi e−j2π(i − t)p/K
K i=0 ⇒ |fˆp | ≈ |ĝp |. (4.18)
K−1
1 X
≈ e−j2πtp/K √ gi e−j2πip/K
K i=0
= e−j2πtp/K · ĝp
As an example, the rotated silhouette images, which are shown in figure 4.9, will
be characterized by approximately the same magnitudes of Fourier coefficients. We
stress that it can also be proven that the magnitudes of obtained coefficients are
invariant with respect to reflections of a mesh around the coordinate hyper-planes.
Therefore, it is not necessary to reflect a mesh model by multiplying its vertices
with the matrix F (3.23), during the normalization procedure (section 3.4).
We tested the presented approach for N, K ∈ {128, 256, 512, 1024} (16 different
settings), and we suggest to take N = 256 and K = 256 (see section 5.2.2). In
our implementation, we set k = 100, i.e., the largest dimension of the vector is
300, embedding the vectors of dimensions 297, 294, . . . , 3. With these settings, the
average feature extraction time for models from the MPEG-7 collection (section 5.1),
which have already been positioned in the canonical coordinate frame, amounts 33.4
ms (on a PC with an 1.4 GHz AMD processor running Windows 2000).
The silhouette-based 3D-shape feature vector fulfills the requirements described
in section 1.3.4. We stated that, if it is not possible to determine a single unique
contour of the silhouette image (for 3D-models with holes or disjoint parts), only
Depth Buffer-Based Feature Vector 91
the longest contour is processed. Therefore, certain parts of 3D-objects might not
be described. We stress that method (4.10) can successfully be applied to models
whose silhouettes comprises more than one contour. Instead of a single contour, the
union of all contours is processed, while the presented procedure remains the same.
(xmin , ymin , zmin ), (xmin , ymin , zmax ), (xmin , ymax , zmax ), and (xmin , ymax , zmin ),
(4.20)
belong to the front clipping plane. The depth-buffer image, which corresponds to the
face (4.20), is formed in the following way. Firstly, the face is subdivided (rasterized)
into N ×N coincident rectangles (or squares, if ρ is a cube). Each rectangle (square)
corresponds to a pixel of a gray scale image of dimensions N × N .
92 3D-Shape Feature Vectors
Let vab be the attribute of the pixel at position (a, b), 0 ≤ a, b ≤ N − 1. The
value of vab is associated to the rectangle (square) νab determined by
νab = {(xmin , y, z) | ya ≤ y < ya + dy , zb ≤ z < zb + dz }, (4.21)
where
ya = ymin + a · dy , dy = (ymax − ymin )/N,
zb = zmin + b · dz , and dz = (zmax − zmin )/N.
All pixel attributes are set to an initial value (vab = 0). A point P = (xP , yP , zP ) ∈ I
of a 3D-mesh model I (1.5), which lies inside the region ρ (P ∈ ρ), is orthogo-
nally projected on the face (4.20). The projection P 0 is determined by coordinates
(xmin , yP , zP ). If P 0 ∈ νab , then the attribute vab is updated,
xmax − xp
vab ← max vab , . (4.22)
xmax − xmin
Note that vab ∈ [0, 1]. An example of the depth buffer image of a single triangle
is depicted in figure 4.10. Black pixels correspond to the value of 0, white pixels
are attributed the value of 1. Since the pixel corresponding to the point A is
almost white, the point A is the closest to the front clipping plane (x = x min ).
The depth buffer image of a single triangle is formed by computing the attributes
corresponding to the vertices of the triangle, and by filling the interior points. The
depth-buffer image of the whole 3D-mesh model is obtained by taking maximal
values vab (0 ≤ a, b ≤ N − 1) of depth-buffer attributes of all triangles. The
remaining five depth buffer images are analogously generated, by substituting the
front clipping plane (4.20).
Figure 4.10: Generation of a depth-buffer image of a triangle (right), using the face
determined by (4.20).
The choice of the cuboid region ρ (4.19) is an interesting problem. Note that
only the part of a 3D-mesh model being inside ρ is processed, while a subset of
I (1.5), which is outside ρ, is ignored. As a first choice, we take the canonical
bounding cube (CBC) (see definition 4.1) as the region ρ. We also use faces of two
other cubes, extended bounding box and canonical cube, to form the depth buffer
images. Firstly, we define the bounding box of a 3D-model.
Depth Buffer-Based Feature Vector 93
Definition 4.2 The bounding box (BB) of a model I (1.5) is the tightest box, with
the faces parallel to the coordinate hyper-planes, which encloses the whole model.
The set of vertices of the bounding box is defined by
It is desirable that the region ρ (4.19) is a cube, in order to keep the correct
aspect ratio of depth buffer images. Hence, the bounding box is not suitable for
forming depth buffer images, because its faces are rectangular, in general. Therefore,
we extend the bounding box to a cube, whose center coincide with the center of
bounding box.
Definition 4.3 The extended bounding box (EBB) of a 3D-model I (1.5) is defined
by the set of vertices
x1 + x 2 y1 + y 2 z1 + z 2
cx = , cy = , cz = ,
2 2 2
(4.24)
where x1 , x2 , y1 , y2 , z1 , and z2 are given by 4.23.
Using the CBC (definition 4.1) or EBB (definition 4.3), for forming depth buffer
images, leads to potential problems with outliers. Obviously, an outlier affects the
dimensions of both CBC and EBB, whence the depth buffer images are affected
by the outlier, as well. In order to reduce the influence of outliers, we define the
canonical cube of a 3D-model.
Definition 4.4 The canonical cube (CC) of a 3D-model I (1.5) is a cube in the
canonical coordinate frame, defined by the set of vertices
Thus, the length of edges of the CC is fixed (constant), 2w, the center of the
CC lies at the origin of the canonical coordinate frame, and the faces are parallel
to the coordinate hyper-planes. Note that both the CBC and EBB are specific for
each 3D-model, while the CC is constant, i.e., equal for all models. Using the cube
of constant size in the canonical frame, for forming the depth-buffers, lessens the
impact of outliers on extracted feature vectors. A difficult problem is the selection
94 3D-Shape Feature Vectors
of the constant value w. If we choose w to be large enough so that all models lie
inside the cube, then the depth-buffer images are mostly black, i.e., models occupy
only a small area in the middle of each image. We expect that discriminant power
of descriptors, relying upon depth buffer images that are mostly black, is poor. On
the other hand, if w is too small, then a significant part of a model may lie outside
the canonical cube, whence the most of model is clipped. Consequently, the feature
vector cannot capture information about all parts of the model. An experimental
analysis for w = 2, w = 4, w = 8, and w = 16 is presented in section 5.2.3.
The depth-buffer images of dimensions N × N pixels can directly be used as
a feature in the spatial domain. An example is shown in figure 4.11, where the
underlying model is the same as in figure 4.8. The six depth-buffers are visualized
in the first row as the grayscale images. The attributes of all six N × N images can
be taken as components of a feature vector. The dimension of the corresponding
feature vector is 6N 2 . As mentioned in section (1.4), the l1 or l2 norms are not
suitable for calculating distances between feature vectors in the spatial domain. In
order to correlate components of vectors in the spatial domain, a suitable distance
metric need to be defined. Another solution is to correlate spatial features by
transforming them into the spectral domain. Therefore, we use the two-dimensional
discrete Fourier transform (2D-DFT) of depth-buffers to represent the feature in
the spectral domain. Briefly, for a two-dimensional sequence of complex numbers
fab ∈ C, a = 0, . . . , M − 1, b = 0, . . . , N − 1, the Fourier coefficients fˆpq ∈ C are
calculated by
M −1 N −1
1 X X
fˆpq = √ fab e−j2π(pa/M + qb/N ) , (4.26)
M N a=0 b=0
O(M N (M + N )) O(M 2 N 2 )
M N
O(M N log(M N )) O(M N log(M N ))
64 64 10.7 341.3
128 128 18.3 1170.3
256 256 32.0 4096.0
512 512 56.9 14563.6
1024 1024 102.4 52428.8
Table 4.3: Approximate speed up for various image dimensions when two FFTs are
used instead of other two approaches for computing the 2D-DFT.
In our case, the attributes vab ∈ [0, 1] (4.22) of a square image of type N × N
(M = N ) serve as the input for the 2D-FFT. We stress that N should be a power
of 2, in order to apply the fast Fourier transform. Before the 2D-FFT (4.26) is
applied, we set
fab = va0 b0 , (4.27)
where a = (a0 + N/2) mod N , b = (b0 + N/2) mod N , and 0 ≤ a0 , b0 ≤ N − 1.
Practically, the new array fab is obtained by cyclically shifting the original one
vab so that f00 = vN/2,N/2 . After applying the 2D-FFT, the obtained array of
complex coefficients fˆpq is also shifted so that v̂N/2,N/2 = fˆ00 . The new array v̂pq
(0 ≤ p, q ≤ N − 1) represents the feature in the spectral (frequency) domain. The
visualizations of arrays v̂pq by the grayscale images in the second row of figure
4.11, each of which corresponds to the depth-buffer image above, is done by taking
min{1, |v̂pq |} as pixel attributes (grayscale values). Thus, Fourier coefficients with
lower-frequencies are depicted by pixels that are located in the middle of the image.
Figure 4.11: Extraction of the depth buffer-based shape descriptor. In the first row,
the depth-buffer images are formed using the canonical bounding cube (definition
4.1). In the second row, each depth-buffer is transformed using the 2D-FFT (4.26).
Since the input sequence of the 2D-DFT consists of real numbers vab , the ob-
tained coefficients v̂pq possess the symmetry property, similar to (4.16), which fol-
96 3D-Shape Feature Vectors
Figure 4.12: Coefficients marked by gray color can be calculated using the symmetry
4.28 of the 2D-DFT. The coefficients that are included in the feature vectors are
located inside the denoted boundaries.
are taken as components of the feature vector. The coefficients are located inside
the boundaries depicted in figure 4.12. The total number of coefficients inside the
marked area is k 2 + k + 1, whence the dimension of the vector obtained using the
six depth-buffers is 6(k 2 + k + 1). For k = 8, the vector possesses 438 components.
Note that the vectors with 342, 258, 186, 126, 78, and 42 components (i.e., for
k = 7, 6, 5, 4, 3, 2) can easily be embedded in the vector of dimension 438.
Let the six depth-buffer images be uniquely indexed by i ∈ {1, . . . , 6}, and let
(i)
v̂pq(0 ≤ p, q ≤ N − 1) be the Fourier coefficients of the depth-buffer image indexed
by i. As an example, the components of the feature vector f of dimension 78 (k = 3)
Depth Buffer-Based Feature Vector 97
We set the parameter N to 64, 128, 256, and 512, whereby the average feature
extraction times of the depth-buffer descriptor, for models from the MPEG-7 col-
lection (section 5.1), are 41.6 ms, 85.6 ms, 262,8 ms, and 1048.1 ms, respectively
(on a PC with an 1.4 GHz AMD processor running Windows 2000). Since N is a
power of 2, the 2D-FFT can be used producing a fast feature extraction. According
to the evaluation results (section 5.2.3), we recommend to set N = 256 and to use
depth buffer-based vectors of dimension 438.
The choice of the region ρ (4.19), CBC (definition 4.1), EBB (definition 4.3), or
CC (definition 4.4), is an interesting problem. As stated earlier, the negative influ-
ence of outliers is present when the CBC or EBB are used. Thus, the requirement
5 from section 1.3.4 is fulfilled only if the CC is used. To demonstrate the problem,
we give three retrieval examples in figure 4.13, where the query model possesses
an outlier. The query model is obtained by adding a long antenna to the original
model of car. When the CBC and EBB are used for forming depth buffer images,
the top ranked models are not relevant to the query. However, if CC with w = 2
is used, robustness with respect to outliers is achieved, and all top ranked models
are relevant. Note that the first match is the original model, which has been used
to create the query object. We stress that the displayed thumbnail images in figure
4.13 serve just to visualize the models, i.e., they do not correspond to depth buffer
images. Models in figure 4.13 are visualized from the positive side of the z-axis in
the canonical coordinate frame, while the x-axis travels to the right. The distance
of a viewpoint is selected so that the corresponding thumbnail image captures the
whole 3D-object, whence the canonical scale is not depicted in figure 4.13. The l 1
distances between the query and the top ranked models are given, as well.
Figure 4.13: When the CBC (definition 4.1) and EBB (definition 4.3) are used for
forming depth buffer images, the descriptor is sensitive with respect to outliers. In
the case when the CC (definition 4.4) is used, robustness with respect to outliers is
achieved.
differ.
A similar problem may occur when calculating surface area inside certain re-
gions. For instance, let the plane δ in figure 4.14b) separate two regions and let
the triangles T and T 0 be almost parallel to δ, but located on the opposite sides of
the plane. If we have a feature vector in the spatial domain, defined by accumulat-
ing the areas of triangles inside the regions and attributing the obtained values to
corresponding vector components, then the areas of T and T 0 are accumulated in
different components of the feature vectors. As a result, we can have a gap between
vector components of very similar models.
Figure 4.14: Large differences of sample values occur mostly when intersecting rays
with triangles (a) and rarely when calculating the surface area inside certain regions
(b). The differences cannot be large when relying upon volumes (c).
Hence, in the spatial domain, the gaps between corresponding sample values of
similar 3D-objects are the most frequent when calculating features based on one-
dimensional properties (e.g., ray casting), and are potentially present when com-
puting features based on two-dimensional characteristics (e.g., surface area). This
fact motivates us to define a feature relying upon a three-dimensional characteristic,
volume. Since a 3D-mesh model is not necessarily a solid object, we define artificial
volumes as shown in figure 4.14c). Namely, each triangle of the mesh is considered
to be the base of a pyramid having the top vertex at the coordinate origin. We cal-
culate the volume of each pyramid taking into account the orientation of the base,
i.e., the volume can be negative. The signed volume VTj , which is associated to the
triangle Tj (1.2) with the vertices pAj , pBj , and pCj (j = 1, . . . , m), is calculated
by
VTj = pAj · ((pBj − pAj ) × (pCj − pAj )) (4.31)
The reason to determine the sign of pyramid’s volume is depicted in figure 4.15.
Triangles 4p1 p2 p3 and 4p2 p4 p3 belong to a mesh model. For simplicity of illus-
tration, the points O, p4 , and p1 are taken to be co-linear. The volume limited by
the triangles, which is ”inside” the model, is depicted on the right side of the figure.
The value of volume is computed by summing appropriate scalar triple products of
three vectors being the signed values of volumes.
100 3D-Shape Feature Vectors
In order to form a feature vector from the calculated volumes, the whole 3D
space (in the CPCA frame) is subdivided into N disjunctive regions ν1 , . . . , νN .
Our volume based feature vector in spatial domain
f = ( f1 , . . . , fN ) (4.32)
Let s1 , s2 , s3 ∈ {1, 2, 3} denote the coordinate axes, so that the axes x, y, and z are
indexed by 1, 2, and 3, respectively. We compute
We use two algorithms for finding the distribution of VTj (j = 1, . . . , m) across the
regions νi (i = 1, . . . , 6k 2 ): approximative and analytical.
Our approximative algorithm is based on partitioning each volume VTj into p2j
(pj ∈ N) partitions (pyramids) of equal volume, by subdividing the triangle T j , as
depicted in figure 4.17. Having in mind that areas of triangles of a mesh model
significantly differ, we use an adaptive partitioning of VTj in which the number of
partitions depends on the area of Tj . The parameter pj that fixes the number of
partitions of VTj is defined by
&r '
Sj
pj = pmin , (4.38)
S
where Sj is the surface area of triangle Tj , S is the surface area of the whole mesh
(3.10), pmin is a parameter used for setting the fineness of approximation, and
dte ∈ Z (t ∈ R) denotes the ceiling function (dte is the first integer greater than
t). The parameter pmin is used to specify the lower bound of the total number of
partitions ptotal of all volumes VTj (i = j, . . . , m), i.e.,
m m Pm
Sj j=1 Sj
X X
2
ptotal = pj > pmin = pmin = pmin . (4.39)
j=1 j=1
S S
gravity of the base of the pyramid. Centers of gravity are marked with bold dots
in figure 4.18, and are denoted by the variable G in the algorithm given by the
pseudocode in figure 4.18. For each partition, we determine the region containing
the center of gravity of the base triangle (function ’vector index’), and accumulate
the whole volume δ of partition in the corresponding component of the feature
vector. Note that we first check if the whole triangle is located in a single region ν i
(4.37). In that case, there is no need for subdivision.
We recall that the parameter k fixes the dimension of the feature vector, which
is equal to the number of regions subdividing the 3D-space. The regions ν i (4.37)
Volume-Based Feature Vector 103
are indexed with an integer from the range [1, 6k 2 ], representing the position of
the vector component. The function ’vector index’, used for determining the region
containing a point A = (a1 , a2 , a3 ), is described by the following pseudocode:
// Returned value: the position of the vector component (integer) from the range [1, 6k 2 ]
// Argument A = (a1 , a2 , a3 ) ∈ R3 represents a 3D-point
// Argument k ∈ N is the subdivision parameter
int vector index(A, k)
s1 , s2 , s3 , a, b, c ∈ Z;
s1 = 1;
if |a1 | < |a2 | ⇒ s1 = 2;
,
if |as1 | < |a3 | ⇒ s1 = 3;
s2 = s1 mod 3 + 1;
s3 = s2 mod 3 + 1;
a = bk(as2 /|as1 | + 1)/2c;
b = bk(as3 /|as1 | + 1)/2c;
if as1 < 0 ⇒ c = 2s1 − 1;
else c = 2s1 − 2;
vector index = c · k 2 + a · k + b + 1;
4. For each region ν ∈ Ω, we intersect the pyramid, formed by Tj and the origin
O, with the region ν, and compute the portion of VTj being inside ν.
P1 · PAB P2 · PAB
> 1/3 ∧ < 1/3 ⇒ PAB ∈
/L
||P1 || · ||PAB || ||P2 || · ||PAB ||
P1 · PBC P2 · PBC
> 1/3 ∧ > 1/3 ⇒ PBC ∈ L
||P1 || · ||PBC || ||P2 || · ||PBC ||
Figure 4.19: Checking if the intersection points lie in the marked area of the plane.
extracts feature vectors of nearly the same retrieval performance as feature vectors
extracted by the analytical algorithm. Hence, in practice we use the approximative
algorithm for a fast feature extraction, while the analytical method is used only for
verifying results in a testing phase.
Table 4.4: Feature extraction times (in milliseconds) for various dimensions of the
volume-based feature vector, using both the approximative and analytical algo-
rithm.
in order to avoid the orientation constraint. Instead of VTj (4.31), which can be
negative, we take the absolute value |VTj |, in order to eliminate the problem of
non-consistent orientation of polygons. All other aspects of the presented technique
remain the same. Note that by taking non-negative values of volumes |V Tj |, we do
not calculate volume as depicted in figure 4.15, but we accumulate all non-negative
volumes being inside regions νi (4.37).
We also introduce an additional modification by normalizing the feature vector
f (4.32) so that ||f ||1 = 1 (1.10). This normalization makes the computed distri-
bution of volumes VTj or |VTj | independent of scale of the object. In this case, no
scaling during the normalization step (section 3.4) is necessary. Thus, there are four
variants of the volume-based feature vector in the spatial domain, which are given
in table 4.5.
Variant Signed VTj (4.31) Scaled (||f ||1 = 1)
V1 yes no
V2 no no
V3 yes yes
V4 no yes
Table 4.5: Four variants of the voxel-based feature vector in the spatial domain.
Note that spectral representations of all variants of the feature vector f (4.32)
can easily be obtained by applying the 2D-DFT (4.26) to 6 2D-arrays F (0) , . . . , F (5) ,
defined using f ,
(i) (i)
F (i) = [fab ]k×k , fab = fck2 +ka+b+1 , 0 ≤ a, b ≤ k − 1, c = 0, . . . , 5.
(i)
The arrays F (i) of volume values are transformed into arrays F̂ (i) = [fˆpq ]k×k of
complex Fourier coefficients. Next, we apply exactly the same approach as it is the
case with the depth-buffer feature vector in the spectral domain. The magnitudes of
(i)
the obtained coefficients |fˆpq |, whose indices satisfy the following inequality, which
Voxel-Based Approach 107
is similar to (4.29),
are taken as components of the feature vector (for more details see section 4.3).
In practice, we extract this feature vector using the approximative algorithm with
k = 128. We recall that the dimension of the feature vector in the frequency domain
is 6(l2 + l + 1), where l < k/2 (l ∈ N) is the highest order of harmonics that are
used to form the vector. Also, an embedded multi-resolution representation (1.8) is
provided, by forming the vector according to (4.30).
We compared all variants of the volume-based method in both spatial and spec-
tral domains (see section 5.2.4). Because of the fact that 3D-models are not neces-
sarily orientable, the variants V2 and V4 (table 4.5) outperform the variants V1 and
V3. Retrieval effectiveness can slightly be improved by representing the volume-
based feature in the spectral domain. Nevertheless, the overall retrieval performance
of the volume-based descriptor is inferior to the silhouette-based and depth buffer-
based descriptors presented in this chapter, regardless of a high stability of the
volume-based approach with respect to variations of the polygonal surface (high-
frequency noise of the surface), different tessellations, as well as levels-of-detail.
After evaluation (section 5.2.4), we suggest to use the variant V4 of the volume-
based descriptor in the spectral domain of dimension 438.
where
xmax − xmin ymax − ymin zmax − zmin
dx = , dy = , dz = , (4.41)
M N P
and
a = 0, . . . , M − 1, b = 0, . . . , N − 1, c = 0, . . . , P − 1.
After the sampling, which is application specific, the voxel µabc is attributed a
value vabc ∈ K depending on positions of the polygons of a 3D-mesh model. In
other words, vabc depends on the point set I (1.5) of the model. Usually, vabc is a
scalar value, and we deal either with a binary voxel grid (K = {0, 1}) or real voxel
grid (K ≡ R). In the binary case, the voxel value vabc is defined by
0 (or 1), I ∩ µabc = ∅
vabc = (4.42)
1 (or 0), I ∩ µabc 6= ∅
Thus, if there is a point p ∈ I laying inside µabc , then we set vabc = 1, otherwise
vabc = 0, or vice versa. There are several choices to attribute a voxel with a real
value. For instance, the voxel attribute vabc might be a color value. Also, the voxel
grid might be used for storing 3D distance fields [33, 56].
Voxel values are usually stored in a 3D-array [vabc ]M ×N ×P . If the 3D-array is
mostly populated with zeros, the space needed for storing a voxel grid can signifi-
cantly be reduced by using an octree structure [116].
A voxelization method is defined by specifying the region ρ (4.19), the dimen-
sions of voxel grid M , N , and P (or voxel dimensions dx , dy , and dz ), and the exact
calculation of voxel attributes. In what follows, we always use cubic voxel grids of
type N × N × N (M = N = P ). We use several definitions of the cuboid region ρ
(4.19), which is voxelized:
In our approach, a 3D-mesh model I (1.5) is voxelized so that the voxel attribute
vabc is equal to the fraction of the total surface area S of the mesh, which is inside
the region µabc (4.40), i.e,
area{ µabc ∩ I }
vabc = , 0 ≤ a, b, c ≤ N − 1. (4.43)
S
We created both approximative and analytical algorithm for computing the voxel
attributes vabc (4.43).
The approximative algorithm is given in figure 4.21, and is based on the same
idea as the algorithm described in figure 4.18. Each triangle Tj of a mesh model is
Voxel-Based Approach 109
subdivided (figure 4.17) into p2j coincident triangles each of which has the surface
area equal to δ = Sj /p2j , where Sj is the area of Tj . If all vertices of the triangle
Tj lie in the same cuboid region µabc (4.40), then we set pj = 1, otherwise we use
(4.38) to determine the value of pj . For each newly obtained triangle, the center of
gravity G is computed, and the voxel µabc 3 G is determined. Finally, the attribute
vabc is incremented by δ. As mentioned in section 4.4, the quality of approximation
is set by the parameter pmin (4.38). For clarity, several lines are separated as the
procedure inc att in the pseudocode in figure 4.21. In our implementation, we do
not have procedure calls as well as the computation of centers of gravity is done
without multiplications.
The analytical algorithm for calculating the attributes vabc (4.43), for a given
mesh model I (1.5), also starts with initializing voxel values to 0. Next, for each
triangle Tj (1.2) of the mesh, we determine a set of voxels Vj intersected by Tj ,
using a very efficient algorithm presented in [53]. We recall that the triangle T j
is defined by its vertices Aj , Bj , and Cj , and 1 ≤ j ≤ m (1.2). Then, for each
µabc ∈ Vj , we compute the area δabc of the triangle Tj , which is inside µabc (4.40).
The value of δabc = area{ µabc ∩ Tj } is computed by performing the following steps:
1. Check if the triangle vertices Aj , Bj , and Cj lay inside µabc . The vertices
being inside µabc are inserted in a list L, which is initially empty.
110 3D-Shape Feature Vectors
axj − xmin
a≤ <a+1
dx
∧
ayj − ymin
Aj ∈ µabc ⇐⇒ b≤ <b+1 , Aj ∈ µabc =⇒ Aj ∈ L.
dy
∧
azj − zmin
c≤ <c+1
dz
x
n y − ymin z − zmin o
Rabc
−
= (xmin + a dx , y, z) b≤ < b + 1, c ≤ < c+1 ,
dy dz
where dx , dy , dz , xmin , ymin , and zmin are given by (4.41) and (4.19). Firstly,
we find the intersection point p = (px , py , pz ) of the line Aj Bj and the plane
x− x− x−
πabc ⊃ Rabc . Since the plane πabc 3 p is determined by equation x = xmin +
a dx , we have px = xmin + a dx . The point p also lies on the line Aj Bj , i.e.,
p = α(Aj − Bj ) + Bj . We directly compute the value of α by
xmin + a dx − bxj
px = α(axj − bxj ) + bxj =⇒ α = .
axj − b xj
py − ymin pz − zmin
b≤ <b+1 ∧ c≤ <c+1
dy dz
x
If p ∈ Aj Bj and p ∈ Rabc
−
, we insert p into the list L.
4. Check if voxel edges intersect the triangle Tj , and add the intersection points
to the list L.
Let p0 p1 be one of 12 voxel edges, e.g.,
Let p be the intersection point of the line p0 p1 and the plane of the triangle
Tj . We have
where t0 , t00 ∈ L are chosen so that ||t0 − m|| and the angle ∠(x, t00 − m) are
maximal, i.e.,
|x · (t∗ − m)| |x · (ti − m)|
||t0 − m|| = max {||ti − m||}, = min .
1≤i≤l ||t∗ − m|| 1≤i≤l ||ti − m||
The points ti are reordered in the non-decreasing order of the polar coordinate
ϕi = arctan (yi /xi ).
6. The surface area δabc of the polygon t1 . . . tl is calculated by summing up
areas of l − 2 triangles, 4t1 t2 t3 ,. . . , 4t1 tl−1 tl . The voxel attribute vabc is
incremented by δabc /S (1.6).
In the case when the CBC, BB, and EBB are used, a 3D-model I is completely
located inside the region ρ. If the CC is used as the cuboid region ρ (4.19), the
112 3D-Shape Feature Vectors
a) b)
c) d)
e) f)
Figure 4.22: The impact of outliers on voxel representations. The only difference
between the original models a) and b) are positions of the antennas. The models
are voxelized at the same resolution and visualized from the same viewpoint using
the CBC (c,d) and CC (e,f).
114 3D-Shape Feature Vectors
Hence, from (4.45) it follows dvw < dvu , while (4.45) gives the opposite conclusion,
dvw > dvu . This situation may happen when corresponding parts of two models
(e.g., doors of car models) are just slightly displaced in the CPCA frame, and the
displacement is registered at the resolution r + 1. From a user’s aspect, voxels
vabc and wabc are identical at the resolution r and very distinctive from uabc . This
impression should be somehow preserved at the resolution r + 1. Obviously, it can
not be done with the lp norm.
Figure 4.23: The increase of the resolution of voxelization reduces the effectiveness
of the lp norm in the spatial domain.
Thus, the origin (0, 0, 0) is shifted to (N/2, N/2, N/2). Therefore, we adjust the
formula (4.47),
N N N
−1 2 −1 2 −1
2
1 X X X
fˆpqs = √ 0
vabc e−2πj(ap+bq+cs)/N , (4.48)
N3 a=− N N N
2 b=− 2 c=− 2
0
vabc ∈ R =⇒ fˆpqs = fˆp0 q0 s0 =⇒ |fˆpqs | = |fˆp0 q0 s0 |,
(4.49)
(p + p0 ) mod N = 0, (q + q 0 ) mod N = 0, (s + s0 ) mod N = 0.
The feature vector is formed from all non-symmetrical coefficients fˆpqs so that
The coefficient fˆ000 is not included, because it is always equal to 1, when the CBC,
BB, or EBB are used. However, if the CC is engaged, then fˆ000 ≤ 1, because a model
may be cropped. Since we are interested in distribution of voxel attributes as a
feature, we normalize fˆpqs by dividing by |fˆ000 |. By suitably choosing the component
values, an embedded multi-resolution feature representation (1.8) is provided. The
exact way of forming the feature vector f = (f1 , . . . , fdim ) is described by the
pseudocode in figure 4.24.
For a fixed value of h (the outer loop in the pseudocode in figure 4.24), the total
number of taken coefficients d(h) can be computed by
i = 1;
scale = 1;
if |fˆ0,0,0 | > 0 ⇒ scale = |fˆ0,0,0 |;
for h = 1, . . . , k
fi = |fˆh,0,0 |/scale, i ← i + 1;
fi = |fˆ0,h,0 |/scale, i ← i + 1;
fi = |fˆ0,0,h |/scale, i ← i + 1;
for x = 1, . . . , h − 1
fi = |fˆ0,x,h−x |/scale, i ← i + 1;
fi = |fˆ0,x,x−h |/scale, i ← i + 1;
fi = |fˆx,0,h−x |/scale, i ← i + 1;
fi = |fˆx,0,x−h |/scale, i ← i + 1;
fi = |fˆx,h−x,0 |/scale, i ← i + 1;
fi = |fˆx,x−h,0 |/scale, i ← i + 1;
for x = 1, . . . , h − 2
for y = 1, . . . , h − 1 − x
fi = |fˆx,y,h−x−y |/scale, i ← i + 1;
fi = |fˆx,y,x+y−h |/scale, i ← i + 1;
fi = |fˆx,−y,h−x−y |/scale, i ← i + 1;
fi = |fˆx,−y,x+y−h |/scale, i ← i + 1;
Figure 4.24: The forming of the voxel-based feature vector f = (f1 , . . . , fdim ), in
the spectral domain.
The dimension dim of the feature vector, i.e., the total number of all taken coeffi-
cients, is determined by
k
X
dim = d(h) = k(2k 2 + 3k + 4)/3. (4.52)
h=1
We recommend to set k = 8 so that the vectors of dimensions 12, 31, 64, 115, 188,
and 287 (k = 2, . . . , 7) are embedded in the feature vector of 416 components.
The efficiency of the analytical and approximative algorithms are compared
using the results from table 4.6. The results are obtained on a PC with an 1.4
GHz AMD processor running Windows 2000. The MPEG-7 collection (section
5.1) is used for measuring times and statistics. The voxel-based descriptor in the
spatial domain is extracted for N = 2, . . . , 8, producing the vectors of dimensions
8, 27, 64, 125, 216, 343, and 512. The approximative extraction algorithm (figure
4.21) is tested for pmin ∈ {16000, 32000, 64000, 96000}. For each value of pmin , the
average number of samples ptotal (4.39) is also given. Since we set pj = 1, when
the whole triangle attributes a single voxel, the total number pinc att of executions
of the procedure inc att (figure 4.21) is significantly lower than ptotal , for small
N . The computational complexity linearly depends on pinc att . Since there is a
certain amount of processing time, which does not depend on pinc att , but on the
number of triangles m of the mesh, the computational complexity can be written
Voxel-Based Approach 117
Table 4.6: Average feature extraction times of the approximative and analytical
algorithms for generating voxel-based feature vectors in the spatial domain, octree
voxel grids, and spectral representations using the 3D-DFT. In all cases, the CBC
is used. The number of executions of the procedure inc att (figure 4.21) is denoted
by pinc att .
main is obtained by adding times needed for generating the octree and performing
the 3D-DFT. The extraction times in table 4.6 are obtained when the CBC is used.
We observe that the approximative method is significantly faster than the analyt-
ical algorithm, in particular for larger values of N . Our experiments (section 5.2.5)
show that feature vectors extracted by the approximative method for pmin = 32000
possess almost the same retrieval effectiveness as vectors obtained by the analytical
approach. We tested all presented choices for fixing the region of voxelization. In
the spatial domain, we suggest to use the BB for voxelization and dimension 343.
In the spectral domain, we recommend the vector of dimension 417 obtained by
applying the 3D-DFT to an octree of depth 7, which represents a voxel grid formed
using the CC with w = 2 (definition 4.4). The scale should be fixed by the average
distance (3.25), if the CC is used. Note that, if the voxel attributes are defined by
(4.43) and CBC, BB, or EBB are used, there is no need to fix the scale of an object
during the normalization step (section 3.4).
f : S2 → K
(4.53)
u 7→ f (u),
and to define our moments-based feature vector. Next, we present several 3D-shape
descriptors that are based on various definitions of the function f (u). We also ex-
plore the idea to define functions on concentric spheres and to use a property of
spherical harmonics to obtain a rotation invariant descriptor, where the invariance
is achieved not using the PCA, but by the definition of descriptor.
where fˆl,m denotes the (l, m)-Fourier coefficient. Hence, spherical harmonics provide
an orthonormal basis for L2 (S 2 ). The (l, m)-spherical harmonic Ylm (l ≥ 0, |m| ≤ l)
is a harmonic homogeneous polynomial of degree l, which has a factorization
Ylm (θ, ϕ) = kl,m Plm (cos θ)ejmϕ , (4.57)
where j is the imaginary unit, kl,m is a normalization constant, and Plm is the
associated Legendre polynomial of degree l and order m that satisfies the following
characteristic three-term recurrent relation
m
(l − m + 1)Pl+1 m
(t) − (2l + 1)tPlm (t) + (l + m)Pl−1 (t) = 0. (4.58)
A visualization of real combinations Slm of spherical harmonics Ylm , for |m| ≤ l ≤ 3,
is shown in figure 4.25.
By using (4.57) to separate the longitudinal coordinate ϕ and latitudinal θ, the
computation of the spherical harmonic coefficients can be performed as a common
Fourier transform followed by a projection on the corresponding Legendre functions,
Z π "Z 2π #
ˆ m
fl,m = hf, Yl i = kl,k f (θ, ϕ)e −jmϕ
dϕ Plm (cos θ) sin θdθ.
0 0 (4.59)
| {z }
1D-FT
120 3D-Shape Feature Vectors
A very important property of spherical harmonics, which can be used for defining
rotation invariant 3D-shape descriptors, is the following:
The SFFT method proposed in [45] and the corresponding software Spharmon-
icKit [59] deal with band-limited functions.
where the sample points are chosen from the equiangular grid:
(2a + 1)π 2bπ
θa = , ϕb = , (4.61)
4B 2B
(B)
and the weights ca play a role analogous to the sin θ factor in the integrals (4.59).
Property 4.2 Let f ∈ L2 (S 2 ) be a real-valued function, i.e., f : S 2 → R. Let the
SFFT coefficients fˆl,m be obtained by applying the sampling theorem (4.1). Then,
the following symmetry between the coefficients exists
=⇒ dim(f ) = k 2 ,
(4.65)
Note that a feature vector formed either using (4.65) or (4.66) contains all feature
vectors of the same type of smaller dimension, thereby providing an embedded
multi-resolution representation (1.8) for 3D shape feature vectors.
Remark 4.1 The property 4.1 can be used to achieve rotation invariance of feature
vectors without normalizing the orientation (rotation) of a 3D-mesh model I (1.5).
Our normalization technique is described in section 3.4. After applying the first 4
steps of the algorithm 4.1, we can form the feature vector f so that
v
u l
uX
f = (||f0 ||, . . . , ||fdim−1 )||), ||fl || = t |fˆl,m |2 , dim ≤ B. (4.67)
m=−l
If Irot is the point set of a 3D-model obtained by rotating the points I around an
arbitrary axis for an arbitrary angle, and frot = (||f00 ||, . . . , ||fdim−1
0
)||) is the feature
vector of Irot , extracted without the normalization step using (4.67), then it holds
||fl || ≈ ||fl0 ||.
Remark 4.2 Suppose that I (1.5) is a given point set of a mesh model and I z
is the point set obtained by rotating the set I around the z-axis for an arbitrary
angle. Let fˆl,m and fˆl,m
0
(|m| ≤ l ≤ B) be complex Fourier coefficients obtained
after the step 3 of the algorithm 4.1, associated to I and Iz . If the number of
samples is large enough (B >> 0) and the sample points are defined by (4.63), then
the magnitudes of the obtained coefficients are approximately the same. Thus, a
feature vector formed by the algorithm 4.1 is invariant with respect to rotations of
a model around the z-axis. The property directly follows from the equation (4.60).
Ray-Based Approach with Spherical Harmonic Representation 123
Remark 4.3 The users of the SpharmonicKit (version 2.5 and earlier) should cor-
rect the normalization problem that exists. Namely, after the SFFT, the obtained
coefficients fˆl,m should be scaled as follows
√ √
fˆl,m ← 2 π fˆl,m , fˆl,0 ← 2π fˆl,0 , 1 ≤ |m| ≤ l, 0 ≤ l < B.
The inverse scaling should be performed before the inverse transform. Note that
without the scaling, the property 4.1 is not valid. We informed the authors of the
SpharmonicKit about the problem, and the following versions will probably include
the correction.
l\m -4 -3 -2 -1 0 1 2 3 4
0 4.448
1 0.139 0.754 0.139
2 0.589 0.111 1.202 0.111 0.589
3 0.051 0.041 0.056 0.057 0.056 0.041 0.051
4 0.060 0.053 0.369 0.004 0.554 0.004 0.369 0.053 0.060
Figure 4.26: Multi-resolution representation of the function r(u) (4.1) used to derive
feature vectors from Fourier coefficients for spherical harmonics.
Table 4.8: Extraction times of the ray-based feature vector with spherical harmonic
representation, for various numbers of sample points. In each case, dim = 435.
f400 = (1.22, 0.13, 0.15, 0.25, 0.03, 0.29, 0.02, 0.01, 0.01, 0.15, 0.06, 0.03, 0.07, 0.03, 0.04,
0.00, 0.03, 0.01, 0.07, 0.01, 0.10, 0.02, 0.03, 0.03, 0.00, 0.02, 0.03, 0.03, 0.00, 0.01,
0.02, 0.03, 0.01, 0.02, 0.01, 0.02, 0.01, 0.02, 0.01, 0.02, 0.02, 0.02, 0.03, 0.03, 0.03),
f800 = (1.21, 0.13, 0.15, 0.25, 0.02, 0.29, 0.02, 0.01, 0.02, 0.15, 0.06, 0.02, 0.07, 0.03, 0.04,
0.00, 0.03, 0.01, 0.07, 0.00, 0.10, 0.02, 0.02, 0.03, 0.00, 0.02, 0.03, 0.04, 0.00, 0.01,
0.02, 0.03, 0.01, 0.02, 0.01, 0.03, 0.01, 0.02, 0.01, 0.02, 0.02, 0.02, 0.03, 0.03, 0.02),
1 − f 1 |, . . . ,
( |f400 45 − f 45 | ) =
|f400
800 800
0.00, 0.00, 0.00, 0.00, 0.01, 0.00, 0.00, 0.00, 0.01, 0.00, 0.00, 0.01, 0.00, 0.00, 0.00,
0.00, 0.00, 0.00, 0.00, 0.01, 0.00, 0.00, 0.01, 0.00, 0.00, 0.00, 0.00, 0.01, 0.00, 0.00,
0.00, 0.00, 0.00, 0.00, 0.00, 0.01, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.01).
Figure 4.27: The ray-based feature vectors (dim = 45) in the spectral domain of
the models in figure 4.5. The maximal difference between component is equal 0.01.
sample values are directly used as components of the feature vectors. One of the
problems of such a descriptor is depicted in figure 4.5, where a ray does not hit the
bottom-left part of the model on the right-hand side. As a result, there is a gap
between the corresponding vector components. When we sample the function r at
many points (4.63), we also have many gaps between sample values for models in
figure 4.5. However, when we apply the SFFT to the samples and form the vector
in the spectral domain, then the component-wise differences between the similar
models from figure 4.5 are negligible (see figure 4.27). The SFFT filters differences
caused by local variances of models’ surfaces.
Therefore, the spectral representation of the ray-based feature vector eliminates
drawbacks that are present in the spatial domain (see section 4.1):
i = 0;
for m = 1, . . . , k
for m1 = 0, . . . , m
for m2 = 0, . . . , m − m1
fi = M m1 ,m2 ,m−m1 −m2 ;
i ← i + 1;
(k + 1)(k + 2)(k + 3)
dim = − 1. (4.70)
6
Usually, we set k = 11.
In order to measure extraction times of various dimensions of the vector, we
computed feature vectors for k = 2, . . . , 11 using three different numbers of samples,
642 , 1282 , and 2562 . The average feature extraction times, which are shown in table
4.9, are obtained on a PC with an 1.4 GHz AMD processor running Windows 2000,
and using the 3D-mesh models from the MPEG-7 set (section 5.1). We recall that
the vector of dimension 363 (k = 11) contains all lower-dimensional vectors (for
k = 2, . . . , 10).
Hence, the feature extraction consists of two steps: sampling and computing
moments. For a given complexity of a 3D-model (number of triangles), the sampling
time ts is proportional to the number of samples, ts = O(B 2 ), while the time tm
needed for computation of moments can be expressed as tm = O(dim · B 2 ). We can
Shading-Based Feature Vector 127
k 2 3 4 5 6 7 8 9 10 11
dim 9 19 34 55 83 119 164 219 285 363
B = 32 27.1 27.9 28.9 30.6 32.6 35.2 38.5 42.0 46.3 51.3
B = 64 59.6 62.7 67.4 74.1 82.5 92.9 106.4 121.0 138.5 158.4
B = 128 184.2 197.5 214.7 240.1 272.2 311.8 365.4 423.6 490.9 570.0
Table 4.9: Average extraction times (in milliseconds) of the moments-based feature
vectors for various dimensions and B ∈ {32, 64, 128}.
write the total time complexity as O((cf + dim)B 2 ), where cf is a constant. The
dimensionality vs. time complexity diagrams for three different numbers of samples
(4B 2 ) are shown in figure 4.28.
B=128
B= 64
500 B= 32
Extraction Time [ms]
400
300
200
100
0
0 50 100 150 200 250 300 350
Dimension
Figure 4.28: Dimensionality vs. extraction times for B ∈ {32, 64, 128}.
The results (see section 5.2.7) show that dim ≥ 363 is the best choice of dimen-
sion for the moments-based feature vector. Hence, in practice, we extract only one
feature vector of dimension dim > 363, embedding all lower-dimensional vectors.
s : S 2 → [0,
1]
0, if r(u) = 0 (4.71)
s(u) = , u ∈ S2,
|u · n(u)|, 6 0
if r(u) =
where n(u) is the normal unit vector of the triangle containing the point r(u)u
(r(u) 6= 0). The computation of values of the function s(u) is illustrated in figure
128 3D-Shape Feature Vectors
4.29. The triangle 4ABC contains the furthest point of intersection along the di-
rectional unit vector u. Since ||u|| = ||n(u)|| = 1, the value u·n(u) = cos ∠(u, n(u))
is regarded as information about flat shading [30]. Therefore, we call this descriptor
shading-based.
(B − A) × (C − A)
n(u) = = (nx , ny , nz ),
||(B − A) × (C − A)||
The feature extraction procedure of the shading-based vector follows the algo-
rithm 4.1. In order to sample the function s (4.71) at 4B 2 points uab (4.63), we
sample the function r (4.1) at these points, populating an additional array of 4B 2
integers by storing the ordinal numbers of triangles containing points r(u ab )uab .
Then, we compute the normal vectors of triangles whose indices are stored in the
array, and calculate the sample values according to (4.71). The point r(u ab )uab is
the furthest point of the mesh model I (1.5) in the direction uab from the origin O.
However, r(uab )uab is the closest point of the model I to an enclosing sphere along
the ray emanated from the origin and traveling in the direction uab . Therefore,
we regard this technique as a rendered perspective projection of the object on an
enclosing sphere. Note that the rendered perspective projection is independent of
the scale of the underlying 3D-mesh model. Therefore, it is not necessary to scale
the model during the normalization step (section 3.4). After the sampling, we apply
the SFFT, and form the vector according to (4.66). The feature extraction time
of this vector is slightly greater than the extraction time of the ray-based feature
vector in the spectral domain (see table 4.8). We suggest to set B = 64 and use
dim = 91, i.e., k = 13 in (4.66). Evaluation results for the shading-based descriptor
with spherical harmonic representation are given in section 5.2.8.
The shading-based feature vector possesses an embedded multi-resolution rep-
resentation (1.8). A drawback of the shading-based approach is sensitivity with
respect to levels of detail and different tessellations of a model. Since we rely upon
normal vectors, if we have a model in different levels of detail, then sample values
(4.71) of a very simple model (small number of triangles) significantly differ from
the samples of a very complex model (large number of triangles). This difference is
reflected in vector components. Therefore, the requirement 4 in section 1.3.4 is not
fulfilled.
As an example, we consider the models of cow in figure 4.30. The original model
consists of 5804 triangles, while the other models are obtained by simplifying, using
Complex Feature Vector 129
Figure 4.30: A model of cow in 5 different levels of detail. The numbers of triangles
are given below corresponding models.
QSlim 1.0 software [37]. All 5 models are included in our 3D model collection (see
section 5.1). Suppose the shading-based descriptor is used and the l1 metric is
selected as the distance measure to rank the models. If the model with 88 triangles
is taken as a query, then the models with 100, 250, 400, and 5804 triangles are
retrieved as 1st, 2nd, 4th, and 23rd match, respectively. If the model with 5804
triangles serves as a query, then the models with 400, 250, 100, and 88 triangles are
at positions 1, 5, 123, and 143. Note that in both examples the retrieved models
are order according to the level of detail (complexity). All the other feature vectors,
which are presented in this chapter so far, do not suffer from the described problem.
In other words, if we use any other feature vector and any of the five cow-models
as a query, then the first four matches are the remaining cow-models.
The problem of non-robustness with respect to levels of detail and tessellations
reduces discriminant power of the shading based descriptor. However, the functions
r(u) and s(u) can be combined to define a 3D-shape descriptor, which possesses
better retrieval performance than the ray-based and shading-based descriptors. The
combined descriptor is presented in a sequel.
(4.64). We recall that the vector of dimension k 2 embeds all lower-dimensional vec-
tors. An example output of the absolute values of the spherical Fourier coefficients
(up to l = 4) is given in table 4.10. The feature vector is composed from all coeffi-
cients, because the symmetry (see property 4.2 and table 4.7) between coefficients
does not exist. As far as we know, this is the first attempt to describe a 3D-shape
by using a complex function. Therefore, we refer to the descriptor as complex.
l\m -4 -3 -2 -1 0 1 2 3 4
0 4.041
1 0.093 0.554 0.095
2 0.451 0.073 0.859 0.076 0.450
3 0.041 0.030 0.052 0.189 0.053 0.031 0.039
4 0.063 0.038 0.272 0.026 0.399 0.027 0.272 0.040 0.062
The average extraction time of the complex feature vector is almost identical to
the average extraction time of the shading-based descriptor (section 4.6.4), i.e., it
is slightly greater than the average extraction time of the ray-based feature vector
in the spectral domain (see table 4.8).
We tested (section 5.2.9) the complex feature vector of maximal dimension 484
(k = 22) for B ∈ {32, 64, 128, 256}, while the parameter α (4.72) took values 0.1,
0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0, 2.0, 3.0, 4.0, and 6.0. We recommend the
following settings: B = 64, α = 0.4, dim = 196.
Thus, the descriptor obtained by combining two features (extent and shading)
shows better retrieval performance than both the ray-based and the shading-based
descriptors. Intuitively, we expect that such a combination of two features result in
a descriptor, whose discriminant power is higher than the discriminant power of the
shading-based, and lower than discriminant power of the ray-based descriptor in the
spectral domain. However, it appears that the weaker descriptor, the shading-based,
can be used to improve retrieval performance of the better descriptor, the ray-
based. We explain this result by a king of “orthogonality” (complementarity) that
exists between combined features. In other words, some 3D-shape characteristics,
which are not captured by one of descriptors, are recorded by the other, whence the
combination gives better results. This result motivates us to create other definitions
of complex functions as well as to try to merge several descriptors in an appropriate
way (section 4.7).
There are other possibilities to define a complex function (4.72) on the sphere S 2
in order to combine different features by using the presented concept. For instance,
the curvature index (2.8) can be used in the same way as the information about the
shading (4.71). Let c(u) be a real-valued function on the sphere S 2
0, if r(u) = 0
c(u) = , u ∈ S2, (4.73)
Γj , 6 0
if r(u) =
Feature Vectors Based on Layered Depth Spheres 131
1
zα0 (u) = r(u) + j c(u), u ∈ S 2, zα0 (u) ∈ C, (4.74)
α
and perform the previously described extraction procedure. However, we find that
using (4.72) produces better descriptors than (4.74).
where |ta,b,i | denotes the distance of the point ta,b,i from the origin. Then, we sort
the values |ta,b,i | in the non-increasing order, |ta,b,i−1 | ≤ |ta,b,i | (1 ≤ i ≤ nab − 1).
Hence, along the direction uab there are nab ≥ 0 points of intersection between
the ray and the model I. A structure that we call layered depth ray (figure 4.31)
consists of the longitudinal index a, latitudinal index b, number nab of points ta,b,i
intersecting the ray, and the sorted array of values |ta,b,i |. Finally, we define an
LDS to be a structure containing a 2D-array of layered depth rays as well as the
longitudinal and latitudinal dimensions, which are both equal 2B (4.63) in our
implementation. The example in figure 4.31 depicts the concept of LDS. All points
ta,b,i of the model of human that are used to form the LDS are marked either by
dots (points that are not furthest along rays) or by small squares (furthest points
along rays). We recall that only points marked with small squares are taken into
132 3D-Shape Feature Vectors
LayeredDepthRay { LayeredDepthSphere {
LatitudinalIndex: integer LatitudinalDim: integer
LongitudinalIndex: integer LongitudinalDim: integer
NoOfValues: integer Rays[0..LatitudinalDim,0..LongitudinalDim]:
Values[0..NoOfValues]: array of LayeredDepthRay objects
sorted array of real numbers }
}
Figure 4.31: Concept of the Layered Depth Sphere with an example in the 2D-space.
account when forming the ray-based feature vector (section 4.6.2), by sampling the
function r (4.1).
We define a function rk (k ∈ {1, . . . , R}) on the sphere S 2 by restricting codomain
of the function r given by (4.1),
rk : S 2 → [gk , gk+1 ) ∪ {0}
(4.75)
rk (u) = max{ {0} ∪ {rk | gk ≤ rk < gk+1 , rk u ∈ I }}.
The the lower bound gk and the upper bound gk+1 restrict function values.
In our feature extraction procedure, we consider R concentric spheres S1 , . . . , SR
to define the bounds gk of functions rk (k = 1, . . . , R). The sphere Sk has the center
at the origin and the radius equal to k ·M/R, where M is an empirically determined
constant value. The bounds gk of function values are given by
0, k=1
gk = 1 M , gR+1 → +∞. (4.76)
k− , 2≤k≤R
2 R
Feature Vectors Based on Layered Depth Spheres 133
S1 : r1 (u1 ) = 0
r1 (u2 ) = |t2,1 |
r1 (u3 ) = |t3,2 |
S2 : r2 (u1 ) = 0
r2 (u2 ) = |t2,2 |
r2 (u3 ) = 0
S3 : r3 (u1 ) = |t1,0 |
r3 (u2 ) = 0
r3 (u3 ) = 0
For each function rk , we apply the SFFT to the samples rk,a,b (0 ≤ a, b ≤ 2B−1).
As the result, we obtain spherical harmonic coefficients r̂k,l,m (0 ≤ l ≤ B − 1,
−l ≤ m ≤ l). Having in mind that rk,a,b ∈ R, we use (4.66) to form a signature
of the function rk from the first L rows (bands) of spherical harmonic coefficients.
Finally, we collect the signatures of all functions r1 , . . . , rR to form a feature vector
based on LDS,
f = ( |r̂1,0,0 |/2, . . . , |r̂R,0,0 |/2, |r̂1,1,0 |/2, . . . , |r̂R,1,0 |/2, |r̂1,1,1 |, . . . , |r̂R,1,1 |,
... (4.78)
|r̂1,L−1,0 |/2, . . . , |r̂R,L−1,0 |/2, . . . , |r̂1,L−1,L−1 |, . . . , |r̂R,L−1,L−1 | ).
The dimension of this feature vector is dim(f ) = R · L(L + 1)/2. For a fixed value
of R, an embedded multi-resolution representation is provided (1.8).
Regarding the implementation of the feature extraction procedure, we use the
same ray-casting method that is used for extracting the ray-based feature vector
134 3D-Shape Feature Vectors
(section 4.6.2, algorithm 2 in figure 4.3). Distances to the origin of all found in-
tersection points are stored in 4B 2 sorted arrays, which are implemented using the
STL [42]. Extraction times for various numbers of functions are given in table 4.11.
The number of functions (R) takes values between 1 and 10, while the number of
bands (L) is adjusted so that the vector dimension does not exceed the value of
550. The number of samples is 16384 (B = 64) and the constant M is set to 4.
The results are obtained on a PC with an 1.4 GHz AMD processor running Win-
dows 2000, extracting features of the 3D-mesh models from the MPEG-7 collection
(section 5.1). Note that for R = 1 we obtain the feature vector presented in section
4.6.2. In table 4.8, the extraction time is equal to 68.6ms, while the same feature
vector is extracted for 105.7ms (table 4.11), when the LDS is used. The difference
of ∼37ms is the time needed for managing the LDS structure.
R 1 2 3 4 5 6 7 8 9 10
L 25 22 18 15 14 13 12 11 10 9
dim 325 506 513 480 525 546 546 528 495 450
Time 105.7 118.4 131.1 134.1 158.0 164.1 181.5 188.1 204.7 212.0
Table 4.11: Average extraction times (in milliseconds) of feature vectors based on
LDS for various numbers of functions on the sphere S 2 . The number of functions is
denoted by R, the number of bands of spherical harmonics used to form a feature
vector is L. In all cases, M = 4 and B = 64.
Thus, we encode the spatial distribution of the point set I (1.5) by consider-
ing regions of the 3D-space, which are between concentric spheres. The presented
feature vector based on LDS follows the standard algorithm for extracting descrip-
tors with spherical harmonic representation, i.e., after the normalization step, the
signature of the function rk is extracted using algorithm 4.1. However, we can use
property 4.1 to form a rotation invariant feature vector without normalizing the
orientation of a 3D-model. We recall that the canonical orientation of a model
is determined by using the CPCA (see section 3.4). An alternative is to fix only
translation and scaling during the normalization step, and after finding the spher-
ical harmonic coefficients r̂k,l,m (1 ≤ k ≤ R, |m| ≤ l ≤ B − 1), instead of using
(4.78), we form a rotation invariant feature vector based on LDS according to 4.67
(see remark 4.1),
where v
u l
uX
||fk,l || = t |r̂k,l,m |2 .
m=−l
The dimension of this feature vector is dim(f ) = R · L. Note that the ordering of
feature vector components provides an embedded multi-resolution representation,
for a fixed value of R. The average extraction times (for 1 ≤ R ≤ 10) of the rotation
invariant feature vector (4.79) are almost identical to the times in table 4.11.
Hybrid Descriptors 135
where ωi is a weighting factor associated to the elements of the feature vector f (i) .
(i)
The value
P of ωi specifies the “importance” of the vector f . The vector dimension
dim = i dimi should be kept in a reasonable limit (e.g., dim < 1000).
The question is how to select good candidates for crossbreeding. We can apply
an exhaustive approach, i.e., to try all possible combinations of available feature
vectors (types, variants, representations, and dimensions) with a variety of different
combinations of values for the weighting factors. As the opposite approach to the
exhaustive method, we use our study of techniques for describing 3D-shape to antic-
ipate how retrieval performance can be improved. We consider that feature vectors,
which possess an embedded multi-resolution representation (1.8), are suitable for
crossbreeding. We recall that an embedded multi-resolution representation is usu-
ally provided by transforming a feature representation from the spatial domain into
the spectral domain, using an appropriate transform, e.g., 1D-FFT (4.11) in section
136 3D-Shape Feature Vectors
4.2, 2D-FFT (4.26) in sections 4.3 and 4.4, 3D-FFT (4.48) in section 4.5, or the FFT
on a sphere (see section 4.6). First components of feature vectors in the spectral
domain correspond to the low-frequency Fourier coefficients, which carry the most
important information. By increasing the dimension of a feature vector with an
embedded multi-resolution representation, we actually add information about fine
details. Thus, the fact that low-dimensional feature vectors contain useful informa-
tion, as well as the requirement to keep the dimension of a hybrid feature vector
inside a reasonable limit, motivate us to try crossbreeding of feature vectors with
an embedded multi-resolution representation.
We also consider that crossbreeding candidates should describe features that are
complemental (“orthogonal”). For instance, the ray-based feature (section 4.6.2)
gives information about the extent of an object from the center of gravity along a
radial direction, while the depth buffer-based approach (section 4.3) describes how
distant is the object from a face of a cuboid by measuring the distance along a
direction that is perpendicular to the face of the cuboid.
Instead of testing different choices of the weights ωi and dimensions dimi , we
define the weighting factors by
dimi
ωi = ⇒ ωi f (i) = dimi , 1 ≤ i ≤ k, (4.81)
(i) (i) (i) 1
f1 + f2 + . . . + fdimi
so that “importance” of the vector f (i) is specified by its dimension dimi . Note that
from (4.80) and (4.81), it follows ||h||1 = dim (1.10).
As a first choice, we combine only two descriptors (k = 2), the depth buffer-based
feature vector of dimension 258, where the extended bounding box (EBB) is used
(section 4.3), and the ray-based vector of dimension 190 with spherical harmonic
representation (section 4.6.2). Hence, the obtained hybrid feature vector possesses
448 components. The reason for choosing the larger dimension for the depth buffer-
based descriptor is to weight the depth buffer-based method as the more important
than the ray-based approach. Since depth buffer-based descriptors possess better
retrieval performance than the ray-based feature vectors (see sections 5.2.3 and
5.2.6), the hybrid obtained by crossbreeding should inherit more information from
the superior parent.
An example of improving performance of original descriptors by crossbreeding
is shown in figure 4.33. The ray-based, depth buffer-based, and hybrid descriptors
are used to rank models by shape-similarity to the query model of bunny. There are
only three more models of bunnies, which are considered relevant to the query, in a
collection of 907 models. Only the hybrid approach retrieves a relevant object as the
first match, which is ranked at positions 6 (ray-based) and 13 (depth buffer) when
other approaches are engaged. The models of bushes, which are matches 1 and 4
in the second row, are ranked as matches 59 and 120, when the hybrid descriptor
is used. Therefore, not only that relevant objects are ranked as higher matches,
but some non-relevant objects, which have a high rank by a parent descriptor, are
ranked significantly lower by the hybrid vector. Besides, top matches ranked by the
hybrid descriptor are more coherent than top matches ranked by a parent descriptor.
Hybrid Descriptors 137
More details about the given example can be seen by visiting the Web-based search
engine CCCC presented in appendix.
The example given in figure 4.33 does not represent a special case. The evalua-
tion, which is presented in section 5.2.13, confirms that top ranked models retrieved
by the hybrid descriptor are more relevant, whence the retrieval effectiveness is in-
creased. In our experiments (section 5.2.11), we tested crossbreeding of the following
four descriptors:
1. Depth buffer-based descriptor (section 4.3) based on the EBB (definition 4.3);
2. Silhouette-based descriptors (section 4.2) with equiangular sampling (4.10);
3. Ray-based descriptor with spherical harmonic representation (section 4.6.2);
4. Voxel-based descriptor in the spectral domain (section 4.5) based on the CC
(definition 4.4) with w = 2.
138 3D-Shape Feature Vectors
Experimental Results
139
140 Experimental Results
categories, 1362 models are left unclassified, while 6 models are considered as not
suitable and are removed from the collection. The smallest class contains only 2
models, while the largest consists of 56 objects. The second classification is strictly
shape-based, e.g., limousines and convertible cars are not in the same category as
well as commercial airplanes, biplanes, and fighters are separated. A 3D-model from
our collection possesses 5653 vertices and 10304 triangles, on average.
We also use the official MPEG-7 test set [87]. The MPEG-7 collection (M1)
consists of 227 models in VRML 2.0 format [14], which are classified into 15 cat-
egories. The names of the categories are: aerodynamic (35 models), balloon (7),
building (10), car (17), elm (9), finger (30), fourlimb (31), letter a (10), letter b
(10), letter c (10), letter d (10), letter e (10), missile (10), soma (7), and tree (21).
There are no unclassified object in the original MPEG-7 classification. We consider
that the MPEG-7 test set has three major drawbacks, small size, improper selection
of models, and non-consistent classification. The total number of models (227) is
to small for a reliable evaluation of descriptors. The classes “finger”, “letter a”,
“letter b”, “letter c”, “letter d”, and “letter e” contain totally 80 models (35% of
the whole set), which are almost identical to other models in the same category.
As a result, many of 3D-shape descriptors will have ideal precision-recall curves for
these 6 categories (precision=100% for the whole recall range). On the other side,
the category “fourlimb” contains models of humans, alligators, cows, dogs, horses,
reptiles, etc., while the category “aerodynamic” contains airplanes, helicopters, dol-
phins, and sharks. In order to have more consistent and shape-based categorization,
we re-classified (M2) the original MPEG-7 test set into 20 categories, so that 222
models are classified, while 5 are left unclassified. For a 3D mesh model from
the MPEG-7 collection, the average number of vertices is 6638, while the average
number of triangles is 9046.
Two 3D model collections are provided by the Princeton Shape Analysis and Re-
trieval Group [108]. The sets are called Training Database (TR) and Test Database.
Both sets consists of 907 3D-objects and all of them are classified. The training
set is subdivided into 90 classes, while the test set consists of 92 categories. In
both sets, the smallest category has 4 models, while the largest category possesses
50 meshes. Both categorizations are reasonably consistent and are mostly shape-
based. Nevertheless, a number of categories are formed using semantics (e.g., the
categories “one story home” and “fantasy animal”), rather than shape similarity.
The complexity of 3D-models inside each of the four collections is depicted by
histograms shown in figure 5.1. We take the number of vertices and triangles as
parameters denoting the complexity of a 3D-mesh. The histogram values are nor-
malized by the total number of models in the corresponding collection. We observe
that for our collection, as well as for the training and the test databases, approxi-
mately 1/3 of all models possess between 1000 and 5000 triangles.
Basic information about all six classifications of 3D-models, which are used in
experiments, are shown in table 5.1. We observe that the most vertices (206067)
possesses a model from the MPEG-7 set, while a model from the test database
is formed by tessellating 316498 triangles. The fraction of orientable models (see
definition 1.2) ranges from 64.6% (classification TR) to 100% (M1 and M2). The
3D Model Datasets 141
Complexity of our collection (1841 models) Complexity of the MPEG-7 collection (227 models)
35 40
Triangles (10304) Triangles (9046)
Vertices (5653) 35 Vertices (6638)
30
Number of models (%)
0 0
0 100 500 1000 5000 10000 50000 100000 0 100 500 1000 5000 10000 50000 100000
Number of vertices/triangles Number of vertices/triangles
Complexity of the training collection (907 models) Complexity of the test collection (907 models)
35 35
Triangles (7326) Triangles (7960)
Vertices (4071) Vertices (4373)
30 30
Number of models (%)
20 20
15 15
10 10
5 5
0 0
0 100 500 1000 5000 10000 50000 100000 0 100 500 1000 5000 10000 50000 100000
Number of vertices/triangles Number of vertices/triangles
Figure 5.1: Complexity of four 3D-model collections, which are used in experiments.
The average numbers of vertices and triangles are given in brackets.
fraction of closed models (see definition 1.1) ranges from 26.6% (classification TR) to
40.1% (M1 and M2). If a 3D shape description technique is restricted only to closed
and orientable models, then 70% of the models from our collection, and more than
3/4 of the models from the training and test databases, are not suitable. Therefore,
a 3D-shape descriptor should not have any restrictions regarding closedness and
orientability. We recall that one version of our volume-based descriptor (section
4.4) relies on orientation of triangles (4.31), whence the retrieval performance of
the method is low. We stress that all other descriptors, which are presented in
chapter 4, do not depend on orientation of triangles.
We regard the classifications O2, TR, and TS as more relevant than O1, M1,
and M2.
Usually, for a given classification, we average results (precision-recall curves, p̄ 50 ,
p̄100 , RP , and BEP defined in section 1.5) over all classified models, in order to
compare general retrieval performance of 3D-shape descriptors. Thus, all classified
models are used as queries. However, a descriptor of the best overall effectiveness
is not necessarily the best descriptor for a specific class of objects. Also, the RP
(1.28) and BEP (1.27) can give contradictory results, e.g., when two descriptors are
compared and the one having higher RP possesses lower BEP . Another problem
is the fact that we use six different classifications. If we compare two descriptors
142 Experimental Results
Classification O1 O2 M1 M2 TR TS
No. of models 1841 1835 227 227 907 907
No. of classes 5 55 15 20 90 92
No. of classified models 170 473 227 222 907 907
No. of unclassified models 1671 1362 0 5 0 0
Average class size 34.0 8.6 15.1 11.1 10.1 9.9
Largest class size 63 56 35 30 50 50
Smallest class size 20 2 7 3 4 4
Average no. of vertices 5653 5653 6638 6638 4071 4373
Maximal no. of vertices 107155 107155 206067 206067 92583 160940
Minimal no. of vertices 4 4 33 33 28 10
Average no. of triangles 10304 10304 9046 9046 7326 7960
Maximal no. of triangles 215473 215473 219283 219283 185092 316498
Minimal no. of triangles 4 4 57 57 30 16
Closed models [%] 31.6 31.6 40.1 40.1 26.6 27.4
Orientable models [%] 68.1 68.1 100.0 100.0 64.6 66.4
Closed and orientable [%] 30.0 30.0 40.1 40.1 22.8 24.6
Table 5.1: Basic information about 3D-model classifications: O1 (the first classifi-
cation of our collection), O2 (the second classification of our collection), M1 (the
original classification of the MPEG-7 test set ), M2 (our re-classification of the
MPEG-7 test set), TR (the Princeton training set), and TS (the Princeton test set)
4. If the previous criteria are not satisfied, then it is difficult to distinguish which
descriptor is better, and we declare the performance of competing descriptors
as approximately equal (or similar).
Approach Abbreviation
Cords-based descriptor PCD
Moments-based descriptor PMD
Descriptor based on equivalence classes DEC
MPEG-7 shape spectrum descriptor SSD
Descriptor based on “shape distributions” SDD
Descriptor based on binary voxel grids BVG
Descriptor based on exponentially decaying EDT EDT
Abbreviations, which are used for denoting different scaling factors (section 3.4)
and different distance calculations (section 1.4), are given in tables 5.4 and 5.5.
Regarding the application of dissimilarity measures, we apply all distances presented
in section 1.4 directly. However, we also test the application of the l1 distance to
144 Experimental Results
re-scaled feature vectors. More precisely, each feature vector is scaled by a factor
ω defined by (4.81). The motivation for the re-scaling is to test the behavior of the
feature vectors when the l1 norm of the vector f is fixed, ||f ||1 = const. We do not
test a similar approach for the l2 norm (||f ||2 = const), because it can be proven
that the ranking of feature vectors will be the same as when the dmin 2 (minL2) is
used.
Table 5.4: Abbreviations for the scaling factors from section 3.4.
Table 5.5: Abbreviations for the distance metrics from section 1.4.
tinuous approach (section 3.4), where the average distance (3.25), from a point
on the surface to the center of gravity, is taken as the scale factor.
As an example, the caption “RAY-A, Our DB1, L1” of a precision recall diagram
is sufficient to specify the type (the ray-based feature vector in the spatial domain),
the scaling (A), the used classification (our original), and the distance calculation
(l1 ).
Precision (%)
60 60
40 40
20 20
0 0
0 20 40 60 80 100 0 20 40 60 80 100
Recall (%) Recall (%)
Precision (%)
60 60
40 40
20 20
0 0
0 20 40 60 80 100 0 20 40 60 80 100
Recall (%) Recall (%)
Figure 5.2: Average precision/recall diagrams of four model classifications for vari-
ous dimensions of the ray-based feature vector (section 4.1).
146 Experimental Results
where the directional unit vectors ui and uj (1 ≤ i, j ≤ 10k 2 + 2) are given by (4.4).
In figure 5.3, we compare the l1 , l2 , dmin
1 , and dS2 (5.3). We observe that dS2 gives
slightly better results than the l1 distance. The minimized l1 distance dmin
1 is more
effective than the l1 norm only when our reclassified collection is used. In all cases,
the l2 norm is the most inferior distance metric.
RAY-A, Our DB1, L1 vs. L2 vs. minL1 vs. QFD RAY-A, Our DB2, L1 vs. L2 vs. minL1 vs. QFD
100 100
L1 [162](43.0,28.6,43.0,30.1) L1 [162](37.5,25.6,30.6,23.7)
L2 [162](40.5,26.8,42.5,27.9) L2 [162](35.3,24.0,29.0,22.2)
minL1 [162](41.3,26.9,37.1,27.2) minL1 [162](38.7,26.5,30.7,24.4)
QFD [162](42.7,28.3,43.5,29.6) QFD [162](38.5,26.3,31.8,24.6)
80 80
Precision (%)
Precision (%)
60 60
40 40
20 20
0 0
0 20 40 60 80 100 0 20 40 60 80 100
Recall (%) Recall (%)
RAY-A, Train DB1, L1 vs. L2 vs. minL1 vs. QFD RAY-A, Test DB1, L1 vs. L2 vs. minL1 vs. QFD
100 100
L1 [162](40.7,27.2,33.9,25.1) L1 [162](38.6,25.5,31.1,23.1)
L2 [162](37.5,25.7,31.2,23.1) L2 [162](36.3,24.3,29.7,21.9)
minL1 [162](40.3,26.6,32.5,24.8) minL1 [162](38.5,25.1,31.1,23.6)
QFD [162](41.1,27.8,34.1,25.3) QFD [162](39.8,26.5,31.9,23.9)
80 80
Precision (%)
Precision (%)
60 60
40 40
20 20
0 0
0 20 40 60 80 100 0 20 40 60 80 100
Recall (%) Recall (%)
On average, if the feature vector of dimension 162 is used, the ranking according
to the similarity with a query, i.e., computation and sorting of distances between the
query and all other models in the collection, is approximately two times faster when
the l1 or l2 metrics are used instead of dmin
1 and dS2 . Relative computational costs
for calculating distances between vectors of different dimensions, without sorting,
Ray-Based Feature Vector 147
are given in table 5.6. We read that, e.g., for dimension 162, the computation of
the quadratic form distance dS2 , defined by (1.15) and (5.3), is 2.24 times more
expensive than the computation of the l1 norm, which is 2.04 times (on average)
faster than the calculation of the minimized l1 distance, dmin
1 .
Table 5.6: Average cost factors for calculating distances between ray-based feature
vectors, for different dimensions. Results are normalized by the average computa-
tional time for the l1 distance.
In order to demonstrate that the scaling by (3.25) is the best choice, we display
the precision-recall diagrams on the left side of figure 5.4. The results are obtained
using the l1 norm as distance metric. On the right-hand side of figure 5.4, the dmin
1
is used as distance metric for feature vectors, which are extracted using different
scale factors as well as without scaling. We observe that all five curves are exactly
the same. Hence, if the dmin
1 is used for measuring dissimilarity between ray-based
descriptors, we do not need to scale 3D-mesh models during the normalization step.
RAY, Our DB2, L1, different scaling RAY, Our DB2, minL1, different scaling
100 100
A [162](37.5,25.6,30.6,23.7) A [162](38.7,26.5,30.7,24.4)
C [162](36.7,25.1,29.6,23.3) C [162](38.7,26.5,30.7,24.4)
X [162](35.3,24.0,28.6,22.4) X [162](38.7,26.5,30.7,24.4)
L [162](36.8,25.1,29.9,23.1) L [162](38.7,26.5,30.7,24.4)
80 80 N [162](38.7,26.5,30.7,24.4)
Precision (%)
Precision (%)
60 60
40 40
20 20
0 0
0 20 40 60 80 100 0 20 40 60 80 100
Recall (%) Recall (%)
extent of a model, along a given directional unit vector u, if the candidate model
is rescaled by α, then the ray-based feature vector fc is also rescaled by the same
factor (α). In this case, the l1 distance between the query and candidate vectors
is equal to the minimized l1 distance dmin 1 . Therefore, the results depicted on the
right-hand side of figure 5.4 are expected, because the dmin1 (fq , fc ) distances, when
different initial scaling factors are used, are proportional.
We recommend to use the ray-based feature vector in the spatial domain of
dimension 162, to fix the scale by (3.25), and to use the dS2 , defined by (1.15) and
(5.3), as dissimilarity measure.
Precision (%)
60 60
40 40
20 20
0 0
0 20 40 60 80 100 0 20 40 60 80 100
Recall (%) Recall (%)
Precision (%)
60 60
40 40
20 20
0 0
0 20 40 60 80 100 0 20 40 60 80 100
Recall (%) Recall (%)
SIL, Our DB2, L1, different scaling SIL-A, Our DB2, L1, different parameters
100 100
A [300](48.4,34.6,41.9,31.4) Image 128x128, 256 samples [300](48.5,34.5,41.8,31.1)
C [300](47.7,34.0,40.4,31.0) Image 256x256, 256 samples [300](48.4,34.6,41.9,31.4)
X [300](45.1,31.8,38.4,28.8) Image 256x256, 512 samples [300](48.2,34.4,41.9,31.2)
L [300](45.5,32.3,39.1,29.2) Image 256x256, 1024 samples [300](47.9,34.2,41.7,31.0)
80 80 Image 512x512, 256 samples [300](47.6,34.0,41.1,30.6)
Precision (%)
Precision (%)
60 60
40 40
20 20
0 0
0 20 40 60 80 100 0 20 40 60 80 100
Recall (%) Recall (%)
Figure 5.6: Average precision/recall diagrams of our reclassified collection, for the
silhouette-based feature vectors. On the left-hand side, different scale normalization
techniques are used (table 5.4). On the right-hand side, different parameter settings,
image dimensions and number of sample points, are tested.
On the right-hand side of figure 5.6, we tested different parameter setting for
the silhouette-based feature vector of dimension 300, which is extracted using (3.25)
as the scale factor. We examined the following dimensions of silhouette images,
128 × 128, 256 × 256, and 512 × 512, and the following number of equiangular
sample points, 256, 512, 1024. Although the precision-recall curves look almost
the same, we consider that the descriptor extracted from 256 sample points of
silhouette images of dimensions 256 × 256 is slightly better than others. Note that
150 Experimental Results
only results obtained using our reclassified 3D-model collection are shown in figure
5.6. Nevertheless, all the conclusions, which are inferred using our collection, are
also supported by the results obtained using the training and test databases. Other
precision-recall diagrams are omitted to save space.
In figure 5.7, we explore the influence of distance metric (table 5.5) on the re-
trieval performance of the silhouette based feature vector of dimension 300. Average
results of four classifications, O1, O2, TR, and TS (table 5.3), are shown. We ob-
serve that the l1 distance applied to re-scaled feature vectors (“L1, scaled”) is the
most effective for ranking the models. The dmin1 (minL1) dissimilarity measure is
more effective than the l1 norm directly applied. Both minimized distances, dmin 1
and dmin
2 , are more effective than the original distances, l1 and l2 . Finally, the
lmax norm is not appropriate distance metric. Having in mind the definition of the
descriptor, the inefficiency of the lmax norm can be anticipated.
SIL-A, Our DB1, different distances SIL-A, Our DB2, different distances
100 100
L1 [300](57.4,41.1,55.5,40.5) L1 [300](48.4,34.6,41.9,31.4)
L2 [300](52.5,36.1,49.2,35.6) L2 [300](44.7,31.4,37.5,29.1)
Lmax [300](42.1,28.0,41.6,29.1) Lmax [300](35.1,24.1,30.6,22.4)
minL1 [300](60.8,45.1,58.7,44.0) minL1 [300](48.7,35.4,42.3,32.0)
80 minL2 [300](58.4,42.0,54.4,41.2) 80 minL2 [300](46.7,33.4,40.6,30.0)
L1, scaled [300](62.4,47.1,60.9,45.5) L1, scaled [300](50.7,37.3,44.5,33.7)
Precision (%)
Precision (%)
60 60
40 40
20 20
0 0
0 20 40 60 80 100 0 20 40 60 80 100
Recall (%) Recall (%)
SIL-A, Train DB1, different distances SIL-A, Test DB1, different distances
100 100
L1 [300](49.5,35.9,43.4,32.3) L1 [300](48.0,34.0,40.9,30.6)
L2 [300](45.5,31.7,39.0,29.0) L2 [300](44.4,30.8,37.2,27.9)
Lmax [300](35.1,23.2,29.9,21.2) Lmax [300](34.6,22.9,29.3,21.3)
minL1 [300](51.0,37.4,45.3,33.8) minL1 [300](48.9,35.3,42.2,31.5)
80 minL2 [300](48.3,34.9,42.7,31.7) 80 minL2 [300](47.5,33.9,40.1,30.1)
L1, scaled [300](53.4,39.2,46.6,35.4) L1, scaled [300](50.4,36.6,43.6,33.0)
Precision (%)
Precision (%)
60 60
40 40
20 20
0 0
0 20 40 60 80 100 0 20 40 60 80 100
Recall (%) Recall (%)
Figure 5.7: Average precision/recall diagrams of four model classifications, when dif-
ferent distance calculations from section 1.4 (table 5.5) are applied to the silhouette-
based feature vector.
The average computational costs for different distance metrics are compared in table
5.7 (compare to table 5.6). All average costs are normalized by the average time
needed to compute the l1 distance. The computational costs for ranking using the
l1 distance are approximately two times lower than the costs for ranking by d min
1 .
Silhouette-Based Feature Vectors 151
Table 5.7: Average cost factors (normalized by the average computational time
for the l1 distance) for calculating distances (table 5.5) between silhouette-based
feature vectors of dimension 300.
SIL, Our DB2, L1 scaled, different scaling SIL, Our DB2, minL1, different scaling
100 100
A [300](50.7,37.3,44.5,33.7) A [300](48.7,35.4,42.3,32.0)
C [300](50.7,37.3,44.5,33.7) C [300](48.7,35.4,42.3,32.0)
X [300](50.7,37.3,44.5,33.7) X [300](48.7,35.4,42.3,32.0)
L [300](50.7,37.3,44.5,33.7) L [300](48.7,35.4,42.3,32.0)
80 80 N [300](48.7,35.4,42.3,32.0)
Precision (%)
Precision (%)
60 60
40 40
20 20
0 0
0 20 40 60 80 100 0 20 40 60 80 100
Recall (%) Recall (%)
Figure 5.8: Average precision/recall diagrams of our reclassified collection, when the
silhouette-based feature vectors are extracted using different scaling factors (table
5.4). Dissimilarities are calculated by dmin
1 and dmin
2 (table 5.5).
Finally, we test three different choices for defining sample values, which are used
as the input for the discrete fast Fourier transform (4.11). The three different defi-
nitions of sample values are given by (4.12), (4.13), and (4.14). In figure 5.9, results
for classifications O1, O2, TR, and TS (table 5.3) are shown. We observe that, for
each classification, the average precision-recall curves of three sampling methods
are very close to each other. Nevertheless, there are slight differences, typically
less than 0.5%, between precision-recall curves. Unfortunately, these differences are
contradictory for different classifications. For instance, the method 2 (4.13) is the
best, when the training database is used, while the method 1 (4.12) is the best
for the test database. Therefore, we consider that any of three possibilities can be
engaged giving approximately the same retrieval results. We decided to use method
2, in which the values of samples are defined by (4.13).
152 Experimental Results
SIL-A, O1, O2, L1, different methods SIL-A, TR, TS, L1, different methods
100 100
Method 1, O1 [300](57.0,40.8,55.5,40.2) Method 1, TR [300](49.4,35.8,43.4,32.3)
Method 2, O1 [300](57.4,41.1,55.5,40.5) Method 2, TR [300](49.5,35.9,43.4,32.3)
Method 3, O1 [300](56.9,41.7,56.9,40.9) Method 3, TR [300](49.3,35.6,43.2,31.7)
Method 1, O2 [300](48.2,34.4,41.5,31.0) Method 1, TS [300](48.6,34.4,41.4,31.0)
80 Method 2, O2 [300](48.4,34.6,41.9,31.4) 80 Method 2, TS [300](48.0,34.0,40.9,30.6)
Method 3, O2 [300](48.7,34.8,41.5,31.7) Method 3, TS [300](48.4,34.0,40.8,30.8)
Precision (%)
Precision (%)
60 60
40 40
20 20
0 0
0 20 40 60 80 100 0 20 40 60 80 100
Recall (%) Recall (%)
We explored all variants of the depth buffer-based feature vector (section 4.3). Firstly,
we test a version of the depth buffer-based descriptor, which is robust with respect
to outliers. More precisely, we use the faces of the canonical cube (CC) (definition
4.4) to form the depth buffers, where we have set w = 2. We recall that the
robustness of the descriptor relying upon the CC is demonstrated in figure 4.13.
The normalization has been done by our continuous approach (section 3.4), where
(3.25) is used to fix scale. In what follows, the 2D-FFT (4.26) is applied to depth
buffer images of dimensions 256 × 256, if not explicitly stated otherwise. In figure
5.10, we give precision-recall curves for different dimensions of the feature vector.
The tested dimensions are 18, 42, 78, 126, 186, 258, 342, and 438. Note that only
the feature vector of dimension 438 (k = 8 in (4.29)) is extracted embedding all
lower dimensional vectors. Results for classifications O1, O2, TR, and TS (table
5.3) are depicted. We conclude that the best choice of dimension is 438.
Depth Buffer-Based Feature Vector 153
Precision (%)
60 60
40 40
20 20
0 0
0 20 40 60 80 100 0 20 40 60 80 100
Recall (%) Recall (%)
Precision (%)
60 60
40 40
20 20
0 0
0 20 40 60 80 100 0 20 40 60 80 100
Recall (%) Recall (%)
DBD-A (w=2), Our DB1, different distances DBD-A (w=2), Our DB2, different distances
100 100
L1 [438](62.1,42.0,52.8,40.6) L1 [438](51.5,37.0,42.8,34.0)
L2 [438](60.6,40.7,52.1,39.2) L2 [438](51.0,36.5,42.9,33.7)
Lmax [438](45.1,28.8,41.5,29.2) Lmax [438](40.7,27.6,33.3,25.5)
minL1 [438](62.9,43.1,54.1,41.4) minL1 [438](52.8,38.1,43.2,34.5)
80 minL2 [438](64.2,44.0,55.1,42.3) 80 minL2 [438](52.1,37.9,44.3,34.4)
L1, scaled [438](64.2,43.9,55.4,42.1) L1, scaled [438](53.3,38.7,44.0,35.1)
Precision (%)
Precision (%)
60 60
40 40
20 20
0 0
0 20 40 60 80 100 0 20 40 60 80 100
Recall (%) Recall (%)
DBD-A (w=2), Train DB1, different distances DBD-A (w=2), Test DB1, different distances
100 100
L1 [438](52.9,37.5,44.8,33.9) L1 [438](51.8,36.1,41.6,32.5)
L2 [438](52.0,36.4,43.4,32.9) L2 [438](50.9,35.4,41.0,31.9)
Lmax [438](40.5,26.8,33.8,24.6) Lmax [438](40.1,26.5,33.8,23.9)
minL1 [438](53.3,38.6,46.2,34.7) minL1 [438](52.3,37.6,44.0,34.0)
80 minL2 [438](53.8,38.6,45.7,34.6) 80 minL2 [438](52.0,37.3,43.6,33.4)
L1, scaled [438](54.5,39.5,47.3,35.7) L1, scaled [438](53.1,38.1,44.5,34.4)
Precision (%)
Precision (%)
60 60
40 40
20 20
0 0
0 20 40 60 80 100 0 20 40 60 80 100
Recall (%) Recall (%)
distance (3.25). Note that there is no need for scaling during the normalization
step, when the CBC and EBB are used. The results for our reclassified collection
are shown in figure 5.12. The l1 and dmin1 distances are used for ranking. We ob-
serve that the best version of the depth buffer approach is obtained, when the EBB
is used. As expected, the retrieval performance of the descriptor, which relies upon
the CC, deteriorates with the increase of w. We recall that, when the cube defined
by (4.25) is used, only the part of a mesh inside the cube is processed. The points
of the model I (1.5), which are outside the cube, are effectively ignored as outliers.
By setting w = 2, we practically crop certain number of models. If we increase the
value of w, the number of ignored points (parts of a mesh) decreases. However, the
increase of w deteriorates the retrieval performance of the descriptor, because the
object occupies a smaller part of the corresponding depth buffer image. In other
words, for a large value of w, the representation of a 3D-model by depth buffer im-
ages is too coarse, whence shape characteristics of the underlying 3D-model cannot
be captured in an appropriate way. Thus, the best choice is to set w = 2. The
performance of the descriptor relying upon the CBC is weaker than performance
of descriptors based on the EBB or CC with w = 2. Regardless of the obtained
Depth Buffer-Based Feature Vector 155
results, we suggest to use the descriptor, which is robust with respect to outliers
(requirement 5 from section 1.3.4), relying upon the CC with w = 2. The reason
for giving advantage to the CC is depicted by the example in figure 4.13.
DBD-A, Our DB2, L1, different variants DBD-A, Our DB2, minL1, different variants
100 100
w=2 [438](51.3,36.8,42.9,34.0) w=2 [438](52.6,38.0,43.6,34.8)
w=4 [438](49.0,35.2,42.4,32.3) w=4 [438](49.9,36.2,43.0,33.0)
w=8 [438](41.3,29.1,36.7,26.3) w=8 [438](43.8,31.3,38.0,27.9)
w=16 [438](28.2,18.8,25.6,17.8) w=16 [438](29.6,19.7,26.6,18.7)
80 EBB [438](51.5,37.3,44.4,34.0) 80 EBB [438](52.4,38.6,45.2,35.4)
CBC [438](49.1,35.7,42.4,32.8) CBC [438](49.3,36.4,42.7,33.2)
Precision (%)
Precision (%)
60 60
40 40
20 20
0 0
0 20 40 60 80 100 0 20 40 60 80 100
Recall (%) Recall (%)
In figure 5.13, we test different scaling factors (table 5.4) in the normalization
procedure. After the normalization, the feature vectors are extracted from 256×256
depth buffer images, using the canonical cube defined by (4.25) with w = 2. The l 1
and dmin
1 distances are used for ranking. Obviously, the average distance (3.25) is
the best choice of the scaling factor.
DBD (w=2), Our DB2, L1, different scaling DBD (w=2), Our DB2, minL1, different scaling
100 100
A [438](51.3,36.8,42.9,34.0) A [438](52.6,38.0,43.6,34.8)
C [438](48.8,34.7,40.8,31.5) C [438](50.4,36.5,43.1,32.3)
X [438](47.0,33.1,39.3,30.5) X [438](49.0,35.5,42.0,32.2)
L [438](48.2,33.8,40.0,31.4) L [438](50.1,36.0,42.2,32.6)
80 80
Precision (%)
Precision (%)
60 60
40 40
20 20
0 0
0 20 40 60 80 100 0 20 40 60 80 100
Recall (%) Recall (%)
In figure 5.14, we test the influence of dimensions of the depth buffer images on
the retrieval performance. We tested images of types 64 × 64, 128 × 128, 256 × 256,
and 512 × 512, and no significant difference in retrieval performance is present.
However, we consider that image dimensions 256 × 256 are slightly better than
other choices. We have also performed tests, which are analogous to the ones from
figures 5.12, 5.13, and 5.14, using the training (TR) and test (TS) databases. We
stress that the obtained results support conclusions, which are inferred using our
reclassified collection.
DBD-A (w=2), Our DB2, L1, different image dimensions DBD-A (w=2), Our DB2, minL1, different image dimensions
100 100
64 x 64 [438](51.1,36.6,42.8,33.8) 64 x 64 [438](51.5,37.1,43.1,33.5)
128 x 128 [438](51.5,37.0,42.8,34.0) 128 x 128 [438](52.8,38.1,43.2,34.5)
256 x 256 [438](51.3,36.8,42.9,34.0) 256 x 256 [438](52.6,38.0,43.6,34.8)
512 x 512 [438](51.3,36.7,43.1,33.7) 512 x 512 [438](52.6,38.1,43.6,34.7)
80 80
Precision (%)
Precision (%)
60 60
40 40
20 20
0 0
0 20 40 60 80 100 0 20 40 60 80 100
Recall (%) Recall (%)
dimension 486 performs very similar to the feature vector of dimension 294. In the
case of almost identical performance of two descriptors, we give advantage to the
lower-dimensional feature vector. We also observe that the feature vectors whose
dimensions are fixed by an odd value of the parameter k perform better than the
vectors whose dimension is fixed by an even value of the parameter k. This obser-
vation can be explained by the fact that, for odd values of k, after the subdivision
of the 3D-space (figure 4.16), the coordinate hyper-planes are not boundaries of the
obtained regions νi (4.37), rather they are located in the middle of a number of
regions νi .
Precision (%)
60 60
40 40
20 20
0 0
0 20 40 60 80 100 0 20 40 60 80 100
Recall (%) Recall (%)
Precision (%)
60 60
40 40
20 20
0 0
0 20 40 60 80 100 0 20 40 60 80 100
Recall (%) Recall (%)
The four variants (table 4.5) of the volume-based approach in the spatial domain
are evaluated in figure 5.16. We recall that the variants V1 and V3 rely upon the
orientation of triangles, because the signed values of the volume VTj (4.31) are
used, while the variants V2 and V4 deal with the non-negative volumes |V Tj |. Also,
the variants V1 and V2 are extracted after fixing the scale of a 3D-mesh model
by the average distance (3.25), while the feature vector f is “post normalized” so
that ||f ||1 = 1, when the variants V3 and V4 are used. The results obtained using
our reclassified collection (O2) and the test database (TS) are shown in figure 4.5.
158 Experimental Results
Results obtained for other classifications (O1, TR, M1, and M2) comply with the
presented results. According to the shown precision-recall diagrams, the variant
V4 is the best, while the variant V2 is better than the rest two. The variant V3
outperforms the variant V1. Thus, if we rely upon orientation of triangles and upon
a canonical scale, the feature vector shows extremely poor retrieval performance
(V1). By normalizing the values of vector components, we obtain a better descriptor
(V3). However, by having non-negative volumes without post normalization, the
retrieval performance is even better (V2). The best solution is to have both non-
negative volumes and post normalized vector components (V4).
VOL-A, Our DB2, L1, variants VOL-A, Test DB1, L1, variants
100 100
V1 [294](22.2,14.5,18.3,14.0) V1 [294](22.4,13.0,16.4,12.8)
V2 [294](37.4,25.6,30.5,24.3) V2 [294](37.9,24.1,29.6,22.2)
V3 [294](33.4,22.1,27.5,21.2) V3 [294](34.2,21.9,28.7,20.7)
V4 [294](41.1,29.4,34.2,26.7) V4 [294](42.6,29.3,35.2,25.9)
80 80
Precision (%)
Precision (%)
60 60
40 40
20 20
0 0
0 20 40 60 80 100 0 20 40 60 80 100
Recall (%) Recall (%)
VOL-A, V4, Our DB2, different distances VOL-A, V4, Test DB1, different distances
100 100
L1 [294](41.1,29.4,34.2,26.7) L1 [294](42.6,29.3,35.2,25.9)
L2 [294](34.8,24.1,29.2,22.1) L2 [294](35.2,23.2,29.2,21.3)
Lmax [294](27.0,18.6,23.3,17.1) Lmax [294](26.9,17.6,23.6,16.1)
minL1 [294](42.2,30.0,34.6,27.3) minL1 [294](42.8,29.6,35.8,26.6)
80 minL2 [294](37.7,26.3,31.5,24.1) 80 minL2 [294](37.7,25.6,31.8,23.1)
L1, scaled [294](41.1,29.4,34.2,26.7) L1, scaled [294](42.6,29.3,35.2,25.9)
Precision (%)
Precision (%)
60 60
40 40
20 20
0 0
0 20 40 60 80 100 0 20 40 60 80 100
Recall (%) Recall (%)
In order to explore the influence of the distance calculation on the retrieval per-
formance of the variant V4 of the volume-based descriptor, we present the precision-
recall diagrams of our reclassified collection and the test database (figure 5.17). The
obtained results suggest that the dmin 1 (1.21) is slightly more effective than the l1
norm (1.10). Note that the L1 and “L1 scaled” give exactly the same rankings,
because the sum of components of the volume-based feature vector is equal to 1,
i.e., the l1 norm is already constant. The dmin2 is more effective than the l2 distance,
but both of them are inferior to the l1 norm. As expected, the results show that
the use of the lmax distance should be avoided. The results shown in figure 5.17 are
obtained using classifications O2 and TS. Results for the classifications O1 and TR
(table 5.3) are similar.
Next, we want to examine if we gain in retrieval effectiveness by representing
the volume-based feature in the spectral domain. In figure 5.18, we display selected
precision-recall curves of the classifications O2 and TS (table 5.3), for the V4 vector
of dimension 294 in the spatial domain and the same variant of the volume-based
feature represented in the spectral domain (see section 4.4). For the feature vector in
the spectral domain, we tested dimensions 186, 258, 342, and 438. When the l 1 norm
is applied after the re-scaling (4.81), the volume-based feature vector of dimension
438 in the spatial domain is superior to the competing descriptors. If the l 1 norm is
directly used (i.e., without the re-scaling), then the spatial domain representation is
better. Hence, we conclude that in the case of the volume-based approach, we gain
in retrieval effectiveness by representing the feature in the spectral domain only if
the obtained vector are re-scaled before ranking using the l1 distance. Results for
the classifications O1, TR, and TS support this conclusion.
VOL-A, V4, Our DB2, L1, spatial vs. spectral VOL-A, V4, Our DB2, L1 scaled, spatial vs. spectral
100 100
Spatial [294](41.1,29.4,34.2,26.7) Spatial [294](41.1,29.4,34.2,26.7)
Spectral [438](40.5,27.1,32.8,25.2) Spectral [438](44.7,30.6,37.2,27.9)
Spectral [342](40.6,27.3,32.7,25.2) Spectral [342](44.4,30.5,36.8,27.7)
Spectral [258](40.4,27.3,32.7,25.4) Spectral [258](43.9,30.1,36.3,27.6)
80 Spectral [186](39.7,27.0,32.6,25.1) 80 Spectral [186](42.8,29.4,35.6,27.2)
Precision (%)
Precision (%)
60 60
40 40
20 20
0 0
0 20 40 60 80 100 0 20 40 60 80 100
Recall (%) Recall (%)
eter pmin (4.38), which specifies the fineness of approximation. We took pmin ∈
{32000, 64000, 128000, 256000}. In figure 5.19, all displayed precision-recall curves,
which are obtained for our reclassified collection and the training database, almost
coincide. Therefore, even for pmin = 32000, the approximative algorithm generates
descriptors whose performance is equal to the performance of descriptors extracted
using the analytical approach, which is more time consuming (see table 4.4). We
suggest to set pmin = 64000.
VOL-A, V4, Our DB2, L1, analytical vs. approximative VOL-A, V4, Train DB1, L1, analytical vs. approximative
100 100
Analytical [294](41.1,29.4,34.2,26.7) Analytical [294](41.9,29.0,35.0,26.4)
pmin= 32000 [294](41.3,29.5,34.1,26.8) pmin= 32000 [294](41.9,29.0,34.9,26.4)
pmin= 64000 [294](41.3,29.5,34.0,26.8) pmin= 64000 [294](41.9,29.0,35.0,26.4)
pmin=128000 [294](41.2,29.5,34.1,26.7) pmin=128000 [294](41.9,29.0,35.0,26.5)
80 pmin=256000 [294](41.1,29.4,34.1,26.7) 80 pmin=256000 [294](41.9,29.0,35.0,26.4)
Precision (%)
Precision (%)
60 60
40 40
20 20
0 0
0 20 40 60 80 100 0 20 40 60 80 100
Recall (%) Recall (%)
VOX-A (CC, w=2), Our DB1, L1 VOX-A (CC, w=2), Our DB2, L1
100 100
[27](24.3,16.2,29.9,19.4) [27](23.6,15.9,21.7,14.6)
[64](22.5,13.9,24.2,17.4) [64](23.8,15.9,19.9,15.7)
[125](43.1,29.3,41.3,29.6) [125](38.7,26.2,31.6,24.5)
[216](29.9,18.0,27.8,21.3) [216](31.2,20.9,25.6,20.0)
80 [343](46.1,31.4,44.4,30.4) 80 [343](41.9,28.6,33.8,25.6)
[512](34.2,21.7,32.7,23.2) [512](36.3,23.8,27.8,21.4)
Precision (%)
Precision (%)
60 60
40 40
20 20
0 0
0 20 40 60 80 100 0 20 40 60 80 100
Recall (%) Recall (%)
VOX-A (CC, w=2), Train DB1, L1 VOX-A (CC, w=2), Test DB1, L1
100 100
[27](24.9,16.3,22.8,15.4) [27](27.3,18.2,25.3,17.3)
[64](26.0,16.6,22.7,16.0) [64](29.3,18.4,23.8,17.4)
[125](38.2,25.4,32.3,23.9) [125](39.1,26.4,33.3,24.1)
[216](33.2,21.4,27.9,20.5) [216](36.4,23.4,29.0,21.3)
80 [343](43.6,29.5,36.4,27.1) 80 [343](42.9,29.0,35.5,26.5)
[512](39.0,25.6,32.1,24.0) [512](40.5,26.9,32.4,24.4)
Precision (%)
Precision (%)
60 60
40 40
20 20
0 0
0 20 40 60 80 100 0 20 40 60 80 100
Recall (%) Recall (%)
VOX-A (CC, w=2), FFT, Our DB1, L1 VOX-A (CC, w=2), FFT, Our DB2, L1
100 100
[416](48.7,32.3,45.3,34.1) [416](44.7,30.8,36.6,28.5)
[287](48.7,32.6,45.8,34.3) [287](43.9,30.4,36.4,28.6)
[188](48.3,32.4,46.2,33.4) [188](43.3,29.9,35.8,27.8)
[115](46.5,31.4,44.9,32.4) [115](41.3,28.4,34.9,26.6)
80 [64](43.4,29.3,42.5,30.7) 80 [64](38.8,26.9,33.3,25.0)
[31](39.4,26.4,40.0,28.1) [31](34.7,24.0,30.6,22.0)
Precision (%)
Precision (%)
60 60
40 40
20 20
0 0
0 20 40 60 80 100 0 20 40 60 80 100
Recall (%) Recall (%)
VOX-A (CC, w=2), FFT, Train DB1, L1 VOX-A (CC, w=2), FFT, Test DB1, L1
100 100
[416](45.7,30.6,38.0,28.5) [416](42.3,28.2,34.2,25.5)
[287](45.4,30.4,38.0,28.1) [287](41.6,27.7,33.9,25.4)
[188](44.1,29.6,37.6,27.1) [188](40.6,27.1,33.9,25.1)
[115](42.7,28.7,36.7,26.4) [115](39.4,26.4,33.3,24.4)
80 [64](39.9,26.7,34.7,24.4) 80 [64](37.1,24.9,31.7,22.5)
[31](35.3,23.6,31.9,22.1) [31](34.1,23.1,29.9,21.2)
Precision (%)
Precision (%)
60 60
40 40
20 20
0 0
0 20 40 60 80 100 0 20 40 60 80 100
Recall (%) Recall (%)
The influence of the choice of the region of voxelization on the retrieval perfor-
mance is depicted in figure 5.21. We tested the CBC, BB, EBB, as well as the CC
with w = 2. Our reclassified collection and the test database are used for a series of
experiments with feature vectors in the spatial and spectral domains. In the spatial
domain, the BB is the best choice of the region of voxelization. Thus, the feature
vector that utilizes deformed representations of 3D-models (the aspect ration is not
preserved when using the BB) is better than the feature vector that is robust with
respect to outliers (using the CC with w = 2). In the spectral domain, the descrip-
tor relying upon the CC with w = 2 is the best. We observe that the performance
of the best descriptor in the spatial domain is better than the performance of the
best descriptor in the spectral domain.
VOX-A, Our DB2, L1, variants VOX-A, FFT, Our DB2, L1, variants
100 100
CC (w=2) [343](41.9,28.6,33.8,25.6) CC (w=2) [416](44.7,30.8,36.6,28.5)
CBC [343](43.4,30.4,36.6,28.0) CBC [416](43.2,30.2,36.8,28.7)
EBB [343](42.6,29.7,35.4,27.2) EBB [416](44.3,30.5,37.0,28.7)
BB [343](46.4,32.3,39.4,30.3) BB [416](40.6,27.8,33.1,25.8)
80 80
Precision (%)
Precision (%)
60 60
40 40
20 20
0 0
0 20 40 60 80 100 0 20 40 60 80 100
Recall (%) Recall (%)
VOX-A, Test DB1, L1, variants VOX-A, FFT, Test DB1, L1, variants
100 100
CC (w=2) [343](42.9,29.0,35.5,26.5) CC (w=2) [416](42.3,28.2,34.2,25.5)
CBC [343](41.1,27.6,33.9,25.7) CBC [416](39.9,26.7,33.3,24.5)
EBB [343](40.5,27.2,33.4,25.2) EBB [416](39.5,26.3,33.4,24.1)
BB [343](46.3,30.5,35.8,27.6) BB [416](35.8,23.3,29.2,21.1)
80 80
Precision (%)
Precision (%)
60 60
40 40
20 20
0 0
0 20 40 60 80 100 0 20 40 60 80 100
Recall (%) Recall (%)
Figure 5.21: Average precision/recall diagrams of our reclassified collection and the
test database, for the voxel-based feature vectors extracted using various choices of
the region of voxelization, in the spatial and spectral (FFT) domains.
which is clearly better than the dmin2 . Since the l1 norm of the voxel-based feature
vector in the spatial domain, which is extracted relying upon the BB, is equal to 1,
the l1 distance gives always the same ranking as the “L1, scaled”. The l2 norm shows
significantly lower retrieval performance than the dmin 2 . The lmax is absolutely not
suitable distance metric. In the spectral domain, the dmin 2 is slightly better than
the dmin
1 , “L1, scaled”, and l 2 . The l 1 norm outperforms only the lmax distance.
VOX-A (BB), Our DB2, different distances VOX-A (CC, w=2), FFT, Our DB2, different distances
100 100
L1 [343](46.4,32.3,39.4,30.3) L1 [416](44.7,30.8,36.6,28.5)
L2 [343](37.3,25.2,29.7,23.4) L2 [416](45.3,31.3,38.3,29.0)
Lmax [343](23.3,15.0,18.8,14.3) Lmax [416](39.3,26.4,32.6,25.4)
minL1 [343](46.8,32.9,39.3,30.4) minL1 [416](44.6,31.1,37.5,29.1)
80 minL2 [343](43.3,29.6,35.2,27.5) 80 minL2 [416](45.6,31.9,38.9,29.6)
L1, scaled [343](46.4,32.3,39.4,30.3) L1, scaled [416](45.0,31.2,37.4,29.2)
Precision (%)
Precision (%)
60 60
40 40
20 20
0 0
0 20 40 60 80 100 0 20 40 60 80 100
Recall (%) Recall (%)
VOX-A (BB), Test DB1, different distances VOX-A (CC, w=2), FFT, Test DB1, different distances
100 100
L1 [343](46.3,30.5,35.8,27.6) L1 [416](42.3,28.2,34.2,25.5)
L2 [343](34.6,21.7,26.1,20.3) L2 [416](42.2,28.5,34.8,25.8)
Lmax [343](20.3,12.7,17.3,12.0) Lmax [416](36.2,24.3,30.8,22.0)
minL1 [343](45.7,30.1,35.9,28.1) minL1 [416](42.5,28.6,35.1,26.3)
80 minL2 [343](41.9,26.9,32.4,24.8) 80 minL2 [416](42.8,29.3,35.8,26.9)
L1, scaled [343](46.3,30.5,35.8,27.6) L1, scaled [416](42.0,28.2,34.6,26.0)
Precision (%)
Precision (%)
60 60
40 40
20 20
0 0
0 20 40 60 80 100 0 20 40 60 80 100
Recall (%) Recall (%)
The results shown in figure 5.23 demonstrate that the best choice of the pa-
rameter w, which fixes the dimensions of the CC (definition 4.4), is w = 2. Also,
the the scaling factor (3.25) is the best choice. These conclusions are inferred from
the precision-recall diagrams on the left-hand side in figure 5.23. We compare de-
scriptors extracted after normalizing a 3D-model by (3.25) and using the CC with
w ∈ {2, 4, 8}. We observe that the retrieval performance deteriorates with the in-
crease of w. Also, we compare descriptors extracted after normalizing the scale of
a 3D-object by different methods (table 5.4) and using the CC with w = 2. As
expected, the average distance (3.25) outperforms other options. On the right-hand
side in figure 5.23, we examine the influence of the resolution of voxelization (di-
164 Experimental Results
mensions of the voxel grids) on retrieval effectiveness. We test voxel grids of types
32 × 32 × 32, 64 × 64 × 64, and 128 × 128 × 128, i.e., the octrees of depths 5, 6, and
7. The largest voxel grid shows the best performance
VOX (CC), FFT, Our DB2, L1, parameters VOX-A (CC, w=2), FFT, Our DB2, L1, different resolutions
100 100
A (w=2) [416](44.7,30.8,36.6,28.5) 32 x 32 x 32 [416](44.3,30.6,36.9,28.2)
A (w=4) [416](38.9,27.1,33.3,25.1) 64 x 64 x 64 [416](43.8,30.3,36.5,28.0)
A (w=8) [416](27.6,18.8,24.8,17.9) 128 x 128 x 128 [416](44.7,30.8,36.6,28.5)
C (w=2) [416](37.5,25.5,31.4,24.2)
80 L (w=2) [416](42.3,29.0,34.8,27.0) 80
X (w=2) [416](41.3,28.2,34.8,26.3)
Precision (%)
Precision (%)
60 60
40 40
20 20
0 0
0 20 40 60 80 100 0 20 40 60 80 100
Recall (%) Recall (%)
VOX-A (BB), L1, analytic vs. approximative alg. VOX-A (CC, w=2), FFT, L1, analytical vs. approximative alg.
100 100
O1, Analytic [343](55.8,38.5,49.8,39.7) O1, Analytic [416](48.7,32.3,45.3,34.1)
O1, pmin=32000 [343](55.8,38.6,49.7,39.7) O1, pmin=32000 [416](47.9,31.8,44.6,33.5)
O2, Analytic [343](46.4,32.3,39.4,30.3) O2, Analytic [416](44.7,30.8,36.6,28.5)
O2, pmin=32000 [343](46.4,32.3,39.3,30.2) O2, pmin=32000 [416](44.4,30.5,36.2,28.4)
80 TR, Analytic [343](48.8,33.7,39.9,30.9) 80 TR, Analytic [416](45.7,30.6,38.0,28.5)
TR, pmin=32000 [343](48.9,33.7,39.8,31.0) TR, pmin=32000 [416](45.9,30.6,38.0,28.6)
TS, Analytic [343](46.3,30.5,35.8,27.6) TS, Analytic [416](42.3,28.2,34.2,25.5)
TS, pmin=32000 [343](46.2,30.4,35.8,27.6) TS, pmin=32000 [416](42.2,28.0,34.1,25.6)
Precision (%)
Precision (%)
60 60
40 40
20 20
0 0
0 20 40 60 80 100 0 20 40 60 80 100
Recall (%) Recall (%)
The algorithms for generating voxel grids, the analytical (section 4.5) and the
approximative (figure 4.21), are compared in figure 5.24. We recall that the an-
alytical algorithm exactly computes the distribution of the point set I (1.5) of a
3D-mesh model across the region of voxelization ρ (4.19), while the approximative
method uses the parameter pmin to fix the fineness of approximation. The results
show that for the classifications O1, O2, TR, and TS (table 5.3), the descriptors
Ray-Based Approach with Spherical Harmonic Representation 165
extracted using the approximative algorithm with pmin = 32000 perform almost
identical as when the analytical method is used, in both spatial (BB, dim = 343)
and spectral domains (CC with w = 2, dim = 416). We recall (see table 4.6) that
the average generation time of the feature vector of dimension 343 in the spatial
domain amounts 12.5ms, when the approximative algorithm with pmin = 32000 is
used, and 30ms, when attributes of voxel grids are analytically computed. In the
spectral domain, we use octrees of depth 7 (128 × 128 × 128 voxel grid), and apply
the 3D-DFT. The approximative algorithm with pmin = 32000 generates a descrip-
tor in 538ms, on average, while the analytical approach takes 3160ms to produce a
feature vector. We stress that results for the training and test databases (table 5.3)
comply with the presented results obtained for our reclassified collection.
The most effective variant of the voxel-based approach is the feature vector of
dimension 343 in the spatial domain, which is extracted relying upon the BB. If the
robustness with respect to outliers is a requirement that should be fulfilled, then
we recommend the spectral representation of the voxel-based feature of dimension
416, which is obtained by applying the 3D-DFT to a 128 × 128 × 128 voxel grid
and using the CC with w = 2 as the region of voxelization. Before forming the
voxel grid, each model should be normalized (section 3.4). We recommend the
average distance (3.25) as the scaling factor. In order to gain in efficiency, we
recommend to use the approximative algorithm (figure 4.21) with pmin = 32000.
The recommended distance metrics are the l1 (1.10) in the spatial domain and the
dmin
2 (1.20) in the spectral domain.
The spectral representation of the ray-based feature, which we regard as the ray-
based feature vector with spherical harmonic representation (section 4.6.2), is eval-
uated in this subsection. We recall that the descriptor is formed using the first k
rows of Fourier coefficients (algorithm 4.1), obtaining the feature vector of dimen-
sion dim = k(k + 1)/2 (4.66). In order to determine a good choice of dimension, we
generated descriptors for k = 29 (dim = 435) embedding all feature vectors whose
dimensions are fixed by k < 29. Our numerous test show that the ray-based fea-
ture vectors with the spherical harmonic representation whose dimensions are fixed
by 13 ≤ k ≤ 19 have similar retrieval effectiveness, while the vectors obtained for
k < 13 or k > 19 have lower performance. To demonstrate our results, we selected
precision-recall curves of four classifications, O1, O2, TR, and TS (table 5.3), in
figure 5.25. We notice that the vectors of dimensions 91, 136, 171, and 190 perform
very similar, while the vectors of dimensions 253 and 435 show that the performance
decreases with the increase of dimension. Since it is difficult to select the best choice
of vector dimension, we consider that any of the dimensions 91 (k = 13), 105, 120,
136, 153, 171, and 190 (k = 19) is acceptable. In what follows, we show results of
the ray-based descriptor in the spatial domain of dimension 136.
166 Experimental Results
Precision (%)
60 60
40 40
20 20
0 0
0 20 40 60 80 100 0 20 40 60 80 100
Recall (%) Recall (%)
Precision (%)
60 60
40 40
20 20
0 0
0 20 40 60 80 100 0 20 40 60 80 100
Recall (%) Recall (%)
Precision (%)
60 60
40 40
20 20
0 0
0 20 40 60 80 100 0 20 40 60 80 100
Recall (%) Recall (%)
Next, we examine the impact of the scaling factor (table 5.4) on retrieval perfor-
mance of the ray-based descriptor with spherical harmonic representation. In figure
5.26, the precision-recall diagrams of classifications O2 and TR are depicted. We
observe that the average distance (3.25) is the best choice of scaling factor.
Ray-Based Approach with Spherical Harmonic Representation 167
We recall that the more samples (4.63) of the extent function r (4.1) are taken the
better the approximation of the underlying 3D-object (see figure 4.26). The samples
serve as the input for the Fourier transform on the sphere (section 4.6.1). Therefore,
we want to test the influence of the number of samples on the performance of
the ray-based descriptor in the spectral domain. In figure 5.27, we display results
obtained for our reclassified collection and the test database. The tested numbers of
samples are 642 (B = 32), 1282 , 2562 , and 5122 . Having in mind the average feature
extraction times (table 4.8) and the given precision-recall diagrams, we suggest to
use B = 64, i.e., to have 1282 sample values of the function r.
RSH-A, Our DB2, L1, different no. of samles RSH-A, Test DB1, L1, different no. of samles
100 100
B= 32 [136](43.7,30.4,37.1,28.6) B= 32 [136](40.9,27.3,34.3,25.2)
B= 64 [136](44.7,31.0,37.4,28.8) B= 64 [136](41.3,27.6,34.6,25.5)
B=128 [136](44.8,31.1,37.5,29.0) B=128 [136](41.6,27.7,34.7,25.6)
B=256 [136](44.8,31.0,37.7,28.9) B=256 [136](41.5,27.7,34.8,25.5)
80 80
Precision (%)
Precision (%)
60 60
40 40
20 20
0 0
0 20 40 60 80 100 0 20 40 60 80 100
Recall (%) Recall (%)
Figure 5.27: Average precision/recall diagrams of our reclassified collection and the
test database, when the ray-based feature vectors with spherical harmonic repre-
sentation are extracted using different number of samples, 4B 2 .
RSH-A, Our DB2, different distances RSH-A, Test DB1, different distances
100 100
L1, scaled [136](45.7,31.9,37.6,29.2) L1, scaled [136](41.4,28.0,35.5,25.7)
L1 [136](44.7,31.0,37.4,28.8) L1 [136](41.3,27.6,34.6,25.5)
L2 [136](43.4,30.0,37.1,27.9) L2 [136](39.9,26.3,33.5,24.8)
Lmax [136](33.6,22.4,28.6,21.2) Lmax [136](29.9,19.3,26.8,18.1)
80 minL1 [136](43.9,30.3,35.7,28.1) 80 minL1 [136](40.1,26.7,34.2,24.5)
minL2 [136](44.6,30.8,37.4,28.4) minL2 [136](40.6,27.2,34.3,24.9)
Precision (%)
Precision (%)
60 60
40 40
20 20
0 0
0 20 40 60 80 100 0 20 40 60 80 100
Recall (%) Recall (%)
Figure 5.28: Average precision/recall diagrams of our reclassified collection and the
test database, obtained by applying different distance metrics (table 5.5) to the
ray-based feature vector with spherical harmonic representation.
The influence of the used distance metric (table 5.5) on the performance of
the ray-based descriptors in the spectral domain is illustrated in figure 5.28. We
observe that the best performance is obtained by applying the l1 distance after
168 Experimental Results
the re-scaling so that the l1 norm of each feature vector is constant (4.81). The
differences in precision-recall values obtained applying the l1 , dmin
2 , dmin
1 , and l2
are not significant. The only exception is the lmin distance, which shows the worst
performance. Hence, we suggest to use the “L1, scaled” technique, whence it is not
necessary to fix the scale invariance in the normalization step (section 3.4), because
each feature vector is re-scaled. A similar example is shown in figure 5.8.
We recommend to sample the extent function r (4.1) at 1282 points defined by
(4.63), before applying the Fourier transform on the sphere (section 4.6.1). Accept-
able dimensions of the ray-based feature vectors with spherical harmonic represen-
tation are 91, 105, 120, 136, 153, 171, and 190. We recommend to normalize the l 1
length of each feature vector and to use the l1 (1.10) distance for ranking.
Precision (%)
60 60
40 40
20 20
0 0
0 20 40 60 80 100 0 20 40 60 80 100
Recall (%) Recall (%)
Precision (%)
60 60
40 40
20 20
0 0
0 20 40 60 80 100 0 20 40 60 80 100
Recall (%) Recall (%)
The precision-recall curves in figure 5.29, which are selected from all generated
diagrams, demonstrate that dim = 363 (k = 11) is the best choice of dimension.
The influence of the four scaling factors (table 5.4), which are presented in
section 3.4, on the retrieval performance of the moments-based descriptor is depicted
in figure 5.30. As it is the case with all approaches tested in this section, the shown
precision-recall diagrams as well as the other evaluation parameters, p̄ 100 , p̄50 , BEP ,
and RP (section 1.5), unambiguously show that the average distance (3.25) is the
best choice of the scaling factor. The results obtained for our original classification
as well as for the test database lead to the same conclusion.
MOM, Our DB2, L1, different scale MOM, Train DB1, L1, different scale
100 100
A [363](31.8,21.0,26.5,19.9) A [363](35.1,22.9,30.4,21.2)
C [363](30.9,20.3,25.5,19.4) C [363](34.2,22.2,29.3,20.6)
X [363](29.2,19.1,24.7,18.3) X [363](32.9,21.2,27.3,19.9)
L [363](30.6,20.1,26.0,19.3) L [363](33.7,21.8,27.9,20.2)
80 80
Precision (%)
Precision (%)
60 60
40 40
20 20
0 0
0 20 40 60 80 100 0 20 40 60 80 100
Recall (%) Recall (%)
MOM-A, Our DB1, Our DB2, L1, different no. of samples MOM-A, Train DB1, Test DB1, L1, different no. of samples
100 100
O1, B= 16 [363](36.5,23.6,38.0,26.0) TR, B= 16 [363](34.4,22.5,29.8,20.8)
O1, B= 32 [363](37.1,23.9,37.8,26.0) TR, B= 32 [363](34.7,22.7,30.2,21.1)
O1, B= 64 [363](37.3,23.9,37.5,25.8) TR, B= 64 [363](35.1,22.9,30.4,21.2)
O1, B=128 [363](37.2,23.9,37.5,25.8) TR, B=128 [363](35.0,22.9,30.2,21.3)
80 O2, B= 16 [363](31.5,20.9,26.7,19.9) 80 TS, B= 16 [363](31.5,20.4,26.8,19.2)
O2, B= 32 [363](31.5,20.8,26.2,20.1) TS, B= 32 [363](31.5,20.4,27.0,19.0)
O2, B= 64 [363](31.8,21.0,26.5,19.9) TS, B= 64 [363](31.7,20.5,27.3,19.3)
O2, B=128 [363](32.0,21.1,26.4,20.0) TS, B=128 [363](31.9,20.6,27.3,19.4)
Precision (%)
Precision (%)
60 60
40 40
20 20
0 0
0 20 40 60 80 100 0 20 40 60 80 100
Recall (%) Recall (%)
we observe that B = 64 is the best choice of the number of samples. The conclusion
is based on both the shown performance and the average extraction times given in
table 4.9.
The effectiveness of distance measures (table 5.5) used for ranking moments-
based feature vectors is depicted in figure 5.32. The l1 distance applied to nor-
malized (re-scaled) feature vectors is slightly more effective than the dmin
1 . The
minimized l1 distance dmin
1 outperforms other dissimilarity measures. The experi-
ments on our original classification and on the training database also confirm that
the dmin
1 is the best choice of distance calculation. Having in mind the definition
of moments (4.68), if the “L1, scaled” technique is applied, there is no need for
fixing the scale of a 3D-mo deli during the normalization procedure (section 3.4).
A similar example is shown in figure 5.8.
MOM-A, Our DB2, different distances MOM-A, Test DB1, different distances
100 100
L1 [363](31.8,21.0,26.5,19.9) L1 [363](31.7,20.5,27.3,19.3)
L2 [363](29.0,19.1,24.5,18.2) L2 [363](28.0,18.0,24.5,16.9)
Lmax [363](24.9,16.4,22.0,16.0) Lmax [363](24.2,15.5,21.5,14.6)
minL1 [363](34.5,22.8,27.7,20.8) minL1 [363](35.3,22.9,29.7,21.4)
80 minL2 [363](31.5,20.8,26.0,19.5) 80 minL2 [363](31.2,20.1,26.7,19.3)
L1, scaled [363](35.5,23.5,29.0,22.1) L1, scaled [363](35.5,23.1,30.2,21.5)
Precision (%)
Precision (%)
60 60
40 40
20 20
0 0
0 20 40 60 80 100 0 20 40 60 80 100
Recall (%) Recall (%)
Figure 5.32: Average precision/recall diagrams of our reclassified collection and the
test database, obtained by applying different distance metrics (table 5.5) to the
moments-based feature vector.
Thus, we recommend to use the following settings for the moments-based de-
scriptor: number of samples 1282 (B = 64), dimension dim = 363, and to normalize
the l1 length of feature vector before applying the l1 distance (L1 scaled).
Precision (%)
60 60
40 40
20 20
0 0
0 20 40 60 80 100 0 20 40 60 80 100
Recall (%) Recall (%)
Precision (%)
60 60
40 40
20 20
0 0
0 20 40 60 80 100 0 20 40 60 80 100
Recall (%) Recall (%)
SSH, Our DB1, Our DB2, L1, different no. of samples SSH, Train DB1, Test DB1, L1, different no. of samples
100 100
O1, B= 32 [91](40.9,25.8,40.0,28.0) TR, B= 32 [91](35.0,23.4,31.8,22.3)
O1, B= 64 [91](41.2,26.0,39.8,28.8) TR, B= 64 [91](35.4,23.7,31.9,22.5)
O1, B=128 [91](41.5,26.2,39.9,28.8) TR, B=128 [91](35.2,23.6,31.8,22.4)
O2, B= 32 [91](36.5,24.7,31.2,23.0) TS, B= 32 [91](31.5,20.8,28.4,19.7)
80 O2, B= 64 [91](37.4,25.3,31.7,23.9) 80 TS, B= 64 [91](31.5,21.0,28.6,20.1)
O2, B=128 [91](37.1,25.1,31.8,23.6) TS, B=128 [91](31.8,21.1,28.8,20.3)
Precision (%)
Precision (%)
60 60
40 40
20 20
0 0
0 20 40 60 80 100 0 20 40 60 80 100
Recall (%) Recall (%)
The l2 norm of the difference between two shading-based feature vectors outper-
forms all tested dissimilarity measures. The conclusion is based on results from a
set of experiments on all available model classifications, applying different metrics
(table 5.5) to the shading-based feature vector. Precision/recall diagrams of our
reclassified collection and the test database, obtained by applying different dissim-
ilarity measures to the shading-based descriptor, are illustrated in figure 5.35.
SSH, Our DB2, different distances SSH, Test DB1, different distances
100 100
L1 [91](37.4,25.3,31.7,23.9) L1 [91](31.5,21.0,28.6,20.1)
L2 [91](37.6,25.8,31.9,24.1) L2 [91](33.1,22.2,29.4,20.8)
Lmax [91](32.6,21.9,26.6,20.0) Lmax [91](28.1,18.5,25.3,17.2)
minL1 [91](35.3,23.7,30.2,22.1) minL1 [91](29.6,19.6,26.7,18.5)
80 minL2 [91](36.5,24.9,31.0,22.9) 80 minL2 [91](31.7,21.1,28.5,19.8)
L1, scaled [91](36.9,25.2,31.0,23.7) L1, scaled [91](31.0,21.0,28.5,20.2)
Precision (%)
Precision (%)
60 60
40 40
20 20
0 0
0 20 40 60 80 100 0 20 40 60 80 100
Recall (%) Recall (%)
Figure 5.35: Average precision/recall diagrams of our reclassified collection and the
test database, obtained by applying different distance metrics (table 5.5) to the
shading-based feature vector.
We recommend the following settings for the shading-based descriptor: the num-
ber of samples 1282 (B = 64), the dimension dim = 91, and the l2 distance for
ranking the feature vectors.
Precision (%)
60 60
40 40
20 20
0 0
0 20 40 60 80 100 0 20 40 60 80 100
Recall (%) Recall (%)
Precision (%)
60 60
40 40
20 20
0 0
0 20 40 60 80 100 0 20 40 60 80 100
Recall (%) Recall (%)
CSH-A, Our DB1, L1, different alpha CSH-A, Our DB2, L1, different alpha
100 100
0.2 [196](58.2,38.3,51.6,38.4) 0.2 [196](45.5,31.7,38.1,29.4)
0.4 [196](59.5,39.3,52.9,39.5) 0.4 [196](46.7,32.7,39.2,29.8)
0.6 [196](59.8,39.4,53.2,40.3) 0.6 [196](47.1,32.8,40.0,30.2)
0.8 [196](59.3,38.8,53.2,39.7) 0.8 [196](47.2,32.7,39.8,30.2)
80 1.0 [196](58.0,37.6,52.0,38.8) 80 1.0 [196](47.5,32.7,39.2,30.2)
2.0 [196](50.2,32.0,45.6,33.5) 2.0 [196](43.0,29.3,36.0,27.6)
4.0 [196](44.9,28.3,42.4,30.7) 4.0 [196](39.9,26.9,33.5,25.3)
6.0 [196](43.4,27.3,41.1,29.9) 6.0 [196](39.0,26.3,32.8,24.3)
Precision (%)
Precision (%)
60 60
40 40
20 20
0 0
0 20 40 60 80 100 0 20 40 60 80 100
Recall (%) Recall (%)
CSH-A, Train DB1, L1, different alpha CSH-A, Test DB1, L1, different alpha
100 100
0.2 [196](44.8,30.7,39.3,28.4) 0.2 [196](41.7,27.9,35.0,25.5)
0.4 [196](44.9,30.8,39.4,28.4) 0.4 [196](41.5,27.9,35.6,26.0)
0.6 [196](44.5,30.6,39.1,28.5) 0.6 [196](40.7,27.5,35.4,25.9)
0.8 [196](43.7,30.1,38.9,28.0) 0.8 [196](40.1,27.0,35.2,25.4)
80 1.0 [196](42.5,29.3,38.3,27.3) 80 1.0 [196](39.3,26.5,34.3,25.1)
2.0 [196](38.8,26.4,35.1,24.9) 2.0 [196](36.5,24.4,31.7,22.9)
4.0 [196](36.5,24.7,33.2,23.3) 4.0 [196](33.7,22.4,29.9,21.2)
6.0 [196](35.9,24.2,32.6,22.7) 6.0 [196](32.7,21.8,29.2,20.7)
Precision (%)
Precision (%)
60 60
40 40
20 20
0 0
0 20 40 60 80 100 0 20 40 60 80 100
Recall (%) Recall (%)
CSH (alpha=0.4), Our DB2, L1, different scaling CSH (alpha=0.4), Test DB1, L1, different scaling
100 100
A [196](46.7,32.7,39.2,29.8) A [196](41.5,27.9,35.6,26.0)
C [196](44.7,31.1,37.6,29.4) C [196](40.8,27.2,34.4,25.3)
X [196](44.0,30.4,37.1,28.5) X [196](40.1,26.7,34.1,25.0)
L [196](45.6,31.8,38.5,29.2) L [196](41.0,27.5,35.3,26.0)
80 80
Precision (%)
Precision (%)
60 60
40 40
20 20
0 0
0 20 40 60 80 100 0 20 40 60 80 100
Recall (%) Recall (%)
Figure 5.38: Average precision/recall diagrams of our reclassified collection and the
test database, when the complex feature vectors are extracted using different scaling
factors (table 5.4).
Complex Feature Vector 175
The results presented in figure 5.38 suggest that the scaling factor defined as the
average distance (3.25) is the best choice.
The influence of dissimilarity measure on retrieval effectiveness of the complex
descriptor is depicted by precision-recall diagrams in figure 5.39. The l1 norm
slightly outperforms the “L1, scaled” approach. Evidently, other dissimilarity mea-
sures are inferior to the top two.
CSH-A (alpha=0.4), Our DB2, different distances CSH-A (alpha=0.4), Train DB1, different distances
100 100
L1 [196](46.7,32.7,39.2,29.8) L1 [196](44.9,30.8,39.4,28.4)
L2 [196](42.0,28.9,37.0,26.4) L2 [196](42.0,28.9,37.0,26.4)
Lmax [196](31.0,20.4,27.5,19.0) Lmax [196](31.0,20.4,27.5,19.0)
minL1 [196](42.6,29.2,37.4,27.2) minL1 [196](42.6,29.2,37.4,27.2)
80 minL2 [196](41.5,28.2,35.9,25.7) 80 minL2 [196](41.5,28.2,35.9,25.7)
L1, scaled [196](46.8,32.7,38.0,29.5) L1, scaled [196](44.2,30.5,39.0,28.1)
Precision (%)
Precision (%)
60 60
40 40
20 20
0 0
0 20 40 60 80 100 0 20 40 60 80 100
Recall (%) Recall (%)
Figure 5.39: Average precision/recall diagrams of our reclassified collection and the
training database, obtained by applying different distance metrics (table 5.5) to the
complex feature vector.
CSH-A, Our DB2, L1, shading vs. curvature CSH-A, Test DB1, L1, shading vs. curvature
100 100
Shading, alpha=0.4 [196](46.7,32.7,39.2,29.8) Shading, alpha=0.4 [196](41.5,27.9,35.6,26.0)
Shading, alpha=0.8 [196](47.2,32.7,39.8,30.2) Shading, alpha=0.8 [196](40.1,27.0,35.2,25.4)
Shading, alpha=1.0 [196](47.5,32.7,39.2,30.2) Shading, alpha=1.0 [196](39.3,26.5,34.3,25.1)
Curvature, alpha=0.4 [196](40.0,27.5,34.7,25.6) Curvature, alpha=0.4 [196](37.3,24.6,32.1,22.7)
80 Curvature, alpha=0.8 [196](29.0,19.0,24.9,18.0) 80 Curvature, alpha=0.8 [196](26.8,16.8,23.2,16.1)
Curvature, alpha=1.0 [196](24.0,15.6,20.2,14.7) Curvature, alpha=1.0 [196](21.8,13.5,19.1,13.3)
Precision (%)
Precision (%)
60 60
40 40
20 20
0 0
0 20 40 60 80 100 0 20 40 60 80 100
Recall (%) Recall (%)
We also tested the complex feature vector defined using the function zα0 (4.74),
which combines the extent function r (4.1) and the function c (4.73) based on the
curvature index (2.8). A summary of obtained results is presented in figure 5.40.
176 Experimental Results
LDS-A (R=8, M=4), Our DB1, L1 LDS-A (R=8, M=4), Our DB2, L1
100 100
[528](52.6,34.7,48.4,35.4) [528](48.2,33.7,40.8,31.2)
[440](52.8,34.9,48.6,35.6) [440](48.3,33.5,40.8,31.1)
[360](52.7,34.9,48.6,35.7) [360](48.3,33.7,41.2,31.4)
[288](52.8,34.7,48.4,35.7) [288](48.6,33.7,40.8,31.3)
80 [224](51.9,34.1,48.3,34.9) 80 [224](48.3,33.5,41.3,31.3)
[168](52.0,33.8,48.0,34.9) [168](48.1,33.2,40.7,30.4)
[120](50.4,32.8,47.4,34.0) [120](47.2,32.6,40.3,30.4)
[80](47.9,31.2,46.3,33.1) [80](44.3,30.2,37.6,28.0)
Precision (%)
Precision (%)
60 60
40 40
20 20
0 0
0 20 40 60 80 100 0 20 40 60 80 100
Recall (%) Recall (%)
LDS-A (R=8, M=4), Train DB1, L1 LDS-A (R=8, M=4), Test DB1, L1
100 100
[528](46.7,31.3,38.7,28.9) [528](46.6,30.8,37.4,28.4)
[440](46.6,31.2,38.7,29.1) [440](46.4,30.6,37.7,28.4)
[360](46.9,31.5,38.9,29.3) [360](46.5,30.8,37.8,28.6)
[288](46.6,31.2,39.0,29.2) [288](46.3,30.5,37.6,28.1)
80 [224](46.8,31.4,39.3,29.4) 80 [224](46.3,30.6,37.8,28.4)
[168](45.9,30.7,38.9,28.6) [168](45.0,29.6,37.3,27.5)
[120](45.3,30.4,38.8,28.4) [120](44.5,29.4,37.0,27.2)
[80](42.6,28.5,36.8,26.4) [80](41.7,27.4,35.1,25.5)
Precision (%)
Precision (%)
60 60
40 40
20 20
0 0
0 20 40 60 80 100 0 20 40 60 80 100
Recall (%) Recall (%)
LDS-A (R=8), Our DB2, L1, different M LDS-A (R=8), Test DB1, L1, different M
100 100
M=2 [360](45.8,31.5,38.4,29.3) M=2 [360](43.1,28.0,34.7,25.8)
M=4 [360](48.3,33.7,41.2,31.4) M=4 [360](46.5,30.8,37.8,28.6)
M=5 [360](46.8,32.0,39.5,29.3) M=5 [360](45.6,30.0,36.8,27.7)
M=6 [360](47.1,32.7,39.5,30.4) M=6 [360](45.1,30.1,37.2,27.9)
80 M=8 [360](44.4,30.3,37.3,27.8) 80 M=8 [360](42.3,27.9,35.7,26.1)
Precision (%)
Precision (%)
60 60
40 40
20 20
0 0
0 20 40 60 80 100 0 20 40 60 80 100
Recall (%) Recall (%)
Figure 5.42: Average precision/recall diagrams of our reclassified collection and the
test database, when LDS feature vectors are extracted using different values of the
parameter M (radius of the largest sphere), with R = 8.
178 Experimental Results
LDS-A (M=4), Our DB2, L1, different R LDS-A (M=4), Train DB1, L1, different R
100 100
R= 4 [480](43.5,29.5,36.5,27.3) R= 4 [480](43.9,29.9,37.8,27.8)
R= 8 [360](48.3,33.7,41.2,31.4) R= 8 [360](46.9,31.5,38.9,29.3)
R=12 [432](46.5,32.2,39.1,29.8) R=12 [432](46.2,30.9,38.4,29.0)
R=16 [448](45.5,31.3,38.2,28.9) R=16 [448](45.1,29.9,37.4,28.0)
80 80
Precision (%)
Precision (%)
60 60
40 40
20 20
0 0
0 20 40 60 80 100 0 20 40 60 80 100
Recall (%) Recall (%)
Figure 5.43: Average precision/recall diagrams of our reclassified collection and the
training database, when LDS feature vectors are extracted using different values of
the parameter R (number of functions on a sphere), with M = 4.
Thus, we state that the acceptable parameter settings for the LDS feature vector
are: R = 8, M = 4, and dim = 360 (L = 9). Different scale normalization factors
(table 5.4) affect the retrieval efficiency. In figure 5.44, we show the precision-recall
diagrams of our reclassified collection and the test database, when the LDS feature
vectors are extracted using different scaling factors. The results confirm that the
average distance (3.25) is the best choice of normalizing the scale of a 3D-mesh
model.
We stress that all feature vectors that are analyzed in this subsection are ex-
tracted using 1282 samples of functions on a sphere, because the results shown in
figures 5.27, 5.31 , and 5.34 suggest that having 1282 samples is a good choice.
Feature Vectors Based on Layered Depth Spheres 179
LDS (R=8, M=4), Our DB2, L1, different scaling LDS (R=8, M=4), Test DB1, L1, different scaling
100 100
A [360](48.3,33.7,41.2,31.4) A [360](46.5,30.8,37.8,28.6)
C [360](45.8,31.4,38.2,29.2) C [360](43.2,27.8,34.0,25.7)
X [360](44.3,29.8,36.5,28.3) X [360](41.0,26.1,32.6,24.5)
L [360](46.2,31.3,38.0,29.2) L [360](43.1,27.7,34.6,26.2)
80 80
Precision (%)
Precision (%)
60 60
40 40
20 20
0 0
0 20 40 60 80 100 0 20 40 60 80 100
Recall (%) Recall (%)
Figure 5.44: Average precision/recall diagrams of our reclassified collection and the
test database, when the LDS feature vectors are extracted using different scaling
factors (table 5.4), with R = 8 and M = 4.
LDS-A (R=8,M=4), Our DB2, diffrenet distances LDS-A (R=8,M=4), Test DB1, diffrenet distances
100 100
L1 [360](48.3,33.7,41.2,31.4) L1 [360](46.5,30.8,37.8,28.6)
L2 [360](47.6,33.1,41.1,31.1) L2 [360](44.9,29.9,37.0,27.3)
minL1 [360](51.5,36.7,44.1,33.1) minL1 [360](49.0,33.7,40.7,30.4)
minL2 [360](49.6,35.1,43.5,32.3) minL2 [360](47.9,33.1,40.4,29.7)
80 L1, scaled [360](52.5,37.6,45.0,34.3) 80 L1, scaled [360](49.5,34.4,42.2,31.5)
Precision (%)
Precision (%)
60 60
40 40
20 20
0 0
0 20 40 60 80 100 0 20 40 60 80 100
Recall (%) Recall (%)
Figure 5.45: Average precision/recall diagrams of our reclassified collection and the
test database, obtained by applying different distance calculations (table 5.5) to the
LDS feature vector with R = 8 and M = 4.
We recommend the following settings for the LDS descriptor: the number of
functions on a sphere R = 8, the number of samples 1282 , the largest radius M = 4
(or M = 6), the vector dimension dim = 360, and the average distance (3.25) as
the scaling factor in the normalization step. We recommend to scale each feature
vector f by the factor ω (4.81) so that ||f ||1 = dim, and to apply the l1 distance for
ranking the scaled feature vectors.
RID-A (R=8, M=6), Our DB2 L1 RID-A (R=8, M=6), Train DB1, L1
100 100
[512](42.4,29.5,38.0,27.8) [512](42.3,29.2,36.3,26.4)
[448](42.2,29.4,37.8,27.3) [448](42.2,29.1,36.2,26.3)
[384](41.8,29.1,37.1,27.5) [384](41.9,28.9,36.2,26.2)
[320](41.6,28.8,36.4,27.1) [320](41.5,28.5,35.8,25.8)
80 [256](40.8,28.3,36.0,26.8) 80 [256](41.0,28.2,35.6,25.8)
[192](40.0,27.7,35.2,25.9) [192](40.5,27.7,34.8,25.0)
[128](39.2,27.1,34.4,25.4) [128](39.5,27.0,34.5,24.7)
[64](37.8,25.8,32.6,23.8) [64](37.5,25.4,32.7,23.5)
Precision (%)
Precision (%)
60 60
40 40
20 20
0 0
0 20 40 60 80 100 0 20 40 60 80 100
Recall (%) Recall (%)
Figure 5.46: Average precision/recall diagrams of our reclassified collection and the
training database, for various dimensions of the rotation invariant feature vector
based on LDSs, with R = 8 and M = 6.
RID-A, Our DB2, L1, different M and R RID-A, Train DB1, L1, different M and R
100 100
M=2, R=8 [512](40.0,27.0,33.9,25.1) M=2, R=8 [512](37.0,24.6,31.4,22.4)
M=4, R=8 [512](42.3,28.8,35.5,26.4) M=4, R=8 [512](38.6,26.0,33.4,23.6)
M=6, R=4 [256](38.6,26.3,32.5,24.1) M=6, R=4 [256](39.4,26.9,34.9,24.7)
M=6, R=8 [512](42.4,29.5,38.0,27.8) M=6, R=8 [512](42.3,29.2,36.3,26.4)
80 M=6, R=12 [504](40.7,27.9,35.1,26.0) 80 M=6, R=12 [504](38.3,25.8,33.1,23.5)
M=8, R=8 [512](38.4,26.1,34.2,24.1) M=8, R=8 [512](38.6,26.2,33.7,23.9)
Precision (%)
Precision (%)
60 60
40 40
20 20
0 0
0 20 40 60 80 100 0 20 40 60 80 100
Recall (%) Recall (%)
Figure 5.47: Average precision/recall diagrams of our reclassified collection and the
training database, when the rotation invariant LDS feature vectors are extracted
using different values of the parameters M and R.
RID-A (R=8,M=6), Our DB2, diffrenet distances RID-A (R=8,M=6), Test DB1, diffrenet distances
100 100
L1 [512](42.4,29.5,38.0,27.8) L1 [512](38.5,25.0,31.7,22.9)
L2 [512](40.5,28.1,35.2,26.4) L2 [512](35.4,22.7,30.1,21.4)
minL1 [512](42.3,29.4,37.4,26.8) minL1 [512](37.4,24.5,31.4,22.7)
minL2 [512](42.9,29.9,37.6,27.1) minL2 [512](36.7,24.1,31.4,22.4)
80 L1, scaled [512](41.5,28.7,36.1,26.2) 80 L1, scaled [512](36.8,24.1,31.0,22.4)
Precision (%)
Precision (%)
60 60
40 40
20 20
0 0
0 20 40 60 80 100 0 20 40 60 80 100
Recall (%) Recall (%)
Figure 5.48: Average precision/recall diagrams of our reclassified collection and the
test database, obtained by applying different distance calculations (table 5.5) to the
rotation invariant LDS feature vector with R = 8 and M = 6.
We performed tests, which are similar to the ones shown in figure 5.45, in order
to study the influence of distance measure on retrieval performance of the rotation
invariant LDS descriptor. In figure 5.48, the precision-recall diagrams of our reclas-
sified collection and the test database suggest that the l1 distance outperforms all
competing dissimilarity techniques, including the application of the l 1 norm after
re-scaling each feature vector. We recall that the “L1-scaled” technique performs
the best in the case of the LDS feature vector.
We recommend the following settings for rotation invariant version of the LDS
descriptor: the number of functions on a sphere R = 8, the number of samples 128 2 ,
the largest radius M = 6, the vector dimension dim = 512, the average distance
(3.25) as the scaling factor in the normalization step, and the l1 norm as distance
metric.
182 Experimental Results
LDS-A (R=8,M=4) vs. RID-A (R=8,M=6), Our DB1, Our DB2, L1 LDS-A (R=8,M=4) vs. RID-A (R=8,M=6), Train DB1, Test DB1, L1
100 100
LDS-A (R=8,M=4), O1 [360](52.7,34.9,48.6,35.7) LDS-A (R=8,M=4), TR [360](46.9,31.5,38.9,29.3)
RID-A (R=8,M=6), O1 [512](50.9,34.6,50.0,35.6) RID-A (R=8,M=6), TR [512](42.3,29.2,36.3,26.4)
LDS-A (R=8,M=4), O2 [360](48.3,33.7,41.2,31.4) LDS-A (R=8,M=4), TS [360](46.5,30.8,37.8,28.6)
RID-A (R=8,M=6), O2 [512](42.4,29.5,38.0,27.8) RID-A (R=8,M=6), TS [512](38.5,25.0,31.7,22.9)
80 80
Precision (%)
Precision (%)
60 60
40 40
20 20
0 0
0 20 40 60 80 100 0 20 40 60 80 100
Recall (%) Recall (%)
LDS-A (R=8,M=4) vs. RID-A (R=8,M=6), Our DB1, Our DB2, L1, scaled LDS-A (R=8,M=4) vs. RID-A (R=8,M=6), Train DB1, Test DB1, L1, scaled
100 100
LDS-A (R=8,M=4), O1 [360](54.8,38.1,51.9,38.5) LDS-A (R=8,M=4), TR [360](49.4,34.2,42.1,31.2)
RID-A (R=8,M=6), O1 [512](43.3,29.3,42.4,30.3) RID-A (R=8,M=6), TR [512](40.7,28.3,34.9,25.5)
LDS-A (R=8,M=4), O2 [360](52.5,37.6,45.0,34.3) LDS-A (R=8,M=4), TS [360](49.5,34.4,42.2,31.5)
RID-A (R=8,M=6), O2 [512](41.5,28.7,36.1,26.2) RID-A (R=8,M=6), TS [512](36.8,24.1,31.0,22.4)
80 80
Precision (%)
Precision (%)
60 60
40 40
20 20
0 0
0 20 40 60 80 100 0 20 40 60 80 100
Recall (%) Recall (%)
Finally, we compare the best variants of two approaches for representing the
same feature, the LDS descriptor (LDS) vs. the rotation invariant version (RID).
The obtained results for four model classifications are shown in figure 5.49. We
compared approaches using the l1 norm directly and after the re-scaling of feature
vectors by ω (4.81). Evidently, for the classifications O2, TR, and TS (table 5.3),
the approach (LDS) that uses our complete normalization step (section 3.4) signifi-
cantly outperforms the approach (RID) based on avoiding the CPCA and using the
property 4.1 of spherical harmonics to secure rotation invariance of the descriptor.
Therefore, we conclude that the CPCA should not be avoided, regardless of certain
weaknesses (section 3.5), because the descriptor relying upon the CPCA possesses
higher discriminant power than the descriptor relying upon the competing technique
for achieving rotation invariance.
(C1) Depth buffer-based descriptor (section 4.3) based on the EBB (definition 4.3),
extracted from depth buffer images of dimensions 128 × 128;
(C4) Voxel-based descriptor in the spectral domain (section 4.5) based on a 128 ×
128 × 128 voxel grid, which is formed using the CC (definition 4.4) with w = 2
as the region of voxelization. Each model is scaled using the average distance
(3.25) as the scaling factor.
We performed crossbreeding of two, three, and all four candidates. We recall that
the average extraction times for the given settings of candidate descriptors C1, C2,
C3, and C4 are 85ms, 33ms, 68ms, and 538ms, respectively.
In figure 5.50, we give results for four 3D-model classifications, O1, O2, TR, and
TS (see table 5.3). We recall that the abbreviations for types of our descriptors are
given in table 4.1. The l1 norm is the best choice of distance metrics (results for
other distance measures are omitted). On the left-hand side, the precision-recall di-
agrams obtained using hybrids of two descriptors are displayed. We observe that the
descriptor denoted by “DBD258*RSH190”, which is obtained by crossbreeding the
candidate feature vector C1 of dimension 258 and the candidate feature vector C3 of
dimension 190, outperforms other descriptors obtained by crossbreeding two candi-
dates. Thus, the dimension of the obtained hybrid feature vector is equal to 448. On
the right-hand side, we compare the hybrid of all four candidates to the hybrids of
three candidate descriptors. To illustrate the introduced improvement of the overall
retrieval performance, the precision-recall curve of the candidate descriptor C1 of
dimension 438 (DBD438) is shown, as well. By comparing precision-recall curves as
well as the given evaluation parameters for each classification, we observe that we
benefit if we create a hybrid of three or four descriptors. The hybrid of all four de-
scriptors of dimension 475, denoted by “DBD186*SIL120*RSH105*VOX064”, and
the hybrid feature vector of dimension 472, denoted by “DBD186*SIL150*RSH136”,
possess very similar retrieval effectiveness and outperform all competing hybrid de-
scriptors. The hybrid descriptor DBD186*SIL150*RSH136 is obtained by cross-
breeding the candidate feature vector C1 of dimension 186, the candidate vector C2
of dimension 150, and the candidate descriptor C3 of dimension 136.
184 Experimental Results
Hybrids of two descriptors, Our DB1, L1 Hybrids of three descriptors, Our DB1, L1
100 100
DBD258*SIL210 [468](72.5,55.5,67.8,52.6) DBD186*SIL120*RSH105*VOX064 [475](74.3,56.6,67.9,53.3)
DBD258*RSH190 [448](72.7,54.7,66.1,51.6) DBD186*SIL150*RSH136 [472](74.0,56.6,68.3,53.4)
DBD258*VOX188 [446](69.8,51.6,64.0,49.3) DBD186*SIL150*VOX115 [451](71.3,54.0,67.1,51.1)
SIL210*RSH190 [400](65.1,48.2,61.8,45.9) DBD186*RSH136*VOX115 [437](73.0,54.5,66.1,51.8)
80 SIL210*VOX188 [398](65.6,48.6,63.4,47.2) 80 SIL150*RSH136*VOX115 [401](68.4,50.7,63.7,47.8)
RSH190*VOX188 [378](62.0,43.3,56.7,42.2) DBD438 [438](71.2,53.1,65.2,50.8)
Precision (%)
Precision (%)
60 60
40 40
20 20
0 0
0 20 40 60 80 100 0 20 40 60 80 100
Recall (%) Recall (%)
Hybrids of two descriptors, Our DB2, L1 Hybrids of three descriptors, Our DB2, L1
100 100
DBD258*SIL210 [468](57.1,42.8,50.5,39.1) DBD186*SIL120*RSH105*VOX064 [475](63.2,48.0,55.7,43.7)
DBD258*RSH190 [448](60.6,45.7,54.1,40.9) DBD186*SIL150*RSH136 [472](63.2,47.9,55.6,43.5)
DBD258*VOX188 [446](55.8,41.2,48.0,37.7) DBD186*SIL150*VOX115 [451](58.5,43.9,50.9,40.0)
SIL210*RSH190 [400](58.6,43.7,51.6,39.4) DBD186*RSH136*VOX115 [437](62.3,46.9,54.4,42.3)
80 SIL210*VOX188 [398](56.4,41.6,48.8,37.9) 80 SIL150*RSH136*VOX115 [401](61.1,46.0,54.1,41.8)
RSH190*VOX188 [378](56.0,40.7,48.6,37.0) DBD438 [438](52.7,39.2,46.0,35.9)
Precision (%)
Precision (%)
60 60
40 40
20 20
0 0
0 20 40 60 80 100 0 20 40 60 80 100
Recall (%) Recall (%)
Hybrids of two descriptors, Train DB1, L1 Hybrids of three descriptors, Train DB1, L1
100 100
DBD258*SIL210 [468](59.3,44.0,51.0,40.1) DBD186*SIL120*RSH105*VOX064 [475](63.6,47.7,55.4,43.3)
DBD258*RSH190 [448](61.3,45.1,53.5,40.8) DBD186*SIL150*RSH136 [472](63.3,47.4,55.4,42.7)
DBD258*VOX188 [446](58.4,42.1,49.6,38.7) DBD186*SIL150*VOX115 [451](60.7,45.0,51.8,41.1)
SIL210*RSH190 [400](58.8,43.4,52.8,39.0) DBD186*RSH136*VOX115 [437](62.2,45.9,54.1,41.6)
80 SIL210*VOX188 [398](57.8,42.3,50.8,38.5) 80 SIL150*RSH136*VOX115 [401](61.3,45.4,54.7,41.0)
RSH190*VOX188 [378](55.7,39.7,48.9,36.5) DBD438 [438](56.4,40.9,48.0,37.5)
Precision (%)
Precision (%)
60 60
40 40
20 20
0 0
0 20 40 60 80 100 0 20 40 60 80 100
Recall (%) Recall (%)
Hybrids of two descriptors, Test DB1, L1 Hybrids of three descriptors, Test DB1, L1
100 100
DBD258*SIL210 [468](56.8,42.6,48.9,38.4) DBD186*SIL120*RSH105*VOX064 [475](60.4,45.4,52.0,40.9)
DBD258*RSH190 [448](58.9,43.5,50.4,39.2) DBD186*SIL150*RSH136 [472](60.5,45.5,52.1,41.4)
DBD258*VOX188 [446](54.6,39.5,46.2,35.7) DBD186*SIL150*VOX115 [451](57.5,42.8,49.2,38.4)
SIL210*RSH190 [400](55.7,40.5,47.8,36.3) DBD186*RSH136*VOX115 [437](59.3,43.8,50.9,39.7)
80 SIL210*VOX188 [398](54.4,39.3,46.3,35.2) 80 SIL150*RSH136*VOX115 [401](58.4,42.7,49.4,38.1)
RSH190*VOX188 [378](52.7,36.8,43.9,33.2) DBD438 [438](53.7,39.3,45.9,35.5)
Precision (%)
Precision (%)
60 60
40 40
20 20
0 0
0 20 40 60 80 100 0 20 40 60 80 100
Recall (%) Recall (%)
Precision (%)
60 60
40 40
20 20
0 0
0 20 40 60 80 100 0 20 40 60 80 100
Recall (%) Recall (%)
Precision (%)
60 60
40 40
20 20
0 0
0 20 40 60 80 100 0 20 40 60 80 100
Recall (%) Recall (%)
The descriptor based on statistical moments (2.5) shows best performance when
the vector of 31 components is first re-scaled by (4.81) and the l1 norm is used
for ranking (L1 scaled). The authors consider the original scale of an object as
a feature, whence the scale normalization is not suggested in [105]. Conversely,
we believe that all models should be scaled by the average distance (3.25), before
extracting the moments-based descriptor. In the bottom-left diagrams in figure 5.51,
the precision-recall curve of the moments-based descriptor of dimension 31, which is
extracted without normalizing the scale of a 3D-model, is shown (denoted by “Not
scaled”). Obviously, the retrieval performance is better if the scale normalization
is performed. Experimental results for the training and test databases support the
conclusions for both cords and moments-based descriptors.
DEC (EBB), Our DB2, L1 DEC, Our DB2, different cubes and distances
100 100
[21](29.3,19.6,25.4,17.5) EBB, L1 [406](37.7,25.5,32.0,23.0)
[39](30.8,20.5,25.6,18.6) CC2, L1 [406](36.7,25.1,31.4,23.7)
[66](32.9,22.0,27.6,19.7) CBC, L1 [406](35.3,24.5,31.4,23.1)
[104](37.1,24.7,30.7,22.4) EBB, L2 [406](34.8,22.8,28.7,21.5)
80 [155](37.7,25.5,31.3,22.6) 80 EBB, minL1 [406](38.6,26.1,32.6,24.0)
[221](37.6,25.4,31.6,23.5) EBB, L1, scaled [406](37.7,25.5,32.0,23.0)
[304](36.6,24.5,31.0,22.3)
[406](37.7,25.5,32.0,23.0)
Precision (%)
Precision (%)
60 [529](34.8,23.7,30.0,22.2) 60
40 40
20 20
0 0
0 20 40 60 80 100 0 20 40 60 80 100
Recall (%) Recall (%)
DEC (EBB), Test DB1, L1 DEC, Test DB1, different cubes and distances
100 100
[21](28.3,18.9,25.9,17.3) EBB, L1 [406](36.3,24.6,31.0,22.3)
[39](31.2,20.6,27.3,18.9) CC2, L1 [406](35.1,23.8,30.7,21.6)
[66](34.0,22.8,30.0,21.0) CBC, L1 [406](33.4,22.8,29.9,21.2)
[104](35.5,23.9,30.6,21.9) EBB, L2 [406](33.0,22.1,27.9,20.5)
80 [155](34.7,23.5,30.4,21.5) 80 EBB, minL1 [406](36.9,24.8,31.5,23.1)
[221](35.6,24.2,30.7,22.2) EBB, L1, scaled [406](36.3,24.6,31.0,22.3)
[304](35.8,24.4,30.8,22.4)
[406](36.3,24.6,31.0,22.3)
Precision (%)
Precision (%)
60 [529](31.6,20.9,26.7,19.3) 60
40 40
20 20
0 0
0 20 40 60 80 100 0 20 40 60 80 100
Recall (%) Recall (%)
Figure 5.52: Average precision/recall diagrams of our reclassified collection and the
test database, obtained by using the descriptor based on equivalence classes (section
2.2). Feature vectors of different dimensions (left) as well as different selections of
the unit cube and distance measures from table 5.5 (right) are tested.
Other Feature Vectors 187
The MPEG-7 shape spectrum descriptor (section 2.3) is evaluated using the im-
plementation provided inside the eXperimentation Model [84]. We tested different
dimensions of the shape spectrum feature vector, extracted using the adjacent tri-
angles or the first (n=1) and the second order (n=2). As it can be seen in figure
5.53, the overall retrieval performance of the shape spectrum descriptor is poor.
The results suggest that the feature vector of dimension 452 slightly outperforms
descriptors of other dimensions. Our results confirm the statement from [88] that
the retrieval performance is not better if adjacent triangles of the second order are
considered. The l1 norm is the best choice of distance measure for this descriptor.
SSD (n=1), Our DB2, L1 SSD (n=1), Our DB2, different distances
100 100
[52](20.0,12.7,17.8,12.5) L1 [452](21.0,13.7,19.1,13.7)
[102](20.5,13.1,17.9,12.9) L2 [452](16.4,11.0,14.7,10.6)
[152](20.3,13.2,18.2,12.9) minL1 [452](19.9,12.9,17.7,13.0)
[202](20.3,13.2,18.4,12.9) minL2 [452](17.3,11.3,16.3,11.7)
80 [252](20.2,13.2,18.5,13.1) 80 L1, scaled [452](21.0,13.7,19.1,13.7)
[302](20.4,13.3,18.4,13.1)
[352](20.9,13.6,18.7,13.6)
[402](20.2,13.2,18.5,13.4)
Precision (%)
Precision (%)
60 [452](21.0,13.7,19.1,13.7) 60
40 40
20 20
0 0
0 20 40 60 80 100 0 20 40 60 80 100
Recall (%) Recall (%)
SSD (n=2), Our DB2, L1 SSD (n=2), Our DB2, different distances
100 100
[52](19.1,12.1,16.7,11.7) L1 [452](20.6,13.3,18.2,13.4)
[102](19.8,12.6,17.4,12.4) L2 [452](15.0, 9.6,13.6, 9.7)
[152](19.8,12.7,17.3,12.6) minL1 [452](20.4,13.1,18.1,12.9)
[202](20.4,13.1,17.6,13.1) minL2 [452](17.1,10.6,14.9,10.8)
80 [252](20.0,12.9,17.7,13.1) 80 L1, scaled [452](20.6,13.3,18.2,13.4)
[302](20.6,13.3,17.5,13.2)
[352](20.4,13.2,18.0,13.4)
[402](20.2,13.1,17.8,13.2)
Precision (%)
Precision (%)
60 [452](20.6,13.3,18.2,13.4) 60
40 40
20 20
0 0
0 20 40 60 80 100 0 20 40 60 80 100
Recall (%) Recall (%)
Precision (%)
60 60
40 40
20 20
0 0
0 20 40 60 80 100 0 20 40 60 80 100
Recall (%) Recall (%)
Precision (%)
60 60
40 40
20 20
0 0
0 20 40 60 80 100 0 20 40 60 80 100
Recall (%) Recall (%)
We compare three variants of the descriptor based on binary voxel grids (section
2.5.2). The first variant (B1) is extracted using the original definition of binary
functions (2.26) given in [36]. The second variant (Br) relies upon the redefini-
tion of functions on a sphere given by (2.29). Both variant are extracted without
applying the CPCA (section 3.4) to the original mesh model, i.e., without finding
a canonical orientation. We recall that only translation and scale are normalized,
while the property 4.1 of spherical harmonics is used to form inherently rotation
invariant feature vectors (2.27). We implemented the third variant (Br, our ap-
proach) of the descriptor based on binary voxel grids so that each 3D-model is
normalized first, then the binary voxel grid is generated, the functions defined by
(2.29) are sampled, and the spherical Fourier transform (section 4.6.1) is applied to
each function. Instead of forming the signatures by (2.27), we used the algorithm
4.1, i.e., (4.78). The dimension of the first and the second version of the descriptor
based on binary voxel grids is dim = RL, where R is the number of function on
concentric spheres and L is the number of bands of spherical harmonics that are
used to compute the signatures. In [57, 36], the recommended settings are R = 32
Other Feature Vectors 189
and L = 16, i.e., dim = 512. The dimension of the third version of feature vector is
dim = R · L(L + 1)/2. Since the number of functions on concentric spheres is fixed
to R = 32 [57, 36], we set L = 6 so that the dimension of the third variant of the
descriptor is dim = 672. Experimental results obtained for the classifications O1,
O2, TR, and TS (table 5.3), aimed at comparing the three variants of the descriptor
based on binary voxel grids, are displayed in figure 5.55. The results for all four
classifications are consistent. The variant “B1” is evidently inferior to the variant
“Br”, whence the 3D-shape is represented in a more appropriate manner if the def-
inition (2.29) is used instead of (2.26). The variant “Br, our approach” is evidently
superior to the variant “Br”. Therefore, if the binary voxel grid is considered as a
feature that describes shape of a 3D-model, than the representation relying upon
the property 4.1 of spherical harmonics and avoiding the use of the CPCA is inferior
to our method that uses the CPCA and (4.78) to form the vector.
Precision (%)
60 60
40 40
20 20
0 0
0 20 40 60 80 100 0 20 40 60 80 100
Recall (%) Recall (%)
Precision (%)
60 60
40 40
20 20
0 0
0 20 40 60 80 100 0 20 40 60 80 100
Recall (%) Recall (%)
Thus, the conclusion inferred using the results in figure 5.49 is supported by
the results given in figure 5.55, i.e., the CPCA should not be avoided, regardless
of certain weaknesses (section 3.5), because the descriptor relying upon the CPCA
possesses higher discriminant power than the descriptor relying upon the competing
technique for achieving rotation invariance.
EDT, Our DB2, L1, variants EDT (original), Our DB2, different distances
100 100
EDT, original [544](53.6,38.7,46.2,35.1) L1 [544](53.6,38.7,46.2,35.1)
EDT, rot. inv [512](51.8,37.0,43.7,33.8) L2 [544](51.6,37.3,44.6,33.5)
EDT, our approach [672](52.1,37.2,44.5,34.8) minL1 [544](53.5,38.6,46.2,35.2)
EDT, our approach [480](52.1,37.1,44.5,34.3) minL2 [544](51.6,37.3,44.6,33.5)
80 80 L1, scaled [544](53.2,38.4,46.2,34.4)
Lmax [544](40.8,28.0,33.6,25.3)
Precision (%)
Precision (%)
60 60
40 40
20 20
0 0
0 20 40 60 80 100 0 20 40 60 80 100
Recall (%) Recall (%)
EDT, Test DB1, L1, variants EDT (original), Test DB1, different distances
100 100
EDT, original [544](50.4,35.9,43.3,32.4) L1 [544](50.4,35.9,43.3,32.4)
EDT, rot. inv [512](47.7,33.4,40.3,30.0) L2 [544](48.7,34.6,41.7,31.2)
EDT, our approach [672](52.1,37.0,43.7,33.5) minL1 [544](50.4,35.7,43.1,32.0)
EDT, our approach [480](51.6,36.5,43.2,32.9) minL2 [544](48.7,34.6,41.7,31.2)
80 80 L1, scaled [544](49.6,35.0,42.2,31.7)
Lmax [544](39.9,26.3,32.6,24.4)
Precision (%)
Precision (%)
60 60
40 40
20 20
0 0
0 20 40 60 80 100 0 20 40 60 80 100
Recall (%) Recall (%)
Figure 5.56: Average precision/recall diagrams of our reclassified collection and the
test database, obtained by using descriptors based on exponentially decaying EDT
(section 2.5.4). On the left-hand side, the descriptors extracted by the executables
provided by the authors are denoted by “EDT, original”, our implementation of the
same approach is denoted by “EDT, rot. inv”, while “EDT, our approach” denotes
the application of the algorithm 4.1 (i.e., we use the CPCA and (4.78) instead of
(2.37)). On the right-hand side, the descriptors extracted using the executables
provided by the authors are tested for different distance measures from table 5.5.
Comparison of different descriptors (1), Our DB1 Comparison of different descriptors (1), Our DB2
100 100
DBD (EBB), L1 scaled [438](71.2,53.1,65.2,50.8) DBD (EBB), L1 scaled [438](52.7,39.2,46.0,35.9)
DBD (CC, w=2), L1 scaled [438](64.2,43.9,55.4,42.1) DBD (CC, w=2), L1 scaled [438](53.3,38.7,44.0,35.1)
SIL, L1 scaled [300](62.4,47.1,60.9,45.5) SIL, L1 scaled [300](50.7,37.3,44.5,33.7)
VOX (BB), L1 [343](55.8,38.5,49.8,39.7) VOX (BB), L1 [343](46.4,32.3,39.4,30.3)
80 VOX (FFT, CC w=2), minL2 [416](53.5,36.9,50.7,37.1) 80 VOX (FFT, CC w=2), minL2 [416](45.6,31.9,38.9,29.6)
VOL (FFT), L1 scaled [438](51.7,35.5,51.1,37.4) VOL (FFT), L1 scaled [438](44.7,30.6,37.2,27.9)
Precision (%)
Precision (%)
60 60
40 40
20 20
0 0
0 20 40 60 80 100 0 20 40 60 80 100
Recall (%) Recall (%)
Comparison of different descriptors (1), Train DB1 Comparison of different descriptors (1), Test DB1
100 100
DBD (EBB), L1 scaled [438](56.4,40.9,48.0,37.5) DBD (EBB), L1 scaled [438](53.7,39.3,45.9,35.5)
DBD (CC, w=2), L1 scaled [438](54.5,39.5,47.3,35.7) DBD (CC, w=2), L1 scaled [438](53.1,38.1,44.5,34.4)
SIL, L1 scaled [300](53.4,39.2,46.6,35.4) SIL, L1 scaled [300](50.4,36.6,43.6,33.0)
VOX (BB), L1 [343](48.8,33.7,39.9,30.9) VOX (BB), L1 [343](46.3,30.5,35.8,27.6)
80 VOX (FFT, CC w=2), minL2 [416](47.3,32.3,40.5,29.7) 80 VOX (FFT, CC w=2), minL2 [416](42.8,29.3,35.8,26.9)
VOL (FFT), L1 scaled [438](46.1,31.8,40.3,29.6) VOL (FFT), L1 scaled [438](44.9,31.1,38.1,28.3)
Precision (%)
Precision (%)
60 60
40 40
20 20
0 0
0 20 40 60 80 100 0 20 40 60 80 100
Recall (%) Recall (%)
The first group of competing descriptors is made up of the best versions of the
depth buffer-based descriptors (section 5.2.3) extracted using the EBB (definition
4.3) and the CC (definition 4.4) with w = 2, the silhouette-based descriptor (section
5.2.2), the voxel-based descriptors (section 5.2.5) in the spatial domain extracted
Global Comparison 193
using the BB (definition 4.2), the voxel-based descriptor in the spectral domain
extracted using the CC with w = 2, and the volume-based descriptor in the spec-
tral domain (section 5.2.4). The recommended dimension and dissimilarity measure
(table 5.5) of each descriptor are used for creating the precision-recall diagrams in
figure 5.57. We conclude that the depth buffer-based descriptor extracted using the
EBB outperforms the competing feature vectors. As mentioned in sections 4.3 and
5.2.3 and demonstrated by the example in figure 4.13, an outlier can significantly
affect components of the vector created by relying upon the EBB, while the de-
scriptor extracted using the CC is more robust with respect to outliers. We also
notice that the silhouette-based descriptor outperforms the remaining three feature
vectors.
Comparison of different descriptors (2), Our DB1 Comparison of different descriptors (2), Our DB2
100 100
RAY, QFD [162](42.7,28.3,43.5,29.6) RAY, QFD [162](38.5,26.3,31.8,24.6)
RSH, L1 scaled [136](50.8,33.5,45.4,32.6) RSH, L1 scaled [136](45.7,31.9,37.6,29.2)
MOM, L1 scaled [363](35.3,22.6,34.7,24.3) MOM, L1 scaled [363](35.5,23.5,29.0,22.1)
SSH, L2 [91](41.1,26.3,39.0,28.3) SSH, L2 [91](37.6,25.8,31.9,24.1)
80 CSH (alpha=0.4), L1 [196](59.5,39.3,52.9,39.5) 80 CSH (alpha=0.4), L1 [196](46.7,32.7,39.2,29.8)
LDS, L1 scaled [360](54.8,38.1,51.9,38.5) LDS, L1 scaled [360](52.5,37.6,45.0,34.3)
Precision (%)
Precision (%)
60 60
40 40
20 20
0 0
0 20 40 60 80 100 0 20 40 60 80 100
Recall (%) Recall (%)
Comparison of different descriptors (2), Train DB1 Comparison of different descriptors (2), Test DB1
100 100
RAY, QFD [162](41.1,27.8,34.1,25.3) RAY, QFD [162](39.8,26.5,31.9,23.9)
RSH, L1 scaled [136](44.8,31.1,39.7,28.8) RSH, L1 scaled [136](41.4,28.0,35.5,25.7)
MOM, L1 scaled [363](35.5,23.1,29.6,22.0) MOM, L1 scaled [363](35.5,23.1,30.2,21.5)
SSH, L2 [91](36.6,24.5,32.5,22.8) SSH, L2 [91](33.1,22.2,29.4,20.8)
80 CSH (alpha=0.4), L1 [196](44.9,30.8,39.4,28.4) 80 CSH (alpha=0.4), L1 [196](41.5,27.9,35.6,26.0)
LDS, L1 scaled [360](49.4,34.2,42.1,31.2) LDS, L1 scaled [360](49.5,34.4,42.2,31.5)
Precision (%)
Precision (%)
60 60
40 40
20 20
0 0
0 20 40 60 80 100 0 20 40 60 80 100
Recall (%) Recall (%)
LDSs (section 5.2.10) defined by (4.78). For each descriptor, the recommended
dimension and distance measure (table 5.5) are used. The results in figure 5.58
suggest that the feature vector based on LDSs outperforms the competing descrip-
tors from the second group. The complex descriptor outperforms the remaining
four feature vectors. The shading-based descriptor outperforms only the moments-
based descriptor. Finally, we observe that we benefit if the ray-based feature is
represented by spherical harmonics, in the spectral domain. Thus, besides filtering
the surface noise, which is captured by measuring the extent in the spatial domain,
providing an embedded multi-resolution feature representation (1.8), and improving
robustness with respect to outliers (see figures 4.5 and 4.27), the representation in
the spectral domain possesses higher effectiveness than the representation of the
ray-based feature in the spatial domain.
The descriptors proposed by other authors (chapter 2) are compared in fig-
ure 5.59. Evidently, the descriptor presented in [58] that is based on exponentially
decaying EDT (section 2.5.4) significantly outperforms all other approaches. There-
fore, we regard the EDT descriptor as the state-of-the-art descriptor. We stress that
we use the executables (binaries) provided by the authors [108] to extract the EDT
descriptor.
Comparison of state-of-the-art descriptors, Our DB1 Comparison of state-of-the-art descriptors, Our DB2
100 100
EDT, L1 [544](57.4,41.4,55.8,40.6) EDT, L1 [544](53.6,38.7,46.2,35.1)
DEC, minL1 [406](48.5,35.0,52.2,35.8) DEC, minL1 [406](38.6,26.1,32.6,24.0)
SDD, L1 [64](34.7,23.3,36.5,25.4) SDD, L1 [64](28.6,19.4,25.7,18.6)
PMD, L1 scaled [31](34.2,22.4,35.4,24.7) PMD, L1 scaled [31](32.1,21.5,27.5,20.2)
80 PCD, minL1 [120](30.7,20.0,33.1,23.4) 80 PCD, minL1 [120](29.3,18.9,23.9,18.0)
SSD, L1 [452](18.9,12.4,24.4,15.2) SSD, L1 [452](21.0,13.7,19.1,13.7)
Precision (%)
Precision (%)
60 60
40 40
20 20
0 0
0 20 40 60 80 100 0 20 40 60 80 100
Recall (%) Recall (%)
Comparison of state-of-the-art descriptors, Train DB1 Comparison of state-of-the-art descriptors, Test DB1
100 100
EDT, L1 [544](51.6,37.2,44.9,34.0) EDT, L1 [544](50.4,35.9,43.3,32.4)
DEC, minL1 [406](38.4,26.2,32.8,23.7) DEC, minL1 [406](36.9,24.8,31.5,23.1)
SDD, L1 [64](29.2,19.5,25.9,18.0) SDD, L1 [64](28.2,18.8,25.3,17.2)
PMD, L1 scaled [31](32.0,21.2,29.3,20.0) PMD, L1 scaled [31](32.8,22.0,29.1,20.5)
80 PCD, minL1 [120](26.2,16.8,22.9,15.6) 80 PCD, minL1 [120](26.7,17.4,23.5,16.6)
SSD, L1 [452](19.6,12.8,18.0,12.2) SSD, L1 [452](17.9,11.5,16.9,11.4)
Precision (%)
Precision (%)
60 60
40 40
20 20
0 0
0 20 40 60 80 100 0 20 40 60 80 100
Recall (%) Recall (%)
In our final comparison, we collected the best precision-recall curves from fig-
ures 5.57, 5.58, 5.59, and 5.50 in figure 5.60. The recommended hybrid descriptor
(DBD186*SIL150*RSH136) of dimension 472 significantly outperforms all other
descriptors. We recall that the hybrid is obtained by crossbreeding the depth
buffer-based descriptor of dimension 186 (4.30) extracted using the EBB (definition
4.3), the silhouette-based feature vector of dimension 150 (4.15) extracted using the
equiangular sampling (4.10) and the definition of sample values given by (4.13), and
the ray-based descriptor of dimension 136 with spherical harmonic representation
(4.66).
Comparison of best descriptors, Our DB1 Comparison of best descriptors, Our DB2
100 100
DBD186*SIL150*RSH136, L1 [472](74.0,56.6,68.3,53.4) DBD186*SIL150*RSH136, L1 [472](63.2,47.9,55.6,43.5)
DBD (EBB), L1 scaled [438](71.2,53.1,65.2,50.8) DBD (EBB), L1 scaled [438](52.7,39.2,46.0,35.9)
SIL, L1 scaled [300](62.4,47.1,60.9,45.5) SIL, L1 scaled [300](50.7,37.3,44.5,33.7)
EDT (original), L1 [544](57.4,41.4,55.8,40.6) EDT (original), L1 [544](53.6,38.7,46.2,35.1)
80 LDS, L1 scaled [360](54.8,38.1,51.9,38.5) 80 LDS, L1 scaled [360](52.5,37.6,45.0,34.3)
Precision (%)
Precision (%)
60 60
40 40
20 20
0 0
0 20 40 60 80 100 0 20 40 60 80 100
Recall (%) Recall (%)
Comparison of best descriptors, Train DB1 Comparison of best descriptors, Test DB1
100 100
DBD186*SIL150*RSH136, L1 [472](63.3,47.4,55.4,42.7) DBD186*SIL150*RSH136, L1 [472](60.5,45.5,52.1,41.4)
DBD (EBB), L1 scaled [438](56.4,40.9,48.0,37.5) DBD (EBB), L1 scaled [438](53.7,39.3,45.9,35.5)
SIL, L1 scaled [300](53.4,39.2,46.6,35.4) SIL, L1 scaled [300](50.4,36.6,43.6,33.0)
EDT (original), L1 [544](51.6,37.2,44.9,34.0) EDT (original), L1 [544](50.4,35.9,43.3,32.4)
80 LDS, L1 scaled [360](49.4,34.2,42.1,31.2) 80 LDS, L1 scaled [360](49.5,34.4,42.2,31.5)
Precision (%)
Precision (%)
60 60
40 40
20 20
0 0
0 20 40 60 80 100 0 20 40 60 80 100
Recall (%) Recall (%)
Comparison of best descriptors, MPEG-7 DB1 Comparison of best descriptors, MPEG-7 DB2
100 100
80 80
Precision (%)
Precision (%)
60 60
40 40
Classification O1 O2 TR TS M1 M2
DSR472 85.3% 70.1% 70.6% 66.4% 91.6% 91.0%
DBD (EBB) 81.8% 60.0% 63.1% 60.0% 90.3% 87.8%
SIL 80.6% 60.4% 59.6% 55.6% 87.2% 87.4%
EDT (original) 71.8% 62.7% 60.3% 57.8% 85.5% 85.6%
LDS 66.5% 60.8% 58.3% 60.1% 84.6% 81.1%
Table 5.8: Percentage of retrieving a relevant model as the best match to a query
(nearest neighbor) for all six collections (table 5.3), using the descriptors from figure
5.60. The DSR472 descriptor (i.e., the hybrid DBD186*SIL150*RSH136) shows the
best performance for all available collections.
We consider that our evaluation is complete, because we estimate that the ap-
proaches presented in [2, 72, 153, 151, 95, 27, 28, 19, 77, 11, 65, 129, 93, 91, 92]
as well as the topology matching technique [48] (section 2.4) are inferior to the
state-of-the-art descriptor, which is described in section 2.5.4. Our estimation is
partially based on results that are presented in the literature. For instance, the
results presented in [36] demonstrate that the descriptor based on exponentially
decaying EDT significantly outperforms descriptors proposed in [2] and [28]. The
techniques proposed in [19, 77, 11] are suitable for CAD 3D-models. The results
presented in [72, 153, 151, 95, 27, 65, 129, 93] also suggest the inferiority to the
EDT descriptor [58]. The topology matching technique is presented in [48], but
no decent evaluation has been published yet. Thus, we assume that the sensitivity
of the method, which is depicted in figure 2.8, deteriorates the overall effective-
ness. We consider that methods presented in [91, 92] have rather theoretical than
practical value. Finally, we used six different model classifications to evaluate the
retrieval performance of descriptors. Since the results are consistent for all available
ground-truth classifications, we believe that the best 3D-shape descriptor has been
found.
We stress that the DSR472 is the best descriptor in general. However, for certain
categories of 3D-models the DSR472 might not be the most suitable descriptor. An
extreme example is shown on the left-hand side in figure 5.61. Typically the most
inferior descriptor (see figure 5.59), the shape spectrum feature vector, is compared
Dimension Reduction using the PCA 197
to the best descriptors (figure 5.60), but only for the category of models of humans.
The category belongs to our reclassified collection and consists of 56 models of hu-
mans in different poses (some of the models are articularly modified). Evidently,
generally the poorest descriptor (SSD) outperforms the best descriptors. The re-
sults are expected, because the SSD is the only descriptor robust with respect to
articular modifications of 3D-models, when the level-of-detail and tessellation are
not significantly changed. We recall that the SSD is not robust to different tessel-
lations and levels-of-details. For the category of models of fighter jets (the training
database), the most suitable descriptor is the depth buffer-based. Nevertheless, we
consider that the global results (not category-wise) suggest which descriptor is the
best for the majority of categories.
Best descriptors vs. SSD, Our DB2, humans Comparison of best descriptors, Train DB1, fighter jets
100 100
DSR472, L1 [472](47.7,30.4,46.6,35.4)
DBD (EBB), L1 scaled [438](41.5,26.6,43.3,30.5)
SIL, L1 scaled [300](35.4,22.6,36.1,27.1)
EDT (original), L1 [544](37.4,23.1,35.9,27.6)
80 LDS, L1 scaled [360](43.1,28.6,44.1,31.0) 80
SSD, L1 [452](60.1,43.8,61.9,47.1)
Precision (%)
Precision (%)
60 60
40 40
20 20 DSR472, L1 [472](78.7,66.1,82.4,62.4)
DBD (EBB), L1 scaled [438](84.1,72.3,88.1,66.9)
SIL, L1 scaled [300](69.8,56.1,72.9,53.8)
EDT (original), L1 [544](64.4,51.4,73.0,49.1)
LDS, L1 scaled [360](61.8,45.8,62.6,44.5)
0 0
0 20 40 60 80 100 0 20 40 60 80 100
Recall (%) Recall (%)
Precision (%)
60 60
40 40
20 20
0 0
0 20 40 60 80 100 0 20 40 60 80 100
Recall (%) Recall (%)
Precision (%)
60 60
40 40
20 20
0 0
0 20 40 60 80 100 0 20 40 60 80 100
Recall (%) Recall (%)
reduction technique is optimal in the mean-square sense [51]. We observe that the
retrieval performances of the hybrid DSR472 and the depth buffer-based descrip-
tors are even increased by compressing the original vectors. The explanation for the
increased retrieval effectiveness lies in the fact that the presence of noise in original
data is also reduced. Note that the feature vector of 35 components generated by
compressing the DSR472 descriptor outperforms both the depth buffer-based and
the EDT descriptors.
The compression by applying the PCA analysis to sets of feature vectors has
two goals: to preserve the retrieval performance and to reduce computational costs
of subsequent processing steps. The feasibility of the first goal is demonstrated
in figure 5.62. Obviously, the ranking of low-dimensional vectors using the dis-
similarity measures form table (5.5) is faster than the ranking of high-dimensional
feature vectors. The average ranking times (distance computations and sorting)
of our reclassified collection (1835 models) when the l1 norm (1.10) is used for
computing distances between feature vectors of dimensions 150, 300, and 500 are
50ms, 100ms, and 155ms, respectively. For the set of 1835 vectors, the computation
of the 472 × 472 covariance matrix (3.4) took 8.4 seconds, while the computation
and sorting of the eigenvalues and eigenvectors (3.5), i.e., the computation on the
transformation matrix, lasted 40.0 seconds. The results are obtained on a com-
puter running Windows 2000 Professional, with 1GB RAM and an 1.4 GHz AMD
processor.
The application of the PCA to feature vectors will be investigated further. For
instance, we plan to apply the PCA to weighted concatenations of feature vectors.
Also, an addition of a large number of similar objects to a collection of 3D-models
may significantly change the principal axes of the corresponding set of feature vec-
tors. Therefore, we expect that certain modifications of the standard PCA will be
necessary in order to provide robustness with respect to the variance of 3D-shapes in
a given collection. Finally, if a user upload a new model to our Web-based retrieval
system CCCC (see appendix), the feature vectors are automatically extracted. As-
suming that a single added feature vector cannot significantly change the computed
transformation parameters, we can use the previously calculated mean vector (3.3)
and transformation matrix Ak to reduce the vector dimension, so that a user re-
ceives a prompt response from the system. Then, the mean vector (3.3) and the
transformation matrix can be updated (re-computed) in an idle time. However,
the robustness of the computed transformation parameters to an enlargement of
the underlying set of data vectors (descriptors) by a single data vector should be
examined.
descriptor, for various variants and parameter settings, using different dissimilar-
ity measures (table 5.5). A selection of more than 3000 generated precision-recall
diagrams is shown in section 5.2.
We consider that the most important state-of-the-art descriptors are included
in our tests (section 5.2.13). The results show that the DSR472 hybrid descriptor
is unambiguously the best 3D-shape descriptor, in general. However, the version of
the DSR472 feature vector that is tested in sections 5.2.11 and 5.2.13 relies upon a
parental descriptor that is not robust with respect to outliers. Namely, the depth
buffer-based descriptor extracted using the extended bounding box (definition 4.3)
is sensitive to outliers, whence a similar level of sensitivity is inherited by the hybrid.
Conversely, if the DSR472 descriptor is obtained by crossbreeding the depth buffer-
based feature vector extracted relying upon the canonical cube (definition 4.4) with
w = 2, the silhouette-based descriptor, and the ray-based feature vector in the
spectral domain, then the resulting hybrid is robust with respect to outliers, because
all parental feature vectors are not sensitive to outliers.
DSR472, EBB vs. CC (w=2), O2, TR, TS, L1 DSR472, EBB vs. CC (w=2), O1, M1, M2, L1
100 100
O2, EBB [472](63.2,47.9,55.6,43.5)
O2, CC (w=2) [472](63.0,47.7,54.8,43.3)
TR, EBB [472](63.3,47.4,55.4,42.7)
TR, CC (w=2) [472](63.3,47.3,55.5,43.0)
80 TS, EBB [472](60.5,45.5,52.1,41.4) 80
TS, CC (w=2) [472](59.9,44.8,51.5,40.4)
Precision (%)
Precision (%)
60 60
40 40
In figure 5.63, we compare the variant of the DSR472 descriptor that is sensitive
to outliers (DSR472, EBB) to the variant that is robust with respect to outliers
(DSR472, CC w=2), using all available ground truth classifications (table 5.3).
The results show that the variant that is sensitive to outliers slightly outperforms
the robust variant of the DSR472 descriptor. Having in mind the example in figure
4.13, we consider that it is important to satisfy the requirement 5 from section 1.3.4.
Therefore, we recommend to use the DSR472 hybrid descriptor, which is robust with
respect to outliers, as the best choice of all tested approaches (descriptor types,
variants, and parameter settings). The instances of the DSR472 feature vector
should be ranked using the l1 distance (1.10).
Chapter 6
Conclusion
201
202 Conclusion
of vertices of a 3D-mesh, differing sizes of triangles are not taken into account.
Modifications of the PCA are introduced in [143] and [105], in order to account
different sizes of triangles by suitable weighting factors. We regard a triangle mesh
model as a union of triangles, whence the point set of the model consists of infinitely
many points. In contrast to the usual application of the PCA, we work with sums of
integrals over triangles (3.19) in place of sums over (weighted) vertices which makes
our approach more complete taking into account all points of the model with equal
weight. The formulas for computing all necessary parameters for the normalization
of translation, rotation, scale, and reflection, using the continuous approach, are
given in section 3.4.
In chapter 4, we present our original 3D-shape descriptors, which are defined
using various features and representation techniques. We considered a variety of
features for characterizing 3D-shape such as
• Extents of a model in certain directions;
• Contours of 2D projections of a model;
• Depth buffer images of a model;
• Artificially defined volumes associated to triangles of a mesh;
• Voxel grids attributed by fractions of the total surface area of a mesh;
• Rendered perspective projections of a model on an enclosing sphere;
• Layered depth spheres.
As far as we know, the layered depth spheres represent an original concept, which
is introduced in this thesis.
The used representation techniques include:
• Fourier transforms: 1D, 2D, 3D;
• Fourier transform on a sphere;
• Moments for representing the extent function (original definition).
We stress that spherical harmonics (the fast Fourier transform on a sphere) are
introduced as a tool for 3D model retrieval by ourselves in [147].
We introduced two approaches for merging appropriate feature vectors, by defin-
ing a complex function on a sphere, and by crossbreeding. The concept of complex
descriptors is proposed in [146], while the concept of hybrid feature vectors is in-
troduced in this thesis. We also present a variety of original feature extraction
algorithms and give complete specifications for forming feature vector components
for each of presented approaches.
In chapter 5, we evaluate 19 types of 3D-shape descriptors (see tables 4.1 and 5.2)
using six different classifications (ground truths) of 3D-models (section 5.1). Most
of the 3D-model classifications are not formed by ourselves, whence we consider
that our evaluation is not subjective. As tools for comparing competing descrip-
tors, we use precision-recall diagrams, the R-precision (first tier), the Bull’s eye
performance (second tier), and two other parameters that are presented in section
1.5. As far as we know, no reported results in the literature include this number of
tested techniques and this number different classifications. Since the state-of-the-art
descriptors [36, 58] are also tested, we believe that our evaluation is competent.
Conclusion 203
The global comparison (section 5.2.13) of the best versions of all 19 descriptors
unambiguously suggest that a version of our hybrid descriptor significantly outper-
forms all competing descriptors. The best descriptor, which we called “DSR472”,
is formed by crossbreeding the following three descriptors:
1. Depth buffer-based descriptor of dimension 186 (section 4.3), extracted from
depth buffer images of dimensions 128×128, which are formed using the canonical
cube (definition 4.3) with w = 2;
2. Silhouette-based descriptor of dimension 150 (section 4.2), extracted from silhou-
ette images of dimensions 256 × 256 using 256 equiangular sample points (4.10),
where sample values are defined by (4.13);
3. Ray-based descriptor of dimension 136, with spherical harmonic representation
(section 4.6.2), obtained by sampling the extent function (4.1) at 128 2 points
(4.63).
The DSR472 descriptor possesses all desirable properties specified in section 1.3.4
such as robustness with respect to levels-of-detail, different tessellations, surface
noise, outliers, arbitrary topological degeneracies. The feature extraction and the
search procedure are efficient, as well. The most important, discriminant power of
the DSR472 descriptor is significantly higher than discriminant power of competing
descriptors. We recommend to use the l1 norm for computing distances between
DSR472 feature vectors.
In section 5.3, we verified that the standard PCA analysis can be applied to a
set of n-dimensional feature vectors of real-valued components to reduce dimension-
ality of the feature vectors without significant loss in retrieval effectiveness. The
reduction increases retrieval efficiency.
The invariance with respect to translation is achieved using the center of gravity
of a model, while we recommend to scale the model by the average distance d avg
(3.25) of a point on the surface to the center of gravity of a model. In section 3.4,
we give an original algorithm for approximating the value of davg . We also pro-
vide a means for fixing reflections around the coordinate hyper-planes (3.23). To
attain rotation invariance, we use the CPCA. Several authors (e.g., [36, 91]) object
the use of the CPCA, because it is not an ideal tool for fixing the orientation of
a 3D-model. Statements such as “all descriptors relying upon the PCA show poor
retrieval performance” are based on wrong intuitive assumptions and are not sup-
ported by experimental results. The descriptor defined in [36] is inherently invariant
with respect to rotation, i.e., it is not necessary to apply the PCA in the normaliza-
tion step. However, we consider that the overall performance is the most important
parameter of quality of a descriptor. We compared our DSR472 descriptor to the
descriptor proposed in [58], which is extracted using the original tools provided
by the authors [108], and the results show that our approach is superior. More-
over, the DSR472 descriptor shows batter performance than the descriptor based
on exponentially decaying Euclidean distance transform (EDT) even for certain cat-
egories of models that are not well aligned in the canonical coordinate frame (e.g.,
categories “desk with hutch”, “TV”, “spider”, “handgun”, “desk lamp”, “piano”,
“palm”, “tree”, etc. from the training database [108]). The rotation invariance of
204 Conclusion
[2] M. Ankerst, G. Kastenmüller, H.-P. Kriegel, and T. Seidl, “3D Shape His-
tograms for Similarity Search and Classification in Spatial Databases,” in
Proc. 6th Int. Symposium on Large Spatial Databases (SSD’99), Lecture Notes
in Computer Science, R. H. Güting, D. Papadias, and F. H. Lochovsky, Eds.,
Hong Kong, China, July 1999, vol. 1651, pp. 207–226, Springer Verlag.
[3] J. Arvo and D. Kirk, “Fast ray tracing by ray classification,” in Proc. SIG-
GRAPH 1987, Anahaim, CA, July 1987, pp. 55–64, ACM SIGGRAPH.
205
206 Bibliography
[18] J.-M. Chung and N. Ohnishi, “Matching and recognition of planar shaped
using medial axis properties,” in First International Workshop on Multi-
media Intelligent Storage and Retrieval Management (MISRM’99), Orlando,
Florida, October 1999.
[21] S. Cohen and L. Guibas, “The Earth Mover’s Distance under Transformation
Sets,” in Proc. of the 7th IEEE International Conference on Computer Vision
(ICCV 1999) - Volume 2, Corfu, Greece, September 1999, pp. 1076–1083,
IEEE Computer Society.
Bibliography 207
[27] M. Elad, A. Tal, and S. Ar, “Directed Search in a 3D Objects Database Using
SVM,” Tech. Rep. HPL-2000-20R1, HP Laboratories Israel, August 2000.
[28] M. Elad, A. Tal, and S. Ar, “Content Based Retrieval of VRML Objects -
An Iterative and Interactive Approach,” in Proc. of the Sixth Eurographics
Workshop in Multimedia, Manchester, UK, September 2001, pp. 97–108.
[52] S. B. Kang and K. Ikeuchi, “3-D Object Pose Determination Using Com-
plex EGI,” Tech. Rep. CMU-RI-TR-90-18, The Robotics Institute, Carnegie
Mellon University, Pittsburgh, Pennsylvania, October 1990.
[66] S. Z. Li, “Content-Based Audio Classification and Retrieval Using the Near-
est Feature Line Method,” IEEE Trans. on Speech and Audio Processing,
8(5):619–625, 2000.
[70] Z. Liu, J. Huang, Y. Wang, and T. Chen, “Audio Feature Extraction and
Analysis for Scene Classification,” in Proc. IEEE 1997 Workshop Multimedia
Signal Processing (MMSP1997), Princeton, NJ, June 1997, pp. 343–348.
[71] Z. Liu, Y. Wang, and T. Chen, “Audio Feature Extraction and Analysis for
Scene Classification,” Journal of VLSI Signal Processing, pp. 61–79, 1998.
[75] B. Mak and E. Barnard, “Phone Clustering using the Bhattacharyya Dis-
tance,” in Proc. of the Fourth International Conference on Spoken Language
Processing (ICSLP 96) - Volume 4, Philadelphia, PA, October 1996, pp. 2005–
2008.
[81] G. Mori, S. Belongie, and J. Malik, “Shape contexts enable efficient retrieval
of similar shapes,” in Proc. IEEE Conf. on Computer Vision and Pattern
Recognition (CVPR 2001), December 2001, vol. 1, pp. 723–730.
[87] MPEG-7 Video Group, “Description of Core Experiments for Motion and
Shape,” ISO/IEC N3397, MPEG-7, Geneva, June 2000.
[88] MPEG-7 Video Group, “Information Technology – Multimedia Content De-
scription Interface – Part 3: Visual,” ISO/IEC FCD 15938–3 / N4062, MPEG-
7, Singapore, March 2001.
[89] MPEG-7 Video Group, “MPEG-7 Visual Part of eXperimentation Model,”
V.9. ISO/IEC N3914, MPEG-7, Pisa, January 2001.
[90] S. Newsam, B. Sumengen, and B. S. Manjunath, “Category-Based Image
Retrieval,” in Proc. 2001 IEEE International Conference on Image Processing
(ICIP 2001), Thessaloniki, Greece, October 2001, pp. 596–599.
[91] M. Novotni and R. Klein, “A Geometric Approach to 3D Object Comparison,”
in Proc. SMI 2001, Genova, Italy, May 2001, pp. 167–175.
[92] M. Novotni and R. Klein, “3D Zernike Descriptors for Content Based Shape
Retrieval,” in Proc. the eighth ACM symposium on Solid modeling and appli-
cations, Seattle, Washington, USA, June 2003, pp. 216–225, ACM Press.
[93] R. Ohbuchi, M. Nakazawa, and T. Takei, “Retrieving 3D Shapes Based On
Their Appearance,” in Proc. 5th ACM SIGMM Workshop on Multimedia
Information Retrieval (MIR 2003), Berkeley, California, November 2003, pp.
39–46.
[94] J.-R. Ohm, F. Bunjamin, W. Liebsch, B. Makai, K. Müller, A. Smolic, and
D. Zier, “A multi-feature Description Scheme for image and video database
retrieval,” in Proc. IEEE Multimedia Signal Processing Workshop, Copen-
hagen, Denmark, September 1999, IEEE Computer Society.
[95] Y. Okada, “3D Model Database System by Hand Sketch Query,” in Proc.
2002 IEEE International Conference on Multimedia (ICME 2002), Lausanne,
Switzerland, August 2002, pp. 889–892.
[96] P. Oliver, “Wotsit’s Format,” http://www.wotsit.org/.
[97] R. Osada, T. Funkhouser, B. Chazelle, , and D. Dobkin, “Matching 3D Models
with Shape Distributions,” in Proc. SMI 2001, Genova, Italy, May 2001, pp.
154–166.
[98] R. Osada, T. Funkhouser, B. Chazelle, and D. Dobkin, “Shape Distributions,”
ACM Transactions on Graphics, 21(4):807–832, October 2002.
[99] E. Paquet, “Nefertiti - Solutions for Multimedia Databases,” http://www.
cleopatra.nrc.ca/.
[100] E. Paquet, S. El-Hakim, A. Beraldin, and S. Peters, “The Virtual Museum:
Virtualisation of Real Historical Environments and Artifacts and Three-
Dimensional Shape-based Searching,” in Proc. International Symposium on
Bibliography 213
[101] E. Paquet and M. Rioux, “Nefertiti: a Query by Content Software for Three-
Dimensional Databases Management,” in Proc. IEEE 3-D Digital Imaging
and Modelling, Ottawa, Canada, May 1997, pp. 345–352.
[102] E. Paquet and M. Rioux, “The MPEG-7 Standard and the Content-based
Management of Three-dimensional Data: A Case Study,” in Proc. IEEE
International Conference on Multimedia Computing and Systems, Florence,
Italy, June 1999, pp. 375–380.
[103] E. Paquet and M. Rioux, “Nefertiti: a Query by Content System for Three-
Dimensional Model and Image Databases Management,” Image and Vision
Computing, 17:157–166, 1999.
[104] E. Paquet and M. Rioux, “Influence of pose on 3-D shape classification: Part
II,” in Proc. Digital Human Modeling for Design and Engineering Conference,
Arlington, VI, June 2001.
[107] H. Pfister, M. Zwicker, J. van Baar, and M. Gross, “Surfels: Surface Ele-
ments as Rendering Primitives,” in Proc. SIGGRAPH 2000, New Orleans,
Louisiana, July 2000, pp. 335–342, ACM SIGGRAPH.
[108] Princeton Shape Retrieval and Analysis Group, “3D Model Search Engine,”
http://shape.cs.princeton.edu/.
[111] V. Roth, “Content-Bbased Retrieval From Digital Video,” Image and Vision
Computing, 17(7):531–540, 1999.
[115] T. Saito and J. ichiro Toriwaki, “New Algorithms for Euclidean Distance
Transformation of an n-Dimensional Digitized Picture with Applications,”
Pattern Recognition, 27(11):1551–1565, 1994.
[116] H. Samet, The Design and Analysis of Spatial Data Structures, Addison-
Wesley, 1990.
[117] D. Saupe and D. V. Vranić, “3D Model Retrieval with Spherical Harmon-
ics and Moments,” in Proc. DAGM 2001, B. Radig and S. Florczyk, Eds.,
Munich, Germany, September 2001, pp. 392–397, Springer Verlag.
[120] J. W. Shade, S. J. Gortler, L.-W. He, and R. Szeliski, “Layered Depth Im-
ages,” in Proc. SIGGRAPH 1998, Orlando, FL, July 1998, pp. 231–242, ACM
SIGGRAPH.
[122] J. R. Smith, Y.-C. Chang, and C.-S. Li, “Multi-Object Multi-Feature Content-
Based Search using MPEG-7,” in Proc. 2001 IEEE International Conference
on Image Processing (ICIP 2001), Thessaloniki, Greece, October 2001, pp.
584–587.
[143] D. V. Vranić and D. Saupe, “3D Model Retrieval,” in Proc. Spring Conference
on Computer Graphics and its Applications (SCCG2000), B. Falcidieno, Ed.,
Budmerice Manor, Slovakia, May 2000, pp. 89–93, Comenius University.
[153] C. Zhang and T. Chen, “Efficient Feature Extraction for 2D/3D Objects
in Mesh Representation,” in Proc. 2001 IEEE International Conference on
Image Processing (ICIP 2001), Thessaloniki, Greece, October 2001, pp. 935–
938.
Appendix: CCCC
Our Web-based retrieval system, called CCCC [140], serves as a proof-of-concept for
our implemented methods and tools for content-based search for 3D-mesh models.
The CCCC is also useful for obtaining a subjective impression about effectiveness
of different descriptors. In this appendix, we give an overview of the on-line system,
which is currently located at the following address:
http://merkur01.inf.uni-konstanz.de/CCCC/.
We assume that the URL of the CCCC 3D search engine will be changed in the
future. However, we expect that it will be relatively easy to locate a new address
using conventional search engines (e.g., google.com) or by visiting the original site
[140]. For the implementation, we used C++, Perl, and JavaScript. The starting
screen of the CCCC is shown in figure 6.1.
219
220 Appendix
in the canonical coordinate frame. In figure 6.4, all the thumbnail images visualize
models in the canonical coordinate frame, viewed from the positive side of the z-
axis, while the x-axis travels to the right. We stress that the canonical scale cannot
be sen on the thumbnail images, which are generated using the extended bounding
box (definition 4.3).
A query is selected simply by clicking on a thumbnail image. We provided an-
other option for specifying the query, uploading from a local directory. Currently,
the CCCC can accept the following 3D-file formats: VRML (Virtual Reality Mod-
eling Language) [44, 14], DXF (Autodesk Drawing eXchange Format) [5], 3DS (3D
Studio file format) [96], OFF (Object File format) [96], OBJ (Wavefront Object
files) [96], and SMF (Simple Model Format) [37]. In the future, we will provide an
option to draw 2D sketches of a 3D-object and use them as a query.
Figure 6.4: Visualization of normalized 3D-mesh models from the positive side of
the z-axis, while the x-axis travels to the right.
CCCC 223
Figure 6.5: Setting up the parameters used for retrieval, feature vector type and
dimension, the distance metric, and the number of displayed objects per screen.
In figure 6.7, the retrieval results for the specified query model, feature vec-
tor, dimension, and distance calculation are shown. The used descriptor is the
hybrid feature vector of dimension 448, which is obtained as described in section
4.7. Objects that are relevant to the query model are denoted by the green color of
the text on the buttons for displaying the statistics and rendering the model in a
VRML-viewer. Conversely, the red color of the text denotes non-relevant objects.
The match number 6 is considered as non-relevant in the used classification (O2 in
table 5.3). We recall that the classification was not done by ourselves, but by our
colleagues, without our influence.
224 Appendix
Thus, in figure 6.6, the first 29 retrieved models are airplanes, but some of them
are not considered relevant to the query, by the selected ground-truth. A type
of voting will be provided to a user, in order to alter the ground truth. Besides
inspecting the retrieved models using thumbnail images of four different views, a
user can obtain more information about the model by displaying a window with
statistics as well as by rendering the model in a VRML-viewer (figure 6.7).
Figure 6.7: Showing of the window with basic statistics about the model and ren-
dering using a VRML-viewer.
Figure 6.8: Precision-recall diagrams for the used query and the average curve for
the class of 3D-models containing the query object.