0% found this document useful (0 votes)
23 views8 pages

Unit 6

Uploaded by

poorna649
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views8 pages

Unit 6

Uploaded by

poorna649
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

UNIT 6

THEASAURUS CONSTRUCTION
INTRODUCTION

Thesauri are valuable structures for Information Retrieval systems. A thesaurus provides a
precise and controlled vocabulary which serves to coordinate document indexing and document
retrieval. In both indexing and retrieval, a thesaurus may be used to select the most appropriate
terms. Additionally, the Thesaurus can assist the searcher in reformulating search strategies if
required.

In IR systems, used to retrieve potentially relevant documents from large collections, the
thesaurus serves to coordinate the basic processes of indexing and document retrieval. (The term
document is used here generically and may refer to books, articles, magazines, letters,
memoranda, and also software.) In indexing, a succinct representation of the document is
derived, while retrieval refers to the search process by which relevant items are identified. The
IR thesaurus typically contains a list of terms, where a term is either a single word or a phrase,
along with the relationships between them. It provides a common, precise, and controlled
vocabulary which assists in coordinating indexing and retrieval. Given this objective, it is clear
that thesauri are designed for specific subject areas and are therefore domain dependent.

A carefully designed thesaurus can be of great value. The indexer is typically instructed to select
the most appropriate thesaural entries for representing the document. In searching, the user can
employ the thesaurus to design the most appropriate search strategy. If the search does not
retrieve enough documents, the thesaurus can be used to expand the query by following the
various links between terms. Similarly, if the search retrieves too many items, the thesaurus can
suggest more specific search vocabulary. In this way the thesaurus can be valuable for
reformulating search strategies.

FEATURES OF THESAURI

Some important features of thesauri will be highlighted here.

Coordination Level
Coordination refers to the construction of phrases from individual terms. Two distinct
coordination options are recognized in thesauri: precoordination and post-coordination. A pre
coordinated thesaurus is one that can contain phrases. Consequently, phrases are available for
indexing and retrieval. A post coordinated thesaurus does not allow phrases. Instead, phrases are
constructed while searching. The choice between the two options is difficult. The advantage in
pre coordination is th at the vocabulary is very precise, thus reducing ambiguity in indexing and
in searching. Also, commonly accepted phrases become part of the vocabulary. However, the
disadvantage is that the searcher has to be aware of the phrase construction rules employed.
Pre coordination is more common in manually constructed thesauri. Automatic phrase
construction is still quite difficult and therefore automatic thesaurus construction usually implies
post-coordination.

Term Relationships
Term relationships are the most important aspect of thesauri since the vocabulary connections
they provide are most valuable for retrieval. Many kinds of relationships are expressed in a
manual thesaurus. These are semantic in nature and reflect the underlying conceptual interactions
between terms. Three categories of term relationships: (1) equivalence relationships, (2)
hierarchical relationships, and (3) nonhierarchical relationships. Equivalence relations include
both synonymy and quasi-synonymy.. Quasi synonyms are terms which for the purpose of
retrieval can be regarded as synonymous, for example, "genetics" and "heredity," which have
significant overlap in meaning. Also, the terms "harshness" and "tenderness," which represent
different viewpoints of the same property continuum. A typical example of a hierarchical
relation is genus-species, such as "dog" and "german shepherd." Nonhierarchical relationships
also identify conceptually Rel ted terms. There are many examples including: thing--part such as
"bus" and "seat"; thing--attribute such as "rose" and "fragance."

Number of Entries for Each Term

It is in general preferable to have a single entry for each thesaurus term. However, this is
seldom achieved due to the presence of homographs--words with multiple meanings. Also, the
semantics of each instance of a homograph can only be contextually deciphered. Therefore, it is
more realistic to have a unique representation or entry for each meaning of a homograph. This
also allows each homograph entry to be associated with its own set of relations.

Specificity of Vocabulary

Specificity of the thesaurus vocabulary is a function of the precision associated with the
component terms. A highly specific vocabulary is able to express the subject in great depth and
detail. This promotes precision in retrieval. The concomitant disadvantage is that the size of the
vocabulary grows since a large number of terms are required to cover the concepts in the
domain.

Control on Term Frequency of Class Members

This has relevance mainly for statistical thesaurus construction methods which work by
partitioning the vocabulary into a set of classes where each class contains a collection of
equivalent terms and have stated that in order to maintain a good match between documents and
queries, it is necessary to ensure that terms included in the same thesaurus class have roughly
equal frequencies. Further, the total frequency in each class should also be roughly similar.

Normalization of Vocabulary

Normalization of vocabulary terms is given considerable emphasis in manual thesauri. There are
extensive rules which guide the form of the thesaural entries. A simple rule is that terms should
be in noun form. A second rule is that noun phrases should avoid prepositions unless they are
commonly known. Also, a limited number of adjectives should be used. There are other rules to
direct issues such as the singularity of terms, the ordering of terms within phrases, spelling,
capitalization, transliteration, abbreviations, initials, acronyms, and punctuation.

THESAURUS CONSTRUCTION

Manual Thesaurus Construction

The process of manually constructing a thesaurus is both an art and a science. We present
here only a brief overview of this complex process. First, one has to define the boundaries of the
subject area. (In automatic construction, this step is simple, since the boundaries are taken to be
those defined by the area covered by the document database.) Boundary definition includes
identifying central subject areas and peripheral ones since it is unlikely that all topics included
are of equal importance. Once this is completed, the domain is generally partitioned into
divisions or subareas.

Since manual thesauri are more complex structurally than automatic ones, as the previous
section has shown, there are more decisions to be made. Now, the collection of terms for each
subarea may begin. A variety of sources may be used for this including indexes, encyclopedias,
handbooks, textbooks, journal titles and abstracts, catalogues, as well as any existing and
relevant thesauri or vocabulary systems. Subject experts and potential users of the thesaurus
should also be included in this step. Once the initial organization has been completed, the entire
thesaurus will have to be reviewed (and refined) to check for consistency such as in phrase form
and word form.
Automatic Thesaurus Construction

In selecting automatic thesaurus construction approaches for discussion here, the criteria
used are that they should be quite different from each other in addition to being interesting. Also,
they should use purely statistical techniques. (The alternative is to use linguistic methods.)
Consequently, the two major approaches selected here have not necessarily received equal
attention in the literature. The first approach, on designing thesauri from document collections, is
a standard one. The second, on merging existing thesauri, is better known using manual methods.
From a Collection of Document Items

Here the idea is to use a collection of documents as the source for thesaurus construction.
This assumes that a representative body of text is available. The idea is to apply statistical
procedures to identify important terms as well as their significant relationships. It is reiterated
here that the central thesis in applying statistical methods is to use computationally simpler
methods to identify the more important semantic knowledge for thesauri. It is semantic
knowledge that is used by both indexer and searcher. Until more direct methods are known,
statistical methods will continue to be used.

By Merging Existing Thesauri

This second approach is appropriate when two or more thesauri for a given subject exist
that need to be merged into a single unit. If a new database can indeed be served by merging two
or more existing thesauri, then a merger perhaps is likely to be more efficient than producing the
thesaurus from scratch.

User Generated Thesaurus

In this third alternative, the idea is that users of IR systems are aware of and use many
term relationships in their search strategies long before these find their way into thesauri. The
objective is to capture this knowledge from the user's search.

THESAURUS CONSTRUCTION FROM TEXTS

The overall process may be divided into three stages: (1) Construction of vocabulary: This
involves normalization and selection of terms. It also includes phrase construction depending on
the coordination level desired. (2) Similarity computations between terms: This step identifies
the significant statistical associations between terms. (3) Organization of vocabulary: Here the
selected vocabulary is organized, generally into a hierarchy.

Construction of vocabulary

The objective here is to identify the most informative terms (words and phrases) for the
thesaurus vocabulary from document collections. The first step is to identify an appropriate
document collection. The only loosely stated criterion is that the collection should be sizable and
representative of the subject area. The next step is to determine the required specificity for the
thesaurus.
Stem evaluation and selection

There are a number of methods for statistically evaluating the worth of a term. The ones we
discuss here are: (1) selection of terms based on frequency of occurrence, (2) selection of terms
based on Discrimination Value, (3) selection of terms based on the Poisson model.

Selection by Frequency of Occurrence

The basic idea is that each term may be placed in one of three different frequency categories with
respect to a collection of documents: high, medium, and low frequency. Terms in the mid-
frequency range are the best for indexing and searching. Terms in the low-frequency range have
minimal impact on retrieval, while high-frequency terms are too general and negatively impact
search precision. Salton recommends creating term classes for the low-frequency terms.

Selection by Discrimination Value (DV)

DV measures the degree to which a term is able to discriminate or distinguish between the
documents of the collection. The more discriminating a term, the higher its value as an index
term. The overall procedure is to compute the average inter document similarity in the collection,
using some appropriate similarity function. Next, the term k being evaluated is removed from the
indexing vocabulary and the same average similarity is recomputed. The discrimination value
(DV) for the term is then computed as:

DV(k) = (Average similarity without k) - (Average similarity with k)

Selection by the Poisson Method

The Poisson distribution is a discrete random distribution that can be used to model a variety of
random phenomena including the number of typographical errors in a page of writing and the
number of red cars on a highway per hour. In all the research that has been performed on the
family of Poisson models, the one significant result is that trivial words have a single Poisson
distribution, while the distribution of nontrivial words deviates significantly from that of a
Poisson distribution.

Phrase construction

This step may be used to build phrases if desired. As mentioned before, this decision is
influenced by the coordination level selected. Also, phrase construction can be performed to
decrease the frequency of high frequency terms and thereby increase their value for retrieval.

Salton and McGill Procedure: This procedure is a statistical alternative to syntactic and/or
semantic methods for identifying and constructing phrases. Basically, a couple of general criteria
are used. First, the component words of a phrase should occur frequently in a common context,
such as the same sentence.
The second general requirement is that the component words should represent broad concepts,
and their frequency of occurrence should be sufficiently high. These criteria motivate their
algorithm, which is described below:

 Compute pair wise co-occurrence for high-frequency words. (Any suitable contextual
constraint such as the ones above may be applied in selecting pairs of terms.)
 If this co-occurrence is lower than a threshold, then do not consider the pair any further.
 For pairs that qualify, compute the cohesion value. Two formulas for computing cohesion
are given below. Both ti and tj represent terms, and size-factor is related to the size of the
thesaurus vocabulary.
 COHESION (ti, tj) = co-occurrence-frequency/sqrt(frequency(ti) * frequency(tj))
 COHESION (ti, tj) = size-factor * (co-occurrence-frequency/(total-frequency(ti) *
total-frequency(tj)))

If cohesion is above a second threshold, retain the phrase as a valid vocabulary phrase.

Choueka Procedure: The second phrase construction method is based on the work by Choueka
(1988). He proposes a rather interesting and novel approach for identifying collocational
expressions by which he refers to phrases whose meaning cannot be derived in a simple way
from that of the component words, for example, "artificial intelligence." The algorithm proposed
is statistical and combinatorial and requires a large collection (at least a million items) of
documents to be effective. The following are the steps:

1. Select the range of length allowed for each collocational expression. Example: two to six
words.

2. Build a list of all potential expressions from the collection with the prescribed length that have
a minimum frequency (again, a preset value).

3. Delete sequences that begin or end with a trivial word. The trivial words include prepositions,
pronouns, articles, conjunctions, and so on.

4. Delete expressions that contain high-frequency nontrivial words.

5. Given an expression such as a b c d, evaluate any potential sub expressions such as a b c and b
c d for relevance. Discard any that are not sufficiently relevant.

6. Try to merge smaller expressions into larger and more meaningful ones. For example, a b c d
and b c d may merge to form a b c d.
Similarity computation

Once the appropriate thesaurus vocabulary has been identified, and phrases have been designed
if necessary, the next step is to determine the statistical similarity between pairs of terms. It
includes two similarity routines:

Cosine: This computes the number of documents associated with both terms divided by the
square root of the product of the number of documents associated with the first term and the
number of documents associated with the second

Dice: This computes the number of documents associated with both terms divided by the sum of
the number of documents associated with one term and the number associated with the other.

Vocabulary Organization

Once the statistical term similarities have been computed, the last step is to impose some
structure on the vocabulary which usually means a hierarchical arrangement of term classes. For
this, any appropriate clustering program can be used. A standard clustering algorithm generally
accepts all pair wise similarity values corresponding to a collection of objects and uses these
similarity values to partition the objects into clusters or classes such that objects within a cluster
are more similar than objects in different clusters. Some clustering algorithms can also generate
hierarchies.

MERGING EXISTING THESAURI

This second approach is appropriate when two or more thesauri for a given subject exist that
need to be merged into a single unit. If a new database can indeed be served by merging two or
more existing thesauri, then a merger perhaps is likely to be more efficient than producing the
thesaurus from scratch. This approach has been discussed at some length in Forsyth and Rada
(1986). The challenge is that the merger should not violate the integrity of any component
thesaurus. Rada has experimented with augmenting the MESH thesaurus with selected terms
from SNOMED (Forsyth and Rada 1986, 216). MESH stands for Medical Subject Headings and
is the thesaurus used in MEDLINE, a medical document retrieval system, constructed and
maintained by the National Library of Medicine. It provides a sophisticated controlled
vocabulary for indexing and accessing medical documents. SNOMED, which stands for
Systematized Nomenclature of Medicine, is a detailed thesaurus developed by the College of
American Pathologists for use in hospital records. MESH terms are used to describe documents,
while SNOMED terms are for describing patients. Ideally, a patient can be completely described
by choosing one or more terms from each of several categories in SNOMED. Both MESH and
SNOMED follow a hierarchical structure. Rada's focus in his experiments has been on
developing suitable algorithms for merging related but separate thesauri such as MESH and
SNOMED and also in evaluating the end products.

You might also like