0% found this document useful (0 votes)

23 views8 pages

Unit 6

Uploaded by

poorna649

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

23 views8 pages

Unit 6

Uploaded by

poorna649

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

UNIT 6

THEASAURUS CONSTRUCTION
INTRODUCTION

Thesauri are valuable structures for Information Retrieval systems. A thesaurus provides a
precise and controlled vocabulary which serves to coordinate document indexing and document
retrieval. In both indexing and retrieval, a thesaurus may be used to select the most appropriate
terms. Additionally, the Thesaurus can assist the searcher in reformulating search strategies if
required.

In IR systems, used to retrieve potentially relevant documents from large collections, the
thesaurus serves to coordinate the basic processes of indexing and document retrieval. (The term
document is used here generically and may refer to books, articles, magazines, letters,
memoranda, and also software.) In indexing, a succinct representation of the document is
derived, while retrieval refers to the search process by which relevant items are identified. The
IR thesaurus typically contains a list of terms, where a term is either a single word or a phrase,
along with the relationships between them. It provides a common, precise, and controlled
vocabulary which assists in coordinating indexing and retrieval. Given this objective, it is clear
that thesauri are designed for specific subject areas and are therefore domain dependent.

A carefully designed thesaurus can be of great value. The indexer is typically instructed to select
the most appropriate thesaural entries for representing the document. In searching, the user can
employ the thesaurus to design the most appropriate search strategy. If the search does not
retrieve enough documents, the thesaurus can be used to expand the query by following the
various links between terms. Similarly, if the search retrieves too many items, the thesaurus can
suggest more specific search vocabulary. In this way the thesaurus can be valuable for
reformulating search strategies.

FEATURES OF THESAURI

Some important features of thesauri will be highlighted here.

Coordination Level
Coordination refers to the construction of phrases from individual terms. Two distinct
coordination options are recognized in thesauri: precoordination and post-coordination. A pre
coordinated thesaurus is one that can contain phrases. Consequently, phrases are available for
indexing and retrieval. A post coordinated thesaurus does not allow phrases. Instead, phrases are
constructed while searching. The choice between the two options is difficult. The advantage in
pre coordination is th at the vocabulary is very precise, thus reducing ambiguity in indexing and
in searching. Also, commonly accepted phrases become part of the vocabulary. However, the
disadvantage is that the searcher has to be aware of the phrase construction rules employed.
Pre coordination is more common in manually constructed thesauri. Automatic phrase
construction is still quite difficult and therefore automatic thesaurus construction usually implies
post-coordination.

Term Relationships
Term relationships are the most important aspect of thesauri since the vocabulary connections
they provide are most valuable for retrieval. Many kinds of relationships are expressed in a
manual thesaurus. These are semantic in nature and reflect the underlying conceptual interactions
between terms. Three categories of term relationships: (1) equivalence relationships, (2)
hierarchical relationships, and (3) nonhierarchical relationships. Equivalence relations include
both synonymy and quasi-synonymy.. Quasi synonyms are terms which for the purpose of
retrieval can be regarded as synonymous, for example, "genetics" and "heredity," which have
significant overlap in meaning. Also, the terms "harshness" and "tenderness," which represent
different viewpoints of the same property continuum. A typical example of a hierarchical
relation is genus-species, such as "dog" and "german shepherd." Nonhierarchical relationships
also identify conceptually Rel ted terms. There are many examples including: thing--part such as
"bus" and "seat"; thing--attribute such as "rose" and "fragance."

Number of Entries for Each Term

It is in general preferable to have a single entry for each thesaurus term. However, this is
seldom achieved due to the presence of homographs--words with multiple meanings. Also, the
semantics of each instance of a homograph can only be contextually deciphered. Therefore, it is
more realistic to have a unique representation or entry for each meaning of a homograph. This
also allows each homograph entry to be associated with its own set of relations.

Specificity of Vocabulary

Specificity of the thesaurus vocabulary is a function of the precision associated with the
component terms. A highly specific vocabulary is able to express the subject in great depth and
detail. This promotes precision in retrieval. The concomitant disadvantage is that the size of the
vocabulary grows since a large number of terms are required to cover the concepts in the
domain.

Control on Term Frequency of Class Members

This has relevance mainly for statistical thesaurus construction methods which work by
partitioning the vocabulary into a set of classes where each class contains a collection of
equivalent terms and have stated that in order to maintain a good match between documents and
queries, it is necessary to ensure that terms included in the same thesaurus class have roughly
equal frequencies. Further, the total frequency in each class should also be roughly similar.

Normalization of Vocabulary

Normalization of vocabulary terms is given considerable emphasis in manual thesauri. There are
extensive rules which guide the form of the thesaural entries. A simple rule is that terms should
be in noun form. A second rule is that noun phrases should avoid prepositions unless they are
commonly known. Also, a limited number of adjectives should be used. There are other rules to
direct issues such as the singularity of terms, the ordering of terms within phrases, spelling,
capitalization, transliteration, abbreviations, initials, acronyms, and punctuation.

THESAURUS CONSTRUCTION

Manual Thesaurus Construction

The process of manually constructing a thesaurus is both an art and a science. We present
here only a brief overview of this complex process. First, one has to define the boundaries of the
subject area. (In automatic construction, this step is simple, since the boundaries are taken to be
those defined by the area covered by the document database.) Boundary definition includes
identifying central subject areas and peripheral ones since it is unlikely that all topics included
are of equal importance. Once this is completed, the domain is generally partitioned into
divisions or subareas.

Since manual thesauri are more complex structurally than automatic ones, as the previous
section has shown, there are more decisions to be made. Now, the collection of terms for each
subarea may begin. A variety of sources may be used for this including indexes, encyclopedias,
handbooks, textbooks, journal titles and abstracts, catalogues, as well as any existing and
relevant thesauri or vocabulary systems. Subject experts and potential users of the thesaurus
should also be included in this step. Once the initial organization has been completed, the entire
thesaurus will have to be reviewed (and refined) to check for consistency such as in phrase form
and word form.
Automatic Thesaurus Construction

In selecting automatic thesaurus construction approaches for discussion here, the criteria
used are that they should be quite different from each other in addition to being interesting. Also,
they should use purely statistical techniques. (The alternative is to use linguistic methods.)
Consequently, the two major approaches selected here have not necessarily received equal
attention in the literature. The first approach, on designing thesauri from document collections, is
a standard one. The second, on merging existing thesauri, is better known using manual methods.
From a Collection of Document Items

Here the idea is to use a collection of documents as the source for thesaurus construction.
This assumes that a representative body of text is available. The idea is to apply statistical
procedures to identify important terms as well as their significant relationships. It is reiterated
here that the central thesis in applying statistical methods is to use computationally simpler
methods to identify the more important semantic knowledge for thesauri. It is semantic
knowledge that is used by both indexer and searcher. Until more direct methods are known,
statistical methods will continue to be used.

By Merging Existing Thesauri

User Generated Thesaurus

In this third alternative, the idea is that users of IR systems are aware of and use many
term relationships in their search strategies long before these find their way into thesauri. The
objective is to capture this knowledge from the user's search.

THESAURUS CONSTRUCTION FROM TEXTS

The overall process may be divided into three stages: (1) Construction of vocabulary: This
involves normalization and selection of terms. It also includes phrase construction depending on
the coordination level desired. (2) Similarity computations between terms: This step identifies
the significant statistical associations between terms. (3) Organization of vocabulary: Here the
selected vocabulary is organized, generally into a hierarchy.

Construction of vocabulary

The objective here is to identify the most informative terms (words and phrases) for the
thesaurus vocabulary from document collections. The first step is to identify an appropriate
document collection. The only loosely stated criterion is that the collection should be sizable and
representative of the subject area. The next step is to determine the required specificity for the
thesaurus.
Stem evaluation and selection

There are a number of methods for statistically evaluating the worth of a term. The ones we
discuss here are: (1) selection of terms based on frequency of occurrence, (2) selection of terms
based on Discrimination Value, (3) selection of terms based on the Poisson model.

Selection by Frequency of Occurrence

The basic idea is that each term may be placed in one of three different frequency categories with
respect to a collection of documents: high, medium, and low frequency. Terms in the mid-
frequency range are the best for indexing and searching. Terms in the low-frequency range have
minimal impact on retrieval, while high-frequency terms are too general and negatively impact
search precision. Salton recommends creating term classes for the low-frequency terms.

Selection by Discrimination Value (DV)

DV measures the degree to which a term is able to discriminate or distinguish between the
documents of the collection. The more discriminating a term, the higher its value as an index
term. The overall procedure is to compute the average inter document similarity in the collection,
using some appropriate similarity function. Next, the term k being evaluated is removed from the
indexing vocabulary and the same average similarity is recomputed. The discrimination value
(DV) for the term is then computed as:

DV(k) = (Average similarity without k) - (Average similarity with k)

Selection by the Poisson Method

The Poisson distribution is a discrete random distribution that can be used to model a variety of
random phenomena including the number of typographical errors in a page of writing and the
number of red cars on a highway per hour. In all the research that has been performed on the
family of Poisson models, the one significant result is that trivial words have a single Poisson
distribution, while the distribution of nontrivial words deviates significantly from that of a
Poisson distribution.

Phrase construction

This step may be used to build phrases if desired. As mentioned before, this decision is
influenced by the coordination level selected. Also, phrase construction can be performed to
decrease the frequency of high frequency terms and thereby increase their value for retrieval.

Salton and McGill Procedure: This procedure is a statistical alternative to syntactic and/or
semantic methods for identifying and constructing phrases. Basically, a couple of general criteria
are used. First, the component words of a phrase should occur frequently in a common context,
such as the same sentence.
The second general requirement is that the component words should represent broad concepts,
and their frequency of occurrence should be sufficiently high. These criteria motivate their
algorithm, which is described below:

 Compute pair wise co-occurrence for high-frequency words. (Any suitable contextual
constraint such as the ones above may be applied in selecting pairs of terms.)
 If this co-occurrence is lower than a threshold, then do not consider the pair any further.
 For pairs that qualify, compute the cohesion value. Two formulas for computing cohesion
are given below. Both ti and tj represent terms, and size-factor is related to the size of the
thesaurus vocabulary.
 COHESION (ti, tj) = co-occurrence-frequency/sqrt(frequency(ti) * frequency(tj))
 COHESION (ti, tj) = size-factor * (co-occurrence-frequency/(total-frequency(ti) *
total-frequency(tj)))

If cohesion is above a second threshold, retain the phrase as a valid vocabulary phrase.

Choueka Procedure: The second phrase construction method is based on the work by Choueka
(1988). He proposes a rather interesting and novel approach for identifying collocational
expressions by which he refers to phrases whose meaning cannot be derived in a simple way
from that of the component words, for example, "artificial intelligence." The algorithm proposed
is statistical and combinatorial and requires a large collection (at least a million items) of
documents to be effective. The following are the steps:

1. Select the range of length allowed for each collocational expression. Example: two to six
words.

2. Build a list of all potential expressions from the collection with the prescribed length that have
a minimum frequency (again, a preset value).

3. Delete sequences that begin or end with a trivial word. The trivial words include prepositions,
pronouns, articles, conjunctions, and so on.

4. Delete expressions that contain high-frequency nontrivial words.

5. Given an expression such as a b c d, evaluate any potential sub expressions such as a b c and b
c d for relevance. Discard any that are not sufficiently relevant.

6. Try to merge smaller expressions into larger and more meaningful ones. For example, a b c d
and b c d may merge to form a b c d.
Similarity computation

Once the appropriate thesaurus vocabulary has been identified, and phrases have been designed
if necessary, the next step is to determine the statistical similarity between pairs of terms. It
includes two similarity routines:

Cosine: This computes the number of documents associated with both terms divided by the
square root of the product of the number of documents associated with the first term and the
number of documents associated with the second

Dice: This computes the number of documents associated with both terms divided by the sum of
the number of documents associated with one term and the number associated with the other.

Vocabulary Organization

Once the statistical term similarities have been computed, the last step is to impose some
structure on the vocabulary which usually means a hierarchical arrangement of term classes. For
this, any appropriate clustering program can be used. A standard clustering algorithm generally
accepts all pair wise similarity values corresponding to a collection of objects and uses these
similarity values to partition the objects into clusters or classes such that objects within a cluster
are more similar than objects in different clusters. Some clustering algorithms can also generate
hierarchies.

MERGING EXISTING THESAURI

This second approach is appropriate when two or more thesauri for a given subject exist that
need to be merged into a single unit. If a new database can indeed be served by merging two or
more existing thesauri, then a merger perhaps is likely to be more efficient than producing the
thesaurus from scratch. This approach has been discussed at some length in Forsyth and Rada
(1986). The challenge is that the merger should not violate the integrity of any component
thesaurus. Rada has experimented with augmenting the MESH thesaurus with selected terms
from SNOMED (Forsyth and Rada 1986, 216). MESH stands for Medical Subject Headings and
is the thesaurus used in MEDLINE, a medical document retrieval system, constructed and
maintained by the National Library of Medicine. It provides a sophisticated controlled
vocabulary for indexing and accessing medical documents. SNOMED, which stands for
Systematized Nomenclature of Medicine, is a detailed thesaurus developed by the College of
American Pathologists for use in hospital records. MESH terms are used to describe documents,
while SNOMED terms are for describing patients. Ideally, a patient can be completely described
by choosing one or more terms from each of several categories in SNOMED. Both MESH and
SNOMED follow a hierarchical structure. Rada's focus in his experiments has been on
developing suitable algorithms for merging related but separate thesauri such as MESH and
SNOMED and also in evaluating the end products.

Thesaurus Construction
No ratings yet
Thesaurus Construction
30 pages
T5-Digital Thesauri
No ratings yet
T5-Digital Thesauri
4 pages
Thesaurus Construction: Theory and Practice
No ratings yet
Thesaurus Construction: Theory and Practice
8 pages
Chapter - 2 Thesaurus Construction and Its Role in Indexing
No ratings yet
Chapter - 2 Thesaurus Construction and Its Role in Indexing
9 pages
Semantic Problems of Thesaurus Mapping: Martin Doerr
No ratings yet
Semantic Problems of Thesaurus Mapping: Martin Doerr
29 pages
Safari - 22 Oct 2017 at 235 PM
No ratings yet
Safari - 22 Oct 2017 at 235 PM
1 page
Evaluation of Index
No ratings yet
Evaluation of Index
7 pages
Scientific and Theoretical Foundations and Significance of The Uzbek Language Thesaurus
No ratings yet
Scientific and Theoretical Foundations and Significance of The Uzbek Language Thesaurus
4 pages
(Ebook) Essential Thesaurus Construction by Vanda Broughton ISBN 9781856045650, 185604565X 2025 Full Version
100% (2)
(Ebook) Essential Thesaurus Construction by Vanda Broughton ISBN 9781856045650, 185604565X 2025 Full Version
167 pages
Ontologías, Taxonomía y Tesauros (2005) PDF
No ratings yet
Ontologías, Taxonomía y Tesauros (2005) PDF
343 pages
Thesaurus
No ratings yet
Thesaurus
3 pages
Practical Exercise in Building A Thesaurus PDF
No ratings yet
Practical Exercise in Building A Thesaurus PDF
23 pages
Knowledge Representation For Multilingual Text Categorization
No ratings yet
Knowledge Representation For Multilingual Text Categorization
5 pages
Clarke, SGD Thesaurus-ForInformationRetrieval (2018)
No ratings yet
Clarke, SGD Thesaurus-ForInformationRetrieval (2018)
19 pages
Du (An Gabrov (Ek: University of Ljubljana, Ljubljana
No ratings yet
Du (An Gabrov (Ek: University of Ljubljana, Ljubljana
7 pages
Information and Documentation - Thesauri and - ISO - TC 46 - SC 9 Identification and Description - ISO 25964, 1, 2011
No ratings yet
Information and Documentation - Thesauri and - ISO - TC 46 - SC 9 Identification and Description - ISO 25964, 1, 2011
160 pages
GAURAV SINGH MLS-203 2nd ASSIGNMENT
No ratings yet
GAURAV SINGH MLS-203 2nd ASSIGNMENT
10 pages
Iml251 - Chapter 7
No ratings yet
Iml251 - Chapter 7
29 pages
Information Standards Quarterly: Special Edition: Year in Review and State of The Standards
No ratings yet
Information Standards Quarterly: Special Edition: Year in Review and State of The Standards
8 pages
Iso PDF
No ratings yet
Iso PDF
8 pages
Searchling: User-Centered Evaluation of A Visual Thesaurus-Enhanced Interface For Bilingual Digital Libraries
No ratings yet
Searchling: User-Centered Evaluation of A Visual Thesaurus-Enhanced Interface For Bilingual Digital Libraries
5 pages
Jlis13-1 17 Lucarelli OK
No ratings yet
Jlis13-1 17 Lucarelli OK
21 pages
3 s2.0 B9781843346128500107 Main
No ratings yet
3 s2.0 B9781843346128500107 Main
3 pages
CIT 02 Unit 08 Using Dictionaries and Thesaurus in Translation
No ratings yet
CIT 02 Unit 08 Using Dictionaries and Thesaurus in Translation
16 pages
Search Engine Techniques
No ratings yet
Search Engine Techniques
10 pages
Course 7 Terminological Activity I Term Record
50% (2)
Course 7 Terminological Activity I Term Record
38 pages
Strathprints Institutional Repository: Strathprints - Strath.ac - Uk
No ratings yet
Strathprints Institutional Repository: Strathprints - Strath.ac - Uk
7 pages
Comprehensive Guide to Dictionaries
No ratings yet
Comprehensive Guide to Dictionaries
28 pages
Thesauri & Ontologies in Digital Libraries
No ratings yet
Thesauri & Ontologies in Digital Libraries
240 pages
Eur Alex Torino 2006
No ratings yet
Eur Alex Torino 2006
11 pages
Sangal Et Al-2012-Creating, Using and Updating Thesauri Files For AutoMap and ORA (Report)
No ratings yet
Sangal Et Al-2012-Creating, Using and Updating Thesauri Files For AutoMap and ORA (Report)
60 pages
Z39 19
No ratings yet
Z39 19
9 pages
Samuel Ross - Assignment #4 - Final Paper
No ratings yet
Samuel Ross - Assignment #4 - Final Paper
11 pages
Republic of The Philippines Region Iv-A Calabarzon Kolehiyo NG Lungsod NG Lipa College of Teacher Education
No ratings yet
Republic of The Philippines Region Iv-A Calabarzon Kolehiyo NG Lungsod NG Lipa College of Teacher Education
3 pages
Lexi C Ography
No ratings yet
Lexi C Ography
37 pages
Lexicography
No ratings yet
Lexicography
33 pages
NLP Thesauruses: Manual vs. Automatic
No ratings yet
NLP Thesauruses: Manual vs. Automatic
8 pages
Anthology-New O O08 O08-1003
No ratings yet
Anthology-New O O08 O08-1003
15 pages
Andargachew Mekonnen Gezmu
No ratings yet
Andargachew Mekonnen Gezmu
113 pages
Chapter 8 Online Reference Tools - Hockly N & Dudeney G (2008) - How To Teach English With Technology
No ratings yet
Chapter 8 Online Reference Tools - Hockly N & Dudeney G (2008) - How To Teach English With Technology
10 pages
English 1
No ratings yet
English 1
6 pages
Cabre Montane Nazar Reus Paper
No ratings yet
Cabre Montane Nazar Reus Paper
15 pages
Blis 3b - Lis Thesaurus
No ratings yet
Blis 3b - Lis Thesaurus
54 pages
Ordoñez Campos PA1 ING
No ratings yet
Ordoñez Campos PA1 ING
4 pages
Quickstart Guide To Text Analysis With Textstat
No ratings yet
Quickstart Guide To Text Analysis With Textstat
2 pages
Tokenization: Token Normalization Is The Process of Canonicalizing Tokens So That Matches Occur
No ratings yet
Tokenization: Token Normalization Is The Process of Canonicalizing Tokens So That Matches Occur
3 pages
Controlled Vocabulary Tools and Thesauri (5647) Controlled Vocabulary Tools and Thesauri (5647)
100% (2)
Controlled Vocabulary Tools and Thesauri (5647) Controlled Vocabulary Tools and Thesauri (5647)
22 pages
Etimology of Techical Vocabulary in English
No ratings yet
Etimology of Techical Vocabulary in English
6 pages
TGDT
No ratings yet
TGDT
144 pages
The Linguistic Dimension - Theory of Translation
No ratings yet
The Linguistic Dimension - Theory of Translation
27 pages
Thesaurus Guide: Definitions & Usage
No ratings yet
Thesaurus Guide: Definitions & Usage
5 pages
Thesaurus: Semantic Relations & Types
No ratings yet
Thesaurus: Semantic Relations & Types
2 pages
Journal of Documentation: Article Information
No ratings yet
Journal of Documentation: Article Information
17 pages
Gardin (1973)
No ratings yet
Gardin (1973)
32 pages
Algorithms For The Verification of The Semantic Relation Between A Compound and A Given Lexeme
No ratings yet
Algorithms For The Verification of The Semantic Relation Between A Compound and A Given Lexeme
8 pages
Folksonomies - Why Do We Need Controlled Vocabulary
No ratings yet
Folksonomies - Why Do We Need Controlled Vocabulary
8 pages
Aris - 1440380105 - Anna's Archive
No ratings yet
Aris - 1440380105 - Anna's Archive
43 pages
Wholesale Product Price
No ratings yet
Wholesale Product Price
9 pages
Shree Nila Sparklers: Price List
No ratings yet
Shree Nila Sparklers: Price List
5 pages
Unit 2
No ratings yet
Unit 2
10 pages
Unit 3
No ratings yet
Unit 3
10 pages
Unit 4
No ratings yet
Unit 4
9 pages
Unit 1
No ratings yet
Unit 1
15 pages
Encrypted Document Analysis
No ratings yet
Encrypted Document Analysis
13 pages
Unit 5
No ratings yet
Unit 5
14 pages
Architecting Cloud Native NET Apps For Azure
100% (3)
Architecting Cloud Native NET Apps For Azure
195 pages
DRC - Magnetic Media - Frequently Asked Questions (FAQ) and Recommendations
No ratings yet
DRC - Magnetic Media - Frequently Asked Questions (FAQ) and Recommendations
18 pages
ER Modeling for Database Designers
No ratings yet
ER Modeling for Database Designers
10 pages
Automated Rail Track Inspection
No ratings yet
Automated Rail Track Inspection
4 pages
Questions and Answers: Autocad Civil 2010
No ratings yet
Questions and Answers: Autocad Civil 2010
11 pages
DC Unit 1 - Study Material - Part 1 PDF
No ratings yet
DC Unit 1 - Study Material - Part 1 PDF
26 pages
Sangoma-s300-Datasheet-Avanzada 7
No ratings yet
Sangoma-s300-Datasheet-Avanzada 7
2 pages
Moving Torch
No ratings yet
Moving Torch
30 pages
orbilogin.com
No ratings yet
orbilogin.com
9 pages
SM-J710FN Evapl 3 2
No ratings yet
SM-J710FN Evapl 3 2
2 pages
Roadmap SH
No ratings yet
Roadmap SH
1 page
Syllabus SYDE223
No ratings yet
Syllabus SYDE223
2 pages
Program Guide: Server and Cloud Enrollment
No ratings yet
Program Guide: Server and Cloud Enrollment
15 pages
EL6021 RS422/RS485 Serial Interface
No ratings yet
EL6021 RS422/RS485 Serial Interface
2 pages
Amazon - Ae Ohuhu Direct AE
No ratings yet
Amazon - Ae Ohuhu Direct AE
1 page
Compilers Lecture 7
No ratings yet
Compilers Lecture 7
21 pages
E-R Diagram Basics for Students
No ratings yet
E-R Diagram Basics for Students
15 pages
Netflix Amazon Apple
No ratings yet
Netflix Amazon Apple
2 pages
Class 8 Maths: Rational Numbers
No ratings yet
Class 8 Maths: Rational Numbers
9 pages
Itia01 - Lesson 5
No ratings yet
Itia01 - Lesson 5
6 pages
1 Domain List Up
No ratings yet
1 Domain List Up
6 pages
Paloalto Networks-SSE-Engineer - Unlocked
No ratings yet
Paloalto Networks-SSE-Engineer - Unlocked
10 pages
C Functions Guide for Students
No ratings yet
C Functions Guide for Students
22 pages
CS101 Lecture 06 - Control Structures - SV
No ratings yet
CS101 Lecture 06 - Control Structures - SV
29 pages
OCI Fast Track Tutorial-OCI v26
No ratings yet
OCI Fast Track Tutorial-OCI v26
68 pages
Maid Hiring Management System
No ratings yet
Maid Hiring Management System
32 pages
TechMarketingSurveySeries Websites
No ratings yet
TechMarketingSurveySeries Websites
8 pages
EMAX Hawk Pro User Manual
No ratings yet
EMAX Hawk Pro User Manual
11 pages
Eulogio "Amang Rodriguez Institute of Science & Technology: Republic of The Philippines
No ratings yet
Eulogio "Amang Rodriguez Institute of Science & Technology: Republic of The Philippines
6 pages
Research Analyst
No ratings yet
Research Analyst
3 pages

Unit 6

Uploaded by

Unit 6

Uploaded by

UNIT 6

Some important features of thesauri will be highlighted here.

Number of Entries for Each Term

Control on Term Frequency of Class Members

Manual Thesaurus Construction

By Merging Existing Thesauri

User Generated Thesaurus

THESAURUS CONSTRUCTION FROM TEXTS

Selection by Frequency of Occurrence

Selection by Discrimination Value (DV)

DV(k) = (Average similarity without k) - (Average similarity with k)

Selection by the Poisson Method

4. Delete expressions that contain high-frequency nontrivial words.

MERGING EXISTING THESAURI

You might also like