Information Retrieval Sytem
Information Retrieval Sytem
3.0 Objectives
3.1 Introduction
3.2 Theoretical Foundations
3.3 Models of Information Retrieval Systems
3.3.1 Models Based on Input and Output
3.3.2 Models Based on Theories and Tools
3.4 IRS : Design and Operation
3.5 Search Strategy
3.6 Evaluation of IRS
3.7 Summary
3.8 Answers to Self Check Exercises
3.9 Keywords
3.10 References and Further Reading
3.0 OBJECTIVES
After reading this unit, you will be able to :
l understand the definition of information retrieval systems;
l know the theoretical foundation and models of information retrieval systems;
l get yourself acquainted with design and operation of IRS; and
l explain the method of searching information from IRS.
3.1 INTRODUCTION
It was Calvin Mooers who in 1950 coined the term “information retrieval” and
described it as “searching and retrieval of information from storage according to
specific subject.” The word retrieval means to discover and bring to the notice of
the users the documents in which information is embedded. Again B.C. Vickery has
described it as “ retrieval is essentially concerned with the structure of the operation
of the device to select documentary information from the store of information in
response to several questions”
The retrieval systems are usually in a state of continuous gradual revision; data are
added or withdrawn; new index points inserted; syndetic relationship changed. The
development of effective retrieval technique has been the core of IR research for
more than 30 years. Nowadays multimedia indexing and retrieval techniques are
being developed to access image, video and sound database without text descriptions.
47
Types of Information Systems The information retrieval system is certainly not a new concept; it is an integral part
of the communication process, a direct outgrowth of the desire among men to
communicate with eachother.
l The classification of retrieval techniques that has been proposed by Hicholas
Belkin and Bruce Croft are:
Retrieval Technique
Exact Partial
Match Match
Individual Network
Belkin and Croft distinguish between exact and partial match techniques. Exact
match techniques are currently in use in most of the conventional IR systems. Queries
are usually formulated using Boolean expression and the search patterns within the
query have to match with exactly the text representation of the document to be
retrieved. Partial match retrieval technique as opposed to exact match technique is
categorised into individual and network. Individual techniques search single document
nodes without considering the document collection as a whole. In the feature-based
techniques, documents are represented by sets of features or index terms. The index
can be either defined manually or be computed automatically. In structure-based
techniques, documents are represented in a more complicated structure than just a
set of index terms as used for the feature based techniques.
In network based methods, the set of all documents and their relationship are used
to find the most relevant documents. With this method, the technique query. In
clustering, most similar documents are clustered together and all documents are
grouped into a cluster hierarchy until a ranked list of lowest level clusters are
produced. Spreading activation is similar to browsing. From the start nodes, other
nodes connected to that node are activated. Activated nodes then propagate or
spread themselves through the network.
Theoretically there is no constraint on the type and structure of the information items
to be stored and retrieved with the information retrieval (IR) system. Until recently
information retrieval systems were limited to searching textural information. Gerard
Salton has defined an information retrieval system as a “system used to store items
of information that need to be processed, searched, retrieved, and disseminated to
various user populations.”
According to Alken Kent , any information retrieval system entails a series of processes
48 or steps, which are as follows:
i) Analysis involving perusal of the record and the selection of point of view (or Information Retrieval Systems
analytics).
ii) Terminology and subject heading control involving establishment of some arbitrary
relationships among, ‘analytic’ in the system.
iii) Recording the results of analysis on a searchable medium.
iv) Storage of records or source documents, involving the physical placement of
the record in some location.
v) Question analysis and development of search strategy involving the expression
of a question or a problem.
vi) Conducting of search involving the manipulation or operation of the search
mechanism in order to identify records from the file.
vii) Delivery of results of search involving physical removal or copying of a record
from files.
Thus, any information retrieval system has three components - input, process and
output. The storing of information is the input component. Generally the search or
retrieval of information from the information retrieval system is through a query
processing system. The information stored in the system is indexed using some
indexing technique using key words. The processing system matches the key words
of the query language with that of the key words under which the information items
have been indexed. The matching results into the response output which may be the
answer to the user in response to his request or search for information.
of information need is complex and time consuming. It draws out for a long
conversational or browsing process and the informational retrieval model must
incorporate such facilities of interfaces.
A distinction is made between data retrieval and information retrieval. In practice
the distinction is one of degree rather than kind . Figure 2, shows a spectrum of
various attributes of data retrieval and information retrieval:
Information Retrieval Data Retrieval
Retrieval models
Probabilistic Deterministic
Indexing
Derived from contents Complete items
Matching/Retrieval
Partial or Best match Exact match
Query language
Natural Artificial/structured
Result criteria
Relevance Any match
Query specification
Complete Incomplete
Items wanted
Relevant Matching
Error response
Insensitive Sensitive
i) Semantic base which conveys meaning from one human being to another.
ii) Syntactic base which helps in the formation of semantics by the use of grammar,
and
iii) The vocabulary which supplies different meanings to terms for the formation of
explosion, expressed in sentences, paragraphs etc.
The logical structure of a language and the taxonomy of the language refers to the
relationship between vocabulary and concepts. The vocabulary generally refers to
the logical structure. The vocabulary control indents thesaurus content and technical
glossary control. The indexing language with control of expression terms provides
the basic model for information retrieval. Use of associative mathematics in search
logic and in search expression formulation provides yet another type of language
control in information retrieval.
Mathematical Models
Mathematical models are essentially based on representative mathematics as well
as associative connections. In particular, cluster analysis and clustering techniques
are used on an experimental basis in automatic abstracting and indexing. Use of set
theory and Boolean logic is a familiar method of mathematical modeling of information.
Concept of similarity measures and choice of variable and the combinational aspect
of clustering tries to provide semantic structure for information represented.
Psychological Model
The psycholinguistics approach to information retrieval led to the study of the formation
of concepts in human mind, the way in which the human thinking process arranges
the ideas and present it at the time of inquiry and the types of retrieval it demands
while searching. The studies of Belkins, Brooks and Oddy on anomalous state of
knowledge provides an interesting insight in relation to the information retrieval
process. Further, studies in information retrieval and artificial intelligence have thrown
significant input bringing in a harmonious coupling of psychological theory with
information retrieval.
The Economic Model
The economic model of information retrieval centres on the measure of cost
effectiveness and cost efficiency of information retrieval. These two criteria are based
on the performance of the information retrieval system in relation to input cost as
well as the number of successful outputs. The concept of provision of multiple access
points being used gives a chance for measurement of information transfer. The several
models of information measurement based on statistical and mathematical techniques
have been used for studies in bibliometrics and scientometrics, providing scope for
correlation for economic benefits. However due to various intangible elements in
information retrieval, which cannot be identified, the economic model does not yet
provide a holistic approach to information retrieval.
User Model in Information Retrieval
In IRS, the user model is used to provide assistance to the user in the query
formulation process. The goal is to express the information requirements in the best
possible way. Clearly the best way is the one that provides the system with enough 53
Types of Information Systems input information to retrieve all relevant documents. Frequently users have a hard
time specifying explicitly what exactly they are looking for. It is the task of the user
model component of the IR system to automatically help and complement the user’s
interests based on their previous search behaviour.
User
Get user’s
preference
(query)
Process User Analyse
System output model user input
Construct
formal query
Information
System
Figure 3 illustrate the central position that a user model can assume in information
retrieval system. An IR system enhanced with user modeling techniques will normally
start by getting the user’s preferences. For example, there can be a statement of the
user’s interests as a self-description, or it may be a SQL based query. This input is
subsequently analysed, using the user model, and the user model is updated
accordingly. Then the formal query is constructed and processed, based on the
user’s preferences. Afterwards in close interaction with the user mode, the output is
prepared for presentation to the user and the user model is refreshed. Finally, the
user can evaluate the query and restart the whole cycle again if needed.
Indexing Mannual
Keyboard Full text
Computer Assisted
Auto indexing
Machine Word Deletion File Structure
Tapes Statistics
Input Dictionary
Text Svntactic Serial Clustered
Character Random
Reader
Abstracting
Extraction
Thesauri
Manual Computer
Assisted Mechanical
Query Language
Feedback
Manual
Automatic Output
55
Types of Information Systems
3.5 SEARCH STRATEGY
Basic Search Techniques
In a bibliographical information retrieval environment, searches can be divided into
two main classes-known item search and unknown item search. A known item
search is what is conducted when the user knows something about the item being
sought. This may be any key, such as author, title, publisher, ISBN, and so on. An
unknown item search is conducted when users are not aware of the existence of any
document that may solve their problems. In other words, users do not know whether
or not such an item exists that can meet their information requirements. There are
different types of searches which are helpful to understand the entire process of
search strategies.
Keyword and Phrase Search
A search can be conducted by entering a single search term or a phrase comprising
more than one term. The keyword search is the simplest form of search facility
offered by a search system. In keyword search mode, the system searches the
inverted file ( the index ) for each keyword/term forming the search expression. The
search terms can be entered through the keyboard or can be selected from an index
or vocabulary control tool, such as subject headings lists or thesauri. Search
expressions containing more than one keyword may require the use of Boolean or
proximity operators.
In a phrase search, the system searches for the entire phrase rather than each individual
key word forming the phrase. Phrase searches can be conducted only in those fields
that are phrase indexed. If the index file comprises only single terms, then phrase
search cannot be conducted, unless proximity operators are used whereby the system
will searche for each constituent keyword in the search expression separately, and
retrieve only those records where the keywords occur consecutively. A search phrase
can simply be entered through the keyboard, or selected from an index file or
vocabulary control tools like subject headings lists and thesauri.
Different search systems provide different facilities for conducting key word and
phrase searches. For example, in a Dialog search one can simply enter a key word
or a phrase preceded by the search command. The user can restrict the search to
one or more fields.
Many bibliographical information retrieval systems provide two types of search
facilities for conducting an unknown item search; keyword search and subject search.
A keywords search allows users to enter one or more key words pertaining to their
query. These keywords can be chosen by the user in any combination depending
upon the requirements, and there are several search operators that can be used to
combine several keywords to formulate a search expression. The search keywords
can appear anywhere, or in one or more chosen fields, in the database records. A
subject search allows the user to submit a subject expression that reflects his or her
information requirement. Such a search is conducted on the subject field that contains
the subject headings assigned by the indexes when the database was created. Thus,
a record will be retrieved only when the user’s subject search expression exactly
matches the subject heading assigned by the indexes. For standardizing the process,
and also helping the user identify the appropriate subject headings, IRS uses certain
56 tools, called vocabulary control tools.
Boolean Search Information Retrieval Systems
This is a search technique that combines search terms according to Boolean logic.
Three types of Boolean search are possible. AND search, OR search and NOT
search.
The AND search allows the user to combine two or more search terms using the
Boolean AND operator. The search will then retrieve all those items that contain all
the constituent terms. For example, the search expression “Internet AND computer”
will retrieve all those records where both the terms occur .The search is restricted
by adding more search terms. The more search term are AND ed, the more restricted,
or specific will be the search and as a result the smaller will be the search output.
Sometimes, a search may produce a blank result if too many search term are AND
ed.
Truncation
Truncation is a facility that enables a search to be conducted for all the different
forms of a word having the same common root . As an example, the truncated word
COMPUT* will retrieve items like COMPUTER, COMPUTING,
COMPUTATION, COMPUTE, etc. A number of different options are available
for truncation viz. right truncation(as in COMPUT* example ), left truncation, and
making of letters in the middle of the word. Left- Turn truncation retrieves all words
having the same characters at the right - hand part e.g.*HYL will retrieve words like
METHYL, ETHYL etc. Similarly middle truncation retrieves all words having the
same characters at the left- and right hand parts. For example, a middle truncated
search term COL * will retrieve both the terms COLOUR AND COLOR.
Proximity Search
This search facility allows the user to specify :
1) Whether two search terms should occur adjacent to each other,
2) Whether one or more words occur in between the search terms,
3) Whether the search term should occur in the same paragraph irrespective of the
intervening words, and so on.
The operators used for proximity searching and their meanings differ from one search
system to another. The various types of proximity search facilities and the
corresponding operators are available in CD-ROM and online database.
Field-specific Search
A search can be conducted on all the fields in a database or it may be restricted to
one or more chosen fields to produce more specific results. Specific fields and
codes vary according to the search systems and database.
Limiting Search
Sometimes the user may want to limit a given search by using certain criteria such as
language, year of publication, type of information source and so on. These are called
limiting searches. Parameters that can be used to limit a search are decided by the
database concerned. Below are two examples of limiting searches in Dialog.
57
Types of Information Systems
Limit Qualifier Example
English - Language document only /ENG SELECT
URBAN (s)CR IME?/ENG
Patents only /PAT S TRANSISTOR?/PAT
Range Search
The range search is very useful with numerical information. It is important in selecting
records within certain data ranges. The following options are usually available for
range searching, though the exact number of operators, their meaning, etc. differ
from one search system to another:
l Greater than (>)
l Less than (<)
l Not equal to (1=or < >)
l Greater than or equal to (>=)
l Less than or equal to (<=)
Search Tools :
Library and information professionals have since been using four types of tools for
organizing information. They are:
1) Classification Schemes
The classification schemes such as, Dewey Decimal Classification (DDC) Universal
Decimal Classification (UDC) Library of Congress Classification (LC), Colon
Classification and so on, are used for classifying documents, organization files and
also for the physical organization of documents in libraries.
2) Catalogue Codes
The catalogue codes, such as Anglo-American Cataloguing Rules, Classified
Catalogue Code, etc., are used to prepare catalogue records of documents, which
provide information to a user about what a given library/information center possesses.
3) Standard Bibliographic Record Formats
Standard record formats such as ISBD and MARC (Machine Readable Cataloguing)
formats are used to prepare machine readable records of bibliographic and other
types of documents.
4) Vocabulary Control Devices
Vocabulary control devices such as thesauri and subject headings lists are used to
standardize the terminology, which can be used both at the time of indexing and
searching records.
All these tools can be used for organizing information in various types of information
systems including digital library systems. However, these are only basic search tools
which may be used and there are many more search techniques available for specific
information retrieval systems.
58
Information Retrieval Systems
3.6 EVALUATION OF IRS
Any information system exists to provide the seeker of information the document
which bears the information or answers his query: The evaluation is a diagnostic
activity to understand the performance of a system. It reveals the strength as also the
weakness of an information system. It informs about the social benefits that accrue
from the system. It also tells us about the economic aspects of the system, such as
cost various aspects etc. On the basis of a careful evaluation one can thus ways for
improving the system, if required. Evaluation is rightly called an investment for the
future.
Evaluation Methodology
The evaluation programme of an information system involves a number of distinct
steps. Let us understand these steps:
1) The first step is to be clear about the scope of evaluation. That is to say the
purpose of evaluation should be very clearly defined. The scope should be
defined precisely before the designing and execution of an evaluation programme.
2) The second step is the designing of the evaluation programme. The design should
be such so that it suits the objectives and purpose defined earlier. The success
of the evaluation programme depends upon the choice of appropriate design.
3) After deciding about the scope and design of the evaluation programme, the
next step is the execution proper. The execution includes the collection of data,
its organization, analysis and, lastly, the drawing of conclusions.
4) The fourth step is to analyse the conclusion and the interpretation of the results.
5) The fifth and the final step is to modify the information system on the basis of the
result of evaluation as revealed in steps 3 and 4.
Irrespective of the methodology followed, the purpose of evaluation of any information
system is to find out how well the input performs and what measures need be taken
for its improvement. Sometimes, the evaluation of a particular IRS may provide a
clue for the design and development of other systems.
Criteria for Evaluation
The criteria on the basis of which an IRS can be evaluated are:
1) Recall and precision and related factors affecting retrieval efficiency
2) Cost
3) Response time
Recall and Precision
The effectiveness of information retrieval can be measured by the ability of that
system to retrieve the relevant documents and hold back the irrelevant ones in a
given collection in relation to a particular query. The ability to inform about the
retrieval of relevant documents and withhold the irrelevant ones are called recall and
precision powers of the system respectively. Though theoretically 100% recall and
precision is desired in practice it is not possible, as these two factors are inversely
proportional to each other. The system in which these two factors are at the optimum
level will be regarded as the best one and would be preferred for application. 59
Types of Information Systems In response to a query, all the relevant document may not be retrieved in a search,
only a part of them may be retrieved. Similarly all the documents retrieved may not
be relevant, though a number of non-relevant documents also remain as not retrieved.
This can be illustrated in the following formats:
Recall is the retrieval of relevant documents by the system. Recall ratio can defined
as the ratio of the number of relevant items retrieved to the total number of relevant
documents in the system. This can be mathematically represented as :
a
or. . × 100
a+ b
Suppose there are in all 100 relevant document in a file and the index is able to
retrieve only 75 of them and misses 25, then the recall ratio is 75/75+25×100=75%
Precision Ratio
Precision is the capacity of the system to withhold non-relevant document. Precision
ratio may be defined as the ratio of the relevant retrieved documents to the total
number of documents retrieved from the file. Mathematically it may be represented
as:
Number of relevant items retrieved
× 100
Total number of relevant items
a
or. . × 100
a+ b
Suppose the total number of documents retrieved are 150, out of these 75 are
relevant, then the precision ratio is
75 ×100
or 50%
150
Many a time it is difficult to know the actual number of relevant documents in the
store. Nevertheless, findings of recall and precision are helpful in assuring the quality
of I.R.S.
Besides recall ratio and precision ratio, the other relevant measures which provide
the retrieval efficiency of a system are:
1) Noise ratio.
2) Fallout ratio.
60 3) Novelty ratio.
Noise Ratio Information Retrieval Systems
The lesser the noise ratio the more efficient a retrieval system will be.
Fall out Ratio
It shows how many non-relevant document, out of the total number of document in
the store have been retrieved by the retrieval system. Mathematically it may be put
as:
Total No. of non-relevant document retrieved
× 100
Total No. of document in store
Novelty Ratio
It is the proportion of nascent or new information items, which the system is able to
bring to the attention of information seekers for the first time. Out of the total number
of relevant document, a small percentage may be of such documents which contain
nascent information. If out of the 100 relevant documents there as 15 such documents
the Novelty Raito will be 15% i.e.
An efficient retrieval system will bring to the attention of the user more of such
documents which provide novel or new or nascent information.
Indexing Exhaustivity
The exhaustivity of a system refers to the accuracy and depth with which the various
concepts contained in the system are covered. Exhaustivity is the property of index
description. The indexing exhaustivity is connected with recall power of the system.
A system having high indexing efficiency possess high recall power.
Cost
Cost is an important factor of IR system evaluation. Cost may relate to initial
expenditure required to develop a system and also other direct charges, concerned
with manpower, material, tools and other initial costs. The cost is a composite factor
which also includes the effort involved on the part of the indexer and the time involved
in the preparation of index and also the search time and search efforts on part of
user. Initial cost can easily be measured but the cost of effort would be matter of
experience and realization. If a particular system is less costly than the system it
better than the other. The case of the use of the system by the user can be related to
this aspect.
Response Time
Response time is another important factor for measuring the efficacy of the system.
Response time should be measured while the users are interacting with the system. 61
Types of Information Systems If a system requires less time to retrieve information it would be economic and
would be better than the other taking a longer time to retrieve the same information.
Self Check Exercise
1) Discuss the various information retrieval techniques.
2) Name the information retrieval models based on tools and techniques.
3) Explain search strategy in brief.
4) Discuss the criteria for evaluation of IRS.
Note: i) Write your answers in the space given below
ii) Check your answer with the answers given at the end of this unit.
...................................................................................................
...................................................................................................
3.7 SUMMARY
The development of effective retrieval techniques has been the core of IR research
for more than 30 years. A number of measures of effectiveness have been proposed.
Effective interfaces for text based information systems are a high priority for users of
these systems. With the increase in the use of the internet, there has been a
corresponding increase in the demand for information retrieval system that can work
in wide area network environments. Search engines like INFOSEEK, LYCOS,
etc., index web pages and provide access to them. Developing databases and
providing search and retrieval access in an integrated manner have been the most
important aspect of developing IRS. The technique for indexing and query
optimisation have been the major issues.
Exact Partial
Match Match
Individual Network
Structure Feature
Cluster Browsing Spreading
Based Based Activation
Logic Graph Formal Ad hoc
i) Linguistic Model
3) There are 5 types of search strategies depending upon the kind of queries:
i) Boolean search
One can also go for proximity search and field specific search.
ii) Fallout
64