0% found this document useful (0 votes)
58 views113 pages

Irs Unit-1-1

The document provides a comprehensive overview of Information Retrieval Systems (IRS), covering their definitions, objectives, capabilities, and relationships to database management systems. It details various processes involved in cataloging, indexing, and user search techniques, as well as the importance of precision and recall in evaluating system performance. Additionally, it discusses the functional components of IRS, including item normalization, selective dissemination of information, and document database searches.

Uploaded by

ankithmahareddy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
58 views113 pages

Irs Unit-1-1

The document provides a comprehensive overview of Information Retrieval Systems (IRS), covering their definitions, objectives, capabilities, and relationships to database management systems. It details various processes involved in cataloging, indexing, and user search techniques, as well as the importance of precision and recall in evaluating system performance. Additionally, it discusses the functional components of IRS, including item normalization, selective dissemination of information, and document database searches.

Uploaded by

ankithmahareddy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 113

UNIT - I

Introduction to Information Retrieval Systems: Definition of Information Retrieval System, Objectives of Information Retrieval Systems, Functional
Overview, Relationship to Database Management Systems, Digital Libraries and Data Warehouses.
Information Retrieval System Capabilities: Search Capabilities, Browse Capabilities, Miscellaneous Capabilities.

UNIT - II
Cataloging and Indexing: History and Objectives of Indexing, Indexing Process, Automatic Indexing, Information Extraction.
Data Structure: Introduction to Data Structure, Stemming Algorithms, Inverted File Structure, N-Gram Data Structures, PAT Data Structure,
Signature File Structure, Hypertext and XML Data Structures, Hidden Markov Models.

UNIT - III
Automatic Indexing: Classes of Automatic Indexing, Statistical Indexing, Natural Language, Concept Indexing, Hypertext
Linkages. Document and Term Clustering: Introduction to Clustering, Thesaurus Generation, Item Clustering, Hierarchy of
Clusters.

UNIT - IV
User Search Techniques: Search Statements and Binding, Similarity Measures and Ranking, Relevance Feedback, Selective Dissemination of
Information Search, Weighted Searches of Boolean Systems, Searching the INTERNET and Hypertext.
Information Visualization: Introduction to Information Visualization, Cognition and Perception, Information Visualization Technologies.

UNIT - V
Text Search Algorithms: Introduction to Text Search Techniques, Software Text Search Algorithms, Hardware Text Search Systems.
Multimedia Information Retrieval: Spoken Language Audio Retrieval, Non-Speech Audio Retrieval, Graph Retrieval, Imagery Retrieval, Video
Retrieval.

TEXT BOOK:
1. Information Storage and Retrieval Systems – Theory and Implementation, Second Edition, Gerald J. Kowalski, Mark T. Maybury, Springer

REFERENCE BOOKS:
1. Frakes, W.B., Ricardo Baeza-Yates: Information Retrieval Data Structures and Algorithms, Prentice Hall, 1992.
2. Information Storage & Retrieval By Robert Korfhage – John Wiley & Sons.
3. Modern Information Retrieval By Yates and Neto Pearson Education.
Information Storage and Retrieval

Chapter1:
Introduction to Information Retrieval Systems
OBJECTIVES

◻ Definition of Information Retrieval Systems

◻ Objectives of Information Retrieval Systems

◻ Functional Overview

◻ Relationship to Database Management Systems


Information Retrieval System Definition
◻ An Information Retrieval System is a
system that is capable of storage,
retrieval, and maintenance of
information.
◻ Information in this context can be
composed of text (including numeric
and
date data), images, audio, video and
other multi-media objects.
◻ Techniquesare beginning to
emerge to search these other
media types.
Gauge of an IR System
◻ An Information Retrieval System consists of a
software program that facilitates a user in
finding the information file user needs.
◻ The gauge of success of an information
system is how well it can minimize the
overhead for a user to find the needed
information.
◻ Overhead from a user's perspective is tile
time required to find tile information needed,
excluding the time for actually reading the
relevant data. Thus search composition, search
execution, and reading non-relevant items are all
aspects of information retrieval overhead.
What is an Item?
◻ The term "item" is used to represent the
smallest complete textual unit that is
processed and manipulated by the system.
◻ The definition of item varies by how a specific
source treats information. A complete
document, such as a book, newspaper or
magazine could be an
item. At other times each chapter, or article may be
defined as an item.
◻ As sources vary and systems include more
complex processing, an item may address
even
lower levels of abstraction such as a contiguous
passage of text or a paragraph.
Objectives of an IR System
◻ The general objective of an Information
Retrieval System is to minimize the overhead
of a user locating needed information.
◻ Overhead can be expressed as the time a
user spends in all of the steps leading to
reading an item containing the needed
information (e.g., query generation, query
execution, scanning results of query to select
items to read, reading non-relevant items).
Measures associates with IR systems

◻ Thetwo major
measures commonly
associated with
information systems
are
precision and recall.
• When a user decides to issue a search looking
for information on a topic, the total database is
logically divided into four segments
Measures associates with IR systems Cont.

◻ Relevant items are those documents


that contain information that helps the
searcher in answering his question.
◻ Non-relevant items are those items
that do not provide any directly useful
information.
◻ There are two possibilities with respect
to each item: it can be retrieved or not
retrieved by the user's query.
Precision
Recall
Measures associates with IR systems

Where:
◻ Number_Possible_Relevant are the

number of relevant items in the database.

◻ Number_Total Retieved is the total


Measures associates with IR systems
number of items retrieved from the
query.
◻ Number_Retrieved_Relevant is the
number of items retrieved that are
relevant to the user's search need.
Measures associates with IR systems

◻ Precision measures one aspect of information


retrieval overhead for a user associated with a
particular search.
◻ If a search has a 85 per cent precision, then 15 per
cent of the user effort is overhead reviewing non-
relevant items.
Recall gauges how well a system processing a particular
query is able to retrieve the relevant items that the user is
interested in seeing.


Ideal Precision and
Ideal Precision and
◻ Figure 1.2a shows the values of precision and recall as the number of
items retrieved increases, under an optimum query where every returned
item is relevant. There are "N" relevant
items in the database.
◻ In Figure 1.2a the basic properties of precision (solid line) and recall
(dashed line) can be observed.
◻ Precision starts off at 100 per cent and maintains that value as long as
relevant items are retrieved.
◻ Recall starts off close to zero and increases as long as relevant
items are retrieved until all possible relevant items have been
retrieved.
◻ Once all "N" relevant items have been retrieved, the only items being
retrieved are non-relevant. Precision is directly affected by retrieval of non-
relevant items and drops to a number close to zero. Recall is not effected by
retrieval of non-relevant items and thus remains at 100 percent.
Objectives of an IR System Cont.

◻ The first objective of an Information


Retrieval System is support of user
search generation.
◻ Natural languages suffer from word
ambiguities such as homographs and use
of acronyms that allow the same word to
have
multiple meanings (e.g., the word "field“).
◻ Disambiguation techniques exist but
introduce significant system overhead in
processing
power and extended search times and often
require interaction with the user.
Objectives of an IR System Cont.

◻ Many users have trouble in generating a


good search statement. The typical user
does not have significant experience with
nor even the
aptitude for Boolean logic statements.
◻ Quite often the user is not an expert in the
area that is being searched and lacks
domain specific vocabulary unique to that
particular subject area (Search begins with
a general
concept, a limited knowledge of the vocabulary
associated with a particular area).

Objectives of an IR System Cont.

◻ Even when the user is an expert in the


area being searched, the ability to
select
the proper search terms is constrained by
lack of knowledge of the author's
vocabulary.
◻ Thus, an Information Retrieval
System must provide tools to help
overcome the
search specification problems discussed
above.
Vocabulary Domains
Objectives of an IR System Cont.

◻ An objective of an information system is to


present the search results in a format that
facilitates the user in determining relevant
items.
◻ Historically data has been presented in an order dictated by how
it was physically stored. Typically, this is in arrival to the system
order, thereby always displaying the results of a search sorted
by time. For those users interested in current events this is
useful
Objectives of an IR System Cont.

◻ The new Information Retrieval Systems


provide functions that provide the
results of a query in order of potential
relevance to the
user.
◻ Even more sophisticated techniques use
item clustering and link analysis to
provide additional item selection
insights.
IR Systems Functional Overview
◻ A total Information Storage and
Retrieval System is composed of four
major functional processes:
1. Item Normalization,
2. Selective Dissemination of Information
(i.e., "Mail"),
3. Archival Document Database Search, and
4. An Index Database Search.
◻ Commercial systems have not integrated
these capabilities into a single system but

supply them as independent capabilities.


1. Item Normalization
◻ Normalize the incoming items
to a standard format.
◻ Standardizing the input takes the different
external formats of input data and
performs the translation to the formats
acceptable to
the system.
◻ A system may have a single format
for all items or allow multiple formats.
1. Item
Normalization Cont.
◻ The next process is to parse the item into
logical sub- divisions that have meaning to the
user. This process, called "Zoning," is visible to
the user
and used to increase the precision of a search
and optimize the display.
◻ An item is subdivided into zones, which
may be hierarchical (Title, Author, Abstract,
Main Text, Conclusion, and References).
◻ The zoning information is passed to the
processing token identification operation to
store
the information, allowing searches to be
restricted to a specific zone.

1. Item
Normalization Cont.
◻ Once the standardization and zoning
has been completed, information (i.e.,
words) that are used in the search
process need to be
identified in the item.
◻ The first step in identification of a
processing token consists of determining
a word. Systems
determine words by
dividing input symbols into three classes:
valid word symbols, inter-word symbols,
and special processing symbols.
1 Item
Cont
◻A word is defined as a contiguous
set of word symbols bounded by
inter-word symbols.
◻ Examples of word symbols are
alphabetic characters and numbers.
◻ Examples of possible inter-word
symbols are blanks, periods and
semicolons.
1 Item
Cont
◻ Next, a Stop List/Algorithm is applied to the
list of potential processing tokens.
◻ The objective of the Stop function is to save
system resources by eliminating from the set
of searchable processing tokens those that
have
little value to the system.
◻ Stop Lists are commonly found in most
systems and consist of words (processing
tokens) whose frequency and/or semantic use
make them of no
1 Item
Cont
value as a searchable token.
◻ (e.g., "the"), have no search value and are not a
useful part of a user's query.

Item Normalization Cont.

◻ The next step in finalizing on


processing tokens is identification of
any specific word characteristics.
◻ The characteristic is used in
systems to assist in
disambiguation of a particular
word.
2. Morphological analysis of the
processing token's part of speech is
included here.
1. Item
Normalization Cont.
◻ Once the potential processing token has
been identified and characterized, most
systems apply stemming algorithms to
normalize the token to a standard semantic
representation.
◻ The decision to perform stemming is a
trade off between precision of a search
(i.e., finding exactly what the query
specifies) versus
standardization to reduce system overhead in
expanding a search term to similar token
representations with a potential increase in recall.
◻ The amount of stemming that is applied can lead
to
retrieval of many non-relevant items.
2. Selective Dissemination of Information
◻ (Mail) Process provides tile capability to
dynamically compare newly received items in
the information system against standing
statements
of interest of users and deliver the item to those
users whose statement of interest matches the
contents of the item.
◻ The Mail process is composed of the
search process, user statements of interest
(Profiles) and user mail files.
◻ When the search statement is satisfied, the
item is placed in the Mail File(s) associated
with the profile.
2. Selective Dissemination of Information Cont.

◻ As each item is received, it is processed


against every user's profile. A profile contains a
typically broad search statement along with a
list of user
mail files that will receive the document if
the search statement in the profile is
satisfied.
◻ User search profiles are different than ad
hoc queries in that they contain
significantly more search terms (10 to
100 times more terms) and cover a wider
range of interests.
◻ These profiles define all the areas in which a user
is interested versus an ad hoc query which is
frequently focused to answer a specific question
3. Document Database Search

◻ The Document Database Search process is


composed of the search process, user entered
queries (typically ad hoc queries) and the
document
database which contains all items that have been
received, processed and stored by the system.
◻ Any search for information that has already
been processed into the system can be
considered a "retrospective" search for
information.
◻ Queries differ from profiles in that they
are typically short and focused on a
specific area of
interest.
4 Index Database
◻ When an item is determined to be of
interest, a user may want to save it
for future reference. This is in effect
filing it.
◻ In an information system this is
accomplished via the index process. In
this
process the user can logically store an
item in a file along with additional index
terms and descriptive text the user wants
to associate with the item.
4 Index Database
4 Index Database
◻ The Index Database Search Process
provides the capability to create
indexes and search them.
◻ The user may search the index and
retrieve the index and/or the document
it references.
◻ The system also provides the capability
to search the index and then search the
items referenced by the index records
that satisfied
4 Index Database
the index portion of the query.
This is called a combined file search.
4 Index Database
◻ There are two classes of index files:
Public and Private Index files.
◻ Every user can have one or more Private
Index files leading to a very large number of
files. Each Private Index file references only
a
small subset of the total number of items in the
Document Database.
◻ Public Index files are maintained by
professional library services personnel
and typically index every item in the
Document
4 Index Database
Database.
Relationship to Database Management

◻ 1. An Information Retrieval System is


software that has the features and functions
required to manipulate "information" items
versus a DBMS that is optimized to handle
"structured" data. Information is fuzzy text.
◻ 2. Structured data is well defined data
(facts) typically represented by tables.
There is a semantic description
associated with each
attribute within a table that well defines that
attribute. On the other hand, if two different
Relationship to Database Management
people generate an abstract for the same item,
they can be different.
Relationship to Database Management
◻ 3. With structured data a user enters a specific
request and the results returned provide the
user with the desired information. The results
are
frequently tabulated and presented in a report
format for ease of use. In contrast, a search of
"information" items has a high probability of not
finding all the items a user is looking for. The
user has to refine his search to locate
additional items of interest. This process is
called "iterative search.“
◻ From a practical standpoint, the
Relationship to Database Management
integration of DBMS's and Information
Retrieval Systems is very important.
Information Retrieval System
Capabilities
Chapter 2
Objectives

◻ Discussing the major functions that


are available in an Information Retrieval
System.
◻ Search and browse capabilities are
crucial to assist the user in locating
relevant
items.
Search Capabilities
◻ The objective of the search capability is to allow for
a mapping between a user's specified need and the
items in the information database that will answer
that need.
◻ “Weighting" of search terms holds significant
potential for assisting in the location and ranking
of relevant items.
◻ E.g. Find articles that discuss data mining(.9)
or data warehouses(.3).
◻ the system would recognize in its importance
ranking and item selection process that data mining
are far more important than items discussing data
warehouses.
1. Boolean Logic
◻ Boolean logic allows a user to logically relate
multiple concepts together to define what information is needed. The typical Boolean
operators are AND, OR, and NOT.

◻ Placing portions of the search statement in parentheses are used to overtly


specify the order of
Boolean operations (i.e., nesting function). If parentheses are not used, the system
follows a default precedence ordering of operations (e.g.,
Use of Boolean Operators
2 Proximit
◻ Proximity is used to restrict the distance allowed within
an item between two search terms.
◻ The semantic concept is that the closer two terms are found
in a text the more likely they are related in the description of
a particular concept.
◻ Proximity is used to increase the precision of a search.
◻ If the terms COMPUTER and DESIGN are found within a few
words of each other then the item is more likely to be
discussing the
design of computers than if the words are paragraphs apart.
2 Proximit
◻ TERM1 within "m . . . . units" of TERM2
◻ The distance operator "m" is an integer number and
units are in Characters, Words, Sentences, or Paragraphs.

◻ A special case of the Proximity operator is the


Adjacent (ADJ) operator that normally has a distance
operator of one
and a forward only direction.

◻ Another special case is where the distance is set


2 Proximit
to zero meaning within the same semantic unit.
2 Proximit
3. Contiguous Word Phrases
◻ A Contiguous Word Phrase (CWP) is both a way
of specifying a query term and a special search
operator. A Contiguous Word Phrase is two or
more words that are treated as a single semantic
unit.
◻ An example of a CWP is "United States of America."
It is four words that specify a search term
representing a single specific semantic concept (a
country) that can be
used with any of the operators discussed above.

◻ Thus a query could specify "manufacturing" AND


"United States of America" which returns any item
that contains the word "manufacturing" and the
contiguous words
"United States of America”.
◻ A contiguous word phrase also acts like a special
search operator that is similar to the proximity
(Adjacency) operator but allows for additional
specificity.
4. Fuzzy Searches
◻ Fuzzy Searches provide the capability to locate spellings of
words that are similar to the entered search term. This
function is primarily used to compensate for errors in spelling
of words.
◻ Fuzzy searching increases recall at the expense of
decreasing precision.
◻ A Fuzzy Search on the term "computer" would
automatically include the following words from the
information database: "computer”, "compiter,"
"conputer," "computter," "compute."
◻ An additional enhancement may lookup the
proposed alternative spelling and if it is a valid word
with a different meaning, include it in the search with
a low ranking or not include it at all (e.g.,
"commuter").
◻ In the process of expanding a query term fuzzy searching
includes other terms that have similar spellings, giving
more weight (in systems that rank output) to words in the
database
that have similar word lengths and position of the characters

as the entered term.


5. Term Masking
◻ Term masking is the ability to expand a query term
by masking a portion of the term and accepting as
valid any processing token that maps to the
unmasked portion of
the term. The value of term masking is much higher in
systems that do not perform stemming or only provide a
very simple stemming algorithm.
◻ There are two types of search term masking: fixed
length and variable length.
◻ Fixed length masking is a single position mask. It
masks out any symbol in a particular position or the
lack of that position in a word.
◻ Variable length "don't cares" allows masking of
any number of characters within a processing
token.
5. Term
Masking
(Variable Length)
6. Numeric and Date Ranges
◻ Term masking is useful when applied to words, but
does not work for finding ranges of numbers or
numeric dates.

◻ To find numbers larger than "125”, using a term "125*" will


not find any number except those that begin with the
digits "125.“

◻ A user could enter inclusive (e.g., "125-425" or "4/2/93-


5/2/95" for numbers and dates) to infinite ranges
(">125“,
"<=233“, representing "Greater Than" or "Less Than” or
“Equal") as part of a query.
7 Concept/Thesaurus
◻ Associated with both Boolean and Natural Language
Queries is the ability to expand the search terms via
Thesaurus or Concept Class database reference tool.

◻ A Thesaurus is typically a one-level or two-level


expansion of a term to other terms that are similar in
meaning.

◻ A Concept Class is a tree structure that expands each


meaning of a word into potential concepts that are
related
7 Concept/Thesaurus
to the initial term.
7. Concept/Thesaurus
7. Concept/Thesaurus Expansion
8 Natural Language
◻ Natural Language Queries allow a user to enter
a prose statement that describes the
information that the
user wants to find.

◻ The longer the prose, the more accurate file


results returned. The most difficult logic case
associated with
Natural Language Queries is the ability to specify
8 Natural Language
negation in the search statement and have the
system
recognize it as negation.
8 Natural Language

◻ An example of a Natural Language Query is:

◻ Find for me all the items that discuss


databases and current attempts in database
applications. Include all

items that discuss Microsoft trials in the development

process. Do not include items about relational


8 Natural Language
databases.
8 Natural Language
◻ This usage pattern is important because sentence

fragments make morphological analysis of the natural language query difficult and

may limit the system's ability to perform term disambiguation (e.g., understand which

meaning of a word is meant).

◻ Natural language interfaces improve the recall of systems with a decrease in precision
when negation is required.
Browse Capabilities
◻ Browse capabilities provide the user with the
capability to determine which items are of
interest and select those to be displayed.
◻ There are two ways of displaying a summary of
the items that are associated with a query: line
item status and data visualization.
◻ If searches resulted in high precision,
then the importance of the browse
capabilities would be lessened.
◻ Since searches return many items that are not
relevant to the user's information need, browse
capabilities can assist the user in focusing on
items
that have the highest likelihood in meeting his need.
1. Ranking
◻ Hits are retrieved in either a sorted order (e.g., sort by Title)
or in time order from the newest to the oldest item.

◻ With the introduction of ranking based upon predicted


relevance values, the status summary displays the
relevance score associated with the item along with a brief
descriptor of the item
(usually both fit on one display screen line).

◻ The relevance score is an estimate of the search system on


how closely the item satisfies the search statement.
Typically relevance
scores are normalized to a value between 0.0 and 1.0. The highest
value of 1.0 is interpreted that the system is sure that
the item is relevant to the search statement.
1 Ranking
◻ Practically, systems have a default minimum value
which the user can modify that stops returning items
that have a
relevance value below the specified value.
◻ Presenting the actual relevance number seems to
be more confusing to the user than presenting a
category that the
number falls in.
◻ For example, some systems create relevance categories
and indicate, by displaying items in different colors, which
category an
1 Ranking
item belongs to. Other systems uses a nomenclature such as
High, Medium High, Medium, Low, and Non-relevant. The color
technique removes the need for written indication of an
1 Ranking
◻ Rather than limiting the number of items that can be assessed
by the number of lines on a screen, other graphical
visualization techniques showing the relevance relationships
of the hit items can be used.
◻ For example, a two or three dimensional graph can be
displayed where points on the graph represent items and the
location of the points represent their relative relationship
between each other and the user's query.
◻ This technique allows a user to see the clustering of items
by topics and browse through a cluster or move to another
topical
1 Ranking
cluster.
2. Zoning
◻ The user wants to see the minimum information
needed to determine if the item is relevant.
◻ Limited display screen sizes require selectability of
what portions of an item a user needs to see to make
the relevance determination.
◻ For example, display of the Title and Abstract may be
sufficient information for a user to predict the potential
relevance of an item. Limiting the display of each item to
these two zones allows multiple items to be displayed
on a single display screen.
◻ This makes maximum use of tile speed of the
user's cognitive process in scanning the single
image and understanding the potential
relevance of the multiple items on the screen.
3. Highlighting
◻ Lets the user quickly focus on the potentially
relevant parts of the text to scan for item
relevance.
◻ Most systems allow the display of an item to begin
with the first highlight within tile item and allow
subsequent jumping to the next highlight.
◻ Another capability, which is gaining strong acceptance,
is for the system to determine the passage in the
document most relevant to the query and position the
browse to
start at that passage.

◻ Using Natural Language Processing, and automatic


expansion of terms via thesauri; highlighting loses
some of its value.
◻ The terms being highlighted that caused a particular
item to be returned may not have direct or obvious
mapping to any of the search terms entered.
Miscellaneous Capabilities

◻ There are many additional functions that facilitate

the user's ability to input queries, reducing the time

it takes to

generate the queries, and reducing a priori the

probability of entering a poor query.


1. Vocabulary Browse
◻ The capability to display in alphabetical sorted
order words from the document database.
◻ The user can enter a word or word fragment and
the system will begin to display file dictionary
around the entered text.
◻ It helps the user determine the impact of using
a fixed or variable length mask on a search term
and potential mis-spellings.
◻ The user can determine that entering the
search term "compul*" in effect is searching for
"compulsion" or compulsive" or "compulsory."
2. Iterative Search and Search History Log

◻ The process of refining the results of a previous


search to focus on relevant items is called iterative
search.

◻ To facilitate locating previous searches as starting


points for new searches, search history logs are
available.

◻ The search history log is the capability to display


all the previous searches that were executed
during the

current session.
3. Canned Query
◻ The capability to name a
query and store it to be
retrieved and executed
during a later user session
is called
canned or stored queries.
◻ A canned query focuses on
the user's general area
of interest one time and then
retrieve it to add additional
search criteria to retrieve data
that is currently needed.
◻ Queries that start with a
canned query are
significantly larger than ad
hoc queries.

You might also like