0% found this document useful (0 votes)
3 views23 pages

IRS Unit 1 Notes

Uploaded by

sarani sattineni
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views23 pages

IRS Unit 1 Notes

Uploaded by

sarani sattineni
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 23

Information Retrieval Systems

UNIT-I

Introduction to Information Retrieval Systems

1.1 Definition of Information Retrieval System,


1.2 Objectives of Information Retrieval Systems,
1.3 Functional Overview,
1.4 Relationship to Database Management Systems,
1.5 Digital Libraries and Data Warehouses
1.6 Information Retrieval System Capabilities:
1.6.1 Search or Querying Capabilities
1.6.2 Browse Capabilities
1.6.3 Miscellaneous Capabilities

INTRODUCTION

 IRS stands for Information Retrieval Systems


 An IR System is a system capable of storage, retrieval, and maintenance of information.
 An IRS System facilitates user in find the information that user need

Types of Searches in IRS:-

 Web Search
 Desktop Search
 Library search

 Information: text, image, audio, video, and other multimedia objects Focus on textual information here.
An IR system facilitates a user in find the information the user needs.

 Item(Data): The smallest complete textual unit processed and manipulated by an IR system
Depend on how a specific source treats information

• Success measure (Objectives of an IR System) : Minimize the overhead for finding information
There is a potential for confusion in the understanding of the differences between Database
Management Systems (DBMS) and Information Retrieval Systems. It is easy to confuse the
software that optimizes functional support of each type of system with actual information or
structured data that is being stored and manipulated. The importance of the differences lies in
the inability of a database management system to provide the functions needed to process
“information.” An information system containing structured data, also suffers major functional
deficiencies.

1.1 Definition of Information Retrieval System

 An Information Retrieval System is a system that is capable of storage, retrieval, and


maintenance of information.
 Information in this context can be composed of text (including numeric and date data),
images, audio, video and other multi-media objects.
 Of all the above data types, Text is the only data type that supports full functional
processing.
 The term “item” is used to represent the smallest complete unit that is processed and
manipulated by the system.
 The definition of item varies by how a specific source treats information. A complete
document, such as a book, newspaper or magazine could be an item. For example a
 video news program could be considered an item. It is composed of text in the form of
closed captioning, audio text provided by the speakers, and the video images being
displayed.
 The efficiency of an information system lies in its ability to minimize the overhead for a
user to find the needed information.
 Overhead from a user’s perspective is the time required to find the information need,
excluding the time for actually reading the relevant data .Thus search composition,
search aspects of information retrieval overhead.

1.2 Objectives of Information Retrieval System

The general objective of an Information Retrieval System is to minimize the overhead of a user
locating needed information.
Overhead can be expressed as the time a user spends in all of the steps leading to reading an item
containing the needed information, excluding the time for actually reading the relevant data
Example:
 Query generation
 Search composition
 Search execution
 Scanning results of query to select items to read

Query Generation:
The user entering the keyword like Eg :Sachine Tendulkar, Best Restaurants,……..etc
It will be providing relevant information to the user
Search Composition:
If the given text or query, It will be available or not finding the user required Information
Search Execution:
Automatically start Execution procedure for relevant Information
Scanning Results of Query for Reading Item:
The user Required Information starts scanning Results of Query, Required Information
Generated Infront of User’s Screen

 An Information Retrieval system consists of a software Program that facilitates a user in the information
the user needs
 The system may use standard computer hardware or specialized hardware to support the search sub
Function to convert Non-Textual sources to searchable Media.

The success of an IRS is how well it can Minimize the user


 The overhead for a user to find the needed information
 Overhead from users Perspective is the time required to find the Information
 Thus, search Composition, Search Execution &reading Non-relevant Items are all aspects of IR
Overhead

Relevant Item: In IRS the term “Relevant” Item is used to Represent an Item containing the needed
Information .

Ex: JPG, bmp

From a user Perspective “Relevant “& Needed


The Success of an information system is very subjective, based upon what information is needed
and the willingness of a user to accept overhead

 The two major measures commonly associated with information systems are

1) Precision
2) Recall

 Support of user search generation


 How to present the search results in a format that facilitate the user in determining
relevant items
Precision
 The ability to retrieve top ranked documents that are mostly relevant
Ex: Key Exactly Matched Ranked Work

Precision: Number of Retrieved-Relevant


Number-of Total-Retrieved

Recall
• The ability of the search to find all of the relevant items.

Ex: Search computer it will shows as hard Disk , Keyboard, Monitor

Recall: Number of Retrieved-Relevant


Number-of Possible-Retrieved

Where

Number-of Retrieved_Relevant is the number of items retrieved that are relevant to the user’s
search need.
Number of Total_Retrieved is the total number of items retrieved from the query.
Number-of Possible_Relevant are the number of relevant items in the database.
Precision measures one aspect of information retrieval overhead for a user associated with particular
search.
 If a search has a 85% precision, then 15% of the user effort is Overhead reviewing non-relevant
items.
 Recall measures how well a system processing a particular query is able to retrieve the relevant items
that the user is interested in seeing.

1.3 Functional Overview:

A total Information Storage and Retrieval System is composed of four major functional
processes:

Item normalization,
 Selective dissemination of information (i.e., “mail”),
 Archival Document Database Search, and an Index
 Database Search along with the Automatic File Build process that supports Index Files.

1. Item Normalization:

The first step in any integrated system is to normalize the incoming items to a standard format.
 Item normalization is the process of standardizing and transforming various aspects of the
items (documents, web pages, etc.)
 Item normalization ensures that the information retrieval system can efficiently and accurately
retrieve relevant documents in response to user queries.
Text / Item Normalization Process

Standardize Input

 Standardizing the input takes the different external format of input data and performs the translation
to the formats acceptable to the system. A system may have a single format for all items or allow multiple
formats.
 Translate foreign language into Unicode
 Allow a single browser to display the languages and potentially a single search system to search
them
Translate multi-media input into a standard format
 Video: MPEG-2, MPEG-1, AVI, Real Video…
 Audio: WAV, Real Audio
 Image: GIF, JPEG
Logical Subsetting (Zoning)
 Parse the item into logical sub-divisions that have meaning to user Title, Author, Abstract, Main
Text, Conclusion, References, Country, Keyword…
 Visible to the user and used to increase the precision of a search and optimize the display
 The zoning information is passed to the processing token identification operation to store the
information, allowing searches to be restricted to a specific zone display the minimum data required
from each item to allow determination of the possible relevance of that item (Display zones such as
Title, Abstract…)

Identify Processing Tokens

 Identify the information that are used in the search process– Processing Tokens (Better Than Words)
 The first step is to determine a word
Dividing input symbols into three classes
1. Valid word Symbols - alphabetic characters, numbers
2. Inter-word Symbols - blanks, periods, semicolons (non searchable)
3. Special processing Symbols - hyphen (-)
 A word is defined as a contiguous set of word symbols bounded by inter-word symbols

Stop List Algorithm

 Save system resources by eliminating from the set of searchable processing tokens those have little
value to the search Whose frequency and/or semantic use make them of no use as searchable token
 Any word found in almost every item
 Any word only found once or twice in the database
 Frequency * Rank = ConstantStop algorithm v.s. Stop list
 Examples of Stop algorithms are: Stop all numbers greater than “999999” (this was selected to
allow dates to be searchable) Stop any processing token that has numbers and characters
intermixed
Characterize Tokens

 Identify any specific word characteristics


 Word sense disambiguation Part of speech tagging
 Uppercase – proper names, acronyms, and organization Numbers and dates

Stemming Algorithm

 Normalize the token to a standard semantic representation Computer, Compute, Computers,


Computing "Comput ”
 Reduce the number of unique words the system has to contain
Ex: “computable”, “computation”, “computability”
→ Small database saves 32 percent of storages
→ Larger database: 1.6 MB
 Improve the efficiency of the IR System and to improve recall → Decline precision

Create Searchable Data Structure

 Processing tokens -> Stemming Algorithm -> update to the Searchable data structure

 Internal representation (not visible to user)Signature file, Inverted list,PAT Tree…


 Contains
→ Semantic concepts represent the items in database Limit what a user can find as a result of
the search

2. Selective Dissemination (Distribution, Spreading) of Information

The Selective Dissemination of Information (Mail) Process provides the capability to dynamically compare
newly received items in the information system against standing statements of interest of users and deliver the
item to those users whose statement of interest matches the contents of the item.
The Mail process is composed of the
 search process,
 user statements of interest (Profiles) and
 user mail files.
As each item is received, it is processed against every user’s profile. A profile contains a typically broad search
statement along with a list of user mail files that will receive the document if the search statement in the profile
is satisfied. Selective Dissemination of Information has not yet been applied to multimedia sources.
the item is placed in the mail file(s) associated with the process User search profiles are different than ad hoc
queries in that they contain significant more search terms and cover a wider range of interests .
3. Document Database Search
The Document Database Search Process provides the capability for a query to search against all
items received by the system. The Document Database Search process is composed of the
search process, user entered queries (typically ad hoc queries) and the document database which
contains all items that have been received, processed and stored by the system. Typically items
in the Document Database do not change (i.e., are not edited) once received.May be partitioned by time and
allow for archiving by the Time partitions.
 Queries differ from profiles in that they are typically short and focused on a specific area of interest .

4. Index Database Search:


When an item is determined to be of interest, a user may want to save it (file it) for future reference
Accomplished via the index process.
 In the index process, the user can logically store an item in a file along with additional index terms and
descriptive text the user wants to associate with the item. An index can reference the original item, or contain
substantive information on the original item Similar to card catalog in a library.
 The Index Database Search Process provides the capability to create indexes and search them
 The user may search the index and retrieve the index and/or the document it references
 The system also provides the capability to search the index and then search the items referenced by the index
records that satisfied the index portion of the query Combined file search
 In an ideal system the index record could reference portions of items versus the total item
 Two classes of index files:
 public and
 private index files
Every user can have one or more private index files leading to a very large number of files, and each private
index file references only a small subset of the total number of items in the Document database Public index
files are maintained by professional library services personnel and typically index every item in the Document
database
 The capability to create private and public index files is frequently implemented via a structured Database
Management System (RDBMS)
 To assist the users in generating indexes, the system provides a process called Automatic File Build
(Information Extraction)
Process selected incoming documents and automatically determines potential indexing for the item
• Authors, date of publication, source, and references The rules that govern which documents are processed for
extraction of index information and the index term extraction process are stored in Automatic File Build
Profiles. When an item is processed it results in creation of Candidate Index Records -> for review and edit by a
user Prior to actual update of an index file.

1.4 Relationship to Database Management Systems

There are two major categories of systems available to process items:


Information Retrieval Systems and Data Base Management Systems (DBMS).

1. An Information Retrieval System is software that has the features and functions required to manipulate
“information” items versus a DBMS that is optimized to handle “structured” data.

2. Structured data is well defined data (facts) typically represented by tables. There is a semantic description
associated with each attribute within a table that well defines that attribute. For example, there is no confusion

Downloaded by sarani sattineni (saranisattineni@gail.com)


between the meaning of “employee name” or “employee salary” and what values to enter in a specific database
record. On the other hand, if two different people generate an abstract for the same item, they can be different.
One abstract may generally discuss the most important topic in an item. Another abstract, using a different
vocabulary, may specify the details of many topics. It is this diversity and ambiguity of language.

3. With structured data a user enters a specific request and the results returned provide the user with the desired
information. The results are frequently tabulated and presented in a report format for ease of use. In contrast, a
search of “information” items has a high probability of not finding all the items a user is looking for. The user
has to refine his search to locate additional items of interest. This process is called “iterative search.

4. From a practical standpoint, the integration of DBMS’s and Information Retrieval Systems is very important.
Commercial database companies have already integrated the two types of systems. One of the first commercial
databases to integrate the two systems into a single view is the INQUIRE DBMS
5. This has been available for over fifteen years. A more current example is the ORACLE DBMS that now
offers an imbedded capability called CONVECTIS, which is an informational retrieval system that uses a
comprehensive thesaurus which provides the basis to generate “themes” for a particular item. The INFORMIX
DBMS has the ability to link to RetrievalWare to provide integration of structured data and information along
with functions associated with Information Retrieval Systems.

Downloaded by sarani sattineni (saranisattineni@gail.com)


Downloaded by sarani sattineni (saranisattineni@gail.com)
1.5 Digital Libraries and Data Warehouses (DataMarts)

Two other systems frequently described in the context of information retrieval are,
 Digital Libraries and
 Data Warehouses (or Data Marts).
 There is a significant overlap between these two systems and an Information Storage and Retrieval
System. All three systems are repositories of information and their primary goal is to “satisfy user
information needs”
Digital Library:

A Digital Library enables users to Interact effectively with Information distributed across a network
These network Information systems support search &Display of Items from organized
collections
 As such, libraries have always been concerned with storing and retrieving information in the media it is
created on.
 As the quantities of information grew exponentially, libraries were forced to make maximum use of
electronic tools to facilitate the storage and retrieval process. With the worldwide internet of libraries and
information sources (e.g., publishers, news agencies, wire services, radio broadcasts) via the Internet,
more focus has been on the concept of an electronic library.
List of Softwares For Digital Libraries
 KOHA
 BIBLIOTEQ
 PMP

 Indexing is one of the critical disciplines in library science and significant effort has gone into the
establishment of indexing and cataloging standards. Migration of many of the library products to a digital
format introduces both opportunities and challenges. The full text of items available for search makes the
index process.
 Another important library service is a source of search intermediaries to assist users in finding
Information.
 Information Storage and Retrieval technology has addressed a small subset of the issues associated with
Digital Libraries. The focus has been on the search and retrieval of textual data with no concern for
establishing standards on the contents of the system.

Downloaded by sarani sattineni (saranisattineni@gail.com)


DATAWAREHOUSES:

A Data warehouse is a type of Data Management System that is designed to enable and support
Business Intelligence Activities, Especially Analytices, Data warehouses are solely Intended to
perform queries and Analysis and often contain Large amounts of Historical Data.

List of Softwares For DATAWAREHOUSES


 Amazon Red shift
 Microsoft Azure
 Google Big query
 Snowflake

 A Data warehouse is Relational Database that is designed for query and analysis rather than
transaction processing It includes historical data derived from transaction data from single
&Multiple sources
 A Data warehouse is a group of Data specific to the entire organization, not only to particular
group of users
 It is not used for daily operations and transaction processing but used for making decisions.

Downloaded by sarani sattineni (saranisattineni@gail.com)


1.6 Information Retrieval System Capabilities
Capabilities are crucial to assist the user to locate relevant items.
The capabilities in the information retrieval systems are,
o Search /query Capabilities
o Browse Capabilities

o Miscellaneous Capabilities

o
Search Capabilities
The objective of the search capability is to allow for a mapping between a user’s specified
need and the items in the information database that will answer that need. The search
capabilities address both Boolean and Natural Language queries. The algorithms used for
searching are called Boolean, natural language processing and probabilistic. Probabilistic
algorithms use frequency of occurrence of processing tokens (words) in determining
similarities between queries and items and also in predictors on the potential relevance of the
found item to the searcher.

1.1 Search Capabilities

It can consist of natural language text in composition style and/or query terms (referred to as
terms in this book) with Boolean logic indicators between them. One concept that has
occasionally been implemented in commercial systems (e.g., RetrievalWare), and holds
significant potential for assisting in the location and ranking of relevant items, is the
“weighting” of search terms. This would allow a user to indicate importance of search terms in
either a Boolean or natural language interface.
The functions define the relationships between the terms in the search statement
Examples:
 Boolean, Natural Language
 Proximity
 Contiguous Word Phrases
 Fuzzy Searches
The interpretation of a particular word
Examples:
 Term Masking
 Numeric and Date Range
 Concept/Thesaurus expansion

Downloaded by sarani sattineni (saranisattineni@gail.com)


 Natural Language

Boolean Logic

 Boolean logic allows a user logically relate multiple concepts together to define what
information is needed.
 The Boolean functions apply to processing tokens identified anywhere within an item.
 Operators: AND, OR, NOT (sometimes XOR).
 Set operations: intersection, union, difference.
 Precedence : NOT, AND, OR; use parentheses to override;
process left-to- right among operators with same precedence.

 Weighting: A weight is associated with each term.

Example:

Downloaded by sarani sattineni (saranisattineni@gail.com)


Proximity
 Restrict the distance within documents between two search terms.
 Proximity limits the acceptable occurrences and increases the precision of the search.
 Important for large documents.
 General Format:TERM1 within m units of TERM2
 UNIT may be character, word, paragraph, etc.
 Direction operator: specify which term should appear first.
 Adjacent operator: m = 1 in forward direction.

Example:

SEARCH STATEMENT SYSTEM OPERATION

“Venetian” ADJ “Blind” would find items that mention a Venetian Blind on a window but not
items discussing a Blind Venetian

“United” within five words of would hit on “United States and American interests,” “United Airlines
“American” and American Airlines” not on “United
States of America and the American dream”

“Nuclear” within zero Would find items that have “nuclear” and “clean-up” in the same
paragraphs of “clean-up” paragraph.

Contiguous Word Phrase


 Treat a sequence of N words as a single semantic unit
 A Contiguous Word Phrase (CWP) is a way of specifying a query term and a special
search operator.
 A Contiguous Word Phrase is two or more words that are treated as a single semantic unit.
 An example of a CWP is “United States of America.”
 CWP is N-ary (not Boolean) operator. Cannot be expressed as Boolean query
 If only two are specified, then CWP reduces to the adjacent operator (or the proximity
operator with m = 1 in forward direction)
 Also called “literal string” or “exact phrase” matching.

Fuzzy (Approximate) matching


Match terms that are similar to the query term.
 Fuzzy matching compensates for spelling errors, especially when documents were scanned-in
and then subjected to optical character recognition (OCR).
 Increased recall (more documents qualify because new terms may be matched) at the expense
Downloaded by sarani sattineni (saranisattineni@gail.com)
of deceased precision.

Example:
COMPUTER may match COMPITER, CONPUTER, etc.
 Usually, should not match if the closely-spelled word is legitimate in itself (e.g., COMMUTER.
This would help maintain precision.
 Rules needed to indicate allowed differences (e.g., one character replacement, or one transposition
of adjacent characters).
 Similar method may be used to overcome phonetic spelling errors. Should be distinguished from
fuzzy set theory solutions.

Term masking
Match terms that contain the query term.
1. Fixed length mask :

 Fixed length masking is a single position mask.


 It not only allows any character in the masked position, but also accepts words where the
position does not exist.
Example :
The term MULTI$NATIONAL will be matched by “multi-national” or
“multinational” (but not by “multi national” since it is a sequence of two terms! )

2. Variable length mask


 Variable length “don’t cares” masking of any number of characters within a processing token.
 The masking may be in the front, at the end, at both front and end or imbedded.
 The first three of these cases are called suffix search, prefix search and imbedded character
string search.
Examples:
“*COMPUTER” - Suffix Search
“COMPUTER*” - Prefix Search
“*COMPUTER*” - Imbedded String Search

Downloaded by sarani sattineni (saranisattineni@gail.com)


Examples

SEARCH STATEMENT SYSTEM OPERATION

MULTI$NATIONAL Matches“multi-national,” “multinational,” “multinational” but


does not match “multi national” since it is two processing
tokens.

*computer* Matches,“minicomputer” “microcomputer” or “computer”

comput* Matches “computers,” “computing,” “computes”

*comput* Matches “microcomputers” , “minicomputing,” “compute”

Number and data Ranges

Match numeric or date terms that are in the range of the query term.
Numeric
To find numbers larger than “125,” using a term “125*” will not find any number except those that begin
with the digits “125.”

Date
 Query terms: 9/1/97 - 8/31/98 (matches all dates between 1 September 1997 and 31 August 1998).
 In a way, term- masking handles “string ranges”.

Downloaded by sarani sattineni (saranisattineni@gail.com)


Concept/Thesaurus Expansion
 Thesaurus Expansion Associated with both Boolean and Natural Language Queries to expand
the search terms through Thesaurus or Concept Class database reference tool.
 A Thesaurus is a one-level or two-level expansion of a term to other terms that are similar
in meaning.
 A Concept Class is a tree structure that expands each meaning of a word into concepts that are
related to the initial term (e.g., in the TOPIC system).

Natural Language

 Describe the information needed in natural language prose.

 Natural Language Queries allow a user to enter a prose statement that describes the
information that the user wants to find.

 The longer the prose, the more accurate the results returned.

Example:

 Find all the documents that discuss oil reserves and current attempts to find oil reserves.

Downloaded by sarani sattineni (saranisattineni@gail.com)


Include any documents that discuss the international financial aspects of the old production
process. Do not include documents about the oil industry in the United States.

Pseudo NL processing: System scans the prose and extracts recognized terms and Boolean
connectors. The grammaticality of the text is not important.

Problem: Recognize the negation in the search statement (“Do not include…”)

Compromise: Use enters natural language sentences connected with Boolean operators.

Using the same search statement, a Boolean query attempting to find the same information
might appear:

(“locate” AND “new” and “oil reserves”) OR (“international” AND “financ*” AND “oil
production”) NOT (“oil industry” AND “United States”)

2. Browse Capabilities
 Once the search is complete, Browse capabilities provide the user with the capability to
determine which items are of interest and select to be displayed.

 There are two ways of displaying a summary of the items that are associated with a query:

→ Line item status

→ Data visualization

 Powerful browsing capabilities are particularly important when precision is low.

Different types of Browsing Capabilities

1. Ranking

2. Zoning

3. Highlighting

Ranking

 Under Boolean systems, the status display is a count of the number of items found by the
query.

 Every one of the items meets all aspects of the Boolean query.

 The reasons why an item was selected can easily be traced to and displayed (e.g., via
highlighting) in the retrieved items.

 The system accumulates the various user rankings and uses this information to order the
output for other user queries

Examples:

 AMAZON.COM

Downloaded by sarani sattineni (saranisattineni@gail.com)


 MovieFinder.com

→ These will deciding what products to display to users based upon their queries

Zoning

 The user wants to see the minimum information needed to determine if the item is relevant.

 Item is divided into uniform-sized passages that are indexed

 Locality based retrieval where the passage boundaries can be dynamic, in these cases the
system can display the particular passage or the item to be found rather than the complete
item.

 The system would also provide an expand capability to retrieve the complete item as an
option.

Highlighting

 Different strengths of highlighting indicate how strongly the highlighted word participated in
the selection of the item.

 Most systems allow the display of an item to begin with the first highlight within the item and
allow subsequent jumping to the next highlight.

 Highlighting has always useful in Boolean systems to indicate the cause of the retrieval.

 This is because of the direct mapping between terms in the search and terms in the item.

*********************************************************************

3. Miscellaneous Capabilities
Facilitate the user’s to input queries, reducing the time takes to generate queries and reducing
the probability of entering a poor query.

Different Types of Miscellaneous Capabilities

1. Vocabulary Browse

2. Iterative Search and Search History Log

3. Canned Query

Vocabulary browse

 Users enter a term are positioned in an alphabetically-sorted list of all the terms that
appear in the database.

 With each term the number of documents in which it appears is shown.

 Assists users who are not familiar with the vocabulary.

 Help users the impact of using individual terms

Downloaded by sarani sattineni (saranisattineni@gail.com)


Iterative search (query refinement)

 Rather than typing a complete new query, the results of the previous search can be used to
create a new query

 The process of refining the results of a previous search to focus on relevant items is called
iterative search.

 During a login session, a user could execute many queries to locate the needed information.

 The search history log is the capability to display all the previous searches that were executed
during the current session.

Canned Query

 The capability to name a query and store it to be retrieved and executed during a later user
session is called canned or stored queries.

 A canned query allows a user to create and refine a search that focuses on the user’s general
area of interest

*************************************************************************

Downloaded by sarani sattineni (saranisattineni@gail.com)

You might also like