IRS Unit 1 by Krishna
IRS Unit 1 by Krishna
UNIT-1 PART-1
INTRODUCTION TO IRS
Information Retrieval Systems (IRS) store, retrieve, and maintain diverse information
types, with text being the primary focus due to its full functional processing
capabilities. IRS aim to minimize user overhead by ef ciently locating relevant
information, utilizing techniques like indexing and search algorithms. The evolution
of IRS re ects technological advancements, from early library catalogs to modern
internet-based search engines, with a growing emphasis on multimedia search.
4. Historical Context:
◦ The earliest IRS were developed to organize information in libraries
through catalog systems.
◦ The advent of computers introduced electronic database management
systems, revolutionizing the storage and retrieval of textual information.
◦ Initial research was limited by hardware capabilities and the library-
centric paradigm.
5. Advancements in Technology:
◦ Governments, especially military entities, spearheaded the development
of advanced IRS due to their need to process large textual databases.
◦ Inexpensive, powerful computers and the growth of the Internet have
made large-scale textual databases accessible to the public.
◦ Modern search engines (e.g., Google, EXCITE) and specialized tools
(e.g., WEBSEEK for images) have expanded retrieval capabilities.
6. Multi-Media Retrieval:
◦ Non-textual information (images, audio, and video) is increasingly
searchable using specialized indexing and pattern-matching techniques.
◦ Examples include tools for image searches (WEBSEEK, DITTO.COM),
audio transcription (e.g., news archives), and video indexing (e.g.,
Disney for video reuse).
7. Challenges and Future Directions:
◦ Multi-media information retrieval is still emerging, with signi cant
theoretical and practical gaps remaining.
◦ Research and development are increasingly driven by the private sector
to meet growing demands.
◦ Recall: Gauges the ability to retrieve all relevant items from the
database. Achieving high recall ensures comprehensive retrieval, which
is critical for tasks requiring complete information.
Functional Overview
The functional overview of a Total Information Storage and Retrieval System
highlights its four key components:
1. Item Normalization:
The initial step involves converting incoming items into a standardized format
that supports further processes. This includes logical restructuring, token
identi cation (e.g., words), characterization, stemming, and creating searchable
data structures.
1. Standardizing Formats:
5. Characterization:
Additional Considerations:
• Multimedia Normalization:
For non-text inputs like video or audio, normalization involves encoding into
standard formats (e.g., MPEG for video, WAV for audio).
1. Core Functionality:
◦ SDI dynamically compares newly received items in an information
system against prede ned "statements of interest" (or pro les) created
by users.
◦ If an item matches a user's pro le, it is delivered to the user through a
"Mail File," ensuring the information is personalized and relevant.
4. Challenges:
◦ SDI systems face dif culties in ranking results based on their relevance
due to the dynamic and asynchronous nature of updates.
◦ The lack of integration with other parts of the information retrieval
system, such as existing Mail or Index Files, limits the system's ability to
lter redundant or low-value information effectively.
5. Potential Improvements:
◦ Expanding the dissemination process to consider data from existing les
(like an index le) can improve relevance. For example, users might
prefer updates about changes in a topic (e.g., oil prices deviating from
$30) over repetitive information.
◦ Pro les could be enhanced to deprioritize or lter topics already covered
extensively in the user’s Mail File.
6. Multimedia Integration:
◦ Current SDI systems primarily handle text-based information. Although
some systems transcribe audio into text for processing, research on
disseminating multimedia sources (like videos or images) remains
limited.
7. Research and Development:
• Index Creation: Users can store items with additional index terms and
descriptive text, creating structured records.
• Public vs. Private Index Files:
◦ Public Index Files: Managed by professionals, indexing all items in the
database. These are widely accessible with permissions.
◦ Private Index Files: Created and maintained by individual users,
referencing a small subset of documents.
• Automatic File Build: A process that generates index suggestions using
prede ned rules, aiding in indexing.
◦ Extracts citation data (e.g., author, publication date) and complex
metadata (e.g., countries or organizations mentioned).
• Search Process:
◦ Users can search only the index, retrieve referenced documents, or
conduct combined le searches.
fi
fi
fi
fl
fi
fi
fi
◦ Search results prioritize structured database constraints followed by free-
text searches.
This feature provides a powerful mechanism for organizing, storing, and retrieving
items with enhanced user control.
These processes collectively demonstrate the diverse and robust mechanisms within
an information retrieval system to handle textual and multimedia data.
fi
fi
Relationship to Database Management Systems
Data
System Type Data Type Search Process User Interface Example
Characteristics
Fuzzy text, INQUIRE
Information concepts, ideas, Iterative search, DBMS,
Relevance
Retrieval Information abstractions, relevance ORACLE
ranked results
System minimal feedback DBMS with
consistency in CONVECTIS,
Data Base vocabulary
Well-de ned and INFORMIX
Speci c Most
Management data, facts,
Structured data request, Report format commercial
System semantic
tabulated results databases
(DBMS) description of
attributes
The relationship between Information Retrieval Systems (IRS) and Database
Management Systems (DBMS) can be understood by comparing how each system
handles and processes data, as described in the provided text.
Key Differences:
1. Nature of Data:
◦ Information Retrieval Systems (IRS) work with fuzzy text or
unstructured data. This type of data lacks strict organization or a de ned
structure. For instance, textual information (like articles, abstracts, or
documents) may have diverse vocabulary, meanings, and presentation
styles, making it dif cult to standardize. The system helps users search
and retrieve information based on relevance, but the search process is
iterative. Users often need to re ne their search queries multiple times
to nd all the desired items because of the ambiguity and diversity of
language.
◦ In DBMS, the user query is speci c, and the system returns exact
results. The results are often very structured and well-organized,
typically in a tabular form (like a report), making it easy for the user to
interpret.
Conclusion:
In summary, DBMS is ideal for managing structured data where the relationships
between data points are well-de ned (like in tables), while IRS is better for handling
unstructured, fuzzy, and ambiguous information (like text). The integration of these
systems allows for handling both types of data simultaneously, enhancing data
fl
fi
fi
fi
retrieval capabilities and providing more exible solutions. This integration ensures
that the user can bene t from the precision of a DBMS and the exibility of an IRS.
Digital Libraries:
3. Key Features:
◦ Access to Information: Digital libraries do not necessarily require
libraries to own physical copies of information. As long as users have
access to digital versions, libraries can serve as entry points to that
information.
◦ Indexing and Cataloging: A signi cant aspect of digital libraries is
indexing—organizing information to make it retrievable. However, with
digitized content, indexing becomes more valuable because full-text
search can be applied.
◦ Search Intermediaries: Digital libraries often require experts who can
help users nd information, analyze sources, and assess the reliability of
digital content.
◦ Legal Issues: One challenge digital libraries face is the legal protection
of digital content, such as copyright and intellectual property rights,
especially in an uncontrolled global environment.
fi
fi
fi
fi
fi
fl
fi
fl
◦ Content Formats: Digital libraries manage various digital formats,
from text to multimedia content (e.g., images, videos), making the task
of preserving and retrieving this information more complex.
4. Future Concerns:
5. Data Mining:
◦ Data mining, also known as Knowledge Discovery in Databases
(KDD), is a technique used in data warehouses to automatically extract
relationships or patterns from the data that were not explicitly part of
the database design. This involves advanced statistical methods, pattern
recognition, and arti cial intelligence algorithms.
fi
fi
◦ Clustering: While clustering in information retrieval is based on known
characteristics of items, in data mining, relationships are discovered
without prior knowledge of the data relationships.
IRS CAPABILITIES
Information Retrieval Systems rely on search and browse capabilities, with search
algorithms including Boolean, natural language processing, and probabilistic
approaches. Browse functions are crucial for ltering search results, while evolving
standards in language and architecture will enable interoperability and accelerate
development of user-centric tools.
Search Capabilities
- Search Query Structure: Users can use natural language text and/or query terms
with Boolean logic to express their information needs.
- Search Term Weighting: Users can indicate the importance of search terms,
allowing the system to prioritize results based on user preferences.
- Query Scoping: Users can limit search to speci c parts of an item (zones) to
improve relevance and reduce retrieval of irrelevant information.
- Search Statement Satisfaction: Improved precision can be achieved by requiring
search statement satisfaction within a contiguous subset of the document.
- Search Statement Functions: Various functions are associated with
understanding the search statement, including term relationships, word
interpretation, and search modi ers.
- Terminology: “Word” or “term” is used interchangeably with “processing token”
to represent searchable units extracted from an item.
Boolean Logic
Boolean logic is a fundamental search technique used in information retrieval
systems to logically combine multiple search terms or concepts. This allows users to
specify their information needs more precisely by relating concepts together using
logical operators.
fi
fi
fi
Key Components:
◦ AND: Ensures that all speci ed terms must be present in the item (set
intersection).
◦ OR: Ensures that at least one of the speci ed terms must be present in
the item (set union).
◦ NOT: Excludes items containing the speci ed term (set difference).
◦ Exclusive OR: A more complex operator rarely used by most systems,
equivalent to a more complicated combination of AND and OR.
• "M of N" Logic: This variant allows users to de ne a set of terms and specify
that any subset of those terms is acceptable. For example, a user might search
for items containing at least two of a set of terms, such as "AA," "BB," or
"CC." This expands into several combinations of AND operations, joined by
OR.
Important Notes:
In essence, Boolean logic provides users with a structured way to combine and re ne
search terms to narrow down or broaden search results, though it might lack
exibility for more sophisticated querying needs like term weighting.
fl
fi
fi
fi
fi
fi
fi
fi
Proximity
Proximity search is used to re ne search results by limiting how far apart two search
terms can be within a document or item. The basic idea is that the closer two terms
are to each other, the more likely they are to be related to the same concept or topic.
This can help improve the precision of search results by focusing on terms that
appear close to each other in a meaningful way.
Key Components:
• Special Cases:
Proximity search helps to ensure that related terms are located close together,
enhancing the relevance of search results by focusing on contextually related terms
rather than simply individual words scattered throughout a document.
• Semantic Unit: A CWP is treated as a single semantic entity, meaning that the
system recognizes the phrase as a whole rather than as individual words. This
allows for more precise searching when you need to nd a speci c phrase.
• Search Examples:
Contiguous Word Phrases allow users to search for multi-word concepts as single
units. They provide a way to ensure that a speci c combination of words is treated as
an indivisible entity in the search process, which improves the precision of results
when looking for exact phrases or terms. This is especially useful for proper nouns,
speci c terms, or standard phrases where the meaning relies on the combination of
words.
Fuzzy Searches
Fuzzy Searches are a search technique that helps nd words or terms that are similar
in spelling to the entered search term. They are primarily used to compensate for
misspellings or typographical errors. This type of search increases recall ( nding
more results) but often decreases precision (because it may return terms that are
similar but not exactly what was intended).
1. Spell Similarity:
3. Heuristic Function:
◦ The system determines how "close" alternate spellings are to the original
search term using a heuristic function. This function may vary
depending on the system used.
◦ In some systems, the search might also rank terms based on how similar
their word lengths and character positions are to the query.
◦ Fuzzy searches may include alternate spellings that are common but may
have different meanings. For example, if "commuter" is identi ed as a
close match to "computer," the system might either include it with a low
ranking or not include it at all, depending on the system's rules.
5. OCR and Fuzzy Searching:
◦ Users can often specify the maximum number of new terms to include in
the query. This ensures that the search does not expand too broadly and
overwhelm the results with irrelevant matches.
Example:
If you perform a fuzzy search for "computer," the system might automatically include
results for terms like:
fi
• "computer"
• "compiter"
• "conputer"
• "computter"
• "compute"
• It might also exclude very different words like "commuter" depending on
system settings.
Summary:
Fuzzy searching helps increase recall by expanding the query to include variations of
the search term that are similar in spelling. While this can lead to more
comprehensive results, it may reduce precision as it might bring in irrelevant terms.
This feature is especially useful for compensating for errors, such as those introduced
by OCR scanning processes, where minor character recognition errors can occur.
Term Masking
Term Masking is a technique used in information retrieval systems to expand a
query term by allowing certain parts of the term to be "masked" or replaced with
wildcard characters, thus enabling the search to match any word that ts the
unmasked portions of the term. This technique is valuable in systems that do not
perform advanced stemming or only use simple stemming algorithms. Term masking
allows for more exible searching by accepting multiple forms of a word or term.
Term masking enhances the exibility of searches by allowing part of a search term
to be masked, enabling the retrieval of a wide range of matching terms. It can be
either xed length (masking speci c positions) or variable length (using wildcards to
mask multiple characters). Suf x search is by far the most frequently used type of
term masking, especially in operational systems. This feature is particularly useful in
cases where a search term may appear in different forms or with different endings,
making it easier for users to nd relevant results.
• Fixed Number Range: To search for numbers within a speci c range, systems
allow the speci cation of inclusive ranges such as "125-425". This nds any
number between 125 and 425, inclusive.
• In nite Ranges: Numeric queries can also be used to nd values greater than
or less than a speci ed number. For instance:
◦ ">125" will nd any numbers greater than 125.
◦ "<=233" will nd any numbers less than or equal to 233. These
operators allow for more exible and dynamic searches.
Date Range Queries:
Similar to numeric ranges, date ranges can be speci ed by entering dates in formats
such as "4/2/93-5/2/95", which will retrieve any items with dates falling
between April 2, 1993, and May 2, 1995. Systems can also support greater-than (>) or
less-than (<=) operations for date queries, allowing users to search for items that are
before or after certain dates.
Concept/Thesaurus Expansion
Concept/Thesaurus Expansion refers to the process of expanding or re ning search
queries in information retrieval systems using either a Thesaurus or a Concept Class
database. These tools help broaden or focus search results by associating terms with
related concepts or synonyms, enhancing the search process.
fi
fi
fi
fi
fi
fi
fl
fi
fi
fi
fi
fi
fi
1. Thesaurus Expansion:
◦ A Thesaurus provides a list of terms with similar meanings. When
expanding a search, the system can include other terms that are
semantically related to the initial search term.
◦ Thesauri are typically organized in a hierarchical manner, where one or
two levels of expansion can reveal similar words or synonyms. For
example, searching for "computer" might expand to terms like "laptop"
or "PC."
◦ There are two types of thesauri: semantic and statistical.
▪ A semantic thesaurus is manually curated, with terms listed
based on their meanings and relationships.
▪ A statistical thesaurus is generated by analyzing a speci c
database or dataset to nd words that frequently appear together,
but it doesn't have a de ned semantic structure.
2. Concept Class Expansion:
The system's task is to parse this prose into a meaningful search, but negation (such
as excluding items about the U.S. oil industry) is challenging. In practice, users often
input sentence fragments to minimize effort, which may complicate the system's
ability to analyze the language correctly. Commercial systems combine both Boolean
logic and natural language capabilities to handle these queries effectively. Although
Natural Language Queries typically improve recall, they can reduce precision,
especially when negation is involved.
Multimedia Queries:
Multimedia Queries are more complex due to the need to handle different types of
media, such as still images, video, and audio, along with textual information. While
traditional text-based queries still apply to multimedia databases, users must specify
search terms for each modality. For instance, a user might use a still image to search
for related images or speci c scenes in a video. Video content can be indexed by
scene changes, represented as images, and textual elements in videos can be
searchable through Optical Character Recognition (OCR) or audio transcription.
Audio content is converted to text through transcription, allowing the user to search
based on this text. However, transcriptions can be error-prone, especially with
conversational speech, which affects search accuracy. In audio search, speaker
identi cation is also possible, allowing users to nd segments spoken by speci c
individuals.
When conducting a multimedia query, the system correlates various modalities (such
as video scenes, transcribed audio, and text) based on factors like time or location.
For example, a query like "Find where Bill Clinton is discussing Cuban refugees and
there is a picture of a boat" could return results where the relevant video segment
includes Clinton's discussion of Cuban refugees and shows a boat during that time.
fi
fi
fi
fi
fi
fi
fi
In summary, while Natural Language Queries provide a user-friendly way to input
search requests, Multimedia Queries expand the complexity by incorporating various
media types and correlating them to return precise results.
Browse Capabilities
Browse capabilities allow users to select and display items of interest after a search.
These capabilities, particularly useful when search precision is low, help users focus
on relevant items.
1. General Purpose
Once a search is completed, browse capabilities allow users to ef ciently sift through
results. The primary goal is to help users identify items of interest and select them for
further review. Users can interact with the search results in a structured way to focus
on the most relevant items.
• Line Item Status: This displays a basic summary of each item, often with a
relevance score (from ranking) and brief descriptors (such as title or abstract).
This format helps users quickly scan results and decide which items to explore
further.
• Data Visualization: This approach uses graphical representations like charts or
graphs (e.g., 2D or 3D graphs) to visually organize the search results. Items are
placed in relation to each other based on their relevance or topics, aiding users
in navigating large sets of results by grouping similar items.
In cases of high precision (where search results are very accurate), browsing
capabilities may not be as crucial. However, in more complex searches where results
may include many irrelevant items, browse capabilities become critical. They help
users focus on the most relevant items based on their needs.
4. Ranking
Ranking systems assess the relevance of each item and display them accordingly.
This is an improvement over Boolean systems, where all retrieved items meet speci c
fi
fi
query criteria, but without any ranking. In ranked systems, each item is given a
relevance score, typically normalized between 0 and 1, with higher scores indicating
higher relevance. These scores help users decide when to stop reviewing items,
reducing the need to look at less relevant results.
Collaborative Filtering:
Some systems use collaborative ltering to rank items. This technique involves
analyzing user feedback (e.g., ratings) on items and using it to adjust the ranking for
similar users or future queries. It's widely used on e-commerce platforms like
Amazon, where user preferences and past behavior are used to personalize the
displayed items.
5. Zoning
Zoning refers to how a search result item is displayed in sections or "zones." Users
typically only need a portion of the item (like the title or abstract) to assess its
relevance. By limiting the initial view to these key sections, the system allows
multiple items to t on one screen, optimizing the user's ability to quickly scan
results.
In more advanced cases, items can be broken into smaller subdivisions called
"passages," which are indexed and retrieved independently. This allows users to
access only the relevant parts of an item, instead of the whole document.
6. Highlighting
Highlighting is used to visually emphasize the parts of an item that contributed to its
retrieval based on the query. This can include keywords or phrases that match the
search terms. Highlighting is particularly useful in Boolean systems where the search
terms have a direct correspondence with the text in the item.
Graphical information visualization can aid users in re ning their queries. Instead of
highlighting individual terms, the system might show a visual representation of how
different terms contributed to the retrieval process. This helps users understand the
fi
fi
fi
fi
fi
relevance of each part of the query and suggests ways to adjust their search to get
more precise results.
Conclusion
Browse capabilities are designed to enhance the user's interaction with search results.
By displaying summaries, using visualizations, and incorporating techniques like
ranking, zoning, and highlighting, these features help users navigate large volumes of
data more effectively. They are especially valuable when search results are large and
imprecise, guiding the user to focus on the most relevant items.
Miscellaneous Capabilities
Miscellaneous capabilities in information retrieval systems refer to additional
features that improve the user's ability to generate and re ne queries, reduce query
mistakes, and facilitate the retrieval of relevant results. These features are designed to
make querying more ef cient and user-friendly.
1. Vocabulary Browse
Vocabulary Browse helps users explore and understand the words and terms
available in the database. This feature displays unique words (tokens) from the
database in alphabetical order, along with the number of items each word appears in.
The user can enter a partial word or a word fragment, and the system will display a
list of matching words found in the database, allowing the user to re ne their search
terms.
For example, if a user enters "comput", the system will display words like
"computing," "computer," and "compulsion," along with their frequency of
occurrence. This feature aids in detecting misspellings and understanding the impact
of search terms. It also helps in identifying whether a term is too common (e.g.,
"computer" might return too many irrelevant results), guiding the user to modify their
search for better precision.
Iterative Search allows users to re ne their queries by using the results of previous
searches. Rather than starting over with a new search, users can narrow down the
results by adding more search criteria to the existing search, effectively applying an
"AND" condition. This process helps to lter out irrelevant results and focus on the
most relevant items.
fi
fi
fi
fi
fi
The Search History Log is a record of all the searches conducted during a user’s
session. It displays the previous queries along with the number of hits, making it easy
for users to revisit past searches and use them as a foundation for new queries. This
feature improves work ow by saving time and ensuring that users can build on past
search results.
3. Canned Query
A Canned Query is a stored query that a user can name and save for future use. This
is useful for users who frequently search within a speci c domain. For example, if a
user often searches for European investment data, they can save a canned query that
includes geographic terms related to Europe. The user can then add more speci c
search criteria to this canned query at any time, avoiding the need to repeatedly enter
common search terms.
Canned queries can be customized by inserting variables that are bound to speci c
values when the query is executed, offering exibility and saving time. These queries
are particularly helpful when users need to execute searches that are structurally
similar but require different parameters at different times.
4. Multimedia
In the case of audio, for instance, users may have dif culty processing large amounts
of audio data in a linear fashion. To mitigate this, systems often provide transcribed
audio alongside the original content, allowing users to follow the transcription while
listening to the audio. This combination has been shown to signi cantly reduce
processing time for users.
Additionally, multimedia items are ranked differently from text-based items, as the
system needs to assign weights based on how well each modality (e.g., text, image,
audio) satis es the query. The combination of different media types in a single result
complicates the ranking process and requires specialized algorithms.
fi
fl
fl
fi
fi
fi
fi
fi
Summary of Miscellaneous Capabilities:
These miscellaneous capabilities work together to enhance the user experience by:
The system offers various features to enhance user querying, including vocabulary
browse, iterative search, and canned queries. Vocabulary browse allows users to
explore the database’s vocabulary, identify potential misspellings, and understand the
impact of search terms. Iterative search and search history logs enable users to re ne
previous searches and easily access past results, while canned queries allow users to
save and reuse frequently used search criteria.
fi
fi
fi