IRS Unit 1 Notes
IRS Unit 1 Notes
UNIT-I
INTRODUCTION
 Web Search
 Desktop Search
 Library search
 Information: text, image, audio, video, and other multimedia objects Focus on textual information here.
An IR system facilitates a user in find the information the user needs.
   Item(Data): The smallest complete textual unit processed and manipulated by an IR system
Depend on how a specific source treats information
• Success measure (Objectives of an IR System) : Minimize the overhead for finding information
      There is a potential for confusion in the understanding of the differences between Database
      Management Systems (DBMS) and Information Retrieval Systems. It is easy to confuse the
      software that optimizes functional support of each type of system with actual information or
      structured data that is being stored and manipulated. The importance of the differences lies in
      the inability of a database management system to provide the functions needed to process
      “information.” An information system containing structured data, also suffers major functional
      deficiencies.
    The general objective of an Information Retrieval System is to minimize the overhead of a user
    locating needed information.
       Overhead can be expressed as the time a user spends in all of the steps leading to reading an item
containing the needed information, excluding the time for actually reading the relevant data
        Example:
              Query generation
              Search composition
              Search execution
              Scanning results of query to select items to read
Query Generation:
The user entering the keyword like Eg :Sachine Tendulkar, Best Restaurants,……..etc
It will be providing relevant information to the user
Search Composition:
If the given text or query, It will be available or not finding the user required Information
Search Execution:
Automatically start Execution procedure for relevant Information
Scanning Results of Query for Reading Item:
The user Required Information starts scanning Results of Query, Required Information
Generated Infront of User’s Screen
    An Information Retrieval system consists of a software Program that facilitates a user in the information
     the user needs
    The system may use standard computer hardware or specialized hardware to support the search sub
        Function to convert Non-Textual sources to searchable Media.
     Relevant Item: In IRS the term “Relevant” Item is used to Represent an Item containing the needed
                      Information .
 The two major measures commonly associated with information systems are
            1) Precision
            2) Recall
 Recall
              • The ability of the search to find all of the relevant items.
Where
Number-of Retrieved_Relevant is the number of items retrieved that are relevant to the user’s
search need.
Number of Total_Retrieved is the total number of items retrieved from the query.
Number-of Possible_Relevant are the number of relevant items in the database.
Precision measures one aspect of information retrieval overhead for a user associated with particular
search.
 If a search has a 85% precision, then 15% of the user effort is Overhead reviewing non-relevant
items.
 Recall measures how well a system processing a particular query is able to retrieve the relevant items
that the user is interested in seeing.
      A total Information Storage and Retrieval System is composed of four major functional
processes:
Item normalization,
 Selective dissemination of information (i.e., “mail”),
 Archival Document Database Search, and an Index
 Database Search along with the Automatic File Build process that supports Index Files.
1. Item Normalization:
       The first step in any integrated system is to normalize the incoming items to a standard format.
       Item normalization is the process of standardizing and transforming various aspects of the
        items (documents, web pages, etc.)
       Item normalization ensures that the information retrieval system can efficiently and accurately
        retrieve relevant documents in response to user queries.
    Text / Item Normalization Process
Standardize Input
     Standardizing the input takes the different external format of input data and performs the translation
to the formats acceptable to the system. A system may have a single format for all items or allow multiple
formats.
   Translate foreign language into Unicode
   Allow a single browser to display the languages and potentially a single search system to search
them
Translate multi-media input into a standard format
   Video: MPEG-2, MPEG-1, AVI, Real Video…
   Audio: WAV, Real Audio
     Image: GIF, JPEG
    Logical Subsetting (Zoning)
          Parse the item into logical sub-divisions that have meaning to user Title, Author, Abstract, Main
           Text, Conclusion, References, Country, Keyword…
          Visible to the user and used to increase the precision of a search and optimize the display
          The zoning information is passed to the processing token identification operation to store the
           information, allowing searches to be restricted to a specific zone display the minimum data required
          from each item to allow determination of the possible relevance of that item (Display zones such as
          Title, Abstract…)
   Identify the information that are used in the search process– Processing Tokens (Better Than Words)
   The first step is to determine a word
Dividing input symbols into three classes
1. Valid word Symbols - alphabetic characters, numbers
2. Inter-word Symbols - blanks, periods, semicolons (non searchable)
3. Special processing Symbols - hyphen (-)
   A word is defined as a contiguous set of word symbols bounded by inter-word symbols
   Save system resources by eliminating from the set of searchable processing tokens those have little
value to the search Whose frequency and/or semantic use make them of no use as searchable token
   Any word found in almost every item
   Any word only found once or twice in the database
 Frequency    * Rank = ConstantStop algorithm v.s. Stop list
   Examples of Stop algorithms are: Stop all numbers greater than “999999” (this was selected to
allow dates to be searchable) Stop any processing token that has numbers and characters
intermixed
     Characterize Tokens
Stemming Algorithm
 Processing tokens -> Stemming Algorithm -> update to the Searchable data structure
The Selective Dissemination of Information (Mail) Process provides the capability to dynamically compare
newly received items in the information system against standing statements of interest of users and deliver the
item to those users whose statement of interest matches the contents of the item.
The Mail process is composed of the
     search process,
     user statements of interest (Profiles) and
      user mail files.
As each item is received, it is processed against every user’s profile. A profile contains a typically broad search
statement along with a list of user mail files that will receive the document if the search statement in the profile
is satisfied. Selective Dissemination of Information has not yet been applied to multimedia sources.
the item is placed in the mail file(s) associated with the process User search profiles are different than ad hoc
queries in that they contain significant more search terms and cover a wider range of interests .
      3. Document Database Search
The Document Database Search Process provides the capability for a query to search against all
items received by the system. The Document Database Search process is composed of the
search process, user entered queries (typically ad hoc queries) and the document database which
contains all items that have been received, processed and stored by the system. Typically items
in the Document Database do not change (i.e., are not edited) once received.May be partitioned by time and
allow for archiving by the Time partitions.
 Queries differ from profiles in that they are typically short and focused on a specific area of interest .
1. An Information Retrieval System is software that has the features and functions required to manipulate
“information” items versus a DBMS that is optimized to handle “structured” data.
2. Structured data is well defined data (facts) typically represented by tables. There is a semantic description
associated with each attribute within a table that well defines that attribute. For example, there is no confusion
3. With structured data a user enters a specific request and the results returned provide the user with the desired
information. The results are frequently tabulated and presented in a report format for ease of use. In contrast, a
search of “information” items has a high probability of not finding all the items a user is looking for. The user
has to refine his search to locate additional items of interest. This process is called “iterative search.
4. From a practical standpoint, the integration of DBMS’s and Information Retrieval Systems is very important.
Commercial database companies have already integrated the two types of systems. One of the first commercial
databases to integrate the two systems into a single view is the INQUIRE DBMS
5. This has been available for over fifteen years. A more current example is the ORACLE DBMS that now
offers an imbedded capability called CONVECTIS, which is an informational retrieval system that uses a
comprehensive thesaurus which provides the basis to generate “themes” for a particular item. The INFORMIX
DBMS has the ability to link to RetrievalWare to provide integration of structured data and information along
with functions associated with Information Retrieval Systems.
         Two other systems frequently described in the context of information retrieval are,
               Digital Libraries and
               Data Warehouses (or Data Marts).
     There is a significant overlap between these two systems and an Information Storage and Retrieval
      System. All three systems are repositories of information and their primary goal is to “satisfy user
      information needs”
Digital Library:
A Digital Library enables users to Interact effectively with Information distributed across a network
These network Information systems support search &Display of Items from organized
collections
     As such, libraries have always been concerned with storing and retrieving information in the media it is
      created on.
     As the quantities of information grew exponentially, libraries were forced to make maximum use of
      electronic tools to facilitate the storage and retrieval process. With the worldwide internet of libraries and
      information sources (e.g., publishers, news agencies, wire services, radio broadcasts) via the Internet,
      more focus has been on the concept of an electronic library.
List of Softwares For Digital Libraries
 KOHA
 BIBLIOTEQ
 PMP
     Indexing is one of the critical disciplines in library science and significant effort has gone into the
      establishment of indexing and cataloging standards. Migration of many of the library products to a digital
      format introduces both opportunities and challenges. The full text of items available for search makes the
      index process.
     Another important library service is a source of search intermediaries to assist users in finding
      Information.
     Information Storage and Retrieval technology has addressed a small subset of the issues associated with
      Digital Libraries. The focus has been on the search and retrieval of textual data with no concern for
      establishing standards on the contents of the system.
A Data warehouse is a type of Data Management System that is designed to enable and support
Business Intelligence Activities, Especially Analytices, Data warehouses are solely Intended to
perform queries and Analysis and often contain Large amounts of Historical Data.
 A Data warehouse is Relational Database that is designed for query and analysis rather than
transaction processing It includes historical data derived from transaction data from single
&Multiple sources
 A Data warehouse is a group of Data specific to the entire organization, not only to particular
group of users
 It is not used for daily operations and transaction processing but used for making decisions.
o Miscellaneous Capabilities
             o
Search Capabilities
            The objective of the search capability is to allow for a mapping between a user’s specified
        need and the items in the information database that will answer that need. The search
        capabilities address both Boolean and Natural Language queries. The algorithms used for
        searching are called Boolean, natural language processing and probabilistic. Probabilistic
        algorithms use frequency of occurrence of processing tokens (words) in determining
        similarities between queries and items and also in predictors on the potential relevance of the
        found item to the searcher.
        It can consist of natural language text in composition style and/or query terms (referred to as
        terms in this book) with Boolean logic indicators between them. One concept that has
        occasionally been implemented in commercial systems (e.g., RetrievalWare), and holds
        significant potential for assisting in the location and ranking of relevant items, is the
        “weighting” of search terms. This would allow a user to indicate importance of search terms in
        either a Boolean or natural language interface.
The functions define the relationships between the terms in the search statement
Examples:
        Boolean, Natural Language
        Proximity
        Contiguous Word Phrases
        Fuzzy Searches
The interpretation of a particular word
Examples:
        Term Masking
        Numeric and Date Range
        Concept/Thesaurus expansion
Boolean Logic
       Boolean logic allows a user logically relate multiple concepts together to define what
        information is needed.
       The Boolean functions apply to processing tokens identified anywhere within an item.
       Operators: AND, OR, NOT (sometimes XOR).
       Set operations: intersection, union, difference.
       Precedence : NOT, AND, OR; use parentheses to override;
       process left-to- right among operators with same precedence.
Example:
Example:
   “Venetian” ADJ “Blind”               would find items that mention a Venetian Blind on a window but not
                                        items discussing a Blind Venetian
   “United” within five words of        would hit on “United States and American interests,” “United Airlines
   “American”                           and American Airlines” not on “United
                                        States of America and the American dream”
   “Nuclear” within zero                Would find items that have “nuclear” and “clean-up” in the same
   paragraphs of “clean-up”             paragraph.
Example:
COMPUTER may match COMPITER, CONPUTER, etc.
      Usually, should not match if the closely-spelled word is legitimate in itself (e.g., COMMUTER.
       This would help maintain precision.
      Rules needed to indicate allowed differences (e.g., one character replacement, or one transposition
       of adjacent characters).
      Similar method may be used to overcome phonetic spelling errors. Should be distinguished from
       fuzzy set theory solutions.
Term masking
Match terms that contain the query term.
   1. Fixed length mask :
Match numeric or date terms that are in the range of the query term.
Numeric
To find numbers larger than “125,” using a term “125*” will not find any number except those that begin
with the digits “125.”
Date
      Query terms: 9/1/97 - 8/31/98 (matches all dates between 1 September 1997 and 31 August 1998).
      In a way, term- masking handles “string ranges”.
Natural Language
      Natural Language Queries allow a user to enter a prose statement that describes the
       information that the user wants to find.
 The longer the prose, the more accurate the results returned.
Example:
 Find all the documents that discuss oil reserves and current attempts to find oil reserves.
       Pseudo NL processing: System scans the prose and extracts recognized terms and Boolean
       connectors. The grammaticality of the text is not important.
Problem: Recognize the negation in the search statement (“Do not include…”)
Compromise: Use enters natural language sentences connected with Boolean operators.
       Using the same search statement, a Boolean query attempting to find the same information
       might appear:
       (“locate” AND “new” and “oil reserves”) OR (“international” AND “financ*” AND “oil
       production”) NOT (“oil industry” AND “United States”)
2. Browse Capabilities
      Once the search is complete, Browse capabilities provide the user with the capability to
       determine which items are of interest and select to be displayed.
 There are two ways of displaying a summary of the items that are associated with a query:
→ Data visualization
1. Ranking
2. Zoning
3. Highlighting
Ranking
      Under Boolean systems, the status display is a count of the number of items found by the
       query.
 Every one of the items meets all aspects of the Boolean query.
      The reasons why an item was selected can easily be traced to and displayed (e.g., via
       highlighting) in the retrieved items.
      The system accumulates the various user rankings and uses this information to order the
       output for other user queries
Examples:
 AMAZON.COM
→ These will deciding what products to display to users based upon their queries
Zoning
 The user wants to see the minimum information needed to determine if the item is relevant.
       Locality based retrieval where the passage boundaries can be dynamic, in these cases the
        system can display the particular passage or the item to be found rather than the complete
        item.
       The system would also provide an expand capability to retrieve the complete item as an
        option.
Highlighting
       Different strengths of highlighting indicate how strongly the highlighted word participated in
        the selection of the item.
       Most systems allow the display of an item to begin with the first highlight within the item and
        allow subsequent jumping to the next highlight.
 Highlighting has always useful in Boolean systems to indicate the cause of the retrieval.
 This is because of the direct mapping between terms in the search and terms in the item.
*********************************************************************
3. Miscellaneous Capabilities
       Facilitate the user’s to input queries, reducing the time takes to generate queries and reducing
       the probability of entering a poor query.
1. Vocabulary Browse
3. Canned Query
Vocabulary browse
               Users enter a term are positioned in an alphabetically-sorted list of all the terms that
                appear in the database.
    Rather than typing a complete new query, the results of the previous search can be used to
     create a new query
    The process of refining the results of a previous search to focus on relevant items is called
     iterative search.
 During a login session, a user could execute many queries to locate the needed information.
    The search history log is the capability to display all the previous searches that were executed
     during the current session.
Canned Query
    The capability to name a query and store it to be retrieved and executed during a later user
     session is called canned or stored queries.
    A canned query allows a user to create and refine a search that focuses on the user’s general
     area of interest
*************************************************************************