0% found this document useful (0 votes)

10 views28 pages

Lecture17 IR

The document discusses information retrieval (IR) systems, which retrieve documents based on user queries, distinguishing between structured and unstructured data. It outlines various IR concepts, retrieval models, query types, and evaluation measures for search relevance, emphasizing the importance of precision and recall. Additionally, it covers web search engines, their analysis, and practical applications of web analysis in optimizing search results.

Uploaded by

prashant840

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views28 pages

Lecture17 IR

Uploaded by

prashant840

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 28

Information Retrieval

CSC 375, Fall 2019

An information retrieval system will tend not

to be used whenever it is more painful and
troublesome for a customer to have
information than for him not to have it.

Zator Technical Bulletin, 136,

“Mooers’ Law; or why some
retrieval systems are used and others are
not”, Zator Company, 1959

Slides slightly modified from Ramez Elmasri and Shamkant Navathe (2011)

Information Retrieval (IR) Concepts

• Information retrieval
§ Process of retrieving documents from a
collection in response to a query by a user
• Introduction to information retrieval
§ What is the distinction between structured
and unstructured data?
§ Information retrieval defined
• “Discipline that deals with the structure, analysis,
organization, storage, searching, and retrieval
of information”

Information Retrieval 2
Information Retrieval (IR) Concepts (cont’d.)

• User’s information need expressed as a

free-form search request
§ Keyword search query
§ Query
• IR systems characterized by:
§ Types of users
§ Types of data
§ Types of information needed
§ Levels of scale

Information Retrieval 3

Information Retrieval (IR) Concepts (cont’d.)

• High noise-to-signal ratio

• Enterprise search systems
§ IR solutions for searching different entities in an
enterprise’s intranet
• Desktop search engines
§ Retrieve files, folders, and different kinds of
entities stored on the computer

Information Retrieval 4
Information Retrieval Systems

• Information retrieval (IR) systems use a simpler

data model than database systems
§ Information organized as a collection of documents
§ Documents are unstructured, no schema
• Information retrieval locates relevant
documents, on the basis of user input such as
keywords or example documents
§ e.g., find documents containing the words “database
systems”
• Can be used even on textual descriptions
provided with non-textual data such as images
• Web search engines are the most familiar
example of IR systems
Information Retrieval 5

Information Retrieval Systems (Cont.)

• Differences from database systems

§ IR systems don’t deal with transactional
updates (including concurrency control and
recovery)
§ Database systems deal with structured data,
with schemas that define the data
organization
§ IR systems deal with some querying issues not
generally addressed by database systems
• Approximate searching by keywords
• Ranking of retrieved answers by estimated
degree of relevance
Information Retrieval 6
Databases and IR Systems: A Comparison

Brief History of IR

• Inverted file organization

§ Based on keywords and their weights
§ SMART system in 1960s
• Text Retrieval Conference (TREC)
• Search engine
§ Application of information retrieval to large-
scale document collections
§ Crawler
• Responsible for discovering, analyzing, and
indexing new documents

Information Retrieval 8
Modes of Interaction in IR Systems

• Query
§ Set of terms
• Used by searcher to specify information need
• Main modes of interaction with IR systems:
§ Retrieval
• Extraction of information from a repository of
documents through an IR query
§ Browsing
• User visiting or navigating through similar or
related documents

Information Retrieval 9

Modes of Interaction in IR Systems (cont’d.)

• Hyperlinks
§ Used to interconnect Web pages
§ Mainly used for browsing
• Anchor texts
§ Text phrases within documents used to label
hyperlinks
§ Very relevant to browsing

Information Retrieval 10
Modes of Interaction in IR Systems (cont’d.)

• Web search
§ Combines browsing and retrieval
• Rank of a Webpage
§ Measure of relevance to query that
generated result set

Information Retrieval 11

Generic IR Pipeline
Generic IR Pipeline (cont’d.)

Information Retrieval 13

Retrieval Models

• Three main statistical models

§ Boolean
§ Vector space
§ Probabilistic
• Semantic model

Information Retrieval 14
Boolean Model

• Documents represented as a set of terms

• Form queries using standard Boolean
logic set-theoretic operators
§ AND, OR and NOT
• Retrieval and relevance
§ Binary concepts
• Lacks sophisticated ranking algorithms

Information Retrieval 15

Vector Space Model

• Documents
§ Represented as features and weights in an n-
dimensional vector space
• Query
§ Specified as a terms vector
§ Compared to the document vectors for
similarity/relevance assessment

Information Retrieval 16
Vector Space Model (cont’d.)

• Different similarity functions can be used

§ Cosine of the angle between the query and
document vector commonly used
• TF-IDF
§ Statistical weight measure
§ Used to evaluate the importance of a
document word in a collection of documents
• Rocchio algorithm
§ Well-known relevance feedback algorithm

Information Retrieval 17

Relevance Ranking Using Terms

• TF-IDF (Term frequency/Inverse Document
frequency) ranking:
§ Let n(d) = number of terms in the document d
§ n(d, t) = number of occurrences of term t in
the document d.
§ Relevance of a document d to a term t
n(d, t)
TF (d, t) = log 1+
n(d)
The log factor is to avoid excessive weight to
frequent terms
§ Relevance of document to query Q
TF (d, t)
r (d, Q) = å
tÎQ n(t)
Information Retrieval 18
Similarity Based Retrieval

• Similarity based retrieval - retrieve documents similar to a

given document
§ Similarity may be defined on the basis of common words
• E.g. find k terms in A with highest TF (d, t ) / n (t ) and use these terms
to find relevance of other documents.
• Relevance feedback: Similarity can be used to refine
answer set to keyword query
§ User selects a few relevant documents from those retrieved by
keyword query, and system finds other documents similar to
these
• Vector space model: define an n-dimensional space,
where n is the number of words in the document set.
§ Vector for document d goes from origin to a point whose ith
coordinate is TF (d,t ) / n (t )
§ The cosine of the angle between the vectors of two documents
is used as a measure of their similarity.
Information Retrieval 19

Probabilistic Model

• Probability ranking principle

§ Decide whether the document belongs to the
relevant set or the nonrelevant set for a query
• Conditional probabilities calculated using
Bayes’ Rule
• BM25 (Best Match 25)
§ Popular probabilistic ranking algorithm
• Okapi system

Information Retrieval 20
Semantic Model

• Include different levels of analysis

§ Morphological
§ Syntactic
§ Semantic
• Knowledge-based IR systems
§ Based on semantic models
§ Cyc knowledge base
§ WordNet

Information Retrieval 21

Types of Queries in IR Systems

• Keywords
§ Consist of words, phrases, and other
characterizations of documents
§ Used by IR system to build inverted index
• Queries compared to set of index
keywords
• Most IR systems
§ Allow use of Boolean and other operators to
build a complex query

Information Retrieval 22
Keyword Queries

• Simplest and most commonly used forms

of IR queries
• Keywords implicitly connected by a
logical AND operator
• Remove stopwords
§ Most commonly occurring words
• a, the, of
• IR systems do not pay attention to the
ordering of these words in the query

Information Retrieval 23

Boolean Queries

• AND: both terms must be found

• OR: either term found
• NOT: record containing keyword omitted
• ( ): used for nesting
• +: equivalent to and
• – Boolean operators: equivalent to AND
NOT
• Document retrieved if query logically true
as exact match in document

Information Retrieval 24
Phrase Queries

• Phrases encoded in inverted index or

implemented differently
• Phrase generally enclosed within double
quotes
• More restricted and specific version of
proximity searching

Information Retrieval 25

Proximity Queries

• Accounts for how close within a record

multiple terms should be to each other
• Common option requires terms to be in
the exact order
• Various operator names
§ NEAR, ADJ(adjacent), or AFTER
• Computationally expensive

Information Retrieval 26
Wildcard Queries

• Support regular expressions and pattern

matching-based searching
§ ‘Data*’ would retrieve data, database,
datapoint, dataset
• Involves preprocessing overhead
• Not considered worth the cost by many
Web search engines today
• Retrieval models do not directly provide
support for this query type

Information Retrieval 27

Natural Language Queries

• Few natural language search engines

• Active area of research
• Easier to answer questions
§ Definition and factoid questions

Information Retrieval 28
Text Preprocessing

• Commonly used text preprocessing

techniques
• Part of text processing task

Information Retrieval 29

Stopword Removal

• Stopwords
§ Very commonly used words in a language
§ Expected to occur in 80 percent or more of
the documents
§ the, of, to, a, and, in, said, for, that, was, on,
he, is, with, at, by, and it
• Removal must be performed before
indexing
• Queries can be preprocessed for
stopword removal

Information Retrieval 30
Stemming

• Stem
§ Word obtained after trimming the suffix and
prefix of an original word
• Reduces different forms of the word
formed by inflection
• Most famous stemming algorithm:
§ Martin Porter’s stemming algorithm

Information Retrieval 31

Utilizing a Thesaurus

• Thesaurus
§ Precompiled list of important concepts and
the main word that describes each
§ Synonym converted to its matching concept
during preprocessing
§ Examples:
• UMLS
– Large biomedical thesaurus of concepts/meta
concepts/relationships
• WordNet
– Manually constructed thesaurus that groups words
into strict synonym sets

Information Retrieval 32
Other Preprocessing Steps: Digits, Hyphens,
Punctuation Marks, Cases

• Digits, dates, phone numbers, e-mail

addresses, and URLs may or may not be
removed during preprocessing
• Hyphens and punctuation marks
§ May be handled in different ways
• Most information retrieval systems perform
case-insensitive search
• Text preprocessing steps language
specific

Information Retrieval 33

Information Extraction

• Generic term
• Extracting structured content from text
• Examples of IE tasks
• Mostly used to identify contextually
relevant features that involve text
analysis, matching, and categorization

Information Retrieval 34
Inverted Indexing

• Vocabulary
§ Set of distinct query terms in the document set
• Inverted index
§ Data structure that attaches distinct terms
with a list of all documents that contains term
• Steps involved in inverted index
construction

Information Retrieval 35
Evaluation Measures of Search Relevance

• Topical relevance
§ Measures extent to which topic of a result
matches topic of query
• User relevance
§ Describes “goodness” of a retrieved result with
regard to user’s information need
• Web information retrieval
§ Must evaluate document ranking order

Information Retrieval 37

Measuring Retrieval Effectiveness

• Information-retrieval may result in:

§ false negative (false drop) - some relevant
documents may not be retrieved.
§ false positive - some irrelevant documents
may be retrieved.
§ For many applications a good index should
not permit any false drops, but may permit a
few false positives.
• Need relevant performance metrics

Information Retrieval 38
Recall and Precision

• Precision is the fraction of retrieved instances

that are relevant
• Recall is the fraction of relevant instances
that are retrieved
• Formally,
§ Recall
• Fraction of the documents that are relevant to the
query that are successfully retrieved.
– Number of relevant documents retrieved by a search /
Total number of existing relevant documents
§ Precision
• Fraction of retrieved documents that are relevant to
the search
– Number of relevant documents retrieved by a search /
Total number of documents retrieved by that search

Information Retrieval 39

Measuring Retrieval Effectiveness (Cont.)

• Recall vs. precision tradeoff:

§ Can increase recall by retrieving many
documents (down to a low level of relevance
ranking), but many irrelevant documents would
be fetched, reducing precision
• Measures of retrieval effectiveness:
§ Recall as a function of number of documents
fetched, or
§ Precision as a function of recall
• Equivalently, as a function of number of documents
fetched
§ E.g. “precision of 75% at recall of 50%, and 60% at
a recall of 75%”
• Problem: which documents are actually
relevant, and which are not
Information Retrieval 40
Recall and Precision (cont’d.)

• Average precision
§ Useful for computing a single precision value
to compare different retrieval algorithms
• Recall/precision curve
§ Usually has a negative slope indicating inverse
relationship between precision and recall
• F-score
§ Single measure that combines precision and
recall to compare different result sets

Information Retrieval 41

Recall/precision curve

Information Retrieval 42
Web Search and Analysis

• Vertical search engines

§ Topic-specific search engines
• Metasearch engines
§ Query different search engines simultaneously
• Digital libraries
§ Collections of electronic resources and
services

Information Retrieval 43

Web Analysis and Its Relationship to IR

• Goals of Web analysis:

§ Improve and personalize search results
relevance
§ Identify trends
• Classify Web analysis:
§ Web content analysis
§ Web structure analysis
§ Web usage analysis

Information Retrieval 44
Searching the Web

• Hyperlink components
§ Destination page
§ Anchor text
• Hub
§ Web page or a Website that links to a
collection of prominent sites (authorities) on a
common topic

Information Retrieval 45

Analyzing the Link Structure of Web Pages

• The PageRank ranking algorithm

§ Used by Google
§ Highly linked pages are more important (have
greater authority) than pages with fewer links
§ Measure of query-independent importance of
a page/node
• HITS Ranking Algorithm
§ Contains two main steps: a sampling
component and a weight-propagation
component

Information Retrieval 46
Web Content Analysis

• Structured data extraction

§ Several approaches: writing a wrapper,
manual extraction, wrapper induction,
wrapper generation
• Web information integration
§ Web query interface integration and schema
matching
• Ontology-based information integration
§ Single, multiple, and hybrid

Information Retrieval 47

Web Content Analysis (cont’d.)

• Building concept hierarchies

§ Documents in a search result are organized
into groups in a hierarchical fashion
• Segmenting Web pages and detecting
noise
§ Eliminate superfluous information such as ads
and navigation

Information Retrieval 48
Approaches to Web Content Analysis

• Agent-based approach categories

§ Intelligent Web agents
§ Information filtering/categorization
§ Personalized Web agents
• Database-based approach
§ Infer the structure of the Website or to
transform a Web site to organize it as a
database

Information Retrieval 49

Web Usage Analysis

• Typically consists of three main phases:

§ Preprocessing, pattern discovery, and pattern
analysis
• Pattern discovery techniques:
§ Statistical analysis
§ Association rules
§ Clustering of users
• Establish groups of users exhibiting similar browsing
patterns
§ Clustering of pages
• Pages with similar contents are grouped together
§ Sequential patterns
§ Dependency modeling
§ Pattern modeling

Information Retrieval 50
Practical Applications of Web Analysis

• Web analytics
§ Understand and optimize the performance of
Web usage
• Web spamming
§ Deliberate activity to promote a page by
manipulating results returned by search
engines
• Web security
• Alternate uses for Web crawlers

Information Retrieval 51

Web Search Engines

• Web crawlers are programs that locate and

gather information on the Web
§ Recursively follow hyperlinks present in known
documents, to find other documents
• Starting from a seed set of documents
§ Fetched documents
• Handed over to an indexing system
• Can be discarded after indexing, or store as a
cached copy
• Crawling the entire Web would take a very
large amount of time
§ Search engines typically cover only a part of the
Web, not all of it
§ Take months to perform a single crawl
Information Retrieval 52
Web Crawling (Cont.)

• Crawling is done by multiple processes on

multiple machines, running in parallel
§ Set of links to be crawled stored in a database
§ New links found in crawled pages added to this set, to
be crawled later
• Indexing process also runs on multiple machines
§ Creates a new copy of index instead of modifying old
index
§ Old index is used to answer queries
§ After a crawl is “completed” new index becomes “old”
index
• Multiple machines used to answer queries
§ Indices may be kept in memory
§ Queries may be routed to different machines for load
balancing

Information Retrieval 53

Trends in Information Retrieval

• Faceted search
§ Allows users to explore by filtering available
information
§ Facet
• Defines properties or characteristics of a class of
objects
• Social search
§ New phenomenon facilitated by recent Web
technologies: collaborative social search, guided
participation
• Conversational search (CS)
§ Interactive and collaborative information finding
interaction
§ Aided by intelligent agents
Information Retrieval 54
Summary

• IR introduction
§ Basic terminology, query and browsing
modes, semantics, retrieval modes
• Web search analysis
§ Content, structure, usage
§ Algorithms
§ Current trends

Information Retrieval 55

Informaiton Retrieval and Web Search
No ratings yet
Informaiton Retrieval and Web Search
44 pages
CompletedUNIT 1 PPT 10.7.17
100% (6)
CompletedUNIT 1 PPT 10.7.17
87 pages
DDB Ch27
No ratings yet
DDB Ch27
60 pages
Chapter 1
No ratings yet
Chapter 1
52 pages
Wollo University Kombolcha Institute of Technology College of Informatics Department of Information Technology
100% (1)
Wollo University Kombolcha Institute of Technology College of Informatics Department of Information Technology
35 pages
1 IR Intro
No ratings yet
1 IR Intro
30 pages
Monday - IR Fundamentals - Grace Yang - AFIRM19-IR
No ratings yet
Monday - IR Fundamentals - Grace Yang - AFIRM19-IR
77 pages
1 IR Introductionn
No ratings yet
1 IR Introductionn
30 pages
Introduction to Information Retrieval Course
No ratings yet
Introduction to Information Retrieval Course
39 pages
Week 2 - Information Retrieval Basics
No ratings yet
Week 2 - Information Retrieval Basics
74 pages
Chapter 1 Introduction To IR
No ratings yet
Chapter 1 Introduction To IR
18 pages
Intro to Info Retrieval Course
No ratings yet
Intro to Info Retrieval Course
31 pages
1 IR Chapter-One
No ratings yet
1 IR Chapter-One
47 pages
Information Retrieval Systems
No ratings yet
Information Retrieval Systems
46 pages
Introduction To IR 2021
No ratings yet
Introduction To IR 2021
40 pages
01 Introduction To ISR
No ratings yet
01 Introduction To ISR
34 pages
Chapter One IR
No ratings yet
Chapter One IR
18 pages
ch1 - Information Retrieval Systems
No ratings yet
ch1 - Information Retrieval Systems
52 pages
Ch1 IR
No ratings yet
Ch1 IR
39 pages
Information Retrieval: Dr. Bassel ALKHATIB
No ratings yet
Information Retrieval: Dr. Bassel ALKHATIB
55 pages
Introduction To IIR
No ratings yet
Introduction To IIR
53 pages
IR Chapter 1
No ratings yet
IR Chapter 1
29 pages
UNIT I - Introduction and Motivation
No ratings yet
UNIT I - Introduction and Motivation
57 pages
Information Storage and Retrieval: Chapter One - Introduction
No ratings yet
Information Storage and Retrieval: Chapter One - Introduction
50 pages
IR Chapter 1
No ratings yet
IR Chapter 1
32 pages
Unit 1: Introduction and Data Pre-Processing
No ratings yet
Unit 1: Introduction and Data Pre-Processing
71 pages
Chapter 1 Ir
No ratings yet
Chapter 1 Ir
37 pages
Unit II
No ratings yet
Unit II
73 pages
Chapter 1 Introduction To IR
No ratings yet
Chapter 1 Introduction To IR
18 pages
IRS B Tech CSE Part 1
No ratings yet
IRS B Tech CSE Part 1
161 pages
Information Retrieval: Adt-V Unit
No ratings yet
Information Retrieval: Adt-V Unit
106 pages
RetrivalChapter One
No ratings yet
RetrivalChapter One
30 pages
Information Retrieval
No ratings yet
Information Retrieval
5 pages
1 introIR
No ratings yet
1 introIR
22 pages
Bulu
No ratings yet
Bulu
47 pages
11 Multimedia Media IR
No ratings yet
11 Multimedia Media IR
19 pages
1-Overview of Information Retrieval - New
No ratings yet
1-Overview of Information Retrieval - New
47 pages
Introduction To Information Retrieval
No ratings yet
Introduction To Information Retrieval
50 pages
01 Introduction To ISR
No ratings yet
01 Introduction To ISR
48 pages
Unit-5 Adt
No ratings yet
Unit-5 Adt
11 pages
Information Retrieval Course Overview
100% (2)
Information Retrieval Course Overview
12 pages
Part B
No ratings yet
Part B
12 pages
Introduction To IR Chapter 01
No ratings yet
Introduction To IR Chapter 01
29 pages
Adt Unit 5
No ratings yet
Adt Unit 5
31 pages
IR Textbook
No ratings yet
IR Textbook
167 pages
Ch2 - IR and LT
No ratings yet
Ch2 - IR and LT
45 pages
Information Retrieval Techniques
No ratings yet
Information Retrieval Techniques
59 pages
Unit - I - IR
No ratings yet
Unit - I - IR
39 pages
1 Introduction MIR
No ratings yet
1 Introduction MIR
35 pages
Lecturenote - 580003121chapter 1
No ratings yet
Lecturenote - 580003121chapter 1
10 pages
Information Retrieval Detailed Lecture Nov 2023
No ratings yet
Information Retrieval Detailed Lecture Nov 2023
39 pages
Section a-UNIT 1
No ratings yet
Section a-UNIT 1
25 pages
27 Information Retrieval
No ratings yet
27 Information Retrieval
48 pages
Intro to Information Retrieval Systems
No ratings yet
Intro to Information Retrieval Systems
10 pages
Introduction To Information Retrieval
No ratings yet
Introduction To Information Retrieval
42 pages
IR UNIT I - Notes
0% (1)
IR UNIT I - Notes
23 pages
Splunk 7.0.0 SearchTutorial Search Tutorial
No ratings yet
Splunk 7.0.0 SearchTutorial Search Tutorial
104 pages
Product Manager Task334
No ratings yet
Product Manager Task334
2 pages
Chapter-1 Basics of Internet
No ratings yet
Chapter-1 Basics of Internet
20 pages
ElasticSearch Interview Questions and Answers For Freshers
No ratings yet
ElasticSearch Interview Questions and Answers For Freshers
14 pages
Polytechnic University of The Philippines Quezon City Campus Commonwealth, Quezon City
No ratings yet
Polytechnic University of The Philippines Quezon City Campus Commonwealth, Quezon City
149 pages
Library Resources Research Techniques
No ratings yet
Library Resources Research Techniques
22 pages
ISR Lab Manual
No ratings yet
ISR Lab Manual
58 pages
Retrieving and Visualizing Data: Charles Severance
No ratings yet
Retrieving and Visualizing Data: Charles Severance
19 pages
SEO Audit for BHSpecialtyConstruction.com
No ratings yet
SEO Audit for BHSpecialtyConstruction.com
16 pages
Amharic Document Representation For Adhoc Retrieval: Tilahun Yeshambel, Josiane Mothe, Yaregal Assabie
No ratings yet
Amharic Document Representation For Adhoc Retrieval: Tilahun Yeshambel, Josiane Mothe, Yaregal Assabie
13 pages
Michael J. Folk, Bill Zoellick, Greg Riccardi - File Structures - An Object-Oriented Approach With C++-Addison-Wesley (1998)
No ratings yet
Michael J. Folk, Bill Zoellick, Greg Riccardi - File Structures - An Object-Oriented Approach With C++-Addison-Wesley (1998)
749 pages
Stemming: Ilakiyaselvan N, B2 Slot
No ratings yet
Stemming: Ilakiyaselvan N, B2 Slot
23 pages
SAP HANA Text Analysis Developer Guide en
No ratings yet
SAP HANA Text Analysis Developer Guide en
62 pages
Final
No ratings yet
Final
42 pages
Primo System Administration Guide
No ratings yet
Primo System Administration Guide
150 pages
Traffic Building PPT Bsmkt204
No ratings yet
Traffic Building PPT Bsmkt204
35 pages
RelativityOne - Searching Guide
No ratings yet
RelativityOne - Searching Guide
145 pages
Case Study-Google
100% (1)
Case Study-Google
16 pages
Chapter - 3 and 4
No ratings yet
Chapter - 3 and 4
47 pages
Bi Unit 1
No ratings yet
Bi Unit 1
17 pages
Unit 3 Assessment - Attempt Review - Saylor Academy
No ratings yet
Unit 3 Assessment - Attempt Review - Saylor Academy
13 pages
PCM 600
No ratings yet
PCM 600
90 pages
How Search Engines Work: SEO Basics
No ratings yet
How Search Engines Work: SEO Basics
59 pages
Unit 6 Advanced Databases
No ratings yet
Unit 6 Advanced Databases
108 pages
Information Retrieval System Assignment-1
No ratings yet
Information Retrieval System Assignment-1
10 pages
Efficient IR Indexing Techniques
No ratings yet
Efficient IR Indexing Techniques
153 pages
11 - Chapter 3 PDF
No ratings yet
11 - Chapter 3 PDF
80 pages
DocFetcher Manual
No ratings yet
DocFetcher Manual
5 pages