0% found this document useful (0 votes)

49 views47 pages

Bulu

The document discusses the challenges and methodologies of Information Retrieval (IR) on the World Wide Web, emphasizing the need for effective retrieval systems that can handle the heterogeneity of web pages. It outlines various IR models, including Boolean and Vector Space models, and highlights the importance of techniques like TF-IDF for ranking search results. Additionally, it addresses the evaluation of web content quality and the necessity for critical evaluation of information found online.

Uploaded by

ankitharajappa21

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

49 views47 pages

Bulu

Uploaded by

ankitharajappa21

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 47

Information Retrieval on the

World Wide Web

Dr. Bulu Maharana

bulumaharana@gmail.com
Why Web IR an Issue?
Source: http://www.internetworldstats.com/stats.htm
Information Retrieval
 Representation, storage, organisation, and access to information items
 (Usually) keyword-based representation

Information Need

Documents
Query

Set of retrieved documents

Useful or relevant Search Engine

information to the user
Retrieval System

Primary goal of an IR system

“Retrieve all the documents which are relevant to
a user query, while retrieving as few non-
4
relevant documents as possible.”
What is different about Web?
The Big Challenge

Meet the user needs given

the heterogeneity of Web pages
What is the difference about the Web?
The Bigger Challenge
Why don’t the users get what
they want from Web?
User Tasks
Pull technology Push technology
• User requests information – automatic and
in an interactive manner permanent pushing of
• 3 retrieval tasks information to user
– Browsing (hypertext) – software agents
– Retrieval (classical IR – example: news service
systems)
– filtering (retrieval task)
– Browsing and retrieval
(modern digital libraries and relevant information for
web systems) later inspection by user

10
IR Architecture
Information Retrieval Models
• An IR model governs how a document and a
query are represented and how the relevance
of a document to a user query is defined.
• Main models:
– Boolean model
– Vector space model
– Statistical language model

CS583, Bing Liu, UIC 12

Boolean model
• Each document or query is treated as a “bag”
of words or terms. Word sequence is not
considered.
• Given a collection of documents D, let V = {t1,
t2, ..., t|V|} be the set of distinctive
words/terms in the collection. V is called the
vocabulary.
• A weight wij > 0 is associated with each term ti
of a document dj ∈ D. For a term that does not
appear in document dj, wij = 0.
dj = (w1j, w2j, ..., w|V|j),
13
Boolean model (contd)
• Query terms are combined logically using the
Boolean operators AND, OR, and NOT.
– E.g., ((data AND mining) AND (NOT text))
• Retrieval
– Given a Boolean query, the system retrieves every
document that makes the query logically true.
– Called exact match.
• The retrieval results are usually quite poor
because term frequency is not considered.
14
Vector Space model
• Documents are also treated as a “bag” of words or
terms.
• Each document is represented as a vector.
• However, the term weights are no longer 0 or 1. Each
term weight is computed based on some variations of TF
or TF-IDF scheme.
• Term Frequency (TF) Scheme: The weight of a term ti in
document dj is the number of times that ti appears in dj,
denoted by fij. Normalization may also be applied.

15
TF-IDF term weighting scheme
• The most well known
weighting scheme
– TF: still term frequency
– IDF: inverse document
frequency.
N: total number of docs
dfi: the number of docs that ti
appears.
• The final TF-IDF term
weight is:

16
Retrieval in vector space model
• Query q is represented in the same way or slightly
differently.
• Relevance of di to q: Compare the similarity of query q
and document di.
• Cosine similarity (the cosine of the angle between the
two vectors)

• Cosine is also commonly used in text clustering

17
An Example
• A document space is defined by three terms:
– hardware, software, users
– the vocabulary
• A set of documents are defined as:
– A1=(1, 0, 0), A2=(0, 1, 0), A3=(0, 0, 1)
– A4=(1, 1, 0), A5=(1, 0, 1), A6=(0, 1, 1)
– A7=(1, 1, 1) A8=(1, 0, 1). A9=(0, 1, 1)

• If the Query is “hardware and software”

• what documents should be retrieved?

18
An Example (cont.)
• In Boolean query matching:
– document A4, A7 will be retrieved (“AND”)
– retrieved: A1, A2, A4, A5, A6, A7, A8, A9 (“OR”)
• In similarity matching (cosine):
– q=(1, 1, 0)
– S(q, A1)=0.71, S(q, A2)=0.71, S(q, A3)=0
– S(q, A4)=1, S(q, A5)=0.5, S(q, A6)=0.5
– S(q, A7)=0.82, S(q, A8)=0.5, S(q, A9)=0.5
– Document retrieved set (with ranking)=
• {A4, A7, A1, A2, A5, A6, A8, A9}

19
The bright side:
Web advantages vs. classic IR
Web IR tools
Web IR tools (cont...)
Algorithmic issues related to
search engines
Ranking of Web Pages
Considerations for Search Engines

Scalability

Content Freshness

Speed of service

Relevancy of Search results

Units of a Search Engine
1. Crawling
2. Indexing
3. Querying
4. Searching
5. Ranking
6. Browsing
Text pre-processing
• Word (term) extraction: easy
• Stopwords removal
• Stemming
• Frequency counts and computing TF-IDF
term weights.

27
Stopwords removal
• Many of the most frequently used words in English are useless in IR
and text mining – these words are called stop words.
– the, of, and, to, ….
– Typically about 400 to 500 such words
– For an application, an additional domain specific stopwords list may
be constructed
• Why do we need to remove stopwords?
– Reduce indexing (or data) file size
• stopwords accounts 20-30% of total word counts.
– Improve efficiency and effectiveness
• stopwords are not useful for searching or text mining
• they may also confuse the retrieval system.

28
Stemming
• Techniques used to find out the root/stem of a word.
E.g.,
– user engineering
– users engineered
– used engineer
– using
• stem: use engineer
Usefulness:
• improving effectiveness of IR and text mining
– matching similar words
– Mainly improve recall
• reducing indexing size
– combing words with same roots may reduce indexing size as
much as 40-50%.

29
Basic stemming methods
Using a set of rules. E.g.,
• remove ending
– if a word ends with a consonant other than s,
followed by an s, then delete s.
– if a word ends in es, drop the s.
– if a word ends in ing, delete the ing unless the remaining word
consists only of one letter or of th.
– If a word ends with ed, preceded by a consonant, delete the ed
unless this leaves only a single letter.
– …...
• transform words
– if a word ends with “ies” but not “eies” or “aies” then “ies --> y.”

30
Frequency counts + TF-IDF
• Counts the number of times a word occurred
in a document.
– Using occurrence frequencies to indicate relative
importance of a word in a document.
• if a word appears often in a document, the document
likely “deals with” subjects related to the word.
• Counts the number of documents in the
collection that contains each word
• TF-IDF can be computed.

31
Evaluation:
Precision, Recall & E Measure
• Given a query:
– Are all retrieved documents relevant?
– Have all the relevant documents been retrieved?
• Measures for system performance:
– The first question is about the precision of the search
– The second is about the completeness (recall) of the
search.
– E-Measure: Normalization of Recall and Precision

32
Search Result Ranking
• Ranking is the process in which the closeness of
a document to the user query is measured.
• Although there are many ranking techniques
used by SEs, most of the ranking algorithms are
not known
Popular Ranking Techniques
1. Boolean Spread
Number of Query terms found in the page and its neighborhood pages
2. Vector Space
Term Frequency (TF) and Inverse Document Frequency (IDF)
3. Most cited
Pages being pointed to in the answer set (authorities) and pages in the
answer set which have outgoing links (hubs)…..Chances of hyperlink
spamming

4. Citation Rank (Google’s PageRank)

The size of each face is proportional to the total size of
the other faces which are pointing to it
Web IR Challenges
Challenges Efforts
Distribution of Web Content Network limitations
Platform incompatibility
High Data volatility Millions of pages added and eliminated;
Domain Name Changes
Heterogeneity and Size of Web data Varies in language, File Formats, Media
Lack of structure and data redundancy No structure because of HTML, Mirroring or Proxy
Servers, 30% of Web Pages duplicated

Poor Content Quality Any body can post, no editorial process

Web Traps Anti-spam protocols, URL aliases, Content duplication

Modeling the Web Vector Space exhausted

Querying Embedding structure in search queries
Distributed Architecture Indexing Mechanisms to be replaced with Effective
Search Agents
Ranking Integrating the User in Search process

Hidden Web Advance Search Agents

Web IR for Librarians
Small Directories
• Built by information specialists
• Selected, evaluated, annotated
• Organized into subject categories

– Librarians’ Internet Index (lii.org)

• By a group of California library professionals
– Infomine
• By UC consortium of library professionals
– Academic Info
• By a librarian in Arizona
Larger Directories
• Google Web directory
• http://directory.google.com
– 5+ million pages - less than 0.04% of Google web
• About.com – a collection of specialized directories
– search by subject
• Yahoo’s directory
• http://dir.yahoo.com
– 4 million UNevaluated pages - about 0.06% of Yahoo! search
Finding “expert pages” and searchable
databases
• Look in all the directories just mentioned
– Databases and “expert pages” scattered throughout
• In routine searching:
– If a site calls itself a directory or database, you can search on it
genome database
“cell biology” directory
• Look for society’s pages with collections of links
genome society
Home Page of “International mammalian genome society”
CRITICAL EVALUATION
Why Evaluate What You Find on the Web?

• Anyone can put up a Web page

– about anything
• Many pages not kept up-to-date
• No quality control
– most sites not “peer-reviewed”
• less trustworthy than scholarly publications
– no selection guidelines for search engines
Web Evaluation Techniques
Before you click to view the page...
• Look at the URL - personal page or site ?

• Domain name appropriate for the content ?

edu, com, org, net, gov, ca.us, uk, etc.

• Published by an entity that makes sense ?

• News from its source?

• Advice from valid agency?

Web Evaluation Techniques
Scan the perimeter of the page
• Can you tell who wrote it ?
• name of page author
• organization, institution, agency you recognize
• e-mail contact by itself not enough
• Credentials for the subject matter ?
– Look for links to:
“About us” “Philosophy” “Background” “Biography”
• Is it recent or current enough ?
• Look for “last updated” date - usually at bottom
• If no links or other clues...
• truncate back the URL
Web Evaluation Techniques
Indicators of quality
• Sources documented
• links, footnotes, etc.
– As detailed as you expect in print publications ?
• do the links work ?
• Information retyped or forged
• why not a link to published version instead ?

• Links to other resources

• biased, slanted ?
Web Evaluation Techniques
What Do Others Say ?

• Search the URL in alexa.com

– Who links to the site? Who owns the domain?
– Type or paste the URL into the basic search box
– Traffic for top 100,000 sites

Informaiton Retrieval and Web Search
No ratings yet
Informaiton Retrieval and Web Search
44 pages
Module 1 Part BInformation Retrieval Webdocuments
No ratings yet
Module 1 Part BInformation Retrieval Webdocuments
49 pages
IR Lec1
No ratings yet
IR Lec1
26 pages
Lecture17 IR
No ratings yet
Lecture17 IR
28 pages
CS583 Info Retrieval
No ratings yet
CS583 Info Retrieval
34 pages
IR-Module 1 and 2
No ratings yet
IR-Module 1 and 2
48 pages
CS583 Info Retrieval
No ratings yet
CS583 Info Retrieval
33 pages
Chapter - 6 Part 1
No ratings yet
Chapter - 6 Part 1
21 pages
Introduction To Information Retrieval
No ratings yet
Introduction To Information Retrieval
50 pages
Web Information Retrieval
No ratings yet
Web Information Retrieval
10 pages
Chapter 1 Ir
No ratings yet
Chapter 1 Ir
37 pages
CompletedUNIT 1 PPT 10.7.17
100% (6)
CompletedUNIT 1 PPT 10.7.17
87 pages
7 B - Query Languages
No ratings yet
7 B - Query Languages
33 pages
Information Retrieval & MapReduce
No ratings yet
Information Retrieval & MapReduce
72 pages
Chapter 1
No ratings yet
Chapter 1
52 pages
Intro to Information Retrieval
No ratings yet
Intro to Information Retrieval
47 pages
Introduction to IR Models
No ratings yet
Introduction to IR Models
46 pages
Information Retrieval: Adt-V Unit
No ratings yet
Information Retrieval: Adt-V Unit
106 pages
Information Retrieval Techniques
No ratings yet
Information Retrieval Techniques
59 pages
Aspect Information Retrieval (IR) Web Search
No ratings yet
Aspect Information Retrieval (IR) Web Search
19 pages
ISR Chap..1
No ratings yet
ISR Chap..1
27 pages
11 Multimedia Media IR
No ratings yet
11 Multimedia Media IR
19 pages
Module 7
No ratings yet
Module 7
53 pages
2 Introduction To Information Retrieval
No ratings yet
2 Introduction To Information Retrieval
38 pages
Information Retrieval
No ratings yet
Information Retrieval
5 pages
NLP - Module 5
No ratings yet
NLP - Module 5
58 pages
Ch2 - IR and LT
No ratings yet
Ch2 - IR and LT
45 pages
Chapter 2
No ratings yet
Chapter 2
45 pages
Advanced Database Tech: IR & Web Search
No ratings yet
Advanced Database Tech: IR & Web Search
21 pages
Cs8080 Ir Unit2 I Modeling and Retrieval Evaluation
No ratings yet
Cs8080 Ir Unit2 I Modeling and Retrieval Evaluation
42 pages
Classwork For Information Retrieval
No ratings yet
Classwork For Information Retrieval
118 pages
Query Languages
No ratings yet
Query Languages
54 pages
Information Retrieval: Prof: Ehab Ezzat Hassanein
No ratings yet
Information Retrieval: Prof: Ehab Ezzat Hassanein
49 pages
Unit 1: Introduction and Data Pre-Processing
No ratings yet
Unit 1: Introduction and Data Pre-Processing
71 pages
Everything in Brief Introduction
No ratings yet
Everything in Brief Introduction
5 pages
Unit II
No ratings yet
Unit II
73 pages
VV - IR - UNIT-I - Part2
No ratings yet
VV - IR - UNIT-I - Part2
35 pages
Introduction to Information Retrieval Course
No ratings yet
Introduction to Information Retrieval Course
39 pages
Emutye
No ratings yet
Emutye
20 pages
Information Retrievalpdf
No ratings yet
Information Retrievalpdf
7 pages
Web Search
No ratings yet
Web Search
30 pages
Query Languages: Chapter Seven
No ratings yet
Query Languages: Chapter Seven
36 pages
1 Introduction MIR
No ratings yet
1 Introduction MIR
35 pages
5 Unit Notes
100% (1)
5 Unit Notes
166 pages
1 IR Intro
No ratings yet
1 IR Intro
30 pages
Information Retrieval Overview
No ratings yet
Information Retrieval Overview
44 pages
Unit-5 Adt
No ratings yet
Unit-5 Adt
11 pages
4 IRModels
No ratings yet
4 IRModels
46 pages
Week 2 - Information Retrieval Basics
No ratings yet
Week 2 - Information Retrieval Basics
74 pages
L02-IR Models MMN
No ratings yet
L02-IR Models MMN
27 pages
Chap - Week8 - Queries and Information Needs
No ratings yet
Chap - Week8 - Queries and Information Needs
44 pages
IR Problem: Introduction To Information Retrieval Outline
No ratings yet
IR Problem: Introduction To Information Retrieval Outline
11 pages
Information Retrieval Lecture Overview
No ratings yet
Information Retrieval Lecture Overview
6 pages
01 Introduction To ISR
No ratings yet
01 Introduction To ISR
34 pages
Introduction To IR 2021
No ratings yet
Introduction To IR 2021
40 pages
1-Overview of Information Retrieval
No ratings yet
1-Overview of Information Retrieval
44 pages
IRS B Tech CSE Part 1
No ratings yet
IRS B Tech CSE Part 1
161 pages
Boolean and Vector Space Retrieval Models
No ratings yet
Boolean and Vector Space Retrieval Models
31 pages
Document
No ratings yet
Document
31 pages
Risk Management2
No ratings yet
Risk Management2
29 pages
Box-Structured Requirement Determination Methods
No ratings yet
Box-Structured Requirement Determination Methods
18 pages
Unit 3 - CH 7 - Parliamentary Institutions
No ratings yet
Unit 3 - CH 7 - Parliamentary Institutions
43 pages
DocScanner Apr 4, 2025 10-07 AM
No ratings yet
DocScanner Apr 4, 2025 10-07 AM
5 pages
DV Internal1 Question Bank
No ratings yet
DV Internal1 Question Bank
4 pages
Map Reduce Workflow Colloquim
No ratings yet
Map Reduce Workflow Colloquim
30 pages
BDA Experiments
No ratings yet
BDA Experiments
2 pages
NLP Part
No ratings yet
NLP Part
19 pages
E-Commerce and Web Technologies: Christian Huemer Pasquale Lops
No ratings yet
E-Commerce and Web Technologies: Christian Huemer Pasquale Lops
222 pages
Afan Oromo Text Keyword Extraction Using Machine Learning
100% (1)
Afan Oromo Text Keyword Extraction Using Machine Learning
18 pages
HR Attrition Prediction via Machine Learning
No ratings yet
HR Attrition Prediction via Machine Learning
14 pages
Pyhton Tutorial 3
No ratings yet
Pyhton Tutorial 3
62 pages
Vietnamese Text Clasification
No ratings yet
Vietnamese Text Clasification
7 pages
TF-IDF: Feature Extraction Guide
No ratings yet
TF-IDF: Feature Extraction Guide
18 pages
AIUnit 6 10
No ratings yet
AIUnit 6 10
8 pages
IR Paper Information
No ratings yet
IR Paper Information
24 pages
Sentiment Analysis On IMDB Movie Comments and Twit
No ratings yet
Sentiment Analysis On IMDB Movie Comments and Twit
8 pages
Minor Project Ii Report Text Mining: Reuters-21578: Submitted by
100% (1)
Minor Project Ii Report Text Mining: Reuters-21578: Submitted by
51 pages
Leveraging eBPF and AI For Ransomware Nose Out
No ratings yet
Leveraging eBPF and AI For Ransomware Nose Out
7 pages
IR Practical Code
No ratings yet
IR Practical Code
13 pages
Data Analytics For Discourse Analysis With Python
No ratings yet
Data Analytics For Discourse Analysis With Python
190 pages
基于深度学习的诈骗电话文本分类方法研究周俊杰
No ratings yet
基于深度学习的诈骗电话文本分类方法研究周俊杰
77 pages
An Improved Ensemble-Based Online Fake News Detection System
No ratings yet
An Improved Ensemble-Based Online Fake News Detection System
6 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
44 pages
Demos 2012
No ratings yet
Demos 2012
550 pages
Scoring
No ratings yet
Scoring
49 pages
I Wordify I A Tool For Discovering and
No ratings yet
I Wordify I A Tool For Discovering and
21 pages
Data Mining Comprehensive Exam - Regular PDF
No ratings yet
Data Mining Comprehensive Exam - Regular PDF
3 pages
Data Mining Exam Guide
No ratings yet
Data Mining Exam Guide
14 pages
AI Class 10 Sample Paper 3 Answer Key
No ratings yet
AI Class 10 Sample Paper 3 Answer Key
6 pages
Text Preprocessing with NLTK
No ratings yet
Text Preprocessing with NLTK
42 pages
AMT302 QUESTION BANK - Format
No ratings yet
AMT302 QUESTION BANK - Format
3 pages
L12&L13 Ranked Retrieval
No ratings yet
L12&L13 Ranked Retrieval
31 pages
NLP Essentials for AI Enthusiasts
No ratings yet
NLP Essentials for AI Enthusiasts
33 pages