0% found this document useful (0 votes)
49 views47 pages

Bulu

The document discusses the challenges and methodologies of Information Retrieval (IR) on the World Wide Web, emphasizing the need for effective retrieval systems that can handle the heterogeneity of web pages. It outlines various IR models, including Boolean and Vector Space models, and highlights the importance of techniques like TF-IDF for ranking search results. Additionally, it addresses the evaluation of web content quality and the necessity for critical evaluation of information found online.

Uploaded by

ankitharajappa21
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
49 views47 pages

Bulu

The document discusses the challenges and methodologies of Information Retrieval (IR) on the World Wide Web, emphasizing the need for effective retrieval systems that can handle the heterogeneity of web pages. It outlines various IR models, including Boolean and Vector Space models, and highlights the importance of techniques like TF-IDF for ranking search results. Additionally, it addresses the evaluation of web content quality and the necessity for critical evaluation of information found online.

Uploaded by

ankitharajappa21
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 47

Information Retrieval on the

World Wide Web

Dr. Bulu Maharana


bulumaharana@gmail.com
Why Web IR an Issue?
Source: http://www.internetworldstats.com/stats.htm
Information Retrieval
 Representation, storage, organisation, and access to information items
 (Usually) keyword-based representation

Information Need

Documents
Query

Set of retrieved documents

Useful or relevant Search Engine


information to the user
Retrieval System

Primary goal of an IR system


“Retrieve all the documents which are relevant to
a user query, while retrieving as few non-
4
relevant documents as possible.”
What is different about Web?
The Big Challenge

Meet the user needs given


the heterogeneity of Web pages
What is the difference about the Web?
The Bigger Challenge
Why don’t the users get what
they want from Web?
User Tasks
Pull technology Push technology
• User requests information – automatic and
in an interactive manner permanent pushing of
• 3 retrieval tasks information to user
– Browsing (hypertext) – software agents
– Retrieval (classical IR – example: news service
systems)
– filtering (retrieval task)
– Browsing and retrieval
(modern digital libraries and relevant information for
web systems) later inspection by user

10
IR Architecture
Information Retrieval Models
• An IR model governs how a document and a
query are represented and how the relevance
of a document to a user query is defined.
• Main models:
– Boolean model
– Vector space model
– Statistical language model

CS583, Bing Liu, UIC 12


Boolean model
• Each document or query is treated as a “bag”
of words or terms. Word sequence is not
considered.
• Given a collection of documents D, let V = {t1,
t2, ..., t|V|} be the set of distinctive
words/terms in the collection. V is called the
vocabulary.
• A weight wij > 0 is associated with each term ti
of a document dj ∈ D. For a term that does not
appear in document dj, wij = 0.
dj = (w1j, w2j, ..., w|V|j),
13
Boolean model (contd)
• Query terms are combined logically using the
Boolean operators AND, OR, and NOT.
– E.g., ((data AND mining) AND (NOT text))
• Retrieval
– Given a Boolean query, the system retrieves every
document that makes the query logically true.
– Called exact match.
• The retrieval results are usually quite poor
because term frequency is not considered.
14
Vector Space model
• Documents are also treated as a “bag” of words or
terms.
• Each document is represented as a vector.
• However, the term weights are no longer 0 or 1. Each
term weight is computed based on some variations of TF
or TF-IDF scheme.
• Term Frequency (TF) Scheme: The weight of a term ti in
document dj is the number of times that ti appears in dj,
denoted by fij. Normalization may also be applied.

15
TF-IDF term weighting scheme
• The most well known
weighting scheme
– TF: still term frequency
– IDF: inverse document
frequency.
N: total number of docs
dfi: the number of docs that ti
appears.
• The final TF-IDF term
weight is:

16
Retrieval in vector space model
• Query q is represented in the same way or slightly
differently.
• Relevance of di to q: Compare the similarity of query q
and document di.
• Cosine similarity (the cosine of the angle between the
two vectors)

• Cosine is also commonly used in text clustering

17
An Example
• A document space is defined by three terms:
– hardware, software, users
– the vocabulary
• A set of documents are defined as:
– A1=(1, 0, 0), A2=(0, 1, 0), A3=(0, 0, 1)
– A4=(1, 1, 0), A5=(1, 0, 1), A6=(0, 1, 1)
– A7=(1, 1, 1) A8=(1, 0, 1). A9=(0, 1, 1)

• If the Query is “hardware and software”


• what documents should be retrieved?

18
An Example (cont.)
• In Boolean query matching:
– document A4, A7 will be retrieved (“AND”)
– retrieved: A1, A2, A4, A5, A6, A7, A8, A9 (“OR”)
• In similarity matching (cosine):
– q=(1, 1, 0)
– S(q, A1)=0.71, S(q, A2)=0.71, S(q, A3)=0
– S(q, A4)=1, S(q, A5)=0.5, S(q, A6)=0.5
– S(q, A7)=0.82, S(q, A8)=0.5, S(q, A9)=0.5
– Document retrieved set (with ranking)=
• {A4, A7, A1, A2, A5, A6, A8, A9}

19
The bright side:
Web advantages vs. classic IR
Web IR tools
Web IR tools (cont...)
Algorithmic issues related to
search engines
Ranking of Web Pages
Considerations for Search Engines

Scalability

Content Freshness

Speed of service

Relevancy of Search results


Units of a Search Engine
1. Crawling
2. Indexing
3. Querying
4. Searching
5. Ranking
6. Browsing
Text pre-processing
• Word (term) extraction: easy
• Stopwords removal
• Stemming
• Frequency counts and computing TF-IDF
term weights.

27
Stopwords removal
• Many of the most frequently used words in English are useless in IR
and text mining – these words are called stop words.
– the, of, and, to, ….
– Typically about 400 to 500 such words
– For an application, an additional domain specific stopwords list may
be constructed
• Why do we need to remove stopwords?
– Reduce indexing (or data) file size
• stopwords accounts 20-30% of total word counts.
– Improve efficiency and effectiveness
• stopwords are not useful for searching or text mining
• they may also confuse the retrieval system.

28
Stemming
• Techniques used to find out the root/stem of a word.
E.g.,
– user engineering
– users engineered
– used engineer
– using
• stem: use engineer
Usefulness:
• improving effectiveness of IR and text mining
– matching similar words
– Mainly improve recall
• reducing indexing size
– combing words with same roots may reduce indexing size as
much as 40-50%.

29
Basic stemming methods
Using a set of rules. E.g.,
• remove ending
– if a word ends with a consonant other than s,
followed by an s, then delete s.
– if a word ends in es, drop the s.
– if a word ends in ing, delete the ing unless the remaining word
consists only of one letter or of th.
– If a word ends with ed, preceded by a consonant, delete the ed
unless this leaves only a single letter.
– …...
• transform words
– if a word ends with “ies” but not “eies” or “aies” then “ies --> y.”

30
Frequency counts + TF-IDF
• Counts the number of times a word occurred
in a document.
– Using occurrence frequencies to indicate relative
importance of a word in a document.
• if a word appears often in a document, the document
likely “deals with” subjects related to the word.
• Counts the number of documents in the
collection that contains each word
• TF-IDF can be computed.

31
Evaluation:
Precision, Recall & E Measure
• Given a query:
– Are all retrieved documents relevant?
– Have all the relevant documents been retrieved?
• Measures for system performance:
– The first question is about the precision of the search
– The second is about the completeness (recall) of the
search.
– E-Measure: Normalization of Recall and Precision

32
Search Result Ranking
• Ranking is the process in which the closeness of
a document to the user query is measured.
• Although there are many ranking techniques
used by SEs, most of the ranking algorithms are
not known
Popular Ranking Techniques
1. Boolean Spread
Number of Query terms found in the page and its neighborhood pages
2. Vector Space
Term Frequency (TF) and Inverse Document Frequency (IDF)
3. Most cited
Pages being pointed to in the answer set (authorities) and pages in the
answer set which have outgoing links (hubs)…..Chances of hyperlink
spamming

4. Citation Rank (Google’s PageRank)


The size of each face is proportional to the total size of
the other faces which are pointing to it
Web IR Challenges
Challenges Efforts
Distribution of Web Content Network limitations
Platform incompatibility
High Data volatility Millions of pages added and eliminated;
Domain Name Changes
Heterogeneity and Size of Web data Varies in language, File Formats, Media
Lack of structure and data redundancy No structure because of HTML, Mirroring or Proxy
Servers, 30% of Web Pages duplicated

Poor Content Quality Any body can post, no editorial process


Web Traps Anti-spam protocols, URL aliases, Content duplication

Modeling the Web Vector Space exhausted


Querying Embedding structure in search queries
Distributed Architecture Indexing Mechanisms to be replaced with Effective
Search Agents
Ranking Integrating the User in Search process

Hidden Web Advance Search Agents


Web IR for Librarians
Small Directories
• Built by information specialists
• Selected, evaluated, annotated
• Organized into subject categories

– Librarians’ Internet Index (lii.org)


• By a group of California library professionals
– Infomine
• By UC consortium of library professionals
– Academic Info
• By a librarian in Arizona
Larger Directories
• Google Web directory
• http://directory.google.com
– 5+ million pages - less than 0.04% of Google web
• About.com – a collection of specialized directories
– search by subject
• Yahoo’s directory
• http://dir.yahoo.com
– 4 million UNevaluated pages - about 0.06% of Yahoo! search
Finding “expert pages” and searchable
databases
• Look in all the directories just mentioned
– Databases and “expert pages” scattered throughout
• In routine searching:
– If a site calls itself a directory or database, you can search on it
genome database
“cell biology” directory
• Look for society’s pages with collections of links
genome society
Home Page of “International mammalian genome society”
CRITICAL EVALUATION
Why Evaluate What You Find on the Web?

• Anyone can put up a Web page


– about anything
• Many pages not kept up-to-date
• No quality control
– most sites not “peer-reviewed”
• less trustworthy than scholarly publications
– no selection guidelines for search engines
Web Evaluation Techniques
Before you click to view the page...
• Look at the URL - personal page or site ?

• Domain name appropriate for the content ?


edu, com, org, net, gov, ca.us, uk, etc.

• Published by an entity that makes sense ?


• News from its source?

• Advice from valid agency?


Web Evaluation Techniques
Scan the perimeter of the page
• Can you tell who wrote it ?
• name of page author
• organization, institution, agency you recognize
• e-mail contact by itself not enough
• Credentials for the subject matter ?
– Look for links to:
“About us” “Philosophy” “Background” “Biography”
• Is it recent or current enough ?
• Look for “last updated” date - usually at bottom
• If no links or other clues...
• truncate back the URL
Web Evaluation Techniques
Indicators of quality
• Sources documented
• links, footnotes, etc.
– As detailed as you expect in print publications ?
• do the links work ?
• Information retyped or forged
• why not a link to published version instead ?

• Links to other resources


• biased, slanted ?
Web Evaluation Techniques
What Do Others Say ?

• Search the URL in alexa.com


– Who links to the site? Who owns the domain?
– Type or paste the URL into the basic search box
– Traffic for top 100,000 sites

• See what links are in Google’s Similar pages


• Look up the page author in Google
Web Evaluation Techniques
STEP BACK & ASK: Does it all add up ?
• Why was the page put on the Web ?
• inform with facts and data?
• explain, persuade?
• sell, attract?
• share, disclose?
• as a parody or satire?
• Is it appropriate for your purpose?
Thank you !!

You might also like