Unit 1
Unit 1
Information Retrieval
• Boolean Retrieval
Sec. 1.1
Term-document incidence
matrices
Incidence vectors
• So we have a 0/1 vector for each term.
• To answer query: take the vectors for Brutus,
Caesar and Calpurnia (complemented)
bitwise AND.
– 110100 AND
– 110111 AND
– 101111 =
– 100100
4
Sec. 1.1
Answers to query
5
Sec. 1.1
Bigger collections
• Consider N = 1 million documents, each with
about 1000 words.
• Avg 6 bytes/word including
spaces/punctuation
– 6GB of data in the documents.
• Say there are M = 500K distinct terms among
these.
6
Sec. 1.1
Inverted index
• For each term t, we must store a list of all
documents that contain t.
– Identify each doc by a docID, a document serial
number
• Can we use fixed-size arrays for this?
Brutus 1 2 4 11 31 45 173 174
Caesar 1 2 4 5 6 16 57 132
Calpurnia 2 31 54 101
Inverted index
• We need variable-size postings lists
– On disk, a continuous run of postings is normal
and best
– In memory, can use linked lists or variable length
arrays Posting
• Some tradeoffs in size/ease of insertion
Brutus 1 2 4 11 31 45 173 174
Caesar 1 2 4 5 6 16 57 132
Calpurnia 2 31 54 101
Dictionary Postings
Sorted by docID (more later on why). 10
Sec. 1.2
Tokenizer
Linguistic modules
Indexer friend 2 4
roman 1 2
Inverted index
countryman 13 16
Initial stages of text processing
• Tokenization
– Cut character sequence into word tokens
• Deal with “John’s”, a state-of-the-art solution
• Normalization
– Map text and query term to same form
• You want U.S.A. and USA to match
• Stemming
– We may wish different forms of a root to match
• authorize, authorization
• Stop words
– We may omit very common words (or not)
• the, a, to, of
Sec. 1.2
Doc 1 Doc 2
Why frequency?
Will discuss later.
Sec. 1.2
Lists of
docIDs
Terms
and
counts IR system implementation
� How do we index
efficiently?
� How much storage do
we need?
Pointers 16
Introduction to
Information Retrieval
• Query processing with an inverted
index
Sec. 1.3
18
Sec. 1.3
1 2 3 5 8 13 21 34 Caesar
19
Sec. 1.3
The merge
• Walk through the two postings
simultaneously, in time linear in the total
number of postings entries
2 4 8 16 32 64 128 Brutus
1 2 3 5 8 13 21 34 Caesar
20
Intersecting two postings lists
(a “merge” algorithm)
21
Sec. 1.3
Boolean queries:
More general merges
25
Sec. 1.3
Query optimization
• What is the best order for query
processing?
• Consider a query that is an AND of n terms.
• For each of the n terms, get its postings,
then AND them together.
Brutus 2 4 8 16 32 64 128
Caesar 1 2 3 5 8 16 21 34
Calpurnia 13 16
Brutus 2 4 8 16 32 64 128
Caesar 1 2 3 5 8 16 21 34
Calpurnia 13 16
28
Exercise
• Recommend a query
processing order for
29
Exercise
• Recommend a query
processing order for
30
Does Google use the Boolean model?
▪ On Google, the default interpretation of a query [ w1 w2 . . . wn] is w1 AND w2 AND . . .
AND wn
▪ Cases where you get hits that do not contain one of the wi :
▪ anchor text
▪ page contains variant of wi (morphology, spelling correction, synonym)
▪ long queries ( n large)
▪ boolean expression generates very few hits
▪ Simple Boolean vs. Ranking of result set
▪ Simple Boolean retrieval returns matching documents in no particular
order.
▪ Google (and most well designed Boolean engines) rank the result set –
they rank good hits (according to some estimator of relevance) higher than
bad hits.
31
31
Extended Boolean model versus Ranked Retrieval
TOKENS AND TERMS
Slides by Manning, Raghavan, Schutze 7
Introduction to Information Retrieval Sec. 2.2.1
Tokenization
Input: “Friends, Romans, Countrymen”
Output: Tokens
Friends
Romans
Countrymen
A token is a sequence of characters in a document
Each such token is now a candidate for an index
entry, after further processing
Described below
But what are valid tokens to emit?
Slides by Manning, Raghavan, Schutze 8
Introduction to Information Retrieval Sec. 2.2.1
Tokenization
Issues in tokenization:
Finland’s capital
Finland? Finlands? Finland’s?
Hewlett‐Packard Hewlett and Packard as two
tokens?
state‐of‐the‐art: break up hyphenated sequence.
co‐education
lowercase, lower‐case, lower case ?
It can be effective to get the user to put in possible hyphens
San Francisco: one token or two?
How do you decide it is one token?
Slides by Manning, Raghavan, Schutze 9
Introduction to Information Retrieval Sec. 2.2.1
Numbers
3/12/91 Mar. 12, 1991 12/3/91
55 B.C.
B‐52
My PGP key is 324a3df234cb23e
(800) 234‐2333
Often have embedded spaces
Older IR systems may not index numbers
But often very useful: think about things like looking up error
codes/stacktraces on the web
(One answer is using n‐grams: Lecture 3)
Will often index “meta‐data” separately
Creation date, format, etc.
Slides by Manning, Raghavan, Schutze 10
Introduction to Information Retrieval Sec. 2.2.1
Tokenization: language issues
French
L'ensemble one token or two?
L ? L’ ? Le ?
Want l’ensemble to match with un ensemble
Until at least 2003, it didn’t on Google
Internationalization!
German noun compounds are not segmented
Lebensversicherungsgesellschaftsangestellter
‘life insurance company employee’
German retrieval systems benefit greatly from a compound splitter
module
Can give a 15% performance boost for German
Slides by Manning, Raghavan, Schutze 11
Introduction to Information Retrieval Sec. 2.2.1
Tokenization: language issues
Chinese and Japanese have no spaces between
words:
莎拉波娃现在居住在美国东南部的佛罗里达。
Not always guaranteed a unique tokenization
Further complicated in Japanese, with multiple
alphabets intermingled
Dates/amounts in multiple formats
フォーチュン500社は情報不足のため時間あた$500K(約6,000万円)
Tokenization: language issues
Arabic (or Hebrew) is basically written right to left,
but with certain items like numbers written left to
right
Words are separated, but letter forms within a word
form complex ligatures
← → ← → ← start
‘Algeria achieved its independence in 1962 after 132
years of French occupation.’
With Unicode, the surface presentation is complex, but the
stored form is straightforward
Slides by Manning, Raghavan, Schutze 13
Introduction to Information Retrieval Sec. 2.2.2
Stop words
With a stop list, you exclude from the dictionary
entirely the commonest words. Intuition:
They have little semantic content: the, a, and, to, be
There are a lot of them: ~30% of postings for top 30 words
But the trend is away from doing this:
Good compression techniques (lecture 5) means the space for
including stopwords in a system is very small
Good query optimization techniques (lecture 7) mean you pay little
at query time for including stop words.
You need them for:
Phrase queries: “King of Denmark”
Various song titles, etc.: “Let it be”, “To be or not to be”
“Relational” queries: “flights to London”
Slides by Manning, Raghavan, Schutze 14
Introduction to Information Retrieval Sec. 2.2.3
Normalization to terms
We need to “normalize” words in indexed text as
well as query words into the same form
We want to match U.S.A. and USA
Result is terms: a term is a (normalized) word type,
which is an entry in our IR system dictionary
We most commonly implicitly define equivalence
classes of terms by, e.g.,
deleting periods to form a term
U.S.A., USA USA
deleting hyphens to form a term
anti‐discriminatory, antidiscriminatory antidiscriminatory
Slides by Manning, Raghavan, Schutze 15
Introduction to Information Retrieval Sec. 2.2.3
Normalization: other languages
Accents: e.g., French résumé vs. resume.
Umlauts: e.g., German: Tuebingen vs. Tübingen
Should be equivalent
Most important criterion:
How are your users like to write their queries for these
words?
Even in languages that standardly have accents, users
often may not type them
Often best to normalize to a de‐accented term
Tuebingen, Tübingen, Tubingen Tubingen
Slides by Manning, Raghavan, Schutze 16
Introduction to Information Retrieval Sec. 2.2.3
Normalization: other languages
Normalization of things like date forms
7月30日 vs. 7/30
Japanese use of kana vs. Chinese characters
Tokenization and normalization may depend on the
language and so is intertwined with language
detection
Is this
Morgen will ich in MIT … German “mit”?
Crucial: Need to “normalize” indexed text as well as
query terms into the same form
Slides by Manning, Raghavan, Schutze 17
Introduction to Information Retrieval Sec. 2.2.3
Case folding
Reduce all letters to lower case
exception: upper case in mid‐sentence?
e.g., General Motors
Fed vs. fed
SAIL vs. sail
Often best to lower case everything, since
users will use lowercase regardless of
‘correct’ capitalization…
Google example:
Query C.A.T.
#1 result was for “cat” (well, Lolcats) not
Caterpillar Inc. Slides by Manning, Raghavan, Schutze 18
Introduction to Information Retrieval Sec. 2.2.3
Normalization to terms
An alternative to equivalence classing is to do
asymmetric expansion
An example of where this may be useful
Enter: window Search: window, windows
Enter: windows Search: Windows, windows, window
Enter: Windows Search: Windows
Potentially more powerful, but less efficient
Slides by Manning, Raghavan, Schutze 19
Introduction to Information Retrieval
Thesauri and soundex
Do we handle synonyms and homonyms?
E.g., by hand‐constructed equivalence classes
car = automobile color = colour
We can rewrite to form equivalence‐class terms
When the document contains automobile, index it under car‐
automobile (and vice‐versa)
Or we can expand a query
When the query contains automobile, look under car as well
What about spelling mistakes?
One approach is soundex, which forms equivalence classes
of words based on phonetic heuristics
More in lectures 3 and 9
Slides by Manning, Raghavan, Schutze 20
Introduction to Information Retrieval Sec. 2.2.4
Lemmatization
Reduce inflectional/variant forms to base form
E.g.,
am, are, is be
car, cars, car's, cars' car
the boy's cars are different colors the boy car be
different color
Lemmatization implies doing “proper” reduction to
dictionary headword form
Slides by Manning, Raghavan, Schutze 21
Introduction to Information Retrieval Sec. 2.2.4
Stemming
Reduce terms to their “roots” before indexing
“Stemming” suggest crude affix chopping
language dependent
e.g., automate(s), automatic, automation all reduced to
automat.
Porter’s algorithm
Commonest algorithm for stemming English
Results suggest it’s at least as good as other stemming
options
Conventions + 5 phases of reductions
phases applied sequentially
each phase consists of a set of commands
sample convention: Of the rules in a compound command,
select the one that applies to the longest suffix.
Slides by Manning, Raghavan, Schutze 23
Introduction to Information Retrieval Sec. 2.2.4
Typical rules in Porter
sses ss
ies i
ational ate
tional tion
Slides by Manning, Raghavan, Schutze 24
Introduction to Information Retrieval Sec. 2.2.4
Other stemmers
Other stemmers exist, e.g., Lovins stemmer
http://www.comp.lancs.ac.uk/computing/research/stemming/general/lovins.htm
Single‐pass, longest suffix removal (about 250 rules)
Full morphological analysis – at most modest
benefits for retrieval
Do stemming and other normalizations help?
English: very mixed results. Helps recall but harms precision
operative (dentistry) ⇒ oper
operational (research) ⇒ oper
operating (systems) ⇒ oper
Definitely useful for Spanish, German, Finnish, …
30% performance gains for Finnish!
Slides by Manning, Raghavan, Schutze 25
Introduction to Information Retrieval Sec. 2.2.4
Language‐specificity
Many of the above features embody transformations
that are
Language‐specific and
Often, application‐specific
These are “plug‐in” addenda to the indexing process
Both open source and commercial plug‐ins are
available for handling these
Slides by Manning, Raghavan, Schutze 26
Introduction to Information Retrieval Sec. 2.2
Dictionary entries – first cut
ensemble.french
時間.japanese
tokenization.english
Slides by Manning, Raghavan, Schutze 27
Introduction to Information Retrieval
FASTER POSTINGS MERGES:
SKIP POINTERS/SKIP LISTS
Slides by Manning, Raghavan, Schutze 28
Introduction to Information Retrieval Sec. 2.3
Recall basic merge
Walk through the two postings simultaneously, in
time linear in the total number of postings entries
2 4 8 41 48 64 128 Brutus
2 8
1 2 3 8 11 17 21 31 Caesar
Can we do better?
Yes (if index isn’t changing too fast).
Slides by Manning, Raghavan, Schutze 29
Introduction to Information Retrieval Sec. 2.3
Augment postings with skip pointers
(at indexing time)
41 128
2 4 8 41 48 64 128
11 31
1 2 3 8 11 17 21 31
Why?
To skip postings that will not figure in the search
results.
How?
Where do we place skip pointers?
Slides by Manning, Raghavan, Schutze 30
Introduction to Information Retrieval Sec. 2.3
Query processing with skip pointers
41 128
2 4 8 41 48 64 128
11 31
1 2 3 8 11 17 21 31
Where do we place skips?
Tradeoff:
More skips shorter skip spans more likely to skip.
But lots of comparisons to skip pointers.
Fewer skips few pointer comparison, but then long skip
spans few successful skips.
Slides by Manning, Raghavan, Schutze 32
Introduction to Information Retrieval Sec. 2.3
Placing skips
Simple heuristic: for postings of length L, use L
evenly‐spaced skip pointers.
This ignores the distribution of query terms.
Easy if the index is relatively static; harder if L keeps
changing because of updates.
This definitely used to help; with modern hardware it
may not (Bahle et al. 2002) unless you’re memory‐
based
The I/O cost of loading a bigger postings list can outweigh
the gains from quicker in memory merging!
Slides by Manning, Raghavan, Schutze 33
Introduction to Information Retrieval
Postings lists intersection with skip
pointers
Slides by Manning, Raghavan, Schutze 34
Introduction to Information Retrieval
PHRASE QUERIES AND POSITIONAL
INDEXES
Slides by Manning, Raghavan, Schutze 35
Introduction to Information Retrieval Sec. 2.4
Phrase queries
Want to be able to answer queries such as “stanford
university” – as a phrase
Thus the sentence “I went to university at Stanford”
is not a match.
The concept of phrase queries has proven easily
understood by users; one of the few “advanced search”
ideas that works
Many more queries are implicit phrase queries
For this, it no longer suffices to store only
<term : docs> entries
Slides by Manning, Raghavan, Schutze 36
Introduction to Information Retrieval Sec. 2.4.1
A first attempt: Biword indexes
Index every consecutive pair of terms in the text as a
phrase
For example the text “Friends, Romans,
Countrymen” would generate the biwords
friends romans
romans countrymen
Each of these biwords is now a dictionary term
Two‐word phrase query‐processing is now
immediate.
Slides by Manning, Raghavan, Schutze 37
Introduction to Information Retrieval Sec. 2.4.1
Longer phrase queries
Longer phrases are processed as we did with wild‐
cards:
stanford university palo alto can be broken into the
Boolean query on biwords:
stanford university AND university palo AND palo alto
Without the docs, we cannot verify that the docs
matching the above Boolean query do contain the
phrase.
Can have false positives!
Slides by Manning, Raghavan, Schutze 38
Introduction to Information Retrieval Sec. 2.4.1
Extended biwords
Parse the indexed text and perform part‐of‐speech‐tagging
(POST).
Bucket the terms into (say) Nouns (N) and
articles/prepositions (X).
Call any string of terms of the form NX*N an extended biword.
Each such extended biword is now made a term in the
dictionary.
Example: catcher in the rye
N X X N
Query processing: parse it into N’s and X’s
Segment query into enhanced biwords
Look up in index: catcher rye
Slides by Manning, Raghavan, Schutze 39
Introduction to Information Retrieval Sec. 2.4.1
Issues for biword indexes
False positives, as noted before
Index blowup due to bigger dictionary
Infeasible for more than biwords, big even for them
Biword indexes are not the standard solution (for all
biwords) but can be part of a compound strategy
Slides by Manning, Raghavan, Schutze 40
Introduction to Information Retrieval Sec. 2.4.2
Solution 2: Positional indexes
In the postings, store for each term the position(s) in
which tokens of it appear:
<term, number of docs containing term;
doc1: position1, position2 … ;
doc2: position1, position2 … ;
etc.>
Slides by Manning, Raghavan, Schutze 41
Introduction to Information Retrieval Sec. 2.4.2
Positional index example
<be: 993427;
1: 7, 18, 33, 72, 86, 231; Which of docs 1,2,4,5
2: 3, 149; could contain “to be
4: 17, 191, 291, 430, 434; or not to be”?
5: 363, 367, …>
For phrase queries, we use a merge algorithm
recursively at the document level
But we now need to deal with more than just
equality
Slides by Manning, Raghavan, Schutze 42
Introduction to Information Retrieval Sec. 2.4.2
Processing a phrase query
Extract inverted index entries for each distinct term:
to, be, or, not.
Merge their doc:position lists to enumerate all
positions with “to be or not to be”.
to:
2:1,17,74,222,551; 4:8,16,190,429,433; 7:13,23,191; ...
be:
1:17,19; 4:17,191,291,430,434; 5:14,19,101; ...
Same general method for proximity searches
Slides by Manning, Raghavan, Schutze 43
Introduction to Information Retrieval Sec. 2.4.2
Proximity queries
LIMIT! /3 STATUTE /3 FEDERAL /2 TORT
Again, here, /k means “within k words of”.
Clearly, positional indexes can be used for such
queries; biword indexes cannot.
Exercise: Adapt the linear merge of postings to
handle proximity queries. Can you make it work for
any value of k?
This is a little tricky to do correctly and efficiently
There’s likely to be a problem on it!
Slides by Manning, Raghavan, Schutze 44
Introduction to Information Retrieval
An algorithm for proximity intersection
of postings lists p1 and p2
Slides by Manning, Raghavan, Schutze 45
Introduction to Information Retrieval Sec. 2.4.2
Positional index size
You can compress position values/offsets: we’ll talk
about that in lecture 5
Nevertheless, a positional index expands postings
storage substantially
Nevertheless, a positional index is now standardly
used because of the power and usefulness of phrase
and proximity queries … whether used explicitly or
implicitly in a ranking retrieval system.
Slides by Manning, Raghavan, Schutze 46
Introduction to Information Retrieval Sec. 2.4.2
Positional index size
Need an entry for each occurrence, not just once per
document
Index size depends on average document size Why?
Average web page has <1000 terms
SEC filings, books, even some epic poems … easily 100,000
terms
Consider a term with frequency 0.1%
Document size Postings Positional postings
1000 1 1
100,000 1 100
Slides by Manning, Raghavan, Schutze 47
Introduction to Information Retrieval Sec. 2.4.2
Rules of thumb
A positional index is 2–4 as large as a non‐positional
index
Positional index size 35–50% of volume of original
text
Caveat: all of this holds for “English‐like” languages
Slides by Manning, Raghavan, Schutze 48
Introduction to Information Retrieval Sec. 2.4.3
Combination schemes
These two approaches can be profitably
combined
For particular phrases (“Michael Jackson”, “Britney
Spears”) it is inefficient to keep on merging positional
postings lists
Even more so for phrases like “The Who”
Williams et al. (2004) evaluate a more
sophisticated mixed indexing scheme
A typical web query mixture was executed in ¼ of the
time of using just a positional index
It required 26% more space than having a positional
index alone Slides by Manning, Raghavan, Schutze 49
Introduction to Information Retrieval
Dictionaries and tolerant retrieval
Slides by Manning, Raghavan, Schutze 1
Introduction to Information Retrieval Ch. 2
Recap of the previous lecture
The type/token distinction
Terms are normalized types put in the dictionary
Tokenization problems:
Hyphens, apostrophes, compounds, CJK
Term equivalence classing:
Numbers, case folding, stemming, lemmatization
Skip pointers
Encoding a tree‐like structure in a postings list
Biword indexes for phrases
Positional indexes for phrases/proximity queries
Slides by Manning, Raghavan, Schutze 2
Introduction to Information Retrieval Ch. 3
This lecture
Dictionary data structures
“Tolerant” retrieval
Wild‐card queries
Spelling correction
Soundex
Slides by Manning, Raghavan, Schutze 3
Introduction to Information Retrieval Sec. 3.1
Dictionary data structures for inverted
indexes
The dictionary data structure stores the term
vocabulary, document frequency, pointers to each
postings list … in what data structure?
Slides by Manning, Raghavan, Schutze 4
Introduction to Information Retrieval Sec. 3.1
A naïve dictionary
An array of struct:
char[20] int Postings *
20 bytes 4/8 bytes 4/8 bytes
How do we store a dictionary in memory efficiently?
How do we quickly look up elements at query time?
Slides by Manning, Raghavan, Schutze 5
Introduction to Information Retrieval Sec. 3.1
Dictionary data structures
Two main choices:
Hashtables
Trees
Some IR systems use hashtables, some trees
Slides by Manning, Raghavan, Schutze 6
Introduction to Information Retrieval Sec. 3.1
Hashtables
Each vocabulary term is hashed to an integer
(We assume you’ve seen hashtables before)
Pros:
Lookup is faster than for a tree: O(1)
Cons:
No easy way to find minor variants:
judgment/judgement
No prefix search [tolerant retrieval]
If vocabulary keeps growing, need to occasionally do the
expensive operation of rehashing everything
Slides by Manning, Raghavan, Schutze 7
Introduction to Information Retrieval Sec. 3.1
Tree: binary tree
Root
a-m n-z
Slides by Manning, Raghavan, Schutze 8
Introduction to Information Retrieval Sec. 3.1
Tree: B‐tree
a-hu n-z
hy-m
Definition: Every internal nodel has a number of children
in the interval [a,b] where a, b are appropriate natural
numbers, e.g., [2,4].
Slides by Manning, Raghavan, Schutze 9
Introduction to Information Retrieval Sec. 3.1
Trees
Simplest: binary tree
More usual: B‐trees
Trees require a standard ordering of characters and hence
strings … but we typically have one
Pros:
Solves the prefix problem (terms starting with hyp)
Cons:
Slower: O(log M) [and this requires balanced tree]
Rebalancing binary trees is expensive
But B‐trees mitigate the rebalancing problem
Slides by Manning, Raghavan, Schutze 10
Introduction to Information Retrieval
WILD‐CARD QUERIES
Slides by Manning, Raghavan, Schutze 11
Introduction to Information Retrieval Sec. 3.2
Wild‐card queries: *
mon*: find all docs containing any word beginning
with “mon”.
Easy with binary tree (or B‐tree) lexicon: retrieve all
words in range: mon ≤ w < moo
*mon: find words ending in “mon”: harder
Maintain an additional B‐tree for terms backwards.
Can retrieve all words in range: nom ≤ w < non.
Query processing
At this point, we have an enumeration of all terms in
the dictionary that match the wild‐card query.
We still have to look up the postings for each
enumerated term.
E.g., consider the query:
se*ate AND fil*er
This may result in the execution of many Boolean
AND queries.
Slides by Manning, Raghavan, Schutze 13
Introduction to Information Retrieval Sec. 3.2
B‐trees handle *’s at the end of a
query term
How can we handle *’s in the middle of query term?
co*tion
We could look up co* AND *tion in a B‐tree and
intersect the two term sets
Expensive
The solution: transform wild‐card queries so that the
*’s occur at the end
This gives rise to the Permuterm Index.
Slides by Manning, Raghavan, Schutze 14
Introduction to Information Retrieval Sec. 3.2.1
Permuterm index
For term hello, index under:
hello$, ello$h, llo$he, lo$hel, o$hell
where $ is a special symbol.
Queries:
X lookup on X$ X* lookup on $X*
*X lookup on X$* *X* lookup on X*
X*Y lookup on Y$X* X*Y*Z ??? Exercise!
Query = hel*o
X=hel, Y=o
Lookup o$hel*
Slides by Manning, Raghavan, Schutze 15
Introduction to Information Retrieval Sec. 3.2.1
Permuterm query processing
Rotate query wild‐card to the right
Now use B‐tree lookup as before.
Permuterm problem: ≈ quadruples lexicon size
Slides by Manning, Raghavan, Schutze 16
Introduction to Information Retrieval Sec. 3.2.2
Bigram (k‐gram) indexes
Enumerate all k‐grams (sequence of k chars)
occurring in any term
e.g., from text “April is the cruelest month” we get
the 2‐grams (bigrams)
$a,ap,pr,ri,il,l$,$i,is,s$,$t,th,he,e$,$c,cr,ru,
ue,el,le,es,st,t$, $m,mo,on,nt,h$
$ is a special word boundary symbol
Maintain a second inverted index from bigrams to
dictionary terms that match each bigram.
Slides by Manning, Raghavan, Schutze 17
Introduction to Information Retrieval Sec. 3.2.2
Bigram index example
The k‐gram index finds terms based on a query
consisting of k‐grams (here k=2).
$m mace madden
mo among amortize
on along among
Slides by Manning, Raghavan, Schutze 18
Introduction to Information Retrieval Sec. 3.2.2
Processing wild‐cards
Query mon* can now be run as
$m AND mo AND on
Gets terms that match AND version of our wildcard
query.
But we’d enumerate moon.
Must post‐filter these terms against query.
Surviving enumerated terms are then looked up in
the term‐document inverted index.
Fast, space efficient (compared to permuterm).
Slides by Manning, Raghavan, Schutze 19
Introduction to Information Retrieval Sec. 3.2.2
Processing wild‐card queries
As before, we must execute a Boolean query for each
enumerated, filtered term.
Wild‐cards can result in expensive query execution
(very large disjunctions…)
pyth* AND prog*
If you encourage “laziness” people will respond!
Search
Type your search terms, use ‘*’ if you need to.
E.g., Alex* will match Alexander.
Which web search engines allow wildcard queries?
Slides by Manning, Raghavan, Schutze 20
Introduction to Information Retrieval
SPELLING CORRECTION
Slides by Manning, Raghavan, Schutze 21
Introduction to Information Retrieval Sec. 3.3
Spell correction
Two principal uses
Correcting document(s) being indexed
Correcting user queries to retrieve “right” answers
Two main flavors:
Isolated word
Check each word on its own for misspelling
Will not catch typos resulting in correctly spelled words
e.g., from form
Context‐sensitive
Look at surrounding words,
e.g., I flew form Heathrow to Narita.
Slides by Manning, Raghavan, Schutze 22
Introduction to Information Retrieval Sec. 3.3
Document correction
Especially needed for OCR’ed documents
Correction algorithms are tuned for this: rn/m
Can use domain‐specific knowledge
E.g., OCR can confuse O and D more often than it would confuse O
and I (adjacent on the QWERTY keyboard, so more likely
interchanged in typing).
But also: web pages and even printed material have
typos
Goal: the dictionary contains fewer misspellings
But often we don’t change the documents and
instead fix the query‐document mapping
Slides by Manning, Raghavan, Schutze 23
Introduction to Information Retrieval Sec. 3.3
Query mis‐spellings
Our principal focus here
E.g., the query Alanis Morisett
We can either
Retrieve documents indexed by the correct spelling, OR
Return several suggested alternative queries with the
correct spelling
Did you mean … ?
Slides by Manning, Raghavan, Schutze 24
Introduction to Information Retrieval Sec. 3.3.2
Isolated word correction
Fundamental premise – there is a lexicon from which
the correct spellings come
Two basic choices for this
A standard lexicon such as
Webster’s English Dictionary
An “industry‐specific” lexicon – hand‐maintained
The lexicon of the indexed corpus
E.g., all words on the web
All names, acronyms etc.
(Including the mis‐spellings)
Slides by Manning, Raghavan, Schutze 25
Introduction to Information Retrieval Sec. 3.3.2
Isolated word correction
Given a lexicon and a character sequence Q, return
the words in the lexicon closest to Q
What’s “closest”?
We’ll study several alternatives
Edit distance (Levenshtein distance)
Weighted edit distance
n‐gram overlap
Slides by Manning, Raghavan, Schutze 26
Introduction to Information Retrieval Sec. 3.3.3
Edit distance
Given two strings S1 and S2, the minimum number of
operations to convert one to the other
Operations are typically character‐level
Insert, Delete, Replace, (Transposition)
E.g., the edit distance from dof to dog is 1
From cat to act is 2 (Just 1 with transpose.)
from cat to dog is 3.
Generally found by dynamic programming.
See http://www.merriampark.com/ld.htm for a nice
example plus an applet.
Slides by Manning, Raghavan, Schutze 27
Introduction to Information Retrieval Sec. 3.3.3
Weighted edit distance
As above, but the weight of an operation depends on
the character(s) involved
Meant to capture OCR or keyboard errors
Example: m more likely to be mis‐typed as n than as q
Therefore, replacing m by n is a smaller edit distance than
by q
This may be formulated as a probability model
Requires weight matrix as input
Modify dynamic programming to handle weights
Slides by Manning, Raghavan, Schutze 28
Introduction to Information Retrieval Sec. 3.3.4
Using edit distances
Given query, first enumerate all character sequences
within a preset (weighted) edit distance (e.g., 2)
Intersect this set with list of “correct” words
Show terms you found to user as suggestions
Alternatively,
We can look up all possible corrections in our inverted
index and return all docs … slow
We can run with a single most likely correction
The alternatives disempower the user, but save a
round of interaction with the user
Slides by Manning, Raghavan, Schutze 29
Introduction to Information Retrieval Sec. 3.3.4
Edit distance to all dictionary terms?
Given a (mis‐spelled) query – do we compute its edit
distance to every dictionary term?
Expensive and slow
Alternative?
How do we cut the set of candidate dictionary
terms?
One possibility is to use n‐gram overlap for this
This can also be used by itself for spelling correction.
Slides by Manning, Raghavan, Schutze 30
Introduction to Information Retrieval Sec. 3.3.4
n‐gram overlap
Enumerate all the n‐grams in the query string as well
as in the lexicon
Use the n‐gram index (recall wild‐card search) to
retrieve all lexicon terms matching any of the query
n‐grams
Threshold by number of matching n‐grams
Variants – weight by keyboard layout, etc.
Slides by Manning, Raghavan, Schutze 31
Introduction to Information Retrieval Sec. 3.3.4
Example with trigrams
Suppose the text is november
Trigrams are nov, ove, vem, emb, mbe, ber.
The query is december
Trigrams are dec, ece, cem, emb, mbe, ber.
So 3 trigrams overlap (of 6 in each term)
How can we turn this into a normalized measure of
overlap?
Slides by Manning, Raghavan, Schutze 32
Introduction to Information Retrieval Sec. 3.3.4
One option – Jaccard coefficient
A commonly‐used measure of overlap
Let X and Y be two sets; then the J.C. is
X Y / X Y
Equals 1 when X and Y have the same elements and
zero when they are disjoint
X and Y don’t have to be of the same size
Always assigns a number between 0 and 1
Now threshold to decide if you have a match
E.g., if J.C. > 0.8, declare a match
Slides by Manning, Raghavan, Schutze 33
Introduction to Information Retrieval Sec. 3.3.4
Matching trigrams
Consider the query lord – we wish to identify words
matching 2 of its 3 bigrams (lo, or, rd)
Context‐sensitive spell correction
Text: I flew from Heathrow to Narita.
Consider the phrase query “flew form Heathrow”
We’d like to respond
Did you mean “flew from Heathrow”?
because no docs matched the query phrase.
Slides by Manning, Raghavan, Schutze 35
Introduction to Information Retrieval Sec. 3.3.5
Context‐sensitive correction
Need surrounding context to catch this.
First idea: retrieve dictionary terms close (in
weighted edit distance) to each query term
Now try all possible resulting phrases with one word
“fixed” at a time
flew from heathrow
fled form heathrow
flea form heathrow
Hit‐based spelling correction: Suggest the
alternative that has lots of hits.
Slides by Manning, Raghavan, Schutze 36
Introduction to Information Retrieval Sec. 3.3.5
Exercise
Suppose that for “flew form Heathrow” we have 7
alternatives for flew, 19 for form and 3 for heathrow.
How many “corrected” phrases will we enumerate in
this scheme?
Slides by Manning, Raghavan, Schutze 37
Introduction to Information Retrieval Sec. 3.3.5
Another approach
Break phrase query into a conjunction of biwords
(Lecture 2).
Look for biwords that need only one term corrected.
Enumerate only phrases containing “common”
biwords.
Slides by Manning, Raghavan, Schutze 38
Introduction to Information Retrieval Sec. 3.3.5
General issues in spell correction
We enumerate multiple alternatives for “Did you
mean?”
Need to figure out which to present to the user
The alternative hitting most docs
Query log analysis
More generally, rank alternatives probabilistically
argmaxcorr P(corr | query)
From Bayes rule, this is equivalent to
argmaxcorr P(query | corr) * P(corr)
SOUNDEX
Slides by Manning, Raghavan, Schutze 40
Introduction to Information Retrieval Sec. 3.4
Soundex
Class of heuristics to expand a query into phonetic
equivalents
Language specific – mainly for names
E.g., chebyshev tchebycheff
Invented for the U.S. census … in 1918
Slides by Manning, Raghavan, Schutze 41
Introduction to Information Retrieval Sec. 3.4
Soundex – typical algorithm
Turn every token to be indexed into a 4‐character
reduced form
Do the same with query terms
Build and search an index on the reduced forms
(when the query calls for a soundex match)
http://www.creativyst.com/Doc/Articles/SoundEx1/SoundEx1.htm#Top
Slides by Manning, Raghavan, Schutze 42
Introduction to Information Retrieval Sec. 3.4
Soundex – typical algorithm
1. Retain the first letter of the word.
2. Change all occurrences of the following letters to '0'
(zero):
'A', E', 'I', 'O', 'U', 'H', 'W', 'Y'.
3. Change letters to digits as follows:
B, F, P, V 1
C, G, J, K, Q, S, X, Z 2
D,T 3
L 4
M, N 5
R 6
Slides by Manning, Raghavan, Schutze 43
Introduction to Information Retrieval Sec. 3.4
Soundex continued
4. Remove all pairs of consecutive digits.
5. Remove all zeros from the resulting string.
6. Pad the resulting string with trailing zeros and
return the first four positions, which will be of the
form <uppercase letter> <digit> <digit> <digit>.
E.g., Herman becomes H655.
Soundex
Soundex is the classic algorithm, provided by most
databases (Oracle, Microsoft, …)
How useful is soundex?
Not very – for information retrieval
Okay for “high recall” tasks (e.g., Interpol), though
biased to names of certain nationalities
Zobel and Dart (1996) show that other algorithms for
phonetic matching perform much better in the
context of IR
Slides by Manning, Raghavan, Schutze 45
Introduction to Information Retrieval
What queries can we process?
We have
Positional inverted index with skip pointers
Wild‐card index
Spell‐correction
Soundex
Queries such as
(SPELL(moriset) /3 toron*to) OR SOUNDEX(chaikofski)
Slides by Manning, Raghavan, Schutze 46
Introduction to Information Retrieval
Exercise
Draw yourself a diagram showing the various indexes
in a search engine incorporating all the functionality
we have talked about
Identify some of the key design choices in the index
pipeline:
Does stemming happen before the Soundex index?
What about n‐grams?
Given a query, how would you parse and dispatch
sub‐queries to the various indexes?
Slides by Manning, Raghavan, Schutze 47
Index construction
Index construction
How do we construct an index?
What strategies can we use with limited
main memory?
Hardware basics
Many design decisions in information retrieval
are based on the characteristics of hardware
We begin by reviewing hardware basics
Hardware basics
Access to data in memory is much faster than
access to data on disk.
Disk seeks: No data is transferred from disk while
the disk head is being positioned.
Therefore: Transferring one large chunk of data
from disk to memory is faster than transferring
many small chunks.
Disk I/O is block-based: Reading and writing of
entire blocks (as opposed to smaller chunks).
Block sizes: 8KB to 256 KB.
Hardware basics
Servers used in IR systems now typically have
several GB of main memory, sometimes tens of
GB.
Available disk space is several (2–3) orders of
magnitude larger.
Fault tolerance is very expensive: It’s much
cheaper to use many regular machines rather
than one fault tolerant machine.
Hardware assumptions
symbol statistic value
s average seek time 5 ms = 5 x 10−3 s
b transfer time per byte 0.02 μs = 2 x 10−8 s
processor’s clock rate 109 s−1
p low-level operation 0.01 μs = 10−8 s
(e.g., compare & swap a word)
size of main memory several GB
size of disk space 1 TB or more
RCV1: Our collection for this
lecture
Shakespeare’s collected works definitely aren’t large
enough for demonstrating many of the points in this
course.
The collection we’ll use isn’t really large enough
either, but it’s publicly available and is at least a more
plausible example.
As an example for applying scalable index
construction algorithms, we will use the Reuters
RCV1 collection.
http://trec.nist.gov/data/reuters/reuters.html
This is one year of Reuters newswire (part of 1995
and 1996)
A Reuters RCV1 document
Reuters RCV1 statistics
symbol statistic value
N documents 800,000
L avg. # tokens per doc 200
M terms (= word types) 400,000
avg. # bytes per token 6
(incl. spaces/punctuation)
avg. # bytes per token 4.5
(without spaces/punctuation)
avg. # bytes per term 7.5
T non-positional postings 100,000,000
4.5 bytes per word token vs. 7.5 bytes per word type: why?
Term Doc #
Recall index construction I
did
1
1
enact 1
julius 1
caesar 1
Documents are parsed to extract words and I 1
these are saved with the Document ID. was 1
killed 1
i' 1
the 1
capitol 1
brutus 1
killed 1
me 1
Doc 1 Doc 2 so 2
let 2
it 2
be 2
I did enact Julius So let it be with with 2
caesar 2
Caesar I was killed Caesar. The noble the 2
i' the Capitol; Brutus hath told you
noble
brutus
2
2
Brutus killed me. Caesar was ambitious hath
told
2
2
you 2
caesar 2
was 2
ambitious 2
Key step Term
I
did
Doc #
1
1
Term
ambitious
be
Doc #
2
2
enact 1 brutus 1
julius 1 brutus 2
4
Runs being
merged.
Disk
Merge tree
1 run, 100M/run
1 run, 80M/run
2 runs, 40M/run
5 runs, 20M/run
…
1 2 9 10
How to merge the sorted runs?
But it is more efficient to do a n-way merge, where
you are reading from all blocks simultaneously
Providing you read decent-sized chunks of each
block into memory and then write out a decent-sized
output chunk, then you’re not killed by disk seeks
Remaining problem with sort-
based algorithm
Our assumption was: we can keep the dictionary
in memory.
We need the dictionary (which grows
dynamically) in order to implement a term to
termID mapping.
Actually, we could work with term,docID postings
instead of termID,docID postings . . .
. . . but then intermediate files become very large.
(We would end up with a scalable, but very slow
index construction method.)
SPIMI:
Single-pass in-memory indexing
Key idea 1: Generate separate dictionaries for
each block – no need to maintain term-termID
mapping across blocks.
Key idea 2: Don’t sort. Accumulate postings in
postings lists as they occur.
With these two ideas we can generate a
complete inverted index for each block.
These separate indexes can then be merged into
one big index.
SPIMI-Invert
Documents
No good answers …
“Rich” documents
(How) Do we index images?
Researchers have devised Query Based on
Image Content (QBIC) systems
“show me a picture similar to this orange circle”
watch for lecture on vector space retrieval
In practice, image search usually based on meta-
data such as file name e.g., monalisa.jpg
New approaches exploit social tagging
E.g., flickr.com
Passage/sentence retrieval
Suppose we want to retrieve not an entire
document matching a query, but only a
passage/sentence - say, in a very long document
Can index passages/sentences as mini-
documents – what should the index units be?
This is the subject of XML search