0% found this document useful (0 votes)
234 views181 pages

Unit 1

The document introduces Boolean retrieval and the inverted index, which are the fundamental concepts underlying modern information retrieval systems. It explains how an inverted index stores postings lists of document IDs for each term, allowing efficient processing of Boolean queries through merging the relevant postings lists. Large collections require compressed, variable-length postings lists rather than fixed-size arrays. The key steps of indexing involve tokenization, sorting, building the dictionary and postings lists from the token streams.

Uploaded by

Chetan Raj
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
234 views181 pages

Unit 1

The document introduces Boolean retrieval and the inverted index, which are the fundamental concepts underlying modern information retrieval systems. It explains how an inverted index stores postings lists of document IDs for each term, allowing efficient processing of Boolean queries through merging the relevant postings lists. Large collections require compressed, variable-length postings lists rather than fixed-size arrays. The key steps of indexing involve tokenization, sorting, building the dictionary and postings lists from the token streams.

Uploaded by

Chetan Raj
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 181

Introduction to

Information Retrieval
• Boolean Retrieval
Sec. 1.1

Unstructured data in 1620


• Which plays of Shakespeare contain the words
Brutus AND Caesar but NOT Calpurnia?
• One could grep all of Shakespeare’s plays for
Brutus and Caesar, then strip out lines containing
Calpurnia?
• Why is that not the answer?
– Slow (for large corpora)
– NOT Calpurnia is non-trivial
– Other operations (e.g., find the word Romans near
countrymen) not feasible
– Ranked retrieval (best documents to return)
• Later lectures
2
Sec. 1.1

Term-document incidence
matrices

1 if play contains word, 0


Brutus AND Caesar BUT NOT otherwise
Calpurnia
Sec. 1.1

Incidence vectors
• So we have a 0/1 vector for each term.
• To answer query: take the vectors for Brutus,
Caesar and Calpurnia (complemented) 
bitwise AND.
– 110100 AND
– 110111 AND
– 101111 =
– 100100

4
Sec. 1.1

Answers to query

• Antony and Cleopatra, Act III, Scene ii


Agrippa [Aside to DOMITIUS ENOBARBUS]: Why, Enobarbus,
When Antony found Julius Caesar dead,
He cried almost to roaring; and he wept
When at Philippi he found Brutus slain.

• Hamlet, Act III, Scene ii


Lord Polonius: I did enact Julius Caesar I was killed i’ the
Capitol; Brutus killed me.

5
Sec. 1.1

Bigger collections
• Consider N = 1 million documents, each with
about 1000 words.
• Avg 6 bytes/word including
spaces/punctuation
– 6GB of data in the documents.
• Say there are M = 500K distinct terms among
these.

6
Sec. 1.1

Can’t build the matrix


• 500K x 1M matrix has half-a-trillion 0’s and
1’s.
Why?

• But it has no more than one billion 1’s.


– matrix is extremely sparse.

• What’s a better representation?


– We only record the 1 positions.
7
Introduction to
Information Retrieval
• The Inverted Index
• The key data structure underlying
modern IR
Sec. 1.2

Inverted index
• For each term t, we must store a list of all
documents that contain t.
– Identify each doc by a docID, a document serial
number
• Can we use fixed-size arrays for this?
Brutus 1 2 4 11 31 45 173 174

Caesar 1 2 4 5 6 16 57 132

Calpurnia 2 31 54 101

What happens if the word Caesar is added to


document 14? 9
Sec. 1.2

Inverted index
• We need variable-size postings lists
– On disk, a continuous run of postings is normal
and best
– In memory, can use linked lists or variable length
arrays Posting
• Some tradeoffs in size/ease of insertion
Brutus 1 2 4 11 31 45 173 174

Caesar 1 2 4 5 6 16 57 132

Calpurnia 2 31 54 101

Dictionary Postings
Sorted by docID (more later on why). 10
Sec. 1.2

Inverted index construction


Documents to Friends, Romans, countrymen.
be indexed

Tokenizer

Token stream Friends Romans Countrymen

Linguistic modules

Modified tokens friend roman countryman

Indexer friend 2 4

roman 1 2
Inverted index
countryman 13 16
Initial stages of text processing
• Tokenization
– Cut character sequence into word tokens
• Deal with “John’s”, a state-of-the-art solution
• Normalization
– Map text and query term to same form
• You want U.S.A. and USA to match
• Stemming
– We may wish different forms of a root to match
• authorize, authorization
• Stop words
– We may omit very common words (or not)
• the, a, to, of
Sec. 1.2

Indexer steps: Token sequence

• Sequence of (Modified token, Document ID) pairs.

Doc 1 Doc 2

I did enact Julius So let it be with


Caesar I was killed Caesar. The noble
i’ the Capitol; Brutus hath told you
Brutus killed me. Caesar was ambitious
Sec. 1.2

Indexer steps: Sort


• Sort by terms
– And then docID

Core indexing step


Sec. 1.2

Indexer steps: Dictionary & Postings

• Multiple term entries


in a single document
are merged.
• Split into Dictionary
and Postings
• Doc. frequency
information is added.

Why frequency?
Will discuss later.
Sec. 1.2

Where do we pay in storage?

Lists of
docIDs

Terms
and
counts IR system implementation
� How do we index
efficiently?
� How much storage do
we need?

Pointers 16
Introduction to
Information Retrieval
• Query processing with an inverted
index
Sec. 1.3

The index we just built


• How do we process a query? Our focus

– Later - what kinds of queries can we process?

Brutus AND Caesar

18
Sec. 1.3

Query processing: AND


• Consider processing the query:
Brutus AND Caesar
– Locate Brutus in the Dictionary;
• Retrieve its postings.
– Locate Caesar in the Dictionary;
• Retrieve its postings.
– “Merge” the two postings (intersect the
document sets):
2 4 8 16 32 64 128 Brutus

1 2 3 5 8 13 21 34 Caesar

19
Sec. 1.3

The merge
• Walk through the two postings
simultaneously, in time linear in the total
number of postings entries
2 4 8 16 32 64 128 Brutus

1 2 3 5 8 13 21 34 Caesar

If the list lengths are x and y, the merge takes O(x+y)


operations.
Crucial: postings sorted by docID.

20
Intersecting two postings lists
(a “merge” algorithm)

21
Sec. 1.3

Boolean queries: Exact match

• The Boolean retrieval model is being able to ask a


query that is a Boolean expression:
– Boolean Queries are queries using AND, OR and NOT
to join query terms
• Views each document as a set of words
• Is precise: document matches condition or not.
– Perhaps the simplest model to build an IR system on
• Primary commercial retrieval tool for 3 decades.
• Many search systems you still use are Boolean:
– Email, library catalog, Mac OS X Spotlight
22
Sec. 1.4

Example: WestLaw http://www.westlaw.com/

• Largest commercial (paying subscribers)


legal search service (started 1975; ranking
added 1992; new federated search added
2010)
• Tens of terabytes of data; ~700,000 users
• Majority of users still use boolean queries
• Example query:
– What is the statute of limitations in cases
involving the federal tort claims act?
– LIMIT! /3 STATUTE ACTION /S FEDERAL /2
TORT /3 CLAIM
• /3 = within 3 words, /S = in same sentence
23
Sec. 1.4

Example: WestLaw http://www.westlaw.com/

• Another example query:


– Requirements for disabled people to be able to
access a workplace
– disabl! /p access! /s work-site work-place
(employment /3 place)
• Note that SPACE is disjunction, not conjunction!
• Long, precise queries; proximity operators;
incrementally developed; not like web search
• Many professional searchers still like Boolean
search
– You know exactly what you are getting
• But that doesn’t mean it actually works better….
Sec. 1.3

Boolean queries:
More general merges

• Exercise: Adapt the merge for the quer:


Brutus AND NOT Caesar

• Can we still run through the merge in time O(


x+y)? What can we achieve?

25
Sec. 1.3

Query optimization
• What is the best order for query
processing?
• Consider a query that is an AND of n terms.
• For each of the n terms, get its postings,
then AND them together.
Brutus 2 4 8 16 32 64 128

Caesar 1 2 3 5 8 16 21 34

Calpurnia 13 16

Query: Brutus AND Calpurnia AND Caesar


26
Sec. 1.3

Query optimization example


• Process in order of increasing freq :
– start with smallest set, then keep cutting further.

This is why we kept


document freq. in dictionary

Brutus 2 4 8 16 32 64 128

Caesar 1 2 3 5 8 16 21 34

Calpurnia 13 16

Execute the query as ( Calpurnia AND Brutus) AND Caesar.


27
Sec. 1.3

More general optimization


• e.g., (madding OR crowd) AND (ignoble OR strife
)
• Get doc. freq.’s for all terms.
• Estimate the size of each OR by the sum of its
doc. freq.’s (conservative).
• Process in increasing order of OR sizes.

28
Exercise

• Recommend a query
processing order for

(tangerine OR trees) AND


(marmalade OR skies) AND
(kaleidoscope OR eyes)

• Which two terms should we


process first?

29
Exercise

• Recommend a query
processing order for

(tangerine OR trees) AND


(marmalade OR skies) AND
(kaleidoscope OR eyes)

• Which two terms should we


process first?

30
Does Google use the Boolean model?
▪ On Google, the default interpretation of a query [ w1 w2 . . . wn] is w1 AND w2 AND . . .
AND wn
▪ Cases where you get hits that do not contain one of the wi :
▪ anchor text
▪ page contains variant of wi (morphology, spelling correction, synonym)
▪ long queries ( n large)
▪ boolean expression generates very few hits
▪ Simple Boolean vs. Ranking of result set
▪ Simple Boolean retrieval returns matching documents in no particular
order.
▪ Google (and most well designed Boolean engines) rank the result set –
they rank good hits (according to some estimator of relevance) higher than
bad hits.

31
31
Extended Boolean model versus Ranked Retrieval

The Boolean retrieval model contrasts with ranked retrieval models


such as the vector space model (Section 6.3 ), in which users
largely use free text queries , that is, just typing one or more words
rather than using a precise language with operators for building up
query expressions, and the system decides which documents best
satisfy the query. Despite decades of academic research on the
advantages of ranked retrieval, systems implementing the Boolean
retrieval model were the main or only search option provided by
large commercial information providers for three decades until the
early 1990s (approximately the date of arrival of the World Wide
Web). However, these systems did not have just the basic Boolean
operations (AND, OR, and NOT) which we have presented so far. A
strict Boolean expression over terms with an unordered results set
is too limited for many of the information needs that people have,
and these systems implemented extended Boolean retrieval
models by incorporating additional operators such as term
proximity operators. A proximity operator is a way of specifying that
two terms in a query must occur close to each other in a document,
where closeness may be measured by limiting the allowed number
of intervening words or by reference to a structural unit such as a
sentence or paragraph.

Many users, particularly professionals, prefer Boolean query


models. Boolean queries are precise: a document either matches
the query or it does not. This offers the user greater control and
transparency over what is retrieved.
Introduction to Information Retrieval

TOKENS AND TERMS

Slides by Manning, Raghavan, Schutze 7
Introduction to Information Retrieval Sec. 2.2.1

Tokenization
 Input: “Friends, Romans, Countrymen”
 Output: Tokens
 Friends
 Romans
 Countrymen
 A token is a sequence of characters in a document
 Each such token is now a candidate for an index 
entry, after further processing
 Described below
 But what are valid tokens to emit?
Slides by Manning, Raghavan, Schutze 8
Introduction to Information Retrieval Sec. 2.2.1

Tokenization
 Issues in tokenization:
 Finland’s capital 
Finland? Finlands? Finland’s?
 Hewlett‐Packard  Hewlett and Packard as two 
tokens?
 state‐of‐the‐art: break up hyphenated sequence.  
 co‐education
 lowercase, lower‐case, lower case ?
 It can be effective to get the user to put in possible hyphens
 San Francisco: one token or two?  
 How do you decide it is one token?
Slides by Manning, Raghavan, Schutze 9
Introduction to Information Retrieval Sec. 2.2.1

Numbers
 3/12/91 Mar. 12, 1991 12/3/91
 55 B.C.
 B‐52
 My PGP key is 324a3df234cb23e
 (800) 234‐2333
 Often have embedded spaces
 Older IR systems may not index numbers
 But often very useful: think about things like looking up error 
codes/stacktraces on the web
 (One answer is using n‐grams: Lecture 3)
 Will often index “meta‐data” separately
 Creation date, format, etc.
Slides by Manning, Raghavan, Schutze 10
Introduction to Information Retrieval Sec. 2.2.1

Tokenization: language issues
 French
 L'ensemble  one token or two?
 L ? L’ ? Le ?
 Want l’ensemble to match with un ensemble
 Until at least 2003, it didn’t on Google
 Internationalization!

 German noun compounds are not segmented
 Lebensversicherungsgesellschaftsangestellter
 ‘life insurance company employee’
 German retrieval systems benefit greatly from a compound splitter 
module
 Can give a 15% performance boost for German 
Slides by Manning, Raghavan, Schutze 11
Introduction to Information Retrieval Sec. 2.2.1

Tokenization: language issues
 Chinese and Japanese have no spaces between 
words:
 莎拉波娃现在居住在美国东南部的佛罗里达。
 Not always guaranteed a unique tokenization
 Further complicated in Japanese, with multiple 
alphabets intermingled
 Dates/amounts in multiple formats
フォーチュン500社は情報不足のため時間あた$500K(約6,000万円)

Katakana Hiragana Kanji Romaji

End-user can express query entirely in hiragana!


Slides by Manning, Raghavan, Schutze 12
Introduction to Information Retrieval Sec. 2.2.1

Tokenization: language issues
 Arabic (or Hebrew) is basically written right to left, 
but with certain items like numbers written left to 
right
 Words are separated, but letter forms within a word 
form complex ligatures

 ←  →    ← →                         ← start
 ‘Algeria achieved its independence in 1962 after 132 
years of French occupation.’
 With Unicode, the surface presentation is complex, but the 
stored form is  straightforward
Slides by Manning, Raghavan, Schutze 13
Introduction to Information Retrieval Sec. 2.2.2

Stop words
 With a stop list, you exclude from the dictionary 
entirely the commonest words. Intuition:
 They have little semantic content: the, a, and, to, be
 There are a lot of them: ~30% of postings for top 30 words
 But the trend is away from doing this:
 Good compression techniques (lecture 5) means the space for 
including stopwords in a system is very small
 Good query optimization techniques (lecture 7) mean you pay little 
at query time for including stop words.
 You need them for:
 Phrase queries: “King of Denmark”
 Various song titles, etc.: “Let it be”, “To be or not to be”
 “Relational” queries: “flights to London”
Slides by Manning, Raghavan, Schutze 14
Introduction to Information Retrieval Sec. 2.2.3

Normalization to terms
 We need to “normalize” words in indexed text as 
well as query words into the same form
 We want to match U.S.A. and USA
 Result is terms: a term is a (normalized) word type, 
which is an entry in our IR system dictionary
 We most commonly implicitly define equivalence 
classes of terms by, e.g., 
 deleting periods to form a term
 U.S.A., USA   USA
 deleting hyphens to form a term
 anti‐discriminatory, antidiscriminatory   antidiscriminatory
Slides by Manning, Raghavan, Schutze 15
Introduction to Information Retrieval Sec. 2.2.3

Normalization: other languages
 Accents: e.g., French résumé vs. resume.
 Umlauts: e.g., German: Tuebingen vs. Tübingen
 Should be equivalent
 Most important criterion:
 How are your users like to write their queries for these 
words?

 Even in languages that standardly have accents, users 
often may not type them
 Often best to normalize to a de‐accented term
 Tuebingen, Tübingen, Tubingen  Tubingen
Slides by Manning, Raghavan, Schutze 16
Introduction to Information Retrieval Sec. 2.2.3

Normalization: other languages
 Normalization of things like date forms
 7月30日 vs. 7/30
 Japanese use of kana vs. Chinese characters

 Tokenization and normalization may depend on the 
language and so is intertwined with language 
detection
Is this
Morgen will ich in MIT … German “mit”?

 Crucial: Need to “normalize” indexed text as well as 
query terms into the same form
Slides by Manning, Raghavan, Schutze 17
Introduction to Information Retrieval Sec. 2.2.3

Case folding
 Reduce all letters to lower case
 exception: upper case in mid‐sentence?
 e.g., General Motors
 Fed vs. fed
 SAIL vs. sail
 Often best to lower case everything, since 
users will use lowercase regardless of 
‘correct’ capitalization…

 Google example:
 Query C.A.T.  
 #1 result was for “cat” (well, Lolcats) not 
Caterpillar Inc. Slides by Manning, Raghavan, Schutze 18
Introduction to Information Retrieval Sec. 2.2.3

Normalization to terms

 An alternative to equivalence classing is to do 
asymmetric expansion
 An example of where this may be useful
 Enter: window Search: window, windows
 Enter: windows Search: Windows, windows, window
 Enter: Windows Search: Windows
 Potentially more powerful, but less efficient

Slides by Manning, Raghavan, Schutze 19
Introduction to Information Retrieval

Thesauri and soundex
 Do we handle synonyms and homonyms?
 E.g., by hand‐constructed equivalence classes
 car = automobile color = colour
 We can rewrite to form equivalence‐class terms
 When the document contains automobile, index it under car‐
automobile (and vice‐versa)
 Or we can expand a query
 When the query contains automobile, look under car as well
 What about spelling mistakes?
 One approach is soundex, which forms equivalence classes 
of words based on phonetic heuristics
 More in lectures 3 and 9
Slides by Manning, Raghavan, Schutze 20
Introduction to Information Retrieval Sec. 2.2.4

Lemmatization
 Reduce inflectional/variant forms to base form
 E.g.,
 am, are, is  be
 car, cars, car's, cars'  car
 the boy's cars are different colors  the boy car be 
different color
 Lemmatization implies doing “proper” reduction to 
dictionary headword form

Slides by Manning, Raghavan, Schutze 21
Introduction to Information Retrieval Sec. 2.2.4

Stemming
 Reduce terms to their “roots” before indexing
 “Stemming” suggest crude affix chopping
 language dependent
 e.g., automate(s), automatic, automation all reduced to 
automat.

for example compressed for exampl compress and


and compression are both compress ar both accept
accepted as equivalent to as equival to compress
compress.
Slides by Manning, Raghavan, Schutze 22
Introduction to Information Retrieval Sec. 2.2.4

Porter’s algorithm
 Commonest algorithm for stemming English
 Results suggest it’s at least as good as other stemming 
options
 Conventions + 5 phases of reductions
 phases applied sequentially
 each phase consists of a set of commands
 sample convention: Of the rules in a compound command, 
select the one that applies to the longest suffix.

Slides by Manning, Raghavan, Schutze 23
Introduction to Information Retrieval Sec. 2.2.4

Typical rules in Porter
 sses  ss
 ies  i
 ational  ate
 tional  tion

 Rules sensitive to the measure of words


 (m>1) EMENT →
 replacement → replac
 cement  → cement

Slides by Manning, Raghavan, Schutze 24
Introduction to Information Retrieval Sec. 2.2.4

Other stemmers
 Other stemmers exist, e.g., Lovins stemmer 
 http://www.comp.lancs.ac.uk/computing/research/stemming/general/lovins.htm

 Single‐pass, longest suffix removal (about 250 rules)

 Full morphological analysis – at most modest 
benefits for retrieval
 Do stemming and other normalizations help?
 English: very mixed results. Helps recall but harms precision
 operative (dentistry) ⇒ oper
 operational (research) ⇒ oper
 operating (systems) ⇒ oper
 Definitely useful for Spanish, German, Finnish, …
 30% performance gains for Finnish!
Slides by Manning, Raghavan, Schutze 25
Introduction to Information Retrieval Sec. 2.2.4

Language‐specificity
 Many of the above features embody transformations 
that are
 Language‐specific and
 Often, application‐specific
 These are “plug‐in” addenda to the indexing process
 Both open source and commercial plug‐ins are 
available for handling these

Slides by Manning, Raghavan, Schutze 26
Introduction to Information Retrieval Sec. 2.2

Dictionary entries – first cut
ensemble.french

時間.japanese

MIT.english These may be


grouped by
mit.german
language (or
guaranteed.english not…).
More on this in
entries.english ranking/query
processing.
sometimes.english

tokenization.english
Slides by Manning, Raghavan, Schutze 27
Introduction to Information Retrieval

FASTER POSTINGS MERGES:
SKIP POINTERS/SKIP LISTS

Slides by Manning, Raghavan, Schutze 28
Introduction to Information Retrieval Sec. 2.3

Recall basic merge
 Walk through the two postings simultaneously, in 
time linear in the total number of postings entries

2 4 8 41 48 64 128 Brutus
2 8
1 2 3 8 11 17 21 31 Caesar

If the list lengths are m and n, the merge takes O(m+n)


operations.

Can we do better?
Yes (if index isn’t changing too fast).
Slides by Manning, Raghavan, Schutze 29
Introduction to Information Retrieval Sec. 2.3

Augment postings with skip pointers
(at indexing time)
41 128
2 4 8 41 48 64 128

11 31
1 2 3 8 11 17 21 31

 Why?
 To skip postings that will not figure in the search 
results.
 How?
 Where do we place skip pointers?
Slides by Manning, Raghavan, Schutze 30
Introduction to Information Retrieval Sec. 2.3

Query processing with skip pointers
41 128
2 4 8 41 48 64 128

11 31
1 2 3 8 11 17 21 31

Suppose we’ve stepped through the lists until we


process 8 on each list. We match it and advance.

We then have 41 and 11 on the lower. 11 is smaller.

But the skip successor of 11 on the lower list is 31, so


we can skip aheadSlides by Manning, Raghavan, Schutze
past the intervening postings. 31
Introduction to Information Retrieval Sec. 2.3

Where do we place skips?
 Tradeoff:
 More skips  shorter skip spans  more likely to skip.  
But lots of comparisons to skip pointers.
 Fewer skips  few pointer comparison, but then long skip 
spans  few successful skips.

Slides by Manning, Raghavan, Schutze 32
Introduction to Information Retrieval Sec. 2.3

Placing skips
 Simple heuristic: for postings of length L, use L
evenly‐spaced skip pointers.
 This ignores the distribution of query terms.
 Easy if the index is relatively static; harder if L keeps 
changing because of updates.

 This definitely used to help; with modern hardware it 
may not (Bahle et al. 2002) unless you’re memory‐
based
 The I/O cost of loading a bigger postings list can outweigh 
the gains from quicker in memory merging!
Slides by Manning, Raghavan, Schutze 33
Introduction to Information Retrieval

Postings lists intersection with skip 
pointers

Slides by Manning, Raghavan, Schutze 34
Introduction to Information Retrieval

PHRASE QUERIES AND POSITIONAL 
INDEXES

Slides by Manning, Raghavan, Schutze 35
Introduction to Information Retrieval Sec. 2.4

Phrase queries
 Want to be able to answer queries such as “stanford 
university” – as a phrase
 Thus the sentence “I went to university at Stanford”
is not a match. 
 The concept of phrase queries has proven easily 
understood by users; one of the few “advanced search” 
ideas that works
 Many more queries are implicit phrase queries
 For this, it no longer suffices to store only
<term : docs> entries
Slides by Manning, Raghavan, Schutze 36
Introduction to Information Retrieval Sec. 2.4.1

A first attempt: Biword indexes
 Index every consecutive pair of terms in the text as a 
phrase
 For example the text “Friends, Romans, 
Countrymen” would generate the biwords
 friends romans
 romans countrymen
 Each of these biwords is now a dictionary term
 Two‐word phrase query‐processing is now 
immediate.

Slides by Manning, Raghavan, Schutze 37
Introduction to Information Retrieval Sec. 2.4.1

Longer phrase queries
 Longer phrases are processed as we did with wild‐
cards:
 stanford university palo alto can be broken into the 
Boolean query on biwords:
stanford university AND university palo AND palo alto

Without the docs, we cannot verify that the docs 
matching the above Boolean query do contain the 
phrase.
Can have false positives!
Slides by Manning, Raghavan, Schutze 38
Introduction to Information Retrieval Sec. 2.4.1

Extended biwords
 Parse the indexed text and perform part‐of‐speech‐tagging 
(POST).
 Bucket the terms into (say) Nouns (N) and 
articles/prepositions (X).
 Call any string of terms of the form NX*N an extended biword.
 Each such extended biword is now made a term in the 
dictionary.
 Example:  catcher in the rye
N           X   X    N
 Query processing: parse it into N’s and X’s
 Segment query into enhanced biwords
 Look up in index: catcher rye
Slides by Manning, Raghavan, Schutze 39
Introduction to Information Retrieval Sec. 2.4.1

Issues for biword indexes
 False positives, as noted before
 Index blowup due to bigger dictionary
 Infeasible for more than biwords, big even for them

 Biword indexes are not the standard solution (for all 
biwords) but can be part of a compound strategy

Slides by Manning, Raghavan, Schutze 40
Introduction to Information Retrieval Sec. 2.4.2

Solution 2: Positional indexes
 In the postings, store for each term the position(s) in 
which tokens of it appear:

<term, number of docs containing term;
doc1: position1, position2 … ;
doc2: position1, position2 … ;
etc.>

Slides by Manning, Raghavan, Schutze 41
Introduction to Information Retrieval Sec. 2.4.2

Positional index example

<be: 993427;
1: 7, 18, 33, 72, 86, 231; Which of docs 1,2,4,5
2: 3, 149; could contain “to be
4: 17, 191, 291, 430, 434; or not to be”?
5: 363, 367, …>

 For phrase queries, we use a merge algorithm 
recursively at the document level
 But we now need to deal with more than just 
equality
Slides by Manning, Raghavan, Schutze 42
Introduction to Information Retrieval Sec. 2.4.2

Processing a phrase query
 Extract inverted index entries for each distinct term: 
to, be, or, not.
 Merge their doc:position lists to enumerate all 
positions with “to be or not to be”.
 to: 
 2:1,17,74,222,551; 4:8,16,190,429,433; 7:13,23,191; ...
 be:  
 1:17,19; 4:17,191,291,430,434; 5:14,19,101; ...
 Same general method for proximity searches

Slides by Manning, Raghavan, Schutze 43
Introduction to Information Retrieval Sec. 2.4.2

Proximity queries
 LIMIT! /3 STATUTE /3 FEDERAL /2 TORT 
 Again, here, /k means “within k words of”.
 Clearly, positional indexes can be used for such 
queries; biword indexes cannot.
 Exercise: Adapt the linear merge of postings to 
handle proximity queries.  Can you make it work for 
any value of k?
 This is a little tricky to do correctly and efficiently
 There’s likely to be a problem on it!

Slides by Manning, Raghavan, Schutze 44
Introduction to Information Retrieval

An algorithm for proximity intersection 
of postings lists p1 and p2

Slides by Manning, Raghavan, Schutze 45
Introduction to Information Retrieval Sec. 2.4.2

Positional index size
 You can compress position values/offsets: we’ll talk 
about that in lecture 5 
 Nevertheless, a positional index expands postings 
storage substantially
 Nevertheless, a positional index is now standardly 
used because of the power and usefulness of phrase 
and proximity queries … whether used explicitly or 
implicitly in a ranking retrieval system.

Slides by Manning, Raghavan, Schutze 46
Introduction to Information Retrieval Sec. 2.4.2

Positional index size
 Need an entry for each occurrence, not just once per 
document
 Index size depends on average document size Why?
 Average web page has <1000 terms
 SEC filings, books, even some epic poems … easily 100,000 
terms
 Consider a term with frequency 0.1%
Document size Postings Positional postings
1000 1 1
100,000 1 100
Slides by Manning, Raghavan, Schutze 47
Introduction to Information Retrieval Sec. 2.4.2

Rules of thumb
 A positional index is 2–4 as large as a non‐positional 
index
 Positional index size 35–50% of volume of original 
text
 Caveat: all of this holds for “English‐like” languages

Slides by Manning, Raghavan, Schutze 48
Introduction to Information Retrieval Sec. 2.4.3

Combination schemes
 These two approaches can be profitably 
combined
 For particular phrases (“Michael Jackson”, “Britney 
Spears”) it is inefficient to keep on merging positional 
postings lists
 Even more so for phrases like “The Who”
 Williams et al. (2004) evaluate a more 
sophisticated mixed indexing scheme
 A typical web query mixture was executed in ¼ of the 
time of using just a positional index
 It required 26% more space than having a positional 
index alone Slides by Manning, Raghavan, Schutze 49
Introduction to Information Retrieval

Dictionaries and tolerant retrieval

Slides by Manning, Raghavan, Schutze 1
Introduction to Information Retrieval Ch. 2

Recap of the previous lecture
 The type/token distinction
 Terms are normalized types put in the dictionary
 Tokenization problems:
 Hyphens, apostrophes, compounds, CJK
 Term equivalence classing:
 Numbers, case folding, stemming, lemmatization
 Skip pointers
 Encoding a tree‐like structure in a postings list
 Biword indexes for phrases
 Positional indexes for phrases/proximity queries
Slides by Manning, Raghavan, Schutze 2
Introduction to Information Retrieval Ch. 3

This lecture
 Dictionary data structures
 “Tolerant” retrieval
 Wild‐card queries
 Spelling correction
 Soundex

Slides by Manning, Raghavan, Schutze 3
Introduction to Information Retrieval Sec. 3.1

Dictionary data structures for inverted 
indexes
 The dictionary data structure stores the term 
vocabulary, document frequency, pointers to each 
postings list … in what data structure?

Slides by Manning, Raghavan, Schutze 4
Introduction to Information Retrieval Sec. 3.1

A naïve dictionary
 An array of struct:

char[20]   int                   Postings *
20 bytes   4/8 bytes        4/8 bytes  
 How do we store a dictionary in memory efficiently?
 How do we quickly look up elements at query time?
Slides by Manning, Raghavan, Schutze 5
Introduction to Information Retrieval Sec. 3.1

Dictionary data structures
 Two main choices:
 Hashtables
 Trees
 Some IR systems use hashtables, some trees

Slides by Manning, Raghavan, Schutze 6
Introduction to Information Retrieval Sec. 3.1

Hashtables
 Each vocabulary term is hashed to an integer
 (We assume you’ve seen hashtables before)
 Pros:
 Lookup is faster than for a tree: O(1)
 Cons:
 No easy way to find minor variants:
 judgment/judgement
 No prefix search [tolerant  retrieval]
 If vocabulary keeps growing, need to occasionally do the 
expensive operation of rehashing everything

Slides by Manning, Raghavan, Schutze 7
Introduction to Information Retrieval Sec. 3.1

Tree: binary tree
Root
a-m n-z

a-hu hy-m n-sh si-z

Slides by Manning, Raghavan, Schutze 8
Introduction to Information Retrieval Sec. 3.1

Tree: B‐tree

a-hu n-z
hy-m

 Definition: Every internal nodel has a number of children 
in the interval [a,b] where a, b are appropriate natural 
numbers, e.g., [2,4].
Slides by Manning, Raghavan, Schutze 9
Introduction to Information Retrieval Sec. 3.1

Trees
 Simplest: binary tree
 More usual: B‐trees
 Trees require a standard ordering of characters and hence 
strings … but we typically have one
 Pros:
 Solves the prefix problem (terms starting with hyp)
 Cons:
 Slower: O(log M)  [and this requires balanced tree]
 Rebalancing binary trees is expensive
 But B‐trees mitigate the rebalancing problem

Slides by Manning, Raghavan, Schutze 10
Introduction to Information Retrieval

WILD‐CARD QUERIES

Slides by Manning, Raghavan, Schutze 11
Introduction to Information Retrieval Sec. 3.2

Wild‐card queries: *
 mon*: find all docs containing any word beginning 
with “mon”.
 Easy with binary tree (or B‐tree) lexicon: retrieve all 
words in range: mon ≤ w < moo
 *mon: find words ending in “mon”: harder
 Maintain an additional B‐tree for terms backwards.
Can retrieve all words in range: nom ≤ w < non.

Exercise: from this, how can we enumerate all terms


meeting the wild-card query pro*cent ?
Slides by Manning, Raghavan, Schutze 12
Introduction to Information Retrieval Sec. 3.2

Query processing
 At this point, we have an enumeration of all terms in 
the dictionary that match the wild‐card query.
 We still have to look up the postings for each 
enumerated term.
 E.g., consider the query:
se*ate AND fil*er
This may result in the execution of many Boolean 
AND queries.

Slides by Manning, Raghavan, Schutze 13
Introduction to Information Retrieval Sec. 3.2

B‐trees handle *’s at the end of a 
query term
 How can we handle *’s in the middle of query term?
 co*tion
 We could look up co* AND *tion in a B‐tree and 
intersect the two term sets
 Expensive
 The solution: transform wild‐card queries so that the 
*’s occur at the end
 This gives rise to the Permuterm Index.

Slides by Manning, Raghavan, Schutze 14
Introduction to Information Retrieval Sec. 3.2.1

Permuterm index
 For term hello, index under:
 hello$, ello$h, llo$he, lo$hel, o$hell
where $ is a special symbol.
 Queries:
 X lookup on X$ X*   lookup on   $X*
 *X   lookup on X$* *X* lookup on   X*
 X*Y lookup on Y$X* X*Y*Z ??? Exercise!

Query = hel*o
X=hel, Y=o
Lookup o$hel*
Slides by Manning, Raghavan, Schutze 15
Introduction to Information Retrieval Sec. 3.2.1

Permuterm query processing
 Rotate query wild‐card to the right
 Now use B‐tree lookup as before.
 Permuterm problem: ≈ quadruples lexicon size

Empirical observation for English.

Slides by Manning, Raghavan, Schutze 16
Introduction to Information Retrieval Sec. 3.2.2

Bigram (k‐gram) indexes
 Enumerate all k‐grams (sequence of k chars) 
occurring in any term
 e.g., from text “April is the cruelest month” we get 
the 2‐grams (bigrams)

$a,ap,pr,ri,il,l$,$i,is,s$,$t,th,he,e$,$c,cr,ru,
ue,el,le,es,st,t$, $m,mo,on,nt,h$
 $ is a special word boundary symbol
 Maintain a second inverted index from bigrams to
dictionary terms that match each bigram.
Slides by Manning, Raghavan, Schutze 17
Introduction to Information Retrieval Sec. 3.2.2

Bigram index example
 The k‐gram index finds terms based on a query 
consisting of k‐grams (here k=2).

$m mace madden
mo among amortize
on along among

Slides by Manning, Raghavan, Schutze 18
Introduction to Information Retrieval Sec. 3.2.2

Processing wild‐cards
 Query mon* can now be run as
 $m AND mo AND on
 Gets terms that match AND version of our wildcard 
query.
 But we’d enumerate moon.
 Must post‐filter these terms against query.
 Surviving enumerated terms are then looked up in 
the term‐document inverted index.
 Fast, space efficient (compared to permuterm).

Slides by Manning, Raghavan, Schutze 19
Introduction to Information Retrieval Sec. 3.2.2

Processing wild‐card queries
 As before, we must execute a Boolean query for each 
enumerated, filtered term.
 Wild‐cards can result in expensive query execution 
(very large disjunctions…)
 pyth* AND prog*
 If you encourage “laziness” people will respond!

Search
Type your search terms, use ‘*’ if you need to.
E.g., Alex* will match Alexander.

 Which web search engines allow wildcard queries?
Slides by Manning, Raghavan, Schutze 20
Introduction to Information Retrieval

SPELLING CORRECTION

Slides by Manning, Raghavan, Schutze 21
Introduction to Information Retrieval Sec. 3.3

Spell correction
 Two principal uses
 Correcting document(s) being indexed
 Correcting user queries to retrieve “right” answers
 Two main flavors:
 Isolated word
 Check each word on its own for misspelling
 Will not catch typos resulting in correctly spelled words
 e.g., from  form
 Context‐sensitive
 Look at surrounding words, 
 e.g., I flew form Heathrow to Narita.

Slides by Manning, Raghavan, Schutze 22
Introduction to Information Retrieval Sec. 3.3

Document correction
 Especially needed for OCR’ed documents
 Correction algorithms are tuned for this: rn/m
 Can use domain‐specific knowledge
 E.g., OCR can confuse O and D more often than it would confuse O 
and I (adjacent on the QWERTY keyboard, so more likely 
interchanged in typing).
 But also: web pages and even printed material have 
typos
 Goal: the dictionary contains fewer misspellings
 But often we don’t change the documents and 
instead fix the query‐document mapping
Slides by Manning, Raghavan, Schutze 23
Introduction to Information Retrieval Sec. 3.3

Query mis‐spellings
 Our principal focus here
 E.g., the query Alanis Morisett
 We can either
 Retrieve documents indexed by the correct spelling, OR
 Return several suggested alternative queries with the 
correct spelling
 Did you mean … ?

Slides by Manning, Raghavan, Schutze 24
Introduction to Information Retrieval Sec. 3.3.2

Isolated word correction
 Fundamental premise – there is a lexicon from which 
the correct spellings come
 Two basic choices for this
 A standard lexicon such as
 Webster’s English Dictionary
 An “industry‐specific” lexicon – hand‐maintained
 The lexicon of the indexed corpus
 E.g., all words on the web
 All names, acronyms etc.
 (Including the mis‐spellings)

Slides by Manning, Raghavan, Schutze 25
Introduction to Information Retrieval Sec. 3.3.2

Isolated word correction
 Given a lexicon and a character sequence Q, return 
the words in the lexicon closest to Q
 What’s “closest”?
 We’ll study several alternatives
 Edit distance (Levenshtein distance)
 Weighted edit distance
 n‐gram overlap

Slides by Manning, Raghavan, Schutze 26
Introduction to Information Retrieval Sec. 3.3.3

Edit distance
 Given two strings S1 and S2, the minimum number of 
operations to convert one to the other
 Operations are typically character‐level
 Insert, Delete, Replace, (Transposition)
 E.g., the edit distance from dof to dog is 1
 From cat to act is 2 (Just 1 with transpose.)
 from cat to dog is 3.
 Generally found by dynamic programming.
 See http://www.merriampark.com/ld.htm for a nice 
example plus an applet.
Slides by Manning, Raghavan, Schutze 27
Introduction to Information Retrieval Sec. 3.3.3

Weighted edit distance
 As above, but the weight of an operation depends on 
the character(s) involved
 Meant to capture OCR or keyboard errors
Example: m more likely to be mis‐typed as n than as q
 Therefore, replacing m by n is a smaller edit distance than 
by q
 This may be formulated as a probability model
 Requires weight matrix as input
 Modify dynamic programming to handle weights

Slides by Manning, Raghavan, Schutze 28
Introduction to Information Retrieval Sec. 3.3.4

Using edit distances
 Given query, first enumerate all character sequences 
within a preset (weighted) edit distance (e.g., 2)
 Intersect this set with list of “correct” words
 Show terms you found to user as suggestions
 Alternatively, 
 We can look up all possible corrections in our inverted 
index and return all docs … slow
 We can run with a single most likely correction
 The alternatives disempower the user, but save a 
round of interaction with the user
Slides by Manning, Raghavan, Schutze 29
Introduction to Information Retrieval Sec. 3.3.4

Edit distance to all dictionary terms?
 Given a (mis‐spelled) query – do we compute its edit 
distance to every dictionary term?
 Expensive and slow
 Alternative?
 How do we cut the set of candidate dictionary 
terms?
 One possibility is to use n‐gram overlap for this
 This can also be used by itself for spelling correction.

Slides by Manning, Raghavan, Schutze 30
Introduction to Information Retrieval Sec. 3.3.4

n‐gram overlap
 Enumerate all the n‐grams in the query string as well 
as in the lexicon
 Use the n‐gram index (recall wild‐card search) to 
retrieve all lexicon terms matching any of the query 
n‐grams
 Threshold by number of matching n‐grams
 Variants – weight by keyboard layout, etc.

Slides by Manning, Raghavan, Schutze 31
Introduction to Information Retrieval Sec. 3.3.4

Example with trigrams
 Suppose the text is november
 Trigrams are nov, ove, vem, emb, mbe, ber.
 The query is december
 Trigrams are dec, ece, cem, emb, mbe, ber.
 So 3 trigrams overlap (of 6 in each term)
 How can we turn this into a normalized measure of 
overlap?

Slides by Manning, Raghavan, Schutze 32
Introduction to Information Retrieval Sec. 3.3.4

One option – Jaccard coefficient
 A commonly‐used measure of overlap
 Let X and Y be two sets; then the J.C. is

X Y / X Y
 Equals 1 when X and Y have the same elements and 
zero when they are disjoint
 X and Y don’t have to be of the same size
 Always assigns a number between 0 and 1
 Now threshold to decide if you have a match
 E.g., if J.C. > 0.8, declare a match 
Slides by Manning, Raghavan, Schutze 33
Introduction to Information Retrieval Sec. 3.3.4

Matching trigrams
 Consider the query lord – we wish to identify words 
matching 2 of its 3 bigrams (lo, or, rd)

lo alone lore sloth


or border lore morbid
rd ardent border card

Standard postings “merge” will enumerate …


Adapt this to using Jaccard (or another) measure.
Slides by Manning, Raghavan, Schutze 34
Introduction to Information Retrieval Sec. 3.3.5

Context‐sensitive spell correction
 Text: I flew from Heathrow to Narita.
 Consider the phrase query “flew form Heathrow”
 We’d like to respond
Did you mean “flew from Heathrow”?
because no docs matched the query phrase.

Slides by Manning, Raghavan, Schutze 35
Introduction to Information Retrieval Sec. 3.3.5

Context‐sensitive correction
 Need surrounding context to catch this.
 First idea: retrieve dictionary terms close (in 
weighted edit distance) to each query term
 Now try all possible resulting phrases with one word 
“fixed” at a time
 flew from heathrow 
 fled form heathrow
 flea form heathrow
 Hit‐based spelling correction: Suggest the 
alternative that has lots of hits.

Slides by Manning, Raghavan, Schutze 36
Introduction to Information Retrieval Sec. 3.3.5

Exercise
 Suppose that for “flew form Heathrow”  we have 7 
alternatives for flew, 19 for form and 3 for heathrow.
How many “corrected” phrases will we enumerate in 
this scheme?

Slides by Manning, Raghavan, Schutze 37
Introduction to Information Retrieval Sec. 3.3.5

Another approach
 Break phrase query into a conjunction of biwords 
(Lecture 2).
 Look for biwords that need only one term corrected.
 Enumerate only phrases containing “common” 
biwords.

Slides by Manning, Raghavan, Schutze 38
Introduction to Information Retrieval Sec. 3.3.5

General issues in spell correction
 We enumerate multiple alternatives for “Did you 
mean?”
 Need to figure out which to present to the user
 The alternative hitting most docs
 Query log analysis
 More generally, rank alternatives probabilistically
argmaxcorr P(corr | query)
 From Bayes rule, this is equivalent to
argmaxcorr P(query | corr) * P(corr)

Noisy channel Language model


Slides by Manning, Raghavan, Schutze 39
Introduction to Information Retrieval

SOUNDEX

Slides by Manning, Raghavan, Schutze 40
Introduction to Information Retrieval Sec. 3.4

Soundex
 Class of heuristics to expand a query into phonetic
equivalents
 Language specific – mainly for names
 E.g., chebyshev  tchebycheff
 Invented for the U.S. census … in 1918

Slides by Manning, Raghavan, Schutze 41
Introduction to Information Retrieval Sec. 3.4

Soundex – typical algorithm
 Turn every token to be indexed into a 4‐character 
reduced form
 Do the same with query terms
 Build and search an index on the reduced forms
 (when the query calls for a soundex match)

 http://www.creativyst.com/Doc/Articles/SoundEx1/SoundEx1.htm#Top

Slides by Manning, Raghavan, Schutze 42
Introduction to Information Retrieval Sec. 3.4

Soundex – typical algorithm
1. Retain the first letter of the word. 
2. Change all occurrences of the following letters to '0' 
(zero):
'A', E', 'I', 'O', 'U', 'H', 'W', 'Y'. 
3. Change letters to digits as follows: 
 B, F, P, V  1
 C, G, J, K, Q, S, X, Z  2
 D,T  3
 L  4
 M, N  5
 R  6
Slides by Manning, Raghavan, Schutze 43
Introduction to Information Retrieval Sec. 3.4

Soundex continued
4. Remove all pairs of consecutive digits.
5. Remove all zeros from the resulting string.
6. Pad the resulting string with trailing zeros and 
return the first four positions, which will be of the 
form <uppercase letter> <digit> <digit> <digit>. 

E.g., Herman becomes H655.

Will hermann generate the same code?


Slides by Manning, Raghavan, Schutze 44
Introduction to Information Retrieval Sec. 3.4

Soundex
 Soundex is the classic algorithm, provided by most 
databases (Oracle, Microsoft, …)
 How useful is soundex?
 Not very – for information retrieval
 Okay for “high recall” tasks (e.g., Interpol), though 
biased to names of certain nationalities
 Zobel and Dart (1996) show that other algorithms for 
phonetic matching perform much better in the 
context of IR

Slides by Manning, Raghavan, Schutze 45
Introduction to Information Retrieval

What queries can we process?
 We have
 Positional inverted index with skip pointers
 Wild‐card index
 Spell‐correction
 Soundex
 Queries such as
(SPELL(moriset) /3 toron*to) OR SOUNDEX(chaikofski)

Slides by Manning, Raghavan, Schutze 46
Introduction to Information Retrieval

Exercise
 Draw yourself a diagram showing the various indexes 
in a search engine incorporating all the functionality 
we have talked about
 Identify some of the key design choices in the index 
pipeline:
 Does stemming happen before the Soundex index?
 What about n‐grams?
 Given a query, how would you parse and dispatch 
sub‐queries to the various indexes?

Slides by Manning, Raghavan, Schutze 47
Index construction
Index construction
 How do we construct an index?
 What strategies can we use with limited
main memory?
Hardware basics
 Many design decisions in information retrieval
are based on the characteristics of hardware
 We begin by reviewing hardware basics
Hardware basics
 Access to data in memory is much faster than
access to data on disk.
 Disk seeks: No data is transferred from disk while
the disk head is being positioned.
 Therefore: Transferring one large chunk of data
from disk to memory is faster than transferring
many small chunks.
 Disk I/O is block-based: Reading and writing of
entire blocks (as opposed to smaller chunks).
 Block sizes: 8KB to 256 KB.
Hardware basics
 Servers used in IR systems now typically have
several GB of main memory, sometimes tens of
GB.
 Available disk space is several (2–3) orders of
magnitude larger.
 Fault tolerance is very expensive: It’s much
cheaper to use many regular machines rather
than one fault tolerant machine.
Hardware assumptions
 symbol statistic value
 s average seek time 5 ms = 5 x 10−3 s
 b transfer time per byte 0.02 μs = 2 x 10−8 s
 processor’s clock rate 109 s−1
 p low-level operation 0.01 μs = 10−8 s
(e.g., compare & swap a word)
 size of main memory several GB
 size of disk space 1 TB or more
RCV1: Our collection for this
lecture
 Shakespeare’s collected works definitely aren’t large
enough for demonstrating many of the points in this
course.
 The collection we’ll use isn’t really large enough
either, but it’s publicly available and is at least a more
plausible example.
 As an example for applying scalable index
construction algorithms, we will use the Reuters
RCV1 collection.
 http://trec.nist.gov/data/reuters/reuters.html
 This is one year of Reuters newswire (part of 1995
and 1996)
A Reuters RCV1 document
Reuters RCV1 statistics
 symbol statistic value
 N documents 800,000
 L avg. # tokens per doc 200
 M terms (= word types) 400,000
 avg. # bytes per token 6
(incl. spaces/punctuation)
 avg. # bytes per token 4.5
(without spaces/punctuation)
 avg. # bytes per term 7.5
 T non-positional postings 100,000,000

4.5 bytes per word token vs. 7.5 bytes per word type: why?
Term Doc #
Recall index construction I
did
1
1
enact 1
julius 1
caesar 1
 Documents are parsed to extract words and I 1
these are saved with the Document ID. was 1
killed 1
i' 1
the 1
capitol 1
brutus 1
killed 1
me 1
Doc 1 Doc 2 so 2
let 2
it 2
be 2
I did enact Julius So let it be with with 2
caesar 2
Caesar I was killed Caesar. The noble the 2
i' the Capitol; Brutus hath told you
noble
brutus
2
2
Brutus killed me. Caesar was ambitious hath
told
2
2
you 2
caesar 2
was 2
ambitious 2
Key step Term
I
did
Doc #
1
1
Term
ambitious
be
Doc #
2
2
enact 1 brutus 1
julius 1 brutus 2

 After all documents have caesar


I
1
1
capitol
caesar
1
1
been parsed, the inverted file was
killed
1
1
caesar
caesar
2
2
is sorted by terms. i' 1 did 1
the 1 enact 1
capitol 1 hath 1
brutus 1 I 1
killed 1 I 1

We focus on this sort step. me


so
1
2
i'
it
1
2

We have 100M items to sort. let


it
2
2
julius
killed
1
1
be 2 killed 1
with 2 let 2
caesar 2 me 1
the 2 noble 2
noble 2 so 2
brutus 2 the 1
hath 2 the 2
told 2 told 2
you 2 you 2
caesar 2 was 1
was 2 was 2
ambitious 2 with 2
Scaling index construction
 In-memory index construction does not scale.
 How can we construct an index for very large
collections?
 Taking into account the hardware constraints we
just learned about . . .
 Memory, disk, speed, etc.
Sort-based index construction
 As we build the index, we parse docs one at a time.
 While building the index, we cannot easily exploit

compression tricks (you can, but much more complex)


 The final postings for any term are incomplete until the
end.
 At 12 bytes per non-positional postings entry (term,
doc, freq), demands a lot of space for large collections.
 T = 100,000,000 in the case of RCV1
 So … we can do this in memory, but typical

collections are much larger. E.g. the New York


Times provides an index of >150 years of newswire
 Thus: We need to store intermediate results on disk.
Use the same algorithm for disk?
 Can we use the same index construction
algorithm for larger collections, but by using disk
instead of memory?
 No: Sorting T = 100,000,000 records on disk is
too slow – too many disk seeks.
 We need an external sorting algorithm.
Bottleneck
 Parse and build postings entries one doc at a
time
 Now sort postings entries by term (then by doc
within each term)
 Doing this with random disk seeks would be too
slow – must sort T=100M records

If every comparison took 2 disk seeks, and N items could be


sorted with N log2N comparisons, how long would this take?
BSBI: Blocked sort-based Indexing
(Sorting with fewer disk seeks)
 12-byte (4+4+4) records (term, doc, freq).
 These are generated as we parse docs.
 Must now sort 100M such 12-byte records by term.
 Define a Block ~ 10M such records
 Can easily fit a couple into memory.
 Will have 10 such blocks to start with.
 Basic idea of algorithm:
 Accumulate postings for each block, sort, write to
disk.
 Then merge the blocks into one long sorted order.
Sorting 10 blocks of 10M records
 First, read each block and sort within:
 Quicksort takes 2N ln N expected steps
 In our case 2 x (10M ln 10M) steps
 Exercise: estimate total time to read each block
from disk and and quicksort it.
 10 times this estimate – gives us 10 sorted runs
of 10M records each.
 Need 2 copies of data on disk
 But can optimize this
BSBI Index Construction
n=0
while (all documents have not been processed)
do n = n + 1
block = parseNextBlock(); // read postings until the block is full
BSBI-Invert(block); // sort all postings
// collect all postings with same
termID into a list
// results in a small inverted index
writeBlockToDisk(block,fn);
mergeBlocks(f1, f2, …, fn; fmerged);
How to merge the sorted runs?
 Can do binary merges, with a merge tree of log210 =
4 layers.
 During each layer, read into memory runs in blocks of
10M, merge, write back.
1
1 2
2 Merged run.
3 4
3

4
Runs being
merged.
Disk
Merge tree
1 run, 100M/run
1 run, 80M/run
2 runs, 40M/run
5 runs, 20M/run

Sorted runs. Bottom level


… of tree.

1 2 9 10
How to merge the sorted runs?
 But it is more efficient to do a n-way merge, where
you are reading from all blocks simultaneously
 Providing you read decent-sized chunks of each
block into memory and then write out a decent-sized
output chunk, then you’re not killed by disk seeks
Remaining problem with sort-
based algorithm
 Our assumption was: we can keep the dictionary
in memory.
 We need the dictionary (which grows
dynamically) in order to implement a term to
termID mapping.
 Actually, we could work with term,docID postings
instead of termID,docID postings . . .
 . . . but then intermediate files become very large.
(We would end up with a scalable, but very slow
index construction method.)
SPIMI:
Single-pass in-memory indexing
 Key idea 1: Generate separate dictionaries for
each block – no need to maintain term-termID
mapping across blocks.
 Key idea 2: Don’t sort. Accumulate postings in
postings lists as they occur.
 With these two ideas we can generate a
complete inverted index for each block.
 These separate indexes can then be merged into
one big index.
SPIMI-Invert

 Merging of blocks is analogous to BSBI.


SPIMI: Compression
 Compression makes SPIMI even more efficient.
 Compression of terms
 Compression of postings
Distributed indexing
 For web-scale indexing:
must use a distributed computing cluster
 Individual machines are fault-prone
 Can unpredictably slow down or fail
 How do we exploit such a pool of machines?
Google data centers
 Google data centers mainly contain commodity
machines.
 Data centers are distributed around the world.
 Estimate: a total of 1 million servers, 3 million
processors/cores (Gartner 2007)
 Estimate: Google installs 100,000 servers each
quarter.
 Based on expenditures of 200–250 million dollars
per year
 This would be 10% of the computing capacity of
the world (in 2007)!?!
Google data centers
 If in a non-fault-tolerant system with 1000 nodes,
each node has 99.9% uptime, what is the uptime
of the system?
 Answer: 63%
 Calculate the number of servers failing per
minute for an installation of 1 million servers.
Distributed indexing
 Maintain a master machine directing the indexing
job – considered “safe”.
 Break up indexing into sets of (parallel) tasks.
 Master machine assigns each task to an idle
machine from a pool.
Parallel tasks
 We will use two sets of parallel tasks
 Parsers
 Inverters
 Break the input document collection into splits
 Each split is a subset of documents
(corresponding to blocks in BSBI/SPIMI)
Parsers
 Master assigns a split to an idle parser machine
 Parser reads a document at a time and emits
(term, doc) pairs
 Parser writes pairs into j partitions
 Each partition is for a range of terms’ first letters
 (e.g., a-f, g-p, q-z) – here j = 3.
 Now to complete the index inversion
Inverters
 An inverter collects all (term,doc) pairs (=
postings) for one term-partition.
 Sorts and writes to postings lists
Data flow
assign Master assign
Postings

Parser a-f g-p q-z Inverter a-f

Parser a-f g-p q-z


Inverter g-p

splits Inverter q-z


Parser a-f g-p q-z

Map Segment files Reduce


phase phase
MapReduce
 The index construction algorithm we just
described is an instance of MapReduce.
 MapReduce (Dean and Ghemawat 2004) is a
robust and conceptually simple framework for
distributed computing …
 … without having to write code for the distribution
part.
 They describe the Google indexing system (ca.
2002) as consisting of a number of phases, each
implemented in MapReduce.
MapReduce
 Index construction was just one phase.
 Another phase: transforming a term-partitioned
index into a document-partitioned index.
 Term-partitioned: one machine handles a
subrange of terms
 Document-partitioned: one machine handles a
subrange of documents
 As we discuss in the web part of the course,
most search engines use a document-partitioned
index … better load balancing, etc.
Schema for index construction in
MapReduce
 Schema of map and reduce functions
 map: input → list(k, v)
 reduce: (k,list(v)) → output

 Instantiation of the schema for index construction


 map: web collection → list (termID, docID)
 reduce: <termID1, list(docID)>, <termID2, list(docID)>, … →
(postings list1, postings list2, …)

 Example for index construction


 map:
 d2 : “Caesar died”. d1 : “Caesar came, Caesar conquered”. →
<Caesar, d2>, <died,d2>, <Caesar,d1>, <came,d1>, <Caesar,d1>, <conquered, d1>
 reduce:
 <Caesar, d2>, <died,d2>, <Caesar,d1>, <came,d1>, <Caesar,d1>, <conquered, d1>
→ (<Caesar,(d1:2,d2:1)>, <died,(d2:1)>, <came,(d1:1)>, <conquered,(d1:1)>)
MapReduce
 The infrastructure used today for this kind of task
 Map: Parsing
 Reduce: Inverting

 More on MapReduce theory here:


 http://en.wikipedia.org/wiki/MapReduce
 http://labs.google.com/papers/mapreduce.html

 A good MapReduce implementation is offered by


Apache: http://hadoop.apache.org/mapreduce/
More on real life
 Same machine can do map, reduce, as well as
other tasks in the same time (e.g., crawling the
Web)
 Focus on minimizing I/O (network & disk)
Dynamic indexing
 Up to now, we have assumed that collections are
static.
 They rarely are:
 Documents come in over time and need to be
inserted.
 Documents are deleted and modified.
 This means that the dictionary and postings lists
have to be modified:
 Postings updates for terms already in dictionary
 New terms added to dictionary
Simplest approach
 Maintain “big” main index
 New docs go into “small” auxiliary index
 Search across both, merge results
 Deletions
 Invalidation bit-vector for deleted docs
 Filter docs output on a search result by this
invalidation bit-vector
 Periodically, re-index into one main index
Issues with main and auxiliary
indexes
 Problem of frequent merges – you touch stuff a lot
 Poor performance during merge
 Actually:
 Merging of the auxiliary index into the main index is efficient
if we keep a separate file for each postings list.
 Merge is the same as a simple append.
 But then we would need a lot of files.
 Assumption for the rest of the lecture: The index is
one big file.
 In reality: Use a scheme somewhere in between
(e.g., split very large postings lists, collect postings
lists of length 1 in one file etc.)
Logarithmic merge
 Maintain a series of indexes, each twice as large
as the previous one.
 Keep smallest (Z0) in memory
 Larger ones (I0, I1, …) on disk
 If Z0 gets too big (> n), write to disk as I0
 or merge with I0 (if I0 already exists) as Z1
 Either write merge Z1 to disk as I1 (if no I1)
 Or merge with I1 to form Z2
 etc.
Logarithmic merge
 Auxiliary and main index: index construction time
is O(T2) as each posting is touched in each
merge.
 Logarithmic merge: Each posting is merged
O(log T) times, so complexity is O(T log T)
 So logarithmic merge is much more efficient for
index construction
 But query processing now requires the merging
of O(log T) indexes
 Whereas it is O(1) if you just have a main and
auxiliary index
Further issues with multiple
indexes
 Collection-wide statistics are hard to maintain
 E.g., when we spoke of spell-correction: which of
several corrected alternatives do we present to
the user?
 We said, pick the one with the most hits
 How do we maintain the top ones with multiple
indexes and invalidation bit vectors?
 One possibility: ignore everything but the main
index for such ordering
 Will see more such statistics used in results
ranking
Dynamic indexing at search
engines
 All the large search engines now do dynamic
indexing
 Their indices have frequent incremental changes
 News items, blogs, new topical web pages
 Sarah Palin, …
 But (sometimes/typically) they also periodically
reconstruct the index from scratch
 Query processing is then switched to the new
index, and the old index is then deleted
Other sorts of indexes
 Positional indexes
 Same sort of sorting problem … just larger Why?
 Building character n-gram indexes:
 As text is parsed, enumerate n-grams.
 For each n-gram, need pointers to all dictionary
terms containing it – the “postings”.
 Note that the same “postings entry” will arise
repeatedly in parsing the docs – need efficient
hashing to keep track of this.
 E.g., that the trigram uou occurs in the term deciduous
will be discovered on each text occurrence of deciduous
 Only need to process each term once
Building n-gram indexes
 Once all (n-gram∈term) pairs have been
enumerated, must sort for inversion
 For an average dictionary term of 8 characters
 We have about 6 trigrams per term on average
 For a vocabulary of 500K terms, this is about 3
million pointers – can compress
Index on disk vs. memory
 Most retrieval systems keep the dictionary in
memory and the postings on disk
 Web search engines frequently keep both in
memory
 massive memory requirement
 feasible for large web service installations
 less so for commercial usage where query loads
are lighter
Indexing in the real world
 Typically, don’t have all documents sitting on a
local filesystem
 Documents need to be spidered
 Could be dispersed over a WAN with varying
connectivity
 Must schedule distributed spiders
 Have already discussed distributed indexers
 Could be (secure content) in
 Databases
 Content management applications
 Email applications
Content residing in applications
 Mail systems/groupware, content management
contain the most “valuable” documents
 http often not the most efficient way of fetching
these documents - native API fetching
 Specialized, repository-specific connectors
 These connectors also facilitate document viewing
when a search result is selected for viewing
Secure documents
 Each document is accessible to a subset of users
 Usually implemented through some form of
Access Control Lists (ACLs)
 Search users are authenticated
 Query should retrieve a document only if user
can access it
 So if there are docs matching your search but
you’re not privy to them, “Sorry no results found”
 E.g., as a lowly employee in the company, I get
“No results” for the query “salary roster”
Users in groups, docs from groups
 Index the ACLs and filter results by them

Documents

Users 0/1 0 if user can’t read


doc, 1 otherwise.

 Often, user membership in an ACL group verified


at query time – slowdown
Exercise
 Can spelling suggestion compromise such
document-level security?
 Consider the case when there are documents
matching my query, but I lack access to them.
Compound documents
 What if a doc consisted of components
 Each component has its own ACL.
 Your search should get a doc only if your query
meets one of its components that you have
access to.
 More generally: doc assembled from
computations on components
 e.g., in content management systems
 How do you index such docs?

No good answers …
“Rich” documents
 (How) Do we index images?
 Researchers have devised Query Based on
Image Content (QBIC) systems
 “show me a picture similar to this orange circle”
 watch for lecture on vector space retrieval
 In practice, image search usually based on meta-
data such as file name e.g., monalisa.jpg
 New approaches exploit social tagging
 E.g., flickr.com
Passage/sentence retrieval
 Suppose we want to retrieve not an entire
document matching a query, but only a
passage/sentence - say, in a very long document
 Can index passages/sentences as mini-
documents – what should the index units be?
 This is the subject of XML search

You might also like