5: Index Construction
Information Retrieval Techniques
1
Index construction
p How do we construct an index?
p What strategies can we use with limited main
memory?
2
Hardware basics
p Many design decisions in information retrieval are
based on the characteristics of hardware
p We begin by reviewing hardware basics
3
Hardware basics
p Access to data in memory is much faster than
access to data on disk
p Disk seeks: No data is transferred from disk
while the disk head is being positioned
p Therefore transferring one large chunk of data
from disk to memory is faster than transferring
many small chunks
p Disk I/O is block-based: Reading and writing of
entire blocks (as opposed to smaller chunks)
p Block sizes: 8KB to 256 KB.
4
Hardware basics
p Servers used in IR systems now typically have
several GB of main memory, sometimes tens of
GB
p Available disk space is several (2–3) orders of
magnitude larger
p Fault tolerance is very expensive: It’s much
cheaper to use many regular machines rather
than one fault tolerant machine.
5
Google Web Farm
p The best guess is that Google now has more than
2 Million servers (8 Petabytes of RAM 8*106
Gigabytes)
p Spread over at least 12 locations around the world
p Connecting these centers is a high-capacity fiber optic
network that the company has assembled over the
last few years.
The Dalles, Oregon Dublin, Ireland 6
Hardware assumptions
p symbol statistic value
p s average seek time 5 ms = 5 x 10−3 s
p b transfer time per byte 0.02 µs = 2 x 10−8 s/B
p processor’s clock rate 109 s−1
p p low-level operation 0.01 µs = 10−8 s
(e.g., compare & swap a word)
p size of main memory several GB
p size of disk space 1 TB or more
p Example: Reading 1GB from disk
n If stored in contiguous blocks: 2 x 10−8 s/B x 109 B = 20s
n If stored in 1M chunks of 1KB: 20s + 106x 5 x 10−3s = 5020
s = 1.4 h 7
A Reuters RCV1 document
8
Reuters RCV1 statistics
symbol statistic value
N documents 800,000
L avg. # tokens per doc 200
M terms (= word types) 400,000
avg. # bytes per token 6
(incl. spaces/punct.)
avg. # bytes per token 4.5
(without spaces/punct.)
avg. # bytes per term 7.5
T non-positional postings 100,000,000
• 4.5 bytes per word token vs. 7.5 bytes per word type: why? 9
• Why T < N*L?
Recall IIR 1 index construction
Term Doc #
I 1
did 1
p Documents are parsed to extract words and enact
julius
1
1
these are saved with the Document ID. caesar 1
I 1
was 1
killed 1
i' 1
the 1
capitol 1
brutus 1
killed 1
me 1
Doc 1 Doc 2 so 2
let 2
it 2
So let it be with be 2
I did enact Julius with 2
Caesar. The noble caesar 2
Caesar I was killed the 2
Brutus hath told you noble 2
i' the Capitol;
Caesar was brutus 2
Brutus killed me. hath 2
ambitious told
you
2
2
caesar 2
10
was 2
ambitious 2
Sec. 4.2
Key step
Term Doc # Term Doc #
I 1 ambitious 2
did 1 be 2
enact 1 brutus 1
julius 1 brutus 2
p After all documents have caesar
I
1
1
capitol
caesar
1
1
been parsed, the inverted was 1 caesar 2
killed 1 caesar 2
file is sorted by terms. i' 1 did 1
the 1 enact 1
capitol 1 hath 1
brutus 1 I 1
killed 1 I 1
me 1 i' 1
so 2 it 2
let 2 julius 1
We focus on this sort step. it
be
2
2
killed
killed
1
1
We have 100M items to sort with
caesar
2
2
let
me
2
1
for Reuters RCV1 (after the
noble
2
2
noble
so
2
2
having removed duplicated brutus
hath
2
2
the
the
1
2
docid for each term) told
you
2
2
told
you
2
2
caesar 2 was 1
was 2 was 2
ambitious 2 with 2
11
Sec. 4.2
Scaling index construction
p In-memory index construction does not scale
p How can we construct an index for very large
collections?
p Taking into account the hardware constraints we
just learned about . . .
p Memory, disk, speed, etc.
12
Sec. 4.2
Sort-based index construction
p As we build the index, we parse docs one at a time
n While building the index, we cannot easily exploit
compression tricks (you can, but much more
complex)
n The final postings for any term are incomplete until
the end
p At 12 bytes per non-positional postings entry (term,
doc, freq), demands a lot of space for large collections
p T = 100,000,000 in the case of RCV1 – so 1.2GB
n So … we can do this in memory in 2015, but typical
collections are much larger - e.g. the New York
Times provides an index of >150 years of newswire
p Thus: We need to store intermediate results on disk.
13
Sec. 4.2
Use the same algorithm for disk?
p Can we use the same index construction
algorithm for larger collections, but by using
disk instead of memory?
n I.e. scan the documents, and for each term
write the corresponding posting (term, doc,
freq) on a file
n Finally sort the postings and build the postings
lists for all the terms
p No: Sorting T = 100,000,000 records (term, doc,
freq) on disk is too slow – too many disk seeks
n See next slide
p We need an external sorting algorithm.
14
Sec. 4.2
Bottleneck
p Parse and build postings entries one doc at a time
p Then sort postings entries by term (then by doc
within each term)
p Doing this with random disk seeks would be too
slow – must sort T = 100M records
If every comparison took 2 disk seeks, and N items could be
sorted with N log2N comparisons, how long would this take?
p symbol statistic value
p s average seek time 5 ms = 5 x 10−3 s
p b transfer time per byte 0.02 µs = 2 x 10−8 s
p p low-level operation 0.01 µs = 10−8 s
(e.g., compare & swap a word) 15
Solution
(2*ds-time + comparison-time)*Nlog2N seconds
= (2*5*10-3 + 10-8)* 108 log2 108
~= (2*5*10-3)* 108 log2 108
since the time required for the comparison is
actually negligible (as the time for transferring data
in the main memory)
= 106 * log2 108 = 106 * 26,5 = 2,65 * 107 s = 307
days!
p What can we do? 16
Gaius Julius Caesar
Divide et Impera
17
Sec. 4.2
BSBI: Blocked sort-based Indexing
(Sorting with fewer disk seeks)
p 12-byte (4+4+4) records (term-id, doc-id, freq)
p These are generated as we parse docs
p Must now sort 100M such 12-byte records by
term
p Define a Block ~ 10M such records
n Can easily fit a couple into memory
n Will have 10 such blocks to start with (RCV1)
p Basic idea of algorithm:
n Accumulate postings for each block (write on a
file), (read and) sort, write to disk
n Then merge the sorted blocks into one long
18
sorted order.
Sec. 4.2
blocks contain
term-id instead
Blocks obtained
parsing different
documents
19
Sec. 4.2
Sorting 10 blocks of 10M records
p First, read each block and sort (in memory)
within:
n Quicksort takes 2N log2N expected steps
n In our case 2 x (10M log210M) steps
p Exercise: estimate total time to read each block
from disk and quicksort it
n Approximately 7 s
p 10 times this estimate – gives us 10 sorted runs
of 10M records each
p Done straightforwardly, need 2 copies of data on
disk
n But can optimize this 20
Sec. 4.2
Block sorted-based indexing
Keeping the
dictionary in memory
n = number of
generated blocks 21
Sec. 4.2
How to merge the sorted runs?
p Open all block files and maintain small read
buffers - and a write buffer for the final merged
index
p In each iteration select the lowest termID that
has not been processed yet
p All postings lists for this termID are read and
merged, and the merged list is written back to disk
p Each read buffer is refilled from its file when
necessary
p Providing you read decent-sized chunks of each
block into memory and then write out a decent-
sized output chunk, then you’re not killed by disk
22
seeks.
Sec. 4.3
Remaining problem with sort-based
algorithm
p Our assumption was: we can keep the dictionary
in memory
p We need the dictionary (which grows
dynamically) in order to implement a term to
termID mapping
p Actually, we could work with (term, docID)
postings instead of (termID, docID) postings . . .
p . . . but then intermediate files become larger -
we would end up with a scalable, but slower
index construction method.
Why?
23
Sec. 4.3
SPIMI:
Single-pass in-memory indexing
p Key idea 1: Generate separate dictionaries for
each block – no need to maintain term-termID
mapping across blocks
p Key idea 2: Don’t sort the postings - accumulate
postings in postings lists as they occur
n But at the end, before writing on disk, sort
the terms
p With these two ideas we can generate a complete
inverted index for each block
p These separate indexes can then be merged into
one big index (because terms are sorted).
24
Sec. 4.3
SPIMI-Invert
When the memory has been exhausted -
write the index of the block (dictionary,
postings lists) to disk
p Then merging of blocks is analogous to 25
BSBI (plus dictionary merging).
Sec. 4.3
SPIMI: Compression
p Compression makes SPIMI even more efficient.
n Compression of terms
n Compression of postings
26
Sec. 4.4
Distributed indexing
p For web-scale indexing (don’t try this at home!):
must use a distributed computing cluster
p Individual machines are fault-prone
n Can unpredictably slow down or fail
p How do we exploit such a pool of machines?
27
Sec. 4.4
Google data centers
p Google data centers mainly contain commodity
machines
p Data centers are distributed around the world
p Estimate: a total of 2 million servers
p Estimate: Google installs 100,000 servers each
quarter
n Based on expenditures of 200–250 million
dollars per year
p This would be 10% of the computing capacity of
the world!?!
28
Google data centers
p Consider a non-fault-tolerant system with
1000 nodes
p Each node has 99.9% availability (probability to
be up in a time unit), what is the availability of
the system?
n All of them should be simultaneously up
p Answer: 37%
n (p of staying up)# of server = (0.999)1000
p Calculate the number of servers failing per
minute for an installation of 1 million servers.
29
Sec. 4.4
Distributed indexing
p Maintain a master machine directing the
indexing job – considered “safe”
p Break up indexing into sets of (parallel) tasks
p Master machine assigns each task to an idle
machine from a pool.
30
Sec. 4.4
Parallel tasks
p We will use two sets of parallel tasks
n Parsers
n Inverters
p Break the input document collection into splits
Document collection
split split split split split
p Each split is a subset of documents
(corresponding to blocks in BSBI/SPIMI)
31
Sec. 4.4
Parsers
p Master assigns a split to an idle parser machine
p Parser reads a document at a time and emits
(term-id, doc-id) pairs
p Parser writes pairs into j partitions
p Each partition is for a range of terms’ first letters
n (e.g., a-f, g-p, q-z) – here j = 3.
p Now to complete the index inversion …
split parser a-f g-p q-z
3 partitions 32
Sec. 4.4
Inverters
p An inverter collects all (term-id, doc-id) pairs
(= postings) for one term-partition (from the
different segments produced by the parsers)
p Sorts and writes to postings lists.
a-f
a-f inverter a-f
postings
...
a-f 33
Sec. 4.4
Data flow: MapReduce
assign Master assign
Postings
Parser a-f g-p q-z Inverter a-f
Parser a-f g-p q-z
Inverter g-p
splits Inverter q-z
Parser a-f g-p q-z
Map Reduce
Segment files
phase phase 34
Sec. 4.4
MapReduce
p The index construction algorithm we just described is
an instance of MapReduce
p MapReduce (Dean and Ghemawat 2004) is a robust
and conceptually simple framework for distributed
computing …
n … without having to write code for the distribution
part
p Solve large computing problems on cheap commodity
machines or nodes that are built from standard parts
(processor, memory, disk) as opposed to on a
supercomputer with specialized hardware
p They describe the Google indexing system (ca. 2002)
as consisting of a number of phases, each
implemented in MapReduce. 35
Sec. 4.4
MapReduce
p Index construction was just one phase
p Another phase (not shown here): transforming a
term-partitioned index into a document-
partitioned index (for query processing)
n Term-partitioned: one machine handles a
subrange of terms
n Document-partitioned: one machine handles a
subrange of documents
p As we will discuss in the web part of the course -
most search engines use a document-
partitioned index … better load balancing, etc.
36
Sec. 4.4
Schema for index construction in
MapReduce
p Schema of map and reduce functions
p map: input → list(k, v) reduce: (k,list(v)) → output
p Instantiation of the schema for index construction
p map: web collection → list(termID, docID)
p reduce: (<termID1, list(docID)>, <termID2, list(docID)>,
…) → (postings list1, postings list2, …)
p Example for index construction
p map: (d2 : "C died.", d1 : "C came, C c’ed.") → (<C, d2>,
<died,d2>, <C,d1>, <came,d1>, <C,d1>, <c’ed, d1>
p reduce: (<C,(d2,d1,d1)>, <died,(d2)>, <came,(d1)>,
<c’ed,(d1)>) → (<C,(d1:2,d2:1)>, <died,(d2:1)>,
<came,(d1:1)>, <c’ed,(d1:1)>)
37
we do not write term-ids for better readability
Sec. 4.5
Dynamic indexing
p Up to now, we have assumed that collections are
static
p They rarely are:
n Documents come in over time and need to be
inserted
n Documents are deleted and modified
p This means that the dictionary and postings
lists have to be modified:
n Postings updates for terms already in
dictionary
n New terms added to dictionary.
38
Sec. 4.5
Simplest approach
p Maintain big main index
p New docs go into small auxiliary index
p Search across both, merge results
p Deletions
n Invalidation bit-vector for deleted docs
n Filter docs output on a search result by this
invalidation bit-vector
p Periodically, re-index into one main index.
39
Sec. 4.5
Issues with main and auxiliary indexes
p Problem of frequent merges – you touch stuff a lot
p Poor performance during merge
p Actually:
n Merging of the auxiliary index into the main index
is efficient if we keep a separate file for each
postings list
n Merge is the same as a simple append
n But then we would need a lot of files – inefficient
for the OS
p Assumption for the rest of the lecture: The index is
one big file
p In reality: use a scheme somewhere in between (e.g.,
split very large postings lists, collect postings lists of
length 1 in one file etc.).
40
Sec. 4.5
Logarithmic merge
p Maintain a series of indexes, each twice as
large as the previous one
p Keep smallest (Z0) in memory
p Larger ones (I0, I1, …) on disk
p If Z0 gets too big (> n), write to disk as I0
p or merge with I0 (if I0 already exists) as Z1
p Either write merge Z1 to disk as I1 (if no I1)
p Or merge with I1 to form Z2
p etc.
41
Sec. 4.5
Permanent indexes
already stored
42
Sec. 4.5
Logarithmic merge
p Auxiliary and main index: index construction
time is O(T2) as each posting is touched in each
merge
p Logarithmic merge: Each posting is merged
O(log T) times, so complexity is O(T log T)
p So logarithmic merge is much more efficient for
index construction
p But query processing now requires the merging of
O(log T) indexes
n Whereas it is O(1) if you just have a main and
auxiliary index
43
Sec. 4.5
Further issues with multiple indexes
p Collection-wide statistics are hard to maintain
p E.g., when we spoke of spell-correction: which of
several corrected alternatives do we present to
the user?
n We said, pick the one with the most hits
p How do we maintain the top ones with multiple
indexes and invalidation bit vectors?
n One possibility: ignore everything but the main
index for such ordering
p Will see more such statistics used in results
ranking.
44
Sec. 4.5
Dynamic indexing at search engines
p All the large search engines now do dynamic
indexing
p Their indices have frequent incremental changes
n News items, blogs, new topical web pages
p Grillo, Crimea, …
p But (sometimes/typically) they also periodically
reconstruct the index from scratch
n Query processing is then switched to the new
index, and the old index is then deleted
45
Google trends
http://www.google.com/trends/
46
Sec. 4.5
Other sorts of indexes
p Positional indexes
n Same sort of sorting problem … just larger Why?
p Building character n-gram indexes:
n As text is parsed, enumerate n-grams
n For each n-gram, need pointers to all dictionary terms
containing it – the “postings”
n Note that the same “postings entry” (i.e., terms) will
arise repeatedly in parsing the docs – need efficient
hashing to keep track of this
p E.g., that the trigram uou occurs in the term
deciduous will be discovered on each text
occurrence of deciduous
p Only need to process each term once.
47