0% found this document useful (0 votes)

22 views69 pages

Week7 8

The document introduces key concepts in Information Retrieval, focusing on index construction and hardware considerations. It discusses strategies for building an index with limited memory and highlights the importance of hardware characteristics in retrieval systems. The lecture will utilize the Reuters RCV1 collection as a case study for scalable index construction algorithms.

Uploaded by

ranitnandi210

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

22 views69 pages

Week7 8

Uploaded by

ranitnandi210

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 69

Introduc)on

to Informa)on Retrieval

Introduc*on to
Informa(on Retrieval
Index Construc*on
Introduc)on to Informa)on Retrieval

Plan
§ Last lecture:
§ Dic*onary data structures a-hu
hy-m
n-z

§ Tolerant retrieval
§ Wildcards
§ Spell correc*on
$m mace madden
§ Soundex
mo among amortize

§ This *me: on abandon among

§ Index construc*on
Introduc)on to Informa)on Retrieval Ch.
4

Index construc(on
§ How do we construct an index? (Previous study)
§ What strategies can we use with limited main
memory? (Previous study)
Introduc)on to Informa)on Retrieval Sec.
4.1

Hardware basics
§ Many design decisions in informa*on retrieval are
based on the characteris*cs of hardware e.g. size of
main memory, disk and cache, read, write and other
opera*ons.
§ We begin by reviewing hardware basics
Introduc)on to Informa)on Retrieval Sec.
4.1

Hardware basics
§ Access to data in memory is much faster than access
to data on disk.
§ Disk seeks: No data is transferred from disk while the
disk head is being posi*oned.
§ Therefore: Transferring one large chunk of data from disk
to memory is faster than transferring many small chunks.
§ Disk I/O is block-‐based: Reading and wri*ng of en*re
blocks (as opposed to smaller chunks).
§ Block sizes: 8KB to 256 KB.
Introduc)on to Informa)on Retrieval Sec.
4.1

Hardware basics
§ Servers used in IR systems now typically have several
GB of main memory, some*mes tens of GB.
§ Available disk space is several (2–3) orders of
magnitude larger.
§ Fault tolerance is very expensive: It’s much cheaper
to use many regular machines rather than one fault
tolerant machine.
Introduc)on to Informa)on Retrieval Sec.
4.1

Hardware assump(ons for this lecture

Introduc)on to Informa)on Retrieval Sec.
4.2

RCV1: Our collec(on for this lecture

§ Shakespeare’s collected works deﬁnitely aren’t
large enough for demonstra*ng many of the points
in this course.
§ The collec*on we’ll use isn’t really large enough
either, but it’s publicly available and is at least a
more plausible example.
§ As an example for applying scalable index
construc*on algorithms, we will use the Reuters
RCV1 collec*on.
§ This is one year of Reuters newswire (part of 1995
and 1996)
Introduc)on to Informa)on Retrieval Sec.
4.2

A Reuters RCV1 document

Introduc)on to Informa)on Retrieval Sec.
4.2

Reuters RCV1 sta(s(cs

Introduc)on to Informa)on Retrieval Sec.
4.2

Term Doc #

Recall IIR 1 index construc(on I

did
enact
1
1
1
julius 1

§ Documents are parsed to extract words and these caesar

I
1
1
are saved with the Document ID. was 1
killed 1
i' 1
the 1
capitol 1
brutus 1
killed 1
me 1
Doc 1 Doc 2 so 2
let 2
it 2
be 2
I did enact Julius So let it be with with 2

Caesar I was killed Caesar. The noble

caesar
the
2
2
i' the Capitol; Brutus hath told you noble
brutus
2
2
Brutus killed me. Caesar was ambitious hath 2
told 2
you 2
caesar 2
was 2
ambitious 2
Introduc)on to Informa)on Retrieval Sec.
4.2

Key step Term

I
did
Doc #
1
1
Term
ambitious
be
Doc #
2
2
enact 1 brutus 1
julius 1 brutus 2

§ A_er all documents have been caesar

I
1
1
capitol
caesar
1
1
parsed, the inverted ﬁle is was
killed
1
1
caesar
caesar
2
2
sorted by terms. i' 1 did 1
the 1 enact 1
capitol 1 hath 1
brutus 1 I 1
killed 1 I 1

We focus on this sort step. me

so
1
2
i'
it
1
2

We have 100M items to sort. let

it
2
2
julius
killed
1
1
be 2 killed 1
with 2 let 2
caesar 2 me 1
the 2 noble 2
noble 2 so 2
brutus 2 the 1
hath 2 the 2
told 2 told 2
you 2 you 2
caesar 2 was 1
was 2 was 2
ambitious 2 with 2
Introduc)on to Informa)on Retrieval Sec.
4.2

Scaling index construc(on

§ In-‐memory index construc*on does not scale
§ Can’t stuﬀ en*re collec*on into memory, sort, then write
back
§ How can we construct an index for very large
collec*ons?
§ Taking into account the hardware constraints we just
learned about . . .memory, disk, speed, etc.
Introduc)on to Informa)on Retrieval Sec.
4.2

Sort based index construc(on

§ As we build the index, we parse docs one at a *me.
§ While building the index, we cannot easily exploit
compression tricks (you can, but much more complex)
§ The ﬁnal pos*ngs for any term are incomplete un*l the end.
§ At 12 bytes per non-‐posi*onal pos*ngs entry (term, doc,
freq), demands a lot of space for large collec*ons.
§ T = 100,000,000 in the case of RCV1, so … we can do this in
memory in 2009, but typical collec*ons are much larger. E.g.,
the New York Times provides an index of >150 years of
newswire
§ Thus: We need to store intermediate results on disk.
Introduc)on to Informa)on Retrieval Sec.
4.2

Sort in “disk” or “memory”

§ Can we use the previous index construc*on
algorithm (simple, posi*onal index, or with skip
pointers) for larger collec*ons, but by using disk
instead of memory?

§ No: Sor*ng T = 100,000,000 records on disk is too

slow – too many disk seeks.

§ We need an external sor*ng algorithm.

Introduc)on to Informa)on Retrieval Sec.
4.2

BoIleneck
§ Parse and build pos*ngs entries one doc at a *me
§ Now sort pos*ngs entries by term (then by doc
within each term)
§ Doing this with random disk seeks would be too slow
– must sort T=100M records

If every comparison took 2 disk seeks, and N items could be

sorted with N log2N comparisons, how long would this take?
Introduc)on to Informa)on Retrieval Sec.
4.2

BSBI: Blocked sort-‐based Indexing

(Sor(ng with fewer disk seeks)
§ 12-‐byte (4+4+4) records (term, doc, freq).
§ These are generated as we parse docs.
§ Must now sort 100M such 12-‐byte records by term.
§ Deﬁne a Block ~ 10M such records
§ Can easily ﬁt a couple into memory.
§ Will have 10 such blocks to start with.
§ Basic idea of algorithm:
§ Accumulate pos*ngs for each block, sort, write to disk.
§ Then merge the blocks into one long sorted order.
Introduc)on to Informa)on Retrieval Sec.
4.2

Sor(ng and merging of terms

Introduc)on to Informa)on Retrieval Sec.
4.2

Sor(ng 10 blocks of 10M records

§ First, read each block and sort within:
§ Quicksort takes 2N ln N expected steps (if is on disk)
§ In our case 2x(10M ln 10M) steps
§ Exercise: es*mate total *me to read each block from
disk and quicksort it.
§ 10 *mes this es*mate – gives us 10 sorted runs of 10M
records each.
§ Done straighlorwardly, need 2 copies of data on disk
§ But can op*mize this
Introduc)on to Informa)on Retrieval Sec.
4.2

Algorithm descrip(on
§ The algorithm parses documents into termID–docID pairs and
accumulates the pairs in memory un*l a block of a ﬁxed size is
full (PARSENEXTBLOCK ).
§ We choose the block size to ﬁt comfortably into memory
to permit a fast in-‐memory sort.
§ The block is then inverted and wrinen to disk.
§ Inversion involves two steps. First, we sort the termID–
docID pairs. Next, we collect all termID–docID pairs with
the same termID into a pos*ngs list, where a pos)ng is
simply a docID.
§ The result, an inverted index for the block we have just
read, is then wrinen to disk.
Introduc)on to Informa)on Retrieval Sec.
4.2
Introduc)on to Informa)on Retrieval Sec.
4.2

How to merge the sorted runs?

§ To do the merging, open all block files simultaneously, and
maintain small read buffers for the ten blocks we are reading.
§ A write buffer for the final merged index we are wri*ng.
§ In each itera*on, we select the lowest termID that has not
been processed yet using a priority queue or a similar data
structure.
§ All pos*ngs lists for this termID are read and merged, and the
merged list is wrinen back to disk.
§ Each read buffer is refilled from its file when necessary.
Introduc)on to Informa)on Retrieval Sec.
4.3

Remaining problem with BSBI

§ Our assump*on was to keep the dic*onary in memory.
§ We need the dic*onary (which grows dynamically) in order to
implement a term to termID mapping (BSBI).
§ Actually, we could work with term,docID pos*ngs (SPIMI)
instead of termID,docID pos*ngs . . .
§ . . . but then intermediate ﬁles become very large. (We would
end up with a scalable, but very slow index construc*on
method.)
Introduc)on to Informa)on Retrieval

Class Exercise
§ Exercise 4.1
If we need T log T comparisons (where T is the
number of termID–docID pairs) and 2 two disk seeks
for each comparison, how much *me would index
construc*on for Reuters-‐RCV1 take if we used disk
instead of memory for storage and an unop*mized
sor*ng algorithm (i.e., not an external sor*ng
algorithm)? Use the system parameters in Table 4.1.
Introduc)on to Informa)on Retrieval

Solu(on
⇒Disk seek *me = 5x10-‐3 s
2 x (5x10-‐3) seconds per comparison
Transfer *me = 2 x 10-‐8 s per byte
Low level opera*ons = 10-‐8 seconds

⇒How long would it take to make T(log₂T)
comparisons with 2 disk seeks per comparison?

⇒T(log₂T) x 2(5x10-‐3s)
...consider transfer *me and any low level opera*ons
⇒Or see on the next slide

Introduc)on to Informa)on Retrieval

Solu(on #2
§ 100,000,000 records
§ Nlog2(N) is = 2,657,542,475.91 comparisons
§ 2 disk seeks per comparison = (2,657,542,475.91 x 0.005) =
13,287,712.38 seconds x 2
§ = 26,575,424.76 seconds
§ = 442,923.75 minutes
§ = 7,382.06 hours
§ = 307.59 days
§ = 84% of a year
§ = 1% of your life

26
Introduc)on to Informa)on Retrieval

Homework # 4 (a)
§ Exercise 4.2 [⋆] How would you create the dic*onary
in blocked sort-‐based indexing on the ﬂy to avoid an
extra pass through the data?
Introduc)on to Informa)on Retrieval Sec.
4.3

SPIMI: Single-‐pass in-‐memory indexing

§ Key idea 1: Generate separate dic*onaries for each
block – no need to maintain term-‐termID mapping
across blocks.
§ Key idea 2: Don’t sort. Accumulate pos*ngs in
pos*ngs lists as they occur.

§ With these two ideas we can generate a complete

inverted index for each block.
§ These separate indexes can then be merged into one
big index.
Introduc)on to Informa)on Retrieval Sec.
4.3

SPIMI-‐Invert

§ Merging of blocks is analogous to BSBI.

Introduc)on to Informa)on Retrieval

SPIMI: Process
§ SPIMI can index collec*ons of any size as long as there is
enough disk space available.
§ The SPIMI algorithm is shown earlier. The part of the
algorithm that parses documents and turns them into a
stream of term–docID pairs, which we call tokens here, has
been omined.
§ SPIMI-‐INVERT is called repeatedly on the token stream un*l
the en*re collec*on has been processed.
§ Tokens are processed one by one (line 4). When a term occurs
for the ﬁrst *me, it is added to the dic*onary, and a new
pos*ngs list is created (line 6). The call in line 7 returns this
pos*ngs list for subsequent occurrences of the term.
Wednesday 8 May 19 30
Introduc)on to Informa)on Retrieval

SPIMI: Process
§ A difference between BSBI and SPIMI is that SPIMI adds a pos*ng directly
to its pos*ngs list (line 10).
§ Instead of first collec*ng all termID–docID pairs and then sor*ng them
(as we did in BSBI), each pos*ngs list is dynamic (i.e., its size is
adjusted as it grows) and it is immediately available to collect pos*ngs.
§ This has two advantages:
§ It is faster and it saves memory because we keep track of the term a
pos*ngs list belongs to, so the termIDs of pos*ngs need not be stored.
§ As a result, the blocks that individual calls of SPIMI-‐INVERT process are
much larger and the index construc*on process as a whole is more
efficient.
§ Short spaced pos*ngs list ini*ally and double the space each *me it is full
(lines 8–9).
§ Some memory is wasted. However, the overall memory requirements
for the dynamically constructed index are s*ll lower than in BSBI.
Wednesday 8 May 19 31
Introduc)on to Informa)on Retrieval

SPIMI: Process
§ When memory has been exhausted, we write the index of the block
(which consists of the dic*onary and the pos*ngs lists) to disk (line 12).
§ We have to sort the terms (line 11) before doing this because we want to
write pos*ngs lists in lexicographic order to facilitate the final merging
step.
§ Each call of SPIMI-‐INVERT writes a block to disk, just as in BSBI.
§ The last step of SPIMI (corresponding to line 7 in Figure 4.2; not shown in
algorithm here) is then to merge the blocks into the final inverted index.
§ The *me complexity of SPIMI is Θ(T).
§ Both the pos*ngs and the dic*onary terms can be stored compactly on
disk if we employ compression. Compression increases the efficiency
further because we can process even larger blocks, and because the
individual blocks require less space on disk (Sec*on 4.7).

Wednesday 8 May 19 32

Introduc)on to Informa)on Retrieval Sec.
4.4

Distributed indexing
§ Used for mainly web-‐scale indexing:
§ must use a distributed compu*ng cluster
§ Individual machines are fault-‐prone
§ Can unpredictably slow down or fail
§ How do we exploit such a pool of machines?
§ By construc*ng distributed index that is par**oned across
several machines.
Introduc)on to Informa)on Retrieval Sec.
4.4

Web search engine data centers

§ Web search data centers (Google, Bing, Baidu)
mainly contain commodity machines.
§ Data centers are distributed around the world.
§ Es*mate: Google ~1 million servers, 3 million
processors/cores (Gartner 2007)
Introduc)on to Informa)on Retrieval Sec.
4.4

Massive data centers

§ If in a non-‐fault-‐tolerant system with 1000 nodes,
each node has 99.9% up*me, what is the up*me of
the system?
§ Answer: 37% = (99.9%)1000
§ *Assump*on: System is up if all nodes are up.
§ Suppose a server will fail a_er 3 years. For an
installa*on of 1 million servers, what is the interval
between machine failures?
§ <2 minutes ((3*365*24*60)/1000000 = 1.5768)
Introduc)on to Informa)on Retrieval Sec.
4.4

Distributed indexing
§ Maintain a master machine direc*ng the indexing
job.
§ Break up indexing into sets of (parallel) tasks.
§ Master machine assigns each task to an idle machine
from a pool.
Introduc)on to Informa)on Retrieval Sec.
4.4

Parallel tasks
§ We will use two sets of parallel tasks
§ Parsers
§ Inverters
§ Break the input document collec*on into splits
§ Each split is a subset of documents (corresponding to
blocks in BSBI/SPIMI)

§ First, the input data, in our case a collec*on of web pages, are split into n splits
where the size of the split is chosen to ensure that the work can be distributed
evenly (chunks should not be too large) and eﬃciently (the total number of chunks
we need to manage should not be too large); 16 or 64 MB are good sizes in
distributed indexing.
Introduc)on to Informa)on Retrieval Sec.
4.4

Parsers
§ Master assigns a split to an idle parser machine
§ Parser reads a document at a *me and emits (term,
docID) pairs
§ Parser writes pairs into j par**ons
§ Each par**on is for a range of terms’ ﬁrst leners
§ (e.g., a-‐f, g-‐p, q-‐z) – here j = 3.

§ Splits are not pre-‐assigned to machines, but are instead assigned by the master
node on an ongoing basis: As a machine ﬁnishes processing one split, it is assigned
the next one. If a machine dies or becomes a laggard (too slow) due to hardware
problems, the split it is working on is simply reassigned to another machine.
Introduc)on to Informa)on Retrieval Sec.
4.4

Inverters
§ An inverter collects all (term,docID) pairs (= pos*ngs)
for one term-‐par**on e.g. for a-‐f.
§ Sorts and writes to pos*ngs lists.

§ Each term par**on (corresponding to r segment files, one on each parser) is
processed by one inverter. Finally, the list of values (docIDs) is sorted for each key
(term) and wrinen to the final sorted pos*ngs list (“pos*ngs” in the figure). This
completes the construc*on of the inverted index.
Introduc)on to Informa)on Retrieval Sec.
4.4

Data ﬂow
assign Master assign
Postings

Parser a-f g-p q-z Inverter a-f

Parser a-f g-p q-z

Inverter g-p

splits Inverter q-z

Parser a-f g-p q-z

Map Segment files Reduce

phase phase
Introduc)on to Informa)on Retrieval Sec.
4.4

MapReduce
§ The index construc*on algorithm we just described is
an instance of MapReduce.
§ MapReduce (Dean and Ghemawat 2004) is a robust
and conceptually simple framework for distributed
compu*ng …
§ … without having to write code for the distribu*on
part.
§ The original Google indexing system consisted of a
number of phases, each implemented in
MapReduce.
Introduc)on to Informa)on Retrieval Sec.
4.4

MapReduce (General)
§ To minimize write *mes before inverters reduce the
data, each parser writes its segment files to its local
disk.
§ In the reduce phase, the master communicates to an
inverter the loca*ons of the relevant segment files.
§ Each segment file only requires one sequen*al read.
This setup minimizes the amount of network traffic
needed during indexing.
§ The same machine can be a parser in the map phase
and an inverter in the reduce phase.
Introduc)on to Informa)on Retrieval Sec.
4.4

Schema for index construc(on in

MapReduce
§ Schema of map and reduce func(ons
§ map: input → list(k, v)
§ reduce: (k,list(v)) → output

§ Instan(a(on of the schema for index construc(on

§ map: collec*on → list(terms, docIDs)
§ reduce: (<term1, list(docIDs)>, <term2, list(docIDs)>, …) → (pos*ngs
list1, pos*ngs list2, …)
Introduc)on to Informa)on Retrieval

Example for index construc(on

§ Map:
§ d2 : C died. d1 : C came, C c’ed.
§ → (〈C, d2 〉, 〈died,d2〉, 〈C,d1〉, 〈came,d1〉, 〈C,d1〉,
〈c’ed,d1〉)
§ Reduce:
§ (〈C,(d2,d1,d1)〉,〈died,(d2)〉,〈came,(d1)〉,〈c’ed,(d1)〉)
§ → (〈C,(d1:2, d2:1)〉, 〈died,(d2:1)〉, 〈came,(d1:1)〉, 〈c’ed,
(d1:1)〉)

44
Introduc)on to Informa)on Retrieval

Exercise 4.3
§ For n = 15 splits, r = 10 segments, and j = 3 term par**ons,
how long would distributed index crea*on take for Reuters-‐
RCV1 in a MapReduce architecture? Base your assump*ons
about cluster machines on Table 4.1 & 4.2.

Wednesday 8 May 19 45

Introduc)on to Informa)on Retrieval

Solu(on
§ We will be spli€ng by documents, so each split is roughly:
split_documents = 800000/15 = 53333 documents
§ Each split size is about:
§ Split_size = 53333documents x 200 token/document x 6 bytes/token
= 63999600 bytes ≈ 61 MB
§ MAP phase:
§ Time spent by a machine to read a split:
Read_per_split = Split_size x (2 x 10-‐8 sec/byte) = 1.28 secs
§ Time spent to sort this split (algorithm complexity is O(nlog2n):
Sort_*me = split_documents x 200 token/document x
log2 (split_documents x 200 token/document) x (10-‐8 sec/
byte) = 2.49 secs

Wednesday 8 May 19 46

Introduc)on to Informa)on Retrieval

Solu(on
§ Time spent by a machine to write a split:
Write_per_split = split_documents x 200 token/document x 4.5 x (2
x 10-‐8 sec/byte) = 0.96 secs -‐> note: here tokens
are without spaces/punct. already
§ MAP phase is read+sort+write:
= 1.28 secs + 2.49 secs + 0.96 secs = 4.73 secs

There are 10 parser machines only since we have 10 segments. So to
parse 15 splits we will need to do 2 passes of MAP.

Total MAP phase = 4.73 x 2 = 9.46 secs
§ REDUCE phase
Index is split into 3 term par**ons, so each term par**on will hold
about 100000000/3 tokens (this is a rough assump*on, in reality term
par**on size could vary).
Wednesday 8 May 19 47
Introduc)on to Informa)on Retrieval

Solu(on
Each Inverter will need to read sort and write this amount of tokens.
§ Term_ par**on_size = 100000000/3 tokens x 4.5 bytes/tokens
= 150000000 bytes ≈ 143 MB
§ Time_reading = 150000000 bytes x (2 x 10-‐8 sec/byte) = 3 secs
§ Time sor*ng = 100000000/3 tokens x log2 (100000000/3 tokens)
x (10-‐8 sec/byte) = 8.33 secs
§ Time_wri*ng = 150000000 bytes x (2 x 10-‐8 sec/byte) = 3 secs
§ Total REDUCE phase = 3 + 8.33 + 3 = 14.33 secs
Total (me of Distributed Index crea(on = 9.46 secs + 14.33
secs = 23.79 secs

Wednesday 8 May 19 48

Introduc)on to Informa)on Retrieval Sec.
4.5

Dynamic indexing
§ Up to now, we have assumed that collec*ons are
sta*c. They rarely are:
§ Documents come in over *me and need to be inserted.
§ Documents are deleted and modiﬁed.
§ This means that the dic*onary and pos*ngs lists
have to be modiﬁed:
§ Pos*ngs updates for terms already in dic*onary
§ New terms added to dic*onary
Introduc)on to Informa)on Retrieval Sec.
4.5

Simplest dynamic approach

§ Maintain “big” main index
§ New docs go into “small” auxiliary index
§ Search across both, merge results
§ Dele*ons
§ Invalida*on bit-‐vector for deleted docs
§ Filter docs output on a search result by this invalida*on
bit-‐vector
§ Periodically, re-‐index into one main index
Introduc)on to Informa)on Retrieval Sec.
4.5

Issues with main and auxiliary indexes

§ Problem of frequent merges – poor performance
§ Actually:
§ Merging of the auxiliary index into the main index is
efficient if we keep a separate file for each pos*ngs list.
§ Merge is the same as a simple append.
§ But then we would need a lot of files – inefficient for OS.

§ Assump*on for remaining lecture: The index is one big ﬁle.
§ In reality: Use a scheme somewhere in between (e.g., split
very large pos*ngs lists, collect pos*ngs lists of length 1 in
one ﬁle etc.)
Introduc)on to Informa)on Retrieval Sec.
4.5

Logarithmic merge
§ Maintain a series of indexes, each twice as large as
the previous one
§ At any *me, some of these powers of 2 are instan*ated
§ Keep smallest (Z0) in memory
§ Larger ones (I0, I1, …) on disk
§ If Z0 gets too big (> n), write to disk as I0
§ or merge with I0 (if I0 already exists) as Z1
§ Either write merged Z1 to disk as I1 (if no I1)
§ Or merge with I1 to form Z2
Introduc)on to Informa)on Retrieval Sec.
4.5
Introduc)on to Informa)on Retrieval Sec.
4.5

Logarithmic merge
§ Overall index construc*on *me is Θ(T logT) where T
is total number of pos*ngs.
§ We trade this eﬃciency gain for a slow down of
query searching process;
§ Due to complexity of dynamic indexing, some large
search engines do not construct indexes dynamically.
Instead, a new index is built from scratch
periodically. Query processing is then switched to
the new index and the old index is deleted.
Introduc)on to Informa)on Retrieval Sec.
4.5

Further issues with mul(ple indexes

§ Collec*on-‐wide sta*s*cs are hard to maintain e.g.,
when we spoke of spell-‐correc*on: which of several
corrected alterna*ves do we present to the user?
§ We said, pick the one with the most hits
§ How do we maintain the top ones with mul*ple
indexes and invalida*on bit vectors?
§ One possibility: ignore everything but the main index for
such ordering.
§ Will see more such sta*s*cs used in results ranking.
Introduc)on to Informa)on Retrieval Sec.
4.5

Dynamic indexing at search engines

§ All the large search engines now do dynamic
indexing
§ Their indices have frequent incremental changes
§ News items, blogs, new topical web pages
§ Sarah Palin, …
§ But (some*mes/typically) they also periodically
reconstruct the index from scratch
§ Query processing is then switched to the new index, and
the old index is deleted
Introduc)on to Informa)on Retrieval

Exercise

Wednesday 8 May 19 57

Introduc)on to Informa)on Retrieval

Solu(on

Wednesday 8 May 19 58

Introduc)on to Informa)on Retrieval Sec.
4.5
Introduc)on to Informa)on Retrieval

Other types of indexes

§ Sor*ng algorithms discussed can all be applied to
posi*onal indexes.
§ In ranked retrieval, pos*ngs are o_en ordered ac-‐
cording to weight or impact, with the highest
weighted pos*ngs occurring ﬁrst.
§ In a docID-‐sorted index, new documents are always
inserted at the end of pos*ngs lists.
§ In an impact-‐sorted index (will study next), the
inser*on can occur anywhere, thus complica*ng the
update of the inverted index.
Wednesday 8 May 19 60
Introduc)on to Informa)on Retrieval

Other types of indexes

§ Security is an important considera*on for retrieval
systems in corpora*ons.
§ User authoriza*on is o_en mediated through access
control lists or ACLs.

Wednesday 8 May 19 61

Introduc)on to Informa)on Retrieval

Other types of indexes

§ The inverted ACL index has, for each user, a
“pos*ngs list” of documents they can access – the
user’s access list.
§ Search results are then intersected with this list.
However, such an index is difficult to maintain when
access permissions change.
§ User membership is therefore o_en verified by
retrieving access informa*on directly from the file
system at query *me – even though this slows down
retrieval.
Wednesday 8 May 19 62
Introduc)on to Informa)on Retrieval

Assignment #4(b)
§ Exercise 4.6
§ Total index construc*on *me in blocked sort-‐based
indexing is broken down in Ta-‐ ble 4.3. Fill out the
*me column of the table for Reuters-‐RCV1 assuming
a system with the parameters given in Table 4.1.

63
Introduc)on to Informa)on Retrieval

Assignment #4(c)
§ Exercise 4.7
§ Repeat Exercise 4.6 for the larger collec*on in Table
4.4. Choose a block size that is realis*c for current
technology (remember that a block should easily ﬁt
into main memory). How many blocks do you need?

64
Introduc)on to Informa)on Retrieval

Assignment #4(d)
§ Exercise 4.9
§ Assume that machines in MapReduce have 100 GB of
disk space each. Assume fur-‐ ther that the pos*ngs
list of the term the has a size of 200 GB. Then the
MapReduce algorithm as described cannot be run to
construct the index. How would you modify
MapReduce so that it can handle this case?

65
Introduc)on to Informa)on Retrieval

Ar(cles and sources to be read

§ Heinz and Zobel (2003) and Zobel and Moﬀat (2006)
as up-‐do-‐date, in-‐depth treatments of index
construc*on.
§ Dynamic indexing methods are discussed in Büncher
et al. (2006) and Lester et al. (2006).
§ Reuters’ resources are available at the following link:
hnps://trec.nist.gov/data/reuters/reuters.html

66
Introduc)on to Informa)on Retrieval

Programming Assignment #4(e)

§ Visit the link: hnp://hadoop.apache.org/
§ The Apache Hadoop so_ware library is a framework
that allows for the distributed processing of large
data sets across clusters of computers using simple
programming models. It is designed to scale up from
single servers to thousands of machines, each
oﬀering local computa*on and storage.

§ Take any collec*on and run the map and reduce

phase of hadoop to make an index with pos*ngs
67
Introduc)on to Informa)on Retrieval

Programming Assignment #4(f)

§ Visit the link: hnp://lucene.apache.org/
§ The Apache LuceneTM project provides Java-‐based
indexing and search technology, as well as
spellchecking, hit highligh*ng and advanced analysis/
tokeniza*on capabili*es.

§ Take any collec*on and run the logarithmic merging

of Lucene to make a dynamic index

68
Introduc)on to Informa)on Retrieval

References
§ Heinz, Steffen, and Jus*n Zobel. 2003. Efficient single-‐pass index
construc*on for text databases. JASIST 54(8):713–729. DOI: dx.doi.org/
10.1002/asi.10268.
§ Zobel, Jus*n, and Alistair Moffat. 2006. Inverted files for text search
engines. ACM Compu)ng Surveys 38(2).
§ Büncher, Stefan, Charles L. A. Clarke, and Brad Lushman. 2006. Hybrid
index main-‐ tenance for growing text collec*ons. In Proc. SIGIR, pp. 356–
363. ACM Press. DOI: doi.acm.org/10.1145/1148170.1148233.
§ Lester, Nicholas, Jus*n Zobel, and Hugh E. Williams. 2006. Efficient online
index maintenance for con*guous inverted lists. IP&M 42(4):916–933.
DOI: dx.doi.org/10.1016/j.ipm.2005.09.005.

Chapter4 Indexconstruction
No ratings yet
Chapter4 Indexconstruction
49 pages
Lecture4-Indexconstruction Ch2 and Ch4
No ratings yet
Lecture4-Indexconstruction Ch2 and Ch4
49 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
49 pages
Lecture 4-Indexconstruction
No ratings yet
Lecture 4-Indexconstruction
45 pages
C3 IndexConstruction
No ratings yet
C3 IndexConstruction
46 pages
Index Construction
No ratings yet
Index Construction
48 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
50 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
48 pages
Lecture 5p1 - Index Construction & Compressing
No ratings yet
Lecture 5p1 - Index Construction & Compressing
42 pages
IR-Lec1 - Ch1-2023
No ratings yet
IR-Lec1 - Ch1-2023
41 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
46 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
54 pages
Week 6
No ratings yet
Week 6
98 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
115 pages
Lec 9
No ratings yet
Lec 9
21 pages
Intro to Information Retrieval
No ratings yet
Intro to Information Retrieval
51 pages
Ir 1
No ratings yet
Ir 1
59 pages
Lecture 4 - Index Construction - Compressing
No ratings yet
Lecture 4 - Index Construction - Compressing
90 pages
Intro to Info Retrieval Systems
No ratings yet
Intro to Info Retrieval Systems
48 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
48 pages
05 Index Construction
No ratings yet
05 Index Construction
47 pages
Introduction to Boolean Retrieval
No ratings yet
Introduction to Boolean Retrieval
50 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
54 pages
Dictionaries & Tolerant Retrieval
No ratings yet
Dictionaries & Tolerant Retrieval
112 pages
03 Dictionaries
No ratings yet
03 Dictionaries
112 pages
Lec 9
No ratings yet
Lec 9
21 pages
Module 4-Boolean Retrieval Models-Edit Distance, Spelling Correction
No ratings yet
Module 4-Boolean Retrieval Models-Edit Distance, Spelling Correction
124 pages
04 Index Construction
No ratings yet
04 Index Construction
48 pages
SEO for Dictionaries & Retrieval
No ratings yet
SEO for Dictionaries & Retrieval
8 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
33 pages
Index Construction
No ratings yet
Index Construction
37 pages
Lecture2 Intro Boolean 6per
No ratings yet
Lecture2 Intro Boolean 6per
9 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
33 pages
Lecture 4-Dictionaries and Tolerant Retrieval
No ratings yet
Lecture 4-Dictionaries and Tolerant Retrieval
50 pages
Information Retrieval - 2
No ratings yet
Information Retrieval - 2
24 pages
IR Summary Lec 1 - Introduction
No ratings yet
IR Summary Lec 1 - Introduction
54 pages
C1 Intro
No ratings yet
C1 Intro
10 pages
CSCI 7000 Modern Information Retrieval: Lecture 1: Introduction
No ratings yet
CSCI 7000 Modern Information Retrieval: Lecture 1: Introduction
16 pages
Boolean Retrieval
No ratings yet
Boolean Retrieval
34 pages
Lec 1 IR
No ratings yet
Lec 1 IR
42 pages
Lecture3-Tolerant-retrieval Dictionaries and Tolerant Retrieval CH 3
No ratings yet
Lecture3-Tolerant-retrieval Dictionaries and Tolerant Retrieval CH 3
47 pages
Lecture1 Intro Handout 1 Per
No ratings yet
Lecture1 Intro Handout 1 Per
57 pages
C10 IR M2021 IndexConstruction SimpleandDistributed
No ratings yet
C10 IR M2021 IndexConstruction SimpleandDistributed
42 pages
Chap1 Boolean
No ratings yet
Chap1 Boolean
39 pages
C7 SpellCorrection
No ratings yet
C7 SpellCorrection
43 pages
Module 4-Boolean Retrieval Models
No ratings yet
Module 4-Boolean Retrieval Models
52 pages
Lecture1 Intro
No ratings yet
Lecture1 Intro
57 pages
Lecture4 Compression V1
No ratings yet
Lecture4 Compression V1
43 pages
Compression
No ratings yet
Compression
46 pages
Term Vocabulary and Postings List
No ratings yet
Term Vocabulary and Postings List
64 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
69 pages
Intro To IRE
No ratings yet
Intro To IRE
48 pages
Lecture3 Tolerant Retrieval
100% (1)
Lecture3 Tolerant Retrieval
48 pages
Lecture3 Tolerant Retrieval
100% (1)
Lecture3 Tolerant Retrieval
48 pages
Lec4 IR
No ratings yet
Lec4 IR
53 pages
SEO for Term Vocabulary & Postings
No ratings yet
SEO for Term Vocabulary & Postings
20 pages
Document Indexing in Information Retrieval
No ratings yet
Document Indexing in Information Retrieval
19 pages
C2 Dictionary
No ratings yet
C2 Dictionary
6 pages
Unit I
No ratings yet
Unit I
83 pages
Serial Communication in AVR
No ratings yet
Serial Communication in AVR
35 pages
Lab 3 Ans
No ratings yet
Lab 3 Ans
10 pages
Modular Service Tool SD Quick Start Guide
No ratings yet
Modular Service Tool SD Quick Start Guide
20 pages
Simulink PLC Coder 1.1: Generate Iec 61131 Structured Text For Plcs and Pacs
No ratings yet
Simulink PLC Coder 1.1: Generate Iec 61131 Structured Text For Plcs and Pacs
4 pages
Practical DAA 1
No ratings yet
Practical DAA 1
10 pages
1: Arabic Programming Languages.
No ratings yet
1: Arabic Programming Languages.
45 pages
Ooad (Sez6c)
No ratings yet
Ooad (Sez6c)
45 pages
A Complete Guide To Flight Simulator 2023 V1.1.0-Single
No ratings yet
A Complete Guide To Flight Simulator 2023 V1.1.0-Single
622 pages
How To Use The SC18IM700 To Control Any I2
No ratings yet
How To Use The SC18IM700 To Control Any I2
10 pages
Datcom K3
No ratings yet
Datcom K3
4 pages
Microsoft OneDrive PDF
No ratings yet
Microsoft OneDrive PDF
5 pages
Exp 12
No ratings yet
Exp 12
3 pages
Introduction To Problem Solving Using C - Unit 4
No ratings yet
Introduction To Problem Solving Using C - Unit 4
32 pages
Guide To Unbricking v1.33 (Eg Netgear R6400)
No ratings yet
Guide To Unbricking v1.33 (Eg Netgear R6400)
9 pages
ICT Grade3 Lesson2
100% (1)
ICT Grade3 Lesson2
43 pages
Microsoft Access VBA
100% (1)
Microsoft Access VBA
18 pages
Motherboard Architecture
No ratings yet
Motherboard Architecture
15 pages
Comp 2 Prelims Reviewer
No ratings yet
Comp 2 Prelims Reviewer
8 pages
IT Management & Consulting Expertise
No ratings yet
IT Management & Consulting Expertise
5 pages
Talend Big Data Reading A File
No ratings yet
Talend Big Data Reading A File
2 pages
Vision Gs
No ratings yet
Vision Gs
22 pages
Packet Tracer - Logical and Physical Mode Exploration
100% (1)
Packet Tracer - Logical and Physical Mode Exploration
2 pages
Red Hat Enterprise Linux-8-Performing A Standard RHEL Installation-En-US
No ratings yet
Red Hat Enterprise Linux-8-Performing A Standard RHEL Installation-En-US
168 pages
CI-DS - PAM - Nov2022
100% (1)
CI-DS - PAM - Nov2022
22 pages
Microchip Syncserver s600 NTP Time Server Datasheet 2903E
No ratings yet
Microchip Syncserver s600 NTP Time Server Datasheet 2903E
6 pages
Kurukshetra University: A Project Report ON
No ratings yet
Kurukshetra University: A Project Report ON
128 pages
Servd DDR Od301 - V03
No ratings yet
Servd DDR Od301 - V03
35 pages
Absen Apel Rabu
No ratings yet
Absen Apel Rabu
69 pages
Emmc16g Kingston
No ratings yet
Emmc16g Kingston
22 pages
INE2720 Web Application Software Development
No ratings yet
INE2720 Web Application Software Development
16 pages

Week7 8

Uploaded by

Week7 8

Uploaded by

Introduc)on

to Informa)on Retrieval

§ This *me: on abandon among

Hardware assump(ons for this lecture

RCV1: Our collec(on for this lecture

A Reuters RCV1 document

Reuters RCV1 sta(s(cs

Recall IIR 1 index construc(on I

§ Documents are parsed to extract words and these caesar

Caesar I was killed Caesar. The noble

Key step Term

§ A_er all documents have been caesar

We focus on this sort step. me

We have 100M items to sort. let

Scaling index construc(on

Sort based index construc(on

Sort in “disk” or “memory”

§ No: Sor*ng T = 100,000,000 records on disk is too

§ We need an external sor*ng algorithm.

If every comparison took 2 disk seeks, and N items could be

BSBI: Blocked sort-­‐based Indexing

Sor(ng and merging of terms

Sor(ng 10 blocks of 10M records

How to merge the sorted runs?

Remaining problem with BSBI

SPIMI: Single-­‐pass in-­‐memory indexing

§ With these two ideas we can generate a complete

§ Merging of blocks is analogous to BSBI.

Wednesday 8 May 19 32

Web search engine data centers

Massive data centers

Parser a-f g-p q-z Inverter a-f

Parser a-f g-p q-z

splits Inverter q-z

Map Segment files Reduce

Schema for index construc(on in

§ Instan(a(on of the schema for index construc(on

Example for index construc(on

Wednesday 8 May 19 45

Wednesday 8 May 19 46

Wednesday 8 May 19 48

Simplest dynamic approach

Issues with main and auxiliary indexes

Further issues with mul(ple indexes

Dynamic indexing at search engines

Wednesday 8 May 19 57

Wednesday 8 May 19 58

Other types of indexes

Other types of indexes

Wednesday 8 May 19 61

Other types of indexes

Ar(cles and sources to be read

Programming Assignment #4(e)

§ Take any collec*on and run the map and reduce

Programming Assignment #4(f)

§ Take any collec*on and run the logarithmic merging

You might also like

BSBI: Blocked sort-‐based Indexing

SPIMI: Single-‐pass in-‐memory indexing