0% found this document useful (0 votes)

137 views47 pages

05 Index Construction

The document discusses index construction in information retrieval. It covers hardware basics like memory being faster than disk access. Index construction involves parsing documents to extract terms and associated document IDs, then sorting the inverted index by term. For the Reuters RCV1 corpus with 800,000 documents, the inverted index contains around 100 million postings after removing duplicate document IDs per term.

Uploaded by

Rizwan Lateef

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

137 views47 pages

05 Index Construction

Uploaded by

Rizwan Lateef

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 47

5: Index Construction

Information Retrieval Techniques

1
Index construction

p  How do we construct an index?

p  What strategies can we use with limited main
memory?

2
Hardware basics

p  Many design decisions in information retrieval are

based on the characteristics of hardware
p  We begin by reviewing hardware basics

3
Hardware basics

p  Access to data in memory is much faster than

access to data on disk
p  Disk seeks: No data is transferred from disk
while the disk head is being positioned
p  Therefore transferring one large chunk of data
from disk to memory is faster than transferring
many small chunks
p  Disk I/O is block-based: Reading and writing of
entire blocks (as opposed to smaller chunks)
p  Block sizes: 8KB to 256 KB.

4
Hardware basics

p  Servers used in IR systems now typically have

several GB of main memory, sometimes tens of
GB
p  Available disk space is several (2–3) orders of
magnitude larger
p  Fault tolerance is very expensive: It’s much
cheaper to use many regular machines rather
than one fault tolerant machine.

5
Google Web Farm

p  The best guess is that Google now has more than
2 Million servers (8 Petabytes of RAM 8*106
Gigabytes)
p  Spread over at least 12 locations around the world
p  Connecting these centers is a high-capacity fiber optic
network that the company has assembled over the
last few years.

The Dalles, Oregon Dublin, Ireland 6

Hardware assumptions

p  symbol statistic value

p  s average seek time 5 ms = 5 x 10−3 s
p  b transfer time per byte 0.02 µs = 2 x 10−8 s/B
p  processor’s clock rate 109 s−1
p  p low-level operation 0.01 µs = 10−8 s
(e.g., compare & swap a word)
p  size of main memory several GB
p  size of disk space 1 TB or more

p  Example: Reading 1GB from disk

n  If stored in contiguous blocks: 2 x 10−8 s/B x 109 B = 20s
n  If stored in 1M chunks of 1KB: 20s + 106x 5 x 10−3s = 5020
s = 1.4 h 7
A Reuters RCV1 document

8
Reuters RCV1 statistics

symbol statistic value

N documents 800,000
L avg. # tokens per doc 200
M terms (= word types) 400,000
avg. # bytes per token 6
(incl. spaces/punct.)
avg. # bytes per token 4.5
(without spaces/punct.)
avg. # bytes per term 7.5
T non-positional postings 100,000,000

• 4.5 bytes per word token vs. 7.5 bytes per word type: why? 9
• Why T < N*L?
Recall IIR 1 index construction
Term Doc #
I 1
did 1

p  Documents are parsed to extract words and enact

julius
1
1
these are saved with the Document ID. caesar 1
I 1
was 1
killed 1
i' 1
the 1
capitol 1
brutus 1
killed 1
me 1
Doc 1 Doc 2 so 2
let 2
it 2
So let it be with be 2
I did enact Julius with 2
Caesar. The noble caesar 2
Caesar I was killed the 2
Brutus hath told you noble 2
i' the Capitol;
Caesar was brutus 2

Brutus killed me. hath 2

ambitious told
you
2
2
caesar 2
10
was 2
ambitious 2
Sec. 4.2

Key step
Term Doc # Term Doc #
I 1 ambitious 2
did 1 be 2
enact 1 brutus 1
julius 1 brutus 2

p  After all documents have caesar

I
1
1
capitol
caesar
1
1
been parsed, the inverted was 1 caesar 2
killed 1 caesar 2
file is sorted by terms. i' 1 did 1
the 1 enact 1
capitol 1 hath 1
brutus 1 I 1
killed 1 I 1
me 1 i' 1
so 2 it 2
let 2 julius 1
We focus on this sort step. it
be
2
2
killed
killed
1
1
We have 100M items to sort with
caesar
2
2
let
me
2
1
for Reuters RCV1 (after the
noble
2
2
noble
so
2
2
having removed duplicated brutus
hath
2
2
the
the
1
2
docid for each term) told
you
2
2
told
you
2
2
caesar 2 was 1
was 2 was 2
ambitious 2 with 2
11
Sec. 4.2

Scaling index construction

p  In-memory index construction does not scale

p  How can we construct an index for very large
collections?
p  Taking into account the hardware constraints we
just learned about . . .
p  Memory, disk, speed, etc.

12
Sec. 4.2

Sort-based index construction

p  As we build the index, we parse docs one at a time

n  While building the index, we cannot easily exploit
compression tricks (you can, but much more
complex)
n  The final postings for any term are incomplete until
the end
p  At 12 bytes per non-positional postings entry (term,
doc, freq), demands a lot of space for large collections
p  T = 100,000,000 in the case of RCV1 – so 1.2GB
n  So … we can do this in memory in 2015, but typical
collections are much larger - e.g. the New York
Times provides an index of >150 years of newswire
p  Thus: We need to store intermediate results on disk.
13
Sec. 4.2

Use the same algorithm for disk?

p  Can we use the same index construction

algorithm for larger collections, but by using
disk instead of memory?
n  I.e. scan the documents, and for each term
write the corresponding posting (term, doc,
freq) on a file
n  Finally sort the postings and build the postings
lists for all the terms
p  No: Sorting T = 100,000,000 records (term, doc,
freq) on disk is too slow – too many disk seeks
n  See next slide
p  We need an external sorting algorithm.
14
Sec. 4.2

Bottleneck
p  Parse and build postings entries one doc at a time
p  Then sort postings entries by term (then by doc
within each term)
p  Doing this with random disk seeks would be too
slow – must sort T = 100M records

If every comparison took 2 disk seeks, and N items could be

sorted with N log2N comparisons, how long would this take?
p  symbol statistic value
p  s average seek time 5 ms = 5 x 10−3 s
p  b transfer time per byte 0.02 µs = 2 x 10−8 s
p  p low-level operation 0.01 µs = 10−8 s
(e.g., compare & swap a word) 15
Solution

(2ds-time + comparison-time)Nlog2N seconds

= (2*5*10-3 + 10-8)* 108 log2 108
~= (2*5*10-3)* 108 log2 108
since the time required for the comparison is
actually negligible (as the time for transferring data
in the main memory)
= 106 * log2 108 = 106 * 26,5 = 2,65 * 107 s = 307
days!

p  What can we do? 16

Gaius Julius Caesar

Divide et Impera

17
Sec. 4.2

BSBI: Blocked sort-based Indexing

(Sorting with fewer disk seeks)
p  12-byte (4+4+4) records (term-id, doc-id, freq)
p  These are generated as we parse docs
p  Must now sort 100M such 12-byte records by
term
p  Define a Block ~ 10M such records
n  Can easily fit a couple into memory
n  Will have 10 such blocks to start with (RCV1)
p  Basic idea of algorithm:
n  Accumulate postings for each block (write on a
file), (read and) sort, write to disk
n  Then merge the sorted blocks into one long
18
sorted order.
Sec. 4.2
blocks contain
term-id instead

Blocks obtained
parsing different
documents

19
Sec. 4.2

Sorting 10 blocks of 10M records

p  First, read each block and sort (in memory)

within:
n  Quicksort takes 2N log2N expected steps
n  In our case 2 x (10M log210M) steps
p  Exercise: estimate total time to read each block
from disk and quicksort it
n  Approximately 7 s
p  10 times this estimate – gives us 10 sorted runs
of 10M records each
p  Done straightforwardly, need 2 copies of data on
disk
n  But can optimize this 20
Sec. 4.2

Block sorted-based indexing

Keeping the
dictionary in memory

n = number of
generated blocks 21
Sec. 4.2

How to merge the sorted runs?

p  Open all block files and maintain small read

buffers - and a write buffer for the final merged
index
p  In each iteration select the lowest termID that
has not been processed yet
p  All postings lists for this termID are read and
merged, and the merged list is written back to disk
p  Each read buffer is refilled from its file when
necessary
p  Providing you read decent-sized chunks of each
block into memory and then write out a decent-
sized output chunk, then you’re not killed by disk
22
seeks.
Sec. 4.3
Remaining problem with sort-based
algorithm

p  Our assumption was: we can keep the dictionary

in memory
p  We need the dictionary (which grows
dynamically) in order to implement a term to
termID mapping
p  Actually, we could work with (term, docID)
postings instead of (termID, docID) postings . . .
p  . . . but then intermediate files become larger -
we would end up with a scalable, but slower
index construction method.
Why?
23
Sec. 4.3

SPIMI:
Single-pass in-memory indexing

p  Key idea 1: Generate separate dictionaries for

each block – no need to maintain term-termID
mapping across blocks
p  Key idea 2: Don’t sort the postings - accumulate
postings in postings lists as they occur
n  But at the end, before writing on disk, sort
the terms
p  With these two ideas we can generate a complete
inverted index for each block
p  These separate indexes can then be merged into
one big index (because terms are sorted).
24
Sec. 4.3

SPIMI-Invert

When the memory has been exhausted -

write the index of the block (dictionary,
postings lists) to disk

p  Then merging of blocks is analogous to 25

BSBI (plus dictionary merging).

Sec. 4.3

SPIMI: Compression

p  Compression makes SPIMI even more efficient.

n  Compression of terms
n  Compression of postings

26
Sec. 4.4

Distributed indexing

p  For web-scale indexing (don’t try this at home!):

must use a distributed computing cluster
p  Individual machines are fault-prone
n  Can unpredictably slow down or fail
p  How do we exploit such a pool of machines?

27
Sec. 4.4

Google data centers

p  Google data centers mainly contain commodity

machines
p  Data centers are distributed around the world
p  Estimate: a total of 2 million servers
p  Estimate: Google installs 100,000 servers each
quarter
n  Based on expenditures of 200–250 million
dollars per year
p  This would be 10% of the computing capacity of
the world!?!

28
Google data centers

p  Consider a non-fault-tolerant system with

1000 nodes
p  Each node has 99.9% availability (probability to
be up in a time unit), what is the availability of
the system?
n  All of them should be simultaneously up
p  Answer: 37%
n  (p of staying up)# of server = (0.999)1000
p  Calculate the number of servers failing per
minute for an installation of 1 million servers.

29
Sec. 4.4

Distributed indexing

p  Maintain a master machine directing the

indexing job – considered “safe”
p  Break up indexing into sets of (parallel) tasks
p  Master machine assigns each task to an idle
machine from a pool.

30
Sec. 4.4

Parallel tasks

p  We will use two sets of parallel tasks

n  Parsers
n  Inverters
p  Break the input document collection into splits

Document collection

split split split split split

p  Each split is a subset of documents

(corresponding to blocks in BSBI/SPIMI)
31
Sec. 4.4

Parsers

p  Master assigns a split to an idle parser machine

p  Parser reads a document at a time and emits
(term-id, doc-id) pairs
p  Parser writes pairs into j partitions
p  Each partition is for a range of terms’ first letters
n  (e.g., a-f, g-p, q-z) – here j = 3.
p  Now to complete the index inversion …

split parser a-f g-p q-z

3 partitions 32
Sec. 4.4

Inverters

p  An inverter collects all (term-id, doc-id) pairs

(= postings) for one term-partition (from the
different segments produced by the parsers)
p  Sorts and writes to postings lists.

a-f

a-f inverter a-f

postings
...

a-f 33
Sec. 4.4

Data flow: MapReduce

assign Master assign

Postings

Parser a-f g-p q-z Inverter a-f

Parser a-f g-p q-z

Inverter g-p

splits Inverter q-z

Parser a-f g-p q-z

Map Reduce
Segment files
phase phase 34
Sec. 4.4

MapReduce

p  The index construction algorithm we just described is

an instance of MapReduce
p  MapReduce (Dean and Ghemawat 2004) is a robust
and conceptually simple framework for distributed
computing …
n  … without having to write code for the distribution
part
p  Solve large computing problems on cheap commodity
machines or nodes that are built from standard parts
(processor, memory, disk) as opposed to on a
supercomputer with specialized hardware
p  They describe the Google indexing system (ca. 2002)
as consisting of a number of phases, each
implemented in MapReduce. 35
Sec. 4.4

MapReduce

p  Index construction was just one phase

p  Another phase (not shown here): transforming a
term-partitioned index into a document-
partitioned index (for query processing)
n  Term-partitioned: one machine handles a
subrange of terms
n  Document-partitioned: one machine handles a
subrange of documents
p  As we will discuss in the web part of the course -
most search engines use a document-
partitioned index … better load balancing, etc.
36
Sec. 4.4
Schema for index construction in
MapReduce

p  Schema of map and reduce functions

p  map: input → list(k, v) reduce: (k,list(v)) → output
p  Instantiation of the schema for index construction
p  map: web collection → list(termID, docID)
p  reduce: (<termID1, list(docID)>, <termID2, list(docID)>,
…) → (postings list1, postings list2, …)
p  Example for index construction
p  map: (d2 : "C died.", d1 : "C came, C c’ed.") → (<C, d2>,
<died,d2>, <C,d1>, <came,d1>, <C,d1>, <c’ed, d1>
p  reduce: (<C,(d2,d1,d1)>, <died,(d2)>, <came,(d1)>,
<c’ed,(d1)>) → (<C,(d1:2,d2:1)>, <died,(d2:1)>,
<came,(d1:1)>, <c’ed,(d1:1)>)
37

we do not write term-ids for better readability

Sec. 4.5

Dynamic indexing

p  Up to now, we have assumed that collections are

static
p  They rarely are:
n  Documents come in over time and need to be
inserted
n  Documents are deleted and modified
p  This means that the dictionary and postings
lists have to be modified:
n  Postings updates for terms already in
dictionary
n  New terms added to dictionary.
38
Sec. 4.5

Simplest approach

p  Maintain big main index

p  New docs go into small auxiliary index
p  Search across both, merge results
p  Deletions
n  Invalidation bit-vector for deleted docs
n  Filter docs output on a search result by this
invalidation bit-vector
p  Periodically, re-index into one main index.

39
Sec. 4.5

Issues with main and auxiliary indexes

p  Problem of frequent merges – you touch stuff a lot

p  Poor performance during merge
p  Actually:
n  Merging of the auxiliary index into the main index
is efficient if we keep a separate file for each
postings list
n  Merge is the same as a simple append
n  But then we would need a lot of files – inefficient
for the OS
p  Assumption for the rest of the lecture: The index is
one big file
p  In reality: use a scheme somewhere in between (e.g.,
split very large postings lists, collect postings lists of
length 1 in one file etc.).
40
Sec. 4.5

Logarithmic merge

p  Maintain a series of indexes, each twice as

large as the previous one
p  Keep smallest (Z0) in memory
p  Larger ones (I0, I1, …) on disk
p  If Z0 gets too big (> n), write to disk as I0
p  or merge with I0 (if I0 already exists) as Z1
p  Either write merge Z1 to disk as I1 (if no I1)
p  Or merge with I1 to form Z2
p  etc.

41
Sec. 4.5

Permanent indexes
already stored

42
Sec. 4.5

Logarithmic merge

p  Auxiliary and main index: index construction

time is O(T2) as each posting is touched in each
merge
p  Logarithmic merge: Each posting is merged
O(log T) times, so complexity is O(T log T)
p  So logarithmic merge is much more efficient for
index construction
p  But query processing now requires the merging of
O(log T) indexes
n  Whereas it is O(1) if you just have a main and
auxiliary index
43
Sec. 4.5

Further issues with multiple indexes

p  Collection-wide statistics are hard to maintain

p  E.g., when we spoke of spell-correction: which of
several corrected alternatives do we present to
the user?
n  We said, pick the one with the most hits
p  How do we maintain the top ones with multiple
indexes and invalidation bit vectors?
n  One possibility: ignore everything but the main
index for such ordering
p  Will see more such statistics used in results
ranking.
44
Sec. 4.5

Dynamic indexing at search engines

p  All the large search engines now do dynamic

indexing
p  Their indices have frequent incremental changes
n  News items, blogs, new topical web pages
p  Grillo, Crimea, …
p  But (sometimes/typically) they also periodically
reconstruct the index from scratch
n  Query processing is then switched to the new
index, and the old index is then deleted

45
Google trends

http://www.google.com/trends/

46
Sec. 4.5

Other sorts of indexes

p  Positional indexes

n  Same sort of sorting problem … just larger Why?
p  Building character n-gram indexes:
n  As text is parsed, enumerate n-grams
n  For each n-gram, need pointers to all dictionary terms
containing it – the “postings”
n  Note that the same “postings entry” (i.e., terms) will
arise repeatedly in parsing the docs – need efficient
hashing to keep track of this
p  E.g., that the trigram uou occurs in the term
deciduous will be discovered on each text
occurrence of deciduous
p  Only need to process each term once.
47

04 Index Construction
No ratings yet
04 Index Construction
48 pages
Lecture 5p1 - Index Construction & Compressing
No ratings yet
Lecture 5p1 - Index Construction & Compressing
42 pages
Lecture 4 - Index Construction - Compressing
No ratings yet
Lecture 4 - Index Construction - Compressing
90 pages
Lec6 InvretedIndex pt2
No ratings yet
Lec6 InvretedIndex pt2
38 pages
Information Retrieval - 2
No ratings yet
Information Retrieval - 2
24 pages
Lecture 4-Indexconstruction
No ratings yet
Lecture 4-Indexconstruction
45 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
50 pages
04const Flat
No ratings yet
04const Flat
54 pages
C10 IR M2021 IndexConstruction SimpleandDistributed
No ratings yet
C10 IR M2021 IndexConstruction SimpleandDistributed
42 pages
Chapter4 Indexconstruction
No ratings yet
Chapter4 Indexconstruction
49 pages
4.index Construction - New
No ratings yet
4.index Construction - New
46 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
49 pages
C3 IndexConstruction
No ratings yet
C3 IndexConstruction
46 pages
Lecture4-Indexconstruction Ch2 and Ch4
No ratings yet
Lecture4-Indexconstruction Ch2 and Ch4
49 pages
Chap5 Index Construction
No ratings yet
Chap5 Index Construction
38 pages
1726119671-4 Index Construction
No ratings yet
1726119671-4 Index Construction
19 pages
Lecture 2 Inverted Index PDF
No ratings yet
Lecture 2 Inverted Index PDF
24 pages
Week7 8
No ratings yet
Week7 8
69 pages
AI6122 Topic 3.1 - Index
No ratings yet
AI6122 Topic 3.1 - Index
40 pages
Lecture2 Indexing
No ratings yet
Lecture2 Indexing
78 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
54 pages
Algorithms For Information Retrieval: Index Construction
No ratings yet
Algorithms For Information Retrieval: Index Construction
12 pages
Index Construction
No ratings yet
Index Construction
37 pages
IR Indexing
No ratings yet
IR Indexing
15 pages
Unit I
No ratings yet
Unit I
83 pages
Index Construction
No ratings yet
Index Construction
48 pages
CSCI 7000 Modern Information Retrieval: Lecture 1: Introduction
No ratings yet
CSCI 7000 Modern Information Retrieval: Lecture 1: Introduction
16 pages
chapter2-MA212-Indexing & Preprocessing
No ratings yet
chapter2-MA212-Indexing & Preprocessing
68 pages
Index Construction and Compression
No ratings yet
Index Construction and Compression
64 pages
3 - Index Construction
No ratings yet
3 - Index Construction
5 pages
02 Boolean Retrieval
No ratings yet
02 Boolean Retrieval
52 pages
Information Retrieval: Indexing
No ratings yet
Information Retrieval: Indexing
32 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
69 pages
Lecture4 Compression V1
No ratings yet
Lecture4 Compression V1
43 pages
Lec 9
No ratings yet
Lec 9
21 pages
UE20CS332 Unit2 Slides PDF
No ratings yet
UE20CS332 Unit2 Slides PDF
264 pages
Lec 9
No ratings yet
Lec 9
21 pages
L05
No ratings yet
L05
33 pages
4 Indexing
No ratings yet
4 Indexing
29 pages
IR-Lec1 - Ch1-2023
No ratings yet
IR-Lec1 - Ch1-2023
41 pages
II. Information Retrieval (Basics Cont.) : Web Search - Summer Term 2006
No ratings yet
II. Information Retrieval (Basics Cont.) : Web Search - Summer Term 2006
16 pages
2.boolean Retrieval Model
No ratings yet
2.boolean Retrieval Model
40 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
48 pages
Lecture01 Intro
No ratings yet
Lecture01 Intro
45 pages
Lecture1-Intro - Realted To Ch1
No ratings yet
Lecture1-Intro - Realted To Ch1
60 pages
Index Construction
No ratings yet
Index Construction
42 pages
Lec 1 IR
No ratings yet
Lec 1 IR
42 pages
FOP Efficiency Indexing 13
No ratings yet
FOP Efficiency Indexing 13
22 pages
Chapter 1: Boolean Retrieval
No ratings yet
Chapter 1: Boolean Retrieval
9 pages
Information Retrieval Course Overview
No ratings yet
Information Retrieval Course Overview
53 pages
Document Indexing in Information Retrieval
No ratings yet
Document Indexing in Information Retrieval
19 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
50 pages
CS276 PA1 Report Rukmani Ravi Sundaram Tayyab Tariq 1.description of The Structure of The Program: Index - Py
No ratings yet
CS276 PA1 Report Rukmani Ravi Sundaram Tayyab Tariq 1.description of The Structure of The Program: Index - Py
2 pages
Dynamic Indexing
No ratings yet
Dynamic Indexing
53 pages
IR Unit III - Notes
No ratings yet
IR Unit III - Notes
18 pages
Introduction to Boolean Retrieval
No ratings yet
Introduction to Boolean Retrieval
50 pages
Ir Chapter Three
No ratings yet
Ir Chapter Three
41 pages
Lecture 2 - Boolean Retrieval
No ratings yet
Lecture 2 - Boolean Retrieval
49 pages
Web Search
No ratings yet
Web Search
49 pages
Mapinfo Anysite Online
No ratings yet
Mapinfo Anysite Online
2 pages
03.01.2022 Mca, Bca CST
No ratings yet
03.01.2022 Mca, Bca CST
1 page
Tenable Tools For Security Compliance - The Antivirus Challenge
No ratings yet
Tenable Tools For Security Compliance - The Antivirus Challenge
9 pages
Senior ASP - NET MVC Interview Questions
No ratings yet
Senior ASP - NET MVC Interview Questions
3 pages
Simplex Method
100% (4)
Simplex Method
16 pages
9825A Quick Reference Guide
No ratings yet
9825A Quick Reference Guide
29 pages
Oracle to PostgreSQL Migration Guide
100% (1)
Oracle to PostgreSQL Migration Guide
19 pages
Classification of Fingerprint
No ratings yet
Classification of Fingerprint
4 pages
Swisscom's ITIL Success with Axios
No ratings yet
Swisscom's ITIL Success with Axios
4 pages
SAP Cost Element Accounting Guide
No ratings yet
SAP Cost Element Accounting Guide
4 pages
Openshift Tutorial
80% (5)
Openshift Tutorial
110 pages
Intro to First-Order Logic
No ratings yet
Intro to First-Order Logic
120 pages
Pipelined Adders
No ratings yet
Pipelined Adders
9 pages
Data Analysis Assignment Guide
No ratings yet
Data Analysis Assignment Guide
4 pages
Transmission Line Fault Detection and Classification: Abstract-Transmission Line Protection Is An Important Issue in
No ratings yet
Transmission Line Fault Detection and Classification: Abstract-Transmission Line Protection Is An Important Issue in
8 pages
LIS
No ratings yet
LIS
7 pages
Lesson Plan Satellite Communication
No ratings yet
Lesson Plan Satellite Communication
5 pages
Unit-4 Using Predicate Logic
No ratings yet
Unit-4 Using Predicate Logic
45 pages
TOC 166 Notes by Quantum City AIR 107, GATE CS 2024, Shreyas Rathod
No ratings yet
TOC 166 Notes by Quantum City AIR 107, GATE CS 2024, Shreyas Rathod
75 pages
JPIA - Organizational Chart
No ratings yet
JPIA - Organizational Chart
2 pages
Akshay Final Journal
No ratings yet
Akshay Final Journal
15 pages
8051 Jump, Loop, Call Guide
No ratings yet
8051 Jump, Loop, Call Guide
27 pages
Filemon Regmon
No ratings yet
Filemon Regmon
61 pages
Ge XL Go Videoprobe Borescope Inspection System 3m 6m X 6mm
No ratings yet
Ge XL Go Videoprobe Borescope Inspection System 3m 6m X 6mm
3 pages
APC Rack Side Air Distribution Unit 2U Rack Mount ACF202BLK 208 230V
No ratings yet
APC Rack Side Air Distribution Unit 2U Rack Mount ACF202BLK 208 230V
2 pages
Unit 3 Notes
No ratings yet
Unit 3 Notes
36 pages
SQL Injection Tool Guide
No ratings yet
SQL Injection Tool Guide
46 pages
Describe The Purpose of The Following in A Computer System and Give An Example of Each
No ratings yet
Describe The Purpose of The Following in A Computer System and Give An Example of Each
3 pages
ALIZE Evaluation
0% (1)
ALIZE Evaluation
68 pages
ITE 340 / ISM 321 Management Information Systems: Review of Attempt 2
No ratings yet
ITE 340 / ISM 321 Management Information Systems: Review of Attempt 2
4 pages

05 Index Construction

Uploaded by

05 Index Construction

Uploaded by

5: Index Construction

Information Retrieval Techniques

p How do we construct an index?

p Many design decisions in information retrieval are

p Access to data in memory is much faster than

p Servers used in IR systems now typically have

The Dalles, Oregon Dublin, Ireland 6

p symbol statistic value

p Example: Reading 1GB from disk

symbol statistic value

p Documents are parsed to extract words and enact

Brutus killed me. hath 2

p After all documents have caesar

Scaling index construction

p In-memory index construction does not scale

Sort-based index construction

p As we build the index, we parse docs one at a time

Use the same algorithm for disk?

p Can we use the same index construction

If every comparison took 2 disk seeks, and N items could be

(2*ds-time + comparison-time)*Nlog2N seconds

p What can we do? 16

BSBI: Blocked sort-based Indexing

Sorting 10 blocks of 10M records

p First, read each block and sort (in memory)

Block sorted-based indexing

How to merge the sorted runs?

p Open all block files and maintain small read

p Our assumption was: we can keep the dictionary

p Key idea 1: Generate separate dictionaries for

When the memory has been exhausted -

p Then merging of blocks is analogous to 25

BSBI (plus dictionary merging).

p Compression makes SPIMI even more efficient.

p For web-scale indexing (don’t try this at home!):

Google data centers

p Google data centers mainly contain commodity

p Consider a non-fault-tolerant system with

p Maintain a master machine directing the

p We will use two sets of parallel tasks

split split split split split

p Each split is a subset of documents

p Master assigns a split to an idle parser machine

split parser a-f g-p q-z

p An inverter collects all (term-id, doc-id) pairs

a-f inverter a-f

Data flow: MapReduce

assign Master assign

Parser a-f g-p q-z Inverter a-f

Parser a-f g-p q-z

splits Inverter q-z

p The index construction algorithm we just described is

p Index construction was just one phase

p Schema of map and reduce functions

we do not write term-ids for better readability

p Up to now, we have assumed that collections are

p Maintain big main index

Issues with main and auxiliary indexes

p Problem of frequent merges – you touch stuff a lot

p Maintain a series of indexes, each twice as

p Auxiliary and main index: index construction

Further issues with multiple indexes

p Collection-wide statistics are hard to maintain

Dynamic indexing at search engines

p All the large search engines now do dynamic

Other sorts of indexes

p Positional indexes

You might also like

p  How do we construct an index?

p  Many design decisions in information retrieval are

p  Access to data in memory is much faster than

p  Servers used in IR systems now typically have

p  symbol statistic value

p  Example: Reading 1GB from disk

p  Documents are parsed to extract words and enact

p  After all documents have caesar

p  In-memory index construction does not scale

p  As we build the index, we parse docs one at a time

p  Can we use the same index construction

(2ds-time + comparison-time)Nlog2N seconds

p  What can we do? 16

p  First, read each block and sort (in memory)

p  Open all block files and maintain small read

p  Our assumption was: we can keep the dictionary

p  Key idea 1: Generate separate dictionaries for

p  Then merging of blocks is analogous to 25

p  Compression makes SPIMI even more efficient.

p  For web-scale indexing (don’t try this at home!):

p  Google data centers mainly contain commodity

p  Consider a non-fault-tolerant system with

p  Maintain a master machine directing the

p  We will use two sets of parallel tasks

p  Each split is a subset of documents

p  Master assigns a split to an idle parser machine

p  An inverter collects all (term-id, doc-id) pairs

p  The index construction algorithm we just described is

p  Index construction was just one phase

p  Schema of map and reduce functions

p  Up to now, we have assumed that collections are

p  Maintain big main index

p  Problem of frequent merges – you touch stuff a lot

p  Maintain a series of indexes, each twice as

p  Auxiliary and main index: index construction

p  Collection-wide statistics are hard to maintain

p  All the large search engines now do dynamic

p  Positional indexes