Week7 8
Week7 8
Introduc*on
to
Informa(on
Retrieval
Index
Construc*on
Introduc)on
to
Informa)on
Retrieval
Plan
§ Last
lecture:
§ Dic*onary
data
structures
a-hu
hy-m
n-z
§ Tolerant
retrieval
§ Wildcards
§ Spell
correc*on
$m mace madden
§ Soundex
mo among amortize
§ Index
construc*on
Introduc)on
to
Informa)on
Retrieval
Ch.
4
Index
construc(on
§ How
do
we
construct
an
index?
(Previous
study)
§ What
strategies
can
we
use
with
limited
main
memory?
(Previous
study)
Introduc)on
to
Informa)on
Retrieval
Sec.
4.1
Hardware
basics
§ Many
design
decisions
in
informa*on
retrieval
are
based
on
the
characteris*cs
of
hardware
e.g.
size
of
main
memory,
disk
and
cache,
read,
write
and
other
opera*ons.
§ We
begin
by
reviewing
hardware
basics
Introduc)on
to
Informa)on
Retrieval
Sec.
4.1
Hardware
basics
§ Access
to
data
in
memory
is
much
faster
than
access
to
data
on
disk.
§ Disk
seeks:
No
data
is
transferred
from
disk
while
the
disk
head
is
being
posi*oned.
§ Therefore:
Transferring
one
large
chunk
of
data
from
disk
to
memory
is
faster
than
transferring
many
small
chunks.
§ Disk
I/O
is
block-‐based:
Reading
and
wri*ng
of
en*re
blocks
(as
opposed
to
smaller
chunks).
§ Block
sizes:
8KB
to
256
KB.
Introduc)on
to
Informa)on
Retrieval
Sec.
4.1
Hardware
basics
§ Servers
used
in
IR
systems
now
typically
have
several
GB
of
main
memory,
some*mes
tens
of
GB.
§ Available
disk
space
is
several
(2–3)
orders
of
magnitude
larger.
§ Fault
tolerance
is
very
expensive:
It’s
much
cheaper
to
use
many
regular
machines
rather
than
one
fault
tolerant
machine.
Introduc)on
to
Informa)on
Retrieval
Sec.
4.1
Term Doc #
BoIleneck
§ Parse
and
build
pos*ngs
entries
one
doc
at
a
*me
§ Now
sort
pos*ngs
entries
by
term
(then
by
doc
within
each
term)
§ Doing
this
with
random
disk
seeks
would
be
too
slow
–
must
sort
T=100M
records
Algorithm
descrip(on
§ The
algorithm
parses
documents
into
termID–docID
pairs
and
accumulates
the
pairs
in
memory
un*l
a
block
of
a
fixed
size
is
full
(PARSENEXTBLOCK
).
§ We
choose
the
block
size
to
fit
comfortably
into
memory
to
permit
a
fast
in-‐memory
sort.
§ The
block
is
then
inverted
and
wrinen
to
disk.
§ Inversion
involves
two
steps.
First,
we
sort
the
termID–
docID
pairs.
Next,
we
collect
all
termID–docID
pairs
with
the
same
termID
into
a
pos*ngs
list,
where
a
pos)ng
is
simply
a
docID.
§ The
result,
an
inverted
index
for
the
block
we
have
just
read,
is
then
wrinen
to
disk.
Introduc)on
to
Informa)on
Retrieval
Sec.
4.2
Introduc)on
to
Informa)on
Retrieval
Sec.
4.2
Class
Exercise
§ Exercise
4.1
If
we
need
T
log
T
comparisons
(where
T
is
the
number
of
termID–docID
pairs)
and
2
two
disk
seeks
for
each
comparison,
how
much
*me
would
index
construc*on
for
Reuters-‐RCV1
take
if
we
used
disk
instead
of
memory
for
storage
and
an
unop*mized
sor*ng
algorithm
(i.e.,
not
an
external
sor*ng
algorithm)?
Use
the
system
parameters
in
Table
4.1.
Introduc)on
to
Informa)on
Retrieval
Solu(on
⇒Disk
seek
*me
=
5x10-‐3
s
2
x
(5x10-‐3)
seconds
per
comparison
Transfer
*me
=
2
x
10-‐8
s
per
byte
Low
level
opera*ons
=
10-‐8
seconds
⇒How
long
would
it
take
to
make
T(log₂T)
comparisons
with
2
disk
seeks
per
comparison?
⇒T(log₂T)
x
2(5x10-‐3s)
...consider
transfer
*me
and
any
low
level
opera*ons
⇒Or
see
on
the
next
slide
Introduc)on
to
Informa)on
Retrieval
Solu(on
#2
§ 100,000,000
records
§ Nlog2(N)
is
=
2,657,542,475.91
comparisons
§ 2
disk
seeks
per
comparison
=
(2,657,542,475.91
x
0.005)
=
13,287,712.38
seconds
x
2
§ =
26,575,424.76
seconds
§ =
442,923.75
minutes
§ =
7,382.06
hours
§ =
307.59
days
§ =
84%
of
a
year
§ =
1%
of
your
life
26
Introduc)on
to
Informa)on
Retrieval
Homework
#
4
(a)
§ Exercise
4.2
[⋆]
How
would
you
create
the
dic*onary
in
blocked
sort-‐based
indexing
on
the
fly
to
avoid
an
extra
pass
through
the
data?
Introduc)on
to
Informa)on
Retrieval
Sec.
4.3
SPIMI-‐Invert
SPIMI:
Process
§ SPIMI
can
index
collec*ons
of
any
size
as
long
as
there
is
enough
disk
space
available.
§ The
SPIMI
algorithm
is
shown
earlier.
The
part
of
the
algorithm
that
parses
documents
and
turns
them
into
a
stream
of
term–docID
pairs,
which
we
call
tokens
here,
has
been
omined.
§ SPIMI-‐INVERT
is
called
repeatedly
on
the
token
stream
un*l
the
en*re
collec*on
has
been
processed.
§ Tokens
are
processed
one
by
one
(line
4).
When
a
term
occurs
for
the
first
*me,
it
is
added
to
the
dic*onary,
and
a
new
pos*ngs
list
is
created
(line
6).
The
call
in
line
7
returns
this
pos*ngs
list
for
subsequent
occurrences
of
the
term.
Wednesday
8
May
19
30
Introduc)on
to
Informa)on
Retrieval
SPIMI:
Process
§ A
difference
between
BSBI
and
SPIMI
is
that
SPIMI
adds
a
pos*ng
directly
to
its
pos*ngs
list
(line
10).
§ Instead
of
first
collec*ng
all
termID–docID
pairs
and
then
sor*ng
them
(as
we
did
in
BSBI),
each
pos*ngs
list
is
dynamic
(i.e.,
its
size
is
adjusted
as
it
grows)
and
it
is
immediately
available
to
collect
pos*ngs.
§ This
has
two
advantages:
§ It
is
faster
and
it
saves
memory
because
we
keep
track
of
the
term
a
pos*ngs
list
belongs
to,
so
the
termIDs
of
pos*ngs
need
not
be
stored.
§ As
a
result,
the
blocks
that
individual
calls
of
SPIMI-‐INVERT
process
are
much
larger
and
the
index
construc*on
process
as
a
whole
is
more
efficient.
§ Short
spaced
pos*ngs
list
ini*ally
and
double
the
space
each
*me
it
is
full
(lines
8–9).
§ Some
memory
is
wasted.
However,
the
overall
memory
requirements
for
the
dynamically
constructed
index
are
s*ll
lower
than
in
BSBI.
Wednesday
8
May
19
31
Introduc)on
to
Informa)on
Retrieval
SPIMI:
Process
§ When
memory
has
been
exhausted,
we
write
the
index
of
the
block
(which
consists
of
the
dic*onary
and
the
pos*ngs
lists)
to
disk
(line
12).
§ We
have
to
sort
the
terms
(line
11)
before
doing
this
because
we
want
to
write
pos*ngs
lists
in
lexicographic
order
to
facilitate
the
final
merging
step.
§ Each
call
of
SPIMI-‐INVERT
writes
a
block
to
disk,
just
as
in
BSBI.
§ The
last
step
of
SPIMI
(corresponding
to
line
7
in
Figure
4.2;
not
shown
in
algorithm
here)
is
then
to
merge
the
blocks
into
the
final
inverted
index.
§ The
*me
complexity
of
SPIMI
is
Θ(T).
§ Both
the
pos*ngs
and
the
dic*onary
terms
can
be
stored
compactly
on
disk
if
we
employ
compression.
Compression
increases
the
efficiency
further
because
we
can
process
even
larger
blocks,
and
because
the
individual
blocks
require
less
space
on
disk
(Sec*on
4.7).
Distributed
indexing
§ Used
for
mainly
web-‐scale
indexing:
§ must
use
a
distributed
compu*ng
cluster
§ Individual
machines
are
fault-‐prone
§ Can
unpredictably
slow
down
or
fail
§ How
do
we
exploit
such
a
pool
of
machines?
§ By
construc*ng
distributed
index
that
is
par**oned
across
several
machines.
Introduc)on
to
Informa)on
Retrieval
Sec.
4.4
Distributed
indexing
§ Maintain
a
master
machine
direc*ng
the
indexing
job.
§ Break
up
indexing
into
sets
of
(parallel)
tasks.
§ Master
machine
assigns
each
task
to
an
idle
machine
from
a
pool.
Introduc)on
to
Informa)on
Retrieval
Sec.
4.4
Parallel
tasks
§ We
will
use
two
sets
of
parallel
tasks
§ Parsers
§ Inverters
§ Break
the
input
document
collec*on
into
splits
§ Each
split
is
a
subset
of
documents
(corresponding
to
blocks
in
BSBI/SPIMI)
§ First,
the
input
data,
in
our
case
a
collec*on
of
web
pages,
are
split
into
n
splits
where
the
size
of
the
split
is
chosen
to
ensure
that
the
work
can
be
distributed
evenly
(chunks
should
not
be
too
large)
and
efficiently
(the
total
number
of
chunks
we
need
to
manage
should
not
be
too
large);
16
or
64
MB
are
good
sizes
in
distributed
indexing.
Introduc)on
to
Informa)on
Retrieval
Sec.
4.4
Parsers
§ Master
assigns
a
split
to
an
idle
parser
machine
§ Parser
reads
a
document
at
a
*me
and
emits
(term,
docID)
pairs
§ Parser
writes
pairs
into
j
par**ons
§ Each
par**on
is
for
a
range
of
terms’
first
leners
§ (e.g.,
a-‐f,
g-‐p,
q-‐z)
–
here
j
=
3.
§ Splits
are
not
pre-‐assigned
to
machines,
but
are
instead
assigned
by
the
master
node
on
an
ongoing
basis:
As
a
machine
finishes
processing
one
split,
it
is
assigned
the
next
one.
If
a
machine
dies
or
becomes
a
laggard
(too
slow)
due
to
hardware
problems,
the
split
it
is
working
on
is
simply
reassigned
to
another
machine.
Introduc)on
to
Informa)on
Retrieval
Sec.
4.4
Inverters
§ An
inverter
collects
all
(term,docID)
pairs
(=
pos*ngs)
for
one
term-‐par**on
e.g.
for
a-‐f.
§ Sorts
and
writes
to
pos*ngs
lists.
§ Each
term
par**on
(corresponding
to
r
segment
files,
one
on
each
parser)
is
processed
by
one
inverter.
Finally,
the
list
of
values
(docIDs)
is
sorted
for
each
key
(term)
and
wrinen
to
the
final
sorted
pos*ngs
list
(“pos*ngs”
in
the
figure).
This
completes
the
construc*on
of
the
inverted
index.
Introduc)on
to
Informa)on
Retrieval
Sec.
4.4
Data
flow
assign Master assign
Postings
MapReduce
§ The
index
construc*on
algorithm
we
just
described
is
an
instance
of
MapReduce.
§ MapReduce
(Dean
and
Ghemawat
2004)
is
a
robust
and
conceptually
simple
framework
for
distributed
compu*ng
…
§ …
without
having
to
write
code
for
the
distribu*on
part.
§ The
original
Google
indexing
system
consisted
of
a
number
of
phases,
each
implemented
in
MapReduce.
Introduc)on
to
Informa)on
Retrieval
Sec.
4.4
MapReduce
(General)
§ To
minimize
write
*mes
before
inverters
reduce
the
data,
each
parser
writes
its
segment
files
to
its
local
disk.
§ In
the
reduce
phase,
the
master
communicates
to
an
inverter
the
loca*ons
of
the
relevant
segment
files.
§ Each
segment
file
only
requires
one
sequen*al
read.
This
setup
minimizes
the
amount
of
network
traffic
needed
during
indexing.
§ The
same
machine
can
be
a
parser
in
the
map
phase
and
an
inverter
in
the
reduce
phase.
Introduc)on
to
Informa)on
Retrieval
Sec.
4.4
44
Introduc)on
to
Informa)on
Retrieval
Exercise
4.3
§ For
n
=
15
splits,
r
=
10
segments,
and
j
=
3
term
par**ons,
how
long
would
distributed
index
crea*on
take
for
Reuters-‐
RCV1
in
a
MapReduce
architecture?
Base
your
assump*ons
about
cluster
machines
on
Table
4.1
&
4.2.
Solu(on
§ We
will
be
spli€ng
by
documents,
so
each
split
is
roughly:
split_documents
=
800000/15
=
53333
documents
§ Each
split
size
is
about:
§ Split_size
=
53333documents
x
200
token/document
x
6
bytes/token
=
63999600
bytes
≈
61
MB
§ MAP
phase:
§ Time
spent
by
a
machine
to
read
a
split:
Read_per_split
=
Split_size
x
(2
x
10-‐8
sec/byte)
=
1.28
secs
§ Time
spent
to
sort
this
split
(algorithm
complexity
is
O(nlog2n):
Sort_*me
=
split_documents
x
200
token/document
x
log2
(split_documents
x
200
token/document)
x
(10-‐8
sec/
byte)
=
2.49
secs
Solu(on
§ Time
spent
by
a
machine
to
write
a
split:
Write_per_split
=
split_documents
x
200
token/document
x
4.5
x
(2
x
10-‐8
sec/byte)
=
0.96
secs
-‐>
note:
here
tokens
are
without
spaces/punct.
already
§ MAP
phase
is
read+sort+write:
=
1.28
secs
+
2.49
secs
+
0.96
secs
=
4.73
secs
There
are
10
parser
machines
only
since
we
have
10
segments.
So
to
parse
15
splits
we
will
need
to
do
2
passes
of
MAP.
Total
MAP
phase
=
4.73
x
2
=
9.46
secs
§ REDUCE
phase
Index
is
split
into
3
term
par**ons,
so
each
term
par**on
will
hold
about
100000000/3
tokens
(this
is
a
rough
assump*on,
in
reality
term
par**on
size
could
vary).
Wednesday
8
May
19
47
Introduc)on
to
Informa)on
Retrieval
Solu(on
Each
Inverter
will
need
to
read
sort
and
write
this
amount
of
tokens.
§ Term_
par**on_size
=
100000000/3
tokens
x
4.5
bytes/tokens
=
150000000
bytes
≈
143
MB
§ Time_reading
=
150000000
bytes
x
(2
x
10-‐8
sec/byte)
=
3
secs
§ Time
sor*ng
=
100000000/3
tokens
x
log2
(100000000/3
tokens)
x
(10-‐8
sec/byte)
=
8.33
secs
§ Time_wri*ng
=
150000000
bytes
x
(2
x
10-‐8
sec/byte)
=
3
secs
§ Total
REDUCE
phase
=
3
+
8.33
+
3
=
14.33
secs
Total
(me
of
Distributed
Index
crea(on
=
9.46
secs
+
14.33
secs
=
23.79
secs
Dynamic
indexing
§ Up
to
now,
we
have
assumed
that
collec*ons
are
sta*c.
They
rarely
are:
§ Documents
come
in
over
*me
and
need
to
be
inserted.
§ Documents
are
deleted
and
modified.
§ This
means
that
the
dic*onary
and
pos*ngs
lists
have
to
be
modified:
§ Pos*ngs
updates
for
terms
already
in
dic*onary
§ New
terms
added
to
dic*onary
Introduc)on
to
Informa)on
Retrieval
Sec.
4.5
§ Assump*on
for
remaining
lecture:
The
index
is
one
big
file.
§ In
reality:
Use
a
scheme
somewhere
in
between
(e.g.,
split
very
large
pos*ngs
lists,
collect
pos*ngs
lists
of
length
1
in
one
file
etc.)
Introduc)on
to
Informa)on
Retrieval
Sec.
4.5
Logarithmic
merge
§ Maintain
a
series
of
indexes,
each
twice
as
large
as
the
previous
one
§ At
any
*me,
some
of
these
powers
of
2
are
instan*ated
§ Keep
smallest
(Z0)
in
memory
§ Larger
ones
(I0,
I1,
…)
on
disk
§ If
Z0
gets
too
big
(>
n),
write
to
disk
as
I0
§ or
merge
with
I0
(if
I0
already
exists)
as
Z1
§ Either
write
merged
Z1
to
disk
as
I1
(if
no
I1)
§ Or
merge
with
I1
to
form
Z2
Introduc)on
to
Informa)on
Retrieval
Sec.
4.5
Introduc)on
to
Informa)on
Retrieval
Sec.
4.5
Logarithmic
merge
§ Overall
index
construc*on
*me
is
Θ(T
logT)
where
T
is
total
number
of
pos*ngs.
§ We
trade
this
efficiency
gain
for
a
slow
down
of
query
searching
process;
§ Due
to
complexity
of
dynamic
indexing,
some
large
search
engines
do
not
construct
indexes
dynamically.
Instead,
a
new
index
is
built
from
scratch
periodically.
Query
processing
is
then
switched
to
the
new
index
and
the
old
index
is
deleted.
Introduc)on
to
Informa)on
Retrieval
Sec.
4.5
Exercise
Solu(on
Assignment
#4(b)
§ Exercise
4.6
§ Total
index
construc*on
*me
in
blocked
sort-‐based
indexing
is
broken
down
in
Ta-‐
ble
4.3.
Fill
out
the
*me
column
of
the
table
for
Reuters-‐RCV1
assuming
a
system
with
the
parameters
given
in
Table
4.1.
63
Introduc)on
to
Informa)on
Retrieval
Assignment
#4(c)
§ Exercise
4.7
§ Repeat
Exercise
4.6
for
the
larger
collec*on
in
Table
4.4.
Choose
a
block
size
that
is
realis*c
for
current
technology
(remember
that
a
block
should
easily
fit
into
main
memory).
How
many
blocks
do
you
need?
64
Introduc)on
to
Informa)on
Retrieval
Assignment
#4(d)
§ Exercise
4.9
§ Assume
that
machines
in
MapReduce
have
100
GB
of
disk
space
each.
Assume
fur-‐
ther
that
the
pos*ngs
list
of
the
term
the
has
a
size
of
200
GB.
Then
the
MapReduce
algorithm
as
described
cannot
be
run
to
construct
the
index.
How
would
you
modify
MapReduce
so
that
it
can
handle
this
case?
65
Introduc)on
to
Informa)on
Retrieval
66
Introduc)on
to
Informa)on
Retrieval
68
Introduc)on
to
Informa)on
Retrieval
References
§ Heinz,
Steffen,
and
Jus*n
Zobel.
2003.
Efficient
single-‐pass
index
construc*on
for
text
databases.
JASIST
54(8):713–729.
DOI:
dx.doi.org/
10.1002/asi.10268.
§ Zobel,
Jus*n,
and
Alistair
Moffat.
2006.
Inverted
files
for
text
search
engines.
ACM
Compu)ng
Surveys
38(2).
§ Büncher,
Stefan,
Charles
L.
A.
Clarke,
and
Brad
Lushman.
2006.
Hybrid
index
main-‐
tenance
for
growing
text
collec*ons.
In
Proc.
SIGIR,
pp.
356–
363.
ACM
Press.
DOI:
doi.acm.org/10.1145/1148170.1148233.
§ Lester,
Nicholas,
Jus*n
Zobel,
and
Hugh
E.
Williams.
2006.
Efficient
online
index
maintenance
for
con*guous
inverted
lists.
IP&M
42(4):916–933.
DOI:
dx.doi.org/10.1016/j.ipm.2005.09.005.
69