Department of Information Technology
Sub : Information Storage and Retrieval
Topic: File Strcutre-2
Mr. Rajpure A.S.
Introduction
Contents:
Basic Concepts of IR, Data Retrieval & Information Retrieval, text mining and IR relation, IR system
block diagram.
Automatic Text Analysis: Luhn's ideas, Conflation Algorithm, Indexing and Index Term Weighing,
Probabilistic Indexing
Inverted file, Suffix trees & suffix arrays, Signature Files, Scatter storage or hash addressing, Clustered
files, Hypertext and XML data structures.
Unit Objectives
1. To understand information retrieval process.
Unit outcomes: On completion the students will be able to :
1. By the end of the course, students should be able to understand the concept of Information
retrieval
Outcome Mapping:
PEO: I, PO: a CO: 1., PSO: 1
Books :
T1. Yates & Neto, Modern Information Retrieval, Pearson Education, ISBN:81-297-0274-6 2.
T2. C.J. T2: Rijsbergen, Information Retrieval, (www.dcs.gla.ac.uk)., 2ndISBN:978- 408709293
Indexing and Searching
(File Structures)
File Structures
• Inverted Files
• Signatures
• PAT Trees
• Sequential Searching
• Compression
4
Inverted Files
• Characteristics
• A word-oriented mechanism based on sorted list of
keywords, with each keyword having links to the documents
containing that keyword.
• Preprocessing
• Each document is assigned a list of keywords or attributes.
• Each keyword (attribute) is associated with relevance
weights.
5
Inversion of Word List
1. The input text is parsed into a list of words along with their
location in the text. (time and storage consuming operation)
2. This list is inverted from a list of terms in location order to
a list of terms in alphabetical order.
3. Add term weights, or reorganize or compress the files.
Inversion of Word List
Structure and Construction
• Structure (split the index into two files)
• Vocabulary: O(nβ) according to Heaps’ Law
• Occurrences : depends on the addressing granularity
• Construction
• The vocabulary is stored in lexicographical order and
points to posting list.
• Posting file:the lists of occurrences are stored
contiguously
8
Dictionary and Postings File
(document #,
frequency)
Vocabulary and Posting File
10
Structures used in Inverted Files
• Vocabulary
• Sorted Arrays
• Hashing Structures
• Keyword Trees: Tries (digital search trees)
• The Search Procedure
• Vocabulary search
• Retrieval of occurrences
• Manipulation of occurrences
11
Size of an Inverted File
• Block addressing
• The text is divided in blocks, and the occurrences point to
the blocks instead of full inverted indices where exact
occurrences are recorded
Cost
• Advantage
• easy to implement
• Disadvantage
• updating the index is expensive
13
Signature Files
Information Retrieval: Data Structures and
Algorithms (Chapters 4)
W.B. Frakes and R. Baeza-Yates (Eds.) Englewood
Cliffs, NJ: Prentice Hall, 1992.
Signature Files
• Characteristics
• Word-oriented index structures based on hashing
• Low overhead (10%~20% over the text size) at the cost
of forcing a sequential search over the index
• Suitable for not very large texts
• Inverted files outperform signature files for most
applications
15
Construction and Search
• Word-oriented index structures base on hashing
• Maps words to bit masks of B bits
• Divides the text in blocks of b words each
• The mask is obtained by bitwise ORing the signatures of all
the words in the text block.
• Search
• Hash the query to a bit mask W
• If W & Bi = W, the text block may contain the word
16
Example
• Four blocks:
• This is a text. A text ha s many words. Words are made from letter s.
000101 110101 100100 101101
• Hash(text) = 000101
• Hash(many)= 110000 Block 4:
• Hash(words)= 100100 001100
• Hash(made)= 001100 OR 100001
• Hash(letters)= 100001 101101
17
False Drop
• Assumes that m bits are randomly set in the mask
• Let α=m/B
• For b words, the probability that a given bit of the
mask is set is 1-(1-1/B)bm ≈1-e-bα
• Hence, the probability that the l random bits are
-bα αΒ ←
also set is Fd =(1-e ) False alarm
• Fd is minimized for α=ln(2)/b
• Fd= 2-m
m = B ln2/b
18
Sequential Signature File (SSF)
Assume documents span exactly one logical block
the size of document signature F = the size of block signature
B
19
Classification of Signature-Based Methods
• Horizontal partitioning
Grouping similar signatures together and/or providing an index on the
signature matrix may result in better-than-linear search.
• Vertical partitioning
Storing the signature matrix column-wise improves the response time on
the expense of insertion time.
20
Classification of Signature-Based Methods
• Vertical partitioning
• without compression
bit-sliced signature files (BSSF, B’SSF)
frame sliced (FSSF)
generalized frame-sliced (GFSSF)
• with compression
compressed bit slices (CBS)
doubly compressed bit slices (DCBS)
no-false-drop method (NFD)
21
Classification of Signature-Based Methods
• Sequential storage of the signature matrix
• without compression
sequential signature files (SSF)
• with compression
bit-block compression (BC)
variable bit-block compression (VBC)
• Horizontal partitioning
• data independent partitioning
Gustafson’s method
partitioned signature files
• data dependent partitioning
2-level signature files
5-trees
22
Criteria
• The storage overhead
• The response time on single word queries
• The performance on insertion, as well as whether
the insertion maintains the “append-only” property
23
Vertical Partitioning
• Idea
avoid bringing useless portions of the document
signature in main memory
• Methods
• store the signature file in a bit-sliced form or in a
frame-sliced form
• store the signature matrix column-wise to improve the
response time on the expense of insertion time
24
Bit-Sliced Signature Files (BSSF)
Transposed bit matrix
documen
(document
ts
signature)
transpose
documen
ts
represent
documents
F bit-files
search: (1) retrieve m bit-files.
e.g., the word signature of free is 001 000 110
010
the document contains “free”: 3rd, 7th, 8th, 11th bit
are set
i.e., only 3rd, 7th, 8th, 11th files are examined.
(2) “and” these vectors. The 1s in the result N-bit
vector
denote the qualifying logical blocks (documents).
(3) retrieve text file through pointer file.
Frame-Sliced Signature File (FSSF)
• Ideas
• Random disk accesses are more expensive than sequential ones
• Force each word to hash into bit positions that are closer to each
other in the document signature
• these bit files are stored together and can be retrieved with a few
random accesses
• Procedures
• The document signature (F bits long) is divided into k frames of s
consecutive bits each.
• For each word in the document, one of the k frames will be chosen
by a hash function.
• Using another hash function, the word sets m bits in that frame.
27
Frame-Sliced Signature File (Cont.)
documen
ts
frame
s
Each frame will be kept in consecutive disk
blocks.
FSSF (Continued)
• Example (n=2, B=12, s=6, f=2, m=3)
Word Signature
free 000000 110010
text 010110 000000
doc. signature 010110 110010
• Search
• Only one frame has to be retrieved for a single word query. I.E., only one
random disk access is required.
e.g., search documents that contain the word “free”
->because the word signature of “free” is placed in 2nd frame,
only the 2nd frame has to be examined.
• At most k frames have to be scanned for an k word query.
• Insertion
• Only f frames have to be accessed instead of F bit-slices.
29
Horizontal Partitioning
1. Goal: group the signatures into sets, partitioning the signature
matrix horizontally.
2. Grouping criterion
documents
Partitioned Signature Files
• Using a portion of a document signature as a
signature key to partition the signature file.
• All signatures with the same key will be grouped
into a so-called “module”.
• When a query signature arrives,
• examine its signature key and look for the
corresponding modules
• scan all the signatures within those modules that have
been selected
31
Suffix Trees
Suffix Trees and Suffix Arrays
• Each position in the text is considered as a text
suffix
• Index points are selected form the text, which point
to the beginning of the text positions which will be
retrievable
33
Suffix arrays
• The main drawbacks of Suffix Array are its costly
construction process.
• Allow binary searches done by comparing the
contents of each pointer.
• Supra-indices (for large suffix array)
35
Construction of Suffix Arrays for Large Texts
Sequential Searching
Brute Force
Knuth-Morris-Pratt
Boyer-Moore Family
Shift-Or
Suffix Automaton
Knuth-Morris-Pratt
Boyer-Moore Family
41
Shift-Or
Suffix Automaton
43
Pattern Matching
• Searching allowing errors
• Dynamic Programming
• Automaton
• Regular Expressions and Extended patterns
• Pattern Matching Using Indices
• Inverted files
• Suffix Trees and Suffix Arrays
45
Dynamic Programming
Automaton
47
Regular Expressions
48
Pattern Matching Using Indices
• Inverted Files
• The types of queries such as suffix or substring queries,
searching allowing errors and regular expressions, are
solved by a sequential search
• The restriction is to find approximate matches or regular
expressions that span many word.
49
Pattern Matching Using Indices
• Suffix Trees
• Suffix trees are able to perform complex searches
• Word, prefix, suffix, substring, and Range queries
• Regular expressions
• Unrestricted approximate string matching
• Useful in specific areas
• Find the longest substring
• Find the most common substring of a fixed size
50
Pattern Matching Using Indices
• Suffix Arrays
• Some patterns can be searched directly in the suffix
array without simulation the suffix tree
• Word, prefix, suffix, subword search and range search
51
Compression
• Compressed text--Huffman coding
• Taking words as symbols
• Use an alphabet of bytes instead of bits
• Compressed indices
• Inverted Files
• Suffix Trees and Suffix Arrays
• Signature Files
52
Thank You