0% found this document useful (0 votes)
9 views54 pages

Department of Information Technology Sub: Information Storage and Retrieval Topic: File Strcutre-2 Mr. Rajpure A.S

Uploaded by

Amol Rajpure
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views54 pages

Department of Information Technology Sub: Information Storage and Retrieval Topic: File Strcutre-2 Mr. Rajpure A.S

Uploaded by

Amol Rajpure
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 54

Department of Information Technology

Sub : Information Storage and Retrieval


Topic: File Strcutre-2

Mr. Rajpure A.S.


Introduction
Contents:
Basic Concepts of IR, Data Retrieval & Information Retrieval, text mining and IR relation, IR system
block diagram.
Automatic Text Analysis: Luhn's ideas, Conflation Algorithm, Indexing and Index Term Weighing,
Probabilistic Indexing
Inverted file, Suffix trees & suffix arrays, Signature Files, Scatter storage or hash addressing, Clustered
files, Hypertext and XML data structures.

Unit Objectives
1. To understand information retrieval process.

Unit outcomes: On completion the students will be able to :


1. By the end of the course, students should be able to understand the concept of Information
retrieval
Outcome Mapping:
PEO: I, PO: a CO: 1., PSO: 1

Books :
T1. Yates & Neto, Modern Information Retrieval, Pearson Education, ISBN:81-297-0274-6 2.
T2. C.J. T2: Rijsbergen, Information Retrieval, (www.dcs.gla.ac.uk)., 2ndISBN:978- 408709293
Indexing and Searching
(File Structures)
File Structures

• Inverted Files
• Signatures
• PAT Trees
• Sequential Searching
• Compression

4
Inverted Files

• Characteristics
• A word-oriented mechanism based on sorted list of
keywords, with each keyword having links to the documents
containing that keyword.
• Preprocessing
• Each document is assigned a list of keywords or attributes.
• Each keyword (attribute) is associated with relevance
weights.

5
Inversion of Word List

1. The input text is parsed into a list of words along with their
location in the text. (time and storage consuming operation)
2. This list is inverted from a list of terms in location order to
a list of terms in alphabetical order.
3. Add term weights, or reorganize or compress the files.
Inversion of Word List
Structure and Construction

• Structure (split the index into two files)


• Vocabulary: O(nβ) according to Heaps’ Law
• Occurrences : depends on the addressing granularity

• Construction
• The vocabulary is stored in lexicographical order and
points to posting list.
• Posting file:the lists of occurrences are stored
contiguously

8
Dictionary and Postings File

(document #,
frequency)
Vocabulary and Posting File

10
Structures used in Inverted Files

• Vocabulary
• Sorted Arrays
• Hashing Structures
• Keyword Trees: Tries (digital search trees)

• The Search Procedure


• Vocabulary search
• Retrieval of occurrences
• Manipulation of occurrences

11
Size of an Inverted File

• Block addressing
• The text is divided in blocks, and the occurrences point to
the blocks instead of full inverted indices where exact
occurrences are recorded
Cost

• Advantage
• easy to implement

• Disadvantage
• updating the index is expensive

13
Signature Files
Information Retrieval: Data Structures and
Algorithms (Chapters 4)
W.B. Frakes and R. Baeza-Yates (Eds.) Englewood
Cliffs, NJ: Prentice Hall, 1992.
Signature Files

• Characteristics
• Word-oriented index structures based on hashing
• Low overhead (10%~20% over the text size) at the cost
of forcing a sequential search over the index
• Suitable for not very large texts
• Inverted files outperform signature files for most
applications

15
Construction and Search

• Word-oriented index structures base on hashing


• Maps words to bit masks of B bits
• Divides the text in blocks of b words each
• The mask is obtained by bitwise ORing the signatures of all
the words in the text block.
• Search
• Hash the query to a bit mask W
• If W & Bi = W, the text block may contain the word

16
Example

• Four blocks:
• This is a text. A text ha s many words. Words are made from letter s.

000101 110101 100100 101101


• Hash(text) = 000101
• Hash(many)= 110000 Block 4:
• Hash(words)= 100100 001100
• Hash(made)= 001100 OR 100001
• Hash(letters)= 100001 101101

17
False Drop

• Assumes that m bits are randomly set in the mask


• Let α=m/B
• For b words, the probability that a given bit of the
mask is set is 1-(1-1/B)bm ≈1-e-bα
• Hence, the probability that the l random bits are
-bα αΒ ←
also set is Fd =(1-e ) False alarm
• Fd is minimized for α=ln(2)/b
• Fd= 2-m
m = B ln2/b

18
Sequential Signature File (SSF)

Assume documents span exactly one logical block


the size of document signature F = the size of block signature
B

19
Classification of Signature-Based Methods

• Horizontal partitioning
Grouping similar signatures together and/or providing an index on the
signature matrix may result in better-than-linear search.

• Vertical partitioning
Storing the signature matrix column-wise improves the response time on
the expense of insertion time.

20
Classification of Signature-Based Methods

• Vertical partitioning
• without compression
bit-sliced signature files (BSSF, B’SSF)
frame sliced (FSSF)
generalized frame-sliced (GFSSF)
• with compression
compressed bit slices (CBS)
doubly compressed bit slices (DCBS)
no-false-drop method (NFD)

21
Classification of Signature-Based Methods

• Sequential storage of the signature matrix


• without compression
sequential signature files (SSF)
• with compression
bit-block compression (BC)
variable bit-block compression (VBC)

• Horizontal partitioning
• data independent partitioning
Gustafson’s method
partitioned signature files
• data dependent partitioning
2-level signature files
5-trees

22
Criteria

• The storage overhead


• The response time on single word queries
• The performance on insertion, as well as whether
the insertion maintains the “append-only” property

23
Vertical Partitioning

• Idea
avoid bringing useless portions of the document
signature in main memory
• Methods
• store the signature file in a bit-sliced form or in a
frame-sliced form
• store the signature matrix column-wise to improve the
response time on the expense of insertion time

24
Bit-Sliced Signature Files (BSSF)
Transposed bit matrix
documen
(document
ts
signature)

transpose

documen
ts

represent
documents

F bit-files

search: (1) retrieve m bit-files.


e.g., the word signature of free is 001 000 110
010
the document contains “free”: 3rd, 7th, 8th, 11th bit
are set
i.e., only 3rd, 7th, 8th, 11th files are examined.
(2) “and” these vectors. The 1s in the result N-bit
vector
denote the qualifying logical blocks (documents).
(3) retrieve text file through pointer file.
Frame-Sliced Signature File (FSSF)
• Ideas
• Random disk accesses are more expensive than sequential ones
• Force each word to hash into bit positions that are closer to each
other in the document signature
• these bit files are stored together and can be retrieved with a few
random accesses
• Procedures
• The document signature (F bits long) is divided into k frames of s
consecutive bits each.
• For each word in the document, one of the k frames will be chosen
by a hash function.
• Using another hash function, the word sets m bits in that frame.

27
Frame-Sliced Signature File (Cont.)

documen
ts

frame
s

Each frame will be kept in consecutive disk


blocks.
FSSF (Continued)

• Example (n=2, B=12, s=6, f=2, m=3)


Word Signature
free 000000 110010
text 010110 000000
doc. signature 010110 110010
• Search
• Only one frame has to be retrieved for a single word query. I.E., only one
random disk access is required.
e.g., search documents that contain the word “free”
->because the word signature of “free” is placed in 2nd frame,
only the 2nd frame has to be examined.
• At most k frames have to be scanned for an k word query.
• Insertion
• Only f frames have to be accessed instead of F bit-slices.

29
Horizontal Partitioning
1. Goal: group the signatures into sets, partitioning the signature
matrix horizontally.
2. Grouping criterion

documents
Partitioned Signature Files

• Using a portion of a document signature as a


signature key to partition the signature file.
• All signatures with the same key will be grouped
into a so-called “module”.
• When a query signature arrives,
• examine its signature key and look for the
corresponding modules
• scan all the signatures within those modules that have
been selected

31
Suffix Trees
Suffix Trees and Suffix Arrays

• Each position in the text is considered as a text


suffix
• Index points are selected form the text, which point
to the beginning of the text positions which will be
retrievable

33
Suffix arrays

• The main drawbacks of Suffix Array are its costly


construction process.
• Allow binary searches done by comparing the
contents of each pointer.
• Supra-indices (for large suffix array)

35
Construction of Suffix Arrays for Large Texts
Sequential Searching
Brute Force
Knuth-Morris-Pratt
Boyer-Moore Family
Shift-Or
Suffix Automaton
Knuth-Morris-Pratt
Boyer-Moore Family

41
Shift-Or
Suffix Automaton

43
Pattern Matching

• Searching allowing errors


• Dynamic Programming
• Automaton
• Regular Expressions and Extended patterns
• Pattern Matching Using Indices
• Inverted files
• Suffix Trees and Suffix Arrays

45
Dynamic Programming
Automaton

47
Regular Expressions

48
Pattern Matching Using Indices

• Inverted Files
• The types of queries such as suffix or substring queries,
searching allowing errors and regular expressions, are
solved by a sequential search
• The restriction is to find approximate matches or regular
expressions that span many word.

49
Pattern Matching Using Indices

• Suffix Trees
• Suffix trees are able to perform complex searches
• Word, prefix, suffix, substring, and Range queries
• Regular expressions
• Unrestricted approximate string matching
• Useful in specific areas
• Find the longest substring
• Find the most common substring of a fixed size

50
Pattern Matching Using Indices

• Suffix Arrays
• Some patterns can be searched directly in the suffix
array without simulation the suffix tree
• Word, prefix, suffix, subword search and range search

51
Compression

• Compressed text--Huffman coding


• Taking words as symbols
• Use an alphabet of bytes instead of bits
• Compressed indices
• Inverted Files
• Suffix Trees and Suffix Arrays
• Signature Files

52
Thank You

You might also like