0% found this document useful (0 votes)

9 views54 pages

Department of Information Technology Sub: Information Storage and Retrieval Topic: File Strcutre-2 Mr. Rajpure A.S

Uploaded by

Amol Rajpure

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views54 pages

Department of Information Technology Sub: Information Storage and Retrieval Topic: File Strcutre-2 Mr. Rajpure A.S

Uploaded by

Amol Rajpure

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 54

Department of Information Technology

Sub : Information Storage and Retrieval

Topic: File Strcutre-2

Mr. Rajpure A.S.

Introduction
Contents:
Basic Concepts of IR, Data Retrieval & Information Retrieval, text mining and IR relation, IR system
block diagram.
Automatic Text Analysis: Luhn's ideas, Conflation Algorithm, Indexing and Index Term Weighing,
Probabilistic Indexing
Inverted file, Suffix trees & suffix arrays, Signature Files, Scatter storage or hash addressing, Clustered
files, Hypertext and XML data structures.

Unit Objectives
1. To understand information retrieval process.

Unit outcomes: On completion the students will be able to :

1. By the end of the course, students should be able to understand the concept of Information
retrieval
Outcome Mapping:
PEO: I, PO: a CO: 1., PSO: 1

Books :
T1. Yates & Neto, Modern Information Retrieval, Pearson Education, ISBN:81-297-0274-6 2.
T2. C.J. T2: Rijsbergen, Information Retrieval, (www.dcs.gla.ac.uk)., 2ndISBN:978- 408709293
Indexing and Searching
(File Structures)
File Structures

• Inverted Files
• Signatures
• PAT Trees
• Sequential Searching
• Compression

4
Inverted Files

• Characteristics
• A word-oriented mechanism based on sorted list of
keywords, with each keyword having links to the documents
containing that keyword.
• Preprocessing
• Each document is assigned a list of keywords or attributes.
• Each keyword (attribute) is associated with relevance
weights.

5
Inversion of Word List

1. The input text is parsed into a list of words along with their
location in the text. (time and storage consuming operation)
2. This list is inverted from a list of terms in location order to
a list of terms in alphabetical order.
3. Add term weights, or reorganize or compress the files.
Inversion of Word List
Structure and Construction

• Structure (split the index into two files)

• Vocabulary: O(nβ) according to Heaps’ Law
• Occurrences : depends on the addressing granularity

• Construction
• The vocabulary is stored in lexicographical order and
points to posting list.
• Posting file：the lists of occurrences are stored
contiguously

8
Dictionary and Postings File

(document #,
frequency)
Vocabulary and Posting File

10
Structures used in Inverted Files

• Vocabulary
• Sorted Arrays
• Hashing Structures
• Keyword Trees: Tries (digital search trees)

• The Search Procedure

• Vocabulary search
• Retrieval of occurrences
• Manipulation of occurrences

11
Size of an Inverted File

• Block addressing
• The text is divided in blocks, and the occurrences point to
the blocks instead of full inverted indices where exact
occurrences are recorded
Cost

• Advantage
• easy to implement

• Disadvantage
• updating the index is expensive

13
Signature Files
Information Retrieval: Data Structures and
Algorithms (Chapters 4)
W.B. Frakes and R. Baeza-Yates (Eds.) Englewood
Cliffs, NJ: Prentice Hall, 1992.
Signature Files

• Characteristics
• Word-oriented index structures based on hashing
• Low overhead (10%~20% over the text size) at the cost
of forcing a sequential search over the index
• Suitable for not very large texts
• Inverted files outperform signature files for most
applications

15
Construction and Search

• Word-oriented index structures base on hashing

• Maps words to bit masks of B bits
• Divides the text in blocks of b words each
• The mask is obtained by bitwise ORing the signatures of all
the words in the text block.
• Search
• Hash the query to a bit mask W
• If W & Bi = W, the text block may contain the word

16
Example

• Four blocks:
• This is a text. A text ha s many words. Words are made from letter s.

000101 110101 100100 101101

• Hash(text) = 000101
• Hash(many)= 110000 Block 4:
• Hash(words)= 100100 001100
• Hash(made)= 001100 OR 100001
• Hash(letters)= 100001 101101

17
False Drop

• Assumes that m bits are randomly set in the mask

• Let α=m/B
• For b words, the probability that a given bit of the
mask is set is 1-(1-1/B)bm ≈1-e-bα
• Hence, the probability that the l random bits are
-bα αΒ ←
also set is Fd =(1-e ) False alarm
• Fd is minimized for α=ln(2)/b
• Fd= 2-m
m = B ln2/b

18
Sequential Signature File (SSF)

Assume documents span exactly one logical block

the size of document signature F = the size of block signature
B

19
Classification of Signature-Based Methods

• Horizontal partitioning
Grouping similar signatures together and/or providing an index on the
signature matrix may result in better-than-linear search.

• Vertical partitioning
Storing the signature matrix column-wise improves the response time on
the expense of insertion time.

20
Classification of Signature-Based Methods

• Vertical partitioning
• without compression
bit-sliced signature files (BSSF, B’SSF)
frame sliced (FSSF)
generalized frame-sliced (GFSSF)
• with compression
compressed bit slices (CBS)
doubly compressed bit slices (DCBS)
no-false-drop method (NFD)

21
Classification of Signature-Based Methods

• Sequential storage of the signature matrix

• without compression
sequential signature files (SSF)
• with compression
bit-block compression (BC)
variable bit-block compression (VBC)

• Horizontal partitioning
• data independent partitioning
Gustafson’s method
partitioned signature files
• data dependent partitioning
2-level signature files
5-trees

22
Criteria

• The storage overhead

• The response time on single word queries
• The performance on insertion, as well as whether
the insertion maintains the “append-only” property

23
Vertical Partitioning

• Idea
avoid bringing useless portions of the document
signature in main memory
• Methods
• store the signature file in a bit-sliced form or in a
frame-sliced form
• store the signature matrix column-wise to improve the
response time on the expense of insertion time

24
Bit-Sliced Signature Files (BSSF)
Transposed bit matrix
documen
(document
ts
signature)

transpose

documen
ts

represent
documents

F bit-files

search: (1) retrieve m bit-files.

e.g., the word signature of free is 001 000 110
010
the document contains “free”: 3rd, 7th, 8th, 11th bit
are set
i.e., only 3rd, 7th, 8th, 11th files are examined.
(2) “and” these vectors. The 1s in the result N-bit
vector
denote the qualifying logical blocks (documents).
(3) retrieve text file through pointer file.
Frame-Sliced Signature File (FSSF)
• Ideas
• Random disk accesses are more expensive than sequential ones
• Force each word to hash into bit positions that are closer to each
other in the document signature
• these bit files are stored together and can be retrieved with a few
random accesses
• Procedures
• The document signature (F bits long) is divided into k frames of s
consecutive bits each.
• For each word in the document, one of the k frames will be chosen
by a hash function.
• Using another hash function, the word sets m bits in that frame.

27
Frame-Sliced Signature File (Cont.)

documen
ts

frame
s

Each frame will be kept in consecutive disk

blocks.
FSSF (Continued)

• Example (n=2, B=12, s=6, f=2, m=3)

Word Signature
free 000000 110010
text 010110 000000
doc. signature 010110 110010
• Search
• Only one frame has to be retrieved for a single word query. I.E., only one
random disk access is required.
e.g., search documents that contain the word “free”
->because the word signature of “free” is placed in 2nd frame,
only the 2nd frame has to be examined.
• At most k frames have to be scanned for an k word query.
• Insertion
• Only f frames have to be accessed instead of F bit-slices.

29
Horizontal Partitioning
1. Goal: group the signatures into sets, partitioning the signature
matrix horizontally.
2. Grouping criterion

documents
Partitioned Signature Files

• Using a portion of a document signature as a

signature key to partition the signature file.
• All signatures with the same key will be grouped
into a so-called “module”.
• When a query signature arrives,
• examine its signature key and look for the
corresponding modules
• scan all the signatures within those modules that have
been selected

31
Suffix Trees
Suffix Trees and Suffix Arrays

• Each position in the text is considered as a text

suffix
• Index points are selected form the text, which point
to the beginning of the text positions which will be
retrievable

33
Suffix arrays

• The main drawbacks of Suffix Array are its costly

construction process.
• Allow binary searches done by comparing the
contents of each pointer.
• Supra-indices (for large suffix array)

35
Construction of Suffix Arrays for Large Texts
Sequential Searching
Brute Force
Knuth-Morris-Pratt
Boyer-Moore Family
Shift-Or
Suffix Automaton
Knuth-Morris-Pratt
Boyer-Moore Family

41
Shift-Or
Suffix Automaton

43
Pattern Matching

• Searching allowing errors

• Dynamic Programming
• Automaton
• Regular Expressions and Extended patterns
• Pattern Matching Using Indices
• Inverted files
• Suffix Trees and Suffix Arrays

45
Dynamic Programming
Automaton

47
Regular Expressions

48
Pattern Matching Using Indices

• Inverted Files
• The types of queries such as suffix or substring queries,
searching allowing errors and regular expressions, are
solved by a sequential search
• The restriction is to find approximate matches or regular
expressions that span many word.

49
Pattern Matching Using Indices

• Suffix Trees
• Suffix trees are able to perform complex searches
• Word, prefix, suffix, substring, and Range queries
• Regular expressions
• Unrestricted approximate string matching
• Useful in specific areas
• Find the longest substring
• Find the most common substring of a fixed size

50
Pattern Matching Using Indices

• Suffix Arrays
• Some patterns can be searched directly in the suffix
array without simulation the suffix tree
• Word, prefix, suffix, subword search and range search

51
Compression

• Compressed text--Huffman coding

• Taking words as symbols
• Use an alphabet of bytes instead of bits
• Compressed indices
• Inverted Files
• Suffix Trees and Suffix Arrays
• Signature Files

52
Thank You

ISR Signaturefile
No ratings yet
ISR Signaturefile
27 pages
Unit 3
No ratings yet
Unit 3
10 pages
Cay Dau Hieu Nhi Phan
No ratings yet
Cay Dau Hieu Nhi Phan
9 pages
Ir Mod 4
No ratings yet
Ir Mod 4
15 pages
Module 2-Data Structures and Algorithms For Retrieval-Cat1
No ratings yet
Module 2-Data Structures and Algorithms For Retrieval-Cat1
133 pages
PAT Trees and PAT Arrays
No ratings yet
PAT Trees and PAT Arrays
12 pages
Extendible Hashing Report
100% (2)
Extendible Hashing Report
37 pages
File Structures for Students
No ratings yet
File Structures for Students
17 pages
Chapter 3 Part 1
No ratings yet
Chapter 3 Part 1
43 pages
Indexing and Compression Basics
No ratings yet
Indexing and Compression Basics
43 pages
File Structure for Students
No ratings yet
File Structure for Students
17 pages
Introduction To File Structure: FS Lab Mini Project Placement Statistics
No ratings yet
Introduction To File Structure: FS Lab Mini Project Placement Statistics
44 pages
Assignment 3
No ratings yet
Assignment 3
3 pages
Ds Unit-5
No ratings yet
Ds Unit-5
5 pages
History of File Structures
No ratings yet
History of File Structures
26 pages
File Organization
No ratings yet
File Organization
49 pages
Lecture 04 Inaryseachtree
No ratings yet
Lecture 04 Inaryseachtree
20 pages
Dsa 240404 220052
No ratings yet
Dsa 240404 220052
9 pages
Experiment 8 DS Student
No ratings yet
Experiment 8 DS Student
8 pages
DBMS Section A
No ratings yet
DBMS Section A
56 pages
IR Chapter Three
No ratings yet
IR Chapter Three
30 pages
18 - Shanthi - Applying SD-Tree For Object-Oriented Query
No ratings yet
18 - Shanthi - Applying SD-Tree For Object-Oriented Query
12 pages
File-System Interface
No ratings yet
File-System Interface
14 pages
Unit V
No ratings yet
Unit V
5 pages
Chapter 3 Indexing Structures
No ratings yet
Chapter 3 Indexing Structures
63 pages
Broadword Implementation of Rank-Select Queries (Nov 19, 2014)
No ratings yet
Broadword Implementation of Rank-Select Queries (Nov 19, 2014)
15 pages
Completed UNIT-III 20.9.17
No ratings yet
Completed UNIT-III 20.9.17
61 pages
Fin f12 Sol
No ratings yet
Fin f12 Sol
6 pages
Information Retrieval (Data Structures and Algorithms) - W B Frakes, R Baeza-Yates
No ratings yet
Information Retrieval (Data Structures and Algorithms) - W B Frakes, R Baeza-Yates
237 pages
Assignment No: 3: Aim: Objective: Theory:-Inverted Index
No ratings yet
Assignment No: 3: Aim: Objective: Theory:-Inverted Index
2 pages
Chap 6 Distributed File System
No ratings yet
Chap 6 Distributed File System
56 pages
File Systems
No ratings yet
File Systems
8 pages
210 Maps PDF
No ratings yet
210 Maps PDF
39 pages
L14 PPT IVSem
No ratings yet
L14 PPT IVSem
33 pages
Module 5 - Indexing and Searching
No ratings yet
Module 5 - Indexing and Searching
15 pages
IRS Module5-I
No ratings yet
IRS Module5-I
15 pages
Data Mining: Sketching, Locality Sensitive Hashing
No ratings yet
Data Mining: Sketching, Locality Sensitive Hashing
61 pages
Chapter 09 Advanced Data Structures
No ratings yet
Chapter 09 Advanced Data Structures
9 pages
Unit 6 File Management
No ratings yet
Unit 6 File Management
70 pages
CompSci Paper 3 - 4 Notes
No ratings yet
CompSci Paper 3 - 4 Notes
51 pages
Data Structures & Algorithms for IR
No ratings yet
Data Structures & Algorithms for IR
34 pages
File Organization Notes
No ratings yet
File Organization Notes
21 pages
Lab08 - DS - Hash Tables
No ratings yet
Lab08 - DS - Hash Tables
9 pages
Hashing Powerpoint
No ratings yet
Hashing Powerpoint
58 pages
File Organizations and Indexes
No ratings yet
File Organizations and Indexes
51 pages
Data Structures and Overview
No ratings yet
Data Structures and Overview
13 pages
Handout 9 - Hashing
No ratings yet
Handout 9 - Hashing
11 pages
Assignment (DS)
No ratings yet
Assignment (DS)
8 pages
A New Data Structure For Cumulative Frequency Tables: Peter M. Fenwick
No ratings yet
A New Data Structure For Cumulative Frequency Tables: Peter M. Fenwick
10 pages
DSA Chapter 08 (Searching)
No ratings yet
DSA Chapter 08 (Searching)
65 pages
9nm4alc: CS 213 M 2023 Data Structures
No ratings yet
9nm4alc: CS 213 M 2023 Data Structures
21 pages
Operating System
No ratings yet
Operating System
38 pages
Chapter 3,4, 5 and 6
No ratings yet
Chapter 3,4, 5 and 6
145 pages
File Structure and Organization Concepts
No ratings yet
File Structure and Organization Concepts
17 pages
Advanced Data Structures Notes
100% (2)
Advanced Data Structures Notes
142 pages
Single Pass Algorithm
No ratings yet
Single Pass Algorithm
16 pages
Unit I PART 3
No ratings yet
Unit I PART 3
27 pages
Conversion of NFA - To NFA
No ratings yet
Conversion of NFA - To NFA
6 pages
Unit I PART 1.ppt1
No ratings yet
Unit I PART 1.ppt1
79 pages
Unit I PART 2
No ratings yet
Unit I PART 2
24 pages
Unit III
No ratings yet
Unit III
58 pages
Unit 2.1
No ratings yet
Unit 2.1
131 pages
Root Canal Dangers Exposed
No ratings yet
Root Canal Dangers Exposed
1 page
Work, Energy, and Power Calculations
No ratings yet
Work, Energy, and Power Calculations
4 pages
7es Respiratory System Lesson Plan
No ratings yet
7es Respiratory System Lesson Plan
5 pages
Maths and Calculator Skills For Science Students. 2nd Edition
No ratings yet
Maths and Calculator Skills For Science Students. 2nd Edition
17 pages
Capacity and Behaviour On One Way Cycle Tracks o 2016 Transportation Researc
No ratings yet
Capacity and Behaviour On One Way Cycle Tracks o 2016 Transportation Researc
15 pages
Preliminaries
No ratings yet
Preliminaries
9 pages
Project Coordination Update
No ratings yet
Project Coordination Update
8 pages
Engineers' Guide to Tank Stability
No ratings yet
Engineers' Guide to Tank Stability
10 pages
Bài M U IELTS Writing Task 2 Band 8 Ngày 11/05/2024
No ratings yet
Bài M U IELTS Writing Task 2 Band 8 Ngày 11/05/2024
13 pages
Zehad Muhit Chowdhury Resume
No ratings yet
Zehad Muhit Chowdhury Resume
2 pages
Devotional
No ratings yet
Devotional
4 pages
The Collar Analyses by Naim
No ratings yet
The Collar Analyses by Naim
10 pages
Santhana Gopala Sthothram
No ratings yet
Santhana Gopala Sthothram
40 pages
6267 - Kolhapur Institute of Technologys College of Engineering (Autonomous), Kolhapur
No ratings yet
6267 - Kolhapur Institute of Technologys College of Engineering (Autonomous), Kolhapur
13 pages
Periyar Rice: Internship Insights
No ratings yet
Periyar Rice: Internship Insights
38 pages
CBC-Event Management Services NC III - Course Design
100% (1)
CBC-Event Management Services NC III - Course Design
8 pages
SPN 111 Fmi 1
No ratings yet
SPN 111 Fmi 1
4 pages
Conclusion
No ratings yet
Conclusion
4 pages
Accepted Manuscript: Advantages and Disadvantages of Membrane Separation
No ratings yet
Accepted Manuscript: Advantages and Disadvantages of Membrane Separation
1 page
CA-103-C-Language Notes
No ratings yet
CA-103-C-Language Notes
77 pages
38-Service Manual Rear Final Drive 01R
No ratings yet
38-Service Manual Rear Final Drive 01R
102 pages
Lecture 1 Modi
100% (1)
Lecture 1 Modi
24 pages
Crafting a Bucket List Essay
100% (2)
Crafting a Bucket List Essay
8 pages
Case Study TCS
0% (1)
Case Study TCS
19 pages
КТП 2023-24 Action 10 кл
No ratings yet
КТП 2023-24 Action 10 кл
16 pages
Chapter II RRL
100% (1)
Chapter II RRL
13 pages
HVAC Smoke Detector Guidelines
No ratings yet
HVAC Smoke Detector Guidelines
1 page
Kerala's Rural-Urban Continuum
No ratings yet
Kerala's Rural-Urban Continuum
14 pages
User'S Manual Vibrating-Wire Piezometers Big Boy: Model 98050 Slim Jim: Model 98051 Conehead: Model Number 98052 Serial No.
No ratings yet
User'S Manual Vibrating-Wire Piezometers Big Boy: Model 98050 Slim Jim: Model 98051 Conehead: Model Number 98052 Serial No.
16 pages
GB 21
No ratings yet
GB 21
2 pages

Department of Information Technology Sub: Information Storage and Retrieval Topic: File Strcutre-2 Mr. Rajpure A.S

Uploaded by

Department of Information Technology Sub: Information Storage and Retrieval Topic: File Strcutre-2 Mr. Rajpure A.S

Uploaded by

Department of Information Technology

Sub : Information Storage and Retrieval

Mr. Rajpure A.S.

Unit outcomes: On completion the students will be able to :

• Structure (split the index into two files)

• The Search Procedure

• Word-oriented index structures base on hashing

000101 110101 100100 101101

• Assumes that m bits are randomly set in the mask

Assume documents span exactly one logical block

• Sequential storage of the signature matrix

• The storage overhead

search: (1) retrieve m bit-files.

Each frame will be kept in consecutive disk

• Example (n=2, B=12, s=6, f=2, m=3)

• Using a portion of a document signature as a

• Each position in the text is considered as a text

• The main drawbacks of Suffix Array are its costly

• Searching allowing errors

• Compressed text--Huffman coding

You might also like