0% found this document useful (0 votes)

4 views27 pages

Lec 9

The document discusses hash-based indexing techniques in databases, focusing on static and extendible hashing. It explains how hash tables work, collision resolution methods, and the advantages of using hash indexes for equality searches. Extendible hashing is highlighted as a solution to avoid overflow chains by dynamically managing bucket sizes.

Uploaded by

p20232002567

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views27 pages

Lec 9

Uploaded by

p20232002567

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 27

Advanced Database Systems

Spring 2025

Lecture #09:
Hash-Based Indexing

R&G: Chapter 11
2

R ECAP : F ILE O RGANISATIONS

Method of arranging a file of records on secondary storage

Heap Files S Q L C lie n t

Store records in no particular order Q u e r y P la n n i n g

O p e ra to r E xe c u ti o n
Sorted Files
F ile s & I n d ex M a n a g e me n t
Store records in sorted order, based on search key fields
B u f f e r Ma n a g e m e nt

Index Files Disk Space Management

Store records to enable fast lookup and modifications D a ta b a se

Tree-based & hash-based indexes
R ECAP : I N -M EMORY H ASH TABLE
3

( F R O M A L G O R I T H M S & D A TA S T R U C T U R E S C O U R S E )

A hash table implements an associative array (dictionary)

Data is stored as a collection of key-value pairs

It uses a hash function to compute an offset into an array of buckets (slots)

From which the desired value can be found

collision

Source: Introduction to Algorithms, 3rd edition

C OLLISION R ESOLUTION
By chaining Open Addressing
Link together entries hashed to the same value Single giant table of slots
Long chains can degrade search performance Hash to slot, then probe until a free slot is found
Variants: Linear Probing, Cuckoo, Robin Hood, …

Source: Introduction to Algorithms, 3rd edition Source: https://en.wikipedia.org/wiki/Hash_table

H ASHING IN D ATABASES
We want to be able to group together tuples with the same key value

Partition the data with hash function(s) applied on the key

All tuples with a certain key will be in the same partition

Useful for:
Removing duplicates (all duplicates will be grouped together)
Grouping data (for GROUP BY)
Looking up data using hash indexes
6

H ASH -B ASED I NDEXING

Suitable for equality-based predicates

SELECT * FROM Customer WHERE A = constant

Cannot support range queries

Other query operations internally generate a flood of equality tests

E.g.: nested loop join, where hash index can make a real difference

Support in commercial DBMSs

Tree-structured indexes preferred since they cover the more general range predicates
But hash-based indexes are used for (index) nested loop joins
7

O VERVIEW
Static and dynamic hashing techniques exist
Trade-offs similar to ISAM vs. B+ trees

Static hashing schemes

Chained hashing

Dynamic hashing schemes

Extendible hashing
Linear hashing (not covered)
8

S TATIC C HAINED H ASHING

Hash index is a collection of buckets

Build static hash index on column A

Allocate a fixed area of N (successive) pages, the so-called primary buckets
In each bucket, install a pointer to a chain of overflow pages (initially set to null)
Define a hash function h with range [0, …, N-1]
The domain of h is the type of A
e.g., h : INTEGER ⟶ [0, …, N-1], if A is of type INTEGER

The hash function determines the bucket where the desired value can be found
9

S TATIC C HAINED H ASH TABLE

Bucket = primary page plus zero or more overflow pages

Buckets contain index entries k* implemented using any of the variants A, B, or C

0 Bucket 0

record r 1
h(k) Bucket 1

…
N-1
h looks at the search Overflow pages Bucket N-1
key field k of record r
Primary
bucket pages
10

S TATIC C HAINED H ASH TABLE M ANAGEMENT

Operations: search, insert, delete
Compute h(k) on the search key field k of record r
Access the primary bucket page with number h(k)
Search for/insert/delete record on this page or, if needed, access the overflow pages

If overflow chain access is avoidable

search requires a single I/O operation
insert and delete require two I/O operations
11

H ASH C OLLISIONS AND O VERFLOW C HAINS

Hash collisions are unavoidable
For search keys k and k’, can happen h(k) = h(k’)
Search keys may not be unique (e.g., student age)
Even if unique, the search key space is much larger than # of buckets
Having as many primary bucket pages as different search keys in database ⇒ waste of space

Long overflow chains can degrade performance

Operation costs become non-uniform and unpredictable for a query optimiser
To reduce this problem, h needs to scatter search keys evenly across [0, …, N-1]
Large # of entries can still cause long chains (dynamic hashing to fix this)
12

H ASH F UNCTIONS
How to map a large key space into a smaller domain
Real distributions of search key values are often non-uniform (skewed)

Trade-off between being fast vs. collision rate

We want a lightweight (non-cryptographic) hash function with a low collision rate

Simple hash function: h(k) = k mod N

Guarantees the range of h(k) to be [0, N - 1]
Choosing N = 2d for some d effectively considers the least d bits of k only
Prime numbers work best for N

Better hash functions used in practice

xxHash (+ benchmark), MurmurHash, Google CityHash, Google FarmHash, CLHash
13

S TATIC H ASHING AND D YNAMIC F ILES

If the data file grows,
the development of overflow chains spoils the index I/O behaviour (1–2 I/O operations)

If the data file shrinks,

a significant fraction of primary buckets may be (almost) empty – a waste of space

We may periodically rehash the data file to restore the ideal situation
(20% free space, no overflow chains)
Expensive – the index not usable while rehashing is in progress

As for ISAM, static hashing has advantages with concurrent access

Only need to lock one bucket page to store a new entry or extend the overflow chain
14

E XTENDIBLE H ASHING
Situation: Bucket (primary page) is full and we want to insert. Why not
reorganize the index by doubling # of buckets?
Reading and writing all pages is expensive!

Idea: Use directory of pointers to buckets, double # of buckets by

doubling the directory, splitting just the bucket that overflowed
Directory much smaller than file, so doubling it is much cheaper

Only one page of data entries is split

No overflow pages!
15

E XTENDIBLE H ASHING

2 8 (01000) 1
14 (01110)
00
01 21 (10101) 2
10 25 (11001)

11
11 (01011) 2

Note: we depict as index entries h(k) instead of k*

G LOBAL AND L OCAL D EPTH

Global depth (n at directory) global 1 local
2 8 (01000)
14 (01110)
Use the least n bits of h(k) to find a 00
bucket pointer in the directory 01 21 (10101) 2 local
The directory size is 2n 10 25 (11001)

11
Local depth (d at individual buckets) 11 (01011) 2 local
The hash values h(k) of all entries in
this bucket agree on their least d bits
17

E XTENDIBLE H ASHING

global 2 8 (01000) 1 local

14 (01110)
Find A
00 hash(A) = 14 = 011102
01 21 (10101) 2 local
10 25 (11001)

11
11 (01011) 2 local

To find a bucket for A, take the least 2 bits of hash(A)

E XTENDIBLE H ASHING

global 2 8 (01000) 1 local

14 (01110)
Find A
00 hash(A) = 14 = 011102
01 21 (10101) 2 local
10 25 (11001)

11
11 (01011) 2 local

Check if the bucket contains key A. Need to compare keys due to collisions!
19

E XTENDIBLE H ASHING

global 2 8 (01000) 1 local

14 (01110)
Find A
00 hash(A) = 14 = 011102
01 21 (10101) 2 local Insert B
10 25 (11001)
hash(B) = 29 = 111012
11
11 (01011) 2 local
20

E XTENDIBLE H ASHING

global 2 8 (01000) 1 local

14 (01110)
Find A
00 hash(A) = 14 = 011102
01 21 (10101) 2 local Insert B
10 25 (11001)
hash(B) = 29 = 111012
29 (11101)
11
11 (01011) 2 local

If the bucket still has capacity, store the index entry in it

E XTENDIBLE H ASHING

global 2 8 (01000) 1 local

14 (01110)
Find A
00 hash(A) = 14 = 011102
01 21 (10101) 2 local Insert B
10 25 (11001)
hash(B) = 29 = 111012
29 (11101)
11
11 (01011) 2 local Insert C
hash(C) = 5 = 001012
22

E XTENDIBLE H ASHING

global 2 8 (01000) 1 local

14 (01110)
Find A
00 hash(A) = 14 = 011102
01 21 (10101) 2 local Insert B
10 25 (11001)
hash(B) = 29 = 111012
29 (11101)
11
11 (01011) 2 local Insert C
hash(C) = 5 = 001012

Split bucket if full (allocate new bucket, increase local, redistribute)

E XTENDIBLE H ASHING

global 2 8 (01000) 1 local

14 (01110)
Find A
00 hash(A) = 14 = 011102
01 25 (11001) 3 local Insert B
10
hash(B) = 29 = 111012
11
21 (10101) 3 local Insert C
29 (11101) hash(C) = 5 = 001012
3 bits now needed to
discriminate between
these two buckets ⇒ 11 (01011) 2 local
double directory
24

E XTENDIBLE H ASHING

global 3 8 (01000) 1 local

14 (01110)
Find A
000 hash(A) = 14 = 011102
001 25 (11001) 3 local Insert B
010
hash(B) = 29 = 111012
011
100 21 (10101) 3 local Insert C
101 29 (11101) hash(C) = 5 = 001012
110
111 11 (01011) 2 local
25

E XTENDIBLE H ASHING

global 3 8 (01000) 1 local

14 (01110)
Find A
000 hash(A) = 14 = 011102
001 25 (11001) 3 local Insert B
010
hash(B) = 29 = 111012
011
100 21 (10101) 3 local Insert C
101 29 (11101) hash(C) = 5 = 001012
5 (00101)
110
111 11 (01011) 2 local
26

D IRECTORY D OUBLING
Double directory by copying its original pointers and ”fixing” pointer
to split bucket
Use of least significant bits enables efficient doubling via copying!

Splitting a bucket does not always require doubling the directory

Buckets with local depth < global depth have multiple pointers to them
Splitting such buckets does not require doubling

Modifying one or more bucket pointers in directory is sufficient

Directory can also shrink when buckets become empty

S UMMARY
Hash-based indexes
Best for equality searches, cannot support range searches

Static hashing
Can lead to long overflow chains

Extendible hashing
Avoids overflow chains by splitting a full bucket when a new entry is to be added to it

Ch11 Hash Indexes 1perpage Annotated
No ratings yet
Ch11 Hash Indexes 1perpage Annotated
28 pages
Hash-Based Indexing Techniques
No ratings yet
Hash-Based Indexing Techniques
15 pages
Lecture14 Hash Based Indexing and Sorting MHH 18oct 2016
No ratings yet
Lecture14 Hash Based Indexing and Sorting MHH 18oct 2016
71 pages
Adbs 5
No ratings yet
Adbs 5
37 pages
Hashing 2
No ratings yet
Hashing 2
17 pages
CO3 Session 6
No ratings yet
CO3 Session 6
29 pages
Hashing Techniques in DBMS
No ratings yet
Hashing Techniques in DBMS
11 pages
Database Systems (資料庫系統) : November 26/28, 2007 Lecture #9
No ratings yet
Database Systems (資料庫系統) : November 26/28, 2007 Lecture #9
43 pages
Static and Dynamic Hashing
No ratings yet
Static and Dynamic Hashing
12 pages
CS143: Hash Index
No ratings yet
CS143: Hash Index
26 pages
Unit-3 Hashing Storage Btree
No ratings yet
Unit-3 Hashing Storage Btree
26 pages
Indexing and Hashing Techniques
No ratings yet
Indexing and Hashing Techniques
36 pages
Understanding Hashing in Databases
No ratings yet
Understanding Hashing in Databases
10 pages
Hashing
No ratings yet
Hashing
8 pages
There Are Two Types of Hashing
No ratings yet
There Are Two Types of Hashing
2 pages
Chapter 7 Indexing Part2
No ratings yet
Chapter 7 Indexing Part2
41 pages
Dynamic Hashing
No ratings yet
Dynamic Hashing
35 pages
Hashing in DBMS
No ratings yet
Hashing in DBMS
6 pages
Extendible Hashing
No ratings yet
Extendible Hashing
65 pages
Database Indexing & Hashing Basics
No ratings yet
Database Indexing & Hashing Basics
7 pages
Unit 3.docx Dbms
No ratings yet
Unit 3.docx Dbms
25 pages
Hash-Based Indexes: Introduction To Database, Fall 2004/melikyan 1
No ratings yet
Hash-Based Indexes: Introduction To Database, Fall 2004/melikyan 1
19 pages
Hashing
No ratings yet
Hashing
33 pages
Hashing
No ratings yet
Hashing
16 pages
Hashing
No ratings yet
Hashing
8 pages
Mod 5
No ratings yet
Mod 5
13 pages
Hash New 23july
No ratings yet
Hash New 23july
14 pages
Hashing in DBMS
No ratings yet
Hashing in DBMS
5 pages
Dsa 240404 220052
No ratings yet
Dsa 240404 220052
9 pages
07 Hashtables
No ratings yet
07 Hashtables
4 pages
Chap 12. Extendible Hashing: File Structures
No ratings yet
Chap 12. Extendible Hashing: File Structures
40 pages
B+-Trees & Hashing Explained
No ratings yet
B+-Trees & Hashing Explained
71 pages
MODULE 5 - BCS304 - HASHING - Leftisht Trees - OBST - Notes
No ratings yet
MODULE 5 - BCS304 - HASHING - Leftisht Trees - OBST - Notes
32 pages
DSAD Dynamic Hashing
No ratings yet
DSAD Dynamic Hashing
79 pages
Unit III-Hashing
100% (1)
Unit III-Hashing
135 pages
11 What Is Hashing in DBMS
No ratings yet
11 What Is Hashing in DBMS
20 pages
Lec04 Hashing CH 11 P2
No ratings yet
Lec04 Hashing CH 11 P2
44 pages
Data Organization for Students
No ratings yet
Data Organization for Students
111 pages
Hashing
No ratings yet
Hashing
8 pages
Hash Dbms
No ratings yet
Hash Dbms
5 pages
Principles of Database Management Systems: 4.2: Hashing Techniques
No ratings yet
Principles of Database Management Systems: 4.2: Hashing Techniques
36 pages
Hashing Techniques in DBMS
No ratings yet
Hashing Techniques in DBMS
24 pages
Data Structure Seminar
No ratings yet
Data Structure Seminar
23 pages
ds-5 Removed
No ratings yet
ds-5 Removed
16 pages
Database Indexing and Hashing
No ratings yet
Database Indexing and Hashing
7 pages
DSimp 2
No ratings yet
DSimp 2
21 pages
Unit Iv Implementation Techniques
No ratings yet
Unit Iv Implementation Techniques
91 pages
Unit-4 Hand Written
No ratings yet
Unit-4 Hand Written
35 pages
6 Hash-Based Indexing
No ratings yet
6 Hash-Based Indexing
26 pages
Unit-5 B+Trees & Hashing
No ratings yet
Unit-5 B+Trees & Hashing
37 pages
Hashing Function
No ratings yet
Hashing Function
14 pages
Unit 6
No ratings yet
Unit 6
38 pages
University Institute of Engineering CSE-2 Year: Advanced Data Structures and Algorithms
No ratings yet
University Institute of Engineering CSE-2 Year: Advanced Data Structures and Algorithms
26 pages
Hashing in Data Structure
No ratings yet
Hashing in Data Structure
43 pages
Hashing Unit 1
No ratings yet
Hashing Unit 1
91 pages
Data and File Structures: Hashing
No ratings yet
Data and File Structures: Hashing
24 pages
Hashing
No ratings yet
Hashing
8 pages
9-Hashing Schemes
No ratings yet
9-Hashing Schemes
23 pages
PT Lect 08 (Bit Manipulation)
No ratings yet
PT Lect 08 (Bit Manipulation)
6 pages
PT Lect 03 (Unions and Enumerations)
No ratings yet
PT Lect 03 (Unions and Enumerations)
21 pages
PT Lect 05 (Preprocessing)
No ratings yet
PT Lect 05 (Preprocessing)
13 pages
Lec 23
No ratings yet
Lec 23
28 pages
PT Lect 02 (Structures)
No ratings yet
PT Lect 02 (Structures)
40 pages
Lecture 1
No ratings yet
Lecture 1
67 pages
Lec 22
No ratings yet
Lec 22
45 pages
Lec 19
No ratings yet
Lec 19
28 pages
Lec 11
No ratings yet
Lec 11
43 pages
Lec 17
No ratings yet
Lec 17
24 pages
Lec 15
No ratings yet
Lec 15
43 pages
Lec 8
No ratings yet
Lec 8
30 pages
Lec 4
No ratings yet
Lec 4
29 pages
SQL Server Table Size Analysis
No ratings yet
SQL Server Table Size Analysis
4 pages
For FX
No ratings yet
For FX
21 pages
Aspiring Data Analyst Profile
No ratings yet
Aspiring Data Analyst Profile
1 page
Btech Oe 6 Sem Basics of Data Base Management System Koe 067 2023
No ratings yet
Btech Oe 6 Sem Basics of Data Base Management System Koe 067 2023
2 pages
SS4TechnicalReferenceManual PDF
No ratings yet
SS4TechnicalReferenceManual PDF
937 pages
Query Processing Techniques
No ratings yet
Query Processing Techniques
18 pages
Chapter 2
No ratings yet
Chapter 2
35 pages
Dokmee - Overview
No ratings yet
Dokmee - Overview
9 pages
23bca3co1 Database Management Systems Question Bank 2024 Revised
No ratings yet
23bca3co1 Database Management Systems Question Bank 2024 Revised
13 pages
Dev0 07 PLPGSQL Arrays
No ratings yet
Dev0 07 PLPGSQL Arrays
17 pages
SQL Lab for Database Beginners
No ratings yet
SQL Lab for Database Beginners
3 pages
1.3 What Kind of Data Can Be Mined?
No ratings yet
1.3 What Kind of Data Can Be Mined?
5 pages
PLSQL 5 4 SG
No ratings yet
PLSQL 5 4 SG
19 pages
Pandas
No ratings yet
Pandas
8 pages
Deep Learning With Databricks: Srijith Rajamohan, Ph.D. John O'Dwyer
No ratings yet
Deep Learning With Databricks: Srijith Rajamohan, Ph.D. John O'Dwyer
38 pages
Veeam Backup and Replication
No ratings yet
Veeam Backup and Replication
3 pages
Hadoop Data Manipulation Guide
No ratings yet
Hadoop Data Manipulation Guide
3 pages
Data Management
100% (1)
Data Management
10 pages
New Bmi Template 2024
No ratings yet
New Bmi Template 2024
104 pages
San Vs Nas
No ratings yet
San Vs Nas
7 pages
SAP HANA Database Service Connections
No ratings yet
SAP HANA Database Service Connections
3 pages
Multiple AWR Reports
No ratings yet
Multiple AWR Reports
2 pages
Fourth Test Windows Server 2022
No ratings yet
Fourth Test Windows Server 2022
2 pages
Accomplishment Report: Period Covered: - October 1-15, 2020
100% (1)
Accomplishment Report: Period Covered: - October 1-15, 2020
2 pages
CIA 2 and 3 IF Sem 3 2019 21 BRR
No ratings yet
CIA 2 and 3 IF Sem 3 2019 21 BRR
18 pages
DBMS Module1-5
No ratings yet
DBMS Module1-5
7 pages
Midterm Exam Data Base
No ratings yet
Midterm Exam Data Base
5 pages
Data Structures Syllabus
No ratings yet
Data Structures Syllabus
2 pages
Data Science MCQ Questions and Answer PDF
70% (10)
Data Science MCQ Questions and Answer PDF
6 pages
SAP GRC ARA: Legacy Risk Analysis Guide
No ratings yet
SAP GRC ARA: Legacy Risk Analysis Guide
71 pages

Lec 9

Uploaded by

Lec 9

Uploaded by

Advanced Database Systems

R ECAP : F ILE O RGANISATIONS

Heap Files S Q L C lie n t

Store records in no particular order Q u e r y P la n n i n g

Index Files Disk Space Management

Store records to enable fast lookup and modifications D a ta b a se

A hash table implements an associative array (dictionary)

It uses a hash function to compute an offset into an array of buckets (slots)

Source: Introduction to Algorithms, 3rd edition

Source: Introduction to Algorithms, 3rd edition Source: https://en.wikipedia.org/wiki/Hash_table

Partition the data with hash function(s) applied on the key

H ASH -B ASED I NDEXING

SELECT * FROM Customer WHERE A = constant

Cannot support range queries

Other query operations internally generate a flood of equality tests

Support in commercial DBMSs

Static hashing schemes

Dynamic hashing schemes

S TATIC C HAINED H ASHING

Build static hash index on column A

S TATIC C HAINED H ASH TABLE

Buckets contain index entries k* implemented using any of the variants A, B, or C

S TATIC C HAINED H ASH TABLE M ANAGEMENT

If overflow chain access is avoidable

H ASH C OLLISIONS AND O VERFLOW C HAINS

Long overflow chains can degrade performance

Trade-off between being fast vs. collision rate

Simple hash function: h(k) = k mod N

Better hash functions used in practice

S TATIC H ASHING AND D YNAMIC F ILES

If the data file shrinks,

As for ISAM, static hashing has advantages with concurrent access

Idea: Use directory of pointers to buckets, double # of buckets by

Only one page of data entries is split

Note: we depict as index entries h(k) instead of k*

G LOBAL AND L OCAL D EPTH

global 2 8 (01000) 1 local

To find a bucket for A, take the least 2 bits of hash(A)

global 2 8 (01000) 1 local

global 2 8 (01000) 1 local

global 2 8 (01000) 1 local

If the bucket still has capacity, store the index entry in it

global 2 8 (01000) 1 local

global 2 8 (01000) 1 local

Split bucket if full (allocate new bucket, increase local, redistribute)

global 2 8 (01000) 1 local

global 3 8 (01000) 1 local

global 3 8 (01000) 1 local

Splitting a bucket does not always require doubling the directory

Modifying one or more bucket pointers in directory is sufficient

Directory can also shrink when buckets become empty

You might also like