Advanced Database Systems
Spring 2025
Lecture #09:
Hash-Based Indexing
R&G: Chapter 11
2
R ECAP : F ILE O RGANISATIONS
Method of arranging a file of records on secondary storage
Heap Files S Q L C lie n t
Store records in no particular order Q u e r y P la n n i n g
O p e ra to r E xe c u ti o n
Sorted Files
F ile s & I n d ex M a n a g e me n t
Store records in sorted order, based on search key fields
B u f f e r Ma n a g e m e nt
Index Files Disk Space Management
Store records to enable fast lookup and modifications D a ta b a se
Tree-based & hash-based indexes
R ECAP : I N -M EMORY H ASH TABLE
3
( F R O M A L G O R I T H M S & D A TA S T R U C T U R E S C O U R S E )
A hash table implements an associative array (dictionary)
Data is stored as a collection of key-value pairs
It uses a hash function to compute an offset into an array of buckets (slots)
From which the desired value can be found
collision
Source: Introduction to Algorithms, 3rd edition
4
C OLLISION R ESOLUTION
By chaining Open Addressing
Link together entries hashed to the same value Single giant table of slots
Long chains can degrade search performance Hash to slot, then probe until a free slot is found
Variants: Linear Probing, Cuckoo, Robin Hood, …
Source: Introduction to Algorithms, 3rd edition Source: https://en.wikipedia.org/wiki/Hash_table
5
H ASHING IN D ATABASES
We want to be able to group together tuples with the same key value
Partition the data with hash function(s) applied on the key
All tuples with a certain key will be in the same partition
Useful for:
Removing duplicates (all duplicates will be grouped together)
Grouping data (for GROUP BY)
Looking up data using hash indexes
6
H ASH -B ASED I NDEXING
Suitable for equality-based predicates
SELECT * FROM Customer WHERE A = constant
Cannot support range queries
Other query operations internally generate a flood of equality tests
E.g.: nested loop join, where hash index can make a real difference
Support in commercial DBMSs
Tree-structured indexes preferred since they cover the more general range predicates
But hash-based indexes are used for (index) nested loop joins
7
O VERVIEW
Static and dynamic hashing techniques exist
Trade-offs similar to ISAM vs. B+ trees
Static hashing schemes
Chained hashing
Dynamic hashing schemes
Extendible hashing
Linear hashing (not covered)
8
S TATIC C HAINED H ASHING
Hash index is a collection of buckets
Build static hash index on column A
Allocate a fixed area of N (successive) pages, the so-called primary buckets
In each bucket, install a pointer to a chain of overflow pages (initially set to null)
Define a hash function h with range [0, …, N-1]
The domain of h is the type of A
e.g., h : INTEGER ⟶ [0, …, N-1], if A is of type INTEGER
The hash function determines the bucket where the desired value can be found
9
S TATIC C HAINED H ASH TABLE
Bucket = primary page plus zero or more overflow pages
Buckets contain index entries k* implemented using any of the variants A, B, or C
0 Bucket 0
record r 1
h(k) Bucket 1
…
N-1
h looks at the search Overflow pages Bucket N-1
key field k of record r
Primary
bucket pages
10
S TATIC C HAINED H ASH TABLE M ANAGEMENT
Operations: search, insert, delete
Compute h(k) on the search key field k of record r
Access the primary bucket page with number h(k)
Search for/insert/delete record on this page or, if needed, access the overflow pages
If overflow chain access is avoidable
search requires a single I/O operation
insert and delete require two I/O operations
11
H ASH C OLLISIONS AND O VERFLOW C HAINS
Hash collisions are unavoidable
For search keys k and k’, can happen h(k) = h(k’)
Search keys may not be unique (e.g., student age)
Even if unique, the search key space is much larger than # of buckets
Having as many primary bucket pages as different search keys in database ⇒ waste of space
Long overflow chains can degrade performance
Operation costs become non-uniform and unpredictable for a query optimiser
To reduce this problem, h needs to scatter search keys evenly across [0, …, N-1]
Large # of entries can still cause long chains (dynamic hashing to fix this)
12
H ASH F UNCTIONS
How to map a large key space into a smaller domain
Real distributions of search key values are often non-uniform (skewed)
Trade-off between being fast vs. collision rate
We want a lightweight (non-cryptographic) hash function with a low collision rate
Simple hash function: h(k) = k mod N
Guarantees the range of h(k) to be [0, N - 1]
Choosing N = 2d for some d effectively considers the least d bits of k only
Prime numbers work best for N
Better hash functions used in practice
xxHash (+ benchmark), MurmurHash, Google CityHash, Google FarmHash, CLHash
13
S TATIC H ASHING AND D YNAMIC F ILES
If the data file grows,
the development of overflow chains spoils the index I/O behaviour (1–2 I/O operations)
If the data file shrinks,
a significant fraction of primary buckets may be (almost) empty – a waste of space
We may periodically rehash the data file to restore the ideal situation
(20% free space, no overflow chains)
Expensive – the index not usable while rehashing is in progress
As for ISAM, static hashing has advantages with concurrent access
Only need to lock one bucket page to store a new entry or extend the overflow chain
14
E XTENDIBLE H ASHING
Situation: Bucket (primary page) is full and we want to insert. Why not
reorganize the index by doubling # of buckets?
Reading and writing all pages is expensive!
Idea: Use directory of pointers to buckets, double # of buckets by
doubling the directory, splitting just the bucket that overflowed
Directory much smaller than file, so doubling it is much cheaper
Only one page of data entries is split
No overflow pages!
15
E XTENDIBLE H ASHING
2 8 (01000) 1
14 (01110)
00
01 21 (10101) 2
10 25 (11001)
11
11 (01011) 2
Note: we depict as index entries h(k) instead of k*
16
G LOBAL AND L OCAL D EPTH
Global depth (n at directory) global 1 local
2 8 (01000)
14 (01110)
Use the least n bits of h(k) to find a 00
bucket pointer in the directory 01 21 (10101) 2 local
The directory size is 2n 10 25 (11001)
11
Local depth (d at individual buckets) 11 (01011) 2 local
The hash values h(k) of all entries in
this bucket agree on their least d bits
17
E XTENDIBLE H ASHING
global 2 8 (01000) 1 local
14 (01110)
Find A
00 hash(A) = 14 = 011102
01 21 (10101) 2 local
10 25 (11001)
11
11 (01011) 2 local
To find a bucket for A, take the least 2 bits of hash(A)
18
E XTENDIBLE H ASHING
global 2 8 (01000) 1 local
14 (01110)
Find A
00 hash(A) = 14 = 011102
01 21 (10101) 2 local
10 25 (11001)
11
11 (01011) 2 local
Check if the bucket contains key A. Need to compare keys due to collisions!
19
E XTENDIBLE H ASHING
global 2 8 (01000) 1 local
14 (01110)
Find A
00 hash(A) = 14 = 011102
01 21 (10101) 2 local Insert B
10 25 (11001)
hash(B) = 29 = 111012
11
11 (01011) 2 local
20
E XTENDIBLE H ASHING
global 2 8 (01000) 1 local
14 (01110)
Find A
00 hash(A) = 14 = 011102
01 21 (10101) 2 local Insert B
10 25 (11001)
hash(B) = 29 = 111012
29 (11101)
11
11 (01011) 2 local
If the bucket still has capacity, store the index entry in it
21
E XTENDIBLE H ASHING
global 2 8 (01000) 1 local
14 (01110)
Find A
00 hash(A) = 14 = 011102
01 21 (10101) 2 local Insert B
10 25 (11001)
hash(B) = 29 = 111012
29 (11101)
11
11 (01011) 2 local Insert C
hash(C) = 5 = 001012
22
E XTENDIBLE H ASHING
global 2 8 (01000) 1 local
14 (01110)
Find A
00 hash(A) = 14 = 011102
01 21 (10101) 2 local Insert B
10 25 (11001)
hash(B) = 29 = 111012
29 (11101)
11
11 (01011) 2 local Insert C
hash(C) = 5 = 001012
Split bucket if full (allocate new bucket, increase local, redistribute)
23
E XTENDIBLE H ASHING
global 2 8 (01000) 1 local
14 (01110)
Find A
00 hash(A) = 14 = 011102
01 25 (11001) 3 local Insert B
10
hash(B) = 29 = 111012
11
21 (10101) 3 local Insert C
29 (11101) hash(C) = 5 = 001012
3 bits now needed to
discriminate between
these two buckets ⇒ 11 (01011) 2 local
double directory
24
E XTENDIBLE H ASHING
global 3 8 (01000) 1 local
14 (01110)
Find A
000 hash(A) = 14 = 011102
001 25 (11001) 3 local Insert B
010
hash(B) = 29 = 111012
011
100 21 (10101) 3 local Insert C
101 29 (11101) hash(C) = 5 = 001012
110
111 11 (01011) 2 local
25
E XTENDIBLE H ASHING
global 3 8 (01000) 1 local
14 (01110)
Find A
000 hash(A) = 14 = 011102
001 25 (11001) 3 local Insert B
010
hash(B) = 29 = 111012
011
100 21 (10101) 3 local Insert C
101 29 (11101) hash(C) = 5 = 001012
5 (00101)
110
111 11 (01011) 2 local
26
D IRECTORY D OUBLING
Double directory by copying its original pointers and ”fixing” pointer
to split bucket
Use of least significant bits enables efficient doubling via copying!
Splitting a bucket does not always require doubling the directory
Buckets with local depth < global depth have multiple pointers to them
Splitting such buckets does not require doubling
Modifying one or more bucket pointers in directory is sufficient
Directory can also shrink when buckets become empty
27
S UMMARY
Hash-based indexes
Best for equality searches, cannot support range searches
Static hashing
Can lead to long overflow chains
Extendible hashing
Avoids overflow chains by splitting a full bucket when a new entry is to be added to it