0% found this document useful (0 votes)
4 views27 pages

Lec 9

The document discusses hash-based indexing techniques in databases, focusing on static and extendible hashing. It explains how hash tables work, collision resolution methods, and the advantages of using hash indexes for equality searches. Extendible hashing is highlighted as a solution to avoid overflow chains by dynamically managing bucket sizes.

Uploaded by

p20232002567
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views27 pages

Lec 9

The document discusses hash-based indexing techniques in databases, focusing on static and extendible hashing. It explains how hash tables work, collision resolution methods, and the advantages of using hash indexes for equality searches. Extendible hashing is highlighted as a solution to avoid overflow chains by dynamically managing bucket sizes.

Uploaded by

p20232002567
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 27

Advanced Database Systems

Spring 2025

Lecture #09:
Hash-Based Indexing

R&G: Chapter 11
2

R ECAP : F ILE O RGANISATIONS


Method of arranging a file of records on secondary storage

Heap Files S Q L C lie n t

Store records in no particular order Q u e r y P la n n i n g

O p e ra to r E xe c u ti o n
Sorted Files
F ile s & I n d ex M a n a g e me n t
Store records in sorted order, based on search key fields
B u f f e r Ma n a g e m e nt

Index Files Disk Space Management

Store records to enable fast lookup and modifications D a ta b a se


Tree-based & hash-based indexes
R ECAP : I N -M EMORY H ASH TABLE
3

( F R O M A L G O R I T H M S & D A TA S T R U C T U R E S C O U R S E )

A hash table implements an associative array (dictionary)


Data is stored as a collection of key-value pairs

It uses a hash function to compute an offset into an array of buckets (slots)


From which the desired value can be found

collision

Source: Introduction to Algorithms, 3rd edition


4

C OLLISION R ESOLUTION
By chaining Open Addressing
Link together entries hashed to the same value Single giant table of slots
Long chains can degrade search performance Hash to slot, then probe until a free slot is found
Variants: Linear Probing, Cuckoo, Robin Hood, …

Source: Introduction to Algorithms, 3rd edition Source: https://en.wikipedia.org/wiki/Hash_table


5

H ASHING IN D ATABASES
We want to be able to group together tuples with the same key value

Partition the data with hash function(s) applied on the key


All tuples with a certain key will be in the same partition

Useful for:
Removing duplicates (all duplicates will be grouped together)
Grouping data (for GROUP BY)
Looking up data using hash indexes
6

H ASH -B ASED I NDEXING


Suitable for equality-based predicates

SELECT * FROM Customer WHERE A = constant

Cannot support range queries

Other query operations internally generate a flood of equality tests


E.g.: nested loop join, where hash index can make a real difference

Support in commercial DBMSs


Tree-structured indexes preferred since they cover the more general range predicates
But hash-based indexes are used for (index) nested loop joins
7

O VERVIEW
Static and dynamic hashing techniques exist
Trade-offs similar to ISAM vs. B+ trees

Static hashing schemes


Chained hashing

Dynamic hashing schemes


Extendible hashing
Linear hashing (not covered)
8

S TATIC C HAINED H ASHING


Hash index is a collection of buckets

Build static hash index on column A


Allocate a fixed area of N (successive) pages, the so-called primary buckets
In each bucket, install a pointer to a chain of overflow pages (initially set to null)
Define a hash function h with range [0, …, N-1]
The domain of h is the type of A
e.g., h : INTEGER ⟶ [0, …, N-1], if A is of type INTEGER

The hash function determines the bucket where the desired value can be found
9

S TATIC C HAINED H ASH TABLE


Bucket = primary page plus zero or more overflow pages

Buckets contain index entries k* implemented using any of the variants A, B, or C

0 Bucket 0

record r 1
h(k) Bucket 1


N-1
h looks at the search Overflow pages Bucket N-1
key field k of record r
Primary
bucket pages
10

S TATIC C HAINED H ASH TABLE M ANAGEMENT


Operations: search, insert, delete
Compute h(k) on the search key field k of record r
Access the primary bucket page with number h(k)
Search for/insert/delete record on this page or, if needed, access the overflow pages

If overflow chain access is avoidable


search requires a single I/O operation
insert and delete require two I/O operations
11

H ASH C OLLISIONS AND O VERFLOW C HAINS


Hash collisions are unavoidable
For search keys k and k’, can happen h(k) = h(k’)
Search keys may not be unique (e.g., student age)
Even if unique, the search key space is much larger than # of buckets
Having as many primary bucket pages as different search keys in database ⇒ waste of space

Long overflow chains can degrade performance


Operation costs become non-uniform and unpredictable for a query optimiser
To reduce this problem, h needs to scatter search keys evenly across [0, …, N-1]
Large # of entries can still cause long chains (dynamic hashing to fix this)
12

H ASH F UNCTIONS
How to map a large key space into a smaller domain
Real distributions of search key values are often non-uniform (skewed)

Trade-off between being fast vs. collision rate


We want a lightweight (non-cryptographic) hash function with a low collision rate

Simple hash function: h(k) = k mod N


Guarantees the range of h(k) to be [0, N - 1]
Choosing N = 2d for some d effectively considers the least d bits of k only
Prime numbers work best for N

Better hash functions used in practice


xxHash (+ benchmark), MurmurHash, Google CityHash, Google FarmHash, CLHash
13

S TATIC H ASHING AND D YNAMIC F ILES


If the data file grows,
the development of overflow chains spoils the index I/O behaviour (1–2 I/O operations)

If the data file shrinks,


a significant fraction of primary buckets may be (almost) empty – a waste of space

We may periodically rehash the data file to restore the ideal situation
(20% free space, no overflow chains)
Expensive – the index not usable while rehashing is in progress

As for ISAM, static hashing has advantages with concurrent access


Only need to lock one bucket page to store a new entry or extend the overflow chain
14

E XTENDIBLE H ASHING
Situation: Bucket (primary page) is full and we want to insert. Why not
reorganize the index by doubling # of buckets?
Reading and writing all pages is expensive!

Idea: Use directory of pointers to buckets, double # of buckets by


doubling the directory, splitting just the bucket that overflowed
Directory much smaller than file, so doubling it is much cheaper

Only one page of data entries is split

No overflow pages!
15

E XTENDIBLE H ASHING

2 8 (01000) 1
14 (01110)
00
01 21 (10101) 2
10 25 (11001)

11
11 (01011) 2

Note: we depict as index entries h(k) instead of k*


16

G LOBAL AND L OCAL D EPTH


Global depth (n at directory) global 1 local
2 8 (01000)
14 (01110)
Use the least n bits of h(k) to find a 00
bucket pointer in the directory 01 21 (10101) 2 local
The directory size is 2n 10 25 (11001)

11
Local depth (d at individual buckets) 11 (01011) 2 local
The hash values h(k) of all entries in
this bucket agree on their least d bits
17

E XTENDIBLE H ASHING

global 2 8 (01000) 1 local


14 (01110)
Find A
00 hash(A) = 14 = 011102
01 21 (10101) 2 local
10 25 (11001)

11
11 (01011) 2 local

To find a bucket for A, take the least 2 bits of hash(A)


18

E XTENDIBLE H ASHING

global 2 8 (01000) 1 local


14 (01110)
Find A
00 hash(A) = 14 = 011102
01 21 (10101) 2 local
10 25 (11001)

11
11 (01011) 2 local

Check if the bucket contains key A. Need to compare keys due to collisions!
19

E XTENDIBLE H ASHING

global 2 8 (01000) 1 local


14 (01110)
Find A
00 hash(A) = 14 = 011102
01 21 (10101) 2 local Insert B
10 25 (11001)
hash(B) = 29 = 111012
11
11 (01011) 2 local
20

E XTENDIBLE H ASHING

global 2 8 (01000) 1 local


14 (01110)
Find A
00 hash(A) = 14 = 011102
01 21 (10101) 2 local Insert B
10 25 (11001)
hash(B) = 29 = 111012
29 (11101)
11
11 (01011) 2 local

If the bucket still has capacity, store the index entry in it


21

E XTENDIBLE H ASHING

global 2 8 (01000) 1 local


14 (01110)
Find A
00 hash(A) = 14 = 011102
01 21 (10101) 2 local Insert B
10 25 (11001)
hash(B) = 29 = 111012
29 (11101)
11
11 (01011) 2 local Insert C
hash(C) = 5 = 001012
22

E XTENDIBLE H ASHING

global 2 8 (01000) 1 local


14 (01110)
Find A
00 hash(A) = 14 = 011102
01 21 (10101) 2 local Insert B
10 25 (11001)
hash(B) = 29 = 111012
29 (11101)
11
11 (01011) 2 local Insert C
hash(C) = 5 = 001012

Split bucket if full (allocate new bucket, increase local, redistribute)


23

E XTENDIBLE H ASHING

global 2 8 (01000) 1 local


14 (01110)
Find A
00 hash(A) = 14 = 011102
01 25 (11001) 3 local Insert B
10
hash(B) = 29 = 111012
11
21 (10101) 3 local Insert C
29 (11101) hash(C) = 5 = 001012
3 bits now needed to
discriminate between
these two buckets ⇒ 11 (01011) 2 local
double directory
24

E XTENDIBLE H ASHING

global 3 8 (01000) 1 local


14 (01110)
Find A
000 hash(A) = 14 = 011102
001 25 (11001) 3 local Insert B
010
hash(B) = 29 = 111012
011
100 21 (10101) 3 local Insert C
101 29 (11101) hash(C) = 5 = 001012
110
111 11 (01011) 2 local
25

E XTENDIBLE H ASHING

global 3 8 (01000) 1 local


14 (01110)
Find A
000 hash(A) = 14 = 011102
001 25 (11001) 3 local Insert B
010
hash(B) = 29 = 111012
011
100 21 (10101) 3 local Insert C
101 29 (11101) hash(C) = 5 = 001012
5 (00101)
110
111 11 (01011) 2 local
26

D IRECTORY D OUBLING
Double directory by copying its original pointers and ”fixing” pointer
to split bucket
Use of least significant bits enables efficient doubling via copying!

Splitting a bucket does not always require doubling the directory


Buckets with local depth < global depth have multiple pointers to them
Splitting such buckets does not require doubling

Modifying one or more bucket pointers in directory is sufficient

Directory can also shrink when buckets become empty


27

S UMMARY
Hash-based indexes
Best for equality searches, cannot support range searches

Static hashing
Can lead to long overflow chains

Extendible hashing
Avoids overflow chains by splitting a full bucket when a new entry is to be added to it

You might also like