L04: Hashing
For efficient look-up in a table
Outline
What is hashing?
How to hash?
What is collision?
How to resolve collision?
Separate chaining
Linear probing
Quadratic probing
Double hashing
Load factor
Primary clustering and secondary clustering
2
ADT Table Operations
ADT table supports 3 operations:
1. Insertion
2. Deletion
3. Retrieval
Sorted Balanced Hashing
Array BST
Insertion O(n) O(log n) O(1) avg
Deletion O(n) O(log n) O(1) avg
Retrieval O(log n) O(log n) O(1) avg
Note: Balanced Binary Search Tree (BST) will be covered
in the next lecture
3
Direct Addressing
Table
A simplified version of hash table
SBS bus Problem
5
Direct Addressing Table: Operations
insert (key, data)
a[key] = data // where a[] is an array - the table
delete (key)
a[key] = null
find (key)
return a[key]
6
Direct Addressing Table: Restrictions
Keys must be integer
What happens for key values 151A and NR10?
Range of keys must be small
Keys must be dense, i.e. not many gaps in
the key values.
How to overcome these restrictions?
7
Hash Table
A generalization of direct addressing
table, to remove its restrictions.
9
10
Good Hash Functions
Fast to compute
Scatter keys evenly throughout the hash table
Less collisions
Need less slots (space)
11
Bad Hash Functions
What happen when you hash Singapore’s
house phone numbers by selecting the first
three digits?
12
Perfect Hash Functions
13
How to Define a Hash Function?
Uniform hash function
Division method
Multiplication method
Hashing of strings
14
Uniform Hash Functions
15
3 Division method (mod operator)
Map into a hash table of m slots.
Use the modulo operator (% ) to map an integer
to a value between 0 and m-1.
hash(k ) = k % m
The most popular method.
16
3 How to pick m (table size)?
17
Multiplication method
1. Multiply by a fraction A between 0 and 1
2. Extract the fractional part
3. Multiply the fraction by m, the hash table size
hash (k ) = m(kA- kA )
18
Hashing of strings
hash1(s) // s is a string
sum = 0
foreach character c in s
sum += c // sum up the ASCII values of all characters
return sum % m // m is the hash table size
19
Hashing of strings
hash(“Tan Ah Teck”)
= (“T” + “a” + “n” + “ ” +
“A” + “h” + “ ” +
“T” + “e” + “c” + “k”) % 11 // hash table size is 11
= (84 + 97 + 110 + 32 +
65 + 104 + 32 +
84 + 101 + 99 + 107) % 11
= 825 % 11
=0
20
Hashing of strings:
Lee Chin Tan
Chen Le Tian
Chan Tin Lee
All 3 strings above have the same hash value! Why?
Problem: The hash code produced by hash1 does
not depend on positions of characters! – Bad
21
Improved Hashing of strings
A better way is to “shift” the sum before adding the next
character, so that its position will affect the hash code.
This kind of hash code is called a polynomial hash code
hash2(s)
sum = 0
foreach character c in s
sum = sum*37 + c
return sum % m // m is the hash table size
22
Collision Resolution
4 Probability of Collision (1/2)
von Mises Paradox (The Birthday Paradox): “How
many people must be in a room before the probability
that some share a birthday, ignoring the year and leap
days, becomes at least 50 percent?”
Q(n) = Probability of unique birthday for n people
364 363 362 365 - n 1
= ...
365 365 365 365
P(n) = Probability of collisions (same birthday) for n people
= 1 – Q(n)
P(23) = 0.507
24
4 Probability of Collision (2/2)
This means that if there are 23 people in a room,
the probability that some people share a
birthday is 50.7%.
So, if we insert 23 keys in to a table with 365
slots, more than half of the time we get
collisions. Such a result is counter-intuitive to
many.
So, collision is very likely!
25
How to resolve collisions?
Conflict resolution schemes commonly used:
Separate Chaining
Linear Probing
Quadratic Probing
Double Hashing
26
27
28
Load Factor
n: number of keys in the hash table
m: size of the hash tables – number of slots
Define the load factor
= n/m
a measure of how full the hash table is.
Note: can be >= 1.
If table size is the number of linked lists, then
is the average length of the linked lists.
29
Average Running Time
Find O(1 + )
Insert O(1)
Delete O(1 + )
Note that affects the performance of find and
delete operations.
If is bounded by some constant, then all three
operations are O(1).
30
31
Linear Probing
0
hash(k) = k mod 7 In linear probing,
1 when there is a
Here the table size m=7
collision, we scan
2 forwards for the the
Note: 7 is a prime number.
3 next empty slot
(wrapping around
4 when we reach the
last slot).
5
32
Linear Probing: Insert 18
0
hash(k) = k mod 7
1
hash(18) 2
= 18 mod 7
3
=4
4 18
5
33
Linear Probing: Insert 14
0 14
hash(k) = k mod 7
1
2
hash(14)
3
= 14 mod 7
=0 4 18
5
34
Linear Probing: Insert 21
0 14
hash(k) = k mod 7
1 21
Collision occurs!
2 Look for next empty slot.
3
hash(21)
= 21 mod 7 4 18
=0 5
35
Linear Probing
4.2 Linear Probing: Insert 1
0 14
hash(k) = k mod 7 Collides with 21
1 (hash value 0). Look
21
for next empty slot.
hash(1) 2 1
= 1 mod 7 3
=1 4 18
5
36
Linear Probing: Insert 35
0 14
hash(k) = k mod 7 Collision, need to
1 check next 3 slots.
21
hash(35) 2 1
= 35 mod 7 3 35
=0
4 18
5
37
Linear Probing: Find 35
0 14
hash(k) = k mod 7 Found 35, after 4
1 probes.
21
hash(35) = 0
2 1
3 35
4 18
5
38
Linear Probing: Find 8
0 14
hash(k) = k mod 7 8 NOT found.
1 Need 5 probes!
21
hash(8) = 1
2 1
3 35
4 18
5
39
Linear Probing: Delete 21
0 14
hash(k) = k mod 7
1 21
hash(21) = 0
2 1
3 35
4 18
5
40
Linear Probing: Find 35
0 14
hash(k) = k mod 7 35 NOT found!
1 Incorrect!
hash(35) = 0
2 1
We cannot simply
3 35 remove a value,
4 because it can
18
affect find()!
5
41
How to delete?
Lazy Deletion
Use three different states of a slot
Occupied
deleted
Empty
When a value is removed from linear probed
hash table, we just mark the status of the slot
as “deleted”, Instead of emptying the slot.
Need to use a state array the same size as
the hash table.
42
Linear Probing: Delete 21
0 14
hash(k) = k mod 7 Slot 1 is occupied
1 but now marked
X
21
as deleted.
hash(21) = 0
2 1
3 35
4 18
5
43
Linear Probing
Linear Probing: Find 35
0 14
hash(k) = k mod 7
1 X
21
hash(35) = 0
2 1
3 35
Found 35
4 18 Now we can find 35
44
Linear Probing: Insert 15
0 14
hash(k) = k mod 7
1 X
21
15 15 is inserted into
hash(15) = 1 the slot 1 which
2 1 was marked as
Note: We continue 3 35 deleted
to search for 15, 4 18
and found that 15 We can insert a
is not in the hash 5 new value into a
table (total 5 slot that has been
6
probes).
marked as
deleted.
45
46
Problem of Linear Probing
0 14
1 15
Consecutive
2 1 occupied slots.
3 35
4 18
5
6 This is called
Primary
Clustering
47
Linear Probing
48
Modified Linear Probing
Q: How to modify linear probing to avoid primary clustering?
We can modify the probe sequence as follows:
hash(key)
( hash(key) + 1 * d ) % m
( hash(key) + 2 * d ) % m
( hash(key) + 3 * d ) % m
:
where d is some constant integer >1 and is co-prime to m.
Note: Since d and m are co-primes, the probe sequence
covers all the slots in the hash table.
49
Quadratic Probing
50
Quadratic Probing: Insert 3, 18
0
hash(k) = k mod 7
1
hash(3) = 3
Hash(18)= 4 2
3 3
4 18
5
51
Quadratic Probing: Insert 38
0 38
hash(k) = k mod 7
1
hash(38) = 3
2
Collision
3 3
+1
4 18
5
+4
6
52
Can quadratic probing always find a free slot?
Insert 12 into the previous example, followed by 10.
what happen?
Theorem
If < 0.5, and m is prime, then we can always
find an empty slot.
(m is the table size and is the load factor)
When quadratic probing is used, in the worst
case, 50% of the hash table is wasted.
53
Secondary Clustering
In quadratic probing, clusters are formed along
the path of probing, instead of around the home
location.
These clusters are called secondary clusters.
Secondary clusters are formed as a result of
using the same pattern in probing by all keys.
54
Double Hashing
55
After Inserting 14 and 18, Insert 21
hash(k) = k mod 7 0 14 hash(21)
hash2(k) = k mod 5
1 21 =21 mod 7
2
=0
3
hash2(21)
4 18 = 21 mod 5
5 =1
6
56
Double Hashing: Insert 4
hash(k) = k mod 7 0 14 If we insert 4, the
hash2(k) = k mod 5
1 probe sequence is
21
4 (home), 8, 12, …
2
3 hash(4) = 4
4
hash2(4) = 4
18
5 4
6
57
Double Hashing: Insert 35
hash(k) = k mod 7 0 14 But if we insert 35,
hash2(k) = k mod 5
1 the probe sequence
21
hash(35) = 0 is 0, 0, 0, …
hash2(35) = 0 2
What is wrong?
3 Since hash2(35)=0.
4 Not acceptable!
18
5 4
6
58
Warning:
59
Note:
If hash2(k) =1, then it is the same as linear probing.
If hash2(k) =d, where d is a constant integer > 1, then it
is the same as modified linear probing.
60
Good Collision Resolution Method
Small cluster size
Always find an empty slot if it exists
Give different probe sequences when 2 keys
collide
Fast
61
[CS1020 Lecture 10 AY2013/4 S1]
62
Summary
How to hash? Criteria for good hash functions?
How to resolve collision?
Collision resolution techniques:
separate chaining
linear probing
quadratic probing
double hashing
Problem on deletions
Primary clustering and secondary clustering.
63