0% found this document useful (0 votes)
99 views16 pages

Ads-Unit I

The document discusses dictionaries and hash tables. It defines a dictionary as a dynamic set that supports search, insert, and delete operations using keys. Hash tables are described as one implementation of dictionaries using a hash function to map keys to table indexes. Open hashing and closed hashing are two broad classes of collision resolution techniques for hash tables, with open hashing using separate chaining of collided keys in linked lists and closed hashing resolving collisions by probing to find vacant cells in the table array. The universal hashing method is presented as a novel technique using a random binary matrix to hash variable-length keys to fixed-length hashes.

Uploaded by

rajasekharv86
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
99 views16 pages

Ads-Unit I

The document discusses dictionaries and hash tables. It defines a dictionary as a dynamic set that supports search, insert, and delete operations using keys. Hash tables are described as one implementation of dictionaries using a hash function to map keys to table indexes. Open hashing and closed hashing are two broad classes of collision resolution techniques for hash tables, with open hashing using separate chaining of collided keys in linked lists and closed hashing resolving collisions by probing to find vacant cells in the table array. The universal hashing method is presented as a novel technique using a random binary matrix to hash variable-length keys to fixed-length hashes.

Uploaded by

rajasekharv86
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

ADVANCED DATA STRUCTURES UNIT - I CSE

DICTIONARIES

A dictionary is a container of elements from a totally ordered universe that supports the
basic operations of inserting/deleting elements and searching for a given element.
In this chapter, first, we introduce the abstract data type Set which includes dictionaries,
priority queues, etc. as subclasses.

Sets:
The set is the most fundamental data model of mathematics.
A set is a collection of well defined elements. The members of a set are all different.
There are special operations that are commonly performed on sets, such as Union,
intersection, difference.

1. The union of two sets S and T, denoted S ∪ T, is the set containing the
elements that are in S or T, or both.
2. The intersection of sets S and T, written S ∩ T, is the set containing the
elements that are in both S and T.
3. The difference of sets S and T, denoted S − T, is the set containing those
elements that are in S but not in T.

For example:
Let S be the set {1, 2, 3} and T the set {3, 4, 5}. Then
S ∪ T = {1, 2, 3, 4, 5}, S ∩ T = {3}, and S − T = {1, 2}

Set implementation:
Possible data structures include
 Bit Vector
 Array
 Linked List
o Unsorted
o Sorted

Dictionaries:
A dictionary is a dynamic set ADT with the operations:
ADVANCED DATA STRUCTURES UNIT - I CSE

 Search(S, k) – an access operation that returns a pointer x to an element where x.key


=k
 Insert(S, x) – a manipulation operation that adds the element pointed to by x to S
 Delete(S, x) – a manipulation operation that removes the element pointed to by x
from S
Dictionaries store elements so that they can be located quickly using keys
It is useful in implementing symbol tables, text retrieval systems, database systems, page
mapping tables, etc.

Implementation:
1. Fixed Length arrays
2. Linked lists: sorted, unsorted, skip-lists
3. Hash Tables: open, closed
4. Trees
 Binary Search Trees (BSTs)
 Balanced BSTs
o AVL Trees
o Red-Black Trees
 Splay Trees
 Multiway Search Trees
o 2-3 Trees
o B Trees
 Tries
Let n be the number of elements is a dictionary D. The following is a summary of the
performance of some basic implementation methods:
Worst case complexity of

O(n) O(n) O(n) O(n)

O(n) O(n) O(n) O(1)

O(n) O(n) O(n) O(n)

Among these, the sorted list has the best average case performance.
In this chapter, we discuss two data structures for dictionaries, namely Hash Tables and Skip
Lists.

HASHING
Division Method:
ADVANCED DATA STRUCTURES UNIT - I CSE

One common method of determining a hash key of the division method of hashing

The formula that will be used is:

H(key) = key % no.of slots in the table

i.e. h(key) = key mod array size

For example:

Consider a table with 8 slots. i.e. array size 8.

Hash key = key % table size

The key values are 36, 18, 72, 43, 6, 42

The division method is generally a reasonable strategy, unless the key happens to have
some undesirable properties.

Note: if the table size is 10 and all of the keys end in zero.

In the above example 42 mod 8 => 2, it’s already filled position in the hash table. This is
known as collision. i.e. two or more record keys map to the same array index.

In This case, the choice of hash function and table size needs to be carefully considered. The
best table sizes are prime numbers.

Multiplication method:
The simplest situation when the keys are floating –point numbers known to be in affixed
range.
ADVANCED DATA STRUCTURES UNIT - I CSE

For example:

If the keys are numbers that are greater than 0 and less than 1, we can just multiply by m
(table size) and round off to the nearest integer to get an address between 0 an m-1.

Algorithm:

1. Choose constant A in the range 0 < A < 1.


2. Multiply key k by A.
3. Extract the fractional part of k*A
4. Multiply the fractional part by number of slots, m.
5. Take the floor of the result.

Mathematically

h( k ) = ⌊m · (k·A mod 1)⌋

where k·A mod 1 = k·A − ⌊k·A⌋ = fractional part of k·A

 Disadvantage: Slower than division method.


 Advantage: Value of m is not critical.

Example:

m = 8 (implies m = 23, p = 3)

w=5

k = 21

Must have 0 < s < 25; choose s = 13 → A = 13/32.

 h(k) = ⌊m · (k·A mod 1)⌋

h( 21 ) = ⌊8 · (21·13/32 mod 1)⌋ = 4

k·A = 21 · 13/32 = 273/32 = 8 17/32 → k·A mod 1 = 17/32

→ m · ( k·A mod 1) = 8 · 17/32 = 17/4 =


4 1/4 → ⌊m · ( k·A mod 1)⌋ = 4

So that h(21) = 4.

Example:
m = 8 (implies m = 23, p = 3)
w=5
k = 21
s = 13
ADVANCED DATA STRUCTURES UNIT - I CSE

k·s = 21 · 13

= 273

= 8 · 25 + 17

= r 1 . r0

r1 = 8 · 25
r0 = 17 = 100012
Written in w = 5 bits, r0 = 100012 The p = 3 most significant bits of r0 is 1002 or 410,
so h(21) = 4.

Exercise Example:

m = 4 (implies m = 22, p = 2)
w=3
k = 12
s=5 0 < s < 2w = 23 = 8
k·s = 12 · 5

=?

= ? · 23 + ?

= r 1 . r0

r1 = ? · 23
r0 = ? = ?2
 Written in w = 3 bits, r0 = ?2
 The p = 2 most significant bits of r0 is ?2 or ?10, so h(12) = ?.

Universal method:
Hashing is a fun idea that has lots of unexpected uses. Here, we look at a novel type of hash
function that makes it easy to create a family of universal hash functions. The method is
based on a random binary matrix and is very simple to implement.
ADVANCED DATA STRUCTURES UNIT - I CSE

The idea is very simple. Suppose you have an input data item that you have input data with
m – bits and you want a hash function that produces n – bits then first generate a random
binary matrix (M) of order nxm.

The hash function is h(x) = Mx Where x to be a binary vector

For example, Suppose you have a key 11, the binary form is 1011 and it is a four bit input
value(m) and want to generate output a three bit hash value(n).

Then generate a random matrix gives say:

(0100)
M = (1011)
(1101)
and if the data value was 1011 the hash value would be computed as:

( 0 1 0 0 ) (1) (0)
h(x) = Mx = ( 1 0 1 1 ) (0) = (1)
( 1 1 0 1 ) (1) (0)
(1)
There are a number of other ways to look at the way the arithmetic is done that suggest
different ways of implementing the algorithm.

The first is to notice that what you are doing is anding each row with the data column
vector. That is taking the second row as an example: ( 1 0 1 1 )And (1 0 1 1) = (1 0 1 1)
and then you add up the bits in the result: 1+0+1+1=1

now the index is 010, convert that into decimal is 2.

There is no.of other ways to look at the way the arithmetic is done that suggest different
ways of implementing the algorithm.

Hashing gives an alternative approach that is often the fastest and most convenient way of
solving these problems like AI – search programs, cryptography, networks, complexity
theory.

Collision Resolution Techniques:

In general, a hashing function can map several keys into the same address. That leads to a
collision. The colliding records must be stored and accessed as determined by a collision –
resolution techniques.
There are two broad classes of such techniques:

Open Hashing (also called separate chaining) and


ADVANCED DATA STRUCTURES UNIT - I CSE

Closed Hashing (also called open addressing)

The difference between the two has to do with whether collision are stored outside the
table (open hashing), or whether collision result in storing and of the records at another slot
in the table (closed hashing).

The particular hashing method that one uses depends on many factors. One important
factor is the ratio of the no.of keys in the table to the no.of hash addresses. It is called load
factor, and is given by: Load factor (α) = n/m,
where n is no.of keys in the table and m is no.of hash address (table size)

Open Hashing:

The simplest form of open hashing defines each slot in the hash table to be the head of a
linked list. All records that hash to a particular slot are placed on that slot’s linked list.

The below figure illustrates a hash table where each slot stores one record and a link pointer
to the rest of the list.

Consider the same example of division method:

Any key that hash to the same index are simply added to the linked list; there is no need to
search for empty cells in the array. This method is called separating chaining.

Closed Hashing (Open addressing):

It resolves collisions in the prime area – that is that contains all of the home addresses.
i.e. when a data item cannot be placed at the index calculated by the hash function, another
location in the array is sought.

There are different methods of open addressing, which vary in the method used to find the
next vacant cell.
ADVANCED DATA STRUCTURES UNIT - I CSE

They are (i) Linear probing


(ii) Quadratic probing
(iii) Pseudo random probing
(iv) Double hashing
(v) Key – offset

Hashing with Linear probe:

We resolve the collision by adding 1 to the current address.


Assuming that the table is not full
We apply division method of hashing

Consider the example:

Add the keys 10, 5, and 15 to the above table


H(k) = k mod table size
10 mod 8 = 2 a collision, so add 1 to the address then check is it empty or filled. If it is filled
then apply the same function, like this we can place this key 10 in the index 5 cell.

If the physical end of the table is reached during the linear search will wrap around to the
beginning of the table and continue from there.

If an empty slot is not found before reaching the point of collision; the table is full

A problem with the linear probe method is that it is possible for blocks of data to form when
collisions are resolved. This is known as primary clustering.
This means that any key that hashes into the cluster will require several attempts to resolve
the collision.

Linear probes have two advantages: First, They are quite simple to implement. Second, data
tend to remain near their home address.

Exercise Example:
Insert the nodes 89, 18, 49, 58, and 69 into a hash table that holds 10 items using the
division method.

Hashing with Quadratic probe:


ADVANCED DATA STRUCTURES UNIT - I CSE

In this probe, rather than always moving one spot, move ‘ i 2 ‘ spots from the point of
collision, where ‘i’ is the no. of attempts to resolve the collision.

In linear probe, if the primary hash index is x, subsequent probe go to x+1, x+2, x+3 and so
on. In Quadratic probing, probes go to x+1, x+4, x+9, and so on, the distance from the initial
probe is the square of the step number: x+12, x+22, x+32, x+42 and so on.

i.e., at first it picks the adjacent cell, if that is occupied, it tries 4 cells away, if that is
occupied it tries 9 cells away, and so on. It eliminates the primary clustering problem with
linear probe.
Consider the above exercise problem, keys 89, 18, 49, 58, 69 with table size 10.

Here each key that hashes to same location will require a longer probe. This phenomenon is
called secondary clustering. It is not a serious problem, but quadratic probing is not often
used because there is a slightly better solution.
Hashing with Pseudo – random probing:

This method uses pseudo – random number to resolve the collision i.e. this probe function
would select the next position on the probe sequence at random from among the unvisited
slots that is the probe sequence should be a random permutation of hash table positions.

Unfortunately, we can’t actually select the next position in the probe sequence at random,
because we would not be able to duplicate this same probe sequence when searching for
the key.

In this probing, the ith slot in the probe sequence is (h (key) + r1) mod M) where r1 is the ith
value in the random permutation of the numbers from 1 to M-1.
All insertions and searches use the same sequence of random numbers.

Consider the same example of division method:


ADVANCED DATA STRUCTURES UNIT - I CSE

36 % 8 = 4
18 % 8 = 2
72 % 8 = 0 now insert 60
43 % 8 = 3 60 % 8 = 4; is a collision
6 % 8 = 6

The Pseudo – random permutation to use is: 0 6 5 2 3 4 1 7


For collision resolution

Current slot = ( 4 + 0 ) % 8 = 4; searching slot 4 and it is occupied

Again, Current slot = ( 4 + 6 ) % 8 = 2; occupied

Current slot = ( 2 + 5 ) % 8 = 1; it is empty; key 60 is occupies slot 1

Pseudo random numbers are a relatively simple solution, but they have one significant
limitation all keys follow only one collision resolution path through the list. Because this
collision resolution can create significant secondary clustering

Double Hashing:

Double hashing uses the idea of applying a second hash function to the key when a collision
occurs. The result of the second hash function will be the number of positions from the
point of collision to insert.

There are a couple of requirements for the second function:


 It must never evaluate to 0
 Must make sure that all cells can be probed
A popular second hash function is : H2(key) = R – (key % R)
Where R is a prime number that is similar than the size of the table

Table size = 10
ADVANCED DATA STRUCTURES UNIT - I CSE

Hash1 (key) = key % 10


Hash2 (key) = 7 – (key % 7)
Because 7 is a prime number than the size of the table

Insert keys: 89, 18, 49, 58, 69

Hash (89) = 89 % 10 = 9
Hash (18) = 18 % 10 = 8

Hash1 (49) = 49 % 10 = 9; it’s a collision


Hash2 (49) = 7 – (49 % 7) = 7; positions from location 9

Hash1 (58) = 58 % 10 =8; it’s a collision


Hash2 (58) = 7 – (58 % 7) = 5; positions from location 8

NOTE:

Linear probing  ‘m’ distinct probe sequences, primary clustering


Quadratic probing  ‘m’ distinct probe sequences, no primary but secondary clustering
Double hashing  ‘m2’ distinct probe sequences, no primary and secondary clustering

Key – offset:

It is double hashing method that produces different collision paths for different keys. Where
as the pseudo random – number generator produces a new address as a function of the
previous address; key offset calculates the new address as a function of the old address and
key.

One of the simplest versions simply adds the quotient of key divided by the list size to the
address to determine the next collision resolution address, as shown below

Offset = key / list size


Address = ( ( offset + old address) modulo list size ) )

For example:

The key is 166702 and list size is 307, using modulo - division hashing method generates an
address of 1. It’s a collision because there was a key 070918.

Using key offset to calculate the next address, we get 237 as shown below
ADVANCED DATA STRUCTURES UNIT - I CSE

Offset = 166702 / 307 = 543


Address = ( ( 543 + 001) ) modulo 307 = 237

If 237 were also a collision, we would repeat the process to locate the next address, as
shown below

Offset = 166702 / 307 = 543


Address = ( ( 543 + 237) ) modulo 307 = 166

If it is free, then place the key in this address.

Skip Lists:

Skip list is a type of data structure that can be used as an alternative to balanced (binary)
trees or B-Trees. As compared to a binary tree, skip lists allow quick search, insertions and
deletions of elements with simple algorithms. This is achieved by using probabilistic
balancing rather than strictly enforce balancing as done in B-trees.

Skip lists are also significantly faster than equivalent algorithms for B-Trees.

A skip list is basically a linked list with additional pointers that allow intermediate nodes to
be skipped, hence the name skip list.

In a simple linked list that consists of ‘n’ elements, to perform a search ‘n’ comparisons are
required in the worst case.

For example:
ADVANCED DATA STRUCTURES UNIT - I CSE

If a second pointer pointing two nodes ahead is added to every node, the number of
comparisons goes down to n/2+1 in the worst case.

Consider a stored list where every other node has an additional pointer, to the node two a
head of it in the list.

Here, every other node has an additional pointer.

Next, every second node has a pointer two ahead of it

In the list of above figure, every second node has a pointer two ahead of it; every fourth

node has a pointer four ahead if it. Here we need to examine no more than +2
nodes.

In below figure, (every (2i)th node has a pointer (2i) node ahead (i = 1, 2,...); then the number

of nodes to be examined can be reduced to log2n while only doubling the number of
pointers.

Here, Every (2i)th node has a pointer to a node (2i)nodes ahead (i = 1, 2,...)

 A node that has k forward pointers is called a level k node. If every (2i)th node has a
pointer (2i) nodes ahead, then
# of level 1 nodes 50 %
# of level 2 nodes 25 %
# of level 3 nodes 12.5 %
 Such a data structure can be used for fast searching but insertions and deletions will
be extremely cumbersome, since levels of nodes will have to change.
ADVANCED DATA STRUCTURES UNIT - I CSE

 What would happen if the levels of nodes were randomly chosen but in the same
proportions (below figure)?
o level of a node is chosen randomly when the node is inserted
th i-1
o A node's i pointer, instead of pointing to a node that is 2 nodes ahead,
points to the next node of level i or higher.
o In this case, insertions and deletions will not change the level of any node.
o Some arrangements of levels would give poor execution times but it can be
shown that such arrangements are rare.
Such a linked representation is called a skip list.
 Each element is represented by a node the level of which is chosen randomly when
the node is inserted, without regard for the number of elements in the data
structure.
 A level i node has i forward pointers, indexed 1 through i. There is no need to store
the level of a node in the node.
 Maxlevel is the maximum number of levels in a node.
o Level of a list = Maxlevel
o Level of empty list = 1
o Level of header = Maxlevel

It is a skip list

Initialization:
An element NIL is allocated and given a key greater than any legal key. All levels of all lists
are terminated with NIL. A new list is initialized so that the level of list = maxlevel and all
forward pointers of the list's header point to NIL

Search:
We search for an element by traversing forward pointers that do not overshoot the node
containing the element being searched for. When no more progress can be made at the
current level of forward pointers, the search moves down to the next level. When we can
make no more progress at level 1, we must be immediately in front of the node that
contains the desired element (if it is in the list).

Insertion and Deletion:


 Insertion and deletion are through search and splice
 update [i] contains a pointer to the rightmost node of level i or higher that is to the
left of the location of insertion or deletion.
 If an insertion generates a node with a level greater than the previous maximum
level, we update the maximum level and initialize appropriate portions of update list.
 After a deletion, we check to see if we have deleted the maximum level element of
the list and if so, decrease the maximum level of the list.
ADVANCED DATA STRUCTURES UNIT - I CSE

 Below figure provides an example of Insert and Delete. The pseudo - code for Insert
and Delete is shown below.

Analysis of Skip lists:

In a skiplist of 16 elements, we may have

 9 elements at level 1
 3 elements at level 2
 3 elements at level 3
 1 element at level 6

One important question is:


Where do we start our search? Analysis shows we should start from level L(n) where

L(n) = log2n

In general if p is the probability fraction,

L(n) = log n

where p is the fraction of the nodes with level i pointers which also have level (i + 1)
pointers.

 However, starting at the highest level does not alter the efficiency in a significant
way.
 Another important question to ask is:
What should be MaxLevel? A good choice is

MaxLevel = L(N) = log N


ADVANCED DATA STRUCTURES UNIT - I CSE

where N is an upperbound on the number of elements is a skiplist.

 Complexity of search, delete, insert is dominated by the time required to search for
the appropriate element. This in turn is proportional to the length of the search
path. This is determined by the pattern in which elements with different levels
appear as we traverse the list.

 Insert and delete involve additional cost proportional to the level of the node being
inserted or deleted.

You might also like