DBMS Unit-5 Notes
DBMS Unit-5 Notes
UNIT-5
STORAGE AND INDEXING
1. Data on External Storage
Disks:
Can retrieve random page at fixed cost
But reading several consecutive pages is much cheaper than
reading them in random order
Tapes:
Can only read pages in sequence
Cheaper than disks; used for archival storage
Indexing:
An index or database index is a data structure which is used to quickly locate and access the
data in a database table.
• The first column is the Search key that contains a copy of the primary key or candidate
key of the table. These values are stored in sorted order so that the corresponding data
can be accessed quickly (Note that the data may or may not be stored in sorted order).
• The second column is the Data Reference which contains a set of pointers holding the
address of the disk block where that particular key value can be found.
▪ At most one index on a given collection of data records can use Alternative 1.
(Otherwise, data records are duplicated, leading to redundant storage and potential
inconsistency.)
▪ If data records are very large, # of pages containing data entries is high. Implies
size of auxiliary information in the index is also large, typically.
❖ Alternatives 2 and 3:
▪ Data entries typically much smaller than data records. So, better than Alternative
1 with large data records, especially if search keys are small. (Portion of index
structure used to direct search, which depends on size of data entries, is much
smaller than with Alternative 1.)
▪ Alternative 3 more compact than Alternative 2, but leads to variable sized data
entries even if search keys are of fixed length.
2. Types of Indexing:
• Primary Index − Primary index is defined on an ordered data file. The data file is ordered
on a key field. The key field is generally the primary key of the relation.
• Secondary Index − Secondary index may be generated from a field which is a candidate
key and has a unique value in every record, or a non-key with duplicate values.
• Clustering Index − Clustering index is defined on an ordered data file. The data file is
ordered on a non-key field. (or)
If order of data records is the same as, or `close to’, order of data entries, then called
clustered index. Alternative 1 implies clustered; in practice, clustered also implies
Alternative 1 (since sorted files are rare). A file can be clustered on at most one search
key. Cost of retrieving data records through index varies greatly based on whether index
is clustered or not!
Suppose that Alternative (2) is used for data entries, and that the data records are stored
in a Heap file.
▪ To build clustered index, first sort the Heap file (with some free space on each
page for future inserts).
▪ Overflow pages may be needed for inserts. (Thus, order of data recs is `close to’,
but not identical to, the sort order.)
2
Database Management Systems
o Dense Index
o Sparse Index
Dense Index
In dense index, there is an index record for every search key value in the database. This makes
searching faster but requires more space to store index records itself. Index records contain
search key value and a pointer to the actual record on the disk.
Sparse Index
In sparse index, index records are not created for every search key. An index record here
contains a search key and an actual pointer to the data on the disk. To search a record, we first
proceed by index record and reach at the actual location of the data. If the data we are looking
for is not where we directly reach by following the index, then the system starts sequential
search until the desired data is found.
Multilevel Index
Index records comprise search-key values and data pointers. Multilevel index is stored on the
disk along with the actual database files. As the size of the database grows, so does the size of
the indices. There is an immense need to keep the index records in the main memory so as to
speed up the search operations. If single-level index is used, then a large size index cannot be
kept in memory which leads to multiple disk accesses.
3
Database Management Systems
Indexes must be chosen to speed up important queries (and perhaps some updates!).
Relative data and information are stored collectively in file formats. A file is a sequence of
records stored in binary format. A disk drive is formatted into several blocks that can store
records. File records are mapped onto those disk blocks.
File Organization
File Organization defines how file records are mapped onto disk blocks. We have four types of
File Organization to organize file records −
When a file is created using Heap File Organization, the Operating System allocates memory
area to that file without any further accounting details. File records can be placed anywhere in
that memory area. It is the responsibility of the software to manage the records. Heap File does
not support any ordering, sequencing, or indexing on its own.
Every file record contains a data field (attribute) to uniquely identify that record. In sequential
file organization, records are placed in the file in some sequential order based on the unique key
field or search key. Practically, it is not possible to store all the records sequentially in physical
form.
Hash File Organization uses Hash function computation on some fields of the records. The
output of the hash function determines the location of disk block where the records are to be
placed.
4
Database Management Systems
Clustered file organization is not considered good for large databases. In this mechanism,
related records from one or more relations are kept in the same disk block, that is, the ordering
of records is not based on primary key or search key.
File Operations
• Update Operations
• Retrieval Operations
Update operations change the data values by insertion, deletion, or update. Retrieval operations,
on the other hand, do not alter the data but retrieve them after optional conditional filtering. In
both types of operations, selection plays a significant role. Other than creation and deletion of a
file, there could be several operations, which can be done on files.
• Open − A file can be opened in one of the two modes, read mode or write mode. In
read mode, the operating system does not allow anyone to alter data. In other words,
data is read only. Files opened in read mode can be shared among several entities. Write
mode allows data modification. Files opened in write mode can be read but cannot be
shared.
• Locate − Every file has a file pointer, which tells the current position where the data is
to be read or written. This pointer can be adjusted accordingly. Using find (seek)
operation, it can be moved forward or backward.
• Read − By default, when files are opened in read mode, the file pointer points to the
beginning of the file. There are options where the user can tell the operating system
where to locate the file pointer at the time of opening a file. The very next data to the
file pointer is read.
• Write − User can select to open a file in write mode, which enables them to edit its
contents. It can be deletion, insertion, or modification. The file pointer can be located at
the time of opening or can be dynamically changed if the operating system allows to do
so.
• Close − This is the most important operation from the operating system’s point of view.
When a request to close a file is generated, the operating system
o removes all the locks (if in shared mode),
o saves the data (if altered) to the secondary storage media, and
o releases all the buffers and file handlers associated with the file.
The organization of data inside a file plays a major role here. The process to locate the file
pointer to a desired record inside a file various based on whether the records are arranged
sequentially or clustered.
5
Database Management Systems
1. hash based Indexing: For a huge database structure, it can be almost next to
impossible to search all the index values through all its level and then reach the
destination data block to retrieve the desired data. Hashing is an effective technique to
calculate the direct location of a data record on the disk without using index structure.
Hashing uses hash functions with search keys as parameters to generate the address of a data
record.
• Bucket − A hash file stores data in bucket format. Bucket is considered a unit of
storage. A bucket typically stores one complete disk block, which in turn can store one
or more records.
• Hash Function − A hash function, h, is a mapping function that maps all the set of
search-keys K to the address where actual records are placed. It is a function from
search keys to bucket addresses.
Ex:
6
Database Management Systems
The lowest level of the tree is called the leaf level, contains the data entries. This allows
us to efficiently locate all data entries with search key values in a desired range. All
searches begin at the top most node, called the root, and the contents of pages in non-leaf
levels direct searches to the correct leaf page. Non leaf pages contain node pointers
separated by search key values. The pointer to the left of a key value k, points to the
subtree that contains only data entries less than k. The node pointer to the right of a key
value k, points to a subtree that contains only data entries greater than or equal to k.
Search starts here
pages
7
Database Management Systems
Measuring number of page I/O’s ignores gains of pre-fetching a sequence of pages; thus,
even I/O cost is only approximated. Average-case analysis; based on several simplistic
assumptions.
9
Database Management Systems
There are two index data structures, called ISAM and B+ trees, based on tree organizations. These
structures provide efficient support for range searches, including sorted le scans as a special case.
Unlike sorted les, these index structures support efficient insertion and deletion. They also
provide support for equality selections, although they are not as efficient in this case as hash-
based indexes.
tree-structured indexing techniques support both range searches and equality searches
ISAM: static structure;
B+ tree: dynamic, adjusts gracefully under inserts and deletes.
solved with the addition of a client-server framework which marshals client requests and
maintains ordering. This is the basic concept behind a database management system (DBMS),
which is a client layer over the underlying data store.
The indexed access method of reading or writing data only provides the desired outcome if in
fact the file is organized as an ISAM file with the appropriate, previously defined keys. Access
to data via the previously defined key(s) is extremely fast. Multiple keys, overlapping keys and
key compression within the hash tables are supported.
Example:
Index entries: <the search key value, block/page id>
they direct search for data entries in leaves.
Example where each node can hold 2 entries; 10* 15* 20* 27* 33* 37* 40* 46* 51* 55* 63*
97*
(* represents key entry)
11
Database Management Systems
8. B+ Tree:
B Trees. B Trees are multi-way trees. That is each node contains a set of keys and pointers. A B
Tree with four keys and five pointers represents the minimum size of a B Tree node. A B Tree
contains only data pages.
B Trees are dynamic. That is, the height of the tree grows and contracts as records are added
and deleted.
12
Database Management Systems
B+ Trees A B+ Tree combines features of ISAM and B Trees. It contains index pages and data
pages. The data pages always appear as leaf nodes in the tree. The root node and intermediate
nodes are always index pages. These features are similar to ISAM. Unlike ISAM, overflow
pages are not used in B+ trees.
The index pages in a B+ tree are constructed through the process of inserting and deleting
records. Thus, B+ trees grow and contract like their B Tree counterparts. The contents and the
number of index pages reflects this growth and shrinkage.
B+ Trees and B Trees use a "fill factor" to control the growth and the shrinkage. A 50% fill
factor would be the minimum for any B+ or B tree. As our example we use the smallest page
structure. This means that our B+ tree conforms to the following guidelines.
Number of Keys/page 4
Number of Pointers/page 5
As this table indicates each page must have a minimum of two keys. The root page may violate
this rule. The following table shows a B+ tree. As the example illustrates this tree does not have
a full index page. (We have room for one more key and pointer in the root page.) In addition,
one of the data pages contains empty slots.
13
Database Management Systems
The key value determines a record's placement in a B+ tree. The leaf pages are maintained in
sequential order AND a doubly linked list (not shown) connects each leaf page with its sibling
page(s). This doubly linked list speeds data movement as the pages grow and contract.
We must consider three scenarios when we add a record to a B+ tree. Each scenario causes a
different action in the insert algorithm. The scenarios are:
14
Database Management Systems
Adding a record when the leaf page is full but the index page is not
Next, we're going to insert a record with a key value of 70 into our B+ tree. This record should
go in the leaf page containing 50, 55, 60, and 65. Unfortunately this page is full. This means
that we must split the page as follows:
50 55 60 65 70
The middle key of 60 is placed in the index page between 50 and 75.
The following table shows the B+ tree after the addition of 70.
15
Database Management Systems
Adding a record when both the leaf page and the index page are full
As our last example, we're going to add a record containing a key value of 95 to our B+ tree.
This record belongs in the page containing 75, 80, 85, and 90. Since this page is full we split it
into two pages:
75 80 85 90 95
The middle key, 85, rises to the index page. Unfortunately, the index page is also full, so we
split the index page:
25 50 75 85 60
The following table illustrates the addition of the record containing 95 to the B+ tree.
16
Database Management Systems
Deletion in B+ Tree:
Deleting 19*,20*
Deleting 24*,
17
Database Management Systems
We must consider three scenarios when we delete a record from a B+ tree. Each scenario causes
a different action in the delete algorithm. The scenarios are:
Combine the leaf page and its sibling. Change the index
YES NO page to reflect the change either by deleting the parent or
merging the parent level indexes.
Hash Organization
• Bucket − A hash file stores data in bucket format. Bucket is considered a unit of storage.
A bucket typically stores one complete disk block, which in turn can store one or more
records.
• Hash Function − A hash function, h, is a mapping function that maps all the set of
search-keys K to the address where actual records are placed. It is a function from search
keys to bucket addresses.
Static Hashing
In static hashing, when a search-key value is provided, the hash function always computes the
same address. For example, if mod-4 hash function is used, then it shall generate only 5 values.
18
Database Management Systems
The output address shall always be same for that function. The number of buckets provided
remains unchanged at all times.
Operation
• Insertion − When a record is required to be entered using static hash, the hash
function h computes the bucket address for search key K, where the record will be
stored.
• Search − When a record needs to be retrieved, the same hash function can be used to
retrieve the address of the bucket where the data is stored.
Bucket Overflow
The condition of bucket-overflow is known as collision. This is a fatal state for any static hash
function. In this case, overflow chaining can be used.
• Overflow Chaining − When buckets are full, a new bucket is allocated for the same hash
result and is linked after the previous one. This mechanism is called Closed Hashing.
19
Database Management Systems
The problem with static hashing is that it does not expand or shrink dynamically as the size of
the database grows or shrinks. Dynamic hashing provides a mechanism in which data buckets
are added and removed dynamically and on-demand. Dynamic hashing is also known
as extended hashing.
Hash function, in dynamic hashing, is made to produce a large number of values and only a few
are used initially.
The prefix of an entire hash value is taken as a hash index. Only a portion of the hash value is
used for computing bucket addresses. Every hash index has a depth value to signify how many
bits are used for computing a hash function. These bits can address 2n buckets. When all these
bits are consumed − that is, when all the buckets are full − then the depth value is increased
linearly and twice the buckets are allocated.
20
Database Management Systems
Hashing is not favorable when the data is organized in some ordering and the queries require a
range of data. When data is discrete and random, hash performs the best.
Hashing algorithms have high complexity than indexing. All hash operations are done in
constant time.
l Hash function generates values over a large range — typically b-bit integers, with
b = 32.
l At any time use only a prefix of the hash function to index into a table of bucket
addresses.
l Let the length of the prefix be i bits, 0 i 32.
Bucket address table size = 2i. Initially i = 0
Value of i grows and shrinks as the size of the database grows and shrinks.
l Multiple entries in the bucket address table may point to a bucket.
l Thus, actual number of buckets is < 2i
The number of buckets also changes dynamically due to merging and
splitting of buckets.
Linear Hashing:
LH handles the problem of long overflow chains without using a directory, and handles
duplicates.
Idea: Use a family of hash functions h0, h1, h2, ... – hi(key) = h(key) mod(2iN); N = initial #
buckets – h is some hash function (range is 0 to 2|MachineBitLength|) – If N = 2d0, for some
d0, hi consists of applying h and looking at the last di bits, where di = d0 + i. – hi+1 doubles the
range of hi (similar to directory doubling)
Directory avoided in LH by using overflow pages, and choosing bucket to split round-robin. –
Splitting proceeds in ‘rounds’. Round ends when all NR initial (for round R) buckets are split. –
Buckets 0 to Next-1 have been split; Next to NR yet to be split. – Current round number is Level.
21
Database Management Systems
The Linear Hashing scheme has m initial buckets labelled 0 through m−1, and an initial hashing
function h0(k) = f(k) % m that is used to map any key k into one of the m buckets (for simplicity
assume h0(k) = k % m), and a pointer p which points to the bucket to be split next whenever an
overflow page is generated (initially p = 0). An example is shown in Figure 1. Bucket Split:
When the first overflow occurs (it can occur in any bucket), bucket 0, which is pointed by p, is
split (rehashed) into two buckets: the original bucket 0 and a new bucket m. A new empty page
is also added in the overflown bucket to accommodate the overflow. The search values originally
mapped into bucket 0
function h0) are now distributed between buckets 0 and m using a new hashing function h1.
22
Database Management Systems
As an example, Figure 2 shows the layout of the Linear Hashing of Figure 1 after inserting a
new record with key 11. The circled records are the existing records that are moved to the new
bucket. In more detail, the record is inserted into bucket 11 % 4 = 3. The bucket overflows and
an overflow page is introduced to accommodate the new record. Bucket 0 is split and the records
originally in bucket 0 are distributed between bucket 0 and bucket 4, using a new hash function
h1(k) = k % 8. The next bucket overflow, such as triggered by inserting two records in bucket 2
or four records in bucket 3 in Figure 2, will cause a new split that will attach a new bucket m+1
and the contents of bucket 1 will be distributed using h1 between buckets 1 and m + 1. A crucial
property of h1 is that search values that were originally mapped by h0 to some bucket j must be
remapped either to bucket j or bucket j + m. This is a necessary property for Linear Hashing to
work. An example of such hashing function is: h1(k) = k % 2m. Further bucket overflows will
cause additional bucket splits in a linear bucket-number order (increasing p by one for every
split).
23