V Unit
V Unit
TERM 2014-15
INDEX
Session planner
-------------------------------------------------------------------------------------------------
8. B+ tree
1
Data on External Storage
– But reading several consecutive pages is much cheaper than reading them in
random order
• Architecture: Buffer manager stages pages from external storage to main memory buffer
pool. File and index layers make calls to the buffer manager.
Many alternatives exist, each ideal for some situations, and not so good in others:
– Heap (random order) files: Suitable when typical access is a file scan retrieving
all records.
– Sorted Files: Best if records must be retrieved in some order, or only a `range’ of
records is needed.
• Like sorted files, they speed up searches for a subset of records, based on
values in certain (“search key”) fields
Index Classification
• Primary vs. secondary: If search key contains primary key, then called primary index.
• Clustered vs. unclustered: If order of data records is the same as, or `close to’, order of
data entries, then called clustered index.
• Suppose that Alternative (2) is used for data entries, and that the data records are stored in
a Heap file.
– To build clustered index, first sort the Heap file (with some free space on each
page for future inserts).
– Overflow pages may be needed for inserts. (Thus, order of data recs is `close to’,
but not identical to, the sort order.)
Indexes
• An index on a file speeds up selections on the search key fields for the index.
– Any subset of the fields of a relation can be the search key for an index on the
relation.
– Search key is not the same as key (minimal set of fields that uniquely identify a
record in a relation).
• An index contains a collection of data entries, and supports efficient retrieval of all data
entries k* with a given key value k.
– Given data entry k*, we can find record with key k in at most one disk I/O.
B+ Tree Indexes
3
❖ Leaf pages contain data entries, and are chained (prev & next)
Example B+ Trees
• Example B+ Tree
• Find 28*? 29*? All > 15* and < 30*
• Insert/delete: Find data entry in leaf, then change it. Need to adjust parent sometimes.
– And change sometimes bubbles up the tree
Hash-Based Indexes
4
– Examples of indexing techniques: B+ trees, hash-based structures
– Typically, index contains auxiliary information that directs searches to the desired
data entries
• Alternative 1:(Data record with key value k)
– If this is used, index structure is a file organization for data records (instead of a
Heap file or sorted file).
– At most one index on a given collection of data records can use Alternative 1.
(Otherwise, data records are duplicated, leading to redundant storage and potential
inconsistency.)
– If data records are very large, No. of pages containing data entries is high. Implies
size of auxiliary information in the index is also large, typically.
• Alternatives for Data Entries (Contd.)
• Alternatives 2 (<k, rid of data record with search key value k>) and 3 (<k, list of rids of
data records with search key k> ):
– Data entries typically much smaller than data records. So, better than Alternative
1 with large data records, especially if search keys are small. (Portion of index
structure used to direct search, which depends on size of data entries, is much
smaller than with Alternative 1.)
– Alternative 3 more compact than Alternative 2, but leads to variable sized data
entries even if search keys are of fixed length.
Operations to Compare
5
Assumptions in Our Analysis
• Heap Files:
– Equality selection on key; exactly one match.
• Sorted Files:
– Files compacted after deletions.
• Indexes:
– Alt (2), (3): data entry size = 10% size of record
– Hash: No overflow buckets.
• 80% page occupancy => File size = 1.25 data size
– Tree: 67% occupancy (this is typical).
• Implies file size = 1.5 data size
• Scans:
– Leaf levels of a tree-index are chained.
– Index data-entries plus actual file scanned for unclustered indexes.
• Range searches:
– We use tree indexes to restrict the set of data records fetched, but ignore hash
indexes.
Cost of Operations
6
– The type of update (INSERT/DELETE/UPDATE), and the attributes that are
affected.
Choice of Indexes
7
Indexes with Composite Search Keys
Index-Only Plans
• A number of queries can be answered without retrieving any tuples from one or more of
the relations involved if a suitable index is available.
<E.dno>
8
SELECT E.dno, MIN(E.sal) FROM Emp E GROUP BY E.dno
SELECT AVG(E.sal) FROM Emp E WHERE E.age=25 AND E.sal BETWEEN 3000 AND 5000
Summary
Introduction
Range Searches
ISAM
9
Comments on ISAM
• File creation: Leaf (data) pages allocated sequentially, sorted by search key;
then index pages allocated, then space for overflow pages.
• Index entries: <search key value, page id>; they `direct’ search for data entries, which
are in leaf pages.
• Search: Start at root; use key comparisons to go to leaf. Cost log F N ; F = No of
entries/index pg, N = No of leaf pgs
• Insert: Find leaf data entry belongs to, and put it there.
• Delete: Find and remove from leaf; if empty overflow page, de-allocate.
• Example ISAM Tree
10
11
B+ Tree: Most Widely Used Index
• Example B+ Tree
• Search begins at root, and key comparisons direct it to a leaf (as in ISAM).
• Search for 5*, 15*, all data entries >= 24* ...
12
B+ Trees in Practice
• Observe how minimum occupancy is guaranteed in both leaf and index pg splits.
• Note difference between copy-up and push-up; be sure you understand the reasons for this.
13
Example B+ Tree After Inserting 8*
14
• Try to re-distribute, borrowing from sibling (adjacent node with same
parent as L).
• If re-distribution fails, merge L and sibling.
• If merge occurred, must delete entry (pointing to L or sibling) from parent of L.
• Merge could propagate to root, decreasing height.
Example Tree After (Inserting 8*, Then) Deleting 19* and 20* ...
• Must merge.
• Observe `toss’ of index entry (on right), and `pull down’ of index entry (below).
15
File Organization
ADVERTISEMENT
ADVERTISEMENT
o The File is a collection of records. Using the primary key, we can access the records. The
type and frequency of access can be determined by the type of file organization which was
used for a given set of records.
o File organization is a logical relationship among various records. This method defines how
file records are mapped onto disk blocks.
o File organization is used to describe the way in which the records are stored in terms of
blocks, and the blocks are placed on the storage medium.
o The first approach to map the database to the file is to use the several files and store only
one fixed length record in any given file. An alternative approach is to structure our files
so that we can contain multiple lengths for records.
o Files of fixed length records are easier to implement than the files of variable length
records.
16
UNIT – V
1. Registers
• Located within the CPU.
• Smallest and fastest type of storage.
• Used to hold data currently being processed.
17
3. Main Memory (RAM)
• Data that's actively being used or processed is loaded here.
• Faster than secondary storage.
• Volatile in nature (i.e., data is lost when power is turned off).
7. Magnetic Tapes
• Sequential access storage, unlike disks which are random access.
• Often used for backups and archiving due to their high capacity and low cost.
• Much slower access times compared to magnetic disks.
18
Types of Storage:
1. Primary Storage: Includes registers, cache memory, and main memory (RAM). It's the
main memory where the operating system, application programs, and data in current use
are kept for quick access by the computer's processor.
2. Secondary Storage: Encompasses data storage devices like HDDs, SSDs, CDs, and USB
drives. It is non-volatile and retains data even when the computer is turned off.
3. Tertiary Storage or Off-line Storage: Often involves magnetic tape systems or optical
disk archives. This is slower than secondary storage and is used for data archiving and
backup.
4. Quaternary Storage: Refers to methods like cloud storage or other remote storage
techniques where data is stored in remote servers and is fetched over the internet or other
networks.
Note: The storage hierarchy in a DBMS serves as a framework for balancing the trade-offs
between cost, speed, and capacity. By strategically using different levels of this hierarchy,
a DBMS can ensure fast access times for frequently used data while also providing large-capacity
storage for data that's accessed less frequently.
File Organization
File organization refers to the arrangement of data on storage devices. The method
chosen can have a profound effect on the efficiency of various database operations.
Common methods of file organization include:
• Ordered Records: Records in a sequential file are stored based on a key field.
19
• Continuous Memory Allocation: The records are stored in contiguous memory
locations.
• No Direct Access: To access a record, you have to traverse from the first record until
you find the desired one.
• Simplicity: The design and logic behind sequential file organization are
straightforward.
• Efficient for Batch Processing: Since records are stored in sequence, sequential
processing (like batch updates) can be done efficiently.
• Inefficient for Random Access**: If you need a specific record, you may have to go
through many records before finding the desired one. This makes random access
slow.
• Insertion and Deletion**: Inserting or deleting a record (other than at the end) can be
time-consuming since you may need to shift records to maintain the order.
• Redundancy Issues**: There's a risk of redundancy if checks are not made before
inserting records. For example, a record with the same key might get added twice if
not checked.
Practical Application: Suppose you have a file of students ordered by their roll
number:
Roll No | Name
--------|--------
1 | Madhu
2 | Naveen
4 | Shivaji
5 | Durga
In a sequential file, if you wanted to add a student with roll number 6, you would append
them at the end. However, if you wanted to insert a student with a roll number 3 which is
20
between 2 and 4, you would need to shift all subsequent records to maintain the
sequence, which can be time-consuming.
• Hash Function: A hash function converts a record's key value into an address.
• Buckets: A bucket typically stores one or more records. A hash function might map
multiple keys to the same bucket.
• No Ordering of Records: Records are not stored in any specific logical order.
• Rapid Access: If the hash function is efficient and there's minimal collision, the
retrieval of a record is very quick.
• Uniform Distribution: A good hash function will distribute records uniformly across
all buckets.
• Collisions**: A collision occurs when two different keys hash to the same bucket.
Handling collisions can be tricky and might affect access time.
21
directs you to a particular bucket. If two books' ISBNs hash to the same value, you handle
that collision, maybe by placing the new record in a linked list associated with that bucket.
• Primary Data File: The actual database file where records are stored.
• Index: An auxiliary file that contains key values and pointers to the corresponding
records in the data file.
• Multi-level Index: Sometimes, if the index becomes large, a secondary (or even tertiary)
index can be created on the primary index to expedite searching further.
• Quick Random Access: Direct access to records is possible using the index.
• Ordered Access: If the primary file is ordered, then indexed file organization can
support efficient sequential access too.
• Overhead of Maintaining Index: Every time a record is added, deleted, or updated, the index also
needs to be updated. This can introduce overhead.
• Space Overhead: Indexes consume additional storage space.
• Complexity: Maintaining multiple levels of indexes can introduce complexity in terms of design
and implementation.
Practical Application: Consider a database that holds information about students.
where each student has a unique student ID. The main file would contain detailed records for each
student. A separate index file would contain student IDs and pointers to the location of the detailed
records in the main file. When you want to find a specific student's details, you first search the
index to find the pointer and then use that pointer to fetch the record directly from the main file.
22
Note: Indexed file organization strikes a balance between the direct and sequential access of
records. While it offers faster search capabilities, there's a trade-off in terms of space and
maintenance overhead. It is ideal for databases where search operations are predominant, and
where the overhead of maintaining the index can be justified by the improved access times.
Indexing
Indexing involves creating an auxiliary structure (an index) to improve data retrieval times. Just
like the index in the back of a book, a database index provides pointers to the locations of records.
Structure of Index
We can create indices using some columns of the database.
|-------------|----------------|
| Search Key | Data Reference |
|-------------|----------------|
• The search key is the database’s first column, and it contains a duplicate or copy of the table’s
candidate key or primary key. The primary key values are saved in sorted order so that the related
data can be quickly accessible.
• The data reference is the database’s second column. It contains a group of pointers that point to the
disk block where the value of a specific key can be found.
Types of Indexes:
1. Single-level Index: A single index table that contains pointers to the actual data records.
23
2. Multi-level Index: An index of indexes. This hierarchical approach reduces the number of accesses
(disk I/O operations) required to find an entry.
3. Dense and Sparse Indexes:
o In a dense index, there's an index entry for every search key value in the database.
o In a sparse index, there are fewer index entries. One entry might point to several records.
4. Primary and Secondary Indexes:
o A primary index is an ordered file whose records are of fixed length with two fields. The
first field is the same as the primary key, and the second field is a pointer to the data block.
There's a one-to-one relationship between the number of entries in the index and the
number of records in the main file.
o A secondary index provides a secondary means of accessing data. For each secondary key
value, the index points to all the records with that key value.
5. Clustered vs. Non-clustered Index:
o In a clustered index, the rows of data in the table are stored on disk in the same order as
the index. There can only be one clustered index per table.
o In a non-clustered index, the order of rows does not match the index's order. You can have
multiple non-clustered indexes.
6. Bitmap Index: Used mainly for data warehousing setups, a bitmap index uses bit arrays (bitmaps)
and usually involves columns that have a limited number of distinct values.
7. B-trees and B+ trees: Balanced tree structures that ensure logarithmic access time. B+ trees are
particularly popular in DBMS for their efficiency in disk I/O operations.
Benefits of Indexing:
• Faster search and retrieval times for database operations.
Drawbacks of Indexing:
• Overhead for insert, update, and delete operations, as indexes need to be maintained.
• Additional storage requirements for the index structures.
24
A clustered index determines the physical order of data in a table. In other words, the order in
which the rows are stored on disk is the same as the order of the index key values. There can be
only one clustered index on a table, but the table can have multiple non-clustered (or secondary)
indexes.
25
Characteristics of a Clustered Index:
1. It dictates the physical storage order of the data in the table.
2. There can only be one clustered index per table.
3. It can be created on columns with non-unique values, but if it's on a non-unique column,
the DBMS will usually add a uniqueifier to make each entry unique.
4. Lookup based on the clustered index is fast because the desired data is directly located
without the need for additional lookups (unlike non-clustered indexes which require a
second lookup to fetch the data).
If we create a clustered index on `BookID`, the physical order of records would be rearranged
based on the ascending order of `BookID`.
The table would then look like this:
Now, when you want to find a book based on its ID, the DBMS can quickly locate the data because
the data is stored in the order of the BookID.
26
• Fast Data Retrieval: Because the data is stored sequentially in the order of the index key,
range queries or ordered queries can be very efficient.
• Data Pages: With the data being stored sequentially, the number of data pages that need
to be read from the disk is minimized.
• No Additional Lookups: Once the key is located using the index, there's no need for
additional lookups to fetch the data, as it is stored right there.
A primary index is an ordered file whose records are of fixed length with two fields. The first field
is the same as the primary key of the data file, and the second field is a pointer to the data block
where that specific key can be found.
The primary index can be classified into two types
1. Dense Index: In this, an index entry appears for every search key value in the data file.
2. Sparse (or Non-Dense) Index: Here, index records are created only for some of the search
key values. A sparse index reduces the size of the index file.
27
1. It is based on the primary key of the table.
2. The primary key must have unique values.
3. The index entries are sorted in the same order as the data file (hence often a clustered
index).
4. The index has one entry for each block in a data file, not for each record.
Assuming each block of our storage can hold two records, our blocks will be:
28
Sparse Primary Index:
29
In the dense primary index, there's an entry for every key value, whereas in the sparse primary
index, only the first key value of each block (the block's anchor record) gets an entry in the index.
In this way, sparse indexes are more compact but might require more sequential block reads if a
search key isn't an anchor for a block.
30
A secondary index provides a secondary means (other than the primary key) to access table data.
Unlike primary indexes, secondary indexes aren't necessarily unique and aren't based on the
primary key of a table. The records in a secondary index are stored separately from the data, and
each index entry points to the corresponding data record. A table can have multiple secondary
indexes.
Characteristics of a :
1. Provides an alternative path to access the data.
2. Can be either dense or sparse.
3. Allows for non-unique values.
4. Does not guarantee the order of records in the data file.
5. Unlike a primary (often clustered) index, a secondary index is typically a non-clustered
index. This means the physical order of rows in a table is not the same as the index order.
31
| 21 | Record 3 |
| 22 | Record 2 |
| 23 | Record 4 |
Here, each age value has a direct pointer to the corresponding record.
If another student with an age of 22 is added:
32
Index data Structures
One way to organize data entries is to hash data entries on the sea.rch key. Another way to organize
data entries is to build a tree-like data structure that directs a search for data entries.
Hash-Based Indexing
In hash-based indexing, a hash function is used to convert a key into a hash code. This hash code
serves as an index where the value associated with that key is stored. The goal is to distribute the
keys uniformly across an array, so that access time is, on average, constant.
Let's break down some of these elements to further understand how hash-based indexing works in
practice:
Buckets
In hash-based indexing, the data space is divided into a fixed number of slots known as "buckets."
A bucket usually contains a single page (also known as a block), but it may have additional pages
linked in a chain if the primary page becomes full. This is known as overflow.
Hash Function
The hash function is a mapping function that takes the search key as an input and returns the bucket
number where the record should be located. Hash functions aim to distribute records uniformly
across buckets to minimize the number of collisions (two different keys hashing to the same
bucket).
33
Insert Operations
When a new record is inserted into the dataset, its search key is hashed to find the appropriate
bucket. If the primary page of the bucket is full, an additional overflow page is allocated and linked
to the primary page. The new record is then stored on this overflow page.
Search Operations
To find a record with a specific search key, the hash function is applied to the search key to identify
the bucket. All pages (primary and overflow) in that bucket are then examined to find the desired
record.
Limitations
Hash-based indexing is not suitable for range queries or when the search key is not known. In such
cases, a full scan of all pages is required, which is resource-intensive.
Hash Function: H(x) = ASCII value of first letter of the name mod 3
• Alice: 65 mod 3 = 2
• Bob: 66 mod 3 = 0
• Carol: 67 mod 3 = 1
Buckets:
Bucket 0: Bob
Bucket 1: Carol
Bucket 2: Alice
Tree-based Indexing
The most commonly used tree-based index structure is the B-Tree, and its variations like B+ Trees
and B* Trees. In tree-based indexing, data is organized into a tree-like structure. Each node
represents a range of key values, and leaf nodes contain the actual data or pointers to the data.
• Sorted Data: They maintain data in sorted order, making it easier to perform range queries.
• Balanced Tree: B-Trees and their variants are balanced, meaning the path from the root
node to any leaf node is of the same length. This balancing ensures that data retrieval times
are consistently fast, even as the dataset grows.
• Multi-level Index: Tree-based indexes can be multi-level, which helps to minimize the
number of disk I/Os required to find an item.
• Dynamic Nature: B-Trees are dynamic, meaning they're good at inserting and deleting
records without requiring full reorganization.
• Versatility: They are useful for both exact-match and range queries.
•
ID Name
1 Abhi
2 Bharath
3 Chinni
35
4 Devid
[1, 3]
/ \
[1] [3, 4]
/ \ / \
1 2 3 4
In the tree, navigating from the root to the leaf nodes will lead us to the desired data record.
36