0% found this document useful (0 votes)
28 views36 pages

V Unit

The document outlines the curriculum for a Database Management Systems course, detailing key topics such as file organization, indexing, and performance tuning. It discusses various file organization methods, index classifications, and the importance of choosing appropriate indexes for queries. Additionally, it covers B+ tree and hash-based indexes, their operations, and considerations for index selection in relation to database performance.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views36 pages

V Unit

The document outlines the curriculum for a Database Management Systems course, detailing key topics such as file organization, indexing, and performance tuning. It discusses various file organization methods, index classifications, and the importance of choosing appropriate indexes for queries. Additionally, it covers B+ tree and hash-based indexes, their operations, and considerations for index selection in relation to database performance.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 36

DATABASE MANAGEMENT SYSTEMS

TERM 2014-15

B. Tech II/CSE II Semester

Text Books: (1) DBMS by Raghu Ramakrishnan

(2) DBMS by Sudarshan and Korth

INDEX

S.NO Module as per

Session planner

-------------------------------------------------------------------------------------------------

1. Data on external storage & File organization and indexing

2. Index data structures

3. Comparison of file organizations

4. Comparison of file organizations

5. Indexes and performance tuning

6. Indexes and performance tuning

7. Intuition for tree indexes & ISAM

8. B+ tree

1
Data on External Storage

• Disks: Can retrieve random page at fixed cost

– But reading several consecutive pages is much cheaper than reading them in
random order

• Tapes: Can only read pages in sequence

– Cheaper than disks; used for archival storage

• File organization: Method of arranging a file of records on external storage.

– Record id (rid) is sufficient to physically locate record


– Indexes are data structures that allow us to find the record ids of records with given
values in index search key fields

• Architecture: Buffer manager stages pages from external storage to main memory buffer
pool. File and index layers make calls to the buffer manager.

Alternative File Organizations

Many alternatives exist, each ideal for some situations, and not so good in others:

– Heap (random order) files: Suitable when typical access is a file scan retrieving
all records.

– Sorted Files: Best if records must be retrieved in some order, or only a `range’ of
records is needed.

– Indexes: Data structures to organize records via trees or hashing.

• Like sorted files, they speed up searches for a subset of records, based on
values in certain (“search key”) fields

• Updates are much faster than in sorted files.

Index Classification

• Primary vs. secondary: If search key contains primary key, then called primary index.

– Unique index: Search key contains a candidate key.

• Clustered vs. unclustered: If order of data records is the same as, or `close to’, order of
data entries, then called clustered index.

– Alternative 1 implies clustered; in practice, clustered also implies Alternative 1


(since sorted files are rare).
2
– A file can be clustered on at most one search key.
– Cost of retrieving data records through index varies greatly based on whether index
is clustered or not!

Clustered vs. Unclustered Index

• Suppose that Alternative (2) is used for data entries, and that the data records are stored in
a Heap file.

– To build clustered index, first sort the Heap file (with some free space on each
page for future inserts).
– Overflow pages may be needed for inserts. (Thus, order of data recs is `close to’,
but not identical to, the sort order.)

Indexes

• An index on a file speeds up selections on the search key fields for the index.
– Any subset of the fields of a relation can be the search key for an index on the
relation.
– Search key is not the same as key (minimal set of fields that uniquely identify a
record in a relation).
• An index contains a collection of data entries, and supports efficient retrieval of all data
entries k* with a given key value k.
– Given data entry k*, we can find record with key k in at most one disk I/O.

B+ Tree Indexes

3
❖ Leaf pages contain data entries, and are chained (prev & next)

❖ Non-leaf pages have index entries; only used to direct searches:

Example B+ Trees

• Example B+ Tree
• Find 28*? 29*? All > 15* and < 30*
• Insert/delete: Find data entry in leaf, then change it. Need to adjust parent sometimes.
– And change sometimes bubbles up the tree
Hash-Based Indexes

• Good for equality selections.


• Index is a collection of buckets.
– Bucket = primary page plus zero or more overflow pages.
– Buckets contain data entries.
• Hashing function h: h(r) = bucket in which (data entry for) record r belongs. h looks at
the search key fields of r.
– No need for “index entries” in this scheme.

Alternatives for Data Entry k* in Index

• In a data entry k* we can store:


– Data record with key value k, or
– <k, rid of data record with search key value k>, or
– <k, list of rids of data records with search key k>
• Choice of alternative for data entries is orthogonal to the indexing technique used to locate
data entries with a given key value k.

4
– Examples of indexing techniques: B+ trees, hash-based structures
– Typically, index contains auxiliary information that directs searches to the desired
data entries
• Alternative 1:(Data record with key value k)
– If this is used, index structure is a file organization for data records (instead of a
Heap file or sorted file).
– At most one index on a given collection of data records can use Alternative 1.
(Otherwise, data records are duplicated, leading to redundant storage and potential
inconsistency.)
– If data records are very large, No. of pages containing data entries is high. Implies
size of auxiliary information in the index is also large, typically.
• Alternatives for Data Entries (Contd.)
• Alternatives 2 (<k, rid of data record with search key value k>) and 3 (<k, list of rids of
data records with search key k> ):
– Data entries typically much smaller than data records. So, better than Alternative
1 with large data records, especially if search keys are small. (Portion of index
structure used to direct search, which depends on size of data entries, is much
smaller than with Alternative 1.)
– Alternative 3 more compact than Alternative 2, but leads to variable sized data
entries even if search keys are of fixed length.

Cost Model for Our Analysis

We ignore CPU costs, for simplicity:


– B: The number of data pages
– R: Number of records per page
– D: (Average) time to read or write disk page
– Measuring number of page I/O’s ignores gains of pre-fetching a sequence of pages;
thus, even I/O cost is only approximated.
– Average-case analysis; based on several simplistic assumptions.

Comparing File Organizations

• Heap files (random order; insert at eof)


• Sorted files, sorted on <age, sal>
• Clustered B+ tree file, Alternative (1), search key <age, sal>
• Heap file with unclustered B + tree index on search key <age, sal>
• Heap file with unclustered hash index on search key <age, sal>

Operations to Compare

• Scan: Fetch all records from disk


• Equality search
• Range selection
• Insert a record
• Delete a record

5
Assumptions in Our Analysis

• Heap Files:
– Equality selection on key; exactly one match.
• Sorted Files:
– Files compacted after deletions.
• Indexes:
– Alt (2), (3): data entry size = 10% size of record
– Hash: No overflow buckets.
• 80% page occupancy => File size = 1.25 data size
– Tree: 67% occupancy (this is typical).
• Implies file size = 1.5 data size
• Scans:
– Leaf levels of a tree-index are chained.
– Index data-entries plus actual file scanned for unclustered indexes.
• Range searches:
– We use tree indexes to restrict the set of data records fetched, but ignore hash
indexes.

Cost of Operations

Understanding the Workload


• For each query in the workload:
– Which relations does it access?
– Which attributes are retrieved?
– Which attributes are involved in selection/join conditions? How selective are these
conditions likely to be?
• For each update in the workload:
– Which attributes are involved in selection/join conditions? How selective are these
conditions likely to be?

6
– The type of update (INSERT/DELETE/UPDATE), and the attributes that are
affected.

Choice of Indexes

• What indexes should we create?


– Which relations should have indexes? What field(s) should be the search key?
Should we build several indexes?
• For each index, what kind of an index should it be?
– Clustered? Hash/tree?
• One approach: Consider the most important queries in turn. Consider the best plan using
the current indexes, and see if a better plan is possible with an additional index. If so,
create it.
– Obviously, this implies that we must understand how a DBMS evaluates queries
and creates query evaluation plans!
– For now, we discuss simple 1-table queries.
• Before creating an index, must also consider the impact on updates in the workload!
– Trade-off: Indexes can make queries go faster, updates slower. Require disk space,
too.

Index Selection Guidelines

• Attributes in WHERE clause are candidates for index keys.


– Exact match condition suggests hash index.
– Range query suggests tree index.
• Clustering is especially useful for range queries; can also help on equality
queries if there are many duplicates.
• Multi-attribute search keys should be considered when a WHERE clause contains several
conditions.
– Order of attributes is important for range queries.
– Such indexes can sometimes enable index-only strategies for important queries.
• For index-only strategies, clustering is not important!
• Examples of Clustered Indexes
• B+ tree index on E.age can be used to get qualifying tuples.
– How selective is the condition?
– Is the index clustered?
SELECT E.dno FROM Emp E WHERE E.age>40

• Consider the GROUP BY query.


– If many tuples have E.age > 10, using E.age index and sorting the retrieved tuples
may be costly.
– Clustered E.dno index may be better!
SELECT E.dno, COUNT (*) FROM Emp E WHERE E.age>10 GROUP BY E.dno
• Equality queries and duplicates:
– Clustering on E.hobby helps!
SELECT E.dno FROM Emp E WHERE E.hobby=Stamps

7
Indexes with Composite Search Keys

• Composite Search Keys: Search on a combination of fields.


– Equality query: Every field value is equal to a constant value. E.g. wrt <sal,age>
index:
• age=20 and sal =75
– Range query: Some field value is not a constant. E.g.:
• age =20 or age=20 and sal > 10
• Data entries in index sorted by search key to support range queries.
– Lexicographic order, or
– Spatial order.

Composite Search Keys


• To retrieve Emp records with age=30 AND sal=4000, an index on <age,sal> would be
better than an index on age or an index on sal.
– Choice of index key orthogonal to clustering etc.
• If condition is: 20<age<30 AND 3000<sal<5000:
– Clustered tree index on <age,sal> or <sal,age> is best.
• If condition is: age=30 AND 3000<sal<5000:
– Clustered <age,sal> index much better than <sal,age> index!
• Composite indexes are larger, updated more often.

Index-Only Plans

• A number of queries can be answered without retrieving any tuples from one or more of
the relations involved if a suitable index is available.

<E.dno>

SELECT E.dno, COUNT(*) FROM Emp E GROUP BY E.dno

<E.dno,E.sal> Tree index!

8
SELECT E.dno, MIN(E.sal) FROM Emp E GROUP BY E.dno

<E. age,E.sal> or <E.sal, E.age> Tree index!

SELECT AVG(E.sal) FROM Emp E WHERE E.age=25 AND E.sal BETWEEN 3000 AND 5000

Summary

• Many alternative file organizations exist, each appropriate in some situation.


• If selection queries are frequent, sorting the file or building an index is important.
– Hash-based indexes only good for equality search.
– Sorted files and tree-based indexes best for range search; also good for equality
search. (Files rarely kept sorted in practice; B+ tree index is better.)
• Index is a collection of data entries plus a way to quickly find entries with given key values.
• Data entries can be actual data records, <key, rid> pairs, or <key, rid-list> pairs.
– Choice orthogonal to indexing technique used to locate data entries with a given
key value.
• Can have several indexes on a given file of data records, each with a different search key.
• Indexes can be classified as clustered vs. unclustered, primary vs. secondary, and dense vs.
sparse. Differences have important consequences for utility/performance.

Introduction

• As for any index, 3 alternatives for data entries k*:


– Data record with key value k
– <k, rid of data record with search key value k>
– <k, list of rids of data records with search key k>
• Choice is orthogonal to the indexing technique used to locate data entries k*.
• Tree-structured indexing techniques support both range searches and equality searches.
• ISAM: static structure; B+ tree: dynamic, adjusts gracefully under inserts and deletes.

Range Searches

• ``Find all students with gpa > 3.0’’


– If data is in sorted file, do binary search to find first such student, then scan to find
others.
– Cost of binary search can be quite high.
• Simple idea: Create an `index’ file.

ISAM

9
Comments on ISAM

• File creation: Leaf (data) pages allocated sequentially, sorted by search key;
then index pages allocated, then space for overflow pages.

• Index entries: <search key value, page id>; they `direct’ search for data entries, which
are in leaf pages.
• Search: Start at root; use key comparisons to go to leaf. Cost log F N ; F = No of
entries/index pg, N = No of leaf pgs
• Insert: Find leaf data entry belongs to, and put it there.
• Delete: Find and remove from leaf; if empty overflow page, de-allocate.
• Example ISAM Tree

10
11
B+ Tree: Most Widely Used Index

• Insert/delete at log F N cost; keep tree height-balanced. (F = fanout, N = No of leaf pages)


• Minimum 50% occupancy (except for root). Each node contains d <= m <= 2d entries.
The parameter d is called the order of the tree.
• Supports equality and range-searches efficiently.

• Example B+ Tree

• Search begins at root, and key comparisons direct it to a leaf (as in ISAM).
• Search for 5*, 15*, all data entries >= 24* ...

12
B+ Trees in Practice

• Typical order: 100. Typical fill-factor: 67%.


– average fanout = 133
• Typical capacities:
– Height 4: 1334 = 312,900,700 records
– Height 3: 1333 = 2,352,637 records
• Can often hold top levels in buffer pool:
– Level 1 = 1 page = 8 Kbytes
– Level 2 = 133 pages = 1 Mbyte
– Level 3 = 17,689 pages = 133 MBytes
Inserting a Data Entry into a B+ Tree

• Find correct leaf L.


• Put data entry onto L.
– If L has enough space, done!
– Else, must split L (into L and a new node L2)
• Redistribute entries evenly, copy up middle key.
• Insert index entry pointing to L2 into parent of L.
• This can happen recursively
– To split index node, redistribute entries evenly, but push up middle key. (Contrast
with leaf splits.)
• Splits “grow” tree; root split increases height.
– Tree growth: gets wider or one level taller at top.

Inserting 8* into Example B+ Tree

• Observe how minimum occupancy is guaranteed in both leaf and index pg splits.
• Note difference between copy-up and push-up; be sure you understand the reasons for this.

13
Example B+ Tree After Inserting 8*

❖ Notice that root was split, leading to increase in height.

❖ In this example, we can avoid split by re-distributing entries; however, this is


usually not done in practice.

Deleting a Data Entry from a B+ Tree

• Start at root, find leaf L where entry belongs.


• Remove the entry.
– If L is at least half-full, done!
– If L has only d-1 entries,

14
• Try to re-distribute, borrowing from sibling (adjacent node with same
parent as L).
• If re-distribution fails, merge L and sibling.
• If merge occurred, must delete entry (pointing to L or sibling) from parent of L.
• Merge could propagate to root, decreasing height.

Example Tree After (Inserting 8*, Then) Deleting 19* and 20* ...

• Deleting 19* is easy.


• Deleting 20* is done with re-distribution. Notice how middle key is copied up.

... And Then Deleting 24*

• Must merge.
• Observe `toss’ of index entry (on right), and `pull down’ of index entry (below).

15
File Organization
ADVERTISEMENT
ADVERTISEMENT

o The File is a collection of records. Using the primary key, we can access the records. The
type and frequency of access can be determined by the type of file organization which was
used for a given set of records.
o File organization is a logical relationship among various records. This method defines how
file records are mapped onto disk blocks.
o File organization is used to describe the way in which the records are stored in terms of
blocks, and the blocks are placed on the storage medium.
o The first approach to map the database to the file is to use the several files and store only
one fixed length record in any given file. An alternative approach is to structure our files
so that we can contain multiple lengths for records.
o Files of fixed length records are easier to implement than the files of variable length
records.

Objective of file organization


o It contains an optimal selection of records, i.e., records can be selected as fast as possible.
o To perform insert, delete or update transaction on the records should be quick and easy.
o The duplicate records cannot be induced as a result of insert, update or delete.
o For the minimal cost of storage, records should be stored efficiently.

Types of file organization:


File organization contains various methods. These particular methods have pros and cons on the
basis of access or selection. In the file organization, the programmer decides the best-suited file
organization method according to his requirement.

Types of file organization are as follows:

16
UNIT – V

Data on External Storage in DBMS


The storage system in a DBMS refers to the hierarchical arrangement of storage devices and
media to store, manage, and retrieve data efficiently. The system is designed to handle different
storage capacities, access speeds, volatility, and costs.

Storage System Hierarchy in DBMS


The storage hierarchy typically has multiple levels, each with its specific characteristics. Here's a
typical hierarchy from fastest (and usually most expensive per byte) to slowest (and usually least
expensive per byte):

1. Registers
• Located within the CPU.
• Smallest and fastest type of storage.
• Used to hold data currently being processed.

2. Cache Memory (L1, L2, L3 caches)


• On or very close to the CPU.
• Extremely fast but small in size.
• Acts as a buffer for frequently used data.

17
3. Main Memory (RAM)
• Data that's actively being used or processed is loaded here.
• Faster than secondary storage.
• Volatile in nature (i.e., data is lost when power is turned off).

4. Flash Storage (Solid State Drives - SSD)


• No moving parts.
• Faster than traditional hard drives.
• More durable and reliable.

5. Magnetic Disks (Hard Disk Drives - HDD)


• Primary secondary storage medium.
• Non-volatile, persistent storage.
• Data is stored in tracks, sectors, and cylinders.
• Slower than main memory but offers a large storage capacity at a lower cost.

6. Optical Disks (CD, DVD, Blu-Ray)


• Data is read using lasers.
• Slower than magnetic disks and usually have less storage capacity.
• Portable and commonly used for media and software distribution.

7. Magnetic Tapes
• Sequential access storage, unlike disks which are random access.
• Often used for backups and archiving due to their high capacity and low cost.
• Much slower access times compared to magnetic disks.

8. Remote Storage/Cloud Storage


• Data stored in remote servers and accessed over the internet.
• Provides scalability, availability, and fault-tolerance.
• Latency depends on network speed and distance to servers.

18
Types of Storage:
1. Primary Storage: Includes registers, cache memory, and main memory (RAM). It's the
main memory where the operating system, application programs, and data in current use
are kept for quick access by the computer's processor.
2. Secondary Storage: Encompasses data storage devices like HDDs, SSDs, CDs, and USB
drives. It is non-volatile and retains data even when the computer is turned off.
3. Tertiary Storage or Off-line Storage: Often involves magnetic tape systems or optical
disk archives. This is slower than secondary storage and is used for data archiving and
backup.
4. Quaternary Storage: Refers to methods like cloud storage or other remote storage
techniques where data is stored in remote servers and is fetched over the internet or other
networks.

Note: The storage hierarchy in a DBMS serves as a framework for balancing the trade-offs
between cost, speed, and capacity. By strategically using different levels of this hierarchy,
a DBMS can ensure fast access times for frequently used data while also providing large-capacity
storage for data that's accessed less frequently.

File organization and indexing are fundamental concepts in database management


systems (DBMS) that deal with efficient storage, retrieval, and management of data.

File Organization
File organization refers to the arrangement of data on storage devices. The method
chosen can have a profound effect on the efficiency of various database operations.
Common methods of file organization include:

Sequential (or Serial) File Organization


In sequential file organization, records are stored in sequence, one after the other, based
on a key field. This key field is a unique identifier for records, ensuring that they have
some order. The records are inserted at the end of the file, ensuring the sequence is
maintained.

Features of Sequential File Organization

• Ordered Records: Records in a sequential file are stored based on a key field.

19
• Continuous Memory Allocation: The records are stored in contiguous memory
locations.

• No Direct Access: To access a record, you have to traverse from the first record until
you find the desired one.

Advantages Sequential File Organization

• Simplicity: The design and logic behind sequential file organization are
straightforward.

• Efficient for Batch Processing: Since records are stored in sequence, sequential
processing (like batch updates) can be done efficiently.

• Less Overhead: There's no need for complex algorithms or mechanisms to store


records.

Disadvantages Sequential File Organization

• Inefficient for Random Access**: If you need a specific record, you may have to go
through many records before finding the desired one. This makes random access
slow.

• Insertion and Deletion**: Inserting or deleting a record (other than at the end) can be
time-consuming since you may need to shift records to maintain the order.

• Redundancy Issues**: There's a risk of redundancy if checks are not made before
inserting records. For example, a record with the same key might get added twice if
not checked.
Practical Application: Suppose you have a file of students ordered by their roll
number:

Roll No | Name
--------|--------
1 | Madhu
2 | Naveen
4 | Shivaji
5 | Durga

In a sequential file, if you wanted to add a student with roll number 6, you would append
them at the end. However, if you wanted to insert a student with a roll number 3 which is

20
between 2 and 4, you would need to shift all subsequent records to maintain the
sequence, which can be time-consuming.

Direct (or Hashed) File Organization


In hash file organization, a hash function is used to compute the address of a block (or
bucket) where the record is stored. The value returned by the hash function using a
record's key value is its address in the database.

Features of Hash File Organization

• Hash Function: A hash function converts a record's key value into an address.

• Buckets: A bucket typically stores one or more records. A hash function might map
multiple keys to the same bucket.

• No Ordering of Records: Records are not stored in any specific logical order.

Advantages Hash File Organization

• Rapid Access: If the hash function is efficient and there's minimal collision, the
retrieval of a record is very quick.

• Uniform Distribution: A good hash function will distribute records uniformly across
all buckets.

• Efficient Search: Searching becomes efficient as only a specific bucket needs to be


searched rather than the entire file.

Disadvantages Hash File Organization

• Collisions**: A collision occurs when two different keys hash to the same bucket.
Handling collisions can be tricky and might affect access time.

• Dependency on Hash Function**: The efficiency of a hash file organization heavily


depends on the hash function used. A bad hash function can lead to clustering and
inefficient utilization of space.

• Dynamic Growth and Shrinking**: If the number of records grows or shrinks


significantly, rehashing might be needed which is an expensive operation.
Practical Application: Imagine a database that holds information about books.
Each book has a unique ISBN number. A hash function takes an ISBN and returns an
address. When you want to find a particular book's details, you hash the ISBN, which

21
directs you to a particular bucket. If two books' ISBNs hash to the same value, you handle
that collision, maybe by placing the new record in a linked list associated with that bucket.

Indexed File Organization


Indexed file organization is a method used to store and retrieve data in databases. It is
designed to provide quick random access to records based on key values. In this
organization, an index is created which helps in achieving faster search and access times.

Features Indexed File Organization:

• Primary Data File: The actual database file where records are stored.

• Index: An auxiliary file that contains key values and pointers to the corresponding
records in the data file.

• Multi-level Index: Sometimes, if the index becomes large, a secondary (or even tertiary)
index can be created on the primary index to expedite searching further.

Advantages Indexed File Organization:

• Quick Random Access: Direct access to records is possible using the index.

• Flexible Searches: Since an index provides a mechanism to jump directly to records,


different types of search operations (like range queries) can be efficiently supported.

• Ordered Access: If the primary file is ordered, then indexed file organization can
support efficient sequential access too.

Disadvantages Indexed File Organization:

• Overhead of Maintaining Index: Every time a record is added, deleted, or updated, the index also
needs to be updated. This can introduce overhead.
• Space Overhead: Indexes consume additional storage space.
• Complexity: Maintaining multiple levels of indexes can introduce complexity in terms of design
and implementation.
Practical Application: Consider a database that holds information about students.
where each student has a unique student ID. The main file would contain detailed records for each
student. A separate index file would contain student IDs and pointers to the location of the detailed
records in the main file. When you want to find a specific student's details, you first search the
index to find the pointer and then use that pointer to fetch the record directly from the main file.

22
Note: Indexed file organization strikes a balance between the direct and sequential access of
records. While it offers faster search capabilities, there's a trade-off in terms of space and
maintenance overhead. It is ideal for databases where search operations are predominant, and
where the overhead of maintaining the index can be justified by the improved access times.

Indexed Sequential Access Method (ISAM)


ISAM is a popular method for indexed file organization. In ISAM:

• The primary file is stored in a sequential manner based on a primary key.


• There's a static primary index built on the primary key.
• Overflow areas are designated for insertion of new records, which keeps the main file in sequence.
Periodically, the overflow area can be merged back into the main file.

Indexing
Indexing involves creating an auxiliary structure (an index) to improve data retrieval times. Just
like the index in the back of a book, a database index provides pointers to the locations of records.

Structure of Index
We can create indices using some columns of the database.

|-------------|----------------|
| Search Key | Data Reference |
|-------------|----------------|

• The search key is the database’s first column, and it contains a duplicate or copy of the table’s
candidate key or primary key. The primary key values are saved in sorted order so that the related
data can be quickly accessible.
• The data reference is the database’s second column. It contains a group of pointers that point to the
disk block where the value of a specific key can be found.

Types of Indexes:
1. Single-level Index: A single index table that contains pointers to the actual data records.

23
2. Multi-level Index: An index of indexes. This hierarchical approach reduces the number of accesses
(disk I/O operations) required to find an entry.
3. Dense and Sparse Indexes:
o In a dense index, there's an index entry for every search key value in the database.
o In a sparse index, there are fewer index entries. One entry might point to several records.
4. Primary and Secondary Indexes:
o A primary index is an ordered file whose records are of fixed length with two fields. The
first field is the same as the primary key, and the second field is a pointer to the data block.
There's a one-to-one relationship between the number of entries in the index and the
number of records in the main file.
o A secondary index provides a secondary means of accessing data. For each secondary key
value, the index points to all the records with that key value.
5. Clustered vs. Non-clustered Index:
o In a clustered index, the rows of data in the table are stored on disk in the same order as
the index. There can only be one clustered index per table.
o In a non-clustered index, the order of rows does not match the index's order. You can have
multiple non-clustered indexes.
6. Bitmap Index: Used mainly for data warehousing setups, a bitmap index uses bit arrays (bitmaps)
and usually involves columns that have a limited number of distinct values.
7. B-trees and B+ trees: Balanced tree structures that ensure logarithmic access time. B+ trees are
particularly popular in DBMS for their efficiency in disk I/O operations.

Benefits of Indexing:
• Faster search and retrieval times for database operations.

Drawbacks of Indexing:
• Overhead for insert, update, and delete operations, as indexes need to be maintained.
• Additional storage requirements for the index structures.

Cluster Indexes in DBMS

24
A clustered index determines the physical order of data in a table. In other words, the order in
which the rows are stored on disk is the same as the order of the index key values. There can be
only one clustered index on a table, but the table can have multiple non-clustered (or secondary)
indexes.

25
Characteristics of a Clustered Index:
1. It dictates the physical storage order of the data in the table.
2. There can only be one clustered index per table.
3. It can be created on columns with non-unique values, but if it's on a non-unique column,
the DBMS will usually add a uniqueifier to make each entry unique.
4. Lookup based on the clustered index is fast because the desired data is directly located
without the need for additional lookups (unlike non-clustered indexes which require a
second lookup to fetch the data).

Clustered Index Example


Imagine a `Books` table with the following records:

| BookID | Title | Genre |


|--------|----------------------|-----------|
|3 | A Tale of Two Cities | Fiction |
|1 | Database Systems | Academic |
|4 | Python Programming | Technical |
|2 | The Great Gatsby | Fiction |

If we create a clustered index on `BookID`, the physical order of records would be rearranged
based on the ascending order of `BookID`.
The table would then look like this:

| BookID | Title | Genre |


|--------|----------------------|-----------|
|1 | Database Systems | Academic |
|2 | The Great Gatsby | Fiction |
|3 | A Tale of Two Cities | Fiction |
|4 | Python Programming | Technical |

Now, when you want to find a book based on its ID, the DBMS can quickly locate the data because
the data is stored in the order of the BookID.

Benefits of a Clustered Index:

26
• Fast Data Retrieval: Because the data is stored sequentially in the order of the index key,
range queries or ordered queries can be very efficient.
• Data Pages: With the data being stored sequentially, the number of data pages that need
to be read from the disk is minimized.
• No Additional Lookups: Once the key is located using the index, there's no need for
additional lookups to fetch the data, as it is stored right there.

Drawbacks of a Clustered Index:


• Overhead on Inserts/Updates: Because the data must be stored physically in the order of
the index keys, inserts or updates can be slower since they might require data pages to be
rearranged.
• Single Clustered Index: You can have only one clustered index per table, so you have to
choose wisely based on the most critical queries' requirements.
In practice, it's often beneficial to have the primary key of a table also be the clustered index, but
this is not a requirement. The decision of which column to use as a clustered index should be based
on query patterns, the nature of the data, and the characteristics of the application using the
database.

Primary Indexes in DBMS

A primary index is an ordered file whose records are of fixed length with two fields. The first field
is the same as the primary key of the data file, and the second field is a pointer to the data block
where that specific key can be found.
The primary index can be classified into two types

1. Dense Index: In this, an index entry appears for every search key value in the data file.
2. Sparse (or Non-Dense) Index: Here, index records are created only for some of the search
key values. A sparse index reduces the size of the index file.

Characteristics of a Primary Index:

27
1. It is based on the primary key of the table.
2. The primary key must have unique values.
3. The index entries are sorted in the same order as the data file (hence often a clustered
index).
4. The index has one entry for each block in a data file, not for each record.

Primary Index Example


Let's assume a simple table named `Students` that stores data about students. The table has the
following structure:

| RollNumber (Primary Key) | Name | Age |


|--------------------------|-------|-----|
| 1001 | Madhu | 20 |
| 1003 | Mahi | 22 |
| 1007 | Ramu | 21 |
| 1010 | Durga | 23 |

Assuming each block of our storage can hold two records, our blocks will be:

• Block 1: Contains records for RollNumbers 1001 and 1003.


• Block 2: Contains records for RollNumbers 1007 and 1010.
Dense Primary Index:

| Key (RollNumber) | Pointer to Block |


|------------------|------------------|
| 1001 | Block 1 |
| 1003 | Block 1 |
| 1007 | Block 2 |
| 1010 | Block 2 |

28
Sparse Primary Index:

| Key (RollNumber) | Pointer to Block |


|------------------|------------------|
| 1001 | Block 1 |
| 1007 | Block 2 |

29
In the dense primary index, there's an entry for every key value, whereas in the sparse primary
index, only the first key value of each block (the block's anchor record) gets an entry in the index.
In this way, sparse indexes are more compact but might require more sequential block reads if a
search key isn't an anchor for a block.

Benefits of a Primary Index:


• Fast retrieval of records based on the primary key.
• Efficient for range-based queries due to sorted order.
• Can be used to enforce the uniqueness of the primary key.
However, while the primary index provides rapid access and ensures the uniqueness of records, it
can be slower for insertions, updates, and deletions, as maintaining the sorted order of the index
can be time-consuming.

Secondary Indexes in DBMS

30
A secondary index provides a secondary means (other than the primary key) to access table data.
Unlike primary indexes, secondary indexes aren't necessarily unique and aren't based on the
primary key of a table. The records in a secondary index are stored separately from the data, and
each index entry points to the corresponding data record. A table can have multiple secondary
indexes.

Characteristics of a :
1. Provides an alternative path to access the data.
2. Can be either dense or sparse.
3. Allows for non-unique values.
4. Does not guarantee the order of records in the data file.
5. Unlike a primary (often clustered) index, a secondary index is typically a non-clustered
index. This means the physical order of rows in a table is not the same as the index order.

Secondary Index Example


Let's continue with the `Students` table:

| RollNumber (Primary Key) | Name | Age |


|--------------------------|-------|-----|
| 1001 | John | 20 |
| 1003 | Alice | 22 |
| 1007 | Bob | 21 |
| 1010 | Clara | 23 |

Assuming we want to create a secondary index on the `Age` column:


Dense Secondary Index on Age:

| Age | Pointer to Record |


|-----|-------------------|
| 20 | Record 1 |

31
| 21 | Record 3 |
| 22 | Record 2 |
| 23 | Record 4 |

Here, each age value has a direct pointer to the corresponding record.
If another student with an age of 22 is added:

| RollNumber | Name | Age |


|------------|-------|-----|
| 1012 | David | 22 |

The dense secondary index would then be:

| Age | Pointer to Record |


|-----|-------------------|
| 20 | Record 1 |
| 21 | Record 3 |
| 22 | Record 2, Record 5|
| 23 | Record 4 |

Benefits of a Secondary Index:


• Provides additional query paths, potentially speeding up query performance.
• Can be created on non-primary key columns, even on columns with non-unique values.
• Useful for optimizing specific query patterns.

Drawbacks of a Secondary Index:


• Increases the overhead of write operations since any insert, update, or delete operation on
the table may require changes to the secondary index.
• Consumes additional storage space.
• May increase the complexity of database maintenance.
In a real-world scenario, database administrators and developers need to strike a balance. They
need to identify which columns are frequently used in query conditions and might benefit from
indexing, while also considering the trade-offs regarding space and write operation performance.

32
Index data Structures

One way to organize data entries is to hash data entries on the sea.rch key. Another way to organize
data entries is to build a tree-like data structure that directs a search for data entries.

Hash-Based Indexing
In hash-based indexing, a hash function is used to convert a key into a hash code. This hash code
serves as an index where the value associated with that key is stored. The goal is to distribute the
keys uniformly across an array, so that access time is, on average, constant.
Let's break down some of these elements to further understand how hash-based indexing works in
practice:

Buckets
In hash-based indexing, the data space is divided into a fixed number of slots known as "buckets."
A bucket usually contains a single page (also known as a block), but it may have additional pages
linked in a chain if the primary page becomes full. This is known as overflow.

Hash Function
The hash function is a mapping function that takes the search key as an input and returns the bucket
number where the record should be located. Hash functions aim to distribute records uniformly
across buckets to minimize the number of collisions (two different keys hashing to the same
bucket).

Disk I/O Efficiency


Hash-based indexing is particularly efficient when it comes to disk I/O operations. Given a search
key, the hash function quickly identifies the bucket (and thereby the disk page) where the desired
record is located. This often requires only one or two disk I/Os, making the retrieval process very
fast.

33
Insert Operations
When a new record is inserted into the dataset, its search key is hashed to find the appropriate
bucket. If the primary page of the bucket is full, an additional overflow page is allocated and linked
to the primary page. The new record is then stored on this overflow page.

Search Operations
To find a record with a specific search key, the hash function is applied to the search key to identify
the bucket. All pages (primary and overflow) in that bucket are then examined to find the desired
record.

Limitations
Hash-based indexing is not suitable for range queries or when the search key is not known. In such
cases, a full scan of all pages is required, which is resource-intensive.

Hash-Based Indexing Example


Let's consider a simple example using employee names as the search key.
Employee Records

| Name | Age | Salary


|-----------|----------|--------
| Alice | 28 | 50000
| Bob | 35 | 60000
| Carol | 40 | 70000

Hash Function: H(x) = ASCII value of first letter of the name mod 3

• Alice: 65 mod 3 = 2
• Bob: 66 mod 3 = 0
• Carol: 67 mod 3 = 1
Buckets:
Bucket 0: Bob
Bucket 1: Carol
Bucket 2: Alice

Pros of Hash-Based Indexing


34
• Extremely fast for exact match queries.
• Well-suited for equality comparisons.

Cons of Hash-Based Indexing


• Not suitable for range queries (e.g., "SELECT * FROM table WHERE age BETWEEN 20
AND 30").
• Performance can be severely affected by poor hash functions or a large number of
collisions.

Tree-based Indexing
The most commonly used tree-based index structure is the B-Tree, and its variations like B+ Trees
and B* Trees. In tree-based indexing, data is organized into a tree-like structure. Each node
represents a range of key values, and leaf nodes contain the actual data or pointers to the data.

Why Tree-based Indexing?


Tree-based indexes like B-Trees offer a number of advantages:

• Sorted Data: They maintain data in sorted order, making it easier to perform range queries.
• Balanced Tree: B-Trees and their variants are balanced, meaning the path from the root
node to any leaf node is of the same length. This balancing ensures that data retrieval times
are consistently fast, even as the dataset grows.
• Multi-level Index: Tree-based indexes can be multi-level, which helps to minimize the
number of disk I/Os required to find an item.
• Dynamic Nature: B-Trees are dynamic, meaning they're good at inserting and deleting
records without requiring full reorganization.
• Versatility: They are useful for both exact-match and range queries.

Tree-based Indexing Example


Continuing with the "Students" table:

ID Name
1 Abhi
2 Bharath
3 Chinni

35
4 Devid

A simplified B-Tree index could look like this:

[1, 3]
/ \
[1] [3, 4]
/ \ / \
1 2 3 4

In the tree, navigating from the root to the leaf nodes will lead us to the desired data record.

Pros of Tree-based Indexing:


• Efficient for range queries.
• Good for both exact and partial matches.
• Keeps data sorted.

Cons of Tree-based Indexing:


• Slower than hash-based indexing for exact queries.
• More complex to implement and maintain.

36

You might also like