Overview of Physical Storage Media – Notes
🔹 Definition:
Physical storage media refers to hardware devices used to store digital data, either
temporarily or permanently.
🔹 Types of Physical Storage Media:
 1. Magnetic Storage
       Examples: Hard Disk Drive (HDD), Magnetic Tapes, Floppy Disks
       Features:
          High storage capacity
          Cost-effective
          Slower than solid-state drives
          Suitable for long-term storage
 2. Optical Storage
       Examples: CD, DVD, Blu-ray Disc
       Features:
          Uses laser technology for reading/writing
          Portable and inexpensive
          Commonly used for media distribution and backups
 3. Solid-State Storage
       Examples: Solid-State Drive (SSD), USB Flash Drive, SD Card
       Features:
          No moving parts (more durable)
          High speed and performance
          Higher cost per GB than HDD
          Common in laptops, smartphones
 4. Magneto-Optical Storage
       Examples: MO Disks
       Features:
          Combines magnetic and optical technology
          Rewritable and durable
          Rarely used in modern systems
🔹 Key Performance Metrics:
 Metric                                          Description
 Access Time                                     Time to locate and access data
 Data Transfer Rate                              Speed of reading/writing data
 Storage Capacity                                Total data it can hold
 Durability                                      Resistance to physical damage or
                                                 environmental wear
 Cost per Bit                                    Expense of storing each unit of data
📘 Storage Device Hierarchy – Notes
🔹 Definition:
The storage device hierarchy refers to the organization of storage devices in a computer
system based on their speed, cost, capacity, and proximity to the CPU.
🔹 Key Characteristics:
 Level               Speed            Cost per Bit       Capacity          Volatility
 Top (Closer to      Fastest          Most Expensive     Small             Volatile
 CPU)
 Bottom (Farther     Slowest          Cheapest           Very Large        Non-volatile
 from CPU)
🔹 Typical Hierarchical Levels:
 1. Registers
         Located inside the CPU
         Fastest and most expensive
         Very limited in size
         Volatile
 2. Cache Memory
         Small memory close to the CPU
       Stores frequently accessed data
       Very fast and expensive
       Volatile
 3. Main Memory (RAM)
       Temporary storage for running programs
       Faster than secondary storage
       Volatile
 4. Secondary Storage
       Examples: HDD, SSD
       Used for long-term data storage
       Slower and cheaper than RAM
       Non-volatile
 5. Tertiary Storage
       Examples: Optical Discs, Magnetic Tapes
       Used for backup and archival
       Very large capacity
       Very slow and cheap
       Non-volatile
 6. Cloud/Remote Storage (Optional Layer)
       Online storage (e.g., Google Drive, AWS)
       Depends on internet speed
       Accessible anywhere
       Used for backup, sync, collaboration
🔹 Summary:
    As we go up the hierarchy: Speed increases, Cost increases, Capacity decreases
    As we go down the hierarchy: Speed decreases, Cost decreases, Capacity increases
File Organization – Notes
🔹 Definition:
File organization refers to the way records are arranged and stored in a file on storage media.
It affects the efficiency of data retrieval, insertion, and deletion.
🔹 Types of File Organization:
 1. Sequential File Organization
       Records are stored in a linear sequence, one after the other.
       Best suited for: Batch processing, large volume of data.
       Pros:
          Simple to implement
          Efficient for reading the entire file
     Cons:
       Slow for random access
       Insertion/deletion is costly
2. Heap (Unordered) File Organization
     Records are stored randomly as they come.
     Best suited for: Small files or when no specific order is needed.
     Pros:
        Fast insertions
     Cons:
        Slow search
        Inefficient for updates/deletes
3. Hashing File Organization
     Uses a hash function to compute the address of a record.
     Best suited for: Direct access to individual records.
     Pros:
        Fast search for exact matches
     Cons:
        Poor performance on range queries
        Collisions need handling
4. Indexed File Organization
     An index is created to map keys to their locations in the file.
     Best suited for: Applications with frequent searching.
     Pros:
        Fast access
        Good for large datasets
     Cons:
        Overhead of maintaining index
        Complex implementation
5. Clustered File Organization
     Related records from different tables are stored together (used in DBMS).
     Improves performance for join operations.
🔹 Comparison Table:
 Type                Search Speed        Insert/Delete         Ordered?   Use Case
 Sequential          Slow (for random)   Slow                  Yes        Payroll, logs
 Heap                Slow                Fast                  No         Temporary data
 Hashing             Very Fast           Moderate              No         Exact match queries
 Indexed             Fast                Moderate              Yes        Large databases, search-
                                                                          heavy apps
 Clustered           Fast (for joins)    Moderate              Yes        Relational DB operations
Spanned vs Unspanned Records – Notes
🔹 Definition:
When storing records in blocks (or pages), a record may either:
    Fit entirely within a single block (Unspanned), or
    Be split across two or more blocks (Spanned)
🔹 1. Unspanned Records
    A record must be stored entirely within one block.
    If a record does not fit, it is moved to the next block.
✅ Advantages:
    Simpler to manage
    Faster access (only one block needs to be read)
❌ Disadvantages:
    Wastes space (internal fragmentation)
    Less flexible for variable-length records
🔹 2. Spanned Records
    A record can be split across multiple blocks.
    The system keeps track of parts of the record across blocks.
✅ Advantages:
    Better space utilization (less wasted space)
    Can store large records even in small blocks
❌ Disadvantages:
    More complex to manage
    Slower access (requires reading multiple blocks)
🔹 Example:
Assume block size = 100 bytes, record size = 70 bytes
    Unspanned: Only 1 record per block (30 bytes wasted)
    Spanned: First block stores 70 bytes of record A and 30 bytes of record B; next block
    continues record B
🔹 Summary Table:
 Feature                          Spanned                       Unspanned
 Record split allowed             Yes                           No
 Space utilization                High                          Low
 Access speed                     Slower (may need 2+ blocks)   Faster (1 block per record)
 Complexity                       Higher                        Lower
Indexing – Notes
🔹 Definition:
Indexing is a data structure technique used to quickly locate and access the data in a
database table, just like an index in a book.
🔹 Why Use Indexing?
    To speed up data retrieval operations (SELECT queries)
    To avoid scanning the entire table (full table scan)
    Especially useful for large databases
🔹 Types of Indexing:
 1. Primary Index
       Built on the primary key
       One index entry per record
         Records are stored in sorted order of the key
         Dense or Sparse indexing can be used
2. Secondary Index
         Built on non-primary key attributes
         Can be many for a table
         Useful for searching on non-unique fields (like Department, City)
3. Clustered Index
         Data is physically sorted based on the indexed column
         Only one clustered index per table
         Faster retrieval for range-based queries
4. Non-Clustered Index
         Index and data are stored separately
         Can have multiple non-clustered indexes
         Slower than clustered for range queries, but still effective
5. Multilevel Index
         Index built on top of another index to reduce access time
         Used when the first-level index is too large
         Similar to B-tree or B+ tree structures
🔹 Dense vs Sparse Index:
 Type                      Every Record Indexed?    Space Used               Speed
 Dense                     Yes                      More                     Fast
 Sparse                    No (only some entries)   Less                     Slower
🔹 Indexing Data Structures:
   B-Tree / B+ Tree: Balanced tree structures for multilevel indexing
   Hash Indexing: Uses hash function for fast equality search
   Bitmap Indexing: Efficient for columns with few distinct values (used in data warehouses)
🔹 Advantages of Indexing:
   Faster data retrieval
   Improves performance of SELECT queries
   Efficient range and exact match searching
🔹 Disadvantages:
   Takes extra disk space
   Slows down INSERT, UPDATE, DELETE due to index maintenance
Indexing in DBMS – Detailed Explanation
🔹 1. Primary Index
🔸 Description:
    Built on primary key of the table.
    The data file is sorted on the primary key field.
    One entry for each block, not each record (Sparse index).
    Also known as sparse primary index if one entry per block.
🔸 Example:
Suppose a table of Students is sorted by Roll_No. A primary index is created on Roll_No.
 Roll_No                                        Name
 101                                            A
 102                                            B
 103                                            C
 ...
Index might store:
 Key (Roll_No)                                  Block Address
 101                                            0
 110                                            1
 120                                            2
🔸 Advantages:
    Fast access using key
    Simple to implement
🔸 Disadvantages:
    Only one per table (based on sorting)
🔹 2. Secondary Index
🔸 Description:
    Created on fields that are not primary keys.
    Table may or may not be sorted on that field.
    Can be dense (every record has an index entry).
🔸 Example:
Creating a secondary index on Department in the Employee table.
 Emp_ID                         Name                         Department
 101                            A                            HR
 102                            B                            IT
 103                            C                            HR
Index (dense):
 Department                                    List of Record Pointers
 HR                                            [101, 103]
 IT                                            [102]
🔸 Advantages:
    Allows fast search on non-key fields
    Supports multiple indexes
🔸 Disadvantages:
    Additional storage
    Slower than primary index
🔹 3. Clustered Index
🔸 Description:
    Data is physically sorted based on the indexed column.
    Only one clustered index per table.
    The index entries contain actual data records, not just pointers.
🔸 Example:
A clustered index on Order_Date would arrange the entire table based on dates.
🔸 Advantages:
    Fast retrieval for range queries (e.g., date ranges)
    Better performance for sorted scans
🔸 Disadvantages:
    Insertion/Deletion costly
    Only one per table
🔹 4. Non-Clustered Index
🔸 Description:
    Index is stored separately from the data.
    Index contains pointers to the data rows.
    Can create multiple non-clustered indexes.
🔸 Example:
Non-clustered index on Customer_Name while table is sorted by Customer_ID.
 Customer_ID                                  Name
 101                                          Zaid
 102                                          Aamir
 103                                          Vikram
Index:
 Name                                         Pointer
 Aamir                                        → 102
 Vikram                                       → 103
 Zaid                                         → 101
🔸 Advantages:
    Multiple indexes allowed
    Faster point lookup
🔸 Disadvantages:
    Slower than clustered for range queries
    Overhead on storage and maintenance
🔹 5. Dense vs Sparse Index
 Feature                        Dense Index                   Sparse Index
 Records Indexed                Every record                  One per block
 Space                          More storage                  Less storage
                                needed
 Search Speed                   Faster (direct)               Slightly slower
 Maintenance                    More complex                  Easier to maintain
                                during updates
🔹 6. Multilevel Index
🔸 Description:
   Index built on top of another index.
   Helps when primary index is too large to be stored in memory.
   Implemented as B-trees or B+ trees.
🔸 Structure:
   Level 1: Index of blocks
   Level 2: Index of index blocks
   And so on…
🔸 Advantages:
   Reduces disk I/O
   Logarithmic search time (like binary search)
🔹 7. Hash Indexing
🔸 Description:
   Uses a hash function to map keys to block addresses.
   Good for exact match queries, not for range queries.
🔸 Example:
Hash(Employee_ID) = Block Number
🔸 Advantages:
     Very fast access for equality searches
     Simple to implement
🔸 Disadvantages:
     Doesn’t support range searches
     Collision handling needed (e.g., chaining)
🔹 8. Bitmap Indexing
🔸 Description:
     Used for columns with few distinct values (e.g., gender, department).
     Stores a bit array for each value.
🔸 Example:
For a Gender column:
 ID                                               Gender
 1                                                M
 2                                                F
 3                                                M
Bitmap:
     M→101
     F→010
🔸 Advantages:
     Efficient for AND/OR operations
     Small storage space for low-cardinality fields
🔸 Disadvantages:
     Not good for high-cardinality fields
     Slower for frequent updates
📌 Summary Table:
 Index           Sorted          Supports         Speed           Space        Use Case
 Type                            Range            (Exact
                                 Queries          Match)
 Primary         Yes             Yes              Fast            Low          Primary
 Index                                                                         key access
 Secondary       No              Yes              Moderate        Medium       Search on
 Index                                                                         non-key
                                                                               fields
 Clustered       Yes             Yes              Very Fast       Medium       Sorting,
 Index                                                                         range
                                                                               scans
 Non-            No              Yes              Fast            High         Multiple
 Clustered                                                                     key
                                                                               searches
 Multilevel      Yes             Yes              Very Fast       Medium       Large
 Index                                                                         datasets
 Hash            No              No               Very Fast       Medium       Equality
 Index                                                                         search
 Bitmap          No              No               Fast for        Low          Categoric
 Index                                            low values                   al
                                                                               attributes
Hashing in DBMS – Notes
🔹 What is Hashing?
Hashing is a technique used to quickly access data by applying a hash function to a search
key. It maps a key to a location (or "bucket") in a hash table.
🔹 Why Use Hashing?
    To enable fast data retrieval (especially for equality searches)
    Used in indexing, hash-based joins, and partitioning
       Efficient in large databases where quick access is needed
🔹 Hash Function
A hash function converts a search key (e.g., Employee ID) into a bucket address.
Formula example:
text
CopyEdit
h(k) = k mod m
Where:
       k = search key
       m = number of buckets
🔹 Types of Hashing:
1. Static Hashing
       Number of buckets is fixed.
       Hash function does not change.
       Each bucket can hold multiple records.
🔸 Example:
If m = 10, then h(107) = 107 % 10 = 7
🔸 Drawbacks:
       Overflow may occur (more records than bucket space)
       Poor for growing datasets
2. Dynamic Hashing
       Number of buckets grows/shrinks as data changes
       Handles overflows gracefully
       Common techniques: Extendible Hashing, Linear Hashing
🔹 Hash Collisions
Occurs when two keys hash to the same bucket.
🔸 Methods to Handle:
 1. Chaining – Use a linked list at each bucket.
 2. Open Addressing – Find next available bucket using probing:
       Linear Probing: check next slots sequentially
       Quadratic Probing: check i² positions ahead
       Double Hashing: use a second hash function
🔹 Extendible Hashing
    Uses a directory of pointers to buckets.
    Directory size = 2^d, where d is the number of bits in hash prefix.
    Buckets split when full; directory doubles if needed.
🔸 Advantage:
    Grows incrementally
    Avoids frequent rehashing
🔹 Linear Hashing
    Buckets are split incrementally over time.
    No directory is used.
    Splits are triggered by load factor.
🔸 Advantage:
    Simpler than extendible hashing
    Automatically adapts to data growth
🔹 Applications of Hashing in DBMS
    Hash-based indexing
    Hash join algorithms
    Partitioning large tables
    Fast equality-based search
🔹 Advantages of Hashing
✅ Constant time access: O(1)
✅ Efficient for large data
✅ Good for equality searches (e.g., Emp_ID = 101)
🔹 Disadvantages of Hashing
❌ Not suitable for range queries
❌ Collisions can degrade performance
❌ Rehashing is expensive in static hashing
📌 Summary Table:
 Feature                          Static Hashing                    Dynamic Hashing
 Buckets                          Fixed                             Grows/Shrinks
 Performance                      Degrades with size                Maintains efficiency
 Collision Handling               Chaining/Probing                  Bucket split/directory
                                                                    use
 Range Queries                    Not supported                     Not supported
B+ Tree in DBMS – Notes
🔹 What is a B+ Tree?
A B+ Tree is a balanced tree data structure used for indexing in databases and file systems. It
is a type of self-balancing search tree that maintains sorted data and allows searches,
sequential access, insertions, and deletions in logarithmic time.
🔹 Why Use B+ Tree in DBMS?
    Efficient for large-scale disk-based databases
    Ideal for range queries
    Used in multilevel indexing
🔹 Structure of B+ Tree
 1. Internal Nodes:
       Only store keys (not data)
       Directs search by guiding to the correct child node
 2. Leaf Nodes:
       Store keys + actual data pointers
       Linked together to form a linked list for sequential access
 3. Root Node:
       Can be a leaf node (if the tree is small) or an internal node
🔹 Properties of B+ Tree (Order n):
 Feature                                          Description
 Order (n)                                        Maximum number of children per node
 Internal node                                    Has at most n children and at least ⌈n/2⌉
 Leaf node                                        Contains ⌈(n−1)/2⌉ to n−1 keys
 Root node                                        Has minimum 2 children if not a leaf
 Balanced                                         All leaves appear on the same level
 Data stored in                                   Only leaf nodes, internal nodes hold keys for
                                                  navigation only
 Leaf node linkage                                Leaf nodes are connected via linked list for
                                                  range traversal
🔹 Operations in B+ Tree
✅ 1. Search
    Start at root and move down based on key comparison.
    Final search is in leaf node.
    Efficient due to balanced structure (Time: O(log n))
✅ 2. Insertion
    Insert key in proper leaf node.
    If overflow occurs:
        Split the leaf
        Promote key to parent
        May recursively split and grow the tree
✅ 3. Deletion
    Remove key from leaf.
    If underflow:
        Borrow from sibling, or
        Merge with sibling
        Update parent keys accordingly
🔹 Advantages of B+ Tree
 Benefit                                         Explanation
 Efficient search                                O(log n) due to balanced structure
 Fast range queries                              Leaf nodes linked for sequential traversal
 Space efficient                                 Internal nodes only store keys (not full
                                                 data)
 Good for disk-based access                      High fan-out reduces tree height (fewer
                                                 disk I/Os)
 Dynamic                                         Supports insertion/deletion efficiently
What is a B-Tree?
A B-Tree is a self-balancing tree data structure that maintains sorted data and allows
searches, sequential access, insertions, and deletions in logarithmic time. It is commonly
used in databases and file systems to efficiently store and retrieve large blocks of data.
Why B-Trees in DBMS?
    Databases often store data on disks, where accessing data is slow compared to memory.
    B-Trees minimize disk reads by maximizing the number of keys stored in each node,
    reducing the height of the tree.
    This helps optimize operations such as search, insert, update, and delete.
Characteristics of B-Trees
 1. Balanced tree: All leaf nodes are at the same level.
 2. Nodes have multiple keys: Unlike binary trees, a B-tree node can have more than two
    children.
 3. Order (m): Defines the maximum number of children each node can have.
 4. Node properties:
       Each node can have a minimum of ceil(m/2) children (except root).
       Each node can have a maximum of m children.
       The number of keys in each node is one less than the number of children.
 5. Keys in nodes are sorted, and child pointers separate key ranges.
Basic Operations in B-Tree
    Search: Similar to binary search but over multiple keys in a node.
   Insertion: Insert in the correct leaf node; if the node is full, split the node and propagate
   the split upwards.
   Deletion: Remove the key, and if underflow occurs (too few keys), redistribute keys or
   merge nodes to maintain properties.
Example (Order = 4)
   Each node can have up to 3 keys and 4 children.
   Minimum children per node (except root) = 2.
   Keys in node: [K1, K2, K3]
   Children: [P0, P1, P2, P3]
   Keys in subtree P0 < K1, keys in subtree P1 between K1 and K2, etc.
Use of B-Trees in DBMS
   Indexing large tables.
   Supporting range queries efficiently.
   Used in storage engines like MySQL's InnoDB.
   Handles large blocks of data with minimal disk I/O.
Summary
 Feature                                         Description
 Data Structure                                  Balanced multiway search tree
 Purpose                                         Efficient disk-based data
                                                 storage
 Node Capacity                                   Multiple keys and children
 Height                                          Low, to reduce disk reads
 Operations Time                                 O(log n) for search, insert,
                                                 delete
 Use Case                                        Indexing in databases
Heap File Organization (DBMS)
   Definition:
   A file organization method where records are stored in no particular order. New records
   are simply added at the end of the file (or in the first available space).
Key Characteristics
   Unordered storage: No sorting or indexing on records.
   Insertion: Fast, simply append record at the end.
   Search: Linear scan required to find a record.
   Deletion: Mark record as deleted (using a flag) or remove physically, which may require
   reorganization.
   Update: Locate record by scanning, then modify.
Advantages
   Simple and easy to implement.
   Efficient for bulk loading and when no queries require sorted access.
   Fast insertions because no order is maintained.
Disadvantages
   Inefficient for searching (requires full file scan).
   Not suitable for queries that require ordered data or fast retrieval.
   Deletion and updates may lead to fragmentation, requiring periodic reorganization.
Use Cases
   Small tables or temporary tables.
   When no indexing or search on specific fields is required.
   When insertions are more frequent than searches.
Summary Table
 Operation                                     Efficiency
 Insertion                                     Fast (append at end)
 Search                                        Slow (linear scan)
 Deletion                                      Slow (requires scanning)
 Update                                        Slow (search + modify)