🗂️ Indexing Structures for Files and Physical Database Design
Introduction
Indexes are used to speed up record retrieval in response to certain search
conditions.
Index structures provide secondary access paths.
Any field can be used to create an index.
Multiple indexes can be constructed.
Most indexes are based on ordered files.
Tree data structures organize the index.
� Types of Single-Level Ordered Indexes
Ordered index similar to index in a textbook.
Indexing field (attribute): Index stores each value of the index field with a list of
pointers to all disk blocks that contain records with that field value. Values in index
are ordered.
There are a few types of single-level ordered indexes:
Primary index: Specified on the ordering key field of ordered file of records.
Clustering index: Used if numerous records can have the same value for the
ordering field.
Secondary index: Can be specified on any nonordering field, and the data
file can have several secondary indexes.
Primary Indexes
Ordered file with two fields:
Primary key, K(i)
Pointer to a disk block, P(i)
One index entry in the index file for each block in the data file.
Indexes may be dense or sparse.
Dense index: Has an index entry for every search key value in the data file. Sparse
index: Has entries for only some search values.
Example: Suppose that we have an ordered file with r=300,000r=300,000 records,
a disk with block size B=4,096B=4,096 bytes. File records are of fixed size and are
unspanned. The record length R=100R=100 bytes.
The blocking factor for the
file bfr=⌊(B/R)⌋=⌊(4,096/100)⌋=40bfr=⌊(B/R)⌋=⌊(4,096/100)⌋=40 records per
block.
The number of blocks needed for the file
is b=⌈(r/bfr)⌉=⌈(300,000/40)⌉=7,500b=⌈(r/bfr)⌉=⌈(300,000/40)⌉=7,500 blocks
.
A binary search on the data file would need
approximately ⌈log2b⌉=⌈(log27,500)⌉=13⌈log2b⌉=⌈(log27,500)⌉=13 block
accesses.
If the size key field is V=9V=9 bytes and block pointer =6=6 bytes:
The size of each index entry is (9+6)=15(9+6)=15 bytes.
The blocking factor for the index
is bfr=⌊(B/Ri)⌋=⌊(4,096/15)⌋=273bfr=⌊(B/Ri)⌋=⌊(4,096/15)⌋=273 entries
per block.
The total number of index entries is equal to the number of blocks in
the data file, which is 7,500.
The number of index blocks is hence bi=⌈(ri/bfri)⌉=⌈(7,500/273)⌉=28bi
=⌈(ri/bfri)⌉=⌈(7,500/273)⌉=28 blocks.
A binary search on the index file would
need ⌈(log2bi)⌉=⌈(log228)⌉=5⌈(log2bi)⌉=⌈(log228)⌉=5 block accesses.
We need one additional block access to the data file for a total
of 5+1=65+1=6 block accesses.
Major Problem with Primary Indexes: Insertion and deletion of records can be
difficult because you have to move records around and change index values. With
a primary index, the problem is compounded because if you insert a record in its
correct position in the data file, you must not only move records to make space
for the new record but also change some index entries.
Solutions:
Use unordered overflow file
Use linked list of overflow records
Clustering Indexes
Clustering field: File records are physically ordered on a nonkey field
without a distinct value for each record.
Ordered file with two fields:
Same type as clustering field
Disk block pointer
Secondary Indexes
Provide secondary means of accessing a data file. Some primary access
exists. The data file records could be ordered, unordered, or hashed.
Ordered file with two fields:
Indexing field, K(i)
Block pointer or record pointer, P(i)
Usually need more storage space and longer search time than primary
index.
Improves search time for arbitrary record.
🌲 Multilevel Indexes
Designed to greatly reduce remaining search space as search is conducted.
Index file
Considered first (or base level) of a multilevel index.
Second level
Primary index to the first level
Third level
Primary index to the second level
🌳 Dynamic Multilevel Indexes Using B-Trees and B+ -Trees
Tree data structure terminology:
Tree is formed of nodes.
Each node (except root) has one parent and zero or more child
nodes.
Leaf node has no child nodes.
Unbalanced if leaf nodes occur at different levels.
Nonleaf node called internal node.
Subtree of node consists of node and all descendant nodes.
Search Trees and B-Trees
Search tree used to guide search for a record given value of one of record’s
fields.
Algorithms necessary for inserting and deleting search values into and from
the tree.
B-Trees
Provide multi-level access structure.
Tree is always balanced.
Space wasted by deletion never becomes excessive.
Each node is at least half-full.
Each node in a B-tree of order pp can have at most p−1p−1 search values.
B+ -Trees
Data pointers stored only at the leaf nodes.
Leaf nodes have an entry for every value of the search field, and a data
pointer to the record if search field is a key field.
For a nonkey search field, the pointer points to a block containing
pointers to the data file records.
Internal nodes
Some search field values from the leaf nodes repeated to guide
search.
Searching for a Record With Search Key Field Value K, Using a B+ -Tree
Algorithm 17.2 details the method for searching for a record with search key field
value KK, using a B+ -Tree.
🔑 Indexes on Multiple Keys
Multiple attributes involved in many retrieval and update requests.
Composite keys: Access structure using key value that combines attributes.
Partitioned hashing: Suitable for equality comparisons.
Grid files: Array with one dimension for each search attribute
⚙️ Other Types of Indexes
Hash Indexes
Secondary structure for file access.
Uses hashing on a search key other than the one used for the primary data
file organization.
Index entries of form (K,P)(K,P) or (K,P)(K,P):
PP : pointer to the record containing the key.
PP: pointer to the block containing the record for that key.
Bitmap Indexes
Used with a large number of rows.
Creates an index for one or more columns.
Each value or value range in the column is indexed.
Built on one particular value of a particular field.
Array of bits
Existence bitmap
Bitmaps for B+ -tree leaf nodes
Function-Based Indexing
Value resulting from applying some function on a field (or fields) becomes
the index key.
Introduced in Oracle relational DBMS. Example:
Function UPPER(Lname) returns uppercase representation.
❗ Some General Issues Concerning Indexing
Physical index:
Pointer specifies physical record address.
Disadvantage: pointer must be changed if record is moved.
Logical index:
Used when physical record addresses expected to change frequently.
Entries of the form (K,Kp)(K,Kp).
Index Creation
General form of the command to create an index:
CREATE INDEX index_name ON table_name (column1, column2, ...);
UNIQUE and CLUSTER keywords optional.
Order can be ASC or DESC.
Secondary indexes can be created for any primary record organization.
Complements other primary access methods.
Indexing of Strings
Strings can be variable length.
Strings may be too long, limiting the fan-out.
Prefix compression: Stores only the prefix of the search key adequate to
distinguish the keys that are being separated and directed to the subtree.
Tuning Indexes
Tuning goals
Dynamically evaluate requirements
Reorganize indexes to yield best performance
Reasons for revising initial index choice
Certain queries may take too long to run due to lack of an index.
Certain indexes may not get utilized.
Certain indexes may undergo too much updating if based on an
attribute that undergoes frequent changes.
Additional Issues Related to Storage of Relations and Indexes
Enforcing a key constraint on an attribute
Reject insertion if new record has same key attribute as existing
record.
Duplicates occur if index is created on a nonkey field.
Fully inverted file: Has secondary index on every field.
Indexing hints in queries: Suggestions used to expedite query execution.
Column-based storage of relations:
Alternative to traditional way of storing relations by row.
Offers advantages for read-only queries.
Offers additional freedom in index creation.
💾 Physical Database Design in Relational Databases
Physical design goals
Create appropriate structure for data in storage
Guarantee good performance
Must know job mix for particular set of database system applications
Analyzing the database queries and transactions
Information about each retrieval query
Information about each update transaction
Analyzing the expected frequency of invocation of queries and transactions
Expected frequency of using each attribute as a selection or join
attribute
80-20 rule: 80 percent of processing accounted for by only 20
percent of queries and transactions
Analyzing the time constraints of queries and transactions
Selection attributes associated with time constraints are candidates
for primary access structures
Analyzing the expected frequency of update operations
Minimize number of access paths for a frequently-updated file
Updating the access paths themselves slows down update operations
Analyzing the uniqueness constraints on attributes
Access paths should be specified on all candidate key attributes that
are either the primary key of a file or unique attributes
Physical Database Design Decisions
Design decisions about indexing
Whether to index an attribute
Attribute is a key or used by a query
What attribute(s) to index on
Single or multiple
Whether to set up a clustered index
One per table
Whether to use a hash index over a tree index
Hash indexes do not support range queries
Whether to use dynamic hashing
Appropriate for very volatile files
📝 Summary
Indexes are access structures that improve efficiency of record retrieval
from a data file
Ordered single-level index types
Primary, clustering, and secondary
Multilevel indexes can be implemented as B-trees and B+ -trees
Dynamic structures
Multiple key access methods
Logical and physical indexes