OVERVIEW OF
STORAGE AND
INDEXING
When you database storage and indexing, you
might think
"This sounds like a fancy way of saying 'how
do we put stuff away so we can find it
again?'"
That's exactly what it is!
But as always, the devil is in the details.
Think about your bedroom.
If you just throw everything on the floor, sure, it's
all there
but good luck finding your favorite socks
when you're late for work
But if you organize things
socks in one drawer
shirts in another
books on shelves arranged by topic
suddenly finding things becomes much
easier
That's indexing.
It's the art of organized laziness.
WHY CAN'T WE JUST SEARCH
EVERYTHING?
Imagine you have a library with a million books
and someone asks you to find every book
that mentions "quantum mechanics"
If the books are just randomly scattered on
shelves, you'd have to look through every
single book.
That's a million books to check!
If each book takes you just one second to check
(and that's being generous)
You'd need about 11.5 days of non-stop
searching.
But wait - what if ten people come in with
different requests?
Now you're looking at 115 days of work.
This doesn't scale, as they say in the
computer business.
This is exactly the problem databases face.
Data storage is cheap.
We can throw terabytes of information onto
disks without breaking a sweat.
But finding specific information?
THE PHYSICAL REALITY WE
CANNOT IGNORE
Before we get too abstract, let's talk about the
physical world.
When you store data, it doesn't just float around
in some mystical digital cloud.
It sits on actual, physical devices: spinning
disks, flash memory, magnetic tapes.
HARD DISK DRIVES
THE PATIENT ELEPHANTS
A hard disk drive is like a very patient elephant
with an excellent memory but terrible knees.
It can remember enormous amounts of
information
but it takes time to lumber over to where the
information is stored
Picture a record player, because that's essentially
what a hard drive is.
You have a spinning disk (or several of them)
and a read head that moves back and forth.
To read data, the head has to move to the
right track (seek time)
wait for the right spot to spin around
(rotational latency)
and then read the data sequentially
If your data is scattered all over the disk
the poor read head is dancing back and forth
like a frantic conductor
But if related data is stored together
the head can read it all in one smooth
motion
Sequential reads are roughly 100 times faster than
random reads on traditional hard drives.
That's not just a little better
that's like difference between walking
and taking a jet plane
SOLID STATE DRIVES
THE CAFFEINATED CHEETAHS
SSDs are different beasts.
They're like caffeinated cheetahs, incredibly fast
but they get tired if you work them too hard
and they're more expensive to feed
With SSDs, there's no mechanical movement.
Data access is nearly instantaneous regardless of
where it's stored.
But SSDs have their own quirks
they can only be written to a limited number
of times
and they prefer to work with data in blocks
MEMORY
THE BRILLIANT BUT FORGETFUL FRIEND
RAM is like having a brilliant friend who can
instantly recall anything you've told them
recently.
but completely forgets everything when they
go to sleep
It's incredibly fast.
thousands of times faster than even SSDs
but it's volatile and expensive
We want to keep frequently accessed data in the
fastest storage available.
and less frequently accessed data in cheaper,
slower storage
FILE ORGANIZATION
Now that we understand our storage devices, let's
talk about how we organize data on them.
There are several approaches, each with its own
trade-offs.
HEAP FILES
THE JUNK DRAWER APPROACH
A heap file is like that junk drawer/box.
New records just get thrown in wherever there's
space.
It's simple and fast for insertions, just find any
empty spot and plop the data down.
But searching?
That's where heap files show their weakness.
To find a specific record, you might have to look
through the entire file.
It's like searching for your passport in that junk
drawer, you might get lucky and find it right away,
or you might have to empty the entire drawer
onto the counter.
SORTED FILES
Sorted files are like having a perfectly organized
library.
Every book is in exactly the right place according
to some ordering (say, by author's last name).
This makes finding a specific item much faster,
you can use binary search, which is logarithmic in
time complexity.
If you have a million records, you can find any
specific record in at most 20 steps.
But there's a catch.
Maintaining sorted order is expensive.
Every time you want to insert a new record, you
might have to move thousands of other records to
make room.
It's like trying to insert a new book into a tightly
packed bookshelf - you might have to shift half
the books to make space.
CLUSTERED VS. UNCLUSTERED
ORGANIZATION
Think of clustered organization like a filing
cabinet where related documents are physically
stored together.
If you're looking for all invoices from January
2023, they're all in the same drawer, in the same
section.
Unclustered organization is like having a card
catalog that tells you where to find each
document
but the actual documents might be scattered
all over the building
The catalog is fast to search, but then you have to
run all over the place collecting the actual
documents.
INDEXING
An index is like having a really smart friend who
knows where everything is.
You don't ask them to get the thing for you; you
ask them where to find it.
THE PHONE BOOK PRINCIPLE
Remember phone books?
Phone books were examples of indexing.
You wanted to find Taylor Batumbakal's phone
number.
You didn't start at page 1 and read every entry
until you found Taylor Batumbakal.
Instead, you flipped to the B section, then to Ba,
then to Batumbakal, and finally to Taylor
Batumbakal.
Each step eliminated vast portions of the search
space.
This is what a B+ tree index does
but instead of alphabetical order, it uses
whatever ordering makes sense for your
data.
B+ TREES
B+ trees are like multi-level phone books.
The top level gives you broad categories
the next level breaks those down into smaller
categories
and so on, until you get to the actual data.
A B+ tree is a variant of a B tree where all data is in
the leaf nodes
and the leaf nodes are also a linked list
A B tree is like a binary tree but with more than
two nodes per level
They're always balanced.
No matter how much data you add or remove, the
tree maintains its shape.
This means that finding any piece of data always
takes roughly the same amount of time.
Think of it like a perfectly balanced pyramid.
Whether you're looking for something that should
be on the left side or the right side
you always have to climb the same number
of levels to get there.
HASH INDEXES
Hash indexes are like having a magical filing
system where you whisper the name of what
you're looking for
and a genie (almost) instantly hands it to you
Here's how they work:
you take your search key (say, a customer ID)
run it through a mathematical function called
a hash function, and that tells you exactly
which bucket to look in.
If the hash function is good and your data is
distributed evenly, you can find any record in
constant time
BITMAP INDEXES
Bitmap indexes are elegant in their simplicity.
For each possible value of a field, you create a
bitmap
a string of 1s and 0s indicating which records
have that value.
Let's say you have a million customer records, and
you want to index the "gender" field.
You create one bitmap for "Male" and one for
"Female".
The Male bitmap might look like: 1001011010...
where 1 means "this record is Male" and 0 means
"this record is not Male".
Bitmaps are also used in chess engines.
The thing about bitmaps is that you can combine
them using simple boolean operations.
Want to find all male customers in California?
Just take the Male bitmap AND the California
bitmap.
The computer can do these operations incredibly
quickly because they work at the bit level.
THE TRADE-OFFS
THERE'S NO FREE LUNCH
In database systems, there are fundamental
trade-offs.
You can't optimize everything at once.
SPACE VS. TIME
Indexes speed up queries, but they take up space.
More importantly, they slow down writes
because every time you modify data, you
might have to update multiple indexes.
It's like having multiple copies of your address
book
one sorted by name, one by phone number,
one by address.
Finding information is fast because you can use
whichever copy is most convenient.
But every time someone moves or changes
their phone number, you have to update all
the copies.
READ VS. WRITE PERFORMANCE
Systems optimized for reading (like data warehouses)
often use different strategies than systems optimized
for writing (like transaction processing systems).
For read-heavy systems
you might create lots of indexes
store data in sorted order
and use compression
For write-heavy systems
you might minimize indexes
accept some disorder in exchange for faster
insertions
and avoid expensive maintenance operations
CONSISTENCY VS. PERFORMANCE
In distributed systems:
you can have strong consistency (everyone
sees the same data at the same time)
or high performance
but it's very difficult to have both
It's like trying to keep a group of friends perfectly
synchronized when they're scattered around the
world.
The more you insist on perfect synchronization
the more time you spend coordinating
and the less time you spend actually doing
useful work
CONCLUSION
All the complexity ultimately comes down to a few
simple principles:
1. Locality matters: Keep related things close
together
2. Prediction helps: If you can guess what will be
needed next, you can prepare for it
3. Trade-offs are inevitable: You can't optimize
everything at once
4. Simple ideas scale: The best solutions are often
elegant applications of basic principles
The next time someone tries to impress you with
fancy database terminology
remember that they're really just talking
about organized ways to put things away so
you can find them again.
Everything else is just details
important details
fascinating details
but details nonetheless