0% found this document useful (0 votes)

10 views5 pages

06 Storage3

The document discusses various database workloads, including OLTP, OLAP, and HTAP, highlighting their characteristics and examples. It also covers different storage models such as N-Ary Storage Model, Decomposition Storage Model, and Partition Attributes Across, along with their advantages and disadvantages. Additionally, it explains database compression techniques, including naive and columnar compression methods, emphasizing the importance of efficient data storage and retrieval in database systems.

Uploaded by

Darion Yaphet

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views5 pages

06 Storage3

Uploaded by

Darion Yaphet

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

Lecture #5: Storage Models & Compression

15-445/645 Database Systems (Spring 2025)

https://15445.courses.cs.cmu.edu/spring2025/
Carnegie Mellon University
Jignesh Patel

1 Database Workloads
OLTP: Online Transaction Processing
An OLTP workload is characterized by fast, short running operations, repetitive operations and simple
queries that operate on single entity at a time. OLTP workloads typically handle more writes than reads,
and only read/update a small amount of data each time.
An example of an OLTP workload is the Amazon storefront. Users can add things to their cart and make
purchases, but the actions only affect their account.

OLAP: Online Analytical Processing

An OLAP workload is characterized by long running, complex queries (which often involves computing
aggregates) and reads on large portions of the database. In OLAP workloads, the database system is often
analyzing and deriving new data from existing data collected on the OLTP side.
An example of an OLAP workload that analyzes data collected on the OLTP side is personalized Amazon
shopping ads. The website analyzes all of data collected from users’ carts and purchases, and then selects
different ads for different users.

HTAP: Hybrid Transaction + Analytical Processing

A new type of workload (which has become popular recently) is HTAP, where OLTP and OLAP workloads
are present together on the same database instance.
Workload Focus:
• OLTP - Write Heavy
• OLAP - Read Havy
• HTAP - A mix of both the above
Spring 2025 – Lecture #5 Storage Models & Compression

2 Storage Models
There are different ways to store tuples in pages.

N-Ary Storage Model (NSM)

In the n-ary storage model, the DBMS stores all of the attributes for a single tuple (row) contiguously in
a single page. Thus, it is also known as a ”row store”. This approach is ideal for OLTP workloads where
requests are write-heavy and accesses are mostly individal entities. It is ideal because it takes only one
fetch to be able to get all of the attributes for a single tuple. NSM pages are typically some constant multiple
of 4KB hardware pages.
Advantages:
• Fast inserts, updates, and deletes.
• Good for queries that need the entire tuple (OLTP).
• Can use index-oriented physical storage for clustering.
Disadvantages:
• Inefficient for scanning large portions of the table and/or a subset of the attributes.
• Poor memory locality in access patterns.
• Difficult to apply compression because of multiple value domains within a single page.

Decomposition Storage Model (DSM)

In the decomposition storage model, the DBMS stores a single attribute (column) for all tuples contiguously
in a block of data. Thus, it is also known as a “column store”. This model is ideal for OLAP workloads
with many read-only queries that perform large scans over a subset of the table’s attributes.
Advantages:
• Reduces the amount of I/O wasted per query because the DBMS only reads the data that it needs for
that query.
• Better (faster) query processing because of increased locality and cached data reuse.
• Better data compression.
Disadvantages:
• Slow for point queries, inserts, updates, and deletes because of tuple splitting/stitching.
To put the tuples back together when using a column store, there are two common approaches:
• The most commonly used approach is fixed-length offsets. Here, the value in a given column will
belong to the same tuple as the value in another column at the same offset. Therefore, every single
value within the column will have to be the same length.
• A less common approach is to use embedded tuple ids. Here, for every attribute in the columns,
the DBMS stores a tuple id (ex: a primary key) with it. Then, the system would also store a mapping
to tell it how to jump to every attribute that has that id. Note that this method has a large storage
overhead because it needs to store a tuple id for every attribute entry.

Partition Attributes Across (PAX)

In the hybrid Partition Attributes Across storage model, the DBMS vertically partitions attributes within
a database page. The goal of doing so is to get the benefit of faster processing on columnar storage while
retaining the spatial locality benefits of row storage.

15-445/645 Database Systems

Page 2 of 5
Spring 2025 – Lecture #5 Storage Models & Compression

In PAX, the rows are horizontally partitioned into groups of rows. Within each row group, the attributes
are vertically partitioned into columns. Each row group is similar to a column store for its subset of the
rows.
A PAX file has a global header containing a directory with offsets to the file’s row groups, and each row
group maintains its own header with meta-data about its contents.

3 Database Compression
Compression is widely used in disk-based DBMSs as disk I/O is (almost) always the main bottleneck. It
is especially popular in systems that have read-only analytical workloads. The DBMS can fetch more
useful tuples, if they have been compressed beforehand at the cost of greater computational overhead for
compression and decompression.
In-memory DBMSs are more complicated since they do not have to fetch data from disk to execute a
query. Memory is much faster than disks, but compressing the database reduces DRAM requirements
and processing. They have to strike a balance between speed vs. compression ratio. Compressing the
database reduces DRAM requirements and may decrease CPU costs during query execution.
Given this, we want a database compression scheme to have the following properties:
• Must produce fixed-length values. The only exception is variable length data stored in separate pools.
This is because the DBMS should follow word-alignment and be able to access data using offsets.
• Allow the DBMS to postpone decompression as long as possible during query execution (late ma-
terialization).
• Must be a lossless scheme because people do not like losing data. Any kind of lossy compression
has to be performed at the application level.
If data sets were completely random bits, there would be no way to perform compression. However, there
are key properties of real-world data sets that are amenable to compression:
• Highly skewed distributions for attribute values (e.g., Zipfian distribution of the Brown Corpus).
• High correlation between attributes of the same tuple (e.g., Zip Code to City, Order Date to Ship
Date).

Compression Granularity
The kind of data we want to compress greatly affects which compression schemes can be used. There are
four levels of compression granularity:
• Block Level: Compress a block of tuples for the same table.
• Tuple Level: Compress the contents of the entire tuple (NSM only).
• Attribute Level: Compress a single attribute value within one tuple. Can target multiple attributes
for the same tuple.
• Columnar Level: Compress multiple values for one or more attributes stored for multiple tuples
(DSM only). This allows for more complicated compression schemes.

15-445/645 Database Systems

Page 3 of 5
Spring 2025 – Lecture #5 Storage Models & Compression

4 Naive Compression
The DBMS compresses data using a general purpose algorithm (e.g., gzip, LZO, LZ4, Snappy, Brotli, Oracle
OZIP, Zstd). Although there are several compression algorithms that the DBMS could use, engineers often
choose ones that often provides lower compression ratio in exchange for faster compression/decompres-
sion.
An example of using naive compression is in MySQL InnoDB. The DBMS compresses disk pages, pads
them to a power of 2KBs and stores them into the buffer pool. However, every time the DBMS tries to
read/update data, the compressed data in the buffer pool must first be decompressed. For blind writes, no
decompression required.
Since accessing data requires decompression of compressed data, this limits the scope of the compression
scheme. If the goal is to compress the entire table into one giant block, using naive compression schemes
would be impossible since the whole table needs to be compressed/decompressed for every access. There-
fore, MySQL breaks the table into smaller chunks since the compression scope is limited.
Another problem is that these naive schemes also do not consider the high-level meaning or semantics
of the data. The algorithm is oblivious to both the structure of the data, and how the query is planning
to access the data. Thus, this eliminates the opportunity to utilize late materialization, since the DBMS
cannot tell when it can delay the decompression of data.

5 Columnar Compression
Run-Length Encoding (RLE)
RLE compresses runs (consecutive instances) of the same value in a single column into triplets:
• The value of the attribute
• The start position in the column segment (offset)
• The number of elements in the run (length)
The DBMS should sort the columns intelligently beforehand to maximize compression opportunities. This
clusters duplicate attributes, thereby increasing the compression ratio. Note that the effectiveness of RLE
greatly depends on the underlying data characteristics (e.g. number and frequency of attributes in each
data).

Bit-Packing Encoding
When all values for an attribute are less than the value’s declared largest size, store them with fewer bits.

Mostly Encoding
Bit-packing variant that uses a special marker to indicate when a value exceeds the largest size and then
maintains a look-up table to store them. Use when values are ”mostly” less than the largest size.

Bitmap Encoding
The DBMS stores a separate bitmap for each unique value of a particular attribute where an offset in the
vector corresponds to a tuple. The ith position in the bitmap corresponds to the ith tuple in the table and
indicates whether that value is present in the attribute of that tuple. The bitmap is typically segmented
into chunks to avoid allocating large blocks of contiguous memory.

15-445/645 Database Systems

Page 4 of 5
Spring 2025 – Lecture #5 Storage Models & Compression

This approach is only practical if the value cardinality is low, since the size of the bitmap is linearly pro-
portional to the cardinality of the attribute value. If the cardinality of the value is high, then the bitmap
can become larger than the original data set.
There are compressed data structures to for sparse data sets (for example: Roaring Bitmaps)

Delta Encoding
Instead of storing exact values, record the difference between values that follow each other in the same
column. The base value can be stored in-line or in a separate look-up table. We can also use RLE on the
stored deltas to get even better compression ratios.

Incremental Encoding
This is a type of delta encoding whereby common prefixes or suffixes and their lengths are recorded so
that they need not be duplicated. This works best with sorted data.

Dictionary Compression
The most common database compression scheme is dictionary encoding. The DBMS replaces frequent
patterns in values with smaller codes. It then stores only these codes and a data structure (i.e. the dictio-
nary) that maps these codes to their original value. A dictionary compression scheme needs to support
fast encoding/decoding, as well as range queries.
Encoding and Decoding: The dictionary needs to decide how to encode (convert uncompressed value
into its compressed form) and decode (convert compressed value back into its original form) data. As
such, it is not possible to use hash functions.
The encoded values also need to support sorting in the same order as original values (i.e. order-preserving
encodings). This ensures that the compressed queries run on compressed data return results that are
consistent with uncompressed queries run on the original data. This order-preserving property allows
operations to be performed directly on the codes.
On certain queries (such as string match), queries can be executed faster.
Note: Columnar compression schemes work best with read heavy workloads and may need additional
support for writes.

15-445/645 Database Systems

Page 5 of 5

Lec 23
No ratings yet
Lec 23
28 pages
05 Storage2
No ratings yet
05 Storage2
4 pages
DB Storage3
No ratings yet
DB Storage3
65 pages
04-Storage2 2
No ratings yet
04-Storage2 2
4 pages
Database Management System Chapter 2
No ratings yet
Database Management System Chapter 2
19 pages
Compressao de Dados PDF
No ratings yet
Compressao de Dados PDF
55 pages
3 Storage
No ratings yet
3 Storage
34 pages
Column Vs Row
No ratings yet
Column Vs Row
64 pages
Column vs. Row Stores: A Deep Dive
No ratings yet
Column vs. Row Stores: A Deep Dive
64 pages
Column Vs Row
No ratings yet
Column Vs Row
64 pages
Data Parallelism
No ratings yet
Data Parallelism
33 pages
Unit 1 (DBMS)
No ratings yet
Unit 1 (DBMS)
24 pages
Database System
No ratings yet
Database System
58 pages
05 Storage3
No ratings yet
05 Storage3
76 pages
08 Storage
No ratings yet
08 Storage
43 pages
11zon DBMS1 PDF
No ratings yet
11zon DBMS1 PDF
20 pages
Lect 1-2pdf
No ratings yet
Lect 1-2pdf
55 pages
SQL Material
No ratings yet
SQL Material
47 pages
RDBMS Essentials for IT Students
No ratings yet
RDBMS Essentials for IT Students
76 pages
Database Chapter 1 CS307
No ratings yet
Database Chapter 1 CS307
60 pages
Database Management System
No ratings yet
Database Management System
388 pages
DBMS Guide for Students & Banks
No ratings yet
DBMS Guide for Students & Banks
35 pages
DBMS Series Part-1
No ratings yet
DBMS Series Part-1
487 pages
Introduction To Database
No ratings yet
Introduction To Database
6 pages
Lect 1-2pdf
No ratings yet
Lect 1-2pdf
55 pages
Database Lecture Technics PDF
No ratings yet
Database Lecture Technics PDF
13 pages
Data Base Management System
No ratings yet
Data Base Management System
31 pages
4 DBMS
No ratings yet
4 DBMS
78 pages
Chapter 11: Indexing and Storage: Modified From: Database System Concepts, 6 Ed
No ratings yet
Chapter 11: Indexing and Storage: Modified From: Database System Concepts, 6 Ed
53 pages
Database Management System
100% (2)
Database Management System
120 pages
SQL Oracle11g Notes
No ratings yet
SQL Oracle11g Notes
125 pages
T08 Databases and Optimizing Storage 1
No ratings yet
T08 Databases and Optimizing Storage 1
58 pages
DBMS Internals: How Does It All Work?
No ratings yet
DBMS Internals: How Does It All Work?
94 pages
Dbms Notes Unit 1
No ratings yet
Dbms Notes Unit 1
25 pages
Database Management System Basics
No ratings yet
Database Management System Basics
12 pages
Oracle 11g Murali Naresh Technology
83% (12)
Oracle 11g Murali Naresh Technology
127 pages
Lec 2
No ratings yet
Lec 2
27 pages
CPE 313 Database Management Systems: Fall 2021/2022
No ratings yet
CPE 313 Database Management Systems: Fall 2021/2022
24 pages
Ddbmss
No ratings yet
Ddbmss
21 pages
Murali Sir SQL Complete Notes PDF
100% (3)
Murali Sir SQL Complete Notes PDF
129 pages
DBMS Unit 4
No ratings yet
DBMS Unit 4
22 pages
Unit 1-DBMS
No ratings yet
Unit 1-DBMS
100 pages
9 - Analytics Databases
No ratings yet
9 - Analytics Databases
12 pages
DBMS File Organization
No ratings yet
DBMS File Organization
60 pages
Architecture and Implementation of Database Systems HS 07 Indexing
No ratings yet
Architecture and Implementation of Database Systems HS 07 Indexing
9 pages
CH 11
No ratings yet
CH 11
50 pages
Database Concepts
No ratings yet
Database Concepts
16 pages
Database Management System Guide
No ratings yet
Database Management System Guide
42 pages
BDAT1000-Week 2 019
No ratings yet
BDAT1000-Week 2 019
41 pages
Job Description - NOC Engineer
No ratings yet
Job Description - NOC Engineer
2 pages
ECOSYS M3660idn Series Brochure
No ratings yet
ECOSYS M3660idn Series Brochure
4 pages
A New Model of Bipolar Power Supplies For Lnls Steering Magnets Using Full-Bridge Topology
No ratings yet
A New Model of Bipolar Power Supplies For Lnls Steering Magnets Using Full-Bridge Topology
5 pages
Ansys Ls-Dyna Mapdl 14.5 l09 Restarting
No ratings yet
Ansys Ls-Dyna Mapdl 14.5 l09 Restarting
19 pages
SAS Base Certification Exam Prep
80% (5)
SAS Base Certification Exam Prep
10 pages
Zelio2 PDF
0% (1)
Zelio2 PDF
234 pages
Upc A Code Generator
No ratings yet
Upc A Code Generator
11 pages
Airtable Invoice Due Dec 28, 2023
No ratings yet
Airtable Invoice Due Dec 28, 2023
3 pages
I Am Error The Nintendo Family Computer Entertainment System Platform 1St Edition Edition Nathan Altice Full Access
No ratings yet
I Am Error The Nintendo Family Computer Entertainment System Platform 1St Edition Edition Nathan Altice Full Access
141 pages
Class 7
No ratings yet
Class 7
3 pages
Addis Ababa Science AND Technology University
No ratings yet
Addis Ababa Science AND Technology University
9 pages
Francois Fleuret - C++ Lecture Notes
No ratings yet
Francois Fleuret - C++ Lecture Notes
146 pages
Final Edited Last PDF
100% (4)
Final Edited Last PDF
72 pages
7 Act Simple Case Formulation Blank
No ratings yet
7 Act Simple Case Formulation Blank
2 pages
RNN For Malware Detection
No ratings yet
RNN For Malware Detection
18 pages
(Sign, Storage, Transmission) Cait McKinney - Information Activism A Queer History of Lesbian Media Technologies-Duke University Press (2020)
No ratings yet
(Sign, Storage, Transmission) Cait McKinney - Information Activism A Queer History of Lesbian Media Technologies-Duke University Press (2020)
293 pages
Excel Shortcuts for Power Users
No ratings yet
Excel Shortcuts for Power Users
5 pages
Ethereum Crowdsale Contract Guide
100% (1)
Ethereum Crowdsale Contract Guide
24 pages
Virtual Machine
90% (10)
Virtual Machine
868 pages
Create Windows CA Certificate Templates For CUCM
No ratings yet
Create Windows CA Certificate Templates For CUCM
32 pages
Bsit Capstone Format 4B 4G 4H 10 11 24
No ratings yet
Bsit Capstone Format 4B 4G 4H 10 11 24
32 pages
1
No ratings yet
1
309 pages
Who Has Inspired You in Your Life and Why?
No ratings yet
Who Has Inspired You in Your Life and Why?
9 pages
IT Application in Management 2 - Quiz Questions and Answers
No ratings yet
IT Application in Management 2 - Quiz Questions and Answers
1 page
Syllabus BCA 21
No ratings yet
Syllabus BCA 21
26 pages
Assignment 01
No ratings yet
Assignment 01
7 pages
Bangkit 2023 Batch 2 - Final Transcript - A256BSY2457
No ratings yet
Bangkit 2023 Batch 2 - Final Transcript - A256BSY2457
1 page
CCBT Unit 2
No ratings yet
CCBT Unit 2
31 pages
Lifestyle Vision Board: Manifest and Brainstorm With This Whiteboard
No ratings yet
Lifestyle Vision Board: Manifest and Brainstorm With This Whiteboard
17 pages
Unit - 2: Linked List
No ratings yet
Unit - 2: Linked List
64 pages

06 Storage3

Uploaded by

06 Storage3

Uploaded by

Lecture #5: Storage Models & Compression

15-445/645 Database Systems (Spring 2025)

OLAP: Online Analytical Processing

HTAP: Hybrid Transaction + Analytical Processing

N-Ary Storage Model (NSM)

Decomposition Storage Model (DSM)

Partition Attributes Across (PAX)

15-445/645 Database Systems

15-445/645 Database Systems

15-445/645 Database Systems

15-445/645 Database Systems

You might also like