05 Storage3
05 Storage3
Systems
            Storage Models &
            Data Compression
15-445/645 FALL 2024                    PROF. ANDY PAVLO
                                              ADMINISTRIVIA
                                Homework #2 is due Sept 22nd @ 11:59pm
                                                  LAST CLASS
                                We discussed storage architecture alternatives to
                                tuple-oriented scheme.
                                → Log-structured storage
                                → Index-organized storage
                                           TODAY'S AGENDA
                                Database Workloads
                                Storage Models
                                Data Compression
                                DB Flash Talk: StarTree
                                          DATABASE WORKLOADS
                                On-Line Transaction Processing (OLTP)
                                → Fast operations that only read/update a small amount of
                                  data each time.
                                                          DATABASE WORKLOADS
                                               Complex
                        Operation Complexity
                                                                                  OLAP
                                                                                                     Jim Gray
Jim Gray
                                                              OLTP
                                                Simple
                                                         Write-Heavy                    Read-Heavy
                                                          DATABASE WORKLOADS
                                               Complex
                        Operation Complexity
                                                                              OLAP
                                                                          HTAP
                                                              OLTP
                                                Simple
                                                         Write-Heavy                    Read-Heavy
WIKIPEDIA EXAMPLE
                                                OBSERVATION
                                The relational model does not specify that the
                                DBMS must store all a tuple's attributes together in
                                a single page.
                                                            OLTP
                        On-line Transaction Processing:
                                                                       SELECT   P.*, R.*
                        → Simple queries that read/update a small        FROM   pages AS P
                          amount of data that is related to a single    INNER   JOIN revisions AS R
                          entity in the database.                          ON   P.latest = R.revID
                                                                        WHERE   P.pageID = ?
                        This is usually the kind of application
                        that people build first.                       UPDATE useracct
                                                                          SET lastLogin = NOW(),
                                                                              hostname = ?
                                                                        WHERE userID = ?
                                                           OLAP
                        On-line Analytical Processing:                  SELECT COUNT(U.lastLogin),
                        → Complex queries that read large portions             EXTRACT(month FROM
                          of the database spanning multiple entities.               U.lastLogin) AS month
                                                                          FROM useracct AS U
                                                                         WHERE U.hostname LIKE '%.gov'
                        You execute these workloads on the               GROUP BY
                                                                          EXTRACT(month FROM U.lastLogin)
                        data you have collected from your
                        OLTP application(s).
                                               STORAGE MODELS
                                A DBMS's storage model specifies how it
                                physically organizes tuples on disk and in memory.
                                → Can have different performance characteristics based on
                                  the target workload (OLTP vs. OLAP).
                                → Influences the design choices of the rest of the DBMS.
                                                                  Database Page
                                                                                  header
header a0 b0 c0
                                                                  Database Page
                                                                                   header
                                                                                                          header a1
                                                                                  b1    c1      header a0 b0    c0
                                                                  Database Page
                                                                                    header
                                                                                  header a5    b5 c5  header a4
                                                                                   b4   c4  header a3 b3    c3
                                                                                  header a2 b2    c2  header a1
                                                                                   b1   c1  header a0 b0    c0
header - - - - -
                        Disk
5-445/645 (Fall 2024)
                                                             header   userID
                                                                        -    userName
                                                                                -     userPass
                                                                                         -     hostname
                                                                                                  -       lastLogin
                                                                                                              -
                        Disk
5-445/645 (Fall 2024)
                        Disk
5-445/645 (Fall 2024)
                        Disk
5-445/645 (Fall 2024)
                        Disk
5-445/645 (Fall 2024)
                        Disk
5-445/645 (Fall 2024)
                                                                            Useless Data!
       15-445/645 (Fall 2024)
                                                                                               24
                                                 NSM: SUMMARY
                                Advantages
                                → Fast inserts, updates, and deletes.
                                → Good for queries that need the entire tuple (OLTP).
                                → Can use index-oriented physical storage for clustering.
                                Disadvantages
                                → Not good for scanning large portions of the table and/or a
                                  subset of the attributes.
                                → Terrible memory locality in access patterns.
                                → Not ideal for compression because of multiple value
                                  domains within a single page.
                        length values.
                                                                                   Row #1     a1      b1         c1
                                                                                   Row #2     a2      b2         c2
                        → Most systems identify unique physical                    Row #3     a3      b3         c3
                          tuples using offsets into these arrays.                  Row #4     a4      b4         c4
                        → Need to handle variable-length values…                   Row #5     a5      b5         c5
                                                                    Page #1
                                                                               header              null bitmap
                        length values.
                                                                                           Row #1     a1      b1         c1
                                                                                           Row #2     a2      b2         c2
                        → Most systems identify unique physical                            Row #3     a3      b3         c3
                          tuples using offsets into these arrays.                          Row #4     a4      b4         c4
                        → Need to handle variable-length values…                           Row #5     a5      b5         c5
                                                                    Page #2 Page #1
                                                                                       header              null bitmap
                        length values.
                                                                                            Row #1      a1      b1         c1
                                                                                            Row #2      a2      b2         c2
                        → Most systems identify unique physical                             Row #3      a3      b3         c3
                          tuples using offsets into these arrays.                           Row #4      a4      b4         c4
                        → Need to handle variable-length values…                            Row #5      a5      b5         c5
                                                                    Page #2 Page #1
                                                                                       header                null bitmap
                                                                    Page #3
                                                                                       c0        c1      c2         c3      c4
5-445/645 (Fall 2024)
                                                                                       c5
       15-445/645 (Fall 2024)
                                                                                             29
                                  Disadvantages
                                  → Slow for point queries, inserts, updates, and deletes
                                    because of tuple splitting/stitching/reorganization.
                                                  OBSERVATION
                                OLAP queries almost never access a single column
                                in a table by itself.
                                → At some point during query execution, the DBMS must get
                                  other columns and stitch the original tuple back together.
                                But we still need to store data in a columnar format
                                to get the storage + execution benefits.
                                                                                                                              Row Group
                                                                         Chunk
                                                                                              Row Group Meta-Data
                          immutable (Parquet, Orc).                                     a0    a1    a2   b0      b1     b2
                                                                             PAX File
                                                                                         c0        c1     c2
                                                                                                                              Row Group
                                                                         Chunk
                                                                                              Row Group Meta-Data
                          immutable (Parquet, Orc).                                     a0    a1    a2   b0      b1     b2
                                                                             PAX File
                                                                                         c0        c1     c2
                                                                                                                              Row Group
                                                                                              Row Group Meta-Data
                                                                                                 File Meta-Data
5-445/645 (Fall 2024)
                                                                                                                              Row Group
                                                                         Chunk
                                                                                              Row Group Meta-Data
                          immutable (Parquet, Orc).                                     a0    a1    a2   b0      b1     b2
                                                                             PAX File
                                                                                         c0        c1     c2
                                                                                                                              Row Group
                                                                                              Row Group Meta-Data
                                                                                                 File Meta-Data
5-445/645 (Fall 2024)
                                                OBSERVATION
                                I/O is the main bottleneck if the DBMS fetches data
                                from disk during query execution.
                                        DATABASE COMPRESSION
                                Goal #1: Must produce fixed-length values.
                                → Only exception is var-length data stored in separate pool.
                                     COMPRESSION GRANULARITY
                                Choice #1: Block-level
                                → Compress a block of tuples for the same table.
                                Choice #2: Tuple-level
                                → Compress the contents of the entire tuple (NSM-only).
                                Choice #3: Attribute-level
                                → Compress a single attribute within one tuple (overflow).
                                → Can target multiple attributes for the same tuple.
                                Choice #4: Column-level
                                → Compress multiple values for one or more attributes stored
                                  for multiple tuples (DSM-only).
                                           NAÏVE COMPRESSION
                                Compress data using a general-purpose algorithm.
                                Scope of compression is only based on the data
                                provided as input.
                                → LZO (1996), LZ4 (2011), Snappy (2011),
                                  Oracle OZIP (2014), Zstd (2015)
                                Considerations
                                → Computational overhead
                                → Compress vs. decompress speed.
                                                  mod log
                                               Compressed Page2
                                                          mod log
                                                       Compressed Page1
                                                          mod log
                                                       Compressed Page2
                                                                      mod log
                                        Uncompressed               Compressed Page1
                                            Page0          16 KB
                                                                      mod log
                                                                   Compressed Page2
                                           NAÏVE COMPRESSION
                                The DBMS must decompress data first before it can
                                be read and (potentially) modified.
                                → This limits the "scope" of the compression scheme.
                                                   OBSERVATION
                                Ideally, we want the DBMS to operate on
                                compressed data without decompressing it first.
                                                    Database Magic!
                        SELECT * FROM users                           SELECT * FROM users
                         WHERE name = 'Andy'                           WHERE name = XX
                                     COMPRESSION GRANULARITY
                                Choice #1: Block-level
                                → Compress a block of tuples for the same table.
                                Choice #2: Tuple-level
                                → Compress the contents of the entire tuple (NSM-only).
                                Choice #3: Attribute-level
                                → Compress a single attribute within one tuple (overflow).
                                → Can target multiple attributes for the same tuple.
                                Choice #4: Column-level
                                → Compress multiple values for one or more attributes stored
                                  for multiple tuples (DSM-only).
                                      COLUMNAR COMPRESSION
                                Run-length Encoding
                                Bit-Packing Encoding
                                Bitmap Encoding
                                Delta / Frame-of-Reference Encoding
                                Incremental Encoding
                                Dictionary Encoding
                                         RUN-LENGTH ENCODING
                                Compress runs of the same value in a single column
                                into triplets:
                                → The value of the attribute.
                                → The start position in the column segment.
                                → The # of elements in the run.
RUN-LENGTH ENCODING
RUN-LENGTH ENCODING
                                                          Compressed Data
                                                             id     isDead
                                                              1    (Y,0,3)
                                                              2    (N,3,1)
                                SELECT isDead, COUNT(*)
                                                              3    (Y,4,1)
                                  FROM users
                                 GROUP BY isDead              4    (N,5,1)
                                                              6    (Y,6,2)
                                                              7   RLE Triplet
                                                              8   - Value
                                                              9   - Offset
                                                                  - Length
RUN-LENGTH ENCODING
RUN-LENGTH ENCODING
                                                 BIT PACKING
                                                                    Original Data        Original:
                        If the values for an integer attribute is                        8 × 32-bits =
                        smaller than the range of its given             int32            256 bits
                        data type size, then reduce the                   13        00000000 00000000 00000000 00001101
                                                 BIT PACKING
                                                                    Original Data        Original:
                        If the values for an integer attribute is                        8 × 32-bits =
                        smaller than the range of its given             int32            256 bits
                        data type size, then reduce the                   13        00000000 00000000 00000000 00001101
                                                 BIT PACKING
                                                                    Original Data        Original:
                        If the values for an integer attribute is                        8 × 32-bits =
                        smaller than the range of its given             int32            256 bits
                        data type size, then reduce the                   13        00001101
value. 56 00111000
92 01011100
81 01010001
172 10101100
                                                                                     Compressed:
                                                                                     8 × 8-bits =
                                                                                     64 bits
5-445/645 (Fall 2024)
                                              BITMAP ENCODING
                                Store a separate bitmap for each unique value for an
                                attribute where an offset in the vector corresponds
                                to a tuple.
                                → The ith position in the Bitmap corresponds to the ith tuple
                                  in the table.
                                → Typically segmented into chunks to avoid allocating large
                                  blocks of contiguous memory.
BITMAP ENCODING
                                              BITMAP ENCODING
                                                                                   Compressed:
                                Original Data                      Compressed Data 16 bits + 16 bits =
                                                                                   32 bits
                                      id                                       isDead
                                           isDead
                                                                          id   Y   N
                                                                                        2 × 8-bits =
                                      1      Y                                          16 bits
                                      2      Y                            1    1   0
                                      3      Y                            2    1   0
                                      4      N      Original:             3    1   0
                                      6      Y
                                                    8 × 8-bits =          4    0   1    8 × 2-bits =
                                                    64 bits               6    1   0    16 bits
                                      7      N
                                      8      Y                            7    0   1
                                      9      Y                            8    1   0
                                                                          9    1   0
                                                  DELTA ENCODING
                                   Recording the difference between values that follow
                                   each other in the same column.
                                   → Store base value in-line or in a separate look-up table.
                                Original Data
                                  time64   temp
                                   12:00   99.5
                                   12:01   99.4
                                   12:02   99.5
                                   12:03   99.6
                                   12:04   99.4
                                                  DELTA ENCODING
                                   Recording the difference between values that follow
                                   each other in the same column.
                                   → Store base value in-line or in a separate look-up table.
                                                  DELTA ENCODING
                                   Recording the difference between values that follow
                                   each other in the same column.
                                   → Store base value in-line or in a separate look-up table.
                                   → Combine with RLE to get even better compression ratios.
                                                  DELTA ENCODING
                                   Recording the difference between values that follow
                                   each other in the same column.
                                   → Store base value in-line or in a separate look-up table.
                                   → Combine with RLE to get even better compression ratios.
                                                      DELTA ENCODING
                                   Recording the difference between values that follow
                                   each other in the same column.
                                   → Store base value in-line or in a separate look-up table.
                                   → Combine with RLE to get even better compression ratios.
                                   Frame-of-Reference Variant: Use global min value.
                                Original Data             Compressed Data           Compressed Data
                                  time64       temp         time64     temp           time64     temp
                                   12:00       99.5          12:00     99.5             12:00     99.5
                                   12:01       99.4            +1      -0.1            (+1,4)     -0.1
                                   12:02       99.5            +1      +0.1                       +0.1
                                   12:03       99.6            +1      +0.1                       +0.1
                                   12:04       99.4            +1      -0.2                       -0.2
                                 5 × 64-bits             64-bits + (4 × 16-bits)   64-bits + (2 × 16-bits)
                                 = 320 bits              = 128 bits                = 96 bits
5-445/645 (Fall 2024)
                                      DICTIONARY COMPRESSION
                                Replace frequent values with smaller fixed-length
                                codes and then maintain a mapping (dictionary)
                                from the codes to the original values
                                → Typically, one code per attribute value.
                                → Most widely used native compression scheme in DBMSs.
                                DICTIONARY: ORDER-PRESERVING
                                The encoded values need to support the same
                                collation as the original values.
                                                                                           Dictionary
                                          name                 name       value     code
                                                                                             Sorted
                                         Andrea                 10        Andrea     10
                                      Mr.Pickles                40         Andy      20
                                          Andy                  20       Jignesh     30
                                        Jignesh                 30     Mr.Pickles    40
                                      Mr.Pickles                40
5-445/645 (Fall 2024)
                                DICTIONARY: ORDER-PRESERVING
                                The encoded values need to support the same
                                collation as the original values.
                                                                                           Dictionary
                                          name                 name       value     code
                                                                                             Sorted
                                         Andrea                 10        Andrea     10
                                      Mr.Pickles                40         Andy      20
                                          Andy                  20       Jignesh     30
                                        Jignesh                 30     Mr.Pickles    40
                                      Mr.Pickles                40
5-445/645 (Fall 2024)
                                    ORDER-PRESERVING ENCODING
                        SELECT name FROM users
                         WHERE name LIKE 'And%'      Still must perform scan on column
                                                                                        Dictionary
                                            name            name       value     code
                                                                                          Sorted
                                           Andrea            10        Andrea     10
                                        Mr.Pickles           40         Andy      20
                                            Andy             20       Jignesh     30
                                          Jignesh            30     Mr.Pickles    40
                                        Mr.Pickles           40
5-445/645 (Fall 2024)
                                                 CONCLUSION
                                It is important to choose the right storage model for
                                the target workload:
                                → OLTP = Row Store
                                → OLAP = Column Store
                                         DATABASE STORAGE
                                Problem #1: How the DBMS represents the
                                database in files on disk.