0% found this document useful (0 votes)

119 views12 pages

In-Memory Database Compression

This chapter discusses compression techniques used in in-memory databases to reduce memory requirements and transfer costs. It introduces prefix encoding, which stores the most common value once along with its count, and run-length encoding, which replaces sequences of the same value with the value and offsets of its starting positions. These lightweight compression methods balance compression rate with efficient encoding and decoding speeds needed for interactive queries.

Uploaded by

Alexandru Moldovan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

119 views12 pages

In-Memory Database Compression

Uploaded by

Alexandru Moldovan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 12

Prof.

Hasso Plattner

A Course in
In-Memory Data Management
The Inner Mechanics
of In-Memory Databases

August 30, 2013

This learning material is part of the reading material for Prof.

Plattner’s online lecture "In-Memory Data Management" taking place at
www.openHPI.de. If you have any questions or remarks regarding the
online lecture or the reading material, please give us a note at openhpi-
imdb@hpi.uni-potsdam.de. We are glad to further improve the material.
Chapter 7
Compression

As discussed in Chapter 5, SanssouciDB is a database architecture designed

to run transactional and analytical workloadsin enterprise computing. The
underlying data set can easily reach a size of several terabytes in large com-
panies. Although memory capacities of commodity servers are growing, it
is still expensive to process those huge data sets entirely in main memory.
Therefore, SanssouciDB and most modern in-memory storage engines use
compression techniques on top of the initial dictionary encoding to decrease
the total memory requirements. Columnar storage of data is well suited for
compression, as data of the same type and domain is stored consecutively
and can thus be processed efficiently.
Another advantage of compression is that it reduces the amount of data
that needs to be transferred between main memory and CPUs, thereby in-
creasing the performance of query execution. We discuss this in more detail
in Chapter 16 on materialization strategies.
This chapter introduces several lightweight compression techniques,
which provide a good trade-o↵ between compression rate and additional
CPU-cycles needed for encoding and decoding. There are also a large num-
ber of so-called heavyweight compression techniques. They achieve much
higher compression rates , but encoding and decoding is prohibitively ex-
pensive for their usage in the context of enterprise applications. An in-depth
discussion of many compression techniques can be found in [AMF06].

7.1 Prefix Encoding

In real-world databases, we often find the case that a column contains one
predominant value while the remaining values appear only seldom. Under
this circumstance, we would store the same value very often in an uncom-
pressed format. Prefix encoding is a simple way to handle this situation more
efficiently. To apply prefix encoding, the data sets need to be sorted by the

43
44 7 Compression

column with the predominant value and the attribute vector has to start with
the predominant value.
To compress the column, the predominant value should not be stored
explicitly every time it occurs. This is achieved by saving the number of
occurrences of the predominant value and one instance of the value itself
in the attribute vector. Thus, a prefix-encoded attribute vector contains the
following information:
• number of occurrences of the predominant value
• valueID of the predominant value from the dictionary
• valueIDs of the remaining values

Example

Given is the attribute vector of the country column from the world popula-
tion table, which is sorted by population of countries in descending order.
Thus, the 1.4 billion Chinese citizens are listed at first, then Indian citizens
and so on. The valueID for China, which is situated at position 37 in the

Prefix Encoding dictionary (see Figure 7.1a), is stored 1.4 billion times at the beginning of
the attribute vector in uncompressed format. In compressed format, the val-
ueID 37 will be written only once, followed by the remaining valueIDs for
the other countries as before. The number of occurrences “1.4 billion” for
China will be stored explicitly. Figure 7.1b depicts the uncompressed and
compressed attribute vectors for this example.
Prefix Encoding
!"#$%&'()*
+',-./!* +',-.*
%" %"
&" !'(" %" !'(" %" !'$"
#$" +,"
Dic$onary*
?@AB8?C" %" %"
valueID* value*
-." /01"
!'(" % !'(" %" !'$" 37" …" 37" 74" …" 74" 195" …" 195" …" 197" …" …"
%" %"
8"bit""""""…""""""8"bit"""8"bit"""""…" 37" CN"
$&" 2," Preﬁx"Encoding" …" …"
%" %"
68" GER"
!'(" 34"
37" 74" …" 74" 195" … 195" …" 197"
8"bit""8"bit"…" …" …"
%" %" 74" IN"
!'$" 56" 1.4B" Number"of"occurrences"of"ﬁrst"value"(64"bit)"
…" …"
!"
(a) Dictionary (b) Dictionary-Encoded Attribute Vector (Top) and 195" US"
Prefix-Encoded Dictionary-Encoded Attribute Vector …" …"
(Bottom) 197" VA"
1"

Fig. 7.1: Prefix Encoding Example

7.2 Run-Length Encoding 45

The following calculation illustrates the compression rate. First of all the
number of bits required to store all 200 countries is calculated as dlog2 (200)e
which results in 8 bit.
Without compression the attribute vector stores the 8 bit for each valueID
8 billion times:
8 billion · 8 bit = 8 billion Byte = 7.45 GB
If the country column is prefix-encoded, the valueID for China is stored
only once in 8 bit instead of 1.4 billion times 8 bit. An additional 64 bit field
is added to store the number of occurrences (dlog2 (1.4 billion)e = 31 bit).
Consequently, instead of storing 1.4 billion times 8 bit, only 64 bit + 8 bit =
72 bit are really necessary. The complete storage space for the compressed
attribute vector is now:

(8 billion 1.4 billion) · 8 bit + 64 bit + 8 bit = 6.15 GB

Thus, 1.3 GB, i.e., 17 % of storage space is saved. Another advantage of

prefix encoding is direct access with row number calculation. For example,
to find all male Chinese the database engine can determine that only tuples
with row numbers from 1 until 1.4 billion should be considered and then
filtered by the gender value.
Although we see that we have reduced the required amount of main
memory, it is evident that we still store much redundant information for
all other countries. Therefore, we introduce run-length encoding in the next
section.

7.2 Run-Length Encoding

Run-length encoding is a compression technique that works best if the at-

tribute vector consists of few distinct values with a large number of occur-
rences. For maximum compression rates, the column needs to be sorted, so
that all the same values are located together. In run-length encoding, value
sequences with the same value are replaced with a single instance of the
value and
• (a) either its number of occurrences or
• (b) its starting position as o↵sets.
Figure 7.2 provides an example of run-length encoding using the starting
positions as o↵sets. Storing the starting position speeds up access.
Run Length Encoding
un Length Encoding
Dic$onary*
46 7 Compression valueID* value*
!"#$%&'()* …" …"
+',-./!* +',-.* 37" CN"

%" %" …" …"

#$" *+" 68" GER"

37" …" 37" 74" …" 74" 195" …" 195" … …" …"
%" %" 8"bit"
74" IN"
,-" ./0" Run"Length"
&" %" $&" '()" %" '()" % %" %" Encoding" …" …"
ValueID" 37" 74" 195" …" 195" US"
$&" 1+"
0B?"LC?M7I" 8"bit" …" …"
/?F;G=?M" %" %" Start"
0" 1.4B" 2.6B" …"
()" %" '()" 23" PosiOon"
33"bit"
%" %" 2"
Record"with"ID"1.5"billion"via"binary"search"
,6" %"
(a) Dictionary (b) Dictionary-Encoded Attribute Vector (Top) and
I"1D"'5)"E=AA=;?"J=8"E=?89K"<C89FI"
!" Compressed Dictionary-Encoded Attribute Vector
(Bottom)

Fig. 7.2: Run-Length Encoding Example

Example

Applied to our example of the country column sorted by population, instead

of storing all 8 billion values (7.45 GB), we store two vectors:
• one with all distinct values: 200 times 8 bit
• the other with starting positions: 200 times 33 bit with 33 bit necessary to
store the o↵sets up to 8 billion (dlog2 (8 billion)e = 33 bit). An additional
33 bit field at the end of this vector stores the number of occurrences for
the last value.
Hence, the size of the attribute vector can be significantly reduced to
approximately 1 KB without any loss of information:

200 · (33 bit + 8 bit) + 33 bit ⇡ 1 KB

If the number of occurrences is stored in the second vector, one field

of 33 bit can be saved with the disadvantage of losing the direct access
possibility via binary search. Losing direct access results in longer response
times, which is not tolerable in the context of enterprise applications. Storing
the position of the last occurrence of a value in the second vector allows
for saving one field as well as keeping fast access through binary search.
The example figure shows the implementation using the starting positions
instead of the described optimal way due to didactic reasons; the principle
is easier to understand if it is explained with starting positions instead of
ending positions.
7.3 Cluster Encoding 47

7.3 Cluster Encoding

Cluster encoding works on equal-sized blocks of a column. The attribute

vector is partitioned into N blocks of fixed size (typically 1024 elements). If
a cluster contains only a single value, it is replaced by a single occurrence

Cluster Encoding
of this value. Otherwise, the cluster remains uncompressed. An additional
bit vector of length N indicates which blocks have been replaced by a single
value (1 if replaced, 0 otherwise). Figure 7.3 depicts an example for cluster
encoding with the uncompressed attribute vector on the top and the com-
pressed attribute vector on the bottom. Here, the blocks only contain four
elements for simplicity.

DE"789"

+8H<G" & & & & & & & & & # # # # # # # ! ! E E E E E E

& & & # # # # ! ! E E E *89"5<@9A;Q"""!!E!E!"

DE"789"

Fig. 7.3: Cluster Encoding Example With a Block Size of 4

Example #"

Given is the city column (1 million di↵erent cities) from the world pop-
ulation table. The whole table is sorted cascadingly by country and city.
Hence, cities, which belong to the same country, are stored next to each
other. Consequently, the occurrences of the same city values are stored next
to each other, as well. 20 bit are needed to represent 1 million city valueIDs
(dlog2 (1 million)e = 20 bit). Without compression, the city attribute vector
requires 18.6 GB (8 billion times 20 bit).
Now, we compute the size of the compressed attribute vector illustrated
in Figure 7.3. With a cluster size of 1024 elements the number of blocks is
7.8 million ( 10248elements
billion rows
per block ). In the worst case every city has 1 incompress-
ible block. Thus, the size of the compressed attribute vector is computed
from the following sizes:
48 7 Compression

incompressible blocks + compressible blocks + bit vector

= 1 million · 1024 · 20 bit + (7.8 1) million · 20 bit + 7.8 million · 1 bit
= 2441 MB + 16 MB + 1 MB
⇡ 2.4 GB

With a resulting size of 2.4 GB, a compression rate of 87 % (16.2 GB less

space required) can be achieved.
Cluster encoding does not support direct access to records. The position of
a record needs to be computed via the bit vector. As an example, consider the
query that counts how many men and women live in Berlin (for simplicity,
we assume that only one city with the name “Berlin” exists and the table is
sorted by city).
SELECT gender, COUNT(gender)
FROM world_population
WHERE city = ‘Berlin’
GROUP BY gender;
To find the recordIDs for the result set, we look up the valueID for “Berlin”
in the dictionary. In our example, illustrated in Figure 7.4, this valueID
is 3. Then, we scan the cluster-encoded city attribute vector for the first
appearance of valueID 3. While scanning the cluster-encoded vector, we need
to maintain the corresponding position in the bit vector, as each position in
the vector is mapped to either one value (if the cluster is compressed) or four
values (if the cluster is uncompressed) of the cluster-encoded city attribute
vector. In Figure 7.4, this is illustrated by stretching the bit vector to the
corresponding value or values of the cluster-encoded attribute vector. After
the position is found, a bit vector lookup is needed to check whether the
block(s) containing this valueID are compressed or not to determine the
recordID range containing the value “Berlin”. In our example, the first block
containing “Berlin” is uncompressed and the second one is compressed.
Thus, we need to analyze the first uncompressed block to find the first
occurrence of valueID 3, which is the second position, and can calculate
the range of recordIDs with valueID 3, in our example 10 to 16. Having
determined the recordIDs that match the desired city attribute, we can use
these recordID to access the corresponding gender records and aggregate
according to the gender values.

7.4 Indirect Encoding

Similar to cluster encoding, indirect encoding operates on blocks of data with

N elements (typically 1024). Indirect Encoding can be applied efficiently if
data blocks hold a few numbers of distinct values. It is often the case if a table
7.4 Indirect Encoding
Cluster Encoding 49
SELECT&gender,&count(gender)&FROM&world_popula<on&WHERE&city=‚Berlin &GROUP&BY&gender;&

rowId& 1& 2& 3& 4& 5& 6& 7& 8& 9& 10# 11# 12# 13# 14# 15# 16# 17& 18& 19& 20& 21& 22& 23& 24&

City& 4 4 4 4 4 4 4 4 4 3# 3# 3# 3# 3# 3# 3# 1& 1& 0& 0& 0& 0& 0& 0&

20&bit&

4& 4& 4 3# 3# 3# 3# 1& 1& 0& 0& 0&

Dic-onary#(city)#
Bit&&
1& 1& 0# 1# 0& 1& valueID# value#
Vector&
1& Aach&
2& Aachen&
4*2+2&=&10&&L  From&3rd&block,&&second&value&&
4*4&=&16& L  4th&block&completely& 3# Berlin#
…& …&
1000000& Zwönitz&
(Sachsen)&&

rowId& 10# 11# 12# 13# 14# 15# 16# Dic-onary#(gender)#

...# 0# 1# 0# 1# 1# 1# 0# ...# valueID# value#

Gender&
0& m&
1& f&

Fig. 7.4: Cluster Encoding Example: No Direct Access Possible

is sorted by another column and a correlation between these two columns

exists (e.g., name column if table is sorted by countries).
Besides a global dictionary used by dictionary encoding in general, ad-
ditional local dictionaries are introduced for those blocks that contain only
a few distinct values. A local dictionary for a block contains all (and only
those) distinct values that appear in this specific block. Thus, mapping to
even smaller valueIDs can save space. Direct access is still possible, how-
ever, an indirection is introduced because of the local dictionary. Figure 7.5
depicts an example for indirect encoding with a block size of 1024 elements.
The upper part shows the dictionary-encoded attribute vector, the lower
part shows the compressed vector, a local dictionary and the used pointer
structure to directly access the starting positions of the respective blocks.
This is necessary, because the blocks may use di↵erent amounts of bits to
represent their values, as shown in the example. The first block contains only
200 distinct values and is compressed. The second block in this example is
not compressed since it contains too many distinct values to take advantage
of the additional dictionary of indirect encoding.

Example

Given is the dictionary-encoded attribute vector for the first name column
(5 million distinct values) of the world population table that is sorted by
country. The number of bits required to store 5 million distinct values is 23 bit
Indirect Encoding

50 7 Compression

23"bit"per"element"
2 126" 576" 55" 126" 2 2 … 55" 881" 212" 3 19" 461" 792" 45" …" 13" …
8"bit"per"element" 23"bit"per"element"
Indirect"
0 2 3 1 2 0 0 … 1 881" 212" 3 19" 461" 792" 45" …" 13" … Encoding"
Block"2"is"not"compressed*
0" 2"
0" " 1" 55" DicOonary"
for""
1" " 2" 126"
Block"1"
2" " 3" 576"
…" …" …" …"
Pointers"to"StarOng" 200"
PosiOons"of"Blocks" 23"bit"

Fig. 7.5: Indirect Encoding Example

(dlog2 (5 million)e = 23 bit). Thus, the size of this vector without additional
compression is 21.4 GB (8 billion · 23 bit).
Now we split up the attribute vector into blocks of 1024 elements resulting
8 billion rows
in 7.8 million blocks ( 1024 elements ). For our calculation and for simplicity, we
assume that each set of 1024 people of the same country contains on average
200 di↵erent first names and all blocks will be compressed. The number of
bits required to represent 200 di↵erent values is 8 bit (dlog2 (200)e = 8 bit).
As a result, the elements in the compressed attribute vector need only 8 bit
instead of 23 bit when using local dictionaries.
Dictionary sizes can be calculated from the (average) number of distinct
values in a block (200) multiplied by the size of the corresponding old val-
ueID (23 bit) being the value in the local dictionary. For the reconstruction
of a certain row, a pointer to the local dictionary for the corresponding block
is stored (64 bit). Thus, the runtime for accessing a row is constant. The
total amount of memory necessary for the compressed attribute vector is
calculated as follows:

local dicitonaries + compressed attribute vector

= (200 · 23 bit + 64 bit) · 7.8 million blocks + 8 billion · 8 bit
= 4.2 GB + 7.6 GB
⇡ 11.8 GB

Compared to the 21.4 GB for the dictionary-encoded attribute vector, a

saving of 9.6 GB (44 %) can be achieved.
The following example query that selects the birthdays of all people
named “John” in the “USA” shows that indirect encoding allows for direct
access:
SELECT birthday
7.5 Delta Encoding 51

FROM world_population
WHERE fname = ‘John’ AND country = ‘USA’
As the table is sorted by country, we can easily identify the recordIDs of
the records with country=“USA”, and determine the corresponding blocks to
scan the “fname” column by dividing the first and last recordID by the cluster
size. Then, the valueID for “John” is retrieved from the global dictionary
and, for each block, the global valueID is translated into the local valueID by
looking it up in the local dictionary. This is illustrated in Figure 7.6 for a single
block. Then, the block is scanned for the local valueID and corresponding
recordIDs are returned for the birthday projection. In most cases, the starting
and ending recordID will not match the beginning and the end of a block.
In this case, we only consider the elements between the first above found
Indirect Encoding Example
recordID in the starting block up to the last found recordID for the value
“USA” in the ending block.
SELECT&birthday&FROM&world_popula9on&WHERE&ﬁrst_name=„John &and&country=„USA ;&

23$bit$per$element$
fname$ 2 126$ 576$ 55$ 126$ 2 2 …$ 55$
8$bit$per$element$
0 2 3 1 2 0 0 … 1
Global$
dic0onary$ Block$1$is$compressed$
ﬁnd$all$records$ Local$Dic0onary$
in$a$given$block$ 2$ Aada$ for$Block$1$
with$ﬁrst$name$„John“$
…$ …$
0$ 2$
576$ John$
1$ 55$
…$ …$
2$ 126$
5000000$ Zysean$
3$ 576$
23$bit$ 49$Byte$
…$ …$
199$
8$bit$ 23$bit$

Fig. 7.6: Indirect Encoding Example

7.5 Delta Encoding

The compression techniques covered so far reduce the size of the attribute
vector. There are also some compression techniques to reduce the data vol-
ume in the dictionary as well. Let us assume that the data in the dictionary is
sorted alpha-numerically and we often encounter a large number of values
with the same prefixes. Delta encoding exploits this fact and stores common
prefixes only once.
52 7 Compression

Delta encoding uses a block-wise compression like in previous sections

with typically 16 strings per block. At the beginning of each block, the length
of the first string, followed by the string itself, is stored. For each following
value, the number of characters used from the previous prefix, the number
of characters added to this prefix and the characters added are stored. Thus,
each following string can be composed of the characters shared with the Delta Encoding
previous string and its remaining part. Figure 7.7 shows an example of
a compressed dictionary. The dictionary itself is shown in Figure 7.7a. Its
for Dictionary
compressed counterpart is provided in Figure 7.7b. Delta Encoding
!"#$%&'()*
for Dictionary
P<?C9M"AR"=;G9"NFIJ<"8?"7IA@S"
P<?C9M"AR"U;<=>"@ATTA?"LW"U;<N8AJG"NFIJ<"
+',-./!* +',-.*
P<?C9M"AR";<TF8?8?C"UF;9"FX<;"@ATTA?"
"*IA@S"DE&("""""""""""""""""*IA@S"E"

E" 6F@M" IA@S"E" U;<=>" !"#$%&'()+',-.+.#0%(

!" 6F@M<?"
Dic$onary*&" Length"of"ﬁrst"value"in"block"
6 F" @" M" &" D" <" ?" D" -" I" 7" A" J" ;" C" !" D" 7"
D" 6FI7AJ;C" Length"of"preﬁx"common"with"previous"value"
valueID* value*
*IA@S"DE&(" Length"of"remaining"part"a_er"common"
#" 67F"
"Block"2045"""""""""""""""""Block"0"

0" Aach" Block"0" preﬁx" Dic$onaryvaluevector*

%" %"
1"
F" % -" / O" J" T ;" 8" E" $" V F" F" ;" I" <" T $" $"
Aachen"
4" A a" c" h" 4" 2" e" n" 2" 6" l" b" o" u" r" g" 1" 2" b"
#D$DE" /OJT;8"
2" Aalbourg"
T <" ;" T <" <" ;" %
Block"2045"
#D$D!" VFF;I<T"
3" Aba"
#D$DD" VFF;I<TT<;Y
…" …" a" … 6" G y" u" m r" i" 0" 7" H a" a" r" l" e" m 7" 7"
T<<;"
32720" Gyumri"
%" %" m e" r" m e" e" r" …
32721" Haarlem"
32722" Haarlemmer`
(a) Dictionary meer" (b) Compressed Dictionary
…" …"
("

Fig. 7.7: Delta Encoding Example

Example

Given is a dictionary for the city column sorted alpha-numerically. The size
of the uncompressed dictionary with 1 million cities, each value using 49 Byte
(we assume the longest city name has 49 letters), is 46.7 MB.
For compression purposes, the dictionary is separated into blocks of 16
values. Thus, the number of blocks is 62,500 ( 1 million
16
cities
). Furthermore, we
assume the following data characteristics to calculate the required size in
memory:
• average length of city names is 7
• average overlap of 3 letters
• the longest city name is 49 letters (dlog2 (49)e = 6 bit).
The size of the compressed dictionary is now calculated as follows:
REFERENCES 53

block size · number o f blocks

= encoding lengths + 1st city + 15 other cities · number o f blocks
= ((1 + 15 · 2) · 6 bit + 7 · 1 Byte + 15 · (7 3) · 1 Byte) · 62, 500
⇡ 5.4 MB

Compared to the 46.7 MB without compression the saving is 42.2 MB

(90 %).

7.6 Limitations

What has to be kept in mind is that most compression techniques require

sorted sets to tap their full potential, but a database table can only be sorted
by one column or cascadingly if no other auxiliary data structures are used.
Furthermore, some compression techniques do not allow direct access. This
has to be carefully considered with regard to response time requirements of
queries.

7.7 References

[AMF06] Daniel Abadi, Samuel Madden, and Miguel Ferreira. Integrating

compression and execution in column-oriented database systems.
In Proceedings of the 2006 ACM SIGMOD international conference on
Management of data, SIGMOD ’06, pages 671–682, New York, NY,
USA, 2006. ACM.

A Course in In-Memory Data Management: Prof. Hasso Plattner
No ratings yet
A Course in In-Memory Data Management: Prof. Hasso Plattner
6 pages
Foundations of Database Storage Techniques: The Future of Enterprise Compu'ng
No ratings yet
Foundations of Database Storage Techniques: The Future of Enterprise Compu'ng
8 pages
Which Compression Types Exist?: Compression Type M - Cs - Columns - Compression - Type Valid For Details Typical Scenario
No ratings yet
Which Compression Types Exist?: Compression Type M - Cs - Columns - Compression - Type Valid For Details Typical Scenario
2 pages
Efficient SIMD Key Compression
No ratings yet
Efficient SIMD Key Compression
16 pages
Data Compression Seminar Report
67% (6)
Data Compression Seminar Report
34 pages
V3i1 0198
No ratings yet
V3i1 0198
5 pages
Data Compression
No ratings yet
Data Compression
23 pages
UNIT-5 Entropy Encoding
No ratings yet
UNIT-5 Entropy Encoding
8 pages
08 File Handling
No ratings yet
08 File Handling
18 pages
Compressao de Dados PDF
No ratings yet
Compressao de Dados PDF
55 pages
CAcomparative Study
No ratings yet
CAcomparative Study
6 pages
Data Compression Techniques
No ratings yet
Data Compression Techniques
5 pages
Asymmetric Distances in Binary Embeddings
No ratings yet
Asymmetric Distances in Binary Embeddings
15 pages
Image Compression: Transmit
No ratings yet
Image Compression: Transmit
16 pages
Digital Data Compression
No ratings yet
Digital Data Compression
10 pages
Image Compression by Retaining Image Quality - Ieee Format
No ratings yet
Image Compression by Retaining Image Quality - Ieee Format
4 pages
Main Techniques and Performance of Each Compression
No ratings yet
Main Techniques and Performance of Each Compression
23 pages
Data Compression: CS 147 Minh Nguyen
No ratings yet
Data Compression: CS 147 Minh Nguyen
25 pages
Content-Based Textual Big Data Analysis and Compression: Fei Gao Ananya Dutta Jiangjiang Liu
No ratings yet
Content-Based Textual Big Data Analysis and Compression: Fei Gao Ananya Dutta Jiangjiang Liu
6 pages
Data Compression Report
No ratings yet
Data Compression Report
12 pages
Run-Length Encoding Explained
No ratings yet
Run-Length Encoding Explained
8 pages
Database Table Sizing & Compression
100% (1)
Database Table Sizing & Compression
51 pages
A Tunable Compression Framework For Bitmap Indices (Guzun - ICDE - 2014)
No ratings yet
A Tunable Compression Framework For Bitmap Indices (Guzun - ICDE - 2014)
12 pages
Indexing Structures & Database Design
No ratings yet
Indexing Structures & Database Design
39 pages
3 Chapter Text and Image Compression
No ratings yet
3 Chapter Text and Image Compression
132 pages
قواعد معطيات 2 (النظري) - 7
No ratings yet
قواعد معطيات 2 (النظري) - 7
27 pages
4 DBMS
No ratings yet
4 DBMS
78 pages
COMP101 Lecture Notes
No ratings yet
COMP101 Lecture Notes
34 pages
Indexing
No ratings yet
Indexing
27 pages
Index Compression: Prabhakar Raghavan (Yahoo and Stanford) and Christopher Manning (Stanford)
100% (1)
Index Compression: Prabhakar Raghavan (Yahoo and Stanford) and Christopher Manning (Stanford)
56 pages
Numerical Based On Indexing: Problem 1.2
No ratings yet
Numerical Based On Indexing: Problem 1.2
3 pages
Image Compression Through Combination Advantages From Existing Techniques
No ratings yet
Image Compression Through Combination Advantages From Existing Techniques
7 pages
10 1016@j Aei 2008 05 001
No ratings yet
10 1016@j Aei 2008 05 001
8 pages
Module 5 - Info Theory and Compression Algo
No ratings yet
Module 5 - Info Theory and Compression Algo
58 pages
Aadel Veri
No ratings yet
Aadel Veri
37 pages
File Organization For Performance: Amogh P K, SVIT
No ratings yet
File Organization For Performance: Amogh P K, SVIT
12 pages
Data Compression Explained
No ratings yet
Data Compression Explained
110 pages
Chapter 3 Multimedia Data Compression
100% (2)
Chapter 3 Multimedia Data Compression
23 pages
Softw Pract Exp - 2013 - Lemire - Decoding Billions of Integers Per Second Through Vectorization
No ratings yet
Softw Pract Exp - 2013 - Lemire - Decoding Billions of Integers Per Second Through Vectorization
29 pages
Data Compression
No ratings yet
Data Compression
6 pages
BSC Fully Distributed Representation Kanerva 1997
No ratings yet
BSC Fully Distributed Representation Kanerva 1997
8 pages
Single-Level Ordered Indexes
No ratings yet
Single-Level Ordered Indexes
12 pages
Database 2 Notes
No ratings yet
Database 2 Notes
42 pages
Research
No ratings yet
Research
4 pages
Lecture I: Data Compression Data Encoding: Efficient Information Encoding To
No ratings yet
Lecture I: Data Compression Data Encoding: Efficient Information Encoding To
48 pages
Data and Voice Coding
No ratings yet
Data and Voice Coding
20 pages
Chapter 5 New
No ratings yet
Chapter 5 New
19 pages
9 Files, Indices and Database Tuning
No ratings yet
9 Files, Indices and Database Tuning
17 pages
File Org & Indexing - DPP 02
No ratings yet
File Org & Indexing - DPP 02
5 pages
06 Storage3
No ratings yet
06 Storage3
5 pages
SingleLevelIndexing Examples
No ratings yet
SingleLevelIndexing Examples
24 pages
Image Compression-2
No ratings yet
Image Compression-2
13 pages
Chapter 2-Compression Techniques
No ratings yet
Chapter 2-Compression Techniques
63 pages
5 Data Compression Ioenotes
No ratings yet
5 Data Compression Ioenotes
47 pages
HTCS501 Unit 4
No ratings yet
HTCS501 Unit 4
17 pages
tr1500 Frequent Pattern Compression
No ratings yet
tr1500 Frequent Pattern Compression
14 pages
Measuring Marketing Productivity: Current Knowledge and Future Directions
No ratings yet
Measuring Marketing Productivity: Current Knowledge and Future Directions
14 pages
Update PDF
No ratings yet
Update PDF
6 pages
A Course in In-Memory Data Management: Prof. Hasso Plattner
No ratings yet
A Course in In-Memory Data Management: Prof. Hasso Plattner
5 pages
Insert PDF
No ratings yet
Insert PDF
7 pages
A Course in In-Memory Data Management: Prof. Hasso Plattner
No ratings yet
A Course in In-Memory Data Management: Prof. Hasso Plattner
6 pages
Partitioning PDF
No ratings yet
Partitioning PDF
5 pages
A Course in In-Memory Data Management: Prof. Hasso Plattner
No ratings yet
A Course in In-Memory Data Management: Prof. Hasso Plattner
4 pages
Enterprise Application Characteristics: 3.1 Diverse Applications
No ratings yet
Enterprise Application Characteristics: 3.1 Diverse Applications
4 pages
A Course in In-Memory Data Management: Prof. Hasso Plattner
No ratings yet
A Course in In-Memory Data Management: Prof. Hasso Plattner
9 pages
New Requirements For Enterprise Computing: 2.1 Processing of Event Data
No ratings yet
New Requirements For Enterprise Computing: 2.1 Processing of Event Data
8 pages
Changes in Hardware: 4.1 Memory Cells
No ratings yet
Changes in Hardware: 4.1 Memory Cells
11 pages
1 s2.0 S0925527311003872 Main PDF
No ratings yet
1 s2.0 S0925527311003872 Main PDF
9 pages
1 s2.0 S0925527307001892 Main PDF
No ratings yet
1 s2.0 S0925527307001892 Main PDF
17 pages
Esports Year Book PDF
No ratings yet
Esports Year Book PDF
147 pages
NATO's Cyber Strategies and Wireless Warfare in The Information Age
No ratings yet
NATO's Cyber Strategies and Wireless Warfare in The Information Age
7 pages
Esports Year Book PDF
No ratings yet
Esports Year Book PDF
147 pages
VHDL Coding Rules Guide
No ratings yet
VHDL Coding Rules Guide
54 pages
Chapter - 9
No ratings yet
Chapter - 9
12 pages
TwinCAT IEC61131-3 Guide
No ratings yet
TwinCAT IEC61131-3 Guide
176 pages
TwinCAT 3 Programming Training
No ratings yet
TwinCAT 3 Programming Training
69 pages
Integration Manual With CTFClient v7.4
No ratings yet
Integration Manual With CTFClient v7.4
137 pages
002-MHIAC - PPT - Presentation IEC-61131 3 Standard
No ratings yet
002-MHIAC - PPT - Presentation IEC-61131 3 Standard
47 pages
Interview Questions We LSI Shorts 2
No ratings yet
Interview Questions We LSI Shorts 2
19 pages
Recurrence Relations in Counting Problems
No ratings yet
Recurrence Relations in Counting Problems
12 pages
VCLImageUtils Pascal
No ratings yet
VCLImageUtils Pascal
104 pages
GEI-100271 System Database (SDB) Browser PDF
No ratings yet
GEI-100271 System Database (SDB) Browser PDF
18 pages
Arduino Cheat Sheet VJ FM 01
100% (1)
Arduino Cheat Sheet VJ FM 01
1 page
Nova Verze Tia Portal v15 PDF
No ratings yet
Nova Verze Tia Portal v15 PDF
77 pages
TwinCAT 3 Programming Guide
No ratings yet
TwinCAT 3 Programming Guide
69 pages
Curso Sysmac - Ladder
No ratings yet
Curso Sysmac - Ladder
70 pages
SINAMICS G130-G150 Drive Manual
No ratings yet
SINAMICS G130-G150 Drive Manual
1,510 pages
DINLect 1
No ratings yet
DINLect 1
69 pages
Allen Bradley Micro800 Ethernet Manual
No ratings yet
Allen Bradley Micro800 Ethernet Manual
48 pages
0 Git t24 Documentation PDF
100% (2)
0 Git t24 Documentation PDF
347 pages
VHDL Programming Lab Assignment
No ratings yet
VHDL Programming Lab Assignment
5 pages
SIMATIC RTU3000C IEC Interoperability V6 0
No ratings yet
SIMATIC RTU3000C IEC Interoperability V6 0
16 pages
Error Code 300500
No ratings yet
Error Code 300500
11 pages
SDSL Cheatsheet
No ratings yet
SDSL Cheatsheet
2 pages
Engine Exhaust System Parameters
No ratings yet
Engine Exhaust System Parameters
24 pages
Operating System (5th UNIT)
No ratings yet
Operating System (5th UNIT)
20 pages
3ADW000048 R0301 REV C Aplication Blocks
No ratings yet
3ADW000048 R0301 REV C Aplication Blocks
126 pages
TIAPortal TestSuiteAdvanced V19
No ratings yet
TIAPortal TestSuiteAdvanced V19
8 pages
VHDL Handbook PDF
No ratings yet
VHDL Handbook PDF
76 pages
Operating Systems: Memory Management
No ratings yet
Operating Systems: Memory Management
18 pages
KDD 98
No ratings yet
KDD 98
5 pages
Math 11 Set Operation 7
No ratings yet
Math 11 Set Operation 7
40 pages

In-Memory Database Compression

Uploaded by

In-Memory Database Compression

Uploaded by

Prof.

August 30, 2013

This learning material is part of the reading material for Prof.

As discussed in Chapter 5, SanssouciDB is a database architecture designed

7.1 Prefix Encoding

Fig. 7.1: Prefix Encoding Example

(8 billion 1.4 billion) · 8 bit + 64 bit + 8 bit = 6.15 GB

Thus, 1.3 GB, i.e., 17 % of storage space is saved. Another advantage of

7.2 Run-Length Encoding

Run-length encoding is a compression technique that works best if the at-

%" %" …" …"

#$" *+" 68" GER"

Fig. 7.2: Run-Length Encoding Example

Applied to our example of the country column sorted by population, instead

200 · (33 bit + 8 bit) + 33 bit ⇡ 1 KB

If the number of occurrences is stored in the second vector, one field

7.3 Cluster Encoding

Cluster encoding works on equal-sized blocks of a column. The attribute

& & & # # # # ! ! E E E *89"5<@9A;Q"""!!E!E!"

Fig. 7.3: Cluster Encoding Example With a Block Size of 4

incompressible blocks + compressible blocks + bit vector

With a resulting size of 2.4 GB, a compression rate of 87 % (16.2 GB less

7.4 Indirect Encoding

Similar to cluster encoding, indirect encoding operates on blocks of data with

City& 4 4 4 4 4 4 4 4 4 3# 3# 3# 3# 3# 3# 3# 1& 1& 0& 0& 0& 0& 0& 0&

4& 4& 4 3# 3# 3# 3# 1& 1& 0& 0& 0&

rowId& 10# 11# 12# 13# 14# 15# 16# Dic-onary#(gender)#

...# 0# 1# 0# 1# 1# 1# 0# ...# valueID# value#

Fig. 7.4: Cluster Encoding Example: No Direct Access Possible

is sorted by another column and a correlation between these two columns

Fig. 7.5: Indirect Encoding Example

local dicitonaries + compressed attribute vector

Compared to the 21.4 GB for the dictionary-encoded attribute vector, a

Fig. 7.6: Indirect Encoding Example

7.5 Delta Encoding

Delta encoding uses a block-wise compression like in previous sections

E" 6F@M" *IA@S"E" U;<=>" !"#$%&'()*+',-.*+.#0%(*

0" Aach" Block"0" preﬁx" Dic$onary*value*vector*

Fig. 7.7: Delta Encoding Example

block size · number o f blocks

Compared to the 46.7 MB without compression the saving is 42.2 MB

What has to be kept in mind is that most compression techniques require

[AMF06] Daniel Abadi, Samuel Madden, and Miguel Ferreira. Integrating

You might also like

E" 6F@M" IA@S"E" U;<=>" !"#$%&'()+',-.+.#0%(

0" Aach" Block"0" preﬁx" Dic$onaryvaluevector*