tinyhist extension

⚠️ Warning: This extension is still an early WIP version, not suitable for production use. The on-disk format, function signatures etc. may change and so on.

This PostgreSQL extension implements tinyhist, a data structure for on-line accumulation of approximate histograms. The algorithm is very friendly to parallel programs, fully mergeable, etc.

Each histogram is only 32B, and has 16 buckets. The buckets are not of the same size. Each bucket covers twice the range of the preceding one, so the accuracy is better for lower values (the buckets are smaller). The smaller buckets also use less space, making the histogram compact.

The bucket sizes look like this:

0 =>	8 bits
1 =>	9 bits
2 =>	10 bits
3 =>	11 bits
...
15 =>	23 bits

Assuming uniformly distributed value, doubling a bucket range means it will get twice the number of values. Which aligns with the extra bit for each bucket, and all buckets getting "full" at about the same time. Of course, the data distribution may not be compact, in which case the buckets get full at different times.

Histogram range

A histogram with a fixed number of buckets can represent only a limited range of values. The 16th bucket is as wide as the 1st after 15 rounds of doubling, so if the first bucket is [0,1], then the last bucket is (16384,32768].

0 =>	[0,1]
1 =>	(1,2]
2 =>	(2,4]
3 =>	(4,8]
...
14 =>	(8192,16384]
15 =>	(16384,32768]

Note: This shows the buckets don't quite double, because the first two buckets cover the same (exclusive) range. This may result in some inefficiency, but has no other effects on how the histogram works.

This determines the "dynamic range" of the histogram. It'll always be the case that the last histogram is 32768x wider than the first one, but it's to us to decide how the buckets map to real values.

Picking the range represented by the first bucket (called the "unit" bucket) fully determines the mapping. For example, if we chose the unit bucket to represents values [0, 100], that determines ranges for all other buckets, and the histogram as a whole. The last bucket will cover range range [1638400, 3276800] and the histogram [0, 3276800].

An empty histogram is created with the "unit" range set to "1", and is adjusted depending on what values are added to the histogram.

If a histogram needs to absorb values outside the current range, the histogram adjusts the "unit" range. The first two buckets are combined, and a new bucket is added at the end. This means the unit range doubles. This is repeated until the histogram can absorb the new value.

The histogram stores the number of doublings in 4 bits, which means the maximum number of doublings is 15. The largest "unit" range is 2^15, i.e. 32768. Combined with the 16 buckets, this means the maximum total range is 1073741824.

Note: It'd be possible to allow wider histograms, but the primary use case for this is tracking e.g. query timings in ms, and 1B ms is roughly 12 days. You should not have very many queries taking that long.

Histogram sampling

The buckets are very small, with only a couple bits per bucket, and can get full easily. The first bucket is only 8 bits, so the maximum value for the counter is 255. The last bucket is 23 bits, with a maximum value if 8388608, but it's also much wider (by factor 2^15).

The histogram needs to handle buckets getting full, which is done by sampling. Initially, all data is processed, as if with sample rage 100%. When any bucket gets full, the sample rate is cut in half, and the bucket counters are adjusted to match that (divided by 2).

After the sampling rate gets reduced (from 100%), the histogram only approximates the data distribution.

Similarly to the unit range, the sample rate is encoded in 4 bits. That means the lowest sample rate is 1/32768.

As the sample rate gets decreased, counters for buckets with few values may drop to 0, as if there were no values. The counters are small integers, with "1" as the lowest value, representing a bucket with (1/sample_rate) values. In other words, the sample_rate determines the accuracy of the histogram, i.e. the "delta" represented by counter increment.

Note: This may need some more thought. It means the smallest bucket can accept only ~8355840 values. For short queries this may not be enough. So maybe we should have one less bucket, and use the 8 bits for sample rate, or something like that? Different tradeoffs. Or maybe it should be customizable, similar to how we allow setting precision/scale for numeric.

Basic usage

The extension provides two functions, which you can use to add values into a histogram:

tinyhist_add(tinyhist hist, value double precision)
tinyhist_add(tinyhist hist, values double precision[])

That is, you can run something like this:

CREATE TABLE t (h tinyhist);
INSERT INTO t VALUES (NULL);
UPDATE t SET h = tinyhist_add(h, 123);
UPDATE t SET h = tinyhist_add(h, ARRAY[456, 789]);

The extension also adds + operators backed by these functions, so you may do this instead:

UPDATE t SET h = h + 123;
UPDATE t SET h = h + ARRAY[456, 789];

There's also an aggregate function tinyhist_agg, which aggregates values into a histogram:

SELECT tinyhist_agg(random() * 10000) FROM generate_series(1,10000);

Note: At the moment there are no functions/operators to combine histograms, but adding those would be fairly straightforward.

The default output format of a histogram is not very readable (it's just the raw counters etc.), so there's a set-returning function to show a more readable version:

WITH hist AS (SELECT tinyhist_agg(random() * 10000) AS h FROM generate_series(1,10000))
SELECT * FROM tinyhist_buckets((SELECT h FROM hist));

 bucket_index | bucket_lower | bucket_upper | bucket_range | bucket_count | bucket_frac |  bucket_density   
--------------+--------------+--------------+--------------+--------------+-------------+-------------------
            0 |            0 |            1 |            1 |           14 |       0.014 |             0.014
            1 |            1 |            2 |            1 |            4 |       0.004 |             0.004
            2 |            2 |            4 |            2 |            4 |       0.004 |             0.002
            3 |            4 |            8 |            4 |            8 |       0.008 |             0.002
            4 |            8 |           16 |            8 |            7 |       0.007 |          0.000875
            5 |           16 |           32 |           16 |           15 |       0.015 |         0.0009375
            6 |           32 |           64 |           32 |           20 |        0.02 |          0.000625
            7 |           64 |          128 |           64 |           25 |       0.025 |       0.000390625
            8 |          128 |          256 |          128 |           50 |        0.05 |       0.000390625
            9 |          256 |          512 |          256 |           53 |       0.053 |     0.00020703125
           10 |          512 |         1024 |          512 |          106 |       0.106 |     0.00020703125
           11 |         1024 |         2048 |         1024 |          155 |       0.155 |   0.0001513671875
           12 |         2048 |         4096 |         2048 |          195 |       0.195 |   9.521484375e-05
           13 |         4096 |         8192 |         4096 |          247 |       0.247 |  6.0302734375e-05
           14 |         8192 |        16384 |         8192 |           97 |       0.097 | 1.18408203125e-05
           15 |        16384 |        32768 |        16384 |            0 |           0 |                 0

Functions

`tinyhist_add(hist, value)`

Adds a value to hist, returning the modified histogram. NULL values are ignored. If the histogram is empty (hist = NULL), a new one is created.

`tinyhist_add(hist, values[])`

Adds an array of value to hist, returning the modified histogram. NULL values are ignored. If the histogram is empty (hist = NULL), a new one is created.

`tinyhist_add(hist1, hist2)`

Merges two histograms hist1 and hist2 into a single histogram. If both histograms are empty (NULL), returns NULL.

`tinyhist_info(hist)`

Returns a record with information about the histogram hist. The output parameters are:

hist_unit - size of the "unit" bucket (with index 0)
hist_sample - sample rate of the histogram (1, ...)
hist_count - number of values in histogram (all buckets)
hist_upper - histogram range (upper boundary of last bucket)

The count value is raw, i.e. it's a simple sum of bucket counters. It's not adjusted to account for sample rate.

`tinyhist_buckets(hist)`

Returns a set of records with information about buckets of hist. The output parameters are:

bucket_index - index of the bucket (0 .. 15)
bucket_lower - lower boundary of the bucket
bucket_upper - upper boundary of the bucket
bucket_range - range (length) of the bucket (upper - lower)
bucket_count - number of values in the bucket (raw, not adjusted)
bucket_frac - fraction of values represented by bucket
bucket_density - density of values (fraction adjusted by range)

The count value is raw, i.e. it's a the value of the counter, not adjusted to account for sample rate.

`tinyhist_agg(value)`

An aggregate function, building a histogram from a set of values, as if calling tinyhist_add() in a loop (but more efficient).

The function is parallel-safe, i.e. the histograms can be built by a parallel query.

`tinyhist_agg(hist)`

An aggregate function, merging pre-calculated histograms values.

The function is parallel-safe, i.e. the histograms can be built by a parallel query.

Operators

`histogram + value`

Equivalent to function tinyhist_add(hist, value).

`histogram + value[]`

Equivalent to function tinyhist_add(hist, values[]).

`histogram + histogram`

Equivalent to function tinyhist_add(hist, hist).

Notes

At the moment, the extension only supports double precision values, but it should not be very difficult to extend it to other numeric types (e.g. integer and/or floating point). The ranges etc. should however remain int, for efficiency reasons, so it's probably simpler to just cast values when building the histogram.

License

This software is distributed under the terms of PostgreSQL license. See LICENSE or http://www.opensource.org/licenses/bsd-license.php for more details.

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
.github/workflows		.github/workflows
sql		sql
test		test
.gitignore		.gitignore
Changes		Changes
LICENSE		LICENSE
META.json		META.json
Makefile		Makefile
README.md		README.md
tinyhist.c		tinyhist.c
tinyhist.control		tinyhist.control

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

tinyhist extension

Histogram range

Histogram sampling

Basic usage

Functions

`tinyhist_add(hist, value)`

`tinyhist_add(hist, values[])`

`tinyhist_add(hist1, hist2)`

`tinyhist_info(hist)`

`tinyhist_buckets(hist)`

`tinyhist_agg(value)`

`tinyhist_agg(hist)`

Operators

`histogram + value`

`histogram + value[]`

`histogram + histogram`

Notes

License

About

Uh oh!

Releases

Packages

Languages

License

tvondra/tinyhist

Folders and files

Latest commit

History

Repository files navigation

tinyhist extension

Histogram range

Histogram sampling

Basic usage

Functions

tinyhist_add(hist, value)

tinyhist_add(hist, values[])

tinyhist_add(hist1, hist2)

tinyhist_info(hist)

tinyhist_buckets(hist)

tinyhist_agg(value)

tinyhist_agg(hist)

Operators

histogram + value

histogram + value[]

histogram + histogram

Notes

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

`tinyhist_add(hist, value)`

`tinyhist_add(hist, values[])`

`tinyhist_add(hist1, hist2)`

`tinyhist_info(hist)`

`tinyhist_buckets(hist)`

`tinyhist_agg(value)`

`tinyhist_agg(hist)`

`histogram + value`

`histogram + value[]`

`histogram + histogram`

Packages