Fix broken cache eviction in clockcache #1603

JamesGuthrie · 2022-08-30T21:52:48Z

Description

The clockcache is a simple approximation of an LRU cache. Instead of
maintaining an exact ordering of which elements were least recently
used, the cache simply keeps track of whether each element is "used".

When an item is inserted into the cache, it is counted as unused. When
an item is fetched from the cache, it is counted as used.

The eviction algorithm looks through the elements of the cache one by
one, trying to find one to evict. If the item is used, it marks it as
unused. If it is unused, it can evict the item. The next time an
eviction is required, it can continue with the next element.

Through a bug in index manipulation, if the next item to be evicted was
at index 1 of the storage array, the next eviction would not continue to
index 2. Instead, after successfully evicting an item from index 1, it
would attempt (on the next eviction) to evict the item at index 1
(again).

The above, combined with the fact that items just inserted are marked as
unused, meant that it was possible to end up in a situation where each
insert evicted the previously inserted item.

This caused much woe, and thrashing.

Merge requirements

Please take into account the following non-code changes that you may need to make with your PR:

CHANGELOG entry for user-facing changes
Updated the relevant documentation

cevian · 2022-08-30T22:06:40Z

pkg/clockcache/cache.go


 			if insertPtr != nil {
-				self.next = next + 1
+				self.next = (startLoc + next + 1) % self.Len()


are we sure we don't need % self.Len() on line 189 too? Even if startLoc is the last element?

Assuming that self.storage has length 10, and startLoc == 9 (i.e. it points to the last element).

When we slice self.storage into postStart and preStart, the contents of postStart will be one item, the last item, and the contents of preStart will be the first 9 items.

So even if startLoc is the last element, preStart will never contain that element.

This is because go's slices are half-open (i.e. exclusive of the last element).

JamesGuthrie · 2022-08-31T11:38:50Z

The commit I added to simplify the evict method has very different performance characteristic on AMD Ryzen processors (AWS cloud instance):

name            old time/op    new time/op    delta
Eviction/500       647ns ± 1%    1304ns ± 0%  +101.48%  (p=0.008 n=5+5)
Eviction/5000     7.58µs ±11%   11.56µs ± 2%   +52.53%  (p=0.008 n=5+5)
Eviction/50000    64.8µs ±10%   102.8µs ± 5%   +58.61%  (p=0.008 n=5+5)

name            old alloc/op   new alloc/op   delta
Eviction/500       0.00B          0.00B           ~     (all equal)
Eviction/5000      0.00B          0.00B           ~     (all equal)
Eviction/50000    0.40B ±150%     3.00B ± 0%  +650.00%  (p=0.008 n=5+5)

name            old allocs/op  new allocs/op  delta
Eviction/500        0.00           0.00           ~     (all equal)
Eviction/5000       0.00           0.00           ~     (all equal)
Eviction/50000      0.00           0.00           ~     (all equal)

github-actions · 2022-08-31T13:11:12Z

@JamesGuthrie

Benchmark Result for `7d8d512`

Master vs PR

name                                                                                      old time/op      new time/op      delta
pkg:github.com/timescale/promscale/pkg/clockcache goos:linux goarch:amd64
IntCache                                                                                       411ns ± 2%       388ns ± 9%    -5.59%  (p=0.004 n=9+9)
StringCache                                                                                    558ns ± 4%       347ns ± 1%   -37.79%  (p=0.000 n=10+9)
Eviction/500                                                                                  1.19µs ± 0%      1.19µs ± 0%      ~     (p=0.282 n=8+10)
Eviction/5000                                                                                 10.4µs ± 0%      10.4µs ± 0%      ~     (p=0.190 n=10+10)
Eviction/50000                                                                                98.9µs ± 2%      98.9µs ± 2%      ~     (p=0.971 n=10+10)
Membership/500                                                                                 164ns ± 2%       162ns ± 1%    -1.39%  (p=0.003 n=10+9)
Membership/5000                                                                                169ns ± 1%       167ns ± 0%    -0.94%  (p=0.001 n=10+10)
Membership/50000                                                                               220ns ± 3%       228ns ± 4%    +3.79%  (p=0.002 n=10+10)
Membership/500000                                                                              437ns ± 1%       413ns ± 1%    -5.53%  (p=0.000 n=8+9)
NotFound/500                                                                                   153ns ± 2%       151ns ± 0%    -1.36%  (p=0.000 n=10+10)
NotFound/5000                                                                                  155ns ± 1%       154ns ± 0%    -1.07%  (p=0.000 n=10+10)
NotFound/50000                                                                                 167ns ± 2%       164ns ± 0%    -1.37%  (p=0.000 n=10+10)
NotFound/500000                                                                                292ns ± 1%       263ns ± 4%   -10.04%  (p=0.000 n=9+10)
InsertUnderCapacity/500                                                                        162ns ± 1%       164ns ± 2%      ~     (p=0.247 n=8+10)
InsertUnderCapacity/5000                                                                       164ns ± 3%       164ns ± 4%      ~     (p=0.868 n=10+10)
InsertUnderCapacity/50000                                                                      183ns ± 6%       186ns ± 7%      ~     (p=0.541 n=10+10)
InsertUnderCapacity/500000                                                                     275ns ± 5%       255ns ± 6%    -7.03%  (p=0.000 n=10+10)
InsertOverCapacity/500                                                                         373ns ± 2%       369ns ± 1%    -1.05%  (p=0.029 n=10+10)
InsertOverCapacity/5000                                                                        349ns ± 0%       349ns ± 0%      ~     (p=0.085 n=10+10)
InsertOverCapacity/50000                                                                       355ns ± 2%       351ns ± 2%    -1.06%  (p=0.018 n=10+10)
InsertOverCapacity/500000                                                                      461ns ± 3%       436ns ± 4%    -5.55%  (p=0.000 n=10+10)
InsertConcurrent/500                                                                           370ns ± 1%       372ns ± 1%      ~     (p=0.068 n=10+9)
InsertConcurrent/5000                                                                          348ns ± 0%       350ns ± 0%    +0.36%  (p=0.003 n=10+10)
InsertConcurrent/50000                                                                         359ns ± 1%       354ns ± 1%    -1.28%  (p=0.001 n=10+10)
InsertConcurrent/500000                                                                        464ns ± 2%       446ns ± 3%    -3.98%  (p=0.000 n=10+10)
CacheFalseSharing                                                                              435ns ± 1%       413ns ± 1%    -5.23%  (p=0.000 n=10+8)
MemoryEmptyCache                                                                              50.1ns ± 2%      50.8ns ± 8%      ~     (p=0.523 n=8+8)
pkg:github.com/timescale/promscale/pkg/pgmodel/model/pgutf8str goos:linux goarch:amd64
MapWithContains/No_nulls_sanitize                                                             6.21ns ± 0%      6.20ns ± 0%    -0.07%  (p=0.026 n=9+9)
MapWithContains/No_nulls_revert                                                               19.1ns ± 0%      19.2ns ± 0%    +0.15%  (p=0.001 n=9+9)
MapWithContains/With_nulls_sanitize                                                            138ns ± 0%       137ns ± 0%    -0.63%  (p=0.000 n=10+8)
MapWithContains/With_nulls_revert                                                              158ns ± 1%       156ns ± 0%    -1.05%  (p=0.000 n=10+9)
MapWithoutContains/No_nulls_sanitize                                                          82.4ns ± 0%      82.0ns ± 0%    -0.40%  (p=0.000 n=10+9)
MapWithoutContains/No_nulls_revert                                                            89.5ns ± 0%      89.2ns ± 0%    -0.36%  (p=0.000 n=8+9)
MapWithoutContains/With_nulls_sanitize                                                        72.6ns ± 0%      72.3ns ± 0%    -0.36%  (p=0.000 n=9+10)
MapWithoutContains/With_nulls_revert                                                          79.1ns ± 0%      78.8ns ± 0%    -0.36%  (p=0.000 n=9+9)
pkg:github.com/timescale/promscale/pkg/rules goos:linux goarch:amd64
VectorConversion/yolo                                                                         0.00ns ±74%      0.00ns ±69%      ~     (p=0.370 n=10+9)
VectorConversion/looping                                                                      0.00ns ± 6%      0.00ns ± 7%      ~     (p=0.210 n=10+10)
pkg:github.com/timescale/promscale/pkg/tests/end_to_end_tests goos:linux goarch:amd64
MetricIngest                                                                                   25.9s ± 5%       25.4s ± 1%      ~     (p=0.101 n=10+8)
NewSeriesIngestion/small_series_metrics_100_seriesPerMetric_10_labels_4_batchSize_100          1.86s ± 7%       1.83s ± 4%      ~     (p=0.315 n=10+10)
NewSeriesIngestion/medium_series_metrics_100_seriesPerMetric_100_labels_4_batchSize_1000       4.38s ± 2%       4.29s ± 3%    -1.97%  (p=0.028 n=9+10)
NewSeriesIngestion/large_series_metrics_100_seriesPerMetric_500_labels_4_batchSize_10000       15.4s ± 3%       15.2s ± 4%      ~     (p=0.089 n=10+10)
GetSeriesIDForKeyValueArrayExistingSeries                                                      989µs ± 2%       981µs ± 1%    -0.86%  (p=0.035 n=9+10)
GetSeriesIDForKeyValueArrayNewSeriesExistingLabels                                            1.50ms ± 5%      1.52ms ± 4%    +1.88%  (p=0.043 n=10+10)
GetSeriesIDForKeyValueArrayNewMetric                                                          42.3ms ± 4%      42.4ms ± 2%      ~     (p=0.739 n=10+10)
GetSeriesIDForKeyValueArrayNewSeriesNewLabels                                                 1.70ms ± 5%      1.73ms ± 3%      ~     (p=0.190 n=10+10)
KeyValueArrayToLabelArrayCreateNewLabels                                                      1.00ms ± 3%      0.99ms ± 1%      ~     (p=0.497 n=10+9)
KeyValueArrayToLabelArrayExistingLabels                                                        575µs ± 3%       574µs ± 2%      ~     (p=0.912 n=10+10)
KeyValueArrayToLabelArrayCreateNewLabelKeys                                                    975µs ± 4%       986µs ± 4%      ~     (p=0.156 n=10+9)
GetOrCreateMetricTableName                                                                     331µs ± 2%       329µs ± 1%      ~     (p=0.315 n=10+10)

name                                                                                      old alloc/op     new alloc/op     delta
pkg:github.com/timescale/promscale/pkg/clockcache goos:linux goarch:amd64
IntCache                                                                                      1.60B ±100%       2.22B ±55%      ~     (p=0.123 n=10+9)
StringCache                                                                                    3.00B ± 0%       0.00B       -100.00%  (p=0.000 n=9+10)
Eviction/500                                                                                   0.00B            0.00B           ~     (all equal)
Eviction/5000                                                                                  0.00B            0.00B           ~     (all equal)
Eviction/50000                                                                                 1.00B ± 0%       1.00B ± 0%      ~     (all equal)
Membership/500                                                                                 0.00B            0.00B           ~     (all equal)
Membership/5000                                                                                0.00B            0.00B           ~     (all equal)
Membership/50000                                                                               0.00B            0.00B           ~     (all equal)
Membership/500000                                                                              0.00B            0.00B           ~     (all equal)
NotFound/500                                                                                   0.00B            0.00B           ~     (all equal)
NotFound/5000                                                                                  0.00B            0.00B           ~     (all equal)
NotFound/50000                                                                                 0.00B            0.00B           ~     (all equal)
NotFound/500000                                                                                0.00B            0.00B           ~     (all equal)
InsertUnderCapacity/500                                                                        0.00B            0.00B           ~     (all equal)
InsertUnderCapacity/5000                                                                       0.00B            0.00B           ~     (all equal)
InsertUnderCapacity/50000                                                                      0.00B            0.00B           ~     (all equal)
InsertUnderCapacity/500000                                                                     0.00B            0.00B           ~     (all equal)
InsertOverCapacity/500                                                                         0.00B            0.00B           ~     (all equal)
InsertOverCapacity/5000                                                                        0.00B            0.00B           ~     (all equal)
InsertOverCapacity/50000                                                                       0.00B            0.00B           ~     (all equal)
InsertOverCapacity/500000                                                                      0.00B            0.00B           ~     (all equal)
InsertConcurrent/500                                                                           0.00B            0.00B           ~     (all equal)
InsertConcurrent/5000                                                                          0.00B            0.00B           ~     (all equal)
InsertConcurrent/50000                                                                         0.00B            0.00B           ~     (all equal)
InsertConcurrent/500000                                                                        0.00B            0.00B           ~     (all equal)
CacheFalseSharing                                                                              15.0B ± 0%       15.0B ± 0%      ~     (all equal)
MemoryEmptyCache                                                                                131B ± 0%        129B ± 1%    -1.78%  (p=0.000 n=8+9)
pkg:github.com/timescale/promscale/pkg/pgmodel/model/pgutf8str goos:linux goarch:amd64
MapWithContains/No_nulls_sanitize                                                              0.00B            0.00B           ~     (all equal)
MapWithContains/No_nulls_revert                                                                0.00B            0.00B           ~     (all equal)
MapWithContains/With_nulls_sanitize                                                            24.0B ± 0%       24.0B ± 0%      ~     (all equal)
MapWithContains/With_nulls_revert                                                              32.0B ± 0%       32.0B ± 0%      ~     (all equal)
MapWithoutContains/No_nulls_sanitize                                                           0.00B            0.00B           ~     (all equal)
MapWithoutContains/No_nulls_revert                                                             0.00B            0.00B           ~     (all equal)
MapWithoutContains/With_nulls_sanitize                                                         0.00B            0.00B           ~     (all equal)
MapWithoutContains/With_nulls_revert                                                           0.00B            0.00B           ~     (all equal)
pkg:github.com/timescale/promscale/pkg/rules goos:linux goarch:amd64
VectorConversion/yolo                                                                          0.00B            0.00B           ~     (all equal)
VectorConversion/looping                                                                       0.00B            0.00B           ~     (all equal)
pkg:github.com/timescale/promscale/pkg/tests/end_to_end_tests goos:linux goarch:amd64
MetricIngest                                                                                  2.06GB ± 6%      2.03GB ± 5%      ~     (p=0.247 n=10+10)
NewSeriesIngestion/small_series_metrics_100_seriesPerMetric_10_labels_4_batchSize_100          105MB ± 0%       105MB ± 0%      ~     (p=0.074 n=8+9)
NewSeriesIngestion/medium_series_metrics_100_seriesPerMetric_100_labels_4_batchSize_1000       146MB ± 0%       146MB ± 0%      ~     (p=0.853 n=10+10)
NewSeriesIngestion/large_series_metrics_100_seriesPerMetric_500_labels_4_batchSize_10000       323MB ± 0%       323MB ± 1%      ~     (p=0.604 n=10+9)
GetSeriesIDForKeyValueArrayExistingSeries                                                     1.37kB ± 0%      1.37kB ± 0%      ~     (p=0.544 n=8+9)
GetSeriesIDForKeyValueArrayNewSeriesExistingLabels                                            1.38kB ± 1%      1.38kB ± 1%      ~     (p=0.986 n=10+10)
GetSeriesIDForKeyValueArrayNewMetric                                                          1.40kB ±12%      1.44kB ±14%      ~     (p=0.370 n=10+10)
GetSeriesIDForKeyValueArrayNewSeriesNewLabels                                                 1.38kB ± 1%      1.37kB ± 1%      ~     (p=0.169 n=10+10)
KeyValueArrayToLabelArrayCreateNewLabels                                                      1.17kB ± 0%      1.17kB ± 1%      ~     (p=0.986 n=10+10)
KeyValueArrayToLabelArrayExistingLabels                                                       1.17kB ± 0%      1.17kB ± 0%      ~     (p=0.965 n=10+10)
KeyValueArrayToLabelArrayCreateNewLabelKeys                                                   1.17kB ± 1%      1.17kB ± 1%      ~     (p=0.988 n=10+10)
GetOrCreateMetricTableName                                                                      563B ± 0%        562B ± 0%    -0.22%  (p=0.003 n=10+9)

name                                                                                      old allocs/op    new allocs/op    delta
pkg:github.com/timescale/promscale/pkg/clockcache goos:linux goarch:amd64
IntCache                                                                                        0.00             0.00           ~     (all equal)
StringCache                                                                                     0.00             0.00           ~     (all equal)
Eviction/500                                                                                    0.00             0.00           ~     (all equal)
Eviction/5000                                                                                   0.00             0.00           ~     (all equal)
Eviction/50000                                                                                  0.00             0.00           ~     (all equal)
Membership/500                                                                                  0.00             0.00           ~     (all equal)
Membership/5000                                                                                 0.00             0.00           ~     (all equal)
Membership/50000                                                                                0.00             0.00           ~     (all equal)
Membership/500000                                                                               0.00             0.00           ~     (all equal)
NotFound/500                                                                                    0.00             0.00           ~     (all equal)
NotFound/5000                                                                                   0.00             0.00           ~     (all equal)
NotFound/50000                                                                                  0.00             0.00           ~     (all equal)
NotFound/500000                                                                                 0.00             0.00           ~     (all equal)
InsertUnderCapacity/500                                                                         0.00             0.00           ~     (all equal)
InsertUnderCapacity/5000                                                                        0.00             0.00           ~     (all equal)
InsertUnderCapacity/50000                                                                       0.00             0.00           ~     (all equal)
InsertUnderCapacity/500000                                                                      0.00             0.00           ~     (all equal)
InsertOverCapacity/500                                                                          0.00             0.00           ~     (all equal)
InsertOverCapacity/5000                                                                         0.00             0.00           ~     (all equal)
InsertOverCapacity/50000                                                                        0.00             0.00           ~     (all equal)
InsertOverCapacity/500000                                                                       0.00             0.00           ~     (all equal)
InsertConcurrent/500                                                                            0.00             0.00           ~     (all equal)
InsertConcurrent/5000                                                                           0.00             0.00           ~     (all equal)
InsertConcurrent/50000                                                                          0.00             0.00           ~     (all equal)
InsertConcurrent/500000                                                                         0.00             0.00           ~     (all equal)
CacheFalseSharing                                                                               1.00 ± 0%        1.00 ± 0%      ~     (all equal)
MemoryEmptyCache                                                                                0.00             0.00           ~     (all equal)
pkg:github.com/timescale/promscale/pkg/pgmodel/model/pgutf8str goos:linux goarch:amd64
MapWithContains/No_nulls_sanitize                                                               0.00             0.00           ~     (all equal)
MapWithContains/No_nulls_revert                                                                 0.00             0.00           ~     (all equal)
MapWithContains/With_nulls_sanitize                                                             1.00 ± 0%        1.00 ± 0%      ~     (all equal)
MapWithContains/With_nulls_revert                                                               1.00 ± 0%        1.00 ± 0%      ~     (all equal)
MapWithoutContains/No_nulls_sanitize                                                            0.00             0.00           ~     (all equal)
MapWithoutContains/No_nulls_revert                                                              0.00             0.00           ~     (all equal)
MapWithoutContains/With_nulls_sanitize                                                          0.00             0.00           ~     (all equal)
MapWithoutContains/With_nulls_revert                                                            0.00             0.00           ~     (all equal)
pkg:github.com/timescale/promscale/pkg/rules goos:linux goarch:amd64
VectorConversion/yolo                                                                           0.00             0.00           ~     (all equal)
VectorConversion/looping                                                                        0.00             0.00           ~     (all equal)
pkg:github.com/timescale/promscale/pkg/tests/end_to_end_tests goos:linux goarch:amd64
MetricIngest                                                                                   10.5M ± 1%       10.5M ± 1%      ~     (p=0.218 n=10+10)
NewSeriesIngestion/small_series_metrics_100_seriesPerMetric_10_labels_4_batchSize_100          69.9k ± 1%       69.8k ± 0%      ~     (p=0.447 n=9+10)
NewSeriesIngestion/medium_series_metrics_100_seriesPerMetric_100_labels_4_batchSize_1000        487k ± 0%        487k ± 0%      ~     (p=0.099 n=9+10)
NewSeriesIngestion/large_series_metrics_100_seriesPerMetric_500_labels_4_batchSize_10000       2.33M ± 0%       2.33M ± 0%      ~     (p=0.101 n=8+10)
GetSeriesIDForKeyValueArrayExistingSeries                                                       31.0 ± 0%        31.0 ± 0%      ~     (all equal)
GetSeriesIDForKeyValueArrayNewSeriesExistingLabels                                              31.0 ± 0%        31.0 ± 0%      ~     (all equal)
GetSeriesIDForKeyValueArrayNewMetric                                                            32.0 ± 0%        32.0 ± 0%      ~     (all equal)
GetSeriesIDForKeyValueArrayNewSeriesNewLabels                                                   31.0 ± 0%        31.0 ± 0%      ~     (all equal)
KeyValueArrayToLabelArrayCreateNewLabels                                                        26.0 ± 0%        26.0 ± 0%      ~     (all equal)
KeyValueArrayToLabelArrayExistingLabels                                                         26.0 ± 0%        26.0 ± 0%      ~     (all equal)
KeyValueArrayToLabelArrayCreateNewLabelKeys                                                     26.0 ± 0%        26.0 ± 0%      ~     (all equal)
GetOrCreateMetricTableName                                                                      9.00 ± 0%        9.00 ± 0%      ~     (all equal)

name                                                                                      old inserts      new inserts      delta
pkg:github.com/timescale/promscale/pkg/clockcache goos:linux goarch:amd64
InsertUnderCapacity/500                                                                        7.35M ± 3%       7.37M ± 3%      ~     (p=0.853 n=10+10)
InsertUnderCapacity/5000                                                                       7.41M ± 1%       7.42M ± 0%      ~     (p=0.888 n=8+9)
InsertUnderCapacity/50000                                                                      6.51M ± 5%       6.38M ± 9%      ~     (p=0.353 n=10+10)
InsertUnderCapacity/500000                                                                     4.13M ± 4%       4.52M ± 5%    +9.32%  (p=0.000 n=10+10)
InsertOverCapacity/500                                                                         3.22M ± 1%       3.25M ± 1%    +1.05%  (p=0.015 n=10+10)
InsertOverCapacity/5000                                                                        3.44M ± 0%       3.45M ± 1%      ~     (p=0.247 n=10+10)
InsertOverCapacity/50000                                                                       3.40M ± 2%       3.46M ± 2%    +1.67%  (p=0.015 n=10+10)
InsertOverCapacity/500000                                                                      2.78M ± 2%       2.90M ± 5%    +4.23%  (p=0.008 n=9+10)

name                                                                                      old inserts/ops  new inserts/ops  delta
pkg:github.com/timescale/promscale/pkg/clockcache goos:linux goarch:amd64
InsertUnderCapacity/500                                                                         1.00 ± 0%        1.00 ± 0%      ~     (all equal)
InsertUnderCapacity/5000                                                                        1.00 ± 0%        1.00 ± 0%      ~     (all equal)
InsertUnderCapacity/50000                                                                       1.00 ± 0%        1.00 ± 0%      ~     (all equal)
InsertUnderCapacity/500000                                                                      1.00 ± 0%        1.00 ± 0%      ~     (all equal)
InsertOverCapacity/500                                                                          1.00 ± 0%        1.00 ± 0%      ~     (all equal)
InsertOverCapacity/5000                                                                         1.00 ± 0%        1.00 ± 0%      ~     (all equal)
InsertOverCapacity/50000                                                                        1.00 ± 0%        1.00 ± 0%      ~     (all equal)
InsertOverCapacity/500000                                                                       1.00 ± 0%        1.00 ± 0%      ~     (all equal)

JamesGuthrie · 2022-08-31T14:00:42Z

It looks like the benchmark job confirms the benchmark that I ran:

Eviction/500                                   885ns ± 4%      2287ns ± 4%  +158.44%  (p=0.000 n=10+10)
Eviction/5000                                 9.41µs ± 4%     21.94µs ± 6%  +133.10%  (p=0.000 n=10+10)
Eviction/50000                                 101µs ± 3%       203µs ± 5%   +99.87%  (p=0.000 n=10+10)

I've removed the commit which reworks the cache.

niksajakovljevic · 2022-08-31T17:53:17Z

Just wondering if you managed to benchmark version of code that does not split array?

harkishen · 2022-09-01T06:03:09Z

Is #1603 (comment) benchmark before the fix or after the fix. If its after the fix, then we are loosing perf in Eviction, right?

JamesGuthrie · 2022-09-02T07:24:16Z

Is #1603 (comment) benchmark before the fix or after the fix. If its after the fix, then we are loosing perf in Eviction, right?

This benchmark is after the fix. I was curious why we're losing performance, so I took a deeper look at this to better understand it.

I think that I understand it now. What follows is a bit of a brain dump into understanding what's going on. The bottom line is: buggy code can look fast, but if it's buggy, it doesn't matter how fast it is.

My first hypothesis was: maybe the code being emitted is different enough to be causing a problem (because we broke some "complier magic" like bounds check elimination), so I took to the disassembler (using go tool objdump -s 'clockcache.*evict().*' -S dist/promscale after running make build).

This was before the fix:

...
				self.next = next + 1
  0x1007ace64		910004e1		ADD $1, R7, R1		
  0x1007ace68		f9002c01		MOVD R1, 88(R0)		
...

And after the fix:

...
				self.next = (startLoc + next + 1) % self.Len()
  0x1007ace74		940003ff		CALL github.com/timescale/promscale/pkg/clockcache.(*Cache).Len(SB)	
  0x1007ace78		f94017e1		MOVD 40(RSP), R1							
  0x1007ace7c		f94013e2		MOVD 32(RSP), R2							
  0x1007ace80		8b020021		ADD R2, R1, R1								
  0x1007ace84		b4000480		CBZ R0, 36(PC)								
  0x1007ace88		91000421		ADD $1, R1, R1								
  0x1007ace8c		9ac00c3b		SDIV R0, R1, R27							
  0x1007ace90		9b008761		MSUB R0, R1, R27, R1							
  0x1007ace94		f94027e2		MOVD 72(RSP), R2							
  0x1007ace98		f9002c41		MOVD R1, 88(R2)								
...

That function call is probably not helping, so I replaced self.Len() with len(self.storage):

				self.next = (startLoc + next + 1) % len(self.storage)
  0x1007ace64		8b0200e1		ADD R2, R7, R1		
  0x1007ace68		f9401802		MOVD 48(R0), R2		
  0x1007ace6c		b4000462		CBZ R2, 35(PC)		
  0x1007ace70		91000421		ADD $1, R1, R1		
  0x1007ace74		9ac20c3b		SDIV R2, R1, R27	
  0x1007ace78		9b028761		MSUB R2, R1, R27, R1	
  0x1007ace7c		f9002c01		MOVD R1, 88(R0)

There are more instructions (7 vs 2), but in a branch of code which is not hit very frequently, so I don't think that it could be exactly here that is causing the problem, but there's only one way to find out.

So I used -cpuprofile on BenchmarkEviction (go test -bench='BenchmarkEviction' -run='^$' -cpu=1 -count=5 -cpuprofile master.out).

Then, using pprof's list on the evict function, we can see where time is spent in the function:

Before fix:

(pprof) list evict
Total: 47.50s
ROUTINE ======================== github.com/timescale/promscale/pkg/clockcache.(*Cache).evict in /home/ubuntu/promscale/pkg/clockcache/cache.go
    22.28s     22.28s (flat, cum) 46.91% of Total
         .          .    161:	//     based off an induction variable e.g. `next := range slice`,
         .          .    162:	//     if the value is merely guarded by e.g. `if next >= len(slice) { next = 0 }`
         .          .    163:	//     the bounds check will not be elided. Doing the walk like this lowers
         .          .    164:	//     eviction time by about a third
         .          .    165:	startLoc := self.next
      20ms       20ms    166:	postStart := self.storage[startLoc:]
         .          .    167:	preStart := self.storage[:startLoc]
         .          .    168:	for i := 0; i < 2; i++ {
      30ms       30ms    169:		for next := range postStart {
    20.91s     20.91s    170:			elem := &postStart[next]
     620ms      620ms    171:			old := atomic.SwapUint32(&elem.used, 0)
     680ms      680ms    172:			if old == 0 {
         .          .    173:				insertPtr = elem
         .          .    174:			}
         .          .    175:
      20ms       20ms    176:			if insertPtr != nil {
         .          .    177:				self.next = next + 1
         .          .    178:				return
         .          .    179:			}
         .          .    180:		}
         .          .    181:		for next := range preStart {

After fix:

Total: 50.66s
ROUTINE ======================== github.com/timescale/promscale/pkg/clockcache.(*Cache).evict in /home/ubuntu/promscale/pkg/clockcache/cache.go
    26.45s     26.45s (flat, cum) 52.21% of Total
         .          .    161:	//     based off an induction variable e.g. `next := range slice`,
         .          .    162:	//     if the value is merely guarded by e.g. `if next >= len(slice) { next = 0 }`
         .          .    163:	//     the bounds check will not be elided. Doing the walk like this lowers
         .          .    164:	//     eviction time by about a third
         .          .    165:	startLoc := self.next
      40ms       40ms    166:	postStart := self.storage[startLoc:]
         .          .    167:	preStart := self.storage[:startLoc]
         .          .    168:	for i := 0; i < 2; i++ {
      10ms       10ms    169:		for next := range postStart {
    12.51s     12.51s    170:			elem := &postStart[next]
     370ms      370ms    171:			old := atomic.SwapUint32(&elem.used, 0)
     400ms      400ms    172:			if old == 0 {
         .          .    173:				insertPtr = elem
         .          .    174:			}
         .          .    175:
      50ms       50ms    176:			if insertPtr != nil {
         .          .    177:				self.next = (startLoc + next + 1) % self.Len()
         .          .    178:				return
         .          .    179:			}
         .          .    180:		}
      10ms       10ms    181:		for next := range preStart {
    12.09s     12.09s    182:			elem := &preStart[next]
     520ms      520ms    183:			old := atomic.SwapUint32(&elem.used, 0)
     430ms      430ms    184:			if old == 0 {
         .          .    185:				insertPtr = elem
         .          .    186:			}
         .          .    187:
      20ms       20ms    188:			if insertPtr != nil {
         .          .    189:				self.next = next + 1
         .          .    190:				return
         .          .    191:			}
         .          .    192:		}
         .          .    193:	}

The first thing to note is that practically no time is spent on the line of code that we were looking at in the disassembler. What is also interesting is that where time is spent in the function is different. The explanation for this is relatively simple: the bug that we had would effectively ensure that self.next is 1 very often. If self.next is small, then we will predominantly be looking at items after the split point in self.storage, so we spend all of our time in for next := range postStart, and (practically) none in for next := range preStart.

With this insight, I had another theory: because the location we check for an entry to evict is different in the broken and fixed versions, maybe we spend more time looking for an item to evict in the fixed code. I instrumented the code to tell me how many items in self.storage we looked at during eviction. That turned out not to be it. (The explanation is simple. In the eviction test, whenever we insert an item, we mark all items as used, so we have to do a full pass through self.storage anyway).

The final attempt, which yielded fruit, was the realisation that the memory location that we're looking at is always the same in the broken version. In the broken code, most of the time self.next is 1. What I did next was to modify the fixed code to set startLoc to always be 1 (instead of self.next). Then I ran the BenchmarkEviction code, and saw that it is the same speed as the broken version.

The patch:

diff --git a/pkg/clockcache/cache.go b/pkg/clockcache/cache.go
index 43448ca7..1a36818f 100644
--- a/pkg/clockcache/cache.go
+++ b/pkg/clockcache/cache.go
@@ -162,7 +162,7 @@ func (self *Cache) evict() (insertPtr *element) {
 	//     if the value is merely guarded by e.g. `if next >= len(slice) { next = 0 }`
 	//     the bounds check will not be elided. Doing the walk like this lowers
 	//     eviction time by about a third
-	startLoc := self.next
+	startLoc := 1
 	postStart := self.storage[startLoc:]
 	preStart := self.storage[:startLoc]
 	for i := 0; i < 2; i++ {

The result when comparing between current master and this PR with the above patch:

name          old time/op    new time/op    delta
Eviction/500     336ns ± 1%     338ns ± 1%  +0.64%  (p=0.043 n=10+10)

name          old alloc/op   new alloc/op   delta
Eviction/500     0.00B          0.00B         ~     (all equal)

name          old allocs/op  new allocs/op  delta
Eviction/500      0.00           0.00         ~     (all equal)

(If you're curious, the +0.64% is because we're using self.Len() instead of len(self.storage)).

I believe that this explains what changed in our code to cause this change in performance. What it doesn't explain is why this change causes a ~20% change in eviction performance. I suspect that the answer has to do with "magical CPU stuff", like branch prediction, or cache access patterns. I would like to be able to answer this, but I cannot (yet).

JamesGuthrie · 2022-09-02T08:45:24Z

Just wondering if you managed to benchmark version of code that does not split array?

This was the benchmark result: #1603 (comment)

The clockcache is a simple approximation of an LRU cache. Instead of maintaining an exact ordering of which elements were least recently used, the cache simply keeps track of whether each element is "used". When an item is inserted into the cache, it is counted as unused. When an item is fetched from the cache, it is counted as used. The eviction algorithm looks through the elements of the cache one by one, trying to find one to evict. If the item is used, it marks it as unused. If it is unused, it can evict the item. The next time an eviction is required, it can continue with the next element. Through a bug in index manipulation, if the next item to be evicted was at index 1 of the storage array, the next eviction would not continue to index 2. Instead, after successfully evicting an item from index 1, it would attempt (on the next eviction) to evict the item at index 1 (again). The above, combined with the fact that items just inserted are marked as unused, meant that it was possible to end up in a situation where each insert evicted the _previously inserted_ item. This caused much woe, and thrashing.

JamesGuthrie force-pushed the jg/fix-clockcache-eviction branch from 88dfafb to 7a7b10e Compare August 30, 2022 21:53

cevian reviewed Aug 30, 2022

View reviewed changes

JamesGuthrie added the action:benchmarks Triggers Benchmark CI workflows on commit push to a PR label Aug 31, 2022

JamesGuthrie force-pushed the jg/fix-clockcache-eviction branch 2 times, most recently from 7a7b10e to 9b852fd Compare August 31, 2022 13:53

JamesGuthrie requested a review from cevian August 31, 2022 13:59

cevian approved these changes Aug 31, 2022

View reviewed changes

JamesGuthrie marked this pull request as ready for review August 31, 2022 15:06

JamesGuthrie requested a review from a team as a code owner August 31, 2022 15:06

niksajakovljevic approved these changes Aug 31, 2022

View reviewed changes

JamesGuthrie force-pushed the jg/fix-clockcache-eviction branch from 9b852fd to fede97f Compare September 1, 2022 12:55

JamesGuthrie force-pushed the jg/fix-clockcache-eviction branch from fede97f to 826d148 Compare September 5, 2022 10:45

JamesGuthrie enabled auto-merge (rebase) September 5, 2022 10:45

JamesGuthrie merged commit 8ccdc0f into master Sep 5, 2022

JamesGuthrie deleted the jg/fix-clockcache-eviction branch September 5, 2022 11:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix broken cache eviction in clockcache #1603

Fix broken cache eviction in clockcache #1603

Uh oh!

JamesGuthrie commented Aug 30, 2022 •

edited

Loading

Uh oh!

cevian Aug 30, 2022

Uh oh!

JamesGuthrie Aug 31, 2022 •

edited

Loading

Uh oh!

JamesGuthrie commented Aug 31, 2022

Uh oh!

github-actions bot commented Aug 31, 2022 •

edited

Loading

Uh oh!

JamesGuthrie commented Aug 31, 2022 •

edited

Loading

Uh oh!

niksajakovljevic commented Aug 31, 2022

Uh oh!

harkishen commented Sep 1, 2022

Uh oh!

JamesGuthrie commented Sep 2, 2022 •

edited

Loading

Uh oh!

JamesGuthrie commented Sep 2, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Fix broken cache eviction in clockcache #1603

Fix broken cache eviction in clockcache #1603

Uh oh!

Conversation

JamesGuthrie commented Aug 30, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Merge requirements

Uh oh!

cevian Aug 30, 2022

Choose a reason for hiding this comment

Uh oh!

JamesGuthrie Aug 31, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JamesGuthrie commented Aug 31, 2022

Uh oh!

github-actions bot commented Aug 31, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmark Result for 7d8d512

Master vs PR

Uh oh!

JamesGuthrie commented Aug 31, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

niksajakovljevic commented Aug 31, 2022

Uh oh!

harkishen commented Sep 1, 2022

Uh oh!

JamesGuthrie commented Sep 2, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JamesGuthrie commented Sep 2, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

JamesGuthrie commented Aug 30, 2022 •

edited

Loading

JamesGuthrie Aug 31, 2022 •

edited

Loading

github-actions bot commented Aug 31, 2022 •

edited

Loading

Benchmark Result for `7d8d512`

JamesGuthrie commented Aug 31, 2022 •

edited

Loading

JamesGuthrie commented Sep 2, 2022 •

edited

Loading